Repository of Correlations & Similarity Measures

Learning about correlations, I discovered there are various methods for obtaining correlation measures for different datatypes. For example, correlations between binary and continuous variables, or correlations between two ordinal variables.

Here is a virtual toolbox of correlation and similarity measurement methods, for personal and public reference.

Correlations

Pearson’s Correlation Coefficient

Pearson’s is the go-to measure when you have two continuous variables and want to quantify the strength and direction of their linear relationship. Both variables should be approximately normally distributed, and the relationship between them should be linear (not curved). It is sensitive to outliers, so inspect your data visually with a scatterplot before applying it.

When to Use:

  • Both variables are continuous (interval or ratio scale)
  • The relationship is expected to be linear
  • Data is roughly normally distributed
  • Outliers have been addressed

Avoid it when:

  • Your data is ordinal or contains ranks
  • The relationship is non-linear
  • You have significant outliers you cannot remove

Equation:

r = \frac{\Sigma_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{ \sqrt{ \Sigma_{i=1}^{n}(x_i - \bar{x})^2 \Sigma_{i=1}^{n} (y_i - \bar{y})^2 } }

Python Code:

Python
import numpy as np
from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}, p-value: {p_value:.4f}")

Sources:

Spearman’s Rank Correlation

Spearman’s is a non-parametric alternative to Pearson’s. Rather than working on raw values, it converts both variables to ranks and then computes the correlation of those ranks. This makes it robust to outliers and appropriate for ordinal data or when the relationship is monotonic but not strictly linear (i.e., as one variable increases, the other tends to increase, but not necessarily at a constant rate).

When to Use:

  • One or both variables are ordinal
  • The relationship is monotonic but not necessarily linear
  • Your data contains outliers that cannot be removed
  • Normality cannot be assumed

Avoid it when:

  • You have a confirmed linear relationship between two continuous, normally distributed variables (Pearson’s is more powerful in that case)

Equation:

r_s = 1 - \frac{6 \Sigma_{i=1}^{n} d_i^2}{n(n^2 - 1)}

Where did_i​ is the difference between the ranks of the ii-th pair, and nn is the number of observations.

Python Code:

Python
from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

rho, p_value = stats.spearmanr(x, y)
print(f"Spearman rho: {rho:.4f}, p-value: {p_value:.4f}")

Sources

Point-Biserial Correlation

Point-Biserial correlation measures the relationship between one continuous variable and one dichotomous (binary) variable, such as pass/fail, yes/no, or 0/1. It is mathematically equivalent to Pearson’s r applied to this specific case, but it is worth calling out explicitly because the data structure is distinct. The continuous variable should be approximately normally distributed within each group of the binary variable.

Use it when:

  • One variable is continuous and the other is binary (naturally or artificially dichotomous)
  • You want to assess how well the binary grouping separates the continuous variable
  • It is commonly used in item analysis for test scoring (e.g., did getting a question right correlate with the total score?)

Avoid it when:

  • The binary variable is an artificially forced split of what is actually a continuous underlying variable (consider biserial correlation instead in that case)

Equation

r_{pb} = \frac{\bar{Y_1} - \bar{Y_0}}{s_Y} \sqrt{\frac{n_1 \cdot n_0}{n^2}}

Where Yˉ1\bar{Y}_1​ and Yˉ0\bar{Y}_0 are the means of the continuous variable for each binary group, sYs_Y​ is the standard deviation of the continuous variable, n1n_1​ and n0n_0​ are the group sizes, and nn is the total sample size.

Python Code

Python
from scipy import stats

continuous = [2.5, 3.1, 4.0, 5.2, 3.8, 2.9, 4.5, 5.1]
binary     = [0,   0,   1,   1,   0,   0,   1,   1  ]

r, p_value = stats.pointbiserialr(binary, continuous)
print(f"Point-Biserial r: {r:.4f}, p-value: {p_value:.4f}")

Sources

Cramer’s V

Cramer’s V measures the strength of association between two categorical variables. It is derived from the chi-squared statistic and always ranges from 0 (no association) to 1 (perfect association), regardless of the size of the contingency table. Unlike Pearson’s or Spearman’s, it does not indicate direction, only strength. It is appropriate when both variables are nominal (unordered categories).

Use it when:

  • Both variables are categorical (nominal)
  • You have already run a chi-squared test and want an effect size measure
  • Your contingency table is larger than 2×2 (for 2×2 tables, Phi coefficient is equivalent)

Avoid it when:

  • Your variables are ordinal (consider Kendall’s tau or Goodman-Kruskal gamma instead)
  • Sample size is very small, as chi-squared assumptions may be violated

Equation

V = \sqrt{\frac{\chi^2 / n}{min(r-1,c-1)}}

Where χ2\chi^2 is the chi-squared statistic, nn is the total number of observations, rr is the number of rows, and cc is the number of columns in the contingency table.

Python Code

Python
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

def cramers_v(confusion_matrix):
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    r, c = confusion_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, c) - 1)))

# Example: favorite color vs. preferred genre
data = pd.DataFrame({
    "Color": ["Red", "Red", "Blue", "Blue", "Green", "Green"],
    "Genre": ["Rock", "Jazz", "Rock", "Classical", "Jazz", "Classical"]
})

confusion_matrix = pd.crosstab(data["Color"], data["Genre"])
v = cramers_v(confusion_matrix)
print(f"Cramer's V: {v:.4f}")

Sources:

Similarity Measures

Cosine Similarity

Cosine similarity measures the angle between two vectors in multi-dimensional space, making it a measure of orientation rather than magnitude. Two vectors pointing in the same direction have a cosine similarity of 1, perpendicular vectors score 0, and opposite vectors score -1. It is widely used in NLP and information retrieval where documents or words are represented as high-dimensional vectors (e.g., TF-IDF or word embeddings), and the raw size of the vectors matters less than their direction.

Use it when:

  • You are working with vector representations of text (e.g., TF-IDF, word2vec, sentence embeddings)
  • Magnitude is not meaningful and only direction (relative composition) matters
  • Comparing documents, user profiles, or item feature vectors in recommendation systems

Avoid it when:

  • Magnitude differences are meaningful and should be captured in the similarity score
  • You are working with simple univariate or bivariate numeric data (Pearson’s or Spearman’s are more interpretable there)

Equation

cosine\ similarity(\bold{A}, \bold{B}) = \frac{\bold{A} \cdot \bold{B}}{\left\Vert \bold{A} \right\Vert \left\Vert \bold{B} \right\Vert} = \frac{\Sigma_{i=1}^n A_i B_i}{\sqrt{\Sigma_{i=1}^n A_i^2} \cdot \sqrt{\Sigma_{i=1}^n B_i^2}}

Python Code

Python
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two document vectors (e.g., word count or TF-IDF representations)
A = np.array([[1, 2, 0, 3, 1]])
B = np.array([[1, 1, 1, 2, 0]])

similarity = cosine_similarity(A, B)
print(f"Cosine Similarity: {similarity[0][0]:.4f}")

# Manual calculation for reference
dot_product = np.dot(A[0], B[0])
norm_A = np.linalg.norm(A[0])
norm_B = np.linalg.norm(B[0])
manual = dot_product / (norm_A * norm_B)
print(f"Manual Cosine Similarity: {manual:.4f}")

Sources:

Scroll to Top