Learning about correlations, I discovered there are various methods for obtaining correlation measures for different datatypes. For example, correlations between binary and continuous variables, or correlations between two ordinal variables.

Here is a virtual toolbox of correlation and similarity measurement methods, for personal and public reference.

Contents hide

Correlations

Pearson’s Correlation Coefficient

Spearman’s Rank Correlation

Point-Biserial Correlation

Cramer’s V

Similarity Measures

Cosine Similarity

Correlations

Pearson’s Correlation Coefficient

Pearson’s is the go-to measure when you have two continuous variables and want to quantify the strength and direction of their linear relationship. Both variables should be approximately normally distributed, and the relationship between them should be linear (not curved). It is sensitive to outliers, so inspect your data visually with a scatterplot before applying it.

When to Use:

Both variables are continuous (interval or ratio scale)
The relationship is expected to be linear
Data is roughly normally distributed
Outliers have been addressed

Avoid it when:

Your data is ordinal or contains ranks
The relationship is non-linear
You have significant outliers you cannot remove

Equation:

r = \frac{\Sigma_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{ \sqrt{ \Sigma_{i=1}^{n}(x_i - \bar{x})^2 \Sigma_{i=1}^{n} (y_i - \bar{y})^2 } }

Python Code:

Python

import numpy as np
from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}, p-value: {p_value:.4f}")

import numpy as np
from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}, p-value: {p_value:.4f}")

Sources:

Wikipedia: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
SciPy docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

Spearman’s Rank Correlation

Spearman’s is a non-parametric alternative to Pearson’s. Rather than working on raw values, it converts both variables to ranks and then computes the correlation of those ranks. This makes it robust to outliers and appropriate for ordinal data or when the relationship is monotonic but not strictly linear (i.e., as one variable increases, the other tends to increase, but not necessarily at a constant rate).

When to Use:

One or both variables are ordinal
The relationship is monotonic but not necessarily linear
Your data contains outliers that cannot be removed
Normality cannot be assumed

Avoid it when:

You have a confirmed linear relationship between two continuous, normally distributed variables (Pearson’s is more powerful in that case)

Equation:

r_s = 1 - \frac{6 \Sigma_{i=1}^{n} d_i^2}{n(n^2 - 1)}

Where $d_i$ is the difference between the ranks of the $i$ -th pair, and $n$ is the number of observations.

Python Code:

Python

from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

rho, p_value = stats.spearmanr(x, y)
print(f"Spearman rho: {rho:.4f}, p-value: {p_value:.4f}")

from scipy import stats

x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]

rho, p_value = stats.spearmanr(x, y)
print(f"Spearman rho: {rho:.4f}, p-value: {p_value:.4f}")

Sources

Point-Biserial Correlation

Point-Biserial correlation measures the relationship between one continuous variable and one dichotomous (binary) variable, such as pass/fail, yes/no, or 0/1. It is mathematically equivalent to Pearson’s r applied to this specific case, but it is worth calling out explicitly because the data structure is distinct. The continuous variable should be approximately normally distributed within each group of the binary variable.

Use it when:

One variable is continuous and the other is binary (naturally or artificially dichotomous)
You want to assess how well the binary grouping separates the continuous variable
It is commonly used in item analysis for test scoring (e.g., did getting a question right correlate with the total score?)

Avoid it when:

The binary variable is an artificially forced split of what is actually a continuous underlying variable (consider biserial correlation instead in that case)

Equation

r_{pb} = \frac{\bar{Y_1} - \bar{Y_0}}{s_Y} \sqrt{\frac{n_1 \cdot n_0}{n^2}}

Where $\bar{Y}_1$ and $\bar{Y}_0$ are the means of the continuous variable for each binary group, $s_Y$ is the standard deviation of the continuous variable, $n_1$ and $n_0$ are the group sizes, and $n$ is the total sample size.

Python Code

Python

from scipy import stats

continuous = [2.5, 3.1, 4.0, 5.2, 3.8, 2.9, 4.5, 5.1]
binary     = [0,   0,   1,   1,   0,   0,   1,   1  ]

r, p_value = stats.pointbiserialr(binary, continuous)
print(f"Point-Biserial r: {r:.4f}, p-value: {p_value:.4f}")

from scipy import stats

continuous = [2.5, 3.1, 4.0, 5.2, 3.8, 2.9, 4.5, 5.1]
binary     = [0,   0,   1,   1,   0,   0,   1,   1  ]

r, p_value = stats.pointbiserialr(binary, continuous)
print(f"Point-Biserial r: {r:.4f}, p-value: {p_value:.4f}")

Sources

Cramer’s V

Cramer’s V measures the strength of association between two categorical variables. It is derived from the chi-squared statistic and always ranges from 0 (no association) to 1 (perfect association), regardless of the size of the contingency table. Unlike Pearson’s or Spearman’s, it does not indicate direction, only strength. It is appropriate when both variables are nominal (unordered categories).

Use it when:

Both variables are categorical (nominal)
You have already run a chi-squared test and want an effect size measure
Your contingency table is larger than 2×2 (for 2×2 tables, Phi coefficient is equivalent)

Avoid it when:

Your variables are ordinal (consider Kendall’s tau or Goodman-Kruskal gamma instead)
Sample size is very small, as chi-squared assumptions may be violated

Equation

V = \sqrt{\frac{\chi^2 / n}{min(r-1,c-1)}}

Where $\chi^2$ is the chi-squared statistic, $n$ is the total number of observations, $r$ is the number of rows, and $c$ is the number of columns in the contingency table.

Python Code

Python

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

def cramers_v(confusion_matrix):
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    r, c = confusion_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, c) - 1)))

# Example: favorite color vs. preferred genre
data = pd.DataFrame({
    "Color": ["Red", "Red", "Blue", "Blue", "Green", "Green"],
    "Genre": ["Rock", "Jazz", "Rock", "Classical", "Jazz", "Classical"]
})

confusion_matrix = pd.crosstab(data["Color"], data["Genre"])
v = cramers_v(confusion_matrix)
print(f"Cramer's V: {v:.4f}")

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

def cramers_v(confusion_matrix):
    chi2, _, _, _ = chi2_contingency(confusion_matrix)
    n = confusion_matrix.sum().sum()
    r, c = confusion_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, c) - 1)))

# Example: favorite color vs. preferred genre
data = pd.DataFrame({
    "Color": ["Red", "Red", "Blue", "Blue", "Green", "Green"],
    "Genre": ["Rock", "Jazz", "Rock", "Classical", "Jazz", "Classical"]
})

confusion_matrix = pd.crosstab(data["Color"], data["Genre"])
v = cramers_v(confusion_matrix)
print(f"Cramer's V: {v:.4f}")

Sources:

Wikipedia: https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
SciPy chi2_contingency docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

Similarity Measures

Cosine Similarity

Cosine similarity measures the angle between two vectors in multi-dimensional space, making it a measure of orientation rather than magnitude. Two vectors pointing in the same direction have a cosine similarity of 1, perpendicular vectors score 0, and opposite vectors score -1. It is widely used in NLP and information retrieval where documents or words are represented as high-dimensional vectors (e.g., TF-IDF or word embeddings), and the raw size of the vectors matters less than their direction.

Use it when:

You are working with vector representations of text (e.g., TF-IDF, word2vec, sentence embeddings)
Magnitude is not meaningful and only direction (relative composition) matters
Comparing documents, user profiles, or item feature vectors in recommendation systems

Avoid it when:

Magnitude differences are meaningful and should be captured in the similarity score
You are working with simple univariate or bivariate numeric data (Pearson’s or Spearman’s are more interpretable there)

Equation

cosine\ similarity(\bold{A}, \bold{B}) = \frac{\bold{A} \cdot \bold{B}}{\left\Vert \bold{A} \right\Vert \left\Vert \bold{B} \right\Vert} = \frac{\Sigma_{i=1}^n A_i B_i}{\sqrt{\Sigma_{i=1}^n A_i^2} \cdot \sqrt{\Sigma_{i=1}^n B_i^2}}

Python Code

Python

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two document vectors (e.g., word count or TF-IDF representations)
A = np.array([[1, 2, 0, 3, 1]])
B = np.array([[1, 1, 1, 2, 0]])

similarity = cosine_similarity(A, B)
print(f"Cosine Similarity: {similarity[0][0]:.4f}")

# Manual calculation for reference
dot_product = np.dot(A[0], B[0])
norm_A = np.linalg.norm(A[0])
norm_B = np.linalg.norm(B[0])
manual = dot_product / (norm_A * norm_B)
print(f"Manual Cosine Similarity: {manual:.4f}")

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Two document vectors (e.g., word count or TF-IDF representations)
A = np.array([[1, 2, 0, 3, 1]])
B = np.array([[1, 1, 1, 2, 0]])

similarity = cosine_similarity(A, B)
print(f"Cosine Similarity: {similarity[0][0]:.4f}")

# Manual calculation for reference
dot_product = np.dot(A[0], B[0])
norm_A = np.linalg.norm(A[0])
norm_B = np.linalg.norm(B[0])
manual = dot_product / (norm_A * norm_B)
print(f"Manual Cosine Similarity: {manual:.4f}")

Sources:

Wikipedia: https://en.wikipedia.org/wiki/Cosine_similarity
scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

Repository of Correlations & Similarity Measures

Correlations

Pearson’s Correlation Coefficient

Spearman’s Rank Correlation

Point-Biserial Correlation

Cramer’s V

Similarity Measures

Cosine Similarity