Learning about correlations, I discovered there are various methods for obtaining correlation measures for different datatypes. For example, correlations between binary and continuous variables, or correlations between two ordinal variables.
Here is a virtual toolbox of correlation and similarity measurement methods, for personal and public reference.
Correlations
Pearson’s Correlation Coefficient
Pearson’s is the go-to measure when you have two continuous variables and want to quantify the strength and direction of their linear relationship. Both variables should be approximately normally distributed, and the relationship between them should be linear (not curved). It is sensitive to outliers, so inspect your data visually with a scatterplot before applying it.
When to Use:
- Both variables are continuous (interval or ratio scale)
- The relationship is expected to be linear
- Data is roughly normally distributed
- Outliers have been addressed
Avoid it when:
- Your data is ordinal or contains ranks
- The relationship is non-linear
- You have significant outliers you cannot remove
Equation:
r = \frac{\Sigma_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{ \sqrt{ \Sigma_{i=1}^{n}(x_i - \bar{x})^2 \Sigma_{i=1}^{n} (y_i - \bar{y})^2 } }Python Code:
import numpy as np
from scipy import stats
x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]
r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}, p-value: {p_value:.4f}")Sources:
- Wikipedia: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
- SciPy docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
Spearman’s Rank Correlation
Spearman’s is a non-parametric alternative to Pearson’s. Rather than working on raw values, it converts both variables to ranks and then computes the correlation of those ranks. This makes it robust to outliers and appropriate for ordinal data or when the relationship is monotonic but not strictly linear (i.e., as one variable increases, the other tends to increase, but not necessarily at a constant rate).
When to Use:
- One or both variables are ordinal
- The relationship is monotonic but not necessarily linear
- Your data contains outliers that cannot be removed
- Normality cannot be assumed
Avoid it when:
- You have a confirmed linear relationship between two continuous, normally distributed variables (Pearson’s is more powerful in that case)
Equation:
r_s = 1 - \frac{6 \Sigma_{i=1}^{n} d_i^2}{n(n^2 - 1)}Where is the difference between the ranks of the -th pair, and is the number of observations.
Python Code:
from scipy import stats
x = [10, 20, 30, 40, 50]
y = [12, 24, 28, 45, 52]
rho, p_value = stats.spearmanr(x, y)
print(f"Spearman rho: {rho:.4f}, p-value: {p_value:.4f}")Sources
- Wikipedia: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
- SciPy docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
Point-Biserial Correlation
Point-Biserial correlation measures the relationship between one continuous variable and one dichotomous (binary) variable, such as pass/fail, yes/no, or 0/1. It is mathematically equivalent to Pearson’s r applied to this specific case, but it is worth calling out explicitly because the data structure is distinct. The continuous variable should be approximately normally distributed within each group of the binary variable.
Use it when:
- One variable is continuous and the other is binary (naturally or artificially dichotomous)
- You want to assess how well the binary grouping separates the continuous variable
- It is commonly used in item analysis for test scoring (e.g., did getting a question right correlate with the total score?)
Avoid it when:
- The binary variable is an artificially forced split of what is actually a continuous underlying variable (consider biserial correlation instead in that case)
Equation
r_{pb} = \frac{\bar{Y_1} - \bar{Y_0}}{s_Y} \sqrt{\frac{n_1 \cdot n_0}{n^2}}Where and are the means of the continuous variable for each binary group, is the standard deviation of the continuous variable, and are the group sizes, and is the total sample size.
Python Code
from scipy import stats
continuous = [2.5, 3.1, 4.0, 5.2, 3.8, 2.9, 4.5, 5.1]
binary = [0, 0, 1, 1, 0, 0, 1, 1 ]
r, p_value = stats.pointbiserialr(binary, continuous)
print(f"Point-Biserial r: {r:.4f}, p-value: {p_value:.4f}")Sources
- Wikipedia: https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
- SciPy docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pointbiserialr.html
Cramer’s V
Cramer’s V measures the strength of association between two categorical variables. It is derived from the chi-squared statistic and always ranges from 0 (no association) to 1 (perfect association), regardless of the size of the contingency table. Unlike Pearson’s or Spearman’s, it does not indicate direction, only strength. It is appropriate when both variables are nominal (unordered categories).
Use it when:
- Both variables are categorical (nominal)
- You have already run a chi-squared test and want an effect size measure
- Your contingency table is larger than 2×2 (for 2×2 tables, Phi coefficient is equivalent)
Avoid it when:
- Your variables are ordinal (consider Kendall’s tau or Goodman-Kruskal gamma instead)
- Sample size is very small, as chi-squared assumptions may be violated
Equation
V = \sqrt{\frac{\chi^2 / n}{min(r-1,c-1)}}Where is the chi-squared statistic, is the total number of observations, is the number of rows, and is the number of columns in the contingency table.
Python Code
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
def cramers_v(confusion_matrix):
chi2, _, _, _ = chi2_contingency(confusion_matrix)
n = confusion_matrix.sum().sum()
r, c = confusion_matrix.shape
return np.sqrt(chi2 / (n * (min(r, c) - 1)))
# Example: favorite color vs. preferred genre
data = pd.DataFrame({
"Color": ["Red", "Red", "Blue", "Blue", "Green", "Green"],
"Genre": ["Rock", "Jazz", "Rock", "Classical", "Jazz", "Classical"]
})
confusion_matrix = pd.crosstab(data["Color"], data["Genre"])
v = cramers_v(confusion_matrix)
print(f"Cramer's V: {v:.4f}")Sources:
- Wikipedia: https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
- SciPy chi2_contingency docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
Similarity Measures
Cosine Similarity
Cosine similarity measures the angle between two vectors in multi-dimensional space, making it a measure of orientation rather than magnitude. Two vectors pointing in the same direction have a cosine similarity of 1, perpendicular vectors score 0, and opposite vectors score -1. It is widely used in NLP and information retrieval where documents or words are represented as high-dimensional vectors (e.g., TF-IDF or word embeddings), and the raw size of the vectors matters less than their direction.
Use it when:
- You are working with vector representations of text (e.g., TF-IDF, word2vec, sentence embeddings)
- Magnitude is not meaningful and only direction (relative composition) matters
- Comparing documents, user profiles, or item feature vectors in recommendation systems
Avoid it when:
- Magnitude differences are meaningful and should be captured in the similarity score
- You are working with simple univariate or bivariate numeric data (Pearson’s or Spearman’s are more interpretable there)
Equation
cosine\ similarity(\bold{A}, \bold{B}) = \frac{\bold{A} \cdot \bold{B}}{\left\Vert \bold{A} \right\Vert \left\Vert \bold{B} \right\Vert} = \frac{\Sigma_{i=1}^n A_i B_i}{\sqrt{\Sigma_{i=1}^n A_i^2} \cdot \sqrt{\Sigma_{i=1}^n B_i^2}}Python Code
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Two document vectors (e.g., word count or TF-IDF representations)
A = np.array([[1, 2, 0, 3, 1]])
B = np.array([[1, 1, 1, 2, 0]])
similarity = cosine_similarity(A, B)
print(f"Cosine Similarity: {similarity[0][0]:.4f}")
# Manual calculation for reference
dot_product = np.dot(A[0], B[0])
norm_A = np.linalg.norm(A[0])
norm_B = np.linalg.norm(B[0])
manual = dot_product / (norm_A * norm_B)
print(f"Manual Cosine Similarity: {manual:.4f}")Sources: