When chance is involved, extreme cases will become more mediocre over time.
Topics: regression, data analysis, machine learning
Required knowledge: some basics in statistics (linear regression, random normal variables)
A lot of activities have a non-random component (e.g. skill) and a random component (luck). Observing an extreme case is on average due to a combination of this skill and luck component. When doing a new observation of the same case, it is typically more average, since the luck component is reset.
Sports are a great example of both skill and luck playing a role. Say we have N athletes doing javelin throws. Each has a "skill" factor, defined as their average throw distance (over many throws). We draw a "luck factor" from a standard normal distribution, and multiply this with a "luck strength" which determines how much influence luck plays.
import numpy as np
N = 5000
np.random.seed(1234)
skill = np.random.normal(loc=50, scale=10, size=N)
So 5000 athletes, let's do one round of throwing.
import matplotlib.pyplot as plt
plt.style.use(['dark_background'])
def throw_results(skill, luck_strength=5):
"""Draw results as random normal around each players skill."""
luck_factor = np.random.randn(len(skill))
results = skill + luck_strength * luck_factor
return results
round_1 = throw_results(skill)
plt.hist(round_1, bins=30)
plt.xlabel('Throw distance (m)')
plt.ylabel('Number of athletes');
Some athletes performed really well, let's take a closer look and see how they perform on the next round!
top_ids = np.argsort(round_1)[-10:][::-1] # indices of top 10 athletes
for idx in top_ids:
print(f'Athlete {idx}: {round_1[idx]:.2f} m')
It's time for round 2. All athletes keep the same skill, but their luck is reset. Let's inspect the same 10 athletes as before.
round_2 = throw_results(skill)
for idx in top_ids:
print(f'Athlete {idx}: {round_2[idx]:.2f} m (previously {round_1[idx]:.2f} m)')
From the top 10 athletes in round 1, only one athlete improved (athlete 700). Let's compare the round 1 results to round 2.
xrange = np.array([10, 88])
def predict_linear(p, x):
"""Linear prediction used with p from np.polysub."""
return p[0] * x + p[1]
def estimate_density(x, y, method='scott', subsample=None, **kwargs):
"""
Useful for scatterplots. This makes it possible to plot high density
regions with another color instead of just having overlapping dots.
See wdplot.prep for full function with documentation.
https://github.com/wdobbels/wdplot/blob/master/wdplot/prep.py
"""
import pandas as pd
from scipy.stats import gaussian_kde
def kde_sklearn(xy):
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV, LeaveOneOut
kwargs.setdefault('cv', None)
kwargs.setdefault('refit', False)
# Sample bandwidth as log-uniform
bandwidths = np.logspace(-1, 1, kwargs.pop('n_bandwidths', 30))
grid = GridSearchCV(KernelDensity(kernel='gaussian'),
{'bandwidth': bandwidths},
**kwargs)
grid.fit(xy)
bw = grid.best_params_['bandwidth']
kde = KernelDensity(bandwidth=bw).fit(xy)
z = kde.score_samples(xy)
return z
# Convert pandas series to numpy array
if isinstance(x, pd.core.series.Series):
x = x.values
if isinstance(y, pd.core.series.Series):
y = y.values
assert x.shape == y.shape, "x and y must have equal lengths"
if subsample is not None:
idx = np.random.choice(x.shape[0], size=subsample, replace=False)
x, y = x[idx], y[idx]
xy = np.vstack([x, y])
if method == 'sklearn':
z = kde_sklearn(xy.T)
else:
kde = gaussian_kde(xy, bw_method=method, **kwargs)
z = kde(xy)
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
return x[idx], y[idx], z[idx]
def scatter_density(x, y, density_kwargs=None, **kwargs):
if density_kwargs is None:
density_kwargs = {}
xc, yc, c = estimate_density(x, y, **density_kwargs)
plt.scatter(xc, yc, c=c, **kwargs)
scatter_density(round_1, round_2, alpha=0.3)
plt.plot(xrange, xrange, 'w-')
# Linear regression (round 2 dependent on round 1; vertical least squares)
p_vls = np.polyfit(round_1, round_2, 1)
plt.plot(xrange, predict_linear(p_vls, xrange), 'r--')
# Linear regression (round 1 dependent on round 2; horizontal least squares)
p_hls = np.polyfit(round_2, round_1, 1)
plt.plot(predict_linear(p_hls, xrange), xrange, 'g--')
plt.xlabel('Round 1 throw distance')
plt.ylabel('Round 2 throw distance');
The data points form an ellipsoid cloud. The major axis of the ellipse falls along the 1-to-1 line (full white line). This white line is also the fit that results from Total Least Squares (TLS), minimizing the perpendicular distance of the data points to the line. The red line is a traditional Ordinary least Squares (OLS), minimizing the vertical distances to the line. The green line is also an OLS fit, but after switching the variables: it minimizes the horizontal distance to the line.
You're given this data and a new athlete joins the arena, and performs round 1. How do you predict his round 2 performance?
predict_linear(p_vls, 20)
Many people would choose the white line: if on round 1 he throws 70 m, then you predict the same outcome for round 2. However, this prediction is not optimal: the red line has a lower variance (i.e. a lower RMSE on average). The red line shows regression to the mean behaviour: less extreme outcomes are predicted. After throwing 70 m in round 1 (which is high), we predict a throw of about 66 m in round 2. On the low end, after throwing 20 m in round 1, we predict a throw of about 26 m. To see why these predictions are better, remember that the best possible prediction is the "skill factor". For the round 1 prediction, we can not properly distinguish the contribution of luck and skill, but for above average outcomes, we can expect on average that both skill and luck contributed positively. The luck factor is drawn from a standard normal each time, and so its expected outcome for round 2 is 0.
Lets see how both the red line and white lines compare in root mean square error (RMSE). We observe 100 000 new athletes for two rounds.
def rmse(x, y):
return np.sqrt(np.mean(np.square(x - y)))
def mean_error(true, pred):
return np.mean(pred - true)
# Sample 100 000 new athletes
new_skill = np.random.normal(loc=50, scale=10, size=100000)
new_round_1 = throw_results(new_skill)
new_round_2 = throw_results(new_skill)
print('Comparing prediction for round 2 to observed round 2.')
# Red line
round_2_red_prediction = predict_linear(p_vls, new_round_1)
rmse_red = rmse(new_round_2, round_2_red_prediction)
me_red = mean_error(new_round_2, round_2_red_prediction)
print(f'VLS predictor (red) - RMSE = {rmse_red:.2f} m ; ME = {me_red:.4f}')
# White line: round 2 prediction = round 1 result
rmse_white = rmse(new_round_2, new_round_1)
me_white = mean_error(new_round_2, new_round_1)
print(f'1-to-1 predictor (white) - RMSE = {rmse_white:.2f} m ; ME = {me_white:.5f}')
print('Comparing prediction for round 2 to skill (best possible prediction).')
# Red line
round_2_red_prediction = predict_linear(p_vls, new_round_1)
rmse_red = rmse(new_skill, round_2_red_prediction)
me_red = mean_error(new_skill, round_2_red_prediction)
print(f'VLS predictor (red) - RMSE = {rmse_red:.2f} m ; ME = {me_red:.4f}')
# white line: round 2 prediction = round 1 result
rmse_white = rmse(new_skill, new_round_1)
me_white = mean_error(new_skill, new_round_1)
print(f'1-to-1 predictor (white) - RMSE = {rmse_white:.2f} m ; ME = {me_white:.5f}')
The red line has a lower RMSE than the white line: the red predictions are better on average.
In regression problems, it is always useful to compare the predicted values against the ground truth. However, there are a few ways to plot this.
# Predicted vs true
scatter_density(new_round_2, round_2_red_prediction, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim = np.array([10, 90])
plt.plot(xlim, xlim, 'w-')
plt.xlabel('True')
plt.ylabel('Predicted');
It looks a bit tilted, lets plot the residual...
scatter_density(new_round_2, round_2_red_prediction - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('Red (vertical least squares) predictor')
plt.xlabel('True')
plt.ylabel('Predicted - True');
At this point, you might wonder if the red predictor (vertical least squares) really is better than the white predictor (1-to-1)...
# new_round_1 is the white line prediction: the same result for round 2 as for round 1
scatter_density(new_round_2, new_round_1 - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('white (1-to-1) predictor')
plt.xlabel('True')
plt.ylabel('Predicted - True');
However, in these cases, the x-axis shows the ground truth, which is not known in advance! It does not matter if we underpredict large values and overpredict small values: for these extreme cases luck must have played a factor, and the luck part can not be predicted. Thus, it makes more sense to use the predicted value as the x-axis.
scatter_density(round_2_red_prediction, round_2_red_prediction - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('Red (vertical least squares) predictor')
plt.xlabel('Predicted')
plt.ylabel('Predicted - True');
By switching the x-axis from "True" to "Predicted", the results look a lot better. In this case, it IS important that per bin in predicted value, the residual averages to zero. If this is not the case, this might be a problem with the model or the data (e.g. the training set does not represent the test set).
So which plot should you use? I suggest using the predicted vs true plot to quickly see how good your model is (since both axes have a similar range, and you can quickly see how good the model is). The residual vs predicted plot can then be used to ensure no bias is present.
As you can imagine, the amount by which you have to correct towards the mean is problem dependent. In particular, it depends on the luck strength: the more luck is involved, the more that has to be corrected towards the mean. After we collect data of N athletes of two rounds, we can perform a linear fit as done before. It can be shown that the model's slope will on average be equal to:
$$ E[\mathrm{slope}] = \frac{\mathrm{Var}[\mathrm{skill}]}{\mathrm{Var}[\mathrm{skill}] + (\mathrm{luck \, strength})^2} $$The exact slope can vary between draws (but less so the more athletes are drawn), that is why we can only calculate the expected slope. This slope can range from zero (when the luck strength dominates over skill) to one (when luck does not play a factor). Both the red and white predictors cross each other at the mean of the data; what matters is their slope. The slope of the red line being smaller than one is what leads to the regression to the mean.
luck_strength = 5
print(f'Expected slope: {np.var(skill) / (np.var(skill) + np.square(luck_strength)):.3f}, measured slope: {p_vls[0]:.3f}')
In the example of the javelin throw, the useful state of interest ("latent variable") is the "skill factor". However, this factor can only be determined after a large number of observations of each athlete. In most regression problems however, we simply have "dependent variables" (the input x) and "independent variables" (the prediction target y). For example, we may want to predict the amount of rain that will fall in the next hour, depending on the current temperature and atmospheric pressure. In this case, the temperature and pressure give some information about the amount of rain that will fall, but not enough to perfectly predict the rainfall. The "skill" component is the part that we can predict, while the "luck" component is the part that can not be predicted. Of course, the relation does not have to be linear for regression to the mean to be hold. All that is required is some part that can not be predicted; an imperfect correlation between the prediction target and the input.
While you now hopefully understand regression towards the mean, one should still be careful. Humans want to give an explanation to anything, even things that are purely the result of statistics. I highly recommend reading "Thinking, fast and slow" by Daniel Kahneman, which goes much more in depth here. Adapting one of his stories to the javelin throwing example, a commentator might say that an athlete that had a particularly good first throw might now be stressed to secure his position with a good second throw, resulting in sweaty hands and thus worse performance. Meanwhile, an athlete with a bad first throw has nothing to lose, and will improve on the next throw. This explanation is not needed at all: regression towards the mean is all we need.
Regression towards the mean was first described by Sir Francis Galton in the late 19th century. See the wikipedia page for more historical info.