Regression towards the mean

When chance is involved, extreme cases will become more mediocre over time.

Topics: regression, data analysis, machine learning
Required knowledge: some basics in statistics (linear regression, random normal variables)

Run an interactive version of this repo using binder: Binder

Short summary

A lot of activities have a non-random component (e.g. skill) and a random component (luck). Observing an extreme case is on average due to a combination of this skill and luck component. When doing a new observation of the same case, it is typically more average, since the luck component is reset.

A game of javelin throw

An athlete throwing a javelin

Sports are a great example of both skill and luck playing a role. Say we have N athletes doing javelin throws. Each has a "skill" factor, defined as their average throw distance (over many throws). We draw a "luck factor" from a standard normal distribution, and multiply this with a "luck strength" which determines how much influence luck plays.

In [1]:
import numpy as np

N = 5000

np.random.seed(1234)
skill = np.random.normal(loc=50, scale=10, size=N)

So 5000 athletes, let's do one round of throwing.

In [2]:
import matplotlib.pyplot as plt

plt.style.use(['dark_background'])

def throw_results(skill, luck_strength=5):
    """Draw results as random normal around each players skill."""
    
    luck_factor = np.random.randn(len(skill))
    results = skill + luck_strength * luck_factor
    return results

round_1 = throw_results(skill)
plt.hist(round_1, bins=30)
plt.xlabel('Throw distance (m)')
plt.ylabel('Number of athletes');

Some athletes performed really well, let's take a closer look and see how they perform on the next round!

In [3]:
top_ids = np.argsort(round_1)[-10:][::-1]  # indices of top 10 athletes
for idx in top_ids:
    print(f'Athlete {idx}: {round_1[idx]:.2f} m')
Athlete 1307: 87.09 m
Athlete 3588: 84.72 m
Athlete 3233: 84.21 m
Athlete 3143: 84.20 m
Athlete 4007: 83.65 m
Athlete 4803: 83.04 m
Athlete 700: 83.02 m
Athlete 4893: 82.52 m
Athlete 1333: 81.64 m
Athlete 825: 81.45 m

It's time for round 2. All athletes keep the same skill, but their luck is reset. Let's inspect the same 10 athletes as before.

In [4]:
round_2 = throw_results(skill)
for idx in top_ids:
    print(f'Athlete {idx}: {round_2[idx]:.2f} m (previously {round_1[idx]:.2f} m)')
Athlete 1307: 89.69 m (previously 87.09 m)
Athlete 3588: 65.63 m (previously 84.72 m)
Athlete 3233: 78.01 m (previously 84.21 m)
Athlete 3143: 75.61 m (previously 84.20 m)
Athlete 4007: 71.32 m (previously 83.65 m)
Athlete 4803: 69.93 m (previously 83.04 m)
Athlete 700: 73.50 m (previously 83.02 m)
Athlete 4893: 88.73 m (previously 82.52 m)
Athlete 1333: 83.82 m (previously 81.64 m)
Athlete 825: 74.41 m (previously 81.45 m)

From the top 10 athletes in round 1, only one athlete improved (athlete 700). Let's compare the round 1 results to round 2.

In [5]:
xrange = np.array([10, 88])

def predict_linear(p, x):
    """Linear prediction used with p from np.polysub."""
    
    return p[0] * x + p[1]

def estimate_density(x, y, method='scott', subsample=None, **kwargs):
    """
    Useful for scatterplots. This makes it possible to plot high density 
    regions with another color instead of just having overlapping dots.
    
    See wdplot.prep for full function with documentation.
    https://github.com/wdobbels/wdplot/blob/master/wdplot/prep.py
    """
    
    import pandas as pd
    from scipy.stats import gaussian_kde

    def kde_sklearn(xy):
        from sklearn.neighbors import KernelDensity
        from sklearn.model_selection import GridSearchCV, LeaveOneOut

        kwargs.setdefault('cv', None)
        kwargs.setdefault('refit', False)
        # Sample bandwidth as log-uniform
        bandwidths = np.logspace(-1, 1, kwargs.pop('n_bandwidths', 30))
        grid = GridSearchCV(KernelDensity(kernel='gaussian'),
                            {'bandwidth': bandwidths},
                            **kwargs)
        grid.fit(xy)
        bw = grid.best_params_['bandwidth']
        kde = KernelDensity(bandwidth=bw).fit(xy)
        z = kde.score_samples(xy)
        return z

    # Convert pandas series to numpy array
    if isinstance(x, pd.core.series.Series):
        x = x.values
    if isinstance(y, pd.core.series.Series):
        y = y.values
    assert x.shape == y.shape, "x and y must have equal lengths"

    if subsample is not None:
        idx = np.random.choice(x.shape[0], size=subsample, replace=False)
        x, y = x[idx], y[idx]
    xy = np.vstack([x, y])
    if method == 'sklearn':
        z = kde_sklearn(xy.T)
    else:
        kde = gaussian_kde(xy, bw_method=method, **kwargs)
        z = kde(xy)
    # Sort the points by density, so that the densest points are plotted last
    idx = z.argsort()
    return x[idx], y[idx], z[idx]

def scatter_density(x, y, density_kwargs=None, **kwargs):
    if density_kwargs is None:
        density_kwargs = {}
    xc, yc, c = estimate_density(x, y, **density_kwargs)
    plt.scatter(xc, yc, c=c, **kwargs)
    
scatter_density(round_1, round_2, alpha=0.3)
plt.plot(xrange, xrange, 'w-')

# Linear regression (round 2 dependent on round 1; vertical least squares)
p_vls = np.polyfit(round_1, round_2, 1)
plt.plot(xrange, predict_linear(p_vls, xrange), 'r--')
# Linear regression (round 1 dependent on round 2; horizontal least squares)
p_hls = np.polyfit(round_2, round_1, 1)
plt.plot(predict_linear(p_hls, xrange), xrange, 'g--')

plt.xlabel('Round 1 throw distance')
plt.ylabel('Round 2 throw distance');

The data points form an ellipsoid cloud. The major axis of the ellipse falls along the 1-to-1 line (full white line). This white line is also the fit that results from Total Least Squares (TLS), minimizing the perpendicular distance of the data points to the line. The red line is a traditional Ordinary least Squares (OLS), minimizing the vertical distances to the line. The green line is also an OLS fit, but after switching the variables: it minimizes the horizontal distance to the line.

A new athlete enters the arena...

You're given this data and a new athlete joins the arena, and performs round 1. How do you predict his round 2 performance?

In [6]:
predict_linear(p_vls, 20)
Out[6]:
26.435020904992875

Many people would choose the white line: if on round 1 he throws 70 m, then you predict the same outcome for round 2. However, this prediction is not optimal: the red line has a lower variance (i.e. a lower RMSE on average). The red line shows regression to the mean behaviour: less extreme outcomes are predicted. After throwing 70 m in round 1 (which is high), we predict a throw of about 66 m in round 2. On the low end, after throwing 20 m in round 1, we predict a throw of about 26 m. To see why these predictions are better, remember that the best possible prediction is the "skill factor". For the round 1 prediction, we can not properly distinguish the contribution of luck and skill, but for above average outcomes, we can expect on average that both skill and luck contributed positively. The luck factor is drawn from a standard normal each time, and so its expected outcome for round 2 is 0.

Note: the skill factor can only be determined from data after lots of observations of the same athlete. The red line is fit for 5000 athletes after two trials, but can be used for new athletes after a single trial in order to predict their second trial.

Lets see how both the red line and white lines compare in root mean square error (RMSE). We observe 100 000 new athletes for two rounds.

In [7]:
def rmse(x, y):
    return np.sqrt(np.mean(np.square(x - y)))

def mean_error(true, pred):
    return np.mean(pred - true)

# Sample 100 000 new athletes
new_skill = np.random.normal(loc=50, scale=10, size=100000)
new_round_1 = throw_results(new_skill)
new_round_2 = throw_results(new_skill)

print('Comparing prediction for round 2 to observed round 2.')
# Red line
round_2_red_prediction = predict_linear(p_vls, new_round_1)
rmse_red = rmse(new_round_2, round_2_red_prediction)
me_red = mean_error(new_round_2, round_2_red_prediction)
print(f'VLS predictor (red) - RMSE = {rmse_red:.2f} m ; ME = {me_red:.4f}')

# White line: round 2 prediction = round 1 result
rmse_white = rmse(new_round_2, new_round_1)
me_white = mean_error(new_round_2, new_round_1)
print(f'1-to-1 predictor (white) - RMSE = {rmse_white:.2f} m ; ME = {me_white:.5f}')
Comparing prediction for round 2 to observed round 2.
VLS predictor (red) - RMSE = 6.73 m ; ME = 0.1178
1-to-1 predictor (white) - RMSE = 7.09 m ; ME = 0.04171
In [8]:
print('Comparing prediction for round 2 to skill (best possible prediction).')

# Red line
round_2_red_prediction = predict_linear(p_vls, new_round_1)
rmse_red = rmse(new_skill, round_2_red_prediction)
me_red = mean_error(new_skill, round_2_red_prediction)
print(f'VLS predictor (red) - RMSE = {rmse_red:.2f} m ; ME = {me_red:.4f}')

# white line: round 2 prediction = round 1 result
rmse_white = rmse(new_skill, new_round_1)
me_white = mean_error(new_skill, new_round_1)
print(f'1-to-1 predictor (white) - RMSE = {rmse_white:.2f} m ; ME = {me_white:.5f}')
Comparing prediction for round 2 to skill (best possible prediction).
VLS predictor (red) - RMSE = 4.49 m ; ME = 0.0807
1-to-1 predictor (white) - RMSE = 5.00 m ; ME = 0.00466

The red line has a lower RMSE than the white line: the red predictions are better on average.

Note: the mean error is lower for the white line predictor in this case. However, it can be shown for both predictors that this mean error goes to zero as the sample size increases: there is no bias for both predictors.

Different ways of showing the error

In regression problems, it is always useful to compare the predicted values against the ground truth. However, there are a few ways to plot this.

In [9]:
# Predicted vs true
scatter_density(new_round_2, round_2_red_prediction, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim = np.array([10, 90])
plt.plot(xlim, xlim, 'w-')
plt.xlabel('True')
plt.ylabel('Predicted');

It looks a bit tilted, lets plot the residual...

In [10]:
scatter_density(new_round_2, round_2_red_prediction - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('Red (vertical least squares) predictor')
plt.xlabel('True')
plt.ylabel('Predicted - True');

At this point, you might wonder if the red predictor (vertical least squares) really is better than the white predictor (1-to-1)...

In [11]:
# new_round_1 is the white line prediction: the same result for round 2 as for round 1
scatter_density(new_round_2, new_round_1 - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('white (1-to-1) predictor')
plt.xlabel('True')
plt.ylabel('Predicted - True');

However, in these cases, the x-axis shows the ground truth, which is not known in advance! It does not matter if we underpredict large values and overpredict small values: for these extreme cases luck must have played a factor, and the luck part can not be predicted. Thus, it makes more sense to use the predicted value as the x-axis.

Note: the distribution of the predicted values (for the red predictor) will be more narrow than for the real data (the standard deviation will be smaller). While extreme cases will tend to be more average on the next round, average cases can get a stroke of (good or bad) luck to become the next rounds extremes. Keep in mind that the luck factor can not be predicted.
In [12]:
scatter_density(round_2_red_prediction, round_2_red_prediction - new_round_2, density_kwargs={'subsample': 5000}, alpha=0.3)
xlim, ylim = [10, 88], [-33, 33]
plt.axhline(y=0, color='w')
plt.xlim(*xlim)
plt.ylim(*ylim)
plt.suptitle('Red (vertical least squares) predictor')
plt.xlabel('Predicted')
plt.ylabel('Predicted - True');

By switching the x-axis from "True" to "Predicted", the results look a lot better. In this case, it IS important that per bin in predicted value, the residual averages to zero. If this is not the case, this might be a problem with the model or the data (e.g. the training set does not represent the test set).

Note: An extreme example is when the target variable is purely random each time: only luck, no skill. The only sensible prediction then is to always predict the average (this is called the baseline predictor). In the residual vs predicted plot, we get a vertical bar centered at zero: no bias. In the residual vs true plot, we get a line with slope -1: we overpredict all values below the average, and underpredict all values above the average. However, this does not make the predictor biased, so I find this plot rather misleading.

So which plot should you use? I suggest using the predicted vs true plot to quickly see how good your model is (since both axes have a similar range, and you can quickly see how good the model is). The residual vs predicted plot can then be used to ensure no bias is present.

Strength of the regression to the mean

As you can imagine, the amount by which you have to correct towards the mean is problem dependent. In particular, it depends on the luck strength: the more luck is involved, the more that has to be corrected towards the mean. After we collect data of N athletes of two rounds, we can perform a linear fit as done before. It can be shown that the model's slope will on average be equal to:

$$ E[\mathrm{slope}] = \frac{\mathrm{Var}[\mathrm{skill}]}{\mathrm{Var}[\mathrm{skill}] + (\mathrm{luck \, strength})^2} $$

The exact slope can vary between draws (but less so the more athletes are drawn), that is why we can only calculate the expected slope. This slope can range from zero (when the luck strength dominates over skill) to one (when luck does not play a factor). Both the red and white predictors cross each other at the mean of the data; what matters is their slope. The slope of the red line being smaller than one is what leads to the regression to the mean.

In [13]:
luck_strength = 5
print(f'Expected slope: {np.var(skill) / (np.var(skill) + np.square(luck_strength)):.3f}, measured slope: {p_vls[0]:.3f}')
Expected slope: 0.797, measured slope: 0.788

More general prediction problems

In the example of the javelin throw, the useful state of interest ("latent variable") is the "skill factor". However, this factor can only be determined after a large number of observations of each athlete. In most regression problems however, we simply have "dependent variables" (the input x) and "independent variables" (the prediction target y). For example, we may want to predict the amount of rain that will fall in the next hour, depending on the current temperature and atmospheric pressure. In this case, the temperature and pressure give some information about the amount of rain that will fall, but not enough to perfectly predict the rainfall. The "skill" component is the part that we can predict, while the "luck" component is the part that can not be predicted. Of course, the relation does not have to be linear for regression to the mean to be hold. All that is required is some part that can not be predicted; an imperfect correlation between the prediction target and the input.

Narrative fallacy

While you now hopefully understand regression towards the mean, one should still be careful. Humans want to give an explanation to anything, even things that are purely the result of statistics. I highly recommend reading "Thinking, fast and slow" by Daniel Kahneman, which goes much more in depth here. Adapting one of his stories to the javelin throwing example, a commentator might say that an athlete that had a particularly good first throw might now be stressed to secure his position with a good second throw, resulting in sweaty hands and thus worse performance. Meanwhile, an athlete with a bad first throw has nothing to lose, and will improve on the next throw. This explanation is not needed at all: regression towards the mean is all we need.

Regression towards the mean was first described by Sir Francis Galton in the late 19th century. See the wikipedia page for more historical info.