math twins

Genotype Doesn't Regress to the Mean

If you run a simulation (see Appendix) where you assume each chromosome has a score and a person's genotype is simply the sum of these scores, then you find the slope between the average parent genotype ("midparent genotype") and the child genotype is 1.0 and that the correlation (absent assortative mating) is 0.707. In other words, regression to the mean shouldn't happen to genotypes.

To help elucidate, note the following are all true (absent assortative mating):

  • A 2 sigma mom will have a 1 sigma child in expectation.
  • A 2 sigma mom and a 2 sigma dad will have a 2 sigma child in expectation.
  • A couple who have an average IQ higher than 97.7% of couples (2 sigma) will probably have a child who is 1.414 sigma.

The last two points can be reconciled: it is extremely rare for a 2 sigma mom to mate with a 2 sigma dad, so such a couple is collectively far above 2 sigma.

Remember, though, this is all just concerning genotype. Since genotype typically imperfectly correlate with phenotype, the latter typically do display a tendency to regress to the mean, even using the midparent number. In particular, the correlation and regression are both scaled by the heritability of the phenotype.

A hypothetical trait that is 100% heritable should, therefore, not display any regression to the mean. This is rare in real life, but the lack of genotype regression also has implications for traits that aren't 100% heritable. In particular, it predicts such traits should "regression" an expected 50% per generation absent assortative mating. Note this 50% is completely independent of the degree of heritability.

For instance, if you have measurement error, it will show up as unshared environment and bias the heritability estimate downwards. Such error will also cause the appearance of regression to the mean even if no such regression occurs.

Interestingly, the effect of "unshared environment" factors (e.g. measurement error) can cause a regression away from the mean for phenotypes. You see, absent such unshared effects, grandparents offer no predictive value after you've already accounted for parents. However, with such unshared effects, grandparents effectively increase the "sample size" to decrease the noise created by unshared environment. In this way, a child's phenotypes will likely regress towards their grandparents' phenotype even after conditioning on their parents' phenotype.

Appendix

import math
import numpy as np
import sklearn.linear_model

N = 100000
HERITABILITY = 0.8

moms_genes = np.random.normal(0, 1/math.sqrt(46), (N, 23, 2))
dads_genes = np.random.normal(0, 1/math.sqrt(46), (N, 23, 2))
moms_genotype = np.sum(moms_genes, (1,2))
dads_genotype = np.sum(dads_genes, (1,2))
moms_phenotype = HERITABILITY * moms_genotype + math.sqrt(1-HERITABILITY**2) * np.random.normal(0, 1, N)
dads_phenotype = HERITABILITY * dads_genotype + math.sqrt(1-HERITABILITY**2) * np.random.normal(0, 1, N)

moms_choice = np.random.randint(0, 2, (N, 23))
dads_choice = np.random.randint(0, 2, (N, 23))

kids_genes = np.zeros((N, 23, 2))
for i in range(len(moms_genes)):
  for j in range(len(moms_genes[i])):
    kids_genes[i][j][0] = moms_genes[i][j][moms_choice[i][j]]
    kids_genes[i][j][1] = dads_genes[i][j][dads_choice[i][j]]

kids_genotype = np.sum(kids_genes, (1,2))
kids_phenotype = HERITABILITY * kids_genotype + math.sqrt(1-HERITABILITY**2) * np.random.normal(0, 1, N)

reg = sklearn.linear_model.LinearRegression().fit(np.array([moms_genotype, dads_genotype]).T, kids_genotype)
print('genotype (mom, dad)')
print('slopes', reg.coef_)
print('r2', reg.score(np.array([moms_genotype, dads_genotype]).T, kids_genotype))
print('')

reg = sklearn.linear_model.LinearRegression().fit(np.array([(moms_genotype + dads_genotype)/2]).T, kids_genotype)
print('genotype (midparent)')
print('slopes', reg.coef_)
print('r2', reg.score(np.array([(moms_genotype + dads_genotype)/2]).T, kids_genotype))
print('')

reg = sklearn.linear_model.LinearRegression().fit(np.array([moms_phenotype, dads_phenotype]).T, kids_phenotype)
print('phenotype (mom, dad)')
print('slopes', reg.coef_)
print('r2', reg.score(np.array([moms_phenotype, dads_phenotype]).T, kids_phenotype))
print('')

reg = sklearn.linear_model.LinearRegression().fit(np.array([(moms_phenotype + dads_phenotype)/2]).T, kids_phenotype)
print('phenotype (midparent)')
print('slopes', reg.coef_)
print('r2', reg.score(np.array([(moms_phenotype + dads_phenotype)/2]).T, kids_phenotype))
print('')