math

Lazy Regression

This family of formula is a fairly powerful tool, if you are curious about altering an analyses you find in a paper, but the authors only report slopes/correlations and not the raw data itself.

Claim

Suppose I have three standard normal variables: $X$, $Y$, and $Z$. Suppose I know the correlations (covariances) between each pair of variables.

Now suppose I want to perform a multivariable linear regression whereby I predict $Z$ as a linear combination of $X$ and $Y$. Can I do that just knowing the pairwise correlations?

Yes.

Let our model be

$$ Z = a_x X + a_y Y $$

then it follows that

$$ a_x = \frac{r_{XZ} - r_{XY} r_{YZ}}{1 - r_{XY}^2} $$ $$ a_y = \frac{r_{YZ} - r_{XY} r_{XZ}}{1 - r_{XY}^2} $$

The proof is left as an exercise for the reader, but its fairly straightforward based on the properties of variance Variance and covariance Covariance.

Examples

  • If $X$ and $Y$ don't correlate, then $r_{XY}=0$, and the two terms become just $a_x=r_{XZ}$ and $a_y=r_{YZ}$.
  • If $r_{yz} = r_{xy}r_{xz}$, then $a_y=0$. In other words, if a plausible causal graph is that $Y$'s only influence on $Z$ is through $X$, then $a_y=0$.

Extension to Residuals

Suppose you perform a simple linear regression to predict $Z$ from $X$ and are left with residuals: $R = Z - r_{XZ} X$. What is the slope between $Y$ and $R$?

The answer is

$$ r_{YZ} - r_{XZ} r_{XY} $$

What if you want to compute the correlation between $Y$ and $R$? Then the answer is

$$ \frac{r_{YZ} - r_{XZ} r_{XY}}{\sqrt{1 - r_{XY}^2}} $$

Extension to Slopes

From the above, you can quite easily prove similar formulas for slopes where you no longer have to assume the variables are in standard normal form. For instance, if the authors report the simple slopes for each pairwise variable, you can compute what the multivariate slopes would have been:

$$ Z = a_x X + a_y Y $$

where

$$ a_x = \frac{s_{zx} - s_{zy} s_{yx}}{1 - s_{yx} s_{xy}} $$

or, alternatively,

$$ \frac{s_{zx} - s_{zy} s_{yx}}{1 - r_{xy}^2} $$

Pattern

It doesn't take a genius to recognize that all the above formulas take the form

$$ \frac{f(X,Z) - f(X,Y)f(Y,Z)}{{\left( 1 - f(X,Z) f(Z,X) \right)^k}} $$

where $f$ is either the correlation, covariance, or slope function.

Per our residual example, the numerator can be thought of as meaning "What is the slope between $X$ and $Z$ after controlling for $Y$?".

The denominator, likewise has a similar form across all the specifications, where $k$ is one of $0$, $1$, or $1/2$.

Extension to Multiple Dimensions

All the above can be extended to any number of variables. Consider ordinary multivariate linear regression where X is an n-by-d matrix where each row is a datapoint and Y be an n-by-1 matrix where each row is a response variable you're predicting. Then the formula for multivariable linear regression is

\[ (X^T X)^{-1} X^T Y \]

Note that \(X^T X\) is just the covariance matrix (we'll call "C") of the explanatory variables multiplied by n. So we can rewrite the above:

\[ = (C \cdot n)^{-1} X^T Y \]

Also note that \(X^T Y\) is just the covariance of each explanatory variable and with the response variable, multiplied by n. Again, by simple substitution:

\[ = (C \cdot n)^{-1} cov(X,Y) \cdot n \] \[ = C^{-1} cov(X,Y) \]

There are are similar extensions for the other models.

Wikipedia contributors. (2022, June 20). Variance. In Wikipedia, The Free Encyclopedia. Retrieved 22:03, August 5, 2022, from https://en.wikipedia.org/w/index.php?title=Variance&oldid=1093973076#Properties Wikipedia contributors. (2022, August 2). Covariance. In Wikipedia, The Free Encyclopedia. Retrieved 22:03, August 5, 2022, from https://en.wikipedia.org/w/index.php?title=Covariance&oldid=1101892543#Properties