Causation from Correlation

In general, it isn't valid to infer correlation from causation. Indeed, it is also not valid to infer correlation from causation. This is true and generally solid advice and most people should heed it more. However, there are some cracks in it.

The cracks start to form when you start looking at the word "valid", which, in this case, means "logically valid", which really means "universally valid". While logical validity is very nice, there are lots of times when we can, should, and doo make inferences with less strict criteria.

The most famous is probably Bayes' theorem, which while logically valid itself, allows us to find probabilistic evidence for and against hypothesis even when such evidence falls short of logical proof.

In a similar way, there are times you can use correlation as evidence of causation and, in some situations, it can be quite good evidence.

Limits on Causation With Correlation

At the heart of inferring causation from correlation is the Causal Graph Causal graph model. These are complicated to explain, so I'm going to elide a full explanation and use an example instead.

Suppose, we find a bunch of people who are 40 and find their incomes and find their parents' incomes when they were 40. Suppose we find a slope of 0.5. This suggests high-earning parents tend to have high-earning children, but the slope represents a correlation, not causation, so most people would say we haven't learned anything about whether parents having higher incomes causes their children to have higher incomes.

To understand why we can, in fact, determine something about causation, we have to examine the reason correlation doesn't imply causation. There are two reasons why a correlation of $r$ between $x$ and $y$ does not imply a causation of strength $r$:

  1. It's possible $y$ is actually causing $x$ or that both cause each other to some degree.
  2. It's possible some third factors, $z$, that cause both $x$ and $y$.

With our example, we can immediately rule out (1) since the only way for a child's income-at-40 to affect a parent's income-at-40 is with time travel.

This leaves the second option.

To simplify a bit, we can imagine income is determined as

$$ y = \alpha \cdot x + \beta_1 \cdot z_1 + \beta_2 \cdot z_2 + \beta_3 \cdot z_3 + \cdots $$

In other words, we'e assuming child income is a linear sum of parent income and other variables.

However, (2) says that each $z$ also has a causal effect on $x$. We can model that as

$$ x = \gamma_1 \cdot z_1 + \gamma_2 \cdot z_2 + \gamma_3 \cdot z_3 + \cdots $$

Finally, assume each $z$ is independent. Given these assumptions, the correlation between $x$ and $y$ should be

$$ \alpha + \beta_1 \cdot \gamma_1 + \beta_2 \cdot \gamma_2 + \beta_3 \cdot \gamma_3 $$

Since $\alpha$ is the true causation, we might call $\beta_1 \cdot \gamma_1 + \beta_2 \cdot \gamma_2 + \beta_3 \cdot \gamma_3$ the "bias" of the correlation when estimating causation. If we can show the "bias" is greater than zero, then we know the causal effect must be smaller than the correlation.

I'd argue this is quite likely in this example.

For instance, suppose $z_1$ was grandparent income. I'd expect that to have a positive effect on both child and parent incomes so, $ \beta_1 \gt 0 $ and $ \gamma_1 \gt 0 $, which makes $ \beta_1 \cdot \gamma_1 \gt 0 $.

Likewise, suppose $z_2% was whether the parent is white. Being white has a positive (if unethical) effect on both their own incomes and their children's incomes, so again $ \beta_2 \cdot \gamma_2 \gt 0$.

Indeed, I can think of lots and lots of similar things - things that cause both higher parent and higher child income. Likewise, I can think of things that cause both lower parent and lower child income.

On the other hand, thinking of things that cause parents to earn more and children to earn less is rather difficult.

For this reason, there is a very good chance that

$$0 \geq; \beta_1 \cdot \gamma_1 + \beta_2 \cdot \gamma_2 + \cdots $$

This implies that the causal effect will be smaller than the observed correlation.

The Assumptions

A sharp reader will, of course, note that we made an awful lot of assumptions. However, many of the assumptions I made were for ease of illustration and weren't strictly necessary for the general conclusion.

For instance, we assumed $y$ was deterministically implied by the other variables. What if there's luck? There's an easy solution: just make one of our $z$s represent luck.

We also assumed each $z$ was independent. However, this assumption is unnecessary. We can add all the $z$s together into one everything-else variable and the same general results holds: if all the same-sign causations are stronger than the revers-signs, the correlation will still place an upper bound on the causal strength.

Likewise, the effect need not be a a weighted sum of the causes. For instance, we can model it with

$$ y = x^\alpha \cdot z_1^{\gamma_1} \cdot z_2^{\gamma_2} $$

and taking the logs will revert us back to the linear case.

We can also support interactions like

$$ y = \alpha_1 \cdot x + \alpha_2 \cdot x \cdot z_1 + \beta_2 \cdot z_2 + \cdots $$

and there are many other formula-related generalizations.

Lax Bounds

Another critique is that linear correlations are often really high even when true effects are small, which means that even if we can theoretically bound causal relationships, this method is practically useless.

The strength of this critique depends entirely on the situation. In some cases, it is spot on, while in others it's overly pessimistic. That being said, there is a method we can use to mitigate this issue.

That method is by including controls into our regression.

For instance, in the income example, we can compute the correlation with income while including race in our linear model. This will reduce the "bias" between parent income and child income, allowing us to achieve a lower upper bound on the causal strength.

Including more variables can bring the bias down lower and lower. The countervailing force is that the more variables we control for, the less likely we are to know the sign of our bias. Once we get to the point where that sign is unknown, this entire method falls apart.

An Example

College education is a good example. As already mentioned there are lots of other variables that cause people to obtain a higher degree and also earn more like IQ, conscientiousness, the ability to postpone graduation, and enjoying tedious paper work. On the other hand, I have a hard time thinking of any factors that cause more education but less income. For this reason, its almost certain a Bachelor's degree causes less than a 70% increase in income.

What We Don't Know Can Hurt Us

This brings us to the final critique: that we can never truly know with certainty whether the bias is positive or negative.

This is definitely a problem and this method is sound only to the extent we know the "bias's" sign. Like the other critiques, this one is context-dependent and is important to keep in mind. However, also like the other critiques, this one doesn't refute the central point that, given some reasonable background knowledge, we can use correlation to place upper bounds on how strong the causal relationship between two variables can be.

Wikipedia contributors. (2020, May 18). Causal graph. In Wikipedia, The Free Encyclopedia. Retrieved 22:26, May 20, 2020, from https://en.wikipedia.org/w/index.php?title=Causal_graph&oldid=957368400