College Rankings: Fixed Effect Estimation
[ Part of a sequence of posts constructing my own college rankings. ]
Data Source
In the previous post, we established a model in which we predicted child income from parent income, child SAT score, and the college attended. In this post, we consider various sources of data so that we can apply this model to to estimate college FEs, since Chetty et al irritatingly do not report these FEs.
The most difficult (yet crucial) data to get is income data for students/alumni and their parents. I know of four relevant sources:
- College Salary Report - Payscale Payscale's 2021-22 College Salary Report
- College Scorecards - U.S. Department of Education College Scorecard
- Opportunity Insights Data Library: Publicly available data we've produced and replication code Mobility report cards: The role of colleges in intergenerational mobility Income segregation and intergenerational mobility across colleges in the United States
All the previous college rankings I found that made any pretense to being rooted in economics used the first two sources, even though both have major limitations. The Payscale data is based on self-reported data accumulated via a very non-random sample College Salary Report Methodology. The College Scorecard data only covers students who receive federal financial aid Data Documentation. College Scorecard, which is obviously a very non-random sample of students.
The last data source doesn't suffer from these problems. It also includes parent income statistics by college and the SAT scores from the same time period as the original study. The reason is simple: this dataset was constructed by the original paper.
This is obviously ideal for reconstructing Chetty et al's analysis and FE estimates. It is also convenient since it aids in analysis later in the sequence.
Fixed Effect Estimation
The baseline analysis is straightforward. First, we just download Table 2: Baseline Cross-Sectional Estimates of Child and Parent Income Distributions by College Table 2: Baseline Cross-Sectional Estimates by College. The relevant columns are
super_opeid
- Institution OPEID / Cluster ID when combining multiple OPEIDsname
- name of college (or college group)par_rank
- mean parental income rankk_rank
- mean kid earnings rank
See Income segregation and intergenerational mobility across colleges in the United States (and its appendix) for details.
Next, we download the SAT data Table 10: College Level Characteristics from the IPEDS Database and the College Scorecard Table 10: College Level Characteristics.. The relevant columns are
super_opeid
name
sat_avg_2001
- Average SAT scores (scaled to 1600) in 2001, defined as the mean of the 25th and 75th percentile of math+verbal SAT scores. Missing for institutions that do not require SAT.
I joined the two tables using super_opeid
, dropping colleges that were missing any of the relevant fields. This left 814 colleges.
Finally, I computed the estimated FE column as follows:
college_fe = k_rank - 0.100 * par_rank - 1.27 * (sat_avg_2001/100)
For those of you just tuning in, this formula comes from my analysis in the previous post.
The analysis and results can be found in this spreadsheet.
Validation
To validate the above analysis, we again to the appendix of Chetty et al Income segregation and intergenerational mobility across colleges in the United States. The authors construct a model using student-level data where they predict k_rank
using five bins for SAT scores, five bins for parent income scores, and a continuous variable for the average SAT score of the college each student attended. In column 1 from Table XV, they report the slope between school-SAT-score and k_rank
: 0.016.
This is not identical to the model we reverse-engineered above, but it offers a useful sanity check. When I check the correlation between colleges' average SAT scores and our estimates for their FEs, I find a slope of 0.014.
These two slopes are pretty close. It's possible this suggests our model has slightly overcorrected for SAT score, but I think making that kind of fine-grained conclusion based on this broad-strokes validation is a mistake.
So, in the end, I view this as validating the analysis, and I consider Chetty et al's paper as being successfully reverse engineered.
Doctorate Degrees
Although Chetty et al find relative pay is fairly stable after the age of ~30 in general, they don't examine people with advanced degrees in particular. Other studies, however, do and find that the earnings of people with PhDs and professional degrees are (relatively) depressed even at that age:
Per the above figure, this suggests that for every 10pp increase in PhD, our pay estimates will be depressed by about 2%, which is about 0.6 percentile ranks. For most colleges, this is a trivial amount, but for some schools, this can make a significant difference in our estimates. For instance, on the extreme end this bumps Caltech and Harvey Mudd up 2.5 and 1.8 percentiles, respectively (about 0.4 and 0.3 SDs). For more conventional elite schools (like Harvard), about 10% of graduates get PhDs, so their numbers get bumped up around 0.6 percentiles.
To adjust for this, we will make the above correction (0.6 percentiles per 10pp in PhD yield). We will compute the yield rate based on the ratio of PhDs attained by people who entered an institution as Freshmen - data we can compute from Build Table. National Center for Science and Engineering Statistics. I confirmed my computations were reasonable by first replicating the similar work done by Swarthmore College Doctorates Awarded. Swarthmore.
This adjustment turns out not to matter much: the correlation between the baseline FEs and the PhD-adjusted FEs is very high (r~0.999), so, excluding a couple outliers like CalTech, this doesn't affect any of the later analysis in a meaningful way.
Finally, as a sanity check, consider Chetty et al's Figure IIa from the online appendix Income segregation and intergenerational mobility across colleges in the United States:
It's hard to make out, but the graph shows the percentile income for Ivy Plus graduates increase 0.5pp relative to "Other Four-Year" graduates between the ages of 33 and 36. This is roughly in-line with our estimated 0.6pp from the PhD rates. On the one hand, plenty of PhD earners will still be in school by then; on the other hand, we haven't accounted for professional degrees yet...
Professional Degrees
Dealing with professional degrees is tricker.
For instance, it seems obvious that medical doctors are a huge confounder here. Based on some stats from Harvard Coleman, it seems likely that more than a sixth of Harvard graduates end up in medical school. Based on stats all over the internet, doctors typically make around 300k eventually, but seldom anywhere close to that by their early 30s. Together, these stats imply that top schools' FE estimates could be seriously biased downwards: some back-of-the-envelope math suggests maybe about 4pp for Harvard. Unfortunately, I can't find any data on how many med students come from each school. The closest I could find were
- Top Feeders to Medical School - this source is hard to interpret and unofficial. The data is limited to 30 schools.
- 2021 FACTS: Applicants and Matriculants Data - this source is official and covers ~250 colleges, but it covers of how many applicants originate from each - not matriculants.
The situation is similar for other professional degrees like JDs and MBAs is much the same: I can't find any comprehensive & reliable data. While the income gain is probably not as extreme as for medical doctors, these groups are several times larger, so there effects are probably also quite significant.
That being said, the above source I labeled "unofficial" Top Feeders to Medical School does give an interesting idea for resolving these issues: scraping LinkedIn. This is well beyond the scope of this analysis, as it involves either getting LinkedIn to give me lots of data for a blog post or massively violating their TOS and scraping literally millions of profiles, probably at significant financial and temporal cost to myself. Hard pass.
So, moving forward, we'll only adjust the FEs for PhD yield. Those can be found here.
Still, we have some reasons to think this isn't an enormous deal. For instance, Figure IId doesn't spike upwards like you'd expect if a sixth of Harvard's graduates suddenly became high-earning doctors:
It's very unclear to me why we aren't seeing a spike between ~30 and ~35 years of age among the "Ivy Plus" cohort - ideas for why are welcome.