education Morgan

GPA Adjustment

"Grade Point Average" is a simple naïve average of the grades a student has received over a period. It is used as a metric of student success and a factor in college admissions, both directly, and as a means of computing class rank.

Unfortunately, it fails as such a metric, both between and within schools, as schools and teachers have different standards for their students. This post is a dive into a dataset I recently acquired, and examines alternatives to GPA.

The Dataset

In a process that was surprisingly painless, I have received the (anonymized) grades for 3 high schools within a school district. The dataset includes student identifiers, teacher identifiers, course names, and the grade each student received, rounded to A, A/B, B, B/C, C, C/D, D, and F. (There are pass/fail classes, but I've excluded them).

Here is a scatter plot of one school's GPAs after two semesters:

And here it is for "Weighted" GPAs, which include the +1 for AP/college classes, allowing students to achieve a GPA above 4.0

For reasons that will hopefully become apparent, it doesn't matter too much which notion of "GPA" we use but, for posterity's sake, we will be using the weighted GPA going forward.

Why GPA Sucks

First, there are classes where high grades are easy, and classes where they are hard. Going forward we will refer to this as the easiness or difficulty of a class. When I say "Algebra 2 is harder than Geometry", the claim is that the same student will tend to receive a lower grade in Algebra 2 than in Geometry — whether that is due to one teacher's higher standards, the material being more difficult, or something else, is not the point.

The main problem with conventional GPA is that it fails to distinguish between easy and difficult classes, leading to perverse behaviors, like students explicitly seeking out easy classes and easy teachers. This hasn't gone unnoticed by teachers and parents. One fix applied by some high schools is to give a free bonus to AP and college classes, in recognition of the fact that college-level courses are more challenging than traditional high school courses.

While this approach might be better than nothing, I've talked with the people who made this decision, and it's clear the actual bonus (+1 on a 4-point scale) was decided rather arbitrarily, with no formal statistical support behind it. Moreover, even if "+1" happened to be the optimal bonus, we can surely do even better if we allow ourselves to give more fine-grained bonuses for every class, and remove the injustice of student GPA being affected by which classes/teachers they happen to be assigned.

The question, then, is how one might estimate the difficulty of a course. Comparing geometry and algebra may not be too hard (since plenty of students take both courses, so we can directly compare them), but comparing hundreds of classes to each other seems more challenging.

The Linear Approach

One approach is to model the grades a student receives in a class as a function of both the student's academic abilities, and the difficulty off achieving high grades in the class. For instance:

(alice's grade in algebra) =
  (alice's academic quotient) + (how easy the algebra class is)

(alice's grade in english) =
  (alice's academic quotient) + (how easy the english class is)

(bob's grade in english) =
  (bob's academic quotient) + (how easy the english class is)

...

etc.

It's worth pointing out that "academic quotient" (or "AQ") here is just some number that predicts the grades a student gets. It's meaningful to the extent that the grades that teachers assign are meaningful.

We then try to estimate each student's academic ability by finding numbers that fit these equations the best. This can be re-written as, essentially, a giant linear regression problem with 1,429 variables (1,174 AQs and 255 class easiness scores) and 15,411 constraints (the number of grades given).

One of the problems with linear regression is when there are strong colinearities in your dataset — these can make it so your estimates are extremely sensitive to small changes. This is undesirable for a summary statistic — you don't want Greg getting a 0.01 higher grade in one class to have an enormous effect on our estimate of your AQ. While there is a standard techniques for helping with this, one motivating reason behind requesting real-world data was to see how ordinary linear regression would fare.

Happily, there appears to be sufficient mixing between classes to make our estimates reasonably robust, but it is certainly the case that there is moderate uncertainty in our estimates of students' AQs, with an average standard error of around 0.13 points. While this may seem high, it is noticeably lower than the standard error of naive GPA at this high school (which is 0.186).

This aligns with the results of a toy experiment I performed (before I had access to real data), where, even in an environment where students were randomly assigned classes, the linear regression approach converged faster than the naive GPA approach, despite having more variables. In other words the additional expressiveness of our model (compared to naive GPA) more than makes up for the fact we have an additional 255 variables (one per class).

Comparing Classes

We don't just get AQ scores for each student from this analysis, we also get scores for how easy a class is. Matching the expectations of the students, math and foreign language classes score as among the most difficult. In addition, the average grade of a class is actually quite predictive of the class' "real" difficulty, particularly when we restrict our sample to classes with at least 10 students. This serves as a sanity check for our model, and also suggests that bell-curve grading, while not perfect, is probably better than the current system.

NOTE: AP classes are comparatively "easy" here, since they get a free +1.0 score. This chart suggests that this bonus is too large if the goal is to simply adjust for the additional difficulty of the course. On the other hand, if the goal is to proactively encourage people to take AP classes, it may be justifiable.

The Elo Method

The AQ approach is not without it's problems. First, a teacher that gives out a wider spread of grades will have a stronger effect on a student's AQ. Second, a student can do perfectly in a class and have their AQ drop! (If a teacher gives strong students and weak students As, then the strong students' AQ would drop, since they were expected to outperform the weak students, but did not).

An alternative approach that sidesteps these issues is to consider all pairs of students in a class as playing a "game", where the winner is the student with the higher score. We can then compute a student's Elo, which is a number indicating how strong of a "player" they are.

Games that are drawn are ignored. This contrasts with Chess, where drawing with a weaker player hurts your Elo. The benefit of ignoring draws is that a perfect score in a class is guaranteed to never hurt your Elo.

Suppose $E_A$ and $E_B$ are Alice and Bob's Elos respectively. The Elo system models the probability that Alice will "beat" Bob as $(1 + 10^{(E_B - E_A)/400})^{-1}$ or, equivalently, as $\sigma(\frac{E_A - E_B}{400} \lg(10))$

In chess, there is a presumption that players are getting better/worse over time, and their algorithm for computing Elo reflects that, with more recent games having a larger effect on your Elo. This scheme makes little sense in our scenario, where there is no order to our games — Alice is playing Bob, Carol, and David simultaneously! So instead of the approach used in Chess, we set up a (massive) logistic regression problem and solve it.

(As a technical aside, the matrix representing every possible game is enormous, with 655,000 games and 1,174 students, occupying around 3 GB of RAM. Optimizing such a large dataset is extremely onerous. Modern Machine Learning recommends solving this with gradient descent using random, small batches of data. Fortunately we don't need to resort to this, since the matrix is sparse (every row has one 1 and one -1). This means we can leverage Scipy's sparse matrix and compute exact gradients very quickly)

(As a second aside, I weight the importance of a game as the inverse of the class size. Otherwise large classes would have a disproportionate effect on a student's Elo.)

Unfortunately a simple logistic model doesn't quite work, since there are students who have straight As (i.e. have never lost a game) whose optimal Elo is infinite. This is easily solved by adding a small regularization term to keep our Elos reasonably sized. A Bayesian should be happy with this developemnt, since this equivalent to putting a Gaussian prior on student Elos (instead of using an improper prior).

Unfortunately the Elos we get depend quite a bit on the regularization term we use. I'm not aware of a principled way to pick this term, so I'll be showing results using an L2 penalty of 1.0, since it is the most natural choice, and since it results in a distribution that is roughly similar to the distribution of Elos in chess. Ultimately the regularization term is arbitrary, but, fortunately, whatever number we choose shouldn't affect students' class ranks.

Below is a plot of the resulting student Elos for a single high school. The 5th percentile is 603, the median is 1281, and the 95th percentile is 2143.

The Elo model can have unintuitive behavior. This is most striking for students with straight Fs (i.e. students who have never won a "game").

GPA	Linear Model	Elo	Num Classes
0.00	-0.35	818	2
0.00	0.13	200	7
0.00	-0.10	-65	10
0.00	0.39	633	3
0.00	0.60	766	2
0.00	0.19	648	4

What's going on here is that the model (arguably reasonably) considers "losing 1 game" to be less severe than "losing 10 games", and your Elo reflects this ("1 loss is unlucky, 10 losses is a pattern"). As a result, the students above with the fewest classes have the highest Elo.

Closing Remarks

While we've tried to give arguments for abandoning traditional GPA in favor of these more sophisticated models, it's easy to get lost in the math. Perhaps there is a mistake in our analysis? How do we know AQ or Elo is tracking something meaningful at all? Converging faster is only good if you're converging to something useful after all!

One "outside" view is that if one (or both) of your models is generating complete nonsense, they're unlikely to be very correlated with each other. Below we plot the correlations between student ranks using all three systems: GPA (the current system), AQ (the linear model), and Elo.

Fortunately we see that, not only are all the models strongly correlated, but Elo and AQ are the most strongly correlated, which is consistent with (but doesn't prove) our hypothesis that our models are measuring academic ability better than naive GPA.

Unfortunately naive GPA is simple, intuitive, and have the support of institutional inertia, so it's not going anywhere.