math Morgan

Linear Regression Generalization

The typical way to perform linear regression is via its closed-form solution:

$$\hat{w} = (X^T X)^{-1} X^T y$$

We need to be very careful with naively applying this when the number of dimensions is large. When the number of points (n) is equal to the number of dimensions (d), we will always achieve zero training loss, even if our test loss is terrible.

This observation, as well as observations made using VC dimension, that prompt many people to claim that models that are too expressive will generalize worse than models that are not expressive enough.

The story goes that classical machine learning got this horrifyingly wrong — that modern neural networks prove that excessive parameters has little to do with how a model generalizes.

While this is true to some extent, I want to push back on the narrative that classical machine learning ever claimed this. The error bounds that make these claims are focused exclusively on worst-case bounds, and, as has been demonstrated many times (most famously in Understanding deep learning requires rethinking generalization) are correct: it is totally possible to train a neural network to zero training loss with atrocious test loss.

The claim that models will necessarily generalize poorly was, as far as I can tell, always wrong -- as wrong for linear models as it is for deep neural networks.

As a demonstration, consider MNIST with only two labels. It is completely possible to train a linear model (with 785 parameters) on only 10 data points and achieve 95+% test accuracy.

In retrospect this shouldn't come across as a huge surprise, but it firmly rejects the claim that parameters have much of anything to do with generalization.

Footnote:

An interesting observation is that the training regime matters quite a lot here. If the weights are initialized with zero noise, the model will converge to something good. But as the noise used to initialize the weights increases, generalization drops.