math Morgan

Kernel Linear Regression

Kernel linear regression is, at its most basic, just a reformulation of linear regression that runs in O(n2d) time instead of O(nd2). A proof of their equivalence is below:
 let (X, y) be the training set's features and labels

      XX'X + (Ic)X = XX'X + X(Ic)
      (XX' + Ic) X = X (X'X + Ic)
                 X = inv(XX' + Ic) X (X'X + Ic)
                X' = (X'X + Ic)' (inv(XX' + Ic X)'
  inv(X'X + Ic) X' = X' inv(XX' + Ic)
inv(X'X + Ic) X' y = X' inv(XX' + Ic) y

  Let "Z" be the features that we want to evaluate on.

inv(X'X + Ic) X' y = X' inv(XX' + Ic) y
Z inv(X'X + Ic) X' y = Z X' inv(XX' + Ic) y

The left side is the standard formula for ridge linear regression (ordinary linear regression when c=0), while the right side is the kernelized version.

(Note: it is actually important for c>0. If c=0 then either the kernelized or non-kernelized formulations will typically require inverting a singular matrix)

If we let

K(A,B) = A'B
we can rewrite the formula as
Z inv(X'X + Ic) X' y = K(Z,X) inv(K(X,X) + Ic) y
We call "K" the "kernel". This reformulation is interesting, because it lets us use other functions for K, beyond a simple inner product. With the correct choices, our model will be able to represent a wide array of functions that would be impossible to express with ordinary linear regression (even using feature expansion).

The kernel function (or "covariance function") is, abstractly, a function that represents how similar two points' labels are, given their features. If k(s,t) is high, we expect the labels of s and t to be similar, if it is low we expect them to be different.

More concretely these values can be viewed as covariances. If 'f(x)' is the groundtruth function we're trying to model, and we choose a kernel k such that k(s,t) = 3 (for some features s and t), we're claiming that if we only know f(s), our guess for f(y) is 3 * f(s).

Viewed from this perspective, linear regression actually starts to look a little odd. For example, x=999 correlates far more strongly with x=1 than x=2 does in a linear model, despite being much farther away, because 999*1 is bigger than 2*1.

Many commonly-used covariance functions "fix" this and instead force covariance to drop with distance. The most famous such function is probably the RBF ("radial basis function") kernel: $ exp(-\gamma (x-y)^2) / \sqrt{2 \pi} $