redding.dev
	math Morgan Kernel Linear Regression Kernel linear regression is, at its most basic, just a reformulation of linear regression that runs in O(n²d) time instead of O(nd²). A proof of their equivalence is below: let (X, y) be the training set's features and labels XX'X + (Ic)X = XX'X + X(Ic) (XX' + Ic) X = X (X'X + Ic) X = inv(XX' + Ic) X (X'X + Ic) X' = (X'X + Ic)' (inv(XX' + Ic X)' inv(X'X + Ic) X' = X' inv(XX' + Ic) inv(X'X + Ic) X' y = X' inv(XX' + Ic) y Let "Z" be the features that we want to evaluate on. inv(X'X + Ic) X' y = X' inv(XX' + Ic) y Z inv(X'X + Ic) X' y = Z X' inv(XX' + Ic) y The left side is the standard formula for ridge linear regression (ordinary linear regression when c=0), while the right side is the kernelized version. (Note: it is actually important for c>0. If c=0 then either the kernelized or non-kernelized formulations will typically require inverting a singular matrix) If we let K(A,B) = A'B we can rewrite the formula as Z inv(X'X + Ic) X' y = K(Z,X) inv(K(X,X) + Ic) y We call "K" the "kernel". This reformulation is interesting, because it lets us use other functions for K, beyond a simple inner product. With the correct choices, our model will be able to represent a wide array of functions that would be impossible to express with ordinary linear regression (even using feature expansion). The kernel function (or "covariance function") is, abstractly, a function that represents how similar two points' labels are, given their features. If k(s,t) is high, we expect the labels of s and t to be similar, if it is low we expect them to be different. More concretely these values can be viewed as covariances. If 'f(x)' is the groundtruth function we're trying to model, and we choose a kernel k such that k(s,t) = 3 (for some features s and t), we're claiming that if we only know f(s), our guess for f(y) is 3 * f(s). Viewed from this perspective, linear regression actually starts to look a little odd. For example, x=999 correlates far more strongly with x=1 than x=2 does in a linear model, despite being much farther away, because 9991 is bigger than 21. Many commonly-used covariance functions "fix" this and instead force covariance to drop with distance. The most famous such function is probably the RBF ("radial basis function") kernel: $ exp(-\gamma (x-y)^2) / \sqrt{2 \pi} $

math Morgan

Kernel Linear Regression

Kernel linear regression is, at its most basic, just a reformulation of linear regression that runs in O(n²d) time instead of O(nd²). A proof of their equivalence is below:

 let (X, y) be the training set's features and labels

      XX'X + (Ic)X = XX'X + X(Ic)
      (XX' + Ic) X = X (X'X + Ic)
                 X = inv(XX' + Ic) X (X'X + Ic)
                X' = (X'X + Ic)' (inv(XX' + Ic X)'
  inv(X'X + Ic) X' = X' inv(XX' + Ic)
inv(X'X + Ic) X' y = X' inv(XX' + Ic) y

  Let "Z" be the features that we want to evaluate on.

inv(X'X + Ic) X' y = X' inv(XX' + Ic) y
Z inv(X'X + Ic) X' y = Z X' inv(XX' + Ic) y

The left side is the standard formula for ridge linear regression (ordinary linear regression when c=0), while the right side is the kernelized version.

(Note: it is actually important for c>0. If c=0 then either the kernelized or non-kernelized formulations will typically require inverting a singular matrix)

If we let

K(A,B) = A'B

we can rewrite the formula as

Z inv(X'X + Ic) X' y = K(Z,X) inv(K(X,X) + Ic) y

We call "K" the "kernel". This reformulation is interesting, because it lets us use other functions for K, beyond a simple inner product. With the correct choices, our model will be able to represent a wide array of functions that would be impossible to express with ordinary linear regression (even using feature expansion).

The kernel function (or "covariance function") is, abstractly, a function that represents how similar two points' labels are, given their features. If k(s,t) is high, we expect the labels of s and t to be similar, if it is low we expect them to be different.

More concretely these values can be viewed as covariances. If 'f(x)' is the groundtruth function we're trying to model, and we choose a kernel k such that k(s,t) = 3 (for some features s and t), we're claiming that if we only know f(s), our guess for f(y) is 3 * f(s).

Viewed from this perspective, linear regression actually starts to look a little odd. For example, x=999 correlates far more strongly with x=1 than x=2 does in a linear model, despite being much farther away, because 999*1 is bigger than 2*1.

Many commonly-used covariance functions "fix" this and instead force covariance to drop with distance. The most famous such function is probably the RBF ("radial basis function") kernel: $ exp(-\gamma (x-y)^2) / \sqrt{2 \pi} $