Gaussian Processes
Def: A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distributionGaussian processes offer a way to take knowledge of a function at certain points and use that knowledge to place a probability distribution over the value of the function at other points. For example, given that we know the value of a function at x=-1, x=0, x=1, and x=2, we can create a Gaussian Process and find a guess for the function at x = 1.5.
Multivariate Gaussian Distribution
The most natural extension of a Gaussian distribution to multiple dimensions is \[ \begin{bmatrix} \mathcal{N}(\mu_1, \sigma_1^2) & \mathcal{N}(\mu_2, \sigma_2^2) & \dots & \mathcal{N}(\mu_n, \sigma_n^2) \end{bmatrix} \] An example of a 2D Gaussian distribution is:
A multivariate Gaussian distribution is simply a linear combination of a distribution like the one above. What's interesting is that this can make each dimension non-independent.
This means that if you draw a random point from this distribution, knowing its x-value will give you information about its y-value (and vice versa).
In particular, our posterior belief on the y-value of a randomly-drawn point is still a Gaussian distribution. Another way of saying this is that if we "know" that the x-coordinate is "c", then we know the point we sampled was drawn from the line "x=c", and this "slice" of the original multi variate Gaussian distribution is also a multivariate Gaussian distribution.
This property of starting with a multivariate Gaussian distribution, updating on some information, and ending with a multivariate Gaussian distribution is extremely convenient, and is at the core of a Gaussian Process.
We represent a D-dimensional multivariate Gaussian distribution using a mean vector \(\mu\) and a covariance matrix \(\Sigma\): \[ p(\vec{x}) = \mathcal{N}(\vec{\mu}, \Sigma) \] $\vec{\mu}$ contains $D$ elements and $\Sigma$ is a $D \times D$ matrix. The diagonal of $\Sigma$ contains the variance of each variable, while $\Sigma_{ij}$ represents the covariance between dimensions $i$ and $j$. $\Sigma$ must be symmetric and positive semi-definite.
The mean vector represents you best guess of \(x\) and the covariance matrix encodes your uncertainty and how much knowing one dimension can teach you about another dimension.
As an example, suppose you have the following 2-dimension Gaussian distribution: \[ p(X = x, Y = y) = \mathcal{N}([\mu_x, \mu_y], \begin{bmatrix} \sigma_x^2 & \text{cov}(x, y)\\ \text{cov}(x, y) & \sigma_y^2 \end{bmatrix} )\] Then our knowledge of $y$ changes once we know the value of $x$. In particular, if $X = c$, then our updated pdf for $Y$ is proportional to a "slice" of the joint multivariate Gaussian distribution: Formally, we can give the formula for you belief about Y, conditioned on our knowledge of X (I won't derive this): \[ p(Y = y | X = x) = \mathcal{N}(\mu_{y|x}, \Sigma_{y|x})\] \[\mu_{y|x} = \mu_y + \frac{\Sigma_{x,y}}{\Sigma_{y,y}} (x - \mu_x)\] \[ \Sigma_{y|x} = \sigma_y^2 - \frac{\Sigma_{x,y}^2}{\Sigma_{x,x}} \]
It will be useful, however, to know the solution in an arbitrary number of dimensions. Suppose you know the first $N$ dimensions of a sample from a $D$ dimensional multivariate Gaussian distribution, and would like to know your posterior on the remaining dimensions. Then we can organize $\Sigma$ into four corners: \[ \Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \] Here $\Sigma_{11}$ is the covariance matrix of the first $N$ dimensions and $\Sigma_{22}$ is the covariance matrix of the remaining dimensions. $\Sigma_{12}$ and $\Sigma_{21}$ are the covariances between the dimensions you know and the dimensions you don't know (note that \(\Sigma_{12} = \Sigma_{21}^T\)). Then the joint distribution is: \[ p(\vec{x}) = \mathcal{N}(\vec{\mu}, \Sigma) \] And the conditional distribution — the one that represents our beliefs about the variables after we make our observations is:
We will be returning to this formula many times throughout this page. Memorizing it isn't necessary, but remembering that we know how to condition multivariate Gaussian distributions like this is crucial.
Gaussian Processes
A Gaussian Process assigns every possible input-value a "dimension" in a multivariate Gaussian distribution. For example, suppose you have a function you are trying to "learn" that takes in either a 0 or a 1 and spits out a real number: \[ f: \{0, 1\} \rightarrow \mathbb{R} \] Then in the image below, we could let the "x-axis" represent our belief over the value of $f(0)$ and the "y-axis" represent our belief over the value of $f(1)$. If we know the value of $f(0)$ (by observing it), then this may give us some information about the value of $f(1)$.
In this case, knowing that $f(0)$ is higher than we expected leads us to believe that $f(1)$ will also be higher than we expected. In particular, if we know the exact joint distribution, then we can compute our posterior on the value of f(1) using the Gaussian-conditioning formula given above.
Unfortunately, many real world problems have uncountably many possible inputs. A far more common case is trying to learn a function that maps real numbers to real numbers: \[ f: \mathbb{R} \rightarrow \mathbb{R} \] A Gaussian Process will reason about this the exact same way it reasoned about the discrete case: it will form a Gaussian distribution such that every possible input-value has a corresponding pdf over its possible output values. Then it will look at the values/dimensions that it does know (i.e. the "training set" or "observations") and use this knowledge to compute a new multivariate distribution over all dimensions that it wants to predict.
At first blush, it seems rather dubious that the machinery we used for finite-dimensional multivariate Gaussian distributions would work well (computationally) for infinite-dimensional Gaussian distributions. In fact, such skepticism would be well placed! Recall the definition from the beginning of the article:
Def: A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution
The key is that we only ever care about the Gaussian distribution at a finite number of points. For example, consider this case: We only know the input at 4 locations, and so if we want to determine our prediction of the function at some other point (e.g. x = 0.5), we just form a 5-dimensional multivariate Gaussian distribution, condition on the dimensions we know and see what the resulting distribution is for the dimension we want to know. So if I want to know my posterior at $x = \frac{1}{2}$, then I'm just computing \[ p(f(\frac{1}{2})\ |\ f(-1), f(0), f(1), f(2))\]
If $k(x_1, x_2)$ is a function that returns the covariance of $f(x_1)$ and $f(x_2)$, then we can write \(Sigma\) as \[ \Sigma = \begin{bmatrix} k(-1, -1) & k(-1, 0) & k(-1, 1) & k(-1, 2) & k(-1, \frac{1}{2})\\ k(0, -1) & k(0, 0) & k(0, 1) & k(0, 2) & k(0, \frac{1}{2})\\ k(1, -1) & k(1, 0) & k(1, 1) & k(1, 2) & k(1, \frac{1}{2})\\ k(2, -1) & k(2, 0) & k(2, 1) & k(2, 2) & k(2, \frac{1}{2})\\ k(\frac{1}{2}, -1) & k(\frac{1}{2}, 0) & k(\frac{1}{2}, 1) & k(\frac{1}{2}, 2) & k(\frac{1}{2}, \frac{1}{2}) \end{bmatrix} \] In the above example, I assume $\mu = 0$ (i.e. my prior is that the points are normally distributed around f(x) = 0), then this gives us all we need to plug-and-chug with the Gaussian-conditioning formula: \[ p(\vec{x}_2 | \vec{x}_1) = \mathcal{N}(\vec{\mu}_{2|1}, \Sigma_{22|1})\] \[\vec{\mu}_{2|1} = \vec{\mu}_2 + \Sigma_{21} \Sigma_{11}^{-1} (X_1 - \vec{\mu}_1)\] \[\Sigma_{22|1} = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12}\] The y-value of the red line at $x = \frac{1}{2}$ in the image above is simply the mean of the normal distribution that results from conditioning on the four points that were observed. The 95% range is calculated from the variance of the resulting normal distribution.
Noise
As you may have noticed, I have been assuming our observations of the function are perfect, but in practice it is often useful to assume normally distributed noise. Adding noise is a simple matter of adding $\sigma^2 I$ to $k(X,X)$: \[ p(y_* | \vec{X}_*, X, \vec{y}) = \mathcal{N}(\vec{\mu}, \Sigma)\] \[\mu = k(X_*, X) (k(X,X) + \sigma^2 I)^{-1} y\] \[\Sigma = k(X_*, X_*) - k(x_*, X) (k(X,X) + \sigma^2 I)^{-1} k(X, X_*)\] There is some amount of "intuition" behind this formula and it is, of course, derivable. I simply don't think there is much value in deriving it. Rest assured that it follows from no additional assumptions.
If we add noise with a mean of zero and a standard deviation of 0.1, then the example we've been using becomes:
Note that we know have some degree of uncertainty everywhere — even at the points we have observed.
Non-zero Prior
Another thing you may have noticed is that in the above image the function tends towards zero when it lacks any other information. While this may be reasonable in some situations, it would naturally be desirable to be able to encode other priors.
Let $\bar{y}(X)$ yield your prior guess for every point in $X$. Then: \[ p(y_* | \vec{X}_*, X, \vec{y}) = \mathcal{N}(\vec{\mu}, \Sigma)\] \[\mu = \bar{y}(X_*) + k(X_*, X) (K + \sigma^2 I)^{-1} (y - \bar{y}(X))\] \[\Sigma = k(X_*, X_*) - k(x_*, X) (K + \sigma^2 I)^{-1} k(X, X_*)\] The effect of applying the prior $\bar{y}(x) = x$ to the example above is shown below. Note that the uncertainty at every point is unchanged from when we had a prior of zero.
Kernels
So far we have simply assumed the existence of some function $k(x_1, x_2)$, that returns the covariance of $f(x_1)$ and $f(x_2)$. In other words: \[ k(X, X) = \begin{bmatrix} \text{cov}(X_1, X_1) & \dots & \text{cov}(X_1, X_n)\\ \vdots & \ddots & \vdots\\ \text{cov}(X_n, X_1) & \dots & \text{cov}(X_n, X_n)\\ \end{bmatrix}\] And covariance matrices are positive semidefinite, which means that a function $k(x_1, x_2)$ is only a valid kernel function if it yields a positive semidefinite matrix for any input points $X$.
Mercer's Theorem:
$k(X, X)$ is positive semidefinite for any collection of points $X = \begin{bmatrix} x_1, \dots, x_n \end{bmatrix}$
if and only iff
$k(x_1, x_2)$ is symmetric (i.e. $k(x_1, x_2) = k(x_2, x_1)$) and can be expressed as an inner product of some feature expansion: \[ k(x_1, x_2) = \langle \phi(x_1), \phi(x_2) \rangle \]
Which leads to an interesting interpretation of Gaussian Processes, which is that any kernel function you pick is implicitly a choice of a feature expansion.
Indeed, Gaussian Processes can also be derived by rewriting (Bayesian) linear regression in terms of $k(x_1, x_2)$, rather than the feature expansion directly. In this way Gaussian Processes can be viewed as a "kernelized" form of Bayesian linear regression.
To phrase a linear regression as a Gaussian Process, simply let $ k(x_1, x_2) = x_1 \cdot x_2$
More About Kernels
A kernel $k(x_1, x_2)$ is stationary if it can be written as a function of the difference between $x_1$ and $x_2$.
Let $k_1$ and $k_2$ be kernels and let $c > 0$. Then the following are also kernels: \[ k_3(x_1, x_2) = k_1(x_1, x_2) + c\] \[ k_3(x_1, x_2) = k_1(x_1, x_2) \cdot c\] \[ k_3(x_1, x_2) = k_1(x_1, x_2) + k_2(x_1, x_2)\] \[ k_3(x_1, x_2) = k_1(x_1, x_2) \cdot k_2(x_1, x_2)\]
Examples of Kernels
- One of the most popular kernels is called the Gaussian Kernel, Squared Exponential Kernel, or Radial Basis Function \[ k(x_1, x_2) = 2\sigma^2 e^{\frac{(x_1 - x_2)^2}{-2l^2}} \] Which has two hyper parameters: $\sigma$ and $l$. This kernel can be viewed as only considering functions that are infinitely differentiable. In other words, the mean function will always be nice and smooth. An example is given below, with $\sigma = l = 1$.
- But some would say it is too smooth for practical use! The Matérn kernel weakens these requirements. Technically the Matérn kernel is a very wide class of kernels specified by a hyperparameter v. The most popular ones have $v = 1.5$ and $v = 2.5$, which can be written as \[k_{v=1.5}(r) = (1 + \frac{r\sqrt{3}}{l}) \cdot \text{exp}(\frac{r \sqrt{3}}{-l})\] \[k_{v=2.5}(r) = (1 + \frac{r\sqrt{5}}{l} + \frac{5r^2}{3l^2}) \cdot \text{exp}(\frac{r \sqrt{5}}{-l})\] Where $r = |x_1 - x_2|$. These represent forming a prior over functions that are once and twice differentiable (respectively). It is believed that choosing $v > 2.5$ tends to make functions "too smooth" for real problems, while $v < 1.5$ is too rough. An example with $v = 1.5$ and $l = 1.0$ is given below:
- Another useful kernel is the polynomial kernel, which is the kernel-representation of the feature expansion \[ \phi(x) = [1, x, x^2, x^3, x^4, ..., x^p] \] The corresponding kernel is: \[ k(x_1, x_2) = (x_1^T x_2 + 1)^p \] When $p=1$ the resulting mean function will be the least-squares regression line. When $p=2$ it will be the least-squares quadratic line. This is a non-stationary kernel. The plot below uses $p=2$ and a noise with standard deviation 0.1, because if there is no noise, then there is no quadratic equation that passes through every point!
- An unpopular kernel is the Dirac delta function: \[k(x_1, x_2) = \delta(x_1 - x_2)\] But for didatic purposes, here is the plot: