Linear Classifiers


Imagine I have $n$ datapoints of dimension $m$. The goal of classification is to find the function $f$ that takes each row of $X$ and returns the appropriate value of $Y$, and continues to do so as we get more data. Failing that (which we usually do), we want to be correct as often as possible.

If we have two categories and two features, we can think of a linear classifier as drawing a line through the feature-space and guessing that everything above is in one category, and everything below is in the other. This generalizes to higher dimensions, so if we have $m$ dimensions, the classifier draws a $m-1$ hyperplane to make its guesses.



A perceptron's prediction is the sign of a weighted sum of a point's features. In the words of linear algebra, if we have a feature vector $\vec{x}$ and a weight vector $\vec{w}$, then the perceptron takes the dot product: $$\vec{x} \cdot \vec{w}$$ This is known as the activation of the perceptron.

Datapoints classified by a linear classification

For simplicity, let's first consider a binary perceptron - that is a single perceptron that's trying to classify all data into one of two categories: $A$ and $B$. In this case, the classification is straightforward. The perceptron has a weight vector $\vec{w}$, and for every feature vector $x$, it classifies it as $A$ if $\vec{x} \cdot \vec{w} > 0$ and otherwise guesses $B$.

To learn, we first need to define a loss function. To do so, let $y$ denote the true category, using $1$ to denote $A$ and $-1$ to denote $B$. Then, we typically define the loss function as $$\max{\left(0, -y \cdot (\vec{x} \cdot \vec{w}) \right)}$$ where $y$ is the correct classification, $\vec{x}$ is the feature vector, and $vec{w}$ is the weight vector. Some introspection should allow you to convince yourself that the loss will be 0 if we guess the right answer, and some positive number otherwise. The value of the positive number will be proportional to "how far off" our guess was.

From here, we basically just do gradient descent:

Matrix<double> x = matrix of features
Vector<int> y = vector of categories (1s and -1s)
Vector<double> w = [0] * m
while (true) {
    int num_wrong = 0;
    for (uint i = 0; i < n; ++i) {
        double dot = dot_product(x[i], w);
        int guess = dot > 0 ? 1 : -1;
        if (guess == y[i]) continue;
        else ++num_wrong;
        for (uint j = 0; j < w.length; ++j) {
            w[j] += y[i] * x[j];
    if (num_wrong == 0) break;

There is a pretty large problem with this so far: if all your features are 0, then your sum will always be 0, so we'll always guess $B$. This is easy to solve, though, by introducing a bias features, which is just always a 1 for every datapoint.

The running time of this algorithm is $O(l^2/d^2)$ where $l$ is the largest of all the feature vectors' lengths, and where $d$ is the minimum distance between an $A$ feature and a $B$ feature.


The idea is basically the same for multiway perceptrons, except

  1. you create a bunch of binar perceptrons for each category
  2. you guess the category whose dot product is largest
  3. the update step is different. If $w_i$ represents the weights of the correct category, than we update it by adding the features to the weights, in a pairwise fashion. If $w_j$ represents the weights of an incorrect category, then we updated it by subtracting the features from the weights, in a pairwise fashion. Some thought should convince you that this is very similar to the binary case.


  1. They can't learn non-separable data.
  2. The generalizations can be poor.
  3. They tend to overfit the training data with too many iterations.


Margin Infused Relaxed Algorithm (MIRA) is a variation on the perceptron algorithm in which we move the weights more intelligently. In particular, if we guess correctly, we do nothing (ase before), but if we guess incorrectly, we'll multiply the weight change from the perceptron algorithm by $$\min{\left(\frac {\left( \vec{w}_\mathrm{old}-\vec{w}^*_\mathrm{old} \right) \cdot \vec{x} + 1} {2\vec{x} \cdot \vec{x}} ,M\right)}$$ where $\vec{w}_\mathrm{old}$ is the previous weight vector for the guessed class, $\vec{w}_\mathrm{old}^*$ is the previous weight vector for the correct class, and $M$ is some constant.

Here's the pseudocode for the multiclassification case:

MIRA(float[][] input, float[] output, uint iterations) -> float[][] {
    float[][] weights = new float[num_classes][input[0].length];
    for (var it = 0; it < iterations; ++it) {
        for (var i = 0; i < input.length; ++i) {
            best_guess = -1;
            highest_value = -1e300;
            // make a guess
            for (var j = 0; j < num_classes; ++j) {
                value = dot(input[i], weights)
                if (value > highest_value) {
                    highest_value = value;
                    best_guess = j;
            // update weights if we guessed wrong
            if (best_guess != output[i]) {
                // compute tau
                var numerator = dot(subtact(weight[output[i]], weight[best_guess]), input[i]) + 1;
                var denominator = 2 * dot(input[i], input[i]);
                tau = numerator / denominator;

                // make best_guess less likely
                new_weights[best_guess] = subtract(weights[best_gesss], scale(input[i], tau));

                // make output[i] more likely
                new_weights[output[i]] = add(weights[output[i]], scale(input[i], tau));
    return weights;

This is relaed to max-margin methods, which try to define a boundary that maximizes the minimum distances between points and the boundary.

While MIRA and max-margin methods tend to do better in practice, they are still limited to finding a perfect linear solution. If no such solution exists, they fail.

Logistic Regression

Also known as Maximum Entropy, Logistic Regession is a very common machine learning/statistical technique. Unlike the Perceptron and MIRA methods, logistic regression isn't restricted to just outputing a category, but can instead output a probability that a point belongs to a class. In particular, for class $i$, we'll model the probability as $$P(c=i|X=\vec{x})=\frac{1}{1+e^{-\vec{w}_i \cdot \vec{x}}}$$ where $\vec{w}_i$ is the weight vector for class $i$.

If your features are booleans, then you can think of this as each feature "voting" and the overall decision being a weighted sum of the votes.

Maximum Entropy

Intuitively, entropy is a measure how uncertain you are about an outcome. If you are completely certain, the entropy is 0, and if you are uniformly uncertain about an event with $n$ coutcomes, then the entropy is $\log{n}$.

In particular, we define entropy as $$H(X) = \sum_x{P(X=x) \cdot \log{\left(P(X=x)\right)}}$$

Logistic regression is called maximum entropy, because it makes the distribution follow the constraints from the training data, but otherwise maximizes entropy (uncertainty).

In particular, our constraints are that we want a model so that "the empirical expectation offeatures and model expectation of features are the same". In other words, if we are classifying whether animals are mammals, we want it so if 70% of animals with hair are mammals, then we want our model to classify 70% of animals with hair as mammals - even if having hair correlates with, say, producing milk.

Generalized Iterative Scaling

Although we could use any number of better optimization methods, we'll consider generalized linear scaling (GLS) in particular.

Before we go into details, we need to satisfy one requirement that GLS has, which is that the sum of all features must be equal. To do this, we'll introduce a non-negative "slack" feature to each make each datapoint's featuers add up to the same value.

This algorithm repeatedly changes weights so that the empirical expectation of features and model expectation of features are the same.

  1. Compute the expectation of each feature based on the training data. $$E_\tilde{p}\left[\phi_i\right]$$
  2. Initialize the weight vector to 1s
  3. Repeat until convergence:
    1. Compute feature expectations for current model.
    2. Compute ratio of the model's current expectations with the expectations we want based on the training data.
    3. Update the weights based on these ratios by multipling each by $$\left(\frac{E_\tilde{p}\left[\phi_i\right]}{E_\mathrm{current}}\right)^{1/V}$$ where $V$ is the sum of all features in each datapoint - these are all the same, because of the slack feature.

It turns out that, we will

  1. always converge
  2. converge to the same answer regardless of starting weights (assuming they're nonzero)
  3. converge to a distribution that obeys the constraints and maximizes entropy

Pros and Cons

  1. doesn't allow you to find interacitons between features (unless you make that interaction an explicit feature)
  2. avoids possibility of double counting that naive Bayes falls into
  3. often needs regularization to avoid overfitting
  4. can be slow