Random Forests

Random Forests are an extremely popular machine learning model, and are commonly cited as the best out-of-the-box algorithm (particularly in conjunction with Adaboost). This is due to their uncanny ability to not overfit, as well as their ability to capture nonlinear relationships. They can be used for both regression and classification problems.

Ensembles

Decision trees are extremely powerful machine learning models, capable of representing an enormous array of relationships. Unfortunately, their great power is their undoing, and they pay for it with a strong propensity towards overfitting, and their low training error can be exceptionally misleading as a predictor of test error.

The idea of an ensemble is to aggregate the prediction of a number of models in the hopes that the resulting prediction is better than the individuals' predictions in expectation. Indeed, given an ensemble of "independent" (to some degree) "weak learners" (learners that are correct $50 + \epsilon$ percent of the time), it can be proven that ensembles will converge to a strong learner.

Random forests are a particular kind of ensemble whose constituents are decison trees. Because decision trees have high variance, using an ensemble is a very natural idea for lowering their variance while (hopefully) maintaining their representational power.

As with all ensemble learning, the ideas of bagging (training each component learner on only a subset of the data) and boosting (weighting some points more than others) are commonly used for random forests.

Training

A key idea in ensemble learning is that the constituents should strive to be reasonably independent — if all your decision trees are the same, you might as well only have one! A fundamental idea for training a random forest is bagging: every decision tree is trained on a sample of the dataset, rather than the entire dataset.

For example, if you have 100 datapoints in your training set, you might train each decision tree on a random set of 80 points. This introduces some variety in each tree, while also ensuring each tree has enough points to be reasonably accurate models.

If your training set has $N$ data points, a popular technique is to train each tree on a sample of size $N$ drawn with replacement from the training set.

Another common practice is to restrict which features each decision tree can see. For example, the first decision tree might only be able to see features 1, 3, and 4, while the second decision tree might only be able to see features 1, 3, and 4. Using this in conjunction with bagging is a pretty standard implementation:

def train_random_forest(X, Y, number_of_trees):
    N = length(x)     # size of training set
    D = length(x[0])  # number of dimensions
    forest = []    # list of trees that we'll return
    for i in range(number_of_trees):
        indices = sample_with_replacement(range(N), N)
        bagged_x = X[indices]
        bagged_y = Y[indices]
        random_dimensions = sample_with_replacement(range(D), D)
        t = train_decision_tree(X, Y, random_dimensions)
        forest.append(t)
    return forest

Adaboost

Adaboost is a very common and lauded modification to the naive random forest, and describes a way to reweight the importance datapoints and trees. Points that are weighted more will be more often selected to be used to train future trees. Adaboost was originally formulated for binary classification.

The algorithm begins by initializing the weight of every point to 1. Every time it trains a tree, it samples (with replacement) from the training data (as is typical for bagging), but with the probability of selecting a datapoint proportional to its weight. If we define the current weights of the training data at some mth step of the algorithm as $w_1, w_2, ..., w_N$, then we can define the error rate as: \[e_m = \frac{\Sigma (w_i \cdot I(y_i \neq \hat{y_i}))}{\Sigma (w_i)} = \frac{\text{weighted sum of errors}}{\text{normalization constant}} \] Let $\alpha_m = \frac{e_m}{1 - e_m}$. Then, for every point that was misclassified by the most recent tree, we multiply their weights by $\alpha_m^{-1}$, and then renormalize at the end of the step (so that the weights are probabilities).

As a sanity check, note that if $0 < e_m < 0.5$ (which should always be the case for a weak learner), then $\alpha_m^{-1} > 1$, which means that misclassified points will have their weights increased.

The final output of the whole forest is now the weighted sum of each tree, where each tree's weight is equal to $\alpha_m$: \[ \hat{y}(x) = \Sigma_{m=1}^M \alpha_m \cdot g_m(x) \] Where $g_m$ is the prediction of the mth tree.

Adaboost in Regression (source)

    For regression problems with the loss function $L$, let $D = \sup{L(y_i, \hat{y}_i)}$.  We use $D$ to force the errors between zero and one, and compute the expected (weighted) loss:

    \[ e_m = \Sigma \frac{L(y_i, \hat{y}_i) w_i}{D} \]

    This "D" trick seems super sketch to me... but I'm not a CS professor.  Then, as in classification, we let $\alpha_m = \frac{e^m}{1 - e^m}$.  Every training point's weight is multiplied by $\alpha_m^{1 - L(y, \hat{y}) / D}$.  If a datapoint is predicted perfectly, it is multiplied by 1, while large errors will force the exponent to be negative (and $\alpha_m < 0.5$)
    

Gradient Boosting

Another common alteration to the random forest algorithm is gradient boosting, which is designed for regression. In this case, every tree is trained on the residuals of the trees before it:

def gradient_boosting(X, Y, number_of_trees):
    forest = []    # list of trees that we'll return
    for i in range(number_of_trees):
        tree = train_tree(X, Y) # train tree on residuals
        forest.append(tree)     # add tree to forest
        Y = Y - tree.predict(X) # update Y to be the new residuals
    return forest

The prediction of the entire forest is then simply the sum of its constituent trees.

This can be viewed as a form of implicit "boosting", where the 'weight' of each point is the error the forest is currently making on it.