#math

# Random Forests

Random Forests are an extremely popular machine learning model, and are commonly cited as the best out-of-the-box classical algorithm (particularly in conjunction with Adaboost). This is due to their uncanny ability to not overfit, as well as their ability to capture nonlinear relationships. They can be used for both regression and classification problems.

## Decision Trees

Decision trees are extremely powerful machine learning models, capable of representing an enormous array of relationships. Unfortunately, their great power is their undoing, and they pay for it with a strong propensity towards overfitting, and their low training error can be exceptionally misleading as a predictor of test error.

With some care they can be salvaged as workable (if not great) models (e.g. with pruning), but even then, the algorithms that construct decision trees are highly sensitive to the data they are given — even a tiny change in the data can lead to a radically different decision tree.

## Ensembles

The idea of an ensemble is to aggregate the prediction of a number of models in the hopes that the resulting prediction is better than the individuals' predictions in expectation. Indeed, given an ensemble of "independent" (to some degree) "weak learners" (learners that are correct $50 + \epsilon$ percent of the time), it can be proven that ensembles will converge to a strong learner.

Decision trees are ideal for ensembling. By averaging each tree's predictions together, you help mitigate the high variance of their predictions. Moreover you turn this weakness into a strength, as the success of an ensemble rests, in large part, on the diversity of its ensemble. An ensemble of trees is a Random Forest.

Due to concerns over overfitting, and due to the computational cost of training and pruning hundreds of trees, the trees used in random forests are often very shallow and unpruned. In some cases even "decision stumps" are used -- trees that only have one parent/decision node.

## Bagging

A key idea in ensemble learning is that the constituents should strive to be reasonably independent — if all your decision trees are the same, you might as well only have one! A fundamental idea for training a random forest is bagging: every decision tree is trained on a sample of the dataset, rather than the entire dataset.

For example, if you have 100 datapoints in your training set, you might train each decision tree on a random set of 80 points. This introduces some variety in each tree, while also ensuring each tree has enough points to be reasonably accurate models.

If your training set has $N$ data points, a popular technique is to train each tree on a sample of size $N$ drawn with replacement from the training set.

Another common practice for encouraging a healthy level of tree diversity is to randomly sample which features each tree can see. For example, the first decision tree might only be able to see features 1, 3, and 4, while the second decision tree might only be able to see features 1, 2, and 5. Using this in conjunction with bagging is a pretty standard implementation:

Because each tree only sees a subset of the data, this leaves open an interesting opportunity: we can use the data that the tree hasn't seen to compute an unbiased estimate of its performance on the test set.

By applying this idea to every tree in the ensemble and averaging their estimates, we can compute an estimate of the entire ensemble's performance. A common misconception is that this estimate is unbiased, but it is, in fact, pessimistic. This can be intuitively understood: if the accuracy of the forest was just the average accuracy of each tree, we wouldn't bother to use a forest at all &emdash; single tree would be just as accurate!

Adaboost is a very common and lauded modification to the naive random forest, and describes a way to reweight the importance datapoints and trees. Points that are weighted more will be more often selected to be used to train future trees. Adaboost was originally formulated for binary classification.

The algorithm begins by initializing the weight of every point to 1. Every time it trains a tree, it samples (with replacement) from the training data (as is typical for bagging), but with the probability of selecting a datapoint proportional to its weight. If we define the current weights of the training data at some mth step of the algorithm as $w_1, w_2, ..., w_N$, then we can define the error rate as: $e_m = \frac{\Sigma (w_i \cdot I(y_i \neq \hat{y_i}))}{\Sigma (w_i)} = \frac{\text{weighted sum of errors}}{\text{normalization constant}}$ Let $\alpha_m = \frac{e_m}{1 - e_m}$. Then, for every point that was misclassified by the most recent tree, we multiply their weights by $\alpha_m^{-1}$, and then renormalize at the end of the step (so that the weights are probabilities).

As a sanity check, note that if $0 < e_m < 0.5$ (which should always be the case for a weak learner), then $\alpha_m^{-1} > 1$, which means that misclassified points will have their weights increased.

The final output of the whole forest is now the weighted sum of each tree, where each tree's weight is equal to $\alpha_m$: $\hat{y}(x) = \Sigma_{m=1}^M \alpha_m \cdot g_m(x)$ Where $g_m$ is the prediction of the mth tree.

For regression problems with the loss function $L$, let $D = \sup{L(y_i, \hat{y}_i)}$. We use $D$ to force the errors between zero and one, and compute the expected (weighted) loss:

$$e_m = \Sigma \frac{L(y_i, \hat{y}_i) w_i}{D}$$

This "D" trick seems super sketch to me... but I'm not a CS professor. Then, as in classification, we let $\alpha_m = \frac{e^m}{1 - e^m}$. Every training point's weight is multiplied by $\alpha_m^{1 - L(y, \hat{y}) / D}$. If a datapoint is predicted perfectly, it is multiplied by 1, while large errors will force the exponent to be negative (and $\alpha_m < 0.5$)

Another common alteration to the random forest algorithm is gradient boosting, which is designed for regression. In this case, every tree is trained on the residuals of the trees before it:

def gradient_boosting(X, Y, number_of_trees):
forest = []    # list of trees that we'll return
for i in range(number_of_trees):
tree = train_tree(X, Y) # train tree on residuals
forest.append(tree)     # add tree to forest
Y = Y - tree.predict(X) # update Y to be the new residuals
return forest


The prediction of the entire forest is then simply the sum of its constituent trees.

This can be viewed as a form of implicit "boosting", where the 'weight' of each point is the error the forest is currently making on it.