Combination of multiple submodels relying on the hypothesis that combining multiple "weak" predictor models together can often produce a powerful model
Let's play: pick a random number (A) in range [1, 100].The outcome of the game is:
Game 1: place 100 times, betting $1 each time
Game 2: place 10 times, betting $10 each time
Game 3: place one time, betting $100
We stimulate each game 10000 times. What do you expect?
Variance decreases as number of games increases
Suppose we use the same training algorithm for every predictor in the ensemble. However, to train them, we use different random subsets of the training set. The samples are selected randomly with replacement. It means that some samples can be selected seevral times, whereas others might not be selected at all.
Let's consider the following:
What if these predictors are decision trees?
From an m number of feature, a max_features are selected randomly at each node. Usually the number of features tested for spliting the tree nodes is represented by the square root of the total number of available features. Each node is split into two nodes only and there is no pruning
This is highly influenced by:
Correlation and strength are controlled by max_features. If max_features increases, both correlation between and strength of trees increase. Therefore, we have to find an optimal range of max_features.
The main features of Random Forests are listed below
During bagging, approximately one-third of the data samples are left out of TRi. Why?
These samples are called out-of-bag (oob) samples.
OOB sample is used for estimating unbiased classification error (oob error estimate) and variable importance
Mean Decrease Accuracy is the commonly used variable importance measure
Let's consider α the number of votes cast for the correct class for all oob samples in all trees. If we permute values of variable m in the oob cases randomly and push them down the tree, then avg(α−β) = raw VI (Variable Importance) score for variable m.
Imagine that you send all training samples (including oob samples) down all trees, count the number of times when sample n and sample k are in the same terminal node and normalize by dividing it by number of trees. What does this tell you?
(1) Random Forests is an ensemble classifiers that consists of several decision trees
(2) Decision trees are built by randomly selecting samples through replacement
(3) The most important hyperparamters are the number of decision trees and the number of variables used to split the decision tree nodes
(4) About 1/3 of the samples called out-of-bag samples (OOB) are used for assessing the accuracy of the trained model
(5) OOB is used to calculate the importance of the input variables (e.g. Mean Decrease Accuracy)