If you are interested in a Random Forest Introduction, you’ve come to the right place. Welcome! In Random Forest classification, multiple decision trees are created when training the data. Classes or labels are predicted by passing a data point through each tree and selecting the class which occurs the most. There have been several high-profile studies on Random Forests. Perhaps the definitive one is the original paper on this topic in 2001 by Leo Breiman from the University of California. He states:
‘Significant improvements in classification accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class.’
Decision Tree Recap
Random Forest classification uses decision trees for predictions. Therefore it’s important to understand decision trees and their key parameters for model design. A decision tree algorithm is an ML method for classification and regression that uses a logical workflow with one or more decisions that ultimately result in a data point classification / predicted target. Decision trees consist of nodes which are essentially logical expressions and branches, which are evaluations of those expressions. The basic structure of a decision tree contains a root node and one or more branches, leading to a label classification, also known as a leaf. Such decision trees are trivial and tend to yield better results when used as part of an ensemble. A decision tree algorithm is said to be recursive if it contains one or more non-root nodes that are themselves a subtree of the overall tree. To make a split at a node, the split which maximizes the reduction in dataset impurity is selected and there are several well-known methods to do so.
Dataset impurity, or entropy, refers to the uncertainty in the dataset. Impurity is at its maximum when all possible outcomes have an equal chance of occurring. When there is only one possible outcome for a dataset, there is no impurity present. Choosing the optimal decision tree for a dataset requires that features/splits which result in the highest dataset impurity reduction should be selected at each node. Here, I will explore two of the most widely known methods for calculating dataset impurity.
Entropy & Information Gain
Entropy is formally defined as follows:
which says the entropy of a dataset S is equal to the sum for each class/label of dataset S of -1 times the proportion or probability pi of that class in S, times the log of pi to the base 2. With the log base set to 2, entropy is measures in bits.
Information Gain is the reduction of entropy obtained after splitting a dataset. It is given by:
which says the Information Gain of an attribute A, with entropy S, is equal to the entropy of the dataset, minus the sum for each distinct value v of an attribute A, of the probability of v in S times the entropy of the data subset Sv.
Gini Index & GiniGain
Another popular measure of dataset impurity is the Gini Index. It is given by:
which says the Gini Index of a dataset S is equal to 1 minus the sum for each class i in a dataset S, of the squared proportion of i in S.
The Gini Index is less computationally intensive to calculate than Entropy, as it does not contain a logarithmic function.
GiniGain is an alternative measure of impurity reduction. It is given by:
which says the Gini Gain of an attribute A, with entropy S, is equal to the Gini Index of the dataset, minus the sum for each distinct value v of an attribute A, of the probability of v in S times the Gini Index of the data subset Sv.
Random Forest Training
Random Forest Structure & Mechanics
As the name implies, Random Forests contain a multitude of decision trees, each containing a dataset randomly selected from the training data. The importance of randomness here cannot be overstated. A Random Forest of high quality
contains trees that have little to no correlation with each other. This means that each tree will produce an independent result and when the results are aggregated, the effect of error is diminished.
There are several techniques to achieve randomness and hence, tree independence, in Random Forests. The following are some of the most commonly used techniques.
Also referred to as Bagging, Bootstrap Aggregation works by sampling data at random from the training data with replacement. Replacement in this context means that the same datapoint can be chosen more than once. Bagging should be performed for every tree so that trees are independent of each other from a data standpoint. Typically this would be done for both model design and the execution / testing phases.
Random Feature Selection
Not to be confused with Feature Selection from a model design perspective. Random feature selection in Random Forests refers to the features that are available to choose from when performing a split. Unlike decision trees, Random Forests do not consider all the features for a split. Instead, a specific number of features are chosen at random from the full feature set. There is a lot of discussion on how what the number should be. Commonly, the √p where p is the total number of features. An arbitrarily high number of features can increase the power of an individual tree but increase the correlation between trees. An arbitrarily low number of features decreases the predictive power of any one tree but reduces the correlation between trees. As per the standard approach on hyperparameter tuning, use trial and error to determine the best setting. One can make use of ML packages to streamline this process.
Number of Estimators
Refers to the number of trees in the random forests. A high number of trees can lead to good prediction accuracy on test data. However as will be shown, there comes a point when adding more trees does not add any more value to the ensemble. To classify a datapoint, a Random Forest runs a datapoint through each tree in the forest. Each tree predicts a label and the mode label is deemed the resultant classification. For this reason, I chose odd numbers as candidates for the number of estimators.
Aligning with common approaches, I experimented with two of the main impurity reduction methods during model design. These are Information Gain, and GiniGain. Information Gain was selected as the criterion for testing as it yielded excellent results with a relatively low number of estimators.
Hyperparameter Tuning + Own Experience
In my own experimentation, a Random Forest was coded from scratch. Three hyperparameters were selected and tuned when designing the model. These are: n_estimators, max_features and criterion. Respectively, they refer to how many trees to create, the number of features to choose from for each split, and the function to use for impurity reduction. The following table details the range of settings I supplied to each hyperparameter.
Ten iterations of the algorithm were performed on every hyperparameter combination, yielding 12 * 4 * 2 * 10 = 960 total iterations.
Classification accuracy on training and test data was calculated for all 960 iterations. Results were then grouped by n_estimators, max_features and criterion. The average training & testing accuracies were then observed. Accuracy on classification of training data points was unequivocally 100%. This makes sense as the logic in the decision trees were
built using the training data and none of the trees were pruned.
As is evident from the charts below, every combination of hyperparameter settings yielded an average accuracy after 10 iterations of more than 96%. The configuration which led to the most accurate classification accuracy was using the Gini criterion along with a maximum of 1 random feature to decide on best split, and 131 estimators. This configuration resulted in an accuracy of 99.61%. However, an accuracy of 99.41% was obtained by using only 81 estimators, if configured with the Entropy criterion and a maximum of 2 random features to decide on best split. As the latter model proved to be just as accurate with 50 (38%) less estimators than the former model, I chose the latter model’s hyperparameters as the basis for further testing.
Average Prediction Accuracy % On Training & Test Data Using Information & Gini Gain
Optimal Hyperparameter Settings
What set of parameters are optimal is somewhat subjective. It all depends on the complexity of the dataset, the computation power of the machine, and level of accuracy is deemed acceptable for the use case. In my case, I chose the following:
To gain an insight into model performance relative to an OOTB Random Forest algorithm, I coded up a Python command line program which performed 10 iterations using the optimal hyperparameters on both the hand-built algorithm and a reference implementation using Scikit-Learn. Prediction accuracy vs Scikit-Learn, ROC plots and testing summary were recorded.
The results showed both algorithms producing similar results across 10 iterations. On average, Scikit-Learn’s Random Forest algorithm yielded a classification accuracy of 96.86% compared to 97.06% for the hand-built algorithm. In other words, the hand-built algorithm was 0.2% in magnitude and 0.21% relatively speaking, better Scikit-Learn’s implementation. A two-tailed independent t-test with a null Hypothesis stating that the means are equal was performed on the accuracies with α = 0.05, resulting in a two-tailed p-value of 0.81. The t-test is included in the program results folder and a summary shown below.
Average Prediction Accuracy
Hand-Built: 97.06% Scikit-Learn: 96.86%
ROC Curves for Hand-Built Algorithm
Receiver Operator Curves (ROC) were generated dynamically at runtime for the algorithm coded by hand, considering the classification results of all iterations as a single dataset. The area under the curve in all cases was greater than 0.99.
In this post we introduced Random Forests and the key concepts, parameters and things to watch out for during training. To motivate the algorithm further, a Random Forest algorithm was coded by hand and its accuracy was pitted against a reference implementation in Scikit-Learn. Both algorithms were used to train a dataset and make a prediction on the class. Although the hand-built algorithm outperformed (from an accuracy standpoint) Scikit-Learn’s implementation by 0.21% in relative terms, an independent t-test produced a p-value of 0.81. As this exceed the significance level of 0.05, the null hypothesis was rejected. Therefore, there is no statistically significant difference between the algorithms in terms of prediction accuracy. In other words, the algorithms produced the same results!
With respect to the ROC curves, the results show that for each class, there exists a near-perfect threshold so that maximum true positives can be obtained whilst having little to no false positives. With an AUC of over 0.99 in each case, the models are practically, but not theoretically ideal. Not bad for a hand-built Random Forest!
If you enjoyed this post, please leave a comment below. Likewise, and for more content and news, why not follow me on Twitter or subscribe to my YouTube channel. For direct contact, feel free to use the contact form.