Random Forests

Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees – Wikipedia

Check the randomForest manual for options and available tools.

library(caret)
library(rpart)  # this is where the car90 data lives
library(dplyr)

car <- car90 %>% 
  select(-Mileage, -Reliability, -Eng.Rev, -Gear.Ratio, -Model2, -Sratio.m) %>%
  na.omit()


set.seed(4747)
inTrain <- createDataPartition(y = car$Type, p=0.7, list=FALSE)
car.train <- car[inTrain,]
car.test <- car[-c(inTrain),]


# note that you can set the number of trees (here 300) in the model,
# but you can't use ntree as a tuning parameter like mtry
car.rf.train <- train(Type ~., data=car.train, method="rf",
           trControl = trainControl(method="none"), 
           ntree=300, tuneGrid = data.frame(mtry=5),
           importance = TRUE)  # need this for variable importance below

print(car.rf.train$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 12.7%
## Confusion matrix:
##         Compact Large Medium Small Sporty Van class.error
## Compact      11     0      2     1      0   0       0.214
## Large         0     4      1     0      0   0       0.200
## Medium        2     0     17     0      0   0       0.105
## Small         0     0      0    14      0   0       0.000
## Sporty        1     0      0     1     10   0       0.167
## Van           1     0      0     0      0   6       0.143
predictions <- predict(car.rf.train, car.test)

confusionMatrix(car.test$Type, predictions)$overall[1]
## Accuracy 
##    0.815
confusionMatrix(car.test$Type, predictions)$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##       8.15e-01       7.67e-01       6.19e-01       9.37e-01       3.70e-01 
## AccuracyPValue  McnemarPValue 
##       2.95e-06            NaN
confusionMatrix(car.test$Type, predictions)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Compact Large Medium Small Sporty Van
##    Compact       5     0      0     0      0   0
##    Large         0     0      2     0      0   0
##    Medium        0     0      7     0      0   0
##    Small         0     0      0     4      1   0
##    Sporty        1     0      1     0      3   0
##    Van           0     0      0     0      0   3
## 
## Overall Statistics
##                                         
##                Accuracy : 0.815         
##                  95% CI : (0.619, 0.937)
##     No Information Rate : 0.37          
##     P-Value [Acc > NIR] : 2.95e-06      
##                                         
##                   Kappa : 0.767         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: Compact Class: Large Class: Medium
## Sensitivity                   0.833           NA         0.700
## Specificity                   1.000       0.9259         1.000
## Pos Pred Value                1.000           NA         1.000
## Neg Pred Value                0.955           NA         0.850
## Prevalence                    0.222       0.0000         0.370
## Detection Rate                0.185       0.0000         0.259
## Detection Prevalence          0.185       0.0741         0.259
## Balanced Accuracy             0.917           NA         0.850
##                      Class: Small Class: Sporty Class: Van
## Sensitivity                 1.000         0.750      1.000
## Specificity                 0.957         0.913      1.000
## Pos Pred Value              0.800         0.600      1.000
## Neg Pred Value              1.000         0.955      1.000
## Prevalence                  0.148         0.148      0.111
## Detection Rate              0.148         0.111      0.111
## Detection Prevalence        0.185         0.185      0.111
## Balanced Accuracy           0.978         0.832      1.000

Using OOB error to find mtry

set.seed(4747)
# note that you can set the number of trees (here 300) in the model,
# but you can't use ntree as a tuning parameter like mtry
car.rfm.train <- train(Type ~., data=car.train, method="rf",
           trControl = trainControl(method="oob"), 
           ntree=300, tuneGrid = data.frame(mtry=1:20),
           importance = TRUE)  # need this for variable importance below

car.rfm.train
## Random Forest 
## 
## 71 samples
## 27 predictors
##  6 classes: 'Compact', 'Large', 'Medium', 'Small', 'Sporty', 'Van' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    1    0.479     0.307
##    2    0.662     0.561
##    3    0.775     0.714
##    4    0.845     0.806
##    5    0.901     0.877
##    6    0.887     0.860
##    7    0.887     0.859
##    8    0.901     0.877
##    9    0.887     0.860
##   10    0.915     0.895
##   11    0.901     0.878
##   12    0.915     0.895
##   13    0.915     0.895
##   14    0.915     0.895
##   15    0.901     0.877
##   16    0.901     0.877
##   17    0.873     0.842
##   18    0.930     0.912
##   19    0.887     0.860
##   20    0.915     0.895
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 18.
predictions <- predict(car.rfm.train, car.test)

confusionMatrix(car.test$Type, predictions)$table
##           Reference
## Prediction Compact Large Medium Small Sporty Van
##    Compact       5     0      0     0      0   0
##    Large         0     0      2     0      0   0
##    Medium        0     0      7     0      0   0
##    Small         1     0      0     4      0   0
##    Sporty        1     0      1     0      3   0
##    Van           0     0      0     0      0   3

A quick glance at the individual trees

We can extract a given tree or get some information about the ensemble.

car.tree2 <- getTree(car.rf.train$finalModel, k=2, labelVar=TRUE) # get the second tree
car.tree2 %>% head(10)
##    left daughter right daughter split var split point status prediction
## 1              2              3      Tank        13.4      1       <NA>
## 2              4              5     Price      9569.0      1       <NA>
## 3              6              7    Length       187.5      1       <NA>
## 4              0              0      <NA>         0.0     -1      Small
## 5              0              0      <NA>         0.0     -1     Sporty
## 6              8              9      Disp       112.5      1       <NA>
## 7             10             11  RearShld        59.8      1       <NA>
## 8             12             13      Tank        15.2      1       <NA>
## 9             14             15    Height        58.8      1       <NA>
## 10             0              0      <NA>         0.0     -1     Medium
treesize(car.rf.train$finalModel) %>% head(15) # size of trees of the ensemble
##  [1] 13 15 16 17 19 12 17 14 19 19 13 13 16 19 14
hist(treesize(car.rf.train$finalModel))

Tuning the model: mtry

We can also tune the structure, ie, find the best hyperparameters of the method via grid search using CV:

set.seed(4747)
car.rf.traincv <- train(Type ~., data=car.train, method="rf",
           trControl = trainControl(method="cv", number=10), tuneGrid = data.frame(mtry=1:20))

car.rf.traincv  # CV model tells us to use mtry=10
## Random Forest 
## 
## 71 samples
## 27 predictors
##  6 classes: 'Compact', 'Large', 'Medium', 'Small', 'Sporty', 'Van' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 64, 63, 64, 64, 63, 65, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    1    0.498     0.336
##    2    0.734     0.655
##    3    0.845     0.801
##    4    0.857     0.818
##    5    0.898     0.870
##    6    0.884     0.852
##    7    0.884     0.852
##    8    0.884     0.852
##    9    0.898     0.871
##   10    0.927     0.907
##   11    0.873     0.839
##   12    0.884     0.852
##   13    0.898     0.871
##   14    0.915     0.891
##   15    0.915     0.891
##   16    0.902     0.875
##   17    0.888     0.855
##   18    0.871     0.836
##   19    0.890     0.859
##   20    0.871     0.836
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 10.
car.rf.traincv$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 8.45%
## Confusion matrix:
##         Compact Large Medium Small Sporty Van class.error
## Compact      13     0      0     1      0   0      0.0714
## Large         0     4      1     0      0   0      0.2000
## Medium        2     0     17     0      0   0      0.1053
## Small         0     0      0    14      0   0      0.0000
## Sporty        1     0      0     0     11   0      0.0833
## Van           1     0      0     0      0   6      0.1429
predictions <- predict(car.rf.traincv, car.test)
confusionMatrix(car.test$Type, predictions)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Compact Large Medium Small Sporty Van
##    Compact       5     0      0     0      0   0
##    Large         0     0      2     0      0   0
##    Medium        0     0      7     0      0   0
##    Small         1     0      0     4      0   0
##    Sporty        1     0      1     0      3   0
##    Van           0     0      0     0      0   3
## 
## Overall Statistics
##                                         
##                Accuracy : 0.815         
##                  95% CI : (0.619, 0.937)
##     No Information Rate : 0.37          
##     P-Value [Acc > NIR] : 2.95e-06      
##                                         
##                   Kappa : 0.767         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: Compact Class: Large Class: Medium
## Sensitivity                   0.714           NA         0.700
## Specificity                   1.000       0.9259         1.000
## Pos Pred Value                1.000           NA         1.000
## Neg Pred Value                0.909           NA         0.850
## Prevalence                    0.259       0.0000         0.370
## Detection Rate                0.185       0.0000         0.259
## Detection Prevalence          0.185       0.0741         0.259
## Balanced Accuracy             0.857           NA         0.850
##                      Class: Small Class: Sporty Class: Van
## Sensitivity                 1.000         1.000      1.000
## Specificity                 0.957         0.917      1.000
## Pos Pred Value              0.800         0.600      1.000
## Neg Pred Value              1.000         1.000      1.000
## Prevalence                  0.148         0.111      0.111
## Detection Rate              0.148         0.111      0.111
## Detection Prevalence        0.185         0.185      0.111
## Balanced Accuracy           0.978         0.958      1.000

Variable Importance

from the R package:

Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

# Too many categorical variables, I don't want to print the following line of code.
# You should uncomment it and run it to see what it lists
# importance(car.rf.train$finalModel)
varImpPlot(car.rf.train$finalModel)