Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees – Wikipedia
Check the randomForest manual for options and available tools.
library(caret)
library(rpart) # this is where the car90 data lives
library(dplyr)
car <- car90 %>%
select(-Mileage, -Reliability, -Eng.Rev, -Gear.Ratio, -Model2, -Sratio.m) %>%
na.omit()
set.seed(4747)
inTrain <- createDataPartition(y = car$Type, p=0.7, list=FALSE)
car.train <- car[inTrain,]
car.test <- car[-c(inTrain),]
# note that you can set the number of trees (here 300) in the model,
# but you can't use ntree as a tuning parameter like mtry
car.rf.train <- train(Type ~., data=car.train, method="rf",
trControl = trainControl(method="none"),
ntree=300, tuneGrid = data.frame(mtry=5),
importance = TRUE) # need this for variable importance below
print(car.rf.train$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 300, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 12.7%
## Confusion matrix:
## Compact Large Medium Small Sporty Van class.error
## Compact 11 0 2 1 0 0 0.214
## Large 0 4 1 0 0 0 0.200
## Medium 2 0 17 0 0 0 0.105
## Small 0 0 0 14 0 0 0.000
## Sporty 1 0 0 1 10 0 0.167
## Van 1 0 0 0 0 6 0.143
predictions <- predict(car.rf.train, car.test)
confusionMatrix(car.test$Type, predictions)$overall[1]
## Accuracy
## 0.815
confusionMatrix(car.test$Type, predictions)$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 8.15e-01 7.67e-01 6.19e-01 9.37e-01 3.70e-01
## AccuracyPValue McnemarPValue
## 2.95e-06 NaN
confusionMatrix(car.test$Type, predictions)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Compact Large Medium Small Sporty Van
## Compact 5 0 0 0 0 0
## Large 0 0 2 0 0 0
## Medium 0 0 7 0 0 0
## Small 0 0 0 4 1 0
## Sporty 1 0 1 0 3 0
## Van 0 0 0 0 0 3
##
## Overall Statistics
##
## Accuracy : 0.815
## 95% CI : (0.619, 0.937)
## No Information Rate : 0.37
## P-Value [Acc > NIR] : 2.95e-06
##
## Kappa : 0.767
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Compact Class: Large Class: Medium
## Sensitivity 0.833 NA 0.700
## Specificity 1.000 0.9259 1.000
## Pos Pred Value 1.000 NA 1.000
## Neg Pred Value 0.955 NA 0.850
## Prevalence 0.222 0.0000 0.370
## Detection Rate 0.185 0.0000 0.259
## Detection Prevalence 0.185 0.0741 0.259
## Balanced Accuracy 0.917 NA 0.850
## Class: Small Class: Sporty Class: Van
## Sensitivity 1.000 0.750 1.000
## Specificity 0.957 0.913 1.000
## Pos Pred Value 0.800 0.600 1.000
## Neg Pred Value 1.000 0.955 1.000
## Prevalence 0.148 0.148 0.111
## Detection Rate 0.148 0.111 0.111
## Detection Prevalence 0.185 0.185 0.111
## Balanced Accuracy 0.978 0.832 1.000
set.seed(4747)
# note that you can set the number of trees (here 300) in the model,
# but you can't use ntree as a tuning parameter like mtry
car.rfm.train <- train(Type ~., data=car.train, method="rf",
trControl = trainControl(method="oob"),
ntree=300, tuneGrid = data.frame(mtry=1:20),
importance = TRUE) # need this for variable importance below
car.rfm.train
## Random Forest
##
## 71 samples
## 27 predictors
## 6 classes: 'Compact', 'Large', 'Medium', 'Small', 'Sporty', 'Van'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.479 0.307
## 2 0.662 0.561
## 3 0.775 0.714
## 4 0.845 0.806
## 5 0.901 0.877
## 6 0.887 0.860
## 7 0.887 0.859
## 8 0.901 0.877
## 9 0.887 0.860
## 10 0.915 0.895
## 11 0.901 0.878
## 12 0.915 0.895
## 13 0.915 0.895
## 14 0.915 0.895
## 15 0.901 0.877
## 16 0.901 0.877
## 17 0.873 0.842
## 18 0.930 0.912
## 19 0.887 0.860
## 20 0.915 0.895
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
predictions <- predict(car.rfm.train, car.test)
confusionMatrix(car.test$Type, predictions)$table
## Reference
## Prediction Compact Large Medium Small Sporty Van
## Compact 5 0 0 0 0 0
## Large 0 0 2 0 0 0
## Medium 0 0 7 0 0 0
## Small 1 0 0 4 0 0
## Sporty 1 0 1 0 3 0
## Van 0 0 0 0 0 3
We can extract a given tree or get some information about the ensemble.
car.tree2 <- getTree(car.rf.train$finalModel, k=2, labelVar=TRUE) # get the second tree
car.tree2 %>% head(10)
## left daughter right daughter split var split point status prediction
## 1 2 3 Tank 13.4 1 <NA>
## 2 4 5 Price 9569.0 1 <NA>
## 3 6 7 Length 187.5 1 <NA>
## 4 0 0 <NA> 0.0 -1 Small
## 5 0 0 <NA> 0.0 -1 Sporty
## 6 8 9 Disp 112.5 1 <NA>
## 7 10 11 RearShld 59.8 1 <NA>
## 8 12 13 Tank 15.2 1 <NA>
## 9 14 15 Height 58.8 1 <NA>
## 10 0 0 <NA> 0.0 -1 Medium
treesize(car.rf.train$finalModel) %>% head(15) # size of trees of the ensemble
## [1] 13 15 16 17 19 12 17 14 19 19 13 13 16 19 14
hist(treesize(car.rf.train$finalModel))
We can also tune the structure, ie, find the best hyperparameters of the method via grid search using CV:
set.seed(4747)
car.rf.traincv <- train(Type ~., data=car.train, method="rf",
trControl = trainControl(method="cv", number=10), tuneGrid = data.frame(mtry=1:20))
car.rf.traincv # CV model tells us to use mtry=10
## Random Forest
##
## 71 samples
## 27 predictors
## 6 classes: 'Compact', 'Large', 'Medium', 'Small', 'Sporty', 'Van'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 64, 63, 64, 64, 63, 65, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.498 0.336
## 2 0.734 0.655
## 3 0.845 0.801
## 4 0.857 0.818
## 5 0.898 0.870
## 6 0.884 0.852
## 7 0.884 0.852
## 8 0.884 0.852
## 9 0.898 0.871
## 10 0.927 0.907
## 11 0.873 0.839
## 12 0.884 0.852
## 13 0.898 0.871
## 14 0.915 0.891
## 15 0.915 0.891
## 16 0.902 0.875
## 17 0.888 0.855
## 18 0.871 0.836
## 19 0.890 0.859
## 20 0.871 0.836
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
car.rf.traincv$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 8.45%
## Confusion matrix:
## Compact Large Medium Small Sporty Van class.error
## Compact 13 0 0 1 0 0 0.0714
## Large 0 4 1 0 0 0 0.2000
## Medium 2 0 17 0 0 0 0.1053
## Small 0 0 0 14 0 0 0.0000
## Sporty 1 0 0 0 11 0 0.0833
## Van 1 0 0 0 0 6 0.1429
predictions <- predict(car.rf.traincv, car.test)
confusionMatrix(car.test$Type, predictions)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Compact Large Medium Small Sporty Van
## Compact 5 0 0 0 0 0
## Large 0 0 2 0 0 0
## Medium 0 0 7 0 0 0
## Small 1 0 0 4 0 0
## Sporty 1 0 1 0 3 0
## Van 0 0 0 0 0 3
##
## Overall Statistics
##
## Accuracy : 0.815
## 95% CI : (0.619, 0.937)
## No Information Rate : 0.37
## P-Value [Acc > NIR] : 2.95e-06
##
## Kappa : 0.767
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Compact Class: Large Class: Medium
## Sensitivity 0.714 NA 0.700
## Specificity 1.000 0.9259 1.000
## Pos Pred Value 1.000 NA 1.000
## Neg Pred Value 0.909 NA 0.850
## Prevalence 0.259 0.0000 0.370
## Detection Rate 0.185 0.0000 0.259
## Detection Prevalence 0.185 0.0741 0.259
## Balanced Accuracy 0.857 NA 0.850
## Class: Small Class: Sporty Class: Van
## Sensitivity 1.000 1.000 1.000
## Specificity 0.957 0.917 1.000
## Pos Pred Value 0.800 0.600 1.000
## Neg Pred Value 1.000 1.000 1.000
## Prevalence 0.148 0.111 0.111
## Detection Rate 0.148 0.111 0.111
## Detection Prevalence 0.185 0.185 0.111
## Balanced Accuracy 0.978 0.958 1.000
from the R package:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
# Too many categorical variables, I don't want to print the following line of code.
# You should uncomment it and run it to see what it lists
# importance(car.rf.train$finalModel)
varImpPlot(car.rf.train$finalModel)