Trees

Classification and Regression Trees

Prediction Trees are used to predict a response or class \(Y\) from input \(X_1, X_2, \ldots, X_n\). If it is a continuous response it’s called a regression tree, if it is categorical, it’s called a classification tree. At each node of the tree, we check the value of one the input \(X_i\) and depending of the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction (usually it is a simple statistic of the dataset the leaf represents, like the most common value from the available classes).

Regression Trees

library(caret)
library(rpart)
library(rpart.plot)
library(tree)
real.estate <- read.table("http://pages.pomona.edu/~jsh04747/courses/math154/CA_housedata.txt", header=TRUE)

set.seed(4747)
fitControl <- trainControl(method="none")
tr.house <- train(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate, method="rpart2", 
                  trControl = fitControl, tuneGrid= data.frame(maxdepth=5))

rpart.plot(tr.house$finalModel)

Scatterplot

Compare the predictions with the dataset (darker is more expensive) which seem to capture the global price trend. Note that this plot uses the tree model (instead of the rpart2 model) because the optimization is different.

library(tree)
tree.model <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate)

price.deciles <- quantile(real.estate$MedianHouseValue, 0:10/10)
cut.prices    <- cut(real.estate$MedianHouseValue, price.deciles, include.lowest=TRUE)
plot(real.estate$Longitude, real.estate$Latitude, col=grey(10:2/11)[cut.prices], pch=20, xlab="Longitude",ylab="Latitude")
tree::partition.tree(tree.model, ordvars=c("Longitude","Latitude"), add=TRUE)

Finer partition

12) Latitude>=34.7 2844  645.0 11.5

the node that splits at latitude greater than 34.7 has 2844 houses. 645 is the “deviance” which is the sum of squares value for that node. the predicted value is the average of the points in that node: 11.5. it is not a terminal node (no asterisk).

set.seed(4747)
fitControl <- trainControl(method="none")
tr.house <- train(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate, method="rpart2", 
                  trControl = fitControl, tuneGrid= data.frame(maxdepth=5))

tr.house$finalModel

## n= 20640 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 20640 6690.0 12.1  
##    2) Latitude>=38.5 2061  383.0 11.6  
##      4) Latitude>=39.4 674   65.5 11.3 *
##      5) Latitude< 39.4 1387  240.0 11.7 *
##    3) Latitude< 38.5 18579 5750.0 12.1  
##      6) Longitude>=-122 13941 4400.0 12.1  
##       12) Latitude>=34.7 2844  645.0 11.5  
##         24) Longitude>=-120 1460  212.0 11.3 *
##         25) Longitude< -120 1384  276.0 11.8 *
##       13) Latitude< 34.7 11097 2690.0 12.2  
##         26) Longitude>=-118 8384 1820.0 12.1  
##           52) Longitude>=-118 2839  692.0 11.9 *
##           53) Longitude< -118 5545  942.0 12.2 *
##         27) Longitude< -118 2713  465.0 12.5 *
##      7) Longitude< -122 4638  961.0 12.4  
##       14) Latitude>=37.9 1063  178.0 12.1 *
##       15) Latitude< 37.9 3575  662.0 12.5 *

More variables

Including all the variables, not only the latitude and longitude:

set.seed(4747)
fitControl <- trainControl(method="none")
tr.full.house <- train(log(MedianHouseValue) ~ ., data=real.estate, method="rpart2", 
                  trControl = fitControl, tuneGrid= data.frame(maxdepth=5))

tr.full.house$finalModel

## n= 20640 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 20640 6690.0 12.1  
##    2) MedianIncome< 3.55 10381 2660.0 11.8  
##      4) MedianIncome< 2.51 4842 1190.0 11.6  
##        8) Latitude>=34.5 2520  558.0 11.4  
##         16) Longitude>=-120 728   77.1 11.1 *
##         17) Longitude< -120 1792  386.0 11.5  
##           34) Latitude>=37.9 1103  150.0 11.4 *
##           35) Latitude< 37.9 689  168.0 11.8 *
##        9) Latitude< 34.5 2322  450.0 11.8  
##         18) Longitude>=-118 878  144.0 11.5 *
##         19) Longitude< -118 1444  215.0 11.9 *
##      5) MedianIncome>=2.51 5539 1120.0 11.9  
##       10) Latitude>=37.9 1104  124.0 11.7 *
##       11) Latitude< 37.9 4435  902.0 12.0  
##         22) Longitude>=-122 4084  771.0 12.0  
##           44) Latitude>=34.5 1270  285.0 11.8 *
##           45) Latitude< 34.5 2814  411.0 12.1 *
##         23) Longitude< -122 351   47.7 12.5 *
##    3) MedianIncome>=3.55 10259 1970.0 12.4  
##      6) MedianIncome< 5.59 7265 1160.0 12.3  
##       12) MedianHouseAge< 38.5 5907  859.0 12.2 *
##       13) MedianHouseAge>=38.5 1358  218.0 12.5 *
##      7) MedianIncome>=5.59 2994  299.0 12.8  
##       14) MedianIncome< 7.39 2008  176.0 12.6 *
##       15) MedianIncome>=7.39 986   49.2 13.0 *

rpart.plot(tr.full.house$finalModel)

Cross Validation (model building!)

Turns out that the tree does “better” by being more complex – why is that? The tree with 14 nodes corresponds to the tree with the highest accuracy / lowest deviance.

plot(cv.model$size, cv.model$dev, type="l", xlab="size", ylab="deviance")

# here, let's use all the variables and all the samples
set.seed(4747)
fitControl <- trainControl(method="cv")
tree.cv.house <- train(log(MedianHouseValue) ~ ., data=real.estate, method="rpart2",
                    trControl=fitControl, tuneGrid=data.frame(maxdepth=1:20),
                    parms=list(split="gini"))
  
tree.cv.house

## CART 
## 
## 20640 samples
##     8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 18577, 18575, 18577, 18575, 18575, 18576, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE   Rsquared
##    1        0.475  0.305   
##    2        0.447  0.382   
##    3        0.429  0.433   
##    4        0.418  0.459   
##    5        0.401  0.503   
##    6        0.398  0.510   
##    7        0.394  0.520   
##    8        0.393  0.523   
##    9        0.388  0.535   
##   10        0.384  0.545   
##   11        0.379  0.557   
##   12        0.374  0.569   
##   13        0.371  0.576   
##   14        0.370  0.577   
##   15        0.370  0.577   
##   16        0.370  0.577   
##   17        0.370  0.577   
##   18        0.370  0.577   
##   19        0.370  0.577   
##   20        0.370  0.577   
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was maxdepth = 14.

rpart.plot(tree.cv.house$finalModel)

plot(tree.cv.house)

Training / test data for model building AND model accuracy

# first create two datasets: one training, one test
inTrain <- createDataPartition(y = real.estate$MedianHouseValue, p=.8, list=FALSE)
house.train <- real.estate[inTrain,]
house.test <- real.estate[-c(inTrain),]


# then use CV on the training data to find the best maxdepth
set.seed(4747)
fitControl <- trainControl(method="cv")
tree.cvtrain.house <- train(log(MedianHouseValue) ~ ., data=house.train, method="rpart2",
                    trControl=fitControl, tuneGrid=data.frame(maxdepth=1:20),
                    parms=list(split="gini"))


tree.cvtrain.house

## CART 
## 
## 16513 samples
##     8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 14862, 14863, 14862, 14861, 14862, 14861, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE   Rsquared
##    1        0.477  0.302   
##    2        0.450  0.379   
##    3        0.430  0.433   
##    4        0.420  0.458   
##    5        0.407  0.490   
##    6        0.397  0.515   
##    7        0.395  0.519   
##    8        0.395  0.519   
##    9        0.390  0.533   
##   10        0.385  0.543   
##   11        0.380  0.556   
##   12        0.375  0.567   
##   13        0.371  0.578   
##   14        0.370  0.579   
##   15        0.370  0.579   
##   16        0.370  0.579   
##   17        0.370  0.579   
##   18        0.370  0.579   
##   19        0.370  0.579   
##   20        0.370  0.579   
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was maxdepth = 14.

tree.train.house <- train(log(MedianHouseValue) ~ ., data=house.train, method="rpart2",
                          trControl=trainControl(method="none"), tuneGrid=data.frame(maxdepth=14),
                          parms=list(split="gini"))

# use confusionMatrix instead of postResample for classification results

test.pred <- predict(tree.train.house, house.test)
postResample(pred = test.pred, obs=log(house.test$MedianHouseValue))

##     RMSE Rsquared 
##    0.365    0.581

Other tree packages

rpart is faster than tree
party gives great plotting options
maptree also gives trees from hierarchical clustering
randomForest up next!

Reference: slides built from http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf