Jo Hardin
October 25, 2017
Prediction Trees are used to predict a response or class \(Y\) from input \(X_1, X_2, \ldots, X_n\). If it is a continuous response it’s called a regression tree, if it is categorical, it’s called a classification tree. At each node of the tree, we check the value of one the input \(X_i\) and depending of the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction (usually it is a simple statistic of the dataset the leaf represents, like the most common value from the available classes).
library(caret)
library(rpart)
library(rpart.plot)
library(tree)
real.estate <- read.table("http://pages.pomona.edu/~jsh04747/courses/math154/CA_housedata.txt", header=TRUE)
set.seed(4747)
fitControl <- trainControl(method="none")
tr.house <- train(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate, method="rpart2",
trControl = fitControl, tuneGrid= data.frame(maxdepth=5))
rpart.plot(tr.house$finalModel)
Compare the predictions with the dataset (darker is more expensive) which seem to capture the global price trend. Note that this plot uses the tree model (instead of the rpart2 model) because the optimization is different.
library(tree)
tree.model <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate)
price.deciles <- quantile(real.estate$MedianHouseValue, 0:10/10)
cut.prices <- cut(real.estate$MedianHouseValue, price.deciles, include.lowest=TRUE)
plot(real.estate$Longitude, real.estate$Latitude, col=grey(10:2/11)[cut.prices], pch=20, xlab="Longitude",ylab="Latitude")
tree::partition.tree(tree.model, ordvars=c("Longitude","Latitude"), add=TRUE)
12) Latitude>=34.7 2844 645.0 11.5
the node that splits at latitude greater than 34.7 has 2844 houses. 645 is the “deviance” which is the sum of squares value for that node. the predicted value is the average of the points in that node: 11.5. it is not a terminal node (no asterisk).
set.seed(4747)
fitControl <- trainControl(method="none")
tr.house <- train(log(MedianHouseValue) ~ Longitude + Latitude, data=real.estate, method="rpart2",
trControl = fitControl, tuneGrid= data.frame(maxdepth=5))
tr.house$finalModel
## n= 20640
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20640 6690.0 12.1
## 2) Latitude>=38.5 2061 383.0 11.6
## 4) Latitude>=39.4 674 65.5 11.3 *
## 5) Latitude< 39.4 1387 240.0 11.7 *
## 3) Latitude< 38.5 18579 5750.0 12.1
## 6) Longitude>=-122 13941 4400.0 12.1
## 12) Latitude>=34.7 2844 645.0 11.5
## 24) Longitude>=-120 1460 212.0 11.3 *
## 25) Longitude< -120 1384 276.0 11.8 *
## 13) Latitude< 34.7 11097 2690.0 12.2
## 26) Longitude>=-118 8384 1820.0 12.1
## 52) Longitude>=-118 2839 692.0 11.9 *
## 53) Longitude< -118 5545 942.0 12.2 *
## 27) Longitude< -118 2713 465.0 12.5 *
## 7) Longitude< -122 4638 961.0 12.4
## 14) Latitude>=37.9 1063 178.0 12.1 *
## 15) Latitude< 37.9 3575 662.0 12.5 *
Including all the variables, not only the latitude and longitude:
set.seed(4747)
fitControl <- trainControl(method="none")
tr.full.house <- train(log(MedianHouseValue) ~ ., data=real.estate, method="rpart2",
trControl = fitControl, tuneGrid= data.frame(maxdepth=5))
tr.full.house$finalModel
## n= 20640
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 20640 6690.0 12.1
## 2) MedianIncome< 3.55 10381 2660.0 11.8
## 4) MedianIncome< 2.51 4842 1190.0 11.6
## 8) Latitude>=34.5 2520 558.0 11.4
## 16) Longitude>=-120 728 77.1 11.1 *
## 17) Longitude< -120 1792 386.0 11.5
## 34) Latitude>=37.9 1103 150.0 11.4 *
## 35) Latitude< 37.9 689 168.0 11.8 *
## 9) Latitude< 34.5 2322 450.0 11.8
## 18) Longitude>=-118 878 144.0 11.5 *
## 19) Longitude< -118 1444 215.0 11.9 *
## 5) MedianIncome>=2.51 5539 1120.0 11.9
## 10) Latitude>=37.9 1104 124.0 11.7 *
## 11) Latitude< 37.9 4435 902.0 12.0
## 22) Longitude>=-122 4084 771.0 12.0
## 44) Latitude>=34.5 1270 285.0 11.8 *
## 45) Latitude< 34.5 2814 411.0 12.1 *
## 23) Longitude< -122 351 47.7 12.5 *
## 3) MedianIncome>=3.55 10259 1970.0 12.4
## 6) MedianIncome< 5.59 7265 1160.0 12.3
## 12) MedianHouseAge< 38.5 5907 859.0 12.2 *
## 13) MedianHouseAge>=38.5 1358 218.0 12.5 *
## 7) MedianIncome>=5.59 2994 299.0 12.8
## 14) MedianIncome< 7.39 2008 176.0 12.6 *
## 15) MedianIncome>=7.39 986 49.2 13.0 *
rpart.plot(tr.full.house$finalModel)
Turns out that the tree does “better” by being more complex – why is that? The tree with 14 nodes corresponds to the tree with the highest accuracy / lowest deviance.
plot(cv.model$size, cv.model$dev, type="l", xlab="size", ylab="deviance")
# here, let's use all the variables and all the samples
set.seed(4747)
fitControl <- trainControl(method="cv")
tree.cv.house <- train(log(MedianHouseValue) ~ ., data=real.estate, method="rpart2",
trControl=fitControl, tuneGrid=data.frame(maxdepth=1:20),
parms=list(split="gini"))
tree.cv.house
## CART
##
## 20640 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 18577, 18575, 18577, 18575, 18575, 18576, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared
## 1 0.475 0.305
## 2 0.447 0.382
## 3 0.429 0.433
## 4 0.418 0.459
## 5 0.401 0.503
## 6 0.398 0.510
## 7 0.394 0.520
## 8 0.393 0.523
## 9 0.388 0.535
## 10 0.384 0.545
## 11 0.379 0.557
## 12 0.374 0.569
## 13 0.371 0.576
## 14 0.370 0.577
## 15 0.370 0.577
## 16 0.370 0.577
## 17 0.370 0.577
## 18 0.370 0.577
## 19 0.370 0.577
## 20 0.370 0.577
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 14.
rpart.plot(tree.cv.house$finalModel)
plot(tree.cv.house)
# first create two datasets: one training, one test
inTrain <- createDataPartition(y = real.estate$MedianHouseValue, p=.8, list=FALSE)
house.train <- real.estate[inTrain,]
house.test <- real.estate[-c(inTrain),]
# then use CV on the training data to find the best maxdepth
set.seed(4747)
fitControl <- trainControl(method="cv")
tree.cvtrain.house <- train(log(MedianHouseValue) ~ ., data=house.train, method="rpart2",
trControl=fitControl, tuneGrid=data.frame(maxdepth=1:20),
parms=list(split="gini"))
tree.cvtrain.house
## CART
##
## 16513 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14862, 14863, 14862, 14861, 14862, 14861, ...
## Resampling results across tuning parameters:
##
## maxdepth RMSE Rsquared
## 1 0.477 0.302
## 2 0.450 0.379
## 3 0.430 0.433
## 4 0.420 0.458
## 5 0.407 0.490
## 6 0.397 0.515
## 7 0.395 0.519
## 8 0.395 0.519
## 9 0.390 0.533
## 10 0.385 0.543
## 11 0.380 0.556
## 12 0.375 0.567
## 13 0.371 0.578
## 14 0.370 0.579
## 15 0.370 0.579
## 16 0.370 0.579
## 17 0.370 0.579
## 18 0.370 0.579
## 19 0.370 0.579
## 20 0.370 0.579
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 14.
tree.train.house <- train(log(MedianHouseValue) ~ ., data=house.train, method="rpart2",
trControl=trainControl(method="none"), tuneGrid=data.frame(maxdepth=14),
parms=list(split="gini"))
# use confusionMatrix instead of postResample for classification results
test.pred <- predict(tree.train.house, house.test)
postResample(pred = test.pred, obs=log(house.test$MedianHouseValue))
## RMSE Rsquared
## 0.365 0.581
rpart
is faster than tree
party
gives great plotting options
maptree
also gives trees from hierarchical clustering
randomForest
up next!
Reference: slides built from http://www.stat.cmu.edu/~cshalizi/350/lectures/22/lecture-22.pdf