Math 58B - Introduction to Biostatistics
Jo Hardin
house = read.table("http://www.rossmanchance.com/iscam2/data/housing.txt", header = T,
sep = "\t")
attach(house)
names(house)
## [1] "sqft" "price" "City" "bedrooms" "baths"
.1. Understand how to run a regression with multiple variables.
.2. Understand how to check residual plots for normal assumptions.
.3. Understand how to find the “best” model (with the knowledge that we don't usually have a definition of “best”).
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
text(0.5, 0.5, txt)
}
pairs(cbind(price, sqft, bedrooms, baths), lower.panel = panel.cor, pch = 18)
summary(lm(price ~ sqft))
##
## Call:
## lm(formula = price ~ sqft)
##
## Residuals:
## Min 1Q Median 3Q Max
## -439654 -144256 -52040 97373 636508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65930.3 60993.6 1.08 0.28
## sqft 202.4 26.4 7.67 3.3e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 222000 on 81 degrees of freedom
## Multiple R-squared: 0.421, Adjusted R-squared: 0.414
## F-statistic: 58.8 on 1 and 81 DF, p-value: 3.35e-11
summary(lm(price ~ bedrooms))
##
## Call:
## lm(formula = price ~ bedrooms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -454935 -206553 -76206 190930 798794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 220612 107208 2.06 0.0428 *
## bedrooms 76865 28802 2.67 0.0092 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 280000 on 81 degrees of freedom
## Multiple R-squared: 0.0808, Adjusted R-squared: 0.0695
## F-statistic: 7.12 on 1 and 81 DF, p-value: 0.00919
The p-values for both explanatory variables (sqft and bedrooms) are significant. Sqft seems more significant, and indeed, the first model has a higher \( R^2 \) - that is, a higher proportion of the variability in price is explained by sqft (42.07%) than by number of bedrooms (8.08%).
However, it is important for us to ask whether either of the relationships actually fit the technical conditions of the linear regression model. We can see from the pairs plots that the relationships look Linear, we'll assume the variables were collected Independently, but the Normality and the Error structure we can check using residual plots.
par(mfrow = c(1, 2))
plot(lm(price ~ sqft)$fitted, lm(price ~ sqft)$resid, , xlab = "fitted w sqft",
ylab = "resid", pch = 18)
abline(h = 0)
plot(lm(price ~ bedrooms)$fitted, lm(price ~ bedrooms)$resid, , xlab = "fitted w bed",
ylab = "resid", pch = 18)
abline(h = 0)
For both of the plots, it seems like the residuals have higher variability for positive residuals. Additionally, it seems that the variability of the residuals increases for larger fitted observations. A natural log transformation should fix both of these problems.
par(mfrow = c(1, 2))
plot(lm(log(price) ~ sqft)$fitted, lm(log(price) ~ sqft)$resid, , xlab = "fitted w sqft",
ylab = "resid for ln", pch = 18)
abline(h = 0)
plot(lm(log(price) ~ bedrooms)$fitted, lm(log(price) ~ bedrooms)$resid, , xlab = "fitted w bed",
ylab = "resid for ln", pch = 18)
abline(h = 0)
Though no residual plot will ever look perfect, these residual plots seem to fit the technical conditions of the model better than the untransformed data.
We'll stick with the transformed data. What happens when we try to predict price (log(price), here) using BOTH sqft and bedrooms?
summary(lm(log(price) ~ sqft + bedrooms))
##
## Call:
## lm(formula = log(price) ~ sqft + bedrooms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.081 -0.278 -0.053 0.268 1.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.164943 0.173558 70.09 < 2e-16 ***
## sqft 0.000468 0.000066 7.09 4.7e-10 ***
## bedrooms -0.060293 0.057202 -1.05 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.45 on 80 degrees of freedom
## Multiple R-squared: 0.448, Adjusted R-squared: 0.435
## F-statistic: 32.5 on 2 and 80 DF, p-value: 4.62e-11
Although the \( R^2 \) value went up (44.84% of variability in log price is explained by sqft and bedrooms), the p-value on bedrooms isn't significant. The p-value here can be interpreted as a hypothesis test on the slope coefficient given the other variables in the model.
0.295 = P(a slope of -.06029 or more extreme if sqft is in the model and there is no relationship between bedrooms and price)
Our output says that once we have sqft in the model, we don't actually need to know anything about the number of bedrooms (even though bedrooms was a significant predictor on its own).
The final model will be run on log(price) using only sqft. Note that the coefficients and the \( R^2 \) values change slightly because the response variable is now squared.
summary(lm(log(price) ~ sqft))
##
## Call:
## lm(formula = log(price) ~ sqft)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.090 -0.296 -0.059 0.287 1.202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.20e+01 1.24e-01 97.36 < 2e-16 ***
## sqft 4.27e-04 5.35e-05 7.99 7.9e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.45 on 81 degrees of freedom
## Multiple R-squared: 0.441, Adjusted R-squared: 0.434
## F-statistic: 63.8 on 1 and 81 DF, p-value: 7.87e-12
As with the prediction intervals we had when we had a single sample, we can now create intervals for either an average (a confidence interval) of an individual (a prediction interval).
predict(lm(log(price) ~ sqft), newdata = data.frame(sqft = 2000), interval = "confidence")
## fit lwr upr
## 1 12.89 12.79 12.99
I am 95% confident that the true average log price for a 2000 sqft home is between 12.79 log$ and 12.99 log$. Back transforming can be a little tricky.
predict(lm(log(price) ~ sqft), newdata = data.frame(sqft = 2000), interval = "prediction")
## fit lwr upr
## 1 12.89 11.99 13.79
I am 95% of homes with 2000 sqft are between 11.99 log$ and 13.99 log$. Now back transforming is easy (because there are no averages), so 95% of homes with 2000 sqft are between $161,126 and $977,301.
plot(sqft, log(price), pch = 18)
sqftlm = lm(log(price) ~ sqft)
abline(sqftlm, col = "red")
newX = seq(min(sqft), max(sqft), 1)
prd.CI = predict(sqftlm, newdata = data.frame(sqft = newX), interval = "confidence",
level = 0.95)
lines(newX, prd.CI[, 2], col = "blue", lty = 2)
lines(newX, prd.CI[, 3], col = "blue", lty = 2)
prd.PI = predict(sqftlm, newdata = data.frame(sqft = newX), interval = "prediction",
level = 0.95)
lines(newX, prd.PI[, 2], col = "green", lty = 3)
lines(newX, prd.PI[, 3], col = "green", lty = 3)