Math 58B - Introduction to Biostatistics
Jo Hardin

Example R code / analysis for housing data

house = read.table("http://www.rossmanchance.com/iscam2/data/housing.txt", header = T, 
    sep = "\t")
attach(house)
names(house)

## [1] "sqft"     "price"    "City"     "bedrooms" "baths"

Goals

.1. Understand how to run a regression with multiple variables.

.2. Understand how to check residual plots for normal assumptions.

.3. Understand how to find the “best” model (with the knowledge that we don't usually have a definition of “best”).

Descriptive Statistics

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- cor(x, y)
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    text(0.5, 0.5, txt)
}

pairs(cbind(price, sqft, bedrooms, baths), lower.panel = panel.cor, pch = 18)

plot of chunk unnamed-chunk-2

Run a linear model trying to predict price

summary(lm(price ~ sqft))

## 
## Call:
## lm(formula = price ~ sqft)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -439654 -144256  -52040   97373  636508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65930.3    60993.6    1.08     0.28    
## sqft           202.4       26.4    7.67  3.3e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 222000 on 81 degrees of freedom
## Multiple R-squared:  0.421,  Adjusted R-squared:  0.414 
## F-statistic: 58.8 on 1 and 81 DF,  p-value: 3.35e-11

summary(lm(price ~ bedrooms))

## 
## Call:
## lm(formula = price ~ bedrooms)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -454935 -206553  -76206  190930  798794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   220612     107208    2.06   0.0428 * 
## bedrooms       76865      28802    2.67   0.0092 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 280000 on 81 degrees of freedom
## Multiple R-squared:  0.0808, Adjusted R-squared:  0.0695 
## F-statistic: 7.12 on 1 and 81 DF,  p-value: 0.00919

The p-values for both explanatory variables (sqft and bedrooms) are significant. Sqft seems more significant, and indeed, the first model has a higher $ R^2 $ - that is, a higher proportion of the variability in price is explained by sqft (42.07%) than by number of bedrooms (8.08%).

However, it is important for us to ask whether either of the relationships actually fit the technical conditions of the linear regression model. We can see from the pairs plots that the relationships look Linear, we'll assume the variables were collected Independently, but the Normality and the Error structure we can check using residual plots.

par(mfrow = c(1, 2))
plot(lm(price ~ sqft)$fitted, lm(price ~ sqft)$resid, , xlab = "fitted w sqft", 
    ylab = "resid", pch = 18)
abline(h = 0)
plot(lm(price ~ bedrooms)$fitted, lm(price ~ bedrooms)$resid, , xlab = "fitted w bed", 
    ylab = "resid", pch = 18)
abline(h = 0)

plot of chunk unnamed-chunk-4

For both of the plots, it seems like the residuals have higher variability for positive residuals. Additionally, it seems that the variability of the residuals increases for larger fitted observations. A natural log transformation should fix both of these problems.

par(mfrow = c(1, 2))
plot(lm(log(price) ~ sqft)$fitted, lm(log(price) ~ sqft)$resid, , xlab = "fitted w sqft", 
    ylab = "resid for ln", pch = 18)
abline(h = 0)
plot(lm(log(price) ~ bedrooms)$fitted, lm(log(price) ~ bedrooms)$resid, , xlab = "fitted w bed", 
    ylab = "resid for ln", pch = 18)
abline(h = 0)

plot of chunk unnamed-chunk-5

Though no residual plot will ever look perfect, these residual plots seem to fit the technical conditions of the model better than the untransformed data.

Combining variables.

We'll stick with the transformed data. What happens when we try to predict price (log(price), here) using BOTH sqft and bedrooms?

summary(lm(log(price) ~ sqft + bedrooms))

## 
## Call:
## lm(formula = log(price) ~ sqft + bedrooms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.081 -0.278 -0.053  0.268  1.221 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.164943   0.173558   70.09  < 2e-16 ***
## sqft         0.000468   0.000066    7.09  4.7e-10 ***
## bedrooms    -0.060293   0.057202   -1.05      0.3    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.45 on 80 degrees of freedom
## Multiple R-squared:  0.448,  Adjusted R-squared:  0.435 
## F-statistic: 32.5 on 2 and 80 DF,  p-value: 4.62e-11

Although the $ R^2 $ value went up (44.84% of variability in log price is explained by sqft and bedrooms), the p-value on bedrooms isn't significant. The p-value here can be interpreted as a hypothesis test on the slope coefficient given the other variables in the model.

0.295 = P(a slope of -.06029 or more extreme if sqft is in the model and there is no relationship between bedrooms and price)

Our output says that once we have sqft in the model, we don't actually need to know anything about the number of bedrooms (even though bedrooms was a significant predictor on its own).

The final model will be run on log(price) using only sqft. Note that the coefficients and the $ R^2 $ values change slightly because the response variable is now squared.

summary(lm(log(price) ~ sqft))

## 
## Call:
## lm(formula = log(price) ~ sqft)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.090 -0.296 -0.059  0.287  1.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.20e+01   1.24e-01   97.36  < 2e-16 ***
## sqft        4.27e-04   5.35e-05    7.99  7.9e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.45 on 81 degrees of freedom
## Multiple R-squared:  0.441,  Adjusted R-squared:  0.434 
## F-statistic: 63.8 on 1 and 81 DF,  p-value: 7.87e-12

Prediction

As with the prediction intervals we had when we had a single sample, we can now create intervals for either an average (a confidence interval) of an individual (a prediction interval).

Confidence interval:

predict(lm(log(price) ~ sqft), newdata = data.frame(sqft = 2000), interval = "confidence")

##     fit   lwr   upr
## 1 12.89 12.79 12.99

I am 95% confident that the true average log price for a 2000 sqft home is between 12.79 log$ and 12.99 log$. Back transforming can be a little tricky.

Prediction interval:

predict(lm(log(price) ~ sqft), newdata = data.frame(sqft = 2000), interval = "prediction")

##     fit   lwr   upr
## 1 12.89 11.99 13.79

I am 95% of homes with 2000 sqft are between 11.99 log$ and 13.99 log$. Now back transforming is easy (because there are no averages), so 95% of homes with 2000 sqft are between $161,126 and $977,301.

Plotting the confidence bounds on a scatterplot

plot(sqft, log(price), pch = 18)
sqftlm = lm(log(price) ~ sqft)
abline(sqftlm, col = "red")
newX = seq(min(sqft), max(sqft), 1)
prd.CI = predict(sqftlm, newdata = data.frame(sqft = newX), interval = "confidence", 
    level = 0.95)
lines(newX, prd.CI[, 2], col = "blue", lty = 2)
lines(newX, prd.CI[, 3], col = "blue", lty = 2)
prd.PI = predict(sqftlm, newdata = data.frame(sqft = newX), interval = "prediction", 
    level = 0.95)
lines(newX, prd.PI[, 2], col = "green", lty = 3)
lines(newX, prd.PI[, 3], col = "green", lty = 3)

plot of chunk unnamed-chunk-10