Math 152, Fall 2013 Jo Hardin

Bootstrap R code for HW & final

install.packages("bootstrap", repos = "http://cran.us.r-project.org")

## Installing package into '\\wells/fac-staff$/jsh04747/My
## Documents/R/win-library/3.0' (as 'lib' is unspecified)

## package 'bootstrap' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\jsh04747\AppData\Local\Temp\RtmpcbAWa1\downloaded_packages

library(bootstrap)

.1. The formal medical term for vitamin A in the blood is serum retinol. Serum retinol has various beneficial effects, such as protecting against fractures. Medical researchers working with children in Papua New Guinea asked whether recent infections reduce the level of serum retinol. They classified children as recently infected or not on the basis of other blood tests, then measured serum retinol. Of the 90 children in the sample, 55 had been recently infected. The observations, in micromoles per liter, is at . One conclusion from the data is “The mean serum retinol level in uninfected children was 1.255 times the mean level in the infected children. A 95% confience interval for the ratio of means in the population of all children in Papua New Guinea is…”

(a) Find and interpret a two sided CI using a percentile interval.
(b) Find and interpret a two sided CI using a t-CI.
(c ) Find and interpret a two sided CI using a Bootstrap-t CI.
(d) Why might a one-sided interval be better here?
(e) Choose one of your intervals above to find and interpret a one sided interval.

Note: Make sure to be clear about the parameter of interest here (that is, what parameter are you finding a confidence interval for?)

.2. Why is the percentile interval range respecting but the BS-t CI is not? Why would that matter?

retinol = read.table("http://pages.pomona.edu/~jsh04747/courses/math152/retinol.csv", 
    header = T, sep = ",")
attach(retinol)
names(retinol)

## [1] "notinfect" "infect"

notinfect  # note that you don't really want all those NA values!!

##  [1] 0.59 1.44 0.35 0.99 0.83 1.17 1.08 1.04 0.99 0.35 0.35 0.35 0.88 0.67
## [15] 1.22 0.94 0.67 0.23 0.62 0.86 1.15 1.00 0.31 0.34 0.46 0.90 1.13 1.02
## [29] 0.58 0.49 0.39 0.70 0.67 1.11 1.36   NA   NA   NA   NA   NA   NA   NA
## [43]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

notinfect2 = notinfect[1:35]
# check:
mean(notinfect2)/mean(infect)

## [1] 1.255

Example

Hesketh and Everitt (2000) report on a study by Caplehorn and Bell (1991) that investigated the times (in days) that heroin addicts remained in a clinic for methadone maintenance treatment. The data in heroin.txt include the amount of time that the subjects stayed in the facility until treatment was terminated (column 4). For about 37% of the subjects, the study ended while they were still the in clinic (status=0). Thus, their survival time has been truncated. For this reason we might not want to estimate the mean survival time, but rather some other measure of typical survival time. Below we explore using the median as well as the 25% trimmed mean. We treat the group of 238 patients as representative of the population. [From ISCAM Chance & Rossman, Investigation 4.5.3]

heroin = read.table("http://pages.pomona.edu/~jsh04747/courses/math58/Math58Data/HEROIN.TXT", 
    header = T)
obs.stat <- median(heroin[, 4])
obs.stat

## [1] 367.5


obs.stat2 <- mean(heroin[, 4], trim = 0.25)
obs.stat2

## [1] 378.3


heroin.rs <- sample(heroin[, 4], 238, replace = T)
median(heroin.rs)

## [1] 315.5

mean(heroin.rs, trim = 0.25)

## [1] 350.6

Bootstrapping

test.stat <- c()
test.stat2 <- c()
sd.test.stat <- c()
sd.test.stat2 <- c()

for (i in 1:1000) {
    heroin.rs <- sample(heroin[, 4], 238, replace = T)
    test.stat <- c(test.stat, median(heroin.rs))
    test.stat2 <- c(test.stat2, mean(heroin.rs, trim = 0.25))

    test.stat.rs <- c()
    test.stat2.rs <- c()

    for (j in 1:1000) {
        heroin.rsrs <- sample(heroin.rs, 238, replace = T)
        test.stat.rs <- c(test.stat.rs, median(heroin.rsrs))
        test.stat2.rs <- c(test.stat2.rs, mean(heroin.rsrs, trim = 0.25))
    }
    sd.test.stat <- c(sd.test.stat, sd(test.stat.rs))
    sd.test.stat2 <- c(sd.test.stat2, sd(test.stat2.rs))
}

par(mfcol = c(2, 3))
hist(heroin[, 4])
hist(heroin.rs)
hist(test.stat)
qqnorm(test.stat)
qqline(test.stat)
hist(test.stat2)
qqnorm(test.stat2)
qqline(test.stat2)

plot of chunk unnamed-chunk-4


# median
mean(test.stat)

## [1] 374.9

sd(test.stat)

## [1] 32.76


# 25% trimmed mean
mean(test.stat2)

## [1] 378.7

sd(test.stat2)

## [1] 23.05


# OR - using R's built in functions

bs.proc <- bootstrap(heroin[, 4], nboot = 1000, theta = median)
bs.proc2 <- bootstrap(heroin[, 4], nboot = 1000, theta = mean, trim = 0.25)

95% Percentile CI

c(sort(test.stat)[25], sort(test.stat)[975])

## [1] 319.5 450.0

c(sort(test.stat2)[25], sort(test.stat2)[975])

## [1] 333.7 422.6


# OR - using R's built in functions

c(sort(bs.proc$thetastar)[25], sort(bs.proc$thetastar)[975])

## [1] 319.5 450.0

c(sort(bs.proc2$thetastar)[25], sort(bs.proc2$thetastar)[975])

## [1] 335.4 426.9

95% t-CI

c(obs.stat - qt(0.975, 237) * sd(test.stat), obs.stat + qt(0.975, 237) * sd(test.stat))

## [1] 303 432

c(obs.stat2 - qt(0.975, 237) * sd(test.stat2), obs.stat2 + qt(0.975, 237) * 
    sd(test.stat2))

## [1] 332.9 423.7



# OR - using R's built in functions

se.bs <- sd(bs.proc$thetastar)
se.bs2 <- sd(bs.proc2$thetastar)

c(obs.stat - qt(0.975, 237) * se.bs, obs.stat + qt(0.975, 237) * se.bs)

## [1] 304.4 430.6

c(obs.stat2 - qt(0.975, 237) * se.bs2, obs.stat2 + qt(0.975, 237) * se.bs2)

## [1] 333.1 423.5

95% Bootstrap-t CI

t.hat <- (test.stat - obs.stat)/sd.test.stat
t.hat2 <- (test.stat2 - obs.stat2)/sd.test.stat2

c(obs.stat - sort(t.hat)[975] * sd(test.stat), obs.stat - sort(t.hat)[25] * 
    sd(test.stat))

## [1] 292.8 424.4

c(obs.stat2 - sort(t.hat2)[975] * sd(test.stat2), obs.stat2 - sort(t.hat2)[25] * 
    sd(test.stat2))

## [1] 334.0 425.6

# OR - using R's built in functions

boott(heroin[, 4], 1000, theta = median)

## $confpoints
##      0.001  0.01 0.025  0.05 0.1   0.5 0.9  0.95 0.975  0.99 0.999
## [1,] 285.1 303.4 310.3 325.9 335 367.5 388 394.1 400.1 406.1 417.4
## 
## $theta
## NULL
## 
## $g
## NULL
## 
## $call
## boott(x = heroin[, 4], theta = median, 1000)

boott(heroin[, 4], 1000, theta = mean, trim = 0.25)

## $confpoints
##      0.001  0.01 0.025  0.05   0.1   0.5   0.9  0.95 0.975  0.99 0.999
## [1,] 311.9 316.4 333.1 338.2 346.6 379.3 410.4 422.5 431.6 445.3 468.3
## 
## $theta
## NULL
## 
## $g
## NULL
## 
## $call
## boott(x = heroin[, 4], theta = mean, 1000, trim = 0.25)

95% BCa interval

test.stat.jk <- c()
test.stat2.jk <- c()

for (i in 1:length(heroin[, 4])) {

    test.stat.jk <- c(test.stat.jk, median(heroin[-i, 4]))
    test.stat2.jk <- c(test.stat2.jk, mean(heroin[-i, 4], trim = 0.25))
}

zo.hat <- qnorm(sum(test.stat < obs.stat)/1000, 0, 1)
a.hat <- sum((mean(test.stat.jk) - test.stat.jk)^3)/(6 * (sum((mean(test.stat.jk) - 
    test.stat.jk)^2)^1.5))

zo.hat2 <- qnorm(sum(test.stat2 < obs.stat2)/1000, 0, 1)
a.hat2 <- sum((mean(test.stat2.jk) - test.stat2.jk)^3)/(6 * (sum((mean(test.stat2.jk) - 
    test.stat2.jk)^2)^1.5))

alpha1.bca <- pnorm(zo.hat + (zo.hat + qnorm(0.975))/(1 - a.hat * (zo.hat + 
    qnorm(0.975))))
alpha2.bca <- pnorm(zo.hat + (zo.hat + qnorm(0.025))/(1 - a.hat * (zo.hat + 
    qnorm(0.025))))


alpha1.bca2 <- pnorm(zo.hat2 + (zo.hat2 + qnorm(0.975))/(1 - a.hat2 * (zo.hat2 + 
    qnorm(0.975))))
alpha2.bca2 <- pnorm(zo.hat2 + (zo.hat2 + qnorm(0.025))/(1 - a.hat2 * (zo.hat2 + 
    qnorm(0.025))))


c(sort(test.stat)[ceiling(1000 * alpha2.bca)], sort(test.stat)[ceiling(1000 * 
    alpha1.bca)])

## [1] 317.0 444.5

c(sort(test.stat2)[ceiling(1000 * alpha2.bca2)], sort(test.stat2)[ceiling(1000 * 
    alpha1.bca2)])

## [1] 332.2 421.8


# OR - using R's built in functions

bcanon(heroin[, 4], 1000, theta = median)

## $confpoints
##      alpha bca point
## [1,] 0.025     308.0
## [2,] 0.050     321.0
## [3,] 0.100     337.0
## [4,] 0.160     342.0
## [5,] 0.840     396.5
## [6,] 0.900     406.5
## [7,] 0.950     434.0
## [8,] 0.975     447.5
## 
## $z0
## [1] -0.07024
## 
## $acc
## [1] 0
## 
## $u
##   [1] 367 368 368 368 368 367 367 367 367 367 368 367 367 367 368 367 367
##  [18] 367 367 367 368 368 368 367 368 367 367 368 367 367 368 367 367 367
##  [35] 367 367 367 367 367 367 367 367 367 367 367 367 368 367 367 367 368
##  [52] 367 367 367 367 368 367 367 367 368 368 367 367 367 368 367 368 367
##  [69] 367 367 368 367 367 367 368 368 368 368 367 368 368 367 367 367 367
##  [86] 367 368 368 367 367 368 367 368 367 367 368 367 367 367 368 367 367
## [103] 367 368 368 368 367 367 367 367 367 367 367 367 367 368 368 367 368
## [120] 367 367 368 368 367 367 367 367 367 368 368 367 368 368 368 368 367
## [137] 368 367 368 367 368 367 368 367 367 368 367 367 367 368 368 368 367
## [154] 367 367 368 367 368 368 367 367 367 368 368 368 367 368 367 368 368
## [171] 368 368 367 368 368 368 368 368 368 368 368 368 367 368 368 368 367
## [188] 368 368 367 368 368 368 368 368 368 367 368 368 367 368 368 368 368
## [205] 368 368 368 368 368 368 367 368 368 368 368 368 368 368 368 368 368
## [222] 367 368 368 368 368 367 368 368 368 368 367 368 368 367 367 368 368
## 
## $call
## bcanon(x = heroin[, 4], nboot = 1000, theta = median)

bcanon(heroin[, 4], 1000, theta = mean, trim = 0.25)

## $confpoints
##      alpha bca point
## [1,] 0.025     332.9
## [2,] 0.050     340.1
## [3,] 0.100     348.7
## [4,] 0.160     355.2
## [5,] 0.840     398.4
## [6,] 0.900     404.8
## [7,] 0.950     416.3
## [8,] 0.975     421.6
## 
## $z0
## [1] -0.05015
## 
## $acc
## [1] -2.176e-05
## 
## $u
##   [1] 377.9 379.2 379.3 379.9 379.3 376.5 377.8 376.5 376.5 378.2 380.1
##  [12] 376.5 377.1 376.5 379.7 378.1 376.5 377.2 377.2 376.5 379.7 378.6
##  [23] 379.0 376.5 379.3 376.7 378.4 378.9 376.5 376.5 379.0 376.7 378.2
##  [34] 376.5 376.5 376.5 376.5 376.9 376.5 376.5 376.6 377.1 377.2 376.5
##  [45] 376.5 376.8 380.1 377.4 377.1 376.5 380.1 376.7 376.5 376.5 376.5
##  [56] 380.0 377.7 376.5 377.3 379.3 380.0 378.2 377.8 376.7 378.6 376.5
##  [67] 379.9 378.1 376.5 376.5 379.4 376.5 376.5 377.1 378.6 379.0 379.4
##  [78] 380.1 378.3 379.7 380.1 377.0 377.1 376.5 378.1 376.5 380.1 380.1
##  [89] 377.4 376.9 379.8 377.5 379.5 377.1 376.5 380.1 377.6 376.5 376.5
## [100] 380.1 377.7 376.8 377.6 380.1 380.1 380.1 377.0 376.5 376.5 376.5
## [111] 376.5 376.5 376.5 376.5 376.5 380.1 380.1 376.5 378.4 376.5 376.5
## [122] 379.5 380.1 376.7 376.5 376.5 376.5 376.5 379.9 380.1 376.5 380.1
## [133] 379.1 378.5 378.7 376.5 380.1 376.7 379.2 376.5 378.8 376.5 380.1
## [144] 376.5 376.6 380.1 376.5 376.5 377.6 379.5 380.1 380.1 376.5 376.5
## [155] 376.5 379.7 376.5 379.2 379.6 376.5 377.3 378.2 380.1 380.1 378.5
## [166] 377.0 378.8 377.6 380.1 380.1 378.5 380.1 377.6 380.1 380.1 380.1
## [177] 380.1 380.1 379.6 380.1 379.8 380.1 376.6 380.0 380.1 380.1 376.9
## [188] 379.7 379.9 377.8 378.4 378.6 380.1 378.6 380.0 380.1 376.9 380.1
## [199] 379.1 377.0 379.7 379.7 380.1 380.1 380.1 380.1 379.3 380.1 378.6
## [210] 380.1 377.0 380.1 380.1 380.1 380.1 380.1 380.1 380.1 380.1 380.1
## [221] 380.1 378.2 380.1 380.0 380.0 378.8 377.4 378.7 379.1 379.8 378.4
## [232] 377.0 380.1 380.1 376.9 376.8 380.1 380.1
## 
## $call
## bcanon(x = heroin[, 4], nboot = 1000, theta = mean, trim = 0.25)

Note: Retinol data provided by Francisco Rosales of the Department of Nutritional Sciences, Penn State University. See Rosales et al., “Relation of serum retinol to acute phase proteins and malarial morbidity in Papua New Guinea children,” American Journal of Clinical Nutrition, 71 (2000), pp. 1580-1588. Problem taken from “Bootstrap Methods and Permutation Tests” by Hesterberg et al., Chapter 16, 6e, IPS.