In this lab we will explore the data using the dplyr
package and visualize it using the ggplot2
package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs
.
Let’s load the packages.
library(dplyr)
library(ggplot2)
library(oilabs)
In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
Load the nc
data set into our workspace.
data(nc)
We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?nc
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
glimpse(nc)
We will first start with analyzing the number of weeks of gestation of the pregnancy: weeks
.
Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy. The summary
function can be useful.
nc %>%
select(weeks) %>%
summary()
How many mothers are we missing gestation data from?
Visualize the data using a boxplot and a histogram. What do the plots highlight about the distribution of the data?
Are the technical conditions necessary for inference satisfied? Comment. You can compute the group sizes with the summarize
command and the n()
function.
Write the hypotheses for testing if the average gestation period is consistent with the commonly held belief that humans gestate for 40 weeks.
t.test(x=nc$weeks, mu=, alternative=)
Give the full conclusion associated with the results of the hypothesis test. The conclusion should include English words like “gestation” and “40 weeks”.
Using the CI part of the t.test
function, find a 99% confidence interval for the true gestation period in the population. Interpret the interval in context of the data (use English words that give the interpretation – say things like “gestiation”). Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf.level
) which takes on a value between 0 and 1. Also note that, when doing a confidence interval, arguments like mu
and alternative
are not useful, so you may want to remove them.
Maybe really what the research shows is that the 50% trimmed mean (the average of the middle 50% of births) is 40 weeks. Using a 99% bootstrap CI, evaluate whether or not the data are consistent with a trimmed mean gestation of 40 weeks. [n.b., the median analysis doesn’t work here because of the repeated integers in the dataset.]
gweeks <- na.omit(nc$weeks) # keep only the non missing data
glimpse(gweeks) # only 998 observations now, that is what we'll bootstrap
resamples <- lapply(1:1000, function(i) sample(gweeks, 998, replace=T))
bootstraptmeans <- sapply(resamples, mean, trim=.25) # trim 25% off each end
bootstraptmeans %>% quantile(c(0.005, 0.995))
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.