Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. The data can be found in the companion package for OpenIntro labs, oilabs.

Let’s load the packages.

library(dplyr)
library(ggplot2)
library(oilabs)

The data

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Load the nc data set into our workspace.

data(nc)

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?nc
  1. What are the cases (observational units) in this data set? How many cases are there in our sample?

Remember that you can answer this question by viewing the data in the data viewer or by using the following command:

glimpse(nc)

Exploratory data analysis

We will first start with analyzing the number of weeks of gestation of the pregnancy: weeks.

Using visualization and summary statistics, describe the distribution of weight gained by mothers during pregnancy. The summary function can be useful.

nc %>%
  select(weeks) %>%
  summary()
  1. How many mothers are we missing gestation data from?

  2. Visualize the data using a boxplot and a histogram. What do the plots highlight about the distribution of the data?


To Turn In

Inference

  1. Are the technical conditions necessary for inference satisfied? Comment. You can compute the group sizes with the summarize command and the n() function.

  2. Write the hypotheses for testing if the average gestation period is consistent with the commonly held belief that humans gestate for 40 weeks.

t.test(x=nc$weeks, mu=, alternative=)
  1. Give the full conclusion associated with the results of the hypothesis test. The conclusion should include English words like “gestation” and “40 weeks”.

  2. Using the CI part of the t.test function, find a 99% confidence interval for the true gestation period in the population. Interpret the interval in context of the data (use English words that give the interpretation – say things like “gestiation”). Note that by default you’ll get a 95% confidence interval. If you want to change the confidence level, add a new argument (conf.level) which takes on a value between 0 and 1. Also note that, when doing a confidence interval, arguments like mu and alternative are not useful, so you may want to remove them.

  3. Maybe really what the research shows is that the 50% trimmed mean (the average of the middle 50% of births) is 40 weeks. Using a 99% bootstrap CI, evaluate whether or not the data are consistent with a trimmed mean gestation of 40 weeks. [n.b., the median analysis doesn’t work here because of the repeated integers in the dataset.]

gweeks <- na.omit(nc$weeks)  # keep only the non missing data
glimpse(gweeks)  # only 998 observations now, that is what we'll bootstrap

resamples <- lapply(1:1000, function(i) sample(gweeks, 998, replace=T))
bootstraptmeans <- sapply(resamples, mean, trim=.25)  # trim 25% off each end
bootstraptmeans %>% quantile(c(0.005, 0.995))
  1. Give three possible explanations for why the confidence interval(s) on mean / trimmed mean gestation don’t overlap the number 40. Which reason do you think it is? Explain.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.