September 29, 2015

Consider the NHANES dataset.

  • Income (HHIncomeMid - Numerical version of HHIncome derived from the middle income in each category)
  • Health (HealthGen - Self-reported rating of participant's health in general Reported for participants aged 12 years or older. One of Excellent, Vgood, Good, Fair, or Poor.)

Summary of the variables of interest

NHANES %>% select(HealthGen) %>% table()
## .
## Excellent     Vgood      Good      Fair      Poor 
##       878      2508      2956      1010       187
NHANES %>% select(HHIncomeMid) %>% summary()
##   HHIncomeMid    
##  Min.   :  2500  
##  1st Qu.: 30000  
##  Median : 50000  
##  Mean   : 57206  
##  3rd Qu.: 87500  
##  Max.   :100000  
##  NA's   :811

Mean Income broken down by Health

NH.means = NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  filter(!is.na(HealthGen)) %>% group_by(HealthGen) %>% 
  summarize(IncMean = mean(HHIncomeMid, na.rm=TRUE), count=n())
NH.means
## Source: local data frame [5 x 3]
## 
##   HealthGen IncMean count
##      (fctr)   (dbl) (int)
## 1 Excellent   69354   878
## 2     Vgood   65011  2508
## 3      Good   55662  2956
## 4      Fair   44194  1010
## 5      Poor   37027   187

Are the differences in means simply due to random chance??

Income and Health

NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  filter(!is.na(HealthGen)) %>% 
ggplot(aes(x=HealthGen, y=HHIncomeMid)) + geom_boxplot()

Differences in Health

diff.mat = data.frame(matrix(ncol=5, nrow=5))
names(diff.mat) = NH.means$HealthGen
rownames(diff.mat) = NH.means$HealthGen

for(i in 1:5){
  for(j in 1:5){
    diff.mat[i,j] = NH.means$IncMean[i] - NH.means$IncMean[j]   }}

diff.mat
##           Excellent  Vgood   Good  Fair  Poor
## Excellent         0   4344  13692 25161 32327
## Vgood         -4344      0   9348 20817 27983
## Good         -13692  -9348      0 11469 18635
## Fair         -25161 -20817 -11469     0  7166
## Poor         -32327 -27983 -18635 -7166     0

Overall difference

We can measure the overall differences as the amount of variability between each of the means and the overall mean:

\[F = \frac{\text{between-group variability}}{\text{within-group variability}}\] \[F = \frac{\sum_i n_i(\bar{Y}_{i\cdot} - \bar{Y})^2/(K-1)}{\sum_{ij} (Y_{ij}-\bar{Y}_{i\cdot})^2/(N-K)}\] \[SumSqBtwn = \sum_i n_i(\bar{Y}_{i\cdot} - \bar{Y})^2\]

Creating a test statistic

NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  filter(!is.na(HealthGen)) %>% head()
## Source: local data frame [6 x 2]
## 
##   HHIncomeMid HealthGen
##         (int)    (fctr)
## 1       30000      Good
## 2       30000      Good
## 3       30000      Good
## 4       40000      Good
## 5       87500     Vgood
## 6       87500     Vgood

Creating a test statistic

GM = mean(NHANES$HHIncomeMid, na.rm=TRUE); GM
## [1] 57206
NH.means
## Source: local data frame [5 x 3]
## 
##   HealthGen IncMean count
##      (fctr)   (dbl) (int)
## 1 Excellent   69354   878
## 2     Vgood   65011  2508
## 3      Good   55662  2956
## 4      Fair   44194  1010
## 5      Poor   37027   187

Creating a test statistic

NH.means$IncMean - GM
## [1]  12148   7805  -1544 -13013 -20179
(NH.means$IncMean - GM)^2
## [1] 147578150  60910286   2383368 169328332 407181201
NH.means$count
## [1]  878 2508 2956 1010  187
NH.means$count * (NH.means$IncMean - GM)^2
## [1] 1.30e+11 1.53e+11 7.05e+09 1.71e+11 7.61e+10

Creating a test statistic

\[SumSqBtwn = \sum_i n_i(\bar{Y}_{i\cdot} - \bar{Y})^2\]

sum(NH.means$count * (NH.means$IncMean - GM)^2)
## [1] 5.37e+11

Permuting the data:

NH.means = NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  mutate(IncomePerm = HHIncomeMid[sample(1:10000, replace=FALSE)]) %>%
  filter(!is.na(HealthGen)) %>% group_by(HealthGen) %>% 
  summarize(IncMeanP = mean(IncomePerm, na.rm=TRUE), count=n())

sum(NH.means$count * (NH.means$IncMeanP - GM)^2)
## [1] 1.08e+10

Lots of times…

reps = 10000
SSB = c()

for(i in 1:reps){
NH.means = NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  mutate(IncomePerm = HHIncomeMid[sample(1:10000, replace=FALSE)]) %>%
  filter(!is.na(HealthGen)) %>% group_by(HealthGen) %>% 
  summarize(IncMeanP = mean(IncomePerm, na.rm=TRUE), count=n())

  SSB = c(SSB, sum(NH.means$count * (NH.means$IncMeanP - GM)^2))
}

Compared to the real data

NH.means = NHANES %>% select(HHIncomeMid, HealthGen) %>% 
  filter(!is.na(HealthGen)) %>% group_by(HealthGen) %>% 
  summarize(IncMean = mean(HHIncomeMid, na.rm=TRUE), count=n())

realSSB = sum(NH.means$count* (NH.means$IncMean - GM)^2)
realSSB
## [1] 5.37e+11
sum(SSB>realSSB) / reps
## [1] 0

Compared to the real data

realSSB; hist(SSB); abline(v=realSSB)
## [1] 5.37e+11

Other test statistics