Introduction to ggplot2

Jo Hardin (almost entirely taken from Randy Pruim: http://dtkaplan.github.io/CVC/Summer2015/Learn/ggplot2/ggplot2Intro.pdf)

September 10, 2015

Goals

What I will try to do

What I can’t do in one session

HELP

  1. One of the best ways to get started with ggplot is to google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/

  2. Look at the end of this presentation. More help options there.

Caffeine and Calories

What was the biggest concern over the average value axes?

  1. It isn’t at the origin.
  2. They should have used all the data possible to find averages.
  3. There wasn’t a random sample.
  4. There wasn’t a label explaining why the axes were where they were.

Pieces of the Graph

Yau gives us nine visual clues, and Wickham translates them into a language using . (The items below are from Modern Data Science, chapter 2. See the text on Sakai for more information.)

  1. Visual Clues: the aspects of the figure where we should focus.
    Position (numerical) where in relation to other things?
    Length (numerical) how big (in one dimension)?
    Angle (numerical) how wide? parallel to something else?
    Direction (numerical) at what slope? In a time series, going up or down?
    Shape (categorical) belonging to what group?
    Area (numerical) how big (in two dimensions)? Beware of improper scaling!
    Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
    Shade (either) to what extent? how severely?
    Color (either) to what extent? how severely? Beware of red/green color blindness.

  2. Coordinate System: rectangular, polar, geographic, etc.

  3. Scale: numeric (linear? logarithmic?), categorical (ordered?), time

  4. Context: in comparison to what (think back to ideas from Tufte)

Set up

require(mosaic)
require(lubridate) # package for working with dates
data(Births78)     # restore fresh version of Births78
head(Births78, 3)
##         date births dayofyear
## 1 1978-01-01   7701         1
## 2 1978-01-02   7527         2
## 3 1978-01-03   8825         3

The grammar of graphics

geom: the geometric “shape” used to display data (glyph)

aesthetic: an attribute controlling how geom is displayed wih respect to variables

scale: adjust information in the aesthetic to map onto the plot

guide: helps user convert visual data back into raw data (legends, axes)

stat: a transformation applied to data before geom gets it

How do we make this plot?

Two Questions:

  1. What do we want R to do? (What is the goal?)

  2. What does R need to know?

How do we make this plot?

Two Questions:

  1. Goal: scatterplot = a plot with points

  2. What does R need to know?

    • data source: Births78

    • aesthetics:

      • date -> x
      • births -> y
      • default color (same for all points)

How do we make this plot?

  1. Goal: scatterplot = a plot with points

    • ggplot() + geom_point()
  2. What does R need to know?

    • data source: data = Births78

    • aesthetics: aes(x = date, y = births)

ggplot(data=Births78, aes(x=date, y=births)) + 
  geom_point()

ggplot() +
  geom_point(data=Births78, aes(x=date, y=births))  

What are the visual cues on this plot?

  1. position
  2. length
  3. shape
  4. area/volume
  5. shade/color

Coordinate System? Scale?

What are the visual cues on this plot?

  1. position
  2. length
  3. shape
  4. area/volume
  5. shade/color

Coordinate System? Scale?

What are the visual cues on this plot?

  1. position
  2. length
  3. shape
  4. area/volume
  5. shade/color

Coordinate System? Scale?

How do we make this plot?

What has changed?

Adding day of week to the data set

The wday() function in the lubridate package computes the day of the week from a date.

Births78 <-  
  Births78 %>% 
  mutate(wday = wday(date, label=TRUE))

ggplot(data=Births78) +
  geom_point(aes(x=date, y=births, color=wday))

How do we make this plot?

This time we use lines instead of dots

ggplot(data=Births78) +
  geom_line(aes(x=date, y=births, color=wday)) 

How do we make this plot?

This time we have two layers, one with points and one with lines

ggplot(data=Births78, 
       aes(x=date, y=births, color=wday)) + 
  geom_point() +  geom_line()

What does this do?

Births78 %>%
  ggplot(aes(x=date, y=births, color="navy")) + 
  geom_point()  

This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.

Setting vs. Mapping

If we want to set the color to be navy for all of the dots, we do it this way:

Births78 %>%
  ggplot(aes(x=date, y=births)) +   # map these 
  geom_point(color = "navy")        # set this

How do we make this plot?

Births78 %>%
  ggplot(aes(x=date, y=births)) + 
  geom_line(aes(color=wday)) +       # map color here
  geom_point(color="navy")           # set color here

Setting vs. Mapping (again)

If I want information to be passed to all data points (not variable):

  1. map the information inside the aes (aesthetic) command

  2. set the information outside the aes (aesthetic) command

Other geoms

apropos("^geom_")
 [1] "geom_abline"       "geom_area"         "geom_bar"         
 [4] "geom_bin2d"        "geom_blank"        "geom_boxplot"     
 [7] "geom_contour"      "geom_crossbar"     "geom_density"     
[10] "geom_density2d"    "geom_dotplot"      "geom_errorbar"    
[13] "geom_errorbarh"    "geom_freqpoly"     "geom_hex"         
[16] "geom_histogram"    "geom_hline"        "geom_jitter"      
[19] "geom_line"         "geom_linerange"    "geom_map"         
[22] "geom_path"         "geom_point"        "geom_pointrange"  
[25] "geom_polygon"      "geom_quantile"     "geom_rangeframe"  
[28] "geom_raster"       "geom_rect"         "geom_ribbon"      
[31] "geom_rug"          "geom_segment"      "geom_smooth"      
[34] "geom_step"         "geom_text"         "geom_tile"        
[37] "geom_tufteboxplot" "geom_violin"       "geom_vline"       

help pages will tell you their aesthetics, default stats, etc.

?geom_area             # for example

Let’s try geom_area

Births78 %>%
  ggplot(aes(x=date, y=births, fill=wday)) + 
  geom_area()

This is not a good plot

Side note: what makes a plot good?

Most (all?) graphics are intended to help us make comparisons

Key plot metric: Does my plot make the comparisions I am interested in

Time for some different data

HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial

?HELPrct

Subjects admitted for treatment for addiction to one of three substances.

Why are these people in the study?

HELPrct %>% 
  ggplot(aes(x=substance)) + 
  geom_bar()

% ggplot(aes(x=age)) + geom_histogram()

stat_bin: binwidth defaulted to range/30. Use ‘binwidth = x’ to adjust this.

```

Notice the messages

Setting the binwidth manually

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_histogram(binwidth=2)

How old are people in the HELP study? – Other geoms

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_freqpoly(binwidth=2)

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_density()

Selecting stat and geom manually

Every geom comes with a default stat

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_line(stat="density")

Selecting stat and geom manually

Every stat comes with a default geom, every geom with a default stat

HELPrct %>% 
  ggplot(aes(x=age)) + 
  stat_density( geom="line")

More combinations

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_point(stat="bin", binwidth=3) + 
  geom_line(stat="bin", binwidth=3)  

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_area(stat="bin", binwidth=3)  

HELPrct %>% 
  ggplot(aes(x=age)) + 
  geom_point(stat="bin", binwidth=3, aes(size=..count..)) +
  geom_line(stat="bin", binwidth=3) 

How much do they drink? (i1)

HELPrct %>% 
  ggplot(aes(x=i1)) + geom_histogram()

HELPrct %>% 
  ggplot(aes(x=i1)) + geom_area(stat="density")

Covariates: Adding in more variables

Using color and linetype:

HELPrct %>% 
  ggplot(aes(x=i1, color=substance, linetype=sex)) + 
  geom_line(stat="density")

Using color and facets

HELPrct %>% 
  ggplot(aes(x=i1, color=substance)) + 
  geom_line(stat="density") + facet_grid( . ~ sex )

Boxplots

Boxplots use stat_quantile() which computes a five-number summary (roughly the five quartiles of the data) and uses them to define a “box” and “whiskers”. The quantitative variable must be y, and there must be an additional x variable.

HELPrct %>% 
  ggplot(aes(x=substance, y=age, color=sex)) + 
  geom_boxplot()

Horizontal boxplots

Horizontal boxplots are obtained by flipping the coordinate system:

HELPrct %>% 
  ggplot(aes(x=substance, y=age, color=sex)) + 
  geom_boxplot() +
  coord_flip()

Give me some space

We’ve triggered a new feature: dodge (for dodging things left/right). We can control how much if we set the dodge manually.

HELPrct %>% 
  ggplot(aes(x=substance, y=age, color=sex)) + 
  geom_boxplot(position=position_dodge(width=1)) 

Issues with bigger data

require(NHANES)
dim(NHANES)
## [1] 10000    76
NHANES %>%  ggplot(aes(x=Height, y=Weight)) +
  geom_point() + facet_grid( Gender ~ PregnantNow )

Using alpha (opacity)

One way to deal with overplotting is to set the opacity low.

NHANES %>% 
  ggplot(aes(x=Height, y=Weight)) +
  geom_point(alpha=0.01) + facet_grid( Gender ~ PregnantNow )

geom_density2d

Alternatively (or simultaneously) we might prefere a different geom altogether.

NHANES %>% 
  ggplot(aes(x=Height, y=Weight)) +
  geom_density2d() + facet_grid( Gender ~ PregnantNow )

Multiple layers

ggplot( data=HELPrct, aes(x=sex, y=age)) +
  geom_boxplot(outlier.size=0) +
  geom_jitter(alpha=.6) +
  coord_flip()

Multiple layers

ggplot( data=HELPrct, aes(x=sex, y=age)) +
  geom_boxplot(outlier.size=0) +
  geom_point(alpha=.6, position=position_jitter(width=.1, height=0)) +
  coord_flip()

Things I haven’t mentioned (much)

require(ggthemes)
qplot( x=date, y=births, data=Births78) + theme_wsj()

Things I haven’t mentioned (much)

ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) +
  geom_violin(coef = 10, position=position_dodge()) +
  geom_point(aes(color=sex, fill=sex), position=position_jitterdodge()) 

Things I haven’t mentioned (much)

A little bit of everything

ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) +
  geom_boxplot(coef = 10, position=position_dodge(width=1)) +
  geom_point(aes(fill=sex), alpha=.5, 
             position=position_jitterdodge(dodge.width=1)) + 
  facet_wrap(~homeless)

Some short cuts

  1. qplot() provides “quick plots” for ggplot2
qplot(length, width, data=KidsFeet)

  1. mplot(dataframe) provides an interactive plotting tool for both ggplot2 and lattice.
mplot(HELPrct)

Want to learn more?

What’s around the corner?

ggvis

Dynamic documents