Jo Hardin (almost entirely taken from Randy Pruim: http://dtkaplan.github.io/CVC/Summer2015/Learn/ggplot2/ggplot2Intro.pdf)
September 10, 2015
What I will try to do
give a tour of ggplot2
explain how to think about plots the ggplot2
way
prepare/encourage you to learn more later
What I can’t do in one session
show every bell and whistle
make you an expert at using ggplot2
One of the best ways to get started with ggplot is to google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/
Look at the end of this presentation. More help options there.
What was the biggest concern over the average value axes?
Yau gives us nine visual clues, and Wickham translates them into a language using . (The items below are from Modern Data Science, chapter 2. See the text on Sakai for more information.)
Visual Clues: the aspects of the figure where we should focus.
Position (numerical) where in relation to other things?
Length (numerical) how big (in one dimension)?
Angle (numerical) how wide? parallel to something else?
Direction (numerical) at what slope? In a time series, going up or down?
Shape (categorical) belonging to what group?
Area (numerical) how big (in two dimensions)? Beware of improper scaling!
Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
Shade (either) to what extent? how severely?
Color (either) to what extent? how severely? Beware of red/green color blindness.
Coordinate System: rectangular, polar, geographic, etc.
Scale: numeric (linear? logarithmic?), categorical (ordered?), time
Context: in comparison to what (think back to ideas from Tufte)
require(mosaic)
require(lubridate) # package for working with dates
data(Births78) # restore fresh version of Births78
head(Births78, 3)
## date births dayofyear
## 1 1978-01-01 7701 1
## 2 1978-01-02 7527 2
## 3 1978-01-03 8825 3
geom: the geometric “shape” used to display data (glyph)
aesthetic: an attribute controlling how geom is displayed wih respect to variables
scale: adjust information in the aesthetic to map onto the plot
guide: helps user convert visual data back into raw data (legends, axes)
stat: a transformation applied to data before geom gets it
Two Questions:
What do we want R to do? (What is the goal?)
What does R need to know?
Two Questions:
Goal: scatterplot = a plot with points
What does R need to know?
data source: Births78
aesthetics:
date -> x
births -> y
Goal: scatterplot = a plot with points
ggplot() + geom_point()
What does R need to know?
data source: data = Births78
aesthetics: aes(x = date, y = births)
ggplot(data=Births78, aes(x=date, y=births)) +
geom_point()
ggplot() +
geom_point(data=Births78, aes(x=date, y=births))
Coordinate System? Scale?
Coordinate System? Scale?
Coordinate System? Scale?
What has changed?
The wday()
function in the lubridate
package computes the day of the week from a date.
Births78 <-
Births78 %>%
mutate(wday = wday(date, label=TRUE))
ggplot(data=Births78) +
geom_point(aes(x=date, y=births, color=wday))
This time we use lines instead of dots
ggplot(data=Births78) +
geom_line(aes(x=date, y=births, color=wday))
This time we have two layers, one with points and one with lines
ggplot(data=Births78,
aes(x=date, y=births, color=wday)) +
geom_point() + geom_line()
The layers are placed one on top of the other: the points are below and the lines are above.
data
and aes
specified in ggplot()
affect all geoms
Births78 %>%
ggplot(aes(x=date, y=births, color="navy")) +
geom_point()
This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.
If we want to set the color to be navy for all of the dots, we do it this way:
Births78 %>%
ggplot(aes(x=date, y=births)) + # map these
geom_point(color = "navy") # set this
color = "navy"
is now outside of the aesthetics list. That’s how ggplot2
distinguishes between mapping and setting.
Births78 %>%
ggplot(aes(x=date, y=births)) +
geom_line(aes(color=wday)) + # map color here
geom_point(color="navy") # set color here
ggplot()
establishes the default data and aesthetics for the geoms, but each geom may change these defaults.
good practice: put into ggplot()
the things that affect all (or most) of the layers; rest in geom_blah()
If I want information to be passed to all data points (not variable):
map the information inside the aes (aesthetic) command
set the information outside the aes (aesthetic) command
apropos("^geom_")
[1] "geom_abline" "geom_area" "geom_bar"
[4] "geom_bin2d" "geom_blank" "geom_boxplot"
[7] "geom_contour" "geom_crossbar" "geom_density"
[10] "geom_density2d" "geom_dotplot" "geom_errorbar"
[13] "geom_errorbarh" "geom_freqpoly" "geom_hex"
[16] "geom_histogram" "geom_hline" "geom_jitter"
[19] "geom_line" "geom_linerange" "geom_map"
[22] "geom_path" "geom_point" "geom_pointrange"
[25] "geom_polygon" "geom_quantile" "geom_rangeframe"
[28] "geom_raster" "geom_rect" "geom_ribbon"
[31] "geom_rug" "geom_segment" "geom_smooth"
[34] "geom_step" "geom_text" "geom_tile"
[37] "geom_tufteboxplot" "geom_violin" "geom_vline"
help pages will tell you their aesthetics, default stats, etc.
?geom_area # for example
Births78 %>%
ggplot(aes(x=date, y=births, fill=wday)) +
geom_area()
This is not a good plot
Most (all?) graphics are intended to help us make comparisons
Key plot metric: Does my plot make the comparisions I am interested in
HELPrct: Health Evaluation and Linkage to Primary care randomized clinical trial
?HELPrct
Subjects admitted for treatment for addiction to one of three substances.
HELPrct %>%
ggplot(aes(x=substance)) +
geom_bar()
Hmm. What’s up with y
?
stat_bin()
is being applied to the data before the geom_bar()
gets to do its thing. Binning creates the y
values.% ggplot(aes(x=age)) + geom_histogram()
```
Notice the messages
stat_bin
: Histograms are not mapping the raw data but binned data.stat_bin()
performs the data transformation.
binwidth
: a default binwidth has been selected, but we should really choose our own.
HELPrct %>%
ggplot(aes(x=age)) +
geom_histogram(binwidth=2)
HELPrct %>%
ggplot(aes(x=age)) +
geom_freqpoly(binwidth=2)
HELPrct %>%
ggplot(aes(x=age)) +
geom_density()
Every geom comes with a default stat
stat_identity()
which does nothingHELPrct %>%
ggplot(aes(x=age)) +
geom_line(stat="density")
Every stat comes with a default geom, every geom with a default stat
HELPrct %>%
ggplot(aes(x=age)) +
stat_density( geom="line")
HELPrct %>%
ggplot(aes(x=age)) +
geom_point(stat="bin", binwidth=3) +
geom_line(stat="bin", binwidth=3)
HELPrct %>%
ggplot(aes(x=age)) +
geom_area(stat="bin", binwidth=3)
HELPrct %>%
ggplot(aes(x=age)) +
geom_point(stat="bin", binwidth=3, aes(size=..count..)) +
geom_line(stat="bin", binwidth=3)
HELPrct %>%
ggplot(aes(x=i1)) + geom_histogram()
HELPrct %>%
ggplot(aes(x=i1)) + geom_area(stat="density")
Using color and linetype:
HELPrct %>%
ggplot(aes(x=i1, color=substance, linetype=sex)) +
geom_line(stat="density")
Using color and facets
HELPrct %>%
ggplot(aes(x=i1, color=substance)) +
geom_line(stat="density") + facet_grid( . ~ sex )
Boxplots use stat_quantile()
which computes a five-number summary (roughly the five quartiles of the data) and uses them to define a “box” and “whiskers”. The quantitative variable must be y
, and there must be an additional x
variable.
HELPrct %>%
ggplot(aes(x=substance, y=age, color=sex)) +
geom_boxplot()
Horizontal boxplots are obtained by flipping the coordinate system:
HELPrct %>%
ggplot(aes(x=substance, y=age, color=sex)) +
geom_boxplot() +
coord_flip()
coord_flip()
may be used with other plots as well to reverse the roles of x
and y
on the plot.We’ve triggered a new feature: dodge
(for dodging things left/right). We can control how much if we set the dodge manually.
HELPrct %>%
ggplot(aes(x=substance, y=age, color=sex)) +
geom_boxplot(position=position_dodge(width=1))
require(NHANES)
dim(NHANES)
## [1] 10000 76
NHANES %>% ggplot(aes(x=Height, y=Weight)) +
geom_point() + facet_grid( Gender ~ PregnantNow )
One way to deal with overplotting is to set the opacity low.
NHANES %>%
ggplot(aes(x=Height, y=Weight)) +
geom_point(alpha=0.01) + facet_grid( Gender ~ PregnantNow )
Alternatively (or simultaneously) we might prefere a different geom altogether.
NHANES %>%
ggplot(aes(x=Height, y=Weight)) +
geom_density2d() + facet_grid( Gender ~ PregnantNow )
ggplot( data=HELPrct, aes(x=sex, y=age)) +
geom_boxplot(outlier.size=0) +
geom_jitter(alpha=.6) +
coord_flip()
ggplot( data=HELPrct, aes(x=sex, y=age)) +
geom_boxplot(outlier.size=0) +
geom_point(alpha=.6, position=position_jitter(width=.1, height=0)) +
coord_flip()
scales (fine tuning mapping from data to plot)
guides (so reader can map from plot to data)
coords (coord_flip()
is good to know about)
themes (for customizing appearance)
require(ggthemes)
qplot( x=date, y=births, data=Births78) + theme_wsj()
scales (fine tuning mapping from data to plot)
guides (so reader can map from plot to data)
coords (coord_flip()
is good to know about)
themes (for customizing appearance)
position (position_dodge()
can be used for side by side bars)
ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) +
geom_violin(coef = 10, position=position_dodge()) +
geom_point(aes(color=sex, fill=sex), position=position_jitterdodge())
scales (fine tuning mapping from data to plot)
guides (so reader can map from plot to data)
themes (for customizing appearance)
position (position_dodge()
, position_jitterdodge()
, position_stack()
, etc.)
ggplot( data=HELPrct, aes(x=substance, y=age, color=sex)) +
geom_boxplot(coef = 10, position=position_dodge(width=1)) +
geom_point(aes(fill=sex), alpha=.5,
position=position_jitterdodge(dodge.width=1)) +
facet_wrap(~homeless)
qplot()
provides “quick plots” for ggplot2
qplot(length, width, data=KidsFeet)
mplot(dataframe)
provides an interactive plotting tool for both ggplot2
and lattice
.mplot(HELPrct)
Winston Chang’s: R Graphics Cookbook
ggvis
dynamic graphics (brushing, sliders, tooltips, etc.)
uses Vega (D3) to animate plots in a browser
similar structure to ggplot2
but different syntax and names
Dynamic documents
RMarkdown
, ggvis
, and shiny