August 2022, UC Berkeley
Florica Constantine
ggplot2
, and
lattice
And here’s some motivation - we can produce a plot like this with a few lines of code.
(Compare to the famous gapminder plot.)
The general call for base plot looks something like this:
Additional parameters can be passed in to customize the plot:
More layers can be added to the plot with additional calls to
lines
, points
, text
, etc.
gapChina <- gapminder %>% filter(country == "China")
plot(gapChina$year, gapChina$gdpPercap)
plot(gapChina$year, gapChina$gdpPercap, type = "l",
main = "China GDP over time",
xlab = "Year", ylab = "GDP per capita") # with updated parameters
points(gapChina$year, gapChina$gdpPercap, pch = 16)
points(x = 1977, y = gapChina$gdpPercap[gapChina$year == 1977],
col = "red", pch = 16)
These are a variety of other types of plots you can make in base graphics.
lattice
and ggplot2
generally
don’t exhibit this sort of behaviorHere are two examples:
ggplot2
, and
lattice
Base graphics is
good for exploratory data analysis and sanity checks
inconsistent in syntax across functions: some take x,y while others take formulas
default plotting parameters are ugly, and it can be difficult to customize
that said, one can do essentially anything in base graphics with some work
ggplot2
is
generally more elegant
more syntactically logical (and therefore simpler, once you learn it)
better at grouping
able to interface with maps
lattice
is
faster than ggplot2
(though only noticeable over
many and large plots)
simpler than ggplot2
(at first)
perhaps better at trellis plots than ggplot2
(some
think this)
able to do 3d plots (but be cautious about using 3d plots)
We’ll focus on ggplot2
as it is very powerful, very
widely-used and allows one to produce very nice-looking graphics without
a lot of coding.
ggplot2
The general call for ggplot2
graphics looks something
like this:
Note that ggplot2
graphs in layers in a continuing
call (hence the endless +…+…+…), which makes additional layers in
the plot.
You can see the layering effect by comparing the same graph with different colors for each layer
p <- ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_point(color = "red")
p
p + geom_point(aes(x = year, y = lifeExp), color = "gray") + ylab("life expectancy") +
theme_minimal()
And, if you’re desperate for the quick and dirty functionality of
base plot, or just like the more familiar syntax at first,
ggplot2
offers the qplot()
function as a
wrapper for most basic plots:
ggplot2
syntax is very different from base graphics and
lattice. It’s built on the grammar of graphics. The
basic idea is that the visualization of all data requires four
items:
One or more statistics conveying information about the data (identities, means, medians, etc.)
A coordinate system that characterizes the
intersections of statistics (at most two for ggplot, three for
lattice
)
Geometries that differentiate between off-coordinate variation in kind
Scales that differentiate between off-coordinate variation in degree
ggplot2
allows the user to manipulate all four of these
items through the stat_*
, coord_*
,
geom_*
, and scale_*
functions.
All of these are important to truly becoming a
ggplot2
master, but today we are going to focus on the most
important to basic users and their data layers: ggplot2
’s
geometries
## Scatterplot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
ggtitle("China's life expectancy")
## Line (time series) plot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
ggtitle("China's life expectancy")
## Boxplot
ggplot(gapminder, aes(x = factor(year), y = lifeExp)) + geom_boxplot() +
ggtitle("World's life expectancy")
## Histogram
gapminder2007 <- gapminder %>% filter(year == 2007)
ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")
ggplot2
and tidy dataggplot2
plays nicely with dplyr
and pipes.
If you want to manipulate your data specifically for one plot but not
save the new dataset, you can call your dplyr
chain and
pipe it directly into a ggplot call.# This combines the subsetting and plotting into one step
gapminder %>% filter(year == 2007) %>%
ggplot(aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")
ggplot2
have one big
difference: ggplot2
requires your data to
be in tidy format. For base graphics, it can actually be helpful
not to have your data in tidy format.For example, here ggplot treats country
as an aesthetic
parameter that differentiates groups of values, whereas base graphics
treats each (year, medal) pair as a set of inputs to the plot.
Here’s ggplot with the data in a tidy format.
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE)
Is that a useful plot?
And here’s use of base graphics, taking advantage of non-tidy, wide-formatted data.
# Base graphics call
gapminder_wide <- gapminder %>% select(country, year, lifeExp) %>% spread(country, lifeExp)
gapminder_wide[1:5, 1:5]
## # A tibble: 5 × 5
## year Afghanistan Albania Algeria Angola
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1952 28.8 55.2 43.1 30.0
## 2 1957 30.3 59.3 45.7 32.0
## 3 1962 32.0 64.8 48.3 34
## 4 1967 34.0 66.2 51.4 36.0
## 5 1972 36.1 67.7 54.5 37.9
plot(gapminder_wide$year, gapminder_wide$China, col = 'red', type = 'l', ylim = c(40, 85))
lines(gapminder_wide$year, gapminder_wide$Turkey, col = 'green')
lines(gapminder_wide$year, gapminder_wide$Italy, col = 'blue')
legend("right", legend = c("China", "Turkey", "Italy"),
fill = c("red", "blue", "green"))
Of course, as mentioned above, you can always filter your tidy data
to replicate this plot with ggplot2
…
ggplot2
ggplot2
geomsWe’ve already seen these initial ones.
X-Y scatter plots: geom_point()
X-Y line plots: geom_line()
or
geom_path()
Histograms: geom_histogram()
, geom_col()
,
or geom_bar()
gapminder2007 <- gapminder %>% filter(year == 2007)
ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
ggtitle("World's life expectancy")
Densities: geom_density()
,
geom_density2d()
Boxplots: geom_boxplot()
# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = gapminder, aes(x = factor(year), y = lifeExp)) +
geom_boxplot()
Contour plots: geom_contour()
data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 100 100 101 101 101 101 101 100 100 100
## [2,] 101 101 102 102 102 102 102 101 101 101
## [3,] 102 102 103 103 103 103 103 102 102 102
## [4,] 103 103 104 104 104 104 104 103 103 103
## [5,] 104 104 105 105 105 105 105 104 104 103
## [6,] 105 105 105 106 106 106 106 105 105 104
## [7,] 105 106 106 107 107 107 107 106 106 105
## [8,] 106 107 107 108 108 108 108 107 107 106
## [9,] 107 108 108 109 109 109 109 108 108 107
## [10,] 108 109 109 110 110 110 110 109 109 108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)
## Var1 Var2 value
## 1 1 1 100
## 2 2 1 101
## 3 3 1 102
## 4 4 1 103
## 5 5 1 104
## 6 6 1 105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns
ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
geom_contour()
tile/image/level plots, heatmaps: geom_tile()
,
geom_rect()
, geom_raster()
Trellis plots allow you to stratify by a variable, with one panel per
categorical value. One uses either facet_grid()
or
facet_wrap()
:
This can be quite powerful. It gives you the ability to take account of an additional variable.
ggplot2
(optional)ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()
# Add linear model (lm) smoother
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "lm")
# Add local linear model (loess) smoother, span of 0.75 (more smoothed)
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", span = .75)
# Add local linear model (loess) smoother, span of 0.25 (less smoothed)
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", span = .25)
# Add linear model (lm) smoother, no standard error shading
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "lm", se = FALSE)
# Add local linear model (loess) smoother, no standard error shading
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
geom_smooth(method = "loess", se = FALSE)
aes()
These four aesthetic parameters (color
,
linetype
, shape
, size
) can be
used to show variation in kind (categories) and variation in
degree (numeric).
Parameters passed into aes
should be variables
in your dataset.
Parameters passed to geom_xxx
outside of
aes
should not be related to your dataset – they
apply to the whole figure.
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE)
Note what happens when we specify the color parameter outside of the
aesthetic operator. ggplot2
views these specifications as
invalid graphical parameters.
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : object 'country' not found
## Error: Unknown colour name: country
## this works but only makes sense if we restrict to one country
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
geom_line(color = "red")
Note: Aesthetics automatically show up in your legend. Parameters (those not mapped to a variable in your data frame) do not!
Differences in kind
## color as the aesthetic to differentiate by continent
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) + scale_x_log10()
## point shape as the aesthetic to differentiate by continent
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(shape = continent)) + scale_x_log10()
## line type as the aesthetic to differentiate by country
gapOceania <- gapminder %>% filter(continent %in% 'Oceania')
ggplot(data = gapOceania, aes(x = year, y = lifeExp)) +
geom_line(aes(linetype = country)) + scale_x_log10()
Differences in degree
## point size as the aesthetic to differentiate by population
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop)) + scale_x_log10()
## color as the aesthetic to differentiate by population
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = pop)) + scale_x_log10() +
scale_color_gradient(low = 'lightgray', high = 'black')
Multiple non-coordinate aesthetics (differences in kind using color, degree using point size)
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) + scale_x_log10()
How many variables have we represented? If we used a trellis plot we could represent yet another variable!
POLL 7A: Which of these ggplot2 calls will work (in the sense of not giving an error, not in the sense of being a useful plot)?
(respond at https://pollev.com/chrispaciorek428)
Aesthetics are handled by their very own scale
functions
which allow you to set the limits, breaks, tranformations, and any
palletes that might determine how you want your data plotted.
ggplot2
includes a number of helpful default scale
functions. For example:
scale_x_log10
that can transform your data on the
flyscale_color_viridis
uses palettes from the
viridis
package specifically designed to “make plots that
are pretty, better represent your data, easier to read by those with
colorblindness, and print well in grey scale.”For example, our data might be better represented using a log10 transformation of per capita GDP:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10()
And perhaps we want colors that are a little different:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10() +
scale_color_viridis_d()
Or perhaps we want to set your palettes and breaks or labels manually:
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10(labels = scales::dollar) +
scale_color_manual("The continents",
values = c("red", "blue", "green", "yellow", "#800080")) # hex codes work!
For more info about setting scales in ggplot2
and for
more helper functions consider diving into the scales
package which is the backend to much of the scales functionality in
ggplot2
ggplot
handles many plot options as additional
layers.
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
xlab(label = "GDP per capita") +
ylab(label = "Life expectancy") +
ggtitle(label = "Gapminder")
Or even more simply use the labs()
function
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
labs(x = "GDP per capita", y = "Life expectancy", title = "Gapminder")
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(size=3)
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(size=1)
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = colors()[11])
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "red")
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = 3)
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = "w")
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = "$", size=5)
ggplot2
(optional)Elements of the plot not associated with geometries can be adjusted using ggplot themes.
There are some “complete” themes already included with the package: -
theme_gray()
(the default) - theme_minimal()
-
theme_bw()
- theme_light()
-
theme_dark()
- theme_classic()
But in additional to these, you can tweak just about any element of
your plot’s appearance using the theme()
function.
For instance, perhaps you want to move the legend from the left to the bottom of your plot, this would be part of the plot theme. Note how you can add options to a complete theme already in the plot:
gapminder %>%
filter(country %in% c("China", "Turkey", "Italy")) %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(color = country)) +
theme_minimal() +
theme(legend.position = "bottom")
ggplot2
graphs can be combined using the
grid.arrange()
function in the
gridExtra
package# Initialize gridExtra library
library(gridExtra)
# Create 3 plots to combine in a table
plot1 <- ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + annotate('text', 150, 80, label = '(a)')
plot2 <- ggplot(data = gapminder2007, aes(x = pop, y = lifeExp)) +
geom_point() + scale_x_log10() + annotate('text', 1.8e5, 80, label = '(b)')
plot3 <- ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
geom_line(aes(color = country), show.legend = FALSE) +
annotate('text', 1951, 80, label = '(c)')
# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol = 1)
patchwork
: Combining Multiple ggplot2
plots (optional)patchwork
package may be used to combine multiple
ggplot2
plots using a small set of operators similar to the
pipe.gridExtra
and
allows complex arrangements to be built nearly effortlessly.# side-by-side plots with a space in between, and a third plot below
(plot1 | plot_spacer() | plot2) / plot3
Feel free to explore more at https://github.com/thomasp85/patchwork.
Note: patchwork
is an example of a
ggplot2 extension package of which there are many! One of the benefits
to learning and using ggplot2
is that there is a huge
community of developers that build separate graphics packages that
generally use the same syntax to extend the ggplot2
functionality into things like animation and 3D plotting! Check them out here.
Two basic image types:
Every pixel of a plot contains its own separate coding; bad if you want to resize the image.
Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing, but image files with many elements can be very large.
ggplot
These questions ask you to work with the gapminder dataset.
Plot a histogram of life expectancy.
Plot the gdp per capita against population. Put the x-axis on the log scale.
Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.
Create a trellis plot of life expectancy by gdpPercap
scatterplots, one subplot per continent. Use a 2x3 layout of panels in
the plot. Now have the size of the points vary with population. Use
coord_cartesian()
(or scale_x_continuous(
) to
set the x-axis limits to be in the range from 100 to 50000.
Make a boxplot of life expectancy conditional on binned values of gdp per capita.
Using the data for 2007, recreate as much as you can of this famous Gapminder plot, where the colors are different continents. (Don’t worry about the ‘2015’ in the background and ignore the ‘play’ button at the bottom.)
Create a “trellis” plot where, for a given year, each panel uses a) hollow circles to plot lifeExp as a function of log(gdpPercap), and b) a red loess smoother without standard errors to plot the trend. Turn off the grey background. Figure out how to use partially-transparent points to reduce the effect of the overplotting of points.
In the following code, I try to plot the life expectancy over time, one line per country but where the color is by continent (in the module, we saw how to have the color vary by country).
This doesn’t work. See if you can understand why and how to fix it. Hint, look up “group” in the ggplot2 help information in R.