This tutorial is a gentle introduction to ggplot2, one of the most successful software packages for producing statistical graphics, created by Hadley Wickham based on the “Grammar of Graphs” by Leland Wilkinson. It provides a simple set of core principles, with carefully chosen defaults, to enable quick prototyping as well as publication-quality graphics. In what follows, we will familiarize ourselves with the fundamental concepts and elements of every ggplot2 graphic: how to create a plot object, add data, create a mapping to some aesthetics, and add layers of visual marks.
Source: _tutorials/01/01.Rmd
This tutorial is heavily based on the First steps introductory chapter from the ggplot2 book. The best way to follow this tutorial is by using the RStudio IDE, so ensure that you have it installed in your computer (along with R).
To start with, let’s create a new project by clicking on File > New Project...
. Follow the steps to give it a sensible name (e.g., “dataviz”) and place it under a proper path in your computer. This creates a new folder where you can save all your files (scripts, plots, data…) related to this course, preferably organized following some logic (e.g., all scripts under a scripts
subdirectory, etc.). Now, whenever you return to your project (if you close and open RStudio again, your last project is automatically opened; otherwise, click File > Recent Projects > ...
), RStudio automatically sets the working directory (both of the file manager as well as the R session) to the project folder.
Once you open a new script (File > New File > R Script
), the interface is divided in four sections:
Save that empty script (Ctrl + S
) using a sensible name (e.g., 01-building_graphs_layer_by_layer.R
) in your project folder (or some subfolder according to your organization logic).
Now, the recommended workflow is to copy the chunks of code you’ll find in this tutorial into your script, and there you can run it (by selecting some lines and hitting Ctrl + Enter
, or just hitting Ctrl + Enter
sequentially to run code line by line), modify it, and try again. Be sure to save your progress with some frequency, just in case, and to comment your code (# comments starts with a hashtag like this
). Future-you will thank you.
Another option would be to download the sources of this Rmd document, and use it to tinker with the chunks of code directly.
A very important skill for every programming language is learning how to read its documentation. In R, we can quickly open the manual page for any function just by typing ?name-of-the-function
in the console, for example, ?mean
(try this yourself). Then, you’ll always find the same structure, more or less:
Ensure that you have
For this tutorial, we need these packages (run the following to install them if you don’t have them already):
install.packages("ggplot2")
In this tutorial, we will mostly use one data set that is bundled with ggplot2: mpg
. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency.
The variables are mostly self-explanatory:
cty
and hwy
record miles per gallon (mpg) for city and highway driving.displ
is the engine displacement in litres.drv
is the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).model
is the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.class
is a categorical variable describing the “type” of car: two seater, SUV, compact, etc.There are three main components to every ggplot:
Every ggplot starts with the object creation, via the ggplot()
function:
ggplot()
As you can see, this generates just an empty frame: there is no data, no mappings, and therefore no guides or other elements. Even if we add some data, there are still nothing connecting it to any visual feature:
ggplot(mpg)
Next, we can add some mappings. For instance, if we are interested in the relationship between miles per gallon in highway driving (hwy
) vs. the engine displacement (displ
), we would assign those attributes to y
and x
positions respectively using the aes()
function:
Now we obtained something new. Now, because there is data and mappings to x
and y
positions, ggplot2 applies some sensible defaults, and automatically adds Cartesian coordinates as well as linear continuous scales that nicely fit to the range of our data (you can check this with e.g. range(mpg$hwy)
). Moreover, these scales display nicely formatted guides, with labeled ticks at regular intervals (not too many, not too few), major and minor grid lines, and axis labels after the names of our variables.
What is missing here? Of course, the most important bit, which is the visual mark we are going to use to actually represent each observation. In this case, let us use simple points:
ggplot(mpg, aes(displ, hwy)) +
geom_point()
Even if there is always the temptation to put everything together in a single line, it is a good practice to separate every function and layer in each own line for readability reasons. Also note that position channels x
and y
are so important that you do not need to name them (i.e. x=displ, y=hwy
), but just remember that x
comes first. Other channels like color
, fill
, shape
, alpha
, size
… must be always named.
As shown above, it is common practice to add data and mapping to the very function that creates the chart object (see ?ggplot
), and in this way they apply as defaults to every single layer we add. It is also possible to delay the mapping and still act as a default as follows:
ggplot(mpg) +
aes(displ, hwy) +
geom_point()
This is maybe more readable, especially when the mapping is complex, but the result is the same. We can also avoid setting a default dataset and mapping altogether, and just directly plug them into the layers that need them (note that now the order is mapping, then data):
ggplot() +
geom_point(aes(displ, hwy), mpg)
However, usually we add several layers that refer to the same data, and occasionally some annotation layer that uses another dataset. Therefore, it is generally best to add a default dataset and mapping to avoid duplicated code across layers… or missing ones. For instance, where are the lines here?
ggplot() +
geom_point(aes(displ, hwy), mpg) +
geom_line()
Obviously, there are no lines because they do not have any mapping. It data and mapping are set as defaults, then we have both elements:
ggplot(mpg) +
aes(displ, hwy) +
geom_point() +
geom_line()
How would you describe the relationship between cty
and hwy
? Do you have any concerns about drawing conclusions from that plot?
What does ggplot(mpg, aes(model, manufacturer)) + geom_point()
show? Is it useful? How could you modify the data to make it more informative?
Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
To add additional variables to a plot, we can map them into other channels such as color, shape, or size. For instance, let’s represent the car class
as the color of the dots:
ggplot(mpg) +
aes(displ, hwy, color=class) +
geom_point()
Based on the previous plot, we can see that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.
Once again, we can observe how ggplot2 applies some more sensible defaults:
class
is a categorical variable, a factor, and applies a default color scale based on hue.Sometimes it is also useful to split up some aesthetics that may apply only to certain layers. For example, in this case:
ggplot(mpg) +
aes(displ, hwy) +
geom_line() +
geom_point(aes(color=class))
Here, position aesthetics apply to all layers, and color is specific to the layer of points.
Every single aesthetic, every single channel, can be set to a fixed value. For instance, if we do not apply any mapping to color, we have previously seen that ggplot2 just draws black dots by default. But of course, this can be changed:
ggplot(mpg) +
aes(displ, hwy) +
geom_point(color="blue")
Mastering data mappings is an important skill and you will learn more about it in subsequent tutorials. See vignette("ggplot2-specs")
for a comprehensive guide on aesthetics.
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
Experiment with the color, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
What happens if you map a continuous variable to shape? Why? What happens if you map trans
to shape? Why?
How is drive train related to fuel economy? How is drive train related to engine size and class?
This is another fundamental technique for mapping categorical variables. It is most useful e.g. as an alternative to color hue when there are too many categories an no way of further aggregating the data.
Take for instance the previous class
example, with 7 different categories. A solution here is to trade color for position: faceting splits the data in as many subsets as categories in the mapped variable. The only difference with other mappings is that it cannot be applied as an aes()
, but directly into the dedicated faceting function, and as a formula, preceded by a ~
:
ggplot(mpg) +
aes(displ, hwy) +
facet_wrap(~class) +
geom_point()
What happens if you try to facet by a continuous variable like hwy
? What about cyl
? What’s the key difference?
Use faceting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does faceting by number of cylinders change your assessment of the relationship between engine size and fuel economy?
Read the documentation for facet_wrap()
. What arguments can you use to control how many rows and columns appear in the output?
What does the scales
argument to facet_wrap()
do? When might you use it?
It should be noted that, with all the code above, we are not only creating chart objects, but also generating and displaying them in one go. This happens with other R objects too: when we do not assign an object to a variable, it is printed. In this case, printing a ggplot means constructing the visual object and displaying it. But of course, as with any other R object, we can save it in a variable and print it later:
p <- ggplot(mpg) +
aes(displ, hwy) +
geom_point()
print(p)
p # print is implicit
We can even build it step by step:
p <- ggplot(mpg)
p
p <- p + aes(displ, hwy)
p
p <- p + geom_point()
p
Or using different variables:
p_base <- ggplot(mpg)
p_aes <- aes(displ, hwy)
p_dot <- geom_point()
p_base + p_aes + p_dot
This is convenient for interactive usage or reports as this one. But at other times we might want to produce a graph in a script and save it somewhere else as a standalone image or PDF. This is achieved with the ggsave()
function:
ggsave("plot.png", p, width = 5, height = 5)
Read the documentation for ggsave(). What happens if you do not specify the plot
argument?
How can you save the plot as a PDF file?
How can you modify the proportions of the plot?
What happens if you change the resolution for a PNG output? And a SVG?
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ucar (2022, Oct. 5). Data visualization | MSc CSS: 01. Building Graphs Layer by Layer. Retrieved from https://csslab.uc3m.es/dataviz/tutorials/01/
BibTeX citation
@misc{ucar202201., author = {Ucar, Iñaki}, title = {Data visualization | MSc CSS: 01. Building Graphs Layer by Layer}, url = {https://csslab.uc3m.es/dataviz/tutorials/01/}, year = {2022} }