Data visualization | MSc CSS: 01. Building Graphs Layer by Layer

Iñaki Ucar

01. Building Graphs Layer by Layer

This tutorial is a gentle introduction to ggplot2, one of the most successful software packages for producing statistical graphics, created by Hadley Wickham based on the “Grammar of Graphs” by Leland Wilkinson. It provides a simple set of core principles, with carefully chosen defaults, to enable quick prototyping as well as publication-quality graphics. In what follows, we will familiarize ourselves with the fundamental concepts and elements of every ggplot2 graphic: how to create a plot object, add data, create a mapping to some aesthetics, and add layers of visual marks.

Author

Affiliation

Iñaki Ucar

Department of Statistics

Published

Oct. 4, 2022

Citation

Ucar, 2022

Source: _tutorials/01/01.Rmd

Preliminaries

This tutorial is heavily based on the First steps introductory chapter from the ggplot2 book. The best way to follow this tutorial is by using the RStudio IDE, so ensure that you have it installed in your computer (along with R).

Workflow

To start with, let’s create a new project by clicking on File > New Project.... Follow the steps to give it a sensible name (e.g., “dataviz”) and place it under a proper path in your computer. This creates a new folder where you can save all your files (scripts, plots, data…) related to this course, preferably organized following some logic (e.g., all scripts under a scripts subdirectory, etc.). Now, whenever you return to your project (if you close and open RStudio again, your last project is automatically opened; otherwise, click File > Recent Projects > ...), RStudio automatically sets the working directory (both of the file manager as well as the R session) to the project folder.

Once you open a new script (File > New File > R Script), the interface is divided in four sections:

Editor (top-left), where you can edit the script file.
Console (bottom-left), where you can execute R commands in the current R session.
Environment (top-right), where you can inspect the variables defined in the current R session.
Files / Plots / Packages / Help / Viewer (bottom-right), where you can browse your files, show your plots, inspect the installed packages, read the manual pages…

Save that empty script (Ctrl + S) using a sensible name (e.g., 01-building_graphs_layer_by_layer.R) in your project folder (or some subfolder according to your organization logic).

Now, the recommended workflow is to copy the chunks of code you’ll find in this tutorial into your script, and there you can run it (by selecting some lines and hitting Ctrl + Enter, or just hitting Ctrl + Enter sequentially to run code line by line), modify it, and try again. Be sure to save your progress with some frequency, just in case, and to comment your code (# comments starts with a hashtag like this). Future-you will thank you.

Another option would be to download the sources of this Rmd document, and use it to tinker with the chunks of code directly.

Reading the documentation

A very important skill for every programming language is learning how to read its documentation. In R, we can quickly open the manual page for any function just by typing ?name-of-the-function in the console, for example, ?mean (try this yourself). Then, you’ll always find the same structure, more or less:

Description of the function, what the function does.
Usage, how to call the function, sometimes with different objects.
Arguments, the description of every argument shown in the usage section.
Value, what the function returns.
Optionally, other sections with more details.
Optionally, References.
Optionally, a list of related functions called See Also.
Optionally, but usually, Examples of usage.

Checkpoint 1

Ensure that you have

created a new project for the course;
created a new script in your project for this tutorial;
located all the relevant panels and parts of the IDE;
figured out how to save the file;
figured out how to send the code from the script to the console for execution;
become accustomed to the documentation.

Required packages

For this tutorial, we need these packages (run the following to install them if you don’t have them already):

install.packages("ggplot2")

Data

In this tutorial, we will mostly use one data set that is bundled with ggplot2: mpg. It includes information about the fuel economy of popular car models in 1999 and 2008, collected by the US Environmental Protection Agency.

library(ggplot2)
mpg

ABCDEFGHIJ0123456789

manufacturer <chr>	model <chr>	displ <dbl>	year <int>	cyl <int>	trans <chr>	drv <chr>	cty <int>	hwy <int>	fl <chr>
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p
audi	a4	2.0	2008	4	auto(av)	f	21	30	p
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p
audi	a4	3.1	2008	6	auto(av)	f	18	27	p
audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p
audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p
audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p

The variables are mostly self-explanatory:

cty and hwy record miles per gallon (mpg) for city and highway driving.
displ is the engine displacement in litres.
drv is the drivetrain: front wheel (f), rear wheel (r) or four wheel (4).
model is the model of car. There are 38 models, selected because they had a new edition every year between 1999 and 2008.
class is a categorical variable describing the “type” of car: two seater, SUV, compact, etc.

Main components

There are three main components to every ggplot:

The data in tidy format.
A set of mappings from data attributes (variables) to visual channels (aesthetics).
At least one layer of visual marks to represent the observations in the dataset, which is usually created with a geom function.

Every ggplot starts with the object creation, via the ggplot() function:

ggplot()

As you can see, this generates just an empty frame: there is no data, no mappings, and therefore no guides or other elements. Even if we add some data, there are still nothing connecting it to any visual feature:

ggplot(mpg)

Next, we can add some mappings. For instance, if we are interested in the relationship between miles per gallon in highway driving (hwy) vs. the engine displacement (displ), we would assign those attributes to y and x positions respectively using the aes() function:

ggplot(mpg, aes(x=displ, y=hwy))

Now we obtained something new. Now, because there is data and mappings to x and y positions, ggplot2 applies some sensible defaults, and automatically adds Cartesian coordinates as well as linear continuous scales that nicely fit to the range of our data (you can check this with e.g. range(mpg$hwy)). Moreover, these scales display nicely formatted guides, with labeled ticks at regular intervals (not too many, not too few), major and minor grid lines, and axis labels after the names of our variables.

What is missing here? Of course, the most important bit, which is the visual mark we are going to use to actually represent each observation. In this case, let us use simple points:

ggplot(mpg, aes(displ, hwy)) +
  geom_point()

Even if there is always the temptation to put everything together in a single line, it is a good practice to separate every function and layer in each own line for readability reasons. Also note that position channels x and y are so important that you do not need to name them (i.e. x=displ, y=hwy), but just remember that x comes first. Other channels like color, fill, shape, alpha, size… must be always named.

As shown above, it is common practice to add data and mapping to the very function that creates the chart object (see ?ggplot), and in this way they apply as defaults to every single layer we add. It is also possible to delay the mapping and still act as a default as follows:

ggplot(mpg) +
  aes(displ, hwy) +
  geom_point()

This is maybe more readable, especially when the mapping is complex, but the result is the same. We can also avoid setting a default dataset and mapping altogether, and just directly plug them into the layers that need them (note that now the order is mapping, then data):

ggplot() +
  geom_point(aes(displ, hwy), mpg)

However, usually we add several layers that refer to the same data, and occasionally some annotation layer that uses another dataset. Therefore, it is generally best to add a default dataset and mapping to avoid duplicated code across layers… or missing ones. For instance, where are the lines here?

ggplot() +
  geom_point(aes(displ, hwy), mpg) +
  geom_line()

Obviously, there are no lines because they do not have any mapping. It data and mapping are set as defaults, then we have both elements:

ggplot(mpg) +
  aes(displ, hwy) +
  geom_point() +
  geom_line()

Checkpoint 2

How would you describe the relationship between cty and hwy? Do you have any concerns about drawing conclusions from that plot?
What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How could you modify the data to make it more informative?
Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.
1. ggplot(mpg, aes(cty, hwy)) + geom_point()
2. ggplot(diamonds, aes(carat, price)) + geom_point()
3. ggplot(economics, aes(date, unemploy)) + geom_line()
4. ggplot(mpg, aes(cty)) + geom_histogram()

More aesthetics

To add additional variables to a plot, we can map them into other channels such as color, shape, or size. For instance, let’s represent the car class as the color of the dots:

ggplot(mpg) +
  aes(displ, hwy, color=class) +
  geom_point()

Based on the previous plot, we can see that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.

Once again, we can observe how ggplot2 applies some more sensible defaults:

It detects that class is a categorical variable, a factor, and applies a default color scale based on hue.
At the same time, the scale is responsible for creating a guide, in this case a legend that shows the class levels along with their associated mapping.

Sometimes it is also useful to split up some aesthetics that may apply only to certain layers. For example, in this case:

ggplot(mpg) +
  aes(displ, hwy) +
  geom_line() +
  geom_point(aes(color=class))

Here, position aesthetics apply to all layers, and color is specific to the layer of points.

Every single aesthetic, every single channel, can be set to a fixed value. For instance, if we do not apply any mapping to color, we have previously seen that ggplot2 just draws black dots by default. But of course, this can be changed:

ggplot(mpg) +
  aes(displ, hwy) +
  geom_point(color="blue")

Mastering data mappings is an important skill and you will learn more about it in subsequent tutorials. See vignette("ggplot2-specs") for a comprehensive guide on aesthetics.

Checkpoint 3

Compare the following two plots and reason why you get this result:

ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")

Experiment with the color, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?
How is drive train related to fuel economy? How is drive train related to engine size and class?

Faceting

This is another fundamental technique for mapping categorical variables. It is most useful e.g. as an alternative to color hue when there are too many categories an no way of further aggregating the data.

Take for instance the previous class example, with 7 different categories. A solution here is to trade color for position: faceting splits the data in as many subsets as categories in the mapped variable. The only difference with other mappings is that it cannot be applied as an aes(), but directly into the dedicated faceting function, and as a formula, preceded by a ~:

ggplot(mpg) +
  aes(displ, hwy) +
  facet_wrap(~class) +
  geom_point()

Checkpoint 4

What happens if you try to facet by a continuous variable like hwy? What about cyl? What’s the key difference?
Use faceting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does faceting by number of cylinders change your assessment of the relationship between engine size and fuel economy?
Read the documentation for facet_wrap(). What arguments can you use to control how many rows and columns appear in the output?
What does the scales argument to facet_wrap() do? When might you use it?

Output

It should be noted that, with all the code above, we are not only creating chart objects, but also generating and displaying them in one go. This happens with other R objects too: when we do not assign an object to a variable, it is printed. In this case, printing a ggplot means constructing the visual object and displaying it. But of course, as with any other R object, we can save it in a variable and print it later:

p <- ggplot(mpg) +
  aes(displ, hwy) +
  geom_point()
print(p)
p # print is implicit

We can even build it step by step:

p <- ggplot(mpg)
p
p <- p + aes(displ, hwy)
p
p <- p + geom_point()
p

Or using different variables:

p_base <- ggplot(mpg)
p_aes <- aes(displ, hwy)
p_dot <- geom_point()
p_base + p_aes + p_dot

This is convenient for interactive usage or reports as this one. But at other times we might want to produce a graph in a script and save it somewhere else as a standalone image or PDF. This is achieved with the ggsave() function:

ggsave("plot.png", p, width = 5, height = 5)

Checkpoint 5

Read the documentation for ggsave(). What happens if you do not specify the plot argument?
How can you save the plot as a PDF file?
How can you modify the proportions of the plot?
What happens if you change the resolution for a PNG output? And a SVG?

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ucar (2022, Oct. 5). Data visualization | MSc CSS: 01. Building Graphs Layer by Layer. Retrieved from https://csslab.uc3m.es/dataviz/tutorials/01/

BibTeX citation

@misc{ucar202201.,
  author = {Ucar, Iñaki},
  title = {Data visualization | MSc CSS: 01. Building Graphs Layer by Layer},
  url = {https://csslab.uc3m.es/dataviz/tutorials/01/},
  year = {2022}
}

01. Building Graphs Layer by Layer

Author

Affiliation

Published

Citation

Preliminaries

Workflow

Reading the documentation

Checkpoint 1

Required packages

Data

Main components

Checkpoint 2

More aesthetics

Checkpoint 3

Faceting

Checkpoint 4

Output

Checkpoint 5

Footnotes

Reuse

Citation