The course’s final goal is to publish a visualization project in our website, where you are reading this tutorial. The website is an R Markdown project that lives in a GitHub repository. This tutorial will help you fork the repository, create a new draft for your project, and submit it using GitHub’s Pull Request workflow.
Source: _tutorials/project/project.Rmd
This tutorial assumes that you have a working installation of R and RStudio. For starters, you can read about the big picture motivation: Why Git? Why GitHub?, and here is a checklist of all the steps required to prepare for this tutorial.
It is also recommended that you read about some Git basics.
This web page is a Distill website, and as such it consists of a series of R Markdown source files that are compiled into HTML. Data projects, like any other software project, greatly benefits from version control, and this website is no exception. Its sources, along with the resulting HTML products, are hosted in (and published from) a GitHub repository. You can access this repository by clicking on the top-right logo.
Collaboration on GitHub, as in any other Git-based platform, is done via Pull Requests. Essentially, you work on a copy of the project, make changes, and then propose them to be included in the original repository. So in the first place, you need to fork the repository, meaning that you get a copy in your personal account that is linked to the original one, so that you can later submit your changes. To do this, just click the fork button in the dataviz repository.
By clicking Create fork, you will be redirected to your new repository. Once there, click the green Code button and copy the HTTPS URL.
Back in RStudio, start the dialog to create a new project.
Now, select a Version Control project, managed by Git, and then paste the repository URL you copied in the last step. Adjust the destination path if required.
If everything is properly configured, RStudio will download a copy of the GitHub’s repository in your personal account to your PC (in Git terms, it will clone the remote repository locally), and the project will be opened.
Now it is time to generate some new content. To create the basic structure of visualization project, first install the distill
R package, and then execute the following code:
# install.packages("distill") # if required
distill::create_post(
title = "The Title of my Article",
collection = paste0("projects/", format(Sys.Date(), "%Y")),
author = "My name",
slug = NIA,
date_prefix = NULL
)
where NIA
should be your student identifier (i.e. the 9-digit identifier of the form 100xxxxxx
). If you inspect RStudio’s Files pane, you will notice that a new directory _projects/<year>/100xxxxxx
has been generated. The Rmd
file with the same name under that directory is now open in the editor pane. Edit the file header as follows:
categories: "<year>"
below the description (substituting <year>
by the current year, obviously).date: "`r Sys.Date()`"
. In this way, the date of the article will be the date of the last compilation, which is convenient.toc: true
to the distill_article
options.The header must look like this:
Also, remove the following part, because this prevents the R code chunks for being displayed in the article, and we want precisely the opposite:
Then save and hit the knit button to compile the document.
As a result, the article _projects/<year>/100xxxxxx/100xxxxxx.html
has been generated.
Now it is time to tell Git what changes we want to incorporate. There is a new Git pane in RStudio in the top-right panel set. Important: you must always select only the files that were modified under your NIA (i.e., the _projects/<year>/100xxxxxx
directory), and nothing else.
In the case above, mark as Staged only the yellow directories, ignoring the other files. Once checked, click the Commit button.
In the new dialogue, check once again that all files under your NIA—and only those— are selected. Then, write a descriptive commit message and hit Commit.
Close the dialogue. Currently, you local copy of the repository contains the changes, but not the remote one on GitHub (the Git pane reads “Your branch is ahead of ‘origin/main’ by 1 commit”). To synchronize the changes, you need to click the Push button.
Finally, you can contribute your article by opening a Pull Request. In your fork, you will see that GitHub notices that your copy is 1 commit ahead of the original repository, and offers you to contribute.
Click the Contribute button to start a new Pull Request.
In this new dialogue,
Project - <your_name>
;Then click Create a draft pull request. Congratulations! You have created your first Pull Request (i.e., something like this), and paved the way to contribute a visualization project for this course.
Usually, Pull Requests contain a more complete version of the final contribution, but here we are learning. So far, we just created a skeleton of the article in “Draft” mode, and we will be adding content gradually.
A Pull Request is a space where maintainers and collaborators work together to shape the final contribution:
Note that, once the Pull Request is open, every commit that the collaborator pushes to their fork is automatically added to the Pull Request. In other words, you do not need to open a new Pull Request (in fact, you cannot). So the cycle repeats until everybody is happy with the result, and the maintainer finally merges the Pull Request.
The workflow is pretty straightforward:
Rmd
file with text, code and images (external images need to be placed under _projects/<year>/100xxxxxx/
, and then referenced by name in the Rmd
)._projects/<year>/100xxxxxx/
.Finally, when the project is finished, please click the “Ready for review” button in the Pull Request (at the bottom, on top of the comment box). Then I will check that everything is ok, ask for minor changes if required, and then merge the Pull Request and your project will be live!
Use level 2 headings (## Title
) as the highest level for headings. In other words, please do not use level 1 headings in your posts (# Title
), because the title of the post is already level 1, so it looks nicer if sections start one level below.
Note that you do not need to add echo=TRUE
to every chunk, because they are shown by default. And we want to show the code, so do not add echo=FALSE
either unless you have a good reason to hide some special piece of code.
It is a good practice to set an initial chunk, right below the YAML header, as the following to ensure that images use the whole width available and are centered. The fig.showtext=TRUE
option is important only if you are using external fonts as described here.
```{r setup, include=FALSE}
knitr::opts_chunk$set(out.width="100%", fig.align="center", fig.showtext=TRUE)
```
Your first task is to show and discuss the chart you selected. For this, save a copy, put it close to the Rmd
file, and then reference it with Markdown syntax as follows:
{.external width="100%"}
Load only the necessary libraries, i.e., the ones that you actually use. This is important because it makes the code cleaner and easier to understand, and prevents readers from installing a lot of packages that they will not use. Also note that, if you load the tidyverse
package, a bunch of packages are loaded with it (such as dplyr
, ggplot2
… see the output from tidyverse::tidyverse_packages()
for a complete list), so you do not need to load them separately.
It is a good practice to name your chunks with a descriptive name, so that it is easier to understand what each chunk does. For example, if you are loading the data, you can name the chunk load-data
.
```{r load-data}
data <- read.csv("mydata.csv")
```
About the data, please include the required data (and only the required data) in the folder of your project. If the original dataset contains more than you need, please filter it (only the required rows and columns) and save it. Preferably, it should be a CSV or similar text-based format (if it’s another format, please load the data into R and then save it as CSV). If you need to use a large dataset, please let me know and we can discuss the best way to handle it. Then, you can read the data just using the name of your file, which should be close to the Rmd
file. In other words, do not include paths from your own computer, because this is not going to work in everyone else’s computer.
Naming the chunks is especially important for those that produce a chart, because the image file will have the name of the chunk as file name. Note also that chunk names should not contain spaces or special characters.
Chunks that produce a chart should have the fig.width
and fig.height
options set to the desired size of the image (in inches; default values are 7 and 5 respectively). With these parameters, you can control the aspect ratio and also the relative size of the fonts. Play with them until you find a good balance.
You should not change the fig.dpi
. The default is more than enough for a webpage. If your fonts don’t look sharp enough, this can be solved by changing the size of the image (see the previous point).
In distill
articles, you can set images and tables that span a width larger than the text column. It looks nicer e.g. for images that are very wide. There are a couple of examples in the Gapminder project. See the documentation for further details.
Note also the preview=TRUE
option here. Adding this option to a chunk makes the image produced by that chunk as the preview image for your article in the “Projects” gallery. See the documentation for further details.
Do not repeat yourself. If you find yourself writing the same code in multiple places, it is a good idea to refactor your code, so that, if you need to change something, you only need to change it in one place:
One of most important best practices is to limit the width of your code. If you write code lines that are very long, they are harder to read. And especially in this case, these long lines will go off the page, and will be lost in the margin, outside of the visible area. So in general it is a good idea to break your lines of code at a maximum width. The common practice is to use a maximum width of 80 characters. There is an option in RStudio to show a visual guide in the editor for this limit. You can activate this in Tools > Global Options > Code > Display > Show margin
. There is also a package called styler
that allows you to style your document automatically. After installing the package, if you click “Addins” in the RStudio toolbar, there is a new addin called “Style active file”. This could be a good starting point (although you may not like other styling options that this package applies).
Once the project is finished and merged, you will need to prepare a short presentation. The document format should be a presentation, but I do not particularly care about the file format. If you wish to try an R Markdown presentation, that is great; but a PowerPoint or PDF would be fine too.
Suggested structure:
Be brief, it is 5 minutes, so do not try to explain everything, there will be a post in our web for that.
Do not show all the code, there will be a post in our web for that. Just highlight some small portions of the code that you found specially tricky, challenging, clever…
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Ucar (2022, Oct. 5). Data visualization | MSc CSS: Creating and Submitting your Project. Retrieved from https://csslab.uc3m.es/dataviz/tutorials/project/
BibTeX citation
@misc{ucar2022creating, author = {Ucar, Iñaki}, title = {Data visualization | MSc CSS: Creating and Submitting your Project}, url = {https://csslab.uc3m.es/dataviz/tutorials/project/}, year = {2022} }