Data visualization | MSc CSS: Income inequality: Gini coefficient before and after tax

David Pereiro-Pol

Original chart

Each year Our World in Data updates the graph that is being used as the principal inspiration for this project. This visualization shows the difference caused by the redistribution of taxes in the Gini coefficient which is a measure of statistical dispersion that represents the income inequality within a nation. After applying taxes and transfers, the Gini coefficient tends to be reduced since redistribution is usually progressive, resources flow from rich to poor people.

In this graph, the Gini coefficient before taxes is represented in the x-axis and after taxes in the y-axis. Each country is shown as a point that changes its size depending on the population and its color depending on its continent. Those countries that are closer to the coordinates (0, 0) are in a better situation in terms of income inequality. Moreover, the reduction caused by taxes can be seen in this chart thanks to the inclusion of different lines that indicate how much the Gini coefficient have been reduced due to the redistribution system.

Getting started

Packages and fonts

Firstly, we have to upload all the packages needed to replicate this graph and also the fonts that will be used. For the font part we will be using sysfonts package to retrieve them from Google Fonts and showtext to activate them.

library(tidyverse)
library(readr)
library(ggtext)
library(geomtextpath)
library(scales)
library(ggrepel)
library(ggnewscale)
library(ggthemes)
library(ggiraph)
library(grid)
library(patchwork)
library(glue)

sysfonts::font_add_google("Playfair Display", family="playfair")
sysfonts::font_add_google("Lato", family="lato")
sysfonts::font_add_google("Roboto", family="roboto")
showtext::showtext_auto()

Getting the data

The data needed to replicate the graph can be downloaded from the graph page. By analyzing its structure we can see that the data set is conformed by 78936 observations and 7 variables. This huge amount of observations, taken into account that we are working with country data, is due to the fact that for each country the year variable takes values from -10000 to 2100. Besides the variable year, we have a variable for the name of the country and other for the code of each country, two variables for the Gini coefficient, one for the population and another for the continent.

gini <- read_csv(file = "data/gini_data.csv")

# Cleaning the names

gini <- gini |>
  rename(
    "pre_tax_gini" = paste("10.4.2 - Redistributive impact of",
    "fiscal policy, Gini index (%) - SI_DST_FISP - Prefiscal income"),
    "post_tax_gini" = paste("10.4.2 - Redistributive impact of",
     "fiscal policy, Gini index (%) - SI_DST_FISP - Postfiscal",
     "disposable income")
  ) |>
  janitor::clean_names()

gini

# A tibble: 78,936 × 7
   entity code    year post_tax_gini pre_tax_gini population continent
   <chr>  <chr>  <dbl>         <dbl>        <dbl>      <dbl> <chr>
 1 Abkha… OWID…   2015            NA           NA         NA Asia
 2 Afgha… AFG   -10000            NA           NA      14737 <NA>
 3 Afgha… AFG    -9000            NA           NA      20405 <NA>
 4 Afgha… AFG    -8000            NA           NA      28253 <NA>
 5 Afgha… AFG    -7000            NA           NA      39120 <NA>
 6 Afgha… AFG    -6000            NA           NA      54166 <NA>
 7 Afgha… AFG    -5000            NA           NA      75000 <NA>
 8 Afgha… AFG    -4000            NA           NA     306250 <NA>
 9 Afgha… AFG    -3000            NA           NA     537500 <NA>
10 Afgha… AFG    -2000            NA           NA     768750 <NA>
# ℹ 78,926 more rows

Cleaning the data

During the data cleaning process, we have encountered two significant challenges. The first issue was the limited availability of Gini coefficient data for many countries, with data often being provided for only one or a few years. To address this, the creators of the graph chose to include countries with sparse Gini data in charts spanning five years before and five years after the available data point. For example, a country with data available only for 2013 would be represented in the graphs from 2008 to 2018. The “solution” that they did was not included in the data set, so to address this problem and imitate the distribution of the graphs we created a function that replicate the Gini values five years in the past and five years in the future.

replicate_values <- function(x) {
  j <- 0 # We initialize this variable here to verify the first
  # condition in the beginning of the for loop
  for (i in 1:length(x)) {
    if (!is.na(x[i]) & i >= j) {  # We check if there is a
      # NA and we put i >= j to continue in the point in which
      # the while loop ended
      j <- i + 1 # We change the value of j to start the while
      # loop in the following iteration
      count <- 0 # To count the next 5 positions
      while (j <= length(x) && count < 5 && is.na(x[j])) { # To change
        #possible NA

        if (!is.na(x[j])) break # If there is not a NA break

        x[j] <- x[i] # Change the NA with the previous value

        j <- j + 1

        count <- count + 1
      }
    }
  }

  j <- length(x) + 1 # We do the same but backwards
  for (i in length(x):1) {
    if (!is.na(x[i]) & i <= j) {
      j <- i - 1
      count <- 0
      while (j >= 1 && count < 5 && is.na(x[j])) {
        if (!is.na(x[j])) break
        x[j] <- x[i]
        j <- j - 1
        count <- count + 1
      }
    }
  }
  return(x)
}

## Replication of data to simulate the original distribution of the data
gini_w_replicate <- gini |>
  group_by(entity) |>
  mutate(
    pre_tax_gini = replicate_values(pre_tax_gini),
    post_tax_gini = replicate_values(post_tax_gini)
  ) |>
  ungroup()

The second one was that the name of the continent of each country only was in the year 2015 observation, so we had to expand that variable for each country. We also deleted the observations that did not have any Gini data.

## Replicate the continent and drop NA

gini_tidy <- gini_w_replicate |>
  group_by(entity) |>
  fill(continent, .direction = "downup") |>
  ungroup() |>
  drop_na(post_tax_gini, pre_tax_gini)

Since the graph that we are replicating is the one from the 2020, we filtered the data for that year.

# Gini 2020

gini_tidy_2020 <- gini_tidy |>
  filter(year == 2020)

gini_tidy_2020

# A tibble: 87 × 7
   entity  code   year post_tax_gini pre_tax_gini population continent
   <chr>   <chr> <dbl>         <dbl>        <dbl>      <dbl> <chr>
 1 Argent… ARG    2020         0.418        0.477   45191960 South Am…
 2 Armenia ARM    2020         0.287        0.333    2890894 Asia
 3 Austra… AUS    2020         0.309        0.403   25743787 Oceania
 4 Austria AUT    2020         0.274        0.418    8921402 Europe
 5 Belarus BLR    2020         0.267        0.292    9350943 Europe
 6 Belgium BEL    2020         0.261        0.412   11540103 Europe
 7 Bolivia BOL    2020         0.451        0.462   11816300 South Am…
 8 Brazil  BRA    2020         0.521        0.585  208660845 South Am…
 9 Bulgar… BGR    2020         0.410        0.451    6933654 Europe
10 Cambod… KHM    2020         0.322        0.324   16725482 Asia
# ℹ 77 more rows

Replicating the chart

Coordinates and axes

Once that we have cleaned the data, the first step is to set the axis and the skeleton of our graph. For this we have to set the x-axis for the pre tax Gini and the y-axis for the post tax Gini.

## Structure

p <- ggplot(gini_tidy_2020) +
  aes(x = pre_tax_gini, y = post_tax_gini)

p

Now we have to adjust the coordinates. Note that in the original graph the origin of the chart was set in (0.2, 0.2) and without any expansion, so we have to set the limits correctly and we have to delete the default expansion that ggplot sets. Besides, we have to eliminate the minor grid since it is not displayed in the original chart.

p <- p +
  scale_x_continuous(limits = c(0.2,0.749),
                            minor_breaks = NULL,
                            expand = expansion(0)) +
  scale_y_continuous(limits = c(0.2, 0.660),
                     minor_breaks = NULL,
                     expand = expansion(0))
p

Now we have to change the appearance of the axis to simulate the original one. For that, we will set theme_minimal because it provides us a similar background and we will change the type and color of the major grid.

p <- p + theme_minimal() +
  theme(
        panel.grid.major = element_line(color = "#dddddd",
                                        linetype = "dashed",
                                        linewidth = 0.3
                                        ),
        plot.margin = margin(7,2,7,7),
        plot.background = element_rect(fill = "white", color = NA)
        )

p

Diagonal lines

The next step will be to add the diagonal lines that indicate the level of reduction of the Gini coefficient. This one was a problem during the replication process. Initially, the lines were added with geom_abline and to include the text we were going to use annotate changing the angle argument, but the text did not correctly match the line. After some research, we found a package called geom_textpath that include the function geom_textabline, this function includes both the text and the line, but because of the different colors between the text and the line we have to include both functions, geom_abline and geom_textabline to obtain a better result. However, by using geom_abline we could not set clip = “off” in coord_cartesian to include the logo and the population legend, so, at the end we used annotate to create a finite segment and geaom_textabline.

xmax <- 0.749
ymax <- 0.660

p <- p  + 
  annotate("segment", 
           x = 0.2, y = 0.2, 
           xend = min(xmax, 0.2 + (ymax - 0.2)), 
           yend = min(ymax, 0.2 + (xmax - 0.2)), 
           color = "#dddddd", 
           linetype = "dashed", 
           linewidth = 0.3) +
  annotate("segment", 
           x = 0.3, y = 0.2, 
           xend = min(xmax, 0.3 + (ymax - 0.2) / (2/3)), 
           yend = min(ymax, 0.2 + (xmax - 0.3) * (2/3)), 
           color = "#dddddd", 
           linetype = "dashed", 
           linewidth = 0.3) +
  annotate("segment", 
           x = 0.4, y = 0.2, 
           xend = min(xmax, 0.4 + (ymax - 0.2) / (1/2)), 
           yend = min(ymax, 0.2 + (xmax - 0.4) * (1/2)), 
           color = "#dddddd", 
           linetype = "dashed", 
           linewidth = 0.3) + 
  geom_textabline(slope = 1, intercept = 0,
                  label = "No reduction",
                  linetype = "blank",
                  size = 2.5, 
                  family = "sans",
                  hjust = 0.83,
                  color = "grey70") +
  geom_textabline(slope = 1 / 2, intercept = 0,
                  label = "Reduce by a half",
                  linetype = "blank",
                  size = 2.5, 
                  family = "sans",
                  hjust = 0.82,
                  color = "grey70") +
  geom_textabline(slope = 2 / 3, intercept = 0,
                  label = "Reduce by a third",
                  linetype = "blank",
                  size = 2.5, 
                  family = "sans",
                  hjust = 0.86,
                  color = "grey70") 

p

Points and legends

Since we have already set the background of the chart, now we can add the points, we have to put the variables population and continent to the size and fill aesthetics, respectively. The colors were obtained by using the ImageColor Picker site.

con_colors <- c("#a2559c", "#00847e",
                "#4c6a9c", "#e56e5a",
                "#9a5129", "#883039")

p <- p + geom_point(aes(size = population,
                        fill = continent),
                        alpha = 0.75,
                        shape = 21, stroke = 0.5)  +
  scale_fill_manual(values = con_colors) 

p

To change the format of the two legends we have to use the function guides in combination with the argument override.aes, by doing so we are able to change both legends independently. We deleted the size legend to later add the custom population legend. In this step, besides changing the legends, we are also adjusting the size of the points to better fit the original one.

p <- p + guides(fill = guide_legend(order = 1, 
                                    theme = theme(
        legend.title = element_blank(),
        legend.key.height = unit(0.3, "cm"),
        legend.text = element_text(family = "lato",
                                   size = 8, 
                                   margin = margin(l = 0))),
        override.aes = list(shape = 15, 
                            size = 2,
                            color = con_colors
                            )),
        size = "none") + 
  theme(legend.justification = c("right", "top"),
        legend.box = "vertical",
        legend.spacing.y = unit(0.1, "cm")) +
  scale_size(breaks = c(1e08, 3e08), 
             labels = label_number(scale = 1e-6, suffix = "M"), 
             range = c(0.5, 9))
        
p

Labels

To add the country labels, we are going to create an additional table in which we will reduce the number of countries per continent to minimize the overlaps. If we do not add this step, we will not see any names in the middle part of the graph since there are a lot of European countries and because of the huge amount of overlaps those labels will not appear.

gini_labels <- gini_tidy_2020 |>
  slice_sample(prop = -0.5, by = continent)

In the original chart the labels also vary their size depending on the population, so, since we have already set an scale for the size before, we will be using the function new_scale from the package ggnewscale that allow us to introduce new values for the labels’ scale. In addition, to better control the overlapping we have to use geom_text_repel, the segment.color argument let us erase the segment between the label and the point and the bg.color and the bg.r arguments allow us to add some white padding behind the letters.

p <- p + 
  new_scale("size") + 
  geom_text_repel(aes(label = entity,
                              color = continent,
                              size = population),
                          segment.color = NA,
                          max.overlaps = 7,
                          show.legend = FALSE, 
                          nudge_x = 0.01,
                          bg.color = "white",
                          bg.r = 0.15,
                          data = gini_labels) +
  scale_size(range = c(2.5, 3.5)) +
  scale_color_manual(values = con_colors)

p

Population legend, line and logo

If we check the original plot we can observe that there are a line, that is between the two legends, the logo of Our World in Data and a particular legend for the population. For the line and the logo, we combine some functions of the grid package such as rasterGrob and linesGrob with annotation_custom and for the population legend, we created an additional graph and we used inset_element from patchwork to include it. The legend will be included in the last step.

image_grob <- rasterGrob(png::readPNG("Our_World_in_Data_logo.png"), 
                         x = 1, y = 1,  
                         hjust = 1, vjust = 1, 
                         width = unit(0.2, "npc"))
legend <- ggplot() +
  lims(y = c(0.9, 1)) +
  annotate("point", x = 1, y = 0.95, 
           size = 10, color = "grey80", 
           shape = 21) +
  annotate("point", x = 1, y = 0.9561,
           size = 5, color = "grey80",
           shape = 21) +
  annotate("text",  
           x = 1, y = 0.94, size = 2.5,
           color = "grey30",
           family = "sans",
           label="300M") + 
  annotate("text",  
           x = 1, y = 0.95, size = 2,
           color = "grey30",
           family = "sans",
           label="100M") + 
  annotate("richtext",
           x = 1, y = 0.962, 
           label = "Circles sized by <br>**Population**",
           hjust = 0.5, 
           vjust = 1, 
           family = "lato",
           size = 2.5, 
           color = "#636363",
           fill = NA, 
           label.color = NA) +
  theme_void() +
  theme(plot.margin = margin(50, 50, 50, 50)) 

p <- p + 
  annotation_custom(image_grob, 
                      xmin = 0.59, xmax = 0.89,
                      ymin = 0.62, ymax = 0.74) +
  annotation_custom(grob = linesGrob(gp = gpar(col = "grey90", lwd = 1)),  
    xmin = 0.785, xmax = 0.89, 
    ymin = 0.53, ymax = 0.53) + 
  coord_cartesian(clip = "off") 
p

Title and annotations

Lastly, we have to put the title, the subtitle and the caption using the proper fonts and size. We have obtained the family font by using the MyFonts site.

title_rep <- "Income inequality: Gini coefficient before and after tax, 2020"

subtitle_rep <- 
  paste("Inequality is measured in terms of the Gini coefficient of",
        "income before taxes on the horizontal axis and after<br>taxes on the",
        "vertical axis")

caption_rep <- "**Data source** : World Bank"

tag_rep <- "OurWorldinData.org/economic-inequality | CC BY"

p <- p + labs(title = title_rep,
              subtitle = subtitle_rep,
              caption = caption_rep,
              tag = tag_rep,
              x = "Before tax",
              y = "After tax") +
  theme(plot.title = element_text(face = "bold",
                                  family = "playfair",
                                  size = 14,
                                  color = "#636363"),
        plot.title.position = "plot",
        plot.subtitle = element_markdown(size = 8.5,
                                                 color = "#636363",
                                                 family = "lato"),
        axis.title = element_text(size = 7.5,
                                  color = "#636363",
                                  family = "lato"),
        axis.text = element_text(size = 7.5,
                                 color = "#636363",
                                 family = "lato"),
        plot.caption = element_markdown(size = 7.5,
                                        color = "#636363",
                                        hjust = 0,
                                        family = "lato"),
        plot.caption.position = "plot",
        plot.tag = element_markdown(size = 7.5,
                                    color = "#636363",
                                    hjust = 1,
                                    family = "lato"),
        plot.tag.position = c(1,0.012)) +
  inset_element(legend, align_to = "plot", left = 1.15, bottom = 0.66,
                  right = 1.15, top = 0.66)

Final result

Here we can observe the final result of the replication.

Alternative version

Original chart problems

The original graph provides a good visualization of the differences in the Gini coefficient. However, if we want to compare one country to others or examine the extent of the reduction in the Gini coefficient, the original representation is not the most effective choice.

For these reasons, a lollipop chart offers a clearer way to visualize differences across countries. It also allows us to observe each country’s position after tax redistribution.

In this alternative version, we have labeled the thirty countries with the most significant reductions in their Gini coefficients to emphasize that tax redistribution has a substantial impact on reducing inequality in Europe. Additionally, by leveraging the ggiraph package, we have added an interactive feature to this graph. This allows users to explore and identify the names, the population and the Gini reduction of other countries displayed.

gini_lollipop <- gini_tidy_2020 |> 
  mutate(entity = fct_reorder(entity, desc(post_tax_gini))) |> 
  mutate(difference = pre_tax_gini - post_tax_gini) |> 
  mutate(tooltip = glue(
      "Country: {entity}<br>
      Population: {population}<br>
      Gini reduction: {round(difference, 4)}"))

high_difference <- gini_lollipop |> 
  slice_max(difference, n = 30)

global_mean_pre <- mean(gini_lollipop$pre_tax_gini)
global_mean_post <- mean(gini_lollipop$post_tax_gini)

title_alt <- "Tax redistribution notably <b>reduces</b> income inequality"
subtitle_alt <- 
  paste("Country data available from  <span","style='color:#E69F00'>",
        "Asia</span>, <span style='color:#000000'>Africa</span>,",
        "<span style='color:#56B4E9'>Europe</span>,",
        "<span style='color:#009E73'>North America</span>,",
        "<span style='color:#CC79A7'>Oceania</span>,",
        "and <span style='color:#0072B2'>South America</span>", 
        "<br>shows that taxes reduce the Gini coefficient in all of them.")
y_alt <- paste("Gini coefficient <span style='color:#7570B3'>",
              "post taxes</span> and <span style='color", 
              ":#aeb370'> pre taxes</span>")
caption_alt <- "**Data source** : World Bank"
label_alt <- paste("Labeled points show <br>the 30 countries", 
                   "<br>with bigger reductions <br>of their", 
                   "Gini coefficients. <br>Note the the majority",
                   "of <br>them are from <span",
                   "style='color:#56B4E9;'>Europe</span>")

lolli <- gini_lollipop |> 
  ggplot() + 
  geom_segment_interactive( aes(x = entity, xend = entity,y = post_tax_gini,
                      yend = pre_tax_gini, color = continent, tooltip = tooltip), 
                linewidth  = 1.5,
                alpha = 0.6) +
  geom_point_interactive(aes(x=entity, y=pre_tax_gini, tooltip = tooltip), 
             size=2, 
             color="#aeb370") +
  geom_point_interactive(aes(x=entity, y=post_tax_gini, tooltip = tooltip), 
             size=2, 
             color="#7570B3") +
  coord_flip() +
  theme_minimal() +
  scale_y_continuous(sec.axis = dup_axis()) +
  theme(axis.text.y  = element_blank(),
        panel.grid = element_blank(),
        legend.position = "none") +
  geom_hline(yintercept = global_mean_pre,
             linetype = "dotted",
             alpha = 0.6,
             color = "#aeb370") +
  geom_hline(yintercept = global_mean_post,
             linetype = "dotted",
             alpha = 0.6,
             color = "#7570B3") + 
  geom_hline(yintercept = 0,
             linetype = "dotted",
             alpha = 0.6) +
  scale_color_manual(values = c(
    "Asia" = "#E69F00",
    "Africa" = "#000000",
    "Europe" = "#56B4E9",
    "North America" = "#009E73",
    "Oceania" = "#CC79A7",  
    "South America" = "#0072B2"
  )) + 
  annotate("text",  
           x = 8, y = global_mean_post - 0.01, angle = 90, 
           size = 3, family = "sans",
           label="Global mean post taxes") + 
  annotate("text",  
           x = 8, y=global_mean_pre - 0.01, angle = 90,
           size = 3, family = "sans",
           label="Global mean pre taxes") + 
  annotate("text",  
           x = 43, y = -0.01, angle = 90, size = 3,
           family = "sans",
           label="Perfect income equality") + 
  geom_text(aes(x = entity, 
                      y = post_tax_gini, 
                      label = entity,
                      color = continent), 
                  size = 2,
                  family = "sans",
                  hjust = 0.75,
                  nudge_y = -0.025,
                  data = high_difference) +
  annotate("segment", 
           x = 43, xend = 43, 
           y = 0.2 , yend = 0.1, 
           linewidth=0.3, 
           arrow = arrow(length = unit(0.3, "cm"), type = "open")) +
  annotate("segment", 
           x = 43, xend = 43, 
           y = 0.6 , yend = 0.7,  
           linewidth=0.3, 
           arrow = arrow(length = unit(0.3, "cm"), type = "open")) +
  annotate("text",  
           x = 41, y = 0.65,  
           size = 3, family = "sans",
           label="More \ninequality") +
  annotate("text",  
           x = 41, y = 0.15,  
           size = 3, family = "sans",
           label="Less \ninequality") +
  annotate("richtext",
           x = 75, y = 0.65, 
           label = label_alt,
           hjust = 0.5, 
           vjust = 1, 
           family = "sans",
           size = 3, 
           fill = NA, 
           label.color = NA ) + 
  labs(title = title_alt,
       subtitle = subtitle_alt,
       y = y_alt,
       caption = caption_alt) +
  theme(plot.title = element_markdown(,
                                  family = "sans",
                                  size = 15),
        plot.subtitle = element_markdown(size = 10,
                                                 family = "sans"),
        axis.title = element_markdown(size = 8,
                                                 family = "sans"),
        axis.title.y = element_blank(),
        plot.caption = element_markdown(size = 7.5,
                                                hjust = 0,
                                                family = "sans"),
        plot.title.position = "plot",
        plot.caption.position = "plot",
        legend.position = "none")

interactive_plot <- girafe(ggobj = lolli, height_svg = 10)

tooltip_css <- glue("background-color: #2C3E50;
  color: #ECF0F1;
  padding: 10px;
  border-radius: 5px;
  font-family: 'Arial', sans-serif;
  font-size: 14px;
  box-shadow: 0px 0px 10px rgba(0,0,0,0.5);")

interactive_plot <- girafe_options(
  interactive_plot,
  opts_tooltip(css = tooltip_css, use_fill = FALSE),
  opts_selection(type = "multiple", only_shiny = FALSE),
  opts_zoom(min = 0.5, max = 2),
  opts_sizing(rescale = TRUE))

interactive_plot

Income inequality: Gini coefficient before and after tax

Author

Affiliation

Published

Citation