Data Visualization | MSc CSS: Film Dialogue: Screenplay Dialogue by Gender

Sude Boz

Introduction

Recently, Hollywood movies has been criticized for widespread sexism and racism where most discussions are around on the idea that white men dominate film roles and dialogues. However, these arguments mostly depend on opinions and observations rather than data, which prevents any genuinely informed conversation. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?

The researchers of this visualization did not set out trying to prove anything, but rather compile real data. They framed it as a census rather than a study. So their way was searching 8,000 screenplays and matched each character’s lines to an actor. From there, they complied the number of words spoken by male and female characters across more or less 2,000 films.

Therefore, this project is going to replicate examining the dialogue by gender for Disney films and after that improve some aspects to have a wider opinion about film industry related to gender.

Original graph. Figure from The Pudding

On January 2016, researchers reported that men speak more often than women in Disney’s princess films (Anderson & Daniels, 2016). They validated the claim and doubled the sample size to 30 Disney films, including Pixar. The outcome shows 22 of 30 Disney films have a male majority of dialogue. Even films with female leads, such as Mulan, the dialogue swings male. Mushu, her protector dragon, has 50% more words of dialogue than Mulan herself.

The dataset has limitations. Like Mulan where a story may focus on a particular character even if this emphasis isn’t evident in the dialogue. Moreover, the analysis relies on screenplays, which are not exact representations of the final film.

Replication

1. Men Have +/- 60 Dialogue

First, in order to create the graph, we load the libraries we are going to use throughout this project:

# Set up environment
library(tidyverse)
library(readr)
library(stringr)
library(ggplot2)
library(scales)
library(magick)
library(cowplot)
library(showtext)
library(ggrepel)
library(ggbeeswarm)
font_add_google("Poppins", "Poppins")
font_add_google("Merriweather", "merriweather")
showtext_auto()

Later, we load the dataset that we found from Pandas Basics

#Loading data
char_url  <- "https://raw.githubusercontent.com/matthewfdaniels/scripts/master/character_list5.csv"
meta_url  <- "https://raw.githubusercontent.com/matthewfdaniels/scripts/master/meta_data7.csv"

characters <- read_csv(char_url)
metadata   <- read_csv(meta_url)

# Data Processing
film_order <- c(
  "The Jungle Book", "Monsters, Inc.", "Up", "Toy Story", "The Rescuers Down Under", 
  "Aladdin", "Holes", "Cars 2", "The Lion King", "Ratatouille", 
  "Something Wicked This Way Comes", "Monsters University", "Hercules", "Hunchback Of Notre Dame", 
  "Toy Story 3", "Finding Nemo", "Mulan", "Star Wars: Episode VII - The Force Awakens", 
  "Beauty And The Beast", "The Little Mermaid", "Wreck-It Ralph", "Mighty Joe Young", 
  "Pocahontas") # ordering films manually 

# Defining custom vectors to manually resolve sorting ties in percentages
order_87_percent <- c(
  "The Lion King", "Ratatouille", "Something Wicked This Way Comes", "Monsters University"
  )

# Order for films around 75%
order_75_percent <- c(
  "Toy Story 3", "Finding Nemo", "Mulan"
)


dialogue_data <- characters |> 
  filter(gender %in% c("m", "f")) |>   # Filter ingfor only male and female characters

 # Calculating total words spoken per film (script_id) and gender 
group_by(script_id, gender) |>  
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 

 # Pivot to create separate columns for male and female word counts
pivot_wider(
  names_from = gender,
  values_from = words,
  values_fill = 0,
  names_prefix = "words_"
) |> 

  # Calculating total words and percentages
mutate(
  words_total = words_m + words_f,
  male_percent = words_m / words_total,
  female_percent = words_f / words_total
) |> 
filter(words_total > 0) |> 

  # Joning metadata for the titles of the films by script_id since it is the key
left_join(metadata, by = "script_id") |> 
  filter(!is.na(title)) |> 
  
  # Filtering only film_order 
filter(title %in% film_order) |>
  
mutate(
  custom_rank = case_when(
    title %in% order_87_percent ~ match(title, order_87_percent),
    title %in% order_75_percent ~ match(title, order_75_percent),
    TRUE ~ 999
  )
) |> 
  
  # Sorting data primarily by male percentage (descending)
arrange(desc(male_percent))

 # Final data preparation for the plot
top_dialogue_data <- dialogue_data |>
  arrange(desc(male_percent), custom_rank)

plot_data <- top_dialogue_data |>
  select(title, female_percent, male_percent) |>
  
  # Converting data back to long format for ggplot stacked bar chart
pivot_longer(
  cols= c(male_percent, female_percent),
  names_to = "gender",
  values_to = "percentage"
) |> 
  
  # Set gender as a factor with specific levels for color mapping and stacking order
mutate(
  gender = factor(gender,
                  levels = c("female_percent", "male_percent"),
                  labels = c("Female Dialogue", "Male Dialogue"))
  )

# Critical part: Setting movie titles as a factor with levels in the desired chart order.

plot_data$title <- factor(
  plot_data$title,
  levels = rev(unique(top_dialogue_data$title))
)



# Color palettes and gradiants
palette <- c("#356df4", "#dba0a0", "#a9bffe", "#2b2b2b", "#e5e8f2", 
             "#858485", "#517fed", "#ccbaba", "#2261fa")

# Number of films
n_films <- length(levels(plot_data$title))

# Gradient blues for male dialogue 
male_colors <- colorRampPalette(
  c("#a9bffe","#517fed", "#2261fa","#356df4")
)(n_films)

# Gradient peach/pink for female dialogue 
female_colors <- colorRampPalette(
  c("#dba0a0", "#ccbaba")
)(n_films)

#Assigning gradient color per film
plot_data <- plot_data |>
  mutate(
    title_index = as.integer(title),
    fill_color  = ifelse(
      gender == "Male Dialogue",
      male_colors[title_index],
      female_colors[title_index]
    )
  )


#Visualization


fifty_fifty_line <- 0.50

men_graph<-
  ggplot(plot_data, aes(x = percentage, y = title, fill = fill_color)) +

  geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) + #creating stack horizontal bar chart
  geom_vline(xintercept = fifty_fifty_line, 
             linetype = "solid", 
             color = "#858485", 
             linewidth = 0.5) + # Adding the 50/50 vertical reference line
  # Special label for "98% MALE DIALOGUE"
  geom_text(
    data = top_dialogue_data |>
      mutate(label = case_when(
        round(male_percent, 2) == 0.98 ~ "98% MALE\nDIALOGUE",
        TRUE ~ NA_character_
      )) |>
      filter(!is.na(label)),
    aes(x = 1.01, y = 20.8, label = label),
    inherit.aes = FALSE,
    color = "#356df4", 
    hjust = 0, 
    size = 2.3,
    fontface = "bold",
    family = "Poppins"
  ) +
  # Add percentage labels (90%, 87%, 75%, 69%)
  geom_text(
    data = top_dialogue_data |> 
      mutate(label = case_when(
        round(male_percent, 2) == 0.98 ~ NA_character_,
        round(male_percent, 2) == 0.90 ~ "90%",
        round(male_percent, 2) == 0.87 & title == "Ratatouille" ~ "87%",
        round(male_percent, 2) == 0.75 & title == "Mulan" ~ "75%",
        round(male_percent, 2) == 0.69 ~ "69%",
        TRUE ~ NA_character_
      )) |> 
      filter(!is.na(label)),
    aes(x = 1.02, 
        y = title,
        label = label),
    inherit.aes = FALSE,
    color = "#356df4",
    hjust = 0,
    size = 3,
    fontface = "bold"
  ) +
  
    scale_fill_identity() + #using identity scale for gradient colors
  
  # Y-axis configuration
   scale_y_discrete(limits = rev(unique(top_dialogue_data$title)), 
                   expand = expansion(add = 0)) +
  # X-axis configuration with 50/50 at top
  scale_x_continuous(labels = c("50/50" = "50/50"), 
                     breaks = c(0.5),  
                     limits = c(0, 1.08),
                     position = "top") + 
  
  # Set titles and labels
  labs(
    title = "Men have 60%+ Dialogue",
    x = "",
    y = "",
    fill = ""
  ) +
  
  # Applying and customizing the theme
  theme_minimal() +
  theme(
    plot.title = element_text(family = "merriweather", color = "#356df4", 
                              face = "bold", hjust = 0.5, 
                              margin = margin(b = 12, unit = "pt"), size = 16),
    axis.text.x = element_text(size = 10, color = "black", 
                               margin = margin(b = 12, unit = "pt")),
    axis.text.y = element_text(family = "merriweather", size = 12, 
                               color = "black", face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.margin = margin(10, 20, 10, 0)
  )
men_graph

2. Gender Balance +/- 10

This graph shows us gender balance around -/+ 10 percent of four films, which are Frozen, The Incredibles, Into the Woods, Tarzan.

Starting with setting up the data and environment

gender_film <- 
  metadata |> filter(title %in%
  c( "Frozen", "The Incredibles", "Into the Woods", "Tarzan")
)

# Targeting list of films for the "Gender Balance" chart

target_movies <- c("Frozen", "The Incredibles", "Into the Woods", "Tarzan")

# Data processing and joining#
# Filter metadata to get script_ids for the target movies
gender_film_ids <- 
  metadata |> 
  filter(title %in% target_movies)  |> 
   select(script_id, title)       # Selecting only the join key and the title

# Calculating dialogue percentages for the target movies
dialogue_data_balanced <- 
  characters |> 
  inner_join(gender_film_ids, by = "script_id") |>  # Join with filtered metadata to include only target movies
  filter(gender %in% c("m", "f")) |> # filtering male and female genders
  group_by(title, gender) |> 
  summarise(                         # Sum the word counts for dialogue
    total_words = sum(words, na.rm = TRUE),
    .groups = 'drop_last'            # Maintain grouping by title
  ) |> 
  mutate(                            # Calculating the percentages
    movie_total_words = sum(total_words),
    percentage = total_words / movie_total_words * 100
  ) |> 
  ungroup()

#Plot data preparation
plot_data_balance <- dialogue_data_balanced |> 
  mutate(
    # Setting movie titles as a factor with specific levels for consistent sorting
    title = factor(title, levels = c("Frozen", "The Incredibles", "Into the Woods", "Tarzan")),
    # Set gender as a factor
    gender = factor(gender, levels = c("f", "m"))
  )

# Extract and round the percentages for the Frozen label
frozen_male_perc <- round(plot_data_balance[plot_data_balance$title == "Frozen" & plot_data_balance$gender == "m", ]$percentage[1])
frozen_female_perc <- round(plot_data_balance[plot_data_balance$title == "Frozen" & plot_data_balance$gender == "f", ]$percentage[1])

#Color Palette

palette_balance <- c("#7193e4", "#6a8fe6","#7c9ae3", "#e67e7e", "#272727", "#dad9dc", 
                     "#7798e3", "#818082", "#e77b7b", "#e28a8a", "#e48483")

# Number of films
n_films_balance <- length(levels(plot_data_balance$title))

# Gradient blues for male dialogue
male_colors_balance <- colorRampPalette(
  c("#6a8fe6","#7193e4", "#7798e3", "#7c9ae3")
)(n_films_balance)

# Gradient pink/red for female dialogue
female_colors_balance <- colorRampPalette(
  c( "#e28a8a","#e48483","#e67e7e", "#e77b7b")
)(n_films_balance)

# Assign gradient colors per film and gender
plot_data_balance <- plot_data_balance |>
  mutate(
    title_index = as.integer(title),
    fill_color = ifelse(
      gender == "m",
      male_colors_balance[title_index],
      female_colors_balance[title_index]
    )
  )


#Visualization


gender_balance_graph<- 
  ggplot(plot_data_balance, aes(x = percentage, y = title, fill = fill_color)) +
  
  geom_bar(stat = "identity",position =  position_stack(reverse = TRUE), width = 0.7) +  #Creating the horizontal stacked bars
  
   geom_vline(xintercept = 50, 
              linetype = "solid",
              color = "#818082", 
              linewidth = 0.5) +
  
  scale_fill_identity() +  
  
  # Adding the Two-Color Percentage Label for Frozen
  geom_text(
    data = plot_data_balance |>  
      filter(title == "Frozen" & gender == "m"),
    label = paste0(frozen_male_perc, "%"),
    x = 101, 
    color = "#2362fa",  
    fontface = "bold",
    size = 4,
    hjust = 0
  ) +
  
  geom_text(
    data = plot_data_balance |> 
      filter(title == "Frozen" & gender == "f"),
    label = paste0(" / ", frozen_female_perc, "%"),
    x = 110, 
    color = "#e94b55",  
    fontface = "bold",
    size = 4,
    hjust = 0
  ) +
  
  # Add the "50/50" text annotation
  annotate("text", x = 50, y = 4.5, label = "50/50", color = "#818082", size = 4) +
  
  # Extend X-axis limits
  scale_x_continuous(limits = c(0, 115), position = "top") +
  
  # Reversing the Y-axis order
  scale_y_discrete(limits = rev(levels(plot_data_balance$title))) +
  
  # Titles and Labels
  labs(
    title = "Gender Balance, +/- 10%",
    y = NULL,
    x = NULL
  ) +
  
  # Theme Customization
  theme_minimal() + 
  theme(
    legend.position = "none",
    plot.title = element_text(family = "merriweather",
                              hjust = 0.5, 
                              size = 18, 
                              face = "bold", 
                              color = "#272727"),  
    axis.text.x = element_blank(), 
    axis.text.y = element_text(family = "merriweather",
                               size = 12, 
                               face = "bold", 
                               color = "#272727"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(10, 50, 10, 10) 
  )
gender_balance_graph

3. Women Have +/- 60 Dialogue

Lastly, we are having a graph that shows women dialogue around -/+ 60 percent of four films, which are Inside Out, Alice in Wonderland, Maleficent”, Sleeping Beauty.

target_women_film <-
  c("Inside Out", "Alice in Wonderland", "Maleficent", "Sleeping Beauty")

# Calculating Dialogue Data (Reusing the structure from the Men's chart) 
# which ensures we have male_percent and female_percent columns

dialogue_data_women <- characters |> 
  filter(gender %in% c("f", "m")) |> 
  group_by(script_id, gender) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  pivot_wider(
    names_from = gender,
    values_from = words,
    values_fill = 0,
    names_prefix = "words_"
  ) |> 
  mutate(
    words_total = words_m + words_f,
    male_percent = words_m / words_total * 100,      # Calculate as percentages (0-100)
    female_percent = words_f / words_total * 100
  ) |> 
  filter(words_total > 0) |> 
  left_join(metadata, by = "script_id") |> 
  filter(!is.na(title)) |> 
  
  #Critical Filter: Keeping only movies where women have 60% or more dialogue
  filter(female_percent >= 60 & title %in% target_women_film) |>
  arrange(desc(female_percent)) # Sort by descending female percentage (highest dominance at the top)

#---- Color Palette--
palette_women <- c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda",
                   "#ed6a6a", "#ed6969", "#ef6465", "#ef6262",
                   "#ff3334", "#9b9aa1", "#050505", "#2562fa", 
                   "#e94953")
n_films_women <- length(unique(dialogue_data_women$title))

male_colors_women <- colorRampPalette(
  c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda")
)(n_films_women)

female_colors_women <- colorRampPalette(
  c("#ed6a6a", "#ed6969", "#ef6465", "#ef6262")
)(n_films_women)

# --- Preparing for Plotting ---
plot_data_women <- dialogue_data_women |> 
  select(title, female_percent, male_percent) |> 
  pivot_longer(
    cols = c(male_percent, female_percent),
    names_to = "gender",
    values_to = "percentage"
  ) |> 
  mutate(
    gender = factor(gender, 
                    levels = c("female_percent", "male_percent"), 
                    labels = c("Female Dialogue", "Male Dialogue"))
  )

# Set factor levels for the Y-axis
plot_data_women$title <- factor(plot_data_women$title, 
                                levels = rev(unique(dialogue_data_women$title)))

# Assigning gradient colors per film and gender
plot_data_women <- plot_data_women |>
  mutate(
    title_index = as.integer(title),
    fill_color = ifelse(
      gender == "Male Dialogue",
      male_colors_women[title_index],
      female_colors_women[title_index]
    )
  )

# --- Extract Custom Label Variables (Inside Out) ---
inside_out_male_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$male_percent[1])
inside_out_female_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$female_percent[1])

if (is.na(inside_out_male_perc)) {
  inside_out_male_perc <- 36
  inside_out_female_perc <- 64 
}

fifty_fifty_line <- 50


#Visualization


women_graph <-
  ggplot(plot_data_women, aes(x = percentage, y = title, fill = fill_color)) +
  
  # Creating the stacked horizontal bar chart
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) +
  
  # Add the 50/50 vertical reference line
  geom_vline(xintercept = fifty_fifty_line, 
             linetype = "solid", 
             color = "#9b9aa1", 
             linewidth = 0.5) +
  
  #scale for gradient colors
  scale_fill_identity() +
  
  # Adding the two-color percentage 
  # Male percentage label
  geom_text(
    data = plot_data_women |> 
      filter(title == "Inside Out" & gender == "Male Dialogue"),
    label = paste0(inside_out_male_perc, "%"),
    x = 101, 
    color = "#2562fa", 
    fontface = "bold",
    size = 4,
    hjust = 0
  ) +
  
  # Female percentage label
  geom_text(
    data = plot_data_women |> 
      filter(title == "Inside Out" & gender == "Female Dialogue"),
    label = paste0(" / ", inside_out_female_perc, "%"),
    x = 109, 
    color = "#e94953", 
    fontface = "bold",
    size = 4,
    hjust = 0
  ) +
  
# Extending x-axis limits
   
  scale_x_continuous(labels = c("50/50" = "50/50"), 
                     breaks = c(0.5),  
                     limits = c(0, 115),
                     position = "top") + 
    # Add the "50/50" text annotation
  annotate("text", x = 50, y = 4.55, label = "50/50", color = "#9b9aa1", size = 4) +
  
  # Y-axis order
  scale_y_discrete(limits = rev(levels(plot_data_women$title))) +
  
  # Titles and labels
  labs(
    title = "Women have 60%+ Dialogue",
    x = "", 
    y = "", 
    fill = "" 
  ) +
  
  # Theme customization
  theme_minimal() +
  theme(
    plot.title = element_text(family = "merriweather", color = "#ff3334", face = "bold", hjust = 0.5, size = 16),
    axis.text.x = element_blank(),
    axis.text.y = element_text(family = "merriweather", size = 12, color = "black", face = "bold"), 
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), 
    legend.position = "none", 
    plot.margin = margin(10, 50, 10, 10) 
  )

women_graph

Improvement

Men graph improvement
Women graph improvement
Creating a new graph: Cinematic Gender Dialogue Evolution

1.a Men Have +/- 60 Dialogue IMPROVEMENT

Since this graph aims to show the criticism of how Disney movies are gendered through the dialogues of male and female characters, I believe it could be improved by adding characters from the each movie would help represent which percentage belongs to each gender.

film_order <- c(
  "The Jungle Book", "Monsters, Inc.", "Up", "Toy Story", "The Rescuers Down Under", 
  "Aladdin", "Holes", "Cars 2", "The Lion King", "Ratatouille", 
  "Something Wicked This Way Comes", "Monsters University", "Hercules", "Hunchback Of Notre Dame",   "Toy Story 3", "Finding Nemo", "Mulan", "Star Wars: Episode VII - The Force Awakens", 
  "Beauty And The Beast", "The Little Mermaid", "Wreck-It Ralph", "Mighty Joe Young", 
  "Pocahontas"
)

order_87_percent <- c("The Lion King", "Ratatouille", 
                      "Something Wicked This Way Comes", "Monsters University")
order_75_percent <- c("Toy Story 3", "Finding Nemo", "Mulan")


#New step for improvement : Calculating top characters

#Find the name of the top speaking Male and Female for each script
top_chars_by_movie <- characters |> 
  filter(gender %in% c("m", "f")) |> 
  # Join metadata early to filter by your specific film list
  left_join(metadata |> select(script_id, title), by = "script_id") |> 
  filter(title %in% film_order) |> 
  group_by(script_id, title, gender, imdb_character_name) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  group_by(script_id, gender) |> 
  slice_max(words, n = 1, with_ties = FALSE) |> # Get top 1 per gender
  ungroup() |> 
  select(script_id, gender, imdb_character_name) |> 
  pivot_wider(
    names_from = gender,
    values_from = imdb_character_name,
    names_prefix = "top_"
  )


#Main data processing

dialogue_data <- characters |> 
  filter(gender %in% c("m", "f")) |> 
  group_by(script_id, gender) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  pivot_wider(
    names_from = gender, values_from = words, values_fill = 0, names_prefix = "words_"
  ) |> 
  mutate(
    words_total = words_m + words_f,
    male_percent = words_m / words_total,
    female_percent = words_f / words_total
  ) |> 
  filter(words_total > 0) |> 
  left_join(metadata, by = "script_id") |> 
  left_join(top_chars_by_movie, by = "script_id") |>  # Join the TOP CHARACTERS calculated above
  filter(!is.na(title)) |> 
  filter(title %in% film_order) |> 
  mutate(
    custom_rank = case_when(
      title %in% order_87_percent ~ match(title, order_87_percent),
      title %in% order_75_percent ~ match(title, order_75_percent),
      TRUE ~ 999
    )
  ) |> 
  arrange(desc(male_percent))


top_dialogue_data <- dialogue_data |>
  arrange(desc(male_percent), custom_rank)


plot_data <- top_dialogue_data |>
  select(title, female_percent, male_percent) |>
  pivot_longer(
    cols = c(male_percent, female_percent),
    names_to = "gender",
    values_to = "percentage"
  ) |> 
  mutate(
    gender = factor(gender,
                    levels = c("female_percent", "male_percent"),
                    labels = c("Female Dialogue", "Male Dialogue"))
  )


plot_data$title <- factor(
  plot_data$title,
  levels = rev(unique(top_dialogue_data$title))
)


#Color Palette

n_films <- length(levels(plot_data$title))

male_colors <- colorRampPalette(c("#a9bffe","#517fed", "#2261fa","#356df4"))(n_films)

female_colors <- colorRampPalette(c("#dba0a0", "#ccbaba"))(n_films)

plot_data <- plot_data |>
  mutate(
    title_index = as.integer(title),
    fill_color  = ifelse(
      gender == "Male Dialogue",
      male_colors[title_index],
      female_colors[title_index]
    )
  )



#Visualization


fifty_fifty_line <- 0.50

men_graph_imp <- ggplot(plot_data, aes(x = percentage, y = title, fill = fill_color)) +
  
  #Stacked Bars (Reverse = TRUE puts Male/Blue on the Left)
  geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) + 
  
  #50/50 Line
  geom_vline(xintercept = fifty_fifty_line, linetype = "solid", color = "#858485", linewidth = 0.5) + 
  

  #Adding male character names 
  geom_text(
    data = top_dialogue_data,
    aes(x = 0.49, y = title, label = top_m), 
    inherit.aes = FALSE,
    color = "white", hjust = 1, size = 2, fontface = "bold", family = "Poppins"
  ) +
  
  #Adding female character names 
  geom_text(
    data = top_dialogue_data,
    aes(x = 0.51, y = title, label = top_f), 
    inherit.aes = FALSE,
    color = "white", hjust = 0, size = 2, fontface = "bold", family = "Poppins"
  ) +

  #Special label for "98% MALE DIALOGUE"
  geom_text(
    data = top_dialogue_data |>
      mutate(label = case_when(
        round(male_percent, 2) == 0.98 ~ "98% MALE\nDIALOGUE",
        TRUE ~ NA_character_
      )) |>
      filter(!is.na(label)),
    aes(x = 1.01, y = 20.8, label = label),
    inherit.aes = FALSE,
    color = "#356df4", hjust = 0, size = 2.3, fontface = "bold", family = "Poppins"
  ) +
  
  #Add percentage labels (90%, 87%, 75%, 69%)
  geom_text(
    data = top_dialogue_data |> 
      mutate(label = case_when(
        round(male_percent, 2) == 0.98 ~ NA_character_,
        round(male_percent, 2) == 0.90 ~ "90%",
        round(male_percent, 2) == 0.87 & title == "Ratatouille" ~ "87%",
        round(male_percent, 2) == 0.75 & title == "Mulan" ~ "75%",
        round(male_percent, 2) == 0.69 ~ "69%",
        TRUE ~ NA_character_
      )) |> 
      filter(!is.na(label)),
    aes(x = 1.02, y = title, label = label),
    inherit.aes = FALSE,
    color = "#356df4", hjust = 0, size = 3, fontface = "bold", family = "Poppins"
  ) +
  
  # Scales
  scale_fill_identity() + 
  scale_y_discrete(limits = rev(unique(top_dialogue_data$title)), expand = expansion(add = 0)) +
  scale_x_continuous(labels = c("50/50" = "50/50"), breaks = c(0.5), limits = c(0, 1.15), position = "top") + 
  
  # Titles and labels
  labs(
    title = "Men have 60%+ Dialogue",
    x = "", 
    y = "", 
    fill = ""
  ) +
  
  # Theme
  theme_minimal() +
  theme(
    plot.title = element_text(family = "merriweather",
                              color = "#356df4",
                              face = "bold", 
                              hjust = 0.5, 
                              margin = margin(b = 12, unit = "pt"), 
                              size = 16),
    axis.text.x = element_text(size = 5, color = "black", 
                               margin = margin(b = 12, unit = "pt"), 
                               family = "Poppins"),
    axis.text.y = element_text(family = "merriweather", 
                               size = 12, 
                               color = "black", 
                               face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.margin = margin(10, 20, 10, 0)
  )

men_graph_imp

3.a Women Have +/- 60 Dialogue IMPROVEMENT

The same process is also will be done in here, however different from the men’s one, we are adding the years of the movies.

# Data Preparation

target_women_film <- c("Inside Out", "Alice in Wonderland", "Maleficent", "Sleeping Beauty")

# Calculate top characters for women films 
# Find the name of the top speaking Male and Female for each script in the target list
top_chars_women <- characters |> 
  filter(gender %in% c("m", "f")) |> 
  left_join(metadata |> select(script_id, title), by = "script_id") |> 
  filter(title %in% target_women_film) |> 
  group_by(script_id, title, gender, imdb_character_name) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  group_by(script_id, gender) |> 
  slice_max(words, n = 1, with_ties = FALSE) |> 
  ungroup() |> 
  select(script_id, gender, imdb_character_name) |> 
  pivot_wider(
    names_from = gender,
    values_from = imdb_character_name,
    names_prefix = "top_"
  )

#Calculating Dialogue data
dialogue_data_women <- characters |> 
  filter(gender %in% c("f", "m")) |> 
  group_by(script_id, gender) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  pivot_wider(
    names_from = gender,
    values_from = words,
    values_fill = 0,
    names_prefix = "words_"
  ) |> 
  mutate(
    words_total = words_m + words_f,
    male_percent = words_m / words_total * 100,      
    female_percent = words_f / words_total * 100
  ) |> 
  filter(words_total > 0) |> 
  left_join(metadata, by = "script_id") |> 
  # JOIN TOP CHARACTERS HERE
  left_join(top_chars_women, by = "script_id") |>
  filter(!is.na(title)) |> 
  filter(female_percent >= 60 & title %in% target_women_film) |>
  arrange(desc(female_percent)) 

# Color Palette

palette_women <- c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda",
                   "#ed6a6a", "#ed6969", "#ef6465", "#ef6262",
                   "#ff3334", "#9b9aa1", "#050505", "#2562fa", 
                   "#e94953")

n_films_women <- length(unique(dialogue_data_women$title))

male_colors_women <- colorRampPalette(c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda"))(n_films_women)
female_colors_women <- colorRampPalette(c("#ed6a6a", "#ed6969", "#ef6465", "#ef6262"))(n_films_women)



#Preparing plot data
plot_data_women <- dialogue_data_women |> 
  select(title, female_percent, male_percent) |> 
  pivot_longer(
    cols = c(male_percent, female_percent),
    names_to = "gender",
    values_to = "percentage"
  ) |> 
  mutate(
    gender = factor(gender, 
                    levels = c("female_percent", "male_percent"), 
                    labels = c("Female Dialogue", "Male Dialogue"))
  )

# Setting factor levels for Y-axis
plot_data_women$title <- factor(plot_data_women$title, levels = rev(unique(dialogue_data_women$title)))

# Assigning gradient colors
plot_data_women <- plot_data_women |>
  mutate(
    title_index = as.integer(title),
    fill_color = ifelse(
      gender == "Male Dialogue",
      male_colors_women[title_index],
      female_colors_women[title_index]
    )
  )

# Extract labels for "Inside Out"
inside_out_male_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$male_percent[1])
inside_out_female_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$female_percent[1])

if (is.na(inside_out_male_perc)) {
  inside_out_male_perc <- 36
  inside_out_female_perc <- 64 
}

fifty_fifty_line <- 50



#Visualization



women_graph_imp <- ggplot(plot_data_women, aes(x = percentage, y = title, fill = fill_color)) +
  
 geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) +
  
  # 50/50 Line
  geom_vline(xintercept = fifty_fifty_line, linetype = "solid", color = "#9b9aa1", linewidth = 0.5) +
  
  # Adding years 
  geom_text(
    data = dialogue_data_women,
    aes(x = 0.5, y = title, label = year), 
    inherit.aes = FALSE,
    color = "white", hjust = 0, size = 3, family = "Poppins", fontface = "bold"
  ) +
  
  # 4. ADD MALE CHARACTER NAMES (Left of 50/50 line, Right Aligned)
  geom_text(
    data = dialogue_data_women,
    aes(x = 49, y = title, label = top_m), 
    inherit.aes = FALSE,
    color = "white", hjust = 1, size = 3, family = "Poppins", fontface = "bold"
  ) +
  
  # Adding female characters name 
  geom_text(
    data = dialogue_data_women,
    aes(x = 51, y = title, label = top_f), 
    inherit.aes = FALSE,
    color = "white", hjust = 0, size = 3, family = "Poppins", fontface = "bold"
  ) +

  # Male percentage label 
  geom_text(
    data = plot_data_women |> filter(title == "Inside Out" & gender == "Male Dialogue"),
    label = paste0(inside_out_male_perc, "%"),
    x = 101, color = "#2562fa", fontface = "bold", size = 4, hjust = 0, family = "Poppins"
  ) +
  
  # Female percentage label 
  geom_text(
    data = plot_data_women |> filter(title == "Inside Out" & gender == "Female Dialogue"),
    label = paste0(" / ", inside_out_female_perc, "%"),
    x = 109, color = "#e94953", fontface = "bold", size = 4, hjust = 0, family = "Poppins"
  ) +
  
  # Scales and labels
  scale_fill_identity() +
  scale_x_continuous(labels = c("50/50" = "50/50"), breaks = c(50), limits = c(0, 120), position = "top") + 
  
  # 50/50 Text
  annotate("text", x = 50, y = 4.55, label = "50/50", color = "#9b9aa1", size = 4, family = "Poppins") +
  
  scale_y_discrete(limits = rev(levels(plot_data_women$title))) +
  
  labs(
    title = "Women have 60%+ Dialogue",
    x = "", y = "", fill = ""
  ) +
  
  theme_minimal() +
  theme(
    plot.title = element_text(family = "merriweather", color = "#ff3334", face = "bold", hjust = 0.5, size = 16),
    axis.text.x = element_blank(), # Hiding X axis text as requested in your previous code logic
    axis.text.y = element_text(family = "merriweather", size = 12, color = "black", face = "bold"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.margin = margin(10, 50, 10, 10 )
  )

print(women_graph_imp)

Beeswarm Trend Plot: Cinematic Gender Dialogue Evolution

This improvement of year graph serves as a longitudinal trend analysis of the film dialogue. While the original graphs highlights individual disparities in popular films, this improvement aimed to provide bigger picture of the film industry’s evolution over nearly a century by using Beeswarm Trend Plot. There are some purposes behind of this improved graph:

Wider context in industry: Unlike the fixed list of 23 films, this graph plots every available movie in the dataset to show whether the male-dominated dialogue trend is a universal standard or an exception.
Decadal density: The “swarm” thickness, which divides data into decades, shows how the volume of films produced has grown while highlighting that the “center of gravity” for conversation hasn’t altered much since the 1930s.
Visualization process: The 50/50 line shows as a permanent benchmark for equality. It allows us to observe outliers (the films with the most female dialogue) often barely reach parity.

# Data preparation
# Processing dialogue data and grouping years into decades for better flow
beeswarm_data <- characters |> 
  filter(gender %in% c("m", "f")) |> 
  group_by(script_id, gender) |> 
  summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |> 
  pivot_wider(names_from = gender, values_from = words, values_fill = 0) |> 
  left_join(metadata |> select(script_id, title, year), by = "script_id") |> 
  mutate(
    male_percent = m / (m + f),
    decade = floor(year / 10) * 10
  ) |> 
  filter(!is.na(year), !is.na(title))


#  Identify outliers and manuel positioning

# Selecting specific films to label and define manual "nudges" for text placement
outliers_manual <- beeswarm_data |>
  group_by(decade) |>
  slice_min(order_by = male_percent, n = 1, with_ties = FALSE) |> 
  ungroup() |>
  mutate(
    # Vertical nudge (Y-axis)
    nudge_y_val = case_when(
      title == "Mulan" ~ -0.05,
      title == "The Little Mermaid" ~ -0.08,
      title == "Frozen" ~ -0.02,
      TRUE ~ -0.04 # Default value for others
    ),
    # Horizontal nudge (X-axis)
    nudge_x_val = case_when(
      title == "Mulan" ~ 0.2,
      title == "The Little Mermaid" ~ -0.3,
      TRUE ~ 0 # Stay centered by default
    )
  )


#  Visualization

year_graph <-
  ggplot(beeswarm_data, aes(x = factor(decade), y = male_percent, color = male_percent)) +

  geom_hline(yintercept = 0.5, linetype = "dashed", color = "#858485", alpha = 0.6) +
  
  # Distributing points using the organic "Beeswarm" pattern
  geom_quasirandom(alpha = 0.5, size = 2, width = 0.4) + 
  
  # Manually adjusted label layer
  geom_text_repel(
    data = outliers_manual,
    aes(label = title),
    nudge_x = outliers_manual$nudge_x_val, # Use manual X shifts
    nudge_y = outliers_manual$nudge_y_val, # Use manual Y shifts
    size = 3.5,
    color = "black",
    fontface = "bold",
    segment.color = 'grey50',
    direction = "both",      # Allow movement in both directions to resolve overlaps
    min.segment.length = 0   # Lines will always be drawn to the points
  ) +
  
  scale_color_gradient2(
    low = "#dba0a0", 
    mid = "#d1d1d1", 
    high = "#356df4", 
    midpoint = 0.5,
    labels = percent_format(),
    # legend bar
    guide = guide_colorbar(
      title = "Male %",
      title.position = "top",
      barwidth = 1,
      barheight = 10,
      nbin = 50
    ) 
  ) +
  scale_y_continuous(labels = percent_format()) +
  theme_minimal(base_family = "Poppins") +
  labs(
    title = "Cinematic Gender Dialogue Evolution",
    subtitle = "Labeled films represent the most female-dominant scripts of each decade.",
    x = "Decade",
    y = "Male Dialogue %"
  ) +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 9, face = "bold"),
    legend.text = element_text(size = 8),
    panel.grid.major.x = element_blank(),
    plot.title = element_text(face = "bold", size = 16, color = "#356df4")
  )
year_graph

Key findings about “Cinematic Gender Dialogue Evolution”

We analyze that across nearly 100 years of film industry, the vast majority of films cluster between 60% and 90% male dialogue. This means male-dominated scripts are “norm”, rather than an exception and this may show us the “Glass Ceiling” of dialogues. Furthermore, even though films are made and centered on female protagonists such as Mulan and Pocahontas, the supporting and secondary characters remain overwhelmingly male, often resulting in men speaking over 70% of the words. Another finding is industry’s resistance to change over decades. This means, the distribution shows that while the volume of films has exploded since the 1990s, the “center” of the clusters hasn’t moved significantly toward the 50/50 line, suggesting that modern cinema still struggles with gender parity in scripts. Lastly, there is power of the outliers. Labeled outliers serve as evidence that balanced scripts are possible; however, their rarity emphasizes how far the industry is from achieving consistent gender equality in storytelling.

Final Graph

The final graph aims to display four graphs together using the cowplot package’s ggdraw to manually adjust the settings. Additionally, images of princesses and princes have been added using the magick package. Therefore, in this graph, we see the improved Men’s, Women’s and Cinematic Gender Dialogue Evolution graphs that we created earlier, with top character names added for both genders. This enables us to make comparisons between years in a longitudinal trend way and understand which characters appear in the movies.

img_princess <- image_read("princess.png") 
img_prince    <- image_read("prince.png")

# Layout configuration
final_graph <- ggdraw() +
  # 1. Main Left Plot: men_graph_imp 
  draw_plot(men_graph_imp, 
            x = 0, y = 0.4, 
            width = 0.55, height = 0.45) +
  
  # 2. Right Top Plot: gender_balance_graph
  draw_plot(gender_balance_graph, 
            x = 0.55, y = 0.65, 
            width = 0.43, height = 0.2) +
  
  # 3. Right Middle Plot: women_graph_imp
  draw_plot(women_graph_imp, 
            x = 0.55, y = 0.4, 
            width = 0.43, height = 0.25) +
  
  # 4. Bottom Anchor: year_graph 
  draw_plot(year_graph, 
            x = 0.05, y = 0.05, 
            width = 0.9, height = 0.32) +
  
  draw_image(img_princess, x = 0.04, y = 0.88, width = 0.2, height = 0.1) +
  draw_image(img_prince, x = 0.75, y = 0.88, width = 0.2, height = 0.1) +
  
  draw_label("SCREENPLAY DIALOGUE BY GENDER", 
             x = 0.5, y = 0.94, 
             hjust = 0.5,
             fontfamily = "merriweather",
             fontface = "bold", 
             size = 28) +
  
  draw_label("An analysis of 2,000 scripts: Trends across decades and individual films", 
             x = 0.5, y = 0.89, 
             hjust = 0.5, 
             fontfamily = "merriweather",
             size = 12, 
             color = "grey40")

print(final_graph)

Film Dialogue: Screenplay Dialogue by Gender

Author

Affiliation

Published

Citation