2,000 Screenplays Dialogue Broken-down by Gender.
Recently, Hollywood movies has been criticized for widespread sexism and racism where most discussions are around on the idea that white men dominate film roles and dialogues. However, these arguments mostly depend on opinions and observations rather than data, which prevents any genuinely informed conversation. How many movies are actually about men? What changes by genre, era, or box-office revenue? What circumstances generate more diversity?
The researchers of this visualization did not set out trying to prove anything, but rather compile real data. They framed it as a census rather than a study. So their way was searching 8,000 screenplays and matched each character’s lines to an actor. From there, they complied the number of words spoken by male and female characters across more or less 2,000 films.
Therefore, this project is going to replicate examining the dialogue by gender for Disney films and after that improve some aspects to have a wider opinion about film industry related to gender.

Original graph. Figure from The Pudding
On January 2016, researchers reported that men speak more often than women in Disney’s princess films (Anderson & Daniels, 2016). They validated the claim and doubled the sample size to 30 Disney films, including Pixar. The outcome shows 22 of 30 Disney films have a male majority of dialogue. Even films with female leads, such as Mulan, the dialogue swings male. Mushu, her protector dragon, has 50% more words of dialogue than Mulan herself.
The dataset has limitations. Like Mulan where a story may focus on a particular character even if this emphasis isn’t evident in the dialogue. Moreover, the analysis relies on screenplays, which are not exact representations of the final film.
First, in order to create the graph, we load the libraries we are going to use throughout this project:
# Set up environment
library(tidyverse)
library(readr)
library(stringr)
library(ggplot2)
library(scales)
library(magick)
library(cowplot)
library(showtext)
library(ggrepel)
library(ggbeeswarm)
font_add_google("Poppins", "Poppins")
font_add_google("Merriweather", "merriweather")
showtext_auto()
Later, we load the dataset that we found from Pandas Basics
#Loading data
char_url <- "https://raw.githubusercontent.com/matthewfdaniels/scripts/master/character_list5.csv"
meta_url <- "https://raw.githubusercontent.com/matthewfdaniels/scripts/master/meta_data7.csv"
characters <- read_csv(char_url)
metadata <- read_csv(meta_url)
# Data Processing
film_order <- c(
"The Jungle Book", "Monsters, Inc.", "Up", "Toy Story", "The Rescuers Down Under",
"Aladdin", "Holes", "Cars 2", "The Lion King", "Ratatouille",
"Something Wicked This Way Comes", "Monsters University", "Hercules", "Hunchback Of Notre Dame",
"Toy Story 3", "Finding Nemo", "Mulan", "Star Wars: Episode VII - The Force Awakens",
"Beauty And The Beast", "The Little Mermaid", "Wreck-It Ralph", "Mighty Joe Young",
"Pocahontas") # ordering films manually
# Defining custom vectors to manually resolve sorting ties in percentages
order_87_percent <- c(
"The Lion King", "Ratatouille", "Something Wicked This Way Comes", "Monsters University"
)
# Order for films around 75%
order_75_percent <- c(
"Toy Story 3", "Finding Nemo", "Mulan"
)
dialogue_data <- characters |>
filter(gender %in% c("m", "f")) |> # Filter ingfor only male and female characters
# Calculating total words spoken per film (script_id) and gender
group_by(script_id, gender) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
# Pivot to create separate columns for male and female word counts
pivot_wider(
names_from = gender,
values_from = words,
values_fill = 0,
names_prefix = "words_"
) |>
# Calculating total words and percentages
mutate(
words_total = words_m + words_f,
male_percent = words_m / words_total,
female_percent = words_f / words_total
) |>
filter(words_total > 0) |>
# Joning metadata for the titles of the films by script_id since it is the key
left_join(metadata, by = "script_id") |>
filter(!is.na(title)) |>
# Filtering only film_order
filter(title %in% film_order) |>
mutate(
custom_rank = case_when(
title %in% order_87_percent ~ match(title, order_87_percent),
title %in% order_75_percent ~ match(title, order_75_percent),
TRUE ~ 999
)
) |>
# Sorting data primarily by male percentage (descending)
arrange(desc(male_percent))
# Final data preparation for the plot
top_dialogue_data <- dialogue_data |>
arrange(desc(male_percent), custom_rank)
plot_data <- top_dialogue_data |>
select(title, female_percent, male_percent) |>
# Converting data back to long format for ggplot stacked bar chart
pivot_longer(
cols= c(male_percent, female_percent),
names_to = "gender",
values_to = "percentage"
) |>
# Set gender as a factor with specific levels for color mapping and stacking order
mutate(
gender = factor(gender,
levels = c("female_percent", "male_percent"),
labels = c("Female Dialogue", "Male Dialogue"))
)
# Critical part: Setting movie titles as a factor with levels in the desired chart order.
plot_data$title <- factor(
plot_data$title,
levels = rev(unique(top_dialogue_data$title))
)
# Color palettes and gradiants
palette <- c("#356df4", "#dba0a0", "#a9bffe", "#2b2b2b", "#e5e8f2",
"#858485", "#517fed", "#ccbaba", "#2261fa")
# Number of films
n_films <- length(levels(plot_data$title))
# Gradient blues for male dialogue
male_colors <- colorRampPalette(
c("#a9bffe","#517fed", "#2261fa","#356df4")
)(n_films)
# Gradient peach/pink for female dialogue
female_colors <- colorRampPalette(
c("#dba0a0", "#ccbaba")
)(n_films)
#Assigning gradient color per film
plot_data <- plot_data |>
mutate(
title_index = as.integer(title),
fill_color = ifelse(
gender == "Male Dialogue",
male_colors[title_index],
female_colors[title_index]
)
)
#Visualization
fifty_fifty_line <- 0.50
men_graph<-
ggplot(plot_data, aes(x = percentage, y = title, fill = fill_color)) +
geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) + #creating stack horizontal bar chart
geom_vline(xintercept = fifty_fifty_line,
linetype = "solid",
color = "#858485",
linewidth = 0.5) + # Adding the 50/50 vertical reference line
# Special label for "98% MALE DIALOGUE"
geom_text(
data = top_dialogue_data |>
mutate(label = case_when(
round(male_percent, 2) == 0.98 ~ "98% MALE\nDIALOGUE",
TRUE ~ NA_character_
)) |>
filter(!is.na(label)),
aes(x = 1.01, y = 20.8, label = label),
inherit.aes = FALSE,
color = "#356df4",
hjust = 0,
size = 2.3,
fontface = "bold",
family = "Poppins"
) +
# Add percentage labels (90%, 87%, 75%, 69%)
geom_text(
data = top_dialogue_data |>
mutate(label = case_when(
round(male_percent, 2) == 0.98 ~ NA_character_,
round(male_percent, 2) == 0.90 ~ "90%",
round(male_percent, 2) == 0.87 & title == "Ratatouille" ~ "87%",
round(male_percent, 2) == 0.75 & title == "Mulan" ~ "75%",
round(male_percent, 2) == 0.69 ~ "69%",
TRUE ~ NA_character_
)) |>
filter(!is.na(label)),
aes(x = 1.02,
y = title,
label = label),
inherit.aes = FALSE,
color = "#356df4",
hjust = 0,
size = 3,
fontface = "bold"
) +
scale_fill_identity() + #using identity scale for gradient colors
# Y-axis configuration
scale_y_discrete(limits = rev(unique(top_dialogue_data$title)),
expand = expansion(add = 0)) +
# X-axis configuration with 50/50 at top
scale_x_continuous(labels = c("50/50" = "50/50"),
breaks = c(0.5),
limits = c(0, 1.08),
position = "top") +
# Set titles and labels
labs(
title = "Men have 60%+ Dialogue",
x = "",
y = "",
fill = ""
) +
# Applying and customizing the theme
theme_minimal() +
theme(
plot.title = element_text(family = "merriweather", color = "#356df4",
face = "bold", hjust = 0.5,
margin = margin(b = 12, unit = "pt"), size = 16),
axis.text.x = element_text(size = 10, color = "black",
margin = margin(b = 12, unit = "pt")),
axis.text.y = element_text(family = "merriweather", size = 12,
color = "black", face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none",
plot.margin = margin(10, 20, 10, 0)
)
men_graph

This graph shows us gender balance around -/+ 10 percent of four films, which are Frozen, The Incredibles, Into the Woods, Tarzan.
Starting with setting up the data and environment
gender_film <-
metadata |> filter(title %in%
c( "Frozen", "The Incredibles", "Into the Woods", "Tarzan")
)
# Targeting list of films for the "Gender Balance" chart
target_movies <- c("Frozen", "The Incredibles", "Into the Woods", "Tarzan")
# Data processing and joining#
# Filter metadata to get script_ids for the target movies
gender_film_ids <-
metadata |>
filter(title %in% target_movies) |>
select(script_id, title) # Selecting only the join key and the title
# Calculating dialogue percentages for the target movies
dialogue_data_balanced <-
characters |>
inner_join(gender_film_ids, by = "script_id") |> # Join with filtered metadata to include only target movies
filter(gender %in% c("m", "f")) |> # filtering male and female genders
group_by(title, gender) |>
summarise( # Sum the word counts for dialogue
total_words = sum(words, na.rm = TRUE),
.groups = 'drop_last' # Maintain grouping by title
) |>
mutate( # Calculating the percentages
movie_total_words = sum(total_words),
percentage = total_words / movie_total_words * 100
) |>
ungroup()
#Plot data preparation
plot_data_balance <- dialogue_data_balanced |>
mutate(
# Setting movie titles as a factor with specific levels for consistent sorting
title = factor(title, levels = c("Frozen", "The Incredibles", "Into the Woods", "Tarzan")),
# Set gender as a factor
gender = factor(gender, levels = c("f", "m"))
)
# Extract and round the percentages for the Frozen label
frozen_male_perc <- round(plot_data_balance[plot_data_balance$title == "Frozen" & plot_data_balance$gender == "m", ]$percentage[1])
frozen_female_perc <- round(plot_data_balance[plot_data_balance$title == "Frozen" & plot_data_balance$gender == "f", ]$percentage[1])
#Color Palette
palette_balance <- c("#7193e4", "#6a8fe6","#7c9ae3", "#e67e7e", "#272727", "#dad9dc",
"#7798e3", "#818082", "#e77b7b", "#e28a8a", "#e48483")
# Number of films
n_films_balance <- length(levels(plot_data_balance$title))
# Gradient blues for male dialogue
male_colors_balance <- colorRampPalette(
c("#6a8fe6","#7193e4", "#7798e3", "#7c9ae3")
)(n_films_balance)
# Gradient pink/red for female dialogue
female_colors_balance <- colorRampPalette(
c( "#e28a8a","#e48483","#e67e7e", "#e77b7b")
)(n_films_balance)
# Assign gradient colors per film and gender
plot_data_balance <- plot_data_balance |>
mutate(
title_index = as.integer(title),
fill_color = ifelse(
gender == "m",
male_colors_balance[title_index],
female_colors_balance[title_index]
)
)
#Visualization
gender_balance_graph<-
ggplot(plot_data_balance, aes(x = percentage, y = title, fill = fill_color)) +
geom_bar(stat = "identity",position = position_stack(reverse = TRUE), width = 0.7) + #Creating the horizontal stacked bars
geom_vline(xintercept = 50,
linetype = "solid",
color = "#818082",
linewidth = 0.5) +
scale_fill_identity() +
# Adding the Two-Color Percentage Label for Frozen
geom_text(
data = plot_data_balance |>
filter(title == "Frozen" & gender == "m"),
label = paste0(frozen_male_perc, "%"),
x = 101,
color = "#2362fa",
fontface = "bold",
size = 4,
hjust = 0
) +
geom_text(
data = plot_data_balance |>
filter(title == "Frozen" & gender == "f"),
label = paste0(" / ", frozen_female_perc, "%"),
x = 110,
color = "#e94b55",
fontface = "bold",
size = 4,
hjust = 0
) +
# Add the "50/50" text annotation
annotate("text", x = 50, y = 4.5, label = "50/50", color = "#818082", size = 4) +
# Extend X-axis limits
scale_x_continuous(limits = c(0, 115), position = "top") +
# Reversing the Y-axis order
scale_y_discrete(limits = rev(levels(plot_data_balance$title))) +
# Titles and Labels
labs(
title = "Gender Balance, +/- 10%",
y = NULL,
x = NULL
) +
# Theme Customization
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(family = "merriweather",
hjust = 0.5,
size = 18,
face = "bold",
color = "#272727"),
axis.text.x = element_blank(),
axis.text.y = element_text(family = "merriweather",
size = 12,
face = "bold",
color = "#272727"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = margin(10, 50, 10, 10)
)
gender_balance_graph

Lastly, we are having a graph that shows women dialogue around -/+ 60 percent of four films, which are Inside Out, Alice in Wonderland, Maleficent”, Sleeping Beauty.
target_women_film <-
c("Inside Out", "Alice in Wonderland", "Maleficent", "Sleeping Beauty")
# Calculating Dialogue Data (Reusing the structure from the Men's chart)
# which ensures we have male_percent and female_percent columns
dialogue_data_women <- characters |>
filter(gender %in% c("f", "m")) |>
group_by(script_id, gender) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(
names_from = gender,
values_from = words,
values_fill = 0,
names_prefix = "words_"
) |>
mutate(
words_total = words_m + words_f,
male_percent = words_m / words_total * 100, # Calculate as percentages (0-100)
female_percent = words_f / words_total * 100
) |>
filter(words_total > 0) |>
left_join(metadata, by = "script_id") |>
filter(!is.na(title)) |>
#Critical Filter: Keeping only movies where women have 60% or more dialogue
filter(female_percent >= 60 & title %in% target_women_film) |>
arrange(desc(female_percent)) # Sort by descending female percentage (highest dominance at the top)
#---- Color Palette--
palette_women <- c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda",
"#ed6a6a", "#ed6969", "#ef6465", "#ef6262",
"#ff3334", "#9b9aa1", "#050505", "#2562fa",
"#e94953")
n_films_women <- length(unique(dialogue_data_women$title))
male_colors_women <- colorRampPalette(
c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda")
)(n_films_women)
female_colors_women <- colorRampPalette(
c("#ed6a6a", "#ed6969", "#ef6465", "#ef6262")
)(n_films_women)
# --- Preparing for Plotting ---
plot_data_women <- dialogue_data_women |>
select(title, female_percent, male_percent) |>
pivot_longer(
cols = c(male_percent, female_percent),
names_to = "gender",
values_to = "percentage"
) |>
mutate(
gender = factor(gender,
levels = c("female_percent", "male_percent"),
labels = c("Female Dialogue", "Male Dialogue"))
)
# Set factor levels for the Y-axis
plot_data_women$title <- factor(plot_data_women$title,
levels = rev(unique(dialogue_data_women$title)))
# Assigning gradient colors per film and gender
plot_data_women <- plot_data_women |>
mutate(
title_index = as.integer(title),
fill_color = ifelse(
gender == "Male Dialogue",
male_colors_women[title_index],
female_colors_women[title_index]
)
)
# --- Extract Custom Label Variables (Inside Out) ---
inside_out_male_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$male_percent[1])
inside_out_female_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$female_percent[1])
if (is.na(inside_out_male_perc)) {
inside_out_male_perc <- 36
inside_out_female_perc <- 64
}
fifty_fifty_line <- 50
#Visualization
women_graph <-
ggplot(plot_data_women, aes(x = percentage, y = title, fill = fill_color)) +
# Creating the stacked horizontal bar chart
geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) +
# Add the 50/50 vertical reference line
geom_vline(xintercept = fifty_fifty_line,
linetype = "solid",
color = "#9b9aa1",
linewidth = 0.5) +
#scale for gradient colors
scale_fill_identity() +
# Adding the two-color percentage
# Male percentage label
geom_text(
data = plot_data_women |>
filter(title == "Inside Out" & gender == "Male Dialogue"),
label = paste0(inside_out_male_perc, "%"),
x = 101,
color = "#2562fa",
fontface = "bold",
size = 4,
hjust = 0
) +
# Female percentage label
geom_text(
data = plot_data_women |>
filter(title == "Inside Out" & gender == "Female Dialogue"),
label = paste0(" / ", inside_out_female_perc, "%"),
x = 109,
color = "#e94953",
fontface = "bold",
size = 4,
hjust = 0
) +
# Extending x-axis limits
scale_x_continuous(labels = c("50/50" = "50/50"),
breaks = c(0.5),
limits = c(0, 115),
position = "top") +
# Add the "50/50" text annotation
annotate("text", x = 50, y = 4.55, label = "50/50", color = "#9b9aa1", size = 4) +
# Y-axis order
scale_y_discrete(limits = rev(levels(plot_data_women$title))) +
# Titles and labels
labs(
title = "Women have 60%+ Dialogue",
x = "",
y = "",
fill = ""
) +
# Theme customization
theme_minimal() +
theme(
plot.title = element_text(family = "merriweather", color = "#ff3334", face = "bold", hjust = 0.5, size = 16),
axis.text.x = element_blank(),
axis.text.y = element_text(family = "merriweather", size = 12, color = "black", face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none",
plot.margin = margin(10, 50, 10, 10)
)
women_graph

Men graph improvement
Women graph improvement
Creating a new graph: Cinematic Gender Dialogue Evolution
Since this graph aims to show the criticism of how Disney movies are gendered through the dialogues of male and female characters, I believe it could be improved by adding characters from the each movie would help represent which percentage belongs to each gender.
film_order <- c(
"The Jungle Book", "Monsters, Inc.", "Up", "Toy Story", "The Rescuers Down Under",
"Aladdin", "Holes", "Cars 2", "The Lion King", "Ratatouille",
"Something Wicked This Way Comes", "Monsters University", "Hercules", "Hunchback Of Notre Dame", "Toy Story 3", "Finding Nemo", "Mulan", "Star Wars: Episode VII - The Force Awakens",
"Beauty And The Beast", "The Little Mermaid", "Wreck-It Ralph", "Mighty Joe Young",
"Pocahontas"
)
order_87_percent <- c("The Lion King", "Ratatouille",
"Something Wicked This Way Comes", "Monsters University")
order_75_percent <- c("Toy Story 3", "Finding Nemo", "Mulan")
#New step for improvement : Calculating top characters
#Find the name of the top speaking Male and Female for each script
top_chars_by_movie <- characters |>
filter(gender %in% c("m", "f")) |>
# Join metadata early to filter by your specific film list
left_join(metadata |> select(script_id, title), by = "script_id") |>
filter(title %in% film_order) |>
group_by(script_id, title, gender, imdb_character_name) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
group_by(script_id, gender) |>
slice_max(words, n = 1, with_ties = FALSE) |> # Get top 1 per gender
ungroup() |>
select(script_id, gender, imdb_character_name) |>
pivot_wider(
names_from = gender,
values_from = imdb_character_name,
names_prefix = "top_"
)
#Main data processing
dialogue_data <- characters |>
filter(gender %in% c("m", "f")) |>
group_by(script_id, gender) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(
names_from = gender, values_from = words, values_fill = 0, names_prefix = "words_"
) |>
mutate(
words_total = words_m + words_f,
male_percent = words_m / words_total,
female_percent = words_f / words_total
) |>
filter(words_total > 0) |>
left_join(metadata, by = "script_id") |>
left_join(top_chars_by_movie, by = "script_id") |> # Join the TOP CHARACTERS calculated above
filter(!is.na(title)) |>
filter(title %in% film_order) |>
mutate(
custom_rank = case_when(
title %in% order_87_percent ~ match(title, order_87_percent),
title %in% order_75_percent ~ match(title, order_75_percent),
TRUE ~ 999
)
) |>
arrange(desc(male_percent))
top_dialogue_data <- dialogue_data |>
arrange(desc(male_percent), custom_rank)
plot_data <- top_dialogue_data |>
select(title, female_percent, male_percent) |>
pivot_longer(
cols = c(male_percent, female_percent),
names_to = "gender",
values_to = "percentage"
) |>
mutate(
gender = factor(gender,
levels = c("female_percent", "male_percent"),
labels = c("Female Dialogue", "Male Dialogue"))
)
plot_data$title <- factor(
plot_data$title,
levels = rev(unique(top_dialogue_data$title))
)
#Color Palette
n_films <- length(levels(plot_data$title))
male_colors <- colorRampPalette(c("#a9bffe","#517fed", "#2261fa","#356df4"))(n_films)
female_colors <- colorRampPalette(c("#dba0a0", "#ccbaba"))(n_films)
plot_data <- plot_data |>
mutate(
title_index = as.integer(title),
fill_color = ifelse(
gender == "Male Dialogue",
male_colors[title_index],
female_colors[title_index]
)
)
#Visualization
fifty_fifty_line <- 0.50
men_graph_imp <- ggplot(plot_data, aes(x = percentage, y = title, fill = fill_color)) +
#Stacked Bars (Reverse = TRUE puts Male/Blue on the Left)
geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) +
#50/50 Line
geom_vline(xintercept = fifty_fifty_line, linetype = "solid", color = "#858485", linewidth = 0.5) +
#Adding male character names
geom_text(
data = top_dialogue_data,
aes(x = 0.49, y = title, label = top_m),
inherit.aes = FALSE,
color = "white", hjust = 1, size = 2, fontface = "bold", family = "Poppins"
) +
#Adding female character names
geom_text(
data = top_dialogue_data,
aes(x = 0.51, y = title, label = top_f),
inherit.aes = FALSE,
color = "white", hjust = 0, size = 2, fontface = "bold", family = "Poppins"
) +
#Special label for "98% MALE DIALOGUE"
geom_text(
data = top_dialogue_data |>
mutate(label = case_when(
round(male_percent, 2) == 0.98 ~ "98% MALE\nDIALOGUE",
TRUE ~ NA_character_
)) |>
filter(!is.na(label)),
aes(x = 1.01, y = 20.8, label = label),
inherit.aes = FALSE,
color = "#356df4", hjust = 0, size = 2.3, fontface = "bold", family = "Poppins"
) +
#Add percentage labels (90%, 87%, 75%, 69%)
geom_text(
data = top_dialogue_data |>
mutate(label = case_when(
round(male_percent, 2) == 0.98 ~ NA_character_,
round(male_percent, 2) == 0.90 ~ "90%",
round(male_percent, 2) == 0.87 & title == "Ratatouille" ~ "87%",
round(male_percent, 2) == 0.75 & title == "Mulan" ~ "75%",
round(male_percent, 2) == 0.69 ~ "69%",
TRUE ~ NA_character_
)) |>
filter(!is.na(label)),
aes(x = 1.02, y = title, label = label),
inherit.aes = FALSE,
color = "#356df4", hjust = 0, size = 3, fontface = "bold", family = "Poppins"
) +
# Scales
scale_fill_identity() +
scale_y_discrete(limits = rev(unique(top_dialogue_data$title)), expand = expansion(add = 0)) +
scale_x_continuous(labels = c("50/50" = "50/50"), breaks = c(0.5), limits = c(0, 1.15), position = "top") +
# Titles and labels
labs(
title = "Men have 60%+ Dialogue",
x = "",
y = "",
fill = ""
) +
# Theme
theme_minimal() +
theme(
plot.title = element_text(family = "merriweather",
color = "#356df4",
face = "bold",
hjust = 0.5,
margin = margin(b = 12, unit = "pt"),
size = 16),
axis.text.x = element_text(size = 5, color = "black",
margin = margin(b = 12, unit = "pt"),
family = "Poppins"),
axis.text.y = element_text(family = "merriweather",
size = 12,
color = "black",
face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none",
plot.margin = margin(10, 20, 10, 0)
)
men_graph_imp

The same process is also will be done in here, however different from the men’s one, we are adding the years of the movies.
# Data Preparation
target_women_film <- c("Inside Out", "Alice in Wonderland", "Maleficent", "Sleeping Beauty")
# Calculate top characters for women films
# Find the name of the top speaking Male and Female for each script in the target list
top_chars_women <- characters |>
filter(gender %in% c("m", "f")) |>
left_join(metadata |> select(script_id, title), by = "script_id") |>
filter(title %in% target_women_film) |>
group_by(script_id, title, gender, imdb_character_name) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
group_by(script_id, gender) |>
slice_max(words, n = 1, with_ties = FALSE) |>
ungroup() |>
select(script_id, gender, imdb_character_name) |>
pivot_wider(
names_from = gender,
values_from = imdb_character_name,
names_prefix = "top_"
)
#Calculating Dialogue data
dialogue_data_women <- characters |>
filter(gender %in% c("f", "m")) |>
group_by(script_id, gender) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(
names_from = gender,
values_from = words,
values_fill = 0,
names_prefix = "words_"
) |>
mutate(
words_total = words_m + words_f,
male_percent = words_m / words_total * 100,
female_percent = words_f / words_total * 100
) |>
filter(words_total > 0) |>
left_join(metadata, by = "script_id") |>
# JOIN TOP CHARACTERS HERE
left_join(top_chars_women, by = "script_id") |>
filter(!is.na(title)) |>
filter(female_percent >= 60 & title %in% target_women_film) |>
arrange(desc(female_percent))
# Color Palette
palette_women <- c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda",
"#ed6a6a", "#ed6969", "#ef6465", "#ef6262",
"#ff3334", "#9b9aa1", "#050505", "#2562fa",
"#e94953")
n_films_women <- length(unique(dialogue_data_women$title))
male_colors_women <- colorRampPalette(c("#8ea5de", "#8fa7dc", "#95aadb", "#97abda"))(n_films_women)
female_colors_women <- colorRampPalette(c("#ed6a6a", "#ed6969", "#ef6465", "#ef6262"))(n_films_women)
#Preparing plot data
plot_data_women <- dialogue_data_women |>
select(title, female_percent, male_percent) |>
pivot_longer(
cols = c(male_percent, female_percent),
names_to = "gender",
values_to = "percentage"
) |>
mutate(
gender = factor(gender,
levels = c("female_percent", "male_percent"),
labels = c("Female Dialogue", "Male Dialogue"))
)
# Setting factor levels for Y-axis
plot_data_women$title <- factor(plot_data_women$title, levels = rev(unique(dialogue_data_women$title)))
# Assigning gradient colors
plot_data_women <- plot_data_women |>
mutate(
title_index = as.integer(title),
fill_color = ifelse(
gender == "Male Dialogue",
male_colors_women[title_index],
female_colors_women[title_index]
)
)
# Extract labels for "Inside Out"
inside_out_male_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$male_percent[1])
inside_out_female_perc <- round(dialogue_data_women[dialogue_data_women$title == "Inside Out", ]$female_percent[1])
if (is.na(inside_out_male_perc)) {
inside_out_male_perc <- 36
inside_out_female_perc <- 64
}
fifty_fifty_line <- 50
#Visualization
women_graph_imp <- ggplot(plot_data_women, aes(x = percentage, y = title, fill = fill_color)) +
geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.8) +
# 50/50 Line
geom_vline(xintercept = fifty_fifty_line, linetype = "solid", color = "#9b9aa1", linewidth = 0.5) +
# Adding years
geom_text(
data = dialogue_data_women,
aes(x = 0.5, y = title, label = year),
inherit.aes = FALSE,
color = "white", hjust = 0, size = 3, family = "Poppins", fontface = "bold"
) +
# 4. ADD MALE CHARACTER NAMES (Left of 50/50 line, Right Aligned)
geom_text(
data = dialogue_data_women,
aes(x = 49, y = title, label = top_m),
inherit.aes = FALSE,
color = "white", hjust = 1, size = 3, family = "Poppins", fontface = "bold"
) +
# Adding female characters name
geom_text(
data = dialogue_data_women,
aes(x = 51, y = title, label = top_f),
inherit.aes = FALSE,
color = "white", hjust = 0, size = 3, family = "Poppins", fontface = "bold"
) +
# Male percentage label
geom_text(
data = plot_data_women |> filter(title == "Inside Out" & gender == "Male Dialogue"),
label = paste0(inside_out_male_perc, "%"),
x = 101, color = "#2562fa", fontface = "bold", size = 4, hjust = 0, family = "Poppins"
) +
# Female percentage label
geom_text(
data = plot_data_women |> filter(title == "Inside Out" & gender == "Female Dialogue"),
label = paste0(" / ", inside_out_female_perc, "%"),
x = 109, color = "#e94953", fontface = "bold", size = 4, hjust = 0, family = "Poppins"
) +
# Scales and labels
scale_fill_identity() +
scale_x_continuous(labels = c("50/50" = "50/50"), breaks = c(50), limits = c(0, 120), position = "top") +
# 50/50 Text
annotate("text", x = 50, y = 4.55, label = "50/50", color = "#9b9aa1", size = 4, family = "Poppins") +
scale_y_discrete(limits = rev(levels(plot_data_women$title))) +
labs(
title = "Women have 60%+ Dialogue",
x = "", y = "", fill = ""
) +
theme_minimal() +
theme(
plot.title = element_text(family = "merriweather", color = "#ff3334", face = "bold", hjust = 0.5, size = 16),
axis.text.x = element_blank(), # Hiding X axis text as requested in your previous code logic
axis.text.y = element_text(family = "merriweather", size = 12, color = "black", face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none",
plot.margin = margin(10, 50, 10, 10 )
)
print(women_graph_imp)

This improvement of year graph serves as a longitudinal trend analysis of the film dialogue. While the original graphs highlights individual disparities in popular films, this improvement aimed to provide bigger picture of the film industry’s evolution over nearly a century by using Beeswarm Trend Plot. There are some purposes behind of this improved graph:
Wider context in industry: Unlike the fixed list of 23 films, this graph plots every available movie in the dataset to show whether the male-dominated dialogue trend is a universal standard or an exception.
Decadal density: The “swarm” thickness, which divides data into decades, shows how the volume of films produced has grown while highlighting that the “center of gravity” for conversation hasn’t altered much since the 1930s.
Visualization process: The 50/50 line shows as a permanent benchmark for equality. It allows us to observe outliers (the films with the most female dialogue) often barely reach parity.
# Data preparation
# Processing dialogue data and grouping years into decades for better flow
beeswarm_data <- characters |>
filter(gender %in% c("m", "f")) |>
group_by(script_id, gender) |>
summarise(words = sum(words, na.rm = TRUE), .groups = 'drop') |>
pivot_wider(names_from = gender, values_from = words, values_fill = 0) |>
left_join(metadata |> select(script_id, title, year), by = "script_id") |>
mutate(
male_percent = m / (m + f),
decade = floor(year / 10) * 10
) |>
filter(!is.na(year), !is.na(title))
# Identify outliers and manuel positioning
# Selecting specific films to label and define manual "nudges" for text placement
outliers_manual <- beeswarm_data |>
group_by(decade) |>
slice_min(order_by = male_percent, n = 1, with_ties = FALSE) |>
ungroup() |>
mutate(
# Vertical nudge (Y-axis)
nudge_y_val = case_when(
title == "Mulan" ~ -0.05,
title == "The Little Mermaid" ~ -0.08,
title == "Frozen" ~ -0.02,
TRUE ~ -0.04 # Default value for others
),
# Horizontal nudge (X-axis)
nudge_x_val = case_when(
title == "Mulan" ~ 0.2,
title == "The Little Mermaid" ~ -0.3,
TRUE ~ 0 # Stay centered by default
)
)
# Visualization
year_graph <-
ggplot(beeswarm_data, aes(x = factor(decade), y = male_percent, color = male_percent)) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "#858485", alpha = 0.6) +
# Distributing points using the organic "Beeswarm" pattern
geom_quasirandom(alpha = 0.5, size = 2, width = 0.4) +
# Manually adjusted label layer
geom_text_repel(
data = outliers_manual,
aes(label = title),
nudge_x = outliers_manual$nudge_x_val, # Use manual X shifts
nudge_y = outliers_manual$nudge_y_val, # Use manual Y shifts
size = 3.5,
color = "black",
fontface = "bold",
segment.color = 'grey50',
direction = "both", # Allow movement in both directions to resolve overlaps
min.segment.length = 0 # Lines will always be drawn to the points
) +
scale_color_gradient2(
low = "#dba0a0",
mid = "#d1d1d1",
high = "#356df4",
midpoint = 0.5,
labels = percent_format(),
# legend bar
guide = guide_colorbar(
title = "Male %",
title.position = "top",
barwidth = 1,
barheight = 10,
nbin = 50
)
) +
scale_y_continuous(labels = percent_format()) +
theme_minimal(base_family = "Poppins") +
labs(
title = "Cinematic Gender Dialogue Evolution",
subtitle = "Labeled films represent the most female-dominant scripts of each decade.",
x = "Decade",
y = "Male Dialogue %"
) +
theme(
legend.position = "right",
legend.title = element_text(size = 9, face = "bold"),
legend.text = element_text(size = 8),
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold", size = 16, color = "#356df4")
)
year_graph

We analyze that across nearly 100 years of film industry, the vast majority of films cluster between 60% and 90% male dialogue. This means male-dominated scripts are “norm”, rather than an exception and this may show us the “Glass Ceiling” of dialogues. Furthermore, even though films are made and centered on female protagonists such as Mulan and Pocahontas, the supporting and secondary characters remain overwhelmingly male, often resulting in men speaking over 70% of the words. Another finding is industry’s resistance to change over decades. This means, the distribution shows that while the volume of films has exploded since the 1990s, the “center” of the clusters hasn’t moved significantly toward the 50/50 line, suggesting that modern cinema still struggles with gender parity in scripts. Lastly, there is power of the outliers. Labeled outliers serve as evidence that balanced scripts are possible; however, their rarity emphasizes how far the industry is from achieving consistent gender equality in storytelling.
The final graph aims to display four graphs together using the cowplot package’s ggdraw to manually adjust the settings. Additionally, images of princesses and princes have been added using the magick package. Therefore, in this graph, we see the improved Men’s, Women’s and Cinematic Gender Dialogue Evolution graphs that we created earlier, with top character names added for both genders. This enables us to make comparisons between years in a longitudinal trend way and understand which characters appear in the movies.
img_princess <- image_read("princess.png")
img_prince <- image_read("prince.png")
# Layout configuration
final_graph <- ggdraw() +
# 1. Main Left Plot: men_graph_imp
draw_plot(men_graph_imp,
x = 0, y = 0.4,
width = 0.55, height = 0.45) +
# 2. Right Top Plot: gender_balance_graph
draw_plot(gender_balance_graph,
x = 0.55, y = 0.65,
width = 0.43, height = 0.2) +
# 3. Right Middle Plot: women_graph_imp
draw_plot(women_graph_imp,
x = 0.55, y = 0.4,
width = 0.43, height = 0.25) +
# 4. Bottom Anchor: year_graph
draw_plot(year_graph,
x = 0.05, y = 0.05,
width = 0.9, height = 0.32) +
draw_image(img_princess, x = 0.04, y = 0.88, width = 0.2, height = 0.1) +
draw_image(img_prince, x = 0.75, y = 0.88, width = 0.2, height = 0.1) +
draw_label("SCREENPLAY DIALOGUE BY GENDER",
x = 0.5, y = 0.94,
hjust = 0.5,
fontfamily = "merriweather",
fontface = "bold",
size = 28) +
draw_label("An analysis of 2,000 scripts: Trends across decades and individual films",
x = 0.5, y = 0.89,
hjust = 0.5,
fontfamily = "merriweather",
size = 12,
color = "grey40")
print(final_graph)

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Boz (2026, Jan. 14). Data Visualization | MSc CSS: Film Dialogue: Screenplay Dialogue by Gender. Retrieved from https://csslab.uc3m.es/dataviz/projects/2025/100569570/
BibTeX citation
@misc{boz2026film,
author = {Boz, Sude},
title = {Data Visualization | MSc CSS: Film Dialogue: Screenplay Dialogue by Gender},
url = {https://csslab.uc3m.es/dataviz/projects/2025/100569570/},
year = {2026}
}