GBBO continued: an interactive Plotly jitter plot

How to chart baker performance across all seasons?

As part of the good old Great British Bake Off (GBBO) project, I wanted to create a chart that would succinctly capture every baker’s performance in every season. First, this necessitated the invention of a “score” feature. On the actual show, no one gets a numerical score. However, on the (very helpful) wikipedia pages for each season, fans have recorded which bakers were favorites, least favorites, Star Bakers and eliminated each episode (read this post to learn more about how I wrangled the data from wikipedia tables). This information can be used to create our own numerical representation of baker performance. I decided to give each baker 1 point for “favorite”, -1 point for “least favorite”, and 2 points for being named the Star Baker.

To compare every baker’s performance in every season, I summed all bakers’ scores to get their “final score” when they were eliminated (or when they won). For each season, a dot plot comparing each baker’s final score could look something like this:

Combining all seasons would give you something like this:

Full code here

The dot-plot-specific portion of the code is below:

geom_dotplot(binaxis='y', stackdir='center',
             aes(fill = winflag)
             ,dotsize = .5
             ,alpha = .8
             ,stackgroups = TRUE)

We can see that the points at around 0 Final Score start to run into each other when stacked, and things begin to look a little cluttered. This is where the jitter plot comes in.

Jitter plots

The jitter plot strikes me as a strange being in the world of data visualization. Unlike almost all other plots, the jitter plot adds an element of chaos. It forgoes absolute precision for the sake of readability.

In essence, a jitter plot is just a dot-plot for situations when there are too many dots for stacking. Instead of spacing the dots equally apart from each other without overlap, a jitter plot “jitters” the dots in a random manner, within a given area.

Using geom_jitter() instead of geom_dotplot():

The jitter-plot-specific portion of the code is below:

geom_jitter(height = .1,width = .25,
            aes(color = winflag), 
              alpha =.8,
              size = 4) 

The points here are less precise, but flow better. There is some overlap, but the randomness makes it easier to tell points apart, and a sense of the density of the points is not lost.

The height and width variables determine how much wiggle room the jitter plot has to work with in the x and y dimensions. smaller numbers would mean a tighter radius and more overlapping. Bigger numbers would give a wider radius to jitter within.

It is possible to achieve jittering while using geom_dotplot() using position = position_jitter(width = ?, height = ?), though there are other advantages of using geom_jitter(). We will use geom_jitter() from here on out for this post.

One advantage to geom_jitter is that it is easy to change dot attributes based on a group variable (not so with geom_dotplot – this is because there is no size or shape variable built into the parameters, only “dotsize”). To highlight winners and runners up in my jitter plot, I was able to change the code as follows:

Full code here

To change size based on a categorical variable, I added size as an attribute to the geom_jitter aesthetics. I then had to add a scale_size_manual() function to define the size for each category. If you have multiple aesthetics that you want to show up in one legend, t’s important to define the same name (even if it’s blank) for the all of the aesthetic functions:

geom_jitter(height = .1,width = .25,
              aes(color = winflag,
                  size = winflag), 
              alpha =.8) +
scale_color_manual(name = "",
                     values = c('Winner' = 'goldenrod',
                                'Runner-up' = '#86dba5',
                                'Eliminated before final' = '#e68a95')) +
scale_size_manual(name = "",
                    values = c('Winner' = 5,
                               'Runner-up' = 4,
                               'Eliminated before final' = 2))

The trouble with labels

I’m sure you noticed that there were helpful data labels in my single-season example. If we were to add the same labels to our jitter plot, we would get this beauty:

Clearly, there are too many observations here to have both intelligible labels and visible points. Wouldn’t it be nice if we could have an interactive plot that would allow a user to choose a point and see more information dynamically? We can!

Plotly

Using Plotly, we can turn our static plot into a dynamic plot that provides much more information upon hover or click:

Charts or apps can be made completely in Plotly, or we can use the ggplotly() function to turn a plot originally made in ggplot into an interactive Plotly visualization.

On a very basic level, all I did to turn our jitter plot into a Plotly plot was save the jitterplot and apply the ggplotly() function. The full code to get the formatted, interactive Plotly visualization is below:

#save fully formatted ggplot jitterplot

p <- ggplot(jitter, aes(season, endsum), group = baker) +
  geom_jitter(height = .1,width = .25,
              aes(color = winflag,
                  size = winflag,
                  text = paste('Baker:', baker,
                               '<br>Status:', winflag,
                               '<br>Max Episode:', maxep,
                               '<br>Final Score:', endsum)), 
              alpha =.8
  ) +
  scale_x_continuous(limits = c(1.8,12.2), breaks=seq(2,12,by=1)) +
  scale_y_continuous(limits = c(-4,13), breaks=seq(-4,13,by=2)) +
  coord_flip() +
  geom_vline(xintercept=2.5, color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 3.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 4.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=5.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 6.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=7.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 8.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=9.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=10.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 11.5,color = "gray30", linetype = "dashed", size = .5) +
  labs(x = "Season", y = "Final Score") +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    axis.text.x = element_text(family = "Arial"),
    text = element_text(size = 14, family = 'Arial')
  ) +
  scale_color_manual(name = "",
                     values = c('Winner' = 'goldenrod',
                                'Runner-up' = '#86dba5',
                                'Eliminated before final' = '#e68a95')) +
  scale_size_manual(name = "",
                    values = c('Winner' = 5,
                               'Runner-up' = 4,
                               'Eliminated before final' = 2)) 

#apply ggplotly() function to our plot, specify the text for the tooltip, remove the toolbar and format the legend

ggplotly(p,tooltip = "text") %>%
  config(displayModeBar = F) %>%
  layout(legend = list(orientation = "v", 
                       xanchor = "center", 
                       x = 1,
                       y=.3,
                       bordercolor = "#edd99f",
                       borderwidth = 2,
                       bgcolor = "#ffdbfa",
                       font = list(
                         family = "Arial",
                         size = 14,
                         color = "#000")))

Only one tweak had to be made to the original ggplot jitter plot code to get the Plotly visualization to work correctly. This was the addition of the “text” aesthetic in the geom_jitter() function:

text = paste('Baker:', baker,
                               '<br>Status:', winflag,
                               '<br>Max Episode:', maxep,
                               '<br>Final Score:', endsum)

You’ll notice that in the ggplotly() function, the tooltip parameter was set to “text”.

ggplotly(p,tooltip = "text") 

This code is crucial in defining what text the user will see when hovering over a point.

The other additions to the ggplotly code are aesthetic. I use config(displayModeBar = F) to remove the toolbar that automatically gets added to Plotly plots (not necessary, I just don’t like how it looks). The layout() function is used to create the custom legend.

And there you have it: making a static plot dynamic was that easy!

ggplot’s geom_tile… not just for heat maps!

If you’ve read my previous blog post, you’ll know that I was able to convince a group of unsuspecting peers (ok, maybe one was suspecting) to create a final project Shiny app all about the Great British Bake Off (GBBO) using data I had previously scraped and wrangled from Wikipedia.

An unexpected challenge that came up while making the app was conceptualizing and creating charts or visualizations that displayed descriptive information about the GBBO. The ability to plot two variables to view a relationship has been drilled so deeply into my head that at this point, it’s second nature. This makes it easy to think of ideas for bar charts or line graphs that compare measures of baker performance. The ability to visually display information that isn’t necessarily trying to prove a point is not drilled into a science major’s head at all. Simple information like each episode’s theme is also important to summarize visually, and counter-intuitively more difficult to plot in a way that adds value.

geom_tile

As someone who works with SQL tables endlessly day after day, I think I naturally gravitate toward organizing and understanding information in a grid. This, to me, is the beauty of geom_tile.

Before making this app, I had only ever used geom_tile to create heat maps. Based on my google searching, this is probably what geom_tile is used for 99% of the time. A standard heat map compares two categorical variables with one on each axis, forming a grid. The squares in the grid are most often colored based on the value of a continuous variable, like temperature, or perhaps something like age:

ggplot(aes(as.factor(episode),as.factor(season.x), fill = avg.age)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  scale_fill_gradient(low = "pink", high = "brown", name = "Avg. Age") +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank())

*I made this as a quick example but it’s actually quite interesting. You can see that for many seasons, there is a tendency for average age to decrease as a season progresses – meaning younger bakers make it further in the competition. You can also see that some seasons have older or younger bakers in general (for example season 10 vs. season 5).

Though heat maps are usually colored with a continuous variable, they also work if you make the continuous variable discrete, or use a categorical variable:

colorpal <- colorRampPalette(c('lightsalmon','royalblue'))(5)

ggplot(aes(as.factor(episode),as.factor(season.x), fill = agegroup)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank()) +
  scale_fill_manual(values = colorpal, name = "Age Group") 

*This version conveys the same information as the original, but it’s simplified and maybe slightly easier to pick up on the main point of the chart.

Plotting GBBO episode themes using geom_tile

In the Great British Bake Off, each episode in a season has a theme. Some of these themes are repeated season after season, such as Cake, Biscuits, and Bread, while other themes are unique to a season. Unique themes tend to fall into basic types, like countries (Japan, Germany, Denmark etc.) or ingredients (Chocolate, Dairy, Spice etc.). I wanted to find a way to show which themes are repeated across all seasons and which themes are unique, as well as when a theme occurs in each season, in one concise visual.

We will use a dataset that I scraped called weekthemes, which has four fields: season, week, week_theme and category. I coded theme categories myself:

Let’s start with a basic geom_tile with seasons on one axis and episode theme on the other:

weekthemes %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) 

Not very beautiful without formatting, but a start. Notice that the themes are organized alphabetically. That’s not necessarily bad, but it’s also not the most logical way to organize themes. We could perhaps organize them by category, popularity, or something else. I know from watching GBBO that certain themes usually occur around the same time each season – for instance, cake week is almost always the first episode, while (obviously) the final is always last.

Wrangling week themes to order the y axis:

This is where good old data wrangling comes in. I find that it’s easiest to order categorical variables by sorting the variable’s factor levels before putting the data into ggplot.

First, let’s create a variable that sorts themes by their average episode across all seasons they occur:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%

Now that we have a ranking for each theme, we can join this back to the original weekthemes dataset, and sort the week_theme variable by our new ranking variable. Let’s also use geom_text to add a data label to our chart, so that we can see which week each theme occurs in each season:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) 

This already feels more organized, and the chart now tells a slightly different story. The eye is drawn to the top of the chart, where the viewer can clearly see that Cake week is almost always first. As you navigate down the chart, other themes that tend to occur in the middle of the seasons become more unique and less common. Pâtisserie is almost always second to last and the Final of course last.

Adding another categorical variable:

To break up the large number of themes on the y axis, I want to add another level of organization with an additional category variable. I coded this variable myself for a few reasons. The first reason is that over the course of GBBO, certain themes have morphed into similar themes. Take for instance Pastry; in earlier seasons, there was no Pastry week. Seasons 2 and 3 had a Tarts episode and a Pies episode. In seasons 4 and 5, Tarts and Pies appear to have morphed into “Tarts and Pies,” and a separate Pastry episode was introduced. From season 6 on, only Pastry remains. Though these themes are all slightly different, they all deal with the same basic category of short crust pastries.

The second reason I decided to code a category variable was to organize unusual themes together. As the seasons of GBBO have progressed, certain patterns in themes have arisen. Take for example country themes. There have been several seasons with a week devoted to the baking of one country, though the same country has never been repeated in multiple episodes. The country themes are ideologically related, though not technically the same. I wanted a way to group these themes together.

Let’s add Category as the fill variable to our chart:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Immediately, the addition of the category variable to color the tiles adds another level of structure to the chart. The chart becomes more interactive, as viewers can now choose to examine themes across seasons by category, looking for patterns, similarities or differences.

Ultimately, for this chart, I decided to change the sorting of the themes to go by theme category, to emphasize each category more than individual themes. I accomplished this by changing the ranking variable to group by category rather than theme:

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Sorting the themes by category so that all themes in the same category are next to each other allows viewers to see which themes fit into which category much more easily, while still getting a sense of which themes tend to occur at specific times in a season. I preferred this sorting method, but there is no right or wrong answer here – it would have been equally valid to level the themes as they were before.

Formatting and refining

The only thing left is to refine the formatting of the chart. There is truly no limit here, but for easy understandability I made the following changes to get my final product:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(color = 'gray20', size = .5) +
  scale_fill_manual(values = mycolors, name = "Category") +
  scale_x_discrete(position = "top",
                   labels=c("2" = "S2", "3" = "S3",
                            "4" = "S4", "5" = "S5",   
                            "6" = "S6", "7" = "S7", 
                            "8" = "S8", "9" = "S9", 
                            "10" = "S10", "11" = "S11", 
                            "12" = "S12")) +   
  labs(color = "Category") +
  geom_text(aes(label = week), color="black", size=rel(5)) +
  xlab("") + 
  ylab("") +
  ggtitle("Great British Bake Off Week Themes Season by Season Comparison") +
  theme(panel.grid.major.y = element_line(color = "#edd99f"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_line(),
        panel.border = element_rect(fill=NA,color="white", size=0.5, linetype="solid"),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill="white"),
        plot.background = element_rect(fill="white"),
        legend.text = element_text(size=12),
        legend.title = element_text(size=14),
        title = element_text(size =14),
        axis.text = element_text(color="black", size=14))

It might not be the prettiest possible version, but I was going for complete readability here.

In conclusion

geom_tile is an incredibly versatile tool to plot much more than just heat maps. I had a lot of fun hacking it to create this informative chart, and will certainly use it again for future charting needs!

Code and data used in this post can be found here.

Visualizing Winter Olympics success

This semester has been busy! I haven’t had a chance to dive into the side projects I’ve been brewing but I’ve really been enjoying the data visualization class I’m taking with Thomas Brambor. Our first assignment was to build some visualizations using historical Winter Olympics data going up to 2014, using ggplot and Plotly. I had a lot of fun doing this assignment and discovered some very cool tips and tricks in the process of perfecting my charts. If you’re viewing this on a phone – sorry, the Plotly charts do not take kindly to that. The full assignment with code is below:

Visualizing gender-neutral baby names with ggplot and Plotly

Visualizing gender-neutral baby names with ggplot and Plotly

I’m finally taking a much-anticipated (by me) class for my MA program called “Data Visualization.” An optional exercise was to play around with a dataset of baby names from US census data. I had some fun creating this interactive chart of the most popular gender-neutral baby names over time.

Names included in this chart must have been in the top 10% of all names for a given year, with a boy:girl or girl:boy ratio of no more than 100:1.

The design of this chart exposes patterns in the predominant sex of a given name over time. Interestingly, it looks like a majority of popular baby names move from a higher ratio of boys to girls to a lower ratio over time. There are many more fascinating insights to find!

The code I wrote to generate this chart is below:

library(babynames)
library(ggplot2)   
library(magrittr)   
library(dplyr)  
library(RColorBrewer)
library(colorways2) #my color package
library(ggthemes)

f <- babynames %>% filter(sex=="F")
m <- babynames %>% filter(sex=="M")

unisex1 <- merge(f,m ,by=c("name","year"),all = TRUE)

base1 <- unisex1 %>%
  group_by(year) %>%
  mutate(overall=n.x+n.y) %>%
  mutate(ratio= n.y/n.x) %>%
  arrange(desc(ratio)) %>%
  mutate(logratio=log(ratio)) %>%
  mutate(overallcentile = ntile(overall,10)) %>%
  filter(tolower(name) != "unknown") %>%
  filter(tolower(name) != "infant") %>%
  filter(tolower(name) != "baby") %>%
  filter(overallcentile >=  10) %>%
  filter(abs(logratio) <= 2) 

d <- highlight_key(base1, ~name)

#had to make a new palette out of an existing one with 74 colors, one for each name
nb.cols <- 74
mycolors <- colorRampPalette(ballpit)(nb.cols)

p <- ggplot(d, aes(year, logratio, col= name)) + 
  geom_hline(yintercept=0, linetype="dashed", color = "black") +
  geom_line() + 
  theme_tufte() +
  geom_point() + 
  scale_y_continuous(labels = c("1:100", "1:10", "1:1","10:1","100:1")) +
  labs(title="Gender Distribution of Most Popular Gender-Neutral Names Over Time", x ="", y = "Boy:Girl ratio (log scale)") +
  theme( text=element_text(family="Helvetica",size = 14),plot.title = element_text(size = 14),axis.text = element_text(size = 12), axis.title = element_text(size = 14))+
  scale_x_continuous(breaks = round(seq(min(1880), max(2020), by = 10),1)) +
  scale_color_manual(values = mycolors) 
  
gg <- ggplotly(p)   
  
highlight(gg, dynamic = F, color = "black",selected = attrs_selected(showlegend = FALSE)) %>% 
 layout(margin = list(b = 40)) %>%
 layout(legend=list(title=list(text='')))

Designing functions to generate and display color palettes in R

Designing functions to generate and display color palettes in R

In this post, I will go over the methodology I used to design the color palettes and functions to display them that comprise the colorways package referred to in my previous blog post.

Identifying the Problem

When it comes to color, I definitely believe in being adventurous and having many options based on mood and context. I think visualizations are best when they take an artistic and sometimes unexpected approach to color. Though blending in is sometimes necessary (for publication, or serious work-related presentations, maybe), when I work on projects for myself, I like to push the envelope a bit. It’s an aspect of data analysis that I truly enjoy.

I’ve found that palettes from RColorBrewer or default palettes work fine for certain situations, like heat maps that need a spectrum of color. But where they break down for me is in the fairly common occurrence of graphing several distinct subgroups. This calls for palettes with colors that are different enough from each other that no two subgroups are perceived as more similar than the others. RColorBrewer has a few “qualitative palettes” for this purpose, but I find them to be a bit generic.

Of course, RColorBrewer isn’t the only package for color palettes out there. This website has compiled a ton of color palette packages in one place, with a handy tool to select palettes. The palettes can then be retrieved in R using the Paletteer package, which is incredibly useful for accessing color palettes from many sources in a standardized way. Now that I found it, I will certainly be saving this resource and using it in the future. Some examples of favorites I’ve found from this site:

The current crop of color packages still lack some features that in my opinion are essential to choosing a palette. As far as I know, they don’t come with easy to use, built in color shuffling, and they don’t come with dynamic color previews in charts or graphs.

So, I set out to create my own package with three distinct functionalities:

  1. Save my own palettes
  2. Display any palette (either native or from another package) in a variety of forms (basic palette, charts, graphs)
  3. Shuffle color palettes if desired and save the newly ordered list of colors

Saving my own palettes

This is just as easy as choosing colors and putting them in lists. The fun part of course is naming the palettes based on my own whimsy. Some examples:

krampus <- c("#0782A6","#A66507","#994846","#CDD845","#624FAF","#52735D","#BBAE92","#FED2E7","#FFE402")
ballpit <- c("#5C33FF","#FF8E07","#E2E442","#42E44F","#C67CF9","#F64EBC","#ACF64E" ,"#C11736","#00B6A0")
donut <- c("#FA88F1","#2DEC93","#8FE2FF","#FF882B","#D80D0D","#D0A321","#369830","#B681FF","#858585")

I decided to go with 9 colors in each palette for uniformity and to maximize novel color combinations when shuffling. I ordered the palettes intentionally with my favorite color combinations come at the beginning, so that when the palettes aren’t shuffled, optimal color combinations are preselected.


Paletteprint: view all palettes

The first thing I wanted to do was write a function that would display all the palettes at once. To do this, I had to first create a list of palettes and their names:

palettes <- list(dino, hive, rumpus, taffy, sleuth, martian, krampus, tulip, donut, donette, creme, farmhand, mayhem, ballpit, january, pair1)

names(palettes) <- c("dino", "hive", "rumpus", "taffy", "sleuth", "martian", "krampus", "tulip", "donut", "donette", "creme", "farmhand", "mayhem","ballpit","january","pair1")

Then I wrote a function using rectangles in base R graphing to make a chart:

paletteprint <- function() {
  bordercolor <- "black"
  x <- c(-8,27)
  y <- c(0,(length(palettes)*3))
  i <- (length(palettes)*3)
  n <- 1
  plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, frame.plot=FALSE)
  for (p in palettes) {
    rect(0,i,3,i-1, col = p[1], border = bordercolor, lwd = 2)
    rect(3,i,6,i-1, col = p[2], border = bordercolor, lwd = 2)
    rect(6,i,9,i-1, col = p[3], border = bordercolor, lwd = 2)
    rect(9,i,12,i-1, col = p[4], border = bordercolor, lwd = 2)
    rect(12,i,15,i-1, col = p[5], border = bordercolor, lwd = 2)
    rect(15,i,18,i-1, col = p[6], border = bordercolor, lwd = 2)
    rect(18,i,21,i-1, col = p[7], border = bordercolor, lwd = 2)
    rect(21,i,24,i-1, col = p[8], border = bordercolor, lwd = 2)
    rect(24,i,27,i-1, col = p[9], border = bordercolor, lwd = 2)
    text(x = -8, y = i-.5, # Coordinates
         label = names(palettes[n]), pos =4)
    i = i-3
    n= n+1
  }
}

The output when calling paletteprint() looks like this:


Colordisplay: view, shuffle and choose the number of colors in a palette

Next I wanted a function that would display the colors in a selected palette, display them in a straightforward way, allow me to choose how many colors I wanted to see, shuffle the colors if desired, and return a list of the colors displayed.

colordisplay <- function(palette, number = 9, bordercolor = "black", shuffle = "no") {

  if (shuffle == "yes"){
    shuff <- sample(seq(from = 1, to = length(palette), by = 1), size = length(palette), replace = FALSE)
  }

  else {
    shuff <- seq(1, length(palette), by=1)
  }

  if (number == 9) {
    names = c(palette[shuff[1]], palette[shuff[2]],palette[shuff[3]],palette[shuff[4]],palette[shuff[5]],palette[shuff[6]],palette[shuff[7]],palette[shuff[8]],palette[shuff[9]])
    title <- paste(names, collapse = ", ")
    x <- c(0,3)
    y <- c(7,10)
    plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, main = title, frame.plot=FALSE)
    rect(0,10,1,9, col = palette[shuff[1]], border = bordercolor, lwd = 4)
    rect(1,10,2,9, col = palette[shuff[2]], border = bordercolor, lwd = 4)
    rect(2,10,3,9, col = palette[shuff[3]], border = bordercolor, lwd = 4)
    rect(0,9,1,8, col = palette[shuff[4]], border = bordercolor, lwd = 4)
    rect(1,9,2,8, col = palette[shuff[5]], border = bordercolor, lwd = 4)
    rect(2,9,3,8, col = palette[shuff[6]], border = bordercolor, lwd = 4)
    rect(0,8,1,7, col = palette[shuff[7]], border = bordercolor, lwd = 4)
    rect(1,8,2,7, col = palette[shuff[8]], border = bordercolor, lwd = 4)
    rect(2,8,3,7, col = palette[shuff[9]], border = bordercolor, lwd = 4)

    return(title)
  }

  else if (number == 8) {
    names = c(palette[shuff[1]], palette[shuff[2]],palette[shuff[3]],palette[shuff[4]],palette[shuff[5]],palette[shuff[6]],palette[shuff[7]],palette[shuff[8]])
    title <- paste(names, collapse = ", ")
    x <- c(0,4)
    y <- c(8,10)
    plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, main=title, frame.plot=FALSE)
    rect(0,10,1,9, col = palette[shuff[1]], border = bordercolor, lwd = 4)
    rect(1,10,2,9, col = palette[shuff[2]], border = bordercolor, lwd = 4)
    rect(2,10,3,9, col = palette[shuff[3]], border = bordercolor, lwd = 4)
    rect(3,10,4,9, col = palette[shuff[4]], border = bordercolor, lwd = 4)
    rect(0,9,1,8, col = palette[shuff[5]], border = bordercolor, lwd = 4)
    rect(1,9,2,8, col = palette[shuff[6]], border = bordercolor, lwd = 4)
    rect(2,9,3,8, col = palette[shuff[7]], border = bordercolor, lwd = 4)
    rect(3,9,4,8, col = palette[shuff[8]], border = bordercolor, lwd = 4)

    return(title)
  }

# and so on until number == 2

Ok, yes, there might have been a better way to set up a loop that doesn’t require that I write a new if statement for every number of colors BUT I did want control of how the display looks for each number of colors chosen so honestly I don’t think it’s really that needlessly wordy.

If I call colordisplay, choose my palette and use all default parameters, the result will look like this (fully zoomed out):

colordisplay(january)

I will also get this as output:

[1] "#F1F07C, #BB7EEE, #98EAC8, #65859C, #8C2438, #ADA99D, #AD840C, #398726, #EC5570"

If I call colordisplay with shuffle on, I get a randomly shuffled output with the corresponding list of colors:

colordisplay(january, shuffle = "yes")
[1] "#AD840C, #98EAC8, #8C2438, #EC5570, #F1F07C, #65859C, #BB7EEE, #398726, #ADA99

Changing the number parameter between 2-8 colors will result in the following shapes:

You can also choose to display a palette from another package, or using paletteer:

colordisplay(paletteer_d("nationalparkcolors::DeathValley"), number = 6, shuffle = "yes")
[1] "#E79498FF, #73652DFF, #F7E790FF, #514289FF, #B23539FF, #FAB57CFF"

Colorbar: view a palette as a bar chart:

I wanted to write a function that would allow me to input a palette, the number of colors I want from that palette, and whether I want the colors to be shuffled, and output a sample bar chart. This is the main functionality that I feel current color packages lack. Of course, you can keep your own code for a sample bar chart and test colors manually, but this function has made it so much easier to quickly assess whether a set of colors will work for a visualization. I also think it’s a ton of fun to shuffle the colors again and again and see what comes up!

The code behind this function is pretty similar to what I wrote for colordisplay. Full code for all functions can be found in this github repo.

Let’s try colorbar with the tulip palette and default parameters:

colorbar(palette = tulip)

We can add up to 7 colors in the bar chart:

colorbar(palette = tulip, number = 7)

We can choose to stack the bars:

colorbar(palette = tulip, number = 7, stacked = "yes")

We can shuffle the colors:

colorbar(palette = tulip, number = 7, stacked = "yes", shuffle = "yes")

Like colordisplay, colorbar will always output the ordered list of colors for each chart:

[1] “#D7DDDE, #A25E5F, #FFC27E, #FFFBC7, #9B8054, #E0E38C, #E0C2F6”

Lastly, we can use palettes from other packages:

colorbar(palette = paletteer_d("wesanderson::IsleofDogs2"), number = 4)

Colorscatter: view a palette as a scatter plot:

Colorscatter is virtually the same function as colorbar, but instead of a bar chart, color scatter will display a scatter plot. The only difference in parameters between the two is that color scatter lacks the “stacked” feature, for obvious reasons.

Let’s try the default colorscatter function using the “ballpit” palette:

colorscatter(palette = ballpit)

Like colorbar, we can view up to 7 colors using colorscatter:

colorscatter(palette = ballpit, number = 7)

We can also shuffle colors:

colorscatter(palette = ballpit, number = 5, shuffle = "yes")

Colorscatter will also return an ordered list of colors each time it is run:

[1] “#C11736, #E2E442, #5C33FF, #00B6A0, #ACF64E”

And of course like colorbar, colorscatter will accept external palettes:

colorscatter(palette = paletteer_d("tvthemes::simpsons"), number = 5, shuffle = "yes")

In conclusion:

I’ve certainly only scratched the surface of what’s out there and what’s capable in the world of R color packages. I’m happy with the package I’ve put together but I can already think of some improvements or additional functionalities I’d like to add (for example, a parameter for switching the charts to dark backgrounds). For now I think this is a great start and I’m excited to do more research and make more updates!

For full code, visit the colorways package GitHub repo here

Creating a local R package using RStudio

Creating a local R package using RStudio

If there’s one thing I love about R, it’s visualizations. And if there’s one thing I love about visualizations, it’s getting to choose the colors. But if there’s one thing I hate about getting to choose the colors, it’s forgetting which colors I like and having to look up color names or hex codes and saving new lists of colors of varying lengths each time I want to generate a new chart or graph. I’ve suspected for a while that it would pay to make a package to save color combinations, but it always felt like too much of a chore. Then I got Omicron.

I was faced with a week’s worth of abyss, quarantined in a single room in my family’s home after flying to California for the holidays. After a healthy amount of grimacing, I decided enough was enough, I would do something with the meager tools at my disposal and finally take on the task of creating a package to save and display color combinations. I spent a few days putting palettes together and writing functions to display, shuffle, and return the color codes based on user input. I’ll write another post just about the contents of the package, because I like it and think it’s a lot of fun.

For now I want to document how I was able to convert functions and data into a package that can be called at any time. This is definitely not the only way to save code as a package, but this is the easiest way to do it that worked for me. Because I already made the color package, I will make a new package for the sake of this post called michiganjfrog.

Step 1: Open a new project in RStudio

In RStudio, navigate to File -> New Project… you will then be prompted to save the current workspace, and do so if you like. The project wizard will pop up. Choose New Directory -> Package. You’ll get a form that looks like this:

Here you will pick the package name, where the package will be saved, and whether to create a git repository. My package uses only base R so I didn’t worry about checking “Use renv with this project,” but for future reference, checking that box could be useful for dependency management. I checked “Create a git repository” because I wanted to track changes to my package on GitHub.

Once the package is created, you’ll get something that looks like this:

The wizard already comes with a sample R script with a function called “hello.”

Step 2: Save a function

To save more functions, select File -> New File -> R Script and write your function. It is generally best practice to save functions in their own R scripts, or to save related functions together.

When you save your R script, RStudio will automatically suggest you save it to the “R” folder within your package. That is exactly where it should be saved. Don’t change that. Name your file either the same name as your function or a short, easy to remember name for your related functions.

Step 3: Build your package with the new function

This step doesn’t have to take place necessarily right after you save your first function, but I think it’s useful down the road to rebuild your package as often as possible and before creating the documentation file for your function for reasons that will become clear in step 4. To build (basically, save) your package, navigate to the top right corner of RStudio and select the “Build” tab. Then click “Install and Restart”

Step 4: Document your function

Go to File -> New File -> R Documentation… A dialog box will pop up that looks like this:

If you’ve already built the package with a function by the same name as the “Topic name,” RStudio will recognize this and automatically fill in some of the fields in the documentation template when you press OK:

You can fill in the other fields with any relevant information, then press “preview” to view the html rendering of the documentation file:

To save the documentation, follow step 3 to rebuild the package.

Step 5: Add data

Adding data such as lists or tables to a package is relatively easy. Just open a new script, define your data and use the use_data() function to save the data within the package. It should look like this:

frogfacts <- c("green","yellow","top hat","shy","eternal","ragtime")

use_data(frogfacts)

Make sure you run the code and rebuild the package to save your changes.

Step 6: Connect to GitHub

After creating a new directory in GitHub, open a terminal window and type the following:

cd desktop/michiganjfrog (or wherever the package is saved)
git remote add origin https://github.com/szaidman22/michiganjfrog.git
git branch -M main
git add -A
git commit -m "adding package contents"
git push --set-upstream origin main

This will connect the package directory to the GitHub directory and add everything to the main branch. To push and commit changes from RStudio, navigate to the “Git” tab at the top right of the RStudio window:

Check any of the documents/directories within the package with changes that you want to push to GitHub, then press the “Commit” button, close the dialog box that pops up, and press the “Push” button:

To check that the changes were pushed properly, go to the GitHub directory and look for commits.

Step 7: Check the package

There are many ways to check that the package does what it’s supposed to, but in my opinion the easiest is just to open a new project and try to use it. Because the package is saved on your own machine and already installed, there is no need to install it. All you have to do is call the package from the library:

It looks like the package and descriptions work! Stay tuned to hear more about the color palette package I made last month.

Gender and Race in the Tech Industry – Analysis of Bias in Compensation

Gender and Race in the Tech Industry – Analysis of Bias in Compensation

As part of my coursework for the QMSS MA program at Columbia, I designed a hierarchical regression model to analyze salary data from levels.fyi, focusing specifically on the significance of race and gender variables in predicting total annual compensation. I wrote the full project in R markdown and included it below, along with a much shorter summary in this post. I hope you enjoy 🙂


Background – Current Gender and Wage gaps in the US

In their 2021 Gender and Pay Gap Report, which analyzes pay disparities in the US across all industries via crowdsourced data, PayScale found that, without adding any control variables, women make 82¢ for every dollar earned by men. After adding control variables, women made 98¢ for every dollar earned by men, leaving a 2% difference attributable purely to discrimination based on gender.

PayScale’s findings on the racial wage gap show that, with or without control of demographics, both men and women of most races earn less than white men. Interestingly, when controlling for external factors, Asian men and women earn more than any group.

Data and Research Design

Levels.fyi is a website founded in 2017 as a place for tech industry professionals around the world to anonymously share detailed compensation information. In 2020, levels.fyi began collecting race, gender, and education information from users along with salary information.

Using data from levels.fyi, I asked the question: can any of the variance in compensation in the tech industry be explained by racial and gender differences? If so, how much of this variance can be attributed to differences in years of experience, job title, educational attainment, and cost of living between genders and racial groups?

My dependent variable was total annual compensation, with gender, race, education, total years of experience, years at the current company and cost of living index as independent variables.

My sample came from a comprehensive dataset of scraped salary postings from levels.fyi. I limited my analysis to jobs in the US and removed NA values for the target independent variables. I also removed records with total yearly compensation equal to 0. I joined this data to a separate table with cost of living index values by US state.

Hierarchical Regression Model

Stepwise multiple regression was used to assess whether gender and/or race would contribute any significant additional explanatory power to the prediction of total annual compensation beyond that of the control variables. The equations for each step of the hierarchical regression model are below:

Model 2a (control variables only): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index)

Model 2b (control variables + gender): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender)

Model 2c (control variables + gender + race): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender) + β7(Race)

Results

The adjusted R^2 value for model 2a was .410, meaning that 41% of the variance in total annual compensation in the sample could be explained by the control variables alone.

After controlling for years of experience, years at the company, education, job title and cost of living index, being female resulted in a 7% decrease in total yearly compensation on average compared to being male. The addition of gender in model 2b added .2% of explanatory power overall. An F test between models 2a and 2b confirmed that this was a significant increase in explanatory power.

In the final step of the hierarchical regression, after controlling for years of experience, years at the company, education, job title, cost of living index and gender, the only group that differed significantly from White posters at the p < .001 threshold in total annual compensation were Black posters who were compensated on average 6% less annually. Difference in total annual compensation between White and Asian posters was significant at the p < .01 threshold, with Asian posters making on average 2% more annually than White posters. The addition of race added only .1% more explanatory power to the overall model. An F test between models 2b and 2c confirmed that this was a significant increase in explanatory power.

Discussion

The direction of all coefficients in the final model agree with PayScale’s 2021 analysis. By and large, the magnitude of the coefficients for race variables agreed with PaysScale’s analysis as well. One poignant difference between my analysis and PayScale’s is the magnitude of the coefficient for Gender [Female]. In the controlled model (Model 2c), the coefficient for Gender [Female] of -.07 is significantly greater than what would have been expected based on the difference in pay of 2% that PayScale found between men and women across industries. This suggests that there may be a larger degree of discrimination in pay based on gender in the tech industry compared to other industries. Perhaps some of this effect can be explained by debunked yet pervasive stereotypes that women naturally have less ability in quantitative disciplines. This is a fascinating area for further research.

There was a large drop in coefficient magnitude for the dummy-coded race and gender variables after controlling for years of experience, years at the company, education, job title and cost of living. This drop was especially large in the coefficient for Race [Black], going from -.17 to -.06 – a difference of 11% explanatory power. This suggests that there is important information contained in the control variables that should be explored further. Systemic differences between racial groups and genders in educational attainment, job title, years of experience and tenure at a company would all affect total annual compensation. If these mediating factors are not addressed along with outright discrimination, financial parity for demographic groups that have historically been excluded from the tech industry will be severely slowed.

Ultimately, there remains a significant amount of variance in total annual compensation that cannot be explained by any of the control variables, particularly for Black tech workers and for women. As the population of these groups rises in the industry, it is increasingly important to continue to analyze the biased systems and attitudes that contribute to this phenomenon.

Pivoting data in R

Bakeoff Data Collection Part 2

Pivoting Data in R

I left off the last GBBO data collection post with code in Github to generate a dataset that combines every episode table from seasons 2-11. This included information on what bakers made for the signature and showstopper bakes, as well as their performance in the technical challenges.

Some crucial information that the episode table doesn’t have is the results of the episode – who is crowned “Star Baker” and who is eliminated. This information is contained in a separate chart for each season that looks like this:

The elimination chart is set up differently from the episode table in a few ways. First – while there is one row per baker, each column represents an episode rather than having separate tables for each episode. The chart also uses color to store information. While some of the colors and text are redundant, favorite and least favorite baker information is only stored with color.

The goal:

  • extract all the data from this table, including colors
  • convert it to a shape that can be joined easily to the episodes table I made in the previous GBBO post

This will involve the dreaded… pivot.

Pivot Philosophy Lesson

I am constantly traumatized by pivoting with SQL at work. I don’t think pivoting in SQL is intuitive at all, and after some soul searching I think I know why. It’s a cliché but at my core I am a visual learner who understands concepts best if I can picture them in my head spatially. When I first had to pivot data while working at the NIH I used SPSS, which gives you the option to pivot between “wide” and “long” data – this I understood very well because I could clearly imagine how I wanted the data to rearrange itself. Wide data stores information in more columns and long data stores information in more rows. To convert between the two, you either have to combine columns and, if you’re not aggregating, add additional rows from what was previously a single observation (wide to long), or you combine multiple rows into one and (again, if not aggregating) add additional variables (long to wide).

The way SQL is written is usually quite helpful and intuitive because it reads like a full sentence that tells a story about data. It’s great for picking and choosing from the data you have, but it gets clunky when you want to imagine something new. Because SQL describes data in this literary way, it makes sense to me that it breaks down in its utility when dealing with data as a spatial structure.

I think it is much simpler and more intuitive to use R (or Python, I assume, never tried) to pivot data because it is a functional programming language that uses objects. I greatly prefer pivoting in R using a simple function to building a fussy query in SQL with multiple CTEs. The function I have used for this program is from the tidyr package which is R royalty and truly needs no introduction or explanation. It’s a bonus that the function is called pivot_longer (there is also, naturally, pivot_wider) – so intuitive, so good.

Pivoting from Wide to Long with R

Our elimination chart data is wide. There is only one row per baker, and 10 columns, one for each episode. The episode data that I made in the previous GBBO post is long – rather than a single row for each baker and a column for each episode, there is a row for each baker and episode, and one episode column (instead of 10) that tells us which episode the data in each row refers to. To merge the elimination chart data with the episode data, I’ll need to change it from having one column for each episode to having one row for each baker and episode, with one episode column – essentially multiplying the number of rows by 10 and reducing the columns from 10 to 1. Then I’ll be able to add additional variables to hold the favorite, least favorite, star baker and eliminated baker data.

Let’s continue using season 11 as an example. First, we will pull the raw data into a table called “result”

url = "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_11)"
webpage <- read_html(url)
tables <- html_nodes(webpage,'table.wikitable')
tables <- html_table(tables, header = TRUE)

result <- data.frame(tables[2])

The shape of this table is identical to that of the chart, though it didn’t automatically assume the first row is the column name. That’s fine – we’ll actually use this to our advantage when we pivot.

The code to pivot the result table is as follows:

result.pivot <- tidyr::pivot_longer(subset(result, Elimination.chart != 'Baker'),
                                    cols = starts_with("Elimination.chart."), 
                                    names_to = "episode", 
                                    names_prefix = "Elimination.chart.",
                                    values_to = "result")

I’ve used the pivot_longer function from tidyr and removed the extra row with column names from the data. The cols argument should have all the columns you want to pivot. I used the starts_with function to include only columns that begin with “Elimination.chart.”, so I’m pivoting every row except the first, which contains the baker names. The names_to argument creates a new column (in this case called “episode”) that will house the name of the column that was pivoted to a row. The names_prefix argument tells the names_to argument to remove the prefix “Elimination.chart.” from the column names. This will leave us with a nice clean value for the episode number. Finally, the values_to argument creates a new column (“result”) that houses the values from the columns that were pivoted.

We end up with this:

This looks nice! This looks sleek! This looks like the episodes table! One problem – this doesn’t contain any background color information 🙁

Now html or xml or any markup language is not my strong suit but that is ok because I am great at googling. I was finally able to find and finesse some code on stack overflow to miraculously pull color information into a table that is *close* to the same shape as our pivoted data:

colors <- webpage %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_nodes('td')
colors <- bind_rows(lapply(xml_attrs(colors), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))

We get three columns called align, style and colspan. Each row seems to correspond to one td tag or cell in the data table, read from left to right. If you look at the “Peter” row in the original chart, you’ll see that the left-most cell has no background and is left aligned. That’s row 1 in the table. In week 1, Peter was Star Baker, which has a pale yellow background. That’s row 2 – with style “background:LemonChiffon;”. Peter was a favorite baker in weeks 4 and 5. You can see that there is one merged cell with a dark blue background for those weeks. This is represented in row 5 with style “background:Cornflowerblue;” and colspan = 2. This is a more succinct way of recording the html table data, but we’ll need to ungroup weeks 4 and 5 to get one row per episode. We’ll also have to remove the rows with no style value, as those don’t contain any episode data.

#in columns with colspan NA, replace with 1
colors$colspan <- replace(colors$colspan, is.na(colors$colspan), 1)

#makes a new table with colspan as numeric rather than string
dups <- colors %>% transform(colspan = as.numeric(colspan)) 

#uses the rep function to duplicate rows by the value in colspan - creating one row per episode
colorsduped <- dups[rep(seq_len(nrow(dups)), as.numeric(unlist(c(dups$colspan)))), ]

#keeps only rows with a value with the string "background" in the style column (so removes the meaningless left align rows)
colorsduped <- colorsduped %>% filter(tolower(colorsduped$style) %like% "background")

#now we can make a new column called status that translates the cornflower blue and plum colors into "Favorite" and "Least Favorite" bakers of the week.
colorsduped <- colorsduped %>% mutate(status =
                                        case_when(tolower(style) %like% "cornflower" ~ "Favorite", 
                                                  tolower(style) %like% "plum" ~ "Least Favorite",
                                                  TRUE ~ "")
)

#reset index that got messed with by the duplications
rownames(colorsduped) <- 1:nrow(colorsduped)

Now we have one row per episode per baker. Even though we don’t have a name variable in here, because the data is ordered in the same way as the first table we pivoted with text information, and because we reset the index, we can join this table to our first table on the index using the merge function:

#merge result.pivot and colorsduped to get all information in one place
final <- merge(result.pivot, colorsduped[,c("style","status")], by = 0)

#sorting for the sake of explanation
final <-final[order(final$baker, as.numeric(final$episode)),]

Beautiful! Now we can do this once for each season, combine all the seasons, and merge the columns we want (result and status) to the episodes dataset using the baker.season.episode variable. Complete code for this can be found on this Github page.

Whipping up some Great British Bake Off Data

Whipping up some Great British Bake Off Data

With R !

Following the “recipe” 😉 for blogs here on SofiaZaidman.com, I will begin with a preamble explaining why on earth I would spend so much of my free time collecting and combing through every crumb 😉 of data I can find on The Great British Bake Off (GBBO).

There was about a year-long period of time that I commuted from Brooklyn to Columbia’s Morningside campus, during which I would download GBBO episodes to my phone and watch one on the subway during my morning commute and another in the evening. This behavior has led to me having seen every season available on Netflix probably more than 3x through. It is hands down my favorite TV show.

Data from the GBBO is ripe for analysis because the show treats baking like a sport. An obvious ML place to start is predicting the winner based on demographic information and performance in challenges. This has been done beautifully by Danny Antaki in a project called DeepBake. I absolutely love this and maybe one day will spend enough time learning neural nets to do this myself.

Another fun idea I had was to make a GBBO recipe generator based on past GBBO recipe names. This is something I have desperately wanted to do ever since this Buzzfeed article from the relatively early days of neural net memes, which I still think is one of the most hilarious things ever. Jacqueline Nolis (co-author of Build a Career in Data Science and co-host of the podcast of the same name that a highly recommend and fun data science project kindred spirit) gave a talk on building an AI like this and has a really great tutorial on her Github page.

Before I can do any of these things of course, I have to source the GBBO data. I’ve started enjoying data wrangling in a very nerdy way, so I was excited when I noticed that the data used for DeepBake was pulled directly from each GBBO season’s Wikipedia page using a super cool R package called rvest. I thought I’d take a stab at it myself to learn a new scraping technique in R rather than Python.

Scraping Wikipedia tables in R using rvest

Inspect tables on Wikipedia

The first step in scraping Wikipedia tables is to inspect the Wikipedia page you’d like to scrape. I went to the GBBO master page and clicked on each season’s individual pages. Most season pages have one table per episode with four columns: Baker, Signature, Technical and Showstopper. Here is one example:

Now let’s try rvest

I adapted some all-purpose rvest code to pull all tables on a wikipedia page. The code is pretty straightforward and manages to use very few lines to extract the information we want. Basically, the html_nodes function retrieves all wikitables on the webpage we specify, and the html_table function converts the list of nodes to a tibble of tables.

# open rvest, data.table and some common packages that may come in handy
library("rvest")
library("data.table")
library("magrittr")
library("dplyr")
library("ggplot2")

# get season 3 tables from wikipedia
url <- "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_3)"
webpage <- read_html(url)
table3nodes <- html_nodes(webpage,'table.wikitable')
table3 <- html_table(table3nodes, header = TRUE)

The list of nodes should look like this:

Once the list of nodes is converted to a tibble of tables, each table should look something like this:

Next I constructed a loop for each season to:

  • Pull and label individual episode data from each season’s page (have to check manually on Wikipedia to know which tables will have episode data)
  • Convert the data to data frames
  • Extract signature, technical and showstopper challenge names and save as new variables
  • Merge all episode data into one data frame
#set count
count <- 0

#create empty data frame with column names
season3 <- data.frame(matrix(ncol = 9, nrow = 0))
x <- c('baker','signature','technical.rank','showstopper','episode', 'signatuare.name','technical.name','showstopper.name', 'season')
colnames(season3) <- x

#build for loop
for (episode in table3[3:12]) {

  ep <- data.frame(episode)
  count = count +1
  ep['episode'] = count
  
  library(stringr)
  signature_name <- str_replace_all(colnames(ep[2]), "[[:punct:]]", " ")
  ep['signature.challenge'] = str_remove(signature_name, 'Signature.')
  
  technical_name <- str_replace_all(colnames(ep[3]), "[[:punct:]]", " ")
  ep['technical.challenge'] = str_remove(technical_name, 'Technical.')
  
  showstopper_name <- str_replace_all(colnames(ep[4]), "[[:punct:]]", " ")
  ep['showstopper.challenge'] = str_remove(showstopper_name, 'Showstopper.')
  
  ep['season'] = '3'
  
  colnames(ep) <- x
  
  season3 <- rbind(season3, ep)
}

This code will create a data frame called season3 with 9 columns and 76 observations – one per baker per episode:

Loops for each season can be built, run, and the resulting data frames can be combined into one master data frame that contains the information from every episode table. The complete code and final dataset are in this Github repository.

I’ll add more to this project soon!