Whipping up some Great British Bake Off Data
With R !
Following the “recipe” 😉 for blogs here on SofiaZaidman.com, I will begin with a preamble explaining why on earth I would spend so much of my free time collecting and combing through every crumb 😉 of data I can find on The Great British Bake Off (GBBO).
There was about a year-long period of time that I commuted from Brooklyn to Columbia’s Morningside campus, during which I would download GBBO episodes to my phone and watch one on the subway during my morning commute and another in the evening. This behavior has led to me having seen every season available on Netflix probably more than 3x through. It is hands down my favorite TV show.
Data from the GBBO is ripe for analysis because the show treats baking like a sport. An obvious ML place to start is predicting the winner based on demographic information and performance in challenges. This has been done beautifully by Danny Antaki in a project called DeepBake. I absolutely love this and maybe one day will spend enough time learning neural nets to do this myself.
Another fun idea I had was to make a GBBO recipe generator based on past GBBO recipe names. This is something I have desperately wanted to do ever since this Buzzfeed article from the relatively early days of neural net memes, which I still think is one of the most hilarious things ever. Jacqueline Nolis (co-author of Build a Career in Data Science and co-host of the podcast of the same name that a highly recommend and fun data science project kindred spirit) gave a talk on building an AI like this and has a really great tutorial on her Github page.
Before I can do any of these things of course, I have to source the GBBO data. I’ve started enjoying data wrangling in a very nerdy way, so I was excited when I noticed that the data used for DeepBake was pulled directly from each GBBO season’s Wikipedia page using a super cool R package called rvest. I thought I’d take a stab at it myself to learn a new scraping technique in R rather than Python.
Scraping Wikipedia tables in R using rvest
Inspect tables on Wikipedia
The first step in scraping Wikipedia tables is to inspect the Wikipedia page you’d like to scrape. I went to the GBBO master page and clicked on each season’s individual pages. Most season pages have one table per episode with four columns: Baker, Signature, Technical and Showstopper. Here is one example:
Now let’s try rvest
I adapted some all-purpose rvest code to pull all tables on a wikipedia page. The code is pretty straightforward and manages to use very few lines to extract the information we want. Basically, the html_nodes function retrieves all wikitables on the webpage we specify, and the html_table function converts the list of nodes to a tibble of tables.
# open rvest, data.table and some common packages that may come in handy
library("rvest")
library("data.table")
library("magrittr")
library("dplyr")
library("ggplot2")
# get season 3 tables from wikipedia
url <- "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_3)"
webpage <- read_html(url)
table3nodes <- html_nodes(webpage,'table.wikitable')
table3 <- html_table(table3nodes, header = TRUE)
The list of nodes should look like this:
Once the list of nodes is converted to a tibble of tables, each table should look something like this:
Next I constructed a loop for each season to:
- Pull and label individual episode data from each season’s page (have to check manually on Wikipedia to know which tables will have episode data)
- Convert the data to data frames
- Extract signature, technical and showstopper challenge names and save as new variables
- Merge all episode data into one data frame
#set count
count <- 0
#create empty data frame with column names
season3 <- data.frame(matrix(ncol = 9, nrow = 0))
x <- c('baker','signature','technical.rank','showstopper','episode', 'signatuare.name','technical.name','showstopper.name', 'season')
colnames(season3) <- x
#build for loop
for (episode in table3[3:12]) {
ep <- data.frame(episode)
count = count +1
ep['episode'] = count
library(stringr)
signature_name <- str_replace_all(colnames(ep[2]), "[[:punct:]]", " ")
ep['signature.challenge'] = str_remove(signature_name, 'Signature.')
technical_name <- str_replace_all(colnames(ep[3]), "[[:punct:]]", " ")
ep['technical.challenge'] = str_remove(technical_name, 'Technical.')
showstopper_name <- str_replace_all(colnames(ep[4]), "[[:punct:]]", " ")
ep['showstopper.challenge'] = str_remove(showstopper_name, 'Showstopper.')
ep['season'] = '3'
colnames(ep) <- x
season3 <- rbind(season3, ep)
}
This code will create a data frame called season3 with 9 columns and 76 observations – one per baker per episode:
Loops for each season can be built, run, and the resulting data frames can be combined into one master data frame that contains the information from every episode table. The complete code and final dataset are in this Github repository.
I’ll add more to this project soon!