Bakeoff Data Collection Part 2
Pivoting Data in R
I left off the last GBBO data collection post with code in Github to generate a dataset that combines every episode table from seasons 2-11. This included information on what bakers made for the signature and showstopper bakes, as well as their performance in the technical challenges.
Some crucial information that the episode table doesn’t have is the results of the episode – who is crowned “Star Baker” and who is eliminated. This information is contained in a separate chart for each season that looks like this:
The elimination chart is set up differently from the episode table in a few ways. First – while there is one row per baker, each column represents an episode rather than having separate tables for each episode. The chart also uses color to store information. While some of the colors and text are redundant, favorite and least favorite baker information is only stored with color.
The goal:
- extract all the data from this table, including colors
- convert it to a shape that can be joined easily to the episodes table I made in the previous GBBO post
This will involve the dreaded… pivot.
Pivot Philosophy Lesson
I am constantly traumatized by pivoting with SQL at work. I don’t think pivoting in SQL is intuitive at all, and after some soul searching I think I know why. It’s a cliché but at my core I am a visual learner who understands concepts best if I can picture them in my head spatially. When I first had to pivot data while working at the NIH I used SPSS, which gives you the option to pivot between “wide” and “long” data – this I understood very well because I could clearly imagine how I wanted the data to rearrange itself. Wide data stores information in more columns and long data stores information in more rows. To convert between the two, you either have to combine columns and, if you’re not aggregating, add additional rows from what was previously a single observation (wide to long), or you combine multiple rows into one and (again, if not aggregating) add additional variables (long to wide).
The way SQL is written is usually quite helpful and intuitive because it reads like a full sentence that tells a story about data. It’s great for picking and choosing from the data you have, but it gets clunky when you want to imagine something new. Because SQL describes data in this literary way, it makes sense to me that it breaks down in its utility when dealing with data as a spatial structure.
I think it is much simpler and more intuitive to use R (or Python, I assume, never tried) to pivot data because it is a functional programming language that uses objects. I greatly prefer pivoting in R using a simple function to building a fussy query in SQL with multiple CTEs. The function I have used for this program is from the tidyr package which is R royalty and truly needs no introduction or explanation. It’s a bonus that the function is called pivot_longer (there is also, naturally, pivot_wider) – so intuitive, so good.
Pivoting from Wide to Long with R
Our elimination chart data is wide. There is only one row per baker, and 10 columns, one for each episode. The episode data that I made in the previous GBBO post is long – rather than a single row for each baker and a column for each episode, there is a row for each baker and episode, and one episode column (instead of 10) that tells us which episode the data in each row refers to. To merge the elimination chart data with the episode data, I’ll need to change it from having one column for each episode to having one row for each baker and episode, with one episode column – essentially multiplying the number of rows by 10 and reducing the columns from 10 to 1. Then I’ll be able to add additional variables to hold the favorite, least favorite, star baker and eliminated baker data.
Let’s continue using season 11 as an example. First, we will pull the raw data into a table called “result”
url = "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_11)"
webpage <- read_html(url)
tables <- html_nodes(webpage,'table.wikitable')
tables <- html_table(tables, header = TRUE)
result <- data.frame(tables[2])
The shape of this table is identical to that of the chart, though it didn’t automatically assume the first row is the column name. That’s fine – we’ll actually use this to our advantage when we pivot.
The code to pivot the result table is as follows:
result.pivot <- tidyr::pivot_longer(subset(result, Elimination.chart != 'Baker'),
cols = starts_with("Elimination.chart."),
names_to = "episode",
names_prefix = "Elimination.chart.",
values_to = "result")
I’ve used the pivot_longer function from tidyr and removed the extra row with column names from the data. The cols argument should have all the columns you want to pivot. I used the starts_with function to include only columns that begin with “Elimination.chart.”, so I’m pivoting every row except the first, which contains the baker names. The names_to argument creates a new column (in this case called “episode”) that will house the name of the column that was pivoted to a row. The names_prefix argument tells the names_to argument to remove the prefix “Elimination.chart.” from the column names. This will leave us with a nice clean value for the episode number. Finally, the values_to argument creates a new column (“result”) that houses the values from the columns that were pivoted.
We end up with this:
This looks nice! This looks sleek! This looks like the episodes table! One problem – this doesn’t contain any background color information 🙁
Now html or xml or any markup language is not my strong suit but that is ok because I am great at googling. I was finally able to find and finesse some code on stack overflow to miraculously pull color information into a table that is *close* to the same shape as our pivoted data:
colors <- webpage %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_nodes('td')
colors <- bind_rows(lapply(xml_attrs(colors), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
We get three columns called align, style and colspan. Each row seems to correspond to one td tag or cell in the data table, read from left to right. If you look at the “Peter” row in the original chart, you’ll see that the left-most cell has no background and is left aligned. That’s row 1 in the table. In week 1, Peter was Star Baker, which has a pale yellow background. That’s row 2 – with style “background:LemonChiffon;”. Peter was a favorite baker in weeks 4 and 5. You can see that there is one merged cell with a dark blue background for those weeks. This is represented in row 5 with style “background:Cornflowerblue;” and colspan = 2. This is a more succinct way of recording the html table data, but we’ll need to ungroup weeks 4 and 5 to get one row per episode. We’ll also have to remove the rows with no style value, as those don’t contain any episode data.
#in columns with colspan NA, replace with 1
colors$colspan <- replace(colors$colspan, is.na(colors$colspan), 1)
#makes a new table with colspan as numeric rather than string
dups <- colors %>% transform(colspan = as.numeric(colspan))
#uses the rep function to duplicate rows by the value in colspan - creating one row per episode
colorsduped <- dups[rep(seq_len(nrow(dups)), as.numeric(unlist(c(dups$colspan)))), ]
#keeps only rows with a value with the string "background" in the style column (so removes the meaningless left align rows)
colorsduped <- colorsduped %>% filter(tolower(colorsduped$style) %like% "background")
#now we can make a new column called status that translates the cornflower blue and plum colors into "Favorite" and "Least Favorite" bakers of the week.
colorsduped <- colorsduped %>% mutate(status =
case_when(tolower(style) %like% "cornflower" ~ "Favorite",
tolower(style) %like% "plum" ~ "Least Favorite",
TRUE ~ "")
)
#reset index that got messed with by the duplications
rownames(colorsduped) <- 1:nrow(colorsduped)
Now we have one row per episode per baker. Even though we don’t have a name variable in here, because the data is ordered in the same way as the first table we pivoted with text information, and because we reset the index, we can join this table to our first table on the index using the merge function:
#merge result.pivot and colorsduped to get all information in one place
final <- merge(result.pivot, colorsduped[,c("style","status")], by = 0)
#sorting for the sake of explanation
final <-final[order(final$baker, as.numeric(final$episode)),]
Beautiful! Now we can do this once for each season, combine all the seasons, and merge the columns we want (result and status) to the episodes dataset using the baker.season.episode variable. Complete code for this can be found on this Github page.