ggplot’s geom_tile… not just for heat maps!

If you’ve read my previous blog post, you’ll know that I was able to convince a group of unsuspecting peers (ok, maybe one was suspecting) to create a final project Shiny app all about the Great British Bake Off (GBBO) using data I had previously scraped and wrangled from Wikipedia.

An unexpected challenge that came up while making the app was conceptualizing and creating charts or visualizations that displayed descriptive information about the GBBO. The ability to plot two variables to view a relationship has been drilled so deeply into my head that at this point, it’s second nature. This makes it easy to think of ideas for bar charts or line graphs that compare measures of baker performance. The ability to visually display information that isn’t necessarily trying to prove a point is not drilled into a science major’s head at all. Simple information like each episode’s theme is also important to summarize visually, and counter-intuitively more difficult to plot in a way that adds value.

geom_tile

As someone who works with SQL tables endlessly day after day, I think I naturally gravitate toward organizing and understanding information in a grid. This, to me, is the beauty of geom_tile.

Before making this app, I had only ever used geom_tile to create heat maps. Based on my google searching, this is probably what geom_tile is used for 99% of the time. A standard heat map compares two categorical variables with one on each axis, forming a grid. The squares in the grid are most often colored based on the value of a continuous variable, like temperature, or perhaps something like age:

ggplot(aes(as.factor(episode),as.factor(season.x), fill = avg.age)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  scale_fill_gradient(low = "pink", high = "brown", name = "Avg. Age") +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank())

*I made this as a quick example but it’s actually quite interesting. You can see that for many seasons, there is a tendency for average age to decrease as a season progresses – meaning younger bakers make it further in the competition. You can also see that some seasons have older or younger bakers in general (for example season 10 vs. season 5).

Though heat maps are usually colored with a continuous variable, they also work if you make the continuous variable discrete, or use a categorical variable:

colorpal <- colorRampPalette(c('lightsalmon','royalblue'))(5)

ggplot(aes(as.factor(episode),as.factor(season.x), fill = agegroup)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank()) +
  scale_fill_manual(values = colorpal, name = "Age Group") 

*This version conveys the same information as the original, but it’s simplified and maybe slightly easier to pick up on the main point of the chart.

Plotting GBBO episode themes using geom_tile

In the Great British Bake Off, each episode in a season has a theme. Some of these themes are repeated season after season, such as Cake, Biscuits, and Bread, while other themes are unique to a season. Unique themes tend to fall into basic types, like countries (Japan, Germany, Denmark etc.) or ingredients (Chocolate, Dairy, Spice etc.). I wanted to find a way to show which themes are repeated across all seasons and which themes are unique, as well as when a theme occurs in each season, in one concise visual.

We will use a dataset that I scraped called weekthemes, which has four fields: season, week, week_theme and category. I coded theme categories myself:

Let’s start with a basic geom_tile with seasons on one axis and episode theme on the other:

weekthemes %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) 

Not very beautiful without formatting, but a start. Notice that the themes are organized alphabetically. That’s not necessarily bad, but it’s also not the most logical way to organize themes. We could perhaps organize them by category, popularity, or something else. I know from watching GBBO that certain themes usually occur around the same time each season – for instance, cake week is almost always the first episode, while (obviously) the final is always last.

Wrangling week themes to order the y axis:

This is where good old data wrangling comes in. I find that it’s easiest to order categorical variables by sorting the variable’s factor levels before putting the data into ggplot.

First, let’s create a variable that sorts themes by their average episode across all seasons they occur:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%

Now that we have a ranking for each theme, we can join this back to the original weekthemes dataset, and sort the week_theme variable by our new ranking variable. Let’s also use geom_text to add a data label to our chart, so that we can see which week each theme occurs in each season:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) 

This already feels more organized, and the chart now tells a slightly different story. The eye is drawn to the top of the chart, where the viewer can clearly see that Cake week is almost always first. As you navigate down the chart, other themes that tend to occur in the middle of the seasons become more unique and less common. Pâtisserie is almost always second to last and the Final of course last.

Adding another categorical variable:

To break up the large number of themes on the y axis, I want to add another level of organization with an additional category variable. I coded this variable myself for a few reasons. The first reason is that over the course of GBBO, certain themes have morphed into similar themes. Take for instance Pastry; in earlier seasons, there was no Pastry week. Seasons 2 and 3 had a Tarts episode and a Pies episode. In seasons 4 and 5, Tarts and Pies appear to have morphed into “Tarts and Pies,” and a separate Pastry episode was introduced. From season 6 on, only Pastry remains. Though these themes are all slightly different, they all deal with the same basic category of short crust pastries.

The second reason I decided to code a category variable was to organize unusual themes together. As the seasons of GBBO have progressed, certain patterns in themes have arisen. Take for example country themes. There have been several seasons with a week devoted to the baking of one country, though the same country has never been repeated in multiple episodes. The country themes are ideologically related, though not technically the same. I wanted a way to group these themes together.

Let’s add Category as the fill variable to our chart:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Immediately, the addition of the category variable to color the tiles adds another level of structure to the chart. The chart becomes more interactive, as viewers can now choose to examine themes across seasons by category, looking for patterns, similarities or differences.

Ultimately, for this chart, I decided to change the sorting of the themes to go by theme category, to emphasize each category more than individual themes. I accomplished this by changing the ranking variable to group by category rather than theme:

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Sorting the themes by category so that all themes in the same category are next to each other allows viewers to see which themes fit into which category much more easily, while still getting a sense of which themes tend to occur at specific times in a season. I preferred this sorting method, but there is no right or wrong answer here – it would have been equally valid to level the themes as they were before.

Formatting and refining

The only thing left is to refine the formatting of the chart. There is truly no limit here, but for easy understandability I made the following changes to get my final product:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(color = 'gray20', size = .5) +
  scale_fill_manual(values = mycolors, name = "Category") +
  scale_x_discrete(position = "top",
                   labels=c("2" = "S2", "3" = "S3",
                            "4" = "S4", "5" = "S5",   
                            "6" = "S6", "7" = "S7", 
                            "8" = "S8", "9" = "S9", 
                            "10" = "S10", "11" = "S11", 
                            "12" = "S12")) +   
  labs(color = "Category") +
  geom_text(aes(label = week), color="black", size=rel(5)) +
  xlab("") + 
  ylab("") +
  ggtitle("Great British Bake Off Week Themes Season by Season Comparison") +
  theme(panel.grid.major.y = element_line(color = "#edd99f"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_line(),
        panel.border = element_rect(fill=NA,color="white", size=0.5, linetype="solid"),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill="white"),
        plot.background = element_rect(fill="white"),
        legend.text = element_text(size=12),
        legend.title = element_text(size=14),
        title = element_text(size =14),
        axis.text = element_text(color="black", size=14))

It might not be the prettiest possible version, but I was going for complete readability here.

In conclusion

geom_tile is an incredibly versatile tool to plot much more than just heat maps. I had a lot of fun hacking it to create this informative chart, and will certainly use it again for future charting needs!

Code and data used in this post can be found here.