Gender and Race in the Tech Industry – Analysis of Bias in Compensation

Posted on January 2, 2022May 9, 2022 Posted in R, Statistical Analysis

Gender and Race in the Tech Industry – Analysis of Bias in Compensation

As part of my coursework for the QMSS MA program at Columbia, I designed a hierarchical regression model to analyze salary data from levels.fyi, focusing specifically on the significance of race and gender variables in predicting total annual compensation. I wrote the full project in R markdown and included it below, along with a much shorter summary in this post. I hope you enjoy 🙂

Background – Current Gender and Wage gaps in the US

In their 2021 Gender and Pay Gap Report, which analyzes pay disparities in the US across all industries via crowdsourced data, PayScale found that, without adding any control variables, women make 82¢ for every dollar earned by men. After adding control variables, women made 98¢ for every dollar earned by men, leaving a 2% difference attributable purely to discrimination based on gender.

PayScale’s findings on the racial wage gap show that, with or without control of demographics, both men and women of most races earn less than white men. Interestingly, when controlling for external factors, Asian men and women earn more than any group.

Data and Research Design

Levels.fyi is a website founded in 2017 as a place for tech industry professionals around the world to anonymously share detailed compensation information. In 2020, levels.fyi began collecting race, gender, and education information from users along with salary information.

Using data from levels.fyi, I asked the question: can any of the variance in compensation in the tech industry be explained by racial and gender differences? If so, how much of this variance can be attributed to differences in years of experience, job title, educational attainment, and cost of living between genders and racial groups?

My dependent variable was total annual compensation, with gender, race, education, total years of experience, years at the current company and cost of living index as independent variables.

My sample came from a comprehensive dataset of scraped salary postings from levels.fyi. I limited my analysis to jobs in the US and removed NA values for the target independent variables. I also removed records with total yearly compensation equal to 0. I joined this data to a separate table with cost of living index values by US state.

Hierarchical Regression Model

Stepwise multiple regression was used to assess whether gender and/or race would contribute any significant additional explanatory power to the prediction of total annual compensation beyond that of the control variables. The equations for each step of the hierarchical regression model are below:

Model 2a (control variables only):

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index)

Model 2b (control variables + gender):

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender)

Model 2c (control variables + gender + race):

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender) + β7(Race)

Results

The adjusted R^2 value for model 2a was .410, meaning that 41% of the variance in total annual compensation in the sample could be explained by the control variables alone.

After controlling for years of experience, years at the company, education, job title and cost of living index, being female resulted in a 7% decrease in total yearly compensation on average compared to being male. The addition of gender in model 2b added .2% of explanatory power overall. An F test between models 2a and 2b confirmed that this was a significant increase in explanatory power.

In the final step of the hierarchical regression, after controlling for years of experience, years at the company, education, job title, cost of living index and gender, the only group that differed significantly from White posters at the p < .001 threshold in total annual compensation were Black posters who were compensated on average 6% less annually. Difference in total annual compensation between White and Asian posters was significant at the p < .01 threshold, with Asian posters making on average 2% more annually than White posters. The addition of race added only .1% more explanatory power to the overall model. An F test between models 2b and 2c confirmed that this was a significant increase in explanatory power.

Discussion

The direction of all coefficients in the final model agree with PayScale’s 2021 analysis. By and large, the magnitude of the coefficients for race variables agreed with PaysScale’s analysis as well. One poignant difference between my analysis and PayScale’s is the magnitude of the coefficient for Gender [Female]. In the controlled model (Model 2c), the coefficient for Gender [Female] of -.07 is significantly greater than what would have been expected based on the difference in pay of 2% that PayScale found between men and women across industries. This suggests that there may be a larger degree of discrimination in pay based on gender in the tech industry compared to other industries. Perhaps some of this effect can be explained by debunked yet pervasive stereotypes that women naturally have less ability in quantitative disciplines. This is a fascinating area for further research.

There was a large drop in coefficient magnitude for the dummy-coded race and gender variables after controlling for years of experience, years at the company, education, job title and cost of living. This drop was especially large in the coefficient for Race [Black], going from -.17 to -.06 – a difference of 11% explanatory power. This suggests that there is important information contained in the control variables that should be explored further. Systemic differences between racial groups and genders in educational attainment, job title, years of experience and tenure at a company would all affect total annual compensation. If these mediating factors are not addressed along with outright discrimination, financial parity for demographic groups that have historically been excluded from the tech industry will be severely slowed.

Ultimately, there remains a significant amount of variance in total annual compensation that cannot be explained by any of the control variables, particularly for Black tech workers and for women. As the population of these groups rises in the industry, it is increasingly important to continue to analyze the biased systems and attitudes that contribute to this phenomenon.

Pivoting data in R

Posted on November 20, 2021July 15, 2022 Posted in Data Wrangling, RTagged Great British Bakeoff

Bakeoff Data Collection Part 2

Pivoting Data in R

I left off the last GBBO data collection post with code in Github to generate a dataset that combines every episode table from seasons 2-11. This included information on what bakers made for the signature and showstopper bakes, as well as their performance in the technical challenges.

Some crucial information that the episode table doesn’t have is the results of the episode – who is crowned “Star Baker” and who is eliminated. This information is contained in a separate chart for each season that looks like this:

The elimination chart is set up differently from the episode table in a few ways. First – while there is one row per baker, each column represents an episode rather than having separate tables for each episode. The chart also uses color to store information. While some of the colors and text are redundant, favorite and least favorite baker information is only stored with color.

The goal:

extract all the data from this table, including colors
convert it to a shape that can be joined easily to the episodes table I made in the previous GBBO post

This will involve the dreaded… pivot.

Pivot Philosophy Lesson

I am constantly traumatized by pivoting with SQL at work. I don’t think pivoting in SQL is intuitive at all, and after some soul searching I think I know why. It’s a cliché but at my core I am a visual learner who understands concepts best if I can picture them in my head spatially. When I first had to pivot data while working at the NIH I used SPSS, which gives you the option to pivot between “wide” and “long” data – this I understood very well because I could clearly imagine how I wanted the data to rearrange itself. Wide data stores information in more columns and long data stores information in more rows. To convert between the two, you either have to combine columns and, if you’re not aggregating, add additional rows from what was previously a single observation (wide to long), or you combine multiple rows into one and (again, if not aggregating) add additional variables (long to wide).

The way SQL is written is usually quite helpful and intuitive because it reads like a full sentence that tells a story about data. It’s great for picking and choosing from the data you have, but it gets clunky when you want to imagine something new. Because SQL describes data in this literary way, it makes sense to me that it breaks down in its utility when dealing with data as a spatial structure.

I think it is much simpler and more intuitive to use R (or Python, I assume, never tried) to pivot data because it is a functional programming language that uses objects. I greatly prefer pivoting in R using a simple function to building a fussy query in SQL with multiple CTEs. The function I have used for this program is from the tidyr package which is R royalty and truly needs no introduction or explanation. It’s a bonus that the function is called pivot_longer (there is also, naturally, pivot_wider) – so intuitive, so good.

Pivoting from Wide to Long with R

Our elimination chart data is wide. There is only one row per baker, and 10 columns, one for each episode. The episode data that I made in the previous GBBO post is long – rather than a single row for each baker and a column for each episode, there is a row for each baker and episode, and one episode column (instead of 10) that tells us which episode the data in each row refers to. To merge the elimination chart data with the episode data, I’ll need to change it from having one column for each episode to having one row for each baker and episode, with one episode column – essentially multiplying the number of rows by 10 and reducing the columns from 10 to 1. Then I’ll be able to add additional variables to hold the favorite, least favorite, star baker and eliminated baker data.

Let’s continue using season 11 as an example. First, we will pull the raw data into a table called “result”

url = "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_11)"
webpage <- read_html(url)
tables <- html_nodes(webpage,'table.wikitable')
tables <- html_table(tables, header = TRUE)

result <- data.frame(tables[2])

The shape of this table is identical to that of the chart, though it didn’t automatically assume the first row is the column name. That’s fine – we’ll actually use this to our advantage when we pivot.

The code to pivot the result table is as follows:

result.pivot <- tidyr::pivot_longer(subset(result, Elimination.chart != 'Baker'),
                                    cols = starts_with("Elimination.chart."), 
                                    names_to = "episode", 
                                    names_prefix = "Elimination.chart.",
                                    values_to = "result")

I’ve used the pivot_longer function from tidyr and removed the extra row with column names from the data. The cols argument should have all the columns you want to pivot. I used the starts_with function to include only columns that begin with “Elimination.chart.”, so I’m pivoting every row except the first, which contains the baker names. The names_to argument creates a new column (in this case called “episode”) that will house the name of the column that was pivoted to a row. The names_prefix argument tells the names_to argument to remove the prefix “Elimination.chart.” from the column names. This will leave us with a nice clean value for the episode number. Finally, the values_to argument creates a new column (“result”) that houses the values from the columns that were pivoted.

We end up with this:

This looks nice! This looks sleek! This looks like the episodes table! One problem – this doesn’t contain any background color information 🙁

Now html or xml or any markup language is not my strong suit but that is ok because I am great at googling. I was finally able to find and finesse some code on stack overflow to miraculously pull color information into a table that is *close* to the same shape as our pivoted data:

colors <- webpage %>% html_nodes(xpath='//*[@id="mw-content-text"]/div/table[3]') %>% html_nodes('td')
colors <- bind_rows(lapply(xml_attrs(colors), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))

We get three columns called align, style and colspan. Each row seems to correspond to one td tag or cell in the data table, read from left to right. If you look at the “Peter” row in the original chart, you’ll see that the left-most cell has no background and is left aligned. That’s row 1 in the table. In week 1, Peter was Star Baker, which has a pale yellow background. That’s row 2 – with style “background:LemonChiffon;”. Peter was a favorite baker in weeks 4 and 5. You can see that there is one merged cell with a dark blue background for those weeks. This is represented in row 5 with style “background:Cornflowerblue;” and colspan = 2. This is a more succinct way of recording the html table data, but we’ll need to ungroup weeks 4 and 5 to get one row per episode. We’ll also have to remove the rows with no style value, as those don’t contain any episode data.

#in columns with colspan NA, replace with 1
colors$colspan <- replace(colors$colspan, is.na(colors$colspan), 1)

#makes a new table with colspan as numeric rather than string
dups <- colors %>% transform(colspan = as.numeric(colspan)) 

#uses the rep function to duplicate rows by the value in colspan - creating one row per episode
colorsduped <- dups[rep(seq_len(nrow(dups)), as.numeric(unlist(c(dups$colspan)))), ]

#keeps only rows with a value with the string "background" in the style column (so removes the meaningless left align rows)
colorsduped <- colorsduped %>% filter(tolower(colorsduped$style) %like% "background")

#now we can make a new column called status that translates the cornflower blue and plum colors into "Favorite" and "Least Favorite" bakers of the week.
colorsduped <- colorsduped %>% mutate(status =
                                        case_when(tolower(style) %like% "cornflower" ~ "Favorite", 
                                                  tolower(style) %like% "plum" ~ "Least Favorite",
                                                  TRUE ~ "")
)

#reset index that got messed with by the duplications
rownames(colorsduped) <- 1:nrow(colorsduped)

Now we have one row per episode per baker. Even though we don’t have a name variable in here, because the data is ordered in the same way as the first table we pivoted with text information, and because we reset the index, we can join this table to our first table on the index using the merge function:

#merge result.pivot and colorsduped to get all information in one place
final <- merge(result.pivot, colorsduped[,c("style","status")], by = 0)

#sorting for the sake of explanation
final <-final[order(final$baker, as.numeric(final$episode)),]

Beautiful! Now we can do this once for each season, combine all the seasons, and merge the columns we want (result and status) to the episodes dataset using the baker.season.episode variable. Complete code for this can be found on this Github page.

Whipping up some Great British Bake Off Data

Posted on November 1, 2021May 9, 2022 Posted in Data Wrangling, RTagged Great British Bakeoff

Whipping up some Great British Bake Off Data

With R !

Following the “recipe” 😉 for blogs here on SofiaZaidman.com, I will begin with a preamble explaining why on earth I would spend so much of my free time collecting and combing through every crumb 😉 of data I can find on The Great British Bake Off (GBBO).

There was about a year-long period of time that I commuted from Brooklyn to Columbia’s Morningside campus, during which I would download GBBO episodes to my phone and watch one on the subway during my morning commute and another in the evening. This behavior has led to me having seen every season available on Netflix probably more than 3x through. It is hands down my favorite TV show.

Data from the GBBO is ripe for analysis because the show treats baking like a sport. An obvious ML place to start is predicting the winner based on demographic information and performance in challenges. This has been done beautifully by Danny Antaki in a project called DeepBake. I absolutely love this and maybe one day will spend enough time learning neural nets to do this myself.

Another fun idea I had was to make a GBBO recipe generator based on past GBBO recipe names. This is something I have desperately wanted to do ever since this Buzzfeed article from the relatively early days of neural net memes, which I still think is one of the most hilarious things ever. Jacqueline Nolis (co-author of Build a Career in Data Science and co-host of the podcast of the same name that a highly recommend and fun data science project kindred spirit) gave a talk on building an AI like this and has a really great tutorial on her Github page.

Before I can do any of these things of course, I have to source the GBBO data. I’ve started enjoying data wrangling in a very nerdy way, so I was excited when I noticed that the data used for DeepBake was pulled directly from each GBBO season’s Wikipedia page using a super cool R package called rvest. I thought I’d take a stab at it myself to learn a new scraping technique in R rather than Python.

Scraping Wikipedia tables in R using rvest

Inspect tables on Wikipedia

The first step in scraping Wikipedia tables is to inspect the Wikipedia page you’d like to scrape. I went to the GBBO master page and clicked on each season’s individual pages. Most season pages have one table per episode with four columns: Baker, Signature, Technical and Showstopper. Here is one example:

Now let’s try rvest

I adapted some all-purpose rvest code to pull all tables on a wikipedia page. The code is pretty straightforward and manages to use very few lines to extract the information we want. Basically, the html_nodes function retrieves all wikitables on the webpage we specify, and the html_table function converts the list of nodes to a tibble of tables.

# open rvest, data.table and some common packages that may come in handy
library("rvest")
library("data.table")
library("magrittr")
library("dplyr")
library("ggplot2")

# get season 3 tables from wikipedia
url <- "https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_(series_3)"
webpage <- read_html(url)
table3nodes <- html_nodes(webpage,'table.wikitable')
table3 <- html_table(table3nodes, header = TRUE)

The list of nodes should look like this:

Once the list of nodes is converted to a tibble of tables, each table should look something like this:

Next I constructed a loop for each season to:

Pull and label individual episode data from each season’s page (have to check manually on Wikipedia to know which tables will have episode data)
Convert the data to data frames
Extract signature, technical and showstopper challenge names and save as new variables
Merge all episode data into one data frame

#set count
count <- 0

#create empty data frame with column names
season3 <- data.frame(matrix(ncol = 9, nrow = 0))
x <- c('baker','signature','technical.rank','showstopper','episode', 'signatuare.name','technical.name','showstopper.name', 'season')
colnames(season3) <- x

#build for loop
for (episode in table3[3:12]) {

  ep <- data.frame(episode)
  count = count +1
  ep['episode'] = count
  
  library(stringr)
  signature_name <- str_replace_all(colnames(ep[2]), "[[:punct:]]", " ")
  ep['signature.challenge'] = str_remove(signature_name, 'Signature.')
  
  technical_name <- str_replace_all(colnames(ep[3]), "[[:punct:]]", " ")
  ep['technical.challenge'] = str_remove(technical_name, 'Technical.')
  
  showstopper_name <- str_replace_all(colnames(ep[4]), "[[:punct:]]", " ")
  ep['showstopper.challenge'] = str_remove(showstopper_name, 'Showstopper.')
  
  ep['season'] = '3'
  
  colnames(ep) <- x
  
  season3 <- rbind(season3, ep)
}

This code will create a data frame called season3 with 9 columns and 76 observations – one per baker per episode:

Loops for each season can be built, run, and the resulting data frames can be combined into one master data frame that contains the information from every episode table. The complete code and final dataset are in this Github repository.

I’ll add more to this project soon!

Wrangling an Image Dataset (under the sea)

Posted on September 22, 2021May 9, 2022 Posted in Data Wrangling, Python

Wrangling an Image Dataset (under the sea)

Follow my process to scrape and wrangle a dataset of Spongebob title card images for eventual GAN processing.

This post covers:

Scraping images from the web using Beautiful Soup
Converting image data to a format digestible by a Generative Adversarial Network (GAN) algorithm

Preface: I’m your biggest fan

In 1999 I was bombarded with Spongebob promos while watching CatDog, memorized the date and time of the premiere, and made sure I was seated in front of the TV right when it aired. As history can attest, it did not disappoint.

A scientist from birth, I was won over by the extent to which real-world biological details were used to inform Spongebob’s characters and plot lines. Only recently did I learn that the late creator, Stephen Hillenberg, was actually a trained marine biologist.

I was equally enamored with Spongebob’s visual style. The early seasons’ aesthetic elicits the same feeling for me as visiting San Francisco’s Fisherman’s Wharf at the turn of the millennium. It’s something akin to coloring the children’s menu at an old, dusty seafood restaurant with an assortment of crayons pulled at random from the Crayola 60 pack while wearing a Hawaiian print skort from Limited Too.

One of the most delightful aspects of Spongebob is the title cards. Many of these are seared in my mind to this day, paired with the sounds of a ukulele or slide guitar. Each card has a distinct personality – many in the same general style but each unique enough to remember for years.

I recently came across a Spongebob fan website with a comprehensive archive of title cards. I wondered if it might be possible to scrape the title card images into a dataset and wrangle them into a neural network friendly format. I’ve been interested in training a GAN to generate images for a while; is it possible to generate novel Spongebob title cards with this data?

In this first post of the Spongebob title card adventure, I will walk through scraping and wrangling image data into a format compatible with GAN algorithms.

Scraping

I used the Beautiful Soup library to scrape the title card image links and episode names from this website. Detailed information on using Beautiful Soup can be found in this blog post on scraping web data.

import requests
from bs4 import BeautifulSoup
import bleach
import pandas as pd

url = "https://spongebob.fandom.com/wiki/List_of_title_cards#.23"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="content")


images = results.find_all('img')
sources =[]
titles = []

for image in images:
    source = image.attrs['src']
    sources.append(source)

titles_raw = results.find_all(class_= 'lightbox-caption')

for title in titles_raw:
    title = bleach.clean(str(title), tags=[], strip=True).strip()
    titles.append(title)
            
sources = [x for x in sources if not x.startswith('data')]
sources = [x.replace('/scale-to-width-down/185', '') for x in sources]

spongedata = pd.DataFrame(list(zip(sources, titles)),columns = ['Image_Source','Title'])
spongedata

Downloading Images

Now that we have the image links, we can use the requests module to open and save the images. There are a number of ways to do this – this stack overflow thread was particularly helpful in finding the right method.

Essentially, requests is used to download the image from the url into memory. Once downloaded, the shutil module is used to properly store the image to a jpg file.

I imported a random title card below:

import matplotlib.pyplot as plt
import matplotlib.image as mpim
import shutil

response = requests.get(spongedata.Image_Source[390], stream=True) 
with open("sponge1.jpg", 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)

Note that the image is approximately 1000×1400 pixels.

Here is another random title card:

response = requests.get(spongedata.Image_Source[211], stream=True) 
with open("sponge1.jpg", 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)

This title card is approximately 1000×1750 pixels – not quite the same size as the first one we looked at. To avoid issues down the road, I decided to resize each title card to a uniform number of pixels:

image.resize((300, 200))

CIFAR10 Image Structure

Now let’s take a look at the kind of image data that others have used with GANs successfully. I did some research and found an incredibly helpful tutorial on constructing a GAN using the CIFAR10 image database. The article is located here.

The CIFAR10 database consists of 60,000 32×32 color images stored as arrays and split into a 50,000 image train and 10,000 image test dataset.

The train and test datasets are stored as 4D arrays – for example, the shape of the train dataset is: (50000, 32, 32, 3). The deepest dimension of the array represents the 3 RGB values assigned to each pixel in a photograph. The next dimension represents the number of rows of pixels (32) and the next the number of columns of pixels (also 32). There are a total of 50,000 series of 32×32 arrays of RGB values.

An image can be converted to a 3D array very simply by calling the numpy.array() function after opening the image file using the PIL python image library:

from PIL import Image
import numpy as np

image = PIL.Image.open('sponge1.jpg')
image = (np.array(image))
image

image.shape

Now that we know how to get an image into an array format similar to those in the CIFAR10 dataset, we can write code to loop through our scraped data, convert all our title cards to arrays and combine them all into one 4D array like the CIFAR10 data:

title_cards = []
index = 0
for x in spongedata.Image_Source:
    response = requests.get(spongedata.Image_Source[index], stream=True) 
    with open("sponge1.jpg", 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response
    img = PIL.Image.open('sponge1.jpg')
    img = img.resize((300, 200))
    img = (np.array(img))
    title_cards.append(img)
    index = index + 1
    #print(index)
title_cards = np.array(title_cards)
title_cards.shape

Yay! We’ve managed to get a dataset of Spongebob title cards in something analogous to the CIFAR10 format that we can now try and feed into a GAN. Stay tuned for more Spongebob title card neural network posts!

Understanding ANOVA and Interaction Effects

Posted on August 31, 2021August 6, 2024 Posted in Python, Statistical Analysis

Understanding ANOVA and Interaction Effects using Python

Analyze the effects of the fictional snail performance supplement Snail Fuel on olympic snail athletes!

Let’s pretend we’ve been hired by the fictional snail energy drink and performance supplement company Snail Fuel^TM to conduct an analysis on the effectiveness of their product. They want to know if snails who drink their supplement can expect a significant increase in muscle mass. Anecdotally, Snail Fuel has heard that snails with red shells in particular have been able to make impressive gains using their product. If this can be proven true, Snail Fuel will change their marketing strategy accordingly.

We have set up a small clinical trial with 30 snails olympians split into 6 groups across two dimensions: shell color and placebo/Snail Fuel. Each snail has continued exercising as usual while drinking either a placebo fuel made of plain protein powder, or Snail Fuel. The snails were scored on % increase in muscle mass after one month.

We plan to analyze this data using a two-way ANOVA test – also known as an analysis of variance. The ultimate goal of an ANOVA is to explain variance in our sample and where the variance is coming from. In our case, in a two-way ANOVA, all variance can be explained from the following sources:

grand mean + row effect (first independent variable) + column effect (second independent variable) + interaction effect (interaction of first and second variable) + error

An ANOVA will be able to tell us how much of the variance in our sample can be attributed to each source, and whether that variance meets a threshold of significance.

Let’s generate a Pandas dataframe with the snail’s scores:

import pandas as pd

yellow_placebo = [2, 4, 3, 1, 5 ]
yellow_fuel = [4, 7, 6, 6, 6]
blue_placebo = [5, 6, 4, 2, 8 ]
blue_fuel = [5, 6, 4, 7, 8]
red_placebo = [1, 7, 5, 2, 4]
red_fuel = [11, 13, 12, 14, 10]

data = []
for i in yellow_placebo:
    data.append((i,"yellow","N"))
for i in yellow_fuel:
    data.append((i, "yellow", "Y"))
for i in blue_placebo:
    data.append((i, "blue", "N"))
for i in blue_fuel:
    data.append((i, "blue","Y"))
for i in red_placebo:
    data.append((i, "red","N"))
for i in red_fuel:
    data.append((i, "red","Y"))
    
columns = ['Percent_Muscle_Increase','Shell_Color','Snail_Fuel']
snail_fuel_data = pd.DataFrame(data,columns=columns)
head(snail_fuel_data)

We can place each snail’s score in a 3×2 grid according to which subgroup the snail falls into:

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5	4, 7, 6, 6, 6
Blue	5, 6, 4, 2, 8	5, 6, 4, 7, 8
Red	1, 7, 5, 2, 4	11, 13, 12, 14, 10

Before we do any calculations, we can visualize our data using a box plot. This will show the mean and range of our dependent variable (percent muscle increase) broken into groups by the categorical variable placebo/snail fuel. We can color each point based on the shell color variable as well.

plt.figure(figsize=(8,5))
import matplotlib.pyplot as plt
import seaborn as sns
colors = ["yellow", "blue","red"]
ax = sns.boxplot(x='Snail_Fuel', y='Percent_Muscle_Increase', data=snail_fuel_data, color='white')
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=snail_fuel_data, hue ='Shell_Color', palette=colors)
plt.show()

Let’s find the average % increase in muscle mass for each subgroup:

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5 mean = 3	5, 7, 6, 6, 6 mean = 6
Blue	5, 6, 4, 2, 8 mean = 5	5, 6, 4, 7, 8 mean = 6
Red	2, 7, 5, 2, 4 mean = 4	11, 13, 12, 14, 10 mean = 12

Let’s plot only the means of our data:

means = snail_fuel_data.groupby(['Shell_Color', 'Snail_Fuel'], as_index=False).mean()

plt.figure(figsize=(8,5))
colors = ["blue", "red","goldenrod"]
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=means, hue ='Shell_Color', palette=colors, s = 7)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "red")], marker='', color='red', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "blue")], marker='', color='blue', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "yellow")], marker='', color='goldenrod', linewidth=2)

plt.show()

Main effects

We can already see that red snails who drank Snail Fuel have the highest average % gain in muscle mass. It also looks like for all shell colors, those who drank Snail Fuel gained more muscle. How can we know the extent that having a red shell increases Snail Fuel’s efficacy if it appears that snails of every shell color saw benefits?

First, we need to know exactly what effects are coming from what variables. How much of an effect does drinking Snail Fuel vs. the placebo have on muscle mass? How much of an effect does shell color have on muscle mass? These are known as the main effects for each variable. We can find this by calculating a grand mean (the average % increase of muscle mass for all snails in the study) and the individual row and column means. The main effect for each row or column will be its distance from the grand mean.

Interaction effects

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5 mean = 3	4, 7, 6, 6, 6 mean = 6	Row 1 mean: 4.5 (effect = -1.5)
Blue	5, 6, 4, 2, 8 mean = 5	5, 6, 4, 7, 8 mean = 6	Row 2 mean: 5.5 (effect = -.5)
Red	1, 7, 5, 2, 4 mean = 4	11, 13, 12, 14, 10 mean = 12	Row 3 mean: 8 (effect = +2)
	Col 1 mean: 4 (effect = -2)	Col 2 mean: 8 (effect = +2)	Grand mean = 6

The main effects for each variable will always add to zero.

We see that the main effect for having a red shell is +2, meaning a snail with a red shell, regardless of whether they were assigned the placebo or the Snail Fuel, would be expected to have a % increase in muscle mass 2 units greater than the average value for all snails.

Similarly, we see that the main effect for drinking Snail Fuel is also +2, meaning that regardless of shell color, if we know a snail was assigned to drink Snail Fuel, we would expect that snail’s % gain in muscle mass to be 2 units greater than the grand mean.

By this logic, we can theorize that a red-shelled snail who drank Snail Fuel would have a % increase in muscle mass 2 (red shell) + 2 (Snail Fuel) units greater than the grand mean.

Now that we know the main effects for our variables, we can calculate the expected means taking main effect into account for each subgroup by adding or subtracting the relevant variable effects from the grand mean:

	Placebo	Snail Fuel
Yellow	Expected mean: 2.5	Expected mean: 6.5	(effect = -1.5)
Blue	Expected mean: 3.5	Expected mean: 7.5	(effect = -.5)
Red	Expected mean: 6	Expected mean: 10	(effect = +2)
	(effect = -2)	(effect = +2)	Grand mean = 6

Let’s compare the expected means with main effects to the observed means we already calculated:

	Placebo	Snail Fuel
Yellow	Observed mean: 3 Expected mean: 2.5 difference: .5	Observed mean: 6 Expected mean: 6.5 difference: -.5
Blue	Observed mean: 5 Expected mean: 3.5 difference: 1.5	Observed mean: 6 Expected mean: 7.5 difference: -1.5
Red	Observed mean: 4 Expected mean: 6 difference: -2	Observed mean: 12 Expected mean: 10 difference: +2

The difference between the observed mean and the grand mean + main effects for each subgroup is known as the subgroup’s interaction effect. In other words, a subgroup’s interaction effect accounts for the variance in the observed mean for that subgroup that is not explained by the row or column effects. It can also be denoted as the row X column effect.

Now we can see clearly that there is a +2 unit positive interaction of red shell color and Snail Fuel. That means that snails with red shells who drank Snail fuel had an average of 2% greater muscle gains than we would have expected based on the shell color and placebo/Snail Fuel main effects alone.

Mathematically, interaction effects must cancel each other out – so a positive interaction of red shell color and Snail Fuel also means a negative interaction of red shell color and placebo. Accordingly, we find that yellow and blue-shelled snails have negative interaction effects for Snail Fuel. On average, blue snails and yellow snails saw 1.5 and .5 fewer percentage point gains in muscle mass than expected.

Running a two-way ANOVA in Python using statsmodels

We can test for the significance of the main effects and interaction effects in our data by running a two-way ANOVA test using the statsmodels library. The code is quite simple and easy to read:

import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('Percent_Muscle_Increase ~ C(Snail_Fuel) + C(Shell_Color) + C(Snail_Fuel):C(Shell_Color)', data=snail_fuel_data).fit()
sm.stats.anova_lm(model, typ=3)

We have specified that the ANOVA should test the relation of the dependent variable (percent muscle increase) to the categorical variables “Snail_Fuel” and “Shell_Color” in addition to their interaction “C(Snail_Fuel):C(Shell_Color).” We have also specified that the model should use a type 3 sum of squares. This is the sum of squares methodology used when an interaction effect is thought to be present in the data. For more information, see this helpful article from Towards Data Science.

The output of our test is below:

Interestingly and perhaps not surprisingly, the model tells us that the only significant effect in our ANOVA is the interaction of shell color and snail fuel. The p values (PR(>F)) for the snail fuel and shell color variables on their own would not meet the <.05 standard of significance.

Given these results, we can say that for the total population of snails in the study, there was no statistically significant difference in muscle increase between those who drank Snail Fuel and those who drank a placebo energy drink. However, there was a statistically significant interaction between shell color and drinking Snail Fuel, such that snails with red shells saw larger gains in muscle mass than snails with yellow and blue shells.

If we wanted to dig even deeper, we can take a look at the model summary:

print(model.summary())

Digging into the model summary, we can discern that the only significant p value corresponds to a positive 4.583 t score for “C(Snail_Fuel)[T.Y]:C(Shell_Color)[T.Red].” That translates to a significant positive interaction of drinking snail fuel and having a red shell.

So it looks like the anecdotal evidence has proven true after all, at least in our very small sample.