Sofia’s 1st Annual Halloween Spooktacular: Candy Decision Trees

To be perfectly honest, the only reason I decided to write this was to have somewhere to put this absolutely adorable little pumpkin with the fall foliage boots. Every great artist must sacrifice for their creations.

The Candy Ranking Dataset

Last fall as I browsed Kaggle, searching for datasets to use for a class assignment, I found The Ultimate Halloween Candy Power Ranking. Someone had created a website that randomly pits two Halloween candies against each other and asks visitors to choose which one they prefer. After 269,000 votes were collected, data on each candy’s win percentage, as well as descriptive information for each candy was posted as a dataset. I’ve included the dataset below:


Most of the descriptive columns are yes or no flags for certain candy characteristics: whether the candy has chocolate, if it has nuts (“peanutyalmondy”), if it’s hard or soft, if it’s fruity, if it has caramel, if it has a crisped rice wafer (oddly specific), if it has nougat, if it’s in a bar form and if it comes in a pack with other candies (“pluribus”). There are also continuous columns for sugar percentile and price percentile compared to the rest of the candies in the dataset, but for the purposes of this post, I’m going to only focus on the flags.

An obvious question to ask of this dataset from a machine learning perspective is: can we predict a candy’s win percentage if we know its features? Asked another way: what features will predict a Halloween candy’s popularity?

Let’s do a bit of visualizing first. Here I’ve plotted average win percentage vs. feature for all 9 yes or no candy descriptors:

Many of the candy features seem to have a noticeable impact on a candy’s win percentage. When we compare the top and bottom candies, we see concrete examples of this.

All the top 10 candies are chocolate, while all the bottom 10 are not chocolate:

Clearly, a candy being chocolate would be a useful feature for predicting a candy’s win percentage. Candies with nuts are also more likely to be in the top 10 than the bottom 10, though not as extreme as chocolate vs. no chocolate:

Taken together, these charts indicate that creating a model using the candy descriptors to predict win percent could be possible. There is clearly a difference between the top performing candies and bottom performing candies that the descriptors are at least partially capturing.

Candy Decision Tree

My first idea when deciding what would be a fun model to use with the candy dataset was a decision tree. Decision trees aren’t usually the best performing models, but they are useful in an illustrative way, providing a kind of roadmap for how they make their predictions.

I think a decision tree in this situation could be thought to function much like a person presented with a halloween candy, trying to decide how much they want to eat it. During training, the tree will run through the features it’s given – in this case, the candy descriptors. It will decide on one feature for its first decision – say, chocolate. After separating the candies into chocolate or not chocolate, it will then choose two more features and make more splits, until the candies have been separated into some number of groups, each with distinct combinations of descriptors. Each of these groups will then have a win percentage prediction associated with it based on the candies in the training dataset that fall within that group. This number can then be used to assign win percentage predictions to new candies that have the same characteristics as candies in the training dataset.

This process reminds me of when my brother and I would get back from trick or treating, dump all our candies into piles, organize them into categories and trade back and forth. The decision tree is dumping out the candies in the training dataset and organizing them into piles in its own way, and I enjoy that.

Decision tree parameters

There are many parameters that can be tuned in a decision tree, but I’m going to focus on max depth for now. Max depth refers to the amount of decisions or splits the tree is making. If max depth was set to 1, then the decision tree would only make one decision. This wouldn’t be something you would ever really do with a model that would be actually used for accurate predictions, but it’s interesting to see what a decision tree would pick if it could only make one decision. Let’s try it out, using python:

import pandas as pd
df = pd.read_csv('candy-data.csv')

y = df['winpercent'] 
X = df[df.columns.drop(['winpercent','competitorname','sugarpercent','pricepercent'])]

from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = .5, random_state=130)

from sklearn.tree import DecisionTreeRegressor

rf = DecisionTreeRegressor(random_state=122, max_depth = 1)

rf.fit(X_train, y_train)

In the above code, we divide the candy data into features (X) and target (y), then split into train and test dataset. Because the dataset was small, I used a 50/50 train test split. Next we fit a decision tree regressor model to the training data. I’ve specified that the decision tree should only split once.

rf.score(X_train, y_train)
0.43092954635899416

The R2 value for our model with 1 split is .431, meaning that about 43.1% of the variance in win percentage in the training data can be explained by the model. At face value that’s not very good, but consider that because the max depth of the model is 1, this means the model is only using 1 feature.

The beautiful thing about decision trees is that there are several built in ways to view how the tree came to its decision. We can generate a chart with the decision tree as follows:

from matplotlib import pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(20,20))
_ = tree.plot_tree(rf, 
                   feature_names=X.columns,  
                   filled=True)

The tree did indeed choose chocolate as the most useful feature for separating candies into a lower performing and a higher performing group. This visualization is great because it gives so much information, including the mean squared error (mse) for each stage in the decision process. We can also see the predicted value that the model assigns to each group. Candies that aren’t chocolate get a 42% win percentage, and chocolate candies get a 60% win percentage.

Let’s add one more split and see what happens:

rf = DecisionTreeRegressor(random_state=122, max_depth = 2)

We now get one more level of splits in the decision tree, resulting in 4 possible predictions based on candy features. If the candy is chocolate, the model will then look to see if it has nuts (peanutyalmondy). Candies that are both chocolate and have nuts get the highest predicted win percentage of 71%. If the candy isn’t chocolate, the model will look to see if it is fruity. Fruity candies that aren’t chocolate have a higher predicted win percentage than candies that are neither chocolate nor fruity.

We can add another layer to our decision tree:

rf = DecisionTreeRegressor(random_state=122, max_depth = 3) 

At this point it’s a little hard to see the figure, but you get the idea. The model has added one more level of splits to the tree, and we now have 8 separate groups of predicted probabilities based on the candy features. In this model, the candies with the highest predicted win percentages are chocolate, “peanutyalmondy”, and bars, and the candies with the lowest predicted win percentage are hard, not fruity, not chocolatey candies.

If we score this model on the training data, we see that it is predicts much more variance than our model with only one split:

rf.score(X_train, y_train)
0.6322525569863932

The R2 value for our model with 3 splits is .632, meaning that about 63.2% of the variance in win percentage in the training data can be explained by the model. That’s slightly more than a 20% better fit compared with the first model with one split.

We can also check how our model will perform in predicting unseen data:

rf.score(X_test, y_test)
0.5348005970923211

Given the size of the dataset (small) and the fact that I was being lazy and not using cross validation, and also by the nature of decision trees, the model fits the training data much better than the test data. However, all things considered, it isn’t so bad. The model was still able to explain more than half of the variance in predicted win percentage for completely unseen candies.

Happy Halloween!

Matrix of Confusion: Tuning the Logistic Regression Prediction Threshold for Imbalanced Classes (the long way)

I must admit that lately I’ve found myself feeling like the brain trapped in the Matrix of Confusion™ (confusion matrix) pictured above. It’s been quite a busy little month or so. I got back from a wonderful and much needed vacation at the beginning of September and subsequently whiplashed straight into a new semester (and back to good old work). Of course, this isn’t just any run of the mill semester. It’s my second to last (or so) semester, which means I need to buckle down and actually come up with a thesis topic and proposal. Perhaps, if you’re familiar with the contents of this blog, you’ll see that this could be somewhat of a difficult task. I tend to have a lot of cute little ideas and interests related to my hobbies, but it’s hard to translate this into something that could be classified as social science research. I do have a substantial amount of legitimate research experience, but this is in the field of cognitive neuroscience (also not a social science).

Anywho, in times of trouble like this, sometimes it helps to give the old noggin a rest from the big questions and pick something small to figure out to use as a distraction 🙂

At the end of my last post I looked at the confusion matrix for the test predictions made by the logistic regression trained on tweets labeled pro-choice and pro-life. Examining the confusion matrix was much more useful in my opinion than simply looking at numbers for precision, recall, and F1, because it shows exactly how tweets are getting labeled in both classes:

Predicted pro-choicePredicted pro-life
Actual pro-choice1358
Actual pro-life4319
Confusion matrix

The model predicted pro-choice tweets with a precision of 135/178 = 75.8%. It was slightly worse at correctly predicting pro-life tweets, with a precision of 19/27 = 70.4%. Our overall accuracy ends up being (135+19)/(135+19+43+8) = 75.1%.

To get the baseline accuracy to compare this accuracy to, we can take the overall percentage of the most abundant class (pro-choice tweets). This was 72.6%. So, theoretically, if instead of going through the trouble to train a model we had just guessed that all tweets were pro-choice, our accuracy would have been 72.6%. Our overall accuracy using the model was 75.1%, so overall we did better by about 3% 🥲.

Predicted pro-choicePredicted pro-life
Actual pro-choice1430
Actual pro-life620
Confusion matrix if the model classified all tweets as pro-choice

Ok… that doesn’t sound like we did amazingly… however, how you want to measure success really does depend on your perspective. In the scenario just described, we would be classifying none of the pro-life tweets correctly. Why would we want that?! Comparing precisions in this theoretical scenario to our model, we would predict pro-choice tweets worse by 100-75.8 = 24.2%, but we would also predict pro-life tweets better by 70.4-0 = 70.4%.

There is also the question of recall – the ratio of correctly classified “positive” tweets to total actual positive tweets (true positive rate), which we can look at from the pro-choice or the pro-life side. Our model has very good recall if considering pro-choice tweets to be positives, with a recall of 135/(135+8) = 94.4%. Only 5.6% of true pro-choice tweets were labeled pro-life. On the flip-side, the recall if we consider pro-life tweets to be positive was 19/(19+43)= 30.6%. This means that in fact, about 70% of pro-life tweets were predicted to be pro-choice. Not great.

ROC Curve

ROC curves plot true positive vs. false positive rate across thresholds for binary classifiers. At the end of my last blog post, I mentioned that we could potentially tweak the classification threshold for our model one way or the other to get better predictions for the pro-life group. Plotting an ROC curve can help visualize this. Again, we can look at this from the pro-choice or pro-life perspective (either one can be treated as “positive”).

Pro-Life ROC Curve

I’ve chosen to look at thresholds within a few percentage points of 50%. The ROC curve plots the false positive rate (in this case, the percent of real pro-choice tweets labeled pro-life) on the x axis, and the true positive rate (real pro-life tweets labeled pro-life) on the y axis. You can see that the .5 threshold lines up with the recall discussed a couple paragraphs ago. These numbers shift drastically with only a slight threshold change. If we chose .44 for the threshold, we would label over 75% of pro-life tweets correctly, as opposed to the 32.3% we get from the .5 threshold. Of course, this would also increase the amount of pro-choice tweets that are falsely labeled as pro-life, up from 5.6% to about 38%.

Pro-Choice ROC Curve

This is giving essentially the same information as the other ROC curve but in reverse. I find it useful to look from both perspectives.

We can use the area under the ROC curve to understand how well the model performs overall:

A thought experiment:

Say the model performed absolutely perfectly and predicted every tweet with 100% confidence. In that case, for any threshold below exactly 1, there would be no false positives. Every threshold would be plotted at (0,1) except for 1, which would be at (1,0). This would result in the ROC curve not being a curve at all, but a right angle, and covering 100% of the area of the chart.

Say the model performs perfectly terribly, with 50% confidence that every tweet should be labeled as pro-choice. In this theoretical scenario, there would be an equally random 50/50 chance of the model labeling any tweet either pro-choice or pro-life at every threshold except for 0 and 1. If there were infinite observations in the dataset and the model was perfectly random with its predictions, the ROC curve would just by the line y=x, and would cover 50% of the area of the chart:

So, the curvier, or steeper, the ROC curve, the better the model.

Here’s my model colored in:

I find this kind of thing so useful to fundamentally understand what’s going on. We can see exactly how much better than completely random the model is predicting (darker green), and the corresponding area that the model isn’t predicting (red).

I added two lines at .726. These represent the baseline accuracy – that is, if we had classified every tweet as pro-choice based on the overall percentage of pro-choice tweets in the dataset. We want the % of true pro-life tweets labeled as pro-choice (false positive rate, x axis) to be below this threshold, and % of true pro-choice tweets labeled as pro-choice (true positive rate, y axis) to be above this threshold. We can see that the thresholds that meet these criteria are .46, .47, .48, .49, .5, and barely .51. A reasonable argument could be made for choosing any of these thresholds.

For thoroughness’s sake, here is the shaded ROC curve when considering pro-life to be “positive.” I’ve changed the baseline accuracy threshold lines to use the overall percentage of pro-choice tweets in the dataset. You can see that the same thresholds are performing better than baseline again, .51, .5, .49, .48, .47 and .46.

The ROC curve is clearly useful in assessing the ratio of true-positives (recall) to false-positives. What about precision, though?

Another way to look at this is with a precision-recall curve:

Here we are plotting recall on the x axis and precision on the y axis – in this case, we consider pro-choice to be “positive.” This allows us to see the trade-off between the ratio of correctly identified pro-choice tweets to all real pro-choice tweets (recall), and the ratio of correctly identified pro-choice tweets to all tweets labeled pro-choice (precision). We can see that for most thresholds tested, both precision and recall are quite high for pro-choice tweets. Overall, the model does a good job of predicting pro-choice tweets as being pro-choice (recall) and a pretty high percentage of the tweets it classifies as being pro-choice actually are pro-choice (precision).

However, remember that the data was imbalanced to begin with, and the dataset has 72.6% pro-choice tweets to begin with. If we were to pick tweets randomly out of a hat, eventually we would get 72.6% pro-choice. This makes the model’s precision in predicting pro-choice tweets somewhat less impressive. For most thresholds, the population of tweets predicted to be pro-choice is only slightly more pro-choice than random.

Let’s look at the precision-recall curve with pro-life as positives:

The precision-recall curve considering pro-life tweets as positives tells a very different story. In this scenario, our baseline accuracy to beat would be 27.4% (100 – 72.6%) representing the overall percentage of pro-life tweets in the dataset. We can see that at all thresholds besides 0, our model predicts pro-life tweets better than baseline, and for most thresholds, this is substantially better than baseline. That’s encouraging.

Recall is more of a problem from this perspective. If we want a threshold with more than 50% of actual pro-life tweets being labeled correctly as pro-life, we only have .46, .45, and .44 to choose from. However, if we chose one of these thresholds to improve pro-life recall, pro-life precision would drop, meaning that more actual pro-choice tweets would get labeled pro-life (false positives).

F1 – harmonic mean of precision and accuracy:

The F1 score is a way to assess precision and recall simultaneously. The formula is as follows:

F1 = 2 * (precision * recall)/(precision + recall)

F1 weighs precision and recall equally, with 0 as the worst score and 1 as the best score. You probably noticed that I labeled the precision-recall curves with corresponding F1 scores. Maybe the most logical way to select a threshold in our case would simply be to find the highest F1 score. Again, though, what we consider to be “best” depends on perspective, as the best F1 score from the pro-life perspective is not the best F1 score from the pro-choice perspective.

Conclusions?

What to take away from all this? Well… if I wanted to use my model to capture a wider swath of pro-life tweets, I think I would tweak the threshold to .46. At this threshold, recall increases to 66.1% from 30.6% – so we are classifying more than double the percentage of pro-life tweets correctly compared to before. Unfortunately, pro-life precision at this threshold drops to 56.2% from 70.4%. So just over half of the tweets labeled pro-life would actually be pro-life. That’s not ideal, but in this scenario I’d rather water-down the predicted pro-life tweet population than miss so many true pro-life tweets and have them get lost in the pro-choice predictions.

At .46 from the pro-choice perspective, precision is quite high at 84.1%, so a greater percentage of the tweets labeled as pro-choice are truly pro-choice. The pro-choice recall takes a hit, though, going from the very very good 94.4% to 77.6% – meaning only 77.6% of all pro-choice tweets are labeled correctly. This is only slightly better than baseline.

In the end, going with the threshold with the best weighted average F1 score across classes would probably make sense in almost all cases. A boring end to a boring post 🙂

Gender and Race in the Tech Industry – Analysis of Bias in Compensation

Gender and Race in the Tech Industry – Analysis of Bias in Compensation

As part of my coursework for the QMSS MA program at Columbia, I designed a hierarchical regression model to analyze salary data from levels.fyi, focusing specifically on the significance of race and gender variables in predicting total annual compensation. I wrote the full project in R markdown and included it below, along with a much shorter summary in this post. I hope you enjoy 🙂


Background – Current Gender and Wage gaps in the US

In their 2021 Gender and Pay Gap Report, which analyzes pay disparities in the US across all industries via crowdsourced data, PayScale found that, without adding any control variables, women make 82¢ for every dollar earned by men. After adding control variables, women made 98¢ for every dollar earned by men, leaving a 2% difference attributable purely to discrimination based on gender.

PayScale’s findings on the racial wage gap show that, with or without control of demographics, both men and women of most races earn less than white men. Interestingly, when controlling for external factors, Asian men and women earn more than any group.

Data and Research Design

Levels.fyi is a website founded in 2017 as a place for tech industry professionals around the world to anonymously share detailed compensation information. In 2020, levels.fyi began collecting race, gender, and education information from users along with salary information.

Using data from levels.fyi, I asked the question: can any of the variance in compensation in the tech industry be explained by racial and gender differences? If so, how much of this variance can be attributed to differences in years of experience, job title, educational attainment, and cost of living between genders and racial groups?

My dependent variable was total annual compensation, with gender, race, education, total years of experience, years at the current company and cost of living index as independent variables.

My sample came from a comprehensive dataset of scraped salary postings from levels.fyi. I limited my analysis to jobs in the US and removed NA values for the target independent variables. I also removed records with total yearly compensation equal to 0. I joined this data to a separate table with cost of living index values by US state.

Hierarchical Regression Model

Stepwise multiple regression was used to assess whether gender and/or race would contribute any significant additional explanatory power to the prediction of total annual compensation beyond that of the control variables. The equations for each step of the hierarchical regression model are below:

Model 2a (control variables only): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index)

Model 2b (control variables + gender): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender)

Model 2c (control variables + gender + race): 

ln(Total Annual Compensation) = β0 + β1(Years of Experience) + β2(Years at Company) + β3(Education) + β4(Title) + β5(Index) + β6(Gender) + β7(Race)

Results

The adjusted R^2 value for model 2a was .410, meaning that 41% of the variance in total annual compensation in the sample could be explained by the control variables alone.

After controlling for years of experience, years at the company, education, job title and cost of living index, being female resulted in a 7% decrease in total yearly compensation on average compared to being male. The addition of gender in model 2b added .2% of explanatory power overall. An F test between models 2a and 2b confirmed that this was a significant increase in explanatory power.

In the final step of the hierarchical regression, after controlling for years of experience, years at the company, education, job title, cost of living index and gender, the only group that differed significantly from White posters at the p < .001 threshold in total annual compensation were Black posters who were compensated on average 6% less annually. Difference in total annual compensation between White and Asian posters was significant at the p < .01 threshold, with Asian posters making on average 2% more annually than White posters. The addition of race added only .1% more explanatory power to the overall model. An F test between models 2b and 2c confirmed that this was a significant increase in explanatory power.

Discussion

The direction of all coefficients in the final model agree with PayScale’s 2021 analysis. By and large, the magnitude of the coefficients for race variables agreed with PaysScale’s analysis as well. One poignant difference between my analysis and PayScale’s is the magnitude of the coefficient for Gender [Female]. In the controlled model (Model 2c), the coefficient for Gender [Female] of -.07 is significantly greater than what would have been expected based on the difference in pay of 2% that PayScale found between men and women across industries. This suggests that there may be a larger degree of discrimination in pay based on gender in the tech industry compared to other industries. Perhaps some of this effect can be explained by debunked yet pervasive stereotypes that women naturally have less ability in quantitative disciplines. This is a fascinating area for further research.

There was a large drop in coefficient magnitude for the dummy-coded race and gender variables after controlling for years of experience, years at the company, education, job title and cost of living. This drop was especially large in the coefficient for Race [Black], going from -.17 to -.06 – a difference of 11% explanatory power. This suggests that there is important information contained in the control variables that should be explored further. Systemic differences between racial groups and genders in educational attainment, job title, years of experience and tenure at a company would all affect total annual compensation. If these mediating factors are not addressed along with outright discrimination, financial parity for demographic groups that have historically been excluded from the tech industry will be severely slowed.

Ultimately, there remains a significant amount of variance in total annual compensation that cannot be explained by any of the control variables, particularly for Black tech workers and for women. As the population of these groups rises in the industry, it is increasingly important to continue to analyze the biased systems and attitudes that contribute to this phenomenon.

Understanding ANOVA and Interaction Effects

Understanding ANOVA and Interaction Effects using Python

Analyze the effects of the fictional snail performance supplement Snail Fuel on olympic snail athletes!

Let’s pretend we’ve been hired by the fictional snail energy drink and performance supplement company Snail FuelTM to conduct an analysis on the effectiveness of their product. They want to know if snails who drink their supplement can expect a significant increase in muscle mass. Anecdotally, Snail Fuel has heard that snails with red shells in particular have been able to make impressive gains using their product. If this can be proven true, Snail Fuel will change their marketing strategy accordingly.

We have set up a small clinical trial with 30 snails olympians split into 6 groups across two dimensions: shell color and placebo/Snail Fuel. Each snail has continued exercising as usual while drinking either a placebo fuel made of plain protein powder, or Snail Fuel. The snails were scored on % increase in muscle mass after one month.

We plan to analyze this data using a two-way ANOVA test – also known as an analysis of variance. The ultimate goal of an ANOVA is to explain variance in our sample and where the variance is coming from. In our case, in a two-way ANOVA, all variance can be explained from the following sources:

grand mean + row effect (first independent variable) + column effect (second independent variable) + interaction effect (interaction of first and second variable) + error

An ANOVA will be able to tell us how much of the variance in our sample can be attributed to each source, and whether that variance meets a threshold of significance.

Let’s generate a Pandas dataframe with the snail’s scores:

import pandas as pd

yellow_placebo = [2, 4, 3, 1, 5 ]
yellow_fuel = [4, 7, 6, 6, 6]
blue_placebo = [5, 6, 4, 2, 8 ]
blue_fuel = [5, 6, 4, 7, 8]
red_placebo = [1, 7, 5, 2, 4]
red_fuel = [11, 13, 12, 14, 10]

data = []
for i in yellow_placebo:
    data.append((i,"yellow","N"))
for i in yellow_fuel:
    data.append((i, "yellow", "Y"))
for i in blue_placebo:
    data.append((i, "blue", "N"))
for i in blue_fuel:
    data.append((i, "blue","Y"))
for i in red_placebo:
    data.append((i, "red","N"))
for i in red_fuel:
    data.append((i, "red","Y"))
    
columns = ['Percent_Muscle_Increase','Shell_Color','Snail_Fuel']
snail_fuel_data = pd.DataFrame(data,columns=columns)
head(snail_fuel_data)

We can place each snail’s score in a 3×2 grid according to which subgroup the snail falls into:

PlaceboSnail Fuel
Yellow2, 4, 3, 1, 5
4, 7, 6, 6, 6
Blue5, 6, 4, 2, 8
5, 6, 4, 7, 8
Red1, 7, 5, 2, 4
11, 13, 12, 14, 10

Before we do any calculations, we can visualize our data using a box plot. This will show the mean and range of our dependent variable (percent muscle increase) broken into groups by the categorical variable placebo/snail fuel. We can color each point based on the shell color variable as well.

plt.figure(figsize=(8,5))
import matplotlib.pyplot as plt
import seaborn as sns
colors = ["yellow", "blue","red"]
ax = sns.boxplot(x='Snail_Fuel', y='Percent_Muscle_Increase', data=snail_fuel_data, color='white')
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=snail_fuel_data, hue ='Shell_Color', palette=colors)
plt.show()

Let’s find the average % increase in muscle mass for each subgroup:

PlaceboSnail Fuel
Yellow2, 4, 3, 1, 5
mean = 3
5, 7, 6, 6, 6
mean = 6
Blue5, 6, 4, 2, 8
mean = 5
5, 6, 4, 7, 8
mean = 6
Red2, 7, 5, 2, 4
mean = 4
11, 13, 12, 14, 10
mean = 12

Let’s plot only the means of our data:

means = snail_fuel_data.groupby(['Shell_Color', 'Snail_Fuel'], as_index=False).mean()

plt.figure(figsize=(8,5))
colors = ["blue", "red","goldenrod"]
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=means, hue ='Shell_Color', palette=colors, s = 7)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "red")], marker='', color='red', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "blue")], marker='', color='blue', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "yellow")], marker='', color='goldenrod', linewidth=2)

plt.show()

Main effects

We can already see that red snails who drank Snail Fuel have the highest average % gain in muscle mass. It also looks like for all shell colors, those who drank Snail Fuel gained more muscle. How can we know the extent that having a red shell increases Snail Fuel’s efficacy if it appears that snails of every shell color saw benefits?

First, we need to know exactly what effects are coming from what variables. How much of an effect does drinking Snail Fuel vs. the placebo have on muscle mass? How much of an effect does shell color have on muscle mass? These are known as the main effects for each variable. We can find this by calculating a grand mean (the average % increase of muscle mass for all snails in the study) and the individual row and column means. The main effect for each row or column will be its distance from the grand mean.

Interaction effects

PlaceboSnail Fuel
Yellow2, 4, 3, 1, 5
mean = 3
4, 7, 6, 6, 6
mean = 6
Row 1 mean: 4.5
(effect = -1.5)
Blue5, 6, 4, 2, 8
mean = 5
5, 6, 4, 7, 8
mean = 6
Row 2 mean: 5.5
(effect = -.5)
Red1, 7, 5, 2, 4
mean = 4
11, 13, 12, 14, 10
mean = 12
Row 3 mean: 8
(effect = +2)
Col 1 mean: 4
(effect = -2)
Col 2 mean: 8
(effect = +2)
Grand mean = 6

The main effects for each variable will always add to zero.

We see that the main effect for having a red shell is +2, meaning a snail with a red shell, regardless of whether they were assigned the placebo or the Snail Fuel, would be expected to have a % increase in muscle mass 2 units greater than the average value for all snails.

Similarly, we see that the main effect for drinking Snail Fuel is also +2, meaning that regardless of shell color, if we know a snail was assigned to drink Snail Fuel, we would expect that snail’s % gain in muscle mass to be 2 units greater than the grand mean.

By this logic, we can theorize that a red-shelled snail who drank Snail Fuel would have a % increase in muscle mass 2 (red shell) + 2 (Snail Fuel) units greater than the grand mean.

Now that we know the main effects for our variables, we can calculate the expected means taking main effect into account for each subgroup by adding or subtracting the relevant variable effects from the grand mean:

PlaceboSnail Fuel
YellowExpected mean: 2.5Expected mean: 6.5(effect = -1.5)
BlueExpected mean: 3.5Expected mean: 7.5(effect = -.5)
RedExpected mean: 6Expected mean: 10(effect = +2)
(effect = -2)(effect = +2)Grand mean = 6

Let’s compare the expected means with main effects to the observed means we already calculated:

PlaceboSnail Fuel
YellowObserved mean: 3
Expected mean: 2.5
difference: .5
Observed mean: 6
Expected mean: 6.5
difference: -.5
BlueObserved mean: 5
Expected mean: 3.5
difference: 1.5
Observed mean: 6
Expected mean: 7.5
difference: -1.5
RedObserved mean: 4
Expected mean: 6
difference: -2
Observed mean: 12
Expected mean: 10
difference: +2

The difference between the observed mean and the grand mean + main effects for each subgroup is known as the subgroup’s interaction effect. In other words, a subgroup’s interaction effect accounts for the variance in the observed mean for that subgroup that is not explained by the row or column effects. It can also be denoted as the row X column effect.

Now we can see clearly that there is a +2 unit positive interaction of red shell color and Snail Fuel. That means that snails with red shells who drank Snail fuel had an average of 2% greater muscle gains than we would have expected based on the shell color and placebo/Snail Fuel main effects alone.

Mathematically, interaction effects must cancel each other out – so a positive interaction of red shell color and Snail Fuel also means a negative interaction of red shell color and placebo. Accordingly, we find that yellow and blue-shelled snails have negative interaction effects for Snail Fuel. On average, blue snails and yellow snails saw 1.5 and .5 fewer percentage point gains in muscle mass than expected.

Running a two-way ANOVA in Python using statsmodels

We can test for the significance of the main effects and interaction effects in our data by running a two-way ANOVA test using the statsmodels library. The code is quite simple and easy to read:

import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('Percent_Muscle_Increase ~ C(Snail_Fuel) + C(Shell_Color) + C(Snail_Fuel):C(Shell_Color)', data=snail_fuel_data).fit()
sm.stats.anova_lm(model, typ=3)

We have specified that the ANOVA should test the relation of the dependent variable (percent muscle increase) to the categorical variables “Snail_Fuel” and “Shell_Color” in addition to their interaction “C(Snail_Fuel):C(Shell_Color).” We have also specified that the model should use a type 3 sum of squares. This is the sum of squares methodology used when an interaction effect is thought to be present in the data. For more information, see this helpful article from Towards Data Science.

The output of our test is below:

Interestingly and perhaps not surprisingly, the model tells us that the only significant effect in our ANOVA is the interaction of shell color and snail fuel. The p values (PR(>F)) for the snail fuel and shell color variables on their own would not meet the <.05 standard of significance.

Given these results, we can say that for the total population of snails in the study, there was no statistically significant difference in muscle increase between those who drank Snail Fuel and those who drank a placebo energy drink. However, there was a statistically significant interaction between shell color and drinking Snail Fuel, such that snails with red shells saw larger gains in muscle mass than snails with yellow and blue shells.

If we wanted to dig even deeper, we can take a look at the model summary:

print(model.summary())

Digging into the model summary, we can discern that the only significant p value corresponds to a positive 4.583 t score for “C(Snail_Fuel)[T.Y]:C(Shell_Color)[T.Red].” That translates to a significant positive interaction of drinking snail fuel and having a red shell.

So it looks like the anecdotal evidence has proven true after all, at least in our very small sample.