Sofia’s 1st Annual Halloween Spooktacular: Candy Decision Trees

Posted on November 1, 2022November 6, 2022 Posted in Decision Trees, Machine Learning, Python, Scikitlearn, Statistical Analysis

To be perfectly honest, the only reason I decided to write this was to have somewhere to put this absolutely adorable little pumpkin with the fall foliage boots. Every great artist must sacrifice for their creations.

The Candy Ranking Dataset

Last fall as I browsed Kaggle, searching for datasets to use for a class assignment, I found The Ultimate Halloween Candy Power Ranking. Someone had created a website that randomly pits two Halloween candies against each other and asks visitors to choose which one they prefer. After 269,000 votes were collected, data on each candy’s win percentage, as well as descriptive information for each candy was posted as a dataset. I’ve included the dataset below:

Most of the descriptive columns are yes or no flags for certain candy characteristics: whether the candy has chocolate, if it has nuts (“peanutyalmondy”), if it’s hard or soft, if it’s fruity, if it has caramel, if it has a crisped rice wafer (oddly specific), if it has nougat, if it’s in a bar form and if it comes in a pack with other candies (“pluribus”). There are also continuous columns for sugar percentile and price percentile compared to the rest of the candies in the dataset, but for the purposes of this post, I’m going to only focus on the flags.

An obvious question to ask of this dataset from a machine learning perspective is: can we predict a candy’s win percentage if we know its features? Asked another way: what features will predict a Halloween candy’s popularity?

Let’s do a bit of visualizing first. Here I’ve plotted average win percentage vs. feature for all 9 yes or no candy descriptors:

Many of the candy features seem to have a noticeable impact on a candy’s win percentage. When we compare the top and bottom candies, we see concrete examples of this.

All the top 10 candies are chocolate, while all the bottom 10 are not chocolate:

Clearly, a candy being chocolate would be a useful feature for predicting a candy’s win percentage. Candies with nuts are also more likely to be in the top 10 than the bottom 10, though not as extreme as chocolate vs. no chocolate:

Taken together, these charts indicate that creating a model using the candy descriptors to predict win percent could be possible. There is clearly a difference between the top performing candies and bottom performing candies that the descriptors are at least partially capturing.

Candy Decision Tree

My first idea when deciding what would be a fun model to use with the candy dataset was a decision tree. Decision trees aren’t usually the best performing models, but they are useful in an illustrative way, providing a kind of roadmap for how they make their predictions.

I think a decision tree in this situation could be thought to function much like a person presented with a halloween candy, trying to decide how much they want to eat it. During training, the tree will run through the features it’s given – in this case, the candy descriptors. It will decide on one feature for its first decision – say, chocolate. After separating the candies into chocolate or not chocolate, it will then choose two more features and make more splits, until the candies have been separated into some number of groups, each with distinct combinations of descriptors. Each of these groups will then have a win percentage prediction associated with it based on the candies in the training dataset that fall within that group. This number can then be used to assign win percentage predictions to new candies that have the same characteristics as candies in the training dataset.

This process reminds me of when my brother and I would get back from trick or treating, dump all our candies into piles, organize them into categories and trade back and forth. The decision tree is dumping out the candies in the training dataset and organizing them into piles in its own way, and I enjoy that.

Decision tree parameters

There are many parameters that can be tuned in a decision tree, but I’m going to focus on max depth for now. Max depth refers to the amount of decisions or splits the tree is making. If max depth was set to 1, then the decision tree would only make one decision. This wouldn’t be something you would ever really do with a model that would be actually used for accurate predictions, but it’s interesting to see what a decision tree would pick if it could only make one decision. Let’s try it out, using python:

import pandas as pd
df = pd.read_csv('candy-data.csv')

y = df['winpercent'] 
X = df[df.columns.drop(['winpercent','competitorname','sugarpercent','pricepercent'])]

from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = .5, random_state=130)

from sklearn.tree import DecisionTreeRegressor

rf = DecisionTreeRegressor(random_state=122, max_depth = 1)

rf.fit(X_train, y_train)

In the above code, we divide the candy data into features (X) and target (y), then split into train and test dataset. Because the dataset was small, I used a 50/50 train test split. Next we fit a decision tree regressor model to the training data. I’ve specified that the decision tree should only split once.

rf.score(X_train, y_train)

0.43092954635899416

The R2 value for our model with 1 split is .431, meaning that about 43.1% of the variance in win percentage in the training data can be explained by the model. At face value that’s not very good, but consider that because the max depth of the model is 1, this means the model is only using 1 feature.

The beautiful thing about decision trees is that there are several built in ways to view how the tree came to its decision. We can generate a chart with the decision tree as follows:

from matplotlib import pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(20,20))
_ = tree.plot_tree(rf, 
                   feature_names=X.columns,  
                   filled=True)

The tree did indeed choose chocolate as the most useful feature for separating candies into a lower performing and a higher performing group. This visualization is great because it gives so much information, including the mean squared error (mse) for each stage in the decision process. We can also see the predicted value that the model assigns to each group. Candies that aren’t chocolate get a 42% win percentage, and chocolate candies get a 60% win percentage.

Let’s add one more split and see what happens:

rf = DecisionTreeRegressor(random_state=122, max_depth = 2)

We now get one more level of splits in the decision tree, resulting in 4 possible predictions based on candy features. If the candy is chocolate, the model will then look to see if it has nuts (peanutyalmondy). Candies that are both chocolate and have nuts get the highest predicted win percentage of 71%. If the candy isn’t chocolate, the model will look to see if it is fruity. Fruity candies that aren’t chocolate have a higher predicted win percentage than candies that are neither chocolate nor fruity.

We can add another layer to our decision tree:

rf = DecisionTreeRegressor(random_state=122, max_depth = 3)

At this point it’s a little hard to see the figure, but you get the idea. The model has added one more level of splits to the tree, and we now have 8 separate groups of predicted probabilities based on the candy features. In this model, the candies with the highest predicted win percentages are chocolate, “peanutyalmondy”, and bars, and the candies with the lowest predicted win percentage are hard, not fruity, not chocolatey candies.

If we score this model on the training data, we see that it is predicts much more variance than our model with only one split:

rf.score(X_train, y_train)

0.6322525569863932

The R2 value for our model with 3 splits is .632, meaning that about 63.2% of the variance in win percentage in the training data can be explained by the model. That’s slightly more than a 20% better fit compared with the first model with one split.

We can also check how our model will perform in predicting unseen data:

rf.score(X_test, y_test)

0.5348005970923211

Given the size of the dataset (small) and the fact that I was being lazy and not using cross validation, and also by the nature of decision trees, the model fits the training data much better than the test data. However, all things considered, it isn’t so bad. The model was still able to explain more than half of the variance in predicted win percentage for completely unseen candies.

Happy Halloween!

Predicting pro-choice vs. pro-life tweets with NLP: Part 1 (prepping and training the model)

Posted on August 19, 2022October 12, 2022 Posted in Machine Learning, Natural Language Processing, Python, Scikitlearn

This summer, I decided to enroll in an accelerated, half-semester long Natural Language Processing class. Why not enhance the already miserably hot and humid month of July with an additional 6 hours per week of sitting in a small wooden seat after work, I thought. It was indeed a brutal month that has left me with a literal pain in my neck, probably from hauling two separate computers uptown twice a week and subsequently hunching over them. And here I am, hunching again to record what I learned for posterity.

Complaints aside, I enjoyed the class. It was programming-heavy and taught in Python, which was a welcome change of pace after a full year of classes only taught in R (though, as repeat readers of this blog should know, I am an R lover). I was already familiar with the essential concepts of NLP, having used it for several projects in other classes, outside of class and at work. I was happy to devote a solid chunk of time and attention to going more in-depth with NLP.

Naturally, there was a final group project. Our group decided to analyze tweets surrounding the recent Roe v. Wade ruling. I volunteered to train a model to label the tweets as either pro-choice or pro-life. This post will outline how I did that.

Labeled data from Surge AI

The first thing required to create our model is a dataset with tweets already labeled as pro-choice or pro-life. As luck would have it, this dataset existed when I googled around. Surge AI is a data labeling platform that crowdsources labeling of data used for ML models. A sample dataset of 1025 tweets labeled pro-choice vs. pro-life was available on their website for free download.

I imported the dataset to python and saved it as a pandas dataframe:

import pandas as pd
roe = pd.read_csv ('surge_ai_roe.csv')

The dataset included columns with the tweet text, as well as a “bio” column with text from the user’s bio. Scrolling through the data, it’s apparent that the bio text is useful in identifying whether a person’s tweet is pro-choice or pro-life (take for example row 15 above). I decided to use both the tweet text and bio text as input for the predictive model.

The data was skewed considerably toward pro-choice tweets. 744 (72.6%) of the 1025 tweets in the Surge AI dataset were labeled pro-choice. This was accounted for in the model later on.

Preparing the tweets for use in a model

To use text as a feature in a model, it must eventually be turned into numbers. This can be accomplished through vectorization. When text is vectorized, it is converted from one chunk of words to a series of variables, as long as however many words were in the original chunk. A number is then assigned to each variable that corresponds to the frequency of any given word in that text:

Text: I am disappointed I never saw the Minions movie in theaters.

Vectorized text:

To vectorize a corpus containing multiple texts, the same logic applies, but the number of variables grows to include all of the words that occur in all of the texts:

Text 1: He looks for eggs.

Text 2: The eggs are for the cookies.

Before our tweets can be vectorized, they must be cleaned. There are several ways a tweet should be cleaned or transformed from its original state in order to be vectorized properly:

Convert to lowercase
Remove numbers and punctuation
Deal with @s and hashtags
Handle Emojis
Remove stop words (maybe)

Though there are a number of steps involved, cleaning the text requires relatively little code.

Convert to lowercase, remove numbers and remove punctuation

To convert to lowercase and remove numbers and punctuation, the following function can be used:

#remove numbers and punctuation, lowercase
def clean_txt(text):
    import re
    new_text = re.sub("[^0-9A-Za-z']", " ", text).lower()
    return new_text

The re library is used for regex operations, which I hardly understand, but everything can be googled 🙂

Clean @s and hashtags

To clean the tags and hashtags for the tweets, I simply googled and found a function on stack overflow:

#removes @s and cleans hashtags, found on google
def tweet_clean(text): 
    from urllib.parse import urlparse
    new_string = ''
    for i in text.split():
         s, n, p, pa, q, f = urlparse(i)
         if s and n:
             pass
         elif i[:1] == '@':
             pass
         elif i[:1] == '#':
             new_text = new_text.strip() + ' ' + i[1:]
         else:
             new_text = new_text.strip() + ' ' + i
    return new_text

Convert emojis to text

I wanted to keep the meaning of the emojis rather than completely remove them. Luckily, there is a library that allows you to convert emojis into text:

import emoji
emoji.demojize('🥖🐣🤌🥰')

‘:baguette_bread::hatching_chick::pinched_fingers::smiling_face_with_hearts:’

Remove stop words

The function below can be used to remove English stop words (common words that hold no real meaning, like and, or, the, of, etc.)

#function to remove stopwords
def rem_sw(text):
    import nltk
    from nltk.corpus import stopwords
    sw = list(set(stopwords.words('english')))
    new_text = [word for word in text.split() if word not in sw]
    return ' '.join(new_text)

I didn’t remove stop words in my final model because it actually performed slightly better when I left them in!

Create new cleaned text columns for use in our model

Using the functions above, I cleaned the tweet text and user bio columns and combined them into a cleaned column called “combined”. I then removed any record with a “combined” length of 0.

roe["clean_text"] = roe.text.apply(tweet_clean).apply(emoji.demojize).apply(clean_txt)

roe["bio"] = roe["bio"].fillna('')

roe["clean_bio"] = roe.bio.apply(emoji.demojize).apply(clean_txt)

roe["combined"] = roe[['clean_text','clean_bio']].apply(lambda x: ' '.join(x), axis=1)

roe["combined_len"] = roe.combined.apply(len)

roe = roe[roe.combined_len > 0]

Now that we have a fully prepared column with the text we want to use in our model, we can vectorize it.

Vectorizing

Below is the code used to vectorize a text column using sklearn’s CountVectorizer and/or TfidfVectorizer functions.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

#Use for simple count vectorizer
#my_vec = CountVectorizer(ngram_range=(1, 2)) 

#Use for tf-idf
my_vec = TfidfVectorizer(ngram_range=(1, 2))

transformed = pd.DataFrame(my_vec.fit_transform(list(roe["combined"])).toarray()) 

transformed.columns = my_vec.get_feature_names()

I used tf-idf (term frequency–inverse document frequency) vectorization rather than simple count vectorization for my model. Tf-idf is similar to count vectorization, but weighs word scores based on how often they appear in a record vs. the full corpus.

The ngram_range variable allows you to specify if you would like to include single, bigrams, trigrams etc. in your vectorization. By adding (1,2) I indicated that I would like the vectorization to create features for each word and each bigram that appears in my corpus. To use only single words you would write (1,1), only bigrams (2,2), only bigrams through trigrams (2,3), all three would be (1,3).

Making the model

There are many types of models that could be used for predicting a binary outcome (in this case pro-choice or pro-life). Perhaps the simplest is logistic regression.

from sklearn.linear_model import LogisticRegression

my_model = LogisticRegression(class_weight= 'balanced')

Next I used the train_test_split function to split my original data into train and test datasets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
     transformed, roe.label, test_size=0.2, random_state=85)

Now it’s quite simple to fit the model:

my_model.fit(X_train, y_train)

And we can get the predicted classes for the test data (called y_pred):

y_pred = my_model.predict(X_test)

We can also view the probabilities associated with these predictions as well:

my_model.predict_proba(X_test)

Finally, we can look at the precision, recall and F1 score to see how the model performed:

from sklearn.metrics import precision_recall_fscore_support
import pandas as pd

metrics = pd.DataFrame(precision_recall_fscore_support(
    y_test, y_pred, average='weighted'))

metrics.index = ["precision","recall","fscore","na"]

All in all, this is not a terrible showing given that there were only about 800 records to train the model with. These metrics have been calculated using the weighted average parameter in sklearn’s precision_recall_fscore_support function, as there were quite a few more tweets labeled as pro-choice vs. pro-life in the dataset.

We can also take a look at the confusion matrix ourselves to better understand how tweets were labeled:

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

	Predicted pro-choice	Predicted pro-life
Actual pro-choice	135	8
Actual pro-life	43	19

We see that there are 135 “true negatives” (tweets correctly labeled as pro-choice), 8 “false positives” (tweets incorrectly labeled as pro-life), 43 “false negatives” (tweets incorrectly labeled as pro-choice) and 19 true positives (tweets correctly labeled as pro-life) in our testing sample. This is important information. The model has good precision when it comes to categorizing pro-choice tweets (135/(135+43) = .76) and good precision for pro-life tweets (19/(19+8) = .70). This means the model is not producing a high percentage of false positives in either category.

The model’s recall is another story. Recall asses the model’s accuracy in correctly classifying true positives for each category. With recall we ask, of the tweets that are actually pro-choice or pro-life, how many were labeled correctly? For pro-choice, recall is very high (135/(135+8) = .94). This tells us that the model is VERY good at correctly labeling pro-choice tweets. However, recall for the pro-life tweets is bad (19/(19+43) = .31). The model is more likely to think that true pro-life tweets are actually pro-choice tweets.

For the purposes of this project and with the constraint of having relatively little labeled data to train the model on, this doesn’t especially bother me. To fix this, we could potentially change the probability threshold such that more tweets get labeled as pro-life.

In summary

Those are the essential steps I took to wrangle and prep tweets for use in a basic ML model to predict pro-choice vs. pro-life ideology. In the next post, I’ll show how I was able to label and analyze a novel dataset of scraped tweets using the model I trained.

Wrangling an Image Dataset (under the sea)

Posted on September 22, 2021May 9, 2022 Posted in Data Wrangling, Python

Wrangling an Image Dataset (under the sea)

Follow my process to scrape and wrangle a dataset of Spongebob title card images for eventual GAN processing.

This post covers:

Scraping images from the web using Beautiful Soup
Converting image data to a format digestible by a Generative Adversarial Network (GAN) algorithm

Preface: I’m your biggest fan

In 1999 I was bombarded with Spongebob promos while watching CatDog, memorized the date and time of the premiere, and made sure I was seated in front of the TV right when it aired. As history can attest, it did not disappoint.

A scientist from birth, I was won over by the extent to which real-world biological details were used to inform Spongebob’s characters and plot lines. Only recently did I learn that the late creator, Stephen Hillenberg, was actually a trained marine biologist.

I was equally enamored with Spongebob’s visual style. The early seasons’ aesthetic elicits the same feeling for me as visiting San Francisco’s Fisherman’s Wharf at the turn of the millennium. It’s something akin to coloring the children’s menu at an old, dusty seafood restaurant with an assortment of crayons pulled at random from the Crayola 60 pack while wearing a Hawaiian print skort from Limited Too.

One of the most delightful aspects of Spongebob is the title cards. Many of these are seared in my mind to this day, paired with the sounds of a ukulele or slide guitar. Each card has a distinct personality – many in the same general style but each unique enough to remember for years.

I recently came across a Spongebob fan website with a comprehensive archive of title cards. I wondered if it might be possible to scrape the title card images into a dataset and wrangle them into a neural network friendly format. I’ve been interested in training a GAN to generate images for a while; is it possible to generate novel Spongebob title cards with this data?

In this first post of the Spongebob title card adventure, I will walk through scraping and wrangling image data into a format compatible with GAN algorithms.

Scraping

I used the Beautiful Soup library to scrape the title card image links and episode names from this website. Detailed information on using Beautiful Soup can be found in this blog post on scraping web data.

import requests
from bs4 import BeautifulSoup
import bleach
import pandas as pd

url = "https://spongebob.fandom.com/wiki/List_of_title_cards#.23"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="content")


images = results.find_all('img')
sources =[]
titles = []

for image in images:
    source = image.attrs['src']
    sources.append(source)

titles_raw = results.find_all(class_= 'lightbox-caption')

for title in titles_raw:
    title = bleach.clean(str(title), tags=[], strip=True).strip()
    titles.append(title)
            
sources = [x for x in sources if not x.startswith('data')]
sources = [x.replace('/scale-to-width-down/185', '') for x in sources]

spongedata = pd.DataFrame(list(zip(sources, titles)),columns = ['Image_Source','Title'])
spongedata

Downloading Images

Now that we have the image links, we can use the requests module to open and save the images. There are a number of ways to do this – this stack overflow thread was particularly helpful in finding the right method.

Essentially, requests is used to download the image from the url into memory. Once downloaded, the shutil module is used to properly store the image to a jpg file.

I imported a random title card below:

import matplotlib.pyplot as plt
import matplotlib.image as mpim
import shutil

response = requests.get(spongedata.Image_Source[390], stream=True) 
with open("sponge1.jpg", 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)

Note that the image is approximately 1000×1400 pixels.

Here is another random title card:

response = requests.get(spongedata.Image_Source[211], stream=True) 
with open("sponge1.jpg", 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response
img = mpim.imread('sponge1.jpg')
imgplot = plt.imshow(img)

This title card is approximately 1000×1750 pixels – not quite the same size as the first one we looked at. To avoid issues down the road, I decided to resize each title card to a uniform number of pixels:

image.resize((300, 200))

CIFAR10 Image Structure

Now let’s take a look at the kind of image data that others have used with GANs successfully. I did some research and found an incredibly helpful tutorial on constructing a GAN using the CIFAR10 image database. The article is located here.

The CIFAR10 database consists of 60,000 32×32 color images stored as arrays and split into a 50,000 image train and 10,000 image test dataset.

The train and test datasets are stored as 4D arrays – for example, the shape of the train dataset is: (50000, 32, 32, 3). The deepest dimension of the array represents the 3 RGB values assigned to each pixel in a photograph. The next dimension represents the number of rows of pixels (32) and the next the number of columns of pixels (also 32). There are a total of 50,000 series of 32×32 arrays of RGB values.

An image can be converted to a 3D array very simply by calling the numpy.array() function after opening the image file using the PIL python image library:

from PIL import Image
import numpy as np

image = PIL.Image.open('sponge1.jpg')
image = (np.array(image))
image

image.shape

Now that we know how to get an image into an array format similar to those in the CIFAR10 dataset, we can write code to loop through our scraped data, convert all our title cards to arrays and combine them all into one 4D array like the CIFAR10 data:

title_cards = []
index = 0
for x in spongedata.Image_Source:
    response = requests.get(spongedata.Image_Source[index], stream=True) 
    with open("sponge1.jpg", 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response
    img = PIL.Image.open('sponge1.jpg')
    img = img.resize((300, 200))
    img = (np.array(img))
    title_cards.append(img)
    index = index + 1
    #print(index)
title_cards = np.array(title_cards)
title_cards.shape

Yay! We’ve managed to get a dataset of Spongebob title cards in something analogous to the CIFAR10 format that we can now try and feed into a GAN. Stay tuned for more Spongebob title card neural network posts!

Understanding ANOVA and Interaction Effects

Posted on August 31, 2021August 6, 2024 Posted in Python, Statistical Analysis

Understanding ANOVA and Interaction Effects using Python

Analyze the effects of the fictional snail performance supplement Snail Fuel on olympic snail athletes!

Let’s pretend we’ve been hired by the fictional snail energy drink and performance supplement company Snail Fuel^TM to conduct an analysis on the effectiveness of their product. They want to know if snails who drink their supplement can expect a significant increase in muscle mass. Anecdotally, Snail Fuel has heard that snails with red shells in particular have been able to make impressive gains using their product. If this can be proven true, Snail Fuel will change their marketing strategy accordingly.

We have set up a small clinical trial with 30 snails olympians split into 6 groups across two dimensions: shell color and placebo/Snail Fuel. Each snail has continued exercising as usual while drinking either a placebo fuel made of plain protein powder, or Snail Fuel. The snails were scored on % increase in muscle mass after one month.

We plan to analyze this data using a two-way ANOVA test – also known as an analysis of variance. The ultimate goal of an ANOVA is to explain variance in our sample and where the variance is coming from. In our case, in a two-way ANOVA, all variance can be explained from the following sources:

grand mean + row effect (first independent variable) + column effect (second independent variable) + interaction effect (interaction of first and second variable) + error

An ANOVA will be able to tell us how much of the variance in our sample can be attributed to each source, and whether that variance meets a threshold of significance.

Let’s generate a Pandas dataframe with the snail’s scores:

import pandas as pd

yellow_placebo = [2, 4, 3, 1, 5 ]
yellow_fuel = [4, 7, 6, 6, 6]
blue_placebo = [5, 6, 4, 2, 8 ]
blue_fuel = [5, 6, 4, 7, 8]
red_placebo = [1, 7, 5, 2, 4]
red_fuel = [11, 13, 12, 14, 10]

data = []
for i in yellow_placebo:
    data.append((i,"yellow","N"))
for i in yellow_fuel:
    data.append((i, "yellow", "Y"))
for i in blue_placebo:
    data.append((i, "blue", "N"))
for i in blue_fuel:
    data.append((i, "blue","Y"))
for i in red_placebo:
    data.append((i, "red","N"))
for i in red_fuel:
    data.append((i, "red","Y"))
    
columns = ['Percent_Muscle_Increase','Shell_Color','Snail_Fuel']
snail_fuel_data = pd.DataFrame(data,columns=columns)
head(snail_fuel_data)

We can place each snail’s score in a 3×2 grid according to which subgroup the snail falls into:

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5	4, 7, 6, 6, 6
Blue	5, 6, 4, 2, 8	5, 6, 4, 7, 8
Red	1, 7, 5, 2, 4	11, 13, 12, 14, 10

Before we do any calculations, we can visualize our data using a box plot. This will show the mean and range of our dependent variable (percent muscle increase) broken into groups by the categorical variable placebo/snail fuel. We can color each point based on the shell color variable as well.

plt.figure(figsize=(8,5))
import matplotlib.pyplot as plt
import seaborn as sns
colors = ["yellow", "blue","red"]
ax = sns.boxplot(x='Snail_Fuel', y='Percent_Muscle_Increase', data=snail_fuel_data, color='white')
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=snail_fuel_data, hue ='Shell_Color', palette=colors)
plt.show()

Let’s find the average % increase in muscle mass for each subgroup:

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5 mean = 3	5, 7, 6, 6, 6 mean = 6
Blue	5, 6, 4, 2, 8 mean = 5	5, 6, 4, 7, 8 mean = 6
Red	2, 7, 5, 2, 4 mean = 4	11, 13, 12, 14, 10 mean = 12

Let’s plot only the means of our data:

means = snail_fuel_data.groupby(['Shell_Color', 'Snail_Fuel'], as_index=False).mean()

plt.figure(figsize=(8,5))
colors = ["blue", "red","goldenrod"]
ax = sns.swarmplot(x="Snail_Fuel", y='Percent_Muscle_Increase', data=means, hue ='Shell_Color', palette=colors, s = 7)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "red")], marker='', color='red', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "blue")], marker='', color='blue', linewidth=2)
plt.plot( 'Snail_Fuel', 'Percent_Muscle_Increase', data=means[(means.Shell_Color == "yellow")], marker='', color='goldenrod', linewidth=2)

plt.show()

Main effects

We can already see that red snails who drank Snail Fuel have the highest average % gain in muscle mass. It also looks like for all shell colors, those who drank Snail Fuel gained more muscle. How can we know the extent that having a red shell increases Snail Fuel’s efficacy if it appears that snails of every shell color saw benefits?

First, we need to know exactly what effects are coming from what variables. How much of an effect does drinking Snail Fuel vs. the placebo have on muscle mass? How much of an effect does shell color have on muscle mass? These are known as the main effects for each variable. We can find this by calculating a grand mean (the average % increase of muscle mass for all snails in the study) and the individual row and column means. The main effect for each row or column will be its distance from the grand mean.

Interaction effects

	Placebo	Snail Fuel
Yellow	2, 4, 3, 1, 5 mean = 3	4, 7, 6, 6, 6 mean = 6	Row 1 mean: 4.5 (effect = -1.5)
Blue	5, 6, 4, 2, 8 mean = 5	5, 6, 4, 7, 8 mean = 6	Row 2 mean: 5.5 (effect = -.5)
Red	1, 7, 5, 2, 4 mean = 4	11, 13, 12, 14, 10 mean = 12	Row 3 mean: 8 (effect = +2)
	Col 1 mean: 4 (effect = -2)	Col 2 mean: 8 (effect = +2)	Grand mean = 6

The main effects for each variable will always add to zero.

We see that the main effect for having a red shell is +2, meaning a snail with a red shell, regardless of whether they were assigned the placebo or the Snail Fuel, would be expected to have a % increase in muscle mass 2 units greater than the average value for all snails.

Similarly, we see that the main effect for drinking Snail Fuel is also +2, meaning that regardless of shell color, if we know a snail was assigned to drink Snail Fuel, we would expect that snail’s % gain in muscle mass to be 2 units greater than the grand mean.

By this logic, we can theorize that a red-shelled snail who drank Snail Fuel would have a % increase in muscle mass 2 (red shell) + 2 (Snail Fuel) units greater than the grand mean.

Now that we know the main effects for our variables, we can calculate the expected means taking main effect into account for each subgroup by adding or subtracting the relevant variable effects from the grand mean:

	Placebo	Snail Fuel
Yellow	Expected mean: 2.5	Expected mean: 6.5	(effect = -1.5)
Blue	Expected mean: 3.5	Expected mean: 7.5	(effect = -.5)
Red	Expected mean: 6	Expected mean: 10	(effect = +2)
	(effect = -2)	(effect = +2)	Grand mean = 6

Let’s compare the expected means with main effects to the observed means we already calculated:

	Placebo	Snail Fuel
Yellow	Observed mean: 3 Expected mean: 2.5 difference: .5	Observed mean: 6 Expected mean: 6.5 difference: -.5
Blue	Observed mean: 5 Expected mean: 3.5 difference: 1.5	Observed mean: 6 Expected mean: 7.5 difference: -1.5
Red	Observed mean: 4 Expected mean: 6 difference: -2	Observed mean: 12 Expected mean: 10 difference: +2

The difference between the observed mean and the grand mean + main effects for each subgroup is known as the subgroup’s interaction effect. In other words, a subgroup’s interaction effect accounts for the variance in the observed mean for that subgroup that is not explained by the row or column effects. It can also be denoted as the row X column effect.

Now we can see clearly that there is a +2 unit positive interaction of red shell color and Snail Fuel. That means that snails with red shells who drank Snail fuel had an average of 2% greater muscle gains than we would have expected based on the shell color and placebo/Snail Fuel main effects alone.

Mathematically, interaction effects must cancel each other out – so a positive interaction of red shell color and Snail Fuel also means a negative interaction of red shell color and placebo. Accordingly, we find that yellow and blue-shelled snails have negative interaction effects for Snail Fuel. On average, blue snails and yellow snails saw 1.5 and .5 fewer percentage point gains in muscle mass than expected.

Running a two-way ANOVA in Python using statsmodels

We can test for the significance of the main effects and interaction effects in our data by running a two-way ANOVA test using the statsmodels library. The code is quite simple and easy to read:

import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('Percent_Muscle_Increase ~ C(Snail_Fuel) + C(Shell_Color) + C(Snail_Fuel):C(Shell_Color)', data=snail_fuel_data).fit()
sm.stats.anova_lm(model, typ=3)

We have specified that the ANOVA should test the relation of the dependent variable (percent muscle increase) to the categorical variables “Snail_Fuel” and “Shell_Color” in addition to their interaction “C(Snail_Fuel):C(Shell_Color).” We have also specified that the model should use a type 3 sum of squares. This is the sum of squares methodology used when an interaction effect is thought to be present in the data. For more information, see this helpful article from Towards Data Science.

The output of our test is below:

Interestingly and perhaps not surprisingly, the model tells us that the only significant effect in our ANOVA is the interaction of shell color and snail fuel. The p values (PR(>F)) for the snail fuel and shell color variables on their own would not meet the <.05 standard of significance.

Given these results, we can say that for the total population of snails in the study, there was no statistically significant difference in muscle increase between those who drank Snail Fuel and those who drank a placebo energy drink. However, there was a statistically significant interaction between shell color and drinking Snail Fuel, such that snails with red shells saw larger gains in muscle mass than snails with yellow and blue shells.

If we wanted to dig even deeper, we can take a look at the model summary:

print(model.summary())

Digging into the model summary, we can discern that the only significant p value corresponds to a positive 4.583 t score for “C(Snail_Fuel)[T.Y]:C(Shell_Color)[T.Red].” That translates to a significant positive interaction of drinking snail fuel and having a red shell.

So it looks like the anecdotal evidence has proven true after all, at least in our very small sample.

Sofia Zaidman

Category: Python

Sofia’s 1st Annual Halloween Spooktacular: Candy Decision Trees

The Candy Ranking Dataset

Candy Decision Tree

Decision tree parameters

Predicting pro-choice vs. pro-life tweets with NLP: Part 1 (prepping and training the model)

Labeled data from Surge AI

Preparing the tweets for use in a model

Convert to lowercase, remove numbers and remove punctuation

Clean @s and hashtags

Convert emojis to text

Remove stop words

Create new cleaned text columns for use in our model

Vectorizing

Making the model

In summary

Wrangling an Image Dataset (under the sea)

Wrangling an Image Dataset (under the sea)

This post covers:

Preface: I’m your biggest fan

Scraping

Downloading Images

CIFAR10 Image Structure

Understanding ANOVA and Interaction Effects

Understanding ANOVA and Interaction Effects using Python

Main effects

Interaction effects

Running a two-way ANOVA in Python using statsmodels