Sofia’s 1st Annual Halloween Spooktacular: Candy Decision Trees

To be perfectly honest, the only reason I decided to write this was to have somewhere to put this absolutely adorable little pumpkin with the fall foliage boots. Every great artist must sacrifice for their creations.

The Candy Ranking Dataset

Last fall as I browsed Kaggle, searching for datasets to use for a class assignment, I found The Ultimate Halloween Candy Power Ranking. Someone had created a website that randomly pits two Halloween candies against each other and asks visitors to choose which one they prefer. After 269,000 votes were collected, data on each candy’s win percentage, as well as descriptive information for each candy was posted as a dataset. I’ve included the dataset below:


Most of the descriptive columns are yes or no flags for certain candy characteristics: whether the candy has chocolate, if it has nuts (“peanutyalmondy”), if it’s hard or soft, if it’s fruity, if it has caramel, if it has a crisped rice wafer (oddly specific), if it has nougat, if it’s in a bar form and if it comes in a pack with other candies (“pluribus”). There are also continuous columns for sugar percentile and price percentile compared to the rest of the candies in the dataset, but for the purposes of this post, I’m going to only focus on the flags.

An obvious question to ask of this dataset from a machine learning perspective is: can we predict a candy’s win percentage if we know its features? Asked another way: what features will predict a Halloween candy’s popularity?

Let’s do a bit of visualizing first. Here I’ve plotted average win percentage vs. feature for all 9 yes or no candy descriptors:

Many of the candy features seem to have a noticeable impact on a candy’s win percentage. When we compare the top and bottom candies, we see concrete examples of this.

All the top 10 candies are chocolate, while all the bottom 10 are not chocolate:

Clearly, a candy being chocolate would be a useful feature for predicting a candy’s win percentage. Candies with nuts are also more likely to be in the top 10 than the bottom 10, though not as extreme as chocolate vs. no chocolate:

Taken together, these charts indicate that creating a model using the candy descriptors to predict win percent could be possible. There is clearly a difference between the top performing candies and bottom performing candies that the descriptors are at least partially capturing.

Candy Decision Tree

My first idea when deciding what would be a fun model to use with the candy dataset was a decision tree. Decision trees aren’t usually the best performing models, but they are useful in an illustrative way, providing a kind of roadmap for how they make their predictions.

I think a decision tree in this situation could be thought to function much like a person presented with a halloween candy, trying to decide how much they want to eat it. During training, the tree will run through the features it’s given – in this case, the candy descriptors. It will decide on one feature for its first decision – say, chocolate. After separating the candies into chocolate or not chocolate, it will then choose two more features and make more splits, until the candies have been separated into some number of groups, each with distinct combinations of descriptors. Each of these groups will then have a win percentage prediction associated with it based on the candies in the training dataset that fall within that group. This number can then be used to assign win percentage predictions to new candies that have the same characteristics as candies in the training dataset.

This process reminds me of when my brother and I would get back from trick or treating, dump all our candies into piles, organize them into categories and trade back and forth. The decision tree is dumping out the candies in the training dataset and organizing them into piles in its own way, and I enjoy that.

Decision tree parameters

There are many parameters that can be tuned in a decision tree, but I’m going to focus on max depth for now. Max depth refers to the amount of decisions or splits the tree is making. If max depth was set to 1, then the decision tree would only make one decision. This wouldn’t be something you would ever really do with a model that would be actually used for accurate predictions, but it’s interesting to see what a decision tree would pick if it could only make one decision. Let’s try it out, using python:

import pandas as pd
df = pd.read_csv('candy-data.csv')

y = df['winpercent'] 
X = df[df.columns.drop(['winpercent','competitorname','sugarpercent','pricepercent'])]

from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = .5, random_state=130)

from sklearn.tree import DecisionTreeRegressor

rf = DecisionTreeRegressor(random_state=122, max_depth = 1)

rf.fit(X_train, y_train)

In the above code, we divide the candy data into features (X) and target (y), then split into train and test dataset. Because the dataset was small, I used a 50/50 train test split. Next we fit a decision tree regressor model to the training data. I’ve specified that the decision tree should only split once.

rf.score(X_train, y_train)
0.43092954635899416

The R2 value for our model with 1 split is .431, meaning that about 43.1% of the variance in win percentage in the training data can be explained by the model. At face value that’s not very good, but consider that because the max depth of the model is 1, this means the model is only using 1 feature.

The beautiful thing about decision trees is that there are several built in ways to view how the tree came to its decision. We can generate a chart with the decision tree as follows:

from matplotlib import pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(20,20))
_ = tree.plot_tree(rf, 
                   feature_names=X.columns,  
                   filled=True)

The tree did indeed choose chocolate as the most useful feature for separating candies into a lower performing and a higher performing group. This visualization is great because it gives so much information, including the mean squared error (mse) for each stage in the decision process. We can also see the predicted value that the model assigns to each group. Candies that aren’t chocolate get a 42% win percentage, and chocolate candies get a 60% win percentage.

Let’s add one more split and see what happens:

rf = DecisionTreeRegressor(random_state=122, max_depth = 2)

We now get one more level of splits in the decision tree, resulting in 4 possible predictions based on candy features. If the candy is chocolate, the model will then look to see if it has nuts (peanutyalmondy). Candies that are both chocolate and have nuts get the highest predicted win percentage of 71%. If the candy isn’t chocolate, the model will look to see if it is fruity. Fruity candies that aren’t chocolate have a higher predicted win percentage than candies that are neither chocolate nor fruity.

We can add another layer to our decision tree:

rf = DecisionTreeRegressor(random_state=122, max_depth = 3) 

At this point it’s a little hard to see the figure, but you get the idea. The model has added one more level of splits to the tree, and we now have 8 separate groups of predicted probabilities based on the candy features. In this model, the candies with the highest predicted win percentages are chocolate, “peanutyalmondy”, and bars, and the candies with the lowest predicted win percentage are hard, not fruity, not chocolatey candies.

If we score this model on the training data, we see that it is predicts much more variance than our model with only one split:

rf.score(X_train, y_train)
0.6322525569863932

The R2 value for our model with 3 splits is .632, meaning that about 63.2% of the variance in win percentage in the training data can be explained by the model. That’s slightly more than a 20% better fit compared with the first model with one split.

We can also check how our model will perform in predicting unseen data:

rf.score(X_test, y_test)
0.5348005970923211

Given the size of the dataset (small) and the fact that I was being lazy and not using cross validation, and also by the nature of decision trees, the model fits the training data much better than the test data. However, all things considered, it isn’t so bad. The model was still able to explain more than half of the variance in predicted win percentage for completely unseen candies.

Happy Halloween!