Sofia’s 1st Annual Halloween Spooktacular: Candy Decision Trees

To be perfectly honest, the only reason I decided to write this was to have somewhere to put this absolutely adorable little pumpkin with the fall foliage boots. Every great artist must sacrifice for their creations.

The Candy Ranking Dataset

Last fall as I browsed Kaggle, searching for datasets to use for a class assignment, I found The Ultimate Halloween Candy Power Ranking. Someone had created a website that randomly pits two Halloween candies against each other and asks visitors to choose which one they prefer. After 269,000 votes were collected, data on each candy’s win percentage, as well as descriptive information for each candy was posted as a dataset. I’ve included the dataset below:


Most of the descriptive columns are yes or no flags for certain candy characteristics: whether the candy has chocolate, if it has nuts (“peanutyalmondy”), if it’s hard or soft, if it’s fruity, if it has caramel, if it has a crisped rice wafer (oddly specific), if it has nougat, if it’s in a bar form and if it comes in a pack with other candies (“pluribus”). There are also continuous columns for sugar percentile and price percentile compared to the rest of the candies in the dataset, but for the purposes of this post, I’m going to only focus on the flags.

An obvious question to ask of this dataset from a machine learning perspective is: can we predict a candy’s win percentage if we know its features? Asked another way: what features will predict a Halloween candy’s popularity?

Let’s do a bit of visualizing first. Here I’ve plotted average win percentage vs. feature for all 9 yes or no candy descriptors:

Many of the candy features seem to have a noticeable impact on a candy’s win percentage. When we compare the top and bottom candies, we see concrete examples of this.

All the top 10 candies are chocolate, while all the bottom 10 are not chocolate:

Clearly, a candy being chocolate would be a useful feature for predicting a candy’s win percentage. Candies with nuts are also more likely to be in the top 10 than the bottom 10, though not as extreme as chocolate vs. no chocolate:

Taken together, these charts indicate that creating a model using the candy descriptors to predict win percent could be possible. There is clearly a difference between the top performing candies and bottom performing candies that the descriptors are at least partially capturing.

Candy Decision Tree

My first idea when deciding what would be a fun model to use with the candy dataset was a decision tree. Decision trees aren’t usually the best performing models, but they are useful in an illustrative way, providing a kind of roadmap for how they make their predictions.

I think a decision tree in this situation could be thought to function much like a person presented with a halloween candy, trying to decide how much they want to eat it. During training, the tree will run through the features it’s given – in this case, the candy descriptors. It will decide on one feature for its first decision – say, chocolate. After separating the candies into chocolate or not chocolate, it will then choose two more features and make more splits, until the candies have been separated into some number of groups, each with distinct combinations of descriptors. Each of these groups will then have a win percentage prediction associated with it based on the candies in the training dataset that fall within that group. This number can then be used to assign win percentage predictions to new candies that have the same characteristics as candies in the training dataset.

This process reminds me of when my brother and I would get back from trick or treating, dump all our candies into piles, organize them into categories and trade back and forth. The decision tree is dumping out the candies in the training dataset and organizing them into piles in its own way, and I enjoy that.

Decision tree parameters

There are many parameters that can be tuned in a decision tree, but I’m going to focus on max depth for now. Max depth refers to the amount of decisions or splits the tree is making. If max depth was set to 1, then the decision tree would only make one decision. This wouldn’t be something you would ever really do with a model that would be actually used for accurate predictions, but it’s interesting to see what a decision tree would pick if it could only make one decision. Let’s try it out, using python:

import pandas as pd
df = pd.read_csv('candy-data.csv')

y = df['winpercent'] 
X = df[df.columns.drop(['winpercent','competitorname','sugarpercent','pricepercent'])]

from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = .5, random_state=130)

from sklearn.tree import DecisionTreeRegressor

rf = DecisionTreeRegressor(random_state=122, max_depth = 1)

rf.fit(X_train, y_train)

In the above code, we divide the candy data into features (X) and target (y), then split into train and test dataset. Because the dataset was small, I used a 50/50 train test split. Next we fit a decision tree regressor model to the training data. I’ve specified that the decision tree should only split once.

rf.score(X_train, y_train)
0.43092954635899416

The R2 value for our model with 1 split is .431, meaning that about 43.1% of the variance in win percentage in the training data can be explained by the model. At face value that’s not very good, but consider that because the max depth of the model is 1, this means the model is only using 1 feature.

The beautiful thing about decision trees is that there are several built in ways to view how the tree came to its decision. We can generate a chart with the decision tree as follows:

from matplotlib import pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(20,20))
_ = tree.plot_tree(rf, 
                   feature_names=X.columns,  
                   filled=True)

The tree did indeed choose chocolate as the most useful feature for separating candies into a lower performing and a higher performing group. This visualization is great because it gives so much information, including the mean squared error (mse) for each stage in the decision process. We can also see the predicted value that the model assigns to each group. Candies that aren’t chocolate get a 42% win percentage, and chocolate candies get a 60% win percentage.

Let’s add one more split and see what happens:

rf = DecisionTreeRegressor(random_state=122, max_depth = 2)

We now get one more level of splits in the decision tree, resulting in 4 possible predictions based on candy features. If the candy is chocolate, the model will then look to see if it has nuts (peanutyalmondy). Candies that are both chocolate and have nuts get the highest predicted win percentage of 71%. If the candy isn’t chocolate, the model will look to see if it is fruity. Fruity candies that aren’t chocolate have a higher predicted win percentage than candies that are neither chocolate nor fruity.

We can add another layer to our decision tree:

rf = DecisionTreeRegressor(random_state=122, max_depth = 3) 

At this point it’s a little hard to see the figure, but you get the idea. The model has added one more level of splits to the tree, and we now have 8 separate groups of predicted probabilities based on the candy features. In this model, the candies with the highest predicted win percentages are chocolate, “peanutyalmondy”, and bars, and the candies with the lowest predicted win percentage are hard, not fruity, not chocolatey candies.

If we score this model on the training data, we see that it is predicts much more variance than our model with only one split:

rf.score(X_train, y_train)
0.6322525569863932

The R2 value for our model with 3 splits is .632, meaning that about 63.2% of the variance in win percentage in the training data can be explained by the model. That’s slightly more than a 20% better fit compared with the first model with one split.

We can also check how our model will perform in predicting unseen data:

rf.score(X_test, y_test)
0.5348005970923211

Given the size of the dataset (small) and the fact that I was being lazy and not using cross validation, and also by the nature of decision trees, the model fits the training data much better than the test data. However, all things considered, it isn’t so bad. The model was still able to explain more than half of the variance in predicted win percentage for completely unseen candies.

Happy Halloween!

Matrix of Confusion: Tuning the Logistic Regression Prediction Threshold for Imbalanced Classes (the long way)

I must admit that lately I’ve found myself feeling like the brain trapped in the Matrix of Confusion™ (confusion matrix) pictured above. It’s been quite a busy little month or so. I got back from a wonderful and much needed vacation at the beginning of September and subsequently whiplashed straight into a new semester (and back to good old work). Of course, this isn’t just any run of the mill semester. It’s my second to last (or so) semester, which means I need to buckle down and actually come up with a thesis topic and proposal. Perhaps, if you’re familiar with the contents of this blog, you’ll see that this could be somewhat of a difficult task. I tend to have a lot of cute little ideas and interests related to my hobbies, but it’s hard to translate this into something that could be classified as social science research. I do have a substantial amount of legitimate research experience, but this is in the field of cognitive neuroscience (also not a social science).

Anywho, in times of trouble like this, sometimes it helps to give the old noggin a rest from the big questions and pick something small to figure out to use as a distraction 🙂

At the end of my last post I looked at the confusion matrix for the test predictions made by the logistic regression trained on tweets labeled pro-choice and pro-life. Examining the confusion matrix was much more useful in my opinion than simply looking at numbers for precision, recall, and F1, because it shows exactly how tweets are getting labeled in both classes:

Predicted pro-choicePredicted pro-life
Actual pro-choice1358
Actual pro-life4319
Confusion matrix

The model predicted pro-choice tweets with a precision of 135/178 = 75.8%. It was slightly worse at correctly predicting pro-life tweets, with a precision of 19/27 = 70.4%. Our overall accuracy ends up being (135+19)/(135+19+43+8) = 75.1%.

To get the baseline accuracy to compare this accuracy to, we can take the overall percentage of the most abundant class (pro-choice tweets). This was 72.6%. So, theoretically, if instead of going through the trouble to train a model we had just guessed that all tweets were pro-choice, our accuracy would have been 72.6%. Our overall accuracy using the model was 75.1%, so overall we did better by about 3% 🥲.

Predicted pro-choicePredicted pro-life
Actual pro-choice1430
Actual pro-life620
Confusion matrix if the model classified all tweets as pro-choice

Ok… that doesn’t sound like we did amazingly… however, how you want to measure success really does depend on your perspective. In the scenario just described, we would be classifying none of the pro-life tweets correctly. Why would we want that?! Comparing precisions in this theoretical scenario to our model, we would predict pro-choice tweets worse by 100-75.8 = 24.2%, but we would also predict pro-life tweets better by 70.4-0 = 70.4%.

There is also the question of recall – the ratio of correctly classified “positive” tweets to total actual positive tweets (true positive rate), which we can look at from the pro-choice or the pro-life side. Our model has very good recall if considering pro-choice tweets to be positives, with a recall of 135/(135+8) = 94.4%. Only 5.6% of true pro-choice tweets were labeled pro-life. On the flip-side, the recall if we consider pro-life tweets to be positive was 19/(19+43)= 30.6%. This means that in fact, about 70% of pro-life tweets were predicted to be pro-choice. Not great.

ROC Curve

ROC curves plot true positive vs. false positive rate across thresholds for binary classifiers. At the end of my last blog post, I mentioned that we could potentially tweak the classification threshold for our model one way or the other to get better predictions for the pro-life group. Plotting an ROC curve can help visualize this. Again, we can look at this from the pro-choice or pro-life perspective (either one can be treated as “positive”).

Pro-Life ROC Curve

I’ve chosen to look at thresholds within a few percentage points of 50%. The ROC curve plots the false positive rate (in this case, the percent of real pro-choice tweets labeled pro-life) on the x axis, and the true positive rate (real pro-life tweets labeled pro-life) on the y axis. You can see that the .5 threshold lines up with the recall discussed a couple paragraphs ago. These numbers shift drastically with only a slight threshold change. If we chose .44 for the threshold, we would label over 75% of pro-life tweets correctly, as opposed to the 32.3% we get from the .5 threshold. Of course, this would also increase the amount of pro-choice tweets that are falsely labeled as pro-life, up from 5.6% to about 38%.

Pro-Choice ROC Curve

This is giving essentially the same information as the other ROC curve but in reverse. I find it useful to look from both perspectives.

We can use the area under the ROC curve to understand how well the model performs overall:

A thought experiment:

Say the model performed absolutely perfectly and predicted every tweet with 100% confidence. In that case, for any threshold below exactly 1, there would be no false positives. Every threshold would be plotted at (0,1) except for 1, which would be at (1,0). This would result in the ROC curve not being a curve at all, but a right angle, and covering 100% of the area of the chart.

Say the model performs perfectly terribly, with 50% confidence that every tweet should be labeled as pro-choice. In this theoretical scenario, there would be an equally random 50/50 chance of the model labeling any tweet either pro-choice or pro-life at every threshold except for 0 and 1. If there were infinite observations in the dataset and the model was perfectly random with its predictions, the ROC curve would just by the line y=x, and would cover 50% of the area of the chart:

So, the curvier, or steeper, the ROC curve, the better the model.

Here’s my model colored in:

I find this kind of thing so useful to fundamentally understand what’s going on. We can see exactly how much better than completely random the model is predicting (darker green), and the corresponding area that the model isn’t predicting (red).

I added two lines at .726. These represent the baseline accuracy – that is, if we had classified every tweet as pro-choice based on the overall percentage of pro-choice tweets in the dataset. We want the % of true pro-life tweets labeled as pro-choice (false positive rate, x axis) to be below this threshold, and % of true pro-choice tweets labeled as pro-choice (true positive rate, y axis) to be above this threshold. We can see that the thresholds that meet these criteria are .46, .47, .48, .49, .5, and barely .51. A reasonable argument could be made for choosing any of these thresholds.

For thoroughness’s sake, here is the shaded ROC curve when considering pro-life to be “positive.” I’ve changed the baseline accuracy threshold lines to use the overall percentage of pro-choice tweets in the dataset. You can see that the same thresholds are performing better than baseline again, .51, .5, .49, .48, .47 and .46.

The ROC curve is clearly useful in assessing the ratio of true-positives (recall) to false-positives. What about precision, though?

Another way to look at this is with a precision-recall curve:

Here we are plotting recall on the x axis and precision on the y axis – in this case, we consider pro-choice to be “positive.” This allows us to see the trade-off between the ratio of correctly identified pro-choice tweets to all real pro-choice tweets (recall), and the ratio of correctly identified pro-choice tweets to all tweets labeled pro-choice (precision). We can see that for most thresholds tested, both precision and recall are quite high for pro-choice tweets. Overall, the model does a good job of predicting pro-choice tweets as being pro-choice (recall) and a pretty high percentage of the tweets it classifies as being pro-choice actually are pro-choice (precision).

However, remember that the data was imbalanced to begin with, and the dataset has 72.6% pro-choice tweets to begin with. If we were to pick tweets randomly out of a hat, eventually we would get 72.6% pro-choice. This makes the model’s precision in predicting pro-choice tweets somewhat less impressive. For most thresholds, the population of tweets predicted to be pro-choice is only slightly more pro-choice than random.

Let’s look at the precision-recall curve with pro-life as positives:

The precision-recall curve considering pro-life tweets as positives tells a very different story. In this scenario, our baseline accuracy to beat would be 27.4% (100 – 72.6%) representing the overall percentage of pro-life tweets in the dataset. We can see that at all thresholds besides 0, our model predicts pro-life tweets better than baseline, and for most thresholds, this is substantially better than baseline. That’s encouraging.

Recall is more of a problem from this perspective. If we want a threshold with more than 50% of actual pro-life tweets being labeled correctly as pro-life, we only have .46, .45, and .44 to choose from. However, if we chose one of these thresholds to improve pro-life recall, pro-life precision would drop, meaning that more actual pro-choice tweets would get labeled pro-life (false positives).

F1 – harmonic mean of precision and accuracy:

The F1 score is a way to assess precision and recall simultaneously. The formula is as follows:

F1 = 2 * (precision * recall)/(precision + recall)

F1 weighs precision and recall equally, with 0 as the worst score and 1 as the best score. You probably noticed that I labeled the precision-recall curves with corresponding F1 scores. Maybe the most logical way to select a threshold in our case would simply be to find the highest F1 score. Again, though, what we consider to be “best” depends on perspective, as the best F1 score from the pro-life perspective is not the best F1 score from the pro-choice perspective.

Conclusions?

What to take away from all this? Well… if I wanted to use my model to capture a wider swath of pro-life tweets, I think I would tweak the threshold to .46. At this threshold, recall increases to 66.1% from 30.6% – so we are classifying more than double the percentage of pro-life tweets correctly compared to before. Unfortunately, pro-life precision at this threshold drops to 56.2% from 70.4%. So just over half of the tweets labeled pro-life would actually be pro-life. That’s not ideal, but in this scenario I’d rather water-down the predicted pro-life tweet population than miss so many true pro-life tweets and have them get lost in the pro-choice predictions.

At .46 from the pro-choice perspective, precision is quite high at 84.1%, so a greater percentage of the tweets labeled as pro-choice are truly pro-choice. The pro-choice recall takes a hit, though, going from the very very good 94.4% to 77.6% – meaning only 77.6% of all pro-choice tweets are labeled correctly. This is only slightly better than baseline.

In the end, going with the threshold with the best weighted average F1 score across classes would probably make sense in almost all cases. A boring end to a boring post 🙂

Predicting pro-choice vs. pro-life tweets with NLP: Part 1 (prepping and training the model)

This summer, I decided to enroll in an accelerated, half-semester long Natural Language Processing class. Why not enhance the already miserably hot and humid month of July with an additional 6 hours per week of sitting in a small wooden seat after work, I thought. It was indeed a brutal month that has left me with a literal pain in my neck, probably from hauling two separate computers uptown twice a week and subsequently hunching over them. And here I am, hunching again to record what I learned for posterity.

Complaints aside, I enjoyed the class. It was programming-heavy and taught in Python, which was a welcome change of pace after a full year of classes only taught in R (though, as repeat readers of this blog should know, I am an R lover). I was already familiar with the essential concepts of NLP, having used it for several projects in other classes, outside of class and at work. I was happy to devote a solid chunk of time and attention to going more in-depth with NLP.

Naturally, there was a final group project. Our group decided to analyze tweets surrounding the recent Roe v. Wade ruling. I volunteered to train a model to label the tweets as either pro-choice or pro-life. This post will outline how I did that.

Labeled data from Surge AI

The first thing required to create our model is a dataset with tweets already labeled as pro-choice or pro-life. As luck would have it, this dataset existed when I googled around. Surge AI is a data labeling platform that crowdsources labeling of data used for ML models. A sample dataset of 1025 tweets labeled pro-choice vs. pro-life was available on their website for free download. 

I imported the dataset to python and saved it as a pandas dataframe:

import pandas as pd
roe = pd.read_csv ('surge_ai_roe.csv')

The dataset included columns with the tweet text, as well as a “bio” column with text from the user’s bio. Scrolling through the data, it’s apparent that the bio text is useful in identifying whether a person’s tweet is pro-choice or pro-life (take for example row 15 above). I decided to use both the tweet text and bio text as input for the predictive model.

The data was skewed considerably toward pro-choice tweets. 744 (72.6%) of the 1025 tweets in the Surge AI dataset were labeled pro-choice. This was accounted for in the model later on.

Preparing the tweets for use in a model

To use text as a feature in a model, it must eventually be turned into numbers. This can be accomplished through vectorization. When text is vectorized, it is converted from one chunk of words to a series of variables, as long as however many words were in the original chunk. A number is then assigned to each variable that corresponds to the frequency of any given word in that text:

Text: I am disappointed I never saw the Minions movie in theaters.

Vectorized text:

To vectorize a corpus containing multiple texts, the same logic applies, but the number of variables grows to include all of the words that occur in all of the texts:

Text 1: He looks for eggs.

Text 2: The eggs are for the cookies.

Before our tweets can be vectorized, they must be cleaned. There are several ways a tweet should be cleaned or transformed from its original state in order to be vectorized properly:

  1. Convert to lowercase
  2. Remove numbers and punctuation
  3. Deal with @s and hashtags
  4. Handle Emojis
  5. Remove stop words (maybe)

Though there are a number of steps involved, cleaning the text requires relatively little code.

Convert to lowercase, remove numbers and remove punctuation

To convert to lowercase and remove numbers and punctuation, the following function can be used:

#remove numbers and punctuation, lowercase
def clean_txt(text):
    import re
    new_text = re.sub("[^0-9A-Za-z']", " ", text).lower()
    return new_text

The re library is used for regex operations, which I hardly understand, but everything can be googled 🙂

Clean @s and hashtags

To clean the tags and hashtags for the tweets, I simply googled and found a function on stack overflow:

#removes @s and cleans hashtags, found on google
def tweet_clean(text): 
    from urllib.parse import urlparse
    new_string = ''
    for i in text.split():
         s, n, p, pa, q, f = urlparse(i)
         if s and n:
             pass
         elif i[:1] == '@':
             pass
         elif i[:1] == '#':
             new_text = new_text.strip() + ' ' + i[1:]
         else:
             new_text = new_text.strip() + ' ' + i
    return new_text

Convert emojis to text

I wanted to keep the meaning of the emojis rather than completely remove them. Luckily, there is a library that allows you to convert emojis into text:

import emoji
emoji.demojize('🥖🐣🤌🥰')

‘:baguette_bread::hatching_chick::pinched_fingers::smiling_face_with_hearts:’

Remove stop words

The function below can be used to remove English stop words (common words that hold no real meaning, like and, or, the, of, etc.)

#function to remove stopwords
def rem_sw(text):
    import nltk
    from nltk.corpus import stopwords
    sw = list(set(stopwords.words('english')))
    new_text = [word for word in text.split() if word not in sw]
    return ' '.join(new_text)

I didn’t remove stop words in my final model because it actually performed slightly better when I left them in!

Create new cleaned text columns for use in our model

Using the functions above, I cleaned the tweet text and user bio columns and combined them into a cleaned column called “combined”. I then removed any record with a “combined” length of 0.

roe["clean_text"] = roe.text.apply(tweet_clean).apply(emoji.demojize).apply(clean_txt)

roe["bio"] = roe["bio"].fillna('')

roe["clean_bio"] = roe.bio.apply(emoji.demojize).apply(clean_txt)

roe["combined"] = roe[['clean_text','clean_bio']].apply(lambda x: ' '.join(x), axis=1)

roe["combined_len"] = roe.combined.apply(len)

roe = roe[roe.combined_len > 0]

Now that we have a fully prepared column with the text we want to use in our model, we can vectorize it.

Vectorizing

Below is the code used to vectorize a text column using sklearn’s CountVectorizer and/or TfidfVectorizer functions.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

#Use for simple count vectorizer
#my_vec = CountVectorizer(ngram_range=(1, 2)) 

#Use for tf-idf
my_vec = TfidfVectorizer(ngram_range=(1, 2))

transformed = pd.DataFrame(my_vec.fit_transform(list(roe["combined"])).toarray()) 

transformed.columns = my_vec.get_feature_names()

I used tf-idf (term frequency–inverse document frequency) vectorization rather than simple count vectorization for my model. Tf-idf is similar to count vectorization, but weighs word scores based on how often they appear in a record vs. the full corpus.

The ngram_range variable allows you to specify if you would like to include single, bigrams, trigrams etc. in your vectorization. By adding (1,2) I indicated that I would like the vectorization to create features for each word and each bigram that appears in my corpus. To use only single words you would write (1,1), only bigrams (2,2), only bigrams through trigrams (2,3), all three would be (1,3).

Making the model

There are many types of models that could be used for predicting a binary outcome (in this case pro-choice or pro-life). Perhaps the simplest is logistic regression.

from sklearn.linear_model import LogisticRegression

my_model = LogisticRegression(class_weight= 'balanced')

Next I used the train_test_split function to split my original data into train and test datasets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
     transformed, roe.label, test_size=0.2, random_state=85)

Now it’s quite simple to fit the model:

my_model.fit(X_train, y_train)

And we can get the predicted classes for the test data (called y_pred):

y_pred = my_model.predict(X_test)

We can also view the probabilities associated with these predictions as well:

my_model.predict_proba(X_test)

Finally, we can look at the precision, recall and F1 score to see how the model performed:

from sklearn.metrics import precision_recall_fscore_support
import pandas as pd

metrics = pd.DataFrame(precision_recall_fscore_support(
    y_test, y_pred, average='weighted'))

metrics.index = ["precision","recall","fscore","na"]

All in all, this is not a terrible showing given that there were only about 800 records to train the model with. These metrics have been calculated using the weighted average parameter in sklearn’s precision_recall_fscore_support function, as there were quite a few more tweets labeled as pro-choice vs. pro-life in the dataset.

We can also take a look at the confusion matrix ourselves to better understand how tweets were labeled:

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
Predicted pro-choicePredicted pro-life
Actual pro-choice1358
Actual pro-life4319

We see that there are 135 “true negatives” (tweets correctly labeled as pro-choice), 8 “false positives” (tweets incorrectly labeled as pro-life), 43 “false negatives” (tweets incorrectly labeled as pro-choice) and 19 true positives (tweets correctly labeled as pro-life) in our testing sample. This is important information. The model has good precision when it comes to categorizing pro-choice tweets (135/(135+43) = .76) and good precision for pro-life tweets (19/(19+8) = .70). This means the model is not producing a high percentage of false positives in either category.

The model’s recall is another story. Recall asses the model’s accuracy in correctly classifying true positives for each category. With recall we ask, of the tweets that are actually pro-choice or pro-life, how many were labeled correctly? For pro-choice, recall is very high (135/(135+8) = .94). This tells us that the model is VERY good at correctly labeling pro-choice tweets. However, recall for the pro-life tweets is bad (19/(19+43) = .31). The model is more likely to think that true pro-life tweets are actually pro-choice tweets.

For the purposes of this project and with the constraint of having relatively little labeled data to train the model on, this doesn’t especially bother me. To fix this, we could potentially change the probability threshold such that more tweets get labeled as pro-life.

In summary

Those are the essential steps I took to wrangle and prep tweets for use in a basic ML model to predict pro-choice vs. pro-life ideology. In the next post, I’ll show how I was able to label and analyze a novel dataset of scraped tweets using the model I trained.

GBBO continued: an interactive Plotly jitter plot

How to chart baker performance across all seasons?

As part of the good old Great British Bake Off (GBBO) project, I wanted to create a chart that would succinctly capture every baker’s performance in every season. First, this necessitated the invention of a “score” feature. On the actual show, no one gets a numerical score. However, on the (very helpful) wikipedia pages for each season, fans have recorded which bakers were favorites, least favorites, Star Bakers and eliminated each episode (read this post to learn more about how I wrangled the data from wikipedia tables). This information can be used to create our own numerical representation of baker performance. I decided to give each baker 1 point for “favorite”, -1 point for “least favorite”, and 2 points for being named the Star Baker.

To compare every baker’s performance in every season, I summed all bakers’ scores to get their “final score” when they were eliminated (or when they won). For each season, a dot plot comparing each baker’s final score could look something like this:

Combining all seasons would give you something like this:

Full code here

The dot-plot-specific portion of the code is below:

geom_dotplot(binaxis='y', stackdir='center',
             aes(fill = winflag)
             ,dotsize = .5
             ,alpha = .8
             ,stackgroups = TRUE)

We can see that the points at around 0 Final Score start to run into each other when stacked, and things begin to look a little cluttered. This is where the jitter plot comes in.

Jitter plots

The jitter plot strikes me as a strange being in the world of data visualization. Unlike almost all other plots, the jitter plot adds an element of chaos. It forgoes absolute precision for the sake of readability.

In essence, a jitter plot is just a dot-plot for situations when there are too many dots for stacking. Instead of spacing the dots equally apart from each other without overlap, a jitter plot “jitters” the dots in a random manner, within a given area.

Using geom_jitter() instead of geom_dotplot():

The jitter-plot-specific portion of the code is below:

geom_jitter(height = .1,width = .25,
            aes(color = winflag), 
              alpha =.8,
              size = 4) 

The points here are less precise, but flow better. There is some overlap, but the randomness makes it easier to tell points apart, and a sense of the density of the points is not lost.

The height and width variables determine how much wiggle room the jitter plot has to work with in the x and y dimensions. smaller numbers would mean a tighter radius and more overlapping. Bigger numbers would give a wider radius to jitter within.

It is possible to achieve jittering while using geom_dotplot() using position = position_jitter(width = ?, height = ?), though there are other advantages of using geom_jitter(). We will use geom_jitter() from here on out for this post.

One advantage to geom_jitter is that it is easy to change dot attributes based on a group variable (not so with geom_dotplot – this is because there is no size or shape variable built into the parameters, only “dotsize”). To highlight winners and runners up in my jitter plot, I was able to change the code as follows:

Full code here

To change size based on a categorical variable, I added size as an attribute to the geom_jitter aesthetics. I then had to add a scale_size_manual() function to define the size for each category. If you have multiple aesthetics that you want to show up in one legend, t’s important to define the same name (even if it’s blank) for the all of the aesthetic functions:

geom_jitter(height = .1,width = .25,
              aes(color = winflag,
                  size = winflag), 
              alpha =.8) +
scale_color_manual(name = "",
                     values = c('Winner' = 'goldenrod',
                                'Runner-up' = '#86dba5',
                                'Eliminated before final' = '#e68a95')) +
scale_size_manual(name = "",
                    values = c('Winner' = 5,
                               'Runner-up' = 4,
                               'Eliminated before final' = 2))

The trouble with labels

I’m sure you noticed that there were helpful data labels in my single-season example. If we were to add the same labels to our jitter plot, we would get this beauty:

Clearly, there are too many observations here to have both intelligible labels and visible points. Wouldn’t it be nice if we could have an interactive plot that would allow a user to choose a point and see more information dynamically? We can!

Plotly

Using Plotly, we can turn our static plot into a dynamic plot that provides much more information upon hover or click:

Charts or apps can be made completely in Plotly, or we can use the ggplotly() function to turn a plot originally made in ggplot into an interactive Plotly visualization.

On a very basic level, all I did to turn our jitter plot into a Plotly plot was save the jitterplot and apply the ggplotly() function. The full code to get the formatted, interactive Plotly visualization is below:

#save fully formatted ggplot jitterplot

p <- ggplot(jitter, aes(season, endsum), group = baker) +
  geom_jitter(height = .1,width = .25,
              aes(color = winflag,
                  size = winflag,
                  text = paste('Baker:', baker,
                               '<br>Status:', winflag,
                               '<br>Max Episode:', maxep,
                               '<br>Final Score:', endsum)), 
              alpha =.8
  ) +
  scale_x_continuous(limits = c(1.8,12.2), breaks=seq(2,12,by=1)) +
  scale_y_continuous(limits = c(-4,13), breaks=seq(-4,13,by=2)) +
  coord_flip() +
  geom_vline(xintercept=2.5, color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 3.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 4.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=5.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 6.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=7.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 8.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=9.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept=10.5,color = "gray30", linetype = "dashed", size = .5) +
  geom_vline(xintercept = 11.5,color = "gray30", linetype = "dashed", size = .5) +
  labs(x = "Season", y = "Final Score") +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.major.y = element_blank(),
    axis.text.x = element_text(family = "Arial"),
    text = element_text(size = 14, family = 'Arial')
  ) +
  scale_color_manual(name = "",
                     values = c('Winner' = 'goldenrod',
                                'Runner-up' = '#86dba5',
                                'Eliminated before final' = '#e68a95')) +
  scale_size_manual(name = "",
                    values = c('Winner' = 5,
                               'Runner-up' = 4,
                               'Eliminated before final' = 2)) 

#apply ggplotly() function to our plot, specify the text for the tooltip, remove the toolbar and format the legend

ggplotly(p,tooltip = "text") %>%
  config(displayModeBar = F) %>%
  layout(legend = list(orientation = "v", 
                       xanchor = "center", 
                       x = 1,
                       y=.3,
                       bordercolor = "#edd99f",
                       borderwidth = 2,
                       bgcolor = "#ffdbfa",
                       font = list(
                         family = "Arial",
                         size = 14,
                         color = "#000")))

Only one tweak had to be made to the original ggplot jitter plot code to get the Plotly visualization to work correctly. This was the addition of the “text” aesthetic in the geom_jitter() function:

text = paste('Baker:', baker,
                               '<br>Status:', winflag,
                               '<br>Max Episode:', maxep,
                               '<br>Final Score:', endsum)

You’ll notice that in the ggplotly() function, the tooltip parameter was set to “text”.

ggplotly(p,tooltip = "text") 

This code is crucial in defining what text the user will see when hovering over a point.

The other additions to the ggplotly code are aesthetic. I use config(displayModeBar = F) to remove the toolbar that automatically gets added to Plotly plots (not necessary, I just don’t like how it looks). The layout() function is used to create the custom legend.

And there you have it: making a static plot dynamic was that easy!

ggplot’s geom_tile… not just for heat maps!

If you’ve read my previous blog post, you’ll know that I was able to convince a group of unsuspecting peers (ok, maybe one was suspecting) to create a final project Shiny app all about the Great British Bake Off (GBBO) using data I had previously scraped and wrangled from Wikipedia.

An unexpected challenge that came up while making the app was conceptualizing and creating charts or visualizations that displayed descriptive information about the GBBO. The ability to plot two variables to view a relationship has been drilled so deeply into my head that at this point, it’s second nature. This makes it easy to think of ideas for bar charts or line graphs that compare measures of baker performance. The ability to visually display information that isn’t necessarily trying to prove a point is not drilled into a science major’s head at all. Simple information like each episode’s theme is also important to summarize visually, and counter-intuitively more difficult to plot in a way that adds value.

geom_tile

As someone who works with SQL tables endlessly day after day, I think I naturally gravitate toward organizing and understanding information in a grid. This, to me, is the beauty of geom_tile.

Before making this app, I had only ever used geom_tile to create heat maps. Based on my google searching, this is probably what geom_tile is used for 99% of the time. A standard heat map compares two categorical variables with one on each axis, forming a grid. The squares in the grid are most often colored based on the value of a continuous variable, like temperature, or perhaps something like age:

ggplot(aes(as.factor(episode),as.factor(season.x), fill = avg.age)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  scale_fill_gradient(low = "pink", high = "brown", name = "Avg. Age") +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank())

*I made this as a quick example but it’s actually quite interesting. You can see that for many seasons, there is a tendency for average age to decrease as a season progresses – meaning younger bakers make it further in the competition. You can also see that some seasons have older or younger bakers in general (for example season 10 vs. season 5).

Though heat maps are usually colored with a continuous variable, they also work if you make the continuous variable discrete, or use a categorical variable:

colorpal <- colorRampPalette(c('lightsalmon','royalblue'))(5)

ggplot(aes(as.factor(episode),as.factor(season.x), fill = agegroup)) +
  geom_tile(color = "white",
            lwd = .5,
            linetype = 1) +
  geom_text(aes(label = round(avg.age,0)), color="white", size=rel(4)) +
  xlab("Episode") + 
  ylab("Season") +
  ggtitle("Average Baker Age by Season and Episode") +
  theme_minimal() +
  theme(panel.grid.major = element_blank()) +
  scale_fill_manual(values = colorpal, name = "Age Group") 

*This version conveys the same information as the original, but it’s simplified and maybe slightly easier to pick up on the main point of the chart.

Plotting GBBO episode themes using geom_tile

In the Great British Bake Off, each episode in a season has a theme. Some of these themes are repeated season after season, such as Cake, Biscuits, and Bread, while other themes are unique to a season. Unique themes tend to fall into basic types, like countries (Japan, Germany, Denmark etc.) or ingredients (Chocolate, Dairy, Spice etc.). I wanted to find a way to show which themes are repeated across all seasons and which themes are unique, as well as when a theme occurs in each season, in one concise visual.

We will use a dataset that I scraped called weekthemes, which has four fields: season, week, week_theme and category. I coded theme categories myself:

Let’s start with a basic geom_tile with seasons on one axis and episode theme on the other:

weekthemes %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) 

Not very beautiful without formatting, but a start. Notice that the themes are organized alphabetically. That’s not necessarily bad, but it’s also not the most logical way to organize themes. We could perhaps organize them by category, popularity, or something else. I know from watching GBBO that certain themes usually occur around the same time each season – for instance, cake week is almost always the first episode, while (obviously) the final is always last.

Wrangling week themes to order the y axis:

This is where good old data wrangling comes in. I find that it’s easiest to order categorical variables by sorting the variable’s factor levels before putting the data into ggplot.

First, let’s create a variable that sorts themes by their average episode across all seasons they occur:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%

Now that we have a ranking for each theme, we can join this back to the original weekthemes dataset, and sort the week_theme variable by our new ranking variable. Let’s also use geom_text to add a data label to our chart, so that we can see which week each theme occurs in each season:

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) 

This already feels more organized, and the chart now tells a slightly different story. The eye is drawn to the top of the chart, where the viewer can clearly see that Cake week is almost always first. As you navigate down the chart, other themes that tend to occur in the middle of the seasons become more unique and less common. Pâtisserie is almost always second to last and the Final of course last.

Adding another categorical variable:

To break up the large number of themes on the y axis, I want to add another level of organization with an additional category variable. I coded this variable myself for a few reasons. The first reason is that over the course of GBBO, certain themes have morphed into similar themes. Take for instance Pastry; in earlier seasons, there was no Pastry week. Seasons 2 and 3 had a Tarts episode and a Pies episode. In seasons 4 and 5, Tarts and Pies appear to have morphed into “Tarts and Pies,” and a separate Pastry episode was introduced. From season 6 on, only Pastry remains. Though these themes are all slightly different, they all deal with the same basic category of short crust pastries.

The second reason I decided to code a category variable was to organize unusual themes together. As the seasons of GBBO have progressed, certain patterns in themes have arisen. Take for example country themes. There have been several seasons with a week devoted to the baking of one country, though the same country has never been repeated in multiple episodes. The country themes are ideologically related, though not technically the same. I wanted a way to group these themes together.

Let’s add Category as the fill variable to our chart:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(week_theme) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Immediately, the addition of the category variable to color the tiles adds another level of structure to the chart. The chart becomes more interactive, as viewers can now choose to examine themes across seasons by category, looking for patterns, similarities or differences.

Ultimately, for this chart, I decided to change the sorting of the themes to go by theme category, to emphasize each category more than individual themes. I accomplished this by changing the ranking variable to group by category rather than theme:

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(size = .5) +
  geom_text(aes(label = week), color="white", size=rel(4)) +
  scale_fill_manual(values = mycolors, name = "Category") 

Sorting the themes by category so that all themes in the same category are next to each other allows viewers to see which themes fit into which category much more easily, while still getting a sense of which themes tend to occur at specific times in a season. I preferred this sorting method, but there is no right or wrong answer here – it would have been equally valid to level the themes as they were before.

Formatting and refining

The only thing left is to refine the formatting of the chart. There is truly no limit here, but for easy understandability I made the following changes to get my final product:

mycolors <- colorRampPalette(c('#fa8ca9','#ffdbfa','lightgoldenrod','#cce0ff',"#d4b7a9"))(12)

weekthemes %>%
  group_by(category) %>%
  summarize(avgweek = mean(week)) %>%
  mutate(ranking = rank(avgweek, ties.method = 'first')) %>%
  inner_join(weekthemes) %>%
  mutate(week_theme = factor(week_theme, levels= unique(week_theme[order(desc(ranking))]))) %>%
  mutate(season = as.factor(season)) %>%
  
  ggplot(aes(season, week_theme, fill=category)) + 
  geom_tile(color = 'gray20', size = .5) +
  scale_fill_manual(values = mycolors, name = "Category") +
  scale_x_discrete(position = "top",
                   labels=c("2" = "S2", "3" = "S3",
                            "4" = "S4", "5" = "S5",   
                            "6" = "S6", "7" = "S7", 
                            "8" = "S8", "9" = "S9", 
                            "10" = "S10", "11" = "S11", 
                            "12" = "S12")) +   
  labs(color = "Category") +
  geom_text(aes(label = week), color="black", size=rel(5)) +
  xlab("") + 
  ylab("") +
  ggtitle("Great British Bake Off Week Themes Season by Season Comparison") +
  theme(panel.grid.major.y = element_line(color = "#edd99f"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_line(),
        panel.border = element_rect(fill=NA,color="white", size=0.5, linetype="solid"),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_rect(fill="white"),
        plot.background = element_rect(fill="white"),
        legend.text = element_text(size=12),
        legend.title = element_text(size=14),
        title = element_text(size =14),
        axis.text = element_text(color="black", size=14))

It might not be the prettiest possible version, but I was going for complete readability here.

In conclusion

geom_tile is an incredibly versatile tool to plot much more than just heat maps. I had a lot of fun hacking it to create this informative chart, and will certainly use it again for future charting needs!

Code and data used in this post can be found here.

Mapping California wine regions using Shapefiles and tmap

California’s Napa Valley is only about an hour’s drive from Berkeley, where I grew up. Even as a sub-drinking age child, I was often given a good reason to spend a day in or around Napa. Hiking, visiting a friend’s grandparents with a pool, driving my younger sister to artistic rollerskating practice (a story for another time). Add to this a family-wide obsession with the early 2000’s classic Sideways and frequent trips through the Santa Ynez Valley for some classic Solvang aebelskiver, and you get the recipe for a lifelong love of the California Wine Country.

My partner works in the world of fancy beverages, and has recently started cataloguing the wines he’s tasted. There are some good apps available (Vivino) for tracking and reviewing wines, and generating predictions based on taste profiles. However, he has found that Vivino falls short when it comes to geographically tracking specific wines and varietals. He wants to be able to see which regions and subregions he has tried wines from in an interactive map format rather than a list. I agree that this would be more interesting. Though after doing my own research, I can see why Vivino lacks a robust mapping feature. There is very little reliable geographic data freely available for world wine regions online. I think the primary reason for this is probably gatekeeping, because the information exists, it’s just in very expensive published books. Who would have guessed that there would be snobbery surrounding data on wine 🙂

The good news is that there is some high quality data available, at least for US wine regions, thanks to the UC Davis library: here. It’s a really cool project – they’ve catalogued viticultural areas for California and other wine producing states in the US in great detail. I decided to start exploring this data by creating my own wine region map for California. I also used this little project as an opportunity to explore some of R’s mapping features with the tmap package.

Get CA Viticultural Area Polygons via Shapefile

The first step in plotting any kind of area on a map is getting the actual data that holds information about its shape. This can come in several forms – raster (pixels, like a JPEG) or polygons (area boundaries are stored as formulas for a polygon, i.e. vectorized like a PDF). Polygon areas are commonly stored as either GeoJSON files or Shapefiles. Either can be easily read into R using a package called rgdal. The polygons I wanted to use for my map come as either GeoJSON or Shapefiles.

I chose to use the Shapefile to import the polygons into R. When you download a Shapefile, it will come as a folder with several files like this:

Don’t worry about what exactly all these separate files are. All you need to do to import the full Shapefile into R is run a very simple line of code using the readOGR function from the rgdal package:

cawine <- readOGR(dsn = "Desktop/CA_avas_shapefile", layer = "CA_avas")

All I did here was add the path to the Shapefile folder, and specify the layer name, which is the name of all the files within the folder. This will get you a “Large SpatialPolygonsDataFrame” in the object space, which looks like this when you click it:

You’ll notice that the data and polygons object both have 142 items, one for each viticultural area. When you expand the data object, you can see that it is formatted as a data frame with columns that provide information and additional context for the polygons:

If you expand the polygons object, you’ll see that each item contains coordinate information for a specific viticultural area:

Plot the areas with tmap

Now that we have the viticultural area Shapefile loaded, we can begin to plot the wine regions using the tmap package:

library(tmap)

tm_shape(cawine) +
   tm_fill("name",legend.show = FALSE, alpha = .5) +
   tm_borders()

tmap works like ggplot in that plots are built by adding function layers on top of each other. Each plot needs a tm_shape() object, which provides the data for the layers that follow. tmap will send back an error if you call only tm_shape() without at least one layer following it. Here I’ve called tm_shape() using our cawine shapefile, added a fill layer with tm_fill() based on the name of each area, and added a border for each area with tm_borders(). The output looks like this:

We’re getting somewhere, but obviously the areas are floating in space with no background map for context. There are several ways we can fix this, and which one we use depends on the end goal. Perhaps the most straightforward fix is to add an additional Shapefile layer with the shape of California to our plot. This will require finding a Shapefile for the outline of California. I just googled this, clicked on one of the first links, here, and downloaded the state boundary file. I added this Shapefile as a base layer for the viticultural areas like so:

ca <- readOGR(dsn = "Desktop/CA-state-boundary", layer = "CA_State_TIGER2016")

tmap_mode("plot")
  tm_shape(ca) +
    tm_fill("white") +
    tm_borders()  +
  tm_shape(cawine) +
    tm_fill("name",legend.show = FALSE, alpha = .5) +
    tm_borders() +
  tm_layout(bg.color = "#cfe4e8") 

This doesn’t look bad, though there is a bit of an awkward edge around the coastline for some reason – this could potentially be a result of the viticultural area Shapefile being mapped to a different projection than the California Shapefile. To check this, I changed the map projection for both Shapefiles to the Equal Earth projection like so:

tmap_mode("plot")
  tm_shape(ca, projection = 8857) +
    tm_fill("white") +
    tm_borders()  +
  tm_shape(cawine, projection = 8857) +
    tm_fill("name",legend.show = FALSE, alpha = .5) +
    tm_borders() +
    tm_layout(bg.color = "#cfe4e8") 

A fascinating new shape for California! We see that the edge on the coastline is still there, so we can safely assume that this is by design. In fact, I believe it is because the California Shapefile extends partially into the ocean territory that belongs to the state.

This static map is nice to get a broad overview of the regions, or to build an attractive map that could be printed out, but if we wanted more detailed information or additional context about the viticultural areas, it would be worth our while to construct a more dynamic map.

Make the map interactive

Making a map interactive using tmap is surprisingly simple. To get started, all you need to do is change the mapping mode to “view” using tmap_mode():

tmap_mode("view")
  tm_shape(ca) +
    tm_fill("white") +
    tm_borders()  +
  tm_shape(cawine) +
    tm_fill("name",legend.show = FALSE, alpha = .5) +
    tm_borders() 

Clearly there is some refining to do here, but this is a great start toward building interactivity that took essentially no work. If you click on the layers icon you can toggle between different default Esri base maps, and add or remove the ca and cawine Shapefile layers.

In the next iteration of our interactive map, I’ll add more information in the popup labels using the popup.vars argument in our fill layer. I also changed the id argument to use the name variable so that the nicely formatted viticultural area name is displayed upon hover. I removed the California Shapefile, as it was no longer needed, and added a custom tm_basemap() layer with Stamen’s TonerLite map:

tmap_mode("view")
  tm_shape(cawine) +
    tm_fill("name",legend.show = FALSE, alpha = .5,
            id="name",
            popup.vars = c("Within" = "within",
                           "Counties" = "county",
                           "Created" = "created")) +
    tm_borders() +
  tm_basemap("Stamen.TonerLite")

Now we have a map that actually provides a good deal of information on the geographic location of all the viticultural areas in the Shapefile dataset. Of course there are infinite ways to iterate on this basic map, but for my purposes, simply visualizing where each region is and getting some basic information displayed is enough, and I’m pleased with the simplicity of the final product. Stay tuned for more wine data exploration in the future!

Code and data used to create the maps in this tutorial are here

Visualizing Winter Olympics success

This semester has been busy! I haven’t had a chance to dive into the side projects I’ve been brewing but I’ve really been enjoying the data visualization class I’m taking with Thomas Brambor. Our first assignment was to build some visualizations using historical Winter Olympics data going up to 2014, using ggplot and Plotly. I had a lot of fun doing this assignment and discovered some very cool tips and tricks in the process of perfecting my charts. If you’re viewing this on a phone – sorry, the Plotly charts do not take kindly to that. The full assignment with code is below:

Visualizing gender-neutral baby names with ggplot and Plotly

Visualizing gender-neutral baby names with ggplot and Plotly

I’m finally taking a much-anticipated (by me) class for my MA program called “Data Visualization.” An optional exercise was to play around with a dataset of baby names from US census data. I had some fun creating this interactive chart of the most popular gender-neutral baby names over time.

Names included in this chart must have been in the top 10% of all names for a given year, with a boy:girl or girl:boy ratio of no more than 100:1.

The design of this chart exposes patterns in the predominant sex of a given name over time. Interestingly, it looks like a majority of popular baby names move from a higher ratio of boys to girls to a lower ratio over time. There are many more fascinating insights to find!

The code I wrote to generate this chart is below:

library(babynames)
library(ggplot2)   
library(magrittr)   
library(dplyr)  
library(RColorBrewer)
library(colorways2) #my color package
library(ggthemes)

f <- babynames %>% filter(sex=="F")
m <- babynames %>% filter(sex=="M")

unisex1 <- merge(f,m ,by=c("name","year"),all = TRUE)

base1 <- unisex1 %>%
  group_by(year) %>%
  mutate(overall=n.x+n.y) %>%
  mutate(ratio= n.y/n.x) %>%
  arrange(desc(ratio)) %>%
  mutate(logratio=log(ratio)) %>%
  mutate(overallcentile = ntile(overall,10)) %>%
  filter(tolower(name) != "unknown") %>%
  filter(tolower(name) != "infant") %>%
  filter(tolower(name) != "baby") %>%
  filter(overallcentile >=  10) %>%
  filter(abs(logratio) <= 2) 

d <- highlight_key(base1, ~name)

#had to make a new palette out of an existing one with 74 colors, one for each name
nb.cols <- 74
mycolors <- colorRampPalette(ballpit)(nb.cols)

p <- ggplot(d, aes(year, logratio, col= name)) + 
  geom_hline(yintercept=0, linetype="dashed", color = "black") +
  geom_line() + 
  theme_tufte() +
  geom_point() + 
  scale_y_continuous(labels = c("1:100", "1:10", "1:1","10:1","100:1")) +
  labs(title="Gender Distribution of Most Popular Gender-Neutral Names Over Time", x ="", y = "Boy:Girl ratio (log scale)") +
  theme( text=element_text(family="Helvetica",size = 14),plot.title = element_text(size = 14),axis.text = element_text(size = 12), axis.title = element_text(size = 14))+
  scale_x_continuous(breaks = round(seq(min(1880), max(2020), by = 10),1)) +
  scale_color_manual(values = mycolors) 
  
gg <- ggplotly(p)   
  
highlight(gg, dynamic = F, color = "black",selected = attrs_selected(showlegend = FALSE)) %>% 
 layout(margin = list(b = 40)) %>%
 layout(legend=list(title=list(text='')))

Designing functions to generate and display color palettes in R

Designing functions to generate and display color palettes in R

In this post, I will go over the methodology I used to design the color palettes and functions to display them that comprise the colorways package referred to in my previous blog post.

Identifying the Problem

When it comes to color, I definitely believe in being adventurous and having many options based on mood and context. I think visualizations are best when they take an artistic and sometimes unexpected approach to color. Though blending in is sometimes necessary (for publication, or serious work-related presentations, maybe), when I work on projects for myself, I like to push the envelope a bit. It’s an aspect of data analysis that I truly enjoy.

I’ve found that palettes from RColorBrewer or default palettes work fine for certain situations, like heat maps that need a spectrum of color. But where they break down for me is in the fairly common occurrence of graphing several distinct subgroups. This calls for palettes with colors that are different enough from each other that no two subgroups are perceived as more similar than the others. RColorBrewer has a few “qualitative palettes” for this purpose, but I find them to be a bit generic.

Of course, RColorBrewer isn’t the only package for color palettes out there. This website has compiled a ton of color palette packages in one place, with a handy tool to select palettes. The palettes can then be retrieved in R using the Paletteer package, which is incredibly useful for accessing color palettes from many sources in a standardized way. Now that I found it, I will certainly be saving this resource and using it in the future. Some examples of favorites I’ve found from this site:

The current crop of color packages still lack some features that in my opinion are essential to choosing a palette. As far as I know, they don’t come with easy to use, built in color shuffling, and they don’t come with dynamic color previews in charts or graphs.

So, I set out to create my own package with three distinct functionalities:

  1. Save my own palettes
  2. Display any palette (either native or from another package) in a variety of forms (basic palette, charts, graphs)
  3. Shuffle color palettes if desired and save the newly ordered list of colors

Saving my own palettes

This is just as easy as choosing colors and putting them in lists. The fun part of course is naming the palettes based on my own whimsy. Some examples:

krampus <- c("#0782A6","#A66507","#994846","#CDD845","#624FAF","#52735D","#BBAE92","#FED2E7","#FFE402")
ballpit <- c("#5C33FF","#FF8E07","#E2E442","#42E44F","#C67CF9","#F64EBC","#ACF64E" ,"#C11736","#00B6A0")
donut <- c("#FA88F1","#2DEC93","#8FE2FF","#FF882B","#D80D0D","#D0A321","#369830","#B681FF","#858585")

I decided to go with 9 colors in each palette for uniformity and to maximize novel color combinations when shuffling. I ordered the palettes intentionally with my favorite color combinations come at the beginning, so that when the palettes aren’t shuffled, optimal color combinations are preselected.


Paletteprint: view all palettes

The first thing I wanted to do was write a function that would display all the palettes at once. To do this, I had to first create a list of palettes and their names:

palettes <- list(dino, hive, rumpus, taffy, sleuth, martian, krampus, tulip, donut, donette, creme, farmhand, mayhem, ballpit, january, pair1)

names(palettes) <- c("dino", "hive", "rumpus", "taffy", "sleuth", "martian", "krampus", "tulip", "donut", "donette", "creme", "farmhand", "mayhem","ballpit","january","pair1")

Then I wrote a function using rectangles in base R graphing to make a chart:

paletteprint <- function() {
  bordercolor <- "black"
  x <- c(-8,27)
  y <- c(0,(length(palettes)*3))
  i <- (length(palettes)*3)
  n <- 1
  plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, frame.plot=FALSE)
  for (p in palettes) {
    rect(0,i,3,i-1, col = p[1], border = bordercolor, lwd = 2)
    rect(3,i,6,i-1, col = p[2], border = bordercolor, lwd = 2)
    rect(6,i,9,i-1, col = p[3], border = bordercolor, lwd = 2)
    rect(9,i,12,i-1, col = p[4], border = bordercolor, lwd = 2)
    rect(12,i,15,i-1, col = p[5], border = bordercolor, lwd = 2)
    rect(15,i,18,i-1, col = p[6], border = bordercolor, lwd = 2)
    rect(18,i,21,i-1, col = p[7], border = bordercolor, lwd = 2)
    rect(21,i,24,i-1, col = p[8], border = bordercolor, lwd = 2)
    rect(24,i,27,i-1, col = p[9], border = bordercolor, lwd = 2)
    text(x = -8, y = i-.5, # Coordinates
         label = names(palettes[n]), pos =4)
    i = i-3
    n= n+1
  }
}

The output when calling paletteprint() looks like this:


Colordisplay: view, shuffle and choose the number of colors in a palette

Next I wanted a function that would display the colors in a selected palette, display them in a straightforward way, allow me to choose how many colors I wanted to see, shuffle the colors if desired, and return a list of the colors displayed.

colordisplay <- function(palette, number = 9, bordercolor = "black", shuffle = "no") {

  if (shuffle == "yes"){
    shuff <- sample(seq(from = 1, to = length(palette), by = 1), size = length(palette), replace = FALSE)
  }

  else {
    shuff <- seq(1, length(palette), by=1)
  }

  if (number == 9) {
    names = c(palette[shuff[1]], palette[shuff[2]],palette[shuff[3]],palette[shuff[4]],palette[shuff[5]],palette[shuff[6]],palette[shuff[7]],palette[shuff[8]],palette[shuff[9]])
    title <- paste(names, collapse = ", ")
    x <- c(0,3)
    y <- c(7,10)
    plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, main = title, frame.plot=FALSE)
    rect(0,10,1,9, col = palette[shuff[1]], border = bordercolor, lwd = 4)
    rect(1,10,2,9, col = palette[shuff[2]], border = bordercolor, lwd = 4)
    rect(2,10,3,9, col = palette[shuff[3]], border = bordercolor, lwd = 4)
    rect(0,9,1,8, col = palette[shuff[4]], border = bordercolor, lwd = 4)
    rect(1,9,2,8, col = palette[shuff[5]], border = bordercolor, lwd = 4)
    rect(2,9,3,8, col = palette[shuff[6]], border = bordercolor, lwd = 4)
    rect(0,8,1,7, col = palette[shuff[7]], border = bordercolor, lwd = 4)
    rect(1,8,2,7, col = palette[shuff[8]], border = bordercolor, lwd = 4)
    rect(2,8,3,7, col = palette[shuff[9]], border = bordercolor, lwd = 4)

    return(title)
  }

  else if (number == 8) {
    names = c(palette[shuff[1]], palette[shuff[2]],palette[shuff[3]],palette[shuff[4]],palette[shuff[5]],palette[shuff[6]],palette[shuff[7]],palette[shuff[8]])
    title <- paste(names, collapse = ", ")
    x <- c(0,4)
    y <- c(8,10)
    plot(1, type="n", xlab="", ylab="", xlim=x, ylim=y,axes=FALSE, main=title, frame.plot=FALSE)
    rect(0,10,1,9, col = palette[shuff[1]], border = bordercolor, lwd = 4)
    rect(1,10,2,9, col = palette[shuff[2]], border = bordercolor, lwd = 4)
    rect(2,10,3,9, col = palette[shuff[3]], border = bordercolor, lwd = 4)
    rect(3,10,4,9, col = palette[shuff[4]], border = bordercolor, lwd = 4)
    rect(0,9,1,8, col = palette[shuff[5]], border = bordercolor, lwd = 4)
    rect(1,9,2,8, col = palette[shuff[6]], border = bordercolor, lwd = 4)
    rect(2,9,3,8, col = palette[shuff[7]], border = bordercolor, lwd = 4)
    rect(3,9,4,8, col = palette[shuff[8]], border = bordercolor, lwd = 4)

    return(title)
  }

# and so on until number == 2

Ok, yes, there might have been a better way to set up a loop that doesn’t require that I write a new if statement for every number of colors BUT I did want control of how the display looks for each number of colors chosen so honestly I don’t think it’s really that needlessly wordy.

If I call colordisplay, choose my palette and use all default parameters, the result will look like this (fully zoomed out):

colordisplay(january)

I will also get this as output:

[1] "#F1F07C, #BB7EEE, #98EAC8, #65859C, #8C2438, #ADA99D, #AD840C, #398726, #EC5570"

If I call colordisplay with shuffle on, I get a randomly shuffled output with the corresponding list of colors:

colordisplay(january, shuffle = "yes")
[1] "#AD840C, #98EAC8, #8C2438, #EC5570, #F1F07C, #65859C, #BB7EEE, #398726, #ADA99

Changing the number parameter between 2-8 colors will result in the following shapes:

You can also choose to display a palette from another package, or using paletteer:

colordisplay(paletteer_d("nationalparkcolors::DeathValley"), number = 6, shuffle = "yes")
[1] "#E79498FF, #73652DFF, #F7E790FF, #514289FF, #B23539FF, #FAB57CFF"

Colorbar: view a palette as a bar chart:

I wanted to write a function that would allow me to input a palette, the number of colors I want from that palette, and whether I want the colors to be shuffled, and output a sample bar chart. This is the main functionality that I feel current color packages lack. Of course, you can keep your own code for a sample bar chart and test colors manually, but this function has made it so much easier to quickly assess whether a set of colors will work for a visualization. I also think it’s a ton of fun to shuffle the colors again and again and see what comes up!

The code behind this function is pretty similar to what I wrote for colordisplay. Full code for all functions can be found in this github repo.

Let’s try colorbar with the tulip palette and default parameters:

colorbar(palette = tulip)

We can add up to 7 colors in the bar chart:

colorbar(palette = tulip, number = 7)

We can choose to stack the bars:

colorbar(palette = tulip, number = 7, stacked = "yes")

We can shuffle the colors:

colorbar(palette = tulip, number = 7, stacked = "yes", shuffle = "yes")

Like colordisplay, colorbar will always output the ordered list of colors for each chart:

[1] “#D7DDDE, #A25E5F, #FFC27E, #FFFBC7, #9B8054, #E0E38C, #E0C2F6”

Lastly, we can use palettes from other packages:

colorbar(palette = paletteer_d("wesanderson::IsleofDogs2"), number = 4)

Colorscatter: view a palette as a scatter plot:

Colorscatter is virtually the same function as colorbar, but instead of a bar chart, color scatter will display a scatter plot. The only difference in parameters between the two is that color scatter lacks the “stacked” feature, for obvious reasons.

Let’s try the default colorscatter function using the “ballpit” palette:

colorscatter(palette = ballpit)

Like colorbar, we can view up to 7 colors using colorscatter:

colorscatter(palette = ballpit, number = 7)

We can also shuffle colors:

colorscatter(palette = ballpit, number = 5, shuffle = "yes")

Colorscatter will also return an ordered list of colors each time it is run:

[1] “#C11736, #E2E442, #5C33FF, #00B6A0, #ACF64E”

And of course like colorbar, colorscatter will accept external palettes:

colorscatter(palette = paletteer_d("tvthemes::simpsons"), number = 5, shuffle = "yes")

In conclusion:

I’ve certainly only scratched the surface of what’s out there and what’s capable in the world of R color packages. I’m happy with the package I’ve put together but I can already think of some improvements or additional functionalities I’d like to add (for example, a parameter for switching the charts to dark backgrounds). For now I think this is a great start and I’m excited to do more research and make more updates!

For full code, visit the colorways package GitHub repo here

Creating a local R package using RStudio

Creating a local R package using RStudio

If there’s one thing I love about R, it’s visualizations. And if there’s one thing I love about visualizations, it’s getting to choose the colors. But if there’s one thing I hate about getting to choose the colors, it’s forgetting which colors I like and having to look up color names or hex codes and saving new lists of colors of varying lengths each time I want to generate a new chart or graph. I’ve suspected for a while that it would pay to make a package to save color combinations, but it always felt like too much of a chore. Then I got Omicron.

I was faced with a week’s worth of abyss, quarantined in a single room in my family’s home after flying to California for the holidays. After a healthy amount of grimacing, I decided enough was enough, I would do something with the meager tools at my disposal and finally take on the task of creating a package to save and display color combinations. I spent a few days putting palettes together and writing functions to display, shuffle, and return the color codes based on user input. I’ll write another post just about the contents of the package, because I like it and think it’s a lot of fun.

For now I want to document how I was able to convert functions and data into a package that can be called at any time. This is definitely not the only way to save code as a package, but this is the easiest way to do it that worked for me. Because I already made the color package, I will make a new package for the sake of this post called michiganjfrog.

Step 1: Open a new project in RStudio

In RStudio, navigate to File -> New Project… you will then be prompted to save the current workspace, and do so if you like. The project wizard will pop up. Choose New Directory -> Package. You’ll get a form that looks like this:

Here you will pick the package name, where the package will be saved, and whether to create a git repository. My package uses only base R so I didn’t worry about checking “Use renv with this project,” but for future reference, checking that box could be useful for dependency management. I checked “Create a git repository” because I wanted to track changes to my package on GitHub.

Once the package is created, you’ll get something that looks like this:

The wizard already comes with a sample R script with a function called “hello.”

Step 2: Save a function

To save more functions, select File -> New File -> R Script and write your function. It is generally best practice to save functions in their own R scripts, or to save related functions together.

When you save your R script, RStudio will automatically suggest you save it to the “R” folder within your package. That is exactly where it should be saved. Don’t change that. Name your file either the same name as your function or a short, easy to remember name for your related functions.

Step 3: Build your package with the new function

This step doesn’t have to take place necessarily right after you save your first function, but I think it’s useful down the road to rebuild your package as often as possible and before creating the documentation file for your function for reasons that will become clear in step 4. To build (basically, save) your package, navigate to the top right corner of RStudio and select the “Build” tab. Then click “Install and Restart”

Step 4: Document your function

Go to File -> New File -> R Documentation… A dialog box will pop up that looks like this:

If you’ve already built the package with a function by the same name as the “Topic name,” RStudio will recognize this and automatically fill in some of the fields in the documentation template when you press OK:

You can fill in the other fields with any relevant information, then press “preview” to view the html rendering of the documentation file:

To save the documentation, follow step 3 to rebuild the package.

Step 5: Add data

Adding data such as lists or tables to a package is relatively easy. Just open a new script, define your data and use the use_data() function to save the data within the package. It should look like this:

frogfacts <- c("green","yellow","top hat","shy","eternal","ragtime")

use_data(frogfacts)

Make sure you run the code and rebuild the package to save your changes.

Step 6: Connect to GitHub

After creating a new directory in GitHub, open a terminal window and type the following:

cd desktop/michiganjfrog (or wherever the package is saved)
git remote add origin https://github.com/szaidman22/michiganjfrog.git
git branch -M main
git add -A
git commit -m "adding package contents"
git push --set-upstream origin main

This will connect the package directory to the GitHub directory and add everything to the main branch. To push and commit changes from RStudio, navigate to the “Git” tab at the top right of the RStudio window:

Check any of the documents/directories within the package with changes that you want to push to GitHub, then press the “Commit” button, close the dialog box that pops up, and press the “Push” button:

To check that the changes were pushed properly, go to the GitHub directory and look for commits.

Step 7: Check the package

There are many ways to check that the package does what it’s supposed to, but in my opinion the easiest is just to open a new project and try to use it. Because the package is saved on your own machine and already installed, there is no need to install it. All you have to do is call the package from the library:

It looks like the package and descriptions work! Stay tuned to hear more about the color palette package I made last month.