This summer, I decided to enroll in an accelerated, half-semester long Natural Language Processing class. Why not enhance the already miserably hot and humid month of July with an additional 6 hours per week of sitting in a small wooden seat after work, I thought. It was indeed a brutal month that has left me with a literal pain in my neck, probably from hauling two separate computers uptown twice a week and subsequently hunching over them. And here I am, hunching again to record what I learned for posterity.
Complaints aside, I enjoyed the class. It was programming-heavy and taught in Python, which was a welcome change of pace after a full year of classes only taught in R (though, as repeat readers of this blog should know, I am an R lover). I was already familiar with the essential concepts of NLP, having used it for several projects in other classes, outside of class and at work. I was happy to devote a solid chunk of time and attention to going more in-depth with NLP.
Naturally, there was a final group project. Our group decided to analyze tweets surrounding the recent Roe v. Wade ruling. I volunteered to train a model to label the tweets as either pro-choice or pro-life. This post will outline how I did that.
Labeled data from Surge AI
The first thing required to create our model is a dataset with tweets already labeled as pro-choice or pro-life. As luck would have it, this dataset existed when I googled around. Surge AI is a data labeling platform that crowdsources labeling of data used for ML models. A sample dataset of 1025 tweets labeled pro-choice vs. pro-life was available on their website for free download.
I imported the dataset to python and saved it as a pandas dataframe:
import pandas as pd
roe = pd.read_csv ('surge_ai_roe.csv')
The dataset included columns with the tweet text, as well as a “bio” column with text from the user’s bio. Scrolling through the data, it’s apparent that the bio text is useful in identifying whether a person’s tweet is pro-choice or pro-life (take for example row 15 above). I decided to use both the tweet text and bio text as input for the predictive model.
The data was skewed considerably toward pro-choice tweets. 744 (72.6%) of the 1025 tweets in the Surge AI dataset were labeled pro-choice. This was accounted for in the model later on.
Preparing the tweets for use in a model
To use text as a feature in a model, it must eventually be turned into numbers. This can be accomplished through vectorization. When text is vectorized, it is converted from one chunk of words to a series of variables, as long as however many words were in the original chunk. A number is then assigned to each variable that corresponds to the frequency of any given word in that text:
Text: I am disappointed I never saw the Minions movie in theaters.
Vectorized text:
To vectorize a corpus containing multiple texts, the same logic applies, but the number of variables grows to include all of the words that occur in all of the texts:
Text 1: He looks for eggs.
Text 2: The eggs are for the cookies.
Before our tweets can be vectorized, they must be cleaned. There are several ways a tweet should be cleaned or transformed from its original state in order to be vectorized properly:
- Convert to lowercase
- Remove numbers and punctuation
- Deal with @s and hashtags
- Handle Emojis
- Remove stop words (maybe)
Though there are a number of steps involved, cleaning the text requires relatively little code.
Convert to lowercase, remove numbers and remove punctuation
To convert to lowercase and remove numbers and punctuation, the following function can be used:
#remove numbers and punctuation, lowercase
def clean_txt(text):
import re
new_text = re.sub("[^0-9A-Za-z']", " ", text).lower()
return new_text
The re library is used for regex operations, which I hardly understand, but everything can be googled 🙂
Clean @s and hashtags
To clean the tags and hashtags for the tweets, I simply googled and found a function on stack overflow:
#removes @s and cleans hashtags, found on google
def tweet_clean(text):
from urllib.parse import urlparse
new_string = ''
for i in text.split():
s, n, p, pa, q, f = urlparse(i)
if s and n:
pass
elif i[:1] == '@':
pass
elif i[:1] == '#':
new_text = new_text.strip() + ' ' + i[1:]
else:
new_text = new_text.strip() + ' ' + i
return new_text
Convert emojis to text
I wanted to keep the meaning of the emojis rather than completely remove them. Luckily, there is a library that allows you to convert emojis into text:
import emoji
emoji.demojize('🥖🐣🤌🥰')
‘:baguette_bread::hatching_chick::pinched_fingers::smiling_face_with_hearts:’
Remove stop words
The function below can be used to remove English stop words (common words that hold no real meaning, like and, or, the, of, etc.)
#function to remove stopwords
def rem_sw(text):
import nltk
from nltk.corpus import stopwords
sw = list(set(stopwords.words('english')))
new_text = [word for word in text.split() if word not in sw]
return ' '.join(new_text)
I didn’t remove stop words in my final model because it actually performed slightly better when I left them in!
Create new cleaned text columns for use in our model
Using the functions above, I cleaned the tweet text and user bio columns and combined them into a cleaned column called “combined”. I then removed any record with a “combined” length of 0.
roe["clean_text"] = roe.text.apply(tweet_clean).apply(emoji.demojize).apply(clean_txt)
roe["bio"] = roe["bio"].fillna('')
roe["clean_bio"] = roe.bio.apply(emoji.demojize).apply(clean_txt)
roe["combined"] = roe[['clean_text','clean_bio']].apply(lambda x: ' '.join(x), axis=1)
roe["combined_len"] = roe.combined.apply(len)
roe = roe[roe.combined_len > 0]
Now that we have a fully prepared column with the text we want to use in our model, we can vectorize it.
Vectorizing
Below is the code used to vectorize a text column using sklearn’s CountVectorizer and/or TfidfVectorizer functions.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
#Use for simple count vectorizer
#my_vec = CountVectorizer(ngram_range=(1, 2))
#Use for tf-idf
my_vec = TfidfVectorizer(ngram_range=(1, 2))
transformed = pd.DataFrame(my_vec.fit_transform(list(roe["combined"])).toarray())
transformed.columns = my_vec.get_feature_names()
I used tf-idf (term frequency–inverse document frequency) vectorization rather than simple count vectorization for my model. Tf-idf is similar to count vectorization, but weighs word scores based on how often they appear in a record vs. the full corpus.
The ngram_range variable allows you to specify if you would like to include single, bigrams, trigrams etc. in your vectorization. By adding (1,2) I indicated that I would like the vectorization to create features for each word and each bigram that appears in my corpus. To use only single words you would write (1,1), only bigrams (2,2), only bigrams through trigrams (2,3), all three would be (1,3).
Making the model
There are many types of models that could be used for predicting a binary outcome (in this case pro-choice or pro-life). Perhaps the simplest is logistic regression.
from sklearn.linear_model import LogisticRegression
my_model = LogisticRegression(class_weight= 'balanced')
Next I used the train_test_split function to split my original data into train and test datasets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
transformed, roe.label, test_size=0.2, random_state=85)
Now it’s quite simple to fit the model:
my_model.fit(X_train, y_train)
And we can get the predicted classes for the test data (called y_pred):
y_pred = my_model.predict(X_test)
We can also view the probabilities associated with these predictions as well:
my_model.predict_proba(X_test)
Finally, we can look at the precision, recall and F1 score to see how the model performed:
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
metrics = pd.DataFrame(precision_recall_fscore_support(
y_test, y_pred, average='weighted'))
metrics.index = ["precision","recall","fscore","na"]
All in all, this is not a terrible showing given that there were only about 800 records to train the model with. These metrics have been calculated using the weighted average parameter in sklearn’s precision_recall_fscore_support function, as there were quite a few more tweets labeled as pro-choice vs. pro-life in the dataset.
We can also take a look at the confusion matrix ourselves to better understand how tweets were labeled:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
Predicted pro-choice | Predicted pro-life | |
Actual pro-choice | 135 | 8 |
Actual pro-life | 43 | 19 |
We see that there are 135 “true negatives” (tweets correctly labeled as pro-choice), 8 “false positives” (tweets incorrectly labeled as pro-life), 43 “false negatives” (tweets incorrectly labeled as pro-choice) and 19 true positives (tweets correctly labeled as pro-life) in our testing sample. This is important information. The model has good precision when it comes to categorizing pro-choice tweets (135/(135+43) = .76) and good precision for pro-life tweets (19/(19+8) = .70). This means the model is not producing a high percentage of false positives in either category.
The model’s recall is another story. Recall asses the model’s accuracy in correctly classifying true positives for each category. With recall we ask, of the tweets that are actually pro-choice or pro-life, how many were labeled correctly? For pro-choice, recall is very high (135/(135+8) = .94). This tells us that the model is VERY good at correctly labeling pro-choice tweets. However, recall for the pro-life tweets is bad (19/(19+43) = .31). The model is more likely to think that true pro-life tweets are actually pro-choice tweets.
For the purposes of this project and with the constraint of having relatively little labeled data to train the model on, this doesn’t especially bother me. To fix this, we could potentially change the probability threshold such that more tweets get labeled as pro-life.
In summary
Those are the essential steps I took to wrangle and prep tweets for use in a basic ML model to predict pro-choice vs. pro-life ideology. In the next post, I’ll show how I was able to label and analyze a novel dataset of scraped tweets using the model I trained.