Matrix of Confusion: Tuning the Logistic Regression Prediction Threshold for Imbalanced Classes (the long way)

I must admit that lately I’ve found myself feeling like the brain trapped in the Matrix of Confusion™ (confusion matrix) pictured above. It’s been quite a busy little month or so. I got back from a wonderful and much needed vacation at the beginning of September and subsequently whiplashed straight into a new semester (and back to good old work). Of course, this isn’t just any run of the mill semester. It’s my second to last (or so) semester, which means I need to buckle down and actually come up with a thesis topic and proposal. Perhaps, if you’re familiar with the contents of this blog, you’ll see that this could be somewhat of a difficult task. I tend to have a lot of cute little ideas and interests related to my hobbies, but it’s hard to translate this into something that could be classified as social science research. I do have a substantial amount of legitimate research experience, but this is in the field of cognitive neuroscience (also not a social science).

Anywho, in times of trouble like this, sometimes it helps to give the old noggin a rest from the big questions and pick something small to figure out to use as a distraction 🙂

At the end of my last post I looked at the confusion matrix for the test predictions made by the logistic regression trained on tweets labeled pro-choice and pro-life. Examining the confusion matrix was much more useful in my opinion than simply looking at numbers for precision, recall, and F1, because it shows exactly how tweets are getting labeled in both classes:

Predicted pro-choicePredicted pro-life
Actual pro-choice1358
Actual pro-life4319
Confusion matrix

The model predicted pro-choice tweets with a precision of 135/178 = 75.8%. It was slightly worse at correctly predicting pro-life tweets, with a precision of 19/27 = 70.4%. Our overall accuracy ends up being (135+19)/(135+19+43+8) = 75.1%.

To get the baseline accuracy to compare this accuracy to, we can take the overall percentage of the most abundant class (pro-choice tweets). This was 72.6%. So, theoretically, if instead of going through the trouble to train a model we had just guessed that all tweets were pro-choice, our accuracy would have been 72.6%. Our overall accuracy using the model was 75.1%, so overall we did better by about 3% 🥲.

Predicted pro-choicePredicted pro-life
Actual pro-choice1430
Actual pro-life620
Confusion matrix if the model classified all tweets as pro-choice

Ok… that doesn’t sound like we did amazingly… however, how you want to measure success really does depend on your perspective. In the scenario just described, we would be classifying none of the pro-life tweets correctly. Why would we want that?! Comparing precisions in this theoretical scenario to our model, we would predict pro-choice tweets worse by 100-75.8 = 24.2%, but we would also predict pro-life tweets better by 70.4-0 = 70.4%.

There is also the question of recall – the ratio of correctly classified “positive” tweets to total actual positive tweets (true positive rate), which we can look at from the pro-choice or the pro-life side. Our model has very good recall if considering pro-choice tweets to be positives, with a recall of 135/(135+8) = 94.4%. Only 5.6% of true pro-choice tweets were labeled pro-life. On the flip-side, the recall if we consider pro-life tweets to be positive was 19/(19+43)= 30.6%. This means that in fact, about 70% of pro-life tweets were predicted to be pro-choice. Not great.

ROC Curve

ROC curves plot true positive vs. false positive rate across thresholds for binary classifiers. At the end of my last blog post, I mentioned that we could potentially tweak the classification threshold for our model one way or the other to get better predictions for the pro-life group. Plotting an ROC curve can help visualize this. Again, we can look at this from the pro-choice or pro-life perspective (either one can be treated as “positive”).

Pro-Life ROC Curve

I’ve chosen to look at thresholds within a few percentage points of 50%. The ROC curve plots the false positive rate (in this case, the percent of real pro-choice tweets labeled pro-life) on the x axis, and the true positive rate (real pro-life tweets labeled pro-life) on the y axis. You can see that the .5 threshold lines up with the recall discussed a couple paragraphs ago. These numbers shift drastically with only a slight threshold change. If we chose .44 for the threshold, we would label over 75% of pro-life tweets correctly, as opposed to the 32.3% we get from the .5 threshold. Of course, this would also increase the amount of pro-choice tweets that are falsely labeled as pro-life, up from 5.6% to about 38%.

Pro-Choice ROC Curve

This is giving essentially the same information as the other ROC curve but in reverse. I find it useful to look from both perspectives.

We can use the area under the ROC curve to understand how well the model performs overall:

A thought experiment:

Say the model performed absolutely perfectly and predicted every tweet with 100% confidence. In that case, for any threshold below exactly 1, there would be no false positives. Every threshold would be plotted at (0,1) except for 1, which would be at (1,0). This would result in the ROC curve not being a curve at all, but a right angle, and covering 100% of the area of the chart.

Say the model performs perfectly terribly, with 50% confidence that every tweet should be labeled as pro-choice. In this theoretical scenario, there would be an equally random 50/50 chance of the model labeling any tweet either pro-choice or pro-life at every threshold except for 0 and 1. If there were infinite observations in the dataset and the model was perfectly random with its predictions, the ROC curve would just by the line y=x, and would cover 50% of the area of the chart:

So, the curvier, or steeper, the ROC curve, the better the model.

Here’s my model colored in:

I find this kind of thing so useful to fundamentally understand what’s going on. We can see exactly how much better than completely random the model is predicting (darker green), and the corresponding area that the model isn’t predicting (red).

I added two lines at .726. These represent the baseline accuracy – that is, if we had classified every tweet as pro-choice based on the overall percentage of pro-choice tweets in the dataset. We want the % of true pro-life tweets labeled as pro-choice (false positive rate, x axis) to be below this threshold, and % of true pro-choice tweets labeled as pro-choice (true positive rate, y axis) to be above this threshold. We can see that the thresholds that meet these criteria are .46, .47, .48, .49, .5, and barely .51. A reasonable argument could be made for choosing any of these thresholds.

For thoroughness’s sake, here is the shaded ROC curve when considering pro-life to be “positive.” I’ve changed the baseline accuracy threshold lines to use the overall percentage of pro-choice tweets in the dataset. You can see that the same thresholds are performing better than baseline again, .51, .5, .49, .48, .47 and .46.

The ROC curve is clearly useful in assessing the ratio of true-positives (recall) to false-positives. What about precision, though?

Another way to look at this is with a precision-recall curve:

Here we are plotting recall on the x axis and precision on the y axis – in this case, we consider pro-choice to be “positive.” This allows us to see the trade-off between the ratio of correctly identified pro-choice tweets to all real pro-choice tweets (recall), and the ratio of correctly identified pro-choice tweets to all tweets labeled pro-choice (precision). We can see that for most thresholds tested, both precision and recall are quite high for pro-choice tweets. Overall, the model does a good job of predicting pro-choice tweets as being pro-choice (recall) and a pretty high percentage of the tweets it classifies as being pro-choice actually are pro-choice (precision).

However, remember that the data was imbalanced to begin with, and the dataset has 72.6% pro-choice tweets to begin with. If we were to pick tweets randomly out of a hat, eventually we would get 72.6% pro-choice. This makes the model’s precision in predicting pro-choice tweets somewhat less impressive. For most thresholds, the population of tweets predicted to be pro-choice is only slightly more pro-choice than random.

Let’s look at the precision-recall curve with pro-life as positives:

The precision-recall curve considering pro-life tweets as positives tells a very different story. In this scenario, our baseline accuracy to beat would be 27.4% (100 – 72.6%) representing the overall percentage of pro-life tweets in the dataset. We can see that at all thresholds besides 0, our model predicts pro-life tweets better than baseline, and for most thresholds, this is substantially better than baseline. That’s encouraging.

Recall is more of a problem from this perspective. If we want a threshold with more than 50% of actual pro-life tweets being labeled correctly as pro-life, we only have .46, .45, and .44 to choose from. However, if we chose one of these thresholds to improve pro-life recall, pro-life precision would drop, meaning that more actual pro-choice tweets would get labeled pro-life (false positives).

F1 – harmonic mean of precision and accuracy:

The F1 score is a way to assess precision and recall simultaneously. The formula is as follows:

F1 = 2 * (precision * recall)/(precision + recall)

F1 weighs precision and recall equally, with 0 as the worst score and 1 as the best score. You probably noticed that I labeled the precision-recall curves with corresponding F1 scores. Maybe the most logical way to select a threshold in our case would simply be to find the highest F1 score. Again, though, what we consider to be “best” depends on perspective, as the best F1 score from the pro-life perspective is not the best F1 score from the pro-choice perspective.

Conclusions?

What to take away from all this? Well… if I wanted to use my model to capture a wider swath of pro-life tweets, I think I would tweak the threshold to .46. At this threshold, recall increases to 66.1% from 30.6% – so we are classifying more than double the percentage of pro-life tweets correctly compared to before. Unfortunately, pro-life precision at this threshold drops to 56.2% from 70.4%. So just over half of the tweets labeled pro-life would actually be pro-life. That’s not ideal, but in this scenario I’d rather water-down the predicted pro-life tweet population than miss so many true pro-life tweets and have them get lost in the pro-choice predictions.

At .46 from the pro-choice perspective, precision is quite high at 84.1%, so a greater percentage of the tweets labeled as pro-choice are truly pro-choice. The pro-choice recall takes a hit, though, going from the very very good 94.4% to 77.6% – meaning only 77.6% of all pro-choice tweets are labeled correctly. This is only slightly better than baseline.

In the end, going with the threshold with the best weighted average F1 score across classes would probably make sense in almost all cases. A boring end to a boring post 🙂