We know that machine learning models are needed to learn patterns from our data.

If you didn’t know that, you should probably read this post first.

But how do we know if our models are actually doing what we want them to do? I mean, there’s not much point in having models if they can’t perform, right?

Fortunately, there are key metrics we can use to analyze the accuracy of our ML models.

ML Models vs Male Models (A Quick Comparison)

Since we’re talking about models, let’s talk about one you may recognize:

Your ML Model may be really, really, really, ridiculously bad at predicting things.

Imagine Derek Zoolander strutting down the runway. He’s cocky, full of confidence, believing he’s nailed the walk.

He hears the audience cheering, but little does he know, the fashion designers with a real eye for detail are shaking their heads in disapproval. They’re thinking:

“Outdated clothes. Over-used facial expression. Probably can’t understand ML”.

Even though he looked like he nailed it with the walk, the pose, and the attitude, the fashion designers are not seeing what they’re looking for.

This subtle mistake is like a machine learning model that performs well on your initial test, but fails to generalize to new, unseen data (called overfitting).

In order to truly shine on the runway, Mr. Zoolander needs more than just the walk and the pose; he needs to consider what the fashion designers actually wanted.

It’s similar in Machine Learning. To truly evaluate effectively, we need to consider multiple factors, and how they work together in context.

The Fantastic Four Of ML Metrics

When it comes to evaluating ML models, we have an entire toolkit of metrics we can use. I don’t want to overload your brain just yet, so today I’ll start with what I call the Fantastic Four of measuring ML models’ performance:

Accuracy, Precision, Recall, and F1 Score.

These 4 all have one thing in common: they compare what our model predicts, to what’s actually true in reality.

This means we have 4 possible outcomes:

Outcome

Model Says

Reality

Example

True Positive (TP)

Yes

Yes

Calling this post a great read.

True Negative (TN)

No

No

Rejecting crocs.

False Positive (FP)

Yes

No

Thinking your cat actually likes you.

False Negative (FN)

No

Yes

Missing out on Bitcoin in 2010.

Get real comfortable with those terms above, because it is foundational to understanding how these metrics work.

Let’s start with Accuracy.

📊 Accuracy - “How Often Am I Right?”

Accuracy tells you how often your model got it right. It looks at both True Positives and True Negatives, and divides that by everything.

Formula:

Accuracy = (True Positives + True Negatives) / Total Predictions

Here’s an example:

Let’s say you have an email classifier that determines if an email is spam or not spam. Let’s say your True Positives (real spam emails) are 40, and your True Negatives (real non-spam emails) are 30, your False Positives (predicted spam, but actually not) are 10, and your False Negatives (missed spam) are 20.

Following the formula:

In this case, our Accuracy = 70%

🎯 Precision - “How Picky Am I?”

Precision focuses on the quality of your positive predictions. It answers: “When I said something was positive, how often was I actually right?”. To determine precision, you take your True Positives and divide them by True Positives + False Positives.

Formula:

Precision = True Positives / (True Positives + False Positives)

Example:

Using our email classifier example above, let’s say our model says 10 of our emails are spam, but only 8 of them actually are. In this case, our precision = 8 / 10 = 80%.

Precision = 80%

🕸️ Recall - “How Much Did I Catch?”

Recall is all about catching the actual positives. It answers: “Of all the real positives out there, how many did I catch?” For this, you want to take the True Positives and divide them by True Positives + False Negatives. (←False Negatives is the key difference between Recall and Precision).

Formula:

Recall = True Positives / (True Positives + False Negatives)

Example:

This time, let’s say we have 20 emails that are actually spam. Our model caught 8. Our recall in this case would be 8 / 20 = 40%.

Recall = 40%

⚖️ F1 Score - “The Perfect Balance”

F1 Score balances both recall and precision into a number. If an F1 Racer were to measure their F1 score, they’d probably measure their precision on how many laps they nailed without crashing, and their recall by how many laps they actually finished.

It’s the harmonic mean, a fine balance between both recall and precision.

F1 Score showing Thanos what true balance looks like.

Formula:

2 x (Precision x Recall) / (Precision + Recall)

Example:

Let’s say our precision is 80%, and our recall is 40%. We’d calculate our F1 Score as the following:

F1 Score: 53%

⭐ Conclusion

Awesome! With the help of Zoolander, we’ve covered the basics of model evaluation. This is one of those posts that will require multiple reads, so I recommend you run at least a few epochs on this topic, just so you can really solidify the concepts!

Next time, I’ll present Part 2, where we find even more ways to nitpick your model’s performance.

Until then, keep your accuracy high and your losses low!

Reply

or to participate