SURPRISE!

Okay, you’re probably thinking, what the hell is this? Why is there a random fabulous pose in my inbox? Did someone hack the post for this week?

Nope. I assure you that zest-master is meant to be there.

Believe it or not, the mental response you just had in seeing something unexpected is at the core of today’s topic: Information Theory.

It’s the math of surprise — the science of measuring just how much your brain goes “wait, WHAT?” when something unexpected shows up.

And yes, it matters a lot in Machine Learning.

📖 What Is Information Theory

Imagine an old man with his walkers. He spends most of his time sitting around, not really moving or talking.

Then all of a sudden, he does this:

Grandpa’s new moves: Information Dance Theory.

This revelation to start busting moves can be seen as surprising new data. And that data can be measured.

At its core, Information Theory is about measuring uncertainty and information.

For example:

If I tell you, “The sun will rise tomorrow,” that’s not very surprising. (Zero information.)
If I tell you, “Grandpa just owned the dance floor,” that’s a lot of information. (Maximum surprise.)

❝

Information theory is a field of applied mathematics that quantifies information. Its main concept is entropy, which is a measure of the uncertainty or randomness in a message.

Just like how people behave randomly, sometimes our data in machine learning can hold randomness or uncertainty.

Information Theory is basically the rulebook for handling that uncertainty — it teaches models how to spot the meaningful patterns in data while ignoring the random junk.

Basically, Information Theory makes sure your model doesn’t make dumb predictions.

🖊️ Entropy

Entropy is the main character in the story of Information Theory. So what is it?

It’s just a fancy word for uncertainty or randomness.

Think of the old man dancing above. If he’s always taking a nap, barely moving, then the situation has low entropy because there's zero uncertainty about what he'll do next.

But the moment he suddenly starts doing the Electric Slide 🕺…

The entire system is flooded with uncertainty and surprise, which means the situation has high entropy.

A machine learning model uses this idea to find features in data that, like Grandpa, shatter predictability and provide valuable insight.

Entropy is used in Decision Trees to figure out the best way to split data. The model prefers questions that reduce uncertainty the most.

🧑‍🏫 Cross-Entropy - “The Stern Teacher”

Speaking of entropy, there’s a close cousin in machine learning you’ll hear a lot about: cross-entropy.

While entropy measures the uncertainty in data, cross-entropy measures how different your model’s predicted probabilities are from the actual truth.

But if your model messes up, cross-entropy might punish it like this:

Cross-entropy doesn’t just shrug off mistakes—it slaps your model with a big penalty. The more confidently wrong it is, the harsher the punishment.

You can think of it like grading:

A student who gets the right answer with 99% confidence gets an A+.
A student who gets it wrong with 99% confidence gets an F minus minus.

👉 In ML: Cross-entropy is the loss function behind classification tasks (like logistic regression, neural networks, image classifiers, etc.).

🤝 Mutual Information

Can you think of two people who love gossip?

They both seem to know everyone else’s business. If these two giggly gossipers whisper to each other as you walk past them…

You know they’re sharing something about you.

The more their conversations overlap, the more “mutual information” there is between them.

In machine learning, mutual information is basically a fancy way of measuring how much knowing one thing tells you about another. It measures how much two variables "get each other."

For example, knowing the temperature outside tells you a lot about ice cream sales (high mutual information), but knowing the day of the week probably tells you next to nothing about it (low mutual information). I

In machine learning, we use this to dump the boring, irrelevant features and focus only on the cool ones that have something to say.

🔗 Kullback–Leibler Divergence - (KL Divergence)

Kullback-what now?

I decided to hit you with some gibberish, since “mutual information” is boring and self-explanatory.

Yes, “Kullback-Leibler” sounds like I’m talking in another language, but I’ll make it less scary than it sounds.

KL Divergence is basically quantified disappointment.

Think of a super strict tiger mom, whose son gets a B- on his biology exam. Now imagine if her disappointment could be quantified. You could say her KL Divergence score is:

KL Divergence is highly relevant to machine learning because it serves as a loss function, especially in classification and probabilistic models.

Wait, loss function? You might be thinking that Cross-Entropy is the loss function.

You’d be right.

But think of it like this: KL Divergence is a part of Cross-Entropy.

The difference is simple: if Cross-Entropy is the total score on a quiz, then KL Divergence is just the number of questions you got wrong.

In machine learning, the number of questions on the quiz is a constant that never changes. So, if your goal is to get the best possible score (minimize Cross-Entropy), it's the same thing as getting the fewest wrong answers (minimizing KL Divergence).

⭐ Conclusion

Information theory provides the fundamental tools for a model to be intelligent.

While probability and statistics help a model understand past data, information theory gives it a way to measure the true value of new insights.

Next week, we'll dive into neural networks, the digital brains that learn just like us—only without the crippling anxiety.

Information Theory: The Model's Existential Crisis