Label Smoothing: The Overlooked and Lesser-Talked Regularization Technique

Make your model less overconfident.

Nov 06, 2023

For every instance in single-label classification datasets, the entire probability mass belongs to a single class, and the rest are zero.

This is depicted below:

Label probability mass distribution on the instance level

The issue is that, at times, such label distributions excessively motivate the model to learn the true class for every sample with pretty high confidence.

This can impact its generalization capabilities.

Label smoothing is a lesser-talked regularisation technique that elegantly addresses this issue.

Reduce the probability mass of true class label

As depicted above, with label smoothing:

We intentionally reduce the probability mass of the true class slightly.
The reduced probability mass is uniformly distributed to all other classes.

Simply put, this can be thought of as asking the model to be “less overconfident” during training and prediction while still attempting to make accurate predictions.

This makes intuitive sense as well.

The efficacy of this technique is evident from the image below:

In this experiment, I trained two neural networks on the Fashion MNIST dataset with the exact same weight initialization.

One without label smoothing.
Another with label smoothing.

The model with label smoothing resulted in a better test accuracy, i.e., better generalization.

Pretty handy, isn’t it?

When not to use label smoothing?

After using label smoothing for many of my projects, I have also realized that it is not well suited for all use cases.

So it’s important to know when you should not use it.

See, if you only care about getting the final prediction correct and improving generalization, label smoothing will be a pretty handy technique.

However, I wouldn’t recommend utilizing it if you care about:

Getting the prediction correct.
And understanding the model’s confidence in generating a prediction.

This is because as we discussed above, label smoothing guides the model to become “less overconfident” about its predictions.

Thus, we typically notice a drop in the confidence values for every prediction, as depicted below:

On a specific test instance:

The model without label smoothing outputs 99% probability for class 3.
With label smoothing, although the prediction is still correct, the confidence drops to 74%.

This is something to keep in mind when using label smoothing.

Nonetheless, the technique is indeed pretty promising to regularize deep learning models.

You can download the code notebook here: Label Smoothing Notebook.

👉 Over to you: What could be some other things to take care of when using label smoothing?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

Thanks so much for appreciating the effort :)

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Daily Dose of Data Science

Discussion about this post

Ready for more?