Daily Dose of Data Science

Share this post

Model Compression: An Overlooked ML Technique That Deserves Much More Attention

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Model Compression: An Overlooked ML Technique That Deserves Much More Attention

A step towards real-world utility of ML model.

Avi Chawla
Sep 18, 2023
37
Share this post

Model Compression: An Overlooked ML Technique That Deserves Much More Attention

www.blog.dailydoseofds.com
2
Share

Imagine you intend to train an ML model that will also be deployed to production.

To begin, you may spend hours:

  • Meticulously creating relevant features,

  • Tuning hyperparameters,

  • Training the best-performing model.

But, all this is just a tiny part of this entire process.

Much of your engineering effort, however, will go into making the model production-friendly.

Because typically, the model that gets shipped is NEVER solely determined by performance — a misconception that many have.

Deployment considerations

Instead, we also consider several operational and feasibility metrics, such as:

  • Inference Latency: Time taken by the model to return a prediction.

  • Model size: The memory occupied by the model.

  • Ease of scalability, etc.

For instance, consider the table below. It compares the accuracy and size of a large neural network with its pruned (or reduced/compressed) versions:

Results of model compression

Looking at the above results, don’t you strongly prefer deploying the model that is 72% smaller, but is still (almost) as accurate as the large model?

I am sure you do!

It makes no sense to deploy the large model when one of its largely pruned versions performs equally well.

The techniques that help us achieve this are called model compression techniques.

These techniques allow us to reduce both the latency and size of the original model, which directly helps in:

  • Lowering computation costs.

  • Reducing model footprint.

  • Improving user experience due to low latency…

…all of which are critical metrics for businesses.

Four widely popular model compression techniques are:

  • Knowledge distillation: Train a large model (teacher) and transfer its knowledge to a smaller model (student).

  • Model Pruning: Remove irrelevant edges and nodes from a network. Three popular types of pruning:

    • Zero pruning

    • Activation pruning (the table we discussed above depicts the results of activation pruning).

    • Redundancy pruning

  • Low-rank factorization: Decompose weight matrices into smaller “low-rank” matrices.

  • Quantization: Reduce the model’s memory usage by using lower-bit representation to store parameters.


While the above techniques are pretty effective, there are many questions to ask here:

  • How do these techniques work end-to-end?

  • How do we define “knowledge?”

  • How do we transfer knowledge from one model to another?

  • Which edges and neurons should be pruned in a network?

  • How do we determine the pruning criteria?

  • How do we decompose a weight matrix into smaller matrices?

  • How does having multiple matrices improve performance?

  • Which data type should we downcast our weight matrices to? How to decide this?

  • And most importantly, how to implement them in our projects?

Can you answer these questions?

If not, then this is precisely what we discussed in a recent machine learning: Model Compression: A Critical Step Towards Efficient Machine Learning.

Model Compression: A Critical Step Towards Efficient Machine Learning

In my opinion, for years, the primary objective in model development has been to achieve the best performance metrics.

I vividly remember when I first transitioned to industry, I witnessed that the focus radically shifted from raw accuracy to considerations such as efficiency, speed, and resource consumption.

While my complex models excelled in research and competition settings, they were not entirely suitable for real-world deployment due to scalability, maintenance, and latency concerns.

Learning about model compression techniques was possibly one of the best skills I cultivated at that time to address these business concerns.

👉 Interested folks can read it here: Model Compression: A Critical Step Towards Efficient Machine Learning.

I am confident you will learn a lot of practical skills from this 20-min deep dive :)

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!

37
Share this post

Model Compression: An Overlooked ML Technique That Deserves Much More Attention

www.blog.dailydoseofds.com
2
Share
Previous
Next
2 Comments
Share this discussion

Model Compression: An Overlooked ML Technique That Deserves Much More Attention

www.blog.dailydoseofds.com
Shailesh
Sep 19Liked by Avi Chawla

Thanks for sharing!! Very informative

Expand full comment
Reply
Share
Nitish Yadav
Sep 18Liked by Avi Chawla

Insightful

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing