Daily Dose of Data Science

Share this post

How Zero-inflated Datasets Can Ruin Your Regression Modeling

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

How Zero-inflated Datasets Can Ruin Your Regression Modeling

...and here's how you can prevent it.

Avi Chawla
Aug 24, 2023
23
Share this post

How Zero-inflated Datasets Can Ruin Your Regression Modeling

www.blog.dailydoseofds.com
2
Share

The target variable of typical regression datasets is somewhat evenly distributed.

But, at times, the target variable may have plenty of zeros. Such datasets are called zero-inflated datasets.

Zero-inflated target distribution

They may raise many problems during regression modeling. This is because a regression model can not always predict exact “zero” values when, ideally, it should.

For instance, consider simple linear regression. The regression line will output exactly “zero” only once (if it has a non-zero slope).

Simple linear regression fit outputs zero only once

This issue persists:

  • Not only in higher dimensions...

  • But also in complex models like neural nets for regression.

One great way to solve this is by training a combination of a classification and a regression model.

This goes as follows:

  • Mark all non-zero targets as “1” and the rest as “0”.

  • Train a binary classifier on this dataset.

Train a binary classifier
  • Next, train a regression model only on those data points with a non-zero true target.

Train a regression model

During prediction:

Prediction phase
  • If the classifier's output is “0”, the final output is also zero.

  • If the classifier's output is “1”, use the regression model to predict the final output.

Its effectiveness over the regular regression model is evident from the image below:

Regression vs. Regression + Classification results
  • Linear regression alone underfits the data.

  • Linear regression with a classifier performs as expected.

👉 Over to you: What are other ways to train a model on a zero-inflated dataset?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading :)


Whenever you’re ready, here are a couple of more ways I can help you:

  • Get the full experience of the Daily Dose of Data Science. Every week, receive two curiosity-driven deep dives that:

    • Make you fundamentally strong at data science and statistics.

    • Help you approach data science problems with intuition.

    • Teach you concepts that are highly overlooked or misinterpreted.

Daily Dose of Data Science ML articles
  • Promote to 31,000 subscribers by sponsoring this newsletter.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

23
Share this post

How Zero-inflated Datasets Can Ruin Your Regression Modeling

www.blog.dailydoseofds.com
2
Share
Previous
Next
2 Comments
Share this discussion

How Zero-inflated Datasets Can Ruin Your Regression Modeling

www.blog.dailydoseofds.com
kverdecia
Aug 24Liked by Avi Chawla

Good content. You should write a books with these daily doses

Expand full comment
Reply
Share
Mmdkry
Writes Mmdkry’s Substack
Aug 24

How predict goals equal zero? what's result models y =0 , i think we have to create 2models, for zeros and for not zeros. I conflicted with this matter and i looking for founding a way to predict wen a y from zero go to a not zero, dependence this changes to a one feature

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing