How Zero-inflated Datasets Ruin Your Regression Modeling

...and here's how you can prevent it.

Dec 26, 2023

The target variable of typical regression datasets is somewhat evenly distributed.

But, at times, the target variable may have plenty of zeros. Such datasets are called zero-inflated datasets.

They may raise many problems during regression modeling. This is because a regression model can not always predict exact “zero” values when, ideally, it should.

For instance, consider simple linear regression. The regression line will output exactly “zero” only once (if it has a non-zero slope).

Simple linear regression fit outputs zero only once

This issue persists:

Not only in higher dimensions...
But also in complex models like neural nets for regression.

One great way to solve this is by training a combination of a classification and a regression model.

This goes as follows:

Mark all non-zero targets as “1” and the rest as “0”.
Train a binary classifier on this dataset.

Next, train a regression model only on those data points with a non-zero true target.

During prediction:

If the classifier's output is “0”, the final output is also zero.
If the classifier's output is “1”, use the regression model to predict the final output.

Its effectiveness over the regular regression model is evident from the image below:

Regression vs. Regression + Classification results

Linear regression alone underfits the data.
Linear regression with a classifier performs as expected.

👉 Over to you: What are other ways to train a model on a zero-inflated dataset?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

Thanks so much for appreciating the effort :)

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Rohit

Dec 27, 2023

If you go through daily dose of data science blogs you start to feel there is a lot to know about and it is never ending. To be more precise learning data science is a continuous process and acquiring it from Daily dose of data science is the best way.

Bobur

Gold.

2 more comments...

Daily Dose of Data Science

Discussion about this post

Ready for more?