You Can Build Any Linear Model If You Learn Just One Thing About Them

The most underappreciated skill in building linear models.

Dec 11, 2023

Many of you appreciated yesterday’s post on Poisson regression.

Today, we shall continue discussing that topic, and I will help you cultivate what I think is one of the MOST overlooked and underappreciated skills in developing linear models.

And I can guarantee that harnessing this skill will give you so much clarity and intuition in the modeling stages.

But let’s do a quick recap of yesterday’s post before we proceed.

Recap

Having a non-negative response in the training data does not stop linear regression from outputting negative values.

Essentially, you can always extrapolate the regression fit for some inputs to get a negative output.

Extrapolation of the linear regression fit

While this is not an “issue” per se, negative outputs may not make sense in cases where you can never have such outcomes.

For instance:

Predicting the number of calls received.
Predicting the number of cars sold in a year, etc.

More specifically, the issue arises when modeling a count-based response, where a negative output wouldn’t make sense.

In such cases, Poisson regression often turns out to be a more suitable linear model than linear regression.

This is evident from the image below:

Please read yesterday’s post for in-depth info: The Modeling Limitations of Linear Regression Which Poisson Regression Addresses.

I thought of continuing yesterday’s post because some of the replies I received appeared to be somewhat overly optimistic about how Poisson regression can handle those limitations.

But, trust me, Poisson regression is no magic.

It’s just that, in our specific use case, the data generation process didn’t perfectly align with what linear regression is designed to handle.

In other words, earlier when we trained a linear regression model, we inherently assumed that the data was sampled from a normal distribution.

But that was not true in this Poisson regression case.

Instead, it came from a Poisson distribution, which is why Poisson regression worked better.

Thus, the takeaway is that whenever you train linear models, always always and always think about the data generation process.

It goes like this:

Okay, I have this data.
I want to fit a linear model through it.
What information do I get about the data generation process that can help me select an appropriate linear model?

You’d start appreciating the importance of data generation when you’d realize that literally member of the generalized linear model family stems from altering the data generation process.

For instance:

If the data generation process involves a Normal distribution → you get linear regression.
If the data has only positive integers in the response variable, maybe it came from a Poisson distribution → and this gives us Poisson regression. This is precisely what we discussed yesterday.
If the data has only two targets — 0 and 1, maybe it was generated using Bernoulli distribution → and this gives rise to logistic regression.
If the data has finite and fixed categories (0, 1, 2,…n), then this hints towards Binomial distribution → and we get Binomial regression.

See…

Every linear model makes an assumption and is then derived from an underlying data generation process.

Data generation assumption of linear models

Thus, developing a habit of holding for a second and thinking about the data generation process will give you so much clarity in the modeling stages.

I am confident this will help you get rid of that annoying and helpless habit of relentlessly using a specific sklearn algorithm without truly knowing why you are using it.

Consequently, you’d know which algorithm to use and, most importantly, why.

This improves your credibility as a data scientist and allows you to approach data science problems with intuition and clarity rather than hit-and-trial.

In fact, once you understand the data generation process, you will automatically get to know about most of the assumptions of that specific linear model.

Here, I would really encourage you to read the deep dive on generalized linear models (GLMs) because there are so many nuances that one must be aware of.

👉 Read it here: Generalized Linear Models (GLMs): The Supercharged Linear Regression.

A linear regression model is undeniably an extremely powerful model, in my opinion.

However, it does make some strict assumptions about the type of data it can model.

So, if linear regression is the only linear model you know, it is difficult to expect good modeling results because nothing stops real-world datasets from violating these assumptions.

That is why being aware of its extensions is immensely important.

The GLM article covers the following topics:

How does linear regression model data?
The limitations of linear regression.
What are GLMs?
What are the core components of GLMs?
How do they relax the assumptions of linear regression?
What are the common types of GLMs?
What are the assumptions of these GLMs?
How do we use maximum likelihood estimates with these GLMs?
How to build a custom GLM for your own data?
Best practices and takeaways.

👉 Interested folks can read it here: Generalized Linear Models (GLMs): The Supercharged Linear Regression.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

Thanks so much for appreciating the effort :)

The button is located towards the bottom of this email.

Thanks for reading!

Daily Dose of Data Science

Discussion about this post

Ready for more?