Most Sklearn Users Don't Know This About Its LinearRegression Implementation

Always review what a specific implementation is hiding underneath.

May 17, 2023

FREE 3-Day Object Detection Challenge

⭐️ Build your own object detection model from start to finish!

Hey friends! Lately, I have been in touch with Data Driven Science. They offer self-paced and hands-on learning on practical data science challenges.

A 3-day object detection challenge is available for free. Here, you’ll get to train an end-to-end ML model for object detection using computer vision techniques.

The challenge is guided, meaning you don’t need any prior expertise. Instead, you will learn as you follow the challenge.

Also, you’ll get to apply many of my previous tips around Image Augmentation, Run-time optimization, and more.

All-in-all, it will be an awesome learning experience.

👉 Register for the challenge here: https://datadrivenscience.com/free-object-detection-challenge/.

Let’s get to today’s post now.

Sklearn's LinearRegression class implements the ordinary least square (OLS) method to find the best fit.

Some important characteristics of OLS are:

It is a deterministic algorithm. If run multiple times, it will always converge to the same weights.
It has no hyperparameters.
It involves matrix inversion, which is cubic in relation to the no. of features. This gets computationally expensive with many features.

Read this answer to learn more about OLS’ run-time complexity: StackOverflow.

SGDRegressor, however:

is a stochastic algorithm. It finds an approximate solution using optimization.
has hyperparameters.
involves gradient descent, which is relatively inexpensive.

Now, if you have many features, Sklearn's LinearRegression will be computationally expensive.

This is because it relies on OLS, which involves matrix inversion. And as mentioned above, inverting a matrix is cubic in relation to the total features.

This explains the run-time improvement provided by Sklearn’s SGDRegresor over LinearRegression.

So remember...

When you have many features, avoid using Sklearn's LinearRegression.

Instead, use the SGDRegressor.

This will help you:

Improve run-time.
Avoid memory errors.
Implement batching (if needed). I have covered this in one of my previous posts here: A Lesser-Known Feature of Sklearn To Train Models on Large Datasets.

👉 Over to you: What are some tradeoffs between using LinearRegression vs. SGDRegressor?

👉 Read what others are saying about this post on LinkedIn and Twitter.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Daily Dose of Data Science

Discussion about this post

Ready for more?