Daily Dose of Data Science

Share this post

Why R-squared is a Flawed Regression Metric

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Why R-squared is a Flawed Regression Metric

...and how to avoid misleading conclusions from it.

Avi Chawla
Jul 26, 2023
27
Share this post

Why R-squared is a Flawed Regression Metric

www.blog.dailydoseofds.com
3
Share

Thank you :)

Before I begin today’s post, a big thanks to everyone who considered supporting Daily Dose of Data Science by becoming a premium member 😇.

I was overwhelmed to see plenty of readers converting to paid and showing their support.

For those who haven’t subscribed yet, join before the offer ends:

Become a Premium Member

Offer:

  • Any pricing updates will NEVER affect your plan.

  • Pause and renew anytime at the same price.

  • Upcoming courses and bonus resources (practice notebooks, extensive study guides, etc.) included.

  • Only 30 hours remain before this ends.

Next member-only post releasing this Friday.

Thanks again :)

Let’s get to today’s post now.


R2 is quite popularly used all across data science and statistics to assess a model.

Yet, contrary to common belief, it is often interpreted as a performance metric for evaluating a model, when, in reality, it is not.

Let’s understand!

R2 tells the fraction of variability in the outcome variable captured by a model.

It is defined as follows:

\(R^2 = \frac{\text{Variability captured by the model}}{\text{Total variability in the data}}\)

In simple words, variability depicts the noise in the outcome variable (y).

Left: The outcome variable has zero variance. Right: The outcome variable has a non-zero variance.

Thus, the more variability captured, the higher the R2.

This means that solely optimizing for R2 as a performance measure:

  • promotes 100% overfitting.

  • leads us to engineer the model in a way that captures random noise instead of underlying patterns.

It is important to note that:

  • R2 is NOT a measure of predictive power.

  • Instead, R2 is a fitting measure.

Thus, you should NEVER use it to measure goodness of fit.

This is evident from the image below:

  • An overly complex and overfitted model almost gets a perfect R2 of 1.

  • A better and more generalized model gets a lower R2 score.

Some other flaws of R2 are:

  • R2 always increases as you add more features, even if they are random noise.

  • In some cases, one can determine R2 even before fitting a model‌, which is weird.

👉 Read my full blog on the A-Z of R2, what it is, its limitations, and much more here: Flaws of R2 Metric.

👉 Over to you: What are some other flaws in R2?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (350+ pages) with 250+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can connect with me on LinkedIn and Twitter.

27
Share this post

Why R-squared is a Flawed Regression Metric

www.blog.dailydoseofds.com
3
Share
Previous
Next
3 Comments
Share this discussion

Why R-squared is a Flawed Regression Metric

www.blog.dailydoseofds.com
Joe Corliss
Jul 30

This post is deeply misleading. The problem at hand is not the R-squared metric, it is model overfitting. The use of R-squared does not cause overfitting, it is the lack of model validation and/or regularization.

Expand full comment
Reply
Share
Devansh
Writes Technology Made Simple
Jul 26

Would you be interested in collaborating? I write about AI Research, ML Engineering and DS as well. My main newsletter- AI Made Simple- is currently at 97K readers- https://artificialintelligencemadesimple.substack.com/

Expand full comment
Reply
Share
1 more comment...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing