Daily Dose of Data Science

Share this post

Why Correlation (and Other Summary Statistics) Can Be Misleading

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Why Correlation (and Other Summary Statistics) Can Be Misleading

...And here's how to avoid drawing misleading conclusions.

Avi Chawla
Aug 15, 2023
34
Share this post

Why Correlation (and Other Summary Statistics) Can Be Misleading

www.blog.dailydoseofds.com
4
Share

Many data scientists solely rely on the correlation matrix to study the association between variables.

But unknown to them, the obtained statistic can be heavily driven by outliers.

This is evident from the image above.

The addition of just two outliers drastically changed:

  • the correlation

  • the regression fit

Thus, plotting the data is highly important.

This can save you from drawing wrong conclusions, which you may have drawn otherwise by solely looking at the summary statistics.

One thing that I often do when using a correlation matrix is creating a PairPlot as well (shown below).

This lets me infer if the scatter plot of two variables and their corresponding correlation measure resonate with each other or not.

👉 Over to you: What are some other measures you take when using summary statistics?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (350+ pages) with 250+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


Whenever you’re ready, here are a couple of more ways I can help you:

  • Get the full experience of the Daily Dose of Data Science. Every week, receive two curiosity-driven deep dives that:

    • Make you fundamentally strong at data science and statistics.

    • Help you approach data science problems with intuition.

    • Teach you concepts that are highly overlooked or misinterpreted.

Daily Dose of Data Science ML articles
  • Promote to 29,000 subscribers by sponsoring this newsletter.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

34
Share this post

Why Correlation (and Other Summary Statistics) Can Be Misleading

www.blog.dailydoseofds.com
4
Share
Previous
Next
4 Comments
Share this discussion

Why Correlation (and Other Summary Statistics) Can Be Misleading

www.blog.dailydoseofds.com
Jon
Aug 28Liked by Avi Chawla

How to deal with that 2 outliers? Please advise

Expand full comment
Reply
Share
2 replies by Avi Chawla and others
Joe Corliss
Aug 15Liked by Avi Chawla

If there are too many variables to plot individually, then Spearman's rank correlation can provide a robust measure of the association between each pair of variables.

Expand full comment
Reply
Share
2 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing