Daily Dose of Data Science

Share this post

Are You Misinterpreting Correlation for Predictiveness?

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Are You Misinterpreting Correlation for Predictiveness?

Here’s what you should use instead to measure predictiveness.

Avi Chawla
Aug 28, 2023
21
Share this post

Are You Misinterpreting Correlation for Predictiveness?

www.blog.dailydoseofds.com
1
Share

Correlation measures how two features vary with one another linearly (or monotonically).

This makes correlation symmetric: corr(A, B) = corr(B, A).

Yet, associations are often asymmetric.

For instance, given a date, it is easy to tell the month. But given a month, you can never tell the date.

One can tell month from date but not the other way around

Correlation, being symmetric, entirely ignores this notion.

What’s more, it is not meant to quantify how well a feature can predict the outcome, as demonstrated below:

Yet, at times, it is misinterpreted as a measure of “predictiveness”.

Lastly, correlation is mostly limited to numerical data. But categorical data is equally important for predictive models.

The Predictive Power Score (PPS) addresses each of these limitations.

As the name suggests, it measures the predictive power of a feature.

PPS(a → b) is calculated as follows:

  • If the target (b) is numeric:

    • Train a Decision Tree Regressor that predicts b using a.

    PPS calculation for numeric target
    • Find PPS by comparing its MAE to the MAE of a baseline model (median prediction).

  • If the target (b) is categorical:

    • Train a Decision Tree Classifier that predicts b using a.

    PPS calculation for categorical target
    • Find PPS by comparing its F1 to the F1 of a baseline model (random or most frequent prediction).

Thus, PPS:

  • is asymmetric, meaning PPS(a, b) != PPS(b, a).

  • can be used on categorical targets (b).

  • can be used to measure the predictive power of categorical features (a).

  • works well for linear and non-linear relationships.

  • works well for monotonic and non-monotonic relationships.

Its effectiveness is evident from the image below.

For all three datasets:

  • Correlation is low.

  • PPS (x → y) is high.

  • PPS (y → x) is zero.

That being said, it is important to note that correlation has its place.

When selecting between PPS and correlation, first set a clear objective about what you wish to learn about the data:

  • Do you want to know the general monotonic trend between two variables? Correlation will help.

  • Do you want to know the predictiveness of a feature? PPS will help.

👉 Over to you: What other points will you add here about PPS vs. Correlation?

Get started with PPS: GitHub.

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


Whenever you’re ready, here are a couple of more ways I can help you:

  • Get the full experience of the Daily Dose of Data Science. Every week, receive two 15-mins data science deep dives that:

    • Make you fundamentally strong at data science and statistics.

    • Help you approach data science problems with intuition.

    • Teach you concepts that are highly overlooked or misinterpreted.

Daily Dose of Data Science Articles
  • Promote to 32,000 subscribers by sponsoring this newsletter.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

21
Share this post

Are You Misinterpreting Correlation for Predictiveness?

www.blog.dailydoseofds.com
1
Share
Previous
Next
1 Comment
Share this discussion

Are You Misinterpreting Correlation for Predictiveness?

www.blog.dailydoseofds.com
Joe Corliss
Sep 1

Mutual information is another good measure.

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing