Daily Dose of Data Science

Share this post

Covariate Shift Is Way More Problematic Than Most People Think

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a free 550+ page data science PDF guide and 450+ practice questions notebook.
Over 50,000 subscribers
Continue reading
Sign in

Covariate Shift Is Way More Problematic Than Most People Think

Here's what most people overlook when detecting covariate shift.

Avi Chawla
Oct 24, 2023
30
Share this post

Covariate Shift Is Way More Problematic Than Most People Think

www.blog.dailydoseofds.com
2
Share

Almost all real-world ML models gradually degrade in performance due to covariate shift.

Data distribution changes over time

For starters, covariate shift happens when the distribution of features changes over time, but the true (natural) relationship between input and output remains the same. We have already discussed it in detail before.

Test sample distribution is different from the training distribution

It is a serious problem because we trained the model on one distribution, but it is being used to predict on another distribution in production.

Thus, it is critical to detect covariate shift early so that models continue to work well.

One of the most common and intuitive ways to detect covariate shift is by simply comparing the feature distribution in training data to that in production.

This could be done in many ways, such as:

  • Compare their summary statistics — mean, median, etc.

  • Inspect visually using distribution plots.

  • Perform hypothesis testing.

  • Measure distances between training/production distributions using Bhattacharyya distance, KS test, etc.

While these approaches are often effective, the biggest problem is that they work on a single feature at a time.

But, in real life, we may observe multivariate covariate shift as well.

Multivariate covariate shift happens when:

  • The distribution of individual distributions remains the same:

    • P(X1) and P(X2) individually remain the same.

  • But their joint distribution changes:

    • P(X1, X2) changes.

This is evident from the image below:

Blue → Training data; Red → Production data.
  • From the KDE plots on the top and the right, it is clear that the distribution of both features (covariates) is almost the same.

  • But, the scatter plot reveals that their joint distribution in training (Blue) differs from that in production (Red).

And it is easy to guess that the univariate covariate shift detection methods discussed above will produce misleading results.

For instance, as demonstrated below, measuring the Bhattacharyya distance between a training and production feature gives a very low distance value, indicating high similarity:

In fact, even though the individual feature distribution is the same, we can confirm experimentally that this will result in a drop in model performance:

Let’s say the true output y = 2*Feature_A + 3*Feature_B + noise.

In most cases, the true output predictions on production data are not immediately available. In the above demonstration, just assume that we have already gathered the true predictions somehow.

As depicted above, the model performance drops by 20% in production, which is huge.

Here, we may completely rule out the possibility of covariate shift if we think that covariate shift can never be multivariate in nature.

Of course, in the figure below, it was easy to identify multivariate covariate shift because we are only looking at two features.

But multivariate covariate shift can happen with more than two features as well.

Unlike the bivariate case above, visual inspection will not be possible for higher dimensions.

Now you know the problem.

So how can we detect this?

Today, instead of proposing the solution myself, I want you to give this problem a thought.

More specifically, you have to think about:

How can we reliably detect multivariate covariate shift?

We will continue the discussion tomorrow, but in the meantime, I would love to hear from you :)

Multivariate covariate shift is a common problem that many real-world ML models suffer from.

Thus, it is critical to detect covariate shift early so that models continue to work as expected.

I am eagerly waiting to hear from you.

Thanks for reading!

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.


Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

  • You Cannot Build Large Data Projects Until You Learn Data Version Control!

  • Why Bagging is So Ridiculously Effective At Variance Reduction?

  • Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.

  • Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit

  • Model Compression: A Critical Step Towards Efficient Machine Learning.

  • Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.

  • Bayesian Optimization for Hyperparameter Tuning.

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

30
Share this post

Covariate Shift Is Way More Problematic Than Most People Think

www.blog.dailydoseofds.com
2
Share
Previous
Next
2 Comments
Share this discussion

Covariate Shift Is Way More Problematic Than Most People Think

www.blog.dailydoseofds.com
Joe Corliss
Oct 24·edited Oct 24Liked by Avi Chawla

If dataset 1 is stored in the matrix X (rows = observations, columns = variables) and dataset 2 is stored in Y, we could calculate X^T*X and Y^T*Y, which removes the difference in sample size and is independent of the orders of the rows. Then we could calculate the Frobenius distance between them, ||X^T*X - Y^T*Y||_F. Not sure if it means anything though, just a thought.

Expand full comment
Reply
Share
Paulo B
Oct 24

Hi Avi Chawla, Hi Everyone!

This is an important topic which i face a lot of time on my daily work. For me, my big question is: how to use previous data and model to adapt for this covariate shift?

In a regression (or classification) framework, most time the new data is arriving which is different for the one used to built the model, but we do not have targets/labels for this new data. Thus, (re)train a (new) model on the new data is not feasible. Which adaptation we can follow to deal with this question?

Thank you,

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing