Daily Dose of Data Science

Share this post

MissForest: A Better Alternative To Zero (or Mean) Imputation

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

MissForest: A Better Alternative To Zero (or Mean) Imputation

Missing value imputation using Random Forest.

Avi Chawla
Aug 14, 2023
14
Share this post

MissForest: A Better Alternative To Zero (or Mean) Imputation

www.blog.dailydoseofds.com
3
Share

Replacing (imputing) missing values with mean or zero or any other fixed value:

  • alters summary statistics

  • changes the distribution

  • inflates the presence of a specific value

This can lead to:

  • inaccurate modeling

  • incorrect conclusions, and more.

Instead, always try to impute missing values with more precision.

In one of the earlier posts, we discussed kNN imputer. Today’s post builds on that by addressing its limitations, which are:

  1. High run-time for imputation — especially for high-dimensional datasets.

  2. Issues with distance calculation in case of categorical non-missing features.

  3. Requires feature scaling, etc.

MissForest imputer is another reliable choice for missing value imputation.

As the name suggests, it imputes missing values using the Random Forest algorithm.

The following figure depicts how it works:

Visual illustration of MissForest imputer
  • Step 1: To begin, impute the missing feature with a random guess — Mean, Median, etc.

  • Step 2: Model the missing feature using Random Forest.

  • Step 3: Impute ONLY originally missing values using Random Forest’s prediction.

  • Step 4: Back to Step 2. Use the imputed dataset from Step 3 to train the next Random Forest model.

  • Step 5: Repeat until convergence (or max iterations).

In case of multiple missing features, the idea (somewhat) stays the same:

MissForest imputer for multiple missing features
  • Impute features sequentially in increasing order missingness — features with fewer missing values are imputed first.

Its effectiveness over Mean/Zero imputation is evident from the image below.

MissForest vs. Mean imputation vs. Zero imputation
  • Mean/Zero alters the summary statistics and distribution.

  • MissForest imputer preserves them.

What’s more, MissForest can impute even if the data has categorical non-missing features.

MissForest is based on Random Forest, so one can impute from categorical and continuous data.

Get started with MissForest imputer: MissingPy MissForest.

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (350+ pages) with 250+ tips.

👉 Over to you: What are some other better ways to impute missing values?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


Whenever you’re ready, here are a couple of more ways I can help you:

  • Get the full experience of the Daily Dose of Data Science. Every week, receive two curiosity-driven deep dives that:

    • Make you fundamentally strong at data science and statistics.

    • Help you approach data science problems with intuition.

    • Teach you concepts that are highly overlooked or misinterpreted.

Daily Dose of Data Science ML articles
  • Promote to 29,000 subscribers by sponsoring this newsletter.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

14
Share this post

MissForest: A Better Alternative To Zero (or Mean) Imputation

www.blog.dailydoseofds.com
3
Share
Previous
Next
3 Comments
Share this discussion

MissForest: A Better Alternative To Zero (or Mean) Imputation

www.blog.dailydoseofds.com
David Esp
Aug 15

Better still is not to impute anything but rather to leave it up to each model's perspective how to treat missing values. For example "distance functions" can be customised and in some cases asymmetric (e.g. reflecting some aspect of the application domain). Preprocessing data presupposes downstream purposes (that might change over time).

Expand full comment
Reply
Share
David Esp
Aug 15

Outlier-tolerant e.g. median would be better than mean - though in your example that would just place the spike in a "better" place.

Expand full comment
Reply
Share
1 more comment...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing