Daily Dose of Data Science

Share this post

Random Forest May Not Need An Explicit Validation Set For Evaluation

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Random Forest May Not Need An Explicit Validation Set For Evaluation

A guide to out-of-bag validation.

Avi Chawla
Jun 26, 2023
12
Share this post

Random Forest May Not Need An Explicit Validation Set For Evaluation

www.blog.dailydoseofds.com
1
Share

We all know that ML models should not be evaluated on the training data. Thus, we should always keep a held-out validation/test set for evaluation.

But random forests are an exception to that.

In other words, you can reliably evaluate a random forest using the training set itself.

Confused?

Let me explain.

To recap, a random forest is trained as follows:

  • First, create different subsets of data with replacement.

  • Next, train one decision tree per subset.

  • Finally, aggregate all predictions to get the final prediction.

Clearly, EVERY decision tree has some unseen data points in the entire training set.

Thus, we can use them to validate that specific decision tree.

This is also called out-of-bag validation.

Calculating the out-of-bag score for the whole random forest is simple too.

For every data point in the entire training set:

  • Gather predictions from all decision trees that used it as an out-of-bag sample

  • Aggregate predictions to get the final prediction

Finally, score all the predictions to get the out-of-bag score.

Out-of-bag validation has several benefits:

  • If you have less data, you can prevent data splitting

  • It's computationally faster than using, say, cross-validation

  • It ensures that there is no data leakage, etc.

Luckily, out-of-bag validation is neatly tied in sklearn’s random forest implementation too.

Parameter for out-of-bag scoring as specified in the official docs

👉 Over to you:

  1. What are some limitations of out-of-bag validation?

  2. How reliable is the out-of-bag score to tune the hyperparameters of the random forest model?


👉 Read what others are saying about this post on LinkedIn and Twitter.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

12
Share this post

Random Forest May Not Need An Explicit Validation Set For Evaluation

www.blog.dailydoseofds.com
1
Share
Previous
Next
1 Comment
Share this discussion

Random Forest May Not Need An Explicit Validation Set For Evaluation

www.blog.dailydoseofds.com
Omar AlSuwaidi
Writes Omar’s Substack
Jun 26Liked by Avi Chawla

Very solid!

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing