Daily Dose of Data Science

Share this post

Your Random Forest Model is Never the Best Random Forest Model You Can Build

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a free 550+ page data science PDF guide and 450+ practice questions notebook.
Over 50,000 subscribers
Continue reading
Sign in

Your Random Forest Model is Never the Best Random Forest Model You Can Build

The coolest trick to improve random forest models.

Avi Chawla
Oct 10, 2023
46
Share this post

Your Random Forest Model is Never the Best Random Forest Model You Can Build

www.blog.dailydoseofds.com
8
Share

Random forest is a pretty powerful and robust model, which is a combination of many different decision trees.

But the biggest problem is that whenever we use random forest, we always create much more base decision trees than required.

Of course, this can be tuned as a hyperparameter, but it requires training many different random forest models, which takes time.

Today, I will share one of the most incredible tricks I formulated recently to:

  • Increase the accuracy of a random forest model

  • Decrease its size.

  • Drastically increase its prediction run-time.

And all this without having to ever retrain the model.

Are you ready?

Let’s begin!

Logic

We know that a random forest model is an ensemble of many different decision trees:

The final prediction is generated by aggregating the predictions from each individual and independent decision tree.

As each decision tree in a random forest is independent, this means that each decision tree will have some test accuracy, right?

This also means that there will be some underperforming and some well-performing decision trees. Agreed?

So what if we do the following:

  • We find the test accuracy of every decision tree.

  • We sort the accuracies in decreasing order.

  • We keep only the “k” top-performing decision trees and remove the rest.

Once done, we’ll be only left with the best-performing decision trees in our random forest, as evaluated on the test set.

Cool, isn’t it?

Now, how to decide “k”?

It’s simple.

We can create a cumulative accuracy plot.

It will be a line plot depicting the accuracy of the random forest:

  • Considering only the first two decision trees.

  • Considering only the first three decision trees.

  • Considering only the first four decision trees.

  • And so on.

It is expected that the accuracy will first increase with the number of decision trees and then decrease.

Looking at this plot, we can find the most optimal “k”.

Implementation

Let’s look at its implementation.

First, we train our random forest as we usually would:

Train the random forest model

Next, we must compute the accuracy of each decision tree model.

In sklearn, individual trees can be accessed with model.estimators_ attribute.

Estimators of random forest are a list of decision trees

Thus, we iterate over all trees and compute their test accuracy:

Compute the accuracy of all trees

The model_accs is a NumPy array that stores tree id and its test accuracy:

>>> model_accs
array([[ 0.   ,  0.815], # [tree id, test accuracy]
       [ 1.   ,  0.77 ],
       [ 2.   ,  0.795],
          ...

Now, we must rearrange the decision tree models in the model.estimators_ list in decreasing order of test accuracies:

# sort on second column in reverse order to obtain sorting order
>>> sorted_indices = np.argsort(model_accs[:, 1])[::-1]

# obtain list of model ids according to sorted model accuracies
>>> model_ids = model_accs[sorted_indices][:,0].astype(int)
array([65, 97, 18, 24, 38, 11,...

This list tells us that the 65th indexed model is the highest performing, followed by 97th indexed, and so on….

Now, let’s rearrange the tree models in model.estimators_ list in the order of model_ids:

# create numpy array, rearrange the models and convert back to list
model.estimators_ = np.array(model.estimators_)[model_ids].tolist()

Done!

Finally, we create the plot discussed earlier.

It will be a line plot depicting the accuracy of the random forest:

  • By considering only the first two decision trees.

  • By considering only the first three decision trees.

  • By considering only the first four decision trees.

  • and so on.

The code to compute cumulative accuracies is demonstrated below:

Create the cumulative accuracy plot

In the above code:

  • We create a copy of the base model called small_model.

  • In each iteration, we set small_model’s trees to the first “k” trees of the base model.

  • Finally, we score the small_model with just “k” trees.

Plotting the cumulative accuracy result, we get the following plot:

Cumulative accuracy score

It is clear that the max test accuracy appears by only considering 10 trees, which is a ten-fold reduction in the number of trees.

Comparing their accuracy and run-time, we get:

  • We get a 3% accuracy boost.

  • 6.5 times prediction faster run-time.

Now, tell me something:

  • Did we do any retraining or hyperparameter tuning? No.

  • As we reduced the number of decision trees, didn’t we improve the run time? Of course, we did.

Isn’t that cool?

You can download this notebook here: Jupyter Notebook.

👉 Over to you: What are some other cool ways to improve model run-time?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

  • Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.

  • Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit

  • Model Compression: A Critical Step Towards Efficient Machine Learning.

  • Generalized Linear Models (GLMs): The Supercharged Linear Regression.

  • Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.

  • Bayesian Optimization for Hyperparameter Tuning.

  • Formulating the PCA Algorithm From Scratch.

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

46
Share this post

Your Random Forest Model is Never the Best Random Forest Model You Can Build

www.blog.dailydoseofds.com
8
Share
Previous
Next
8 Comments
Share this discussion

Your Random Forest Model is Never the Best Random Forest Model You Can Build

www.blog.dailydoseofds.com
Omar AlSuwaidi
Writes Omar’s Substack
Oct 11Liked by Avi Chawla

This is actually really neat and clever!

I'm guessing you could even consider a tree's performance criteria based on its weighted train and test scores, not to overfit the training set nor to overtune (overfit) the validation/test set

Expand full comment
Reply
Share
1 reply by Avi Chawla
Marcell Nagy
Oct 11Liked by Avi Chawla

Noice! The only thing that I was thinking is that now it is tuned for the test set, and we don't know how it would perform on another data set. So it is always adviced to make decisions about the model using the validation set, and never about the test set. So the only thing I would change is to reduce the number of trees using the validation set and then in the end check the performance on the test set.

Expand full comment
Reply
Share
2 replies by Avi Chawla and others
6 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing