Daily Dose of Data Science

Share this post

Is Categorical Feature Encoding Always Necessary Before Training ML Models?

www.blog.dailydoseofds.com

Is Categorical Feature Encoding Always Necessary Before Training ML Models?

If not, when is it not needed?

Avi Chawla
May 25, 2023
11
1
Share

When data contains categorical features, they may need special attention at times. This is because many algorithms require numerical data to work with.

Thus, when dealing with such datasets, it becomes crucial to handle these features appropriately to ensure accurate and meaningful analysis.

For instance, one common approach is to use one-hot encoding, as shown below:

One-hot encoding

Encoding categorical data allows algorithms to process them effectively.

But is it always necessary?

While encoding categorical data is often crucial, knowing when to do it is also equally important.

The following visual depicts which algorithms need categorical data encoding and which don’t.

Categorization of algorithms based on categorical data encoding requirement

As shown above, many ML algorithms typically work well even without categorical data encoding. These include decision trees, random forests, naive bayes, gradient boosting, and more.

Consider a decision tree, for instance. It can split the data based on exact categorical feature values. This makes categorical feature encoding an unnecessary step.

Decision tree

Thus, it's important to understand the nature of your data and the algorithm you intend to use.

You may never need to encode categorical data if the algorithm is insensitive to it.

👉 Over to you: Where would you place k-nearest neighbors in this chart? Let me know :)

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (250+ pages) with 200+ tips.


👉 Read what others are saying about this post on LinkedIn and Twitter.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

11
1
Share
Previous
Next
1 Comment
Rupe
May 25

I think that sklearn is unable to deal with pure categorical data, so if using that library, you do need to encode categorical data even for decision trees etc. There are other libraries, I think maybe one called H20 which does deal with categorical correctly.

Expand full comment
Reply
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing