Daily Dose of Data Science

Share this post

The Most Overlooked Problem With One-Hot Encoding

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

The Most Overlooked Problem With One-Hot Encoding

Hint: This is NOT about sparse data representation.

Avi Chawla
May 27, 2023
16
Share this post

The Most Overlooked Problem With One-Hot Encoding

www.blog.dailydoseofds.com
1
Share

With one-hot encoding, we introduce a big problem in the data.

One-hot encoding illustration

When we one-hot encode categorical data, we unknowingly introduce perfect multicollinearity.

Multicollinearity arises when two or more features can predict another feature.

As the sum of one-hot encoded features is always 1, it leads to perfect multicollinearity.

Multicollinearity with one-hot encoding

This is often called the Dummy Variable Trap.

It is bad because:

  • The model has redundant features

  • Regressions coefficients aren’t reliable in the presence of multicollinearity, etc.

So how to resolve this?

The solution is simple.

Drop any arbitrary feature from the one-hot encoded features.

This instantly mitigates multicollinearity and breaks the linear relationship which existed before.

No multicollinearity with one-hot encoding when one feature is dropped

So remember...

Whenever we one-hot encode categorical data, it introduces multicollinearity.

To avoid this, drop one column and proceed ahead.

👉 Over to you: What are some other problems with one-hot encoding?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (250+ pages) with 200+ tips.


👉 Read what others are saying about this post on LinkedIn and Twitter.

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

16
Share this post

The Most Overlooked Problem With One-Hot Encoding

www.blog.dailydoseofds.com
1
Share
Previous
Next
1 Comment
Share this discussion

The Most Overlooked Problem With One-Hot Encoding

www.blog.dailydoseofds.com
Cesar Mendoza
Jul 10

Does it really matter which feature is dropped?

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing