Daily Dose of Data Science

Share this post

The Limitation of PCA Which Many Folks Often Ignore

www.blog.dailydoseofds.com

The Limitation of PCA Which Many Folks Often Ignore

Not all datasets are linearly separable.

Avi Chawla
May 3, 2023
14
5
Share
Share this post

The Limitation of PCA Which Many Folks Often Ignore

www.blog.dailydoseofds.com

Imagine you have a classification dataset. If you use PCA to reduce dimensions, it is inherently assumed that your data is linearly separable.

But it may not be the case always. Thus, PCA will fail in such cases.

PCA on a linearly separable and inseparable dataset

If you wish to read how PCA works, I would highly recommend reading one of my previous posts: A Visual and Overly Simplified Guide to PCA.

To resolve this, we use the kernel trick (or the KernelPCA). The idea is to:

  • Project the data to another space using a kernel function, where the data becomes linearly separable.

  • Apply the standard PCA algorithm to the transformed data.

For instance, in the image below, the original data is linearly inseparable. Using PCA directly does not produce any desirable results.

PCA on a linearly inseparable dataset

But as mentioned above, KernelPCA first transforms the data to a linearly separable space and then applies PCA, resulting in a linearly separable dataset.

KernelPCA on a linearly inseparable dataset

Sklearn provides a KernelPCA wrapper, supporting many popularly used kernel functions. You can find more details here: Sklearn Docs.

Having said that, it is also worth noting that the run-time of PCA is cubic in relation to the number of dimensions of the data.

PCA run-time

When we use a KernelPCA, typically, the original data (in n dimensions) is projected to a new higher dimensional space (in m dimensions; m>n). Therefore, it increases the overall run-time of PCA.

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (250+ pages) with 200+ tips.

Over to you: What are some other limitations of PCA that you know of? Let me know :)

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science


Hey there!

Thanks for being an avid reader and supporter of the Daily Dose of Data Science. I am beyond words to express how grateful I am that you make time every day to read this newsletter.

If you have been benefited from this newsletter in any way and my work makes your day just a little better by teaching you something new, then I would really appreciate it if you could write a review for Daily Dose of Data Science below:

Write a testimonial

Your contribution will immensely help me bring more readers to this newsletter. Thanks a lot for considering my request.

Your biggest fan,

Avi


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

14
5
Share
Share this post

The Limitation of PCA Which Many Folks Often Ignore

www.blog.dailydoseofds.com
Previous
Next
5 Comments
Giorgio Borelli
May 3Liked by Avi Chawla

A limitation is a derivative of what you mentioned: how to pick a proper kernel to make the data linearly separable? How do we check we have made our data set linearly separable?

Expand full comment
Reply
2 replies by Avi Chawla and others
Giorgio Borelli
May 3Liked by Avi Chawla

You put in admirable efforts to offer a new angle on data science, each day! Well done

Expand full comment
Reply
1 reply by Avi Chawla
3 more comments…
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing