The Limitation of PCA Which Many Folks Often Ignore

Not all datasets are linearly separable.

May 03, 2023

Imagine you have a classification dataset. If you use PCA to reduce dimensions, it is inherently assumed that your data is linearly separable.

But it may not be the case always. Thus, PCA will fail in such cases.

PCA on a linearly separable and inseparable dataset

If you wish to read how PCA works, I would highly recommend reading one of my previous posts: A Visual and Overly Simplified Guide to PCA.

To resolve this, we use the kernel trick (or the KernelPCA). The idea is to:

Project the data to another space using a kernel function, where the data becomes linearly separable.
Apply the standard PCA algorithm to the transformed data.

For instance, in the image below, the original data is linearly inseparable. Using PCA directly does not produce any desirable results.

But as mentioned above, KernelPCA first transforms the data to a linearly separable space and then applies PCA, resulting in a linearly separable dataset.

KernelPCA on a linearly inseparable dataset

Sklearn provides a KernelPCA wrapper, supporting many popularly used kernel functions. You can find more details here: Sklearn Docs.

Having said that, it is also worth noting that the run-time of PCA is cubic in relation to the number of dimensions of the data.

When we use a KernelPCA, typically, the original data (in n dimensions) is projected to a new higher dimensional space (in m dimensions; m>n). Therefore, it increases the overall run-time of PCA.

Over to you: What are some other limitations of PCA that you know of? Let me know :)

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

Hey there!

Thanks for being an avid reader and supporter of the Daily Dose of Data Science. I am beyond words to express how grateful I am that you make time every day to read this newsletter.

If you have been benefited from this newsletter in any way and my work makes your day just a little better by teaching you something new, then I would really appreciate it if you could write a review for Daily Dose of Data Science below:

Write a testimonial

Your contribution will immensely help me bring more readers to this newsletter. Thanks a lot for considering my request.

Your biggest fan,

Avi

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

5 Comments

Giorgio Borelli

May 3, 2023Liked by Avi Chawla

A limitation is a derivative of what you mentioned: how to pick a proper kernel to make the data linearly separable? How do we check we have made our data set linearly separable?

Expand full comment

2 replies by Avi Chawla and others

You put in admirable efforts to offer a new angle on data science, each day! Well done

1 reply by Avi Chawla

3 more comments...