Daily Dose of Data Science

Share this post

The Supercharged Version of KMeans That Deserves Much More Attention

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a free 550+ page data science PDF guide and 450+ practice questions notebook.
Over 50,000 subscribers
Continue reading
Sign in

The Supercharged Version of KMeans That Deserves Much More Attention

Robustify KMeans with centroid addition and removal.

Avi Chawla
Oct 12, 2023
28
Share this post

The Supercharged Version of KMeans That Deserves Much More Attention

www.blog.dailydoseofds.com
4
Share

KMeans is widely used for its simplicity and effectiveness as a clustering algorithm.

Yet, we all know that its performance is entirely dependent on the centroid initialization step.

Thus, it is likely that we may obtain inaccurate clusters, as depicted below:

KMeans produces inaccurate clustering results

Of course, rerunning with different initialization does help at times.

But I have never liked the unnecessary run-time overhead it introduces.

So today, let me share a neat and supercharged upgrade to KMeans, which addresses this issue while also producing better clustering results.

It’s called the Breathing KMeans algorithm.

Let’s understand how it works.

Step 1: Run Kmeans

First, we run the usual KMeans clustering only once, i.e., without rerunning the algorithm with a different initialization.

Clustering results of KMeans after running the algorithm without repetition and until convergence

This gives us the location of “k” centroids, which may or may not be accurate.

Step 2: Breathe in step

To the “k” centroids obtained from Step 1, we add “m” new centroids.

As per the research paper of Breathing Kmeans, m=5 was found to be good value.

Now, you might be thinking, where do we add these “m” centroids?

The addition of new centroids is decided based on the error associated with a centroid.

Simply put, a centroid’s error is the sum of the squared distance to the points associated with that centroid.

The error of a centroid

Thus, we add “m” centroids in the vicinity of centroids with high error.

Let’s understand more intuitively why this makes sense.

Centroids with high and low error

In the above KMeans clustering results:

  • The centroid at the top has a high error.

  • All other centroids have relatively low error.

Intuitively speaking, if a centroid has a very high error, it is possible that multiple clusters are associated with it.

Thus, we would want to split this cluster.

Adding new centroids near clusters with high error will precisely fulfill this objective.

After adding “m” new centroids, we get a total of “k+m” centroids.

Finally, we run KMeans again with “k+m” centroids only once.

This gives us the location of “k+m” centroids.

Step 3: Breathe out step

Next, we want to remove “m” centroids from the “k+m” centroids obtained above.

Here, you might be thinking, which “m” centroids should we remove?

The removal of centroids is decided based on the “utility” of a centroid.

Simply put, a centroid’s utility is proportional to its distance from all other centroids.

The greater the distance, the more isolated it will be.

Hence, the more the utility.

Utility of centroids

This makes intuitive sense as well.

If two centroids are pretty close, they will have low utility.

Thus, they are likely in the same cluster, and we must remove one of them.

This is demonstrated below:

Remove low utility centroid

After removing one of the low-utility centroids, the other centroid becomes very useful.

So, in practice, after removing one centroid, we update the utility values of all other centroids.

We repeat the process until all “m” low-utility centroids have been removed.

This gives back “k” centroids.

Finally, we run KMeans with these “k” centroids only once.

Step 4: Decrease m by 1.

Step 5: Repeat Steps 2 to 4 until m=0.

Done!

These repeated breathing cycles (breathe-in and breathe-out steps) almost always provide a faster and better solution than standard KMeans with repetitions.

In each cycle:

  • New centroids are added at “good” locations. This helps in splitting clusters occupied by a single centroid.

  • Low-utility centroids are removed. This helps in eliminating centroids that are likely in the same cluster.

As a result, it is expected to converge to the optimal solution faster.

The effectiveness of Breathing KMeans over KMeans is evident from the image below:

KMeans vs. Breathing KMeans results
  • KMeans produced two misplaced centroids

  • Breathing KMeans accurately clustered the data with a 50% run-time improvement.

There is also an open-source implementation of Breathing KMeans, with a sklearn-like API.

To get started, install the bkmeans library:

pip install bkmeans

Next, run the algorithm as follows:

from bkmeans import BKMeans

bkm = BKMeans(n_clusters = 100)

bkm.fit(X)

Isn’t that a cool upgrade to KMeans?

👉 Over to you: What are some other ways to improve KMeans’ clustering and its run-time?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


Learned something new today?

If yes, then there is a wealth of information and practical insights in the deep dives below.

Read them next until I get back to you with another insightful newsletter issue tomorrow.

  • Why Bagging is So Ridiculously Effective At Variance Reduction?

  • Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.

  • Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit

  • Model Compression: A Critical Step Towards Efficient Machine Learning.

  • Generalized Linear Models (GLMs): The Supercharged Linear Regression.

  • Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.

  • Bayesian Optimization for Hyperparameter Tuning.

  • Formulating the PCA Algorithm From Scratch.

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

28
Share this post

The Supercharged Version of KMeans That Deserves Much More Attention

www.blog.dailydoseofds.com
4
Share
Previous
Next
4 Comments
Share this discussion

The Supercharged Version of KMeans That Deserves Much More Attention

www.blog.dailydoseofds.com
George Carter
Oct 12Liked by Avi Chawla

Good article. Would have been helpful and meaningful to execute step 3 thru 5 with the example presented in step 1.

Expand full comment
Reply
Share
3 replies by Avi Chawla and others
3 more comments...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing