Discover more from Daily Dose of Data Science
A (Highly) Important Point to Consider Before You Use KMeans Next Time
Estimating the success rate of KMeans.
The most important yet often overlooked step of KMeans is its centroid initialization. Here's something to consider before you use it next time.
KMeans selects the initial centroids randomly. As a result, it fails to converge at times. This requires us to repeat clustering several times with different initialization.
Yet, repeated clustering may not guarantee that you will soon end up with the correct clusters. This is especially true when you have many centroids to begin with.
Instead, KMeans++ takes a smarter approach to initialize centroids.
The first centroid is selected randomly. But the next centroid is chosen based on the distance from the first centroid.
In other words, a point that is away from the first centroid is more likely to be selected as an initial centroid. This way, all the initial centroids are likely to lie in different clusters already, and the algorithm may converge faster and more accurately.
The impact is evident from the bar plots shown below. They depict the frequency of the number of misplaced centroids obtained (analyzed manually) after training 50 different models with KMeans and KMeans++.
On the given dataset, out of the 50 models, KMeans only produced zero misplaced centroids once, which is a success rate of just 2%.
In contrast, KMeans++ never produced any misplaced centroids.
Luckily, if you are using sklearn, you don’t need to worry about the initialization step. This is because sklearn, by default, resorts to the KMeans++ approach.
However, if you have a custom implementation, do give it a thought.
Thanks for reading Daily Dose of Data Science! I share something insightful daily in this newsletter. Subscribe for free to learn something new about Python and Data Science every day.
👉 Tell me you liked this post by leaving a heart react 🤍.
👉 If you love reading this newsletter, feel free to share it with friends!
I hope you may have noticed that I have made these daily newsletter issues a bit more detailed. What is your feedback about this change?
Find the code for my tips here: GitHub.