Discover more from Daily Dose of Data Science
Evaluate Clustering Performance Without Ground Truth Labels
Three reliable methods for clustering evaluation.
In the absence of ground truth labels, evaluating clustering performance is difficult.
Yet, there are a few performance metrics that can help.
Using them, you can compare multiple clustering results, say, those obtained with a different number of centroids.
This is especially useful for high-dimensional datasets, as visual evaluation is difficult.
for every point, find average distance to all other points within its cluster (A)
for every point, find average distance to all points in the nearest cluster (B)
score for a point is (B-A)/max(B, A)
compute the average of all individual scores to get the overall clustering score
computed on all samples, thus, it's computationally expensive
a higher score indicates better and well-separated clusters.
I covered this here if you wish to understand Silhoutte Coefficient with diagrams: The Limitations Of Elbow Curve And What You Should Replace It With.
A: sum of squared distance between all centroids and overall dataset center
B: sum of squared distance between all points and their specific centroid
metric is computed as A/B (with an additional scaling factor)
relatively faster to compute
it is sensitive to scale
a higher score indicates well-separated clusters
measures the similarity between clusters
thus, a lower score indicates dissimilarity and better clustering
Luckily, they are neatly integrated with sklearn too.
👉 Over to you: What are some other ways to evaluate clustering performance in such situations?
Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (350+ pages) with 250+ tips.
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.
Find the code for my tips here: GitHub.