Daily Dose of Data Science

Share this post

The Limitations of DBSCAN Clustering Which Many Often Overlook

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a free 550+ page data science PDF guide and 450+ practice questions notebook.
Over 50,000 subscribers
Continue reading
Sign in

The Limitations of DBSCAN Clustering Which Many Often Overlook

...And here's a better alternative to work with.

Avi Chawla
Oct 4, 2023
19
Share this post

The Limitations of DBSCAN Clustering Which Many Often Overlook

www.blog.dailydoseofds.com
1
Share

DBSCAN is a density-based clustering, which clusters data points based on density.

This makes it more robust than algorithms like KMeans because:

  • Being “density-based”, it can identify clusters of varying shapes.

  • While KMeans can only create globular clusters.

A comparison between DBSCAN and KMeans is shown below:

KMeans vs. DBSCAN on non-globular clustering datasets
  • KMeans attempts to form globular clusters. Hence, it fails to identify correct clusters.

  • DBSCAN relies on the concept of “density”, making it more robust.

Yet, it is important to note that DBSCAN also has some limitations, which many often overlook.

Let’s understand these today.

To begin, DBSCAN assumes that the local density of data points is (somewhat) globally uniform. This is governed by its eps parameter.

Thus, it may struggle to identify clusters with varying densities.

This may need several hyperparameter tuning attempts to get promising results.

HDBSCAN can be a better choice for density-based clustering.

It relaxes the assumption of local uniform density, which makes it more robust to clusters of varying densities by exploring many different density scales.

Its effectiveness is evident from the image below:

On a dataset with three clusters, each with varying densities:

  • DBSCAN struggles to identify correct clusters.

  • HDBSCAN is found to be more robust.

What's more:

  • DBSCAN is a scale variant algorithm. Thus, clustering results for data X, 2X, 3X, etc., can be entirely different.

  • On the other hand, HDBSCAN is scale-invariant. So, clustering results remain the same across different scales of data.

We can also verify this experimentally:

Clustering results differ with the scale of data for DBSCAN
  • The results vary with the data scale for DBSCAN.

  • The results remain unaltered with the scale for HDBSCAN.

👉 Over to you: Can you explain why HDBSCAN is scale-invariant?

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!


In case you missed it

Yesterday, I introduced The Daily Dose of Data Science Lab, a cohort-based platform for you to:

  • Attend weekly live sessions (office hours) hosted by me and invited guests.

  • Enroll in self-paced and live courses.

  • Get private mentoring.

  • Join query discussions.

  • Find answers to your data-related problems.

  • Refer to the internal data science resources, and more.

To ensure an optimal and engaging experience, The Lab will always operate at a capacity of 120 active participants.

So, if you are interested in receiving further updates about The Lab, please fill out this form: The Lab interest form.

Note: Filling out the form DOES NOT mean you must join The Lab. This is just an interest form to indicate that you are interested in learning more before making a decision.

I will be sharing more details with the respondents soon.

Thank you :)


Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

  • Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit

  • Model Compression: A Critical Step Towards Efficient Machine Learning.

  • Generalized Linear Models (GLMs): The Supercharged Linear Regression.

  • Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.

  • Bayesian Optimization for Hyperparameter Tuning.

  • Formulating the PCA Algorithm From Scratch.

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

I want to read full articles.


👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Review Daily Dose of Data Science

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science

19
Share this post

The Limitations of DBSCAN Clustering Which Many Often Overlook

www.blog.dailydoseofds.com
1
Share
Previous
Next
1 Comment
Share this discussion

The Limitations of DBSCAN Clustering Which Many Often Overlook

www.blog.dailydoseofds.com
Arvind Roshaan
Oct 5Liked by Avi Chawla

Thanks for this crisp and informative post Avi!

Expand full comment
Reply
Share
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing