Daily Dose of Data Science

Share this post

How Would You Identify Fuzzy Duplicates In A Data With Million Records?

www.blog.dailydoseofds.com

How Would You Identify Fuzzy Duplicates In A Data With Million Records?

Avi Chawla
Jan 26, 2023
7
Share
Share this post

How Would You Identify Fuzzy Duplicates In A Data With Million Records?

www.blog.dailydoseofds.com

Imagine you have over a million records with fuzzy duplicates. How would you identify potential duplicates?

The naive approach of comparing every pair of records is infeasible in such cases. That's over 10^12 comparisons (n^2). Assuming a speed of 10,000 comparisons per second, it will take roughly 3 years to complete.

The csvdedupe tool (linked in comments) solves this by cleverly reducing the comparisons. For instance, comparing the name “Daniel” to “Philip” or “Shannon” to “Julia” makes no sense. They are guaranteed to be distinct records.

Thus, it groups the data into smaller buckets based on rules. One rule could be to group all records with the same first three letters in the name.

This way, it drastically reduces the number of comparisons with great accuracy.

Read more: csvdedupe.

Share this post on LinkedIn: Post Link.

Thanks for reading Daily Dose of Data Science! Subscribe for free to receive new posts and support my work.


Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn.

7
Share
Share this post

How Would You Identify Fuzzy Duplicates In A Data With Million Records?

www.blog.dailydoseofds.com
Previous
Next
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing