Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: One-Click CSV Deduplication (open-source) (dedupe.it)
4 points by remolacha 32 days ago | hide | past | favorite | 2 comments
I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Lmk any feedback on how to make this better!




Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had


Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: