
Show HN: Difference Digests for Efficient Reconciliation - hundredwatt
https://github.com/hundredwatt/difference_digest
======
hundredwatt
I was trying to solve the problem of detecting "hard deletes" in an ETL
pipeline that updates data incrementally by key and came across "Difference
Digests" as described in this paper:
[https://www.ics.uci.edu/~eppstein/pubs/EppGooUye-
SIGCOMM-11....](https://www.ics.uci.edu/~eppstein/pubs/EppGooUye-
SIGCOMM-11.pdf)

Basically, I wanted to figure out which rows were still in one database, but
had been deleted from another, without downloading both and doing a row-by-row
compare.

I open-sourced a Go implementation of "Difference Digests" that can encode
large tables directly in a SQL database and then only download the relatively
small bloom filters for comparison.

With this implementation, I benchmarked a data set of 10,000,000 rows that
lives in 2 remote databases where 100 rows were deleted from 1 of the
databases. Downloading the IDs from both tables would require 153MB of data
transfer. Using the builtin SQL functions of difference digests requires only
downloading 253KB (a 99.9% reduction!) to perform a full comparison and
identify which 100 IDs had been deleted.

The amount of data transfer required is proportional to the number of
differences, so for large data sets where you are detecting a small number of
deletions, the performance is incredible!

I had a lot of fun implementing the algorithms (Strata Estimator and
Invertible Bloom Filter) and hope others find this useful :)

