> "My second point was that this — streaming a "non-lossy" database as a change ...

atombender · on May 3, 2015

Merkle trees (as used for anti-entropy in Cassandra, Riak etc.) are only practical when both sides of the replication can speak them, though.

I wonder, are Merkle trees viable for continuous streaming replication, not just repair?

fizx · on May 3, 2015

You'd implement the merkle trees yourself in the application layer. Alternately, you could use hash lists. It'd be somewhat similar to how you'd implement geohashing. Let's say you just take a SHA-X of each row represented as a hex varchar, then do something like "SELECT COUNT(*) GROUP BY SUBSTR(`sha_column`, 0, n)". If there's a count mismatch, then drill down into it by checking the first two chars, the first three chars, etc. Materialize some of these views as needed. It's ugly and tricky to tune.

Merkle trees aren't interesting in the no-repair case, as the changelog is more direct and has no downside.