Hacker News new | past | comments | ask | show | jobs | submit login
Using Bloom filters to efficiently synchronise hash graphs (kleppmann.com)
171 points by ignoramous on Dec 3, 2020 | hide | past | favorite | 18 comments

It's possible to be dramatically more communication efficient than bloom filters to synchronize sets.

Bloom filters require O(my_set_size) bandwidth, with a good constant factor. This work optimizes it somewhat by excluding members that existed at the time of last synchronization, but it still wastes bandwidth when those commits came in from another source.

It's possible to instead realize O(symmetric_difference_between_sets) when synchronizing a pair. Applied to this kind of usage the bandwidth required would be reduced by whatever redundancy factor (e.g. number of different repos you'd learn the same commit from).

This can be accomplished via error correcting codes, for a high performance industrial implementation I contributed to, see: https://github.com/sipa/minisketch/ (I tried submitting this to HN a while back, but no joy--).

For our paper on applying this to Bitcoin transaction relay to render rumouring bandwidth largely independent of peer count:


[Similar to the motivations of the post, we consider immunity to attackers to be critical-- and most prior efficient gossip work is extremely non-robust.]

For the specific case of synchronizing the blockchain itself, these fancy techniques aren't needed because there is only a single best chain 'list'-- there is only one relevant tip-- and it can be efficiently synchronized with bisection. Bitcoin sends up to 100 hashes at once (though usually more like 30), the first few dense around the best tip and the rest exponentially spaced to extremely efficiently find the highest point of commonalty.

Wow. If I understood its capabilities right, minisketch seems indistinguishable from magic.

Would you please clear a bunch of things:

0. Is my understanding that total elements in a set doesn't matter (afa reconcilation is concerned), right? See #1.

1. If Alice knows for sure there have been, say, 50 new additions (or rather, updates to existing elements) of elements of size 10 bits each (to a set of say 100M total other unchanged elements) since the last sync with Bob, does a sketch(10, 0, 50) by Alice cover all changes Bob needs to see or sketch(10, 0, 100) is needed?

2. A sketch(10, 0, 100) would occupy atmost 10x100 bits? Or, atleast 10x100 bits?

3. What's state of the art in set-reconciliation research right now apart from Pin Sketches?

4. Do implementations in other languages exist?

An ask:

1. A WASM executable would be great.


Hi, other author here.

0. Indeed. Though if your set consists of 100 million elements, you'll indirectly need to have at least 27 bit elements just to be able to identify them. So for practical problems you can probably say that the sketch size does grow logarithmitically with the set sizes.

1. The capacity scales with the symmetric difference between the sets. So if one party has [A,B,C] and the other has [B,C,D,E], you need capacity 3 (for A,D,E). So every "change" to an element costs at most 2 capacity; insertions/deletions each cost 1 capacity.

2. Sketch size is exactly (element size in bits)*capacity. So 1000 bits for a sketch with capacity of 100 10-bit elements.

3. I'm not really up to date here. I believe that PinSketch is the best known deterministic reconciliation (meaning that reconstruction is guaranteed to succeed if the capacity is sufficient). There are more performant algorithms and more trade-offs to make (between capacity and probility, e.g.) if you allow probabilistic recovery.

4. I just wrote a (slow) native Python re-implementation for demonstration/testing purpose: https://github.com/sipa/minisketch/pull/26. The C++ code relies on many low-level optimizations that are hard to replicate in higher level languages, so expect a very serious performance drop.

1. Shouldn't be hard to compile to WASM with emscripten, but I haven't tried.

Thanks a lot. I have use for something exactly like minisketch but need to interface it with JavaScript. I'll take a stab at it with Emscripten.

Can you point towards other performant algorithms? IBLT, from the minisketch GitHub readme at least, seems wasteful in terms of space.

If you're aware: What'd be a comparable data structure to Pin Sketches for membership checks (whether "X" exists in a set of, say, 100M records). Minimal Acyclic Finite State Automata (MA-FSA) and Succinct Tries come to mind but those aren't optimal though deterministic.

Yeah, IBLT has a size overhead which mostly matters for small differences/capacities. It's also probabilistic, so the higher your probability for recovery has to be, the larger it gets. Assuming you don't actually know an strict upper bound on the symmetric difference size before reconciliation, PinSketch of course also becomes probabilistic.

On the other hand, IBLT is very fast (and linear in the capacity; PinSketch is quadratic in the capacity).

A data structure for set lookups? Sounds like a Bloom filter. If you want it deterministic, just send the set in sorter order say? I don't think there is much fancy possible here.

That does sound like absolute magic. One hundred million? I didn't know about this algorithm, and it looks revolutionary for a ton of things.

If sketch size grows with only log(n) where N is the set size, then we are talking about practical efficient non-thrashing replication of billions of records at broadband speeds.

(insert mind=blown meme...)

I think the most mind-blowing aspect about it is that (PinSketch) sketches are identical in size to the bandwidth that would be required to just send the difference if you did know it ahead of time.

Perhaps this intuition helps: imagine we have sets of positive numbers, and we know that the difference is at most 1 element. So there is one element that you have but I don't, or the other way around. In this case there is a very easy solution: just send the sum of all your elements. If it's larger than the sum of the receiver's element, the (absolute value of the) difference is exactly the element that one party has and the other doesn't.

In a way, PinSketch is generalizing this "send sum of your elements" in such a way that it works for more than one difference.

This sounds very interesting, so I tried reading some of the math in https://github.com/sipa/minisketch/blob/master/doc/math.md (I'm not great this kind of math).

In my understanding, a Bloom filter, while being O(set_size), nonetheless has a small multiplier for the linear factor (in the OP I think they mentioned using 10 bits). As I read the PinSketch setup, it looks like it would actually need the full bit size of the set members, so for SHA-256 you need the full 256 bits per set member. So while PinSketch would only need O(symmetric_difference_between_sets), the element size factor is ~25 times larger?

And also, in the example from the minisketch readme, it looks A still needs to send the full sketch (not just the symdiff) in the first message, right?

In other words, there are a bunch of trade-offs here that might very much depend on your application?

You can make the hashes you as small as you like-- subject to the limitation that you'll get false positives if you make it too small ---- the same problem that small hashes get in bloom filters.

Bloom filters also have poor communications efficiencies for small numbers of entries: you have to oversize the filter a lot. This can be ameliorated by entropy coding the oversized bloom filter, but once you're going that far you're better off with using a golumb coded set or something more efficient.

I think the only case where an approximate set-filter approach will be more efficient is where you want to communicate a small set one way to someone whos set is much larger. The issue there is the the sketch needs one entry per item it removes, while a bloom filter can be smaller (assuming the difference is big enough to overcome the bloom overheads).

Information theoretically it should be possible to decode a minisketch 'beyond the limit' using knowledge of the specific items in the local set. ... but the only algorithm I know for this has exponential time (brute force remove items from your set and retry the reconstruction). This problem is mathematically pretty close to list-input decoding a error correcting code, so efficient algorithms may be possible-- but so far there hasn't been any research to that end.

[You can, of course, combine a filter with efficient reconciliation in cases where an set size differences are expected. The filter's FP rate just needs to be low enough to get the receivers set size close to the sender's]

> it looks A still needs to send the full sketch

The sketch is exactly the same size as the symdiff it can recover.

An interesting challenge is that you likely don't know in advance how big the symdiff is... If you're too conservative you waste bandwidth, but never more bandwidth then you use sending the whole set. You can reduce your waste at the expense of round-trips, in an obvious manner: fetching a little more of the sketch each time until you have enough.

In the erlay paper (bitcoin application of this stuff), we give a scheme to bisect when the difference is bigger than expected without wasting much bandwidth. The bisection loses a little communications efficiency but avoids quadratic computation costs which become problematic for big set differences.

It depends on the protocol requirements. If you have 256-bit data elements to reconcile, and want 100% guarantee that reconciliation will succeed with 1 round-trip, then you'll indeed need to use sketches with 256-bit elements.

However, if you're fine with more roundtrips, or probabilistic reconciliation, you can typically use far smaller elements. Have the parties agree on a random salt, and construct sketches of short salted hashes of your data elements (where the size of these hashes depends on your tolerance for failure due to collisions in your sets). Run the reconciliation algorithm to figure out the difference, and then send/request the real elements based on their short hashes.

From what I can see in the graphs on GitHub it seems really computationally intensive, only a few thousand 30-bit elements per second (for decoding if I understand it correctly).

So I guess that is the catch which makes it impractical for more widespread use. Or am I wrong, is there an alternative speed wise?

Depends on how you split it up: if you're synchronizing 100 differences at a time it's on the order of 100,000 differences per second per core which is plenty fast for many applications.

If you're working on thousands of differences at a time, then IBLT's communications efficiency may be a acceptable, and it is considerably faster.

[You can always decrease the amount you feed into reconciliation at once by using hashing to subdivide your inputs, at a small loss to efficiency due to imbalancing.]

So this should be used to speed up rsync, or Google Wave, amongst many other such merge problems.

This is nice. How does it compare to something like a polynomial based representation like in [1] (used by the old keyserver gossip network)? I find Bloom filters more intuitive to reason about, but [1] does mention them as prior work and seems to think its scheme outperforms them.

[1] http://dx.doi.org/10.1109/TIT.2003.815784 http://ipsit.bu.edu/documents/ieee-it3-web.pdf

I think the approaches based on invertible bloom filters are generally more efficient.

https://dl.acm.org/doi/abs/10.1145/2043164.2018462 (What's the difference?: efficient set reconciliation without prior context) 2011

https://link.springer.com/article/10.1007/s00446-017-0316-0 (Simple Multi-Party Set Reconciliation) 2015

They both briefly mention the Minsky paper.

> I think the approaches based on invertible bloom filters are generally more efficient.

More computationally efficient for sure, but their communications efficiency is only good asymptotically.

For "small" cases such as where the set difference has a few hundred or thousand entries and especially where the set members are small (e.g. <=64-bit hashes) there is a pretty large constant factor needed to get a low failure rate from invertible bloom filters.

E.g. for a set difference of size 128 we measured that you need 32x communications overhead to get a 1 in 10,000 failure rate. 8x overhead for 2048 items. The issue is that iblt decodability depends on the non-existence of cycles in a random graph and this only holds for sufficiently large graphs.

For applications (such as Bitcoin transaction relay or -- I think -- SCM commits) which frequently synchronize a small number of items these communications overheads take that approach entirely out of the running. Fortunately, at these sizes the quadratic computation of communications efficient approaches is not a big deal.

Interestingly, it is completely unambiguous that IBLT and dense code based reconciliation can be combined (sort of the dual to Raptor codes, which combine sparse fountain codes and dense RS-like codes for block erasure recovery)-- but it's an open question if the combination can capture the advantages of both approaches (linear time decode and concrete communications efficiency (as opposed to asymptotic)) for set reconciliation as is the case for block erasure codes.


I'm thinking even polynomial methods can "fail" if you underestimated the difference didn't send a big enough sketch.

Won't you have to do some sort of exponential fallback anyway to cover that case? If so, you could also use a lower overhead in the IBT and just rely on the same fallback mechanism.

Depending on your communications cost model there is no reason to use an exponential backoff with pinsketch: you can just send one (or N) more sketch-entries at a time. At a cost of one or two extra entries total, you can tell if you have enough before doing the costly computation.

[There are some utility functions in minisketch that tell you how much overhead you need to make all failures fast-detectable with a cryptographic level of certainty, it's typically a tiny amount]

Your application may also know by construction that the symmetric difference won't be larger than expected. This is the case for e.g. biometric key generation-- if the difference larger than expected it just means the auth failed. Likewise for using pinsketch for anonymous communication.

If you notice on our graphs IBLT never gets to particularly low overhead-- at least for small keys, even when you accept low reliability.

Efficient fallback also doesn't work quite as well for iblt:

For pinsketch when the sketch is too small you can send more entries (a shorter sketch is just a prefix of a longer sketch), which is perfectly efficient (+/- rtt costs).

You can also bisect the sketch by splitting into two sketches with an arbitrary hash function and conserve your prior communication: you send the sketch of just the left half and the receiver can get the sketch of the right half by xoring the left with the parent sketch. Bisection has communications overhead because splitting will not distribute items perfectly equally.

For pinsketch the only reason you'd do the bisection thing instead of extending the sketch is because decoding has O(N^2) computational cost, but for IBLT bisection is the only option on failure other than totally start over with a bigger sketch (and waste all the prior bandwidth).

Unfortunately the bisection approach doesn't work as well on IBLT because while pinsketch will be successful in every split segment as soon as that segment is under capacity, IBLT will also continue to fail until the splitting removes all cycles from each segment. So you get inefficiency both from imbalance and from the need to eliminate cycles.

Obviously, what trade-offs will make the most sense depends on the application. We've contemplated adding IBLT to minisketch as an option, particularly because the two approaches can be combined in several different ways: e.g. IBLTs where the buckets are small pinsketches just works(tm)) or e.g. when IBLTs fail they often decode most differences but have a few cycles, a small parallel pinsketch can have the known differences removed and resolve the cycles.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact