Hacker News new | past | comments | ask | show | jobs | submit login
Xor Filters: Faster and Smaller Than Bloom Filters (lemire.me)
457 points by matt_d on Dec 20, 2019 | hide | past | favorite | 81 comments

They praise bloom filters for being simple and describe how it works in two sentences. They criticize cuckoo filters for being too complicated. They praise their own algorithm for being simple, and then they don't describe it, and instead link to a repo of several hundred lines of code.

My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.

The xor filter is more complex, but the concepts behind it are quite accessible. Let's give it a try.

The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k).[0] If they are equal, the key is assumed to be contained in the filter.

Assuming that the fingerprint function is random, the probability of false positives is 2^-K.

Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:

   a1[h1(k_i)] + a2[h2(k_i)] + a3[h3(k_i)] = f(k_i) for i = 1..N
If the arrays are big enough (each one a fraction larger than N/3 where N is the number of elements), then the probability is high that the system has a solution (this comes down to the hyperedges {h1(k_i), h2(k_i), h3(k_i)} being "acyclical" -- the acyclic part would be easier to understand if you only had two hash functions giving you normal undirected edges, but the proof of acyclicity only works with 3 or more hash functions). If it doesn't, you just pick new hash functions at random and retry.

The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.

[0] The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.

Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.

He probably did not talk about it because this post was an announcement of his paper: https://arxiv.org/abs/1912.08258, the XOR filter description starts at page 3.

There's also the linked paper that has algorithmic description (starts on p. 3).

I agree, I kept reading to see their approach and "woof, there it is, so much better, you work it out!". sigh.

Rule #1 of getting something adopted breached.

>My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else.

Or, one could just read the code instead of guessing - it's not that hard to follow, similar to Bloom idea + XORing the hashes.

> not that hard to follow

I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.

Neither the blog nor the repo contain benchmarks, which is bit weird when the title of a technical article has word "Faster".

Then after digging through the actual paper it turns out that the construction performance of Xor Filter is actually massively slower than Bloom.. May not matter for many uses, but critical for others (like mine).

> Blocked Bloom = 10 ns/key

> Bloom 8 = 40 ns/key

> Xor 8 = 110 ns/key

Co-author of the paper here. The repo https://github.com/FastFilter/fastfilter_cpp contains all the source code and benchmarks. Yes, construction of the xor filter is slower, but I wouldn't say massively. You should probably compare Bloom 12 against Xor 8, as those have similar false positive rates. Bloom 12 is 60 ns/key, versus Xor 8 at 110 ns/key. Cuckoo filter is somewhere between. Those number don't include calculating the hash code of a key, which might be another 30 ns/key or so.

What is your use case? Maybe Blocked Bloom might be better (if you have the memory).

Thanks, found the benchmarks (your blog links primarily to different repo, "/xorfilter").

My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..

Very interesting! If both construction and lookup need to be very fast, and if you do have enough memory, then I would say a blocked Bloom filter would be best (without knowing your use case in detail). With a regular Bloom filter, you will have lots of CPU cache misses. Cuckoo and xor filter might make sense if you can do (batch) construction in background threads, multiple filters concurrently. That way you can save memory.

The filters are used to seek from database files, so lookup/query side performance is not major issue (Xor filters were mainly interesting because of the smaller size). Normally a secondary index would do the job fine, but because of the high sustained rate there is no time to write it. And all CPU cores are already maxed out, so moving more to background doesn't provide net-benefit..

+Thanks everybody for your encouraging words :)

That is a very interesting project that you have on the go there, it doesn't happen often that I see tools that I think might be game changers coming along and this just might be one of those.


Btw you have a bug in your HN profile, it shows "Karma 158422", which seems to be about two orders of magnitude off? ;)

I know, I asked dang to fix it but he refused. I'll have to live with it I guess.

What I like about your project is that I've always felt that debuggers and such give us the same 'von Neumann' view of running programmers that we feel our CPU has. It's a bit like trying to figure out why your car won't start when all you have is a view through a microscope. And besides that, it is quite often that which is not happening that is indicative of an issue and debuggers usually allow you to focus only very well on everything that does happen.

Maybe you feel that way because that's exactly what debugger was designed for: CPU register access in assembler programs..

It's pretty insane that every other aspect of software engineering has changed so dramatically since the 1970s, but debugger has remained fundamentally unchanged for almost half a century now.

Also.. next BugJail beta will have SQL queries against captured program execution (select * from field_write join method_call...) to test hypothesis about runtime behavior. If you can express "that which is not happening" in SQL and get back 0 rows, then it didn't happen. Do you think this could solve your need?

It might be possible to use integer compression with block offsets like in the TurboPFor [1] demo.

TurboPFor offers also direct access to single values in a block.

1 - https://github.com/powturbo/TurboPFor

Thanks for the link, but actually I already use integer compression.

The website mentions a waiting list. Can you point me in the direction of where I can get email notifications about this project? Would be very interested in this for .NET. Is it similar to https://devblogs.microsoft.com/dotnet/debugging-net-apps-wit... ?

Sorry, the waiting list is indeed down atm :( You can just send one-line support ticket at https://support.bugjail.com/hc/en-us/requests/new and I'll add you manually.

The underlying technology in BugJail and Microsoft TTD is somewhat similar (capturing based), but the functionality build on top of that is very different: Microsoft TTD still looks and works like "normal debugger plus back button", while BugJail is total redesign from scratch and looks nothing like any traditional debugger.

Wow. That is a really cool project. I can see why you need that level of performance.

This is damn awesome. Keep up the good work!

Yes but grandparent is correct. The first word in the title is still “faster”.

I know it's frowned upon to say this at HN, but I'm beginning to wonder who's reading the actual page. At least to me the page seems pretty clear about it:

>The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster in some demonstration test.

When they say "faster," they mean for querying. Which is actually pretty useful for a lot of use cases where bloom filters are used today.

To be clear, I wrote specifically "construction performance".

The blog already stated that "build is bit slower", so i was mainly commenting that instead of "bit slower" it is actually order of magnitude slower (in some cases).

According to the paper, the query side performance is indeed faster, so they are not misleading in any way.

The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.

The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.

Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.

[1] https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc84...

(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.

The overhead you're seeing is because you've mismatched your FP rate with the optimal size for your GCS code. If they're exactly matched the overhead is very low (not as low as an arithmetic code simply because each entry is still coded into bits, so you get something like a half-bit overhead on average from that).

I would love to see improvements to the GCS code I have, but I tested with many different sizes and Golomb-Rice parameters... I was not able to reduce the size overhead in any way. Maybe you can help? There are two implementations, one is Java: https://github.com/FastFilter/fastfilter_java/blob/master/sr... and the other is C++: https://github.com/FastFilter/fastfilter_cpp/blob/master/src...

Good point about Q-1. I never looked at optimizing parameters when writing up the GCM.

A quick empirical check shows a small improvement for Q-1, but the claim that it comes in under 1% from the minimum seems to be based on an unusual definition of the minimum. nullc gives it as log2(eM), but the real minimum is simply log2(M), in which case the unary terminator alone imposes a 5% penalty at M=2^20.

It may still be possible to optimize the parameters further, but the gain would be perhaps a tenth of a bit per item.

> but the real minimum is simply log2(M)

The minimum entropy to code such a thing is log2(MN choose N)/N bits per element.

log2(eM) is just an approximation when N is large.

You cannot code with N uniform choices out of MN options with less than log2(MN choose N) bits on average, by the pigeonhole principle.

> the unary terminator alone

The terminator carries information. If M is chosen well the terminator carries almost one bit of information.

The minimum for approximate membership testers is log2((M choose N)/(N + fp_rate * (M - N) choose N))/N, assuming M >> N. The asymptotic is log2(1/fp_rate). For simplicity's sake below I pretend we're dealing with a set of 1, and so 1/fp_rate = M.

I thought you had this in mind in the first paragraph with "1 / log(2) (around 1.44) times more compact than Bloom filters". However using your definition of the minimum, for practical purposes it isn't possible to be anywhere near 1.44 times more compact than Bloom filters.

Using your example of M=2^20, log2(M) log2(e) / log2(eM) = 1.34. Personally I've never spent more than 45 bits per item on such structures, but even at 64 bits log2(eM) only reaches 1.41.

The oft-cited 1.44x Bloom filter tax is simply taken from its per-item cost of log2(M) log2(e), absent the information-theoretic minimum of log2(M), leaving just the log2(e) multiplier.

I think I was caught up thinking the discussion was all about the efficiency of the GCS coding for the underlying bit-set; which it is indeed indeed nearly optimal for (but only for some parameters, which don't happen to be where the fp rate is 1/2^k, which unfortunately is the case usually considered in papers!). Sorry about that.

The bloom cost 1.44 times the lower bound. The difference between an optimally coded bitset and the asymptotic lower bound is an additive 1.44 bits per element. This is a pretty big difference!

If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.

Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!

Direct link to paper (pdf) - https://arxiv.org/pdf/1912.08258.pdf

I haven't fully read the paper yet, but it seems like a variation of a bloom filter where membership is determined by xoring three hashes (see algorithm 1 in the pdf)

If that's the case then both membership and non-membership will be probabilistic, rather than one or the other being certain like boom filters, right?

"Membership test: returns true if the key x is likely in S, false otherwise"

I think they circumvent the problem by making a complicated insertion procedure where all of the keys need to be inserted at once, it seems uses a stack to avoid hash collisions

Take a look at https://github.com/FastFilter/xorfilter/blob/master/xorfilte...

I can usually read go, but I've never been that great at deciphering "academic code" so the exact details escape me here.

Yeah, TFA says it's intended as an immutable structure that has to be rebuilt to add new data to it - which limits its applications.

Yep, also the insertion procedure is retried with increasingly larger arrays until no “collisions” are encountered.

Reminds me of minimal perfect hash construction.

Yes, the underlying theory is from a minimal perfect hash function called "BDZ": http://cmph.sourceforge.net/bdz.html . Yes, sometimes a retry is needed, but the array is not made larger; it just retries with a different seed.

One of the advantages of bloom filters is that you can also do intersections, unions and cardinality estimations thereof on remote sets. These don't seem to support that.

Yes it is possible to do this, but do people actually use those features? For (pure) cardinality estimation, there is HyperLogLog, which can also be merged...

> do people actually use those features?

I did, not sure how common it is though. For much-larger-than-memory problems it can give a tremendous speedup to be able to do an estimate or logical operation(s) before having to hit the disk.

I used that to sum unique visitors on websites, by summing daily bloom filters corresponding to the set of visitors each day. It was pretty cool to be able to sum "unique" metrics, which is usually not possible in classic reporting tools.

As far as I can tell it doesn’t support insertion operations. That’s a massive drawback vs both cuckoo and bloom filters.

It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.

I think there are many use cases where you don't add entries after construction, for example for log-structured merge trees. You anyway can't add too many, otherwise it will hurt the false positive rate. What _are_ important use case where you add entries afterwards? BTW for cuckoo filters, add and remove can sometimes fail (depending on parameters and luck).

There are real-world applications where this* would drop in. I believe Chrome's malicious-site blocker employs a Bloom filter blob.

* assuming it does what it says, I haven't looked.

Assuming you're talking about Google Safe Browsing it needs the ability to update the data structures. Implementations vary but I wouldn't use Bloom Filters and this seems even less appropriate.

I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.

I’m unsure if they do partial/delta updates — I’ve never looked at the spec in detail - but I don’t know how we’ll bloom filters deal with bulk delta updates.

It is absolutely outrageous that you are accusing the authors of this work of dishonesty because of limitations they explicitly called out in their introductory post as well as the paper.

The compared Cuckoo filters have a similar property, no? (Depends on luck if you can update it or not)

When the table is less full than some cliff-effect-threshold the probability of failure is arbitrarily negligible.

If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.

If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.

[Of course, all assuming someone can't attack your hash function.]

(Co-author of the paper.) You are right. However, it is a bit hard to find the threshold. Even the original author of the cuckoo filter didn't get it right, see https://github.com/efficient/cuckoofilter/issues/9 - for this reason we also lowered the maximum load to 0.94. However, I found (running the benchmarks) that sometimes that is not sufficient (insertion still fails). So, where is the cliff exactly? Is it 0.9? Or lower? And, as you say, it also depends on the hash function. We have used the Murmur3 finalizer, which is statistically very good. What if you use a weaker hash function? For a regular cuckoo hash table, you can just re-build it. But for a cuckoo filter rebuild is not easily possible as you don't have the keys.

You are also right regarding augmenting the cuckoo filter (e.g. "cuckoo hashing with a stash"), but it slows things down. Or use a different algorithm to insert (breath first I think). But it would add complexity.

I think, as is, the cuckoo filter is quite a tricky data structure: it can do a lot in theory, but it is very tricky to get right.

It would be more fair to compare with minimal perfect hashing, as it does not support insertion too

Also of academic interest; Neural Bloom filter: https://arxiv.org/abs/1906.04304

Achives optimally 36% space saving compared to BL but needs an fp accelerator to achieve the same throughput. Compares to perfect hasing, which also does not suppprt insertions.

Comparing a static data structure to two well-known dynamic structures is disingenuous, to say the least. And for a static structure you can do better than just a minimal perfect hash table of fingerprints. You can use a succinct data structure to compress the fingerprints in a form that is still efficiently queryable (unlike say Golomb-compressed sequences): https://www.antoniomallia.it/sorted-integers-compression-wit...

Elias-fano are overestimated without any proof or comparison benchmarks.

Like other integer compression methods, you still need offsets to small blocks for efficient random access.

Getting a single value in a block is done with slow branchy bit ops like ctz and bit shifting. That's decoding

The fastest elias-fano implementation is in https://github.com/powturbo/TurboPFor and is partially using SIMD

VLDB also just published a paper on a similar, improved bloom-like data structure "Vacuum Filters: More Space-Efficient and Faster Replacement for Bloom and Cuckoo Filters" http://www.vldb.org/pvldb/vol13/p197-wang.pdf

It would be interesting to get a comparison between Xor and Vacuum.

Thanks for the link! I wasn't aware of this. The source code is available at https://github.com/wuwuz/Vacuum-Filter/

To me it looks like a version of the cuckoo filter that isn't restricted to a 2^n size. We made the same change in the benchmark of the xor filter against the cuckoo filter (and others made the same change). There are some more changes, and the paper claims the vacuum filter is faster than the cuckoo filter. I will try to add the vacuum filter to our benchmark suite at https://github.com/FastFilter/fastfilter_cpp and see what results I get.

Interesting. Where does 3 come from for the number of hash functions? Why not 4 or 5?

The minimum number motivated to work is also the cheapest. 2 was simply deemed insecure.

No, actually 2 would also work. It's just that 3 hash functions needs the least space for some reason (less than 2, less than 4 or more).

Well met, and thanks for identifying the solution using overhead.

Insecure? What does that mean in the context of a read-only data structure?

Good point, loose language on my part. I am thinking of false positives or collision. The required statistical properties aren't met.

How is this different from a hashset where you simply never do the final key comparison? As far as I can tell both require linear storage space and have around the same storage complexity.

It seems like the main "innovation" here is simply using a perfect hash function for your hashset.

From the paper: "We store the fingerprints in an array B with capacity c slightly larger than the cardinality of the set |S | (i.e., c ≈ 1.23 × |S |)."

The constant factor is different. If you want a small number of collisions, you will need a very small load-factor (you can't do most techniques for handling collisions because you aren't storing the key). So if you treat your hash-table as an array of bits and you run at a load-factor of say .1, then you are now using 10 bits per key, which is a lot more space than a bloom filter or an xor filter with the same false-positive rate.

Bloom filters don't require linear space. You can essentially create a set of arbitrarily large objects, as long as you have a hash for them. It requires O(C) memory where C is the size of your largest hash function. As a trade-off, it's not deterministic, since there can be hash collisions due to pigenhole principle. So, even though bloom filter claims the object is in the set, there is a certain probability (that you can calculate) that it's not actually there. On the other hand, a tree-set or hash-set will be deterministically correct, but they require O(N) space and tree-set requires O(logN) lookup cost.

Bloom filters absolutely call for linear space if you are using them appropriately: given the hash collision probability you are working with the size of the required filter is linear in the number of elements you want to store. Yes, it is "technically true" (the most useless kind of true) that you can store as much stuff as you want in the filter... but you get increasingly worse false positive rates that eventually just saturate to near 100%. If you actually read any of the math--including this linked work on Xor filters--you will always see them described in terms of "bits per item stored" (where the number of bits is extremely small vs. what you would expect). Hell: Wikipedia even describes the algorithm's storage requirements as "A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.".

I disagree. It requires linear space in the sense that you need another data structure to deal with hash collisions. If you read my comment I was talking about exactly that. In normal applications of bloom filter it's built before a cache layer to ask if your data is in the cache. For this you do not need a list of objects since if bloom filter wrongly determined that object is in the set, it will be a cache miss and as long as a bloom filter operation is faster than a cache miss this is usually a good tradeoff; because normally when you're about to have a cache miss, you end up not going to cache because bloom filter says the object is not there. For this application, you do NOT need O(N) space inside the bloom filter layer, but obviously eventually someone will need to go retrieve this data.

> If you actually read any of the math

I mean I did and I was trying to describe that math above. I built bloom filters many times before. I'm curious if you misread my comment.

You said the following, so I assume you agree with it:

> It requires O(C) memory where C is the size of your largest hash function.

For any given expected false-positive rate P, C must be chosen to be a size of O(N), so a bloom filter with expected false-positive rate P will be of size O(N).

How big of a problem can false positives get?

In terms of the URL blacklist, is the 1% of false positives 1% of all possible URLs? Or of the amount of requests to the filter? Or what?

Of the amount of blacklisted URLs, for sure. The filter doesn't know anything about possible URLs.

The C++ code, anyway, is remarkably prolix, probably 4-10x. It is probably best that they did not attempt Rust, under the circumstances.

From the blog post: > It would be relatively easy to start from the Go version and produce a Rust version, but I had to stop somewhere.

(I bet someone in hacker news is saying "challenge accepted!" and we'll have a xorfilter crate very soon)

Yes, I read that.

The point was that as bad as their C++ code is, their Rust version might be even more disappointing.

Downvoting me just for mentioning Rust only demonstrates immaturity.

(Author here) The project started by taking the cuckoo filter source code (which happens to be written in C++), and extend it. A lot of code comes from that conversion. Other implementations (Morton, CQF) are also written in C++. I write Java normally, so C++ code is not my strength, but C++ was needed to be able to compare with others. A Rust version would be nice!

I don’t have a ton of time but the Go version does seem very easy to port, we’ll see!

I downvoted you because wordy code isn't bad. "Prolix" is not a negative adjective when describing code. Golf is a game and not for real systems.

If you don't understand how having several times as much code as needed, reproducing features also provided by thoroughly-tested Standard libraries, creates reliability risks, this probably will not be the place you learn it. But you could try.

People promoting "safe" languages that don't support encapsulating powerful semantics in libraries prefer to downplay the risks in not having access to such libraries. Needing to open-code the same features, or rely on less well-encapsulated semantics is a fertile source of real-world bugs.

In the race to reliable systems, access to and use of sound, well-tested, usable libraries multiplies efforts.

Downvoting people because you don't understand their argument is disgraceful. Ask.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact