My guess is it's one of those things that are simple to understand for the guy that wrote it but inscrutable for everyone else. I'm not ashamed to admit that I have several such code blobs in production right now, but I won't pretend to call them simple.
The xor filter queries for a key by computing three hashes h1(k), h2(k), h3(k) of the key k and using those to index into three arrays a1, a2, a3 of K-bit values. The 3 values loaded from those arrays are xor'd together and compared to a fingerprint f(k). If they are equal, the key is assumed to be contained in the filter.
Assuming that the fingerprint function is random, the probability of false positives is 2^-K.
Constructing the xor filter requires choosing the three hash functions at random, and solving the system of linear equations given by:
a1[h1(k_i)] + a2[h2(k_i)] + a3[h3(k_i)] = f(k_i) for i = 1..N
The complicated part of the algorithm is solving the system of linear equations efficiently. The complicated part of the correctness proof is showing that the system has a solution with high probability.
 The fingerprint is also a hash function. The difference is that f(k) can be fixed for the algorithm, whereas the h1, h2, h3 need to be chosen randomly when the xor filter is built.
Edit: Note that the paper combines the three arrays into a single large one, but that's an implementation detail. Though it makes you wonder whether one couldn't have three hash functions that each cover the entire range of the larger array and still make it work. That could potentially decrease the required size of the array marginally, at the expense of a more difficult proof.
Rule #1 of getting something adopted breached.
Or, one could just read the code instead of guessing - it's not that hard to follow, similar to Bloom idea + XORing the hashes.
I'd say that the requirement to solve a system of linear equations that may or may not even have a solution in order to construct the filter might qualify as hard to follow.
Then after digging through the actual paper it turns out that the construction performance of Xor Filter is actually massively slower than Bloom.. May not matter for many uses, but critical for others (like mine).
> Blocked Bloom = 10 ns/key
> Bloom 8 = 40 ns/key
> Xor 8 = 110 ns/key
What is your use case? Maybe Blocked Bloom might be better (if you have the memory).
My use case is a new kind of debugger (https://BugJail.com), which works by capturing 20+ million events per second from running application, and then reconstructing model of program execution from the raw events. With 20+ million events per second that 110ns per key is just not going to work..
+Thanks everybody for your encouraging words :)
Btw you have a bug in your HN profile, it shows "Karma 158422", which seems to be about two orders of magnitude off? ;)
What I like about your project is that I've always felt that debuggers and such give us the same 'von Neumann' view of running programmers that we feel our CPU has. It's a bit like trying to figure out why your car won't start when all you have is a view through a microscope. And besides that, it is quite often that which is not happening that is indicative of an issue and debuggers usually allow you to focus only very well on everything that does happen.
It's pretty insane that every other aspect of software engineering has changed so dramatically since the 1970s, but debugger has remained fundamentally unchanged for almost half a century now.
Also.. next BugJail beta will have SQL queries against captured program execution (select * from field_write join method_call...) to test hypothesis about runtime behavior. If you can express "that which is not happening" in SQL and get back 0 rows, then it didn't happen. Do you think this could solve your need?
TurboPFor offers also direct access to single values in a block.
1 - https://github.com/powturbo/TurboPFor
The underlying technology in BugJail and Microsoft TTD is somewhat similar (capturing based), but the functionality build on top of that is very different: Microsoft TTD still looks and works like "normal debugger plus back button", while BugJail is total redesign from scratch and looks nothing like any traditional debugger.
>The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster in some demonstration test.
When they say "faster," they mean for querying. Which is actually pretty useful for a lot of use cases where bloom filters are used today.
The blog already stated that "build is bit slower", so i was mainly commenting that instead of "bit slower" it is actually order of magnitude slower (in some cases).
According to the paper, the query side performance is indeed faster, so they are not misleading in any way.
The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.
Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.
A quick empirical check shows a small improvement for Q-1, but the claim that it comes in under 1% from the minimum seems to be based on an unusual definition of the minimum. nullc gives it as log2(eM), but the real minimum is simply log2(M), in which case the unary terminator alone imposes a 5% penalty at M=2^20.
It may still be possible to optimize the parameters further, but the gain would be perhaps a tenth of a bit per item.
The minimum entropy to code such a thing is log2(MN choose N)/N bits per element.
log2(eM) is just an approximation when N is large.
You cannot code with N uniform choices out of MN options with less than log2(MN choose N) bits on average, by the pigeonhole principle.
> the unary terminator alone
The terminator carries information. If M is chosen well the terminator carries almost one bit of information.
I thought you had this in mind in the first paragraph with "1 / log(2) (around 1.44) times more compact than Bloom filters". However using your definition of the minimum, for practical purposes it isn't possible to be anywhere near 1.44 times more compact than Bloom filters.
Using your example of M=2^20, log2(M) log2(e) / log2(eM) = 1.34. Personally I've never spent more than 45 bits per item on such structures, but even at 64 bits log2(eM) only reaches 1.41.
The oft-cited 1.44x Bloom filter tax is simply taken from its per-item cost of log2(M) log2(e), absent the information-theoretic minimum of log2(M), leaving just the log2(e) multiplier.
The bloom cost 1.44 times the lower bound. The difference between an optimally coded bitset and the asymptotic lower bound is an additive 1.44 bits per element. This is a pretty big difference!
I haven't fully read the paper yet, but it seems like a variation of a bloom filter where membership is determined by xoring three hashes (see algorithm 1 in the pdf)
I think they circumvent the problem by making a complicated insertion procedure where all of the keys need to be inserted at once, it seems uses a stack to avoid hash collisions
Take a look at https://github.com/FastFilter/xorfilter/blob/master/xorfilte...
I can usually read go, but I've never been that great at deciphering "academic code" so the exact details escape me here.
Reminds me of minimal perfect hash construction.
I did, not sure how common it is though. For much-larger-than-memory problems it can give a tremendous speedup to be able to do an estimate or logical operation(s) before having to hit the disk.
It’s somewhat disingenuous that they are clearly attempting to position this as being superior to the other data structures, while their’s cannot do one of the core operations.
* assuming it does what it says, I haven't looked.
I'm sure there's some application where this is the right choice, but the reaction here shows that people don't like being told the propaganda approach to new data structures. Just returning true is also technically "faster and smaller than Bloom Filters" as the headline for this item says when I write this. Almost useless, but faster and smaller.
If having a failure rate of 1 in 2^hundreds is still not good enough for you, you could augment the data structure with a vector of insertion failures, which will be empty in almost all universes and so scanning it on every query will cost nothing but a single well predicted branch.
If the number of candidate insertion places is fairly large (like ... 8) then the cuckoo filter can be quite full while still keeping a negligible chance of insertion failure.
[Of course, all assuming someone can't attack your hash function.]
You are also right regarding augmenting the cuckoo filter (e.g. "cuckoo hashing with a stash"), but it slows things down. Or use a different algorithm to insert (breath first I think). But it would add complexity.
I think, as is, the cuckoo filter is quite a tricky data structure: it can do a lot in theory, but it is very tricky to get right.
Achives optimally 36% space saving compared to BL but needs an fp accelerator to achieve the same throughput. Compares to perfect hasing, which also does not suppprt insertions.
Like other integer compression methods, you still need offsets to small blocks for efficient random access.
Getting a single value in a block is done with slow branchy bit ops like ctz and bit shifting. That's decoding
The fastest elias-fano implementation is in
https://github.com/powturbo/TurboPFor and is partially using SIMD
It would be interesting to get a comparison between Xor and Vacuum.
To me it looks like a version of the cuckoo filter that isn't restricted to a 2^n size. We made the same change in the benchmark of the xor filter against the cuckoo filter (and others made the same change). There are some more changes, and the paper claims the vacuum filter is faster than the cuckoo filter. I will try to add the vacuum filter to our benchmark suite at https://github.com/FastFilter/fastfilter_cpp and see what results I get.
It seems like the main "innovation" here is simply using a perfect hash function for your hashset.
From the paper: "We store the fingerprints in an array B with capacity c slightly larger than the cardinality of the set |S | (i.e., c ≈ 1.23 × |S |)."
> If you actually read any of the math
I mean I did and I was trying to describe that math above. I built bloom filters many times before. I'm curious if you misread my comment.
> It requires O(C) memory where C is the size of your largest hash function.
For any given expected false-positive rate P, C must be chosen to be a size of O(N), so a bloom filter with expected false-positive rate P will be of size O(N).
In terms of the URL blacklist, is the 1% of false positives 1% of all possible URLs? Or of the amount of requests to the filter? Or what?
(I bet someone in hacker news is saying "challenge accepted!" and we'll have a xorfilter crate very soon)
The point was that as bad as their C++ code is, their Rust version might be even more disappointing.
Downvoting me just for mentioning Rust only demonstrates immaturity.
People promoting "safe" languages that don't support encapsulating powerful semantics in libraries prefer to downplay the risks in not having access to such libraries. Needing to open-code the same features, or rely on less well-encapsulated semantics is a fertile source of real-world bugs.
In the race to reliable systems, access to and use of sound, well-tested, usable libraries multiplies efforts.
Downvoting people because you don't understand their argument is disgraceful. Ask.