
Vectorized Bloom Filters for Advanced SIMD Processors (2014) [pdf] - tjalfi
https://pdfs.semanticscholar.org/6a5e/0414a01c19da4a48ac4018b5687d782ad25c.pdf
======
jbapple
This is a nice piece of work and still relevant. The source is available:

[http://www.cs.columbia.edu/~orestis/vbf.c](http://www.cs.columbia.edu/~orestis/vbf.c)

It has very high throughput for L2-resident filters, as long as most queries
return "false" and you can use a bulk API. It was, IIRC, about 4x faster than
a hand-made "horizontal" SIMD Bloom filter, and 20x faster than cuckoo
filters.

By "horizontal" SIMD, I am using the language of a follow-up paper by the same
team at Columbia, "Rethinking SIMD Vectorization for In-Memory Databases",
[http://www.cs.columbia.edu/~orestis/sigmod15.pdf](http://www.cs.columbia.edu/~orestis/sigmod15.pdf),
[http://www.cs.columbia.edu/~orestis/sigmod15source.zip](http://www.cs.columbia.edu/~orestis/sigmod15source.zip).
In that paper, they call "vertical" SIMD for hash-based containers
"process[ing] a different input key per vector lane". "Horizontal" SIMD is
putting the same key in each lane.

I suspect the results in this paper could be improved with more modern gather
techniques on newer x86-64 processors.

~~~
psi-squared
> I suspect the results in this paper could be improved with more modern
> gather techniques on newer x86-64 processors.

By that, do you mean the AVX2 'gather' type instructions? If not, I'd be
interested to know what those techniques are.

As for AVX2 gathers, I had to look this up recently and it sounds like they're
about as fast as manually unpacking the vector and performing scalar loads.
That is to say, they're decidedly _not_ fast. On the other hand, it sounds
like (as of Skylake) they're bottlenecked on accesses to the L1 cache, so
they're about as fast as they reasonably could be.

Source: [https://stackoverflow.com/questions/21774454/how-are-the-
gat...](https://stackoverflow.com/questions/21774454/how-are-the-gather-
instructions-in-avx2-implemented)

Not sure about performance on Zen, but I would imagine it's similar?

~~~
jbapple
> do you mean the AVX2 'gather' type instructions?

Yes, and AVX512, with two caveats:

1\. This is processor-dependent. Different processors have different ratios of
instruction latency.

2\. Sometimes a related instruction is faster than the instruction
specifically for that purpose. For instance, on the glibc on my machine,
memcmp and strncmp are NOT implemented using the sse4.2 instructions for
string comparison, but instead use ptest and pcmpeq, respectively, because it
is faster to do so. The same phenomenon could be true of gathers as well.

