
When Bloom filters don't bloom - jgrahamc
https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
======
majke
In the final program `mmuniq` I did a couple of, I think, interesting hacks.

[https://github.com/cloudflare/cloudflare-
blog/blob/master/20...](https://github.com/cloudflare/cloudflare-
blog/blob/master/2020-02-mmuniq/mmuniq.c)

First, I used a hash function using aesni (aesenc) instruction set. See this:

[https://gist.github.com/majek/96dd615ed6c8aa64f60aac14e3f6ab...](https://gist.github.com/majek/96dd615ed6c8aa64f60aac14e3f6ab5a)

While I have little proof it's a good hash, it seems enough, and is _slightly_
(5-10%?) faster than siphash24 in this context.

Then I mixed counting hash with finding new lines \n. This allows me to do
only one user-data load into XMM registers.

Most importantly, to offset the RAM latency cost, I'm doing 64 prefetches as I
parse the input, and only after this I actually touch the hash table. The
memory latency is still the biggest time sync, but at least this seem to speed
up the program 2x or more. Hash table without this batching+prefetch is 6-8
seconds. With batching goes down below 3s.

I suspect linear probing / open addressing of the hash table may have some
penalty. While it plays nicely with the cache prefetch, it generally leads to
longer chains. This means we need to keep the hash table sparse, not loaded
above 0.6-0.75. See this

[https://en.wikipedia.org/wiki/File:Hash_table_average_insert...](https://en.wikipedia.org/wiki/File:Hash_table_average_insertion_time.png)

from
[https://en.wikipedia.org/wiki/Hash_table](https://en.wikipedia.org/wiki/Hash_table)

~~~
willvarfar
I'm a bit confused, you are now storing the IPv4 addresses in a hash table
using a 64-bit hash?

Why not just use the 32-bit address as a key, and grow the 'blocks' so if two
addresses are just a couple of digits apart, promote it to a /24 block etc.

~~~
majke
Apologies, maybe I oversimplified the original problem. I'm dealing with IP's
(both v4 and v6), subnets, ranges (which may or may not align to subnets).
These map to one or more datacenter numbers.

I could indeed define data model, parse the data thoroughly, optimize in-
memory data structure, and so on. That requires rigid data structure, knowing
access pattern and understanding the problem space. I'm not there yet.
Instead, I created this generic tool which works with any text files, and fell
into a rabbit hole of over-optimizing it. That's it.

~~~
taywrobel
FWIW, you should be able to represent individual IPs, ranges, and subsets all
in CIDR notation, tho for ranges you may need multiple CIRD entries to reflect
the whole range.

CIDR for ipv4 consists of the 32 bit address and a 32 bit mask, so with some
bit packing you can uniquely represent them in 64 bits without hashing.

The problem you’ll run into there is doing a “contains” check on an origin IP
for a list of CIDRs, but you’ll need to do that currently since you’re dealing
with subnets, I assume.

~~~
jsn
32 bit mask is _way_ too generous, you only need 5 bit masklen. It all doesn't
matter though since they have v6 addresses and ranges.

~~~
willvarfar
Would save a lot of space to just have separate lists for each mask length
etc.

------
throwaway_pdp09
A blocked bloom filter works on cache blocks
<[https://www.tutorialspoint.com/blocked-bloom-
filter>](https://www.tutorialspoint.com/blocked-bloom-filter>). It takes more
memory but far fewer hits to RAM, so much better caching behaviour. It should
solve your problem.

Edit: it will use more RAM than cuckoo filters.

~~~
aw1621107
FYI, that closing angle bracket breaks the link.

~~~
ignoramous
Working link: [https://www.tutorialspoint.com/blocked-bloom-
filter](https://www.tutorialspoint.com/blocked-bloom-filter)

------
Hitton
>time (cat logs.txt | sort | uniq > /dev/null)

If I skip two useless pipes, and use "sort -u logs.txt > /dev/null" instead,
I'm already twice as fast as original (it seems that piping to sort
effectively prevents parallelization).

~~~
repsilat
For me sort|uniq is _nine_ times slower than

    
    
      LANG=C sort -u
    

but `sort -u` on its own is only marginally faster than sort|uniq.

I can't remember exactly what LANG=C does, but I think the it makes sort not
need to do some fancy unicode stuff? If the person writing the article just
needs to uniqify IP addresses they should use it.

~~~
majke
Oh boy. This is drastic. Thanks for great advice:

    
    
      marek:~$ time (cat logs-popcount-org.txt | sort -u | wc -l)
      39057531
    
      real 2m37.387s
      user 2m35.626s
      sys 0m2.937s
    
      marek:~$ time (cat logs-popcount-org.txt | LANG=C sort -u -S6G | wc -l)
      39057531
      
      real 0m12.908s
      user 0m42.826s
      sys 0m3.586s

------
cube2222
Even though I agree with another commenter that it's surprising that the
author used a bloom filter instead of the hash map as a baseline, this article
still is an excellent small walkthrough of quick ad-hoc low-level performance
profiling.

------
727374
I went down the Bloom Filter rabbit hole for a project a while back and then
wondered if I could actually fit the entire set space in memory. The set was
IDs up to 2^32, so basically a giant bit array. I believe the IPV4 universe of
this article is actually the same size as in my project. I coded up my project
using Java bit arrays and got things working decently well, using a big heap.
Then I found out there are a bunch of _compressed_ bit array libraries such as
EWAH and Roaring Bitmap. When I substituted Roaring for the stock Java Bitset
implementation, I saw space and computation improve by many orders of
magnitude. Roaring uses a bunch of tricks to achieve this, but it mostly comes
down to encoding large runs of 1s or 0s. Obviously, not as tiny as bloom
filters, but still pretty small for most modern machines if you have a sparse
set. [https://www.roaringbitmap.org/](https://www.roaringbitmap.org/)

~~~
deepsun
And it's not probabilistic, it gives strict answer whether an element is in
the set or not.

------
jessermeyer
Bloom filters are ideal candidates for answering the question 'Is x _not_ in
the set?'. If the answer is yes, nothing, including the item you're asking
about, ever hashed to that location. If the answer is no, all you know is that
something hashed there, possibly your item or not.

~~~
jiofih
Which works perfectly for this case: if the answer is yes, add it to output,
otherwise skip.

~~~
reitzensteinm
It's not that simple. If the correctness of your program relies on the "not in
set" test being accurate, you're going to need to make the filter huge, and
slow.

Probabilistic data structures are about trading off correctness and
performance. If you try to push the correctness up to near perfect, they'll
quickly stop making sense and you should just use an actual perfect algorithm
instead, as the author did.

Bloom filters are great for early outs, where you can save a chunk of
computation on a definite negative, but still be correct in case of false
positive.

~~~
jiofih
Not sure I follow. For detecting uniques all you care about is the definite
negatives? Either way, the author addresses precision in his post.

~~~
reitzensteinm
False positives mean you're throwing away genuinely unique items that hash the
same. If you use a 256 bit Bloom filter to process a billion items, you'll get
exactly 256 results that are indeed unique.

The rate of false positives you require out of the data structure is key. If
your program is correct with a 20% false positive rate, you're golden. If the
goal is more or less 0, look elsewhere.

The author addresses precision, but not in a way that questions whether a
Bloom filter is indeed the right tool for the job.

~~~
jiofih
But you don’t care about false positives, only the definite negatives! I was
trying to address the fact that the top comment on the utility of a bloom
filter is covered in the post.

~~~
reitzensteinm
If you don't care about the false positives, you don't need a Bloom filter.
Just reject everything.

------
kevingadd
I'm not sure why the author started with bloom filters instead of a hash
table, to be honest. The workload seems ideal for a hash table. Interesting to
see how big the performance gap was though, I wouldn't have expected such a
difference. It probably comes down to the fact that the bloom filter has to
spread its data across many locations so if the set is particularly large it's
always going to lose to a hash table due to hitting main memory more times (in
scenarios where you can use either one, at least).

I suppose they started with a bloom filter because they intended to use one
when consuming the data, i.e. checking incoming requests against a 'malicious
IP' set?

~~~
stefan_
I don't know why the author didn't stick to their shell script, which took all
of 2 minutes for a manual data cleanup step. Like, go grab a coffee.

~~~
majke
2 minutes was for 40M items. I had 1B items to sort/uniq.

------
mike_d
> For example, source IPs belonging to a legitimate Italian ISP should not
> arrive in a Brazilian datacenter.

This is an assumption based on a very naive understanding how packets get
delivered on the Internet. I for one wouldn't enjoy being blocked from
CloudFlare sites just because of poor routing or peering.

[https://en.wikipedia.org/wiki/Hot-potato_and_cold-
potato_rou...](https://en.wikipedia.org/wiki/Hot-potato_and_cold-
potato_routing)

~~~
jdlshore
Do you understand how rude you're being? This is an article about Bloom
filters, not Internet routing.

------
ncmncm
Why not just use an array of 2^32 bits -- a half gigabyte -- and leave off
hashing altogether?

All it would cost is the excess runtime, which we should not mind giving up
unless we smoke.

If necessary, you could have two or more. 256 of them would fit in 128G, which
lots of servers have without even needing it all.

~~~
kevin_thibedeau
2^32 == 4GiB

~~~
miloignis
2^32 bytes is 4 GiB, but 2^32 bits is 512 MiB. In the GP's hypothetical array,
I believe you would index (array[idx >> 3] >> (idx & 7)) != 0

------
Freaky
> Check out this excellent visualization by Thomas Hurst showing how
> parameters influence each other:

> [https://hur.st/bloomfilter/](https://hur.st/bloomfilter/)

 _Blush_. However will I cope with all this extra traffic?

One common way to improve bloom filter cache performance is to divide them
into blocks - elsewhere is mentioned doing it to the cache line, but it would
be interesting to see how much performance would be gained with a more naive
approach, for instance, splitting the filter into 4KiB pages.

I've done this for disk-backed filters, but never looked to see if it improved
performance generally.

------
heartbeats
Another solution is to change the order things are checked in. With optimum k,
each hash takes about 1/2 of the possibilities out. If he'd batch it so the
memory accesses are roughly ordered, it would be much faster.

EDIT:

So, for instance, if k = 19, that means there are 19 hash functions. If he
goes through all the inputs, and checks: are there any hashes here falling
within the first 1/19 of the memory space? If not, keep it. If so, check
whether any are unset. If not, keep it. If they are, zero the pointer to the
input in the array. After this is done, he should be rid of roughly 32% (0.5 *
(1-(18/19)^19)) of candidates. The second pass throws out 33% of candidates,
and so on and so forth.

He could even keep an absurdly large value for _k_ and _n_ : if it is 128G,
then k = 19'053, meaning he can use an even finer increment. He'd have to
spill the filter to disk, but the access patterns will be great.

------
Hello71
seems like it would be worth comparing to the old "awk '!a[$0] { a[$0]=1;
print }'". I would assume that such arrays are implemented internally using
hash tables. probably not as efficient as a C implementation, but the used
parts of the interpreter should fit in I-cache, so it should be within a few
times as fast.

~~~
majke
Here you go:

    
    
      marek:~$ time (cat logs-popcount-org.txt | awk '!a[$0] {a[$0]=1; print }'|wc -l)
      39057531
    
      real 0m41.236s
      user 0m38.179s
      sys 0m5.447s
    

So: sort: 2m, awk 41 seconds. Also, awk used 6.1G of RAM at peak.

~~~
Hello71
yeah, a generic, probably pointer-heavy hash table is definitely gonna be
worse on memory. I'm surprised that it's _that_ much worse on time though, I
expected it to be closer. I guess probably the cache misses are worse with
such a large table though.

~~~
majke
Ok, I'll bite again:

    
    
      marek:~$ cat logs-popcount-org.txt | perf stat -d awk '!a[$0] { a[$0]=1; print }' > /dev/null 
      
       Performance counter stats for 'awk !a[$0] { a[$0]=1; print }':
    
               40,318.47 msec task-clock:u             
                       0      context-switches:u       
                       0      cpu-migrations:u         
               1,670,649      page-faults:u            
         112,979,634,215      cycles:u                 
          93,441,976,758      instructions:u           
          18,990,099,679      branches:u               
             208,386,137      branch-misses:u          
          26,093,832,363      L1-dcache-loads:u        
             708,880,979      L1-dcache-load-misses:u  
             464,332,790      LLC-loads:u              
             245,913,835      LLC-load-misses:u        
    
            40.337768657 seconds time elapsed
      
            36.851718000 seconds user
             3.468126000 seconds sys
    
    

Compare this to the optimized approach which has 57M LLC-load-misses, and 7M
instructions.

~~~
ncmncm
I would welcome seeing a comparison in your environment to using the simple
1/2 GB array of bits, with no hashing or storage of IP addresses. (Extra
points for hugetlb mapping.)

------
alecco
The latency problem must be due to (unnecessary data) dependencies and not
taking advantage of superscalar architecture. Modern CPUs support 32 or more
in-flight memory operations.

After hiding latency, the next bottleneck would be instructions so vector
scatter/gather can alleviate this problem.

~~~
BeeOnRope
32 per core? Most modern Intel only supports 10 or 12 outstanding (line)
accesses per core. AMD is a bit better. Ice Lake is significantly better.

~~~
alecco
I think it's per memory controller. Probably per socket.

------
aratauto
On a bit unrelated note, if you want to handle very large Bloom filters
(billions of entries with low false positive rates) there is an open source
Java library that can help you to do that: [https://github.com/nixer-io/nixer-
spring-plugin/tree/master/...](https://github.com/nixer-io/nixer-spring-
plugin/tree/master/bloom-tool).

There is also a command line utility that accompanies the library:
[https://github.com/nixer-io/nixer-spring-
plugin/tree/master/...](https://github.com/nixer-io/nixer-spring-
plugin/tree/master/bloom-tool).

------
_nhynes
Another potentially useful data structure that has good asymptotic complexity,
but probably also poor cache locality is the Van Emde Boas tree [0]. I've
never seen one in practice, but they sure make for excellent p-set problems!

[0]
[https://en.wikipedia.org/wiki/Van_Emde_Boas_tree](https://en.wikipedia.org/wiki/Van_Emde_Boas_tree)

------
thomashusa
There is a very interesting talk [1] by Chandler Carruth that I stumbled upon
last weekend that is very much related to this

[1]
[https://www.youtube.com/watch?v=nXaxk27zwlk&t=4678s](https://www.youtube.com/watch?v=nXaxk27zwlk&t=4678s)

------
mark-r
The generic sort command is easy, but in this case maybe a custom radix sort
would have been faster?

------
dpc_pw
Relevant, maybe: [https://medium.com/adobetech/filtering-duplicates-on-the-
com...](https://medium.com/adobetech/filtering-duplicates-on-the-command-
line-30x-faster-than-sort-uniq-96ca5f7b4277)

------
nraynaud
I am surprised, isn't cloudflare a Go shop?

~~~
jgrahamc
It would be ridiculous to do everything in one language. Different tools for
different situations. We use Go, Rust, C++, Python, Lua, JavaScript, ...

