Age-Partitioned Bloom Filters

jedberg · on Jan 10, 2020

This is really great!

For those that don't know, bloom filters are really good at determining when something is not in a set. So if says "not in set" that's 100% accurate and super fast.

We used this to our advantage at reddit. When you load a comments page, we had to determine if you've voted on anything. So with the bloom filter in front, the query for "have they voted on anything here" was very quick if the answer was no.

But we still had to make the query. So another hack we did was to put a value on your user object to track the last time you interacted with the website. If the comments page was made after your last interaction, then we didn't even have to do the query.

With this, we wouldn't have to do the two step process -- it sounds like this would be super fast when it was time boxed. Maybe...

ignoramous · on Jan 11, 2020

BloomFilters make for good indices, too. My first encounter with them was from the BigTable paper.

We used it in front of a Radix Tree that at places had too much depth. The filter took something like 2MiB at 7% false-positive rate vs 100+ MiB for the Radix Tree. We later bitmap-compressed the Radix Tree [0] to under 5MiB at the cost of making the lookups very very slow (P99 at 4ms, P50 at 1ms compared to P99 at 400ns before, iirc) but added a TinyLFU-based [1] lean-HAMT [2] cache of 512KiB in the front to speed lookups of popular items and a Robinhood cache [3] of 256KiB to take care of the long-tail KVs.

[0] https://news.ycombinator.com/item?id=2348619

[1] http://highscalability.com/blog/2016/1/25/design-of-a-modern...

[2] https://blog.acolyer.org/2015/11/27/hamt

[3] https://blog.acolyer.org/2018/10/26/robinhood-tail-latency-a...

elfexec · on Jan 11, 2020

> For those that don't know, bloom filters are really good at determining when something is not in a set. So if says "not in set" that's 100% accurate and super fast.

Agreed. It allows for false positives, but disallows false negatives.

rolleiflex · on Jan 10, 2020

Aether (https://getaether.net) also uses a roughly equivalent version of this I had thought, up to this point, that I ‘discovered’. I call it ‘Rolling Bloom’. (Not sure which implementation is first.)

Here’s the Go implementation: https://github.com/nehbit/aether/blob/master/aether-core/aet...

CarlosBaquero · on Jan 10, 2020

Thanks for the pointer. I took a look at the code and it’s different. Looks like Aether maintains several bloom filters, each link to a given time range. In contrast the paper builds a different kind of bloom filter where hash functions can overlap.

rolleiflex · on Jan 10, 2020

Interesting, thanks for taking a look. That is correct. Aether also does some partial merging in the cases where the time range required does not exactly match the filter sequence.

microcolonel · on Jan 10, 2020

I've used something like this before, thank you for characterizing it. :- )

I used mine for authentication rejection.

hinkley · on Jan 10, 2020

Do you suppose there's a timing attack where someone can figure out if any particular users (target user, or admins) are currently online and get up to nonsense if they aren't?

Not sure if your app is big enough to care, but it might be you want to normalize the response times, so server load is reduced but auth time is fixed.

microcolonel · on Jan 10, 2020

It was a general rejection system, so all valid (and recently invalidated) tokens were in the same bloom. If we rejected your token, your connection was cut immediately without response. The normal mechanisms for limiting short-lived connections from a given address apply, so your ability to sample for that would be severely limited, and chances are you wouldn't even have a well-formed token to sample with anyway (since the tokens were authenticated).

I don't work on that anymore though. I might do something similar in the future, but being able to detect when admins are snoozing is low on my list of concerns; if it's designed properly, it'll be cheaper to hire PIs to just look and see when they're snoozing.

stupidcar · on Jan 11, 2020

Missed opportunity to title this paper "OK Bloomer".

dragontamer · on Jan 11, 2020

An interesting hack is to use a count-based bloom filter, maybe 8-bits per count instead of 1-bit.

As long as the "count" doesn't overflow to the max size (in this case: 255), you can always "remove" items from a Counting Bloom Filter. In effect, a traditional 1-bit bloom filter is simply a counting-bloom filter that "overflows" on the first hit.

If people really need deletion on their bloom filter, do realize it is possible! (as long as you select a proper bit-size).