
Age-Partitioned Bloom Filters - ifcologne
https://arxiv.org/abs/2001.03147
======
jedberg
This is really great!

For those that don't know, bloom filters are really good at determining when
something is not in a set. So if says "not in set" that's 100% accurate and
super fast.

We used this to our advantage at reddit. When you load a comments page, we had
to determine if you've voted on anything. So with the bloom filter in front,
the query for "have they voted on anything here" was very quick if the answer
was no.

But we still had to make the query. So another hack we did was to put a value
on your user object to track the last time you interacted with the website. If
the comments page was made after your last interaction, then we didn't even
have to do the query.

With this, we wouldn't have to do the two step process -- it sounds like this
would be super fast when it was time boxed. Maybe...

~~~
ignoramous
BloomFilters make for good indices, too. My first encounter with them was from
the BigTable paper.

We used it in front of a Radix Tree that at places had too much depth. The
filter took something like 2MiB at 7% false-positive rate vs 100+ MiB for the
Radix Tree. We later bitmap-compressed the Radix Tree [0] to under 5MiB at the
cost of making the lookups very very slow (P99 at 4ms, P50 at 1ms compared to
P99 at 400ns before, iirc) but added a TinyLFU-based [1] lean-HAMT [2] cache
of 512KiB in the front to speed lookups of popular items and a Robinhood cache
[3] of 256KiB to take care of the long-tail KVs.

[0]
[https://news.ycombinator.com/item?id=2348619](https://news.ycombinator.com/item?id=2348619)

[1] [http://highscalability.com/blog/2016/1/25/design-of-a-
modern...](http://highscalability.com/blog/2016/1/25/design-of-a-modern-
cache.html)

[2]
[https://blog.acolyer.org/2015/11/27/hamt](https://blog.acolyer.org/2015/11/27/hamt)

[3] [https://blog.acolyer.org/2018/10/26/robinhood-tail-
latency-a...](https://blog.acolyer.org/2018/10/26/robinhood-tail-latency-
aware-caching-dynamic-reallocation-from-cache-rich-to-cache-poor/)

------
rolleiflex
Aether ([https://getaether.net](https://getaether.net)) also uses a roughly
equivalent version of this I had thought, up to this point, that I
‘discovered’. I call it ‘Rolling Bloom’. (Not sure which implementation is
first.)

Here’s the Go implementation:
[https://github.com/nehbit/aether/blob/master/aether-
core/aet...](https://github.com/nehbit/aether/blob/master/aether-
core/aether/services/rollingbloom/rollingbloom.go)

~~~
CarlosBaquero
Thanks for the pointer. I took a look at the code and it’s different. Looks
like Aether maintains several bloom filters, each link to a given time range.
In contrast the paper builds a different kind of bloom filter where hash
functions can overlap.

~~~
rolleiflex
Interesting, thanks for taking a look. That is correct. Aether also does some
partial merging in the cases where the time range required does not exactly
match the filter sequence.

------
microcolonel
I've used something like this before, thank you for characterizing it. :- )

I used mine for authentication rejection.

~~~
hinkley
Do you suppose there's a timing attack where someone can figure out if any
particular users (target user, or admins) are currently online and get up to
nonsense if they aren't?

Not sure if your app is big enough to care, but it might be you want to
normalize the response times, so server load is reduced but auth time is
fixed.

~~~
microcolonel
It was a general rejection system, so all valid (and recently invalidated)
tokens were in the same bloom. If we rejected your token, your connection was
cut immediately without response. The normal mechanisms for limiting short-
lived connections from a given address apply, so your ability to sample for
that would be severely limited, and chances are you wouldn't even have a well-
formed token to sample with anyway (since the tokens were authenticated).

I don't work on that anymore though. I might do something similar in the
future, but being able to detect when admins are snoozing is low on my list of
concerns; if it's designed properly, it'll be cheaper to hire PIs to just look
and see when they're snoozing.

------
stupidcar
Missed opportunity to title this paper "OK Bloomer".

------
dragontamer
An interesting hack is to use a count-based bloom filter, maybe 8-bits per
count instead of 1-bit.

As long as the "count" doesn't overflow to the max size (in this case: 255),
you can always "remove" items from a Counting Bloom Filter. In effect, a
traditional 1-bit bloom filter is simply a counting-bloom filter that
"overflows" on the first hit.

If people really need deletion on their bloom filter, do realize it is
possible! (as long as you select a proper bit-size).

