
Scalable Bloom Filters (2007) [pdf] - setra
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.7953&rep=rep1&type=pdf
======
signa11
this is a kind of tangential comment/rant :

but to me it seems that research papers _must_ be, for lack of a better term,
_runnnable_. i would, and hopefully others as well, like to, replicate all
these wonderful results that are advertised in these papers. without that,
they are all just advertisements of scholarship rather than scholarship
themselves. a set of instructions + environment which generated these figures
would be very welcome.

<end-rant>

on the subject of bloom filters, have a look at this one:
[https://www3.cs.stonybrook.edu/~ppandey/files/p775-pandey.pd...](https://www3.cs.stonybrook.edu/~ppandey/files/p775-pandey.pdf)
(A General-Purpose Counting Filter: Making Every Bit Count)

~~~
mceachen
Agreed. First time this was in HN, I wanted to understand both types of bloom
filters, so I wrote
[https://github.com/mceachen/bloomer](https://github.com/mceachen/bloomer).

------
opencoff
I used this paper for building a scalable bloom filter for use in an ad-tech
stack. The performance was better than DJB's CDB.

[https://github.com/opencoff/portable-
lib](https://github.com/opencoff/portable-lib)

The bloom filter code is in src/bloom.c; the Header file is in
inc/utils/bloom.h

I implemented a serialization/deserialization of the bloom filters as well
(src/bloom_marshal.c).

The tests are in test/t_bloom.c.

------
esturk
Can someone help me understand the query part?

It says that a query is done on each BF, even on the ones that were added
after the initial storage. So suppose we have only 2 iterations. In the first
BF, there's k0 hash functions and in the 2nd (iteration) BF, there's now k1
hash functions.

So naturally, an item is stored using the k0 hash functions. But in order to
query, I run against k1 hash functions which is a larger set. If any one of
the k1-k0 extra hash functions returns 0, won't that be a false negative?

~~~
cmurphycode
Yes, you are right - you have to query each filter.

To keep the contract of the bloom filter, a "no" can only come if ALL filters
return no. So if one of the filters return 0 and the other returns 1, the
answer is maybe (i.e. a yes with some false positive probability). If we
instead answered no, it would be a false negative as you stated, so we can't
do that. This paper doesn't defeat this property of bloom filters.

The cool insight in this paper is that how you choose the new filter size
allows for a relatively nice tradeoff of "wasted" size, and a target for the
effective false positive ratio, even in the face of growth. We are increasing
the probability for false positives, but depending on how you pick the sizes,
you can do better than if you simply allocated another bloom of the same size,
forever and ever.

(When I say "wasted" size, I mean the extra bits you need to get a certain
false positive ratio, when you compare it to a properly sized filter from the
get-go. In essence, you're paying some overhead when you get to a certain
size. In exchange, you do not need to have guessed the size correctly /
allocated all that memory from the get-go.)

------
bistro17
implementation of this paper -
[https://metacpan.org/pod/Bloom::Scalable](https://metacpan.org/pod/Bloom::Scalable)

------
stochastic_monk
This should also be marked with a year. A cursory google search has
StackOverflow answers from 2013.

A lack of a year label implicitly suggests that it’s new, e.g. Hacker News.
Please add a tag with the appropriate year.

~~~
grzm
Looks like it's from 2007:

[https://www.sciencedirect.com/science/article/pii/S002001900...](https://www.sciencedirect.com/science/article/pii/S0020019006003127)

~~~
dang
OK, added. Thanks!

------
howitworks
This should be labeled as a PDF

~~~
dang
Yes. Added.

------
bhouston
Hacker news loves upvoting articles about bloom filters (and Bayesian
probability - its on the front page again this evening.) Personally I've never
found a use for either of them in practice.

~~~
cryptonector
They have _one_ use: as a fast check for whether you should bother with a
slower check. Bloom filters are basically a sort of a pre-cache.

~~~
FreakLegion
They actually have other uses, like measuring file similarity in sdhash and
dimensionality reduction in machine learning.

~~~
cryptonector
Interesting. Thanks!

