
Flow Analysis & Time-based Bloom Filters - zerop
http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/
======
T_S_
tl;dr A Bloom filter implements the predicate isMemberOf efficiently--at the
cost of some false postives. Its performance degrades when the set is large.
Solution: index your set by time, implementing a Bloom filter for each epoch.

------
MichaelGG
This is a neat idea. However, the Redis backend seems rather pointless. If I'm
understanding the code correctly, each bit of the filter is implemented as an
item in Redis. From the comments[1] the overhead is ~130 bytes. So each bit
ends up needing 1000 bits? At that point, it'd be even be more efficient to
just use a hashtable with the SHA256 hash of the item as the key and storing a
count+expiry time...

[1]: In the comments: "The simple backend I added treats each key as an
individual bucket, and based on my rough testing, Redis added about ~130 bytes
of overhead per key - which, of course, is a huge difference compared to the
bit vector."

~~~
igrigorik
New version of bloomfilter.rb uses SETBIT/GETBIT:
[https://github.com/igrigorik/bloomfilter-
rb/blob/master/lib/...](https://github.com/igrigorik/bloomfilter-
rb/blob/master/lib/bloomfilter/redis.rb)

The "bit" per key was prior to setbit/getbit functionality. The new
implementation is, in fact, very efficient.

~~~
MichaelGG
But not for the one that supports expiry:

[https://github.com/igrigorik/bloomfilter-
rb/blob/master/lib/...](https://github.com/igrigorik/bloomfilter-
rb/blob/master/lib/bloomfilter/counting_redis.rb#L55)

As the point is to use Redis's expiration feature, which obviously does not
apply to individual bits.

------
snikolic
First, it seems like unnecessary overheard - memory, and potentially network -
to store a Bloomfilter in Redis. A modern machine can pretty easily fit a BF
containing hundreds of millions of items with a very, very low false positive
rate into memory. Why get Redis involved?

Second, an alternative solution might be using multiple, smaller Bloomfilters,
and dumping the oldest BF shard whenever the most recent one fills up. This
wouldn't be appropriate for all use cases, but it's simpler, and potentially
superior in some cases.

~~~
spearo77
The article mentions specifically that storing the Bloomfilter in Redis or
memcached is done only to allow sharing between multiple processes.

------
gwern
> Finally, we also make use of the native expire functionality in Redis to
> guarantee that keys are only stored for a bounded amount of time.

So if I'm understanding this, the time-based aspect comes solely from the fact
that Redis/the backend is storing the timestamp of entries and also deleting
them behind the Bloom filter's back?

Well, that's a lot simpler than the strategies I was imagining.

------
geoffw8
Hey - I love this site, I'm reasonably new to Ruby stuff and it gives me a
taster of things I love reading about, and things I'd like to do. Can anyone
please suggest any other similar blogs/places?

Thanks!

~~~
slig
There're a lot of gems on Ilya Grigorik's blog, start there.

~~~
geoffw8
Yeah, its been a regular haunt of mine for a while. If you know of anymore,
please share!

~~~
petercooper
Articles (including Ilya's) are frequently submitted by authors to
<http://rubyflow.com/>

~~~
geoffw8
Thanks Peter.

------
jasondavies
Cool, I might add this to bloomfilter.js if I get a chance:
<https://github.com/jasondavies/bloomfilter.js>

