One more fun thing about Bloom filters: generalized, they're not just a way of testing for set membership. They're a way of getting an upper bound on some function of a finite input set. For example, suppose that I have a set of a million names and how tall that person is, and I want a data structure that will give me an upper bound on the height of a person given the name. I know, this is totally contrived, but here's what you could do:
1. Make an array of height values, initially zero. Call it heights.
2. For each name, get several hash values: h1, h2, ... hn.
3. For each hash h, set heights[h] = max(heights[h], height). In other words, you're only pushing the height values upward.
4. To get an upper bound on a person's height, hash their name and look at your heights array. The upper bound is min(heights[h1], heights[h2], ... heights[hn]).
(If you want to be very technical, this works not only for numbers, but for any set that supports max() and min() operations. A Bloom filter for set membership uses such a set with two elements, 0 and 1.)
Edited to fix my phone's auto-corrections.
A new research collaboration between UC Berkeley and MIT on probabilistic query answering. It allows users to specify trade-off between error confidence bound vs time.
17 TB on 100 nodes? That a lot of nodes to hold 17 TB; on average 174 GB-ish of data each.
The speed is super impressive, but using 100 nodes makes this look more like a parallel processing achievement than "big data".
1G/node/sec * 100 nodes * 2 sec = 200G.
17TB is 85 times that number.