
Probabilistic Data Structures For Web Analytics And Data Mining - aespinoza
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
======
pjscott
This is probably the most readable article I've seen on the topic. If the
author is reading this: your article is good and you should feel good.

One more fun thing about Bloom filters: generalized, they're not just a way of
testing for set membership. They're a way of getting an upper bound on some
function of a finite input set. For example, suppose that I have a set of a
million names and how tall that person is, and I want a data structure that
will give me an upper bound on the height of a person given the name. I know,
this is totally contrived, but here's what you could do:

1\. Make an array of height values, initially zero. Call it heights.

2\. For each name, get several hash values: h1, h2, ... hn.

3\. For each hash h, set heights[h] = max(heights[h], height). In other words,
you're only pushing the height values upward.

4\. To get an upper bound on a person's height, hash their name and look at
your heights array. The upper bound is min(heights[h1], heights[h2], ...
heights[hn]).

(If you want to be _very_ technical, this works not only for numbers, but for
any set that supports max() and min() operations. A Bloom filter for set
membership uses such a set with two elements, 0 and 1.)

------
thecroat
Hey eeeverrrybody. So if my brain melted, and I only understood about half of
this, where do I got to learn more to better understand this? I love this
stuff, but my melon can only retain/understand so much without going insane.

~~~
davidcuddeback
Try implementing some of the data structures and throw some toy problems at
them. That's a really good way to get a feel for them. I've done that with AI
algorithms, and I found that it really helps solidify the concepts because
sometimes the results don't make sense and I have to figure out "why".
Implementing the algorithms forces you to do a deep dive, but at a pace that
hopefully won't fry your brain. Don't overlook the papers that the author
linked to at the bottom if the article.

Edited to fix my phone's auto-corrections.

------
mwexler
What a great read. I am always curious what might be a good user experience
that lets a user select "time" vs. "precision/accuracy" in an interface. A
slider comes to mind, but has anyone seen a clever implementation that lets
users choose how they want an answer delivered?

~~~
rxin
Check this out: <http://blinkdb.org/>

A new research collaboration between UC Berkeley and MIT on probabilistic
query answering. It allows users to specify trade-off between error confidence
bound vs time.

~~~
rickmode
"BlinkDB can execute a range of queries over a real-world query trace on up to
17 TB of data and 100 nodes in 2 seconds, with an error of 2–10%"

17 TB on 100 nodes? That a lot of nodes to hold 17 TB; on average 174 GB-ish
of data each.

The speed is super impressive, but using 100 nodes makes this look more like a
parallel processing achievement than "big data".

~~~
rxin
Even with parallel processing, assuming you can handle 1G of data per node per
second (which is a fairly optimistic estimation for on disk data):

1G/node/sec * 100 nodes * 2 sec = 200G.

17TB is 85 times that number.

------
Estragon
These techniques are quite sensitive to the distribution of the data under
study. For instance, if the numbers being processed by the LogLog counter
aren't uniformly distributed, the cardinality estimates you get from it could
be arbitrarily biased.

