
Stream Processing and Probabilistic Methods: Data at Scale - tylertreat
http://www.bravenewgeek.com/stream-processing-and-probabilistic-methods/
======
xaa
It's good that the "Big Data" community is finally shifting the paradigm
(back) to stream processing. I mean, it is the abstraction behind pipes, which
were invented when data we now consider small was Big. Now if only someone
will take the UNIX pipe and make it transparently multi-machine instead of
writing an ungodly large Java framework to emulate them badly, slowly, and
verbosely...

However, I was a little disappointed by the "probabilistic" methods. I was
thinking of things like approximate kNN, online regression, that sort of
thing, in which you actually trade speed and streamability for accuracy. Bloom
filters don't actually lose any accuracy in the example given, since there is
a fallback to a database in the case of a false positive. Instead they are an
optimization technique.

The more interesting probabilistic methods to me are the ones that say: we are
willing to give up the accuracy of the traditional technique, but are hoping
to make up for it by being able to process more data. But of course
"probabilistic method" is a broad and context-dependent term.

~~~
rawnlq
A key modeling assumption with streaming data structures is that the stream
itself is much much larger than than available memory. So you can't use any
data structures that are linear in the stream size and have to trade off
accuracy for less space (e.g., instead of set you have bloom filters, instead
of multisets you have count-min sketch, etc). This does fit your criteria
doesn't it?

~~~
xaa
My nitpicking was more about how they were presented. If you are simply using
a bloom filter to avoid random disk access on the database, then no, you
haven't lost accuracy in your system-as-a-whole.

Also, and I guess this is more fuzzy, my perception is that set-membership
algorithms, for very large sets, are largely used in search and
recommendation. These problems have unique properties: if you get the top N
right, the ordering of the rest hardly matters. Contrast this with regression
or classification, where you are looking for a _global_ result.

Also, I agree that it is generally a bad idea to use linear data structures in
streaming algorithms, but they are not automatically out in practice (as
opposed to the CS literature). If you can keep the constant factor small.

~~~
tylertreat
Fair criticism :) The reason I mostly focused on various Bloom filters was
because I actually encountered a problem at work recently that required them
to deduplicate messages in a distributed messaging system. Mostly was
journaling my research.

It would be fun to look at stochastic ways to approximate k-NN and other
sketches.

~~~
xaa
It was an interesting read for sure, and I appreciate it for what it is. Of
course, this is HN, so we have to criticize ;)

Vowpal Wabbit is a great project that implements approximate, streaming
versions of a lot of basic machine learning algorithms. And it has incredibly
high throughput even when run as a single process. VW by itself has made me
greatly skeptical of the "need" for systems like Spark.

Locality-sensitive hashing is a cool approximate kNN algorithm, along with
others on the Wikipedia kNN page. kNN I think is especially suitable for
approximate techniques because often the "nearest neighbors" are a little
unstable and highly dependent on the distance metric anyway, so you lose
little by moving to an approximate algorithm. When you approximate regression,
OTOH, you complicate your ability to make statistically valid statements on
the model.

------
rawnlq
Another pretty good article on this topic:
[https://highlyscalable.wordpress.com/2012/05/01/probabilisti...](https://highlyscalable.wordpress.com/2012/05/01/probabilistic-
structures-web-analytics-data-mining/)

