Stream Processing and Probabilistic Methods: Data at Scale

xaa · on Feb 14, 2015

It's good that the "Big Data" community is finally shifting the paradigm (back) to stream processing. I mean, it is the abstraction behind pipes, which were invented when data we now consider small was Big. Now if only someone will take the UNIX pipe and make it transparently multi-machine instead of writing an ungodly large Java framework to emulate them badly, slowly, and verbosely...

However, I was a little disappointed by the "probabilistic" methods. I was thinking of things like approximate kNN, online regression, that sort of thing, in which you actually trade speed and streamability for accuracy. Bloom filters don't actually lose any accuracy in the example given, since there is a fallback to a database in the case of a false positive. Instead they are an optimization technique.

The more interesting probabilistic methods to me are the ones that say: we are willing to give up the accuracy of the traditional technique, but are hoping to make up for it by being able to process more data. But of course "probabilistic method" is a broad and context-dependent term.

rawnlq · on Feb 14, 2015

A key modeling assumption with streaming data structures is that the stream itself is much much larger than than available memory. So you can't use any data structures that are linear in the stream size and have to trade off accuracy for less space (e.g., instead of set you have bloom filters, instead of multisets you have count-min sketch, etc). This does fit your criteria doesn't it?

xaa · on Feb 14, 2015

My nitpicking was more about how they were presented. If you are simply using a bloom filter to avoid random disk access on the database, then no, you haven't lost accuracy in your system-as-a-whole.

Also, and I guess this is more fuzzy, my perception is that set-membership algorithms, for very large sets, are largely used in search and recommendation. These problems have unique properties: if you get the top N right, the ordering of the rest hardly matters. Contrast this with regression or classification, where you are looking for a global result.

Also, I agree that it is generally a bad idea to use linear data structures in streaming algorithms, but they are not automatically out in practice (as opposed to the CS literature). If you can keep the constant factor small.

tylertreat · on Feb 14, 2015

Fair criticism :) The reason I mostly focused on various Bloom filters was because I actually encountered a problem at work recently that required them to deduplicate messages in a distributed messaging system. Mostly was journaling my research.

It would be fun to look at stochastic ways to approximate k-NN and other sketches.

xaa · on Feb 14, 2015

It was an interesting read for sure, and I appreciate it for what it is. Of course, this is HN, so we have to criticize ;)

Vowpal Wabbit is a great project that implements approximate, streaming versions of a lot of basic machine learning algorithms. And it has incredibly high throughput even when run as a single process. VW by itself has made me greatly skeptical of the "need" for systems like Spark.

Locality-sensitive hashing is a cool approximate kNN algorithm, along with others on the Wikipedia kNN page. kNN I think is especially suitable for approximate techniques because often the "nearest neighbors" are a little unstable and highly dependent on the distance metric anyway, so you lose little by moving to an approximate algorithm. When you approximate regression, OTOH, you complicate your ability to make statistically valid statements on the model.

rawnlq · on Feb 14, 2015

Another pretty good article on this topic: https://highlyscalable.wordpress.com/2012/05/01/probabilisti...