

Introducing Streaming K-Means in Spark MLlib 1.2 - rxin
http://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

======
rxin
This is a cool feature, and is one of the prime example of what Spark's tight
integration of various libraries can enable (in this case Spark Streaming and
MLlib). It was originally designed by Jeremy Freeman to handle workloads in
neuroscience, which IIRC was generating data at 1TB/30mins.

------
hcrisp
Sounds similar to an exponential moving average[1], which itself is a one-pole
IIR digital filter. [1]
[http://en.wikipedia.org/wiki/Moving_average#Exponential_movi...](http://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average)

------
michaelmior
Is it true that this doesn't support dynamic values of k? That is, the
algorithm isn't adaptive to a changing number of clusters? That said, I
suppose for some small range of k values, you could do this trivially by
tracking them all and picking the best.

~~~
tlarkworthy
a general drawback of k-means. You have to select the k. In practice you try a
range and see how well they summarize your data (e.g. leave one out cross
validation).

You could do that here too, just have a range of k. If only there were a
streaming leave-one-out cross validation for k-means to complement this
approach ...

(it is possible to do this in a streaming style, see LWPR)

~~~
michaelmior
I understand that's a drawback of k-means. I was just wondering if this was
something that Spark solved natively. Looks like the answer is no for the time
being. Thanks for the pointer to LWPR :)

------
cfregly
very interesting post. ironically, hackernews uses a similar type of time-
decay algorithm!

