

Yinyang K-Means: A Replacement for Classic K-Means with Consistent Speedup - jcr
http://jmlr.org/proceedings/papers/v37/ding15.html

======
huac
Huge, if true.

Researchers included a link to (C++) source code:
[http://research.csc.ncsu.edu/nc-
caps/yykmeans.tar.bz2](http://research.csc.ncsu.edu/nc-caps/yykmeans.tar.bz2)

~~~
ogrisel
I agree it's useful although there also exists a more scalable approximate
variant of classical k-Means and the new Yinyang k-Means (referenced in the
paper), namely Mini-batch k-Means:

[http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf](http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf)

This approximate method is implemented (at least) in sofia-ml and scikit-
learn.

~~~
thecopy
Another approach is to transform the dataset into a smaller but representative
dataset, called a core-set, and running k-means on that core-set instead which
produces a "(1+e )-approximation for the optimal cluster centers".

See
[http://people.csail.mit.edu/dannyf/kmeancoreset.pdf](http://people.csail.mit.edu/dannyf/kmeancoreset.pdf)

------
ilurk
> Overall average Speedup (X) of Yinyang over Standard (9.36), Elkan (6.12),
> Drake (3.08)

There wasn't a single mention on GPU use. So I'm wondering if the yinyang
algorithm would port well to the GPU.

Just a quick google led me to reports of GPU speedups of 2x to 75x.

------
jordigh
I haven't read the paper yet, but I'm about to. Before I do, I want to ask: do
people usually implement O(n^2) distance calculations in k-means when trying
to find the closest centroid that each point is to? I think that's what I
usually see, which seems so dumb to me. Are metric trees that obscure?

~~~
ogrisel
That would be interesting to investigate but you have to keep in mind that the
centroids move between each iterations. So you have to re-build a new metric
tree at each iteration. kd-tree / build time complexity are between `n log(n)`
and `n (log(n))^2`. You don't benefit from the classifical speed-up of metric
trees where you build the tree once and then do many queries on the same tree.

Also metric trees such as k-d tree and ball tree do no bring any speed-up in
high dimensions.

~~~
CyberDildonics
>Also metric trees such as k-d tree and ball tree do no bring any speed-up in
high dimensions.

This is not true. Only their naive implementations don't bring speedups in
high dimensions and it depends on the number of data points, query points, and
dimensions. Best bin first is one common optimization.

