

K-means clustering explained and visualized with d3 - tmcw
http://macwright.org/2012/09/16/k-means.html

======
tzs
The article is describing Lloyd's Algorithm. It would be good to mention the
name somewhere.

It would probably be a good idea to mention what to do if a cluster ends up
with no points assigned at some point.

It should also be noted that there are other methods for choosing the initial
cluster centers than choosing randomly from the data. For instance, choosing
random points in the smallest d-dimensional rectangle that contains the data
points (where d is the dimensionality of the space containing the data points)
is popular.

When checking a point against each cluster center to determine which center to
assign that point to you can speed things up a little by using the square of
the distance rather than the distance. The saves you NK square root
calculations, where N is the number of points and K is the number of clusters.

~~~
tmcw
Thanks! This now mentions and links Lloyd's algorithm.

What happens when a cluster has no points assigned? I'm not sure, do you know?

Is there much reasoning for choosing other initial cluster means? Just better
approximation?

I'd rather not talk about optimizations to the algorithm too much; highly
optimized algorithms aren't the best way to learn, and this is not an article
about implementation.

~~~
tzs
What you do when a cluster has no points depends on the details of the problem
you are trying to apply clustering to. If you don't actually care how many
clusters you have, you could just drop such cluster centers from the final
output.

If, on the other hand, you really want the number of clusters you asked for,
one common approach is to abort the run if a cluster becomes entry and start
over, with a new set of initial clusters.

I'm not sure if it really matters much how you pick the initial clusters.
Random within the space, rather than random from among the points, can simply
be easier. Suppose you've got 100 points and are trying to split into 10
clusters. If you pick from the points, you need to pick without replacement so
that you won't pick the same point twice. This can complicate things a little,
depending on what data structures you are using and what's in your language's
library.

If, on the other hand, you are just picking 10 random points in the space, the
chance of picking the same point twice is so low you can get away with not
worrying about it.

If you are curious, another place K-means clustering is used is in choosing
points to serve as the centers for modeling a dataset with radial basis
functions. There is an excellent set of lectures and slides on this in the
Caltech machine learning video library here:
<http://work.caltech.edu/library/>

