
Introduction to K-means Clustering - DataScienceInc
https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials
======
eveningcoffee
K-means is based on some quite wild assumptions - your data follows a specific
case of the Gaussian distribution. Plus side is that the algorithm is
relatively easy to understand and implement so it is a good starting point
into clustering.

When you got this, then next step is probably to consider if the centroid is
really a good representation of the distribution of your data. If you still
believe that your data is about to follow some form of Gaussian distribution
then K-means can be levelled up with the EM algorithm.

More over, if you are faced with some crazy multidimensional data then it may
be fair first try to reduce the dimensionality of the data with the PCA or
t-SNE algorithm or with combination of both. Or perhaps use autoencoders.

~~~
hnhg
What about data that generally follows something more like a power law
distribution? How would you cluster that?

~~~
a_bonobo
scikit-learn's materials (amazing in general) has a nice overview of their
clustering methods and how they perform with some test data sets:
[http://scikit-learn.org/stable/modules/clustering.html](http://scikit-
learn.org/stable/modules/clustering.html)

How would you even cluster a power law distribution? There's nothing that
distinguishes in two or more groups?

~~~
hnhg
A lot of online engagement data looks like a power law distribution on large
platforms. The mean (obviously) has huge variance. Say you have multiple
metrics per user like this. You can still apply k-means but the huge variance
makes it rather ineffective. That's what I've found, anyway.

------
daveguy
My favorite cluster algorithm comparison:

[http://nbviewer.jupyter.org/github/scikit-learn-
contrib/hdbs...](http://nbviewer.jupyter.org/github/scikit-learn-
contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb)

------
ArmandGrillet
That is a lot of text for such a simple algorithm, the regular implementation:

1\. K observations are chosen randomly from the dataset as initial means.

2\. K clusters are created by associating every observation with its nearest
mean.

3\. The centroid of each of the K clusters becomes the new mean.

Steps 2 and 3 are repeated until we reach convergence. Very simple cases are
badly clustered with this algorithm: [https://cdn-
images-1.medium.com/max/1200/1*BGSo8t0ktVLDn4Chv...](https://cdn-
images-1.medium.com/max/1200/1*BGSo8t0ktVLDn4ChvH23kA.png) (an example with
K=2). However, it is the simplest clustering algorithm thus a good start for
people who want to start learning about this field!

~~~
amelius
> until we reach convergence

You forgot to explain why (if) this algorithm converges.

------
bkanber
I wrote a JS implementation and tutorial on k-means a few years back [1], if
anyone is looking for this in the browser. Great writeup, I appreciate the
math focus!

[1] [http://burakkanber.com/blog/machine-learning-k-means-
cluster...](http://burakkanber.com/blog/machine-learning-k-means-clustering-
in-javascript-part-1/)

------
gesman
In my experiments K-Means clustering proved to be super useful to detect
malicious activities, cyberattacks and all kind of suspicious behaviors
without knowing any rules or patterns in advance.

Here's how K-Means with K=20 detected malicious activity within real world,
live web server - just by selecting smallest clusters (bottom table shows
specifics about what fallen into smallest clusters - all hits but the first
one are attempt to exploit web application by some nefarious actors):

[http://imgur.com/a/l3EGW](http://imgur.com/a/l3EGW)

This is built using free version of Splunk with free machine learning toolkit
addon.

~~~
gglanzani
I wonder if that's really Kmeans, as it tend to work well only when the size
of the clusters are approximately the same.

Or maybe splunk labels it as Kmeans when it really is some other algorithm.

~~~
gesman
Actually it K-Means applied after PCA (to reduce dimensionality) and after
Standard Scaling.

------
loeber
If you find this interesting, you may also enjoy my introduction to k-means
clustering, in which I implement Lloyd's algorithm and use it to classify
handwritten digits:

[http://johnloeber.com/docs/kmeans.html](http://johnloeber.com/docs/kmeans.html)

------
amelius
See also self-organizing maps. They seem more robust. See e.g. [1].

[1]
[https://pdfs.semanticscholar.org/3ffe/8f8a7b0d00297e0cd74d20...](https://pdfs.semanticscholar.org/3ffe/8f8a7b0d00297e0cd74d20b5d936349d6cbc.pdf)

------
tylerhou
I created a visualization of k-means for some people at my school a few months
ago. I'll put it here for anyone who might be interested.

[https://github.com/tylerhou/kmeans](https://github.com/tylerhou/kmeans)

~~~
wallygold
I've also done a visualization:
[http://www.darynr.com/kmeans.htm](http://www.darynr.com/kmeans.htm)

------
gpderetta
In a previous life I've implemented news clustering algorithms for a couple of
companies.

I wasn't impressed with the results with k-means and I've always found that
agglomerative clustering algorithms to work better for natural text documents.

------
dnprock
I made a simpler example of k-means clustering in JavaScript. Demonstrating
the concept.

[https://nxsheet.com/sheets/56fb78812efc70776ddce752](https://nxsheet.com/sheets/56fb78812efc70776ddce752)

