
K-means clustering using sklearn and Python - dhiraj8899
https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187
======
lawlorino
This just seems like an advert for a mailing list and some ML platform I've
never heard of, the article content doesn't offer anything new over the
million other beginner sklearn tutorials out there.

~~~
andrewmatte
Yes and the bait and switch from business for flowers?

And K-means??? Why not HDBSCAN?

~~~
Buetol
Each time somebody points out K-means, I show them this clustering benchmark
by the scikit-learn project: [https://scikit-
learn.org/stable/_images/sphx_glr_plot_cluste...](https://scikit-
learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png)

~~~
vasili111
Any accompanying text for the image?

~~~
Buetol
Yes, here's the context: [https://scikit-
learn.org/stable/modules/clustering.html](https://scikit-
learn.org/stable/modules/clustering.html)

------
sillysaurusx
I see a lot of people asking for more advanced notebooks.

Recently I was asked to participate in a competition to identify brain
hemorrhages ([https://www.kaggle.com/c/rsna-intracranial-hemorrhage-
detect...](https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection)).
It turns out jhoward has published a lot of Kaggle notebooks walking through
the entire process of gathering the data, cleaning it, implementing a learning
algorithm, and submitting an entry.

If you're trying to do practical end to end machine learning, these are
definitely worth studying.

Notebooks:

[https://www.kaggle.com/jhoward/from-prototyping-to-
submissio...](https://www.kaggle.com/jhoward/from-prototyping-to-submission-
fastai)

[https://www.kaggle.com/jhoward/cleaning-the-data-for-
rapid-p...](https://www.kaggle.com/jhoward/cleaning-the-data-for-rapid-
prototyping-fastai)

[https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-
awar...](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-
fastai)

[https://www.kaggle.com/jhoward/don-t-see-like-a-
radiologist-...](https://www.kaggle.com/jhoward/don-t-see-like-a-radiologist-
fastai)

The last one is particularly interesting, and proposes a new color map for
seeing 65536 different greyscale values. Turbo is also an alternative:
[https://ai.googleblog.com/2019/08/turbo-improved-rainbow-
col...](https://ai.googleblog.com/2019/08/turbo-improved-rainbow-colormap-
for.html)

~~~
ska
I haven't looked at them carefully but two things jump out at me:

1) there is some very useful stuff there, particularly for someone new to
consuming medical imaging data. The writeups are aiming to be fairly complete

2) there are some things jhoward is being naive, e.g. CT image scaling, where
the advice could give you trouble.

------
coleifer
You can implement it, albeit more slowly, in pure python using just 20-30
lines of code. I wrote a blog a while back showing how kmeans can be used to
identify dominant colors in images. It has many applications and is a handy
tool to use for roughly grouping data. Care is needed to pick the optimal
starting centroids and _k_.

~~~
lawlorino
> You can implement it, albeit more slowly, in pure python using just 20-30
> lines of code.

This is a good exercise for anyone starting out learning about machine
learning, but I'd always stick to a well known library if I was actually using
it for something else.

> Care is needed to pick the optimal starting centroids and k.

Definitely, and I think that speaks to the laziness of the linked article that
they just say "Use the elbow method" for choosing k. In my ~4 years of being a
data scientist I've never seen or heard of this working for any "real world"
problem. Metrics like silhouette scores are much more useful and quantifiable.

------
tjpaudio
There are like a million blog style posts for these basic intro to ML with
pandas & python type thing. It's so prevalent that I am starting to wonder
what the motivation is?

~~~
ApolloFortyNine
They're basically just ads for whatever the author is selling, since you
obviously couldn't just post a literal ad to hackernews, people make simple
how to articles (they have to be simple to reach the most people) and post
those instead.

------
durbatuluk
I should start calling myself a ML expert? I'm 32 year old and most of these
"ML algorithms" are called statistics for me. Maybe I'm missing something,
where is the LEARNING on k-means? Numerical solutions seems to be on rise now.

Nice write for someone starting, more details about details of algorithm steps
would greatly attract more readers.

~~~
Rainymood
>I should start calling myself a ML expert? I'm 32 year old and most of these
"ML algorithms" are called statistics for me. Maybe I'm missing something,
where is the LEARNING on k-means? Numerical solutions seems to be on rise now.

Yes, you can. I have studied statistics and I cringe at the watering down of
what "machine learning" and "AI" has become; simple statistics.

>Nice write for someone starting, more details about details of algorithm
steps would greatly attract more readers.

I disagree, making it even simpler would attract more readers. You see the
same with Youtube tutorials that have 22 parts. The first part has 200.000
views, the second 150.000 and the 20th part only has 400 views or so.

------
dna_polymerase
This was one of the first lectures in my ML course back in university. We
didn't even have pictures of grapes or the local farmers market back then.
Also, we implemented this stupidly easy algorithm without any libraries.

------
thiago_fm
Sorry if I've hijacked the article, but is K-means really relevant? My teacher
in 2008 enjoyed to use it for everything and it is very basic and easy to
grasp, perhaps not even worth an article on it. Are people writing very
amazing AI using it? So, why do people like writing about it so much?

~~~
TuringNYC
Yes, very relevant. You dont always need advanced models. You can accomplish
quite a bit with classic models. I used it for two separate customer projects
recently in two different industries.

~~~
ci5er
SVD is also simple and versatile.

------
john-rowa
I liked the Elbow method implementation part, which is designed to help find
the optimal number of clusters in a dataset. Thanks.

~~~
xibalba
Did you create this account just to make a positive comment on this article?

------
sunrise100
There are many developers who might not have done K means clustering or
unsupervised learning at all. We should think about them as well. And I think
the article did a good job of explaining related concepts. I liked it.

~~~
commandlinefan
I don’t know - he didn’t explain how the algorithm works, what it does, or
anything you couldn’t just have gotten out of the scikit docs. Would have been
much better if it actually described the algorithm or presented an
implementation of it (which isn’t that complex, BTW).

