
K-means Clustering 86 Single Malt Scotch Whiskies - platz
http://blog.revolutionanalytics.com/2013/12/k-means-clustering-86-single-malt-scotch-whiskies.html
======
tlarkworthy
A few technical issues:

-High dimensional dataset, probably some redundant dimension are being "counting twice". Solution PCA your data first.

-sampling artefacts. Run the k-nearest algorithm lots of times and average

If the data points were lower dimension, the clusters would be able to cover
the space better. Bowmore is a smoky whisky (score 3), but it has enough
interesting other characteristics that means it does not align with the
obvious "Islay" cluster (smoky centre of 2.8). I think it's in the wrong
cluster, due to redundant fields having too much power.

~~~
_deh
Couldn't resist a quick play with PCA. Initially because, looking at the
cluster centres, cluster 4 looked much more distinct than the others and I
wondered if the rest could be sharpened up. Then realised quite how sharply
the cluster sizes differ - two small (6 distilleries each) and two large (41
and 33). Quite an interesting result for k-means, and a reason for
sharper/less sharp clusters. Having played with the data a bit, I buy into the
smoky whiskies being a very different beast to the others, but I'm less sure
about the other small segment (tobacco-but-not-smoky)

On your point, turns out PCA alone doesn't tend to help with Bowmore (or the
similar Highland Park). The first component you get is basically medicinal-
and-smoky, and these two only score 1 for medicinal. They're also probably a
bit too honeyed or floral for cluster 4. But clustering on the components and
tweaking the starting values can draw Isle of Jura, Oban, Old Pulteney and
occasionally Glen Garioch into the smoky cluster. On the other hand, looking
at a biplot of the first two components your could argue for two clusters -
the six in OP's original cluster 4 vs. the rest.

The most important thing is that this work led OP to Talisker - result. But I
suspect a definitive clustering will require more features - so, sadly,
someone needs to taste them all again...

------
mjw
This is awesome! I wonder if anyone's done something similar with beers?

Anyway a few "next thing to try" suggestions from a machine learning
perspective:

The model selection process used here is by its own admission quite ad-hoc,
based on a gut feel about diminishing returns. There are various more
principled methods you can use to find the sweet spot between over- and under-
fitting with these kind of models, a lot of them based on held-out validation
data.

One way to do this would be leave-one-out cross validation (LOO-CV): hold out
one whisky, fit the model, and see how 'surprised' the model is by the held-
out whisky, repeat for the next whisky and average over all the folds. Because
the dataset is tiny this should be quite feasible.

To measure 'surprisal' you could e.g. look at the distance from the held-out
data point to the nearest cluster, although something better motivated would
be if you switched to a probabilistic model and used likelihood of the held-
out data. Probably the simplest next thing you could try in that direction
would be a Gaussian mixture model (GMM) trained using EM. K-means is actually
a degenerate limiting case of this.

A probabilistic model would also allow you to use Bayesian model selection
criteria, which can get quite interesting (and might lead you eventually to
things like Dirichlet process mixture models).

It would make it easier to compare the model's explanatory power with other
unsupervised probabilistic models. For example some kind of latent factor
model like Factor analysis or pPCA would be quite interesting to investigate
too, whether taken alone or in combination with clustering as a dimensionality
reduction step as tlarkworthy is suggesting.

Also concur that doing multiple runs with different randomised initialization
is generally a good idea for k-means or EM, since they can get stuck in poor
local minima. Perhaps more common practise to pick the best of multiple runs
than to average them though.

~~~
_deh
Something on beer: [http://bit.ly/JLNgZA](http://bit.ly/JLNgZA) (wisc.edu)

------
cschmidt
There was previously a published paper that did a hierarchical clustering of a
bunch of whiskies:

    
    
         A Classification of Pure Malt Scotch Whiskies
         Lapointe and Legendre
         Appl. Statist. (1994) 43, No. 1, pp. 237-257
    

[http://www.dcs.ed.ac.uk/home/jhb/whisky/lapointe/text.html](http://www.dcs.ed.ac.uk/home/jhb/whisky/lapointe/text.html)

[http://www.albany.edu/psychology/bcd/share/scotch_classifica...](http://www.albany.edu/psychology/bcd/share/scotch_classification/lapointe&legendre_1994.pdf)

It would be interesting to compare their results with the OP.

------
bazzargh
He's missing the newest and smallest distillery on Islay on that map -
Kilchoman. Might want to give it a go.

[http://kilchomandistillery.com/](http://kilchomandistillery.com/)

... also the bewildering array sold by the others, particularly Bruichladdich,
and the ridiculously expensive remaining stock of the defunct Port Ellen:

[http://www.theguardian.com/lifeandstyle/2013/oct/26/scotch-w...](http://www.theguardian.com/lifeandstyle/2013/oct/26/scotch-
whisky-village-distillery-died)

------
surlyadopter
At a certain point I suppose taste is subjective, but Laphroig is definitely
not the smokiest or peatiest of the Islay malts.

~~~
KHPatel
It is a little subjective. I've always thought Ardbeg has a smokier taste than
Laphroig, but a fellow whisky-enthusiast friend of mine thinks the complete
opposite.

edit: on the post. Very curious to see your thoughts on Talisker and whether
it fits the profile the results gave it.

~~~
surlyadopter
I find Ardbeg (at least the Uigeadail) to be far smokier than Laphroig. The
Lagavulin as well. To me the Talisker is quite bland, the sherry flavor really
drowned out any peatiness.

~~~
EpicEng
Agreed. I love the Talisker Storm, but I have a Lagavulin 16 in my hand right
now that gives it a run for it's money. Definitely more punch in the smoke and
peat departments.

~~~
nsns
The elements effecting the taste of a Whiskey are mind-bogglingly complex,
e.g. -
[https://www.youtube.com/watch?v=O_aLgTRQjmM&t=10m21s](https://www.youtube.com/watch?v=O_aLgTRQjmM&t=10m21s)

~~~
EpicEng
I know, and I love Ralfy's reviews. Can't say my palette is up to his level,
but luckily, practicing is fun.

------
s-p
taste is subjective, but peating levels are not. the standard levels for the
Trinity of Peat are as follows:

Ardbeg: 55ppm (this has changed over time)

Lagavulin: 40-45ppm

Laphroaig: 40ppm

however, it's important to distinguish between standard levels and novelty
bottlings. the Ardbeg Supernova is peated to over 100ppm, twice the usual
level. Bruichladdich, which is a relatively lightly-peated malt, produces the
highly-peated Port Charlotte line, as well as the Octomore, peated to about
170ppm.

so, unless you're talking about the standard line, you can't simply talk about
the peating level of Ardbeg.

finally, since all single malts are aged in a charred oak cask which acts as a
filter over time, in general, the older the malt, the less smoky it will
taste. compare a Laphroaig 10yo cask strength to a Laphroaig 18yo, and/or
25yo, if you're ever lucky enough to do so.

------
egospring
For those looking to learn more, there's a guy by the name of Ralfy who does
fantastic in-depth reviews that will get you excited to enjoy whisky:
[http://www.youtube.com/user/ralfystuff](http://www.youtube.com/user/ralfystuff)

------
EpicEng
I have to agree on the nod to Talisker. My favorite drink at the moment is
Talisker Storm. Not a fan of the packaging, but who cares? It's delicious.

