My attempt to summarise the difference in language familiar to computer scientists, is that you can look at the frequentist vs Bayesian debate as being about when a worst-case analysis is preferable to average-case analysis for unknown parameters of a statistical model.
There's something you don't know (the parameters). Are you looking to make statements which bound how bad things could be under the worst-case setting of those parameters? Or do you have some idea upfront about how likely different parameter settings are, and want to make statements about them in the "average" case?
Rather like with worst-case vs average-case analysis of algorithms, which is more appropriate depends what you're trying to do, and sometimes both are interesting.
This is awesome! I wonder if anyone's done something similar with beers?
Anyway a few "next thing to try" suggestions from a machine learning perspective:
The model selection process used here is by its own admission quite ad-hoc, based on a gut feel about diminishing returns. There are various more principled methods you can use to find the sweet spot between over- and under-fitting with these kind of models, a lot of them based on held-out validation data.
One way to do this would be leave-one-out cross validation (LOO-CV): hold out one whisky, fit the model, and see how 'surprised' the model is by the held-out whisky, repeat for the next whisky and average over all the folds. Because the dataset is tiny this should be quite feasible.
To measure 'surprisal' you could e.g. look at the distance from the held-out data point to the nearest cluster, although something better motivated would be if you switched to a probabilistic model and used likelihood of the held-out data. Probably the simplest next thing you could try in that direction would be a Gaussian mixture model (GMM) trained using EM. K-means is actually a degenerate limiting case of this.
A probabilistic model would also allow you to use Bayesian model selection criteria, which can get quite interesting (and might lead you eventually to things like Dirichlet process mixture models).
It would make it easier to compare the model's explanatory power with other unsupervised probabilistic models. For example some kind of latent factor model like Factor analysis or pPCA would be quite interesting to investigate too, whether taken alone or in combination with clustering as a dimensionality reduction step as tlarkworthy is suggesting.
Also concur that doing multiple runs with different randomised initialization is generally a good idea for k-means or EM, since they can get stuck in poor local minima. Perhaps more common practise to pick the best of multiple runs than to average them though.
Good to see Bayesian model selection get a mention. Bayesian model averaging is pretty interesting, too, in that it comes, in a sense, with built-in protection against overfitting.
I still think there is something quite fundamental, though, about validation sets and other related resampling-based methods for estimating generalisation performance (cross-validation, bootstrap, jackknife and so on).
The built-in picture you get about predictive performance from Bayesian methods comes with strong caveats -- "IF you believe in your model and your priors over its parameters, THEN this is what you should expect". Adding extra layers of hyperparameters and doing model selection or averaging over them might sometimes make things less sensitive to your assumptions, but it doesn't make this problem go away; anything the method tells you is dependent on its strong assumptions about the generative mechanism.
Most sensible people don't believe their models are true ("all models are false, some models are useful"), and don't really fully trust a method, fancy Bayesian methods included, until they've seen how well it does on held-out data. So then it comes back to the fundamentals -- non-parametric methods for estimating generalisation performance which make as few assumptions as possible about the data and the model they're evaluating.
Cross-validation isn't the only one of these, and perhaps not the best, but it's certainly one of the simplest. One thing people do forget about it is that it does make at least one basic assumption about your data -- independence -- which is often not true and can be pretty disastrous if you're dealing with (e.g.) time-series data.
I agree. As a Bayesian hoping to understand my data, P(X|M1) is useful: it's the probability I have for X under M1's modelling assumptions. Of course M1 is an approximation, but that's how science is done. You get to understand how your model behaves, and you may say "Well, X is a bit higher than it should be, but that's because M1 assumes a linear response, and we know that's not quite true".
Bayesian model averaging entails P(X) = P(X|M1)P(M1) + P(X|M2)P(M2). It assumes that either M1 or M2 is true. No conclusions can be derived from that. It might be useful from a purely predictive standpoint (maybe) , but it has no place inside the scientific pipeline.
There is a related quantity which is P(M1)/P(M2). That's how much the data favours M1 over M2, and it's a sensible formula, because it doesn't rely on the abominable P(M1) + P(M2) = 1
Model averaging can be quite useful when you're averaging over versions of the same model with different hyperparameters, e.g. the number of clusters in a mixture model.
Yeah, but in this case, there's a crucial difference: within the assumptions of a mixture model M, N=1, 2, ... clusters do make an exhaustive partition of the space, whereas if I compute a distribution for models M1 and M2, there is always M3, M4, ... lurking unexpressed and unaccounted for. In other words,
P(N=1|M) + P(N=2|M) + ... = 1
P(M1) + P(M2) << 1
Is the number of clusters even a hyperparameter? Wiki says that hyperparameters are parameters of the prior distribution. What do you think?
Great explanation. I would like to add to this, that held-out data is often used in Bayesian learning too - for example, in cases when you intentionally over-specify the model (adding more parameters than might be needed) because you don't really know what the best model might be. The inference goes until the likelihood on held-out data keeps increasing. Example, gesture recognition in Kinekt. If someone finds this info useful, I also recommend Coursera course on Probabilistic Graphical Models.
Yes the London market is a bit nuts, but if you're paying anything close to 25k/year (that's 480/week!) for a 2 bed in a "not great" part of London then either you have very high standards or you're being seriously ripped off.
The coloured dots are cute and all, but if the goal is to make visually apparent the relationship between salary and union membership, some more traditional visualisations might have made this clearer. For example boxplots broken down by union.
So with a few minor complications convexity generalises to Riemannian manifolds like the earth. You need to replace "straight line" with "minimising geodesic" i.e. shortest path, which don't depend on the choice of coordinate chart, just on the Riemannian manifold structure (which includes an inner product hence a metric).
I wonder if you could prove this "probability of a random line segment violating convexity" definition equivalent to something given in terms of a ratio of different areas like the area to convex hull area suggestion below.
Would you not get the mean height above sea level of all points along the border and divide that by the mean height above sea level of the entire state (which I guess is really an average of some sampling of points)?
> A Novel Method for Applying a Trivial Modification of an Already-Known Algorithm to Some Type of Specific Data
Papers sharing empirical findings on applications of existing research can be very useful to those of us also, you know, looking to apply said research. Don't underestimate the amount of value (and legwork!) involved in figuring out how to adapt and apply theoretical work to real problems in some CS-related fields.
Obviously that kind of thing isn't accepted at the top conferences, so you know where not to look if you're not interested in it.
On the subject of scraping data from OCR'd tables:
I heard from a colleague who moved into finance, that there's a mini arms race going on between some funds(?) who are subject to regulatory requirements to release financial performance metrics but for a variety of reasons would rather not (and certainly would rather not make the data machine readable), and other hedge funds who want to run automated trading strategies off said released figures.
They keep obfuscating the tables to make them harder and harder to parse algorithmically while still remaining theoretically human-readable.