The point of a competition is to meet specific parameters as well as possible and push the boundaries of what can be done. It's not meant to create a "daily driver".
I realize he argued against this with the coin flip test, but that is why you'd ideally want to have many of these competitions over time. If you start to see the same names popping up at the top regularly, you know there's some sort of significance to them. Teams would ultimately want to trend towards whatever wins competitions most consistently, so they'd want to rely on models they think are the most likely to perform in a real world test. They wouldn't want to simply rely on a coin flip.
And in large competitions, you have the chance of batching the top performers together and seeing what is common between them. Presumably there's a reason these models pan out over the rest in aggregate; are they worth pursuing a bit?
I think we agree at the end though, even within my analogy. A huge value of F1 racing is the publicity the teams give their sponsors. They might learn some information that can be pushed down to their consumer vehicles, but it's marginal compared to team winnings and the value of saying "See? Our engineers are the best".
I don't know if I necessarily agree with that, but there is definitely a danger in evaluating models purely based on their predictive power when many models are being evaluated on a common dataset - which is exactly the mainstream practice in the world of deep learning research - and it is therefore wise to be wary, not just as individuals but as a community.
What the picture was trying to say is that, within a given year, the "winner" becomes less likely to be truly better than the second place team. Alexnet was clearly better than the alternative, even with Bonferroni adjusted significance thresholds. Less so by 2016/17.
I'm writing a follow up on imagenet in particular to address some of the nuance. It is very clearly not a representative example of ML competitions, but the same effects still apply to some extent (imo).
My understanding is that the context is "usable models in clinical setting". Am I reading it wrong?
Could this just surface the teams that just submitted the most models? Maybe some sort of wins per a submission score could help with this?
When did someone last really pick it apart in its operational environment?
To elaborate for the other commenters: vaishaalshankar's team has created a new ImageNet evaluation dataset from scratch, and they observed that the leaderboard positions of popular image recognition models didn't change much when switching to the new evaluation. The actual performance of the models decreased significantly, but without affecting the ranking.
The OP starts from a not very controversial claim: there's a good chance that the winner of a Kaggle competition is not actually better than any of the other top k contestants, for quite large values of k. But then he completely overplays his hand, and by the time he gets to talk about ImageNet, he makes claims that were actually falsified by vaishaalshankar's paper.
That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data. Medical AI shows us over and over again that truly out of distribution unseen data (external validation) is a completely different challenge to simply drawing multiple test sets from your home clinic.
Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).
Not the same distribution, it's new data collected and processed according to the same recipe. A quite different distribution, demonstrated by the fact that the accuracy numbers drop sharply. That's why it's so surprising that the rankings do not change that much. (Okay, in principle, a possible explanation is that it is the exact same distribution, with a fixed percentage of mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the paper deals with this possibility.)
In any case, I fully agree that this kind of generalization is still much easier than generalizing to real world data.
> Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).
CIFAR-10 is basically the opposite of your list of requirements. Train set small, test set small, test set public, small number of labels, grid searched to death. And yet, look at the CIFAR-10 graph from that paper. The exact same pattern as ImageNet.
> Of course, you can’t submit more than one model. That would be cheating. One of the models could perform well, the equivalent of getting 8 heads with a fair coin, just by chance.
> Good thing there is a rule against it submitting multiple models, or any one of the other 99 participants and their 99 models could win, just by being lucky
I wonder what the author must think of Poker tournaments. Even assuming there is luck involved, unless all of the models are equally bad (which would be surprising) teams that produce better models should win much more than their fair share, where fair share is 1/N when N is the total number of submissions.
But lets say that the author is correct that ML competitions are mostly luck. That is a testable hypothesis - in particular, we would expect little to no correlation between the credentials of the competitors and their ranking in the competition. Is that actually the case? Do unknown individuals who just started doing Machine Learning win on their first Kaggle submission? If the author's hypothesis is correct, one would expect that to happen fairly often, and one would expect that even highly expert competitors should win approximately (only) their fair share.
What they're saying, as far as I understand it, is that the best k teams (where k << N) each win roughly 1/k of the time.
> "If one or both of the sample proportions are close to 0 or 1 then this approximation is not valid and you need to consider an alternative sample size calculation method."
0.86 is fairly close to 1. And they're not proportions but rather averages of Dice coefficients. The statistics used here seem very suspicious. I'd like to see a Monte Carlo simulation or at least the assumptions / derivation of the formula used.
The post also doesn't draw on any actual reported experiences of either competitors or hosts of competitions. Generally in post competition wrap-ups there's a lot of information sharing and very significant development of understanding that comes about.
To take the imagenet example that's claimed in the post as being a likely example of over-fitting: each year's imagenet results led to very significant advances in our understanding of image classification, which are today very widely used in industry and research.
Good Resnet models trained on ImageNet (the good parts, not just people) tend to result in state of the art results for almost every transfer-learning domain they're applied to.
No statistician would do what I did for a formal publication, but I think what I did gets the point across.
But I don't think it's fair to complain about $30,000 prizes being awarded to first rather than second place in a specific competition without doing at least a little checking of whether that was actually the case. And the article kind of reads like cynicism that all machine learning is a waste, and all the algorithms are just producing random numbers that randomly happen to be right some of the time and win the competition by random chance.
I've fallen victim of getting a Twitter bump, and assuming that people know I'm not anti-ML.
The blog post is meant to be educational, not argumentative. Since it has got wider exposure I'll do a follow up to clarify my position on imagenet.
Two reasons : 1 - it's harder to do this vs. optimise the behooozas out of a dataset and throw the best one over the wall (and this is often done in good heart complete with a whole gamut of "standard practice" which are in-fact information leak from test to train like checking what features are informative on the test set before doing training) 2... folks don't know better, and best practice is sparsely documented or taught. This is because there are almost no practitioners turned teachers in comp sci. I'm not running down the great people who do great work pushing the field, they are my betters, but the next generation are being mislead into thinking that the skills they are picking up in their ML classes are going to keep them gainfully employed in the long term.
A common (industry) scenario is that you have a classifier that is 99.xx or 5x.xx accurate (it makes very few mistakes but they are costly, or it's a bit better than a coin flip, but we'll take that as it's what pays the mortgage), and we need to be absolutely positively certain that the one that we are fielding really works and is the best one we have (or homeless-a-go-go)
With the calculator
99.79 vs 99.65 @99% & 99% power-> 68457 examples
50.79 vs 50.65 @99 &99 -> 6129161 examples
which is why fancy models with marginal demonstrated improvement are often kept in the draw - much to the frustration of ML folks who are sure that it works and have proved that it will make the firm $10M a week.
Winners have often developed really interesting feature engineering strategies for a domain, as well as very well organised tuning and stacking systems.
Maybe a lot of difference between 1st and 5th percentile is luck, but top end of Kaggle is a really valuable insight into building effective models, even if you might want to simplify implimentation a bit in the commercial world.
The winning models are hardly ever used in production, but the set of skills needed to get a gold is.
Agree with a mild version of the author's statement: that for many competitions, the difference between top n spots is not statistically significant. However, the author's statement (as represented by this chart https://lukeoakdenrayner.files.wordpress.com/2019/09/ai-comp...) is far too strong.
The actual best model may not always win, but will typically be in the top 0.1%.
There are people on this thread who have poked holes in the author's sample size calculator (I'm not going to rehash that).
But an empirical observation: the same top ranked Kagglers consistently perform well in competition after competition.
You can see this by digging through the profiles of top ranked Kagglers (https://www.kaggle.com/rankings). Or by looking at competition leaderboards. For example the leaderboard screenshot the author shared in the post (https://lukeoakdenrayner.files.wordpress.com/2019/09/pneumo-...) shows 11 of the 13 top performers are Masters and Grandmasters, which puts them at the top ranked 1.5K members of our community of 3.4MM data scientists (orange and gold dots under the profile pictures indicate Master and Grandmaster rank).
I actually think the author's headline is often correct: there are many cases where machine learning competitions don't produce useful models. But for a completely different reason: Competitions sometimes have leakage.
As a funny example: I remember we were once given a dataset to predict prostate cancer from ~300 variables. One of the variables was "had prostate cancer surgery". Turned out that was a very good predictor of prostate cancer ;). Thankfully that was an example where we caught the leakage. Unfortunately there are cases where we don't catch the leakage.
Of course a title like, "Here's why you can't directly deploy ML competition models into a production environment" doesn't grab as many clicks.
The article makes several good points. But just because the testing isn’t sufficient to prove that the winner didn’t just get lucky, it doesn’t prove that the winner did just get lucky.
If almost every competition on Kaggle has a winner that is not significantly better than the bulk of the field, then that is proof. Chance correlations leading to you not rejecting the null can only take you so far.
What you could do is actually describe the kind of errors the network makes. In the example of CT, false positive, false negative, wrong diagnosis. We can try to analyze what the network is detecting, rather than accept a result on some test set as real.
The millions of trials is an overstatement, but indeed few hundred thousands are needed to actually discern a winner, presuming the network did not cheat by focusing on, say, population statistics - say, certain cranium sizes being more likely to present with problems. Relying on population statistics derived from a small sample (even if representative, which it's not) is very risky...
The earlier example says that the difference between winners in an arbitrarily picked Kaggle competition was "0.0014". Sure, I agree, seems small. But this random diagram about image classification shows that the Google model "the improvement year on year slows (the effect size decreases)". But that's not even true. The effect is exponential!
2011->2012: 38.6% "Reliable Improvement"
2014->2015: 34.9% "Probably Overfitting"
Is this really the thoughts of someone well versed in statistics? I get the feeling they are just upset they lost at a competition and decided instead to rant about it (as you can see from the aggressive use of memes). This is not a well thought out argument against ML competitions, but you might be fooled into thinking it was because it contains just enough discussion of statistics that you might not notice it doesn't hold up.
That is the main problem, and the lack of systems thinking is on your side. There is a strong pressure to cheat and overfit. Sharing in fact makes this even stronger.
We've had some fun with that when trying to use ML for something as complex as music rhythm envelope extraction. (Which is easy in comparison to CT test.)
Best results were approaching 90% accuracy on the big suite, but real results were closer to 40%. A slightly worse solution did reliably 70%. (And was not a neural network even, plus possible to improve.)
General best approaches sometimes indeed work, but sometimes (often?) they are overfitted in architecture, not even dataset.
Solid case in point, the u-net came to prominence from a medical kaggle competition. Was it a "useful" model? The author might not be wrong in saying the model wouldn't work as well in the wild but I would definitely say it was useful. The unet is still a very commonly used architecture
By being lucky. And with a large enough number of solutions, at least one solution will almost surely be lucky.
Besides that, I'm very curious about who you are and what you do. Willing to reveal more?
I’m kind of an introvert so I hate hackathons but did one and had about 75 code for a day. My boss wasn’t happy but I was surprised at the fun culture and connections created across teams. Five years later people mention something they learned or a technique they use or someone they met that they continue to work with.
I don’t think these competitions are the best mean for solving specific problems, though think they are part of a good portfolio.
I think that they are valuable for other purposes like communication and interoperability.
The competition and test set will still have hidden biases due to the ontology you use.
Sufficient optimization pressure always eventually overcomes your bias control metrics, in the context of actual utility rather than other metrics.
It's well known in cryptography and security that all abstractions are leaky.
I'm always for examining replicability, but the current paradigm to me seems misguided and this articulates some of the reasons very well.