Hacker News new | past | comments | ask | show | jobs | submit login
AI competitions don’t produce useful models (lukeoakdenrayner.wordpress.com)
199 points by baylearn on Sept 19, 2019 | hide | past | favorite | 62 comments

This is a bit like saying "F1 doesn't produce useful cars", isn't it?

The point of a competition is to meet specific parameters as well as possible and push the boundaries of what can be done. It's not meant to create a "daily driver".

I realize he argued against this with the coin flip test, but that is why you'd ideally want to have many of these competitions over time. If you start to see the same names popping up at the top regularly, you know there's some sort of significance to them. Teams would ultimately want to trend towards whatever wins competitions most consistently, so they'd want to rely on models they think are the most likely to perform in a real world test. They wouldn't want to simply rely on a coin flip.

And in large competitions, you have the chance of batching the top performers together and seeing what is common between them. Presumably there's a reason these models pan out over the rest in aggregate; are they worth pursuing a bit?

I think we agree at the end though, even within my analogy. A huge value of F1 racing is the publicity the teams give their sponsors. They might learn some information that can be pushed down to their consumer vehicles, but it's marginal compared to team winnings and the value of saying "See? Our engineers are the best".

The article isn't actually that trivial; it makes a very provocative and bold statement towards the end, essentially declaring that all stated progress in image classification in the last 5 years is questionable.

I don't know if I necessarily agree with that, but there is definitely a danger in evaluating models purely based on their predictive power when many models are being evaluated on a common dataset - which is exactly the mainstream practice in the world of deep learning research - and it is therefore wise to be wary, not just as individuals but as a community.

Hi, author here. I didn't actually mean to suggest that the last 5 years of performance improvement could be spurious. That clearly isn't true. I use resnets/densenets etc in my day to day work!

What the picture was trying to say is that, within a given year, the "winner" becomes less likely to be truly better than the second place team. Alexnet was clearly better than the alternative, even with Bonferroni adjusted significance thresholds. Less so by 2016/17.

I'm writing a follow up on imagenet in particular to address some of the nuance. It is very clearly not a representative example of ML competitions, but the same effects still apply to some extent (imo).

We can't have meaningful discussions based on titles or misunderstanding of the context the article uses and overgeneralizing.

My understanding is that the context is "usable models in clinical setting". Am I reading it wrong?

> Teams would ultimately want to trend towards whatever wins competitions most consistently

Could this just surface the teams that just submitted the most models? Maybe some sort of wins per a submission score could help with this?

Our company is very thankful for the models developed for a competition we held through kaggle. The model is in our current release product, and is the current known state of the art for it's task in accuracy / performance.

How sure are you that it works? I know that this sounds really flippant, but I have seen a number of well regarded models that when we / I took them to bits were discovered to actually not be doing better than dominant class prediction, and in a couple of cases had actually become arbitrary.

When did someone last really pick it apart in its operational environment?

yeah, the model works. Its individual identification task there are many individuals, and we have lots of labeled data that we held out from the competition.

Is it a single model or stacking of multiple models?

There were several top contenders that did ensemble. We went with a single model (also in top few) for reasons of integration cost. I Don't actually remember which placed first for the competition.

https://arxiv.org/abs/1902.10811 is an useful counterpoint to this article's comments on ImageNet overfitting (esp e.g. §3.3, "Few Changes in the Relative Order").

One of the authors of above paper here. There is a whole line of work that supports the notion that all progress we are making via this leaderboard (or competition) mechanism is in fact real progress -- and that we are not simply overfitting to the test set. This thread by Moritz Hardt does a good job laying a few reasons why this may the case: https://twitter.com/mrtz/status/1134158716251516928

I'd like to signal boost this, it's a very important line of research. Congrats on a groundbreaking paper! The first time I have seen it, I was completely shocked by that perfect linear fit on "Do ImageNet Classifiers Generalize to ImageNet?" Figure 1.

To elaborate for the other commenters: vaishaalshankar's team has created a new ImageNet evaluation dataset from scratch, and they observed that the leaderboard positions of popular image recognition models didn't change much when switching to the new evaluation. The actual performance of the models decreased significantly, but without affecting the ranking.

The OP starts from a not very controversial claim: there's a good chance that the winner of a Kaggle competition is not actually better than any of the other top k contestants, for quite large values of k. But then he completely overplays his hand, and by the time he gets to talk about ImageNet, he makes claims that were actually falsified by vaishaalshankar's paper.

I don't actually think imagenet is anywhere near as susceptible to crowd based overfitting as most kaggle competitions, but I don't actually think that paper falsifies the claim that it is.

That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data. Medical AI shows us over and over again that truly out of distribution unseen data (external validation) is a completely different challenge to simply drawing multiple test sets from your home clinic.

Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).

> That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data.

Not the same distribution, it's new data collected and processed according to the same recipe. A quite different distribution, demonstrated by the fact that the accuracy numbers drop sharply. That's why it's so surprising that the rankings do not change that much. (Okay, in principle, a possible explanation is that it is the exact same distribution, with a fixed percentage of mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the paper deals with this possibility.)

In any case, I fully agree that this kind of generalization is still much easier than generalizing to real world data.

> Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable).

CIFAR-10 is basically the opposite of your list of requirements. Train set small, test set small, test set public, small number of labels, grid searched to death. And yet, look at the CIFAR-10 graph from that paper. The exact same pattern as ImageNet.

> Now imagine you aren’t flipping coins. Imagine you are all running a model on a competition test set. Instead of wondering if your coin is magic, you instead are hoping that your model is the best one, about to earn you $25,000.

> Of course, you can’t submit more than one model. That would be cheating. One of the models could perform well, the equivalent of getting 8 heads with a fair coin, just by chance.

> Good thing there is a rule against it submitting multiple models, or any one of the other 99 participants and their 99 models could win, just by being lucky

I wonder what the author must think of Poker tournaments. Even assuming there is luck involved, unless all of the models are equally bad (which would be surprising) teams that produce better models should win much more than their fair share, where fair share is 1/N when N is the total number of submissions.

But lets say that the author is correct that ML competitions are mostly luck. That is a testable hypothesis - in particular, we would expect little to no correlation between the credentials of the competitors and their ranking in the competition. Is that actually the case? Do unknown individuals who just started doing Machine Learning win on their first Kaggle submission? If the author's hypothesis is correct, one would expect that to happen fairly often, and one would expect that even highly expert competitors should win approximately (only) their fair share.

I don't think the author is saying that the best team only wins 1/N of the time, where N is the total number of participants. Far from it.

What they're saying, as far as I understand it, is that the best k teams (where k << N) each win roughly 1/k of the time.

while there is variance in how my models do against a test set, its highly unlikely my rank 10 model is going to dethrone a #1 rank model, nor a rank 1k model going to beat mine. Its possible, but only because I over fit the public leader board, and good ML practices help prevent that (such as if your cross validation improves a model but it does worse on the public leader board, be inclined to trust your cross validation).

I guess the calculator he used is https://select-statistics.co.uk/calculators/sample-size-calc.... The calculator includes this caveat:

> "If one or both of the sample proportions are close to 0 or 1 then this approximation is not valid and you need to consider an alternative sample size calculation method."

0.86 is fairly close to 1. And they're not proportions but rather averages of Dice coefficients. The statistics used here seem very suspicious. I'd like to see a Monte Carlo simulation or at least the assumptions / derivation of the formula used.

You're spot on - the calculations in the post are totally wrong. For instance, a Dice coefficient is itself the average of hundreds of thousands of observations (each pixel in an image), so you can't just treat the coefficient as a single data point. However you also can't just take a standard error, since the points are highly spatially correlated.

The post also doesn't draw on any actual reported experiences of either competitors or hosts of competitions. Generally in post competition wrap-ups there's a lot of information sharing and very significant development of understanding that comes about.

To take the imagenet example that's claimed in the post as being a likely example of over-fitting: each year's imagenet results led to very significant advances in our understanding of image classification, which are today very widely used in industry and research.

Have you actually tried to use one of those networks on a decent real dataset? They're making common category errors. ImageNet roulette is an interesting fun example of what happens if you feed an overfitted network with real images. Such as misclassifying children. :)

ImageNet Roulette deliberately uses a terrible categorization scheme that has long been acknowledged as so poor as to not admit meaningful results in order to make the highly political point that ML should never be applied to people. There's a reason most people scrub that whole piece of the taxonomy before training.

Good Resnet models trained on ImageNet (the good parts, not just people) tend to result in state of the art results for almost every transfer-learning domain they're applied to.

Hi, author here. There are a range of ways the estimates can be improved, although many require data that isn't available. The main point is that having a ballpark idea of how reliable your results are is good, and you can achieve that with this sort of simple napkin maths.

No statistician would do what I did for a formal publication, but I think what I did gets the point across.

Yes, it's a very thought-provoking article. I'm sure there are many competitions on Kaggle that were won due to the testing/training splits or other incidental choices rather than better machine learning.

But I don't think it's fair to complain about $30,000 prizes being awarded to first rather than second place in a specific competition without doing at least a little checking of whether that was actually the case. And the article kind of reads like cynicism that all machine learning is a waste, and all the algorithms are just producing random numbers that randomly happen to be right some of the time and win the competition by random chance.

All I can really say is that my usual readers understand that I am pro-ML, in fact I'm probably more hung go about the potential of deep learning than many of my compatriots.

I've fallen victim of getting a Twitter bump, and assuming that people know I'm not anti-ML.

The blog post is meant to be educational, not argumentative. Since it has got wider exposure I'll do a follow up to clarify my position on imagenet.

It's a great post; I love ML, I've spent many years trying to get value out of it, and sometimes succeeding. But folks are applying without any of the checks and balances that are needed to produce real value in a sustained way.

Two reasons : 1 - it's harder to do this vs. optimise the behooozas out of a dataset and throw the best one over the wall (and this is often done in good heart complete with a whole gamut of "standard practice" which are in-fact information leak from test to train like checking what features are informative on the test set before doing training) 2... folks don't know better, and best practice is sparsely documented or taught. This is because there are almost no practitioners turned teachers in comp sci. I'm not running down the great people who do great work pushing the field, they are my betters, but the next generation are being mislead into thinking that the skills they are picking up in their ML classes are going to keep them gainfully employed in the long term.

This is the problem with the internet and links. You come in with inappropriate context and make judgements based on single pages of text.

If you look at the calculator and plug some values of your own you can see that the absolute value of the numbers is not what is generating the large sample sizes. The problem that is being shown up is the small gap in predictive power of the two classifiers being differentiated.

A common (industry) scenario is that you have a classifier that is 99.xx or 5x.xx accurate (it makes very few mistakes but they are costly, or it's a bit better than a coin flip, but we'll take that as it's what pays the mortgage), and we need to be absolutely positively certain that the one that we are fielding really works and is the best one we have (or homeless-a-go-go)

With the calculator

99.79 vs 99.65 @99% & 99% power-> 68457 examples

50.79 vs 50.65 @99 &99 -> 6129161 examples

which is why fancy models with marginal demonstrated improvement are often kept in the draw - much to the frustration of ML folks who are sure that it works and have proved that it will make the firm $10M a week.

As someone who develops machine-learning models as job, I can say that there are snippets from Kaggle kernels that run in minutes, and outperform months of manual effort by a business analyst.

Winners have often developed really interesting feature engineering strategies for a domain, as well as very well organised tuning and stacking systems.

Maybe a lot of difference between 1st and 5th percentile is luck, but top end of Kaggle is a really valuable insight into building effective models, even if you might want to simplify implimentation a bit in the commercial world.

That's exactly what the author is trying to say, I think. We shouldn't put so much emphasis on who got into first place (which is mostly determined by luck) but rather investigate all techniques used by the top 5th or 10th or whatnot percentile, which is meaningfully separated from the rest.

Given that a handful of names occupy Gold places in kaggle competitions, I would not call it luck. Given how hard it is to stay at the top on private set, I would not call it luck.

The winning models are hardly ever used in production, but the set of skills needed to get a gold is.

Kaggle CEO here.

Agree with a mild version of the author's statement: that for many competitions, the difference between top n spots is not statistically significant. However, the author's statement (as represented by this chart https://lukeoakdenrayner.files.wordpress.com/2019/09/ai-comp...) is far too strong.

The actual best model may not always win, but will typically be in the top 0.1%.

There are people on this thread who have poked holes in the author's sample size calculator (I'm not going to rehash that).

But an empirical observation: the same top ranked Kagglers consistently perform well in competition after competition.

You can see this by digging through the profiles of top ranked Kagglers (https://www.kaggle.com/rankings). Or by looking at competition leaderboards. For example the leaderboard screenshot the author shared in the post (https://lukeoakdenrayner.files.wordpress.com/2019/09/pneumo-...) shows 11 of the 13 top performers are Masters and Grandmasters, which puts them at the top ranked 1.5K members of our community of 3.4MM data scientists (orange and gold dots under the profile pictures indicate Master and Grandmaster rank).

I actually think the author's headline is often correct: there are many cases where machine learning competitions don't produce useful models. But for a completely different reason: Competitions sometimes have leakage.

To elaborate on leakage: it's a case where something in the training or test dataset wouldn't be available in a production setting.

As a funny example: I remember we were once given a dataset to predict prostate cancer from ~300 variables. One of the variables was "had prostate cancer surgery". Turned out that was a very good predictor of prostate cancer ;). Thankfully that was an example where we caught the leakage. Unfortunately there are cases where we don't catch the leakage.

It seems like the author is saying two things. 1) The hold out method isn't robust enough and 2) There may not be much difference between the top performing models. Which, okay, there may be a point there. But that doesn't really support the thesis that competitions don't produce useful models. It's more like they don't produce models that can be deployed directly for clinical use. There's no reason the techniques used by the competitors can't contribute to something that is clinically useful.

Of course a title like, "Here's why you can't directly deploy ML competition models into a production environment" doesn't grab as many clicks.

The headline, apart from being clickbaity, is accepting the null hypothesis. Which is a STAT101 no no.

The article makes several good points. But just because the testing isn’t sufficient to prove that the winner didn’t just get lucky, it doesn’t prove that the winner did just get lucky.

There's nothing problematic about accepting the null hypothesis, it's just that instead of controlling for Type I error, you need to control for Type II error, i.e. ensure sufficient power.

Yeah, we can't disprove anything, yadda yadda.

If almost every competition on Kaggle has a winner that is not significantly better than the bulk of the field, then that is proof. Chance correlations leading to you not rejecting the null can only take you so far.

I think the point is we are left with uncertainty. Your prior should be that we don't know which competitor is best, and after the competition we are still unsure.

Isn't that the same thing as "not producing useful models"? Like, sure, some of the models may work, but unless you know which ones you can't make use of them.

Yes, very true, but if we're still unsure it may be worth testing them more, while if we've proven they don't work we can abandon them.

How do you test them more, with which nonbiased dataset that does not exist?

What you could do is actually describe the kind of errors the network makes. In the example of CT, false positive, false negative, wrong diagnosis. We can try to analyze what the network is detecting, rather than accept a result on some test set as real.

The millions of trials is an overstatement, but indeed few hundred thousands are needed to actually discern a winner, presuming the network did not cheat by focusing on, say, population statistics - say, certain cranium sizes being more likely to present with problems. Relying on population statistics derived from a small sample (even if representative, which it's not) is very risky...

It's also possible that if you have a lot of models that all score very close to each other that they just ALL work.

There was a good discussion on Reddit on how the objective of Kaggle competitions from the competition creator's perspective isn't necessarily a useful model. https://www.reddit.com/r/MachineLearning/comments/d50lr3/d_w...

I like the diagram where it shows the improvement from "human" to "Google" and labels it as "probably overfitting". Not only does it look like the person is drawing an extra hurricane bubble to prove a point in a presidential style, but it's complete nonsense.

The earlier example says that the difference between winners in an arbitrarily picked Kaggle competition was "0.0014". Sure, I agree, seems small. But this random diagram about image classification shows that the Google model "the improvement year on year slows (the effect size decreases)". But that's not even true. The effect is exponential!

2011->2012: 38.6% "Reliable Improvement" 2014->2015: 34.9% "Probably Overfitting"

Is this really the thoughts of someone well versed in statistics? I get the feeling they are just upset they lost at a competition and decided instead to rant about it (as you can see from the aggressive use of memes). This is not a well thought out argument against ML competitions, but you might be fooled into thinking it was because it contains just enough discussion of statistics that you might not notice it doesn't hold up.

Sigh this beautifully captures the tragedy of biostats and epidemiology which is held hostage by lack of systems thinking. Unlike say a clinical trials, ML competitions are not limited to data available at start of the competition in fact by having a fair measurement of performance there is strong incentive to label and share more data and run checkpointed models. Further since the goal is not “publishing” with arbitrary requirements of P values / Power an empirical strategy is likely to provide much better long term accountability and better models. Sadly the cult of bio stats is so deeply vested in publishing rather than thinking end to end (designing a system as whole) any other models is quickly rejected by the community.

You're making a critical mistake. Why make a network for a competition that won't produce great results without major modifications on real data?

That is the main problem, and the lack of systems thinking is on your side. There is a strong pressure to cheat and overfit. Sharing in fact makes this even stronger.

We've had some fun with that when trying to use ML for something as complex as music rhythm envelope extraction. (Which is easy in comparison to CT test.)

Best results were approaching 90% accuracy on the big suite, but real results were closer to 40%. A slightly worse solution did reliably 70%. (And was not a neural network even, plus possible to improve.)

General best approaches sometimes indeed work, but sometimes (often?) they are overfitted in architecture, not even dataset.

My comment is really similar to what's already been said, this reads a bit like a "I don't really want to waste my time so neither should you" post. I mean nothing wrong with that but a bit negative and sour then I personally would like. Plus I think the entire spirit of the competition was a bit lost on this person.

Solid case in point, the u-net came to prominence from a medical kaggle competition. Was it a "useful" model? The author might not be wrong in saying the model wouldn't work as well in the wild but I would definitely say it was useful. The unet is still a very commonly used architecture

Huh? How do you "overfit" on the test set if you don't have the test set? And also to call a good ML result a "coin toss" shows a profound lack of understanding of what goes into such "coin tosses", and why solving practical problems with ML is an entirely different ballgame than training a classifier on imagenet (which is in itself pretty hard if you want SOTA results).

> How do you "overfit" on the test set if you don't have the test set?

By being lucky. And with a large enough number of solutions, at least one solution will almost surely be lucky.

I guess I'm extraordinarily "lucky" then. I don't participate in competitions, but I'm often "luckier" than entire teams of people working tirelessly for months. And I charge a lot of money for it.

I'm intentionally misunderstanding your comment now for academic reasons: it's great that you recognise the role of survival bias and luck in becoming successful at many things.

Besides that, I'm very curious about who you are and what you do. Willing to reveal more?

My comment on this would be that the winner of a particular competition might not be the best model (in fact they are likely to be sneaky, e.g.extract information about the distribution of the hold out set and tune for that), but they are a great way to get a survey of the state of the art by looking in general at the types of approaches used by the top competitors. There is definitely information there.

The post talks about coin flipping, or 0/1 classification. Many competitions use different scores however - multiclasses, learning to find bounding boxes of objects, etc. It is much less likely to find "good" answers on the test set by chance. I think the points in the article are important, but with this context become a non-issue, when a random answer is unlikely to be correct.

The article is not about models being indistinguishable from random classifiers, the difference there should be very significant even on the tasks it discussed. Instead, the problem originates from the small differences in test set performance between the top N models. While that difference may very well increase when moving from binary classification to a more technically involved regression task, that is by no means guaranteed, and the main points of the article still apply.

Not AI related, but I had a boss who thought hackathons were great because they would spontaneously produce great things in a day.

I’m kind of an introvert so I hate hackathons but did one and had about 75 code for a day. My boss wasn’t happy but I was surprised at the fun culture and connections created across teams. Five years later people mention something they learned or a technique they use or someone they met that they continue to work with.

I don’t think these competitions are the best mean for solving specific problems, though think they are part of a good portfolio.

I think that they are valuable for other purposes like communication and interoperability.

It feels that the article demonstrates, by and large, the law of diminishing marginal utility is still at work in research communities, which makes sense. At some point you should call it a day and do something else.


The competition and test set will still have hidden biases due to the ontology you use.

Sufficient optimization pressure always eventually overcomes your bias control metrics, in the context of actual utility rather than other metrics.

It's well known in cryptography and security that all abstractions are leaky.

Well how much of AI research and development in general is actually reflecting on the human experience rather than spewing unholy algebra onto the skeleton of past discoveries?

There are so many things just so plain wrong about this (I attempted to respond, then had to stop), that I feel this post is more of an attempt to instill the frustration felt when the author attempted to compete and promptly got run over by some SotA- hungry boost-junkies from countries where the p-test is not part of the curriculum in schools. I really don't know how to constructively salvage this... Talk about the role of luck in games?

I think the hold-out paradigm is woefully overrated in part for similar reasons. It's always seemed odd to me to emphasize hold out samples when you know their asymptotic performance in the form of fit statistics.

I'm always for examining replicability, but the current paradigm to me seems misguided and this articulates some of the reasons very well.

If the problem is getting "lucky" on a test set, then it seems like you could do several rounds of splitting test/training set, retraining from scratch each time, and then take the median performance. Not great if training takes a long time, but it would at least conclusively answer the question.

They tend to produce overfitted model instead, because the such competition optimize towards that direction.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact