
AI competitions don’t produce useful models - baylearn
https://lukeoakdenrayner.wordpress.com/2019/09/19/ai-competitions-dont-produce-useful-models/
======
Sileni
This is a bit like saying "F1 doesn't produce useful cars", isn't it?

The point of a competition is to meet specific parameters as well as possible
and push the boundaries of what can be done. It's not meant to create a "daily
driver".

I realize he argued against this with the coin flip test, but that is why
you'd ideally want to have many of these competitions over time. If you start
to see the same names popping up at the top regularly, you know there's some
sort of significance to them. Teams would ultimately want to trend towards
whatever wins competitions most consistently, so they'd want to rely on models
they think are the most likely to perform in a real world test. They wouldn't
want to simply rely on a coin flip.

And in large competitions, you have the chance of batching the top performers
together and seeing what is common between them. Presumably there's a reason
these models pan out over the rest in aggregate; are they worth pursuing a
bit?

I think we agree at the end though, even within my analogy. A huge value of F1
racing is the publicity the teams give their sponsors. They might learn some
information that can be pushed down to their consumer vehicles, but it's
marginal compared to team winnings and the value of saying "See? Our engineers
are the best".

~~~
xenocyon
The article isn't actually that trivial; it makes a very provocative and bold
statement towards the end, essentially declaring that all stated progress in
image classification in the last 5 years is questionable.

I don't know if I necessarily agree with that, but there is definitely a
danger in evaluating models purely based on their predictive power when many
models are being evaluated on a common dataset - which is exactly the
mainstream practice in the world of deep learning research - and it is
therefore wise to be wary, not just as individuals but as a community.

~~~
lukeor
Hi, author here. I didn't actually mean to suggest that the last 5 years of
performance improvement could be spurious. That clearly isn't true. I use
resnets/densenets etc in my day to day work!

What the picture was trying to say is that, within a given year, the "winner"
becomes less likely to be truly better than the second place team. Alexnet was
clearly better than the alternative, even with Bonferroni adjusted
significance thresholds. Less so by 2016/17.

I'm writing a follow up on imagenet in particular to address some of the
nuance. It is very clearly not a representative example of ML competitions,
but the same effects still apply to some extent (imo).

------
probinso
Our company is very thankful for the models developed for a competition we
held through kaggle. The model is in our current release product, and is the
current known state of the art for it's task in accuracy / performance.

~~~
sgt101
How sure are you that it works? I know that this sounds _really_ flippant, but
I have seen a number of well regarded models that when we / I took them to
bits were discovered to actually not be doing better than dominant class
prediction, and in a couple of cases had actually become arbitrary.

When did someone last really pick it apart in its operational environment?

~~~
probinso
yeah, the model works. Its individual identification task there are many
individuals, and we have lots of labeled data that we held out from the
competition.

------
ajtulloch
[https://arxiv.org/abs/1902.10811](https://arxiv.org/abs/1902.10811) is an
useful counterpoint to this article's comments on ImageNet overfitting (esp
e.g. §3.3, "Few Changes in the Relative Order").

~~~
vaishaalshankar
One of the authors of above paper here. There is a whole line of work that
supports the notion that all progress we are making via this leaderboard (or
competition) mechanism is in fact real progress -- and that we are not simply
overfitting to the test set. This thread by Moritz Hardt does a good job
laying a few reasons why this may the case:
[https://twitter.com/mrtz/status/1134158716251516928](https://twitter.com/mrtz/status/1134158716251516928)

~~~
skinner_
I'd like to signal boost this, it's a very important line of research.
Congrats on a groundbreaking paper! The first time I have seen it, I was
completely shocked by that perfect linear fit on "Do ImageNet Classifiers
Generalize to ImageNet?" Figure 1.

To elaborate for the other commenters: vaishaalshankar's team has created a
new ImageNet evaluation dataset from scratch, and they observed that the
leaderboard positions of popular image recognition models didn't change much
when switching to the new evaluation. The actual performance of the models
decreased significantly, but without affecting the ranking.

The OP starts from a not very controversial claim: there's a good chance that
the winner of a Kaggle competition is not actually better than any of the
other top k contestants, for quite large values of k. But then he completely
overplays his hand, and by the time he gets to talk about ImageNet, he makes
claims that were actually falsified by vaishaalshankar's paper.

~~~
lukeor
I don't actually think imagenet is anywhere near as susceptible to crowd based
overfitting as most kaggle competitions, but I don't actually think that paper
falsifies the claim that it is.

That paper shows that imagenet classifiers retain their ranking on data drawn
from the same distribution as the training set. That isn't even close to the
same thing as generalising to real world data. Medical AI shows us over and
over again that truly out of distribution unseen data (external validation) is
a completely different challenge to simply drawing multiple test sets from
your home clinic.

Again, I don't actually think imagenet was as problematic as other
competitions, but there is better evidence for that (not the least of which is
that for the first half of imagenet's life, the differences in models were
large, the number of tests was cumulatively fairly small, and the test set was
huge: ie what I wrote supports imagenet as fairly reliable).

~~~
skinner_
> That paper shows that imagenet classifiers retain their ranking on data
> drawn from the same distribution as the training set. That isn't even close
> to the same thing as generalising to real world data.

Not the same distribution, it's new data collected and processed according to
the same recipe. A quite different distribution, demonstrated by the fact that
the accuracy numbers drop sharply. That's why it's so surprising that the
rankings do not change that much. (Okay, in principle, a possible explanation
is that it is the exact same distribution, with a fixed percentage of
mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the
paper deals with this possibility.)

In any case, I fully agree that this kind of generalization is still much
easier than generalizing to real world data.

> Again, I don't actually think imagenet was as problematic as other
> competitions, but there is better evidence for that (not the least of which
> is that for the first half of imagenet's life, the differences in models
> were large, the number of tests was cumulatively fairly small, and the test
> set was huge: ie what I wrote supports imagenet as fairly reliable).

CIFAR-10 is basically the opposite of your list of requirements. Train set
small, test set small, test set public, small number of labels, grid searched
to death. And yet, look at the CIFAR-10 graph from that paper. The exact same
pattern as ImageNet.

------
landryraccoon
> Now imagine you aren’t flipping coins. Imagine you are all running a model
> on a competition test set. Instead of wondering if your coin is magic, you
> instead are hoping that your model is the best one, about to earn you
> $25,000.

> Of course, you can’t submit more than one model. That would be cheating. One
> of the models could perform well, the equivalent of getting 8 heads with a
> fair coin, just by chance.

> Good thing there is a rule against it submitting multiple models, or any one
> of the other 99 participants and their 99 models could win, just by being
> lucky

I wonder what the author must think of Poker tournaments. Even assuming there
is luck involved, unless all of the models are equally bad (which would be
surprising) teams that produce better models should win much more than their
fair share, where fair share is 1/N when N is the total number of submissions.

But lets say that the author is correct that ML competitions are mostly luck.
That is a testable hypothesis - in particular, we would expect little to no
correlation between the credentials of the competitors and their ranking in
the competition. Is that actually the case? Do unknown individuals who just
started doing Machine Learning win on their first Kaggle submission? If the
author's hypothesis is correct, one would expect that to happen fairly often,
and one would expect that even highly expert competitors should win
approximately (only) their fair share.

~~~
kqr
I don't think the author is saying that the best team only wins 1/N of the
time, where N is the total number of participants. Far from it.

What they're saying, as far as I understand it, is that the best k teams
(where k << N) each win roughly 1/k of the time.

------
Mathnerd314
I guess the calculator he used is [https://select-
statistics.co.uk/calculators/sample-size-calc...](https://select-
statistics.co.uk/calculators/sample-size-calculator-two-proportions/). The
calculator includes this caveat:

> _" If one or both of the sample proportions are close to 0 or 1 then this
> approximation is not valid and you need to consider an alternative sample
> size calculation method."_

0.86 is fairly close to 1. And they're not proportions but rather averages of
Dice coefficients. The statistics used here seem very suspicious. I'd like to
see a Monte Carlo simulation or at least the assumptions / derivation of the
formula used.

~~~
jph00
You're spot on - the calculations in the post are totally wrong. For instance,
a Dice coefficient is itself the average of hundreds of thousands of
observations (each pixel in an image), so you can't just treat the coefficient
as a single data point. However you also can't just take a standard error,
since the points are highly spatially correlated.

The post also doesn't draw on any actual reported experiences of either
competitors or hosts of competitions. Generally in post competition wrap-ups
there's a lot of information sharing and very significant development of
understanding that comes about.

To take the imagenet example that's claimed in the post as being a likely
example of over-fitting: each year's imagenet results led to very significant
advances in our understanding of image classification, which are today very
widely used in industry and research.

~~~
AstralStorm
Have you actually tried to use one of those networks on a decent real dataset?
They're making common category errors. ImageNet roulette is an interesting fun
example of what happens if you feed an overfitted network with real images.
Such as misclassifying children. :)

~~~
bermanoid
ImageNet Roulette deliberately uses a _terrible_ categorization scheme that
has long been acknowledged as so poor as to not admit meaningful results in
order to make the highly political point that ML should never be applied to
people. There's a reason most people scrub that whole piece of the taxonomy
before training.

Good Resnet models trained on ImageNet (the good parts, not just people) tend
to result in state of the art results for almost every transfer-learning
domain they're applied to.

------
oli5679
As someone who develops machine-learning models as job, I can say that there
are snippets from Kaggle kernels that run in minutes, and outperform months of
manual effort by a business analyst.

Winners have often developed really interesting feature engineering strategies
for a domain, as well as very well organised tuning and stacking systems.

Maybe a lot of difference between 1st and 5th percentile is luck, but top end
of Kaggle is a really valuable insight into building effective models, even if
you might want to simplify implimentation a bit in the commercial world.

~~~
kqr
That's exactly what the author is trying to say, I think. We shouldn't put so
much emphasis on who got into first place (which is mostly determined by luck)
but rather investigate all techniques used by the top 5th or 10th or whatnot
percentile, which is meaningfully separated from the rest.

------
sibmike
Given that a handful of names occupy Gold places in kaggle competitions, I
would not call it luck. Given how hard it is to stay at the top on private
set, I would not call it luck.

The winning models are hardly ever used in production, but the set of skills
needed to get a gold is.

------
antgoldbloom
Kaggle CEO here.

Agree with a mild version of the author's statement: that for many
competitions, the difference between top n spots is not statistically
significant. However, the author's statement (as represented by this chart
[https://lukeoakdenrayner.files.wordpress.com/2019/09/ai-
comp...](https://lukeoakdenrayner.files.wordpress.com/2019/09/ai-
competition4-1.png?w=620&h=311)) is far too strong.

The actual best model may not always win, but will typically be in the top
0.1%.

There are people on this thread who have poked holes in the author's sample
size calculator (I'm not going to rehash that).

But an empirical observation: the same top ranked Kagglers consistently
perform well in competition after competition.

You can see this by digging through the profiles of top ranked Kagglers
([https://www.kaggle.com/rankings](https://www.kaggle.com/rankings)). Or by
looking at competition leaderboards. For example the leaderboard screenshot
the author shared in the post
([https://lukeoakdenrayner.files.wordpress.com/2019/09/pneumo-...](https://lukeoakdenrayner.files.wordpress.com/2019/09/pneumo-
challenge.png?w=663&h=521)) shows 11 of the 13 top performers are Masters and
Grandmasters, which puts them at the top ranked 1.5K members of our community
of 3.4MM data scientists (orange and gold dots under the profile pictures
indicate Master and Grandmaster rank).

I actually think the author's headline is often correct: there are many cases
where machine learning competitions don't produce useful models. But for a
completely different reason: Competitions sometimes have leakage.

~~~
antgoldbloom
To elaborate on leakage: it's a case where something in the training or test
dataset wouldn't be available in a production setting.

As a funny example: I remember we were once given a dataset to predict
prostate cancer from ~300 variables. One of the variables was "had prostate
cancer surgery". Turned out that was a very good predictor of prostate cancer
;). Thankfully that was an example where we caught the leakage. Unfortunately
there are cases where we don't catch the leakage.

------
ineedasername
It seems like the author is saying two things. 1) The hold out method isn't
robust enough and 2) There may not be much difference between the top
performing models. Which, okay, there may be a point there. But that doesn't
really support the thesis that competitions don't produce useful models. It's
more like they don't produce models that can be deployed directly for clinical
use. There's no reason the techniques used by the competitors can't contribute
to something that is clinically useful.

Of course a title like, "Here's why you can't directly deploy ML competition
models into a production environment" doesn't grab as many clicks.

------
mr_toad
The headline, apart from being clickbaity, is _accepting the null hypothesis_.
Which is a STAT101 no no.

The article makes several good points. But just because the testing isn’t
sufficient to prove that the winner didn’t just get lucky, it doesn’t _prove_
that the winner did just get lucky.

~~~
mlnewbie
Yeah, we can't disprove anything, yadda yadda.

If almost every competition on Kaggle has a winner that is not significantly
better than the bulk of the field, then that is proof. Chance correlations
leading to you not rejecting the null can only take you so far.

~~~
skybrian
I think the point is we are left with uncertainty. Your prior should be that
we don't know which competitor is best, and after the competition we are still
unsure.

~~~
mlnewbie
Isn't that the same thing as "not producing useful models"? Like, sure, some
of the models may work, but unless you know which ones you can't _make use_ of
them.

~~~
skybrian
Yes, very true, but if we're still unsure it may be worth testing them more,
while if we've proven they don't work we can abandon them.

~~~
AstralStorm
How do you test them more, with which nonbiased dataset that does not exist?

What you could do is actually describe the kind of errors the network makes.
In the example of CT, false positive, false negative, wrong diagnosis. We can
try to analyze what the network is detecting, rather than accept a result on
some test set as real.

The millions of trials is an overstatement, but indeed few hundred thousands
are needed to actually discern a winner, presuming the network did not cheat
by focusing on, say, population statistics - say, certain cranium sizes being
more likely to present with problems. Relying on population statistics derived
from a small sample (even if representative, which it's not) is very risky...

------
minimaxir
There was a good discussion on Reddit on how the objective of Kaggle
competitions from the competition creator's perspective isn't necessarily a
_useful_ model.
[https://www.reddit.com/r/MachineLearning/comments/d50lr3/d_w...](https://www.reddit.com/r/MachineLearning/comments/d50lr3/d_why_are_kaggle_prizes_so_low/)

------
ggggtez
I like the diagram where it shows the improvement from "human" to "Google" and
labels it as "probably overfitting". Not only does it look like the person is
drawing an extra hurricane bubble to prove a point in a presidential style,
but it's complete nonsense.

The earlier example says that the difference between winners in an arbitrarily
picked Kaggle competition was "0.0014". Sure, I agree, seems small. But this
random diagram about image classification shows that the Google model "the
improvement year on year slows (the effect size decreases)". But that's not
even true. The effect is exponential!

2011->2012: 38.6% "Reliable Improvement" 2014->2015: 34.9% "Probably
Overfitting"

Is this really the thoughts of someone well versed in statistics? I get the
feeling they are just upset they lost at a competition and decided instead to
rant about it (as you can see from the aggressive use of memes). This is not a
well thought out argument against ML competitions, but you might be fooled
into thinking it was because it contains just enough discussion of statistics
that you might not notice it doesn't hold up.

------
mlvsepi
Sigh this beautifully captures the tragedy of biostats and epidemiology which
is held hostage by lack of systems thinking. Unlike say a clinical trials, ML
competitions are not limited to data available at start of the competition in
fact by having a fair measurement of performance there is strong incentive to
label and share more data and run checkpointed models. Further since the goal
is not “publishing” with arbitrary requirements of P values / Power an
empirical strategy is likely to provide much better long term accountability
and better models. Sadly the cult of bio stats is so deeply vested in
publishing rather than thinking end to end (designing a system as whole) any
other models is quickly rejected by the community.

~~~
AstralStorm
You're making a critical mistake. Why make a network for a competition that
won't produce great results without major modifications on real data?

That is the main problem, and the lack of systems thinking is on your side.
There is a strong pressure to cheat and overfit. Sharing in fact makes this
even stronger.

We've had some fun with that when trying to use ML for something as complex as
music rhythm envelope extraction. (Which is easy in comparison to CT test.)

Best results were approaching 90% accuracy on the big suite, but real results
were closer to 40%. A slightly worse solution did reliably 70%. (And was not a
neural network even, plus possible to improve.)

General best approaches sometimes indeed work, but sometimes (often?) they are
overfitted in architecture, not even dataset.

------
ackbar03
My comment is really similar to what's already been said, this reads a bit
like a "I don't really want to waste my time so neither should you" post. I
mean nothing wrong with that but a bit negative and sour then I personally
would like. Plus I think the entire spirit of the competition was a bit lost
on this person.

Solid case in point, the u-net came to prominence from a medical kaggle
competition. Was it a "useful" model? The author might not be wrong in saying
the model wouldn't work as well in the wild but I would definitely say it was
useful. The unet is still a very commonly used architecture

------
m0zg
Huh? How do you "overfit" on the test set if you don't have the test set? And
also to call a good ML result a "coin toss" shows a profound lack of
understanding of what goes into such "coin tosses", and why solving practical
problems with ML is an entirely different ballgame than training a classifier
on imagenet (which is in itself pretty hard if you want SOTA results).

~~~
kqr
> How do you "overfit" on the test set if you don't have the test set?

By being lucky. And with a large enough number of solutions, at least one
solution will almost surely be lucky.

~~~
m0zg
I guess I'm extraordinarily "lucky" then. I don't participate in competitions,
but I'm often "luckier" than entire teams of people working tirelessly for
months. And I charge a lot of money for it.

~~~
kqr
I'm intentionally misunderstanding your comment now for academic reasons: it's
great that you recognise the role of survival bias and luck in becoming
successful at many things.

Besides that, I'm very curious about who you are and what you do. Willing to
reveal more?

------
salty_biscuits
My comment on this would be that the winner of a particular competition might
not be the best model (in fact they are likely to be sneaky, e.g.extract
information about the distribution of the hold out set and tune for that), but
they are a great way to get a survey of the state of the art by looking in
general at the types of approaches used by the top competitors. There is
definitely information there.

------
fishooter
The post talks about coin flipping, or 0/1 classification. Many competitions
use different scores however - multiclasses, learning to find bounding boxes
of objects, etc. It is much less likely to find "good" answers on the test set
by chance. I think the points in the article are important, but with this
context become a non-issue, when a random answer is unlikely to be correct.

~~~
MrMoenty
The article is not about models being indistinguishable from random
classifiers, the difference there should be very significant even on the tasks
it discussed. Instead, the problem originates from the small differences in
test set performance between the top N models. While that difference may very
well increase when moving from binary classification to a more technically
involved regression task, that is by no means guaranteed, and the main points
of the article still apply.

------
prepend
Not AI related, but I had a boss who thought hackathons were great because
they would spontaneously produce great things in a day.

I’m kind of an introvert so I hate hackathons but did one and had about 75
code for a day. My boss wasn’t happy but I was surprised at the fun culture
and connections created across teams. Five years later people mention
something they learned or a technique they use or someone they met that they
continue to work with.

I don’t think these competitions are the best mean for solving specific
problems, though think they are part of a good portfolio.

I think that they are valuable for other purposes like communication and
interoperability.

------
euske
It feels that the article demonstrates, by and large, the law of diminishing
marginal utility is still at work in research communities, which makes sense.
At some point you should call it a day and do something else.

------
emmab
[https://www.lesswrong.com/posts/5gQLrJr2yhPzMCcni/the-
optimi...](https://www.lesswrong.com/posts/5gQLrJr2yhPzMCcni/the-optimizer-s-
curse-and-how-to-beat-it)

The competition and test set will still have hidden biases due to the ontology
you use.

Sufficient optimization pressure always eventually overcomes your bias control
metrics, in the context of actual utility rather than other metrics.

It's well known in cryptography and security that all abstractions are leaky.

------
tuespetre
Well how much of AI research and development in general is actually reflecting
on the human experience rather than spewing unholy algebra onto the skeleton
of past discoveries?

------
ipsa
There are so many things just so plain wrong about this (I attempted to
respond, then had to stop), that I feel this post is more of an attempt to
instill the frustration felt when the author attempted to compete and promptly
got run over by some SotA- hungry boost-junkies from countries where the
p-test is not part of the curriculum in schools. I really don't know how to
constructively salvage this... Talk about the role of luck in games?

------
gyuserbti
I think the hold-out paradigm is woefully overrated in part for similar
reasons. It's always seemed odd to me to emphasize hold out samples when you
know their asymptotic performance in the form of fit statistics.

I'm always for examining replicability, but the current paradigm to me seems
misguided and this articulates some of the reasons very well.

------
ahupp
If the problem is getting "lucky" on a test set, then it seems like you could
do several rounds of splitting test/training set, retraining from scratch each
time, and then take the median performance. Not great if training takes a long
time, but it would at least conclusively answer the question.

------
tanilama
They tend to produce overfitted model instead, because the such competition
optimize towards that direction.

