
A Look into Machine Learning's First Cheating Scandal - metachris
http://dswalter.github.io/blog/machine-learnings-first-cheating-scandal/
======
TuringTest
Long story short:

\- The LSVRC "visual recognition" competition has a rule in place, that limits
to twice per week how often each contestant can run their entries to the
contest against the ImageNet dataset.

\- The Baidu team run their tests much more often, claiming that they
understood the limit was placed on a person basis and not per team.

\- This rule is in place because more frequent submissions somehow distort the
quality of the test, by switching its emphasis to overlearning (adapting too
much to the specific data included in the test dataset) instead of true
algorithmic advances, which can influence the whole machine learning
discipline given this contest's high profile. ( * )

\- As a result, the whole Baidu company has been banned from the competition
for a year.

( * ) ("The Baidu team’s oversubmissions tilted the balance of forward
progress on the LSVRC from algorithmic advances to hyperparameter
optimization.")

~~~
cocoflunchy
Do they test against the whole dataset? Wouldn't that be fixed by having a
test dataset and a separate, bigger dataset that would be used for the final
grading (like on Kaggle)?

~~~
dasboth
I should have read the comments, I just asked the same thing. I always
wondered how that worked, though - you don't submit an algorithm, do you? Just
a spreadsheet with your predictions, in which case how can they evaluate your
algorithm on another test set?

~~~
p4wnc6
You get to see the feature vectors of the test set, or at least submit your
algorithm to execute upon them. If you can see the features of the test set
but not the labels or target variable for each feature vector, then you can't
use it to help you (e.g. the problem of modeling the distribution over
outcomes for a given feature vector (whose target outcome you don't know) _is_
the problem you're solving).

But if you can repeatedly run different models against the test set _and you
do get to see a score_ , you could do something like random parameter
searching or other optimization ideas to tune your algorithm to be highly
overfitted to the test set.

Another way to avoid this would be to develop several test sets that are
roughly "equivalent" in terms of the distributional properties, and then
randomly change the test set periodically, or change the test set right after
the final submission deadline, to discourage people from pursuing overfitting.

~~~
dasboth
That makes sense. I was thinking of the multiple test set approach, but rather
than forcing people to re-submit on a new test set near the end (if that's
what you meant), the organisers could just return the score based on a random
fraction of the test set. 'Cheaters' would then overfit to just half of the
actual test set and this would be apparent when the final scores (on the other
half) are revealed. I suspect this is what Kaggle do, as they don't make you
re-submit at the end (AFAIK).

------
not_that_noob
To translate to a more familiar domain, think about the SATs. After a period
of study, students take the SAT where they get back the composite score of how
they did, but never the actual answers to the questions on the test.

Now imagine a student can take the test repeatedly over the space of a few
days, and can use the score to reverse engineer the answers to the questions.
They can put in random answers and note which ones cause the score to go up.
Of course the real life SATs don't allow this, and they change up the
questions to prevent this sort of cheating. If this were possible, our
enterprising/cheating student can derive the complete answer key over time,
noting the changes in scores for each run. And once they have the key, they
can ace the test. No longer is it a test of their aptitude, but rather of
their knowledge of the answer key.

This scandal is analogous to this albeit contrived example. With an ML
testset, it's not possible to change the data because you want it standardized
so you can evaluate improvements that new approaches may bring. It's the only
way to have a meaningful yardstick to measure against. Thus, the only way to
prevent such gaming is restricting multiple submissions, so that you can't do
'hyperparameter optimization' \- i.e. overlearn on the testset.

That's why it's cheating - it's not a measure of how well your algorithm did,
but rather on how well you reverse-engineered the answer key. It's a huge
disservice to the field and the people who did this should be ashamed of
themselves.

------
philh
Idle speculation:

> "The key sentence here is, 'Please note that you cannot make more than 2
> submissions per week.' It is our understanding that this is to say that one
> individual can at most upload twice per week. Otherwise, if the limit was
> set for a team, the proper text should be 'your team' instead," Wu wrote.

I wonder whether, to a native Chinese speaker, this really does sound like
it's talking about individual people, and saying "you" when one means "your
team" seems really bizarre. Can any Chinese speakers weigh in?

(Even stipulating this, the affair still sounds more like malice than
incompetence on the part of Wu.)

~~~
ryporter
I agree with your last remark. Regardless of the letter of the law, any ML
researcher who could possibly win such competition should clearly understand
its spirit. They had to know that they were training on the test data, a
cardinal sin in machine learning.

~~~
chestervonwinch
Exactly. This is machine learning chapter 0 stuff. To think a team of machine
learning (most likely) PhDs innocently misinterpreted the rules is naive. More
likely, I imagine the researchers felt pressured to obtain results
significantly enough to let their morals take the sidecar.

~~~
p4wnc6
Especially when you consider that the first failure mode should be to _ask for
clarification_ if you are unsure. It's completely unreasonable for them to
claim they read the rules, and upon their singular reading of the rule, they
_were completely sure_ that the limit was for individual participants and not
whole teams. That there was not even any microscopic amount of doubt that
maybe, just maybe, it was for whole teams.

This sounds like they are trying to cover their malicious intent with an "it's
better to ask for forgiveness than permission" kind of trick.

------
dasboth
Congratulations to the author for both of his last 2 posts making it to the HN
front page!

This explains the need to drill home the train-test idea from the last post. I
hadn't thought about this before but multiple submissions do amount to
multiple peeks at your held-out test set, which is a huge ML no-no.

I don't know much about LSVRC, but doesn't the way Kaggle work prevent this?
AFAIR you get a "public" test-score which is used for the leaderboards, but
once the deadline for submissions is up, each submission is evaluated on a
held-out test set giving you a "private" score. Now that I think about it, I'm
not sure how that works, I guess the accuracy they show you as your public
score is only on part of the submitted rows? Regardless of how that's done,
could the LSVRC organisers not do something similar?

------
hbogert
Couldn't the LSVRC just limit the amount of submissions? Why would you rely on
the competence or good intentions for this?

~~~
masklinn
> Couldn't the LSVRC just limit the amount of submissions?

One of the article's notes indicates that it most likely does, and that the
Baidu team wilfully got around that limit:

> Members of the Baidu team had to create multiple logins in order to
> circumvent the “two submissions per week” rule

~~~
fsam
The fact that they created multiple logins was the biggest ethical breach here
IMO. It was a willful violation of the contest rules.

The absolute number of submissions is somewhat arbitrary, e.g. why are 40
okay, but 200 are not?

~~~
masklinn
There isn't an absolute limit on submissions, only a rate limit. The rate
limit is probably somewhat arbitrary but it has a purpose: the aim of the
challenge is to measure and compare algorithmic progress in machine learning,
that's also why the test dataset is kept secret.

Without that, it becomes possible to pre-tune the algorithm to more closely
match or better recognise the specific dataset ("hyperparameter optimisation")
but not work any better in the general case, so the submitter does better on
the specific challenge, but doesn't actually advance the field.

The limit is there to skew incentives towards algorithmic improvements, the
specific rate doesn't really matter as long as it makes hyperparameter
overfitting less convenient/efficient than algorithmic work.

------
HappyTypist
> there are almost no papers focusing on 3 or 4-layer CNN’s these days, for
> example

What's the 'best practices' for the number of hidden layers in a CNN? 1 or 2
hidden layers?

~~~
argonaut
Certainly more than 6. It depends on what you're going for. If you're just
trying out a new technique (e.g. a new activation function, etc.) super-deep
nets don't really matter. But if you're trying to get the best performance
possible on a task (for a competition, say), you can tack on many more layers
(MSR did ~150 layers).

------
Flockster
So how would a better dataset look like? Does bigger always equal harder? What
are the criteria to measure that?

And wouldn't video datasets be somewhat easier to analyse given the fact that
you have multiple frames of the same object?

~~~
ninjin
It is very difficult to measure data set quality. The goal of a data set to
give you insight into how well an approach performs for any given data point
(in this case an image and associated label). The problem is that we can not
collect all the possible data points, so we have to settle for what we hope to
be a fair sample of all data points.

While bigger usually means better, if for example there was a bias in how the
data points were selected the data set can actually be worse than a smaller
one where there was no such bias. Building a good data set really depends on
the current understanding of the nature and difficulty of the task, both on
the part of the scientific field and the researchers working on the data set.

The very same problem exists in the medical domain. How do you select the
right patients to evaluate for a treatment? How do you know that there is not
a specific genetic, racial, gender, etc. trait that leads to side effects? The
answer is of course that you can not know with absolute certainty unless you
include every human on the planet (and then there is the question as-of-yet
unborn humans), but you can use your experience and medical knowledge to try
to select patients (your data set) for the medical trial to minimise the
possible impact of as-of-yet unknown side effects.

------
jmount
Nice article on why it is cheating (from a mathematical sense, independent of
language) to get scores from the hold-out leader board too many times (plus
some methods to mitigate the effect):
[http://arxiv.org/abs/1502.04585](http://arxiv.org/abs/1502.04585) .

------
mikeskim
As long as you make the public leaderboard set small, and the private one shot
leaderboard set very large, the number of submissions matters very little in
the final rankings. The only real issue is hand labeling the public
leaderboard set to augment training data.

------
king_of_nouns
Meh.. I'm not so sure about this.

Didn't people claim "cheating" back when the first compilers started doing
data flow analysis too?

~~~
ska
You might have a point if there was any sort of parallel between what they are
doing and data flow analysis, but there really isn't.

The parallel would be more like adding a detector for certain benchmarks in
your compiler, and outputting hand tuned assembly for that case ... except
even worse than that, because you'd have to implement the detector and
assembly generation in such a way that it made your compiler behave worse on
general input.

~~~
titanomachy
I'm not sure if the analogy was intentional but your comment made me
immediately think of the Volkswagen emissions scandal.

~~~
ska
There is a parallel, but only to the first part. What VW did was cheat on
benchmarks, much like certain driver vendors and compilers have been known to
do.

But what we're talking about here is much, much worse from a design point of
view. It specializing your system for the benchmark, in such a way that you
are actually making it worse in general.

