Hacker News new | comments | ask | show | jobs | submit login
A Look into Machine Learning's First Cheating Scandal (dswalter.github.io)
246 points by metachris on Dec 14, 2015 | hide | past | web | favorite | 51 comments



Long story short:

- The LSVRC "visual recognition" competition has a rule in place, that limits to twice per week how often each contestant can run their entries to the contest against the ImageNet dataset.

- The Baidu team run their tests much more often, claiming that they understood the limit was placed on a person basis and not per team.

- This rule is in place because more frequent submissions somehow distort the quality of the test, by switching its emphasis to overlearning (adapting too much to the specific data included in the test dataset) instead of true algorithmic advances, which can influence the whole machine learning discipline given this contest's high profile. ( * )

- As a result, the whole Baidu company has been banned from the competition for a year.

( * ) ("The Baidu team’s oversubmissions tilted the balance of forward progress on the LSVRC from algorithmic advances to hyperparameter optimization.")


Do they test against the whole dataset? Wouldn't that be fixed by having a test dataset and a separate, bigger dataset that would be used for the final grading (like on Kaggle)?


They have a separate test set. But by submitting a lot of entries with slightly different parameters, you can optimize your parameters based on the test score.

Some (all ? ) Kaggle competition also have a daily submission limit to avoid this kind of cheating.


I think you misunderstand. Kaggle has a training set (known to all participants), a "public leaderboard" test set (secret) and a "private leaderboard" test set (also secret). You can get your model's score on the "public" test set a couple of times per day. Your score on the private test set is only revealed once, after the competition has ended.

People can, and do, overfit their model to the public test set, but doing so does not improve their score on the private test set, so cheating is prevented even without the submission limit.

The submission limit helps ensure the leaderboard generated from the public test set stays close to the leaderboard generated by the private test set while the competition is running, so that you can get an idea about your standing. But participants know better than taking the public leaderboard too seriously.


I stand corrected :) It seems like ILSVRC only has 2 datasets though, I wonder why they don't use a Kaggle-like approach ? My guess would be because they want people to be able to know their test score in order to put them in their papers.


The grading test set is secret.

This is what the Baidu team were probing.


I should have read the comments, I just asked the same thing. I always wondered how that worked, though - you don't submit an algorithm, do you? Just a spreadsheet with your predictions, in which case how can they evaluate your algorithm on another test set?


You get to see the feature vectors of the test set, or at least submit your algorithm to execute upon them. If you can see the features of the test set but not the labels or target variable for each feature vector, then you can't use it to help you (e.g. the problem of modeling the distribution over outcomes for a given feature vector (whose target outcome you don't know) is the problem you're solving).

But if you can repeatedly run different models against the test set and you do get to see a score, you could do something like random parameter searching or other optimization ideas to tune your algorithm to be highly overfitted to the test set.

Another way to avoid this would be to develop several test sets that are roughly "equivalent" in terms of the distributional properties, and then randomly change the test set periodically, or change the test set right after the final submission deadline, to discourage people from pursuing overfitting.


That makes sense. I was thinking of the multiple test set approach, but rather than forcing people to re-submit on a new test set near the end (if that's what you meant), the organisers could just return the score based on a random fraction of the test set. 'Cheaters' would then overfit to just half of the actual test set and this would be apparent when the final scores (on the other half) are revealed. I suspect this is what Kaggle do, as they don't make you re-submit at the end (AFAIK).


Publish the final test set right before the deadline.

Alternatively, require the participants to provide an API and call them, instead of them submitting things.


To translate to a more familiar domain, think about the SATs. After a period of study, students take the SAT where they get back the composite score of how they did, but never the actual answers to the questions on the test.

Now imagine a student can take the test repeatedly over the space of a few days, and can use the score to reverse engineer the answers to the questions. They can put in random answers and note which ones cause the score to go up. Of course the real life SATs don't allow this, and they change up the questions to prevent this sort of cheating. If this were possible, our enterprising/cheating student can derive the complete answer key over time, noting the changes in scores for each run. And once they have the key, they can ace the test. No longer is it a test of their aptitude, but rather of their knowledge of the answer key.

This scandal is analogous to this albeit contrived example. With an ML testset, it's not possible to change the data because you want it standardized so you can evaluate improvements that new approaches may bring. It's the only way to have a meaningful yardstick to measure against. Thus, the only way to prevent such gaming is restricting multiple submissions, so that you can't do 'hyperparameter optimization' - i.e. overlearn on the testset.

That's why it's cheating - it's not a measure of how well your algorithm did, but rather on how well you reverse-engineered the answer key. It's a huge disservice to the field and the people who did this should be ashamed of themselves.


Idle speculation:

> "The key sentence here is, 'Please note that you cannot make more than 2 submissions per week.' It is our understanding that this is to say that one individual can at most upload twice per week. Otherwise, if the limit was set for a team, the proper text should be 'your team' instead," Wu wrote.

I wonder whether, to a native Chinese speaker, this really does sound like it's talking about individual people, and saying "you" when one means "your team" seems really bizarre. Can any Chinese speakers weigh in?

(Even stipulating this, the affair still sounds more like malice than incompetence on the part of Wu.)


I agree with your last remark. Regardless of the letter of the law, any ML researcher who could possibly win such competition should clearly understand its spirit. They had to know that they were training on the test data, a cardinal sin in machine learning.


Exactly. This is machine learning chapter 0 stuff. To think a team of machine learning (most likely) PhDs innocently misinterpreted the rules is naive. More likely, I imagine the researchers felt pressured to obtain results significantly enough to let their morals take the sidecar.


Especially when you consider that the first failure mode should be to ask for clarification if you are unsure. It's completely unreasonable for them to claim they read the rules, and upon their singular reading of the rule, they were completely sure that the limit was for individual participants and not whole teams. That there was not even any microscopic amount of doubt that maybe, just maybe, it was for whole teams.

This sounds like they are trying to cover their malicious intent with an "it's better to ask for forgiveness than permission" kind of trick.


just growth hacking a competition, right?


Once upon a time, English had both singular and plural second-person pronouns; "thou/thee" and "ye/you" respectively. Over time, the use of the first three declined drastically, and "you" is the only one in popular use today.

https://en.wikipedia.org/wiki/English_personal_pronouns#Arch...

Chinese, however, still uses both singular and plural second-person pronouns; 你 (nǐ, lit. you) and 你們 (nǐmen, lit. you all) respectively.

https://en.wikipedia.org/wiki/Chinese_pronouns#Personal_pron...

Note that the above is drastically simplified and I recommend reading the links for more information.


English is relatively unusual in merging its second-person pronouns, although the use of the second-person plural in more formal or polite situations (aka the T-V distinction) is extremely widespread.

However, the matter of personal pronouns is such a fundamental aspect of grammar (at least in Indo-European languages) that it was literally the first piece of grammar introduced when we studied French. (It may not have been the first chapter, since the first chapter may have been limited to stock phrases and some pronunciation guides; it's been way too long). I find it highly implausible that any foreign speaker that has a working proficiency of English could somehow think that "you" is not used for both individuals and groups.


I have seen some very strange errors in English written by native Chinese speakers; however, it does seem implausible that every single person on their team interpreted the sentence in exactly the same wrong way.


Even if that's the case, they would have noticed the inconsistency in language when they were forced to create multiple logins to submit more frequently.


Congratulations to the author for both of his last 2 posts making it to the HN front page!

This explains the need to drill home the train-test idea from the last post. I hadn't thought about this before but multiple submissions do amount to multiple peeks at your held-out test set, which is a huge ML no-no.

I don't know much about LSVRC, but doesn't the way Kaggle work prevent this? AFAIR you get a "public" test-score which is used for the leaderboards, but once the deadline for submissions is up, each submission is evaluated on a held-out test set giving you a "private" score. Now that I think about it, I'm not sure how that works, I guess the accuracy they show you as your public score is only on part of the submitted rows? Regardless of how that's done, could the LSVRC organisers not do something similar?


Couldn't the LSVRC just limit the amount of submissions? Why would you rely on the competence or good intentions for this?


> Couldn't the LSVRC just limit the amount of submissions?

One of the article's notes indicates that it most likely does, and that the Baidu team wilfully got around that limit:

> Members of the Baidu team had to create multiple logins in order to circumvent the “two submissions per week” rule


The fact that they created multiple logins was the biggest ethical breach here IMO. It was a willful violation of the contest rules.

The absolute number of submissions is somewhat arbitrary, e.g. why are 40 okay, but 200 are not?


There isn't an absolute limit on submissions, only a rate limit. The rate limit is probably somewhat arbitrary but it has a purpose: the aim of the challenge is to measure and compare algorithmic progress in machine learning, that's also why the test dataset is kept secret.

Without that, it becomes possible to pre-tune the algorithm to more closely match or better recognise the specific dataset ("hyperparameter optimisation") but not work any better in the general case, so the submitter does better on the specific challenge, but doesn't actually advance the field.

The limit is there to skew incentives towards algorithmic improvements, the specific rate doesn't really matter as long as it makes hyperparameter overfitting less convenient/efficient than algorithmic work.


The research community is a small one. Everyone knows everyone in the community.


> there are almost no papers focusing on 3 or 4-layer CNN’s these days, for example

What's the 'best practices' for the number of hidden layers in a CNN? 1 or 2 hidden layers?


Certainly more than 6. It depends on what you're going for. If you're just trying out a new technique (e.g. a new activation function, etc.) super-deep nets don't really matter. But if you're trying to get the best performance possible on a task (for a competition, say), you can tack on many more layers (MSR did ~150 layers).


So how would a better dataset look like? Does bigger always equal harder? What are the criteria to measure that?

And wouldn't video datasets be somewhat easier to analyse given the fact that you have multiple frames of the same object?


It is very difficult to measure data set quality. The goal of a data set to give you insight into how well an approach performs for any given data point (in this case an image and associated label). The problem is that we can not collect all the possible data points, so we have to settle for what we hope to be a fair sample of all data points.

While bigger usually means better, if for example there was a bias in how the data points were selected the data set can actually be worse than a smaller one where there was no such bias. Building a good data set really depends on the current understanding of the nature and difficulty of the task, both on the part of the scientific field and the researchers working on the data set.

The very same problem exists in the medical domain. How do you select the right patients to evaluate for a treatment? How do you know that there is not a specific genetic, racial, gender, etc. trait that leads to side effects? The answer is of course that you can not know with absolute certainty unless you include every human on the planet (and then there is the question as-of-yet unborn humans), but you can use your experience and medical knowledge to try to select patients (your data set) for the medical trial to minimise the possible impact of as-of-yet unknown side effects.


The point of a dataset is to serve as a representative sample of the full spectrum of possible input data. Datasets need to be properly classified, and with different elements / 'intricacies' well distributed.

Think of it as using Hacker News posts to illustrate what posts are on topic. You could give someone a better impression if you showed them 10 posts instead of just 3, but if the 10 posts were all on the same subject then it wouldn't be of any additional utility. And if you accidentally include an off-topic post, then that user is going to have a mistaken impression of the post.

As for video... that's a long story.


Nice article on why it is cheating (from a mathematical sense, independent of language) to get scores from the hold-out leader board too many times (plus some methods to mitigate the effect): http://arxiv.org/abs/1502.04585 .


As long as you make the public leaderboard set small, and the private one shot leaderboard set very large, the number of submissions matters very little in the final rankings. The only real issue is hand labeling the public leaderboard set to augment training data.


[deleted]


You can skip that part and start reading at "Conclusions". The first section merely describes an unrelated experiment, show for contrast, as an example of successful research that doesn't commit the sin of overfitting. In fact, the reasons of why this is such a scandal are explained in surprisingly understandable layman prose.

Edit: A MIT review linked from the article has a simpler and more detailed explanation of how the test system was abused, and why they did it.

http://www.technologyreview.com/view/538111/why-and-how-baid...


That's n choose 2...


[flagged]


Happy to oblige. First: Chinese companies do not cheat on 'everything', that's a generalization that holds absolutely no water. Second, begging for downvotes is such a non-productive thing to do that it is a part of the HN guidelines.

"Please don't bait other users by inviting them to downvote you or declare that you'll probably get downvoted."

https://news.ycombinator.com/newsguidelines.html


Practically all companies cheat whenever there is a competitive or financial advantage doing so and they think that they can get away with it (legally and from a PR perspective), either in manners traceable to orders from the top or via more localised teams/individuals acting without sufficient internal oversight.

Look at BMW for the most recent very public non-Chinese example of this.

It is not specifically a problem with China, though it certainly seems to be more common there currently. I suspect this is due to a mix of issues:

* With the massive growth and changes there in the last couple of decades have left "due diligence" and similar concepts playing catchup.

* As they are moving to a more western way of operating boundaries are being tested (and sometimes actively pushed). This should reduce over time.

* To a certain extent operators over there see what companies elsewhere get caught for, assume that far more is going on and not being reported, and feel it fair that they behave the same way. Also short term thinking leads people who aren't expecting to be in the same place in a few years time to notice that some of these scandals don't erupt for many years after the fact so they consider it worth the risk for localised gain.

* There is a perception that companies in the west are disposed to turning a blind eye to such issues in order to cut costs by using a less expensive third party. Such a disposition encourages such behaviour, and I'm of the opinion that this perception is far from false.

Again there is nothing specifically Chinese about this, it is a more general issue of insufficiently regulated capitalism, but the shear number of companies and individuals over there trying to grow into the market can make it look so.


> Look at BMW for the most recent very public non-Chinese example of this.

Did I miss something? Or is this an example of what I meant when I said that other German manufacturers would be negatively affected by the VW scandal as well?


No, you didn't miss something. I experienced a memory fault and picked out the wrong car manufacturer with a W in their acronym...


I'm disappointed this was upvoted back from gray. I downvoted you because downvote baiting is just about the least classy comment you could make.


Most probably true, but the shame is to put on China as the "red" system it is, and not the people.


Meh.. I'm not so sure about this.

Didn't people claim "cheating" back when the first compilers started doing data flow analysis too?


You might have a point if there was any sort of parallel between what they are doing and data flow analysis, but there really isn't.

The parallel would be more like adding a detector for certain benchmarks in your compiler, and outputting hand tuned assembly for that case ... except even worse than that, because you'd have to implement the detector and assembly generation in such a way that it made your compiler behave worse on general input.


I'm not sure if the analogy was intentional but your comment made me immediately think of the Volkswagen emissions scandal.


There is a parallel, but only to the first part. What VW did was cheat on benchmarks, much like certain driver vendors and compilers have been known to do.

But what we're talking about here is much, much worse from a design point of view. It specializing your system for the benchmark, in such a way that you are actually making it worse in general.


It's a contest. It has specific rules to prevent a very specific type of synthetic advantage (training to the test set). This was about as cheating as it gets. That's why the company fired the guy in charge of the group that cheated.


> It's a contest

Yes, who cares? The bigger picture is advancing the field, not scoring some bigger number in some artificial environment.

> This was about as cheating as it gets.

Like I said, so was data flow analysis originally.

--

The point is to not just dismiss this as "cheating" but take a closer look at how the current benchmark is flawed and how this sort of shortcut might be useful.


This sort of "shortcut" exploits a well known flaw in the training any machine learning or regression model using any algorithm. It is the problem of overfitting. The problem is that given enough parameters you can tweak those parameters so that your model exactly fits a target data set (including the random noise). The reason for excluding and limiting access to the test set is so that parameters can't be modified such that the model is adjusted to the specific test data set. It is a well known problem and exploit that the rules are designed to guard against. Here is the wiki page on overfitting:

https://en.wikipedia.org/wiki/Overfitting

Say the test data set, by chance, has 1% more dogs than the training set. By tweaking your algorithm to guess "dog" an extra 1% of the time you may be able to get .1% increase in your success rate. Also, that tweaking isn't a manual process it's part of the training process based on feedback from the results of your test set score. The improvement is enough to "win", but it's not because your algorithm is better.

That goes along with the argument that the data set is at its end of life. Maybe we're at the point that gaming the system is the only way to eek out the .1% needed to "win". In that case it's time to move on to a tougher test.

EDIT: I'd like to point out that the ImageNet competition is continually on top of the "time to make it more challenging" aspect. They introduced localization in 2011 (identifying not just what, but where items are). The 2015 competition includes, for the first time, recognition and localization tests in video clips.


The blog post explains why the behavior in question is cheating, not a "useful shortcut". Overfitting a model isn't a real advancement to the field.

The point is to take a closer look at the current benchmarks and potentially come up with better ones.


To give an extreme example of overfitting, let's say you have a test set which has these examples: B -> 11, H -> 5, M -> 11. Well, you can solve these easily. Just make a function like this:

    function solve(letter) {
        return {B: 11, H: 5, M: 11}[letter];
    }
machine learning 4Head . Of course, the point is to solve for examples you haven't seen yet, so this "solution" isn't amusing anyone. That said, overfitting commonly happens even without trying to overfit, and people use techniques to minimize it.


To support what the others have said, but put in different words: the point of the rule is so that it will be more likely that breakthroughs in contest results translate to breakthroughs in the field itself.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: