Hacker News new | comments | ask | show | jobs | submit login
[dupe] Troubling Trends in Machine Learning Scholarship (approximatelycorrect.com)
124 points by nabla9 7 months ago | hide | past | web | favorite | 37 comments

This is only going to get worse. I see literally no counter force putting any kind of barrier against the incentive to e.g. overfit in order to get exaggerated results. The lucrative reward is too large not to fall for the use of shady tactics (which are sometimes not even consciously obvious; not being critical enough of your own work is a form of model loosening which acts in favour of more impressive results).

A major factor is that machine learning is so conference deadline driven. Does the main result look good enough? Submit it! Ablation studies are the last thing you do, and often you are pressed for time due to the deadline. Since so much work is done by PhD students, they are incentivized to get their work out as fast as possible because they need to graduate and papers in top-conferences play a huge factor in getting jobs in academia and industry.

I'm aware of multiple papers by top labs where their state-of-the-art results are really the result of some very minor change (e.g., some pre-processing technique, bigger images) that was swept under the rug in their paper. All the math and the complex model had minimal impact on the actual numeric result, with the actual reason being unpublishable in a top venue.

Your entire first paragraph makes me cringe. Coming from an academic/research background, all of the things you mentioned goes against everything I was taught in terms of being rigorous with your research.

Is this more evidence we have really entered the "click baiting" era? Where the end result of the research is now secondary to getting published and into big conferences to get some notoriety?

  all of the things you mentioned goes against
  everything I was taught in terms of being
  rigorous with your research.
There are certain areas where "the system" asks people to do things right - but provides vast rewards for doing things wrong, if you can avoid detection.

If you're a pro cyclist who isn't doping, good for you! But there's no medal for finishing fourth; you'll be rewarded with medals and sponsorship cash for doping undetectably.

Academics whose rigorous papers aren't quite getting into top-tier journals are in a similar situation.

I also struggle to understand why PhD led research in AI is considered so highly. Only very rarely does is hold a candle to team lead research in the private sector.

Didnt you make the exact same post here? https://news.ycombinator.com/item?id=17497235

I disagree. There is a cost in publishing bad work, and an incentive for 'predators'. For example, MIT published a bad paper at this ICML, and I was incentivized to tell it publicly: https://medium.com/the-ai-lab/mit-paper-in-ai-for-drug-disco...

"The ubiquity of this issue is evidenced by the paper introducing the Adam optimizer [35]. In the course of introducing an optimizer with strong empirical performance, it also offers a theorem regarding convergence in the convex case, which is perhaps unnecessary in an applied paper focusing on non-convex optimization. The proof was later shown to be incorrect in [63]."

I strongly disagree here. Researchers, please keep trying to prove that your proposed optimizers are correct, even if you can only manage to prove it for a subclass of problems, and even if it risks publishing an accidentally incorrect proof.

Adam may be not the clearest example. I'd say the clearest example is the original GAN paper, where "theorems" where completely unrelated to what they were doing and trivial at the same time.

>>and even if it risks publishing an accidentally incorrect proof.

No. If a researcher has even a slightest doubt about the correctness of their proof, they should either spend the time to verify it or talk to an expert. There is no lack of experts in convex optimization in particular. Or, instead of publishing the proof, one could just say "we believe this is provable by these methods". But one can not try and shift the burden of verification to the readers, while claiming the results. This is not how trust is built.

Yeah I read that proof as more of a reassurance than an emphasized part of the paper. If the method does not even work on convex problems, it's highly suspicious as a tool for nonconvex problems.

Yes, but why do it on paper when you can use an automated prover instead, since you're proving properties of an algorithm?

It smells of laziness...

can you cite any non-trivial examples where people writing papers include machine-checked proofs , in fields of study unrelated to the study of machine-checked proofs?

"Failure to identify the sources of empirical gains, e.g. emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning."

I think everyone is looking for gains coming from architecture modification since more power and clever hyperparameter tuning can only go so far.

But given that any of systems requires hyperparameter tuning and that requires a lot of time, it is inherently hard to distinguish between this and novel architectures. If someone says "X worked for me but Y didn't", tuning could always be the explanation.

It seems like one could only really scientifically distinguish two architectures if the hyperparameter methodology was fixed. But that's about the opposite of now as far as my amateur exposure to the field goes.

You could also try to devise optimal hyperparameters by genetic algorithm (or similar global optimizer) for every architecture with a big fixed number of generations.

That should be apples to apples comparison... Or best to best. At the same time fragility of hyperparameters can be evaluated.

I really appreciate the detailed analysis in the article.

Genetic algorithms, Bayesian optimization, simulated annealing etc. are like 101 optimization algorithms, they won't get you anywhere. Hyperparameter tuning is as NP-hard as it gets in the same theoretical ballpark as cryptology when it comes to measuring how demanding it is. In Deep Learning you are basically doing meta-optimization, as the process of learning neural network itself is already non-linear optimization (hint: Adam is a non-linear optimizer); non-linear optimization is generally NP-hard unless restricted to some trivial cases like quadratic programming; here you want to optimize over an infinite set of already NP-hard problems.

What are you saying goes beyond "101 optimization algorithms"? Are you saying to put another "layer" of deep learning to tune the parameters of the first "layer"? Doesn't this just add even more parameters to tune?

No, you don't put another layer of Deep Learning (well, you technically could if you had some training data for optimization, but almost certainly you don't).

You simply put another layer of (mixed-integer) non-linear optimization on top of Deep Learning hyperparameters, i.e. the hyper-parameters like learning rate, batch size, category weights, even the loss function composition, will be the variables you optimize. The (mixed-integer) non-linear optimization method will have its own set of parameters (they all do), so later you might indeed want to optimize those as well, if you get hold of some hyper-computing device allowing you to compress the first two steps to less than your lifetime. You can probably do this kind of hyper^n optimization as deeply as you like but I doubt it would lead you anywhere, given how terrible non-linear optimization performance on general functions is and how short the existence of Universe is.

Sorry, its still unclear. Do you consider genetic algorithms, etc to be an example of (mixed-integer) non-linear optimization? Or are you talking about some other method?

>"You can probably do this kind of hyper^n optimization as deeply as you like but I doubt it would lead you anywhere, given how terrible non-linear optimization performance on general functions is and how short the existence of Universe is."

In practice you don't need to find the actual optimum though, just do better than the alternatives. This reminds me of "a single nn layer can approximate any function". Sure, but in practice a reasonable approximation will arise much easier with other architectures.


By "101" I thought you meant something like "low level" or "first thing you learn/try". Did you mean something else?

I seems like you're confusing an algorithm that could attain absolute maximum for hyperparameters with something could attain "good enough" values for hyperparameters.

After all, the models that work "well-enough" now, that attain state of the art results, don't have to have absolute best parameters but rather parameters that come from rules of thumb, trial-and-error and sometimes the 101-optimizations mentioned. I mean, we knows things work, there are working results, the challenge is determining exactly how these appear.

Today you can do much better - there is a whole active and evolving sub-field of solving hyperparameter tuning as an optimization problem. An ex is performing bayesian optimization using gaussian processes.

I hope the research from the community becomes mainstream soon.

One could optimize test error over hyperparameters by various methods. But at that point does test or validation error continue to provide the functionality that a test or validation set is supposed to provide ?

You are not doing away with the test/validation set. You are doing away with:

   1. grid search - you would now only test for parameters that look promising (based on some criteria). Instead of covering the whole grid, you want to be smart about which points to try out. [1]
   2. you might still do grid search but you want to preferentially allocate resources to parameter exploration based on how promising they seem. Hyperband - [2]- is an example. 
   3. you can do both - you can be smart about picking parameters, but for the ones you pick you can preferentially allocate resources. Ex *Bayesian Optimization and Hyperband* (BOHB)
[1] Bayesian Optimization using Gaussian Processes is an example of this. https://arxiv.org/pdf/1206.2944.pdf Here's a library that helps you do this: https://github.com/JasperSnoek/spearmint. But there are other techniques in this family like Randomized Online Aggressive Racing (ROAR), Deep Network for Global Optimization (DNGO), Tree-Structured Parzen Estimators (TPE), etc

[2] Hyperband: https://arxiv.org/abs/1603.06560

[3] BOHB: https://arxiv.org/abs/1807.01774

Yeah that makes perfect sense -- optimizing hyperparameters more efficiently but on the train set.

Do you really consider this an obstacle? Obviously you need to have a validation set for the optimization and a new/unused test set for estimating out of sample performance.

"Population Based Training" goes some way to sorting this out, if you have a fuckton of compute available.

People have literally done this analysis and the finding was that most gains were the result of improved hyperparamaters. Why is this controversial?

References? I mean, I in fact believe this, none of my post above was arguing with the basic point. Rather, my point literally that this is controversial because this testing is hard. You can't just write a loop over all papers, hardware configurations and so-forth. I can't imagine a given paper would do more than look at a handful of papers, do some common sense reasoning and some statical reasoning. And from the common sense side of it, yeah, seems logical but the problem is that's less than absolute and as the OP and my GP note, researchers have incentive to write "that one paper where the architecture really matters"

I skimmed this, found it troubling and started trying to make a checklist of does/don'ts to apply to my own work -


"Mathiness" as the author puts it is one of my main complaints of many academic papers, this is not at all limited to ML. I think we really need to get away from typical math notation and start to present the calculations as code. This is more readable to the modern practicing professionally and the algorithm's degree of parallelization opportunities (or troubling lack thereof) would be more clear.

The paper needs to explain what is being done, why and under what assumptions. As long as each of those are clearly explained, in my experience, it doesn't really matter whether each of them is explained in words/math/code. Yes, one form might be a little bit more work for people from a certain background, but that's a hump you can get over relatively easily after reading a few papers on the topic. The worst is when you have to guess at what the authors might be doing and why they're doing it that way, and what assumptions they might be operating under... and all they've done is dumped their results within the page limit and before the conference submission deadline.

I've struggled with that a couple of times.

Anecdotal evidence: I wrote a technical report describing an algorithm and its novel applications, etc. A good compromise between a high-level description of the concepts and a documentation of the actual code (and thus a guide for implementing it), in my opinion.

I tried to turn that into a scientific paper and it was rejected. Changed the entire notation to a "math-y" notation that actually "obfuscates or impresses rather than clarifies" (direct quote from the article). Accepted with positive comments on the very same aspects that got it rejected in the first place. This in a Computer Science conference.

isn't that true for absolutely everything in CS?

Doing bad science is an issue that all scientific fields need to be vigilant about (e.g. the "replication crisis" in psychology). It is particularly relevant to machine learning at this point in time as i) the field is growing extremely rapidly and ii) there is lots of $s on the table for those who achieve commercialisable results. So the incentives to do bad science are increasing while the checks against are weakening. These forces are not at play in, say, theoretical CS.

> e.g. the "replication crisis" in psychology

Just an aside, it's not just psychology. When you google "replication crisis" most articles say "... in science" for good reason.


When mentioning "replication crisis" I should probably add articles like this one too: https://www.insidehighered.com/news/2018/04/05/scholar-chall...

That article is incredibly frightening.

> to conclude that, although misconduct and questionable research methods do occur in “relatively small” frequencies,

which is kind of expected in a career track that doesn't exactly come with hedge fund-type money -- why would many people be dishonest for grants?

> there is “no evidence” that the issue is growing.

which might mean that the issue isn't in how science is practiced nowadays, but that the scientific process (grants, publication, peer review, research programs) has always been broken.

It's not clear that any of these is the case, but it's depressing to see "well, people are generally not crooks and the game has always been played this way" as an optimistic view of the replication crisis.

While outright fraud is probably rare, "questionable research methods/practices" are standard. Most people doing them dont even know there is something wrong with it. Pick any biomed or social science paper and I will quickly find some indication of them (lack of blinding, weird unexplained sample size changes, missing controls, failure to show a direct comparison of the variables you claim are related, etc).

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact