
Missing data hinder replication of artificial intelligence studies - jonbaer
http://www.sciencemag.org/news/2018/02/missing-data-hinder-replication-artificial-intelligence-studies
======
lettergram
Hell, I did an analysis on "Causality in Crypto Markets":

[https://blog.projectpiglet.com/2018/01/causality-in-
cryptoma...](https://blog.projectpiglet.com/2018/01/causality-in-
cryptomarkets/)

I provided code and the data (since I knew no one would believe me, without
evidence). Why do we believe researchers who have more incentive than I do to
publish? Their jobs require publication, or at least are greatly improved by
it, thus they have every incentive to be less than honest about their
"experiments". Most of the "A.I. revolution" isn't about algorithms, it's
about having more data and _the right data_. That's why Google and Facebook
can out compete in research, as much as they do. They have all the data.

Personally, I've almost made it a life mission to try and replicate every
study I can. After watching people in labs (private and public biology labs),
just straight up make up data - I never believe anything.

I also wrote some stuff on backtesting (which in industry is viewed as the
gold standard, but IMO is deeply prone to flaws):
[https://blog.projectpiglet.com/2018/01/perils-of-
backtesting...](https://blog.projectpiglet.com/2018/01/perils-of-backtesting/)

Having worked in industry and worked in academia... I don't believe 90% of the
research I hear.

~~~
madenine
Its gotten to the point where the first thing I do when reading a paper is
skim for their training data.

If its not from a public/standard/replicable data source my bullshit meter
hops up a couple steps.

"We improved on the current SotA by 15% but you can't see the data or the code
- but trust us, its for real"

~~~
MaxBarraclough
A similar line of thinking to the _always provide the source code behind your
paper_ position.

~~~
dmreedy
In fact, I'd go so far to say that data _is_ source code in the case of
Machine Learning; it just goes through a kind of compilation step.

~~~
MaxBarraclough
Careful now, don't tempt the LISPers.

------
kuschku
I hinted at this issue a few days ago already here on HN:

 _This is why about 90% of the papers Google publishes are useless._

 _They describe deep learning successes, with datasets that aren 't public, on
hardware that isn't public, with software that isn't public._

 _They describe technologies, be it spanner, borg, flumejava or others,
relying on closed hard- and software._

 _If an average university professor can 't replicate research 1:1 based
solely on existing, open technology and the paper, the paper should not be
accepted by any journal. In fact, it shouldn't even be accepted on
conferences, or be able to get a doi code. That's not worthy of being called
science._

And it’s problematic when a large amount of AI research happens at Google,
Facebook and co., and none of it can be replicated. Science requires
replication, if AI research can’t be replicated, it’s not science.

~~~
biggodoggo
The problem is when BigCo does AI research, they are not doing it for the
greater good, they are trying to find a competitive edge. Unfortunately there
are no good incentives (besides doing the "right" thing) for companies to open
their datasets to the public, so until there is we can expect more of the
same...

~~~
tensor
This isn't fair to researchers at companies. While the goal of the
organization may be to make money, the goal of the individual researchers who
publish the papers is very likely doing it for the greater good.

~~~
YeGoblynQueenne
Yeah, I don't see that. If you're an employee of a company (not least one
who's paying you a very fat salary) there is an expectation that you will do
work that benefits the company's bottom line. If that happens to be aligned
with the "greater good" then OK, but there is no such guarantee- and if it
isn't... well then your research will only benefit the company.

~~~
tensor
Of course your work benefits the bottom line, but being able to publish is
often a perk that companies have to give to be able to attract and keep people
who care about those things. It's actually incredibly insulting that you would
assume you know the motivations of individual employees.

The world isn't as black and white as you'd like it to be.

~~~
janekm
If journals/conferences were to refuse un-replicable results, companies that
wish to offer this perk would have to make a sufficient part of their dataset
public in order to offer this perk to their employees. This would be a good
thing.

------
lacker
I was shocked in CS graduate school the first time I downloaded the code
referred to by a very well-respected paper in the field, and found that it did
not even compile. By the 20th time, I was less shocked, but I had lost a lot
of respect for CS academia.

The correct standard should be, if you claim you wrote a program that does X,
the program should be available along with simple instructions that explain
how to run it. The reviewers should then follow the instructions and verify
that it actually works, and returns the results described in the paper! This
basic process would sadly invalidate most computer science research.

~~~
PeterisP
The vast majority of such papers don't really involve a claim that they wrote
a program that does X. I mean, they probably did write such a program, but
that's tangential to the paper and not particularly relevant; the fact that
they did isn't one of claims in the paper, it's not something they're trying
to prove/show in the conclusions. There are some papers in the style "Software
package for doing X", but those are a minority and generally limited to major
releases of widely used open source tools for that problem; most CS papers are
not about software, just like most astronomy papers are not about telescopes.

E.g. the paper is about some new method to approach X. In general, the paper
could be valid even without any program implementation whatsoever; but you'll
likely might supplement the method description with some experimental
evaluation - but the paper is not about the experimental evaluation or the
"experimental apparatus" i.e. the code they used. I mean, the paper is likely
making some claims about the method as such, not about any single particular
implementation of the method including their own. As a part of the main claim
"this method seems good and interesting" they're providing some evidence "we
tried, and in certain conditions described here applying method X was 10%
better than method Y" \- but the actual code used to do that is just
supplementary material; the code is a tool they used in research, not the
result of the research they published. The code is the "telescope" you used to
make an observation, but the paper is about the observation, not about the
telescope.

It's just as with medicine - we generally accept papers that say "well, we
tried this procedure on 100 patients and 73 of them got better" without
requiring video evidence of those 73 patients actually getting better; in the
same manner, we accept CS papers that say "well, we tried this procedure on
100 datapoints and 73 of them got correct results" and don't require the
reviewers to reproduce the experiments and verify if the experimental results
aren't falsified; just as reviewers don't try to reproduce experiments in
pretty much any other science.

~~~
infogulch
The difference is that when a medical paper claims

> well, we tried this procedure on 100 patients

"this procedure" is the most important part, and they describe it in detail in
the paper, hopefully well enough that someone could attempt to replicate it,
and some do attempt to replicate it. (Not that procedure descriptions in such
papers are always sufficient for this.)

The difference between that and

> we tried this procedure on 100 datapoints

Is that it's nigh impossible to describe a ML procedure in enough detail to
reproduce with just the description in the paper. Tiny changes in the
parameters and construction can completely change the result; the only way to
be able to reproduce it is if you had the source code. And also the source
_data_ which is just if not more important as the source code (see sibling
thread).

The opportunity that academic CS has over _every other science_ is that they
could empower _every reader_ with the capability to verify the results of
_every paper they read_ , and this is _actually attainable_. Reviewers of
other sciences don't reproduce findings themselves for purely practical
reasons that don't need to exist in CS.

~~~
joepvd
To try out a procedure on one dataset amounts to a data point. You need to
process a bunch of datasets to establish the performance of the procedure
under study.

I though ML folks would have the statistical background to know you cannot
infer a true statement from a single occurrence?

------
Eridrus
This seems pretty straight forward to resolve the, just have conferences say
they will add half a point to the average review score for papers with code.

Most reviewers seem to agree that besides the top and bottom 10% of papers the
primary differentiator is luck, so might as well add code to that mix.

------
carlmr
>Gundersen says the culture needs to change. "It's not about shaming," he
says. "It's just about being honest."

I think that it should be about shaming. If you shame those that don't provide
enough information to replicate you might create an incentive to publish
properly.

Replication problems create so much noise for further research that I would
classify a researcher who doesn't publish replicable results as hostile to the
research community.

------
j_m_b
Welcome to the world of science.... where the data is scant but the results
are impressive!

Seriously though, publish your data in the supplementary materials. Quit being
afraid of failure or someone catching your mistakes. This is science, not
figure skating.

~~~
epistasis
I would encourage going one step further, and finding a repository for the
data outside of the journal. Journals can be terrible at keeping supplementary
files around, and I have people email asking for the supplementary data after
a website redesign, or similar. Journals also sometimes do weird things like
convert moderate size tables to PDF (making them nearly useless), and it's not
clear what's happening when you're submitting the article always.

The other advantage of a repository is that it gets included in large data
collections, and is in a standardized format.

Often it's a lot of work to upload to repositories, nearly as much as the
analysis (if you have a good analysis pipeline), but is well worth it both for
oneself and for the community.

------
gbrown
Missing data hinders generalizability and long term stability of ML techniques
even internal to an organization. This is just one symptom of the fact that
the "data generating mechanism" changes constantly, in both obvious and subtle
ways due to changes in business practice and market conditions.

------
diminish
Last few weeks I've been reading a lot of scientific research on speech
recognition algorithms and technology. Without open data, open algorithms and
open software it's quite impossible to generate any of the results claimed by
some articles. I think science, and especially computer science has a
fundamental problem in execution and research publishing. We need a novel
revolutionary approach beyond peer review where 'reviewability' must be quick
and within reach if not instant - as bundled inside research documents.

------
afandian
If you're interested in finding out more about data citation, you may find
these interesting:

[https://www.crossref.org/blog/the-research-nexus---better-
re...](https://www.crossref.org/blog/the-research-nexus---better-research-
through-better-metadata/)

(edit - missing data citations from scholarly publications cause/propagate the
replication problem described in the article)

