
Why Can't I Reproduce Their Results? - sonabinu
http://theorangeduck.com/page/reproduce-their-results
======
wittyreference
There's certainly a kernel of truth to this.

I worked in a cancer lab as an undergrad. New PI had just come over to the
school. He had us working on a protocol he'd developed in his last lab to
culture a specific type of glandular tissue in a specific way.

I and two other students spent months trying to recreate his results, at his
behest. When we couldn't get it to work, he'd shrug and say some shit like
"keep plugging away at the variables," or "you didn't pipette it right." I
don't even know what the fuck the second one means.

Then an experienced grad student joined the lab. He spent, I don't know, a
week at it? and was successfully reproducing the culture.

I still don't know how he did it. I just know that he wasn't carrying a magic
wand, and the PI certainly wasn't perpetrating a fraud _against himself_. It
was just, I guess, experience and skill.

~~~
bsder
> It was just, I guess, experience and skill.

Lab technique is definitely a thing. Mostly it's about knowing what _not_ to
pay attention to so that you can give extra attention to the things that
matter. It's also about recognizing the intermediate signals that say "Yep,
still good." (ie. Are the bubbles the right size? Is the foam the right color?
Does it smell right/wrong? Did a layer of glop form at the correct step?)

However, I _rarely_ see anyone taking good enough notes to reproduce an
experiment. I had a Physics Lab taught by a professor who would randomly take
someone's notebook in an assignment and reproduce the experiment using your
notes--and fail you on that assignment if he couldn't.

Very few students ever passed that. You really have to be excessively
meticulous. However, when you are debugging a _real_ experiment without a
known correct answer, if you aren't that meticulous you will never find your
own bugs.

~~~
scott_s
I'm reminded of baking. The recipes for everything are all known, but there's
still an enormous amount of skill and tacit knowledge required.

~~~
amluto
Hah. A lot of excellent bread recipes involve a step like “add flour until it
looks right.”

~~~
kwhitefoot
Not the ones intended for professional use.

I have a textbook of cookery [1] written for students intending to become
professional cooks, it emphasizes precision measurements and repeatable
processes. Where it expects adjustments it explains what it means and what
constitutes "it looks right". It stands head and shoulders above pretty much
all other cookery books I have read. But of course it lacks glossy pictures
and is not backed up by a video of Nigella Lawson in a low cut dress (not that
I'm objecting to Nigella and her cleavage)

[1] Professional Cookery: The Process Approach by Daniel R. Stevenson

------
altvali
As an outsider to Academia, this looks like a long array of excuses and
saddens me. A huge number of hours are wasted every year by undergrads
fighting with incomplete papers when we could do better. We could enforce
higher standards. Everything in a paper should be explainable and
reproducible. Have a look at efforts by PapersWithCode, Arxiv Sanity, colah's
blog, or 3blue1brown in either curating content or explaining concepts. I
couldn't find a single excuse in this blog post for which we can't come up
with a solution, if we have the consensus to enforce it.

~~~
yiyus
Enforcing higher standards is way more difficult than you think, even
impractical most of the time. Having consensus and enforcing it is not
trivial. Requiring from every PhD student the level of quality of 3blue1brown
is extremely naive.

Your comment is the equivalent of saying there should be no armies and instead
we should be peaceful and love each other. Great idea. And those excuses of
why I cannot connect A to B? Let's just reach consensus to use the same
connector everywhere!

I'd like to live in that world, but unfortunately it is not the one we got.

~~~
altvali
You're making a strawman here. What I want is that prestigious
journals/conferences not accept papers unless they come with the code, dataset
and weights required to replicate the result, and every formula come with
explanations of the terms. If the terms are not widely used, a more thorough
explanation should be requested by reviewers.

It's not that difficult. There's already a Distill Prize for Clarity in
Machine Learning. A great spark would be if a company like DeepMind or OpenAI
would enforce standards like these internally and host a conference that
rewards papers at that standard. It would be a great PR move, at great benefit
for humanity.

~~~
yiyus
There are many good papers that would not fulfill your requirements. I agree
that ideally they would go through several revisions until every detail is
fixed, but rejection levels would skyrocket. We would not have many great
papers.

I am not too familiar with the AI field, but I know that at least in my field
the quality is very often too poor. We have to try to improve it. But if I
rejected every paper I review because every detail to replicate it is not
there, I would have rejected 99% of them. And some of the ones that eventually
got accepted were very valid and very useful papers, and replication not so
important after all.

------
YeGoblynQueenne
Large parts of this post are naked apologism for grave aberrations in computer
science research, the equivalent to code smells that have calcified and are
now "how we've always done things" because nobody has the courage to do
anything to fix them, or even knows how. That these are addressed to new PhDs
as "this is the real world and you must get used to it" is just tragic. The
surest way for all this garbage to remain the way it is, is for new PhDs to
accept it as it is and not try to do anything to change it.

~~~
voxl
Asking new PhD students to change culture is like asking Entry-Level
Developers to change the management culture of Google.

~~~
asdff
The difference is all professors were once lowly PhD students.

~~~
throwaway2245
but not all PhD students become professors - only those who successfully
(choose to) navigate the status quo.

------
m0zg
To be fair, in my field (deep learning, computer vision), the papers often do
not contain enough information to reproduce the results. To take a recent
example, Google's EfficientDet paper did not contain enough detail to be able
to implement BiFPN, so nobody could replicate their results until official
implementation was released. And even then, to the best of my knowledge,
nobody has been able to train the models to the same accuracy in PyTorch - the
results matching Google's merely port the TensorFlow weights.

Much of the recent "efficient" DL work is like that. Efficient models are
notoriously difficult to train, and all manner of secret sauce is simply not
mentioned, and without it you won't get the same result. At higher levels of
precision, a single percentage point of a metric can mean 10% increase in
error rate, so this is not negligible.

To the authors' credit though, a lot of this work does get released in full
source code form, so even if you can't achieve the same result on your own
hardware, you can at least test the results using the provided weights, and
see that they _are_ in fact achievable.

~~~
StandardFuture
This is why science needs to get rid of "papers" and make everything a public
repository of information. GitHub, etc. could be used to file "Issues" (like
your anecdote) to ensure a recorded history of what the paper might be missing
or how it could be better expressed, etc.

In fact, I really do not understand why in 2020 we are writing papers/PDFs
anymore. We have _more_ than enough software and software tooling to have
superior formats for sharing and discussing information.

Is our biggest hurdle to this that the most experienced
scientists/professors/researchers in our day are not as technically competent
as they would need to be? Otherwise, why is this not already the norm? We do
not need ArXiv or journals, we need GitXiv.

~~~
m0zg
That papers are unnecessary is a common misconception shared by those not in
the field. Papers are _extremely_ useful. I read a couple carefully every week
and skim many others. But I will concede that a paper is not sufficient in
most cases, at least in my field. And researchers have only very limited
amount of time which they can devote to engaging in "discussions". Arguably
their time would be much better spent creating the next great thing rather
than fixing bugs in their latest paper, as frustrating as that is for everyone
else.

~~~
justinmeiners
I think the author of that comment was saying a paper alone is typically not
sufficient. A researcher should summarize their work in a written paper AND
provide resources, code, media, etc.

~~~
m0zg
Yes, but don't hold your breath for:

1\. Researchers ditching papers as the primary means of exchanging ideas.

2\. Researchers going back and fixing bugs, or even reviewing PRs.

None of this is going to happen.

~~~
justinmeiners
I don't think that's what's being requested. It's ok if resources are dumped
and left "as is". We just need more than we are getting, and it's not like
they don't have that, it's just not as expected for them to publish it.

------
aledalgrande
My personal experience with trying to recreate tech from research papers
(real-time mobile computer vision in my case):

\- assuming highly specialized Math skills (e.g. manifolds), which wasn't me
at the time, so it made testing and bug hunting harder

\- code missing, or code is a long C file vomited out by Matlab, with some
pieces switched to (Desktop) SSE instructions for speed gains

\- papers were missing vital parameters to reproduce the experiment (e.g.
exact setting for a variable that influences an optimization loop precision vs
speed)

\- the experiment was very constrained and the whole algorithm would have
never worked in real life, which is what I had supposed for the first few
months (meh)

\- most papers, as the article says, are just a little bump over the previous
ones, so now you have to read and implement a tree of papers

\- sometimes there would be the need of a dataset to train a model, but the
dataset is closed source and incredibly expensive, so not a good avenue

At the time I was also working with the first version of (Apple) Metal which,
after going crazy on why my algo wasn't working, I discovered had a precision
bug with the division operator. FML

Still it was a very instructing experience, the biggest takeaway if you do
something similar: don't be certain that once you implemented an algorithm, it
will work as advertised. It's totally different from say, writing an API, it's
not a well constrained problem.

~~~
snovv_crash
In my experience of the same field, the only lab you can consistently trust
the outputs from is Microsoft Research. I don't know if they have some kind of
blind QA or something, but everything from them has worked as advertised.

~~~
aledalgrande
Yeah, Microsoft Research, Disney Research and ETH are the most reliable. You
learn very fast to read only papers from the biggest universities and
companies.

------
ishcheklein
Hey, DVC maintainer here. For those who interested in this topic, I like this
one about the same problem - [https://petewarden.com/2018/03/19/the-machine-
learning-repro...](https://petewarden.com/2018/03/19/the-machine-learning-
reproducibility-crisis/) (industry focused) and an excellent talk from Patrick
Ball -
[https://www.youtube.com/watch?v=ZSunU9GQdcI&t=1s](https://www.youtube.com/watch?v=ZSunU9GQdcI&t=1s)
how they structure data projects.

~~~
Pick-A-Hill2019
I agree and disagree (btw, previously 2 days ago at
[https://news.ycombinator.com/item?id=23732531](https://news.ycombinator.com/item?id=23732531)).

Yes I agree, nine times out of ten you've made a typo, connected the red wire
where the black wire should be.

But I diagree with the overall sentiment of the article which fails to
highlight that - when all else has failed and you have double checked things
your end ..... sometimes there IS a typo, an incorrect equation, a simple
error IN THE ORIGINAL research. Sure, blame yourself as a first instinct
(which is a good thing to do) BUT - There is indeed a replication crisis
currently in all fields of study and research. Forgive the link to a totaly
non-tech field but it illustrates my counter-point as well as any other.

"The Role of Replication (The Replication Crisis in Psychology)"[1]

The openness of psychological research – sharing our methodology and data via
publication – is a key to the effectiveness of the scientific method. This
allows other psychologists to know exactly how data were gathered, and it
allows them to potentially use the same methods to test new hypotheses.

In an ideal world, this openness also allows other researchers to check
whether a study was valid by replication – essentially, using the same methods
to see if they yield the same results. The ability to replicate allows us to
hold researchers accountable for their work.

[1] [https://courses.lumenlearning.com/ivytech-
psychology1/chapte...](https://courses.lumenlearning.com/ivytech-
psychology1/chapter/the-replication-crisis-in-psychology/)

------
entha_saava
It is saddening to see so much pessimism in the thread.

Tight regulations on reproducibility are the first thing academia needs.
Academia is a rat race with so much dishonesty and optimizing for measure
(citation count etc..) these days. I have seen too many of professors
producing low quality papers for the sake of producing papers - which creates
a lot of noise, and a cargo cult PhD culture. Without reproducibility how can
you even trust the results?

While academic code doesn't adhere to code quality standards of software
development, I don't think many software engineers involve in ridiculing the
author if they produce the code, leave alone other academics.

Btw, I submitted a link (a very comprehensive article by a CS academic) but it
was lost in the HN noise. Maybe someone with higher karma points repost that?
I found that article through nofreeview-noreview manifesto website and that's
very well written, it covers a number of problems like this..
([http://a3nm.net/work/research/wrong](http://a3nm.net/work/research/wrong))..
PS I am no way affiliated to the author. Just mentioning because that article
deserves to be at HN front page.

------
worik
This is why publishing code and data, together, is so important.

Irreproducibile results in computing have no justification in this day and
age.

~~~
projektfu
If you run their code on their data you haven't reproduced their result. You
have merely copied it. If their code works on another very similar data set to
produce a very similar result, then you have something. Or if you're trying to
understand the technique, you have to write your code and see if it reproduces
their result on their data.

~~~
worik
What nonsense! Copying something is reproducing it. If it cannot be copied
then it cannot be reproduced. If it can be copied then it can be investigated.

I ran into exactly this problem with:

    
    
        White's Reality Check (WRC), described in White's (2000) Econometrica paper titled "A Reality Check for Data Snooping",
    

The algorithms that were being tested were so poorly described that less than
half could be reproduced.

Once we got there for what we could we found two fundamental problems with
"Whites Reality Check", one theoretical that is blindingly obvious once you
see it and the other that they ignored trading costs. When we measured the
levels of trading costs that would permit the algorithms to work it was clear
they could not.

That we could not deduce without reproducing the results. If we could have
copied the original code it would have shaved about eight weeks from my
research.

BTW "Whites Reality Check" was at the time pone of the most cited paper in
its, rather narrow, field.

~~~
projektfu
But running their code, presumably, would get their result on their data,
unless it was a fraud or weird misconfiguration. We’re not talking about
fraud, we’re talking about reproducibility supporting a theory.

You suspected the result was wrong, so your interest was in debugging their
code to contradict their result. That has value but also is not a reproduction
of the study.

------
gitgud
Even academic papers with open-source code can be infuriating to get working.
Usually it's a mess of hidden dependencies, specific versions of global
libraries, hard coded paths, undocumented compiler settings, specific OS
versions...

Usually, the highly specific experience and knowledge of the author is assumed
in the reader...

------
Heliosmaster
A bit of a plug, but this is exactly the reason why we are building
Nextjournal [0].

We've built the platform from the ground up with immutability in mind,
leveraging Clojure and Datomic which are a great fit for this architecture.

[0]: [https://nextjournal.com](https://nextjournal.com)

------
User23
Just as a neat bit of trivia that isn't mentioned in the article, the inventor
of the equals sign was Robert Recorde[1]. Dijkstra provides some additional
background at [2].

[1]
[https://en.wikipedia.org/wiki/Robert_Recorde](https://en.wikipedia.org/wiki/Robert_Recorde)

[2]
[https://www.cs.utexas.edu/users/EWD/transcriptions/EWD10xx/E...](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD10xx/EWD1073.html)

------
smitty1e
> To describe the implementation in a way which is less precise, but simpler,
> shorter, and easier for the reader to understand.

I'm waiting for the textbook that offers formulae with code and a but of
regression data.

------
eximius
I do wonder if standardized equipment with digital controls (or digitally
monitored controls) to record all salient values throughout an experiment
would 'solve' this.

Obviously some cutting edge stuff can't use standardized equipment, but can
you standardize a lot of other stuff?

~~~
rjsw
I don't think you need to go as far as standardizing the equipment, capturing
the control settings as well as the output data in a standard form would do.

I'm currently trying to push the use of an existing standard [1] for capturing
engineering experimental results.

[1]
[https://en.wikipedia.org/wiki/ISO_10303](https://en.wikipedia.org/wiki/ISO_10303)

------
fxtentacle
In the case of AI: just train again. Good or bad luck with the random weight
initialization can have a huge influence on your results. Nobody really talks
about it, but many "pro" papers use a global seed and deterministic randomness
to avoid the issue.

------
gravypod
> I'm about to tell you something which can sometimes be harder to believe
> than conspiracy theories about academia: you've got a bug in your code.

If this is the case in a majority of the instances where someone fails to
reproduce the software backing a paper then there may be another issue at
play. Someone re-implementing a paper is "just" taking a written description
of something and transcribing it to code. If a mistake, that can completely
ruin the results of the work, can be made this easily it should be fairly
straight forward to see that the original implementer could also have made a
mistake that could have thrown off their results.

Does the world of academia have any tools to prevent this? It seems like this
could effect a lot of the research being done today. Given the following:

1\. Most research being done today utilizes some software to generate their
results.

2\. This software often encodes some novel way of implementing some analysis
method.

3\. The published paper from research with a novel method of analysis will
need to describe their method of analysis.

4\. Future researchers will find this paper, attempt to implement the
described analysis method, and publish new results from this implementation.

We can see we are wasting a lot of resources reimplementing already written
code. We may also not be implementing this code correctly and may be skewing
results in different ways.

> Debugging research code is extremely difficult, and requires a different
> mind set - the attention to detail you need to adopt will be beyond anything
> you've done before. With research code, and numerical or data-driven code in
> particular, bugs will not manifest themselves in crashes. Often bugged code
> will not only run, but produce some kind of broken result. It is up to you
> to validate that every line of code you write is correct. No one is going to
> do that for you. And yes, sometimes that means doing things the hard way:
> examining data by hand, and staring at lines of code one by one until you
> spot what is wrong.

This is a very common mindset that many academics have but I don't understand
why this is the case. A paper seems like a fantastic opportunity to define an
interface boundary. If your paper describes a new method to take A and
transform it into B then it should be possible for you to write and publish
your `a_to_b` method aside your paper. You could even write unit tests on your
`a_to_b`. If a future researcher comes along and finds a way to apply your
`a_to_b` to more things they could modify your code and, just by rerunning
your tests, should be able to verify their implementation actually works.

If a future researcher decided to use you `a_to_b` you could write some code
to automatically generate a list of papers to reference.

If we are seriously spending this much time treading the same water then it
should be possible to dramatically improve the quality and throughput of
academics by providing some tool like this to them.

I know someone will say "but you gain so much knowledge re-implementing XYZ"
and to them I'd say that you don't need to read the code while reading the
paper. You could write the code yourself and just utilize the unit tests
provided by the author to make sure you fully understand each edge case.

~~~
yummypaint
There have been pushes to publish more code in recent years. Most journal
article formats at least allow for a "supplemental material" section like
here:
[https://journals.aps.org/prc/authors](https://journals.aps.org/prc/authors)

Publishing code in a journal article in analogous to publishing key equations.
Equations are just functions themselves after all. However, articles have to
be succinct, and journals still exist in a hybrid physical/digital space. It
isn't very useful to physically ink thousands of lines of code onto a page,
especially if it represents a vastly different level of technical detail than
the rest of the paper. In practice, people who write comprehensive software
typically make it available through github or similar, and put a reference in
the article. If not that, people will send you their code if you contact them.
If they're stingy you may not get source, though. I don't know of any funding
agency requirements that code source be made available, though I think that
might be a good thing to try.

I think the biggest difficulty is with medium-sized pieces of code. Small
enough that the bookkeeping/maintenance needed to make it available gets
skipped, but large enough that it isn't possible to provide details in the
article.

------
7532yahoogmail
Wow. Hell of a good read. And smart points too.

------
ipunchghosts
Let me make a statement and let you can judge it. (its below and stated as
"STATEMENT")

BACKGROUND I have been working for 15 years in industry doing hardcore ML. I
have the fortunate drive and background that I was able to get my masters
degree while working full time from an R1 school. No watered-down online
degree, no certificate. I would drive twice a week to class for 4 years and
did a full thesis which was published. Since then, I have published 6 papers,
all peer reviewed. I even did a sabbatical with another research lab of which
I was invited to come.

After 15 years, I decided to go back and get my Phd, all while continue to
work full time. My thought was that it would be easy to get a phd with all my
technical experience and math chops ive developed over the last 15 years. I
essentially have been doing math 5 days a week for 15 years. Here's what
happened...

Coursework was a breeze. I barely put any time into it and I easily can get a
B+. This is really helpful because I am working 40-50 hours a week at my full
time job nd managing my family. I passed my candidacy exam on the first try
with little issue (this is rare for my department).

The biggest hangup I have about the phd process is what my advisor wants me to
do when writing papers. He is the youngest full professor in the department
and is from a well known and well respected graduate research university. But,
the way he has me slant my papers is absurd. Results which I feel are very
important to the assessment of the reader to decide if they should use the
method, he has me remove because the results are "too subtle." He is
constantly beating on me to think about "the casual reviewer."

Students in the lab produce papers which are very brittle and overfit to the
test data. His lab uses the same dataset paper after paper. My advisor was so
proud of a method his top student produced that he offered his code for my
workplace to use. It didn't work as well as a much simpler method we used.
Eventually we gave the student our data so there could be a fairest shake at
getting students method to work. The student never got the method work to work
nearly as well as in his published paper despite telling my company and I over
and over that "it will work". The student is now at amazon lab 126.

STATEMENT: Academia is peer reviewed driven but the peers are other academics
and so the system of innovation is dead; academics have very little
understanding of what actually works in practice. Great example: its of no
surprise that Google has such a hard time using ML on MRI datasets. The groups
working on this are made up of Phds from my grad lab!

TL;DR - worked for 15 years, went back for phd, here's what i hear:

"think of the casual reviewer"

"fiddle with your net so that your results are better than X"

"you have to sell your results so that its clear your method has merit"

"can you get me results that are 1% better? use the tricks from blog Y"

"As long as your results are 1% better, you are fine"

Edit 1:boasts are given to avoid "your experience doesn't count because you X"
strawmans, where X={are lazy,are a young student, are inexperienced, went to
an easy school, are in an easy program, naive to the peer review process}

~~~
devalgo
Of course you have the other side of the coin where those same ML students
take a pre-trained VGG model, fine tune it on a couple thousand pictures of
hotdogs or whatever and raise millions in VC money for their "AI" company.

~~~
mennis16
Don't these companies usually end up failing though? I'm not very familiar
with the startup space but in academia it feels like these system-gaming labs
receive perpetual encouragement.

~~~
ncmncm
"Failing" is more nuanced than you might think.

In a very large number of cases, the startup closing its doors after two years
is not considered a failure by the VC fund. The startup successfully spent the
money the VC firm was contractually obligated to invest, may have employed the
VC's choices of officers for a significant period (providing them income and
experience), maybe purchased a great deal of tech from suppliers the VC
officers are themselves invested in, and may have left valuable assets that
could be snapped up.

The biggest problem a VC firm has is the five billion dollars of others' money
they have to "place" in 30/60/90 days. What happens after placement is much
less their problem. They know most of the placements are duds, but they and
the actual investors knew that up front. Once the money is "placed", though,
much of it can be siphoned off for the benefit of the VCs' cronies or one or
other non-dud. Maybe a non-dud or non-startup buys up assets of a dud for
pennies on the dollar, and extracts something usable, like patents or
equipment. Sure, the investor lost that money, but somebody got it, and
somebody got what it bought.

None of this is good for most people who do a startup, unless they are chosen
as a non-dud. The chosen duds are valuable for money laundering, which few
startup principals really meant to sign up to be. Some did.

------
Ecco
Who’s «Reviewer 2 »?

~~~
tetromino_
Slang term for an unreasonably nitpicky reviewer who will never be happy with
the state of your paper. Here is an introduction to the stereotype:
[https://arstechnica.com/science/2020/06/empirical-
analysis-t...](https://arstechnica.com/science/2020/06/empirical-analysis-
tells-reviewer-2-go-f-yourself/)

