
Machine Learning Reproducibility Checklist [pdf] - sonabinu
https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf
======
Abishek_Muthian
This is a useful checklist, it reminded me of a recent topic of 'Whether AI is
the end of Scientific Method' on Babbage from Economist Radio[1]

The arguments were that in ML/DL, experiments are run at large scale without
hypothesis, with radical empiricism in a trial and error fashion which is
against Scientific Method i.e. Hypothesis, Experiment, Observation, Theory.

[1][https://soundcloud.com/theeconomist/babbage-ai-the-end-of-
th...](https://soundcloud.com/theeconomist/babbage-ai-the-end-of-the)

~~~
djaque
I'd say that the scientific method is just a formal process taught to school
kids and that most scientists don't follow it either. At least they don't in
my field (physics).

It's more like "hypothesis" -> "experiment" -> "uh, this is kind of weird" ->
"changes hypothesis to fit the data" -> "take some more data" -> "huh...
here's this cool thing unrelated to any of my other hypotheses" -> "switches
topic to something more viable".

~~~
abdullahkhalids
You are absolutely right, but I don't think the philosophy of science demands
that you actually follow the scientific method during discovery mode. What it
says, though, is that if you discover any scientific knowledge, its discovery
process must be castable in the way of the scientific method. Meaning, when
someone else checks or replicates your result, they can actually follow the
four step process, and recreate that same knowledge.

Science more than just a collection of facts, is a description of a set of
experimental processes that recreate those facts. Which is where the
authority-lessness power of it comes from.

------
sillysaurusx
This checklist has some flaws. Most interesting results in ML have no proof.

For example, can you give a proof of superconvergence? What’s the exact
learning rate that causes it, and why? Did you know that you can often get
away with a high learning rate for a time, and then divergence happens? What’s
the proof of that?

Give a proof that under all circumstances and wind conditions, lowering your
airplane’s flaps by 5 degrees will help you land safely.

Also, what about datasets that you’re not allowed to release? I personally
despise such datasets, but I found myself in the ironic position of having a
10GB dataset dropped in my lap that was a perfect fit for my current project.
Unfortunately it wasn’t until after training was mostly complete that we
realized we hadn’t asked whether the author was comfortable releasing it, and
indeed the answer was no. So what to do? Just don’t talk about it?

I guess the list is good as a set of ideals to aim for. I just wish some
consideration was given that you often can’t meet all of those goals.

Most of OpenAI's work would be excluded by this checklist. I don't think
anyone would argue that OpenAI doesn't do important work, and that their
results are in some sense reproducible.

~~~
TeMPOraL
> _Give a proof that under all circumstances and wind conditions, lowering
> your airplane’s flaps by 5 degrees will help you land safely._

My passing familiarity with aerodynamics and control theory suggests that you
could derive a multidimentional shell in parameter space (of wind,
temperature, airspeed and other conditions), to form an envelope within which
the plane will behave predictably, so that you can prove whether lowering
flaps by 5 degrees at a particular point will help you land safely. Accounting
for model uncertainty, that envelope would likely be tighter than it could be
if we knew our physics better, but that's still far better than a black box ML
model that doesn't give you guarantees that similar inputs will lead to
similar outputs (there's a mathematical formalism to this whose name escapes
me now).

~~~
igorkraw
>but that's still far better than a black box ML model that doesn't give you
guarantees that similar inputs will lead to similar outputs (there's a
mathematical formalism to this whose name escapes me now).

You might be thinking of the K-lipschitz-smoothness or K-lipschitz-continuity
of a model for a given norm, i.e. $\Vert f(x)-f(y)\Vert \leq K \Vert
x-y\Vert$. A guarantee that K is smaller or equal a certain value is called a
Lipschitz certificate . We _can_ by now give this type of guarantee in balls
around the training data (coming out of Adversarial example research), but
with some limitations and generalization of Lipschitz certificates to test
data and/or other norms is pretty bad in general.

I personally think that the term "black box ML model" needs to die with
respect to neural networks, the theory work being done has pried open that box
sufficiently by now that we can start reasoning about them somewhat. People
just generally don't like the answers because it unveils limitations or
challenges

------
YeGoblynQueenne
This gives a little more context:

[https://www.nature.com/articles/d41586-019-03895-5](https://www.nature.com/articles/d41586-019-03895-5)

------
DrNuke
This is aimed at production or critical applications, though, not forefront or
blue-sky research. In the former case, we need a shared & agreed framework to
make sure everyone from everywhere gets statistically comparable results, with
this checklist helping us in that sense. In the latter case, it is open field
and we are looking for agreeable results approximation before a method, which
will be devised later to fit concordant results.

~~~
MAXPOOL
I agree with you in principle. I still think reproducible should be a goal
even in pure research in blue sky machine learning for the following reasons:

Even basic research is can be sensitive to omitted parameters, setups and
starting conditions. Honest mistakes, accidental omissions and failing to spot
sensitivity to parameters happens all the time. Writing "A clear explanation
of any assumptions" is rarely comprehensive. Its easy to miss things and
become blind to some fine details.

Discovering and documenting of a new interesting phenomenon and dynamics is
also part of basic research. Publishing experimental discoveries without
explanation should be always reproducible.

