
Rapid Prototyping of Interactive Data Science Workflows in Jupyter - wcrichton
http://willcrichton.net/notes/rapid-prototyping-data-science-jupyter/
======
mlthoughts2018
I’ve invested a lot of time into the IPython & Jupyter tools. In one job, I
even wrote a 0MQ kernel to leverage ipython as a repl for a programming
language created inside my company (which had horrible developer tooling).

After working on several large machine learning and data science projects over
the years, I’ve sadly come to the conclusion that the notebook environment is
a very wrong approach to take for shareable, extensible or reproduceable
systems. On my team now, aside from the tiniest, most ephemeral prototypes
that can be thrown away entirely, nobody ever uses Jupyter notebooks for
anything.

The projects I’m referring to are very similar to those of this post too — an
in-house image annotation tool we wrote with PyQT instead of notebook widgets,
lots of interactive Flask web apps with simple React pages to demo and explore
face detection results, diagnostic graphs, search engine output, and
prototyping systems for layouts of image search results pages under different
ML based ranking algorithms.

The main reason the notebook caused problems is that it leads to poor software
craftsmanship, which should be a consideration even from the earliest stages
of models and prototypes (because it will make you produce higher quality
output faster, not for any spiritual commitment to coding standards).

When notebooks contain code you wouldn’t want merged into a given library or
project’s main branch, it needs code review. And notebooks are awful units of
work to review because they break models of automated testing, lack the same
context for enforced style guidelines, and are generally written with a mix of
priorities usually focused on the way the code author wants to think about the
prototype, instead of separating these concerns into decoupled units.

Through review, you quickly realize that anything that needs code reuse has to
be using good design principles, separating concerns, organizing things into
decoupled functions or possibly classes.

When you factor these all out of the notebook and into a helper module (and
write tests), then you see a bunch of the magic constants or initialized
parameters that need to be factored out into a parameter file so that you can
easily vary parameters, have it version controlled, and have a way to connect
results artifacts (usually embedded plots or tables and saved data files) with
parameter settings used to generate them.

Then you realize you need to make the overall software environment also
reproducible, because a notebook itself is never “reproduceable” apart from
the overall software environment it was run in, and other people picking up
your notebook would need the same data and possibly a Docker container or
other container or VM to match the same software, libraries, network settings,
port number, etc.

And this process goes on until you realize you had to factor things into a
tested module, automate the creation of the surrounding Docker container or
other environment, extract out & document all your parameters, and so on.
Until finally there is nothing left in the notebook but custom plotting code
or data displays.

And the display units could be e.g. a version controlled latex file that picks
up image files or renders table templates with data, or a separate web app
that uses good design to implement reusable video display or something, all
called from a boring non-notebook launch script, all of which could be driven
by a simple custom Makefile — making the whole thing just as interactive,
generally more so, as the notebook which suffered all the problems.

I fully agree this requires a team with good software craftsmanship skills.
But anybody competent enough to write the original notebook can also learn
these skills, and reorganize the work with just a few craftsmanship principles
that almost immediately reveal the notebook as something that just gets in the
way and slows you down.

And for teams needing to make reproducible data studies or reproducible model
training environments for real business situations, this effectively makes a
notebook inappropriate most of the time.

The main use cases where the notebook remains value-additive are situations
where you can throw away the prototype at any time, it does not need any
maintenance and it has no code or analysis within it that needs to be reused.

This does happen for some kinds of tutorials, some pedagogical uses, and some
totally ad hoc work, like slinging some code to whip together a quick answer
for someone on the business team. The notebook can still be good for these
cases.

But these really represent a small number of use cases, certainly much smaller
than the set of use cases the notebook is marketed towards.

My experience, after being a zealot for IPython notebooks in the early days,
is that notebooks are just way oversold, and they encourage thinking in a low-
craftsmanship manner, and are best avoided as a general rule of thumb.

~~~
alquinte
I feel like most of the problems you list come from trying to version control
notebooks, which may just be a futile endeavor.

I primarily use notebooks as a worklog for explorative analysis. I can use
them to track down the history of experiments and thought processes that led
me to those experiments, to help me avoid treading over similar waters and to
keep a single cognitive thread throughout the whole process. And if any idea
is worth building actual software for, it then becomes a good reference for
those projects.

The crucial idea here is to NOT change the notebook over time, just add to it.
Then there are no versioning issues. And when it gets too large to navigate
efficiently, just break it apart and continue the thread in the next notebook.

Of course in a collaborative environment this kind of thing is less useful,
but for keeping track of individual work it can still be effective.

~~~
mlthoughts2018
I am sympathetic to your perspective, especially because it's basically what I
thought when I was in grad school too, and I totally know where you are coming
from. But I wanted to highlight what I think are reasonable counterpoints to
some of what you wrote:

> "I can use them to track down the history of experiments and thought
> processes that led me to those experiments"

But this is exactly what you _can 't_ do with notebooks, mainly _because_ they
are not well-suited for review. This is true even when working alone, but is
amplified when working in a team because even experimental choices require
code review.

For example, I worked on a problem once where we needed to train a neural
network for age prediction. Someone on the team very familiar with how to
write this using Keras began writing up an ad hoc training script, meant to be
a "playground" for changing parameters, re-running analysis, etc.

But through code review, even before any experiment was run, we detected bugs
in some of the code, mistakes in how the data was being loaded and data
cleaning steps, and assumptions that had been made about parameters for less
frequently used activation function parameters. By having extra reviewers look
at it and _review the intended experiment itself, as well as the code to
implement it_ we saved a lot of time, gained from diversity of opinions about
how to handle it, avoided needing to debug incorrect diagnostic or accuracy
charts to backtrack to these bugs later.

Code review doesn't always solve all of these problems, much the same way that
it doesn't solve all the problems of straight software engineering. But it is
clearly such a valuable tool that it almost always should be a baked-in,
mandatory part of the process, even for greenfield experimental code. And the
notebook makes this so hard that people don't do it, and when they do it, they
don't benefit much from it.

> "The crucial idea here is to NOT change the notebook over time, just add to
> it. Then there are no versioning issues."

I wish it were that easy, and that effectively "undo" resolved all versioning
problems, but that is extremely rare. In many cases, you need to run the same
analysis notebook with a variety of different parameters, and each one
corresponds to a separate experiment. Doing this by manually tweaking
parameters, hoping you didn't make a typo, and then storing the output
artifacts in a file called "run7_alpha3.0_learningrate_0.001_04242018.tar.gz"
leads to awful problems of reproducibility.

The other thing is that even if you only add to a notebook, you might be
adding the use of new third-party libraries, or adding code that assumes some
version of something has been implicitly upgraded in the background. So you'd
still need a ground truth source of all your dependencies, like
requirements.txt for Python, with possibly more information about the Python
version itself (does your notebook code rely on stably ordered dictionaries
starting in Python 3.6? How would someone coming to the notebook know that if
they are using 3.5 and the results seem funny for no discernible reason?) --
and in general you'll need some overall environment management setup to
ensure, for reproducibility, that other people can download the same data,
setup the same language environment, get the right versions of the required
libraries, and have other system dependencies if there are any (maybe they
need a certain version of CUDNN installed? Or maybe you used TensorFlow built
from source with extra optimization flags enabled?)

The idea is to take a good "project hygiene" and software craftsmanship
approach out of the box, to avoid situations when someone says, "why is this
notebook broken on Bob's laptop? It works on Alice's Ubuntu server...". And to
always design for code review and testability from the start, because they are
huge productivity enhancers and the benefits compound over time. Even for
research-oriented prototypes.

