
How to Grow Software Architecture out of Jupyter Notebooks - jedwhite
https://github.com/guillaume-chevalier/How-to-Grow-Neat-Software-Architecture-out-of-Jupyter-Notebooks
======
laichzeit0
So I'm software developer for 10 years that started using Jupyter Notebooks
the last year. I absolutely love that the REPL that ipython gives you. Do a
query that takes really long, store it in some variable, spend the next hour
or two working on that dataset in memory, changing code, iterating, all the
while never having to re-execute that query or load data because it's just a
REPL.

How can one get the same type of developer experience in Python without using
Jupyter notebooks? What I'm talking about, to people that maybe do traditional
development in lets say Java or C++ is, imagine running your code in a
debugger, setting a breakpoint, when it hits that breakpoint, you see "ah
here's the problem", you fix the code, and have it re-execute all the while
not having stopped your program at all. No re-compiling, and then having to
re-execute from the beginning. It's like once you've done things the Juypter
way, how can you possibly want to back to writing code in a traditional sense,
it's just too slow of a process.

How do you get that same experience without using Jupyter? I tried the VSCode
plugin [1] that tries to make things the same as a Juptyer Notebook, but it's
no where near as smooth an experience and feels clunky.

[1]
[https://marketplace.visualstudio.com/items?itemName=donjayam...](https://marketplace.visualstudio.com/items?itemName=donjayamanne.jupyter)

~~~
malux85
One thing that I would love in python is some sort of snapshotting,

Like “this is my global state” - the entire state of the python process, all
objects, etc.

Then I change a variable and that creates a new state. We have this now.

What I want is a rewind button, so I can go back to the previous state
(quickly preferably) like a time travelling debugger.

Someone build this, I’ll pay for it (and I’m sure others would too)

I know there’d be some problems with external changes (API calls altering
external system, writing to files, etc) that a time travelling debugger might
not be able to reverse, but even so, I would live with this limitation happily
if such a tool existed

~~~
detaro
A basic way of getting that on Unix systems is forking - you get a copy of the
process with the same memory state, and likely relatively small interfaces
would suffice to manage which process is currently receiving input from e.g. a
REPL. (Sockets, open files, ... are more complex, but I guess many REPL
workflows could do without that). Sadly you can't easily save much memory for
the snapshots from copy-on-write (which otherwise is a neat trick with forking
to keep old state) due to the way Python manages memory, at least not without
modifying the interpreter.

~~~
thomasballinger
If anyone wants to walk through these ideas more slowly I write about them in
[http://ballingt.com/interactive-interpreter-
undo](http://ballingt.com/interactive-interpreter-undo) and implement this in
[https://github.com/thomasballinger/rlundo](https://github.com/thomasballinger/rlundo)
but as you say, eventually decided I had to implement an interpreter for which
serialization of state was more efficient and all external resources had
amenable interfaces
[http://dalsegno.ballingt.com/](http://dalsegno.ballingt.com/)

~~~
malux85
Will you marry me?

this is really cool, I'm going to give deeper into the code tonight. Awesome
work

------
fabatka
My first thought when I saw the title: why would you want to do that? I feel
like the only justification it gave (faster feedback) is way too insignificant
compared to the downsides of developing inside a notebook. Joel Grus has a
presentation that he gave at the 2018 JupyterCon about this which I found very
enlightening and convincing. It is about why _not_ to use notebooks apart from
the very beginning of any serious project. The slides are available here:
[https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUh...](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-
dkAIsUXP-AL4ffI/edit#slide=id.g3d168d2fd3_0_255)

Of course I still use notebooks for experimenting (usually it is open beside
my IDE), I just think that developing a whole software architecture inside
notebooks is not something notebooks are for.

~~~
ptd
If you don't mind elaborating a bit, how deep do you have to get into a
project before you get past the point of experimenting?

Is there value in moving away from notebooks all together while you are still
learning?

Do these comments apply to platforms such as MyBinder or Jupyterhub?

Finally,(other than Grus, he is excellent) are there any resources, people, or
websites you would recommend for maintaining a sterile workflow or for general
knowledge?

Thanks.

~~~
fabatka
Sorry, I'm afraid I'm much less experienced than you think I am. My comment's
value is entirely in sharing the slides, I only put some of my thoughts there
so that my comment isn't just a link to a google drive file.

------
Myrmornis
I’m a backend python software engineer in my day job, working on production
code in a standard git, text editor, .py files workflow. I occasionally use
Jupyter notebooks (e.g. yesterday for the AdventOfCode challenges). I am
ashamed to say that despite the beautiful front-end that works so well, I
absolutely hate using Jupyter notebooks.

\- The mutable state with global variables from all cells in scope drives me
crazy. I just want to run an ephemeral script repeatedly so I can be sure what
the state is during execution.

\- The process of starting the server and moving code from version-controlled
.py files into notebook cells to become part of Untitled(25).ipynb, which
can’t be sanely version controlled, drives me crazy.

\- Not being able to use my normal text editor drives me crazy.

Instead of building up lines of well-tested python functions in a disciplined
line-by-line fashion, periodically committed to git with readable diffs, I end
up with a chaotic mess of code fragments in various cells with no confidence
regarding what will happen when some subset of the cells are executed, and
none of it in git let alone with a sane commit history.

I’ve tried so many times over the last 10 years, and I feel bad because it’s
such an amazing project, but I really dislike the experience of using Jupyter
instead of standard python development tools.

The reason I do it is for graphics. This isn’t a Jupyter gripe other than the
diverted attention, but why the fuck can’t I just import matplotlib in a
normal python file? (under macOS it throws something about “frameworks” which
no-one cares to understand. I think there’s some incantation that makes it
work but seriously, this is ridiculous). And maybe draw graphics to a GUI
graphics widget from the (i)python shell like R does.

(No need to reply "Because you haven't written it"! This is deliberately a
rant; I contribute to open source projects.)

~~~
GChevalier
Never tried macOS, but matplotlib works fine under Windows and Linux. Maybe
you could save plots to images on disks and prevent them to show? I once ran
code that used PLT on a server and I needed to use something like
`matplotlib.use('Agg')` to prevent the code from crashing because of lacking
graphical output.

See this SO answer:
[https://stackoverflow.com/a/34583288/2476920](https://stackoverflow.com/a/34583288/2476920)

Personally, I love to have notebook cells to be able to code without re-
running everything. Especially in the case of deep learning, training a model
is long. Jupyter is very good for creating and debugging code that: A) needs a
trained model loaded for it to work but you want to skip the part of
saving/loading the model, or B) code that saves-then-load a model.

If the "mutable state with global variables" drives you crazy, you may want to
avoid reusing the same variable names from one cell to another, and reset the
notebook more often. Also, avoid side effects (such as writing/loading from
disks) and try to have what's called pure functions (e.g.: avoid singletons or
service locators, pass references instead). If your code is clean and does not
do too much side effects, you should be able to work fine in notebooks without
having headaches.

EDIT: typo.

~~~
GChevalier
Also, you should be able to use your favorite editor for the code outside
notebooks (over time, more and more of the code will be outside of your
notebook). You might often work in the editor, and at other times in the
notebook depending on the nature of the work. As the project advances,
notebooks will become less and less important, they only kickstart projects.

~~~
Myrmornis
Thanks. Yes when I'm being organized I manage to get the notebook able to
`reload` a module that I'm working on in my editor, which is probably the most
important step towards reducing code in the notebook.

But in general I wonder whether this is what I'm looking for:
[https://github.com/daleroberts/itermplot](https://github.com/daleroberts/itermplot)

------
ontouchstart
Unlike most IDEs which are desktop based, Jupyter Notebook system is
fundamentally network based with a standard HTML5 front end sending messages
back and forth to the kernel.

So fundamentally Jupyter is not a document based application that most of us
software developers grow up with. You have to think in terms of messaging and
the cells are just the input/output terminals. The rendering of cells is
determined by the runtime container that parse the .ipynb (JSON) string.

Here is an contrived example to explain it:

[https://nbviewer.jupyter.org/gist/ontouchstart/ea1631f69e507...](https://nbviewer.jupyter.org/gist/ontouchstart/ea1631f69e507a81a9d9ec56b79e4d11)

There is a lot going on depends on the container.

~~~
TuringTest
That's good insight. Since the 50's, coding has been understood as creating
text objects that contain the entire specification of a program in a
programming language.

Input data, parameters & configurations are considered external to the code
object. It's the same with other activities like debugging, exploratory
programming, debugging and building tests, which are considered peripheral to
the "true coding" activity of altering the source code that will run on the
production environment. The single successful exception to that model is the
spreadsheet, were data and code are indissolubly merged in the same document;
and, despite its success as a development tool for non-programmers,
professional developers usually despise it.

Could it be that Jupyter and notebooks are making developers comfortable with
a new approach, where the distinction between "coding" and "running the code"
is blurred, and where partial execution and mixing code with data are the
norm?

Coders are accustomed to using the REPL for this approach, but in the end only
the code that is "commited" by copying it to the source file is considered the
"final" version. Yet, with notebooks, code that has been written within an
interactive session can be stored as the official definition for a part of a
program, without having to transfer it to a separate object.

~~~
ontouchstart
At semantic level, coding is still about creating objects and managing their
relationships with the environment. An .ipynb file itself just an object tree.
Since Jupyter people made the wise decision of using easy to parse JSON format
instead of more efficient binary format, the platform is opened up to a big
ecosystem that people can make tools for specific need. Like the document
centric desktop application revolution in the 80s that gave us MS Office and
Adobe Photoshop, the open, data centric web application revolution we are
seeing in this decade will fundamentally change how we consume and produce
software. It is a natural evolution of computing system production from
hardware panels to punch card, to text code instructions to object graph
management in network environments. (The buzzword is “the cloud”.)

~~~
TuringTest
Yeah, you're right that coding is creating computational objects.

I was pointing out that "coding" is traditionally understood to refer only to
the sequences of instructions in source and object code that remains
unmodified during runtime execution, but there are other views that extend the
definition to all the other objects involved in running the software in its
environment.

In this view, coding is about defining automated behaviors that can execute on
their own without a human leading by hand every step. I'm thinking in
particular about End-User Development approaches, which allow non-developers
to create such automations without ever learning the syntax of a formal
language.

Under this approach, the abstraction level may be lower, since the actual
instances of data used are as important to understanding the system as the
code that processes them.

~~~
ontouchstart
Another type of coding is describing the desired state and let the runtime
environment achieve the transition.

Unix Makefile is one of the early examples. So are HTML/CSS etc.

More recent examples are dataflow graph in tensorflow

[https://www.tensorflow.org/guide/graphs](https://www.tensorflow.org/guide/graphs)

and kubernetes

[https://github.com/kubernetes/kubernetes/blob/release-1.3/do...](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/design/README.md)

A new idea just occurred to me during a technical lunch meeting with a bunch
of friends today: since Jupyter Notebook itself is nothing but a object graph,
it might be possible to use a "computed" .ipynb for this kind of "coding",
i.e., describing desired state.

That would be TDD in another sense.

~~~
TuringTest
Yeah, I happen to work with declarative statements + problem solver for a
living :-)

In this context, a precise textual description helps to exactly define what
you want to achieve, but a graphical view helps to inspect how many different
cases are there in your data, and how available context are connected.

Representing state machines with graphs is a relatively common thing in rapid
prototyping apps, so I've thinking of building a prototyping app that would
also allow embedding real code and data in them (similar to Visual GUI
builders, but allowing much more code to be defined visually). There are
several tools attempting to achieve that, but all fail at the abstraction
problem of defining new instances for your objects (i.e. you can't create new
instances of arbitrary visually defined objects, only for things in tables).

~~~
ontouchstart
I really enjoy this long and insightful discussion. Let’s see how deep we can
go (in terms of the thread tree and thought itself).

Declarative constraints driven problem solver was pioneered by

[https://en.m.wikipedia.org/wiki/Sketchpad](https://en.m.wikipedia.org/wiki/Sketchpad)

and also used in places like Apple’s

[https://en.wikipedia.org/wiki/Interface_Builder](https://en.wikipedia.org/wiki/Interface_Builder)

It is interesting that you use "texture description" for byte stream based
artifact that we call code, which is runtime independent.

On the other hand, the term "graph" in your second paragraph has double
meanings: visual representation for human eye and brain to parse, and
mathematical graph that has can be serialized in JSON, XML or whatever byte
stream.

I think many people are excited about GUI revolution pioneered by Doug
Engelbart, Ivan Sutherland and Alan Kay for different reasons. Some people are
excited about the implementations: bitmap display, mouse, etc., some people
are excited about the principles and abstraction: MVC, constraints, data flow,
messaging, etc, etc.

With Jupyter notebook system, which is based on three state machines: a
persistent kernel process, a DOM session in the browser, and a message queue,
it is the first time that we might have a nice abstraction to capture a graph
in __space and time __.

That is at least why I am getting excited. The other features are just bells
and whistles.

~~~
TuringTest
Earlier today, I took a note to myself saying this:

The next-generation development environment will be a combination of:

* a wiki (outlining content + version control),

* a spreadsheet (functional reactive coding + lightweight schema definition + easy working on collections), and

* a mockup wireframe builder (visual layout + visual state machines + rapid prototyping).

Pluggable connectors to constraint solvers and machine learning would be a
plus.

I'm not sure that Jupyter fully captures all parts of the computation graph;
most of all, restoring a previous state after closing and reopening it is
cumbersome (if I'm not mistaken you need to run each individual cell by hand
in the right order). There are theoretical models that could fix that, though,
like those allowing for time travel debugging. And the visuals are simple and
flexible enough to support many use cases. So yes, web notebooks are a good
basis for a knowledge-building programmable platform.

~~~
ontouchstart
Jupyter system has three different subsystems that communicate via message
passing: browser state, ZeroMQ and kernel process. Each system has its own
leaky abstraction. That is why the version controls and time travel debugging
in the traditional sense become impossible. We shoud embrace this challenge
and think beyond document based IDEs.

For example, you can think of rendered notebooks as caching. I have even
played with the concept of using rendered notebook as the input/data source of
another notebook. This looks like a radical idea. But it is exactly how
[https://nbviewer.jupyter.org/](https://nbviewer.jupyter.org/) works.

~~~
TuringTest
> the concept of using rendered notebook as the input/data source of another
> notebook

Isn't that essentially what a spreadsheet does? Each table contains the
results of a previous computation, intertwined with input and configuration
data, and you just keep chaining more and more programs through a single
unifying metaphor (the cell coordinate system plus function expressions). This
model also gets you reactive recomputation for free.

It's a powerful system, but quite old; I'd say the only radical thing is
convincing classic developers that there's a better model than their beloved
imperative, one-step-at-a-time idea of what running a program means.

~~~
ontouchstart
Not exactly the same model as spreadsheets. In spreadsheets, a cell is a value
in memory. The sheet is a graph of states.

In Jupyter Notebook, the cells are nothing but messages. And the saved
notebook files are JSON serialized byte stream of the message and metadata. It
doesn’t keep the states, just data, a digital artifact.

~~~
TuringTest
Yet spreadsheet cells may also be formulas, which are akin to method calls /
message passing / function application. The value of a cell with a formula is
the memoized result of its latest function call execution. I don't see an
essential difference with a notebook cell, other than the layout - grid vs
linear. (Although I might be missing something; I don't have broad experience
with Jupyter's computation model).

Plus, spreadsheets may also be distributed.

~~~
ontouchstart
Good points.

I don't care much about the visual layout of Jupyter notebook, so in this
sense, it is not much different than spreadsheets.

However, when I say Jupyter cells are nothing but messages to/from the kernel,
it is profoundly different than spread sheets where cells can be message as
you described. The same with distributed runtime states. Spreadsheets may also
be distributed, but Jupyter notebook runtime is always distributed. You might
have a half system like nbviewer that doesn't have a kernel running, but that
is not a full Jupyter system. (I am distinguishing Jupyter runtime from the
JSON byte stream artifact that we call a notebook file)

~~~
TuringTest
So, if I understand it right, Jupyter follows an agent-based model of
computation, right? As opposed to spreadsheets, which follow a functional
reactive paradigm.

Yes, these are different computation models; although formally it is possible
to transform mathematically one into the other, and vice-versa (in the same
way that you can transform any computation model to a Turing machine or lambda
calculus).

It doesn't have any practical concerns, but it means that you can translate
insights gained from one model to the other.

~~~
ontouchstart
This thread is getting too deep and too narrow (visually) to read on HN. So I
will summarize and stop here.

1\. It is a good observation that Jupyter runtime follows an agent-based
computation model instead of the MVC of smalltalk, or functional reactive
Spreadsheets. Theoretically they are equivalent in the Turing sense. But this
model opens huge opportunities that allow it to mimic how human brain and
human societies work. It is all about messages, context containers and just-
in-time computation.

2\. This paradigm change also affects how we grow (instead of building)
software architecture, which is the topic of the root of this HN submission.
In the build/develop paradigm, software engineers focused on requirements,
specification, foundation, frameworks, ..., i.e., the house building metaphor
([https://en.wikipedia.org/wiki/Design_Patterns](https://en.wikipedia.org/wiki/Design_Patterns)
) But if you are growing a software project, there is no architecture. Then
you should use agriculture metaphors: ecosystem, environment, energy cycle,
water cycle, etc. It is about seeding, weeding, environment control,
harvesting and sustainability. Push it a little bit further, programmers in
this paradigm are more like software farmers than data plumbers in a data
warehouse or code assemblers in a factory.

It was a nice discussion in a virtual public square. I enjoyed it a lot. I
hope you did as well.

~~~
TuringTest
I did.

------
TuringTest
Web notebooks are the new system shell, it's just that most developers haven't
realized it yet.

The original shell was an evolution over the teletype. Developers would type
their code on a typewriter machine connected to the CPU over a wire (maybe
involving punched tape or cards); and the CPU spouted back the results of the
commands executed.

Later, when video screens were adopted, they emulated the same linear approach
of the terminal on the new device, because developers tend to be an extremely
conservative bunch when it comes to the approach taken by their tools of the
trade. It took several decades to exploit the advantages of interactive
sessions that the new media allowed, creating a new family of IDEs with live
debugging, property inspectors and Intellisense/contextual help.

We are at a similar turning point with respect to classic development tools.
The notebook is still used mostly as an improved version of the bash shell
"with graphics", but there's a lot of research to do on how to enhance the
model with new tools to take advantage of the new medium (including literate
programming, web-based collaboration, or instant feedback of command execution
on persistent objects).

People are exploring the possibilities of this approach, with Bret Victor
possibly being the most influential. I'm sure the personal experiences of
thousands of developers with their home-made workflows, involving lots of
heterogeneous tools used in creative ways, could be studied to design a more
general environment that will change how large software projects are built.

~~~
solomatov
I think [https://datalore.io/](https://datalore.io/) is further away than
Jupyter in respect to being a better shell for developers. It has interactive
recomputation, collaboration built-in, integrated version control, etc.

------
breatheoften
I made a module to offer some more options for some of these problems recently
- particularly useful when you have code that you _want_ to keep organized in
a notebook for continued interactive experimentation and/or for explaining a
prototype process — but you also want to reuse that code in an external script
or program. Let’s you call any notebook as a function. Keyword arguments can
be passed and there’s a simple way to receive these values within the notebook
to allow the notebook to be parametrizable when invoked externally.

[https://github.com/breathe/NotebookScripter](https://github.com/breathe/NotebookScripter)

~~~
davecap1
Is this similar to papermill?

[https://github.com/nteract/papermill](https://github.com/nteract/papermill)

~~~
breatheoften
I hadn’t come across papermill before - looks quite full featured. Seems to be
oriented a bit towards “scripting Notebooks” with the ability to generate
output notebook files with populated output cells. I think it must have to
emulate some of the logic of the Notebook ui to do that.

NotebookScripter is much more minimal and doesn’t support generating/updating
.ipynb files. It has just one api allowing one to treat a notebook file as a
function that can be called from another python program.

I’m not entirely sure whether papermill can run the target notebook in the
calling process — it looks like it spins up a Jupyter kernel and communicates
with it via message passing. NotebookScripter creates an ipython interpreter
context in process and execs the notebook code in process.

------
marmaduke
> it might be a good idea to skip unit tests

barring some barriers (e.g. use caching where appropriate) I've found writing
unit tests is a neat clean replacement for notebooks.

~~~
TomBombadildoze
A thousand times this. Jupyter notebooks are an antipattern. They encourage
bad habits and yield copy-pasted, hacked-together, unmaintainable code. The
REPL is a tool for exploration, not for full-blown development.

> barring some barriers (e.g. use caching where appropriate)

Fixtures.

~~~
marmaduke
Yes thanks for the vocab. Notebooks are like a single mutable global namespace
fixture. Definitely a regression in terms of engineering practice, but Python
slow startup and reloading semantics are also barriers, and I’d like to see a
nailgun (JVM startup accelerator) for Python someday.

~~~
TomBombadildoze
> nailgun (JVM startup accelerator) for Python someday

It's not an exact corollary but pypy sort of fills that niche. Unfortunately,
it's two major versions behind and it's not a 100% drop-in replacement.

------
im3w1l
Gradually pulling stable code out of the notebook into a tested library has
worked well for me.

------
GChevalier
Author here.

Whoa, thank you all for the nice comments, I didn't expect to make such a buzz
here nor today. I'm glad to see the reactions - even the bad ones, it seems
aligned with what I thought. Yes, notebooks are very useful for the faster
coding cycle, but they become easily heavy (I'd love to see a better multiline
edit and a better autocompletion in Jupiter).

Seems like I already posted my article 2 months ago, but renamed the GitHub
repo since then, which may explain why someone else (jedwhite) could submit my
article again:
[https://news.ycombinator.com/item?id=18339703](https://news.ycombinator.com/item?id=18339703)

I didn't submit it twice to HN. Well, nice to see that in a parallel world my
post did the 1st page on HN! :-)

But could this mean that my HN account is like "shadow banned" or something?
Strange to see that all my own submissions on HN haven't got much attention
for months. Or maybe it's just the random factor... Well, thanks!

------
skybrian
It seems weird that notebooks don't have built-in unit testing. All you'd have
to do is compare the current output to the saved "known good" output and show
the diffs. It could be a checkbox next to each item in the notebook.

~~~
DarkCrusader2
nbval does something similar to what you want.

[https://github.com/computationalmodelling/nbval](https://github.com/computationalmodelling/nbval)

------
rafiki6
I argue that this misses the mark. Jupyter Notebooks is much more an evolution
of Excel then it is of other development tools.

------
tarasmatsyk
TL;DR

Did not get any insights here except links to books.

What I prefer to do is to use Jupyter for all R&D tasks then `productize`
modules/parts that proved to be working. When wrapping solutions as a product
we go through typical engineering process: TDD, CI/CD, Docker, code review
etc. In any case, we end up with having a bunch of machine instructions and
there is no point in breaking best practices of engineering craft, they have
been established for years. No need to invent another wheel.

------
InGodsName
I no longer code anything.

I've been working remote at a big company now.

I write requirements of a function then i send it to my subcontractors on
freelancer website.

While i am waiting, i play with my kids, kiss my wife and take her out for
shopping.

I get back the functions, i paste them in the codebase... change a thing or
two.

I am writing 3000 lines of code with this approach daily.

I used to write 200 lines of code alone.

One key thing, my passion in programming is killed entirely because always
I've to work according to what plans management devise. Management changes
direction, randomly kills project, changes requirements - this makes me feel
like a slave who has to dance to statisfy the masters, no matter how many
landmines they place on the dance floor.

So, now i just create outline and let cheap labor fill the blanks.

My value proposition isn't code but because i can understand the problems at
hand and create plans to tackle the problem.

Choice of problem isn't in my hand tho area of problem is definitely in my
hand.

I am never going back to open office plan, sebum/ear wax filled noise
cancelling headphones.

~~~
BeetleB
Are you an employee or a contractor? If the former, are you not violating IP
by subcontracting?

~~~
clueless123
one price, one person All-tasks Contractor.. I farm out/purchase all basic
graphics, use tons of open source libraries and farm out most tedious stuff.
My employers know, as long as we deliver they do not care.

