
Making Git and Jupyter Notebooks play nice - mana99
http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/
======
TimSAstro
Hi, HN!

Author here. A friend mentioned this was on the front page so I wanted to stop
by and make explicit that this advice is OUT OF DATE as far as I'm concerned.
It's way too much hassle (I work in a much larger team now than I did then!)
and doesn't play well with rebase etc.

These days I either recommend the jupytext approach (not tried it but seems
sensible) or personally I just use Sphinx-gallery.

Advantages:

* Plain python files play well with IDE refactoring, Black formatter, etc etc.

* You now have a readymade 'tutorial' page for your docs.

* Files are run with every docs build, so you can configure things to alert you when they're broken.

Disadvantages:

* You end up editing a throwaway notebook file. If you forget to copy-paste your edits back to the source, and rebuild, you have lost your edits. However, this forces me to keep the 'temporary, exploratory' nature at the front of my mind and not allow the notebook code to grow too large before performing some clean-up.

~~~
lukasm
Can you not use [https://code.visualstudio.com/docs/python/jupyter-support-
py...](https://code.visualstudio.com/docs/python/jupyter-support-py#_export-a-
jupyter-notebook) to export/import?

------
FridgeSeal
Things I’ve found in checked-in notebooks:

* database credentials.

* sensitive data

* a whole DataBricks webpage because the person didn’t understand how to export just the notebook.

* collections of notebooks named only what step in the process they are, and literally nothing about what they actually do.

* Whole base64 encoded images and zip files

* packages imported by manually manipulating system environment paths

* multi-processing/multithreading by shelling out and calling new python instances

* good old “don’t run these cells”

~~~
starpilot
Untitled9.ipynb

------
kevcampb
Or just use jupytext and only commit the .py files. Works for us. Commits just
look like normal python code, with a few comment markers for cells

~~~
nabdab
I love jupytext, but i feel like it’s a patch on a problem that should have
just been solved. Just change jupyter to work directly in the genereres file
format and skip the “pair files” hassle.

~~~
leni536
Or it could be solved at the git tooling side by introducing Jupyter specific
merge and diff tools.

~~~
SifJar
adding git tooling for a specific file type seems like a slippery slope, no?
(assuming you are saying that git itself should have this tooling built in -
if you mean some sort of addon, fair enough but then everyone who uses jupyter
& git needs to install that addon)

~~~
leni536
I am not suggesting that git itself should ship with a bunch of custom merge
utilities for specific file types. git ships with a mechanism that allows
custom merge drivers. Setting up custom merge drivers might not be ergonomic
right now, but it could have some benefits compared to the transformation
approach. For example merge conflicts could result in a valid notebook and it
could be manually resolved inside the notebook interface, no need to dive into
the text file.

------
morotter
I found a good compromise by using VSCode Python extension. You can import
Jupyter notebooks as Python scripts and the other way around [0]. If I need to
work on a notebook, I prefer working with a Python script and the interactive
window [1]. Then I commit both script and ipynb version.

[0] [https://code.visualstudio.com/docs/python/jupyter-support-
py...](https://code.visualstudio.com/docs/python/jupyter-support-py#_export-a-
jupyter-notebook)

[1] [https://code.visualstudio.com/docs/python/jupyter-support-
py...](https://code.visualstudio.com/docs/python/jupyter-support-py#_python-
interactive-window)

~~~
bobbylarrybobby
Agree 100%. Only thing I’d add is that the killer feature of the extension is
that it allows one to treat ordinary python files as notebooks (without
converting!) by 1. Connecting to a persistent kernel instance 2. Allowing the
user to use magic comments to delineate code cells within the python file.
With these two things, and the ability to export as a real notebook, you get
the great experience of a notebook — submitting easily editable cells to a
kernel one at a time, as many times as you want — without of all the usual
baggage that that would entail. Plus you get to edit in a decent editor (and
edit things other than just python files in it) instead of the crap Jupyter
forces you to use.

------
kbumsik
Can Jupyter stop using JSON and look for a new structure (not only about the
file format but also the data fields representation)? The current format is
unordered and contains huge binary blobs which makes it very inefficient and
version control is simply a pain.

I believe a new design worth sacrificing backward compability.

------
wodenokoto
A lot of people mentioning what is essentially R markdown files as a better
approach.

In .rmd the notebook is just markdown where code segments inside triple
backticks can be executed.

That sounds nice and git compatible (and it is), but you lose a lot of the
convenience of Jupiter notebook, namely that output isn’t stored together with
input.

The nice thing about Jupyter notebooks is you get story, code and results in a
single package.

~~~
bobongo
At the beginning of your Rmd file, add a chunk that spins your Rmd file to an
R script.

[https://www.garrickadenbuie.com/blog/convert-r-markdown-
rmd-...](https://www.garrickadenbuie.com/blog/convert-r-markdown-rmd-files-to-
r-scripts/)

Now you have:

\- single file (Rmd, plain text) where you can do any edits, which, when
compiled, produces:

\- (1) you get story, code and results in a single package: one output file
(usually, html), where all inputs and outputs are stored together (or, just
the outputs, if you turn off echo'ing, e.g. for producing a report),

\- (2) you get the code: another output file, which is a simple R script
version of your Rmd file.

Code updates to the Rmd are viewable much easier on the output R script (#2),
which is also much more convenient for debugging. For static text updates, I
look at the Rmd file. For content updates (e.g. did my data change between the
runs?), I look at the html (or Word or PDF etc) file (#1).

------
amirathi
A year ago, I was frustrated (and surprised) to see one can't do code reviews
with Jupyter Notebooks. GitHub diffs for notebooks JSON are super messy.

I set out to build ReviewNB[1], code review tool for Jupyter Notebooks. Turns
out a lot of other people had this exact problem. One can see visual diffs &
write review comments on notebook cell. Currently only works with GitHub
though.

If you want to diff locally (before committing changes), you might like
nbdime[2].

[1] [https://www.reviewnb.com/](https://www.reviewnb.com/)

[2] [https://github.com/jupyter/nbdime](https://github.com/jupyter/nbdime)

------
jdnier
Note: This article is from February 2017.

------
spicyramen
Google AI team did a very good presentation and blog post about this and other
common problems [https://cloudblog.withgoogle.com/products/ai-machine-
learnin...](https://cloudblog.withgoogle.com/products/ai-machine-
learning/best-practices-that-can-improve-the-life-of-any-developer-using-
jupyter-notebooks/amp/)

------
stared
I have some projects based on Jupyter Notebook (e.g.
[https://github.com/stared/thinking-in-tensors-writing-in-
pyt...](https://github.com/stared/thinking-in-tensors-writing-in-pytorch)),
and collaboration sucks. Even with git, I need to resort to "don't touch these
files, I am working on them right now".

I was thinking about using RMarkdown files (vide
[https://towardsdatascience.com/version-control-with-
jupyter-...](https://towardsdatascience.com/version-control-with-jupyter-
notebooks-f096f4d7035a)) as for them diffs make sense.

Does any of you use this approach, or have insights on how to make it good for
collaboration AND visible on GitHub?

------
posedge
Interesting. Is it possible to integrate this as a git difftool instead?
Sometimes you want to include the cell output in the repository.

------
TimD1
I just clear all outputs before committing code. Works well enough for me, but
it's cool to see a more advanced solution!

~~~
steve19
The problem with that is when there are very long running cells and you want
to see the output and keep developing, and committing code as you go along.

------
globuous
Neat project ! Dearly needed. But may i ask, why not use pandoc to convert
notebooks to orgmode and git that ?

------
etiam
Anybody aware of a corresponding solution for Mercurial yet?

