
Why is version control in Jupyter notebooks so hard? - themlaiguy
Are there any tools that help with version control on notebooks?
======
snilzzor
I've been clearing my output using nbconvert before putting the notebook into
version control. I have a precommit hook and a check in CI. This works for my
use case but I can understand needing to preserve output.

jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
my_notebook_name.ipynb

------
amirathi
You bet. I built ReviewNB[1] specifically for Jupyter Notebook code reviews.

There's also,

\- nbstripout[2] for stripping outputs automatically before every commit

\- nbdime[3] for diff'ing notebooks locally

\- jupytext[4] for converting notebooks to markdown and vice-a-versa

[1] [https://www.reviewnb.com/](https://www.reviewnb.com/)

[2] [https://github.com/kynan/nbstripout](https://github.com/kynan/nbstripout)

[3] [https://github.com/jupyter/nbdime](https://github.com/jupyter/nbdime)

[4] [https://github.com/mwouts/jupytext](https://github.com/mwouts/jupytext)

------
PaulHoule
It is simple. Code is one thing and data is another thing; you can mix
arbitrary code with arbitrary data but the result might not make sense!

Neurotypicals have a hard time with this kind of contradiction and will try
one simplistic answer that almost works and then try a different one and then
go back to the old one and eventually they will get interested in something
else then give up.

For instance, it makes sense to strip the data out of a jupyter notebook
before checking in. You can version manage the code that way. However, people
also really want to look at the notebook in github and see the analysis, the
data, the results.

~~~
chewxy
> Neurotypicals have a hard time with this kind of contradiction ...

Wat.

The problem with Jupyter notebooks and version control is that Jupyter
notebooks encapsulate temporality. You see this in the little [N] boxes on the
left of each cell.

I suppose this is what you mean by "data"? Its use is a little atypical.

I have a slightly different workflow. I observed that the top of my Jupyter
notebooks are generally more or less static when compared to the lower parts
of it. This allows the top parts to slowly coalesce into a proper program over
git commits.

I also try to have a linear notion of variables (i.e. a variable is defined
exactly once and used exactly once - permitting construction of values in
loops of course).

This style of development helps with version control as well. Restart the
kernel and clear output, then run each cell exactly once before a git commit.

