Hacker News new | past | comments | ask | show | jobs | submit login
How to Version-Control Jupyter Notebooks (nextjournal.com)
164 points by tosh on Dec 22, 2018 | hide | past | favorite | 44 comments

nbstripout [1] is my favorite tool for this. Installing it in your Git repo is 2 lines:

$ pip install --upgrade nbstripout # install nbstripout bin

$ nbstripout --install # install Git hook in current repo

Then, any .ipynb files that you check in will have their output stripped in the index (without affecting your working copy).

(Surprised it's not mentioned in the article.)

[1] https://github.com/kynan/nbstripout

In addition to stripping the output, I will advertise a Jupyter trick that has helped me many times: When I'm ready to walk away from a project for a while, or even if I just need a coffee break, I will do "restart and run all cells." This ensures that there is no hidden state that I've forgotten about, and that someone else could run the same notebook without mishap.

This is my current go to solution. It does require the collaborators to also have it installed

If you use CI you can add such things there so any commits that didn't run them fail CI. I do the same with the `pre-commit` utility, it's very very handy for running checks repeatably.

This is great advice, thanks! In practice, do collaborators also need to have it installed?

I wonder if it would be easier and/or possible (should be) to just write a filter in jq that strips output, prompt_number and execution_count in each cell

I haven't really followed Jupyter's development lately, so maybe this is already happening, but I think what you really need is some concept of a workspace, rather than a single file.

The problem with notebooks is that they get unwieldy, and you want to keep a bunch of code around that's only useful in certain cases, or just starts doing "too much".

Sure, you can factor this code out into a library/function, but there's nothing that makes that easy, and once you've made it into a library, there's nothing that helps you easily make changes to that library in a different notebook.

Perhaps notebooks should have variables accessible as if they were modules. This would solve my personal problem of building libraries.

Jupyter has great potential to be a new kind of IDE, it just needs more resources.

Am I the only one who looks at this and thinks: "wtf, no, versioning notebooks should not be this tedious?" instead of suggesting other horrendous ways of versioning them?

Absolutely. Point #4 offers a more sane alternative.

Don't use the Jupyter Notebooks for something you want to version control it. It's like a one-line Perl script. Write, run and delete.

If you need more than that, use the plain text file source code.

Actually, just forget the Jypyter Notebooks and use good old plain text source code like the rest of the programmers.

As a data scientist, I disagree strongly on this. Writing "typical" application code, sure, jupyter is (probably) overkill. But for cv, nlp, data sanitizing, etc, you are constantly iterating over algorithms and visually viewing the output. Multi-stage pipelines just require rerunning a cell.

Caching to disk is cumbersome for data that's usually junk.

Cells and integrated vis is such a massive leap forward that using plain old text feels like banging rocks together.

Pretty much this. As a quant / data scientist, I quite often have notebooks just hanging there for weeks with a few hundred GB of ready-to-use data preloaded and preprocessed in the kernel which makes the experimenting with it incredibly ergonomic.

Being able to quickly check the output while iterating on a an algorithm, or visualise intermediate results is irreplaceable.

But you still don't need Notebook, Cell evaluation is all you need, which can be done without all the notebook hassle.

It’s missing that one can use Hydrogen and avoid this problem: https://nteract.io/.

Export isn’t great atm but can be combined with pweave: http://mpastell.com/pweave/docs.html

I think VSCode has something similar.

This gives another advantage of using a proper editor and its entire ecosystem.

With the VSCode Python extension you can directly create cells with #%% in a similar way to Hydrogen. There is also Neuron which allows you to see outputs in a separate pane.

I'm still struggling to find a setup in which cells are auto-generated (or unnecessary like in RStudio) and the autocomplete works as well as in JupyterLab. If I could reliably see all methods/submodules/inline documentation + path autocomplete quickly and for all packages, I would switch to VSCode. (There's a good chance that this just due to me not being fully aware of what's available in VSCode. )

Have you tried https://atom.io/packages/ide-python for autocompletion/inline documentation? It uses https://github.com/davidhalter/jedi. Also I'd be surprised if those things aren't done properly in VSCode with Python extension(s).

edit: Atom IDE (that this package links to) has been deprecated last week or so by Facebook, I'm not sure what dependencies packages like the above have on the atom-ide-ui.

I have never programmed in R before, but why do you say that there is no need for cells?

I use cells/notebooks in Python, so I can keep my code organized and run computationally intensive things once... Is this something that is not needed in R?

So firstly, you can use R in Jupyter in the exact same way you use Python (ju-pyt-er stands for Julia, Python, R).

Then R also had RMarkdown which allows to have notebooks with executable cells (code chunks) and they play much nicer with version control than .ipynb files.

What I was referring to in my previous post is working with a .R file (which is plain text) in RStudio. If my cursor is on a single line which is also one statement, ctrl/cmd + enter executes that statement and shows me the output in the console or in a separate pane for plots. If the cursor is within a multi-line expression such as a plot declaration, beginning of a loop, function declaration, then the interpreter figures out that I want to run multiple lines and executes the whole loop/declares function/creates plot. Or I can also select some code and run it.

Ideally, this is the kind of behaviour that I'd like to replicate with a .py file. It's a nice interactive workflow and also solves the problems that jupyter has with version control.

Interesting... I'm currently working on VSNotebooks (extension for VScode), which is a fork from Neuron... I would love to get some ideas that could help bring notebooks into the future, so thanks for your reply!

Pweave seems to give the same as RMarkdown but for Python and other languages: https://github.com/mpastell/Pweave. Examples: http://mpastell.com/pweave/examples/index.html

Improved export from Hydrogen to Jupyter Notebooks is at the top of my wishlist, and I'm hoping to submit a PR for it soon. See this issue: https://github.com/nteract/hydrogen/issues/1296

In case anyone reads this, I submitted a PR to support exporting Markdown cells from Hydrogen: https://github.com/nteract/hydrogen/pull/1498

Mentioned in the article: manual nbconvert, nbdime, ReviewNB (currently GitHub only), jupytext.

Jupytext includes a bit of YAML in the e.g. Python/R/Julia/Markdown header. https://github.com/mwouts/jupytext

Huge +1s for both nbdime and jupytext. Excellent tools both.

Really enjoying jupytext. I do a bunch of my training from Jupyter and it has made my workflow better.

Holy moly, the JS main bundle on this site is 16MB.

I imagine it's because this is a web jupyter book or something. It's definitely extremely slow loading the page on mobile even after downloading the assets, so it's probably super unoptimized.

Right, we didn’t get around to splitting the js bundles up yet but will do so soon, thanks for the reminder. Currently the main bundle is the same for the page and the editor which you can try at https://nextjournal.com/try

Heavy page weights really should be called out in the submission title, IMHO. Not all of us are on unlimited data plans.

nbdime - https://nbdime.readthedocs.io/en/latest/ works very well and works well in the terminal with `git diff` for me (https://nbdime.readthedocs.io/en/latest/vcs.html#usage). Wanted to highlight the fact it integrates well with `git diff` which is my favourite part of nbdime but skipped in the article.

Just write python scripts instead in the first place. Import to Jupyter if you really, really need the notebook UI.

Write methods in Python, call them from notebook. This works well in collaboration and team members can just fork notebook or the merges are kind of trivial.

This makes the notebook just a convenient way to visualize or share with non team members.

Or give Emacs Org mode a try :)

In my university I took a data science course and we needed to do jupyter notebooks in groups, and merging was horrible. And then we expressed some concern to the course team, they recommend that we use google drive. I still think that jupyter notebook file format wasn't done for collaboration.

Do we already have something with the quality and workflow of RStudio but for Python?

Spyder and Rodeo don't even come close at this point. Does PyCharm allow something similar?

Rstudio and R Notebooks work great with Python: https://cran.r-project.org/web/packages/reticulate/vignettes...

Yes, but it still feels hacky and only supports R markdown.

R markdown? I'm not sure I understand, the text between chunks is plain 'ole markdown. It can be overridden with Latex as needed. What parts feel hacky?

My point was that I don't want to work with markdown at all. Ideally, I'd like to work with a .py file in the same way that I can execute parts of a .R file in RStudio.

I hear ya. I think you can do that in Rstudio, but I've only tried with .R files.

Yep, looks possible with code completion in the next iteration: https://twitter.com/grahamimac/status/1076881510194651136?s=...

I have a TOC that covers the right third of the page. Do other folks have that? Does the document author not see it?

It's an issue with the responsive layout. If the window is wide enough the overlap goes away; narrow enough the TOC goes away.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact