Ask HN: Are there any good Diff tools for Jupyter Notebooks?

stiff · on May 22, 2022

You can use jupytext to maintain dual .py/.ipynb representation of notebooks and keep both versions in sync:

https://github.com/mwouts/jupytext/blob/main/docs/paired-not...

It works both ways, it can update the .py file each time you save the notebook, or you can edit the .py file and have the jupytext command line tool update the .ipynb.

yasser_kaddoura · on May 22, 2022

This with https://github.com/untitled-ai/jupyter_ascending + your Editor to have a supercharged notebook workflow.

pen2l · on May 22, 2022

On the one hand: cool, if you're an avid emacsen or a vimmer, yeah, ok. OTOH, gosh that is such a cluttered and cumbersome setup. Just bring in vim/emacs bindings to your jupyter: https://github.com/lambdalisue/jupyter-vim-binding. There's a handful of plugins, choose one.

Whatever the final solution everyone decides should be, I just hope it doesn't involve having two redundant windows open side-by-side like that. Ideally, it should probably be instantiating an emacs client within Jupyter as that seems the most logical.

nolroz · on May 22, 2022

Visual studio code has a diffing view for notebooks that looks very promising. https://code.visualstudio.com/docs/datascience/jupyter-noteb...

Royi · on May 22, 2022

I wish for a simple option in VS Code: On close of a Jupyter Notebook clear its output. Or something that separate the display of the output from the saved file (Still `ipnyb` file). See [1].

[1]: https://github.com/microsoft/vscode-jupyter/issues/9514

domenicrosati · on May 22, 2022

Woah this is exactly what I'm looking for! Thanks

dahart · on May 22, 2022

Can you talk more about why you’re working in Jupyter Notebooks at a level that needs diff reviews? Are you reviewing your own work, or the work of others?

One option would be to start a policy to always “restart and clear output” before saving. This cleans the output cells and makes the .ipynb files diffable. Just happens to also make them nice for storing in version control.

Another option would be to work in pure python files in the first place, and only use Jupyter after the fact. The close brother to Jupyter is the Spyder IDE, which gives you most of the benefits of quick visual outputs, but also has a nice python debugger built in.

pen2l · on May 22, 2022

edit: Googling reveals nbdime, has this been looked into? - https://nbdime.readthedocs.io/en/latest/

Not OP but I can imagine easily the need for what he's asking.

You'll find a lot of algorithms for data and image processing saved as notebooks these days offered to you. Let's say you make some changes from the provided code and after a handful of changes something is not working right. You might want to diff from where you are back to a working version in hopes that differences that emerge might clue you into where to look for where the problem might be.

As an aside, I want to say Jupyter notebooks (moreso jupyterlab) is sort of a disruptive change to our coding workflows. We've had interpreters for a long time sure, but creating interactive graphs on-the-fly is a godsend, insights come to you in such a workflow that wouldn't otherwise. I hope this catches on, I actually want my shell terminal to become more Jupyter-like. Also, fun fact: did you know you could do real-time collaboration on Juypter notebooks? https://jupyterlab.readthedocs.io/en/stable/user/rtc.html

dahart · on May 22, 2022

Oh I can totally imagine use-cases too, but I’d love to hear what the OP’s use case actually is. I also agree completely on the disruption that Jupyter brings, and that it has just massive benefits. But when a workflow isn’t giving you everything you want, it’s worth evaluating whether the tools you’re using are the right tools for the job, right?

One example would be that Jupyter is well designed for a lot of prototyping and for single-person scenarios. It’s well designed for sharing and for including notes and narrative with code. It’s just not really designed for multi-user workflows. That’s not a negative in my book, it’s just a fact that makes me reach for a different tool when I need to collaborate.

Also don’t overlook Spyder, which is part of the same ecosystem as Jupyter, they’re usually bundled together, and Spyder gives you the interactive features you want but might better support a production workflow that is multi-user, collaborative, and also more easily diffable.

All that said, it might be awesome if someone builds a Jupyter diff tool that is designed to ignore the output cells!

domenicrosati · on May 22, 2022

Hey there - OP here. I haven't used spyder I'll have it check it out.

The primary use case is: I am a researcher in nlp where speed of prototyping is key. I work in an environment where research fragments are primarily jupyter notebooks. So needing to diff notebooks is typically reviewing my own changes when modifying my and others research sketches. Since its helpful to see how code changes.

What really resonates with me is what others have said which is I need to run cells that take 2-6 hours to compute so recomputing cells is annoying... I dont love notebooks for their messy state which cause obvious problems that are very annoying.. and I am not an advocate for notebooks for production for this reason but the flexibility of computing stuff and having that persist and doing downstream prototyping makes notebooks amazing! Markdown and latex in there is also really helpful.

The secondary use case is PRs but... typically reviewing others research code isnt at the granular level of notebook riffs across a few commits so it deosnt come up often.

ivansavz · on May 22, 2022

> https://jupyterlab.readthedocs.io/en/stable/user/rtc.html

Wow! Realtime notebook collaborative editing! This is going to be so cool for teaching (allow students to fill-in part of the code block).

Have you tried this yes? Is the idea to run jupyter on a machine with a public IP and port 8888 open allowing the server to be accessed from multiple people at the same time? Would this work services like `ngrok` that make you personal computer available online?

aulin · on May 22, 2022

Not OP but restart and clear output can be quite compute intensive if you're working with big datasets or training ML models. There are many ways to mitigate this like saving weights and only redo the inference but it's not always worth it when you're iterating through models and parameters or doing exploratory data analysis. Most of the time you want to just keep results/outputs of previous run and improve from there

dahart · on May 22, 2022

That’s a great point, I sometimes avoid clearing outputs when I’m playing with Pytorch just because retraining takes a while. This has been motivating me to learn how to be fluent with saving weights to disk.

aulin · on May 22, 2022

I used something as a precommit hook in the past that removed plots and other rendered content and only kept text and code in git index. I'm almost sure it was https://github.com/kynan/nbstripout but it's been a while and I could be wrong.

Once the hook was in place git diff worked well enough to not need any other diffing tool.

karlicoss · on May 22, 2022

yep, can confirm, it basically filters out any notebook output from version control (while keeping it intact in the notebook file itself). This works seamlessly with diffing, committing, staging, etc.

cschmidt · on May 22, 2022

There is https://nbdime.readthedocs.io/en/latest/, although I haven't used it personally to know how good it is.

primarydonkey · on May 22, 2022

I used this and can recommend it, since it also shows you the outputs of different versions.

But as another commenter said, when I got to the point of needing to diff my notebooks, I realized that I could move some of the code into separate python files.

If you're a business analyst, one use case is if you need to process some data e.g. every quarter, but the data changes a bit every time so you need to update the approach slightly (e.g. data structure changes, new mapping rules). With nbdiff it's easy to keep track of changes while having some helpful visualizations in the same file.

ivansavz · on May 22, 2022

I use nbdime daily for notebooks under version control thanks to the following configs in my ~/.gitconfig (global git config).

   [diff "jupyternotebook"]
     command = git-nbdiffdriver diff --ignore-details
   [difftool "nbdime"]
     cmd = git-nbdifftool diff --ignore-details \"$LOCAL\" \"$REMOTE\" \"$BASE\"

I'm not sure if this is a standard setup or if I copy-pasted from some blog post, but overall it's working great.

There are some issues with it, like (1) will unnecessarily mark graphics as changed (e.g. re-generated figures from the same code), and (2) the diffs become less meaningful if large chunks of cells were moved, but overall it works great.

If it supported a `--color-words` option then it would be super helpful for seeing only which words have changes, instead of whole lines changed (very good for long paragraphs of Markdown text).

amirathi · on May 30, 2022

Here are tools people commonly use for notebook version control with git -

[1] nbdime to view local diffs & merge changes

[2] jupytext for 2-way sync between notebook & markdown/scripts

[3] JupyterLab git extension for git clone / pull / push & see visual diffs

[4] Jupyerlab gitplus to create GitHub PRs from JupyterLab

[5] ReviewNB for reviewing & diff'ing notebook PRs / Commits on GitHub

Disclaimer: While I’m the author of last two (GitPlus & ReviewNB), I’ve represented the overall landscape in an unbiased way. I've been working on this specific problem for 3+ years & regularly talk to teams who use GitHub with notebooks.

[1] https://nbdime.readthedocs.io

[2] https://jupytext.readthedocs.io

[3] https://github.com/jupyterlab/jupyterlab-git

[4] https://github.com/ReviewNB/jupyterlab-gitplus

[5] https://www.reviewnb.com/

iqkznnft · on May 22, 2022

The solution is don't use ipynb. Instead, use an IDE that can run code segments in files, and version those files.

You end up with files which are syntactically correct code, versionable, and can be run in segments just like ipynb. Win, win, win.

exevp · on May 22, 2022

You can use clean and smudge filters in git. Since notebook files are JSON it's pretty straightforward to stripe outputs from them using `jq`:

http://timstaley.co.uk/posts/making-git-and-jupyter-notebook...

yanbianhobo · on May 22, 2022

We use ReviewNB at work, it integrates very nicely with github providing the same PR review workflow, it’s a paid tool though.

rgavuliak · on May 22, 2022

We’re using reviewNB, it works though we don’t do too many iterations of a notebook.

barrrrald · on May 22, 2022

Hex just launched a diff view feature, along with git sync and a clean file format: https://hex.tech/blog/github-sync

dkeathley · on May 22, 2022

In addition to this, you can keep a dual markdown version that uses a much more human-readable syntax and preserves both code and markdown sections of the Jupyter notebook. This is also via jupytext. In both jupyterlab and jupyter you can pair the two versions (something like what is discussed here: https://www.wrighters.io/jupytext-notebooks-as-markdown-or-p...) and they will stay in sync automatically.

freedomben · on May 22, 2022

For the Elixir equivalent of Jupyter (called Livebook) I've been keeping the markdown files in a `livebooks` directory so diffing them is as easy as `git diff` or any other existing text-based diff tools. It's been pretty successful.

TekMol · on May 22, 2022

In Google Colab, when you "Download ipynb" you get a file that looks like json.

You can prettify it via "python3 -m json.tool" for example. Then you have a structure that you can diff via your favorite diff tool.

What is a pita about it?