I think Jupyter notebooks are quite useful as "rich display" shells. I often use them to set up simple interactive demos or tutorials to show folks or keep notes or scratch for myself.
That being said, I do think the "reproducibility" aspect of the notebook is overblown for the reasons other comments cite. Notebooks are hard to version control and diff, and are easy to "corrupt." I often see Jupyter notebooks described as "literate programs," and I really don't think that's an apt description. The notebook is basically the IPython shell exposed to the browser where you can display rich output.
This is where I think the R ecosystem's approach to the problem is better (a bit like org-mode & org-babel). For them, there is a literate program in plain text. Code blocks can be executed interactively and results displayed inline by a "viewer" on the document (like that provided by RStudio), but executing code doesn't change the source code of the program, and diffs/versions are only created by editing the source. At any point, the file can be "compiled" or processed into a static output document like HTML or PDF.
This is essentially literate programming but with an intermediate "interactive" feature facilitated by an external program. RMarkdown source doesn't know its being interacted with or executed, and you can edit it like any other literate program.
Interaction, reproducibility, and publication have fundamental tensions with each other. Jupyter notebooks are trying to do all three in the same software/format, and my sense is that they're starting to strain against those tensions.
I like the r approach so much more.
Additionally, Rstudio is an incredibly powerful IDE for data analysis.
EDIT: Interestingly, however, I still use ESS https://ess.r-project.org/ but that's because I love Emacs too much :D
I also tend gravitate towards ESS, and probably split my R development time between emacs and RStudio. I've even written a very kludgy Rmd notebook mode that uses overlays to show evaluation results from code chunks. But RStudio is very well-designed and ESS just doesn't compare feature-wise, sadly.
Although it's gone far beyond just Julia and Python now.
Edit: Ahurmazda is right.
...the core programming languages supported by Jupyter are Julia, Python and R. While the name Jupyter is not a direct acronym for these languages, it nods its head in those directions. In particular, the "y" in the middle of Jupyter was chosen to honor our Python heritage.
Hydrogen is a fantastically well-executed and useful piece of software. (As is Jupyter.) I’ve used it in my own programming, and also in teaching introductory software development, where it’s a helpful transition from Jupyter notebooks (which are used in the early part of the course, for problem sets / reading journals) to text files and code editors (which are introduced later, and used for team projects, and for projects that use Flask or PyGame).
But also — relevant to this sub-thread — “Hydrogen” (it relates “Jupyter” to “Atom”) has to be one of the best project names ever. It’s right up there with “Pyramid Scheme”.
The name is also in homage to Galileo's notebooks recording the discovery of the moons of Jupiter (the four we now call the 'Galilean moons'), and also to a bar called Jupiter in Berkeley, which the core team has visited quite often. The last one's more like a funny coincidence, though.
That said, I've always rhymed it with lumpy, haha
Sometimes you want to present data, graphics and have a bit of interactivity. The notebooks make it easy to share your code/data/graphics. And it beats a PowerPoint any day (for this use case anyway).
Thanks Jupyter Team!
One reason I don't use it is that I started doing data science before the Python ecosystem was viable -- before Pandas existed. I use R for data science (which I generally find superior due to Hadley Wickham's libraries).
I know Jupyter supports R now, but I already had a terminal/web-based workflow by the time that happened.
More importantly, I think this recent blog post finally crystallized why I don't program in REPLs: Because they encourage global variables! I naturally structure my code into functions from the outset.
I don't like the persistence because it can lead to "wrong" programs. I prefer to test my programs with a "clean slate", i.e. by starting a new process.
Jupyter’s structure of delimited code cells enables a programming style where each can be treated like an atomic unit, where if it completes, then its effects are persisted in memory for other code cells to process.
However, this style of programming with Jupyter has its limits. For example, Jupyter penalizes abstraction by removing this interactive debuggability.
In other words, if you put all your code in functions like I do, then Jupyter doesn't add anything. It doesn't let you "step through" the function like a debugger does.
Though, I think that I should somehow try to get over this because there are a lot of benefits to something like Jupyter, like having graphics inline.
Or maybe Jupyter just needs an integrated debugger? And maybe the ability to clear state or tree-walk definitions? I don't like having unused definitions laying around in my workspace.
Also, does Jupyter have any notion of data flow? I don't think it does, because Python doesn't. I think Observable might address some of my gripes, but I haven't tried it yet:
I'd really love an ideally typed expression-oriented language to take notebook programming to the next level ... This is my dream for what I want from swift notebooks ...
- memoized all (non-loop?) values created in the notebook file scope
- automatically invalidated memoized entries after source change using control flow analysis and code coverage data from previous runs
- provides an approximation of the conceptual model of re-running the whole notebook 'from the beginning' when the notebook is executed -- but pulls memoized values from the memoized cache when present to make execution fast and to avoid repeating side-effects
- allowed for easy to issue interactive 'invalidate all memoized entries before/after here' operations in the notebook file ...
- involving editing code normally in a regular source-file in ide (with maybe a different extension to imply the different execution semantics)
- allowed for execution of code in debugger when desired without having to change anything ...
- supported inline graphical representations
- supported wiring custom graphical ui models into the notebook's execution context ...
I'd encourage you to give https://beta.observablehq.com a try. From all of your points — aside from a statically typed language and editing in your normal IDE — we try to hit that target on the nose.
Every cell is only reevaluated when any of its inputs changes, inline and custom graphical representations can render your live data — and even be used as values to be passed themselves as inputs to other cells. For a very simple example, see: https://beta.observablehq.com/@mbostock/d3-brushable-scatter...
If anything should use the dataflow model, it's data analysis!!!
And yes that's why I mentioned Observable, and I'm glad jashkenas also responded. As far as I understand, it's like a spreadsheet, so when you update your inputs, the outputs become consistent automatically.
It's sort of like Make (or perhaps Make in reverse). Dataflow also allows your code to be parallelized. Some scientists don't care about this, but engineers do. It's an eye-opening experience to speed up naive data analysis by 1000x or more with shell scripts and a little C++.
 a = 2
 def square(x):
return a 2
 b = 5
This is something that you can't do in MATLAB, which I still use primarily. In MATLAB you have to create a new file for each function. Which nulls the readability of the script (notebook) when you publish it, because the source of the functions is not included. So, if I write a MATLAB notebook that I want to publish and share, I end up avoiding to create functions for as long as possible and instead use copy paste ...
Currently, some of the main thing that keeps me with MATLAB regardless is:
a) It feels more responsive, probably because its a native app and not running in the browser.
b) I don't like to work in the browser. Its distracting.
c) I like the profiler, debugger and workspace (constant visual inspection of global variables in a separate window) of MATLAB which comes right out of the box. For python/Jupyter I have to set this up manually.
Note: MATLAB now has an 'interactive script' function that is similar to Jupyter, but it is so slow for > 100 loc, that it's completely useless. Even the Mathworks developers admit this. Instead I use %% to seperate my MATLAB scripts into executable blocks (ctrl + enter) and then use the 'publish' function to create a LaTex file which I can then compile to pdf. This creates vector based math formulas and figures (which 'publish to html / pdf' doesn't).
It sounds like RStudio might be a better model for what we want:
https://en.wikipedia.org/wiki/RStudio (click on the screenshot)
It only works with R, but a lot of people say great things about it.
I saw another commenter say that Jupyter is more like Mathematica notebooks, and RStudio is more like MATLAB. I think this sounds right.
In the former, the interactive experience is more central. In the latter, you are developing a program, and the IDE helps you do it interactively. But the program is central. (At least this is true for R, it sounds like it might not be as true for MATLAB. But I know that pretty large programs are written in MATLAB.)
For now I'm still sticking with my highly-custom shell-based workflow. But I do want to make interactive graphics less painful. Right now I juggle a web browser, a terminal, a text editor, and an R REPL!
1. The default front-end is a weak platform for getting work done.
2. Running a remote kernel is a pain in the ass (cat a config file then manually tunnel 4 ports over SSH), and I can't seem to get it to work on Windows at all.
This is an issue at my company because we do a lot of work on remote servers that can be accessed only through SSH or JupyterHub. Individual users do not have control over the latter, so we are stuck with the inadequate default experience I just described above.
3. No kernel other than Ipython is mature.
IRKernel is getting there. Everything else is at best a beta-quality product.
4. Notebooks are not a plain text file format.
Hand editing a notebook is messy. They do not play well with version control systems and diff tools. RMarkdown and Knitr/Sweave are just preprocessors for established plain text formats (Markdown and Latex with some extra syntax). With those formats you can take advantage of a wealth of existing tooling, as well as having the freedom to edit the file in a normal text editor without having to rely on a special front end. Ironically having everything formatted as JSON should make it easier to write those special front ends, but I have not seen any good ones yet.
I hear so many good things about it. I wrote this comment about it:
But ANY of those four is is a dealbreaker for me. I want to use languages other than Python, with remote kernels, and I want version control. And I like my text editor to be really fast.
I think it comes down to a scientific background vs. a software background. I've memorized a boatload of tools and weird shell incantations, but the result is that I have a more solid workflow than Jupyter provides. Solid in the sense that it is likely to produce reliable results, not that it's "easier".
But if you don't have that software engineering background then I understand that Jupyter makes a whole bunch of things easier. It's not optimal in my view, but it's easier.
- better front end integration - e.g. a separate vim process connecting/editing cells of a running notebook and updating the browser view on each change
- Fewer bugs and more parity between the python kernel and non-python kernels
Other than that, I try to put any lengthy code in functions that are in a module alongside the notebook, so that the notebook mostly contains one-line commands to do kick off a calculation or to generate a plot. I also have a shortcut that copies the content of the current browser text field (notebook cell) into MacVim, and pastes it back automatically as soon as I close the editor.
I do prefer Rstudio's REPL approach of being able to run code by line or by blocks (likely inspired by MATLAB's IDE), rather than Jupyter's approach of executing code by cell (which was inspired by Mathematica). They both let you try stuff out easily while maintaining state, but the former is far easier to productionize.
2. Remote kernels over SSH aren't that hard -- I do this all the time via SSH tunnels. I start Jupyter Lab in an SSH console (usually on a cloud-based VM), and create a tunnel to port 8888 (the default) using my Windows SSH app (Bitvise). 1 port. That's it.
3. No comment - I only use the Python kernel.
4. Correct. Notebooks do present challenges for version control.
I want the opposite. I want to use a remote kernel with a local client.
This lets me run computationally heavy Jupyter calculations on a beefy remote backend in the cloud. My local browser merely talks to that backend via a tunnel.
Here's something on the web that describes this  -- except with Bitvise on Windows, you don't have to enter any SSH commands. The tunnel setup etc. is all done via a GUI. This is a pretty standard SSH tunnel technique. You can use this for more than just Jupyter.
BTW I managed to get it to work. I think I had missed a port the first time I tried.
Kernel maturity: the Julia and Haskell kernels are pretty well supported, I understand, though I haven't used them myself.
Alternative frontends: Emacs IPython Notebook is pretty well maintained, if that's to your taste.
I use Rmarkdown and Sweave to write homeworks for my students in a very Jupyter way. I also use them to generate data driven static webpages, procedurally generate production quality and easily formatted PDF and HTML reports. I also use them as a templating system for auto-generated model diagnostic emails. Perhaps I need to return to Jupyter to see what I'm missing, but I don't really know what purpose it would serve, or what kind of work it would make easier.
JupyterLab also allow Rmarkdown-like workflow where code-blocks in a markdown document can be executed to display graph.
I believe the important part is to allow interoperability between different ways people want to work. You can't have 1 size fits all, and there are still a lot of work that can be done to cover some use case.
I see it as useful where illustrating and explaining some computational steps is at least as important as executing them. Teaching is one obvious use case, but it's also valuable for sharing scientific methods, documenting a library with runnable examples, or presentations at programming conferences.
When a model and the associated data pipelines hit production, they need to be version controlled - plain text files can't be beat for that. The idea that the same tool should be both IDE and report is also very strange to me - I can see how it lowers the barrier to entry in some cases, but it doesn't seem optimal for most uses.
I do agree that Jupyter has helped a lot to get reproducibility on people's radars, and that's a positive thing.
I like being able to use my existing tools (text editor, make, version control, etc). I also like being able to write clean functions and run unit tests while still ultimately being able generate a clean final document.
I see two use cases for this sort of notebook thing. One is reproducible research, where I find that a woven solution is far preferable over a notebook. I can use version control to assist in iteration, and use an editor of my choice. The other is exploratory analysis, which Rmarkdown/Sweave/knitr is not really a good substitute for.
For exploratory analysis, I've found that I use the terminal integration in Vim together with the REPL works well for this. It lets me save half-working stuff and record some of my thought process; in a Jupyter notebook the only things I can leave in are working code snippets.
Wikipedia says it does incremental rebuilding, which was news to me. I've never used it, but I know a lot of R users who use it.
One thing that I think is missing from IPython though is the ability to save a given state of the interpreter, with all the variables in it. So that one could preform a time consuming data loading/parsing once, and restart from that point if some variables get messed up. Jupiter cann't do that either, afaik.
Even stuff that's technically plain text is easier when you can display tables and other formatted text. E.g. I have a tiny little notebook that generates LaTeX code for a normal distribution table; it's a notebook because then I can display an HTML preview in a few lines of code.
Jupyter can't save interpreter state - that seems essentially impossible without adding state-saving and -loading code to every single library and dependency.
but ipython is already well integrated with matplotlib in the --pylab mode, and is pretty interactive. Thats how I use it.
re:saving state - its technically a nightmare, I agree. I though it might be possible at OS level - just dumping the whole process. There still would be issues of what to do with open files, network connections, etc, but somehow it seems that in many use cases that would be enough.
I think even saving state by dumping the whole process is unfeasible. What happens if some dependency gets upgraded, e.g. for a critical security hole? The problems seem unavoidable, so I think we're stuck.
For transplants from Matlab or Mathematica it's wonderful.
In my heart, I wish the "data community" would show more care for security (and, related, privacy), with deeper focus on features that simplify access control, and guidelines on how to enforce "reasonable defaults".
I fear that Jupyter in many companies is becoming the next Jenkins, with unconstrained access to all data vs all infra, and this will lead to more and more incidents and leaks.
I very much hope that recognitions like this one will foster not only better tools and support, but also best practice and security considerations.
But, back to the focus of the post, congrats on this success!
We probably will put that in the context of GDPR/HIPPA/FERPA and follow these guidelines to make Jupyter "Ready" for these framework. We can't say that Jupyter it itself compliant, as you need to see in which context it is deployed, but we want to make it as easy as possible for a team of researcher with low budget, or a companies with 1000+ user to make it easy to deploy a secure, auditable and safe Jupyter environement.
I was like "Did they actually achieve this?"
It is really great for solving one off problems or learning.
The jupyter code itself while verbose is pretty extensible also. I just put together something that lets me connect to my spark kubernetes pod.
I think being able to customize jupyter and add new kernels (languages) is where it becomes really powerful and awesome.
The major downside is that it disables JS by default within notebooks, so if you're using Bokeh you'll have to install the jupyterlab extension, but the innovations around files, downloads, views on notebooks, etc. are worth the price of admission.
On of the issue is you always have internal state as soon as you interact with a data source or sink. If you read/write from a API, then rest is stateful. Your file system is stateful... etc.
It's an interesting but hard problem, we'll be happy to have more help with.
Also, this is not for the faint of heart, but before closing a session, I do a "restart kernel and run all." This ensures that my notebook is running the way I'd think if I open it up later and try to re-run it.
My personal best practice is to 'restart and run all' to run the entire thing before I commit it to github. Jupyter git integration is a whole separate bag of worms, as the combination of presentation/code violates some core git assumptions.