So I'm software developer for 10 years that started using Jupyter Notebooks the ...

thaw13579 · on Dec 12, 2018

It might seem odd, but I achieve a similar thing by scripting my data processing tasks with GNU Make. You can incrementally add steps to the process, and it will retain the existing results as you go along. My use case is scientific computing, where some of the steps may be the results of a costly optimization or simulation. I can update the workflow quite easily and not worry about redoing the expensive bits or making copies. Of course, all of the dependency management and parallel processing of Make are a bonus as well.

dlcmh · on Dec 12, 2018

Thanks, I’d never thought of doing this before - found 2 resources to help me explore this approach:

http://zmjones.com/make/ http://blog.kaggle.com/2012/10/15/make-for-data-scientists/

leethargo · on Dec 12, 2018

Does not sound odd, but in your case you will store results on disk, rather than in memory? That means that there might be some overhead in (de-)serialization and also some restrictions on what kind of data can be stored as an intermediate result?

AboutTheWhisles · on Dec 12, 2018

You can mitigate the reading from disk by memory mapping the file.

I don't think serialization is a problem either. It is possible to make data structures that always exist as a single span of memory, so that they don't have a separate serialized representation. Most data structures are better served like this anyway, since prefetching and linear memory access is a huge speedup over pointer chasing. Building a data structure out of many separate heap allocations will also slow it down.

gnulinux · on Dec 12, 2018

You can memory map and store binary, in-memory data (as long as you don't use pointers). So there'll be some overhead serializing (removing pointers) but ime it's not noticeable since most serialization tasks can be done in milliseconds (even if the dataset is Terabytes) whereas most algorithmic tasks are in the order of seconds, minutes or hours.

jacobush · on Dec 12, 2018

Probably but it's not a given. It depends on the tools. The tools might work with memory mapped data structures directly.

gnulinux · on Dec 12, 2018

Same! I use GNU Make for this. TBH I've never seen anyone using Make for this and my coworkers think it's a bit odd. But I think it's very practical.

nostrademons · on Dec 12, 2018

I've been doing a lot of unstructured data mining in Kotlin recently that's heavily dependent upon heuristics and how the algorithms interact with the precise data given.

I'll often write the code so that each function is a bunch of value definitions in a pipeline, followed by a trivial return statement. No side effects if I can help it. Then I run it in the IntelliJ debugger and set a breakpoint on each return statement. When it hits the breakpoint, I instantly know the value of each subexpression, and can inspect them in case anything is amiss. If I need to rewrite something, I'll write the new code in a watch and immediately evaluate it. I basically use watches in the debugger as a REPL, except that I have the ability to factor out subexpressions and independently check the correctness of them, plus IntelliJ syntax-checks and autocompletes my code within the watch. Also, the sequence in which breakpoints are hit gives me a trace of the code.

The one downside is that I still have to recompile and re-run once I've fixed the bug and identified the correct code, but I mitigate that by architecting the app so that I can easily run it on a subset of data. So for example, my initial seed data set consists of a Hadoop job over a corpus of 35,000 1G files, but I wrote a little shim that can run the mapper & reducer locally, without starting a Hadoop cluster, over a single shard. And then I've got a command-line arg in that shim that lets me run it on just a single record within that shard, skipping over the rest. And I've got other flags that let me select which specific algorithm to run over that single record in that single shard in that 35TB 3B record corpus. I also rely heavily on crawling, but I save the output of each crawl as a WARC file that can be processed in a similar way, without needing to go back out to the Internet. The upshot is that within about 10 seconds I can stop on a breakpoint in any piece of the code, on any input data, and view the complete state of the program.

malux85 · on Dec 12, 2018

One thing that I would love in python is some sort of snapshotting,

Like “this is my global state” - the entire state of the python process, all objects, etc.

Then I change a variable and that creates a new state. We have this now.

What I want is a rewind button, so I can go back to the previous state (quickly preferably) like a time travelling debugger.

Someone build this, I’ll pay for it (and I’m sure others would too)

I know there’d be some problems with external changes (API calls altering external system, writing to files, etc) that a time travelling debugger might not be able to reverse, but even so, I would live with this limitation happily if such a tool existed

mpweiher · on Dec 12, 2018

> snapshotting ... entire state of the python process, all objects, etc.

Someone else in this thread mentioned Smalltalk: this is exactly what a Smalltalk image gives you.

The image is everything, all your objects, code, temporary, everything. And at any time you can do "save" or even "save as" to persist a copy.

And there are variants that can work with Python code as well.

http://squeak.org/

TuringTest · on Dec 12, 2018

It's a shame that Smalltalk never gained traction as a system language, and the C/Unix approach took over enterprise-grade development.

I wonder what software engineering would look like if all the decades-long advancement of tools for stabilization and deployment of code had been researched over platforms that combined system and application code, like Smalltalk did. Instead, we consider both layers as isolated from each other. A lot of work over virtualization and containers seem to be used as a workaround for not having this integrated approach as part of the platform design.

scroot · on Dec 12, 2018

It could still work really well as the systems environment and language for a personal computing device. One needs to build the proper abstractions and DSLs on top of it, of course, but it would be a much more conducive environment to "literate computing," if that's the goal (and I think it should be).

TuringTest · on Dec 12, 2018

I think a lot of people reading this thread will agree :-)

It is definitely an approach more apt to personal computing than to industrial engineering, at least with the level of maturity of the tools available today. The increased ease of use of these environments may be a nice-to-have for professional developers, but they are essential for non-developers.

detaro · on Dec 12, 2018

A basic way of getting that on Unix systems is forking - you get a copy of the process with the same memory state, and likely relatively small interfaces would suffice to manage which process is currently receiving input from e.g. a REPL. (Sockets, open files, ... are more complex, but I guess many REPL workflows could do without that). Sadly you can't easily save much memory for the snapshots from copy-on-write (which otherwise is a neat trick with forking to keep old state) due to the way Python manages memory, at least not without modifying the interpreter.

thomasballinger · on Dec 12, 2018

If anyone wants to walk through these ideas more slowly I write about them in http://ballingt.com/interactive-interpreter-undo and implement this in https://github.com/thomasballinger/rlundo but as you say, eventually decided I had to implement an interpreter for which serialization of state was more efficient and all external resources had amenable interfaces http://dalsegno.ballingt.com/

malux85 · on Dec 12, 2018

Will you marry me?

this is really cool, I'm going to give deeper into the code tonight. Awesome work

detaro · on Dec 12, 2018

hooking into readline is a clever idea!

CMCDragonkai · on Dec 12, 2018

Interesting COW for whole process images and a tree like interface to jump between them. Could work if well if the process image itself had structured interfaces for everything important. Otherwise you'd end up with black box images everywhere.

solomatov · on Dec 12, 2018

If you want snapshotting, take a look at http://datalore.io/ It stores snapshots between cells, and can incrementally recalculate only what's needed.

heavyset_go · on Dec 12, 2018

I believe IPython cells keep their state throughout execution of the notebook.

PurpleRamen · on Dec 12, 2018

Just start ipython? The Ipython shell is the original jupyter, with ipython notebook been splitted of at some point.

But what you describe is normal behaviour in any REPL. The main advantage notebooks have over shells is a better sessionhandling and richful datastructures. Images, markdown and stuff is a bit hard to display on a regular terminal. Ipython has the qt-frontend for this, but then moved to notebooks

mlthoughts2018 · on Dec 12, 2018

The IPython console is great. It’s the Notebook (not console) that’s terrible. The two things are not very similar to each other.

I’ve been saying it for years, even before the great meme presentation this year: notebooks are not the appropriate medium for 99% of rapid prototyping or shared experiment work and notebooks make for terrible units of reproducibility (especially compared to just traditionally version controlled source code with a Makefile or with a Docker container etc).

Pedagogy is probably the only serious use case of a notebook. If you’re not working on a 100% pure pedagogy task, it’s a giant mistake to be doing what you’re doing in a Jupyter notebook.

breck · on Dec 12, 2018

Some tactics I've picked up:

- I work mostly from a clean, normal Python codebase in my normal IDE

- I'll have a file like "common.py" or something

- I keep a Jupyter notebook running always, and I'll do a (from importlib import reload) reload(common) to update any code in my live environment

I also:

- Cache a lot to disk (np.save/np.load).

- For every long-running method, have a tiny subset to unit test it

thomasballinger · on Dec 12, 2018

Less powerful but more automatic: I added a reload modules key to bpython (F6) for this and an auto reload which reruns the repl session each time an imported module changes (F5)

amatic · on Dec 12, 2018

I've found Atom's Hydrogen[1] to be pretty neat for my purposes, currently data analysis and plotting with python. Separate code into cells with #%%, crtl+enter to run. No need to reload data from files. [1]https://atom.io/packages/hydrogen

im3w1l · on Dec 12, 2018

> imagine running your code in a debugger, setting a breakpoint, when it hits that breakpoint, you see "ah here's the problem", you fix the code, and have it re-execute all the while not having stopped your program at all. No re-compiling, and then having to re-execute from the beginning.

Java in Eclipse works exactly like this.

pacala · on Dec 12, 2018

Run/debug code via PyCharm/VSCode. Cache long computations on disk via a suitable pickle, x = cache(time_consuming_fn, 'path/to/cached/data'). When stale results are suspected, 'restart' the environment by deleting the cache directory.

ajdhsjakafjt · on Dec 12, 2018

Matlab has a somewhat different focus, but essentially offers the experience you describe.

(just want to say: this concept is around for ages and Jupyter could adapt some of Matlab's GUI features for sure)

laichzeit0 · on Dec 12, 2018

I know it's not a new concept, but it's just not the way traditional software is developed. Tradition software is usually done in the complete antithesis to the way a REPL environment works.

I guess that's the main reason why so many software developers don't "get" why people want to write code in a browser. Even normal Python developers.

TeMPOraL · on Dec 12, 2018

I'm a REPL-first obstinate Lisper and a recent fan of Observable Notebooks, and I still don't like writing code in the browser. The browser is an excellent rendering engine, but absolutely sucks at typing & developer conveniences. If someone could make a cross between Emacs and Chrome (preferably with elisp instead of JS), I'd be in heaven.

orbifold · on Dec 12, 2018

It is called Visual Studio Code. Switched from emacs and haven’t looked back so far.

gnulinux · on Dec 12, 2018

I don't "get" why people want to write code in a browser NOT because I find REPL useless but because I find browser a poor choice of program to input code. I use REPL all the time, but I use emacs to do this. My firefox isn't customized to type code, it's customized to type HN comments and watch cat videos. Why use firefox instead of emacs to type code?

EForEndeavour · on Dec 13, 2018

In my view, Jupyter Notebooks are web-based because they are designed with sharing in mind. It's far from perfect (git does not play nicely with .ipynb files :(), but here are two specific advantages that come to mind:

- You can run a Jupyter Notebook locally or on a remote server, seamlessly working in a graphics- and Markdown-capable REPL in either case

- You can easily share and export the final notebook in its post-run state, presenting Markdown, LaTeX formulas, code, comments, plots, and results in the order they were written and executed, in a single (HTML or PDF) document (or even a slide deck, if you dare)

To your last point about limiting use of the browser to writing short comments and watching videos: I'd say that's a matter of personal taste. Browsers are perfectly capable of quickly handling the volume of code and text that would appear in a reasonable notebook.

That said, I grew up with vim keybindings, and while Jupyter supports the basics (modal editing, j and k to move between cells, dd to delete a cell, etc.) and has super useful default keyboard shortcuts, I still move and type faster in vim. But Jupyter adds so much value as a seamless, graphics-enabled, run-anywhere REPL that I happily accept the slight slowdown in my typing and navigation, which is never the bottleneck in my R&D work anyway.

You can replace "firefox" and "emacs" with other things to take that last argument to an extreme. For example: "My [laptop] isn't customized to type code, it's customized to type HN comments and watch cat videos. Why use [a laptop] instead of [a workstation] to type code?" I thought like that 15 years ago, but technology and I changed.

gnulinux · on Dec 13, 2018

It's not about performance. I have no problem with jupyter notebook. I use ein-mode in emacs which is basically a better frontend for jupyter notebook AND you get all benefits of emacs since cells are individual emacs buffers. It's also not about browsers. Writing code in a medium that's not designed to write code encourages bad behavior. Jupyter notebook is supposed to be optimized for Julia, Python and R but somehow writing Python code is next to impossible because it doesn't understand what to do with indentation. There is almost no word suggestion, when I type a few characters and click tab I almost always get horrible suggestion and have to navigate with arrow keys. I cannot use C-q to comment stuff out/in intelligently (comment-dwim), I cannot use regex, I cannot fuzzy search, I cannot use multiple cursors, I cannot use replace-text, I cannot highlight SQL embedded in python. I can go on for days. And this is not a problem of jupyter notebook. Why reinvent the wheel? We already have text editors and IDEs that solve the problem of inputing code, why would you ever want to type code in a web input box? That is, unless someone spends more money and makes a full-fledged IDE in local javascript. So much wasted effort for a problem that was solved in 1960s.

molteanu · on Dec 12, 2018

> How do you get that same experience without using Jupyter?

You use an image-based language (i.e. Lisp, Smalltalk)

opium_tea · on Dec 12, 2018

One method I use fairly frequently is to combine ipdb and importlib. You can set a breakpoint in one module, after some time intensive task, and pause there in your console. Write some code in another module - your working module - and then import it from the ipdb prompt. You can then make some changes to your working module and use importlib.reload(working_module). It will reload that module practically instantaneously and execute any top level code.

AboutTheWhisles · on Dec 12, 2018

Here is that exact workflow in modern C++

https://github.com/LiveAsynchronousVisualizedArchitecture/la...

All the IO between nodes is serialized, so an output that is already in shared memory (to be examined or visualized in other programs) can also be written to disk.

It can then be loaded as a 'constant' node and iterations can be done with hot reloading on the baked output.

They key is that you need both live/hot/automatic reloading of execution AND snapshots/replay/rewind/baking.

If you only have one or the other, you can't isolate a part of a program and iterate on it nearly as quickly.

Pete_D · on Dec 12, 2018

I use emacs python-mode for this. It's more or less a wrapper around the REPL you'd get by typing "python" at a command line, but with some integration to support sending updated function definitions, tab completion, etc.

tempay · on Dec 12, 2018

I get a very similar experience using IPython from a terminal, sometimes copying code over from my main text editor (where I have linting, multiple cursors and other niceties). As I do a lot of plotting I use itermplot[1] to display inline graphics. When I'm done I copy everything using %history XX to a permanent home in libraries or standalone scripts.

[1] https://github.com/daleroberts/itermplot

skocznymroczny · on Dec 12, 2018

But you can do that in Java? Java debugger (and IDEs like Netbeans) can hotswap a class in the middle of execution without restarting the application.

gronne · on Dec 12, 2018

The Spyder3 ide might be worth looking at. It won’t be as nifty with graphs and charts but its great for iterative data wrangling and debugging.

source99 · on Dec 12, 2018

I’ve spent a fair amount of time architecting my software to be able to set a flag and instead of processing a full typical run lasting several hours of compute it processes a subset in minutes or seconds. It’s not as clean as a Jupiter notebook but it’s much more repeatable than working in a notebook.

I have a hard time maintaining discipline in a notebook and therefore don’t find them very useful

dewy · on Dec 12, 2018

You could try the Spyder IDE, which has some similarities to Jupyter.

It runs code in IPython, but you write in a regular Python file rather than a notebook. I've used it in exactly the manner you describe, doing a load to memory once, then running and rerunning analyses (or parts of them) in the same Ipython console.

KerrickStaley · on Dec 12, 2018

one specific solution: If your code downloads web content (e.g. a scraper), and you want to avoid having to re-download when you restart it, the requests_cache module [1] works really well.

Also echoing thaw13579's comment; I used GNU Make in one project [2] for this and it worked OK.

[1] https://github.com/reclosedev/requests-cache [2] https://github.com/kerrickstaley/tatoeba_rank/blob/master/Ma...

neves · on Dec 12, 2018

The REPL is a really different way to develop software. Jupyter was grown from IPython, a really great REPL. I prefer it to Jupyter. I always start a console connected to my notebook executing the magic %qtconsole

orbifold · on Dec 12, 2018

For c++ there is root/cling which is what all the experiments at CERN use in one way or another. They offer a Jupiter plugin as well, but it can be used in the console independently.

raducu · on Dec 14, 2018

That's only because you store primitives in memory in jupyter; I find I write terrible code in Jupyter precisely because of that.

mattip · on Dec 12, 2018

You can use spyder, and execute chunks of code in a file by delinieating them with #%%. This gives you something like cells but in a single python file with no json format.

CMCDragonkai · on Dec 12, 2018

Look to Smalltalk for lessons on this.

lvncelot · on Dec 13, 2018

I'm using kind of middle way to what's described in the article. I'll hack away in a Notebook because the problems that arise early in coding are problems that can be fixed quickly, and the time spent waiting for the code vs. time fixing problem is sometimes magnitudes higher.

But at some point I'm getting to a stage where I realize that I'm starting to write architecture, not algorithms. And at that point the only thing I'll do with my notebooks is glance at them to look at particular solutions, and maybe run them once or twice for sanity checks (or to just be happy that I solved the problem :D).

I then try to flip a mental switch. There are fundamental differences between writing python scripts for scientific or any other purposes and writing python architecture. Which is why I've moved away from copy-pasting code from notebooks to codebases. (I've been stubborn to actually ingrain this for a year or so.). I treat my own notebooks as if they were foreign solutions. Good for knowing how to tackle the problem, but I'll keep them the hell away from my clipboard.

At that point, it shouldn't matter if I have to re-run everything again for tests, because the stability and reliability of my code is now the highest priority. Also, if your development progress is anything like mine, the actual problems you fix a few weeks into a project aren't problems where you try a solution every few minutes, but rather once or twice a day. So the extra time spent waiting doesn't really bother me.

TL;DR: Hack away in Jupyter notebooks for exploration, then lock it all away once you know how to tackle the problem and work 'traditionally'.

alecco · on Dec 12, 2018

REPL for exploring is good. But for actual development I prefer some simple TDD.