Hacker News new | past | comments | ask | show | jobs | submit login

So I'm software developer for 10 years that started using Jupyter Notebooks the last year. I absolutely love that the REPL that ipython gives you. Do a query that takes really long, store it in some variable, spend the next hour or two working on that dataset in memory, changing code, iterating, all the while never having to re-execute that query or load data because it's just a REPL.

How can one get the same type of developer experience in Python without using Jupyter notebooks? What I'm talking about, to people that maybe do traditional development in lets say Java or C++ is, imagine running your code in a debugger, setting a breakpoint, when it hits that breakpoint, you see "ah here's the problem", you fix the code, and have it re-execute all the while not having stopped your program at all. No re-compiling, and then having to re-execute from the beginning. It's like once you've done things the Juypter way, how can you possibly want to back to writing code in a traditional sense, it's just too slow of a process.

How do you get that same experience without using Jupyter? I tried the VSCode plugin [1] that tries to make things the same as a Juptyer Notebook, but it's no where near as smooth an experience and feels clunky.

[1] https://marketplace.visualstudio.com/items?itemName=donjayam...




It might seem odd, but I achieve a similar thing by scripting my data processing tasks with GNU Make. You can incrementally add steps to the process, and it will retain the existing results as you go along. My use case is scientific computing, where some of the steps may be the results of a costly optimization or simulation. I can update the workflow quite easily and not worry about redoing the expensive bits or making copies. Of course, all of the dependency management and parallel processing of Make are a bonus as well.


Thanks, I’d never thought of doing this before - found 2 resources to help me explore this approach:

http://zmjones.com/make/ http://blog.kaggle.com/2012/10/15/make-for-data-scientists/


Does not sound odd, but in your case you will store results on disk, rather than in memory? That means that there might be some overhead in (de-)serialization and also some restrictions on what kind of data can be stored as an intermediate result?


You can mitigate the reading from disk by memory mapping the file.

I don't think serialization is a problem either. It is possible to make data structures that always exist as a single span of memory, so that they don't have a separate serialized representation. Most data structures are better served like this anyway, since prefetching and linear memory access is a huge speedup over pointer chasing. Building a data structure out of many separate heap allocations will also slow it down.


You can memory map and store binary, in-memory data (as long as you don't use pointers). So there'll be some overhead serializing (removing pointers) but ime it's not noticeable since most serialization tasks can be done in milliseconds (even if the dataset is Terabytes) whereas most algorithmic tasks are in the order of seconds, minutes or hours.


Probably but it's not a given. It depends on the tools. The tools might work with memory mapped data structures directly.


Same! I use GNU Make for this. TBH I've never seen anyone using Make for this and my coworkers think it's a bit odd. But I think it's very practical.


I've been doing a lot of unstructured data mining in Kotlin recently that's heavily dependent upon heuristics and how the algorithms interact with the precise data given.

I'll often write the code so that each function is a bunch of value definitions in a pipeline, followed by a trivial return statement. No side effects if I can help it. Then I run it in the IntelliJ debugger and set a breakpoint on each return statement. When it hits the breakpoint, I instantly know the value of each subexpression, and can inspect them in case anything is amiss. If I need to rewrite something, I'll write the new code in a watch and immediately evaluate it. I basically use watches in the debugger as a REPL, except that I have the ability to factor out subexpressions and independently check the correctness of them, plus IntelliJ syntax-checks and autocompletes my code within the watch. Also, the sequence in which breakpoints are hit gives me a trace of the code.

The one downside is that I still have to recompile and re-run once I've fixed the bug and identified the correct code, but I mitigate that by architecting the app so that I can easily run it on a subset of data. So for example, my initial seed data set consists of a Hadoop job over a corpus of 35,000 1G files, but I wrote a little shim that can run the mapper & reducer locally, without starting a Hadoop cluster, over a single shard. And then I've got a command-line arg in that shim that lets me run it on just a single record within that shard, skipping over the rest. And I've got other flags that let me select which specific algorithm to run over that single record in that single shard in that 35TB 3B record corpus. I also rely heavily on crawling, but I save the output of each crawl as a WARC file that can be processed in a similar way, without needing to go back out to the Internet. The upshot is that within about 10 seconds I can stop on a breakpoint in any piece of the code, on any input data, and view the complete state of the program.


One thing that I would love in python is some sort of snapshotting,

Like “this is my global state” - the entire state of the python process, all objects, etc.

Then I change a variable and that creates a new state. We have this now.

What I want is a rewind button, so I can go back to the previous state (quickly preferably) like a time travelling debugger.

Someone build this, I’ll pay for it (and I’m sure others would too)

I know there’d be some problems with external changes (API calls altering external system, writing to files, etc) that a time travelling debugger might not be able to reverse, but even so, I would live with this limitation happily if such a tool existed


> snapshotting ... entire state of the python process, all objects, etc.

Someone else in this thread mentioned Smalltalk: this is exactly what a Smalltalk image gives you.

The image is everything, all your objects, code, temporary, everything. And at any time you can do "save" or even "save as" to persist a copy.

And there are variants that can work with Python code as well.

http://squeak.org/


It's a shame that Smalltalk never gained traction as a system language, and the C/Unix approach took over enterprise-grade development.

I wonder what software engineering would look like if all the decades-long advancement of tools for stabilization and deployment of code had been researched over platforms that combined system and application code, like Smalltalk did. Instead, we consider both layers as isolated from each other. A lot of work over virtualization and containers seem to be used as a workaround for not having this integrated approach as part of the platform design.


It could still work really well as the systems environment and language for a personal computing device. One needs to build the proper abstractions and DSLs on top of it, of course, but it would be a much more conducive environment to "literate computing," if that's the goal (and I think it should be).


I think a lot of people reading this thread will agree :-)

It is definitely an approach more apt to personal computing than to industrial engineering, at least with the level of maturity of the tools available today. The increased ease of use of these environments may be a nice-to-have for professional developers, but they are essential for non-developers.


A basic way of getting that on Unix systems is forking - you get a copy of the process with the same memory state, and likely relatively small interfaces would suffice to manage which process is currently receiving input from e.g. a REPL. (Sockets, open files, ... are more complex, but I guess many REPL workflows could do without that). Sadly you can't easily save much memory for the snapshots from copy-on-write (which otherwise is a neat trick with forking to keep old state) due to the way Python manages memory, at least not without modifying the interpreter.


If anyone wants to walk through these ideas more slowly I write about them in http://ballingt.com/interactive-interpreter-undo and implement this in https://github.com/thomasballinger/rlundo but as you say, eventually decided I had to implement an interpreter for which serialization of state was more efficient and all external resources had amenable interfaces http://dalsegno.ballingt.com/


Will you marry me?

this is really cool, I'm going to give deeper into the code tonight. Awesome work


hooking into readline is a clever idea!


Interesting COW for whole process images and a tree like interface to jump between them. Could work if well if the process image itself had structured interfaces for everything important. Otherwise you'd end up with black box images everywhere.


If you want snapshotting, take a look at http://datalore.io/ It stores snapshots between cells, and can incrementally recalculate only what's needed.


I believe IPython cells keep their state throughout execution of the notebook.


Just start ipython? The Ipython shell is the original jupyter, with ipython notebook been splitted of at some point.

But what you describe is normal behaviour in any REPL. The main advantage notebooks have over shells is a better sessionhandling and richful datastructures. Images, markdown and stuff is a bit hard to display on a regular terminal. Ipython has the qt-frontend for this, but then moved to notebooks


The IPython console is great. It’s the Notebook (not console) that’s terrible. The two things are not very similar to each other.

I’ve been saying it for years, even before the great meme presentation this year: notebooks are not the appropriate medium for 99% of rapid prototyping or shared experiment work and notebooks make for terrible units of reproducibility (especially compared to just traditionally version controlled source code with a Makefile or with a Docker container etc).

Pedagogy is probably the only serious use case of a notebook. If you’re not working on a 100% pure pedagogy task, it’s a giant mistake to be doing what you’re doing in a Jupyter notebook.


Some tactics I've picked up:

- I work mostly from a clean, normal Python codebase in my normal IDE

- I'll have a file like "common.py" or something

- I keep a Jupyter notebook running always, and I'll do a (from importlib import reload) reload(common) to update any code in my live environment

I also:

- Cache a lot to disk (np.save/np.load).

- For every long-running method, have a tiny subset to unit test it


Less powerful but more automatic: I added a reload modules key to bpython (F6) for this and an auto reload which reruns the repl session each time an imported module changes (F5)


I've found Atom's Hydrogen[1] to be pretty neat for my purposes, currently data analysis and plotting with python. Separate code into cells with #%%, crtl+enter to run. No need to reload data from files. [1]https://atom.io/packages/hydrogen


> imagine running your code in a debugger, setting a breakpoint, when it hits that breakpoint, you see "ah here's the problem", you fix the code, and have it re-execute all the while not having stopped your program at all. No re-compiling, and then having to re-execute from the beginning.

Java in Eclipse works exactly like this.


Run/debug code via PyCharm/VSCode. Cache long computations on disk via a suitable pickle, x = cache(time_consuming_fn, 'path/to/cached/data'). When stale results are suspected, 'restart' the environment by deleting the cache directory.


Matlab has a somewhat different focus, but essentially offers the experience you describe.

(just want to say: this concept is around for ages and Jupyter could adapt some of Matlab's GUI features for sure)


I know it's not a new concept, but it's just not the way traditional software is developed. Tradition software is usually done in the complete antithesis to the way a REPL environment works.

I guess that's the main reason why so many software developers don't "get" why people want to write code in a browser. Even normal Python developers.


I'm a REPL-first obstinate Lisper and a recent fan of Observable Notebooks, and I still don't like writing code in the browser. The browser is an excellent rendering engine, but absolutely sucks at typing & developer conveniences. If someone could make a cross between Emacs and Chrome (preferably with elisp instead of JS), I'd be in heaven.


It is called Visual Studio Code. Switched from emacs and haven’t looked back so far.


I don't "get" why people want to write code in a browser NOT because I find REPL useless but because I find browser a poor choice of program to input code. I use REPL all the time, but I use emacs to do this. My firefox isn't customized to type code, it's customized to type HN comments and watch cat videos. Why use firefox instead of emacs to type code?


In my view, Jupyter Notebooks are web-based because they are designed with sharing in mind. It's far from perfect (git does not play nicely with .ipynb files :(), but here are two specific advantages that come to mind:

- You can run a Jupyter Notebook locally or on a remote server, seamlessly working in a graphics- and Markdown-capable REPL in either case

- You can easily share and export the final notebook in its post-run state, presenting Markdown, LaTeX formulas, code, comments, plots, and results in the order they were written and executed, in a single (HTML or PDF) document (or even a slide deck, if you dare)

To your last point about limiting use of the browser to writing short comments and watching videos: I'd say that's a matter of personal taste. Browsers are perfectly capable of quickly handling the volume of code and text that would appear in a reasonable notebook.

That said, I grew up with vim keybindings, and while Jupyter supports the basics (modal editing, j and k to move between cells, dd to delete a cell, etc.) and has super useful default keyboard shortcuts, I still move and type faster in vim. But Jupyter adds so much value as a seamless, graphics-enabled, run-anywhere REPL that I happily accept the slight slowdown in my typing and navigation, which is never the bottleneck in my R&D work anyway.

You can replace "firefox" and "emacs" with other things to take that last argument to an extreme. For example: "My [laptop] isn't customized to type code, it's customized to type HN comments and watch cat videos. Why use [a laptop] instead of [a workstation] to type code?" I thought like that 15 years ago, but technology and I changed.


It's not about performance. I have no problem with jupyter notebook. I use ein-mode in emacs which is basically a better frontend for jupyter notebook AND you get all benefits of emacs since cells are individual emacs buffers. It's also not about browsers. Writing code in a medium that's not designed to write code encourages bad behavior. Jupyter notebook is supposed to be optimized for Julia, Python and R but somehow writing Python code is next to impossible because it doesn't understand what to do with indentation. There is almost no word suggestion, when I type a few characters and click tab I almost always get horrible suggestion and have to navigate with arrow keys. I cannot use C-q to comment stuff out/in intelligently (comment-dwim), I cannot use regex, I cannot fuzzy search, I cannot use multiple cursors, I cannot use replace-text, I cannot highlight SQL embedded in python. I can go on for days. And this is not a problem of jupyter notebook. Why reinvent the wheel? We already have text editors and IDEs that solve the problem of inputing code, why would you ever want to type code in a web input box? That is, unless someone spends more money and makes a full-fledged IDE in local javascript. So much wasted effort for a problem that was solved in 1960s.


> How do you get that same experience without using Jupyter?

You use an image-based language (i.e. Lisp, Smalltalk)


One method I use fairly frequently is to combine ipdb and importlib. You can set a breakpoint in one module, after some time intensive task, and pause there in your console. Write some code in another module - your working module - and then import it from the ipdb prompt. You can then make some changes to your working module and use importlib.reload(working_module). It will reload that module practically instantaneously and execute any top level code.


Here is that exact workflow in modern C++

https://github.com/LiveAsynchronousVisualizedArchitecture/la...

All the IO between nodes is serialized, so an output that is already in shared memory (to be examined or visualized in other programs) can also be written to disk.

It can then be loaded as a 'constant' node and iterations can be done with hot reloading on the baked output.

They key is that you need both live/hot/automatic reloading of execution AND snapshots/replay/rewind/baking.

If you only have one or the other, you can't isolate a part of a program and iterate on it nearly as quickly.


I use emacs python-mode for this. It's more or less a wrapper around the REPL you'd get by typing "python" at a command line, but with some integration to support sending updated function definitions, tab completion, etc.


I get a very similar experience using IPython from a terminal, sometimes copying code over from my main text editor (where I have linting, multiple cursors and other niceties). As I do a lot of plotting I use itermplot[1] to display inline graphics. When I'm done I copy everything using %history XX to a permanent home in libraries or standalone scripts.

[1] https://github.com/daleroberts/itermplot


But you can do that in Java? Java debugger (and IDEs like Netbeans) can hotswap a class in the middle of execution without restarting the application.


The Spyder3 ide might be worth looking at. It won’t be as nifty with graphs and charts but its great for iterative data wrangling and debugging.


I’ve spent a fair amount of time architecting my software to be able to set a flag and instead of processing a full typical run lasting several hours of compute it processes a subset in minutes or seconds. It’s not as clean as a Jupiter notebook but it’s much more repeatable than working in a notebook.

I have a hard time maintaining discipline in a notebook and therefore don’t find them very useful


You could try the Spyder IDE, which has some similarities to Jupyter.

It runs code in IPython, but you write in a regular Python file rather than a notebook. I've used it in exactly the manner you describe, doing a load to memory once, then running and rerunning analyses (or parts of them) in the same Ipython console.


one specific solution: If your code downloads web content (e.g. a scraper), and you want to avoid having to re-download when you restart it, the requests_cache module [1] works really well.

Also echoing thaw13579's comment; I used GNU Make in one project [2] for this and it worked OK.

[1] https://github.com/reclosedev/requests-cache [2] https://github.com/kerrickstaley/tatoeba_rank/blob/master/Ma...


The REPL is a really different way to develop software. Jupyter was grown from IPython, a really great REPL. I prefer it to Jupyter. I always start a console connected to my notebook executing the magic %qtconsole


For c++ there is root/cling which is what all the experiments at CERN use in one way or another. They offer a Jupiter plugin as well, but it can be used in the console independently.


That's only because you store primitives in memory in jupyter; I find I write terrible code in Jupyter precisely because of that.


You can use spyder, and execute chunks of code in a file by delinieating them with #%%. This gives you something like cells but in a single python file with no json format.


Look to Smalltalk for lessons on this.


I'm using kind of middle way to what's described in the article. I'll hack away in a Notebook because the problems that arise early in coding are problems that can be fixed quickly, and the time spent waiting for the code vs. time fixing problem is sometimes magnitudes higher.

But at some point I'm getting to a stage where I realize that I'm starting to write architecture, not algorithms. And at that point the only thing I'll do with my notebooks is glance at them to look at particular solutions, and maybe run them once or twice for sanity checks (or to just be happy that I solved the problem :D).

I then try to flip a mental switch. There are fundamental differences between writing python scripts for scientific or any other purposes and writing python architecture. Which is why I've moved away from copy-pasting code from notebooks to codebases. (I've been stubborn to actually ingrain this for a year or so.). I treat my own notebooks as if they were foreign solutions. Good for knowing how to tackle the problem, but I'll keep them the hell away from my clipboard.

At that point, it shouldn't matter if I have to re-run everything again for tests, because the stability and reliability of my code is now the highest priority. Also, if your development progress is anything like mine, the actual problems you fix a few weeks into a project aren't problems where you try a solution every few minutes, but rather once or twice a day. So the extra time spent waiting doesn't really bother me.

TL;DR: Hack away in Jupyter notebooks for exploration, then lock it all away once you know how to tackle the problem and work 'traditionally'.


REPL for exploring is good. But for actual development I prefer some simple TDD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: