How can one get the same type of developer experience in Python without using Jupyter notebooks? What I'm talking about, to people that maybe do traditional development in lets say Java or C++ is, imagine running your code in a debugger, setting a breakpoint, when it hits that breakpoint, you see "ah here's the problem", you fix the code, and have it re-execute all the while not having stopped your program at all. No re-compiling, and then having to re-execute from the beginning. It's like once you've done things the Juypter way, how can you possibly want to back to writing code in a traditional sense, it's just too slow of a process.
How do you get that same experience without using Jupyter? I tried the VSCode plugin  that tries to make things the same as a Juptyer Notebook, but it's no where near as smooth an experience and feels clunky.
I don't think serialization is a problem either. It is possible to make data structures that always exist as a single span of memory, so that they don't have a separate serialized representation. Most data structures are better served like this anyway, since prefetching and linear memory access is a huge speedup over pointer chasing. Building a data structure out of many separate heap allocations will also slow it down.
I'll often write the code so that each function is a bunch of value definitions in a pipeline, followed by a trivial return statement. No side effects if I can help it. Then I run it in the IntelliJ debugger and set a breakpoint on each return statement. When it hits the breakpoint, I instantly know the value of each subexpression, and can inspect them in case anything is amiss. If I need to rewrite something, I'll write the new code in a watch and immediately evaluate it. I basically use watches in the debugger as a REPL, except that I have the ability to factor out subexpressions and independently check the correctness of them, plus IntelliJ syntax-checks and autocompletes my code within the watch. Also, the sequence in which breakpoints are hit gives me a trace of the code.
The one downside is that I still have to recompile and re-run once I've fixed the bug and identified the correct code, but I mitigate that by architecting the app so that I can easily run it on a subset of data. So for example, my initial seed data set consists of a Hadoop job over a corpus of 35,000 1G files, but I wrote a little shim that can run the mapper & reducer locally, without starting a Hadoop cluster, over a single shard. And then I've got a command-line arg in that shim that lets me run it on just a single record within that shard, skipping over the rest. And I've got other flags that let me select which specific algorithm to run over that single record in that single shard in that 35TB 3B record corpus. I also rely heavily on crawling, but I save the output of each crawl as a WARC file that can be processed in a similar way, without needing to go back out to the Internet. The upshot is that within about 10 seconds I can stop on a breakpoint in any piece of the code, on any input data, and view the complete state of the program.
Like “this is my global state” - the entire state of the python process, all objects, etc.
Then I change a variable and that creates a new state. We have this now.
What I want is a rewind button, so I can go back to the previous state (quickly preferably) like a time travelling debugger.
Someone build this, I’ll pay for it (and I’m sure others would too)
I know there’d be some problems with external changes (API calls altering external system, writing to files, etc) that a time travelling debugger might not be able to reverse, but even so, I would live with this limitation happily if such a tool existed
Someone else in this thread mentioned Smalltalk: this is exactly what a Smalltalk image gives you.
The image is everything, all your objects, code, temporary, everything. And at any time you can do "save" or even "save as" to persist a copy.
And there are variants that can work with Python code as well.
I wonder what software engineering would look like if all the decades-long advancement of tools for stabilization and deployment of code had been researched over platforms that combined system and application code, like Smalltalk did. Instead, we consider both layers as isolated from each other. A lot of work over virtualization and containers seem to be used as a workaround for not having this integrated approach as part of the platform design.
It is definitely an approach more apt to personal computing than to industrial engineering, at least with the level of maturity of the tools available today. The increased ease of use of these environments may be a nice-to-have for professional developers, but they are essential for non-developers.
this is really cool, I'm going to give deeper into the code tonight. Awesome work
But what you describe is normal behaviour in any REPL. The main advantage notebooks have over shells is a better sessionhandling and richful datastructures. Images, markdown and stuff is a bit hard to display on a regular terminal. Ipython has the qt-frontend for this, but then moved to notebooks
I’ve been saying it for years, even before the great meme presentation this year: notebooks are not the appropriate medium for 99% of rapid prototyping or shared experiment work and notebooks make for terrible units of reproducibility (especially compared to just traditionally version controlled source code with a Makefile or with a Docker container etc).
Pedagogy is probably the only serious use case of a notebook. If you’re not working on a 100% pure pedagogy task, it’s a giant mistake to be doing what you’re doing in a Jupyter notebook.
- I work mostly from a clean, normal Python codebase in my normal IDE
- I'll have a file like "common.py" or something
- I keep a Jupyter notebook running always, and I'll do a (from importlib import reload) reload(common) to update any code in my live environment
- Cache a lot to disk (np.save/np.load).
- For every long-running method, have a tiny subset to unit test it
Java in Eclipse works exactly like this.
(just want to say: this concept is around for ages and Jupyter could adapt some of Matlab's GUI features for sure)
I guess that's the main reason why so many software developers don't "get" why people want to write code in a browser. Even normal Python developers.
- You can run a Jupyter Notebook locally or on a remote server, seamlessly working in a graphics- and Markdown-capable REPL in either case
- You can easily share and export the final notebook in its post-run state, presenting Markdown, LaTeX formulas, code, comments, plots, and results in the order they were written and executed, in a single (HTML or PDF) document (or even a slide deck, if you dare)
To your last point about limiting use of the browser to writing short comments and watching videos: I'd say that's a matter of personal taste. Browsers are perfectly capable of quickly handling the volume of code and text that would appear in a reasonable notebook.
That said, I grew up with vim keybindings, and while Jupyter supports the basics (modal editing, j and k to move between cells, dd to delete a cell, etc.) and has super useful default keyboard shortcuts, I still move and type faster in vim. But Jupyter adds so much value as a seamless, graphics-enabled, run-anywhere REPL that I happily accept the slight slowdown in my typing and navigation, which is never the bottleneck in my R&D work anyway.
You can replace "firefox" and "emacs" with other things to take that last argument to an extreme. For example: "My [laptop] isn't customized to type code, it's customized to type HN comments and watch cat videos. Why use [a laptop] instead of [a workstation] to type code?" I thought like that 15 years ago, but technology and I changed.
You use an image-based language (i.e. Lisp, Smalltalk)
All the IO between nodes is serialized, so an output that is already in shared memory (to be examined or visualized in other programs) can also be written to disk.
It can then be loaded as a 'constant' node and iterations can be done with hot reloading on the baked output.
They key is that you need both live/hot/automatic reloading of execution AND snapshots/replay/rewind/baking.
If you only have one or the other, you can't isolate a part of a program and iterate on it nearly as quickly.
I have a hard time maintaining discipline in a notebook and therefore don’t find them very useful
It runs code in IPython, but you write in a regular Python file rather than a notebook. I've used it in exactly the manner you describe, doing a load to memory once, then running and rerunning analyses (or parts of them) in the same Ipython console.
Also echoing thaw13579's comment; I used GNU Make in one project  for this and it worked OK.
But at some point I'm getting to a stage where I realize that I'm starting to write architecture, not algorithms. And at that point the only thing I'll do with my notebooks is glance at them to look at particular solutions, and maybe run them once or twice for sanity checks (or to just be happy that I solved the problem :D).
I then try to flip a mental switch. There are fundamental differences between writing python scripts for scientific or any other purposes and writing python architecture. Which is why I've moved away from copy-pasting code from notebooks to codebases. (I've been stubborn to actually ingrain this for a year or so.). I treat my own notebooks as if they were foreign solutions. Good for knowing how to tackle the problem, but I'll keep them the hell away from my clipboard.
At that point, it shouldn't matter if I have to re-run everything again for tests, because the stability and reliability of my code is now the highest priority. Also, if your development progress is anything like mine, the actual problems you fix a few weeks into a project aren't problems where you try a solution every few minutes, but rather once or twice a day. So the extra time spent waiting doesn't really bother me.
TL;DR: Hack away in Jupyter notebooks for exploration, then lock it all away once you know how to tackle the problem and work 'traditionally'.
Of course I still use notebooks for experimenting (usually it is open beside my IDE), I just think that developing a whole software architecture inside notebooks is not something notebooks are for.
Is there value in moving away from notebooks all together while you are still learning?
Do these comments apply to platforms such as MyBinder or Jupyterhub?
Finally,(other than Grus, he is excellent) are there any resources, people, or websites you would recommend for maintaining a sterile workflow or for general knowledge?
- The mutable state with global variables from all cells in scope drives me crazy. I just want to run an ephemeral script repeatedly so I can be sure what the state is during execution.
- The process of starting the server and moving code from version-controlled .py files into notebook cells to become part of Untitled(25).ipynb, which can’t be sanely version controlled, drives me crazy.
- Not being able to use my normal text editor drives me crazy.
Instead of building up lines of well-tested python functions in a disciplined line-by-line fashion, periodically committed to git with readable diffs, I end up with a chaotic mess of code fragments in various cells with no confidence regarding what will happen when some subset of the cells are executed, and none of it in git let alone with a sane commit history.
I’ve tried so many times over the last 10 years, and I feel bad because it’s such an amazing project, but I really dislike the experience of using Jupyter instead of standard python development tools.
The reason I do it is for graphics. This isn’t a Jupyter gripe other than the diverted attention, but why the fuck can’t I just import matplotlib in a normal python file? (under macOS it throws something about “frameworks” which no-one cares to understand. I think there’s some incantation that makes it work but seriously, this is ridiculous). And maybe draw graphics to a GUI graphics widget from the (i)python shell like R does.
(No need to reply "Because you haven't written it"! This is deliberately a rant; I contribute to open source projects.)
See this SO answer: https://stackoverflow.com/a/34583288/2476920
Personally, I love to have notebook cells to be able to code without re-running everything. Especially in the case of deep learning, training a model is long. Jupyter is very good for creating and debugging code that:
A) needs a trained model loaded for it to work but you want to skip the part of saving/loading the model, or
B) code that saves-then-load a model.
If the "mutable state with global variables" drives you crazy, you may want to avoid reusing the same variable names from one cell to another, and reset the notebook more often. Also, avoid side effects (such as writing/loading from disks) and try to have what's called pure functions (e.g.: avoid singletons or service locators, pass references instead). If your code is clean and does not do too much side effects, you should be able to work fine in notebooks without having headaches.
But in general I wonder whether this is what I'm looking for: https://github.com/daleroberts/itermplot
So fundamentally Jupyter is not a document based application that most of us software developers grow up with. You have to think in terms of messaging and the cells are just the input/output terminals. The rendering of cells is determined by the runtime container that parse the .ipynb (JSON) string.
Here is an contrived example to explain it:
There is a lot going on depends on the container.
Input data, parameters & configurations are considered external to the code object. It's the same with other activities like debugging, exploratory programming, debugging and building tests, which are considered peripheral to the "true coding" activity of altering the source code that will run on the production environment.
The single successful exception to that model is the spreadsheet, were data and code are indissolubly merged in the same document; and, despite its success as a development tool for non-programmers, professional developers usually despise it.
Could it be that Jupyter and notebooks are making developers comfortable with a new approach, where the distinction between "coding" and "running the code" is blurred, and where partial execution and mixing code with data are the norm?
Coders are accustomed to using the REPL for this approach, but in the end only the code that is "commited" by copying it to the source file is considered the "final" version. Yet, with notebooks, code that has been written within an interactive session can be stored as the official definition for a part of a program, without having to transfer it to a separate object.
I was pointing out that "coding" is traditionally understood to refer only to the sequences of instructions in source and object code that remains unmodified during runtime execution, but there are other views that extend the definition to all the other objects involved in running the software in its environment.
In this view, coding is about defining automated behaviors that can execute on their own without a human leading by hand every step. I'm thinking in particular about End-User Development approaches, which allow non-developers to create such automations without ever learning the syntax of a formal language.
Under this approach, the abstraction level may be lower, since the actual instances of data used are as important to understanding the system as the code that processes them.
Unix Makefile is one of the early examples. So are HTML/CSS etc.
More recent examples are dataflow graph in tensorflow
A new idea just occurred to me during a technical lunch meeting with a bunch of friends today: since Jupyter Notebook itself is nothing but a object graph, it might be possible to use a "computed" .ipynb for this kind of "coding", i.e., describing desired state.
That would be TDD in another sense.
In this context, a precise textual description helps to exactly define what you want to achieve, but a graphical view helps to inspect how many different cases are there in your data, and how available context are connected.
Representing state machines with graphs is a relatively common thing in rapid prototyping apps, so I've thinking of building a prototyping app that would also allow embedding real code and data in them (similar to Visual GUI builders, but allowing much more code to be defined visually). There are several tools attempting to achieve that, but all fail at the abstraction problem of defining new instances for your objects (i.e. you can't create new instances of arbitrary visually defined objects, only for things in tables).
Declarative constraints driven problem solver was pioneered by
and also used in places like Apple’s
It is interesting that you use "texture description" for byte stream based artifact that we call code, which is runtime independent.
On the other hand, the term "graph" in your second paragraph has double meanings: visual representation for human eye and brain to parse, and mathematical graph that has can be serialized in JSON, XML or whatever byte stream.
I think many people are excited about GUI revolution pioneered by Doug Engelbart, Ivan Sutherland and Alan Kay for different reasons. Some people are excited about the implementations: bitmap display, mouse, etc., some people are excited about the principles and abstraction: MVC, constraints, data flow, messaging, etc, etc.
With Jupyter notebook system, which is based on three state machines: a persistent kernel process, a DOM session in the browser, and a message queue, it is the first time that we might have a nice abstraction to capture a graph in space and time.
That is at least why I am getting excited. The other features are just bells and whistles.
The next-generation development environment will be a combination of:
* a wiki (outlining content + version control),
* a spreadsheet (functional reactive coding + lightweight schema definition + easy working on collections), and
* a mockup wireframe builder (visual layout + visual state machines + rapid prototyping).
Pluggable connectors to constraint solvers and machine learning would be a plus.
I'm not sure that Jupyter fully captures all parts of the computation graph; most of all, restoring a previous state after closing and reopening it is cumbersome (if I'm not mistaken you need to run each individual cell by hand in the right order). There are theoretical models that could fix that, though, like those allowing for time travel debugging. And the visuals are simple and flexible enough to support many use cases. So yes, web notebooks are a good basis for a knowledge-building programmable platform.
For example, you can think of rendered notebooks as caching. I have even played with the concept of using rendered notebook as the input/data source of another notebook. This looks like a radical idea. But it is exactly how https://nbviewer.jupyter.org/ works.
Isn't that essentially what a spreadsheet does? Each table contains the results of a previous computation, intertwined with input and configuration data, and you just keep chaining more and more programs through a single unifying metaphor (the cell coordinate system plus function expressions). This model also gets you reactive recomputation for free.
It's a powerful system, but quite old; I'd say the only radical thing is convincing classic developers that there's a better model than their beloved imperative, one-step-at-a-time idea of what running a program means.
In Jupyter Notebook, the cells are nothing but messages. And the saved notebook files are JSON serialized byte stream of the message and metadata. It doesn’t keep the states, just data, a digital artifact.
Plus, spreadsheets may also be distributed.
I don't care much about the visual layout of Jupyter notebook, so in this sense, it is not much different than spreadsheets.
However, when I say Jupyter cells are nothing but messages to/from the kernel, it is profoundly different than spread sheets where cells can be message as you described. The same with distributed runtime states. Spreadsheets may also be distributed, but Jupyter notebook runtime is always distributed. You might have a half system like nbviewer that doesn't have a kernel running, but that is not a full Jupyter system.
(I am distinguishing Jupyter runtime from the JSON byte stream artifact that we call a notebook file)
Yes, these are different computation models; although formally it is possible to transform mathematically one into the other, and vice-versa (in the same way that you can transform any computation model to a Turing machine or lambda calculus).
It doesn't have any practical concerns, but it means that you can translate insights gained from one model to the other.
1. It is a good observation that Jupyter runtime follows an agent-based computation model instead of the MVC of smalltalk, or functional reactive Spreadsheets. Theoretically they are equivalent in the Turing sense. But this model opens huge opportunities that allow it to mimic how human brain and human societies work. It is all about messages, context containers and just-in-time computation.
2. This paradigm change also affects how we grow (instead of building) software architecture, which is the topic of the root of this HN submission. In the build/develop paradigm, software engineers focused on requirements, specification, foundation, frameworks, ..., i.e., the house building metaphor (https://en.wikipedia.org/wiki/Design_Patterns ) But if you are growing a software project, there is no architecture. Then you should use agriculture metaphors: ecosystem, environment, energy cycle, water cycle, etc. It is about seeding, weeding, environment control, harvesting and sustainability. Push it a little bit further, programmers in this paradigm are more like software farmers than data plumbers in a data warehouse or code assemblers in a factory.
It was a nice discussion in a virtual public square. I enjoyed it a lot. I hope you did as well.
Now it is hard to debug programs like this unless you trace all the links and cached assets. On the other hand, many API centered cloud native application behaves like this and you have to deal with it one way or another. So to some extend, Jupyter is the IDE for the age of cloud.
The original shell was an evolution over the teletype. Developers would type their code on a typewriter machine connected to the CPU over a wire (maybe involving punched tape or cards); and the CPU spouted back the results of the commands executed.
Later, when video screens were adopted, they emulated the same linear approach of the terminal on the new device, because developers tend to be an extremely conservative bunch when it comes to the approach taken by their tools of the trade. It took several decades to exploit the advantages of interactive sessions that the new media allowed, creating a new family of IDEs with live debugging, property inspectors and Intellisense/contextual help.
We are at a similar turning point with respect to classic development tools. The notebook is still used mostly as an improved version of the bash shell "with graphics", but there's a lot of research to do on how to enhance the model with new tools to take advantage of the new medium (including literate programming, web-based collaboration, or instant feedback of command execution on persistent objects).
People are exploring the possibilities of this approach, with Bret Victor possibly being the most influential. I'm sure the personal experiences of thousands of developers with their home-made workflows, involving lots of heterogeneous tools used in creative ways, could be studied to design a more general environment that will change how large software projects are built.
NotebookScripter is much more minimal and doesn’t support generating/updating .ipynb files. It has just one api allowing one to treat a notebook file as a function that can be called from another python program.
I’m not entirely sure whether papermill can run the target notebook in the calling process — it looks like it spins up a Jupyter kernel and communicates with it via message passing. NotebookScripter creates an ipython interpreter context in process and execs the notebook code in process.
barring some barriers (e.g. use caching where appropriate) I've found writing unit tests is a neat clean replacement for notebooks.
> barring some barriers (e.g. use caching where appropriate)
It's not an exact corollary but pypy sort of fills that niche. Unfortunately, it's two major versions behind and it's not a 100% drop-in replacement.
Whoa, thank you all for the nice comments, I didn't expect to make such a buzz here nor today. I'm glad to see the reactions - even the bad ones, it seems aligned with what I thought. Yes, notebooks are very useful for the faster coding cycle, but they become easily heavy (I'd love to see a better multiline edit and a better autocompletion in Jupiter).
Seems like I already posted my article 2 months ago, but renamed the GitHub repo since then, which may explain why someone else (jedwhite) could submit my article again: https://news.ycombinator.com/item?id=18339703
I didn't submit it twice to HN. Well, nice to see that in a parallel world my post did the 1st page on HN! :-)
But could this mean that my HN account is like "shadow banned" or something? Strange to see that all my own submissions on HN haven't got much attention for months. Or maybe it's just the random factor... Well, thanks!
Did not get any insights here except links to books.
What I prefer to do is to use Jupyter for all R&D tasks then `productize` modules/parts that proved to be working. When wrapping solutions as a product we go through typical engineering process: TDD, CI/CD, Docker, code review etc.
In any case, we end up with having a bunch of machine instructions and there is no point in breaking best practices of engineering craft, they have been established for years. No need to invent another wheel.
I've been working remote at a big company now.
I write requirements of a function then i send it to my subcontractors on freelancer website.
While i am waiting, i play with my kids, kiss my wife and take her out for shopping.
I get back the functions, i paste them in the codebase... change a thing or two.
I am writing 3000 lines of code with this approach daily.
I used to write 200 lines of code alone.
One key thing, my passion in programming is killed entirely because always I've to work according to what plans management devise. Management changes direction, randomly kills project, changes requirements - this makes me feel like a slave who has to dance to statisfy the masters, no matter how many landmines they place on the dance floor.
So, now i just create outline and let cheap labor fill the blanks.
My value proposition isn't code but because i can understand the problems at hand and create plans to tackle the problem.
Choice of problem isn't in my hand tho area of problem is definitely in my hand.
I am never going back to open office plan, sebum/ear wax filled noise cancelling headphones.
Don't you mean you "collect" ~3000 lines of code per day?
And how much does that even cost you? How do you ensure quality? Do you pay per hour or result?