I want a notebook where causality can only flow forward through the cells. I hat...

fryguy · on Jan 28, 2020

My problem with notebooks is that I feel like the natural mental model for them is a spreadsheet mental model, not a REPL mental model. Under that assumption, changing a calculation in the middle means that all of the cells that depend on that calculation would be updated, but instead you need to go and manually re-run the cells after it that depend on that calculation (or re-run the entire notebook) to see the effect on later things. Keeping track of the internal state of the REPL environment is tricky, and my notebooks have usually just ended up being convenient REPL blocks rather than a useful notebook since that's the workflow it emphasizes.

WorldMaker · on Jan 28, 2020

That's something that I think Observable [1], in my modest usage, seems to do well.

[1] https://observablehq.com/

jacobolus · on Jan 28, 2020

Yep, the real complaint is “dead state”, not out of order execution. Worrying about linear flow per se turns out to be misguided based on lack of imagination for/experience with a better model: reactive re-rendering of dependent cells. Observable entirely solves the dead state problem, in a much more effective way than just guaranteeing linear flow would do.

* * *

More generally, Observable solves or at least ameliorates every item in the linked article’s list of complaints. (In 2020, any survey about modern notebook environments really should be discussing it.)

I found the article quite superficial. More like “water cooler gripes from notebook users we polled” than fundamental problems with or opportunities for notebooks as a cognitive tool. I think you could have learned more or less the same thing from going to whatever online forum Jupyter users write their complaints at and skimming the discussion for a couple weeks.

I guess this might be the best we can hope for from the results of a questionnaire like this. But it seems crazy having an article about notebook UI which makes no mention of spreadsheets, literate programming, Mathematica, REPLs, Bret Victor’s work, etc.

From the title I was hoping for something more thoughtful and insightful.

SiempreViernes · on Jan 28, 2020

You can get a jupyter extension[1] that allows you to add tags and dependencies and this way construct the dependency graph as you go along. Of course, you have to do it manually and the interface is a bit clunky, but it does what it says.

In practice I think taking care not to accidentally shadow variables is much more important: this dependency business only makes sense once you have a clear idea of what you need and by that point you are mostly done anyway.

[1] https://jupyter-contrib-nbextensions.readthedocs.io/en/lates...

jacobolus · on Jan 28, 2020

I don’t understand what you are trying to say in your second paragraph, but I highly recommend you spend a few weeks playing with http://observablehq.com instead of speculating about the differences.

In practice, I find it to be dramatically better than previous notebook environments for data analysis, exploratory programming / computational research, prototyping, data visualization, and writing/reading interactive documents (blog posts, software library documentation, expository papers ...). It has a lower barrier to starting new projects, a lower-friction flow throughout

I find it better at every stage of my thinking process from blank page up through final code/document, and would recommend it vs. Jupyter or Matlab or Mathematica in every case unless some specific software library is needed which is unavailable in Javascript. The only other tool I really need is pen and paper, though I also use http://desmos.com/calculator and Photoshop a fair bit.

tel · on Jan 28, 2020

This falls apart when computation is a factor, though. You can't recompute the whole notebook on every commit when there are 30 cells that each take 2-8 seconds to complete.

etangent · on Jan 28, 2020

[flagged]

stevesimmons · on Jan 28, 2020

In Jupyter I approach this by structuring my exploratory analysis in sections, with the minimum of variables reused between sections.

Typically the time-intensive data prep stage is section 1.

The remaining sections are designed essentially like function blocks: data inputs listed in the first cell and data outputs/visualizations towards the end.

Once I decide the exploratory analysis in a section is more-or-less right, I bundle up the code cells into a standalone function, ready for reuse later in my analysis.

Jupyter notebooks can easily get disorganised with out-of-order state. However that is their strength too: exploratory analysis and trying different code approaches is inherently a creative rather than a linear activity.

Sean1708 · on Jan 28, 2020

Maybe I'm missing a joke here, but if that's your workflow then there's absolutely no advantage to notebooks over something like Spyder or even VS Code.

etangent · on Jan 28, 2020

No, that's not the workflow. You work in the notebook as normal but from time to time (say every two hours) rerun the whole thing.

One advantage of this is that it forces you to name your variables such that they don't overwrite each other. Further down the line this enables sophisticated comparisons of states (e.g. dataframes) before and after (something data scientists need)

gdy · on Jan 28, 2020

If you have a few long data loading and preprocessing steps it's a pain to wait for them to run again, people try to avoid it.

When something odd begins to happen, they don't immediately consider the possibility that it's not their bug and waste time trying to 'debug' the problem instead of just rerunning the notebook.

playing_colours · on Jan 28, 2020

Would it be a solution to store intermediate computations to an in-memory or disk database like Redis, SQLite? It is a matter of few minutes to run a docker instance and write simple read / write + serialize Python util functions?

gdy · on Jan 28, 2020

Surely, it would be a solution, but I don't think for an average data scientist it's a matter of few minutes.

etangent · on Jan 28, 2020

You don't reload every time you write a line of code. Nobody's insane like that. You reload every two hours or so. This is good enough for most except most extreme data sets.

graphpapa · on Jan 28, 2020

Well if block 1 takes ages but everything after that is dependant 2->3->4 etc. Obviously it would be nice to just re-run block two and have those changes cascade

heinrichhartman · on Jan 28, 2020

I break long running data imports out into seperate Notebooks or .py files, and persist the results.

Always restart&re-run for usable results.

tobmlt · on Jan 28, 2020

That’s what I’d always do. On more complex notebooks, though, is it possible that isn’t a solution? I wouldn’t think so but I am happy to be surprised. Then again I use notebooks only at the end of a project to present work in “executable presentation” style. Restart and Rerun all has been always been sufficient for me. More generally, I took a look at notebooks, thought, “Why develop with all the extra baggage” and left it at that until ready to experiment with presentation methods for (tight) core ideas.

dkersten · on Jan 28, 2020

Why are you even using notebooks at all then?

etangent · on Jan 28, 2020

> One advantage of this is that it forces you to name your variables such that they don't overwrite each other. Further down the line this enables sophisticated comparisons of states (e.g. dataframes) before and after (something data scientists need)

Also, not sure about you, but I like seeing all of my outputs on a single browser page without having to write any glue code whatsoever.

hinkley · on Jan 28, 2020

A couple years out of college we finally took a hard look at the credit cards and realized we had fucked up.

We were gonna buckle down, pay the cards down hard for a while, 'color' our money so we both had discretionary spending separate from, say, the power bill. She had much more Excel experience than I did so she worked up a spreadsheet.

It was bad. We had worked up some 'fair' notion of proportionality and she basically had no spending money and mine was pretty bleak. So I redid the numbers from scratch with split that was better for her. In the new spreadsheet she has much more spending money and... hold on, I've got a bit more too? I looked at her spreadsheet repeatedly and I never did figure out where a couple hundred bucks got lost. I went back to sanity checking mine instead to make sure I wasn't wrong. It checked out.

I wonder sometimes how often small companies discover they've been running in the red instead of the black, because some cell got zeroed out, a sum didn't cover an entire column, or embezzlement is encoded straight into the spreadsheet without anyone noticing.

There's gotta be a better way.

asdff · on Jan 28, 2020

The entire accounting department at any company exists to make sure their numbers are spot on. If your wife had an entire accounting department scrutinizing her numbers, they'd find the discrepancy. These are people who were willing to sacrifice their entire professional career and their lives during busy season at least to do nothing but tinker with excel for 40+ years; always trust a masochist verging on the insane.

davedx · on Jan 28, 2020

> I wonder sometimes how often small companies discover they've been running in the red instead of the black, because some cell got zeroed out, a sum didn't cover an entire column

This is a really interesting insight (actually obvious when you think about it). I'm currently working on a spreadsheet app and these kinds of observations are very interesting to me. I guess things like named cells/variables will help (instead of using $A$4 etc.). Range selection could also be more intelligent (it could actively warn you if a range selection seems to be missing a few cells of the same data type). Do you have any other insights here?

C1sc0cat · on Jan 28, 2020

One company I sued to work for had this happen there was a magic spreadsheet in the accounting system - one factor in the massive restricting of the company - ICAN was the other

WorldMaker · on Jan 28, 2020

Going back to even some of the earliest literate programming exercises by Knuth, there's a lot of demonstrable usefulness in being able to write the code "out of order", or to at least demonstrate it in such form. It's not entirely out of the question that setup requirements aren't interesting to the main narrative flow, and maybe even distract from it, such that the "natural" place to put stuff in say a textbook is in the back as an Appendix.

A good notebook (again, similar to early literate programming tools) should help you piece the final execution flow back into the procedural flow needed for the given compiler/interpreter, but it probably should still let you rearrange it to best fit your narrative/thesis/arc.

kvlr · on Jan 28, 2020

This is how runkit does it for nodejs and I think it’s working quite well for them.

We (at Nextjournal) tried doing the same for other languages (Python, Julia, R) and felt that it didn’t work nearly as well. You often want to change a cell at the top e.g. to add a new import and it can be quite annoying when long-running dependent cells re-execute automatically. I now think that automatic execution of dependent cells works great when your use case is fast executing cells (see observablehq) but we need to figure out something else for longer running cells. One idea that I haven’t tried yet is only run cells automatically that have executed within a given threshold.

I hear a lot of complaints about hidden state but I think it’s less of a problem in reality. It’s just a lot faster than always rerunning things from a clean slate. Clojure's live programming model [1] works incredibly well by giving the user full control over what should be evaluated. But Clojure's focus on immutability also makes this work really well. I rarely run into issues where I'm still depending on a var that's been removed and then there's still the reloaded workflow [2].

Overall I think notebooks are currently a great improvement for people that would otherwise create plain scripts – working on it is a lot quicker when you have an easy way to just execute parts of it. Plus there's the obvious benefit of interleaving prose and results. That doesn't mean we should not be thinking about addressing the hidden state problem but I think notebooks do add a lot of value nevertheless.

[1] https://clojure.org/guides/repl/introduction

[2] http://thinkrelevance.com/blog/2013/06/04/clojure-workflow-r...

IanCal · on Jan 28, 2020

If people are wondering about cases that can cause this - a common one (for me) is a mis-spelled variable name. If you go back and change it, the old one is still there and if you make the same mistake twice you will have code that runs but doesn't work. It's then really not obvious why it doesn't work.

sillysaurusx · on Jan 28, 2020

It's best to think of the notebook as a REPL. So you'd want to run `del foo` on the old name.

In fact, this is a good counterexample. Why should the notebook delete the old variable name? What if its value is a thread currently executing?

Notebooks are REPLs, and it's better to get used to that than to try to enforce some confusing time traveling.

tgb · on Jan 28, 2020

But, strangely, Jupyter doesn't also give you a REPL (like, say, R Studio does). I'm always making new cells in the middle to output the column names of my spreadsheet, and then I have to delete them. I used to just always have an ipython REPL running and test things out in there as I write. You can start a ipython instance on the same kernel but I found that messed up my plots when I did that IIRC.

ivirshup · on Jan 28, 2020

You can get a REPL attached to a notebook in jupyter. When you open a console in jupyter-lab you have the option of attaching it to an already running kernel. Using the notebook interface you can connect a console using `jupyter console --existing`. By default this connects to the most recent session, but you can also specify a session by passing a token.

tgb · on Jan 30, 2020

Yeah, but problems with that are what my last sentence referred to. I didn't try debugging it for long, though.

SiempreViernes · on Jan 28, 2020

Just get the extension: https://jupyter-contrib-nbextensions.readthedocs.io/en/lates...

there are quite a few very useful ones, my favourite being collapsible headings: https://jupyter-contrib-nbextensions.readthedocs.io/en/lates...

sillysaurusx · on Jan 28, 2020

Yes! I've wanted this too.

Colab has a nice feature that's close to this: Insert -> Scratch code cell

IanCal · on Jan 28, 2020

> It's best to think of the notebook as a REPL.

With sections of the history easily replaced.

> So you'd want to run `del foo` on the old name.

And then delete this line, because if you leave it in it'll break when you try and run the file all the way through.

> Notebooks are REPLs, and it's better to get used to that than to try to enforce some confusing time traveling.

Do you mean by treating them as append only and never rerunning any cells?

> In fact, this is a good counterexample. Why should the notebook delete the old variable name? What if its value is a thread currently executing?

The notebook has no idea if it can or can't, but that doesn't mean that leaving it in is good it's simply the only realistic option.

antipaul · on Jan 28, 2020

Agree. Get used to the habit of deleting old objects/names when you are replacing them, if you work in notebooks

IanCal · on Jan 28, 2020

It's an easy thing to miss though, because you also then need to delete the line of code you used to delete the old object/name so you have no record of cleaning up after yourself.

chirau · on Jan 28, 2020

Be careful what you wish for.

Hot reloads can become very expensive. Especially when it comes to computationally heavy tasks that notebooks are built for.

If you decide you want hot reloads by default, it'd mean each time you click on a cell and then click on another you'd be restarting the whole notebook.

If you had massive datasets you were loading or other args that you were parsing manually or at prompt, you'd have to go back and do all that. Don't even get me started on the operations you'd have done with those dataframes prior.

I think it is a good thing that notebooks separate instructions and re-execute manually by default. The cost of the alternative is just too high

jpeloquin · on Jan 28, 2020

> I think it is a good thing that notebooks separate instructions and re-execute manually by default. The cost of the alternative is just too high.

Maybe add a "lock" toggle so a user can block a cell from being automatically executed? The heavy numeric setup tasks could then be gathered in a few cells and locked, leaving the lighter plotting & summary stats cells free to update reactively.

chirau · on Jan 28, 2020

Toggle???

Toggle a whole environment and intrepeter's behavior? Do you know how much architecture that would involve? That's like trying to tell IDLE to be able to both delete or keep your variables on exit, or the JVM to have a toggle switch for memory and garbage management.

Why doesn't the developer make themselves useful and simply write a save function that freezes their buffer variable values to a text, json or SQLite file that they can read from or stream rather than trying to set back a whole community years of progress in an effort to accommodate perhaps entitled or lazy devs.

Can you even imagine the architectural costs of trying to accommodate streaming data and timestamped data as opposed to you just writing your own stuff to file?

teddyuk · on Jan 28, 2020

I think adding a toggle to run or not run a cell would be a trivial change, something like adding a property to the cell (https://raw.githubusercontent.com/jupyter/notebook/master/no...) and checking whether that is set or not before running the contents.

moultano · on Jan 28, 2020

I don't think re executing by default would be beneficial. I just don't want state present in the interpreter that isn't in any live cell, and I only want time to flow one direction. Other than that, I think the other ergonomics of notebooks are fine.

panic · on Jan 28, 2020

RunKit does this: https://runkit.com/home

> RunKit allows you to rewind your work to a previous point, even filesystem changes are rewound!

ivalm · on Jan 28, 2020

If the interpreter state contains large variables checkpointing might not be viable (eg I have dataframes that are 100s of GB/large fractions of total available memory, reading/writing from hard drive all the time would be relatively slow. If you can save deltas I guess it wouldn't be too space inefficient but I imagine still slow).

At the same time, I do like the idea of an append only notebook where you can:

1. Only run cells in sequential order

2. Only edit cells that are below the most recently ran cell.

Thankfully you can enforce it through code practice and the notebook is relatively guaranteed to be "run all"-able. You will need to refactor it after the initial dirty run, but at least it's easy to reason about.

mumblemumble · on Jan 28, 2020

I want a notebook situation where the platform understand sampling, so that, while I'm doing my EDA and initial development and generally doing the kinds of work that are appropriate to do in notebook, I'm never working with 100GB data frames.

I suspect that a big part of my annoyance about the current state of the data space is that parts of the ecosystem were designed with the needs of data scientists in mind, and other parts of the ecosystem were designed with the needs of data engineers in mind, and it's all been jammed together in a way that makes sure nobody can ever be happy.

ivalm · on Jan 28, 2020

You can sample data if you want already (or sequentially load partial data, which is what I usually do if I just want to test basic transformations), but if you need to worry about rare occurrences (and don't know the rate) then sampling can be dangerous. For example, when validating data there are edge cases that are very rare (ie sometimes I catch issues that are less than one record per billion), it can be hard to catch them without looking at all of the data.

roblabla · on Jan 28, 2020

Assuming the data isn't changed, thanks to CoW forking wouldn't cause any extra memory usage. If only a subset of data is changed, same thing - only the changed cells will take extra space. The problem only occurs when the whole variable changes - in which case yeah, you're SOL. I wonder what the usage patterns are for such datasets?

ivalm · on Jan 28, 2020

Personal experience: when first looking at the data I often do lots of map /reduce style operations which might transform large portions of the dataframe.

Question, if you use CoW then presumably your variable blocks are no longer contiguous, wouldn't this really slow down vector operations?

roblabla · on Jan 30, 2020

> Question, if you use CoW then presumably your variable blocks are no longer contiguous, wouldn't this really slow down vector operations?

I don't think so. Vector operations require the data to be aligned to whatever the vector size is, no? E.G. 16-byte vector ops require the data to be aligned to 16-byte, etc... At least that's my understanding.

thomasballinger · on Jan 28, 2020

I've played with prototypes of this by calling fork on IPython to take snapshots of interpreter state https://github.com/thomasballinger/rlundo/blob/master/readme... but if you can't serialize state fully, rerunning from the top (bpython's approach) can work, or rerunning as a dependency dag shows is necessary (my current employer Observable's approach) works nicely.

okennedy · on Jan 28, 2020

Check out our reproducibility focused notebook Vizier (https://vizierdb.info). In Vizier, inter-cell communication happens through spark dataframes (and we're working on other datatypes too. This makes it possible for Vizier to track inter-cell dependencies and automatically reschedules execution of dependent cells. (It also makes Vizier a polyglot notebook :) )

bb88 · on Jan 28, 2020

Makefiles have this issue too, sometimes things have been incorrectly made, and the dependencies in the makefile are wrong.

Unless it takes more than a few seconds to run a notebook, rerun every cell up to the point you're editing, always.

And then if it does take minutes, and you find yourself in an unexplainable rut, then run the entire notebook, and get a cup of coffee.

continuational · on Jan 29, 2020

That's part of what drove me to write TopShell, which is a notebook-like interface:

https://github.com/topshell-language/topshell

  * Information only flows downwards.
  * Computations are cached.
  * Things are automatically recomputed when the values they depend on change.
  * Things with effects is instead cleared and awaits confirmation before running.

johnc1231 · on Jan 28, 2020

I haven't dug into it myself, but Netflix makes something called Polynote that is supposed to add some awareness of the sequence of the cells to combat this

tardenoisean · on Jan 28, 2020

https://datalore.io does exactly this

willj · on Jan 28, 2020

Might be worthy trying out nodebook [1] which at least enforces the forward directionality you mentioned.

Also polynote by Netflix, as a user below mentioned.

[1] https://github.com/stitchfix/nodebook

nkurz · on Jan 28, 2020

s/Netflix/Stitch Fix/

(don't reply, and I'll delete this)