The Future of Notebooks: Lessons from JupyterCon 353 points by wcrichton 5 months ago | hide | past | web | favorite | 154 comments

 I don't get the people here saying don't use Jupyter notebooks, or they are bad software engineering.So much Python development is trying snippets of code in a REPL as you introspect live objects, then once they're right pasting them into the IDE.All my Jupyter notebooks are like that, where my code starts as cells of a line or two, as I check each output. Then I coalesce them into a function (which avoid the problem of execution order that some people here mention). Then I may move those functions into a normal Python module and build up complex classes, add unit tests. Or if the goal is communicating with other people the results of running that code, I'll add explanatory text and mathematical formulas, break it up into sections, include references to papers and web links.I think what people miss here is many software development tasks, especially in data science and machine learning, are essentially exploratory data analysis. And the outputs are graphs, tables of data etc, designed to be read by humans. Jupyter's ease of use for iteration, and the ability to interleave documentation, formulas, graphs and tables with the code is a big win.In my daily work, I have my IDE and Jupyter open side by side. The Jupyter notebooks are version controlled like my .py modules.Over time, I expect editors like VSCode to edit Jupyter notebooks natively. And also more code editing to move from IDEs into the Jupyter (via integrating the Monaco editor). We will all benefit from that crosspollination, whether we are principally software engineers or data scientists.
 Yeah, I hear you about the jupyter-is-for-draft-code. I thought that's how everyone used Jupyter for software dev. I guess this is the MATLAB-style software dev---try something in the REPL, and once you figure out the parameters interactively copy-paste that line/paragraph into your program.As soon as I have one piece of functionality working as a function, I move the code to the "main" file for the project and import the function into the notebook. This means I never have to write more than a few paragraphs of code in the notebook and instead use it for testing and glue code. I don't think of the notebook as something sharable with others, it's more like WIP code... if I keep the notebook around it's usually because it can end up a useful test harness when code needs to be revisited/updated.As for global state, one thing that I find really helpful is the keyboard shortcut 00. If you focus anywhere outside of a cell and you press zero twice, this restarts the kernel and you can re-run the commands from the beginning in a "clean state" so what you see matches what is. This approach works well when you're editing the source code that gets imported into the notebook---you never have to worry about re-importing since always restarting from scratch.
 Jupyter code is draft code. Draft code rarely makes into clean code, because it requires extra tedious effort. Most of us are too overworked to have the time or energy to perform it.Alternatively one can write exploratory code as unit tests and run them inside an IDE. When the exploration is done, three quarters of the tedious cleanup effort is already done. There are already unit tests. There are already APIs exercised by the unit tests. The code is in much better shape.The missing two bits:* No charting support. To the best of my knowledge, neither PyCharm nor VSCode have an output console with HTML support.* Caching. Some operations take a lot of time. At the extreme, fast.ai style development, where some cells train a model for hours/days. Re-running such code from scratch on every test run is impractical. An opportunity for a library with explicit save/load/cache/checkpoint support.
 Vscode has decent Jupyter support with its python plugin. What I really missed my jupyter notebooks was a world class editor with keyboard shortcuts and code completion. So I went the other route, running jupyter notebooks in vscode.I made a really simple python cli module that would convert from jupyter notebook to python files with runnable cells in vscode, and vice versa.https://github.com/nojvek/vscode-ipynb-py-converter/blob/mas...Really upped my game while doing the udacity courses.Vscode does indeed have a html output console.
 Wow! This is insanely cool! You should package this up as a VS Code plugin so more people can use it. At a minimum, please post a license.txt file in your repo so folks can actually use it (MIT or similar is pretty easy and common for something like this, without a license.txt file it's not clear whether others can actually use it and many places that means they can't).
 You should check this out https://github.com/nteract/hydrogen
 Hydrogen looks very cool! Am I right that hydrogen and the https://github.com/DonJayamanne/vscodeJupyter plugin that nojvek's plugin is designed to work with in that hydrogen shows the plots inline (like Jupyter does) but hydrogen doesn't support markdown? [edit: the biggest difference is that hydrogen is a stand-alone atom-like editor not a vscode plugin]
 I usually mark all my repos as MIT but will explicitly add a LICENSE.txt. Thanks
 I only recently learned myself that Jupyter notebooks do have code completion, triggered by hitting the tab button - although possibly you were thinking of something more advanced than that. In any case I'll check out your code on github.
 Actually, if you activate the 'Hinterland' extension, you don't need to hit tab. And there's a ton of more extensions too. Their Github repo has instructions on how to set up the extensions - https://github.com/ipython-contrib/jupyter_contrib_nbextensi...
 This is how I code, exploratory code as unit tests ....> Alternatively one can write exploratory code as unit tests and run them inside an IDE. When the exploration is done, three quarters of the tedious cleanup effort is already done. There are already unit tests. There are already APIs exercised by the unit tests. The code is in much better shape.
 > And also more code editing to move from IDEs into the JupyterJupyter, and all REPLS in general, rely on a concrete top-level execution context to provide useful feeedback. Much of the code written outside of data science doesn’t really have that, which is why you don’t see notebooks being used for software development very often.
 I frequently use IPython or REPL.it for software development. It's not my main editor, and I don't write large chunks of my code there, but for rapidly prototyping a small subset of some task I find them invaluable.But you are right about notebooks - I don't find them particularly useful, except as a better REPL, and no-one I know uses them for "normal" (non Data Science) software development.
 I think my favorite non "data science" use case is web scraping. Ad-hoc nature of the task lends itself to using a notebook. A lot of times I don't want the full suite and boilerplate Scrapy gives you so I write it out in a notebook with Parsel & Requests. Once done export and clean up.
 I'm not sure this is as limiting as it sounds. You can import whatever modules you want from within a notebook, and most projects tend to be modular. (Those that aren't can usually be shimmed pretty quickly.)
 Used all the time in Ruby (repl, not notebook). In fact I think anyone doing Ruby dev who doesn't have a pry session open is missing out on a good 30% productivity boost, unless they know their APIs inside out.
 I would agree with this, although an IDE with really good introspection and an integrated debugger can approximate the experience. Lately I've started using spaceneovim, which is hopefully going to help with the back-and-forth workflow a bit, modulo issues of what-mode-am-I-in-again? The terminal is two keystrokes away, and having all of the normal vim commands available in both contexts is something that I think will be more and more useful.
 > Over time, I expect editors like VSCode to edit Jupyter notebooks natively.Please, please, please. Given the rising importance of interactive programming, I really think it deserves more than a extension to give us a seamless experience.
 Related issue: https://github.com/Microsoft/vscode/issues/34739
 I just wish it wasn’t so tedious to version control Jupyter notebooks. I always try and clear the output before I stage it so I don’t have large binaries of plots or figures that were generated.
 Check out nbstripout. It sets up a git filter so that your output is stripped as the notebook is committed. But you keep the output locally.
 Jupyter notebooks are for data science, mostly because visualization is required. Something you do once, report it and it is done. Itsn't make sense to use Jupyter for other stuff. It doesn't make much sense to use them for training big models of deep learning because there are better tools for that.
 > Something you do once, report it and it is done.I have been having a hard time getting comfortable with that idea. There is usually "a lot of" untested and unreviewed code in the notebook that produces the analysis.
 Well, there are many things to consider:- Code is small, most of it is calling well-know libraries with the algorithms.- Every 3-4 lines of code you usually show the results of what is happening (either how the data changed, a graph or whatever)- 95% of the errors in data science come from bad practices: you either didn't clean up the data correctly, you didn't split the train/test/validation sets correctly or at the right step in the process, you chosed the wrong algorithm, etc- most of the time you don't know whether you are doing something wrong or not because you don't have enough knowledge, but the code shows something all the time. No bugs in the code.Those are the reason I don't care very much about bugs in the code because usually there are bigger problems in the data science process rather than in the code.
 What tools can you recommend for machine learning?
 Scripts to train the model, Jupiter to do your analysis. This is, AFIK, what most ML shops do if they’re using python.
 IT depends what you want to do
 > Over time, I expect editors like VSCode to edit Jupyter notebooks natively.PyCharm has support for Jupyter notebooks [1], but for now it's not very good [2]. Two weeks ago Jetbrains released this official comment [2]:Hi, we are aware of the problems our Jupyter support has and we aren't going to fix them, because the current support is going to retire. Currently we are working on another support of Jupyter notebook (even 2!), which hopefully will be better and much more usable. We plan to make it ready for 2018.3 PyCharm release, so stay tuned.My note: The current PyCharm is version 2.2.
 > I don't get the people here saying don't use Jupyter notebooks, or they are bad software engineering.I have a python window open on my desk right now. But the model of programming here contrasts sharply, in my mind, with the one discussed yesterday, from HtDP https://news.ycombinator.com/item?id=17826959 .I absolutely use the REPL, for instance to determine where to put what arguments into a library function. But I have also in the past had it happen that I found a routine of mine that missed an edge case and there were a good number of projects that subsequently used that routine, that now could be wrong. Some of those could have led to a publication, so there could in that sense be a permanance to that error. Oops.Those of us who have gotten bit, or who teach and routinely see students develop habits we know will lead to them getting bit, worry about systems that are not developed so much as pasted together.
 > The Jupyter notebooks are version controlled like my .py modules.That's interesting, could you give us some more details ? Last time I tried to put a Jupyter notebook under git it was as mess. Do you use another tool than git or have they made tools to help with version control ? Or is it just your workflow that helps, like emptying all cell results before saving ?
 Our version control uses a monorepo, with non-branching workflows, development against the head version, and frequent commits/pushes to production.For Jupyter notebooks, cell results get emptied before saving, unless declared public. That is to ensure data confidentiality, not really to help version control.Of course, standard diff gets confused with ipynbs. You need a tool like nbdime (notebook diff and merge).
 Thanks for sharing nbdime! Been looking for something like this for a while.
 Not OP, but I can recommend the handy https://github.com/kynan/nbstripout which acts as a git filter which makes version control ignore cell outputs.With that approach, though notebooks are clean they're still fairly poor for easily evaluating diffs between versions. If code review / diffs are more important than preserving the notebook, then you could use a post save hook to convert notebook input to a .py file and output to .html:
 > I don't get the people here saying don't use Jupyter notebooks, or they are bad software engineering.How many businesses rely on bodged up Excel spreadsheets to run their business?This is that, but on steroids.For better or worse.
 Honestly, I just use a file called play.py, create a function called test_it() or something, and maybe use ipdb if I need to stop somewhere and explore.Then: pytest -s play.pyI just don’t understand what the appeal of Jupyter is, but iPython is great.
 > So much Python development is trying snippets of code in a REPL as you introspect live objects, then once they're right pasting them into the IDE.This approach to development might be a lot rarer than you think – I've never seen anyone do it frequently. For tiniest snippets ("If I want to split there and then use that suffix--how do I need to index? Let's check in the REPL real quick" or "What did that stdlib function return again? [Because that's of course not documented, why would anyone need to know that, after all]" and stuff like "I have this list of tuples like that, now, how do I need to zip/unpack/star that stuff such that I get a list of X instead. Let's try in the REPL with [(1,2), (3,4)]!"), sure, for "actual code" that's going somewhere... nope.
 I understand why they became popular, but as a software engineer considering how they work, I am just full of disappointment. we're going to spend the next ten years re-inventing every single software engineering best practice for jupyter's weirdo environment.
 You reason from a downside (more layman programmers without proper workflow). I reason from an upside: much better flow than Excel or Excel/VBA and hence less errors, better accountability, etc.
 In the mid-90s I was a Mech Eng undergrad using a program called MathCAD which provided a "notebook" interactive computation mixed with text environment by running as a Word plugin. In 2018 I use Jupyter and it's not clear where the progress has been. There are too many compromises trying to make it work in a web browser. For what Jupyter is for I find that RStudio or Spyder are infinitely superior. I do interactive exploring and transforming that into both documents and reusable and deployable code all in the one tool
 I too prefer RStudio/RMarkdown. Now we have the Reticulate package that allows running Python inside RMarkdown, for my purposes it's way better than Jupyter. It's just plain text so my editor and VCS and everything plays nicely.But perhaps the biggest strength is that you go through Pandoc, so you can just click a button and get the output as a Word file for sending to someone who's not a software dev, or you can get it as a LaTeX source file and extend it into a proper formal document like a journal paper, or you can do it as HTML where embedding Bokeh scripts and other interactive things Just Works.I've even made presentations with it going through the reveal.js framework, where you can put an interactive plot on one of your slides, and show people live "what happens when we change this parameter". That's still semi-witchcraft in 2018, but it's going to become a common thing (hopefully).
 Unfortunately RStudio also makes compromises to make it work in a web browser. A native tool could be even better.
 RStudio is mainly a native tool, the web version is quite inferior. Linux, Windows and OSX native versions
 One has to strecth the term quite a lot to say that RStudio is a "native" GUI. The interface is just a browser window, the only native part is the menu bar.
 This is definitely one of my concerns too. Ad hoc code inside these notebooks is almost completely unmanageable from any reasonable software maintenance perspective, and refactoring code out of them is prohibitively difficult as well. I really want something to emerge that combines the best of both worlds of an IDE and notebook development, but there isn't anything close currently.
 ob-ipython enables IDE-like editing features within the code cells. It's embedded in a polyglot, git-friendly, literate programming environment called Org-mode. I use it every day and love it. Other goodies:- easily manage multiple kernels (in different languages / machines) in one file- tree-based organization manages complexity better than linear notebooks- no browser in sight (unless you need interactive widgets)- highly exportable, including to ipynb via ox-ipynbDownsides:- Emacs-only- small user base / limited docs- can't easily import from ipynb- async cell execution support is early-stageSee Scimax ipython for examples
 Org-mode is strictly speaking the most powerful notebook programming environment out there.I’m also much more excited by R Notebooks as implemented by RStudio than I am Jupyter. R Notebooks take the same basic approach as org-mode, implementing a smaller set of functionality built around a more mainstream-palatable Markdown format.I don’t like R Notebooks as much as org-mode (why reinvent the wheel!?), but at least the general approach plays nicely with git.On the other hand, the lisp-addled part of me starts to think that the juypyter format being json might actually be a step up from a poorly specified markup format that has to be parsed into data structures.Maybe the real issue is that git is an insufficient revision control system, and that we need revision control systems that can revision data structures, rather than simple text diffing.
 Could you please elaborate on how you use orgmode for interactive programming?
 Interesting. I use emacs for hacking Python but I've never heard of this tool. Do you actually use this for software development or is it more of a data-science-type exploration tool?
 I use Org for software development whenever I can, which is currently everyday.Usually, new code starts in cells with some Org-managed context (e.g. a Jupyter kernel in a remote container with some DB/service access). This is done using the :session code cell keyword, which works per subtree. Managing remote sessions like this generally keeps me away from terminals.Surrounding the cell are various mini-dashboards with useful docs / links / commands for that part of the project. Since Org supports embedding elisp and shell commands in clickable links [1], these mini-dashboards can be made very quickly.Org lets me edit the code using the proper Emacs mode for its language, while pulling dynamic completion / docs from the Jupyter kernel. Just like Jupyter notebooks, I can view rich outputs from the cells in-line. I can then name the outputs and make them inputs to other cells, including ones in different languages / kernels. AFAIK that's an Org-only trick.Most code eventually finds it's way to normal source files (see Org's "tangle" feature). This feels more natural than moving code from notebooks since, again, the cell editing mode is the same as the one for source files.Org's tree-manipulation capabilities + support for multiple sessions means that (so far) I've only ever needed 1 Org file per project. I track this in git, which is simple since Org is just plain-text. To share with non-Org users, I usually export to ipynb [2] or, for static docs, HTML [3].
 Thanks.
 related "literate devops in emacs" https://youtu.be/dljNabciEGg
 What? What are you using Jupyter notebooks for where you want maintenance? They should be records of data analysis/ procedures, not code that runs in production or something.
 Did you read the article?"A Netflix engineer described how they have replaced Bash scripts with Jupyter notebooks for ETL pipelines and cron jobs."
 Maybe read the article. They are experimenting with putting the notebooks directly into production ala bash script. I don’t think this is a great idea either.
 Yeah, I read it after, silly of me to comment first.
 exactly. I use it for quick data exploration and experimentation but it remains at that level.
 People do that in Excel too and the next thing you know a spreadsheet is managing a portfolio or being used as the basis for published science!
 everyone knows that excel is a drawing program https://www.thisiscolossal.com/2017/12/tatsuo-horiuchi-excel...
 I can’t tell if this is sarcasm or not... I’ll assume it is and upvote :-)
 Its not the fault of the tools if users dont know any better.
 So what would you suggest a financial modeler, who has no experience of any programming environments, to use instead?The value of Excel is that is a zero-config tool, available everywhere as 'standard' business installation, allows very quick iteration with visual output in certain range of tasks, is battle tested in millions of computers... etc. And everyone else is using it too.For a programmer it's easy to suggest to just pick some good language. Once you know two programming languages you can navigate pretty soon fluently in third.But picking up the first language fluency? That's hard. You can't just suggest to use python or such. You need to provide a tool that holds hands, has lots tutorials available etc.Visual programming is not necessary the answer either. I've understood that without discipline LabView programs become actual visual spaghetti pretty quickly.
 of course its hard. Just like its hard a new language thats is far from your native tongue. But we should not be scared of learning something difficult, because the returns are well worth the investment. There is only so much you can do in Excel and if you want to advance your career you need to move past it even if you have no programming background. Additionally, its never been easier than in 2018 to find free resources to learn. The only things you need are awareness, time, and will.
 It's also not the fault of the tools if there are no better alternatives.
 Some simpler notebook-like environments stay closer to source code in that they basically are source code with interleaved results (e.g. as comments). IMHO they hit a sweet spot between REPLs and those notebook environments inspired by mathematica, maple and similar more mathematically oriented software products.
 That's just a REPL with basic editor integration (eval at point, paste result). Surprisingly unpopular outside Emacs/Lisp land.
 Well yes, but these notebooks are little more, aren't they.
 People are already doing bad programming practices in large scale with MATLAB, I guess jupyter is a step up from that.
 On the other hand, notebooks encourage many good practices: literate programming, purely functional code with immutable outputs, visibility of state, and reproducibility.
 Testing, style checking, code coverage...
 As the speaker for the scheduling notebook talk that got referenced here, I can probably give a few insights about how we're using notebooks. A lot of the risks and concerns brought up here were talked about in the session. The slides ended up on https://conferences.oreilly.com/jupyter/jup-ny/public/schedu... (and hopefully the talks themselves will get posted soon).In particular I referenced how we treat and emphasize notebooks as an integration tool which acts as a good place to combine actions with documentation, visuals, and output logs. There's a section on Integrating Notebook which outlines how we approach this. There's also a strong emphasis on pushing complexity and shared code into the repositories housing notebooks. Effectively you end up with a lot of same best-practices found in non-notebook development, and the same abuses that lead to unmaintainable code -- which sometimes is needed in the short term.So far we've had a lot of success with notebooks in production as parameterizable templates, or as a way to easily produce scheduled reports or machine learning experiments. Many users just provide the parameters while supporting teams provide the tested templates. Other users like being able to simply schedule their iterated work without needing to translate to another medium. One of the biggest wins though is gaining a shared interface for debugging, experimenting, and reusing code by having notebooks as the output artifacts of execution (even if it's just executing some other code elsewhere on behalf of the user). Papermill made a lot of this possible by separating input notebooks from output notebooks, and by being able to inject runtime values into those output notebooks.
 We also are doing a blog series on how we're approaching notebooks at https://medium.com/netflix-techblog/notebook-innovation-591e... (the scheduling post is the second in the series). Also I'd recommend listening to talks from the conference with an open mind. There were a lot of great discussions and positive energy around notebooks.Hopefully these help describe a few of the new patterns and tools available in the notebook space.
 I really like notebooks as a way to share an analysis or data visualization. For me the biggest benefits are: - Integration. So nice to have access to the compute, data and libraries all in one place. There is a surprising amount of hassle moving data, setting up paths and libraries, etc. Notebooks almost act like for a container for data analysis rather than service. - Sharable and reproducible. My coworkers can reproduce and explore some new idea with almost no effort, especially important when their strengths are more ML or stats than devops. - Literate programming. It is really nice to have plots, markdown and code all it one place when deliverable is a report or analysis rather than code.Even with these benefits I do think the criticisms about software development are right on point. Notebooks are a step backward in terms of software engineering environment with none of the modern tooling, version control, testing frameworks etc. I think that the folks who dismiss notebooks as a platform though are missing some important benefits that have long been absent in current editors. Larger companies like Facebook and Google are already facing the reality that devops is a pain point even for sophisticated software engineers and have developed remote code editors like cider and nucleotide to try and enable bringing code development to the data and compute rather than doing it from the laptop. R has been working on a long time on integrating analysis results and code in a reproducible package with sweave/knitr and python now has pweave in a similar fashion.I hear all the issues with jupyter and I'm not particularly married to current form of notebooks. I do think though that the features of remote development, data visualization and support for literate/report programming and sharing ode are first class features that I'll continue to want in the future.
 It is a disaster that more than half of the research papers do not provide data and the code, required either to reproduce research (if it is a theoretical one) or reproduce results analysis. Wider adoption of JupyterLab (and alike) will help to solve this problem. Using PDF and paper for distributing research results feels like XIX century these days.
 Most comments here express disbelief and disappointment in Jupyter from the software engineering point of view.What exactly is wrong with it? I use Jupyter daily and find no other Python environment more productive, be it scripts or IPython or IDEs. Granted, I work in scientific computing and use Python for data wrangling and stats. I find immense value in interactivity and iteration speed.
 I also work in scientific computing, and I love Jupyter.I think out-of-order or unknown-order execution is the biggest source of my own mistakes / bugs. If the order in which your cells are executed affects your result, you've got a bug waiting to bite you or someone else. I've been there. And we tend to run cells more than once, while refining our work.This potentially affects reproducibility, which is a tenet of scientific computing. So it's important in our world.I personally think there might be a couple of ways to deal with this issue. One is to put handcuffs on Jupyter and inadvertently make it more difficult or cumbersome to use. But this might be valuable if the size of your project and importance of your result warrants it.Another way is for us to recognize that at some point we have to grow up and learn some techniques from the programmers. In my revisionist history, software engineering began when programs got too big to see the whole thing on one screen and intuitively understand.One of my little habits is that I do a "restart kernel and run all cells" when I'm ready to take a break or close out at the end of the day. If if fails, the causes of the problem tend to be superficial, and nothing to lose sleep over.Another habit is to take cells that work and are nice, and enclose their innards in a function so that their impact on the global variable space is minimized. This can be done during a sleepy time in the afternoon when you're not really in flow anyway.
 > I think out-of-order or unknown-order execution is the biggest source of my own mistakes / bugs.To avoid the state problem, I start each cell (or, in org-mode, each src block) with a command to set up the environment for that cell, usually by importing a script named something like setup_abc123.py. Cells are not allowed to reference code or results from other cells except through the filesystem, which is under the watchful eye of git. In this workflow, cells exist only to write plots or tables into the notebook for human consumption.This workflow, for me, has completely eliminated errors due to hidden state dependencies. Before I adopted it I made a serious state-related error every couple of days, which was intolerable.As a bonus, using a single script to set up the analysis environment makes it trivial to start a new analysis, even with an entirely different tool chain.
 > In my revisionist history, software engineering began when programs got too big to see the whole thing on one screen and intuitively understand.Laughing, politely, at that summary of "history." Software engineering started when there were no screens. Code was written on typewriters (called "terminals") and then printed to cards or long rolls of paper (and occasionally using switches to indicate the 1's and 0's you wanted set as bits in your code). Before the Mac a "screen" of code was pretty universally 24 lines of green or white text on a black background, 80 monospace characters per line. A few specialized terminals had 25x80 so they could show a status line below the 24x80 standard area, but they were rare beasts. Yes, there were a few systems like Plato that predated the Mac but if you need a tl;dr it's the observation that, like most revisionist histories, this one is unfortunately total nonsense when looked at in the context of actual history (even if lots of people who didn't experience the actual history believe it).
 Quite agreed, and it's why I deliberately chose the "revisionist" label.Before screens, there were pages, before pages there were cards... going back to plugboards, etc.Each of us got in at a certain time point. When I learned programming, we had a workstation that simulated a card punch, let you see the contents of one card at a time on a little vacuum-fluorescent display, and stored the data on an 8" floppy. You handed the floppy to the computer operator, and received a printout on green-bar paper.The other way I got into "programming" was staring at a nest of wires, reading my output on an oscilloscope, and making corrections with a soldering iron. ;-)The common theme was that a program gets too large for somebody to look at and understand it, much less for a team to work on it all at once, unless disciplines are adopted that tend to come from software engineering. When I'm mentoring colleagues who are just beginning to get into scientific programming, some of these simple techniques, such as avoiding globals and creating functions, are in fact an improvement.
 It sounds like instead of putting the sentence in past tense and calling it a revisionist history it sounds like what you are really doing is making a present tense generalization based on years of historical observation: software engineering begins when the code gets too big to see the whole thing on one screen and intuitively understand (aka Analog31's law)
 Yes, that's perfect.
 Maybe I can provide a bit of perspective on this, as I have lots of conflicting feelings about Jupyter.When I'm doing some bit of data wrangling or just exploratory work with data I quite like it - at least at first. As you pointed out, it's really easy to extremely quickly iterate on things and start to get an idea of what's in the data, what techniques work, and which don't. It's great.Until it isn't. I'm probably "using it wrong" but all the suggestions I've seen for doing it "right" start cutting into the iteration time. If I'm in Jupyter it's because I want to flail around real fast and see what happens. And that causes all sorts of problems. Which cells do I need to rerun together? Which cells should I not run again. What the heck is actually in all these variables right now anyway, because I don't remember which order I've run (and rerun) all the cells in. And what was that one approach or parameter that really worked well that one time? I don't remember.These sorts of things aren't an issue with a non-Jupyter approach. And because I'm taking my time anyway I'm probably being diligent with source control, and all that. Even pulling code out of a notebook into a more stable workflow is a pain, because you have no assurances you can just copy/paste a cell and have it work. In my experience, it never will.It's one of those things that has a pretty heavy costs and benefits. And unfortunately they aren't really possible to separate.Not coincidentally, I feel the same about dynamic languages and even less usefully typed static languages. Jupyter takes an already fairly extreme position on the stable(safe)/easy continuum and really ramps it up to a point where it's hard to wrangle time and work done easy into some sort of stability. It's great, and it's awful.
 I think notebooks could be hugely improved by saving a snapshot of state after each cell, and having the linear order of cells only move forward in time. Accidental variable reuse (especially from a cell that is in the "future" or maybe no longer exists) is a huge source of bugs.
 From other comments, it sounds like something called Cocalc might be of interest to you. I haven't even looked at it yet but it sounds like it does at least some of those things.
 > And what was that one approach or parameter that really worked well that one time?I gave a talk at JupyterCon this week about Cocalc which partly solves this one problem by providing a Time travel slider with complete history.
 I took abstract algebra with Prof. Kedlaya at UCSD who likes to use Cocalc for his other courses, haven't used it though.
 > Which cells do I need to rerun together? Which cells should I not run again. What the heck is actually in all these variables right now anyway, because I don't remember which order I've run (and rerun) all the cells in. And what was that one approach or parameter that really worked well that one time? I don't remember.Right. Get that all the time. My solution is to try to condense the useful code built up across different cells into a reusable block, most often a function. Do that while the 'state' is still fresh in your mind and you remember the order of execution.It also helps to break up a long notebook into sections using headings and markdown cells with comments. I often start by writing down a question/problem I am trying to solve in prose, that makes it easier to recover the context of the code cells that follow.
 > My solution is to try to condense the useful code built up across different cells into a reusable block, most often a function. Do that while the 'state' is still fresh in your mind and you remember the order of execution.Right, that's the sort of thing I was talking about in terms of "using it right." It makes sense and is really a good idea, it just starts eating into the advantages you get from using Jupyter in the first place.I'm still trying to figure out how best to integrate it in my workflow. So far I've mostly settled on using it to get some rough ideas about the data and libraries I'm using, and then just reworking from scratch in a more stable setting. As needed I'll go back and try things out there, possibly in a new notebook entirely and just do my best to copy/paste the setup over. In getting it working the setup stuff at least gets consolidated. It's... well awful but eh, I haven't really found a better way.
 Jupyter is also really useful for figuring out how you want to do something before you copy it to your IDE. Every time I work with a new API or library I try it out in jupyter first.
 What are the benefits of this vs just an integrated repl?
 There is value when e.g. you need more than simply textual output (think plots, etc.).
 For me it is having a complete view of what I have done previously and rerunning past codeblocks by making changes.
 Being able to re-run and modify cells out of order.
 I am a computational biologist with a heavy emphasis on the data analysis. I did try Jupyter a couple of years ago and here are my concerns with it, compared to my usual flow (Pycharm + pure python + pickle to store results of heavy processing).1) Extracting functions is harder 2) Your git commits become completely borked 3) Opening some data-heavy notebooks is neigh impossible once they have been shut down 4) Import of other modules you have in local is pretty non-trivial. 5) Refactoring is pretty hard 6) Sphinx for autodoc extraction is pretty much out of the picture 7) Non-deterministic re-runs - depending on the cell execution order you can get very different results. That's an issue when you are coming back to your code a couple of months later and try to figure what you did to get there.There are likely work-arounds for most of these problems, but the issue is that with my standard workflow they are non-issues to start with.In my experience, Jupyter is pretty good if you rely only on existing libraries that you are piecing together, but once you need to do more involved development work, you are screwed.
 My hunch is the same disdain you probably have towards scientists using excel is how many software engineers feel about your using jupyter.It isn't that you aren't getting good results and doing good work. It is that there are many practices that were hammered out in software development that were completely tossed out the window. 4GL comes to mind as a comparison. Makes great demos. Is not likely to lead to good engineering.
 I use Jupyter for two things:1. Interactive computation and plotting, run data through a pipeline.2. Quick prototyping and testing. Once a function/class is ready, it goes into a module, together with some unit tests.Nothing beats Jupyter for that second use case, 90% of the bugs are squashed through interaction and inspection in the notebook cells, not through unit tests.
 That some folks can be productive with a tool doesn't say, necessarily, much about the tool. Some could probably make the same two claims for excel. It isn't that the tool should be banned, per se. Just that many practices that have been rather proven in software are much harder to do in this environment.Sounds like what you like is the live coding aspect. Many of the lisp environments of yesteryear would have probably appealed to you. So would many earlier math packages such as matlab and mathematica. Both of those have had environments similar to the new "notebooks" for many years.For the pure coding side, imagine using an environment that actually allowed live redefinition of stuff in the code you had written. Step debugging. Breakpoints. Conditional breakpoints.
 > Some could probably make the same two claims for excelAs a tool Excel has provided astronomical real world value.There are only a small handful of other tools that even come close.
 I'm pretty sure this puts us in agreement. Right? :)I think the counter is there is probably a lot of damage excel has done, as well. I'm not convinced other tools would have resulted in no bugs/problems.
 - hard to use linters - hard to use formatters like yapf - can't use the shortcuts from my editor - easy to screw up the order of execution - harder to review, generate nice diff, fix merge conflicts
 I don't think there's anything inherently wrong with jupyter notebooks, but they run in the browser without behaving in the same way a webpage or web app will in terms of managing state.This is sort of tantamount to opening a website in your browser, but as soon as it loads, you need to refresh it as to "correct" the state of the site.That's my only compliant about jupyter notebooks. It's a powerful tool but acts a little funky if you're not used to it.
 Not sure why I got downvoted... I have been working in web development in the last 5 years and have started working with Jupyter notebooks in the last 4 weeks while taking some Data Science courses.I really feel like state management with jupyter notebooks is a bit whacky. From my experience having built and maintaining websites, it's bizarre to open a notebook with a previous state but have cells not work because it needs to be reran from top to bottom.In contrast, when opening a web page, the state is usually reflecting what you'd might expect without having to refresh the page.I am sure there is some setting to correct this (either rerun on open or don't save state on exit), but it hurts my mental model of how a browser based application should work.
 How do you find it compares to Spyder?
 Spyder is too clunky. I work with a number of remote compute clusters, and like the ability to fire up a jupyter server through ssh, enable port forwarding, and work with the notebook in my browser.If I need a file browser or I want to quickly view two notebooks side-by-side, Jupyter Lab is still more convenient than Spyder. Although it doesn't support all the features of the notebook, and you end up installing additional extensions managed through yarn and webpack (jeez).
 This is a really important comment. I started my career in scientific computing, went over to software/web development for a while, and am back over to scientific computing. I've been involved in teaching efforts lately, mainly for analysts who want to do data wrangling and stats, and I think it is a good environment for some of what they want to do.I never used Jupyter before getting involved in this kind of coding, data wangling and stats. And it took me a while to get used to it. The "lack of state" is confusing, as well as the apparent non-linearity of a jupyter notebook (it can appear that certain commands are done in order from top to bottom, when if you look at the number next to them, they may have been executed in a completely different order). I actually chalked this up to my background in software development, figuring I was more wired to think of state as something that would be cleared every time a script was run. Analysts (statisticians, engineers, data scientist types) like to interact back and forth very iteratively with their data, and I think they're less likely to be tripped up by some back of mind assumption that you can determine the state of variables and so forth by looking at a bunch of commands as if they were run top to bottom in a file.It got me thinking about when I'd use jupyter notebooks for my own work. In short, if I were writing a program, even just a one page script, I wouldn't use an interactive notebook (I have opened .py files in jupyter and just used it as a text editor).But if I wanted to "do my math homework"? A one off where I need to get an answer to some complicated questions that will require a lot of data wrangling and stats? Yeah, I'd probably reach for jupyter notebook or something like it.Otherwise? I can't remember the O'Reilly book where this was written (sorry I wish I had the cite) but the author wrote that his favorite dev environment is python and a text editor. That's still the case for me as well.I hate to conclude with the disappointing "the problems are when you use it for the wrong task", mainly because I've objected when people have made this argument in the past. If a tool tends to get used for the wrong task, that's partly a problem with who is using it... but we shouldn't let the tool itself off the hook too easily. It's still worth thinking through. I'm worried I'm falling into the trap of making this argument when I like the tool, and objecting to it when I don't. I'll have to think about that a bit.As it stands, though, I haven't had a huge problem with Jupyter Notebook, and I like it a lot, probably because I don't use it when Python + vi (or a "better" but still basic text editor) would do the trick.
 For the times that you mention that you currently use Jupyter, how is it better than simply randomly selecting a small subset of data at the beginning and then writing your code in a full featured IDE with debugging functionality (including the ability to add breakpoints and see variable contents at those points, etc.)? The one line at the beginning to randomly subset the large data set is easy to delete or comment out once you have the code working the way you like - to me it seems more effective to do that in a native IDE like Eclipse+PyDev or in a lightweight IDE like Spyder than to use Jupyter, which requires so many compromises.
 People complaining that notebooks are not good for software engineering are missing one important point – notebooks are very often used by people who are not software engineers and/or not doing software engineering work.(The point is otherwise valid, I have seen software engineering in notebooks and it was horrific.)
 "jupyter is the new bash"As much as I love Jupyter this is a bridge too far even for me. There's nothing about Jupyter that makes sense for non-interactive scripting. Although, I suppose it would help you do block chain calculations... which would also be a terrible idea.
 Inline documentation and charting (which for data tasks is a honking huge deal). It's literate programming with an actual use-case. Very into this idea.
 None of that is what people use bash for, however.
 You'd be surprised. It's certainly not _all_ of what people use bash for, but there are a lot of sketchy ETL (and sampling, and small-scale analytics) jobs out there held together with bash and awk and hopes and dreams, and replacing those with something like this is a substantial win.
 I don't think the full talk for that is out yet but I think it could be brilliant. Literate bash scripts with built in metrics reporting that you can just pull out if necessary? Documentation and selective re-run? Sign me up!
 If Jupyter is not your cup of tea you might like Cauldron, the unnotebook - http://www.unnotebook.comInterview with the author in May, 2017 - https://www.linkedin.com/pulse/why-cauldron-might-right-data...
 As a pretty heavy notebook user (for computer vision and machine learning), I have to say that all of these new features are nice but they don't address the real problems with the current ecosystem. Namely, there's no good way to transition exploratory notebook code to python modules. We need linting and refactoring tools in the notebook and something better than aimport for moving code into separate .py files.
 I despise notebooks being used in production for the reasons given early in the article: you can execute code in any order so authors usually end up with spaghetti and long pages without clear flows. I would rather scientists use and learn the tools that have been developed over decades on collaborative code writing using version control. It helps integrate their solutions too.
 This is a code style problem not a notebook problem: the same people would probably write spaghetti without clear control flow if writing a Python batch script instead of a Notebook.
 I agree, but tools like version control do help a ton to manage and mitigate these kinds of risks. Actually, it could be super super cool if someone figured out how to version control notebooks, including tracking of code execution flow.
 To everyone frustrated with notebooks, I urge you to check out nextjournal.com:- Notebooks are automatically versioned- You can reference and reuse parts of other notebooks immutably through something they call "Transclusions": https://nextjournal.com/nextjournal/transclusions- The technology-stacks underlying each notebook are immutable, meaning that notebooks work on any machine (no "hidden" dependencies).- You can explicitly reference other results instead of relying on global state, making execution order irrelevant.It's currently in private beta, but people are starting to use it in production. (EDIT: You can sign up using the code curryon2018)(Disclaimer: friends of mine are building this and I used to work on it in the past.)
 I love jupyter notebooks, but unfortunately the code editor is unusable to me. I cannot stand the bizarre parenthesis auto-completion, and the autoindentation settings; and there's no easy way to remove them.
 I feel the same way. One workaround is that some editors have a plugin. Emacs has a mode called ein that is fantastic. As you poke around getting the snippets to work, you can cut/paste into a different file/buffer. After you start the Jupiter notebook copy the login token, run: ein:notebooklist-login then ein:notebooklist-open and away you go.
 Just be careful to use https://github.com/millejoh/emacs-ipython-notebook instead of the unmaintained https://github.com/tkf/emacs-ipython-notebook. Somehow the latter comes up as my first search result.
 "On the education side of things, Jupyter is quickly gaining adoption in universities around America, particularly for data science courses. Conversely, data science is increasingly becoming students’ first exposure to programming and computer science"Am I the only one this two sentences worries deeply?Or is it just a rehash of the twenty year old "my first exposure to programming was Excel" (which, from what I'm reading is what the article is talking about : a glorified spreadsheet).
 I can’t recommend enough using Atom with the Hydrogen plugin. A Jupyter kernel runs in the back and you can execute a line or multiple lines of code at once just like the web notebooks. But you’re editing a plain code file, not an .ipynb, which is easy to check into git, move to inside a library, and get full IDE tooling on.
 When you consider that it is impossible to reuse notebooks in each other... Or unit test them... Or that it is not especially easy to version control them in any sort of branch/merge workflow... Jupyter Notebooks are much closer to Excel spreadsheets than they are to what most people would consider actual programs.
 >impossible to reuse notebooks in each other... Or unit test them...It's quite easy to unit test notebooks. Whenever you write a function you want to test, write some unit tests (or even just asserts) in the same cell it's defined, then any time the cell is run to define the function, the tests are run too. Likely you'll be doing some manual tests when you write a block of code cum function, and it's not hard to copy the input crafted from those manual tests to a unit test.
 write some unit tests (or even just asserts) in the same cell it's definedAs many people will do this as presently write tests for their Excel macros or formulas.
 Except Excel is in some ways much better designed -- it automatically recalculate everything if you change a cell, and makes it easy to link to other spreadsheets.
 Yes there’s no concept of cell dependency in Jupyter. Often find myself just re-running the entire notebook!
 Spreadsheets have been insanely popular and useful. If the similarity is at all deep, there's opportunity in figuring out version control, unit testing, reusable packages etc, even if they seem hopelessly impossible on the surface.
 I agree that probably working on bash feels more snappy but Jupyter is not that clunky as you mentioned. My laptop is a mediocre one and I can run the Jupyter, navigate to the code and open the editor in total like 3 seconds.In my experience, kernels work just fine unless you are doing something you shouldn't. My kernel problems occur at points where I have a memory issues. Though I agree with you that you don't understand what's wrong once the kernel dies.
 >In my experience, kernels work just fine unless you are doing something you shouldn't.Here's the thing. I'm sure many of the complaints I voiced can be chalked up to lack of experience. But even if it weren't for the issues about documentation and troubleshooting, I'd still have yet another tool to manage, learn to use, learn best practices, etc. Instead of just getting things done with my tool it just gets in the way and tells me there are things I should and shouldn't do. It takes time to learn about it.And now, I'm not sure the investment is worth the effort. Despite its warts, everyone uses bash and it's really easy to learn about it and translate that knowledge into many applications. And, it does the job when I ask it to. Same for Python, IDE/editor usage, LaTeX, and all the other toys in your typical academia toolbox. I'm all for learning. But time is short, deadlines loom close, and the good enough is often the better's worst enemy. And when it turns out the better isn't actually better, welp.
 For what it's worth, I love the fact that Jupyter is browser-based.> It's the entire reason it's so easy to format text / math in a Jupyter notebook. There's Markdown + MathJax for basic interactive formatting, no need to compile or anything. If you need anything more sophisticated (like a box around text, etc.) just throw in a
and it'll automatically show up in the notebook. > I only need one Chrome window (or more, if I choose) for my entire workspace. I can have tabs for notebooks AND webpages open at the same time, in the same program. I don't need to worry about having dozens of windows open! I also don't need to deal with some developer's idea of a tabbed workspace--Chrome's tab work exactly how I need them to.Maybe you could get all the same benefits from an Electron app, but people don't quite like those either :).
 Been a bpython lover now I mainly use ipython. Used jupyter + ipython kernel once a while but for 99% my use cases, ipython is good enough(testing out code, checking out help info,etc).For anything you want to have graph(e.g. matlab), notebook shall serve you well.
 I like notebooks. I like them for data work. I like them for prototyping things that many may consider software engineering. I still think this presentation has a lot of good points about where they could improve.
 I wrote an importers for notebooks that laods and runs the cell based on keywords I'm the markdown.
 from Joel Grus, “I don’t like Notebooks”
 I found this most exciting:> While Jupyter notebooks have traditionally been a humans-only entrypoint into a program, researchers and companies alike are increasingly using notebooks for automation.It's part of a slow journey back the power of the Lispms and the Smalltalk environments.
 "jupyter is the new bash" has to be the most underwhelming hyperbole of the decade
 Are you kidding? The shell is a massively important part of modern computers, and making it more expressive would be a huge win. You might be familiar with TermKit[0] which also tried this.
 \${SHELL} is\ horrible\ and\ dated
 Even in its present state, Jupyter is strictly better than plain REPL, especially if you need visualization. It's like Python in general: it was not designed to write 10-100KLOC programs. Python was designed for scripting. Jupyter was designed for interactive data exploration, and to create a record of results which you can view without re-executing the cells. Is it error prone? Yes. But then so is REPL.
 > Python was designed for scripting.It's far more than scripting, and for much bigger programs than 10k LOC... My company's Python codebase is 35m LOC/500k modules, with 25k commits per week and contributions from 2.5k developers per month.To my mind, Python has the opposite problem. I've been a Python programmer since 2000. Every few years, I think I should devote more time to other languages (first Java, then R, then Haskell, then JS, ...). But the Python ecosystem just keeps getting stronger. My current focus is data science and machine learning. In these domains, there are no good reasons to drop Python.
 JS was also designed for scripting
 > Python was designed for scripting.Can you provide a citation?
 I don’t provide citations for such obvious things. If you’re interested, you can find the origin story as written by Guido himself in Python FAQ using any of the available search engines.
 I was not able to find it. What I was able to find is this: "Python is an interpreted, interactive, object-oriented programming language." https://docs.python.org/3/faq/general.html#what-is-pythonYes some people do write scripts in Python. Does that make Python a scripting language? Maybe in your definition of scripting language.
 A bit below your section you will find https://docs.python.org/3/faq/general.html#why-was-python-cr...Notable part: [..]We needed a better way to do system administration[..]
 You should never use notebooks - period.My TL;DR: • Confusing for beginners • Encourage bad practices • Poor editor
 Why not?
 My TL;DR: • Confusing for beginners • Encourage bad practices • Poor editor
 As many already said, I think it is a great tool for prototyping and data exploration but when it comes to moving code to production, for me it makes very little sense to use it.Netflix said that if the job breaks they can enter the notebook with the data and see what is wrong. For me it feels like they did development with 0 safe guards and if it breaks they check why. Instead of logging problems and dealing with edge cases in the code beforehand
 I'm not sure exactly what Netflix means by "if the job breaks", but I've used jupyter notebook in production to do ML before.If you consider the notebook as a way to augment your logs with plots, it might make a bit more sense.Running a jupyter notebook is a nice way to generate a HTML report for a job. For a typical ML pipeline, you first plot some stats about the input data, then train some model, plot some training loss, a confusion matrix, some example of predictions, etc...If some job gives a strange result (maybe that's what they mean by break), having the notebook rendered as an HTML page with all the plot is a very effective way to do a first round of diagnostics. You can also start the notebook with the same parameters and 'run' through your report, which is a nice way to do interactive debugging.Also in this case, the notebook itself was quite small in terms of lines of code. All the functions were implemented in modules, so it's really like the notebook is your 20 lines 'main' function. So you need some discipline among your team.

Applications are open for YC Summer 2019

Search: