An alternative strategy is to wrap your C++ stuff in a library and use pybind11 for lightweight bindings and python-based interactivity. This approach means more glue code, but python has much better data science tooling.
I think Cling will see the greatest opportunity in algo tuning and multi-platform development. That’s a different kind of data science— focused more on benchmarking than consumer problems. I just wish the C++ build tool json ecosystem were friendlier.
 https://cdn.rawgit.com/root-project/cling/master/www/index.h... See "Embedding Cling"
With that said, this project looks amazing. Interactive/interpreted C/C++ would be great to have even outside of data science applications.
Maybe for you. For myself (a professional data scientist) and every other professional data scientist I have worked with, notebooks are useful and better than Windows Notepad, otherwise we would all be using Windows Notepad.
This is such a strongly worded post that I have to wonder what exactly you are trying to do with notebooks, that makes it not just suboptimal but atrocious. Because it sounds like you're using it for something other than its intended purpose, in which case of course it's atrocious, because you're using a tool for something other than what it was designed to do.
If you think of a notebook like a tool for literate programming, then sure, I can see why you might have issues with them. They don't integrate well with text editors, other tooling such as test runners, or even the Python import system. I would strongly recommend against writing a library in Jupyter. I would strongly recommend against writing low-level CUDA code in a Python notebook. I would weakly recommend against writing production ETL pipelines in a notebook (although if your ETL pipeline is just untested spaghetti code, you might as well do it in a notebook).
Notebooks are not competing with IDEs, text editors, or any other dev tooling. Notebooks are an alternative to REPLs. If you use a notebook like a super-powered REPL, you are going to get a lot out of it. I can't think of a single complaint about a notebook that can't also be leveled at an "Editor+REPL" type of workflow, and I can think of many problems with the Editor+REPL setup that are solved by notebooks.
If you prefer using Spyder and RStudio with the console and plot window, that's great. I honestly do too. There is a lot to dislike about Jupyter Notebook and Jupyter Lab specifically. But notebooks in general, and Jupyter specifically, are still the most efficient tool I've found for doing something, keeping track of / taking notes on what I did, and displaying the results in a way that I and my colleagues can easily understand what I did.
There are also other notebook formats such as RMarkdown and Pluto, both of which have much nicer plain-text representations than Jupyter Notebook. I'm a big fan of those, too. But there are tradeoffs in every design and I'm not about to malign the Jupyter project (or its ancestor IPython) for its choice to use JSON.
The key qualifier is 'from a development perspective'. Notebooks do have their usefulness. I do some data-science related tasks but I'm mostly a developer, so admittedly that could skew my perspective.
> If you think of a notebook like a tool for literate programming...
I mean, that's kind of a selling point of notebooks, though...
> They don't integrate well with text editors
They don't integrate well with anything, not even other notebooks! From that alone I sense strong Word vibes.
> If you prefer using Spyder and RStudio with the console and plot window, that's great.
My usual setup is pretty much the same as yours with Visual Studio Code, but yeah, that was the point... So notebooks are at least somewhat competing in the same space as IDEs after all? Is there a subtlety I'm missing or "I prefer IDEs" and "Notebooks are not competing with IDEs" are in direct contradiction?
> keeping track of / taking notes on what I did, and displaying the results in a way that I and my colleagues can easily understand what I did.
Again, I'm not arguing with that, I was talking about notebooks as a development tool. I can very well see the appeal of notebooks as a presentation/storytelling/documentation format, and to an extent even I agree they have their usefulness there, and I certainly have no qualms with their usage in that niche. But I just couldn't bear to sit in front of Jupyter instead of VSC to code something, hence my strong wording.
When you want to run stuff in production, you move to a regular code dev environment (or handoff to developers).
Now some things in business are infrequent enough that you can live with Jupyter for the once a quarter or so when you need to rerun updates but that's not the same as production code that runs daily/weekly.
Spyder and RStudio are very much not oriented toward traditional software development and very much oriented toward interactive work between the text editor window and the console/REPL window. That is, they are fancy implementations of the Editor+REPL workflow. The fact that you can develop software in that kind of environment is not the point.
It doesn't have a huge audience at present, but https://nbdev.fast.ai/ is doing just that. I'm personally of the opinion that those looking to use notebooks as a programming environment would be better suited by something like Smalltalk. That is, something that seamlessly blends between stored programs and the runtime. You can see some progression towards this in VS Code (I believe there's an issue about exposing variables evaluated in the ipython REPL in autocomplete), so I'm not terribly bullish on Jupyter-style notebooks being the next big thing.
Really? Developing software in Jupyter must be almost as common as the "just code an Excel macro bro" approach.
>weakly recommend against writing production ETL pipelines in a notebook
I find prototyping and developing ETL code in notebooks to be the most efficient. Especially when dealing with new and unfamiliar data sources. The interactivity and feedback loop makes defining what an ETL process should be doing a lot easier.
That said, once it's working, everything should get moved out into a module with tests.
A notebook is great if you don't change anything, but at that point it's basically a REPL with inline graphics, which is usually what I really want...
It does take a certain discipline to make reproducible notebooks. I like to always end the day with a "restart and run all" if possible.
Give me an indicator that a cell is stale but I don't want eager execution on all the time.
How so? You can execute lines of a file out-of-order in a REPL just as easily as you can execute notebook cells out-of-order.
From my experience, the the problem is many people in DS positions or who start with notebooks are unaware of what editors are for and don’t have much grasp on code maintainability or collaboration.
They simply don’t know what they don’t know so their notebooks are spaghetti code mixed with tons of random things being displayed that was a sanity check for them at the time but now not even they know why they printed or displayed it.
IMO their ideal use case is in tasks like communicating homework exercises to a teacher. Or for tiny stand-alone projects.
The problems they are designed to solve are, supposedly, numerous: reproducibility, literate programming, interactive exploratory analysis. But they end up getting in the way more than helping.
1) Reproducibility: nothing in the notebook is more reproducible compared with a simple source file. And you cannot really reuse anything within the notebook. Notebooks make re-running the written code more convenient for people not familiar with your stack. But, like mentioned above, it looses its appeal in any substantial project, where you will not be able to share the whole data that fuels the notebook.
2) Literate programming: it works only when you don't change anything, or make the notebook very abstract. If you do an analysis on a prepared dataset, add some comments, draw some pictures, then you have a "literate programming" document. But it lives only as long as nothing is changed. If you change the dataset it depended on you will likely have to change the comments in the notebook. You cannot just re-run it. And the comments added to the notebook - in my experience - nobody reads them, besides the original author. Almost every time somebody sent me a notebook I had additional questions about the results, which required me to interact and ask questions. It just adds additional burden on your part to describe what you do two times: once in the code, and another time in text. You can either guess what the recipients will be asking about and try to cover all corners, or simply answer those questions directly in a shorter amount of time.
3) Interactive exploratory analysis - notebooks slow it down. Writing simple functions in an editor and sending them to a REPL or simply exploring things in the REPL directly - both are faster alternatives. With notebooks you have to care about presenting the figures and pictures and comments from step one. In a typical exploratory analysis you are just trying things out without any idea of preserving them, before you stumble upon something useful.
4) On top of all that notebooks make things difficult to change. Say you have a notebook that loads 3 datasets, compares them with pictures and tables, transforms them, then compares them some more. And you decide to change the color of one picture. You either will have to re-run this notebook from the start just to change the picture. Or, alternatively, you start using cache, which, in my experience, is never a better option.
Just adding an alternative opinion.
It is. I still use them regularly. I also avoid them like the plague regularly. It all depends on whether I happen to be wearing my Type A Data Scientist hat or my Type B Data Scientist hat at the moment.
I see why scientists use it, but it lacks any FFI!
Having said this, for anything that goes even slightly in the direction of production (like some kind of model training and/or processing pipelines), you should absolutely not use it.
I get this angle and I agree to an extent, demos/throwaway code is where they really shine, but I don't entirely buy the "literate" aspect (disclosure: I assist in some DS tasks but I'm mostly a developer, so that might skew my perspective).
Knuth's WEB, love it or hate it, was at least arbitrarily and easily composable, notebooks on the other hand have hidden state and plenty of footguns even for those who are already familiar with the language. It's basically impossible to compose notebooks in any meaningful way.
I think you could make the idea work (or at least work better) with a purely functional language, say Haskell, instead of Python. Purely functional semantics are more amenable to natural interpretations closer to the Notebook format than an OOP.
It’s a tutorial on how to use some NLP ML library. It’s a pretty excellent example of where this format shines, imho, and as you can see it’s absolutely literate.
Btw I also identify as primarily a developer, but over the past few years been confronted with these notebooks more and more and I’ve learned to appreciate their use over time. But as with all good things, they’re sometimes overused.
Composability... eh. You can run a Python notebook from the command line just like a Python source file. You can even run one notebook from another, or "compile" the notebook to Python source and import from it. But that somewhat of an "off-label" use to me. The idea is not to be a general-purpose literate programming tool, although it turns out to be pretty close to one.
It's a notebook, like a lab notebook. You do stuff and take notes on that stuff, and it keeps the results of what you did + your notes all together in one file that can be rendered to HTML or PDF. That's fucking fantastic IMO. Once you're at the point where you're trying to compose a lot of notebooks you should strongly consider opening an IDE and developing a project-local library instead. Or at least using some kind of task runner / data flow engine thing.
I hate to be the "no true Scotsman" guy about notebooks, but all of this anti-notebook invective just doesn't make sense to me. Go ahead and write all you want about how notebooks are not an IDE replacement - that's fine, and nobody (at least nobody that I respect) ever claimed they were.
Based on my experiences as a Physics student, It encourages bad code (DRY, etc.), but considering academics often have no real training as programmers as opposed to computer-practitioners, I guess it's better than them dumping raw c into a file and never touching it again?
A simple REPL is all that's needed to both do A-type and B-type data exploration. (I won't use the term "data scientist", it's an exaggeration in most cases.)
Python has a REPL, R has a REPL, Perl has PDL and both a simple REPL (https://github.com/viviparous/preplish) and a more complex one (https://metacpan.org/pod/Reply).
Jupyter should not be used as an IDE because it is the wrong tool for development. A-type data explorers just want a painless UI and may not care much about the horrible agglutination of incomplete/slow/broken solutions that Jupyter represents.
I used to do EDA in REPLs when I was more in a more research-focused scientist, but I've made a transition over to the engineering side the last few years.
What made the transition natural for me is that I never shirked on reproducibility when doing EDA. I've seen other scientists, once they have something, "toss it over" and get the entire thing rewritten-- this isn't really scientists owning their work.
Now-a-days, I do test-driven EDA, i.e. via using pytest. Write the test first that describes what you're "exploring for", and the code to document how you got it. As a bonus, an engineer can rewrite it since you've done them a favor and written tests they can validate.
That explains their popularity.
The xeus-cling work is awesome and has made it possible to do data science prototyping in C++. There are lots of other C++ notebook examples in the examples repository: https://github.com/mlpack/examples/
I also managed to prototype interactive audio programming with http://bela.io with Cling https://gist.github.com/jarmitage/6e411ae8746c04d6ecbee1cbc1...
But, beyond that, even being comfortable with polyglot programming, it's still better when you've got what you need in one place.
I consider that a problem, moreso the way JIT efforts tend to be downplayed in the community.
I wonder how many users outside HEP there are though.