Hacker News new | past | comments | ask | show | jobs | submit login
Interactive C++ for Data Science (llvm.org)
157 points by Bootvis on Dec 23, 2020 | hide | past | favorite | 56 comments

Lovely to see Cling progress so much (see the CUDA+jupyter example on the page), but a major challenge remains: dependencies. It used to be that you had to use a forked LLVM just to use Cling, but major kudos they cleaned that up. However for large C++ projects with all sorts of compiler flags, include paths, etc, it’s still a major effort to get stuff running.

An alternative strategy is to wrap your C++ stuff in a library and use pybind11 for lightweight bindings and python-based interactivity. This approach means more glue code, but python has much better data science tooling.

I think Cling will see the greatest opportunity in algo tuning and multi-platform development. That’s a different kind of data science— focused more on benchmarking than consumer problems. I just wish the C++ build tool json ecosystem were friendlier.

To be fair dependency mgmt is a C++ issue in general, so it's not the cling teams fault

Oh yeah sorry I'm not blaming cling at all. It might be easier if cling were completely baked into llvm, but obviously there's a debate to be had about that. I see today you can embed cling[1] which is a lot easier than when you had to use ROOT. Still not the simplest thing though.

[1] https://cdn.rawgit.com/root-project/cling/master/www/index.h... See "Embedding Cling"

In this case its more of an LLVM / no stable ABI issue.

I really wish the data science/algo community weren't going in the direction of Jupyter/Notebooks. It's an atrocious experience from a development standpoint, several steps backwards from even just editing code in Windows notepad.

With that said, this project looks amazing. Interactive/interpreted C/C++ would be great to have even outside of data science applications.

several steps backwards from even just editing code in Windows notepad.

Maybe for you. For myself (a professional data scientist) and every other professional data scientist I have worked with, notebooks are useful and better than Windows Notepad, otherwise we would all be using Windows Notepad.

This is such a strongly worded post that I have to wonder what exactly you are trying to do with notebooks, that makes it not just suboptimal but atrocious. Because it sounds like you're using it for something other than its intended purpose, in which case of course it's atrocious, because you're using a tool for something other than what it was designed to do.

If you think of a notebook like a tool for literate programming, then sure, I can see why you might have issues with them. They don't integrate well with text editors, other tooling such as test runners, or even the Python import system. I would strongly recommend against writing a library in Jupyter. I would strongly recommend against writing low-level CUDA code in a Python notebook. I would weakly recommend against writing production ETL pipelines in a notebook (although if your ETL pipeline is just untested spaghetti code, you might as well do it in a notebook).

Notebooks are not competing with IDEs, text editors, or any other dev tooling. Notebooks are an alternative to REPLs. If you use a notebook like a super-powered REPL, you are going to get a lot out of it. I can't think of a single complaint about a notebook that can't also be leveled at an "Editor+REPL" type of workflow, and I can think of many problems with the Editor+REPL setup that are solved by notebooks.

If you prefer using Spyder and RStudio with the console and plot window, that's great. I honestly do too. There is a lot to dislike about Jupyter Notebook and Jupyter Lab specifically. But notebooks in general, and Jupyter specifically, are still the most efficient tool I've found for doing something, keeping track of / taking notes on what I did, and displaying the results in a way that I and my colleagues can easily understand what I did.

There are also other notebook formats such as RMarkdown and Pluto, both of which have much nicer plain-text representations than Jupyter Notebook. I'm a big fan of those, too. But there are tradeoffs in every design and I'm not about to malign the Jupyter project (or its ancestor IPython) for its choice to use JSON.

> For myself (a professional data scientist) and every other...

The key qualifier is 'from a development perspective'. Notebooks do have their usefulness. I do some data-science related tasks but I'm mostly a developer, so admittedly that could skew my perspective.

> If you think of a notebook like a tool for literate programming...

I mean, that's kind of a selling point of notebooks, though...

> They don't integrate well with text editors

They don't integrate well with anything, not even other notebooks! From that alone I sense strong Word vibes.

> If you prefer using Spyder and RStudio with the console and plot window, that's great.

My usual setup is pretty much the same as yours with Visual Studio Code, but yeah, that was the point... So notebooks are at least somewhat competing in the same space as IDEs after all? Is there a subtlety I'm missing or "I prefer IDEs" and "Notebooks are not competing with IDEs" are in direct contradiction?

> keeping track of / taking notes on what I did, and displaying the results in a way that I and my colleagues can easily understand what I did.

Again, I'm not arguing with that, I was talking about notebooks as a development tool. I can very well see the appeal of notebooks as a presentation/storytelling/documentation format, and to an extent even I agree they have their usefulness there, and I certainly have no qualms with their usage in that niche. But I just couldn't bear to sit in front of Jupyter instead of VSC to code something, hence my strong wording.

Notebooks are primarily a discovery tool when you're exploring your data, deciding what ETL transformations and feature engineering to create and testing algorithms.

When you want to run stuff in production, you move to a regular code dev environment (or handoff to developers).

Now some things in business are infrequent enough that you can live with Jupyter for the once a quarter or so when you need to rerun updates but that's not the same as production code that runs daily/weekly.

My point is that notebooks are not a development tool, so it makes no sense to criticize them from that perspective. Nobody is asking or even suggesting that you develop software in a notebook, and I haven't met anyone who does it.

Spyder and RStudio are very much not oriented toward traditional software development and very much oriented toward interactive work between the text editor window and the console/REPL window. That is, they are fancy implementations of the Editor+REPL workflow. The fact that you can develop software in that kind of environment is not the point.

> Nobody is asking or even suggesting that you develop software in a notebook

It doesn't have a huge audience at present, but https://nbdev.fast.ai/ is doing just that. I'm personally of the opinion that those looking to use notebooks as a programming environment would be better suited by something like Smalltalk. That is, something that seamlessly blends between stored programs and the runtime. You can see some progression towards this in VS Code (I believe there's an issue about exposing variables evaluated in the ipython REPL in autocomplete), so I'm not terribly bullish on Jupyter-style notebooks being the next big thing.

> ...and I haven't met anyone who does it.

Really? Developing software in Jupyter must be almost as common as the "just code an Excel macro bro" approach.

Working on an analytics team myself, I strongly agree with everything in your reply. I just want to nitpick one bit:

>weakly recommend against writing production ETL pipelines in a notebook

I find prototyping and developing ETL code in notebooks to be the most efficient. Especially when dealing with new and unfamiliar data sources. The interactivity and feedback loop makes defining what an ETL process should be doing a lot easier.

That said, once it's working, everything should get moved out into a module with tests.

The problem with a notebook (that you don't have with a REPL) is the execution order is sort of arbitrary and changes don't propagate well.

A notebook is great if you don't change anything, but at that point it's basically a REPL with inline graphics, which is usually what I really want...

This is both a pain (state gets confusing) but this statefulness also a huge boon when rapid prototyping. You don't need to rerun calculations to tweak a single variable. Sometimes the data preprocessing can be really expensive.

It does take a certain discipline to make reproducible notebooks. I like to always end the day with a "restart and run all" if possible.

Right I guess what is needed is some way for cell dependencies to be known, so that all cells that depend on a changed cell will automatically update.

Just restart your kernel if you think something spooky related to state is occurring. This is the number one criticism I see of notebooks, but it is so easy to deal with if you are aware of it. The benefits of being able to rerun an isolated part of your code far outweigh the harms IMO.

Its done in Pluto.jl - Julia notebooks that have cell dependencies implemented (and rerun them upon change of variable’s value).

Maybe I am misunderstanding how the system works but I don't actually want that kind of Excel-like eager behavior like 90% of the time. Sure it's neat for simple computations but some cells take hours to run in my workflow.

Give me an indicator that a cell is stale but I don't want eager execution on all the time.

The problem with a notebook (that you don't have with a REPL) is the execution order is sort of arbitrary and changes don't propagate well.

How so? You can execute lines of a file out-of-order in a REPL just as easily as you can execute notebook cells out-of-order.

I like notebooks for eda and some prototyping but then move to vs code.

From my experience, the the problem is many people in DS positions or who start with notebooks are unaware of what editors are for and don’t have much grasp on code maintainability or collaboration.

They simply don’t know what they don’t know so their notebooks are spaghetti code mixed with tons of random things being displayed that was a sanity check for them at the time but now not even they know why they printed or displayed it.

Don't know about Jupyter, but been using Rmarkdown for as long as they are available. Recently, however, I am moving away from Rmarkdown and I don't miss it.

IMO their ideal use case is in tasks like communicating homework exercises to a teacher. Or for tiny stand-alone projects.

The problems they are designed to solve are, supposedly, numerous: reproducibility, literate programming, interactive exploratory analysis. But they end up getting in the way more than helping.

1) Reproducibility: nothing in the notebook is more reproducible compared with a simple source file. And you cannot really reuse anything within the notebook. Notebooks make re-running the written code more convenient for people not familiar with your stack. But, like mentioned above, it looses its appeal in any substantial project, where you will not be able to share the whole data that fuels the notebook.

2) Literate programming: it works only when you don't change anything, or make the notebook very abstract. If you do an analysis on a prepared dataset, add some comments, draw some pictures, then you have a "literate programming" document. But it lives only as long as nothing is changed. If you change the dataset it depended on you will likely have to change the comments in the notebook. You cannot just re-run it. And the comments added to the notebook - in my experience - nobody reads them, besides the original author. Almost every time somebody sent me a notebook I had additional questions about the results, which required me to interact and ask questions. It just adds additional burden on your part to describe what you do two times: once in the code, and another time in text. You can either guess what the recipients will be asking about and try to cover all corners, or simply answer those questions directly in a shorter amount of time.

3) Interactive exploratory analysis - notebooks slow it down. Writing simple functions in an editor and sending them to a REPL or simply exploring things in the REPL directly - both are faster alternatives. With notebooks you have to care about presenting the figures and pictures and comments from step one. In a typical exploratory analysis you are just trying things out without any idea of preserving them, before you stumble upon something useful.

4) On top of all that notebooks make things difficult to change. Say you have a notebook that loads 3 datasets, compares them with pictures and tables, transforms them, then compares them some more. And you decide to change the color of one picture. You either will have to re-run this notebook from the start just to change the picture. Or, alternatively, you start using cache, which, in my experience, is never a better option.

Just adding an alternative opinion.

I have to disagree with your assessment of notebooks. I use them mostly for exploratory data analysis. It’s great for being able to view multiple graphs at the same time, tweak some parameters and see what changes. Then of course being able to easily share your findings. I agree as a development environment it has a lot of drawbacks. Hard to debug and global variables galore. I don’t view them as an end to end solution but as one step in the workflow

Don't forget also documenting your findings. Without a notebook, results end up divorced from code stuffed into a folder with weird names. To properly document you have to write something in latex or word or markdown, but with a notebook you can keep code, graphs, and notes all together.

I don't like Jupyter notebooks either. Recently tried Pluto notebooks and they're way better. Pluto is reactive, you can change a value of a variable in a cell, for example, and the change will propagate to all cells that depend on that value. In the end you can save the code and run it normally outside of a notebook. For now these are Julia based, but Julia's a pretty nice alternative for data science at this point - much faster then Python, higher level (more python like) than C++.


I strongly recommend Jeremy Howard's (founder of fast.ai) recent talk - "I Like Notebooks" (https://youtu.be/9Q6sLbz37gk). He addresses some of the concerns/critiques about jupyter notebooks.

> I really wish the data science/algo community weren't going in the direction of Jupyter/Notebooks. It's an atrocious experience from a development standpoint

It is. I still use them regularly. I also avoid them like the plague regularly. It all depends on whether I happen to be wearing my Type A Data Scientist hat or my Type B Data Scientist hat at the moment.

What are type A and Type B?

> I really wish the data science/algo community weren't going in the direction of Jupyter/Notebooks. It's an atrocious experience from a development standpoint

I see why scientists use it, but it lacks any FFI!

Just use python libraries, wrappers,gRPC, or thrift.

I think the whole selling point of notebooks is their literate style, and it being a kind of collaborative, shareable scratch pad. I’ve done so many demos and pitches of ideas with notebooks, it’s a super useful tool for story telling and reproducible tutorials.

Having said this, for anything that goes even slightly in the direction of production (like some kind of model training and/or processing pipelines), you should absolutely not use it.

> I think the whole selling point of notebooks is their literate style...

I get this angle and I agree to an extent, demos/throwaway code is where they really shine, but I don't entirely buy the "literate" aspect (disclosure: I assist in some DS tasks but I'm mostly a developer, so that might skew my perspective).

Knuth's WEB, love it or hate it, was at least arbitrarily and easily composable, notebooks on the other hand have hidden state and plenty of footguns even for those who are already familiar with the language. It's basically impossible to compose notebooks in any meaningful way.

I think you could make the idea work (or at least work better) with a purely functional language, say Haskell, instead of Python. Purely functional semantics are more amenable to natural interpretations closer to the Notebook format than an OOP.

Just a random example I encountered the other day, take a look at this: https://github.com/huggingface/transformers/blob/master/note...

It’s a tutorial on how to use some NLP ML library. It’s a pretty excellent example of where this format shines, imho, and as you can see it’s absolutely literate.

Btw I also identify as primarily a developer, but over the past few years been confronted with these notebooks more and more and I’ve learned to appreciate their use over time. But as with all good things, they’re sometimes overused.

The whole accusation of hidden state makes no sense to me. Jupyter Notebook has "restart & run all" button. It's no different from keeping a REPL open all day. If you don't restart your REPL before re-running your program, you have at least as much of a hidden state problem as any notebook. And don't get me started on spreadsheets.

Composability... eh. You can run a Python notebook from the command line just like a Python source file. You can even run one notebook from another, or "compile" the notebook to Python source and import from it. But that somewhat of an "off-label" use to me. The idea is not to be a general-purpose literate programming tool, although it turns out to be pretty close to one.

It's a notebook, like a lab notebook. You do stuff and take notes on that stuff, and it keeps the results of what you did + your notes all together in one file that can be rendered to HTML or PDF. That's fucking fantastic IMO. Once you're at the point where you're trying to compose a lot of notebooks you should strongly consider opening an IDE and developing a project-local library instead. Or at least using some kind of task runner / data flow engine thing.

I hate to be the "no true Scotsman" guy about notebooks, but all of this anti-notebook invective just doesn't make sense to me. Go ahead and write all you want about how notebooks are not an IDE replacement - that's fine, and nobody (at least nobody that I respect) ever claimed they were.

It reminds me a bit of programmers who condemn any large Excel sheet, saying that it should be moved into a database. What it usually means is that they don't understand the problems it's solving.

This is one of the selling points of Pluto.jl for Julia... less hidden state for interactive coding.


The one thing about Pluto that turned me off is that when you export to PDF it puts an advertisement in the header and footer for Pluto. It was embarrassing to send my work to non-technical people.

Hmm. On the one hand I sort of agree but I think giving people a roughly structured play pen isn't always bad.

Based on my experiences as a Physics student, It encourages bad code (DRY, etc.), but considering academics often have no real training as programmers as opposed to computer-practitioners, I guess it's better than them dumping raw c into a file and never touching it again?

Jupyter is a Rube-Goldbergian nightmare. Python is a memory hog. There are better solutions, to be sure.

A simple REPL is all that's needed to both do A-type and B-type data exploration. (I won't use the term "data scientist", it's an exaggeration in most cases.)

Python has a REPL, R has a REPL, Perl has PDL and both a simple REPL (https://github.com/viviparous/preplish) and a more complex one (https://metacpan.org/pod/Reply).

Jupyter should not be used as an IDE because it is the wrong tool for development. A-type data explorers just want a painless UI and may not care much about the horrible agglutination of incomplete/slow/broken solutions that Jupyter represents.

I want to write an article about "test-driven EDA."

I used to do EDA in REPLs when I was more in a more research-focused scientist, but I've made a transition over to the engineering side the last few years.

What made the transition natural for me is that I never shirked on reproducibility when doing EDA. I've seen other scientists, once they have something, "toss it over" and get the entire thing rewritten-- this isn't really scientists owning their work.

Now-a-days, I do test-driven EDA, i.e. via using pytest. Write the test first that describes what you're "exploring for", and the code to document how you got it. As a bonus, an engineer can rewrite it since you've done them a favor and written tests they can validate.

You apparently seem to misunderstand what notebooks are good for. They are not only better than notepad, they are better than any IDE or vim/emacs for data exploration, data visualization and machine learning model development.

Ha-ha-ha :-) - too late! Every generation will make their own "Worse is better" mistake. Have faith (and patience), what works and is good, will (eventually) be re-discovered.

The high energy physics community has been using interpreted C++ REPLs for over 20 years now :) (first with CINT, now with cling).

I always viewed Jupyter/Notebooks as a sort of powerpoint, where quite simple content is presented in a flashy manner.

That explains their popularity.

mlpack, a C++ machine learning library, includes xeus-cling notebooks directly on their homepage: https://www.mlpack.org/

The xeus-cling work is awesome and has made it possible to do data science prototyping in C++. There are lots of other C++ notebook examples in the examples repository: https://github.com/mlpack/examples/

Here's a live coding music synth based on Cling https://github.com/nwoeanhinnogaehr/tinyspec-cling

I also managed to prototype interactive audio programming with http://bela.io with Cling https://gist.github.com/jarmitage/6e411ae8746c04d6ecbee1cbc1...

Oh wow! I had missed Cling completely. A REPL for quickly testing out how something works was one of the things I was missing when doing C++. Nice to know there are good developments in that direction.

Not related but this has trigger me thinking: for some reason, I really enjoy old things, and new things, but not changing things. I really like things as they are, and learn new things when needed. I don't like a static type checked python when I write a python script. if I need static type checking, I probably just use a different language. I am pretty happy to develop in Python and convert to C++ if needed for machine learning projects (maybe python interface for C++ library)

Given the progress of developer friendliness in C++, I rather use it directly if the underlying library is written anyway in C++.

It is not a problem using Python as interface for C++ libraries. It is just easier to use. I don't go to C++ for its friendliness. I feel that everything are trying to be everything else. Programmers never had a problem switching to a new language when needed.

> Programmers never had a problem switching to a new language when needed.

Sure, plenty of programmers always have, and the more financial draw there is to programming and the more people spend their career from education through work in an environment where a programmer is defined as “a JavaScript programmer” or a “.NET programmer”, etc., the more will.

But, beyond that, even being comfortable with polyglot programming, it's still better when you've got what you need in one place.

Someone has to have the thankless burden of writing the Python glue code.

I consider that a problem, moreso the way JIT efforts tend to be downplayed in the community.

I find being able to directly evaluate code snippets via attaching to a jupyter kernel (running locally or in cloud) is one of the most important development efficiency booster.

Glad they fixed this issue: https://github.com/jupyter-xeus/xeus-cling/issues/91 which made me stop trying years ago. I would like to try again. Have been happy using Julia since then.

cling is a marvel, especially compared to its predecessor, CINT, which was a crazy hack that kinda/sorta worked just well enough.

I wonder how many users outside HEP there are though.

Stop trying to make C++ happen. It's like "fetch" in Mean Girls. It's never gonna catch on

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact