Hacker News new | past | comments | ask | show | jobs | submit login
JavaScript and the next decade of data programming (2020) (benschmidt.org)
122 points by tim_sw on June 4, 2021 | hide | past | favorite | 60 comments

Cool to see this :) We wrote the JS version of Arrow with some of these insights in mind. We even had a wasm + webcl backends even before building arrow and then using it to standardize some of our protocols.

Going in the wild these last few years, we've had a few fundamental lessons-learned here. A lot of the above is still true, and in fact even more so. At the same time, for the data world, there are tough points like client heterogeneity and data scale. The way I like to describe it now is we've moved from a "thin <> thick" client/server model to "thick <> supercomputer" client/server. With modern latencies and bandwidth, your interactive compute session is distributed GPUs: it's a weird way to build apps and think about what a computer is, but totally commoditized and doable now. So Arrow is great for streaming MB/s to the browser. Potentially in a couple years, GB/s to v8/browser, mostly waiting for some basic issues Chrome devs don't prioritize, not fundamental ones. This gets hard when we want to interactively visually analyze say 10 GB of click trails: that can quickly get into 100 GB of in-memory representation needing streaming / out-of-core tricks, and download speed issues. Likewise, WebGPU is cool, but it's still not even what CUDA / OpenCL were exposing 10 years ago, again, mostly b/c Google WebGL people prioritize other stuff.

Given that, at least for the next few years, we've been continuing to build for "1-10 GB client GPU <> 10GB-1TB server GPUs". On the plus side, "on the internet, nobody knows you're a dog", so we've been free to do GPU nodejs extenstions that WebGL standards people don't get to dictate.

> Potentially in a couple years, GB/s to v8/browser, mostly waiting for some basic issues Chrome devs don't prioritize, not fundamental ones. This gets hard when we want to interactively visually analyze say 10 GB of click trails: that can quickly get into 100 GB of in-memory representation needing streaming / out-of-core tricks, and download speed issues.

what issues aren't being prioritized?

> Likewise, WebGPU is cool, but it's still not even what CUDA / OpenCL were exposing 10 years ago, again, mostly b/c Google WebGL people prioritize other stuff.

what's missing vs say opencl?

a lot of shade cast with very few contestable claims. I want to listen to your points but right now the article seems far more right to me than your ill described counterclaims. I don't see the limits you claim with webgpu sand I don't see what the browser folk need to be attending to.

ex: As GPUs can now have 40-80GB RAM, it'd be great to create & move around big buffers, esp. using optimized i/o interfaces like cuFile. But way before then, we should be able to allocate a big buffer: https://stackoverflow.com/questions/8974375/whats-the-maximu...

ex: gpu programming quickly gets into things like controlling reads/writes of data across warps as soon as you leave stuff like doing recursive fractals and wanting to do data structures designed for data parallelism, with some of the most elementary benefiting from this being things like on-gpu creation and manipulation of trees and graphs. I don't expect people learning/using webgl2 to want to do that, but I 100% expect the frameworks they're relying on to be doing it for how they implement things like reductions.

I was checking the spec every year or so to see if it supports basic memory model pragma equivs but still no, so basically still a basic opengl mindset, which is fair, just not letting webdevs do late 2000's style gpgpu. I tried working w/ khronos etc. on standards here but Google & HW manufacturers killed it for effectively competitive + resourcing reasons. That was ~10 years ago, and main issues are still true.

Nowadays, I'm more focused on areas we can control, and aren't subject to these standards bodies. NodeJS ecosystems are not restricted, and the ecosystem is incentivized to compete on advancing HW/SW in the data center. So instead, I get to work on questions like "so each PCI card gives us 16-64 GB/s bandwidth to a GPU, and the new SSD arrays give us 12 GB/s per controller, so if we do a few cheap GPUs on a box, can we do interactive visual analytics on that?" And "Is on-GPU decompression faster than sending uncompressed when all same-node?" I wish standards were building for this kind of stuff, but maybe you have a better idea on how to make it happen...

Although I share the sentiments the author has against Python and Cloud, calling it JS a "backend" of the data science stack is a non-trivial stretch. It's obviously a front-end. Having a fast GPU computation is great for frontend as well as backend, so don't worry it to be wasted.

I'd rather expect cloud vendors adopt JS/wasm as a UDF / extension for their compute offering. BigQuery does this already [1]. S3 Object lambda [2] is close. Please leave web browser as a frontend.

[1] https://cloud.google.com/bigquery/docs/reference/standard-sq... [2] https://aws.amazon.com/blogs/aws/introducing-amazon-s3-objec...

It can be a "backend" the way Python is a "backend". It's the glue, the developer-facing API.

The heavy lifting will likely be done by cross-compiling C++ and Fortran code to WASM, and expecting it to be JITted, or maybe rewriting portions of in in JS where JIT is already good.

Whoever is first to implement a well-working GPGPU interface in the browser, wins.

WASM won’t be performant enough for heavy matrix computations because it wouldn’t be able to use SIMD (in its current state). There’s a WASM SIMD proposal in the works but it’s only for 128-bit vectors and I’m not sure it will perform the same as native code.

just one example: deno has webgpu support. whatever mentality you hold now, however you think if he today, I expect almost all of us need to loosen our preconceptions rather quickly.

as for what the browser ought to be, I again think you should lower your preconceptions & not be so opposed to the browser exploring we data crunching roles. especially as folk explore moving away from big cloud being near-exclusive means of computing.

That's one attitude. I'm on the other side where having low expectation to the browser technology. People claim a lot on Web and Browser but it takes extremely wrong to get the fruit. I'm happy to be wrong eventually and will take it if it does come true. But until then, I'd stick on the pragmatic side.

That's said, someone has to chase the ambition to make it happen. Maybe I was being too negative and discouraging here.

> But if you need to continually be checking random samples of a dataframe, re-running modules, and seeing if your regular expressions correctly clean a dataset, you are using a notebook interface today, even if you bundle your code into a module at some point.

Notebooks like jupyter and observable are not a good advance. The primary reason is that they are leading the community of data programmers away from git-based workflows. ("git" is a shorthand for whatever version-control here due to its dominance). Although it may seem like a strange claim, the primary reason "programmers" and "software engineers" use git is for correctness/debugging: the idea of not being to switch accurately between different states of your codebase to compare program behavior is insanity to a "programmer" / "software engineer". And it should be insanity also to data programmers. But due to the rise in popularity of notebooks, it is not. These things just do not encourage the use of version control to compare different states of the code.

It's going to lead to many buggy analyses, and many data programmers missing the opportunity to learn how to write code to a high level of correctness.

This is a good point.

On the other hand, developer productivity in notebook environments is so high that iterating your way to correct code may be faster than trying to plan it all out in the beginning.

There is also the aspect of exploration. The hardest part of most data programmer projects is understanding the data and optimal wrangling.

This is something I often wonder about. In my own experience the perceived productivity gains in prototyping often come with externalized technical debt that lead to an overall loss of productivity. Notebook-driven development tends to produce models that are too tightly coupled with an underlying data pipeline, as the workflow very much facilitates this. There’s a measured loss of productivity when a model and pipeline need to be decoupled, e.g., to split the pipeline or repurpose the model.

Obviously coupling components is not unique to notebooks, but there just seems to be a culture in notebook-driven development to disproportionately push poorly designed code artifacts. The MLEs and data scientist I’ve worked with that write simple functional modules tend to introduce less technical debt, at the expense sometimes of taking longer to push out a demo prototype.

All that said, I think the aforementioned issues are not truly intrinsic to a notebook-driven development, but rather an emerging culture that has rather aptly been called Potemkin data science: https://mcorrell.medium.com/potemkin-data-science-fba2b5ba5c...

>On the other hand, developer productivity in notebook environments is so high that iterating your way to correct code may be faster than trying to plan it all out in the beginning.

In science, what kills you isn't the mistake you see and correct. It's the mistake you don't see, that then gets published.

Interesting, but not true. You clearly don't work in science: the main thing is getting publications. Mostly people don't pay that much attention to them, so if there are errors, (a) probably no one will notice and (b) you can always explain it away with some obfuscation.

I do work in science, and I care about getting things right.

> On the other hand, developer productivity in notebook environments is so high that iterating your way to correct code may be faster than trying to plan it all out in the beginning.

That's not a fair comparison though is it? Those of us who use git, the vast majority of the time we iterate our way to correct code; we don't plan it out at the beginning. It's just that we iterate over the course of a sequence of commits, pausing on each commit to study application behavior. Notebooks users also iterate over a sequence of code evolutions, except mostly without systematic checkpointing and without clearing in-memory state between code changes.

There is nothing inherent in notebooks that prevent the use of tools like git.

...except the need to commit the current state to make things reproducible.

And I'm afraid it's not just code in cells, but also the sequence of its execution, editing, execution again, etc, because the order of running the cells is arbitrary (whatever the user commands interactively), and the results of previous invocations form the dreaded arbitrarily mutable global state.

> ...except the need to commit the current state to make things reproducible.

This can definitely be a problem, but at least in the instances where I've worked with teams using Notebooks it hasn't been an issue. Usually data is either coming from networked storage (S3 etc.) as CSV/Parquet/HDF5, a database query with well-defined (enough) criterion, or API calls.

> because the order of running the cells is arbitrary

Most people I know have to restart their notebooks often enough to have to re-run all cells again anyway. You usually get into a good habit of making sure they're mostly "in order", or grouping common code into cells to minimize the number of cell executions and ctrl+enter's needed.

> dreaded arbitrarily mutable global state

It's definitely a double-edged sword, but the global state is a big part of what makes notebooks so useful. I have a repl-based environment, the ability to run arbitrary code, and global state to iteratively explore the domain of the problem I'm working on. That sword cuts me many times along the way, but they're (mostly) nicks. But it's usually worth the output, especially to the business.

In all these instances, teams shared notebooks through git and most of the issues were simply understanding the context of their usage more than issues with their execution. I don't remember more than a few notebooks really being used by more than a few people anyway, so I've learned to consider them more as collaborative artifacts in an exploratory process to more mature, "packaged" code evolved from those notebooks.

This has nothing to do with the ability to use a tool like git. Notebooks make sloppy practices easier, but all of these problems exist in non notebook code too.

You have no idea how many times I’ve run into non notebook code that assumes a database or something is in a certain state with no clear explanation of how to get to that statement. Or, a library that requires a certain order of operations that is unclear.

For Jupyter sure, it's true, but not for Observable.

A couple of Brad Myers's students at CMU worked out a concept for lightweight versioning, with a particular focus on users of code notebooks.


Unfortunately, they moved on to an evolution of the initial concept that's a lot less compelling/promising.

Are there any good workflows that depend on out of order execution outside the exploration phase?

To use notebooks reproducibly, the whole thing should be cleared and run with some frequency. Global state shouldn't matter.

There's also the nbconvert --to script jupyter function that can convert a notebook into a script.

I am involved in a quite heavy web-based data processing and visualization application (browser regularly consumes 8GB RAM and more for a single tab).

I have to say that you really can see that browsers are not made for this kind of workload, even if Javascript itself might be fast enough, down to very stupid little things (e.g. accidentially pointing your mouse in the debug window to a variable holding a large array risks to freeze the entire browser).

Looks like something that can be solved by moving the task to background with [web workers][0].

[0]: https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers...

At least with Firefox, doing tasks in the background was sometimes quite painful, especially during development. For example, I do some CPU-intensive stuff in the background, close the tab because I noticed a mistake in my code -> background task still runs for several seconds eating CPU and memory until it is killed or finished (I didn't check to be honest) by the browser.

Edit: As with everything related to complex web applications (and I had two or three in my career), it is obvious that the whole environment (client, protocols, languages,...) was never designed for this.

Couple days ago we posted a Show HN with similar arguments on why JS for data makes sense, see: https://news.ycombinator.com/item?id=27334145 -- Ben's article is much more eloquent and presents several examples, but there are a few libraries worth complementing this article: danfo.js (dataframes with tensorflow), ml5.js (a keras-like library), tidy.js (a tidy port to JS), and tensorflos.js.

> When I started tinkering with JS, I thought of it as slow; but web developers are far more obsessive about speed than any other high-level, dynamically typed language I’ve seen.

Given that, isn't it odd that JS still encourages map/filter array iterations that make multiple passes over a single array? I.e. that it hasn't yet introduced some sort of comprehension or itertools-with-compilation feature that does array processing in a single pass?

JavaScript has generator functions. Comprehensions are really just sugar around this functionality. You could pretty easily write your own set of generator function-powered methods for mapping and filtering.

Which I've started doing, but it is still a real shame that the "default way" is so inefficient, especially given how many devs never even realize it. It contributes a whole lot of accidental-slowness to the ecosystem.

> the "default way" is so inefficient

A big reason why I say screw map, filter, reduce, etc. and my "default way" never stopped being ordinary for loops.

bastawhiz's (correct) point is that it's easy in JS to have the best of both worlds: write your code as multiple, decoupled passes, with the performance of a for loop. The only problem is that in JS you have to go out of your way to do it. But those of us who know about it can, and generally should, do so.

Though I have found generators in JS often come with an overhead compared to a well-written for loop. (Some languages like rust or Julia will inline all the closures for you and remove all the overhead).

Still a great way to avoid intermediate allocations, though.

can you point me to a good tutorial showing how to do this? i'm a conflicted web dev regarding loops. I love the simplicity that map/filter have, but I hate that they require looping over the same array more than once, which makes me choose for loops over them... Never thought I could have the best of both worlds

This may be what you meant, but it's worse than just making multiple passes over a single array: it allocates a whole new array on each pass to hold the results, even for something like a slice()

I prefer to have readable code first, and performance improvement later and only if needed.

Other languages like Rust can achieve basically the same syntax with lazy evaluation/composition of phases, only collecting the results into a new data structure as the very last step.

Are you saying you find

    Array.map(e => f(e))
Easier to read than

    for(e in Array) {
I don’t see how they really are any more or less readable, although I’ll admit map gave me a day of rethinking when I switched from 100% python to using react on the frontend.

Technically you would use

  Array.forEach(e => f(e))
If you don't care about transformations. Which is basically the same as your for

Map is when you want to return a new collection with each original item in it mapped to a "different" thing.

Programming punditry makes a lot more sense if, when you see someone talking about the code being more readable, you understand that most of them are lying to themselves about what they really mean. Which is code that is more writable. In other words, something that's easier for them, the lazy person who is punching keys and derping around in a text editor.

forEach is an outlier, because most of these methods transform something into something else and can be chained, and that's where the readability increase comes in. I usually prefer for loops over forEach because it emphasizes the fact that forEach doesn't return a value; it's used purely for the sake of side-effects.

I think


is the more apt analog there as it wouldn't be generating a new array of the return like map would.

Maybe, I only went with map since it was in the GP.

It also offers Array.reduce(), which can filter and map in one go, among other tasks.

Comprehensions are nice and I wish JS had them, but can’t you achieve the same thing with a for loop?

You can, but transducers would be a more consistent api

I think for a significant portion of R users the jump to JavaScript is not plausible as things stand.

R is one of the few languages / platforms where people who don't give a fuck about software engineering can successfully create mostly correct computer programs with reasonable speed.

JavaScript is not ready for those people yet.

(Although I am pretty impressed with the work the Observable people are doing in this regard.)

Do you have any advice for R users on making the transition? Even though I am primarily an R user, these days I have been using a fair amount of JS in my work, and I'm not sure how I can get to the "next level" in my programming that your alluding to. It seems like most of the tutorials are either too rudimentary or way over my head (like the three step drawing meme).

Maybe it makes sense for someone in my position to just go bite the bullet and go through the Eloquent JavaScript even though I feel like I already have a fair grasp of the content? Or learn some key principles of Software Engineering?

To be blunt, if you can seriously cope with R, javascript should be a piece of piss. R has a very, very steep learning curve... perhaps the transition is less to do with changing languages and more to do with learning principles of software engineering?

Not only is Javascript not ready for those people yet, Javascript has been moving further and further away from those people for years, both in the language spec and in the community and ecosystem around the language.

About the #1 WebGL complaint, is it really so that you can't use integers in attributes?

https://www.khronos.org/registry/webgl/specs/latest/2.0/#3.7... seems to say otherwise, and also https://developer.mozilla.org/en-US/docs/Web/API/WebGL2Rende... tells you can use integer types including gl.INT, gl.SHORT etc, and here's someone using int32 textures successfully: https://stackoverflow.com/questions/43905968/how-to-get-32-b...

There is no requirement for integer arithmetic in WebGL 1; all numbers are essentially floats, even if the attribute is not float.

Integer support is GLSL ES 3.0 (the shader language) i.e. WebGL2.

Intentionally talking about about the years obsolete WebGL 1 feature set as "WebGL" while proposing WebGPU would be pretty iffy. If the issue is that WebGL 2, an incremental update to WebGL 1, doesn't have 100% user coverage after 4.5 years, WebGPU is going to be much farther off as it's not even fully specced yet.

But since the article didn't say anything about webgl versions, it seems more likely that the author was just unaware.

Probably supported only in WebGL2

One thing I'm curious about is how do you get the data to the browser. Particularly if you are interested in doing it in a low latency way with frequent data updates?

For instance what do I need to do if I want to have a chart of timeseries data that is being updated in near realtime, say every 5 seconds, appear on a webpage?

In my Org data originates from Operational Historians which are essentially ring buffers for time series data and look a lot like SQL databases. How does javascript communicate with this sort of data efficiently. In "old school" Non live updating R world I would use a JDBC connection. What is javascript way of receiving the data?

Websockets. We do price feeds for inhouse trading platform and we don't use anything else, it works very well for browser-backend and backend-backend comms.

So much mention of Dplyr and pandas being slow and yet not a single mention of data.table.


It's possible to do almost everything in JavaScript and SQL. However, the moment a developer wants to use time series forecasting models like ARIMA and ARCH for predictive forecasting there isn't much in either technologies (side cases include Oracle providing ARIMA functions in SQL). Unfortunately, it requires the server handing off computation to Python or R, somehow.

Does anyone know of a tutorial how to implement python modules scipy, numpy, pandas, statsmodels, sklearn with a Docker image which can listen for incoming data streamed or otherwise from a node web server?

Doesn't address the docker part, but the syntax for this node -> python bridge looks reasonable, and there's a short comparison section for other solutions: https://github.com/Submersible/node-python-bridge

Good post.

If the author is reading, the demos are still pretty intensive and a way to toggle them on and off would be much appreciated for those of us on less powerful devices.

Summary: "Why bother with R and Python?"

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact