Hacker News new | past | comments | ask | show | jobs | submit login
The Path to Parallel JavaScript (blog.mozilla.org)
226 points by leo2urlevan on Feb 26, 2015 | hide | past | web | favorite | 56 comments

This is nice for some pretty limited use cases, but the most common use case for multithreading in app-like programs (which is what Worker-based apps presumably service: these are not documents) is removing latency from the UI thread. But as long as only the main thread can touch the UI, and the main thread also can't access shared memory, this limits uses of this to scenarios where copying the entire render state from a Worker is reasonably fast — in which case, Workers currently already solve the problem. This proposed implementation of shared memory doesn't actually solve one of the big remaining needs for shared memory, which is when it's prohibitively expensive to copy state between a Worker thread and the UI thread at 60fps.

For example, Workers aren't particularly useful for games in their current iteration: the overhead of copying the state of the world back to the rendering thread is high. This is exactly the problem that shared memory would solve, were it not limited to Workers. This puts web export (or even primary web-based game authorship) at a significant disadvantage as compared to native apps: native code can share memory, and web-based implementations can't. In many cases architectures that are optimal for shared-memory threading are pathological when the rendering thread requires copies, meaning that threading gets thrown out the window for web. Even with asm.js-compiled "near-native" performance on the single core, you can only use 25% of the available CPU if you can't use multithreading. A 4x performance hit is the difference between 60fps and 15fps... Or 15fps and ~4fps.

The title of the blog post got me pretty excited, but the proposal is fairly disappointing in terms of unlocking better performance for web apps. The use cases here are pretty limited to things like CPU-bound number crunching, and I doubt too many people are running machine learning algorithms in a browser as compared to the number people who're using browsers to, y'know, render UIs. By all means scope the problem down to sharing primitive data in ArrayBuffers — we can build abstractions on top of that! — but limiting it to Worker threads makes it near-useless for most web applications. Workers already solve the use cases for UIs that can tolerate copies between the UI thread and the Worker threads, and this proposal doesn't allow us to solve needs for UIs that can't.

The blog post does mention the main thread, as something that is more complex and in need of further investigation.

Still, even without shared memory being accessible to the main thread, I think sharing between workers can be extremely useful. Yes, you need to proxy information to the main thread in order to render, but that doesn't need to be a big problem. See for example the test here, where a 3D game's rendering was proxied to the main thread, with little loss of performance,


That very small overhead could be worth letting the game run in multiple workers while using shared memory.

Also, things like Canvas 2D and WebGL are APIs that might exist in workers, there are efforts towards that happening right now. That would eliminate the need to copy anything to the main thread, and avoid a lot of latency.

I can't speak to the BananaBread codebase, since I haven't read it — although BananaBread performs poorly as compared to commercial engines, regardless — but if you look at even the published benchmarks in that blog post, the Worker-based implementation is slower in all cases except for Firefox with two bots. Chrome is always slower when using Workers, sometimes massively so, and Firefox with ten bots is slower multithreaded than single-threaded.

Regardless, shared memory isn't only useful for WebGL. It's useful for any kind of UI where you don't want to block the main thread, and if you don't have DOM access it's tough to make that work. If copying is fine then current Workers are good enough; if copying isn't fine, then this doesn't change that.

I agree with

> If copying is fine then current Workers are good enough; if copying isn't fine, then this doesn't change that.

I am saying that copying is fine (for most apps).

Copying is fine because BananaBread is indeed less performant than commercial engines, as you said, and that actually makes it a better test candidate here. It does far more GL commands than a perfectly optimized engine would, which means much more overhead in terms of sending messages to the main thread.

Despite that extra overhead, it does very well. 2 bots, a realistic workload, is fast in Firefox, and the slowdown in Chrome (where message-passing overhead is higher) is almost negligible. 10 bots, as mentioned in the blogpost, is a stress test, not a realistic workload. As expected, things start to get slower there. (Game is still playable though!)

And large amounts of GL commands, as tested there, are much more than what a typical UI application would need. So for UI applications, that just want to not stall the main thread, I think a single worker proxying operations to the main thread could be enough. Copying is fine.

The proposal in the blogpost here is for other cases. Workers + copying already solve quite a lot, very well. What this new proposal focuses on are computation-intensive applications, that can multiply their performance by using multiple threads. For example, an online photo-editing application needs this. This might be just a small fraction of websites, but they are important too.

Here are some things that BananaBread doesn't test that I suspect would break larger games in commercial engines:

* High-poly models. This is one area where copying breaks down: that can be a ton of data. BananaBread has a single, low-poly model that it uses for NPCs, and it presumably rarely (if ever?) needs to be re-copied.

* Large, seamless worlds. If you can transfer the entire world into a single static GL buffer, sure, copying isn't a problem since it only happens once on boot. If you need to incrementally load and unload in chunks, you're going to be paying that cost again and again.

* Multi-pass rendering. In fact, the proxying approach makes multi-pass rendering impossible, as noted in the blog post.

By all means, if your application already works single-threaded, or within the confines of the existing spec, you're going to be fine. But memory copies aren't free, and UI offloading is one of the biggest reasons to use shared memory.

Shared memory in Workers is nice — it doesn't make anything worse, and it makes some things better! — but it's a little disappointing that the main thread can't access the shared memory buffers. That's all.

I understand your disappointment, clearly more opportunities are opened up on the main thread. As the article says, this is a proposed first step, and the main thread is trickier, so it can be considered carefully later on. Meanwhile, for a large set of use cases, the current proposal can provide massive speedups.

> if copying isn't fine, then this doesn't change that.

Good point. I imagine it would be possible to transfer ownership of a SharedArrayBuffer to the main thread from a worker, just as it is done today with zero-copy transferable ArrayBuffers.

References to the SharedArrayBuffer in all workers would be made unusable. When the code on the main thread is done it can transfer ownership of the SharedArrayBuffer back to the workers.

I'm of the opinion that adding "proper" (read: low-level) multithreading is a bad idea for Javascript.

How will it interact with the existing event loop? How will scheduling work? How about sharing state properly? Is my work going to crash because some bozo writes a jQuery plugin for background processing that doesn't understand how spinning on a mutex can cause trouble?

Javascript developers generally lack, due to the language and the environment, any sort of empathy for what is and is not a good idea in a place where concurrent modification of shared state is possible.

Web workers, wisely, solved this problem by creating explicit messaging. This gets you most of the way there without running into the really nasty bugs you see in native code.

Adding a shared buffer object as proposed (presumably giving an interleaved set of views onto the same underlyling backing array, much as we do with typed buffers) could be acceptable, because you can guarantee isolated access to elements.

However, adding the more general threading support you see in, for example, C++ would be a nightmare. I'm already a little wary of the maintenance and legacy costs that ES6 are going to impose on us...giving devs the power to do dumb shit with threads isn't going to help our industry.

An addendum.

I think that adding full support for things like Canvas and WebGL rendering would relieve the majority of the remaining issues people have, if done alongside that shared buffers approach.

Hell, written in a functional style Javascript looks none too different from, say, PLINQ, and if they decide to add library-level function parallelism I could absolutely get behind it.

Just save us from copy-and-pasted $.raiseSemaphore() and $.joinOnMonitor()--and history shows us, that is what we'll be expecting.


One more thing...JS does not lend itself to static analysis, and that may make writing safe parallel code even more difficult.

Great developments, since, imho, we really need multi-threading to make decent user-interfaces (ones without hick-ups due to blocking of the cpu-resource).

I think what we need is immutable data-structures to be shareable between threads. This approach should also allow structural sharing between threads, allowing for efficient and safe data structures.

Also, I could see a use for a mechanism where a thread creates a data-structure, then marks it as read-only, such that it can become shared.

Anecdotally, I've found that most performance hick-ups don't come from computation blocking the ui, but from rendering one part of the ui blocking every other part from rendering.

The solution to that being parallel or async paint/layout, which is something I never hear mentioned (probably because it's a really hard problem).

Servo is actually in the process of implementing a parallel layout engine[1][2]

1: http://en.wikipedia.org/wiki/Servo_(layout_engine) 2: http://pcwalton.github.io/blog/2014/02/25/revamped-parallel-...

They've recently landed a tool that makes it possible to visualize the parallel rendering: https://github.com/servo/servo/pull/4969

Interesting. I wonder how they handle incremental rendering though. Only the forward-path is described in your reference [2].

As far as I can tell there is nothing unusual or difficult here. Servo's incremental layout implementation is here: https://github.com/servo/servo/blob/master/components/layout...

I believe that the Servo project (https://github.com/servo/servo) is attempting to address exactly this.

The proposal only is for sharing memory in between Worker threads, which can't access the DOM (which also means canvas, WebGL, etc aren't able to be accessed). It doesn't do anything to help make UIs smoother: you still have to copy to transfer data from the UI thread to the Worker threads and back. If that isn't a problem, then Workers as-is will unblock your UI: you can already do this with Workers as they're implemented today.

All that this does is make algorithms running inside Workers faster. It doesn't unblock UIs, since Workers already do that if you're willing to accept copying. If copying back and forth from the main UI thread is too slow, this proposal won't change that.

There is active work on allowing WebGL and canvas in workers, with the worker talking directly (not via the main thread) to the compositor.

Re-pasting a link that was buried further down in the discussion. Transferable objects provide zero copy sending/receiving of large amounts of binary data.


What's wrong with message passing though - MPI does it, Erlang does it - surely this paradigm can deliver good performance for data and task parallelism.

I hope Mozilla will continue experimenting with parallel js. Exciting times!

Message passing can be very efficient for some tasks. But there are cases where it is hard to optimize out copying.

For example, let's say you're running a raytracer, and you have several web workers each render a slice of the frame. Then each worker can transfer back their output to the "main" thread (using existing typed array transfer). But the main thread now has several separate typed arrays, one from each worker. If it wants to combine them all into one contiguous typed array, it needs to do a copy of the data, which is something we'd like to avoid.

In this case, what you really want is to have a single contiguous typed array, and let each worker write to a slice of it. Something similar to that would be possible in what is proposed in the blogpost.

Copying in cache is very fast now. Spending time aboiding copies is no longer always a win like it used to be.

In cache is key, though.

Decoded image data is 4 bytes per pixel, so a raytraced image of any sort of reasonable size would barely (if at all) fit in L3 cache even on modern processors. And you need to fit both the source and the destination, right?

The shared ArrayBuffer interface being described here is following the philosophy of the Extensible Web Manifesto [0]. The idea is that libraries providing higher-level APIs and programming models, such as message passing, can be built on top of low-level primitives.

[0] https://extensiblewebmanifesto.org/

When you have a large amount of data that needs to be sequentially iterated through by multiple threads — e.g. in games, where the UI thread and the physics thread are often entirely separate but read off a shared world state — message passing falls over. The copies are just too expensive.

Message passing is great for things that use small amounts of data but lots of CPU, though!

Forgive me if my comment seems ignorant, as I've had experience with MPI and threaded code, but not professionally. I also do not provide any numbers or profiles.

I would think message passing in terms of MPI is acceptable, because "the cost of copying" is insignificant to "the cost of network latency." When you have very little overhead (multiple threads performing atomic operations on shared memory), then "the cost of copying" becomes relatively significant. And if you want to target JS given existing legacy C++ code that probably won't be rewritten, well then the JS execution environment will have to be the one to bend.

Which existing C++ code are you referring to? It seems like a bad idea to compromise the soundness of Javascript to do whatever it is you are talking about doing. I don't think that legacy C++ code is something that should be causing anything to bend.

It starts to become a problem when the messages are very large. Consider the case where you need to send a big set to a thread, so that it can use it as a part of a computation. Conversion to JSON would be too slow, because every element of the set would have to be visited upon invocation.

You often want large computations to be performed in a side-thread (to avoid blocking the UI thread). It would be a pity if such computations couldn't take large data-structures as an argument, because large computations often take large amounts of data as input.

I think they mostly mean the overhead of (de)serialization from/into JSON. It's also hard to pass binary data that way.

This was mentioned in one line of the article so it's easy to miss, (I only recently heard about them which is why I caught it) but transfering binary data can actually be done with Transferrable objects already[1]

However for some high performance applications even that overhead might be too much because it requires allocation. Also having a regions of opt-in shared memory allows for higher level languages / patterns where message passing isn't the perfect answer.

An example off the top of my head that hits on both points (no-alloc + higher level patterns) would be to have one worker writing something encoded with SBE[2] into a shared buffer and having another consume it.

That will be 0 allocation (thus no gc-pressure) and very fast, for a class of applications avoiding GC pressure all together is really important.

It's a little sad that you can't share that back with the main thread. But it's not a deal breaker by any stretch, think about the main thread in the web like a classic gui event loop which you just use for rendering and you can still use transferable objects to get data to it and you should be fine.

1: https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers...

2: https://github.com/real-logic/simple-binary-encoding

Would it work to have the data structures be copy-on-write? That way if the worker only reads then it's O(1), you just pass a reference to the worker.

I imagine it'd be a pain to write a garbage collector for something like that.

Copy on write is already implemented-ish with transferable objects [1] (just copy then write the copy).

However, copy on write still requires allocation and for some applications that is a deal breaker.

1: https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers...

Perhaps we need a fast copy mechanism for js to make message passing a more attractive option? Or make interpreter recognize such cases and do it under the hood/natively. Keeping my fingers crossed for a more functional approach.

Will Spidermonkey be rewritten in Rust eventually?

Sidenote: WebGL 2.0 should come with ASTC support to be relatively future-proof.

Note that JITing compilers gain a lot less from Rust's safety guarantees than most code: you can easily JIT some code that breaks one of the safety properties that Rust ordinarily defends.

That is too far in the future to see.

From what I've seen, SIMD.js does make doing some calculations quicker, although it's not the 4x increase one might assume. It's more like 30% to 50%. The latency and overhead associated with moving the SIMD.js calculations into a Web Worker actually reduces the performance increase to as little as 10% to 20%.

While message passing might make sense for some tasks, it's not going to be quick enough to do things in 16ms and achieve the magical 60fps. Eventually shared memory is going to be needed, and if we get that far I think there will have to be some sort of acknowledgement that these performance-related features can't possibly be foolproof.

Oh, and an asynchronous thread safe DOM, please and thanks :)

I've been thinking that something combined with the use of async functions (es7) could be used for Shared* locking...

    sharedObject.lock(^() => { 
      /* use object, which is locked until async promise resolves/rejects */ 
where `^` is a short key for an async lambda function... `async` keyword could be used too... just throwing the `^` out there.

This could lock the object, allowing for an async function to execute... when the async function promise resolves, the lock is released... it would have to be limited to async functions though. but that would likely hit the JS engines around the same time as any shared objects anyway.

For me, the biggest problems with workers, is that you can't simply pass the functions that the worker needs (separated from state) from the main window... it means you're creating a separate script for workers, which isn't so bad, just not always easy to reason around.

It also makes isomorphic interfaces (async node & browser) slightly harder.

Does anyone know what is the origin of async/await? It's very convenient for simple cases, but it makes it very hard to mix synchronous and async code(C#).

How does it make it difficult?

You can only call async functions inside async functions.

See also What color is your function? http://journal.stuffwithstuff.com/2015/02/01/what-color-is-y...

If that was the case then async functions would be useless, since the global scope is not an async function. You can call async functions from regular functions, they will just return a promise. You can then use normal promise handing behavior (passing a callback function, which is how you'd currently handle async anyway).

In real-world scenarios though, most of the time you'll be reacting to events, and not calling async code from sync code.

Edit: To clarify I'm talking about the ES7 async/await feature.

I have encounter this problem https://stackoverflow.com/questions/28708238/catching-except...

Basically, it's hard and prone to bugs.

The implementation in JS seems to be simpler and consequently less error prone than the one in C#. Calling an async function just returns a Promise. The await operator just yields to the event loop until a given Promise is fullfilled. Exceptions are handled "normally" inside the async function with try/catch, outside they're handled with a callback, like in existing code that uses promises.

I would love to see a channel primitive similar to Golang's. They really seem to have hit the nail on the head with that one.

The CSP model is very nice, but I don't think that what this proposal is aiming for. CSP is more about concurrency, while the post makes it clear they are looking at parallelism. This is more addressing "things that appear to be single threaded but run faster" like image analysis, for instance.

They're obviously separate concepts, but tangentially related. Even in Go you'll learn that sometimes introducing multithreading to a concurrent application will actually reduce your performance due to reconciliation of context-switching within the application.

Javascript already has its own inherent concurrency, obviously, but it's not outlandish to say that introducing a goroutine/coroutine concept would be a lot more elegant and manageable.

Clojure and ClojureScript have core.async which does a great/better job https://github.com/clojure/core.async/blob/master/examples/w...

There's also a video if you care to watch https://www.youtube.com/watch?v=enwIIGzhahw

Clojure's core.async library works wonders in ClojureScript giving you channel primitives and coroutines.

I've never tried using them across web workers however.

i have always wanted Canvas as well as read-only array buffers (for things like map-reduce image analysis) to be accessible in web workers.

Why not implement go routines model?

Aren't go routines pretty similar to WebWorkers, but with special syntax for creating them (and sending/receiving messages over channels) and perhaps lighter weight (though that may just be an implementation issue with current JS engines)?

Edit: nevermind, it looks like go routines can share memory, (but channels are the preferred method of synchronization): https://golang.org/ref/mem

Good Article, Thanks.

This is really important work!

Oh no, will this become an ego thing? Parallelism must lie somewhere near abstraction as a premature optimization. Computers are hard, even single threaded.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact