Hacker News new | past | comments | ask | show | jobs | submit login
Replacing JavaScript Hot Path with WebAssembly (developers.google.com)
226 points by markdog12 on Feb 14, 2019 | hide | past | favorite | 103 comments

My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps. Doing 90-degree image rotation with fixed steps and some index calculations should work better (0.18 sec vs 1.5 sec for their implementation in node.js):

    for (var y = 0; y < height; y++)
        for (var x = 0; x < width; x++)
            b[x + y*width] = a[y + (width - 1 - x)*height];
Although that's still far from the theoretical maximum throughput because the cache utilization is really bad. If you apply loop tiling, it should be even faster. This problem is closely related to matrix transpose, so there is a great deal of research you can build upon.

EDIT: 0.07 seconds with loop tiling:

    for (var y0 = 0; y0 < height; y0 += 64){
        for (var x0 = 0; x0 < width; x0 += 64){
            for (var y = y0; y < y0 + 64; y++){
                for (var x = x0; x < x0 + 64; x++){
                    b[x + y*width] = a[y + (width - 1 - x)*height];

Your 0.18 sec result is (to use the units they used in the article) 180ms, and if I understand correctly their best webassembly compiled and executed result (?) is 300ms. Beautiful.

EDIT: But it could also be that your computer is somewhat faster than theirs? Do you happen to have some very fast CPU? Can you say which? When I run C-like C++ versions of your code I get the speeds you get with node.js. However, you made overall much better results than they were able, it's still great work!

    #include <stdio.h>
    int main(int argc, char* argv[]) {
        enum { height = 4096, width = 4096 };
        unsigned* a = new unsigned[ height*width ];
        unsigned* b = new unsigned[ height*width ];
        if ( argc < 2 ) { // call with no params
            // to measure overhead when just allocations
            // and no calculations are done
            printf( "%d %d\n", (int)a, (int)b );
            return 1;
        if ( argv[1][0] == '1' ) // call with 1 the fastest
        for (unsigned y0 = 0; y0 < height; y0 += 64)
            for (unsigned x0 = 0; x0 < width; x0 += 64)
                for (unsigned y = y0; y < y0 + 64; y++)
                    for (unsigned x = x0; x < x0 + 64; x++)
                        b[x + y*width] = a[y + (width - 1 - x)*height];
        for (unsigned y = 0; y < height; y++)
            for (unsigned x = 0; x < width; x++)
                b[x + y*width] = a[y + (width - 1 - x)*height];

        return 0;

I think its fast because of the L1 cache or something like that. I dont understand fully but this is what i got

The fastest version is the fastest because it's the most cache-friendly one of all which were presented. See e.g.


But note that robko made an improvement even before making that.

> made an improvement even before

Or maybe not: my short experiments with the simplified version based on their algorithm and his JavaScript versions gave some conflicting results. I haven't thoroughly verified them, this note is just to motivate the others to try.

I get 60ms in C. But in your code, the compiler might decide to remove most of the code since b is not used after being calculated. I checked the assembly code and it does not seem to be the case here, but it's still something to be aware of.

> I get 60ms in C

OK, I get cca 80ms for my run with the parameter 1 on my main computer, and 200ms on N3150 Celeron.

> b is not used after being calculated

Earlier, I've never seen that any C compiler optimizes away the call to the allocator and the access to the so allocated arrays. Maybe it's different now? Hm, dead code elimination... I guess a random init of the few values before and read and print of a few values after the loop must be always safe... Now that I think, also filling the array with zeroes before.

Maybe this is what you meant but the snippet can be optimised a ton as well unless I'm missing something:

- Move the "y * width" calculation outside of the "for x" loop.

- The multiply operators can be replaced with addition e.g. replace "y * width" with "counter += width" each y iteration and similarly for the x loop.

Optimising inner loops is really fun.

How much of the speed up in the article is because the JS engine can't figure out how to optimise it compared to the WebAssembly compiler?

These code motion/strength reduction optimizations are standard even in mildly optimizing compilers. I would be very surprised if an optimizing JavaScript compiler did not perform them automatically.

I tried a few micro-optimizations, but they did not make a measurable difference, so I kept the code short instead. But maybe some JIT is particularly bad at loop hoisting, so it might make a difference there.

Huh interesting! I always disliked butchering code to do processor cache optimizations and I kinda worked under the impression that a browser’s JS and wasm compilers would do these optimizations for me.

I’ll definitely give tiling a spin (although at this point we are definitely fast enough™️)

Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?

My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).

No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.

WASM example would speed up as well using the same approach. Or C, Rust or whatever.

To add background, this is a standard optimization technique that has been employed in eg fortran compilers since at least the 1980s.

Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.

The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...

I tested these two pieces of code in different browsers on i7-8750H with 16GB of RAM.

Chrome: 248 ms vs 93 ms

Firefox: 552 ms vs 93 ms

MS Edge: 7486 ms vs 6186 ms

IE: 9590 ms vs 9156 ms

These are some WTF results, to be honest.

> As I understand they the main goal was to achieve easily readable and maintainable code, even to the detriment of performance.

Seems like a tricky goal for image algorithms in general where you're performing the same action over and over on millions of pixels. Obscure inner loop optimisations are pretty much required.

In these situations, I would sometimes keep the code for the naive but slow version around next to the highly optimised but difficult to understand version. You can compare the output of them to find bugs as well.

> My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps.

Why would non-1 for loop be slower in some browsers? Does the compiler add some sort of prefetch instruction in the faster browsers based on the loop increment?

Did you see the benchmarks? There's almost no difference between javascript and wasm except for a single certain browser. So you're really going to take on the maintenance burden to get that better performance?

This is a cool technique but I can just imagine the looks on my team mates faces when I tell them it isn't react... :/

We have to remember that the current WASM spec is still "just" a MVP. It doesn't yet include performance related spec (like SMID). WASM is also fairly recent. JS interpreter/JIT in browser has seen years of optimization with a trove of real world usage. It will take some time for WASM to be able to compete seriously.

Another factor is also that the WASM compilers for various languages (Rust, C/C++, etc) are obviously recent too and not super optimized.

My own tiny experiment is that WASM can already yield quite decent performance gain but with very compute intensive load, which is not a typical problem in frontend development. The size gain is also real, but you need to handcraft your WASM or forget about using the std and other stuff in the language you are compiling from (Rust generate very fat binary with a naïve implementation for example).

Still, I am quite optimistic about WASM. I was actually impressed that, even though it is quite recent, I can already compete with JS when it come to performance. When the various performance-related spec will be finalized and implemented and that browsers and compilers start heavily optimizing the WASM, we should really see some real-world gain.

It isn't a rebuttal to this article, but I found this blog post about optimizing javascript to be really, really interesting:


Don't forget to follow that up with the second follow-up: http://fitzgeraldnick.com/2018/02/26/speed-without-wizardry....

WASM's biggest claim to fame is providing web development access to non js devs. Having done C for a majority of my life, the ability to build and execute C code for large scale web deployment is appealing!

Weird that they didn't say which browser is which :/

Probably were worried it would look like they're trying to shame their business partners (Apple and Microsoft I guess).

Actually it seems that the second worst in JavaScript (when executing their example) is Chrome?

User robko here https://news.ycombinator.com/item?id=19167078 measured the code on node.js, and node.js is based on Chrome's V8 and he measured 1.5 sec vs article author's of around 2.7s, so it would seem that robko has some almost twice as fast CPU, and the other two (fast) JavaScripts are under 500 ms, and the slowest is 8 seconds, so V8 of Chrome remains the only candidate for the second worst performing of their example.

I wish they had at east posted a browser-runnable version of their test so we could see for ourselves which browser is which, or compare JS vs WASM on our own systems. (On this type of code, I'd expect Safari to be the fastest, not Chrome.)

See my "minimal" C++ translation in my other post here. There's not much to add. For JavaScript start with their code, but add the allocation, just replace allocations with var a = new Uint32Array(height * width); and b the same. Add the timing (1), put in HTML and you're done. It's easy, just a few minutes for anybody who works with that (and this site should be filled with the competent developers AFAIK).

1) https://developer.mozilla.org/en-US/docs/Web/API/Performance

They said it in the article:

>Note: Due to legal concerns, I won’t name any browsers in this article.

What legal concerns? Companies benchmark their competitors all the time.

Yep. It's complete bullshit and it's a shame to see cowardice corporate legal fearmongering like this in a company like Google, that was once at the same wavelength as the technical/hacker community. As if Firefox, Microsoft or Apple would sue them for publishing one browser benchmark.

Even worse if it were a pretext to not make Chrome look bad.

"Legal concerns" is a weird excuse, but personally, I'm glad they didn't name names. The point of this article isn't to shame any browser vendors, it's to talk about WebAssembly. Naming the browsers would have just distracted from the article's topic.

Would that be the explanation given it would be completely fine :)

> There's almost no difference between javascript and wasm except for a single certain browser.

For very large values of "single", approaching "two". In the "Speed comparison per language" chart, Browser 3 is more than 5x slower than Browser 2 on JavaScript/WASM, and Browser 4 is slower still. So there are very significant improvements on two out of the four browsers tested.

Yes if it provides a better product for users.

That's not true in almost any circumstance. What "benchmarks" are you referencing?

There's got to be at least one RESF member on your team. You can sell the technique by telling them they get to use Rust in your web front end.

The "predictable performance" point applies not just to performance across browsers but also that you don't need to pay JIT warm-up costs. A while back, I ran some benchmarks on the same codebase in TypeScript and AssemblyScript and found that wasm was much faster than JS for short computations and often slower than JS when V8 is given multiple seconds to fully warm up the JIT:


So really, it depends a lot on the use case. In my case, it's often a short-lived node process that a user is directly waiting on, so compiling to wasm is probably useful. It also depends on what you're doing; some types of work (e.g. where you'd want careful memory management) are a lot harder for V8 to optimize from JS and can be expressed more nicely in AssemblyScript or another language that gives more memory flexibility.

For that, it looks like unless you're running the same js on a really huge dataset webassembly will win (going from the second speed test). Even when you're compiling 50MB of JS with that thing, Wasm is 5% slower than JS, and when you're compiling 500KB (more typical) it's 300% faster.

Wow all these numbers seem insanely bad. 500 milliseconds to transpose 16 million pixels (so 64mil bytes)? A modern CPU should able to do that at least 10x faster, if not 100x.

They are bad but not way off for that basic for loop, depending on which rotation is being applied.

Using their code on my Intel-based workstation at around 3ghz using GCC 7.3 it takes around 80-100ms to rotate a 4096x4096 buffer 90 or 270, and 14ms to rotate 180.

Max memory bandwidth of something like an i9-9900k is 41.2GB/s. This test reads & writes 128mib of data. So max theoretical achievable performance here is around 3-4ms. Max theoretical. So 100x is not really feasible. 10x, though, very much is, as the quick convert shows a peak time of 14ms with a 180* rotation.

Of course the major source of slowness here is that the reads/writes are not sequential, and the 90 & 270 rotations are achieving a fraction of the possible bandwidth they could as the input reads are jumping around, so every single one is a cache miss and the other 60 bytes in each cache line on the miss will be purged before it's used again.

Flipping it would mean the writes are never utilizing a full cache line, either, though. So you can't really "fix" that, not easily at least. So either your read or write bandwidth ends up tanking and you can only achieve roughly 6% of the max (only ever using 4 bytes of the 64-byte cache line) for that half of the problem. Without some clever magic to handle this your max theoretical on a 41.2GB/s CPU drops to around 50ms.

All that said it's clear that WASM is very far off from native levels of performance. ~5x slower isn't something to brag about. But hey maybe the test system was a potato, and the 500ms isn't as bad as it sounds.

You are correct. The code is using an inefficient cache access pattern, so most of the time is spent waiting.

You probably won't get 100x faster without SIMD, but 10x is certainly doable. Unfortunately, SIMD.js support has been removed from Chrome and Firefox a while ago, even though it is not available in wasm to this day.

How would SIMD do anything to address the problem's fundamental anti-cache-friendly access patterns? You'd need to restructure the problem to be cache-friendly, but SIMD won't really be relevant to that.

You can use both at once. Usually, you'd have something like 64x64 tiles in cache and use 4x4 or 8x8 tiles for SIMD.

Or better yet, WebGL should be able to do this in a few ms on a GPU.

Or simply use the canvas api, which has super optimized graphics libraries behind it - rather than reimplementing the wheel :)

But I get that really this was a how much can wasm help performance as % vs js - you could always write an “optimized” routine and compare those and theoretically achieve something similar.

The article mentions why they couldn't use canvas for this: they are running this code in a worker, and canvas support in workers is not great in browsers so far.

Not only that, there's a nasty bug in Chrome that makes it unusable for our use-case https://bugs.chromium.org/p/chromium/issues/detail?id=906619

Ah my bad for skimming - I though most canvas stuff worked these days? (I recall many years ago when I worked on such things that fonts were the biggest problem, but also people generally wanting to be able to paint dom elements in their as well)

It is OffScreenCanvas, the variant that works in web workers, that has poor browser support.

In my experience, the canvas api is very slow and not well thought-out. For example, to create a native image object from raw pixels, you have to copy the pixels into an ImageData object, draw it to a canvas, create a data URL from the canvas and then load an image from that data URL.

Can we expect a day when WASM will be first class citizen in browsers (i.e like JavaScript) and not just a sidekick?

With threads, SIMD, GC, direct DOM access and more, tools like Emscripten and wasm_pack.....yes



And don't forget support for lock-free programming (memory fence instructions), useful when you want to implement your own specific concurrent GC, for example.

Not anytime soon IMO because WASM still has to access browser APIs through the DOM, which is really built with JS in mind.

HTML DOM is described in terms of IDL interfaces, complete with types. I wouldn't say that it's optimized for JS - indeed, that's why jQuery and similar were introduced. When WHATWG took over, they improved it specifically for better JS interop, but it's still straightforward to map to most statically typed languages.



The problem isn’t exposing the APIs, the problem is the wasm has what is essentially the C memory model, so you couldn’t trust any point/object you get from wasm land.

That’s why there so much work being put into giving wasm a more typical (for a vm) typed heap. Similar issues occur with lifetime of objects - if you get anything from the dom, you have to keep it live if wasm references it, but wasm has no idea of what memory or a handle is.

These are solvable problems, but you’re not getting dom access until after they’re solved.

Why can't wasm just use opaque handles for DOM objects? It doesn't need them to be in wasm-accessible memory, after all. It just needs to be able to invoke methods on them.

It’s not “wasm just needs to be able to invoke them”

Because the wasm memory model doesn’t have typed memory - if you call a dom api and get a handle back, you need to store it. Then you need to be able to pass it back to the host vm.

So now your wasm code needs to make sure the handle stays live - wasm by design doesn’t interact with the host GC, so you have to manually keep the handle alive (refcounting apis or whatever), and the host VM has to have someway to deal with you trying to use the handle without having kept it alive.

Similarly because wasm is designed around storing raw memory in the heap the wasm code can treat the handles as integers. Eg an attacker can just generate spoof handles and try to create type-confusion bugs, or maybe manually over release things.

So the problem isn’t “how do we let wasm make these calls” but rather “how do we do that without making it trivially exploitable”.

WASM ref handles for DOM nodes is comming.


But surely that is also fundamentally a solved problem? I mean, we've had distributed systems for a long time, and they had to deal with all the same issues - lifetime, security etc.

Distributed systems are designed (for better or worth) on the idea of non-malicious nodes.

Those that aren't have an extremely limited API - that would be logically not dissimilar from "untrusted wasm talks to more trusted JS".

Why not anytime soon? VBScript and Dart where given the ability to access the DOM in IE and Chrome in the past.

I would have loved to see Go included in the comparison as well. It can compile to wasm since 1.11.

Go was pretty much a non-starter. They (currently) need a runtime which will make the file size non-competitive to the other ones. Also, since only Chrome has support for threads in WebAssembly (in Canary), we’d not be able to make any use of the concurrency.

I'd be tempted to hand-write the WAT for that. It's not that bad, much easier than dropping into a x86 asm block in C or something.

I did :D Turns out compilers are pretty good at generating WAT.

Hmm.. the Google article stipulates:

> WebAssembly on the other hand is built entirely around raw execution speed. So if we want fast, predictable performance across browsers for code like this, WebAssembly can help.

So i wanted to see how i could use WebAssembly in a React webapps. I found this SO question sees the opposite:

> When running this [ WebAssembly] code in Chrome, I observe "pauses" that cause the app to be a bit jittery. Running the app in Firefox is a lot faster and smoother.


Same code, different browser, different performance. I'd love seeing a Google Developer answering that question in depth..

That looks like a bug, rather than an inherent characteristic of wasm.

I would try optimizing the JS before dropping down to webassembly. For example try replacing let and const with var as let and const in loops have to create a new variable for each iteration.

That's not how let and const work at all, where did you get that impression?

Have you ever made a for loop using var only to have the variable point to the last value in the iteration ? And had to make a closure using forEach, function or self calling function ? With let you do not have to do that as a new variable is created for each iteration. Instead or reassigned when you use var.

WebAssembly is a bit underwhelming to be honest. It feels like every week there is a new language that can come close to C performance meanwhile they've been working on WebAssembly for years and years and it can barely beat JS.

Shouldn't WA as a greenfield project with it's extremely basic memory model and lack of runtime or standard library be super easy to optimize?

After all, there is no point in having the bad ergonomics of assembly together with the awful performance of JS, right?

Those are runtimes in the second range. Are they doing that in a separate thread or do they block the UI? And how long does it take to transfer the data to that thread?

The performance gain are so small , its not worth this setup overhead . The average user won’t see the difference. Hence , the simplicity of’this module . Just do it in JS , chrome as 70% market share why would you ever bother ?

V8 has received decades of optimizations and it can easily compete with compiled languages in terms of speed.

I was hyped to death for WASM , but this is the tenth article I’m reading on this subject and I still ending on the same conclusion : there is no advantage for front end developers to use WASM.

Only Rendering Engine ( Unity , Adobe Products, Autodesk ) can really benefits from this.

> chrome [h]as 70% market share why would you ever bother ?

This view seriously needs to die. It's honestly not that hard to test in two or three browsers, and the differences are minor enough that it isn't a pain. But the only way that's possible is through Web standardization, which only happens when there are diverse options.

As web developers, it's our duty to keep the web healthy, and that means not only optimizing for a single browser.

Agreed. This is IE 6.0 all over again.

With the exception that Chrome is a good browser.

While msft did abuse their position to solidify an IE centric world, people need to realize that when ie4/5/6 were released they were dramatically better than the competitors. The problem is that post-domination they simply stagnated and so the design shortcomings start being a problem.

It needs to be repeated: at the time IE /was/ a good browser. Just like chrome today. And similar to chrome played fast and loose with web exposed features. Sometimes for the better (XHR was an IE invention), sometimes for worse (so was activeX).

> Sometimes for the better (XHR was an IE invention), sometimes for worse (so was activeX).

Wasn't XMLHttpRequest an ActiveX object?

I.E. 6 was a good browser when it was released too.

When IE6 was released it was much better than the nearest competitor. But stagnating for 6 years made it become basically one of the worst.

and an open source browser with at least 2 major browser vendors using it's core rendering engine... Edge and Opera

But that simply means there’s functionally only one browser - the fact that there are different skins isn’t really relevant.

Standards are not about “anyone can just use that implementation” they are about “anyone can make a competing implementation”.

Looking at the sources is not a specification.

The standards are just economically prohibitive to implement all over again.

By that logic it was a waste of time for Firefox to exist -- there was already IE, or it was a waste of time for webkit to exist as there was already khtml, or blink because webkit, etc, etc

People only caring about one browser is exactly what caused ie6 to become such a problem - everyone had to reverse engineer whatever it was doing because nothing was specified.

> By that logic it was a waste of time for Firefox to exist -- there was already IE,

No, IE would need be to be open source for that logic to be applicable there, since the idea is to use a well-developed open source code base instead of rolling your own thing.

> or it was a waste of time for webkit to exist as there was already khtml, or blink because webkit,

You actually undercut your own point with these examples: WebKit was a fork of KHTML, Blink was a fork of WebKit. The developers in question believed that it would have been a waste of time to start from scratch, and so they didn't!

Maybe, but they were only possible because web developers had started considering Firefox in addition to IE. Even then the amount of time spent reverse engineering IE behavior was absurd - when webkit forked khtml it could not render yahoo.om correctly (it mattered then ;) ).

This post is saying you only need to test chrome because it’s 80% of the market. Back in the day IE was more than 90% of the market.

If all you do is test on chrome you force every competitor to reverse engineer chrome (you can’t fork chrome to make a gpl browser). Alternatively you give up and just use chrome (skinned or not), and that dictates the features you get (I don’t see chrome getting built in tracker blocking any time soon).

You can’t use alternative browsers because the web is filled with sites that are only tested on chrome.

Congratulations you have recreated IE.

No, it's not like IE at all because IE was closed source. This was what I was trying to say earlier: the whole reason IE was "bad" was because it stagnated, which would not have been possible if it was open source. In this case, it's more like Linux.

The article sets out to prove the predictability of WASM's performance, and not necessarily a performance gain wrt js.

> This confirms what we laid out at the start: WebAssembly gives you predictable performance. No matter which language we choose, the variance between browsers and languages is minimal

If you're not hyped about WASM, it's probably because your app and customer base's browser preferences are on the js engine's JIT happy-path, which could hold true for most apps. There could very easily be a js path that is significantly worse in performance on chrome, just saying, 70% market share is both a blessing and a curse.

Another major reason for WASM hype is for C#, Rust, C, C++, Go devs to reach parity with js in terms of web accessibility. Frameworks like Blazor (from MSFT) have taken all the best practices & advantages of React and made them available to C# devs.

> have taken all the best practices & advantages of React and made them available to C# devs.

The irony is that C# devs were the first to use reactive programming before React even existed.

Chrome hasn't optimized wasm very well yet. Wasm isn't and never was meant for frontend. It's meant for crunching data and making it possible to use the thousands of C libs computers run on to have a safe and efficient execution environment that is not restrained by the host software having implimented that C lib directly.

For example there was this app in C# that would convert images into 512 color palette and use dithering to retain some quality. I made a version in the browser, but because of js being too slow it didn't work for large images. Thing is, mine was far safer and accessible than the C# program.

anyone like to take a guess what "Browser 4" is?

Too bad no wasm on IE 11

Where it would be needed most

Depends on your use case... a lot of places are just plain deprecating IE altogether.

As well they should:-)

But IE 11 has been given a stay of execution until 2025 I believe.

And sadly most of my users are still on it (government and healthcare)

Isn't Microsoft one of the founding companies in the initiative? [0]

[0] https://webassembly.org/

Microsoft !== IE

The logo on the front page you linked it the logo of Microsoft's other browser, Edge. There is no other mention of Microsoft or Internet Explorer on it.

(overly ironic) tl;dr "let's rewrite something in this new browser feature because the other browser feature we added last week is not supported anywhere and buggy in chrome"

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact