My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps. Doing 90-degree image rotation with fixed steps and some index calculations should work better (0.18 sec vs 1.5 sec for their implementation in node.js):
for (var y = 0; y < height; y++)
for (var x = 0; x < width; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
Although that's still far from the theoretical maximum throughput because the cache utilization is really bad. If you apply loop tiling, it should be even faster. This problem is closely related to matrix transpose, so there is a great deal of research you can build upon.
EDIT: 0.07 seconds with loop tiling:
for (var y0 = 0; y0 < height; y0 += 64){
for (var x0 = 0; x0 < width; x0 += 64){
for (var y = y0; y < y0 + 64; y++){
for (var x = x0; x < x0 + 64; x++){
b[x + y*width] = a[y + (width - 1 - x)*height];
Your 0.18 sec result is (to use the units they used in the article) 180ms, and if I understand correctly their best webassembly compiled and executed result (?) is 300ms. Beautiful.
EDIT: But it could also be that your computer is somewhat faster than theirs? Do you happen to have some very fast CPU? Can you say which? When I run C-like C++ versions of your code I get the speeds you get with node.js. However, you made overall much better results than they were able, it's still great work!
#include <stdio.h>
int main(int argc, char* argv[]) {
enum { height = 4096, width = 4096 };
unsigned* a = new unsigned[ height*width ];
unsigned* b = new unsigned[ height*width ];
if ( argc < 2 ) { // call with no params
// to measure overhead when just allocations
// and no calculations are done
printf( "%d %d\n", (int)a, (int)b );
return 1;
}
if ( argv[1][0] == '1' ) // call with 1 the fastest
for (unsigned y0 = 0; y0 < height; y0 += 64)
for (unsigned x0 = 0; x0 < width; x0 += 64)
for (unsigned y = y0; y < y0 + 64; y++)
for (unsigned x = x0; x < x0 + 64; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
else
for (unsigned y = 0; y < height; y++)
for (unsigned x = 0; x < width; x++)
b[x + y*width] = a[y + (width - 1 - x)*height];
return 0;
}
Or maybe not: my short experiments with the simplified version based on their algorithm and his JavaScript versions gave some conflicting results. I haven't thoroughly verified them, this note is just to motivate the others to try.
I get 60ms in C. But in your code, the compiler might decide to remove most of the code since b is not used after being calculated. I checked the assembly code and it does not seem to be the case here, but it's still something to be aware of.
OK, I get cca 80ms for my run with the parameter 1 on my main computer, and 200ms on N3150 Celeron.
> b is not used after being calculated
Earlier, I've never seen that any C compiler optimizes away the call to the allocator and the access to the so allocated arrays. Maybe it's different now? Hm, dead code elimination... I guess a random init of the few values before and read and print of a few values after the loop must be always safe... Now that I think, also filling the array with zeroes before.
These code motion/strength reduction optimizations are standard even in mildly optimizing compilers. I would be very surprised if an optimizing JavaScript compiler did not perform them automatically.
I tried a few micro-optimizations, but they did not make a measurable difference, so I kept the code short instead. But maybe some JIT is particularly bad at loop hoisting, so it might make a difference there.
Huh interesting! I always disliked butchering code to do processor cache optimizations and I kinda worked under the impression that a browser’s JS and wasm compilers would do these optimizations for me.
I’ll definitely give tiling a spin (although at this point we are definitely fast enough™️)
Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?
My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).
No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.
WASM example would speed up as well using the same approach. Or C, Rust or whatever.
Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.
The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed.
After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.
(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).
BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...
> As I understand they the main goal was to achieve easily readable and maintainable code, even to the detriment of performance.
Seems like a tricky goal for image algorithms in general where you're performing the same action over and over on millions of pixels. Obscure inner loop optimisations are pretty much required.
In these situations, I would sometimes keep the code for the naive but slow version around next to the highly optimised but difficult to understand version. You can compare the output of them to find bugs as well.
> My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps.
Why would non-1 for loop be slower in some browsers? Does the compiler add some sort of prefetch instruction in the faster browsers based on the loop increment?
Did you see the benchmarks? There's almost no difference between javascript and wasm except for a single certain browser. So you're really going to take on the maintenance burden to get that better performance?
This is a cool technique but I can just imagine the looks on my team mates faces when I tell them it isn't react... :/
We have to remember that the current WASM spec is still "just" a MVP. It doesn't yet include performance related spec (like SMID).
WASM is also fairly recent. JS interpreter/JIT in browser has seen years of optimization with a trove of real world usage. It will take some time for WASM to be able to compete seriously.
Another factor is also that the WASM compilers for various languages (Rust, C/C++, etc) are obviously recent too and not super optimized.
My own tiny experiment is that WASM can already yield quite decent performance gain but with very compute intensive load, which is not a typical problem in frontend development.
The size gain is also real, but you need to handcraft your WASM or forget about using the std and other stuff in the language you are compiling from (Rust generate very fat binary with a naïve implementation for example).
Still, I am quite optimistic about WASM. I was actually impressed that, even though it is quite recent, I can already compete with JS when it come to performance. When the various performance-related spec will be finalized and implemented and that browsers and compilers start heavily optimizing the WASM, we should really see some real-world gain.
WASM's biggest claim to fame is providing web development access to non js devs. Having done C for a majority of my life, the ability to build and execute C code for large scale web deployment is appealing!
Actually it seems that the second worst in JavaScript (when executing their example) is Chrome?
User robko here https://news.ycombinator.com/item?id=19167078 measured the code on node.js, and node.js is based on Chrome's V8 and he measured 1.5 sec vs article author's of around 2.7s, so it would seem that robko has some almost twice as fast CPU, and the other two (fast) JavaScripts are under 500 ms, and the slowest is 8 seconds, so V8 of Chrome remains the only candidate for the second worst performing of their example.
I wish they had at east posted a browser-runnable version of their test so we could see for ourselves which browser is which, or compare JS vs WASM on our own systems. (On this type of code, I'd expect Safari to be the fastest, not Chrome.)
See my "minimal" C++ translation in my other post here. There's not much to add. For JavaScript start with their code, but add the allocation, just replace allocations with
var a = new Uint32Array(height * width); and b the same. Add the timing (1), put in HTML and you're done. It's easy, just a few minutes for anybody who works with that (and this site should be filled with the competent developers AFAIK).
Yep. It's complete bullshit and it's a shame to see cowardice corporate legal fearmongering like this in a company like Google, that was once at the same wavelength as the technical/hacker community. As if Firefox, Microsoft or Apple would sue them for publishing one browser benchmark.
Even worse if it were a pretext to not make Chrome look bad.
"Legal concerns" is a weird excuse, but personally, I'm glad they didn't name names. The point of this article isn't to shame any browser vendors, it's to talk about WebAssembly. Naming the browsers would have just distracted from the article's topic.
> There's almost no difference between javascript and wasm except for a single certain browser.
For very large values of "single", approaching "two". In the "Speed comparison per language" chart, Browser 3 is more than 5x slower than Browser 2 on JavaScript/WASM, and Browser 4 is slower still. So there are very significant improvements on two out of the four browsers tested.
The "predictable performance" point applies not just to performance across browsers but also that you don't need to pay JIT warm-up costs. A while back, I ran some benchmarks on the same codebase in TypeScript and AssemblyScript and found that wasm was much faster than JS for short computations and often slower than JS when V8 is given multiple seconds to fully warm up the JIT:
So really, it depends a lot on the use case. In my case, it's often a short-lived node process that a user is directly waiting on, so compiling to wasm is probably useful. It also depends on what you're doing; some types of work (e.g. where you'd want careful memory management) are a lot harder for V8 to optimize from JS and can be expressed more nicely in AssemblyScript or another language that gives more memory flexibility.
For that, it looks like unless you're running the same js on a really huge dataset webassembly will win (going from the second speed test). Even when you're compiling 50MB of JS with that thing, Wasm is 5% slower than JS, and when you're compiling 500KB (more typical) it's 300% faster.
Wow all these numbers seem insanely bad. 500 milliseconds to transpose 16 million pixels (so 64mil bytes)? A modern CPU should able to do that at least 10x faster, if not 100x.
They are bad but not way off for that basic for loop, depending on which rotation is being applied.
Using their code on my Intel-based workstation at around 3ghz using GCC 7.3 it takes around 80-100ms to rotate a 4096x4096 buffer 90 or 270, and 14ms to rotate 180.
Max memory bandwidth of something like an i9-9900k is 41.2GB/s. This test reads & writes 128mib of data. So max theoretical achievable performance here is around 3-4ms. Max theoretical. So 100x is not really feasible. 10x, though, very much is, as the quick convert shows a peak time of 14ms with a 180* rotation.
Of course the major source of slowness here is that the reads/writes are not sequential, and the 90 & 270 rotations are achieving a fraction of the possible bandwidth they could as the input reads are jumping around, so every single one is a cache miss and the other 60 bytes in each cache line on the miss will be purged before it's used again.
Flipping it would mean the writes are never utilizing a full cache line, either, though. So you can't really "fix" that, not easily at least. So either your read or write bandwidth ends up tanking and you can only achieve roughly 6% of the max (only ever using 4 bytes of the 64-byte cache line) for that half of the problem. Without some clever magic to handle this your max theoretical on a 41.2GB/s CPU drops to around 50ms.
All that said it's clear that WASM is very far off from native levels of performance. ~5x slower isn't something to brag about. But hey maybe the test system was a potato, and the 500ms isn't as bad as it sounds.
You are correct. The code is using an inefficient cache access pattern, so most of the time is spent waiting.
You probably won't get 100x faster without SIMD, but 10x is certainly doable. Unfortunately, SIMD.js support has been removed from Chrome and Firefox a while ago, even though it is not available in wasm to this day.
How would SIMD do anything to address the problem's fundamental anti-cache-friendly access patterns? You'd need to restructure the problem to be cache-friendly, but SIMD won't really be relevant to that.
Or simply use the canvas api, which has super optimized graphics libraries behind it - rather than reimplementing the wheel :)
But I get that really this was a how much can wasm help performance as % vs js - you could always write an “optimized” routine and compare those and theoretically achieve something similar.
The article mentions why they couldn't use canvas for this: they are running this code in a worker, and canvas support in workers is not great in browsers so far.
Ah my bad for skimming - I though most canvas stuff worked these days? (I recall many years ago when I worked on such things that fonts were the biggest problem, but also people generally wanting to be able to paint dom elements in their as well)
In my experience, the canvas api is very slow and not well thought-out. For example, to create a native image object from raw pixels, you have to copy the pixels into an ImageData object, draw it to a canvas, create a data URL from the canvas and then load an image from that data URL.
And don't forget support for lock-free programming (memory fence instructions), useful when you want to implement your own specific concurrent GC, for example.
HTML DOM is described in terms of IDL interfaces, complete with types. I wouldn't say that it's optimized for JS - indeed, that's why jQuery and similar were introduced. When WHATWG took over, they improved it specifically for better JS interop, but it's still straightforward to map to most statically typed languages.
The problem isn’t exposing the APIs, the problem is the wasm has what is essentially the C memory model, so you couldn’t trust any point/object you get from wasm land.
That’s why there so much work being put into giving wasm a more typical (for a vm) typed heap. Similar issues occur with lifetime of objects - if you get anything from the dom, you have to keep it live if wasm references it, but wasm has no idea of what memory or a handle is.
These are solvable problems, but you’re not getting dom access until after they’re solved.
Why can't wasm just use opaque handles for DOM objects? It doesn't need them to be in wasm-accessible memory, after all. It just needs to be able to invoke methods on them.
It’s not “wasm just needs to be able to invoke them”
Because the wasm memory model doesn’t have typed memory - if you call a dom api and get a handle back, you need to store it. Then you need to be able to pass it back to the host vm.
So now your wasm code needs to make sure the handle stays live - wasm by design doesn’t interact with the host GC, so you have to manually keep the handle alive (refcounting apis or whatever), and the host VM has to have someway to deal with you trying to use the handle without having kept it alive.
Similarly because wasm is designed around storing raw memory in the heap the wasm code can treat the handles as integers. Eg an attacker can just generate spoof handles and try to create type-confusion bugs, or maybe manually over release things.
So the problem isn’t “how do we let wasm make these calls” but rather “how do we do that without making it trivially exploitable”.
But surely that is also fundamentally a solved problem? I mean, we've had distributed systems for a long time, and they had to deal with all the same issues - lifetime, security etc.
Go was pretty much a non-starter. They (currently) need a runtime which will make the file size non-competitive to the other ones. Also, since only Chrome has support for threads in WebAssembly (in Canary), we’d not be able to make any use of the concurrency.
> WebAssembly on the other hand is built entirely around raw execution speed. So if we want fast, predictable performance across browsers for code like this, WebAssembly can help.
So i wanted to see how i could use WebAssembly in a React webapps. I found this SO question sees the opposite:
> When running this [ WebAssembly] code in Chrome, I observe "pauses" that cause the app to be a bit jittery. Running the app in Firefox is a lot faster and smoother.
I would try optimizing the JS before dropping down to webassembly. For example try replacing let and const with var as let and const in loops have to create a new variable for each iteration.
Have you ever made a for loop using var only to have the variable point to the last value in the iteration ? And had to make a closure using forEach, function or self calling function ? With let you do not have to do that as a new variable is created for each iteration. Instead or reassigned when you use var.
WebAssembly is a bit underwhelming to be honest. It feels like every week there is a new language that can come close to C performance meanwhile they've been working on WebAssembly for years and years and it can barely beat JS.
Shouldn't WA as a greenfield project with it's extremely basic memory model and lack of runtime or standard library be super easy to optimize?
After all, there is no point in having the bad ergonomics of assembly together with the awful performance of JS, right?
Those are runtimes in the second range. Are they doing that in a separate thread or do they block the UI? And how long does it take to transfer the data to that thread?
The performance gain are so small , its not worth this setup overhead . The average user won’t see the difference.
Hence , the simplicity of’this module . Just do it in JS , chrome as 70% market share why would you ever bother ?
V8 has received decades of optimizations and it can easily compete with compiled languages in terms of speed.
I was hyped to death for WASM , but this is the tenth article I’m reading on this subject and I still ending on the same conclusion : there is no advantage for front end developers to use WASM.
Only Rendering Engine ( Unity , Adobe Products, Autodesk ) can really benefits from this.
> chrome [h]as 70% market share why would you ever bother ?
This view seriously needs to die. It's honestly not that hard to test in two or three browsers, and the differences are minor enough that it isn't a pain. But the only way that's possible is through Web standardization, which only happens when there are diverse options.
As web developers, it's our duty to keep the web healthy, and that means not only optimizing for a single browser.
While msft did abuse their position to solidify an IE centric world, people need to realize that when ie4/5/6 were released they were dramatically better than the competitors. The problem is that post-domination they simply stagnated and so the design shortcomings start being a problem.
It needs to be repeated: at the time IE /was/ a good browser. Just like chrome today. And similar to chrome played fast and loose with web exposed features. Sometimes for the better (XHR was an IE invention), sometimes for worse (so was activeX).
By that logic it was a waste of time for Firefox to exist -- there was already IE, or it was a waste of time for webkit to exist as there was already khtml, or blink because webkit, etc, etc
People only caring about one browser is exactly what caused ie6 to become such a problem - everyone had to reverse engineer whatever it was doing because nothing was specified.
> By that logic it was a waste of time for Firefox to exist -- there was already IE,
No, IE would need be to be open source for that logic to be applicable there, since the idea is to use a well-developed open source code base instead of rolling your own thing.
> or it was a waste of time for webkit to exist as there was already khtml, or blink because webkit,
You actually undercut your own point with these examples: WebKit was a fork of KHTML, Blink was a fork of WebKit. The developers in question believed that it would have been a waste of time to start from scratch, and so they didn't!
Maybe, but they were only possible because web developers had started considering Firefox in addition to IE. Even then the amount of time spent reverse engineering IE behavior was absurd - when webkit forked khtml it could not render yahoo.om correctly (it mattered then ;) ).
This post is saying you only need to test chrome because it’s 80% of the market. Back in the day IE was more than 90% of the market.
If all you do is test on chrome you force every competitor to reverse engineer chrome (you can’t fork chrome to make a gpl browser). Alternatively you give up and just use chrome (skinned or not), and that dictates the features you get (I don’t see chrome getting built in tracker blocking any time soon).
You can’t use alternative browsers because the web is filled with sites that are only tested on chrome.
No, it's not like IE at all because IE was closed source. This was what I was trying to say earlier: the whole reason IE was "bad" was because it stagnated, which would not have been possible if it was open source. In this case, it's more like Linux.
The article sets out to prove the predictability of WASM's performance, and not necessarily a performance gain wrt js.
> This confirms what we laid out at the start: WebAssembly gives you predictable performance. No matter which language we choose, the variance between browsers and languages is minimal
If you're not hyped about WASM, it's probably because your app and customer base's browser preferences are on the js engine's JIT happy-path, which could hold true for most apps. There could very easily be a js path that is significantly worse in performance on chrome, just saying, 70% market share is both a blessing and a curse.
Another major reason for WASM hype is for C#, Rust, C, C++, Go devs to reach parity with js in terms of web accessibility. Frameworks like Blazor (from MSFT) have taken all the best practices & advantages of React and made them available to C# devs.
Chrome hasn't optimized wasm very well yet. Wasm isn't and never was meant for frontend. It's meant for crunching data and making it possible to use the thousands of C libs computers run on to have a safe and efficient execution environment that is not restrained by the host software having implimented that C lib directly.
For example there was this app in C# that would convert images into 512 color palette and use dithering to retain some quality. I made a version in the browser, but because of js being too slow it didn't work for large images. Thing is, mine was far safer and accessible than the C# program.
The logo on the front page you linked it the logo of Microsoft's other browser, Edge. There is no other mention of Microsoft or Internet Explorer on it.
(overly ironic) tl;dr "let's rewrite something in this new browser feature because the other browser feature we added last week is not supported anywhere and buggy in chrome"
EDIT: 0.07 seconds with loop tiling: