
A first look at WebAssembly performance - rch
http://www.stefankrause.net/wp/?p=405
======
flohofwoe
The C vs Firefox results are about the same that I'm seeing in my 8-bit
emulator
([http://floooh.github.io/virtualkc/](http://floooh.github.io/virtualkc/)).
For the Amstrad CPC (currently the most expensive system) on my 2.8GHz Core i5
MBP I'm seeing about 1.3 to 1.5 ms 'emulator time' per 16.6ms frame for the
WASM/asm.js version on FF Nightly, and for the native version (clang -O3)
about 1.2 to 1.4ms.

This 'core emulator loop' is pretty much 100% integer code, with a lot of 8-
and 16-bit integer bit operations (shifting, masking, ...). No calls into the
browser or operating system, and almost no calls into the CRT or C++ stdlib
(may be a small memcpy here and there).

The performance differences for 'pure' C code between browser- and native-
version are so small now that they are no longer relevant, at least for the
use cases I encountered, or rather within the normal performance differences
when running the code on a slightly different CPU.

Calling out into HTML5 APIs is a whole different topic though ;)

~~~
dreta
Languages like Java and JavaScript don't let you lay data out in memory
directly, nor do they give you much on control over how memory is accessed, so
any performance benchmark involving C is entirely superficial.

I can write 2 programs in C, both which iterate over some amount of elements
and perform the same calculations on the same amount of data, and have one
take 500ms and the other take 8s. It's all a matter of how you lay things out
in memory.

~~~
joakleaf
That's a bold claim; Sure it is theoretically possible, to have two versions
of the same C program take either 500ms or 8ms purely due to memory layout.

But I would like to challenge you to actually do it! I.e. same number of
calculations, on same amount of data and a factor of 60 run time difference,
with only the memory layout as actual difference between the two
implementations.

Up for it?

~~~
Coding_Cat
I'll take that challenge. Here's how to do it, with a little bit of sneaky
interpretation of 'memory layout':

have an algorithm which takes a large struct S and looks at a subset Sf to
determine what to do with S, for some value of Sf use all of S, otherwise skip
it. (e.g. when distance between 2 particles < threshold, calculate force).

Now have a very low pass-rate for the filter so that total time ~= time taken
to read all of Sf. Make Sf a single byte and S >= 64 bytes (page size if you
really want to beat it).

And now compare array-of-struct vs struct-of-array ;). You should see ~64x
performance difference in the assymptotic case.

AoS will use one byte per cacheline read. SoA will use 64 bytes per cacheline.
if you're memory bound this will translate to an almost 64x speed difference.
There are other ways to achieve the same effect but that is the gist of
getting 64x performance. using 1 byte vs 64. if you want to go really crazy,
use a single bit and a bitfield array for the SoA. If you use a bitfield the
passing chance doesn't have to be that low.

I happened to have to present on AoS vs SoA today so I have a less extreme
benchmark to show the difference (~16x because it filters over 32 bit ints)

[http://imgur.com/a/HuXFR](http://imgur.com/a/HuXFR)

~~~
sqeaky
So you can write code that is 64x times slower when you specifically optimize
for a slowdown. Cool.

Can you run that in web assembly and get the same slowdown. If so most of the
same things that allow C/C++ to faster than other languages on systems will
also allow them to be faster in the browser.

~~~
Coding_Cat
Yeah, this will affect any language that stores structs as POD (plain old
data) as long as it has somewhat sensible alignment rules and allocation (i.e.
doesn't hide every struct behind a pointer, doesn't align char's to 8 bytes).

------
Derbasti
The way I understand it, WebAssembly is all about the size of the binary and
parsing overhead. Or, at a higher level, about enabling a level playing field
between more languages than just JavaScript.

Speed improvements from a common runtime and bytecode are certainly welcome,
but if they are possible with WebAssembly, they are also be possible with
plain JavaScript, and therefore shouldn't be visible in a comparison between
the two.

I, for one, hope that WebAssembly will enable, say, Lua as a first class
citizen on the web.

~~~
hacker_9
> Speed improvements from a common runtime and bytecode are certainly welcome,
> but if they are possible with WebAssembly, they are also be possible with
> plain JavaScript

Nope, a static language will execute faster than a dynamic language, because
the runtime knows precisely what type everything is and how much space to
allocate. Additionally no faffing around with dictionaries for dynamic types.
Currently JS engines will have to figure out the types dynamically at runtime,
but even then can't always be sure so have to keep to slower dictionary
lookups.

Once we have a static language -> webasm, it should execute an order of a
magnitude faster than JS -> webasm.

Edit: Additionally, browsers will finally consume less power, CPU time and
memory, because the execution runtime process will be so streamlined.

~~~
Klathmon
That's not really true, as JS engines will already compile things with the
assumption that the types won't change and "bail out" if they do.

So if you can "hint" to the JIT that a variable is an integer, and it will
stay an integer, then it will not only "unbox" it and treat it as an integer,
it will compile the code very similar to how a static language would.

In asm.js, this is done by using little "tricks" of JS to hint to the compiler
that something is (for example) an integer by appending `|0` to it.

So there really shouldn't be any major difference between "static language ->
webasm" and "JS -> webasm" if the JS is written with performance in mind.

In practice you'll probably see greater performance from a static language
simply because you "can't" violate those rules so there doesn't need to be
really any "guarding" or checking, but you aren't looking at anything like a
magnitude speed increase here, it'd be incremental in most cases.

~~~
flukus
> That's not really true, as JS engines will already compile things with the
> assumption that the types won't change and "bail out" if they do.

Which makes for some great benchmarks but poor real world results.

~~~
Klathmon
It actually works suprisingly well.

After all, no JS engine out there would really tune their engines to benefit
benchmarks at the expense of real-world performance.

Even with all the dynamic ability that something like JS gives you, most devs
still create an object to store stuff, then don't change it. That means that
an aggressive policy of "compile it as it is, and bail if it changes" ends up
working much more often than it doesn't.

And in the cases where it doesn't, falling back to the "compile it
dynamically" isn't going to be any slower than if they did that first, so it's
basically a free optimization (as long as the compilation doesn't take that
long).

------
strainer
I think there is no js version implemented which accesses the bodies
parameters like this:

x = body.x[body_index]

I expect this should be much faster than accessing them like this:

x = body.[body_index].x

Because the latter requires pointer-from-property calculation for every single
values access (which must be somehow optimised)

The former just requires pointer-from-property calculation for every array
(not element) involved in the nbody loops, the individual element|values are
picked out by array addressing which is inherently much simpler than property
access.

I've used this method for nbody physics. I can modify a test to it and push to
the repo.

~~~
krausest
That would be great. Just send me a pull request!

~~~
strainer
Well I got it completing and its taking about 35% longer than the 'original'
method. I do expect index lookups to be faster than property lookups so
believe there must be some inefficiency in my hasty refactoring.

Or there might be an issue with there being only 5 bodies in the test. Im a
bit puzzled.

Ill send the code later in the day.

------
batmansmk
Our benchmarks show amazing results... but not in speed. WebAssembly for us is
about the size of the binary and the speed of parsing more than the speed of
execution... A Javascript JIT with enough execution data can be even better
than C in theory.

~~~
richardwhiuk
You could feed the profile data back into the C compiler in theory, so in
practice, I think the C _could_ always be faster.

~~~
foota
If you have dynamic data the best optimizations could change, and you can't
feed the profile back to the compliler while the program is running. (Well...
maybe in theory)

~~~
flukus
The JVM has been capable of this for a while. I think the overhead of
profiling and applying optimizations has always been greater than any
efficiency gains.

~~~
szemet
_' the overhead of profiling and applying optimizations has always been
greater than any efficiency gains.'_

Not necessarily, especially if you explicitly rely on them.

For example in C++ you deliberately have to avoid overusing virtual functions
if you don't want the overhead. In Java, virtual functions are the default,
and you usually don't care: if you call them a lot (e.g. in a tight loop) JVM
will adaptively give you the proper implementation - without per call
indirection - or may even inline it.

If you code C++ in Java style then JVM will win - so the overhead of (runtime)
profiling and applying optimizations not always greater

~~~
Vogtinator
G++ is able to optimize virtual calls to non-virtual calls in most cases
(speculative devirtualization) by adding quick checks into the code (like
if(obj->vtable == Class::vtable) Class:virtualMethod(obj);) which the JVM
would do as well.

So if G++ is able to tell the code paths leading to the virtual call, it does
the same optimization as the JVM.

~~~
pjmlp
g++ cannot do that across dynamic libraries. It needs to see the source code
at compile time.

This is where JIT wins, as it can do that with binary deployed code.

------
simooooo
Am I the only one that was expecting web assembly to be like 10x JS speeds?

~~~
benjismith
Well remember, WebAssembly just a different representation format for the same
instruction set, executed by the underlying virtual machine (v8, spidermonkey,
etc). So you should expect to see much of a performance difference.

The main benefit of WebAssembly is so that we can write code using our
favorite languages (not just javascript!) and compile to a common binary
format the browser understands. The objectives of WebAssembly never had
anything to do with performance.

~~~
pmontra
Are there DOM bindings for languages other than JavaScript or is WebAssembly
only for backend/webservices?

~~~
Matthias247
As far as I understand you can call JS APIs from WebAssembly, which means you
use the already available DOM APIs. I don't know exactly what the calling
convention from WASM to JS looks like, but I guess it's specified somewhere.

~~~
TheCycoONE
There are a few, none of them optimal for DOM manipulation yet because they
all involve calling javascript. There is plans to add wasm dom hooks
eventually. [https://kripken.github.io/emscripten-
site/docs/porting/conne...](https://kripken.github.io/emscripten-
site/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html)

------
DannyBee
Just wanted to say thanks for being so thorough in describing the compilation
flags, versions, etc.

It's really refreshing :)

You also just may want to s/gcc/clang/, none of the gcc compile commands you
are running are using gcc, they are all clang/llvm. Apple just aliased the gcc
binary to clang on macs.

------
astrodust
These performance numbers are amazingly impressive considering we're talking
about JavaScript vs. C.

Hopefully the various browser vendors can iron out some of the performance
problems here and even things up better.

------
matt4077
I'm surprised Chrome doesn't manage to be the fasted in any of these tests,
considering the focus Google has put on performance and the resources I
thought they were devoting to Chrome.

It appears as if they'd come out ahead on average, but not by a definitive
margin. Especially surprised by FF which I thought had somewhat dropped the
ball (and/or simply been overrun once Google's business objectives aligned
with the users' interests).

~~~
icebraining
Mozilla has done a great job of keeping up; they created some benchmarks to
keep track of how well they're doing:

[https://arewefastyet.com/](https://arewefastyet.com/)

[https://areweslimyet.com/](https://areweslimyet.com/)

------
emersonrsantos
How the author turned off speedstep on a macbook?

I assume he's using MacOS because of the Safari benchmark.

~~~
ktta
He does talk about the laptop he uses at the end of the post.

>All tests were performed on a 2015 MacBook Pro, 2.5 GHz Intel Core i7, 16 GB
1600 MHz DDR3. For all tests the best of three runs was selected for the
result.

~~~
emersonrsantos
But how do you disable CPU throttling on an intel macbook to do accurate
benchmarks?

~~~
krausest
I didn't. I believe that is's better to run a few times and take the best run.
I payed for the turbo mode CPU and I'd like to know the performance on my
machine. There's a lot going on my machine, and even more when a browser is
running. The only thing (besides running on a clean machine) one can do about
it is to measure multiple runs. Three runs are maybe a bit to little for
scientific results, but I considered them good enough for a casual benchmark.

~~~
igouy
[http://benchmarksgame.alioth.debian.org/for-programming-
lang...](http://benchmarksgame.alioth.debian.org/for-programming-language-
researchers.html)

Those charts show 300 runs of Java n-body #3.

------
exabrial
I really want to like web assembly, but every time I read about it I SMH. We
already have a bunch of great runtimes, compilers, and opcodes. I hate to see
the effort duplicated yet again. Or maybe I'm missing something obvious?

~~~
Ajedi32
> We already have a bunch of great runtimes, compilers, and opcodes.

I'm confused. Are you suggesting we, for example, use the JVM for this
instead? WASM is being designed specifically with the web in mind, and has a
lot of design goals which just weren't a concern for other runtimes. See:
[http://webassembly.org/docs/high-level-
goals/](http://webassembly.org/docs/high-level-goals/)

~~~
exabrial
Well not exactly, but I feel like the JVM, LLVM, parrot, CLR, or any one of
those that have well defined opcodes could be leveraged instead of producing
something entirely new. There's quite a bit of investment in those projects...
Does that make more sense what I was trying to say?

~~~
n00b101
The reason for not using LLVM bitcode or ASM.js is already covered in detail
here
[https://github.com/WebAssembly/design/blob/master/FAQ.md](https://github.com/WebAssembly/design/blob/master/FAQ.md)

I am unsure how JVM or CLR are relevant. WebAssembly is not a virtual machine
byte code (and neither is LLVM, despite the name). As the name "WebAssembly"
suggests, it is like an assembly language level target for "the web" (more
precisely, for JavaScript interpreters found in web browsers).

It is true that there is a lot of unfortunate duplication of effort and
fracturing of technology happening here, due to historical legacy issues with
web browsers. JVM and CLR are indeed highly mature and presumably secure VMs
today. If we were to wave a magic wand and rewrite the history of the web
browsers, perhaps it would be much better if JavaScript never existed and
browsers used an off-the-shelf, standardized VM like JVM or CLR and we wrote
web apps in C# or Java (or, at our discretion, F#, etc). This wouldn't solve
the problem that WebAssembly is solving (i.e compiling native code, eg C++,
for execution in the browser), but then maybe we wouldn't have needed that
today if everyone (native and web) had standardized on a single VM target. But
that's just not how things turned out, browsers created their own parallel
technology stack from scratch so things like WebAssembly need to be created
from scratch to work within the constraints that we are stuck with.

~~~
zigzigzag
I wrote more on this above, so I won't repeat myself in this comment, but yes
WebAssembly is a "virtual machine byte code". It is literally a bytecode
language that doesn't target physical machines. It bears no resemblence to x86
or ARM so it has to be JIT compiled or interpreted.

Saying it doesn't target a VM because it targets "the web" is meaningless.

~~~
n00b101
Fair enough. You are correct, WebAssembly is a bytecode for the underlying
machinery of current ECMAScript/JavaScript engines. I think you raise a very
good question, why did the WebAssembly committee decide to implement their own
stack machine bytecode instead of re-using an existing bytecode like JVM or
CLI (which, interestingly, is an ECMA standard)? It's troubling. The closest
explanation I could find is:

"Why not a fully-general stack machine?

The WebAssembly stack machine is restricted to structured control flow and
structured use of the stack. This greatly simplifies one-pass verification,
avoiding a fixpoint computation like that of other stack machines such as the
Java Virtual Machine (prior to stack maps). This also simplifies compilation
and manipulation of WebAssembly code by other tools. Further generalization of
the WebAssembly stack machine is planned post-MVP, such as the addition of
multiple return values from control flow constructs and function calls." [1]

[1]
[http://webassembly.org/docs/rationale/](http://webassembly.org/docs/rationale/)

------
singularity2001
if you compare manually optimized JavaScript with manually optimized
webassembly ( which was not available to the OP?), could you expect
webassembly to be consistently faster than java on all browsers?

