
Staring at the Sun: Dalvik vs. Asm.js vs. Native - evilpie
https://blog.mozilla.org/javascript/2013/08/01/staring-at-the-sun-dalvik-vs-spidermonkey/
======
peterhunt
I've been watching a bunch of this JS performance stuff unfold over the past
few months (including that huge rant a while ago). People lament the lack of
JITing in UIWebView or that JS is inherently slow or whatever but if you're
targeting mobile, there's usually only one thing that matters:

Can your JS rendering code consistently execute within 16ms?

JS (even without JIT) is certainly fast enough to do this if you offload
anything intensive to workers (in fact, I recommend that you put your whole
app except for the real time aspects in a worker if possible) and schedule
long running tasks over several requestAnimationFrames.

Usually the only issue is that GC pauses can cause hiccups > 16ms and you're
in no control of that. This has traditionally been seen as a deal-breaker.

That's why I'm excited about Asm.js (and LLJS) -- even on generic JS runtimes,
it's my understanding they don't generate garbage, and can execute without GC
pauses. So I'm looking forward to writing most of my app in traditional JS in
a worker, and the realtime components in Asm.js.

~~~
modeless
GC pauses are not the only issue. Image loading often introduces longer pauses
than GC, and input latency is a huge problem too. I've written a benchmark
that exposes these responsiveness issues: [http://google.github.io/latency-
benchmark](http://google.github.io/latency-benchmark)

~~~
peterhunt
Yeah jpeg decoding is a big one, but I think it's less to do with raw
performance and more to do with the fact that there aren't any hooks to help
you control the user experience around it (i.e. there is no way to determine
if you're done decoding unless you write the decoder yourself and draw to
canvas, which increases your latency by a lot).

IIRC I think dropped frames from jpeg decoding is less of an issue in today's
webkit, but I'm not sure. I seem to remember there being workarounds like
adding a no-op CSS animation or iframe to coax the browser to do it off of the
UI thread.

That's a nice demo you have there, btw, I'll have to check it out more.

~~~
modeless
Thanks! JPEG decoding isn't the only problem with image loading. Image
resizing, texture upload, and relayout/painting related to image loading all
contribute to jank. Also, as you pointed out, the fact that it's impossible to
schedule any of those things at appropriate times because you have no control
over or visibility about when they happen.

~~~
peterhunt
Are there relayout issues I'm not aware of with images when you're using
absolute positioning?

I did some experiments a while ago with running jpgjs in a webworker and was
able to get zero-jank images by splitting the copy to canvas across multiple
requestAnimationFrames. However it adds a significant amount of latency and
I'm not sure it's worth the trade off.

I am wondering, though, if one built a pipelined progressive jpeg
downloaded/decoder/renderer with this technique if it would yield positive
results.

~~~
modeless
I don't think so, I just mentioned layout since it's something that can be
triggered by image loading.

I don't think progressive JPEG is worth the rendering overhead in general, but
doing JPEG decoding in a JS worker is not as crazy as it sounds if you're
serious about reducing jank.

------
chrisaycock
The point about memory allocation is similar to Walter Bright's explanation of
D compiler performance [1]. There, the issue was _deallocation_ ; Walter
"cheated" by never deallocating any memory since the compiler is a short-lived
process.

As a complement in this Mozilla test, Kannan Vijayan believes that Asm.js ran
so fast in the _Binary Trees_ test because there is a single large allocation
at process startup, much like a memory pool. The moral of the story is always
be aware of what malloc() and free() are doing to your code.

[1]
[https://news.ycombinator.com/item?id=6103883](https://news.ycombinator.com/item?id=6103883)

~~~
Sharlin
Practically the first thing to do in any serious game development project is
to pick an allocator other than malloc.

~~~
ancarda
As someone who has never touched C/C++ or managed their own memory, can you
explain how other allocators work and how the performance differs from malloc?

~~~
aaronblohowiak
The complexity of malloc comes from the fact that the objects can be of
variable size and have unknown lifespans. Faster allocators usually work on
"pools" of objects that have the SAME size and/or lifespan. Using "pools" lets
you avoid the extra overhead.

If you can be sure that your allocator is only called in one thread, you can
also avoid expensive memory fences.

~~~
marshray
I don't know this for a fact, but some implementations of malloc() seem to
allow realloc() the possibility of later growing the memory in-place.

A custom allocator may gain efficiency by optimizing explicitly for or against
the realloc() scenario.

~~~
spc476
If you are following the C standard for malloc(), then realloc() comes along.
In fact, according to the C standard, malloc(size) is the same as
realloc(NULL,size) and free(ptr) is the same as realloc(ptr,0).

realloc() can also shrink a previous allocation, in addition to growing it
(but you must be aware that you might get back an entirely new pointer).

~~~
mistercow
>but you must be aware that you might get back an entirely new pointer

Which is super fun if someone else adds code that keeps a copy of the pointer,
not knowing that it might change.

Incidentally, is that possibility realistic in a 64-bit system? It seems like
the addresses could easily be spaced out enough that you would never actually
expect to see a realloc return a different pointer.

~~~
marshray
They could, but it would be extremely inefficient for small allocations. Pages
are 8 KiB minimum, so it would waste memory, bloat up the page tables, and
ruin cache efficiency.

~~~
mistercow
Hmm, I see what you mean. But it seems like that could be solved by more
sophisticated virtual memory management, although it seems like it would
essentially take an entire other layer of virtualization to make it work,
which might have other performance consequences.

~~~
Sharlin
Basically you would just reimplement malloc on the MMU level.

------
tyre
This is a rather dubious comparison and seems more targeted at selling ASM.js
than making a meaningful comparison. The author includes asm.js but not native
JS. Well, he actually does test it, but removes it from the results because
_it did too well_.

From the article: To be frank, I didn’t include the regular Javascript scores
in the results because regular Javascript did far too well, and I felt that
including those scores would actually confuse the analysis instead of help it.

[https://blog.mozilla.org/javascript/files/2013/08/Dalvik-
vs-...](https://blog.mozilla.org/javascript/files/2013/08/Dalvik-vs-ASM-vs-
Native-vs-JS.png)

~~~
kannanvijayan
Author here. The "regular JS scores" do REALLY well on several benchmarks
(faster than native, even), for one primary reason: the transcendental math
cache (which every JS engine uses).

This is a very specific optimization that expects that functions like "sin",
"cos", and "tan" will be called repeatedly with the same inputs, and puts a
cache in front of those functions.

In sunspider, this helps. In the real world, we call this "overoptimization
for sunspider".

I've said this before, and I'll say it again: Sunspider is a poor benchmark to
use to talk about JS engines. All of them game it - ALL of them.

If I _had_ included the regular JS sunspider scores, the comparison would be
unfair since all the JS engines are specifically optimized for sunspider.

The reason these benchmarks are somewhat appropriate for Java and C++ is
_because_ Java and C++ compilers and libraries have not been optimized with
sunspider in mind, and things like the transcendental math cache don't skew
numbers.

(Also, to be clear: OdinMonkey, the asm.js compiler in SpiderMonkey, does NOT
use the transcendental math cache like the optimizer for regular JS code does.
This is why "plain JS" is faster than asm.js in several of the benchmarks).

~~~
mraleph
> for one primary reason: the transcendental math cache

If this is the primary reason then it implies that those benchmarks primarily
measure performance of transcendental operations with repetitive arguments and
thus nothing interesting.

Why do you even bother include them in this case?

> Sunspider is a poor benchmark to use to talk about JS engines

If you think it is a poor benchmark why do you use it or remind outside world
of its existence?

In my opinion SunSpider is indeed a poor benchmark in general to talk about
_any_ kind of adaptive JIT (which includes JVM). That is why _I_ never use it
for anything.

~~~
kannanvijayan
I mention the reasons for using SunSpider. They're effectively micro-
benchmarks that exercise various implementation features on their respective
platforms, they're easy to port, and simple to understand.

If we treat them appropriately and carefully (i.e. we don't look at two scores
and use it to make an absolute judgement, but simply as a starting point for
investigation and thinking about what it implies about the platform), they are
of some use.

Even with the transcendental-heavy benches, it might still have been the case
that there was some other implementation issue that slowed down asm.js on
those benches. It's good to confirm that there aren't.

NBodies suggests asm.js costs associated with double-indirection. NSieve
suggest that asm.js ARM codegen could be improved relative to x86. Binary-
trees suggests that Dalvik may have an issue with highly recursive code.
Binary-trees also suggests that there may be a perf issue with the default NDK
libc's malloc (or free, or both) implementation.

All of these things are useful to think about, as long as we avoid the pitfall
of using the benchmark as a judgement tool, and remember to use it as a
starting point for analysis.

Lastly, I felt the exercise was useful in confirming that across a set of
commonly accepted microbenches, asm.js was generally able to hold its own.
It's good to confirm these things empirically, instead of assuming them.

~~~
mraleph
Yes, I understand what VM implementors can derive from individual results of
each and every micro-benchmark.

[I usually go as far as saying that only VM implementors and people with deep
enough understanding of VMs should ever micro-benchmark and pay attention to
micro-benchmarking results]

I would however argue that you could just run each microbenchmark separately
and report results without even mentioning that those benchmarks (in their
short running forms) constitute SunSpider.

Another thing that you could have done is either disable transcendental cache
for IonMonkey generated code altogether (or add transcendental cache in Java /
C++ code) and reported pure JavaScript results on the main, prominently
visible graph.

> asm.js was generally able to hold its own

I am sure that you wanted to say OdinMonkey here instead of asm.js. asm.js is
a set of syntactic and typing rules, it does not have any performance per se.
OdinMonkey is an implementation.

I have seen people conflate these two things together again and again.

~~~
kannanvijayan
I disagree that only VM implementors should think about these things. It
suggests a degree of overspecialization that I think is good to avoid. Even if
one is not a VM implementor, it's always good to have a good grasp on critical
evaluation of results. That's something I tried to promote with my article: to
encourage people to think more deeply about benchmarks than simply noting the
final number and proceeding.

I considered disabling the math cache in SpiderMonkey and using those scores,
but it seemed inappropriate. I don't know to what extent the tuning and other
optimizations are otherwise targeted to SunSpider. The math cache is obvious,
but there may be many non-obvious tunings. Let's face it, JS engine devs have
been trying to improve scores for a long time now. To what degree have tuning
of GC, of when-to-jit, of what-to-jit, of when-to-inline, of what-to-inline,
of the numerous thresholds.. been targeted at SunSpider?

I didn't feel comfortable just removing the math cache and simply stating
"well now we've leveled the playing field so native JS is not at an
advantage".

Also, you're entirely correct about OdinMonkey vs asm.js. It's too easy to use
"asm.js" as a shorthand for particular implementations. It's something people
tend to do (e.g. "java" to mean "the Sun JVM"), but I should definitely try to
avoid that.

------
justinsb
Kudos to the author for including their code. Too many benchmarks don't. And
it is amazing that JS is even in the same ballpark as Java.

That said, like all benchmarks, there are systematic biases. I looked at the
binary trees benchmark. The obvious problem is that it uses far too few
iterations (100), so the runtime was 60 milliseconds (on my laptop). That's
really not enough time for JIT to kick in, although probably JS does JIT more
eagerly than Java. I upped it to 10000 iterations: (OpenJDK) Java took 3.3s,
JS (with node) took 7.8s, C++ took 15s. (C++ is really hurt by garbage
collection vs alloc/free.) Switching C++ to use Google's TCMalloc brought it
down to 10s.

When you see a benchmark that says that X is faster than Java, and X does not
include the letter C, take it with a pinch of salt!

~~~
janjongboom
But now you're comparing unoptimized javascript to java, instead of asm.js.

~~~
justinsb
Ah - fair point. How can I repeat the asm.js results on my laptop?

~~~
azakai
Get the benchmark code, get emscripten, and compile them,

emcc -O2 nsieve.cpp

for example.

Yes, I also had to increase the runtime, they were quite short on a laptop -
they were meant for a phone I think.

~~~
nxn
Not sure if this is relevant these days, but the Emscripten FAQ mentions the
need to use "-s ASM_JS=1" as an argument to emcc in order for it to actually
output asm.js style code. See "Q. How fast will the compiled code be?" from:
[https://github.com/kripken/emscripten/wiki/FAQ](https://github.com/kripken/emscripten/wiki/FAQ)

~~~
azakai
That's outdated, thanks, I'll fix it now.

------
zbowling
So we have a better NDK than Googles at Apportable. We use our own malloc and
better C++ runtime (libc++ instead of libstdc++) and use Clang 3.4 instead of
GCC.

One problem with your tests is that you are using the system malloc, and that
is horribly slow (and dalvik gc will even obtain a global lock on it every now
and then). Firefox does not use the system malloc (instead it uses jemalloc).
This actually has a big time savings in tests that will call malloc at any
point.

I would love to run your tests on our platform. Can publish the exact times
you got? I want to spin it up and see if I can get better numbers on the
native side.

~~~
kannanvijayan
Sure, here's a paste of a CSV file containing the data I recorded. This was
taken on a Nexus 4 running Android 4.2.2. There are two columns for each bench
- the right column containing nine individual scores, and the left column
containing their aggregate stats:

[http://paste.ubuntu.com/5960016/](http://paste.ubuntu.com/5960016/)

------
JoachimSchipper
On one hand, it's great that the author did all this work. On the other hand,
the benchmark is quite dubious: aside from the Javascript-engines-are-heavily-
optimized-for-SunSpider thingy (which, if you click the link at the end of the
article, ends up often making plain JS faster than asm.js), the most obvious
port of JS code is likely not the most obvious way to write Java/C++, let
alone the most performant way.

Still, the fact that you can compare Javascript and C++ without needing a log
scale is quite an achievement.

~~~
krallja
The asm.js is not implemented by hand; it is compiled from C++ via Emscripten,
so there's no "obvious port of JS code" \- it's all done by the compiler.

~~~
duaneb
Did the author write the C++ code? This is all pretty useless without being
able to reproduce it.

EDIT: I'll eat my hat, the code looks great.

EDIT2: Looks like he allocates memory in an inner loop, no wonder native is so
slow... I wouldn't take the native benchmarks seriously at all.
[https://github.com/kannanvijayan/benchdalvik/blob/master/nsi...](https://github.com/kannanvijayan/benchdalvik/blob/master/nsieve/nsieve.cc#L38)

~~~
kannanvijayan
I tried to keep this as faithful as possible to the original JS code I was
copying from.

In the actual SunSpider benchmark (in Javascript), a new Array object is
allocated within that same loop, so I wrote the C++ and Java code to mimic
that behaviour.

I could have pulled the Array allocation out across all of the
implementations, but I tried to avoid making any changes to the benchmarks
unless there was a correctness-issue involved (e.g. moving makeCumulative in
the fasta benchmark out to the prelude was a correctness issue.. since it's
wrong to run it more than once on the same array).

~~~
duaneb
Perhaps, but that's still something you just _wouldn 't do_ in native code. If
you were to write code that way you wouldn't be writing C++ in the first place
(I would hope). I understand the reasoning, but I also think it's misleading
in terms of results.

~~~
kannanvijayan
It's not how you would write it in Javascript either, or Java. Actually, in
general you wouldn't be implementing a sieve algorithm at all.

That's a general pitfall of benchmarks like these - the micros don't test real
programs so much as they test a limited set of implementation mechanisms. In
this case, what we're measuring is: "allocate an array, fill it, and then scan
it with various stride lengths and mutate it, and then free it.. what does
that cost on average, given this spread of array sizes?"

The useful thing with these benchmarks isn't what the final numbers are, but
why they are what they are, and what that suggests about the underlying
implementation.

To put it another way, I think these sorts of comparisons are more useful for
being able to confirm that some set of mechanisms work roughly equivalently in
one vs the other system.. rather than useful for saying one is "better" than
the other in any objective way.

For example, the nsieve result suggested an issue with ARM codegeneration with
asm.js. But the fact that scores between asm.js and native are pretty close on
x86 desktops suggests that outside of codegeneration, the cost of allocating
arrays, scanning them, and mutating them like this is roughly equivalent on
asm.js and native.

Similarly, the nbodies result might be suggesting that double-indirection in
hot code is a weak spot for asm.js compared to native.

The fasta result suggests that there are high overheads associated with using
Java collections for small lookup tables of primitive values.

With benchmarks, my opinion is that the scores themselves are less important
than how you interpret them. Thus I'm not as concerned about how one would
optimally implement a looped sieve algorithm in C++ vs Java vs JS, since
that's not what I'm trying to get at.

~~~
duaneb
Ok, I'll give you this. I think you're entirely correct and I realize I'm not
seeing the forest for the trees.

------
justinsb
The spectral norm test appears to be wrong. On C++ & Java it gets the math
expression in the A function wrong, and produces NaN, which apparently has a
huge performance cost. Taking the correct version from the Javascript (and
converting to a double correctly) gives the correct result, and (OpenJDK) Java
runs twice as fast.

~~~
justinsb
Kannan, if you want to fix it, the Java diff is:

\- return ((double)1)/((i+j) _(i+j+1) /((double)2+i+1));

\+ return 1.0/((i+j)_(i+j+1)/2+i+1);

That gives the same count as the JS. It'd be interesting to know whether
Dalvik/ARM has the same performance hit on NaN!

~~~
kannanvijayan
Thanks!

I think the "/2" should be a "/((double)2)", since in C and Java it'll treat
the former as a truncated int divide. I do see the issue with the precedence
on the 2+i+1, though. Will fix up and re-run.

~~~
justinsb
You're quite right. You can also use 2.0

~~~
kannanvijayan
Fixed up and pushed to repo. If you see anything else that's incorrect, please
feel free to let me know either by posting or through mail, and thanks again
for pointing that out.

With the changes, the scores didn't change dramatically, and the asm.js
version got a bit faster too, actually coming in closer to the native score
than before, which is weird. Overall, scores got better across the board by
about 8% or so.

The fact that asm.js does better now than it did before somehow suggests to me
that there may be an issue in the NDK libc's malloc or free implementation,
and that the better scores are simply allowing that issue to have more effect.
This is another one of those programs which does a bunch of "biggish"
allocs/frees repeatedly.

I'll put the updated charts up with an edit note shortly.

------
MBCook
Interesting. It's nice to see someone actually do analysis instead of just
dumping numbers and saying "So X is the fastest, except when it's Y".

I'd like to see all these re-run on a desktop with a desktop JVM. I'm curious
if the problems Dalvik showed on the binary tree bench are specific Dalvik
specific or if Hotspot has optimizations that fix some of the issues
identified.

------
lttlrck
Its a shame the benchmarks chosen did not allow useful comparison between
Asm.js and plain Javascript to be included. Plain JS is an important baseline.

------
kenster07
It is quite misleading to write native code in the same way you write the
javascript.

A wothwhile benchmark would test optimized code in all contestant languages.

~~~
kenster07
Was this point downvoted for a good reason?

I will reiterate my point. There is no point in comparing suboptimal C++ to
optimized asm.

Rather than downvoting me, the author if the study ought to rerun the
benchmark with optimized C++.

------
fuckjavascript
Well, it's good to see that the browser developers have finally conceded total
defeat on Javascript as the basis for their platform, and are now simply
constructing a nice little low-level virtual machine for general compilation.
Granted, they're still hypocritically trying to keep up appearances by sharing
as much machinery as possible with their Javascript VM, but nobody's perfect,
I guess.

