I've been watching a bunch of this JS performance stuff unfold over the past few months (including that huge rant a while ago). People lament the lack of JITing in UIWebView or that JS is inherently slow or whatever but if you're targeting mobile, there's usually only one thing that matters:
Can your JS rendering code consistently execute within 16ms?
JS (even without JIT) is certainly fast enough to do this if you offload anything intensive to workers (in fact, I recommend that you put your whole app except for the real time aspects in a worker if possible) and schedule long running tasks over several requestAnimationFrames.
Usually the only issue is that GC pauses can cause hiccups > 16ms and you're in no control of that. This has traditionally been seen as a deal-breaker.
That's why I'm excited about Asm.js (and LLJS) -- even on generic JS runtimes, it's my understanding they don't generate garbage, and can execute without GC pauses. So I'm looking forward to writing most of my app in traditional JS in a worker, and the realtime components in Asm.js.
It's not just about loading the JS; it's about loading the entire WebView, or whatever other environment. (Obviously, this is an issue that every application faces, it's just something worth paying attention to.)
GC pauses are not the only issue. Image loading often introduces longer pauses than GC, and input latency is a huge problem too. I've written a benchmark that exposes these responsiveness issues: http://google.github.io/latency-benchmark
Yeah jpeg decoding is a big one, but I think it's less to do with raw performance and more to do with the fact that there aren't any hooks to help you control the user experience around it (i.e. there is no way to determine if you're done decoding unless you write the decoder yourself and draw to canvas, which increases your latency by a lot).
IIRC I think dropped frames from jpeg decoding is less of an issue in today's webkit, but I'm not sure. I seem to remember there being workarounds like adding a no-op CSS animation or iframe to coax the browser to do it off of the UI thread.
That's a nice demo you have there, btw, I'll have to check it out more.
Thanks! JPEG decoding isn't the only problem with image loading. Image resizing, texture upload, and relayout/painting related to image loading all contribute to jank. Also, as you pointed out, the fact that it's impossible to schedule any of those things at appropriate times because you have no control over or visibility about when they happen.
Are there relayout issues I'm not aware of with images when you're using absolute positioning?
I did some experiments a while ago with running jpgjs in a webworker and was able to get zero-jank images by splitting the copy to canvas across multiple requestAnimationFrames. However it adds a significant amount of latency and I'm not sure it's worth the trade off.
I am wondering, though, if one built a pipelined progressive jpeg downloaded/decoder/renderer with this technique if it would yield positive results.
I don't think so, I just mentioned layout since it's something that can be triggered by image loading.
I don't think progressive JPEG is worth the rendering overhead in general, but doing JPEG decoding in a JS worker is not as crazy as it sounds if you're serious about reducing jank.
asm.js is a way to write javascript to help the JIT. It's intended to be autogenerated by tools. It has nothing to do with GC. asm.js will accumulate garbage just like normal javascript.
I don't think that's correct, since I believe asm.js manages its own heap via a big typed array and does not allocate anything tracked by the runtime GC (at least that's what LLJS does and I believe that asm.js does the same thing)
Hmm perhaps it's not part of the spec, but I believe all languages that compile down to it do this. And the spec mentions "absence of garbage collection" so I believe you're at the very least strongly encouraged to avoid it (if not forbidden -- does OdinMonkey even have a GC?)
The point about memory allocation is similar to Walter Bright's explanation of D compiler performance [1]. There, the issue was deallocation; Walter "cheated" by never deallocating any memory since the compiler is a short-lived process.
As a complement in this Mozilla test, Kannan Vijayan believes that Asm.js ran so fast in the Binary Trees test because there is a single large allocation at process startup, much like a memory pool. The moral of the story is always be aware of what malloc() and free() are doing to your code.
As someone who has never touched C/C++ or managed their own memory, can you explain how other allocators work and how the performance differs from malloc?
The complexity of malloc comes from the fact that the objects can be of variable size and have unknown lifespans. Faster allocators usually work on "pools" of objects that have the SAME size and/or lifespan. Using "pools" lets you avoid the extra overhead.
If you can be sure that your allocator is only called in one thread, you can also avoid expensive memory fences.
If you are following the C standard for malloc(), then realloc() comes along. In fact, according to the C standard, malloc(size) is the same as realloc(NULL,size) and free(ptr) is the same as realloc(ptr,0).
realloc() can also shrink a previous allocation, in addition to growing it (but you must be aware that you might get back an entirely new pointer).
Which is why you might get back a new pointer---because the original block couldn't be grown, so a new block has to be allocated and the contents copied.
>but you must be aware that you might get back an entirely new pointer
Which is super fun if someone else adds code that keeps a copy of the pointer, not knowing that it might change.
Incidentally, is that possibility realistic in a 64-bit system? It seems like the addresses could easily be spaced out enough that you would never actually expect to see a realloc return a different pointer.
They could, but it would be extremely inefficient for small allocations. Pages are 8 KiB minimum, so it would waste memory, bloat up the page tables, and ruin cache efficiency.
Hmm, I see what you mean. But it seems like that could be solved by more sophisticated virtual memory management, although it seems like it would essentially take an entire other layer of virtualization to make it work, which might have other performance consequences.
For example, jemalloc stores medium-sized allocations in big contiguous buffers, binned by size rounded up to 16 bytes. So if you realloc to a different size class, your pointer will change, because you need to move to a different big buffer.
Here's a hopefully illustrative example of a special purpose memory manager: The naive bitmap
Take a fixed-size pool of memory and divide it into a fixed number of equally-sized regions. Say, if your game's entities take up a maximum of 32 bytes of memory each, and you are ok with imposing a hard limit of 256 entities in the world at once, you could allocate an 8 kilobyte memory buffer and manage it yourself. Already there are apparent benefits in this scheme in reducing memory fragmentation (all of the game entities are in this 8 kilobyte region, and not scattered all throughout memory) and thus cache coherency (it's quicker to sequentially access memory in a small region of memory than it is to "hop around" all over the place).
How might we manage this memory? We really just need a way of allocating regions ("I want to create a new entity, please give me a free region to store it in") and freeing them when we're done with them (enemy died or whatnot). Because we have a fixed number of fixed-size regions, it's very straightforward (well under a 100 lines of C) to use a bitmap for doing so. Well, bitmap is the common terminology for this data structure, but I prefer to think of it as a binary set: you have a bunch of bits in RAM, and a 1 bit at some integer index indicates that the index is in the set, and a 0 bit indicates that it isn't. For example, if you have the byte 00010011, then 0, 1, and 4 would be in the set, while 2, 3, 5, 6, and 7 would not. [EDIT] Er, I guess "set of integers under some limit" is a more accurate description, but I like the way that "binary set" sounds.
So, dividing the 256 regions of the example by 32 bits in an x86 machine word, you would need 8 machine words to store this binary set, which take up 32 bytes total. All we need to keep track of is whether or not each region is free, so it's a suitable data structure for this purpose.
Most modern machine architectures have instructions for finding the index of the first set (1) bit in a machine word in a single, very quick operation.[1] So, we can use that to our advantage by representing a free region as a set bit, and an allocated region as a clear bit. Then, when you need to allocate a new entity, we can very quickly iterate over the bitmap like so: for each machine word, skip it if it is equal to 0 (all bits are clear, ie the region is fully allocated anyway), and otherwise call find-first-set to find the index of the next free region. Clear that bit (to indicate that it is now in use), and return the index of that bit to the caller (who can then use it like an index into an array, or convert it to a pointer). Freeing regions is even simpler: You just set the bit in the bitmap corresponding to the index in the memory buffer.
We also get another very useful operation for free with this scheme: Iterating over every object in the set. Say, we want to call "update" on every entity in our buffer. Make a copy of the binary set, and invert it, so that now a set bit indicates an allocated region. Then, it's very simple to create an iterator that can traverse every entity in existence: Call find-first-set to find allocated entities, call their update function, then clear the associated bit in the inverted binary set. Keep doing so until you reach the end of the set.
That's a very simple scheme to implement, but you can only really use it when it fits your memory allocation patterns well, or when you can tailor your memory allocation patterns to fit the memory management scheme. In a game, it can be acceptable to impose a hard limit on, say, the number of entities on screen (you could easily have a rule somewhere that stops spawning enemies if a certain limit has been reached, or start despawning less important entities when memory gets tight), but this probably wouldn't fly if you were maintaining a data structure for handling website requests: "Sorry, this site only supports up to 1K concurrent users, please come back later"
Now, think about the kind of changes you might have to make to make this manager more "flexible": How could you make it handle regions of various sizes? How could you make it handle growing the memory buffer (so that you can create more than the "maximum" number of entities)? How would you make it handle various types of objects that just happen to take the same amount of space? How about handling extremely large swaths of memory? Can we reasonably predict how the memory manager will be used in the program, or is that entirely out of our hands? The closer you get to malloc's use case, the less useful this data structure becomes, and by the same token, you should be starting to see how a manager that is expected to perform as best as possible under arbitrary usage patterns can be sub-optimal for a specialized use case.
Lastly, it's perfectly acceptable to use different memory management techniques at different levels of memory granularity. Consider that that is essentially what you'd be doing if you malloc a buffer that you then manage with the bitmapped memory allocator I described.
I once made a C python extension considerably faster by being more careful with malloc and free - I can recommend it as an easy-ish way to speed up a C/C++ program.
Anyone who has implemented a memory allocator knows how expensive it can be. If you have a simpler algorithm that works with your data, allocate a huge chunk and manage it yourself.
My favourite example of memory allocation anti-patterns:
A decade or so ago I worked on a GUI and we needed a Type 1 font parser. Freetype was not a good fit to our (very memory constrained, slow) platform. I was suprised though, how slow the library was, and did some profiling. First thing I noticed was a huge number of malloc() calls. And they were all small. A huge portion of them were only 4 bytes. The malloc() implementation on our platform had an overhead of at least 12 bytes per allocation, so not only was it slow, it also wasted a ludicrous amount of memory.
All in all there were multiple allocations per glyph despite the fonts being loaded and unloaded as a single unit. Yikes.
Since we were in a rush, my hacky fix was a search/replace for malloc()/free() to special a special version of malloc that would just grab the next free chunk from a pool, or if the pool was full allocate another 4KB block and start allocating from that, and nothing for free, and then I added code to initialise the pool and free it to the open/close parts of the font API.
The result was drastically reduced memory usage and at least an order of magnitude speedup on our platform.
(The library in question was t1lib, a library originally released by Adobe, and the stupid memory handling is still in there as of today despite sporadic updates over the last decade; I guess it's not used much any more with FreeType, but I'm almost tempted to do a cleaner fix and submit it - the fix we did was not suitable for general usage as it explicitly assumed a single set of fonts would be loaded and unloaded in the same order, as it didn't keep pools per loaded font)
As someone with basic C experience, could you link me to some resources about how to use malloc properly (ie, allocating a block of RAM and managing it yourself)? I find the whole thing fascinating.
If you need both malloc and realloc, then these are not the droids you're looking for - the system implementation won't be that much slower than something you can hack together.
If you are using heap RAM for persistence over a long lived application, then again the system version is going to be helpful (you're going to allocate something like a 4KB chunk, and then you'll have to hang on to that as long as >= 1 byte is still needed from it, which will dramatically increase the RAM used by your process compared to what it requires.)
If your particular problem is just "I need to allocate some heap RAM because my arrays might be too big for the stack," then allocating one big chunk at the start of your function (of a size tailored to your particular problem) will save you a bunch of slow system calls.
This is a rather dubious comparison and seems more targeted at selling ASM.js than making a meaningful comparison. The author includes asm.js but not native JS. Well, he actually does test it, but removes it from the results because it did too well.
From the article:
To be frank, I didn’t include the regular Javascript scores in the results because regular Javascript did far too well, and I felt that including those scores would actually confuse the analysis instead of help it.
Author here. The "regular JS scores" do REALLY well on several benchmarks (faster than native, even), for one primary reason: the transcendental math cache (which every JS engine uses).
This is a very specific optimization that expects that functions like "sin", "cos", and "tan" will be called repeatedly with the same inputs, and puts a cache in front of those functions.
In sunspider, this helps. In the real world, we call this "overoptimization for sunspider".
I've said this before, and I'll say it again: Sunspider is a poor benchmark to use to talk about JS engines. All of them game it - ALL of them.
If I _had_ included the regular JS sunspider scores, the comparison would be unfair since all the JS engines are specifically optimized for sunspider.
The reason these benchmarks are somewhat appropriate for Java and C++ is _because_ Java and C++ compilers and libraries have not been optimized with sunspider in mind, and things like the transcendental math cache don't skew numbers.
(Also, to be clear: OdinMonkey, the asm.js compiler in SpiderMonkey, does NOT use the transcendental math cache like the optimizer for regular JS code does. This is why "plain JS" is faster than asm.js in several of the benchmarks).
> for one primary reason: the transcendental math cache
If this is the primary reason then it implies that those benchmarks primarily measure performance of transcendental operations with repetitive arguments and thus nothing interesting.
Why do you even bother include them in this case?
> Sunspider is a poor benchmark to use to talk about JS engines
If you think it is a poor benchmark why do you use it or remind outside world of its existence?
In my opinion SunSpider is indeed a poor benchmark in general to talk about any kind of adaptive JIT (which includes JVM). That is why I never use it for anything.
I mention the reasons for using SunSpider. They're effectively micro-benchmarks that exercise various implementation features on their respective platforms, they're easy to port, and simple to understand.
If we treat them appropriately and carefully (i.e. we don't look at two scores and use it to make an absolute judgement, but simply as a starting point for investigation and thinking about what it implies about the platform), they are of some use.
Even with the transcendental-heavy benches, it might still have been the case that there was some other implementation issue that slowed down asm.js on those benches. It's good to confirm that there aren't.
NBodies suggests asm.js costs associated with double-indirection. NSieve suggest that asm.js ARM codegen could be improved relative to x86. Binary-trees suggests that Dalvik may have an issue with highly recursive code. Binary-trees also suggests that there may be a perf issue with the default NDK libc's malloc (or free, or both) implementation.
All of these things are useful to think about, as long as we avoid the pitfall of using the benchmark as a judgement tool, and remember to use it as a starting point for analysis.
Lastly, I felt the exercise was useful in confirming that across a set of commonly accepted microbenches, asm.js was generally able to hold its own. It's good to confirm these things empirically, instead of assuming them.
Yes, I understand what VM implementors can derive from individual results of each and every micro-benchmark.
[I usually go as far as saying that only VM implementors and people with deep enough understanding of VMs should ever micro-benchmark and pay attention to micro-benchmarking results]
I would however argue that you could just run each microbenchmark separately and report results without even mentioning that those benchmarks (in their short running forms) constitute SunSpider.
Another thing that you could have done is either disable transcendental cache for IonMonkey generated code altogether (or add transcendental cache in Java / C++ code) and reported pure JavaScript results on the main, prominently visible graph.
> asm.js was generally able to hold its own
I am sure that you wanted to say OdinMonkey here instead of asm.js. asm.js is a set of syntactic and typing rules, it does not have any performance per se. OdinMonkey is an implementation.
I have seen people conflate these two things together again and again.
I disagree that only VM implementors should think about these things. It suggests a degree of overspecialization that I think is good to avoid. Even if one is not a VM implementor, it's always good to have a good grasp on critical evaluation of results. That's something I tried to promote with my article: to encourage people to think more deeply about benchmarks than simply noting the final number and proceeding.
I considered disabling the math cache in SpiderMonkey and using those scores, but it seemed inappropriate. I don't know to what extent the tuning and other optimizations are otherwise targeted to SunSpider. The math cache is obvious, but there may be many non-obvious tunings. Let's face it, JS engine devs have been trying to improve scores for a long time now. To what degree have tuning of GC, of when-to-jit, of what-to-jit, of when-to-inline, of what-to-inline, of the numerous thresholds.. been targeted at SunSpider?
I didn't feel comfortable just removing the math cache and simply stating "well now we've leveled the playing field so native JS is not at an advantage".
Also, you're entirely correct about OdinMonkey vs asm.js. It's too easy to use "asm.js" as a shorthand for particular implementations. It's something people tend to do (e.g. "java" to mean "the Sun JVM"), but I should definitely try to avoid that.
Kudos to the author for including their code. Too many benchmarks don't. And it is amazing that JS is even in the same ballpark as Java.
That said, like all benchmarks, there are systematic biases. I looked at the binary trees benchmark. The obvious problem is that it uses far too few iterations (100), so the runtime was 60 milliseconds (on my laptop). That's really not enough time for JIT to kick in, although probably JS does JIT more eagerly than Java. I upped it to 10000 iterations: (OpenJDK) Java took 3.3s, JS (with node) took 7.8s, C++ took 15s. (C++ is really hurt by garbage collection vs alloc/free.) Switching C++ to use Google's TCMalloc brought it down to 10s.
When you see a benchmark that says that X is faster than Java, and X does not include the letter C, take it with a pinch of salt!
One thing to keep in mind is that OpenJDK is different from Dalvik, and their JITs are very different.
Dalvik optimizes more heavily in favour of memory space (one of the reasons for its tracing JIT approach as opposed to method JIT approach).
Indeed, on the desktop, Java does very well on a lot of these benchmarks. My goal with these measurements, though, was to get an idea of what kind of performance could be expected by people who were writing asm.js code for Firefox OS devices, and how it compares to performance of similar code on other mobile platforms.
Yes, these benchmarks by Mozilla are biased, but I wouldn't be surprised if Asm.js actually performs as well as Dalvik for 2 reasons. First, the Dalvik VM has always been optimized for memory footprint and not speed. Secondly, Google has focused their recent performance optimizations on the NDK, the platform that everyone uses when performance is critical. I think Google feels Dalvik is good enough for everything else.
I've looked at every change log since Android 2.0 and the last time I saw any performance tweaks to Dalvik was with the addition of JIT in 2.2. Dalvik is about 3x slower than Oracle's Hotspot[1]. Until Google moves Lars Bak to the Android team or buys Myriad for their Dalvik Turbo, I don't see this changing anytime soon.
And it is amazing that JS is even in the same ballpark as Java
It's not. The comparison is between C++/native, C++/asm.js and Java/Dalvik.
Yes, one of the C++ benchmarks goes through an intermediate Javascript stage, but the whole point of asm.js is that we can do so without too much of a performance penalty.
Not sure if this is relevant these days, but the Emscripten FAQ mentions the need to use "-s ASM_JS=1" as an argument to emcc in order for it to actually output asm.js style code. See "Q. How fast will the compiled code be?" from: https://github.com/kripken/emscripten/wiki/FAQ
So we have a better NDK than Googles at Apportable. We use our own malloc and better C++ runtime (libc++ instead of libstdc++) and use Clang 3.4 instead of GCC.
One problem with your tests is that you are using the system malloc, and that is horribly slow (and dalvik gc will even obtain a global lock on it every now and then). Firefox does not use the system malloc (instead it uses jemalloc). This actually has a big time savings in tests that will call malloc at any point.
I would love to run your tests on our platform. Can publish the exact times you got? I want to spin it up and see if I can get better numbers on the native side.
Sure, here's a paste of a CSV file containing the data I recorded. This was taken on a Nexus 4 running Android 4.2.2. There are two columns for each bench - the right column containing nine individual scores, and the left column containing their aggregate stats:
All the code to the benchmarks is available, and all the tools use to compile them are open source, so please make that comparison, would be interesting to see!
On one hand, it's great that the author did all this work. On the other hand, the benchmark is quite dubious: aside from the Javascript-engines-are-heavily-optimized-for-SunSpider thingy (which, if you click the link at the end of the article, ends up often making plain JS faster than asm.js), the most obvious port of JS code is likely not the most obvious way to write Java/C++, let alone the most performant way.
Still, the fact that you can compare Javascript and C++ without needing a log scale is quite an achievement.
The asm.js is not implemented by hand; it is compiled from C++ via Emscripten, so there's no "obvious port of JS code" - it's all done by the compiler.
I tried to keep this as faithful as possible to the original JS code I was copying from.
In the actual SunSpider benchmark (in Javascript), a new Array object is allocated within that same loop, so I wrote the C++ and Java code to mimic that behaviour.
I could have pulled the Array allocation out across all of the implementations, but I tried to avoid making any changes to the benchmarks unless there was a correctness-issue involved (e.g. moving makeCumulative in the fasta benchmark out to the prelude was a correctness issue.. since it's wrong to run it more than once on the same array).
Perhaps, but that's still something you just wouldn't do in native code. If you were to write code that way you wouldn't be writing C++ in the first place (I would hope). I understand the reasoning, but I also think it's misleading in terms of results.
It's not how you would write it in Javascript either, or Java. Actually, in general you wouldn't be implementing a sieve algorithm at all.
That's a general pitfall of benchmarks like these - the micros don't test real programs so much as they test a limited set of implementation mechanisms. In this case, what we're measuring is: "allocate an array, fill it, and then scan it with various stride lengths and mutate it, and then free it.. what does that cost on average, given this spread of array sizes?"
The useful thing with these benchmarks isn't what the final numbers are, but why they are what they are, and what that suggests about the underlying implementation.
To put it another way, I think these sorts of comparisons are more useful for being able to confirm that some set of mechanisms work roughly equivalently in one vs the other system.. rather than useful for saying one is "better" than the other in any objective way.
For example, the nsieve result suggested an issue with ARM codegeneration with asm.js. But the fact that scores between asm.js and native are pretty close on x86 desktops suggests that outside of codegeneration, the cost of allocating arrays, scanning them, and mutating them like this is roughly equivalent on asm.js and native.
Similarly, the nbodies result might be suggesting that double-indirection in hot code is a weak spot for asm.js compared to native.
The fasta result suggests that there are high overheads associated with using Java collections for small lookup tables of primitive values.
With benchmarks, my opinion is that the scores themselves are less important than how you interpret them. Thus I'm not as concerned about how one would optimally implement a looped sieve algorithm in C++ vs Java vs JS, since that's not what I'm trying to get at.
The spectral norm test appears to be wrong. On C++ & Java it gets the math expression in the A function wrong, and produces NaN, which apparently has a huge performance cost. Taking the correct version from the Javascript (and converting to a double correctly) gives the correct result, and (OpenJDK) Java runs twice as fast.
I think the "/2" should be a "/((double)2)", since in C and Java it'll treat the former as a truncated int divide. I do see the issue with the precedence on the 2+i+1, though. Will fix up and re-run.
Fixed up and pushed to repo. If you see anything else that's incorrect, please feel free to let me know either by posting or through mail, and thanks again for pointing that out.
With the changes, the scores didn't change dramatically, and the asm.js version got a bit faster too, actually coming in closer to the native score than before, which is weird. Overall, scores got better across the board by about 8% or so.
The fact that asm.js does better now than it did before somehow suggests to me that there may be an issue in the NDK libc's malloc or free implementation, and that the better scores are simply allowing that issue to have more effect. This is another one of those programs which does a bunch of "biggish" allocs/frees repeatedly.
I'll put the updated charts up with an edit note shortly.
Hmm. Could you post the code for your corrected version?
I tried to find where a NaN would be introduced in the C++ and Java code, but I can't really spot it. The equation only has adds, multiplies, and divides, no negative numbers, and no divides by zero that I can see.
If you show me where the problem is, it wouldn't take me too long to re-run everything with the corrected code and update the results.
Interesting. It's nice to see someone actually do analysis instead of just dumping numbers and saying "So X is the fastest, except when it's Y".
I'd like to see all these re-run on a desktop with a desktop JVM. I'm curious if the problems Dalvik showed on the binary tree bench are specific Dalvik specific or if Hotspot has optimizations that fix some of the issues identified.
Its a shame the benchmarks chosen did not allow useful comparison between Asm.js and plain Javascript to be included. Plain JS is an important baseline.
Well, it's good to see that the browser developers have finally conceded total defeat on Javascript as the basis for their platform, and are now simply constructing a nice little low-level virtual machine for general compilation. Granted, they're still hypocritically trying to keep up appearances by sharing as much machinery as possible with their Javascript VM, but nobody's perfect, I guess.
Can your JS rendering code consistently execute within 16ms?
JS (even without JIT) is certainly fast enough to do this if you offload anything intensive to workers (in fact, I recommend that you put your whole app except for the real time aspects in a worker if possible) and schedule long running tasks over several requestAnimationFrames.
Usually the only issue is that GC pauses can cause hiccups > 16ms and you're in no control of that. This has traditionally been seen as a deal-breaker.
That's why I'm excited about Asm.js (and LLJS) -- even on generic JS runtimes, it's my understanding they don't generate garbage, and can execute without GC pauses. So I'm looking forward to writing most of my app in traditional JS in a worker, and the realtime components in Asm.js.