Can your JS rendering code consistently execute within 16ms?
JS (even without JIT) is certainly fast enough to do this if you offload anything intensive to workers (in fact, I recommend that you put your whole app except for the real time aspects in a worker if possible) and schedule long running tasks over several requestAnimationFrames.
Usually the only issue is that GC pauses can cause hiccups > 16ms and you're in no control of that. This has traditionally been seen as a deal-breaker.
That's why I'm excited about Asm.js (and LLJS) -- even on generic JS runtimes, it's my understanding they don't generate garbage, and can execute without GC pauses. So I'm looking forward to writing most of my app in traditional JS in a worker, and the realtime components in Asm.js.
There are two things that matter, actually:
1. How quickly can you get your app up and running and responding to user input?
2. Once that's done, can you continue to maintain 60 fps?
IIRC I think dropped frames from jpeg decoding is less of an issue in today's webkit, but I'm not sure. I seem to remember there being workarounds like adding a no-op CSS animation or iframe to coax the browser to do it off of the UI thread.
That's a nice demo you have there, btw, I'll have to check it out more.
I did some experiments a while ago with running jpgjs in a webworker and was able to get zero-jank images by splitting the copy to canvas across multiple requestAnimationFrames. However it adds a significant amount of latency and I'm not sure it's worth the trade off.
I am wondering, though, if one built a pipelined progressive jpeg downloaded/decoder/renderer with this technique if it would yield positive results.
I don't think progressive JPEG is worth the rendering overhead in general, but doing JPEG decoding in a JS worker is not as crazy as it sounds if you're serious about reducing jank.
"The asm.js model provides a model closer to C/C++ by eliminating dynamic type guards, boxed values, and garbage collection."
From the asm.js FAQ (http://asmjs.org/faq.html)
The creation of big typed array simulating a memory pool is a construction of Emscripten. asm.js in that context is simply the vehicle / language.
As a complement in this Mozilla test, Kannan Vijayan believes that Asm.js ran so fast in the Binary Trees test because there is a single large allocation at process startup, much like a memory pool. The moral of the story is always be aware of what malloc() and free() are doing to your code.
If you can be sure that your allocator is only called in one thread, you can also avoid expensive memory fences.
A custom allocator may gain efficiency by optimizing explicitly for or against the realloc() scenario.
realloc() can also shrink a previous allocation, in addition to growing it (but you must be aware that you might get back an entirely new pointer).
The probability by which a given allocation pattern is likely to result in a successful growing realloc() seems like a pretty big design choice.
Do you leave space after allocations for them to grow? Do you have power-of-2 size-based tables with an allocation bitmap? If so, what sizes? Etc.
Which is super fun if someone else adds code that keeps a copy of the pointer, not knowing that it might change.
Incidentally, is that possibility realistic in a 64-bit system? It seems like the addresses could easily be spaced out enough that you would never actually expect to see a realloc return a different pointer.
For example, jemalloc stores medium-sized allocations in big contiguous buffers, binned by size rounded up to 16 bytes. So if you realloc to a different size class, your pointer will change, because you need to move to a different big buffer.
Take a fixed-size pool of memory and divide it into a fixed number of equally-sized regions. Say, if your game's entities take up a maximum of 32 bytes of memory each, and you are ok with imposing a hard limit of 256 entities in the world at once, you could allocate an 8 kilobyte memory buffer and manage it yourself. Already there are apparent benefits in this scheme in reducing memory fragmentation (all of the game entities are in this 8 kilobyte region, and not scattered all throughout memory) and thus cache coherency (it's quicker to sequentially access memory in a small region of memory than it is to "hop around" all over the place).
How might we manage this memory? We really just need a way of allocating regions ("I want to create a new entity, please give me a free region to store it in") and freeing them when we're done with them (enemy died or whatnot). Because we have a fixed number of fixed-size regions, it's very straightforward (well under a 100 lines of C) to use a bitmap for doing so. Well, bitmap is the common terminology for this data structure, but I prefer to think of it as a binary set: you have a bunch of bits in RAM, and a 1 bit at some integer index indicates that the index is in the set, and a 0 bit indicates that it isn't. For example, if you have the byte 00010011, then 0, 1, and 4 would be in the set, while 2, 3, 5, 6, and 7 would not. [EDIT] Er, I guess "set of integers under some limit" is a more accurate description, but I like the way that "binary set" sounds.
So, dividing the 256 regions of the example by 32 bits in an x86 machine word, you would need 8 machine words to store this binary set, which take up 32 bytes total. All we need to keep track of is whether or not each region is free, so it's a suitable data structure for this purpose.
Most modern machine architectures have instructions for finding the index of the first set (1) bit in a machine word in a single, very quick operation. So, we can use that to our advantage by representing a free region as a set bit, and an allocated region as a clear bit. Then, when you need to allocate a new entity, we can very quickly iterate over the bitmap like so: for each machine word, skip it if it is equal to 0 (all bits are clear, ie the region is fully allocated anyway), and otherwise call find-first-set to find the index of the next free region. Clear that bit (to indicate that it is now in use), and return the index of that bit to the caller (who can then use it like an index into an array, or convert it to a pointer). Freeing regions is even simpler: You just set the bit in the bitmap corresponding to the index in the memory buffer.
We also get another very useful operation for free with this scheme: Iterating over every object in the set. Say, we want to call "update" on every entity in our buffer. Make a copy of the binary set, and invert it, so that now a set bit indicates an allocated region. Then, it's very simple to create an iterator that can traverse every entity in existence: Call find-first-set to find allocated entities, call their update function, then clear the associated bit in the inverted binary set. Keep doing so until you reach the end of the set.
That's a very simple scheme to implement, but you can only really use it when it fits your memory allocation patterns well, or when you can tailor your memory allocation patterns to fit the memory management scheme. In a game, it can be acceptable to impose a hard limit on, say, the number of entities on screen (you could easily have a rule somewhere that stops spawning enemies if a certain limit has been reached, or start despawning less important entities when memory gets tight), but this probably wouldn't fly if you were maintaining a data structure for handling website requests: "Sorry, this site only supports up to 1K concurrent users, please come back later"
Now, think about the kind of changes you might have to make to make this manager more "flexible": How could you make it handle regions of various sizes? How could you make it handle growing the memory buffer (so that you can create more than the "maximum" number of entities)? How would you make it handle various types of objects that just happen to take the same amount of space? How about handling extremely large swaths of memory? Can we reasonably predict how the memory manager will be used in the program, or is that entirely out of our hands? The closer you get to malloc's use case, the less useful this data structure becomes, and by the same token, you should be starting to see how a manager that is expected to perform as best as possible under arbitrary usage patterns can be sub-optimal for a specialized use case.
Lastly, it's perfectly acceptable to use different memory management techniques at different levels of memory granularity. Consider that that is essentially what you'd be doing if you malloc a buffer that you then manage with the bitmapped memory allocator I described.
EDIT: Looks like the author of the native benchmarks needs to learn this lesson: https://github.com/kannanvijayan/benchdalvik/blob/master/nsi...
A decade or so ago I worked on a GUI and we needed a Type 1 font parser. Freetype was not a good fit to our (very memory constrained, slow) platform. I was suprised though, how slow the library was, and did some profiling. First thing I noticed was a huge number of malloc() calls. And they were all small. A huge portion of them were only 4 bytes. The malloc() implementation on our platform had an overhead of at least 12 bytes per allocation, so not only was it slow, it also wasted a ludicrous amount of memory.
All in all there were multiple allocations per glyph despite the fonts being loaded and unloaded as a single unit. Yikes.
Since we were in a rush, my hacky fix was a search/replace for malloc()/free() to special a special version of malloc that would just grab the next free chunk from a pool, or if the pool was full allocate another 4KB block and start allocating from that, and nothing for free, and then I added code to initialise the pool and free it to the open/close parts of the font API.
The result was drastically reduced memory usage and at least an order of magnitude speedup on our platform.
(The library in question was t1lib, a library originally released by Adobe, and the stupid memory handling is still in there as of today despite sporadic updates over the last decade; I guess it's not used much any more with FreeType, but I'm almost tempted to do a cleaner fix and submit it - the fix we did was not suitable for general usage as it explicitly assumed a single set of fonts would be loaded and unloaded in the same order, as it didn't keep pools per loaded font)
If you are using heap RAM for persistence over a long lived application, then again the system version is going to be helpful (you're going to allocate something like a 4KB chunk, and then you'll have to hang on to that as long as >= 1 byte is still needed from it, which will dramatically increase the RAM used by your process compared to what it requires.)
If your particular problem is just "I need to allocate some heap RAM because my arrays might be too big for the stack," then allocating one big chunk at the start of your function (of a size tailored to your particular problem) will save you a bunch of slow system calls.
Similarly, if your process is short lived, then you can also forget about free() calls, and just use one big buffer and your own allocator on it - see "everyone thinks about garbage collection the wrong way (2010)" http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047... (recent HN discussion: https://news.ycombinator.com/item?id=6131786 )
From the article:
This is a very specific optimization that expects that functions like "sin", "cos", and "tan" will be called repeatedly with the same inputs, and puts a cache in front of those functions.
In sunspider, this helps. In the real world, we call this "overoptimization for sunspider".
I've said this before, and I'll say it again: Sunspider is a poor benchmark to use to talk about JS engines. All of them game it - ALL of them.
If I _had_ included the regular JS sunspider scores, the comparison would be unfair since all the JS engines are specifically optimized for sunspider.
The reason these benchmarks are somewhat appropriate for Java and C++ is _because_ Java and C++ compilers and libraries have not been optimized with sunspider in mind, and things like the transcendental math cache don't skew numbers.
(Also, to be clear: OdinMonkey, the asm.js compiler in SpiderMonkey, does NOT use the transcendental math cache like the optimizer for regular JS code does. This is why "plain JS" is faster than asm.js in several of the benchmarks).
If this is the primary reason then it implies that those benchmarks primarily measure performance of transcendental operations with repetitive arguments and thus nothing interesting.
Why do you even bother include them in this case?
> Sunspider is a poor benchmark to use to talk about JS engines
If you think it is a poor benchmark why do you use it or remind outside world of its existence?
In my opinion SunSpider is indeed a poor benchmark in general to talk about any kind of adaptive JIT (which includes JVM). That is why I never use it for anything.
If we treat them appropriately and carefully (i.e. we don't look at two scores and use it to make an absolute judgement, but simply as a starting point for investigation and thinking about what it implies about the platform), they are of some use.
Even with the transcendental-heavy benches, it might still have been the case that there was some other implementation issue that slowed down asm.js on those benches. It's good to confirm that there aren't.
NBodies suggests asm.js costs associated with double-indirection. NSieve suggest that asm.js ARM codegen could be improved relative to x86. Binary-trees suggests that Dalvik may have an issue with highly recursive code. Binary-trees also suggests that there may be a perf issue with the default NDK libc's malloc (or free, or both) implementation.
All of these things are useful to think about, as long as we avoid the pitfall of using the benchmark as a judgement tool, and remember to use it as a starting point for analysis.
Lastly, I felt the exercise was useful in confirming that across a set of commonly accepted microbenches, asm.js was generally able to hold its own. It's good to confirm these things empirically, instead of assuming them.
[I usually go as far as saying that only VM implementors and people with deep enough understanding of VMs should ever micro-benchmark and pay attention to micro-benchmarking results]
I would however argue that you could just run each microbenchmark separately and report results without even mentioning that those benchmarks (in their short running forms) constitute SunSpider.
> asm.js was generally able to hold its own
I am sure that you wanted to say OdinMonkey here instead of asm.js. asm.js is a set of syntactic and typing rules, it does not have any performance per se. OdinMonkey is an implementation.
I have seen people conflate these two things together again and again.
I considered disabling the math cache in SpiderMonkey and using those scores, but it seemed inappropriate. I don't know to what extent the tuning and other optimizations are otherwise targeted to SunSpider. The math cache is obvious, but there may be many non-obvious tunings. Let's face it, JS engine devs have been trying to improve scores for a long time now. To what degree have tuning of GC, of when-to-jit, of what-to-jit, of when-to-inline, of what-to-inline, of the numerous thresholds.. been targeted at SunSpider?
I didn't feel comfortable just removing the math cache and simply stating "well now we've leveled the playing field so native JS is not at an advantage".
Also, you're entirely correct about OdinMonkey vs asm.js. It's too easy to use "asm.js" as a shorthand for particular implementations. It's something people tend to do (e.g. "java" to mean "the Sun JVM"), but I should definitely try to avoid that.
That said, like all benchmarks, there are systematic biases. I looked at the binary trees benchmark. The obvious problem is that it uses far too few iterations (100), so the runtime was 60 milliseconds (on my laptop). That's really not enough time for JIT to kick in, although probably JS does JIT more eagerly than Java. I upped it to 10000 iterations: (OpenJDK) Java took 3.3s, JS (with node) took 7.8s, C++ took 15s. (C++ is really hurt by garbage collection vs alloc/free.) Switching C++ to use Google's TCMalloc brought it down to 10s.
When you see a benchmark that says that X is faster than Java, and X does not include the letter C, take it with a pinch of salt!
Dalvik optimizes more heavily in favour of memory space (one of the reasons for its tracing JIT approach as opposed to method JIT approach).
Indeed, on the desktop, Java does very well on a lot of these benchmarks. My goal with these measurements, though, was to get an idea of what kind of performance could be expected by people who were writing asm.js code for Firefox OS devices, and how it compares to performance of similar code on other mobile platforms.
I've looked at every change log since Android 2.0 and the last time I saw any performance tweaks to Dalvik was with the addition of JIT in 2.2. Dalvik is about 3x slower than Oracle's Hotspot. Until Google moves Lars Bak to the Android team or buys Myriad for their Dalvik Turbo, I don't see this changing anytime soon.
It's not. The comparison is between C++/native, C++/asm.js and Java/Dalvik.
emcc -O2 nsieve.cpp
Yes, I also had to increase the runtime, they were quite short on a laptop - they were meant for a phone I think.
One problem with your tests is that you are using the system malloc, and that is horribly slow (and dalvik gc will even obtain a global lock on it every now and then). Firefox does not use the system malloc (instead it uses jemalloc). This actually has a big time savings in tests that will call malloc at any point.
I would love to run your tests on our platform. Can publish the exact times you got? I want to spin it up and see if I can get better numbers on the native side.
EDIT: I'll eat my hat, the code looks great.
EDIT2: Looks like he allocates memory in an inner loop, no wonder native is so slow... I wouldn't take the native benchmarks seriously at all. https://github.com/kannanvijayan/benchdalvik/blob/master/nsi...
I could have pulled the Array allocation out across all of the implementations, but I tried to avoid making any changes to the benchmarks unless there was a correctness-issue involved (e.g. moving makeCumulative in the fasta benchmark out to the prelude was a correctness issue.. since it's wrong to run it more than once on the same array).
That's a general pitfall of benchmarks like these - the micros don't test real programs so much as they test a limited set of implementation mechanisms. In this case, what we're measuring is: "allocate an array, fill it, and then scan it with various stride lengths and mutate it, and then free it.. what does that cost on average, given this spread of array sizes?"
The useful thing with these benchmarks isn't what the final numbers are, but why they are what they are, and what that suggests about the underlying implementation.
To put it another way, I think these sorts of comparisons are more useful for being able to confirm that some set of mechanisms work roughly equivalently in one vs the other system.. rather than useful for saying one is "better" than the other in any objective way.
For example, the nsieve result suggested an issue with ARM codegeneration with asm.js. But the fact that scores between asm.js and native are pretty close on x86 desktops suggests that outside of codegeneration, the cost of allocating arrays, scanning them, and mutating them like this is roughly equivalent on asm.js and native.
Similarly, the nbodies result might be suggesting that double-indirection in hot code is a weak spot for asm.js compared to native.
The fasta result suggests that there are high overheads associated with using Java collections for small lookup tables of primitive values.
With benchmarks, my opinion is that the scores themselves are less important than how you interpret them. Thus I'm not as concerned about how one would optimally implement a looped sieve algorithm in C++ vs Java vs JS, since that's not what I'm trying to get at.
- return ((double)1)/((i+j)(i+j+1)/((double)2+i+1));
+ return 1.0/((i+j)(i+j+1)/2+i+1);
That gives the same count as the JS. It'd be interesting to know whether Dalvik/ARM has the same performance hit on NaN!
I think the "/2" should be a "/((double)2)", since in C and Java it'll treat the former as a truncated int divide. I do see the issue with the precedence on the 2+i+1, though. Will fix up and re-run.
With the changes, the scores didn't change dramatically, and the asm.js version got a bit faster too, actually coming in closer to the native score than before, which is weird. Overall, scores got better across the board by about 8% or so.
The fact that asm.js does better now than it did before somehow suggests to me that there may be an issue in the NDK libc's malloc or free implementation, and that the better scores are simply allowing that issue to have more effect. This is another one of those programs which does a bunch of "biggish" allocs/frees repeatedly.
I'll put the updated charts up with an edit note shortly.
I tried to find where a NaN would be introduced in the C++ and Java code, but I can't really spot it. The equation only has adds, multiplies, and divides, no negative numbers, and no divides by zero that I can see.
If you show me where the problem is, it wouldn't take me too long to re-run everything with the corrected code and update the results.
I'd like to see all these re-run on a desktop with a desktop JVM. I'm curious if the problems Dalvik showed on the binary tree bench are specific Dalvik specific or if Hotspot has optimizations that fix some of the issues identified.
A wothwhile benchmark would test optimized code in all contestant languages.
I will reiterate my point. There is no point in comparing suboptimal C++ to optimized asm.
Rather than downvoting me, the author if the study ought to rerun the benchmark with optimized C++.