> Profile-driven compilation implies that we might invoke an optimizing
> compiler while the function is running and we may want to transfer the
> function’s execution into optimized code in the middle of a loop; to our
> knowledge the FTL is the first compiler to do on-stack-replacement for
> hot-loop transfer into LLVM-compiled code.
I know WebKit has tons of tests, but there must be some truly hairy bugs in that code. Does anybody know how they exercise the various optimize/deoptimize steps?
It turns out that instrumentation and optimization is actually a harder problem than one would think, which makes a lot of rather nifty technology farther out of reach than it otherwise should be.
Yup. Another is Open Shading Language. They ended up with performance 25% faster than the previous shaders, which were coded by hand in C. It's used now in Blender's "Cycles" renderer. Yay, open source!
If you mean something else…then I guess not. It's a shader after all, not a general purpose programming language.
I tend to use dynamic in the sense of static vs. dynamic, where dynamic is "happens at runtime" and static is "happens before runtime".
This is the patch point intrinsic documentation: http://llvm.org/docs/StackMaps.html. This is a really significant addition to LLVM, because it opens up a whole world of speculative optimizations, even in static languages. Java, for example, suffers on LLVM for want of an effective way to support optimistic devirtualization.
You're right that you can do some optimizations such as reassociation of pointer arithmetic, but I'm not sure how much they matter in practice—I'd be quite surprised if they outweighed the advantage of being able to allocate in ~5 instructions .
That said, in a practical sense I'm certain the advantage of being able to use the mature optimizations in the LLVM JIT outweighs the disadvantages of having to use conservative GC (and as I understand Azul is working on precise GC support for LLVM). Given their constraints I think they did a great job.
 Edit: I've been informed that the fast path in JSC is around this for pulling a bin off the free list. I should be more precise: what I mean is that you can stay on the fast path more often. You can always allocate very quickly, only falling off the fast path if the nursery fills up. Of course, falling off a malloc fast path is rare anyhow, so this isn't likely to be that much of a difference in practice…
I guess around turns of the event loop you can probably get precise compaction though, since nothing's on the stack.
Edit: I didn't think that it was correct that all production-quality semispaces are scattered throughout the address space and indeed it's not (though I probably misunderstood you): SpiderMonkey uses a contiguous-in-memory nursery.
This is going to be a very very big deal.
I wonder, however, how slow register spilling really is. I will test it when I have time, but logically, it shouldn't take up much time. Under the x64 ABI, 6 registers are used for argument passing , and the rest of the arguments are passed on the stack. So, when the runtime calls into GC functions, all but at most 6 pointers are already in the stack, at (in theory) predictable locations. Those 6 registers can be pushed to stack in 6 instructions that take up 8 bytes , so the impact on the code size should be minimal, and performance is probably also much faster than most other memory accesses. Furthermore, both OCaml and Haskell use register spilling, and while not quite at C-like speeds, they are mostly faster than JS engines and probably also faster than FTL JIT.
Of course, predicting the stack map after LLVM finishes its optimisations is another thing entirely, but I sincerely hope the developers implement it. EDIT: it seems that LLVM includes some features  that allow one to create a stack map, though I wonder if it can be made as efficient as the GHC stack map, which is simply a bitmap/pointer in each stack frame, identifying which words in the frame are pointers and which aren't.
 tested using https://defuse.ca/online-x86-assembler.htm#disassembly
All that the GC has to do is force-spill all callee saves onto the stack, or just save them on the side. A typical conservative-on-the-stack (the more general class of collectors to which WebKit's Bartlett-based GC belongs) algorithm for stack scanning looks like:
void scanConservatively(void* begin, void* end);
void* start = &fake;
void* end = ... get the address of where you first entered the VM ...
scanConservatively(&buf, &buf + 1);
Notice that this uses setjmp() as a hack to extract all callee-save registers.
That is sufficient to: (1) get all of the pointers you're interested in and (2) allow the compiler (in WebKit's case, LLVM) maximum freedom to not spill registers anymore than it would do in an normal C calling convention where no GC is in play.
Bottom line: Bartlett = zero overhead stack accounting.
This project was targeting game code for the register heavy, branch-unfriendly, in-order PowerPC PS3 and Xbox 360. Performance in scalar numerical code was equal to equivalent C++ code.
1. Consistency. You don't have to worry that some critical bit of type information will suddenly start to fail to be propagated all the way through multiple stages. This isn't really a problem for standard benchmarks, which are closely watched for regressions, but can be a problem for random code in the wild. JIT heuristics are complex.
2. Startup speed. AOT does not have to go through all of the various warmup phases, which are needed by JITs both to avoid wasting time on optimizing code that is not important for perf and to get type hints.
(1) It requires a pause to compile a bunch of stuff before you even get going. While in our case LLint would have executed it a few times already.
(2) Even for asm.js-style code, dynamic profiling info can help you optimize things more than you could with pure AOT. For example, you can make inlining decisions on the fly, based on what functions are actually called.
(3) Caching compiled code is a good idea for more than just asm.js, no need to make it a special case.
On point 3, above.. I've slowly come to the opinion that caching optimized jitcode code is a bad idea in the general case.
Optimized jitcode attaches many constraints across various heap objects (e.g. hidden types), and those constraints would need to be serialized in a way that they can be loaded in a foreign runtime, matched up back to objects in the foreign heap (if those objects even exist). It's pretty complex to build a design that allows you to answer the questions:
1. what piece of JS code in my heap does this cached jitcode correspond to?
2. how do I wire the jitcode into this brand new object graph and attach all the corresponding constraints correctly?
And once you do build that design, the wins would be meager in general. The high-optimization-tier already only runs on very-very-hot code, so it's invoked on only a small fraction of the functions on the heap. Reading the jitcode from disk is slow, and you wouldn't just be reading the code, but the serialized metadata describing how to map that jitcode to the correct set of objects and hidden types that constraints need to be attached to. It's not even clear that the savings on compilation would exceed the costs of reading all this information from disk. Unless you are pulling in, in one go, a large amount of compiled jitcode that you know will be executed.. disk caching has drawbacks that are hard to justify.
The reason asm.js code is a good candidate for targeted caching is because it avoids all of this business. You bake in no heap pointers, and you have no invalidation constraints hanging off of arbitrary heap objects pointing back at your jitcode. Asm is structured so that answering those two questions above is very easy, and it also allows you to cache large amounts of jitcode (a full module) that can then be loaded in one go and be guaranteed to be valid and executable.
Anyway, just food for thought. That said, awesome work and congrats again :)
I think for games you actually want the pause ahead of time, during the loading screen. Having the first few seconds stutter isn't a good experience.
> (2) Even for asm.js-style code, dynamic profiling info can help you optimize things more than you could with pure AOT. For example, you can make inlining decisions on the fly, based on what functions are actually called.
Well, for asm.js code the static inlining is already done by LLVM/Emscripten before the code is even shipped, so I'm not sure how much of a benefit this actually brings in practice—you can only choose to inline functions more aggressively than LLVM already did, not less aggressively. (Might help work around Emscripten's outlining pass, however.) You also eat CPU cycles that could be used for the app by doing non-AOT compilation (though, granted, Emscripten apps of today tend to use only one core, so background compilation will fix this on multicore).
> (3) Caching compiled code is a good idea for more than just asm.js, no need to make it a special case.
If we turn out to be wrong, though, we will not be religious about it or anything.
Hopefully the next Safari in iOS and OSX will get many more improvements
doesnt "use asm" simply skip the initial profiling tiers that gather type stats etc? most of the benefit of compiling to asm.js comes from fast/explicit type coersion.
I imagine even writing a Ruby -> JS transpiler that used the WebKit VM would provide a speedup, similar to how JRuby works on the JVM, but native compilation would be even better.
Moreover, Dropbox is trying roughly similar architecture for their Python implementation.
Rubinius already does Ruby -> LLVM.
dubbed the FTL – short for Fourth Tier LLVM
But of course, your point of confusing people with acronyms still stands!
But I bet the engineers like the name.
FTL beats v8 on almost all of those.
It would be more interesting to see results for benchmarks such as DeltaBlue or Richards. These would correlate much better with gmail and hipmunk performance, as opposed to in-browser demos of Unreal Tournament.
WebKit Nightly: 25141
Chrome Canary: 29962
That's a 38% speedup which is none too shabby.
There is a discontinuity about a month ago, which is around when FTL was enabled, I believe. There you can see the nice speedups on Mandreel and the others you mention.
Of course the FTL is tuned for the opposite - slower to compile but generates great code. That shows through on asm.js tests, which tend to be longer-running, and some of the other tests in Octane which also have a high running-time/code-size ratio.
That's the cool thing about having four tiers. Each tier has some scenarios on which it shines.