Hacker News new | comments | show | ask | jobs | submit login
Introducing the WebKit FTL JIT (webkit.org)
491 points by panic on May 13, 2014 | hide | past | web | favorite | 96 comments



    > Profile-driven compilation implies that we might invoke an optimizing
    > compiler while the function is running and we may want to transfer the
    > function’s execution into optimized code in the middle of a loop; to our
    > knowledge the FTL is the first compiler to do on-stack-replacement for
    > hot-loop transfer into LLVM-compiled code.
Reading this practically made my hair stand on end, it is one hell of a technical feat, considering especially they had no ability to pre-plan so that both LLVM and their previous engine would maintain identical stack layouts in their designs. It's really insane they got this to work at all.

I was reminded of the old story about a search engine that tried to supplant JS with a Java bytecode-clone because at the time it was widely believed Javascript had reached its evolutionary limit. How times change!


You don't have to maintain identical stack layouts, just be able to translate between frame formats at transitions. Most of the necessary semantics can be expressed as a function call. Figure out what JS values are live across the transition point, and pass them as arguments to a function call which transitions to the interpreter, which proceeds using those values. That can be done without any changes to LLVM, although obviously the code patching can't, and its more efficient not to require values to be shuffled into argument registers. Keeping things updated when the target function is inlined is also tricky. But overall its not as invasive a change to LLVM as you'd think. See: A Modular Approach to On-Stack Replacement in LLVM, https://www.google.com/url?sa=t&source=web&rct=j&ei=IrByU5Hx....



Whoops, I read further down and saw that you posted this link hours before me.


And it's not just two engines, but four, with at least one of them using the amount of available executable memory to choose whether to switch between them.

I know WebKit has tons of tests, but there must be some truly hairy bugs in that code. Does anybody know how they exercise the various optimize/deoptimize steps?


One neat testing strategy is to artificially lower counter used to tier up. Instead of using 1000 for DFG and 100000 for FTL, use something like 10 for DFG and 100 for FTL. This will cause lots of opt/deopt stress and can discover correctness bugs which may hide in more normal executions.



Indeed. Although, to be honest, I'd expected such technology to be the norm for most interpreted/compiled languages at this point.

It turns out that instrumentation and optimization is actually a harder problem than one would think, which makes a lot of rather nifty technology farther out of reach than it otherwise should be.


Maybe the first browser to do so, but my colleagues at university were doing this 2-3 years ago (hot profile guided optimized replacement in LLVM for generic/non-browser specific code). I will see if their papers are available without journal access.


Can you provide papers' names for those of us who have access to academia stuff?


This project is such a great technological achievement for both the Webkit and LLVM communities. There are so many 'first times' in this project. This is the first time profile guided information is used inside the LLVM _JIT_. This is the first time the LLVM infrastructure supported self-modifying code. This is one of the very few successful projects that used LLVM to accelerate a dynamic language. This is the first time Webkit integrated their runtime (JITs and garbage collector) with an external JIT. This is the first JavaScript implementation that has advance features such as auto-vectorization. Congrats guys!


> This is one of the very few successful projects that used LLVM to accelerate a dynamic language.

Yup. Another is Open Shading Language[0]. They ended up with performance 25% faster than the previous shaders, which were coded by hand in C. It's used now in Blender's "Cycles" renderer. Yay, open source!

[0] https://github.com/imageworks/OpenShadingLanguage/


I don't think that OpenShadingLanguage is dynamic.


If by dynamic, you mean: the compiled code for the same exact shader source code can be wildly different at runtime, based on the inputs, OSL is dynamic in that sense.

If you mean something else…then I guess not. It's a shader after all, not a general purpose programming language.

I tend to use dynamic in the sense of static vs. dynamic, where dynamic is "happens at runtime" and static is "happens before runtime".


I think nadav256 meant dynamically typed. I.E. most languages that target LLVM are statically typed, and this project is one of the few (the first?) to successfully leverage LLVM for a dynamically typed language.


Julia is another dynamically typed language using LLVM. On the other hand, Julia was designed to use LLVM. I think this may be the first to succeed in dynamically typed language not designed to use LLVM.


What about Rubinius? (don't know much about it; does it not qualify as successful?)


There's also Python's "Unladen Swallow" attempt.[0]

[0] http://en.wikipedia.org/wiki/Unladen_Swallow


Unladen Swallow is a dead project at this point (although many lessons were learned from it). The active Python LLVM Jit project is Dropbox's Pyston:

https://github.com/dropbox/pyston

https://tech.dropbox.com/2014/04/introducing-pyston-an-upcom...


There is also numba, http://numba.pydata.org/, which uses LLVM to compile annotated Python code, although this is more limited of course.


The Bartlett mostly copying collector is just a really neat design. Even if your compiler gives you precise stack maps, conservative root scavenging still has the major advantage of giving the optimizer the most freedom. Its the basis of SBCL's collector, which is probably 20 years old now. Good to see it still has legs.

This is the patch point intrinsic documentation: http://llvm.org/docs/StackMaps.html. This is a really significant addition to LLVM, because it opens up a whole world of speculative optimizations, even in static languages. Java, for example, suffers on LLVM for want of an effective way to support optimistic devirtualization.


Well, with conservative-on-stack GC you lose the ability to do bump allocation in the nursery for all objects (since you can't update all roots when you tenure objects, so you can't move them, so you can't do two-space stop-and-copy), which is to my mind a pretty severe drawback.

You're right that you can do some optimizations such as reassociation of pointer arithmetic, but I'm not sure how much they matter in practice—I'd be quite surprised if they outweighed the advantage of being able to allocate in ~5 instructions [1].

That said, in a practical sense I'm certain the advantage of being able to use the mature optimizations in the LLVM JIT outweighs the disadvantages of having to use conservative GC (and as I understand Azul is working on precise GC support for LLVM). Given their constraints I think they did a great job.

[1] Edit: I've been informed that the fast path in JSC is around this for pulling a bin off the free list. I should be more precise: what I mean is that you can stay on the fast path more often. You can always allocate very quickly, only falling off the fast path if the nursery fills up. Of course, falling off a malloc fast path is rare anyhow, so this isn't likely to be that much of a difference in practice…


Both MPS and SBCL, which are Bartlett-style collectors, use bump pointer allocation. They divide the heap into pages, and bump allocate into pages. At collection time, when a page is the target of an ambiguous pointer, it is pinned and promoted in place. A pinned page is not used for allocation the next round. The downside is that unless the ambiguous pointers are transient, a page may be pinned and unusuable for allocation for awhile.


JavaScriptCore also has bump-pointer allocation, as mentioned by the post. We don't use it for all things but we think our tradeoff choices are pretty good.


Right, I specifically mentioned two-space collection for this reason (though I guess I should have been more precise that it doesn't forbid bump pointer allocation, just means you may leak entire semispaces for a while).


The most common incarnation of Bartlett is a semispace mostly-copying collector with bump pointer allocation. You never leak a whole space. The only requirement is that the spaces are a discontiguous bag of pages. Note that in proper GC terminology this is the right way to say it - a space is not a contiguous region but rather an abstract bag of heap locations, though usually it's a bag of very large pages, like 64KB in WebKit. You'd want discontiguous spaces anyway if you want to support dynamic heap growing/shrinking or other features like pinning (for fast FFI), etc. It's usually only either academic collectors, or collectors in one-VM-per-process systems like server VMs, that can even afford to make a space be a contiguous region of virtual memory. So, Bartlett ends up incurring zero net overhead for embeddable VM environments like a browser engine.


Presumably you still leak individual pages if you have stuck stack pointers though?

I guess around turns of the event loop you can probably get precise compaction though, since nothing's on the stack.

Edit: I didn't think that it was correct that all production-quality semispaces are scattered throughout the address space and indeed it's not (though I probably misunderstood you): SpiderMonkey uses a contiguous-in-memory nursery.


If you use a block a structured heap, which is convenient for other reasons (GHC uses one), you just leak parts of entire semispaces. :)


Not if you compact, which you can't (fully) do if you have stuck pointers in a conservative collector.


This is a much bigger deal then people are giving credit for, because of the other thing that Apple uses LLVM for. It's the primary compiler for Objective-C, and thus Cocoa (mac apps) and CocoaTouch (ios apps) as well. If apple has Javascript compiling on-the-fly at this speed, this also means that it would be pretty trivial to expose the objective-c runtime to javascript, and mix and match C & Javascript code.

This is going to be a very very big deal.



You can't get the JSContext of a UIWebView though, which is a big hole.



I know, but that isn't a realistic option since it's not a public API and therefore not something you should use in a production application.


Yes, this is a private api - meaning not officially public so can't use it in production. I would love to have the new FTL JIT engine supported officially for the webview JS execution. It would bring UIWebView on iOS7 up to par with Android WebView on Android 4.4


I doubt "it would be pretty trivial". The problems are probably in cross-language object life times and memory management--not in whether both generate machine instructions using LLVM machinery.


I'm so glad to see that Webkit is not dead after the Blink fork. I still use Safari as my main browser, but its developer tools and optimizing compiler lag behind.


Right, but seeing the speed comparison with the LLINT makes me really sad that we can't even use the baseline JIT in iOS Apps. I still have some hope that Apple works on a way to run JSC in an isolated, trusted process that is allowed to use the JIT. Otherwise the new FTL JIT doesn't get us anything in many cases.

[1] http://www.webkit.org/blog-files/ftl-jit/four_tier_performan...


And allow other competing engines on the platform, while they're at it.


They use a conservative GC, which I understand, as they were using it before FTL JIT, and it required minimal changes for integration with LLVM-based JIT. However, in the blog post, they mention several times that they wanted to avoid stack maps because that would require spilling pointers from registers to stack, which they say is undesirable for performance reasons.

I wonder, however, how slow register spilling really is. I will test it when I have time, but logically, it shouldn't take up much time. Under the x64 ABI, 6 registers are used for argument passing [1], and the rest of the arguments are passed on the stack. So, when the runtime calls into GC functions, all but at most 6 pointers are already in the stack, at (in theory) predictable locations. Those 6 registers can be pushed to stack in 6 instructions that take up 8 bytes [2], so the impact on the code size should be minimal, and performance is probably also much faster than most other memory accesses. Furthermore, both OCaml and Haskell use register spilling, and while not quite at C-like speeds, they are mostly faster than JS engines and probably also faster than FTL JIT.

Of course, predicting the stack map after LLVM finishes its optimisations is another thing entirely, but I sincerely hope the developers implement it. EDIT: it seems that LLVM includes some features [3] that allow one to create a stack map, though I wonder if it can be made as efficient as the GHC stack map, which is simply a bitmap/pointer in each stack frame, identifying which words in the frame are pointers and which aren't.

[1] http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_...

[2] tested using https://defuse.ca/online-x86-assembler.htm#disassembly

[3] http://llvm.org/docs/GarbageCollection.html


Registers have to be spilled at every location where the GC can be entered. This includes places where the GC can be triggered, including function calls and loop back edges. Having to spill values held in registers across a loop back edge is quite suboptimal.


That isn't true with all stack map implementations though: mature ones (such as OCaml's IIRC) have the ability to track registers as well, eliminating the spilling requirement. But LLVM doesn't have support for this yet (although I'll be delighted if Azul lands their patches to implement it!)


Yes, I should clarify this is a limitation of LLVM's current gcroot design.


That's a half-truth. Callee-save registers are a beautiful thing. When LLVM emits code for a function call, with high probability all of the important state will not be spilled - it'll be left in callee-save registers and it will be up to the callee to save those registers if the callee wishes to use them.

All that the GC has to do is force-spill all callee saves onto the stack, or just save them on the side. A typical conservative-on-the-stack (the more general class of collectors to which WebKit's Bartlett-based GC belongs) algorithm for stack scanning looks like:

void scanConservatively(void* begin, void* end); void scanMyStack() { int fake; void* start = &fake; void* end = ... get the address of where you first entered the VM ... jmp_buf buf; setjmp(&buf); scanConservatively(start, end); scanConservatively(&buf, &buf + 1); }

Notice that this uses setjmp() as a hack to extract all callee-save registers.

That is sufficient to: (1) get all of the pointers you're interested in and (2) allow the compiler (in WebKit's case, LLVM) maximum freedom to not spill registers anymore than it would do in an normal C calling convention where no GC is in play.

Bottom line: Bartlett = zero overhead stack accounting.


My post was talking about LLVM's gcroot mechanism, which forces spills anywhere GC could be invoked. Sorry if that was unclear.


For a similar project I was on -- that is a managed language built on top of LLVM -- we ensured GC events only came from calls to GC functions, either explicit collects or allocates. Spills to the stack of references only occurred in calls to functions and not loops. Our collector was an accurate collector though.

This project was targeting game code for the register heavy, branch-unfriendly, in-order PowerPC PS3 and Xbox 360. Performance in scalar numerical code was equal to equivalent C++ code.


If I recall correctly, GHC (at least the non-LLVM code generator) doesn't spill at every location that GC could be entered. Instead, it checks whether GC needs to be entered, by checking if there is any more space on the stack & heap (admittedly, this is quite trivial, due to its bump-pointer allocation), and only spills and enters the GC if necessary.


I wonder how difficult it would be then to take the hints for asm.js and after validating that it meets the contracts it provides push through all the code into the FTLJIT to get a huge speed boost on that code. With the ability to do hot transfers into LLVM compiled code it should be possible to do it without any real noticeable issues to the user.


That is roughly the design that OdinMonkey, the Firefox asm.js compiler, uses: verify the type information, then dump it all onto the high performance JS compiler backend. Two big advantages of an AOT approach like OdinMonkey are:

1. Consistency. You don't have to worry that some critical bit of type information will suddenly start to fail to be propagated all the way through multiple stages. This isn't really a problem for standard benchmarks, which are closely watched for regressions, but can be a problem for random code in the wild. JIT heuristics are complex.

2. Startup speed. AOT does not have to go through all of the various warmup phases, which are needed by JITs both to avoid wasting time on optimizing code that is not important for perf and to get type hints.


We already get a huge speed boost on asm.js code, without having to special case it. It's likely that LLVM could do even better on it, but at this time we're not sure that a specific hint that it's asm.js would help.


It may not help throughput, but it will certainly help warmup and caching.


Counterpoints:

(1) It requires a pause to compile a bunch of stuff before you even get going. While in our case LLint would have executed it a few times already.

(2) Even for asm.js-style code, dynamic profiling info can help you optimize things more than you could with pure AOT. For example, you can make inlining decisions on the fly, based on what functions are actually called.

(3) Caching compiled code is a good idea for more than just asm.js, no need to make it a special case.


Congratulations on the landing. I've been looking forward to this. Great to see LLVM getting improved to be a better JIT target, too.

On point 3, above.. I've slowly come to the opinion that caching optimized jitcode code is a bad idea in the general case.

Optimized jitcode attaches many constraints across various heap objects (e.g. hidden types), and those constraints would need to be serialized in a way that they can be loaded in a foreign runtime, matched up back to objects in the foreign heap (if those objects even exist). It's pretty complex to build a design that allows you to answer the questions:

1. what piece of JS code in my heap does this cached jitcode correspond to?

2. how do I wire the jitcode into this brand new object graph and attach all the corresponding constraints correctly?

And once you do build that design, the wins would be meager in general. The high-optimization-tier already only runs on very-very-hot code, so it's invoked on only a small fraction of the functions on the heap. Reading the jitcode from disk is slow, and you wouldn't just be reading the code, but the serialized metadata describing how to map that jitcode to the correct set of objects and hidden types that constraints need to be attached to. It's not even clear that the savings on compilation would exceed the costs of reading all this information from disk. Unless you are pulling in, in one go, a large amount of compiled jitcode that you know will be executed.. disk caching has drawbacks that are hard to justify.

The reason asm.js code is a good candidate for targeted caching is because it avoids all of this business. You bake in no heap pointers, and you have no invalidation constraints hanging off of arbitrary heap objects pointing back at your jitcode. Asm is structured so that answering those two questions above is very easy, and it also allows you to cache large amounts of jitcode (a full module) that can then be loaded in one go and be guaranteed to be valid and executable.

Anyway, just food for thought. That said, awesome work and congrats again :)


> (1) It requires a pause to compile a bunch of stuff before you even get going. While in our case LLint would have executed it a few times already.

I think for games you actually want the pause ahead of time, during the loading screen. Having the first few seconds stutter isn't a good experience.

> (2) Even for asm.js-style code, dynamic profiling info can help you optimize things more than you could with pure AOT. For example, you can make inlining decisions on the fly, based on what functions are actually called.

Well, for asm.js code the static inlining is already done by LLVM/Emscripten before the code is even shipped, so I'm not sure how much of a benefit this actually brings in practice—you can only choose to inline functions more aggressively than LLVM already did, not less aggressively. (Might help work around Emscripten's outlining pass, however.) You also eat CPU cycles that could be used for the app by doing non-AOT compilation (though, granted, Emscripten apps of today tend to use only one core, so background compilation will fix this on multicore).

> (3) Caching compiled code is a good idea for more than just asm.js, no need to make it a special case.

No argument there, although it's harder to handle the full generality of JavaScript due to the temptation to bake pointers into the jitcode. If you can make it work though, hey, that's awesome.


The hypothesis of JSC folks is that it will be able to get very good results without any special case handling for code labeled as asm.js just by having a great general compiler, particularly with more tuning. On the things we profiled, there wasn't a case where AOT or even triggering the fourth tier earlier seemed like the most valuable thing to do.

If we turn out to be wrong, though, we will not be religious about it or anything.


> If we turn out to be wrong, though, we will not be religious about it or anything.

Well said!


I am glad WebKit is thriving. I was worried that the fork Blink with Google and Opera would means Webkit gets no love.

Hopefully the next Safari in iOS and OSX will get many more improvements


In some ways, it's easier to go fast when you don't have to argue over basic strategy or maintain duplicate code paths. Even if you lose some contributors in the process.


Yes, but in this case it isn't just "some" contributors. Google used to contribute more then double the commit and people on the project.

http://hypercritical.co/2013/04/12/code-hard-or-go-home


Google was a valuable contributor. But a lot of Google's commits were not necessarily that high value to non-Chrome ports of WebKit. They had forked code paths for a lot of things (so changes on their side of the #ifdef fork don't help anyone else), web platform features that were not necessarily of interest to others, and refactorings that did not turn out for the better. We have since ripped that stuff out. They did also contribute a lot of very valuable things, most notably security fixes and new web platform features that were useful. In fact, we sometimes still merge stuff from Blink (and they sometimes still merge from WebKit). But hacking on WebKit is a lot more fun now that we don't have to argue about how things affect V8, or the Chromium process model.


You are right for WebKit as a whole, but Google didn't contribute much to JavaScriptCore part of WebKit, using V8 instead. So I think it is indeed reasonable that this part of WebKit's development will speed up after breakup with Google.


> Note that the FTL isn’t special-casing for asm.js by recognizing the "use asm" pragma. All of the performance is from the DFG’s type inference and LLVM’s low-level optimizing power.

doesnt "use asm" simply skip the initial profiling tiers that gather type stats etc? most of the benefit of compiling to asm.js comes from fast/explicit type coersion.


Is this architecture generic enough that one could, say, build a Ruby compiler on top of it?

I imagine even writing a Ruby -> JS transpiler that used the WebKit VM would provide a speedup, similar to how JRuby works on the JVM, but native compilation would be even better.


Yes.

Moreover, Dropbox is trying roughly similar architecture for their Python implementation.

https://github.com/dropbox/pyston


You might be able to try that with http://opalrb.org/

Rubinius already does Ruby -> LLVM.


But as much as i hate to admit Rubinius is still slow.


Part of this project was building support into LLVM that you would need to build a really great dynamic language JIT. Another key aspect was having lower compilation tiers that don't use LLVM. Using LLVM as a fourth tier works much better than using it as a first or even second tier (we have tried in the past).


    dubbed the FTL – short for Fourth Tier LLVM
Is it still a backronym if it redefines an existing acronym? (even a fictional one?)

http://en.wikipedia.org/wiki/List_of_Battlestar_Galactica_ob...


The term FTL is generally used, and predates Battlestar Galactica.[1]

But of course, your point of confusing people with acronyms still stands!

http://en.wikipedia.org/wiki/Faster-than-light


When I first saw references to Apple's engine updates the other day and they were referred to as FTLJIT I just assumed it stood for Faster-Than-Light. Perhaps that's just too nerdy for Apple to admit. Maybe it's not a backronym, it just worked out that way.

But I bet the engineers like the name.


I bet the "redefinition" is intentional.


I adore threads like this. They really bring to focus exactly how much more I have to learn.


It's called FTL, but I am not actually sure it's Faster Than LuaJIT.


no benchmarks versus V8?


You can see some benchmarks comparing them here:

http://arewefastyet.com/#machine=12&view=breakdown&suite=asm...

http://arewefastyet.com/#machine=12&view=breakdown&suite=asm...

FTL beats v8 on almost all of those.


All these benchmarks have "asmjs" in their names, which tells me they are not benchmarking idiomatic JavaScript, but JS generated by emscripten. If that's the case, they do not represent the performance of hand-written applications, which is the majority of all apps on the web.

It would be more interesting to see results for benchmarks such as DeltaBlue or Richards. These would correlate much better with gmail and hipmunk performance, as opposed to in-browser demos of Unreal Tournament.


I wish I knew if lower is better or higher is better on the first page (y axis is not described on the graphs).


Lower in the graphs is always better on awfy, which corresponds to lower numbers are better in all benchmarks, except for octane where higher numbers are better (the axis is reversed so still lower in the graph is better).


For me, chrome still beats this webkit and the latest firefox for deep learning mnist in the browser:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist....


I'm trying to figure out how to measure performance here. Both Chrome and Firefox seem to take about the same wall-clock time to go through 5000 examples. Should I be waiting longer? Measuring something else?


Would be interesting to see Octane numbers.


On a new Mac Pro 2013, 8-core, 3Ghz, I measure:

Safari: 18159 WebKit Nightly: 25141 Chrome Canary: 29962

That's a 38% speedup which is none too shabby.


In my test, Octane numbers improved by 25%. Improved benchmarks are Richards and DeltaBlue, which the blog post already analyzed in depth, and Mandreel and zlib, which are asm.js-like.


You can also see online results at

http://arewefastyet.com/#machine=12&view=breakdown&suite=oct...

There is a discontinuity about a month ago, which is around when FTL was enabled, I believe. There you can see the nice speedups on Mandreel and the others you mention.


To me, the most stunning part is JavaScriptCore's much better performance on Typescript. And it seems it always has been that way, even before FTL was enabled. I wonder why.


It's a good question. I think I know why. WebKit's preexisting JITs were tuned for fast code load. Even the DFG JIT, the third-tier optimizing JIT, is deliberately tuned to generate code quickly even if it means not optimizing as much. Typescript is a ton of code so low-latency compilation is crucial.

Of course the FTL is tuned for the opposite - slower to compile but generates great code. That shows through on asm.js tests, which tend to be longer-running, and some of the other tests in Octane which also have a high running-time/code-size ratio.

That's the cool thing about having four tiers. Each tier has some scenarios on which it shines.


So in terms of compiler speed, DFG > V8 > FTL, and in terms of code quality, FTL > V8 > DFG, so DFG+FTL wins both against V8? I guess that makes sense.


V8 has multiple tiers too, so I'm not sure you can express it quite that simply.


I think you can read Crankshaft instead of V8 then. V8 has two tiers, and the lower tier is below DFG.


Even then, I'm not sure it's a simple linear relationship. Crankshaft is better on some code, and FTL is better on some code. There's even cases where DFG (plus the tiers below it) does better than Crankshaft. (Talking about speed her; for compile time you are probably right for a wide range of code.)


You can try it yourself with a WebKit nightly if you want. http://nightly.webkit.org/


wonder how this compares to node - specifically https://github.com/rogerwang/node-webkit


ELI5 version please?


Safari got faster.


Much like how hairdressers don't necessarily have the most promising haircuts, it would seem that companies that make the finest in web browsers don't necessarily have the greatest of web pages! I don't think there is an ounce of javascript on the webkit website yet that article goes waaay over the heads of mere mortals on mega-speedy-javascript.


This is actually a good thing. Do you expect the page about racecars come with a racecar attached? Javascript is already overused on the web.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: