
Introducing the WebKit FTL JIT - panic
https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/
======
hosay123

        > Profile-driven compilation implies that we might invoke an optimizing
        > compiler while the function is running and we may want to transfer the
        > function’s execution into optimized code in the middle of a loop; to our
        > knowledge the FTL is the first compiler to do on-stack-replacement for
        > hot-loop transfer into LLVM-compiled code.
    

Reading this practically made my hair stand on end, it is one hell of a
technical feat, considering especially they had no ability to pre-plan so that
both LLVM and their previous engine would maintain identical stack layouts in
their designs. It's really insane they got this to work at all.

I was reminded of the old story about a search engine that tried to supplant
JS with a Java bytecode-clone because at the time it was widely believed
Javascript had reached its evolutionary limit. How times change!

~~~
rayiner
You don't have to maintain identical stack layouts, just be able to translate
between frame formats at transitions. Most of the necessary semantics can be
expressed as a function call. Figure out what JS values are live across the
transition point, and pass them as arguments to a function call which
transitions to the interpreter, which proceeds using those values. That can be
done without any changes to LLVM, although obviously the code patching can't,
and its more efficient not to require values to be shuffled into argument
registers. Keeping things updated when the target function is inlined is also
tricky. But overall its not as invasive a change to LLVM as you'd think. See:
A Modular Approach to On-Stack Replacement in LLVM,
[https://www.google.com/url?sa=t&source=web&rct=j&ei=IrByU5Hx...](https://www.google.com/url?sa=t&source=web&rct=j&ei=IrByU5HxKYmjyATRj4HIAg&url=http://www.sable.mcgill.ca/publications/techreports/2012-1/lameed-12-osr-
TR.pdf&cd=1&ved=0CCcQFjAA&usg=AFQjCNEPX4mWcsIiKztcd7SsLfHS8UtXJg&sig2=JBsd14i3hsnNAJBoqCIkjg).

~~~
jevinskie
Also
[http://llvm.org/docs/StackMaps.html](http://llvm.org/docs/StackMaps.html)

~~~
jevinskie
Whoops, I read further down and saw that you posted this link hours before me.

------
nadav256
This project is such a great technological achievement for both the Webkit and
LLVM communities. There are so many 'first times' in this project. This is the
first time profile guided information is used inside the LLVM _JIT_. This is
the first time the LLVM infrastructure supported self-modifying code. This is
one of the very few successful projects that used LLVM to accelerate a dynamic
language. This is the first time Webkit integrated their runtime (JITs and
garbage collector) with an external JIT. This is the first JavaScript
implementation that has advance features such as auto-vectorization. Congrats
guys!

~~~
erichocean
_> This is one of the very few successful projects that used LLVM to
accelerate a dynamic language._

Yup. Another is Open Shading Language[0]. They ended up with performance 25%
faster than the previous shaders, which were coded by hand in C. It's used now
in Blender's "Cycles" renderer. Yay, open source!

[0]
[https://github.com/imageworks/OpenShadingLanguage/](https://github.com/imageworks/OpenShadingLanguage/)

~~~
nadav256
I don't think that OpenShadingLanguage is dynamic.

~~~
erichocean
If by _dynamic_ , you mean: the compiled code for the same exact shader source
code can be wildly different at runtime, based on the inputs, OSL is dynamic
in that sense.

If you mean something else…then I guess not. It's a shader after all, not a
general purpose programming language.

I tend to use dynamic in the sense of _static vs. dynamic_ , where _dynamic_
is "happens at runtime" and _static_ is "happens before runtime".

~~~
bronxbomber92
I think nadav256 meant dynamically typed. I.E. most languages that target LLVM
are statically typed, and this project is one of the few (the first?) to
successfully leverage LLVM for a dynamically typed language.

~~~
sanxiyn
Julia is another dynamically typed language using LLVM. On the other hand,
Julia was _designed_ to use LLVM. I think this may be the first to succeed in
dynamically typed language not designed to use LLVM.

~~~
ihnorton
What about Rubinius? (don't know much about it; does it not qualify as
successful?)

~~~
erichocean
There's also Python's "Unladen Swallow" attempt.[0]

[0]
[http://en.wikipedia.org/wiki/Unladen_Swallow](http://en.wikipedia.org/wiki/Unladen_Swallow)

~~~
bobbyi_settv
Unladen Swallow is a dead project at this point (although many lessons were
learned from it). The active Python LLVM Jit project is Dropbox's Pyston:

[https://github.com/dropbox/pyston](https://github.com/dropbox/pyston)

[https://tech.dropbox.com/2014/04/introducing-pyston-an-
upcom...](https://tech.dropbox.com/2014/04/introducing-pyston-an-upcoming-jit-
based-python-implementation/)

------
rayiner
The Bartlett mostly copying collector is just a really neat design. Even if
your compiler gives you precise stack maps, conservative root scavenging still
has the major advantage of giving the optimizer the most freedom. Its the
basis of SBCL's collector, which is probably 20 years old now. Good to see it
still has legs.

This is the patch point intrinsic documentation:
[http://llvm.org/docs/StackMaps.html](http://llvm.org/docs/StackMaps.html).
This is a really significant addition to LLVM, because it opens up a whole
world of speculative optimizations, even in static languages. Java, for
example, suffers on LLVM for want of an effective way to support optimistic
devirtualization.

~~~
pcwalton
Well, with conservative-on-stack GC you lose the ability to do bump allocation
in the nursery for all objects (since you can't update all roots when you
tenure objects, so you can't move them, so you can't do two-space stop-and-
copy), which is to my mind a pretty severe drawback.

You're right that you can do some optimizations such as reassociation of
pointer arithmetic, but I'm not sure how much they matter in practice—I'd be
quite surprised if they outweighed the advantage of being able to allocate in
~5 instructions [1].

That said, in a practical sense I'm certain the advantage of being able to use
the mature optimizations in the LLVM JIT outweighs the disadvantages of having
to use conservative GC (and as I understand Azul is working on precise GC
support for LLVM). Given their constraints I think they did a great job.

[1] Edit: I've been informed that the fast path in JSC is around this for
pulling a bin off the free list. I should be more precise: what I mean is that
you can stay on the fast path more often. You can always allocate very
quickly, only falling off the fast path if the nursery fills up. Of course,
falling off a malloc fast path is rare anyhow, so this isn't likely to be that
much of a difference in practice…

~~~
rayiner
Both MPS and SBCL, which are Bartlett-style collectors, use bump pointer
allocation. They divide the heap into pages, and bump allocate into pages. At
collection time, when a page is the target of an ambiguous pointer, it is
pinned and promoted in place. A pinned page is not used for allocation the
next round. The downside is that unless the ambiguous pointers are transient,
a page may be pinned and unusuable for allocation for awhile.

~~~
pcwalton
Right, I specifically mentioned two-space collection for this reason (though I
guess I should have been more precise that it doesn't forbid bump pointer
allocation, just means you may leak entire semispaces for a while).

~~~
pizlonator
The most common incarnation of Bartlett is a semispace mostly-copying
collector with bump pointer allocation. You never leak a whole space. The only
requirement is that the spaces are a discontiguous bag of pages. Note that in
proper GC terminology this is the right way to say it - a _space_ is not a
contiguous region but rather an abstract bag of heap locations, though usually
it's a bag of very large pages, like 64KB in WebKit. You'd want discontiguous
spaces anyway if you want to support dynamic heap growing/shrinking or other
features like pinning (for fast FFI), etc. It's usually only either academic
collectors, or collectors in one-VM-per-process systems like server VMs, that
can even afford to make a space be a contiguous region of virtual memory. So,
Bartlett ends up incurring zero net overhead for embeddable VM environments
like a browser engine.

~~~
pcwalton
Presumably you still leak individual pages if you have stuck stack pointers
though?

I guess around turns of the event loop you can probably get precise compaction
though, since nothing's on the stack.

Edit: I didn't think that it was correct that all production-quality
semispaces are scattered throughout the address space and indeed it's not
(though I probably misunderstood you): SpiderMonkey uses a contiguous-in-
memory nursery.

------
InTheArena
This is a much bigger deal then people are giving credit for, because of the
other thing that Apple uses LLVM for. It's the primary compiler for
Objective-C, and thus Cocoa (mac apps) and CocoaTouch (ios apps) as well. If
apple has Javascript compiling on-the-fly at this speed, this also means that
it would be pretty trivial to expose the objective-c runtime to javascript,
and mix and match C & Javascript code.

This is going to be a very very big deal.

~~~
overcyn
You've been able to do that for awhile.

[http://opensource.apple.com/source/JavaScriptCore/JavaScript...](http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7537.65/API/JSValue.h)

~~~
alayne
You can't get the JSContext of a UIWebView though, which is a big hole.

~~~
tehno
Yes you can on iOS7: [http://stackoverflow.com/questions/19664206/access-the-
javas...](http://stackoverflow.com/questions/19664206/access-the-
javascriptcore-engine-of-a-uiwebview)

~~~
alayne
I know, but that isn't a realistic option since it's not a public API and
therefore not something you should use in a production application.

~~~
suyash
Yes, this is a private api - meaning not officially public so can't use it in
production. I would love to have the new FTL JIT engine supported officially
for the webview JS execution. It would bring UIWebView on iOS7 up to par with
Android WebView on Android 4.4

------
tambourine_man
I'm so glad to see that Webkit is not dead after the Blink fork. I still use
Safari as my main browser, but its developer tools and optimizing compiler lag
behind.

~~~
phoboslab
Right, but seeing the speed comparison with the LLINT makes me really sad that
we can't even use the baseline JIT in iOS Apps. I still have some hope that
Apple works on a way to run JSC in an isolated, trusted process that is
allowed to use the JIT. Otherwise the new FTL JIT doesn't get us anything in
many cases.

[1] [http://www.webkit.org/blog-files/ftl-
jit/four_tier_performan...](http://www.webkit.org/blog-files/ftl-
jit/four_tier_performance.png)

~~~
higherpurpose
And allow other competing engines on the platform, while they're at it.

------
tomp
They use a conservative GC, which I understand, as they were using it before
FTL JIT, and it required minimal changes for integration with LLVM-based JIT.
However, in the blog post, they mention several times that they wanted to
avoid stack maps because that would require spilling pointers from registers
to stack, which they say is undesirable for performance reasons.

I wonder, however, how slow register spilling really is. I will test it when I
have time, but logically, it shouldn't take up much time. Under the x64 ABI, 6
registers are used for argument passing [1], and the rest of the arguments are
passed on the stack. So, when the runtime calls into GC functions, all but at
most 6 pointers are already in the stack, at (in theory) predictable
locations. Those 6 registers can be pushed to stack in 6 instructions that
take up 8 bytes [2], so the impact on the code size should be minimal, and
performance is probably also much faster than most other memory accesses.
Furthermore, both OCaml and Haskell use register spilling, and while not quite
at C-like speeds, they are mostly faster than JS engines and probably also
faster than FTL JIT.

Of course, predicting the stack map after LLVM finishes its optimisations is
another thing entirely, but I sincerely hope the developers implement it.
EDIT: it seems that LLVM includes some features [3] that allow one to create a
stack map, though I wonder if it can be made as efficient as the GHC stack
map, which is simply a bitmap/pointer in each stack frame, identifying which
words in the frame are pointers and which aren't.

[1]
[http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_...](http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions)

[2] tested using
[https://defuse.ca/online-x86-assembler.htm#disassembly](https://defuse.ca/online-x86-assembler.htm#disassembly)

[3]
[http://llvm.org/docs/GarbageCollection.html](http://llvm.org/docs/GarbageCollection.html)

~~~
rayiner
Registers have to be spilled at every location where the GC can be entered.
This includes places where the GC can be triggered, including function calls
and loop back edges. Having to spill values held in registers across a loop
back edge is quite suboptimal.

~~~
pcwalton
That isn't true with all stack map implementations though: mature ones (such
as OCaml's IIRC) have the ability to track registers as well, eliminating the
spilling requirement. But LLVM doesn't have support for this yet (although
I'll be delighted if Azul lands their patches to implement it!)

~~~
rayiner
Yes, I should clarify this is a limitation of LLVM's current gcroot design.

------
simcop2387
I wonder how difficult it would be then to take the hints for asm.js and after
validating that it meets the contracts it provides push through all the code
into the FTLJIT to get a huge speed boost on that code. With the ability to do
hot transfers into LLVM compiled code it should be possible to do it without
any real noticeable issues to the user.

~~~
om2
We already get a huge speed boost on asm.js code, without having to special
case it. It's likely that LLVM could do even better on it, but at this time
we're not sure that a specific hint that it's asm.js would help.

~~~
sanxiyn
It may not help throughput, but it will certainly help warmup and caching.

~~~
om2
Counterpoints:

(1) It requires a pause to compile a bunch of stuff before you even get going.
While in our case LLint would have executed it a few times already.

(2) Even for asm.js-style code, dynamic profiling info can help you optimize
things more than you could with pure AOT. For example, you can make inlining
decisions on the fly, based on what functions are actually called.

(3) Caching compiled code is a good idea for more than just asm.js, no need to
make it a special case.

~~~
pcwalton
> (1) It requires a pause to compile a bunch of stuff before you even get
> going. While in our case LLint would have executed it a few times already.

I think for games you actually want the pause ahead of time, during the
loading screen. Having the first few seconds stutter isn't a good experience.

> (2) Even for asm.js-style code, dynamic profiling info can help you optimize
> things more than you could with pure AOT. For example, you can make inlining
> decisions on the fly, based on what functions are actually called.

Well, for asm.js code the static inlining is already done by LLVM/Emscripten
before the code is even shipped, so I'm not sure how much of a benefit this
actually brings in practice—you can only choose to inline functions more
aggressively than LLVM already did, not less aggressively. (Might help work
around Emscripten's outlining pass, however.) You also eat CPU cycles that
could be used for the app by doing non-AOT compilation (though, granted,
Emscripten apps of today tend to use only one core, so background compilation
will fix this on multicore).

> (3) Caching compiled code is a good idea for more than just asm.js, no need
> to make it a special case.

No argument there, although it's harder to handle the full generality of
JavaScript due to the temptation to bake pointers into the jitcode. If you can
make it work though, hey, that's awesome.

~~~
om2
The hypothesis of JSC folks is that it will be able to get very good results
without any special case handling for code labeled as asm.js just by having a
great general compiler, particularly with more tuning. On the things we
profiled, there wasn't a case where AOT or even triggering the fourth tier
earlier seemed like the most valuable thing to do.

If we turn out to be wrong, though, we will not be religious about it or
anything.

~~~
Tloewald
> If we turn out to be wrong, though, we will not be religious about it or
> anything.

Well said!

------
ksec
I am glad WebKit is thriving. I was worried that the fork Blink with Google
and Opera would means Webkit gets no love.

Hopefully the next Safari in iOS and OSX will get many more improvements

~~~
om2
In some ways, it's easier to go fast when you don't have to argue over basic
strategy or maintain duplicate code paths. Even if you lose some contributors
in the process.

~~~
ksec
Yes, but in this case it isn't just "some" contributors. Google used to
contribute more then double the commit and people on the project.

[http://hypercritical.co/2013/04/12/code-hard-or-go-
home](http://hypercritical.co/2013/04/12/code-hard-or-go-home)

~~~
om2
Google was a valuable contributor. But a lot of Google's commits were not
necessarily that high value to non-Chrome ports of WebKit. They had forked
code paths for a lot of things (so changes on their side of the #ifdef fork
don't help anyone else), web platform features that were not necessarily of
interest to others, and refactorings that did not turn out for the better. We
have since ripped that stuff out. They did also contribute a lot of very
valuable things, most notably security fixes and new web platform features
that _were_ useful. In fact, we sometimes still merge stuff from Blink (and
they sometimes still merge from WebKit). But hacking on WebKit is a lot more
fun now that we don't have to argue about how things affect V8, or the
Chromium process model.

------
leeoniya
> Note that the FTL isn’t special-casing for asm.js by recognizing the "use
> asm" pragma. All of the performance is from the DFG’s type inference and
> LLVM’s low-level optimizing power.

doesnt "use asm" simply skip the initial profiling tiers that gather type
stats etc? most of the benefit of compiling to asm.js comes from fast/explicit
type coersion.

------
lobster_johnson
Is this architecture generic enough that one could, say, build a Ruby compiler
on top of it?

I imagine even writing a Ruby -> JS transpiler that used the WebKit VM would
provide a speedup, similar to how JRuby works on the JVM, but native
compilation would be even better.

~~~
steveklabnik
You might be able to try that with [http://opalrb.org/](http://opalrb.org/)

Rubinius already does Ruby -> LLVM.

~~~
ksec
But as much as i hate to admit Rubinius is still slow.

~~~
om2
Part of this project was building support into LLVM that you would need to
build a really great dynamic language JIT. Another key aspect was having lower
compilation tiers that _don 't_ use LLVM. Using LLVM as a fourth tier works
much better than using it as a first or even second tier (we have tried in the
past).

------
aaronbrethorst

        dubbed the FTL – short for Fourth Tier LLVM
    

Is it still a backronym if it redefines an existing acronym? (even a fictional
one?)

[http://en.wikipedia.org/wiki/List_of_Battlestar_Galactica_ob...](http://en.wikipedia.org/wiki/List_of_Battlestar_Galactica_objects#FTL_Drive)

~~~
scott_karana
The term FTL is generally used, and predates Battlestar Galactica.[1]

But of course, your point of confusing people with acronyms still stands!

[http://en.wikipedia.org/wiki/Faster-than-
light](http://en.wikipedia.org/wiki/Faster-than-light)

~~~
MBCook
When I first saw references to Apple's engine updates the other day and they
were referred to as FTLJIT I just assumed it stood for Faster-Than-Light.
Perhaps that's just too nerdy for Apple to admit. Maybe it's not a backronym,
it just worked out that way.

But I bet the engineers like the name.

------
jwarren
I adore threads like this. They really bring to focus exactly how much more I
have to learn.

------
otikik
It's called FTL, but I am not actually sure it's Faster Than LuaJIT.

------
nivertech
no benchmarks versus V8?

~~~
azakai
You can see some benchmarks comparing them here:

[http://arewefastyet.com/#machine=12&view=breakdown&suite=asm...](http://arewefastyet.com/#machine=12&view=breakdown&suite=asmjs-
ubench)

[http://arewefastyet.com/#machine=12&view=breakdown&suite=asm...](http://arewefastyet.com/#machine=12&view=breakdown&suite=asmjs-
apps)

FTL beats v8 on almost all of those.

~~~
acqq
I wish I knew if lower is better or higher is better on the first page (y axis
is not described on the graphs).

~~~
azakai
Lower in the graphs is always better on awfy, which corresponds to lower
numbers are better in all benchmarks, except for octane where higher numbers
are better (the axis is reversed so still lower in the graph is better).

------
cromwellian
Would be interesting to see Octane numbers.

~~~
sanxiyn
In my test, Octane numbers improved by 25%. Improved benchmarks are Richards
and DeltaBlue, which the blog post already analyzed in depth, and Mandreel and
zlib, which are asm.js-like.

~~~
azakai
You can also see online results at

[http://arewefastyet.com/#machine=12&view=breakdown&suite=oct...](http://arewefastyet.com/#machine=12&view=breakdown&suite=octane)

There is a discontinuity about a month ago, which is around when FTL was
enabled, I believe. There you can see the nice speedups on Mandreel and the
others you mention.

~~~
sanxiyn
To me, the most stunning part is JavaScriptCore's much better performance on
Typescript. And it seems it always has been that way, even before FTL was
enabled. I wonder why.

~~~
pizlonator
It's a good question. I think I know why. WebKit's preexisting JITs were tuned
for fast code load. Even the DFG JIT, the third-tier optimizing JIT, is
deliberately tuned to generate code quickly even if it means not optimizing as
much. Typescript is a _ton_ of code so low-latency compilation is crucial.

Of course the FTL is tuned for the opposite - slower to compile but generates
great code. That shows through on asm.js tests, which tend to be longer-
running, and some of the other tests in Octane which also have a high running-
time/code-size ratio.

That's the cool thing about having four tiers. Each tier has some scenarios on
which it shines.

~~~
sanxiyn
So in terms of compiler speed, DFG > V8 > FTL, and in terms of code quality,
FTL > V8 > DFG, so DFG+FTL wins both against V8? I guess that makes sense.

~~~
om2
V8 has multiple tiers too, so I'm not sure you can express it quite that
simply.

~~~
sanxiyn
I think you can read Crankshaft instead of V8 then. V8 has two tiers, and the
lower tier is below DFG.

~~~
om2
Even then, I'm not sure it's a simple linear relationship. Crankshaft is
better on some code, and FTL is better on some code. There's even cases where
DFG (plus the tiers below it) does better than Crankshaft. (Talking about
speed her; for compile time you are probably right for a wide range of code.)

------
jongraehl
wonder how this compares to node - specifically
[https://github.com/rogerwang/node-webkit](https://github.com/rogerwang/node-
webkit)

------
harichinnan
ELI5 version please?

~~~
munificent
Safari got faster.

------
Theodores
Much like how hairdressers don't necessarily have the most promising haircuts,
it would seem that companies that make the finest in web browsers don't
necessarily have the greatest of web pages! I don't think there is an ounce of
javascript on the webkit website yet that article goes waaay over the heads of
mere mortals on mega-speedy-javascript.

~~~
rimantas
This is actually a good thing. Do you expect the page about racecars come with
a racecar attached? Javascript is already overused on the web.

