
Author of LuaJIT explains why compilers can't beat hand-coded assembly - BrandonM
http://article.gmane.org/gmane.comp.lang.lua.general/75426
======
BrandonM
In the "BareMetal is a 64-bit OS for x86-64 based computers" comment thread
(<http://news.ycombinator.com/item?id=2587640>) haberman linked to this email
(<http://news.ycombinator.com/item?id=2588170>), and I thought it was
interesting enough to garner its own submission.

~~~
silentbicycle
Definitely. Thanks for posting this, I wish HN still regularly had more hard
tech content like this.

~~~
BrandonM
You're welcome. The thanks should go to haberman, though, who not only linked
to the email but apparently also asked the questions that led to such a great
explanation.

Reading it really made me want to do some assembly programming.

~~~
haberman
I used this info as the basis of writing some assembly of my own:
[http://blog.reverberate.org/2011/04/25/upb-status-and-
prelim...](http://blog.reverberate.org/2011/04/25/upb-status-and-preliminary-
performance-numbers/)

DynASM, which I used to write my JIT (also written by LuaJIT's author Mike
Pall) is also an impressive piece of work. It makes writing a JIT so pleasant
that I meant to write an article walking through the process of building a
simple JIT. I was going to implement a JIT for the "Universal Machine" from
ICFP 2006 as the example (<http://www.boundvariable.org/>), which is an
absolutely delightful problem, but unfortunately discovered that it is heavily
self-modifying, and thus would be hard to get a big performance boost from.
But it looks like someone did it anyway: <http://paul.bluereboot.net/um/>

~~~
ssp
_It makes writing a JIT so pleasant that I meant to write an article walking
through the process of building a simple JIT._

Please do, if you have the time. DynASM really is a stroke of genius.

The way it works is that the JIT contains bits of pre-generated machine code,
and then a little bit of bytecode stitches these bits together and emits them.
This means the actual runtime piece of the JIT doesn't need to know all the
wacky details of x86 instruction encoding. All that is done in a preprocessor
written in Lua.

I never got around to finding out how to do dynamic register allocation
though. It seemed to me that there was no way to parameterize the pre-
generated machine code, but presumably I'm missing something.

------
billforsternz
I enjoyed this article, I always find the argument that "these days you can't
beat the compiler" fishy. I remember more than 25 years ago I wrote an 8080
machine code interpreter in 8088/86 assembly. My innermost interpreter
comprised 3 lines of asm;

    
    
      lodsb     ;al = *si++
      mov ah,al ;eg if al=3eh (=mvi a,dd), then ax=0x3e3e
      jmp ax    ;eg goto 0x3e3e. Preloader puts the emulation
                ; code for mvi a,dd followed by another copy
                ; of the inner interpreter at cs:0x3e3e.
    

This approach uses 64K to sparsely store the emulation code for each of the
256 opcodes at 0x0000,0x0101,0x0202 ... etc. which seemed a pretty outrageous
waste of memory at the time. But it was worth it because none of the 3 inner
interpreter instructions modified the flags, which meant I could use the
native 8086 flags to directly emulate the 8080 flags, a huge speed win.
Remember that the 8086 had LAHF and SAHF instructions to help solve this very
problem, yet my approach avoiding the whole flags issue was still twice as
quick as commercial equivalents.

Okay admittedly this is all irrelevant dinosaur nostalgia from the perspective
of modern technology, but it reinforces the point that there's no substitute
for human creativity when it comes to real hardcore optimization tricks.

~~~
froydnj
> I enjoyed this article, I always find the argument that "these days you
> can't beat the compiler" fishy.

I enjoyed your example; very clever. I think the "these days you can't beat
the compiler" should really be understood to carry a qualifier like "for
compiling things that we understand well enough and that occur often enough
that it's worth teaching the compiler how to optimize them". Knuth's maxim
"Science is what we understand well enough to explain to a computer. Art is
everything else we do." seems particularly apt here.

Problems where humans can apply very domain-specific knowledge to come up with
clever solutions are not the sort of problems that a general-purpose compiler
is going to generate good code for. Knowing when you're dealing with one of
those problems (e.g. writing inner loops for multi-precision arithmetic or
vectorizing multimedia codecs) is part of being a good programmer. But to a
first or even second approximation, you are not dealing with one of those
problems, so you might as well just sit back and let the compiler do its job.

~~~
billforsternz
I was with you until "but to a first or even second approximation, you are not
dealing with one of those problems, so ...". Perhaps there is a typo there and
it should have been something like "but _when_ to a first or even...."

------
tptacek
A great find. And some of this argument is very simple even if you don't do a
lot of assembly programming; for instance:

* that C-code interpreters force the compiler to do register allocation across the slow paths of all instructions, even though the interpreter dev can in this case definitely predict which things need to be in registers;

* and, the optimizer in the compiler makes it hard to tune the code for the 20% that does 80% of the work, where an assembly program can hoist the slow path code out of the I-cache entirely.

~~~
wingo
Dunno. I've found with __builtin_expect() in GCC, the code corresponding to an
unlikely branch is placed at the end of the interpreter body, even after all
the other instructions, which is probably as good as you can do icache-wise.
I'd have to check again, but such annotations could inform register allocation
as well.

~~~
CountHackulus
I was thinking about this, and I think you're mostly right. There's a bit more
that needs to be done, but some sort of expect builtin (name varies based on
compiler) would help.

What I'm thinking is, he needs to factor all the instructions out into
separate functions so that he stops hitting the optimization boundaries (or he
may be able to override them with parameters to the compiler). Then, mark the
hot paths with the expect builting.

One that's in, use profile guided optimization (again name varies based on
compiler), to feed the compiler with more frequency information on the
branches. Partial inlining (not sure if GCC has this opt yet) should take care
of the rest.

As for the register assignment, well that's hard to get right while getting
the instruction scheduling right at the same time. Profile guided optimization
will help guide the global register allocator in the right direction, but it
likely won't be perfect. Thankfully, on modern architectures, register to
register moves are pretty much NOPs thanks to register files, short circuting,
and out of order.

------
VMG
To be clear, he qualifies this with the statement

 _Final words: A modern compiler contains a huge set of heuristics that
interact with each other in strange ways. They have been tuned for 'average'
code. But an interpreter is a very different beast, so you'll inevitably get
disappointing results._

More info in this reddit comment:
[http://www.reddit.com/r/programming/comments/badl2/luajit_2_...](http://www.reddit.com/r/programming/comments/badl2/luajit_2_beta_3_is_out_support_both_x32_x64/c0lrus0)

~~~
stcredzero
What if someone developed pragmas and a compiler that could use that
information with a different set of heuristics?

~~~
_delirium
I wonder how close you'd get to that by running the interpreter with gcc's or
llvm's profiling on, and then recompiling with profile-guided optimization. If
it's just a matter of compilers having heuristics tuned towards the wrong kind
of workload, PGO ought to re-tune them appropriately, if enough of the
heuristics in question actually look at the profiling data.

~~~
caf
I was thinking this too - you would think that profile guided optimisation
ought to be able to identify the slow paths.

------
chrisb
A couple of years ago I wrote a direct-threaded .NET interpreter. This article
made me go and have a look at the assembly that is being generated for the
simple 'add' instruction case.

The .NET IL is stack-based which the interpreter runs directly, so it will
never be as efficient as LuaJits register-based approach, but I was somewhat
disappointed to see this as the output of the Windows VS2008 C compiler in
release mode (enables all optimizations):

    
    
      JIT_ADD_I32I32_start:
      	BINARY_OP(I32, I32, I32, +);
      0041164D  mov         eax,dword ptr [pCurEvalStack] 
      00411650  sub         eax,4 
      00411653  mov         dword ptr [pCurEvalStack],eax 
      00411656  mov         ecx,dword ptr [pCurEvalStack] 
      00411659  mov         edx,dword ptr [ecx-4] 
      0041165C  mov         eax,dword ptr [pCurEvalStack] 
      0041165F  add         edx,dword ptr [eax] 
      00411661  mov         ecx,dword ptr [pCurEvalStack] 
      00411664  mov         dword ptr [ecx-4],edx 
      JIT_ADD_I32I32_end:
      	GO_NEXT();
      00411667  mov         edi,dword ptr [pCurOp] 
      0041166A  add         edi,4 
      0041166D  mov         dword ptr [pCurOp],edi 
      00411670  jmp         dword ptr [edi-4]
    

The assembly after 'BINARY_OP(...)' is the integer add code. pCurEvalStack
points to the top of the current evaluation stack. Binary ops are performed on
the top two items on the stack, and pushing the result back onto the stack.

The assembly after 'GO_NEXT()' is the standard epilogue, which just dispatches
the next instruction (see
<http://en.wikipedia.org/wiki/Threaded_code#Direct_threading> for description
of direct-threading).

There are 3 seperate loads of the same memory address, which surely can be
done better.

I'll probably have a go at altering the definition of the BINARY_OP(...) macro
to see if I can persuade it to generate better code. I'm not keen to hand-
write the assembly for this as it can compile to multiple processors.

The macro is currently defined as:

    
    
      #define BINARY_OP(returnType, type1, type2, op) \
      	pCurEvalStack -= sizeof(type1) + sizeof(type2) - sizeof(returnType); \
      	*(returnType*)(pCurEvalStack - sizeof(returnType)) = \
      	*(type1*)(pCurEvalStack - sizeof(returnType)) op \
      	*(type2*)(pCurEvalStack - sizeof(returnType) + sizeof(type1))
    

...which can probably be improved, or at least specialised to give better
results for binary ops where the operand and result types are all the same -
e.g. integer addition.

~~~
chrisb
Frustratingly, I just changed the macro to this:

    
    
      #define BINARY_OP(returnType, type1, type2, op) \
      { \
      	register PTR pRet = pCurEvalStack - sizeof(type1) - sizeof(type2); \
      	pCurEvalStack = pRet + sizeof(returnType); \
      	*(returnType*)pRet = *(type1*)pRet op *(type2*)(pRet + sizeof(type1)); \
      }
    

And the assembly generated is this:

    
    
      	BINARY_OP(I32, I32, I32, +);
      004117BA  mov         eax,dword ptr [pCurEvalStack] 
      004117BD  sub         eax,8 
      004117C0  mov         dword ptr [pRet],eax 
      004117C6  mov         ecx,dword ptr [pRet] 
      004117CC  add         ecx,4 
      004117CF  mov         dword ptr [pCurEvalStack],ecx 
      004117D2  mov         edx,dword ptr [pRet] 
      004117D8  mov         eax,dword ptr [edx] 
      004117DA  mov         ecx,dword ptr [pRet] 
      004117E0  add         eax,dword ptr [ecx+4] 
      004117E3  mov         edx,dword ptr [pRet] 
      004117E9  mov         dword ptr [edx],eax 
    

which is still unimaginably terrible. Why isn't it re-using values that have
already been loaded into registers? Why isn't it using a register for pRet?
I've even told it to! Although I think it's documented that the MS compiler
ignores the 'register' keyword.

And this is with all optimisations turned on. How depressing.

~~~
maximilianburke
Most compilers completely ignore "register" these days. I believe also that
aliasing the stack when you're performing the actual operation is greatly
hindering the compiler's ability to optimize.

~~~
chrisb
It does look as though no optimisation is being performed at all.

I just isolated the use of the BINARY_OP macro, and put it in a simple-ish
test function. now it's being optimized excellently.

The function that contains the apparently unoptimisable code is in a hugely
long and complex function, and I wonder if something in it is preventing all
optimisation from occuring within that function. I've quickly looked through
the assembly produced in the function and all of it appears unoptimised;
whereas code in other functions is optimised ok.

What can prevent all optimisation from occuring in a function?

~~~
kenjackson
How long is long? Is this a code gened function? I've seen in some compilers
that they sometimes have limits where they turn off optimization due to
throughput issues.

If you could break the function up in to pieces, and see if it still doesn't
optimize.

~~~
chrisb
The function is just over 2900 lines long. Every line crafted lovingly by
hand.

The whole source file is here: <http://pastebin.com/9L8N3AVF>

The function starts at line 232, and the disassembly I was looking at is from
line 1852.

This is the function that implements the direct-threaded interpreter. Direct-
threading works by using goto's (jmp's) to dispatch the next instruction to be
interpreted, which means that I don't think it can be broken up into multiple
smaller functions.

Please let me know if you think I'm wrong :)

~~~
swolchok
What if you made BINARY_OP a __forceinline (<http://msdn.microsoft.com/en-
us/library/z8y1yy88.aspx>) function instead of a macro?

Things that the function contains that might disable optimization: a lot of
gotos, inline assembly in GO_NEXT (this _does_ affect optimization:
<http://msdn.microsoft.com/en-us/library/5hd5ywk0.aspx>)...

It's also not immediately obvious to me that pCurEvalStack is initialized.

Having a look at the architecture optimization manual re: the redundant loads;
it's not immediately obvious to me that modern processors won't handle this
fine (<http://www.intel.com/Assets/PDF/manual/248966.pdf>).

~~~
chrisb
Thanks for the idea. Unfortunately it doesn't improve things:

    
    
      	pCurEvalStack = TestAdd(pCurEvalStack);
      0040BFFA  mov         eax,dword ptr [pCurEvalStack] 
      0040BFFD  mov         dword ptr [ebp-694h],eax 
      0040C003  mov         ecx,dword ptr [ebp-694h] 
      0040C009  sub         ecx,8 
      0040C00C  mov         dword ptr [ebp-690h],ecx 
      0040C012  mov         edx,dword ptr [ebp-694h] 
      0040C018  sub         edx,4 
      0040C01B  mov         dword ptr [ebp-694h],edx 
      0040C021  mov         eax,dword ptr [ebp-690h] 
      0040C027  mov         ecx,dword ptr [eax] 
      0040C029  mov         edx,dword ptr [ebp-694h] 
      0040C02F  add         ecx,dword ptr [edx] 
      0040C031  mov         eax,dword ptr [ebp-690h] 
      0040C037  mov         dword ptr [eax],ecx 
      0040C039  mov         ecx,dword ptr [ebp-694h] 
      0040C03F  mov         dword ptr [pCurEvalStack],ecx 
    

And also the nice thing about the BINARY_OP() macro is that it can take types
and the operator as arguments (e.g. BINARY_OP(I32, I32, I32, +),
BINARY_OP(I64, I64, I32, <<) ) meaning it can be used for many operations on
many types. I would have to write seperate functions for each case if using in
inline function.

And pCurEvalStack is initialised in the LOAD_METHOD_STATE macro, referenced
first on line 590, before interpretation begins.

------
sb
This is a very interesting post that fits nicetly with many other interesting
statements Mike Pall made (such as his reference to the LuaJIT2 interpreter
being faster than the LuaJIT1 jit compiler on some cases [<http://lambda-the-
ultimate.org/node/3851#comment-57646>].)

On a related note, a similar problem is highlighted in a paper by Anton Ertl
and David Gregg ([1]), where they claim that the slowdowns of an efficient
interpreter to an optimizing native code compiler is 1:10, whereas the
slowdown between an efficient interpreter and an ineffcient one is 1:100.
Consequently, there seems to be a lot of optimization potential (though I
admit that depending on the kind of the interpreter, it might be unlikely to
stay within a slowdown of 1:10)

 _[1]_ : Ertl, Gregg. "The Structure and Performance of Efficient
Interpreters", 2001.
([http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.1513&rep=rep1&type=pdf))

------
bajsejohannes
While this is true, it's important to note that you need to know a lot about
how a modern process works to do this.

I once looked at the possibility of optimizing the byte code dispatcher for a
javascript engine (Opera's old one), and my 386 assembly skills were far
outdated: Removing one seemingly useless instruction of the ~10 core assembly
instructions actually slowed the whole thing down. (Unfortunately, I have
forgotten the details)

------
KirinDave
It seems to me that this is more of a complaint leveled at C-ish compilers.
One could envision a language that makes the sort of structures that baffle
the normal compilerore visible and amenable to optimization. I'm not sure if
some syntax and some clever semantics could address all these problems, but...

When I was first learning C formally, I had this idea that C was the "fastest
language". My instructor said, day 1, that the problem with C was that it was
not amenable to optimization. It was pretty surprising, but as time goes on
and languages, compilers, and interpreters get better I think we'll see this
sort of problem more and more.

~~~
ori_b
// One could envision a language that makes the sort of structures that baffle
the normal compilerore visible and amenable to optimization. //

Yes, the problem is that it starts to look like assembly at that point. With a
whole bunch of notes on each variable and operation to let the compiler know
about whether you care about specific aspects of the result of the operation.
(See the 'restrict' keyword in C for an example of what I mean)

~~~
CountHackulus
That sounds like PL/X[1] to me. That language will let you do things that
REALLY aren't advisable on most systems like specify pointer sizes (per
variable), specify address mode (arbitrary blocks, functions, nested
functions, per module, whatever), specify registers for any variable. Oh an my
favourite, when calling a function, you can specify an arbitrary global label
for it to return to.

Or of course you can write as cleanly as possible, let the compiler do the
lion's share of the work, and fix up what you need to.

[1]<http://en.wikipedia.org/wiki/PL/X> and
<http://c2.com/cgi/wiki?PlsLanguage>

------
thesz
There was an attempt to comperss code for VLIW processors using stack
machines.

[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3758)

"On a TriMedia VLIW with a load latency of three cycles and a jump latency of
four cycles, the interpreter achieves a peak performance of four cycles per
instruction and a sustained performance of 6.27 cycles per instruction.
Experiments are described that demonstrate the compression quality of the
system and the execution speed of the pipelined interpreter; these were found
to be about five times more compact than native TriMedia code and a slowdown
of about eight times, respectively."

They used pragmas. They used loop pipelining:
<http://en.wikipedia.org/wiki/Software_pipelining> and compiler optimizations.
Their results aren't that bad, they achieved CISC code density for 8 times
slowdown.

------
programminggeek
I've always thought that the standard wisdom is that _most_ developers can't
beat the compiler by hand-writing _most_ of their code in ASM, but in things
like game development, things start higher level Lua, C++, C and work their
way down to increase performance as bottlenecks are found.

Compilers are like Google or Wikipedia, 99% of the time they know more than
99% of developers.

------
gregholmberg
LuaJIT is freakishly fast for a certain (not small) set of problems [0].

[0]: <http://luajit.org/performance_x86.html>

------
_tef
Related paper: Context threading

<http://www.cs.toronto.edu/syslab/pubs/demkea_context.ps>

The gist is you can structure your interpreter to improve branch prediction by
keeping your vm & hardware in similar states.

------
davidhollander
For anyone recently trying LuaJIT2:

If you are wondering why readline (history) support is missing from the
interpreter and how to add it: [http://dhllndr.posterous.com/adding-readline-
support-to-luaj...](http://dhllndr.posterous.com/adding-readline-support-to-
luajit2)

Also, I just put a small functional programming library for Lua on GitHub
about an hour before reading this: <https://github.com/davidhollander/fn>
Here's an example of how to create a sum function with Fold Left and an
addition function:

    
    
        require'fn'
        sum = fn.partial( fn.foldl, function(a, b) return a+b end)
        x = sum(1, 2, 3, 4, 5, 6, 7, 8, 9)
        print(x)
        -- 45
    

...and for anyone who hasn't tried LuaJIT2: <http://luajit.org/download.html>

------
Someone
The referenced page is more interesting than what the title promises. The
title is trivially true, the referenced page makes the stronger claim that a
compiler can not _tie_ with hand-coded assembly.

~~~
BrandonM
All I have to say for the title is that I did my best to not make it
linkbait-y.

------
pessimist
Would runtime optimization or profile based optimization help here?

------
NY_USA_Hacker
It's a very old story: Hand coded assembler can commonly get a factor of
several faster execution time on compiled code and more when using special
instructions on special data. But on large programs, no, because some of what
compilers and runtime libraries do for large programs is too difficult for
hand coded assembler. Indeed, some of what compilers do in register allocation
(and maybe cache usage) will be too difficult for hand coded assembler.

More generally, now the execution time of a small, single threaded program is
nearly never of much concern. E.g., when have a processor with dozens of
cores, it's easy to let a small, single threaded program have a core and then
f'get about it.

Execution time remains important but now mostly just for large, multi-threaded
programs.

