
When FFI Function Calls Beat Native C - signa11
https://nullprogram.com/blog/2018/05/27/
======
jcranmer
This is one of those things that sounds better than it really is.

The timing used here in this post is ns per call, which somewhat obfuscates
what's going on. A better metric would be clock cycles of overhead, because
the differences here are only around 2-3 clock cycles. In other words, the
potential savings is going to be drowned out by jitter caused by things like
the processor deciding to service interrupts, or your hyperthread partner
doing work. If the relevant code is so hot that the cost is showing up in
profiles, then very likely, the fact that a function call is going through the
PLT versus being directly called is not going to be the thing that makes the
most difference: it's the fact that you're making the function call in the
first place.

~~~
ajross
Well, sure, but by the same measure and for the same reasons, "clock cycles"
are not a deterministic measure of software performance, while "nanoseconds"
are, by definition. The test can't easily control for cache effects (which are
vastly bigger in aggregate than IRQ or SMT effects, FWIW), so it doesn't try
and assumes everything averages out, and it does.

The reason the effect seems bigger than it is is that for all but the most
trivial leaf functions, PLT indirection just isn't a big deal. Never has been,
isn't now, never will be. It's a cool trick that a JIT can skip that step, but
there are bigger fish to fry.

~~~
achamayou
And the most trivial cases will be inlined more often than not.

~~~
BeeOnRope
Well FFI calls will never be inlined, by definition, and that's kind of what
people are testing here.

~~~
achamayou
This is comparing FFI calls against native C, and in idiomatic native C, short
calls will be inlined

------
nneonneo
The PLT is mostly a necessary evil of two requirements: lazy-loaded symbols
and W^X memory. The former is necessary to reduce binary load times, and the
latter is necessary for security (RWX segments in memory are ripe for abuse in
the event of a vulnerability). A JIT trades off very differently: slow startup
("warm up" period) and often creates RWX segments, but gains speed advantages
as seen here.

Of course, the GOT referred to by the PLT is itself is a bit of a security
vulnerability, because it's usually a writable chunk of memory with a ton of
frequently-called function pointers; corrupting it gives you relatively easy
control over program flow. The new mitigation is "full RelRO", which resolves
all GOT entries during binary load, then marks the GOT segment read-only. This
gains in security, but trades off binary load time, and it is still going to
be slower at runtime than the JIT. Full RelRO is usually off by default
because of this added overhead.

Modifying the code segment to include direct function references (either
during binary load or later, during runtime) is also not a great approach
because it means being unable to share the code segment between processes,
which would increase memory pressure. We could do what Windows does, which is
to have static randomization (randomization of binary addresses is performed
once per boot, not once per process), enabling code to be "statically
relocated" lazily per boot, but this carries its own complexities, and opens
you up to different security vulnerabilities.

There's no magic bullet here, unfortunately. There are just tradeoffs
everywhere in many directions - between runtime speed, startup time, security
and memory usage.

~~~
ioquatix
Static linking avoids this overhead right?

~~~
nneonneo
Yes, a statically-linked binary will have all internal references resolved by
the linker. If PIE is disabled, the addresses will all be fixed, which makes
exploiting such a binary much easier. Glibc has only _just_ added support for
static PIE executables (in Feb of this year), and I'm not sure what the
implementation is - whether it relocates all the references in the binary
(increased memory usage), or uses a PLT/GOT approach to support relocations
(increased runtime overhead).

~~~
brandmeyer
Depending on the linkage, function calls internal to the library can use
ordinary PC-relative function calls instead of fully indirect function calls.
So you get the low per-call cost, but pay a higher per-process cost in memory
consumption.

~~~
BeeOnRope
Exactly.

The cost of indirect jumps isn't implied by position independent (PIE/PIC)
code alone, but by PIE/PIC combined with jumps across separately loadable
segments. For jumps _within_ a segment you can just use a direct relative call
which is the fastest type of call and doesn't require any load-time fixup
ever.

Of course then there is also the consideration of symbol interposition: if you
want to support that even _within_ a statically linked thing (usually
something other than the main executable), then you need a GOT-like thing, but
not necessarily a PLT-like thing.

------
wvenable
This reminds me of working with C++ and the Pebble SDK. The Pebble builds ELF
binaries but doesn't have an ELF loader. It also uses position independent
code. The SDK uses objtools to strip the code out of the ELF and the Pebble OS
just copies the code into memory and jumps to it.

In C, this works fine. But in C++, gcc generates the GOT for virtual method
calls but there is no ELF loader to fill it.

I accidentally found out if my code contains a destructor that makes an
assignment to a static variable then no GOT is generated and virtual method
calls work for the entire project! I'm not sure the logic behind that but this
is the code that is needed:

    
    
        class SampleObject
        {
        public:
            //! Destroys the underlying Pebble window
            virtual ~SampleObject() 
            {
                // The assignment to a static variable makes the virtual methods work for subclasses
                // I have no idea why!
                dummy() = 0; 
            }
    
        private:
            // Dummy static variable needed for virutal method pointers to work (see destructor)
            inline static int& dummy() { static int dummy = 0; return dummy; }  
        }

------
viraptor
Fortunately "native C" is not one, unchangeable thing. If the plt overhead
really is important in your situation, you can still compile either static, or
with "\--fno-plt".

------
nneonneo
FWIW, I get very different results from inside my Linux VM (MacBook Pro, 2.9
GHz Core i7):

    
    
        jit: 1.433856 ns/call
        plt: 1.700611 ns/call
        ind: 1.404397 ns/call
    

The PLT is double-indirected (call->jmp->func), but the single-indirect call
performs the same as the JIT. This backs up @tlb's note that modern x86
machines will predict indirect calls.

~~~
nneonneo
Now that I'm thinking about it a bit more - I recall that GCC 6.1 added "-fno-
plt" for x64 targets, which replaces PLT calls with "call [func@got]" instead,
eliminating the double-indirection. Unfortunately, I don't have a GCC 6.1
environment, so I'd be curious to see if someone can run the benchmark with
"-fno-plt" and how it performs. A combination of that plus indirect call
prediction should basically close the gap with the JIT approach.

~~~
damien
It does close the gap.

Before:

    
    
      cc -shared -fPIC -Os -s -o empty.so empty.c
      cc -std=c99 -Wall -Wextra -O3 -g3 -fpie -pie -o benchmark benchmark.c ./empty.so -ldl
    
      jit: 1.541608 ns/call
      plt: 2.309939 ns/call
      ind: 1.540583 ns/call
      

After:

    
    
      cc -shared -fPIC -Os -s -o empty.so empty.c
      cc -std=c99 -Wall -Wextra -O3 -g3 -fpie -fno-plt -pie -o benchmark benchmark.c ./empty.so -ldl
    
      jit: 1.541836 ns/call
      plt: 1.539239 ns/call
      ind: 1.542874 ns/call

------
FrozenVoid
A rule of thumb:If you see "faster than C" claims its either poor C or they
use different code path/algorithm thats not equivalent to C code. A FFI that
is faster than native, isn't FFI but some other mechanism that more direct -
in this case bypassing the address of the function resolution. 'static inline'
is of course even faster. The only C overhead i recall that is legit is libc
startup time with glibc, and you can use faster libc alternatives.

~~~
gaius
Unless it is FORTRAN obviously

~~~
Noughmad
FORTRAN is only faster than C because it doesn't allow aliasing. Use the
`restrict` keyword in C and you should get FORTRAN-like performance.

~~~
ArneBab
Fortran can give you much better speed at comparable code-complexity, as long
as your problem is one which FORTRAN is optimized for, i.e. multidimensional
array manipulation or numerical work.

------
abacate
Any micro performance gains from a function call will become rounding errors
compared to the extra cost of any arithmetic operation on a dynamically-typed
language.

It may be an interesting article for describing how JIT and some late-
optimization techniques can produce optimized code given the right
optimization preconditions, but I fail to see how the title is supposed to
describe the contents of the article.

Last, but not least, performance-sensitive C code tends to be statically
linked.

------
newnewpdro
PLT and GOT are quite orthogonal to the C programming language, these are
implementation details of dynamic linking and shared libraries.

Back in the 90s we would statically link performance-sensitive executables to
avoid this overhead. It didn't require changing the language.

------
ndesaulniers
Damn, he beat me to the writeup.

> The most surprising result of the benchmark is that LuaJIT’s FFI is
> substantially faster than C.

[https://news.ycombinator.com/threads?id=ndesaulniers](https://news.ycombinator.com/threads?id=ndesaulniers)

------
gravypod
I'm still amazed that there is code in-lining across lineages. I saw that on
the last post and I still can't imagine all of the edge cases that are being
handled by that. Are there many other languages that do that other than
LuaJIT?

~~~
nicoburns
Rust is planning to enable inlining C functions in the near future. I believe
this should work in both directions, and works because they are both compiled
using LLVM.

~~~
steveklabnik
[https://github.com/rust-lang/rust/pull/50000](https://github.com/rust-
lang/rust/pull/50000) is the first bit, [https://github.com/rust-
lang/rust/issues/49879](https://github.com/rust-lang/rust/issues/49879) is the
tracking issue.

------
ars
Time to bring back self modifying code?

A C program, that self modifies as it runs, to inline commonly called
functions, or make direct jumps instead of indirect.

~~~
chrisseaton
We already have these - JIT compilers as the other person commented. You can
even get them for C if you want.

------
exabrial
I didn't even know the plt existed... Fascinating. I'm curious if other JITd
runtimes (like the jvm) offer the same optimizations

~~~
electrum
The JVM could do this, but there is a lot of other overhead for JNI calls (by
virtue of the JNI API specification). Project Panama is working on improving
native interoperability:
[http://openjdk.java.net/projects/panama/](http://openjdk.java.net/projects/panama/)

~~~
MaxBarraclough
I was under the impression that although C->Java is slow, Java->C calls have
been fast for years, with the JIT generating full-speed calls -
[https://web.archive.org/web/20160304055443/http://nerds-
cent...](https://web.archive.org/web/20160304055443/http://nerds-
central.blogspot.co.uk/2012/04/extreme-jni-performance.html)

~~~
aardvark179
It's fast, but it depends on exactly what you're doing. There is always a bit
of work managing the stack and converting between ABIs, and JNI is quite poor
in performance when dealing with anything other than primitive types.

The sort of thing Panama is adding is support for layouts of structures so
that you can get results to and from C quickly, along with things like
expressing vector algorithms in a way that allows good optimisation on
different CPUs.

~~~
MaxBarraclough
Sounds good. Better FFI is always good, and Java seems to be way behind .Net
when it comes to utilising SIMD.

------
stmw
Especially timely as Spectre/Meltdown mitigations put indirect calls and
branch predictions under more performance pressure -- nice reminder of how
these things work.

------
xyproto
How important is code sharing between processes for a regular Linux desktop or
server distros? Could JIT compiled applications be a sensible default in the
future?

------
jorangreef
It would be great if V8 could follow LuaJIT in this.

------
wwarner
immutability ftw

------
cryptonector
Uh, whatever. This isn't about LuaJIT, nor any JIT, nor C, nor any high-level
programming language FFI.

    
    
        ELF != C
        C does not imply ELF
        ELF does not imply C
    

This is about entirely and _only_ dynamic linking vs. static linking. Selling
it as anything else is bunk and click-bait.

And yes, dlsym(3) should return the address of the PLT, not the direct address
of the function: so that dlsym(foo, "bar") == bar (when you can link with
whatever object provides "bar").

------
lukego
LuaJIT is awesome. It's just amazing how much information is available to the
JIT. I've been using it for more than five years now and I still can't believe
that such a high-level dynamic language actually _is_ competitive with C for
system programming.

That Mike guy knows a thing or two I reckon.

