Hacker News new | comments | show | ask | jobs | submit login
When FFI Function Calls Beat Native C (nullprogram.com)
242 points by signa11 4 months ago | hide | past | web | favorite | 75 comments

This is one of those things that sounds better than it really is.

The timing used here in this post is ns per call, which somewhat obfuscates what's going on. A better metric would be clock cycles of overhead, because the differences here are only around 2-3 clock cycles. In other words, the potential savings is going to be drowned out by jitter caused by things like the processor deciding to service interrupts, or your hyperthread partner doing work. If the relevant code is so hot that the cost is showing up in profiles, then very likely, the fact that a function call is going through the PLT versus being directly called is not going to be the thing that makes the most difference: it's the fact that you're making the function call in the first place.

Well, sure, but by the same measure and for the same reasons, "clock cycles" are not a deterministic measure of software performance, while "nanoseconds" are, by definition. The test can't easily control for cache effects (which are vastly bigger in aggregate than IRQ or SMT effects, FWIW), so it doesn't try and assumes everything averages out, and it does.

The reason the effect seems bigger than it is is that for all but the most trivial leaf functions, PLT indirection just isn't a big deal. Never has been, isn't now, never will be. It's a cool trick that a JIT can skip that step, but there are bigger fish to fry.

And the most trivial cases will be inlined more often than not.

Well FFI calls will never be inlined, by definition, and that's kind of what people are testing here.

This is comparing FFI calls against native C, and in idiomatic native C, short calls will be inlined

The author mentions that for real things the PLT and indirect call overhead is in the noise, which makes you wonder what the point of all of this is. Besides, the headline is click-bait, and the contents is obnoxious: it's all about static vs. dynamic linking vs. using dlopen()/dlsym(). There's nothing in TFA specific to C, for example, nor even to Lua, as any statically-linked program written in any high-level programming language with an FFI to C or assembly will evince this performance advantage over a dynamically-linked version of the same.

This doesn't seem right. Many languages have only a single way to do FFI (e.g., usually via calling a dynamic object) so the concept of "statically linked FFI" doesn't even apply to many languages.

The whole GOT/PLT thing is kind of C/C++ specific, since that's compiled into the calling code, and supports specific C ABI features like symbol interposition, and so is not a necessary component of calling a dynamic library. Especially JIT-compiled languages don't really need either of the features the GOT or PLT offer.

I think the main performance are really (1) dynamically linked C/C++ with GOT + PLT, (2) statically linked C/C++ and then (potentially) separate categories for the FFI mechanism for various high level languages, which may happen to coincide with (1) or (2) or have some other behavior.

Certainly the very slow results for some languages indicate that they are doing something other than a plain call via mechanisms (1) or (2).

The PLT is mostly a necessary evil of two requirements: lazy-loaded symbols and W^X memory. The former is necessary to reduce binary load times, and the latter is necessary for security (RWX segments in memory are ripe for abuse in the event of a vulnerability). A JIT trades off very differently: slow startup ("warm up" period) and often creates RWX segments, but gains speed advantages as seen here.

Of course, the GOT referred to by the PLT is itself is a bit of a security vulnerability, because it's usually a writable chunk of memory with a ton of frequently-called function pointers; corrupting it gives you relatively easy control over program flow. The new mitigation is "full RelRO", which resolves all GOT entries during binary load, then marks the GOT segment read-only. This gains in security, but trades off binary load time, and it is still going to be slower at runtime than the JIT. Full RelRO is usually off by default because of this added overhead.

Modifying the code segment to include direct function references (either during binary load or later, during runtime) is also not a great approach because it means being unable to share the code segment between processes, which would increase memory pressure. We could do what Windows does, which is to have static randomization (randomization of binary addresses is performed once per boot, not once per process), enabling code to be "statically relocated" lazily per boot, but this carries its own complexities, and opens you up to different security vulnerabilities.

There's no magic bullet here, unfortunately. There are just tradeoffs everywhere in many directions - between runtime speed, startup time, security and memory usage.

> Modifying the code segment to include direct function references is also not a great approach because it means being unable to share the code segment between processes

Realistically how big of a deal is this? I feel like it's overblown, no? Literally the only huge program I have that sneezes out a gazillion instances of itself is Chrome. On everything else there's usually only one and most a few instances.

If you work on a non-Windows platform, then many executables will rely on libraries provided by the package manager. Your code might not be using that library from within a different process, but that doesn't mean that somebody else isn't.

Yeah sorry I wasn't clear, I was talking about Windows since this is the approach Windows takes. And somehow (perhaps mistakenly) I assumed "executable" meant EXE, not DLL (since I refer to the two as "binaries"). Otherwise yeah I agree [1].

[1] https://news.ycombinator.com/item?id=17175361

Not all non-Windows platforms have package managers, or are UNIX clones.

All JIT:ted VM:s modify their code segments. So not a big deal at all. Code modification is done to implement dynamic dispatch efficiently. For example, suppose the compiler generates code for "a + b" If it can't infer the types of a and b, it generates a call to a stub function. When that stub function is called, it looks at the types of a and b and generates efficient dispatch code. For example, if a and b were ints it would generate code like: "if (type(a)==int && type(b)==int) { return fast_add(a, b); } else { slow_but_generic_add(a, b); }"

A sophisticated JIT could even keep track on how frequently the fast track is taken and generate a new stub based on that. It could even inline the stub into the calling function and recompile that one.

> So not a big deal at all.

Er...a big problem that all JITted VMs share. For one, not permitted on iOS, and memory pressure from dirty pages is much bigger than from clean pages, exacerbated by the fact that they're not shared.

I tested this overhead in one of my latest jits. With full PIC and PIE and ASLR (indirect calls via PLT via GOT), I got a 10x overhead over static no PIC, no PIE, all direct near calls. This was mostly observable on windows, as this was the only secure by default platform then. Linux was 10x faster. I.e. I got no observable win by replacing the calls with jitted near calls. Nowadays with PIE, PIC and ASLR maybe the jit solution is better again.

One big factor of direct calls vs indirect calls is the CPU i-cache, which preloads the direct call, but not the indirect call.

I believe the branch target buffer on modern x86 hardware predicts and preloads indirect calls too. This analysis claims that Core 2 and later chips predict indirect branches quite well: http://www.agner.org/optimize/microarchitecture.pdf

That's certainly true. But if you are consuming fewer BTB resources by making more direct calls, then the CPU can spend its BTBs accelerating things like predictable virtual function calls and the like.

Yes, this. And the BTB is always slower than a direct call.

The first call always takes a big hit, only then it will be cached and subsequent calls are about the same speed is direct calls, with 1-2 cycs overhead. The address rarely changes, so it will always be in the first level BTB. But still.

You could have an initial code segment which generates the dynamic mappings and then performs a system call.

The system call would map in a new non-shared page which the kernel has dynamically generated from the supplied mappings. The user-land never sees the page as writable, only as executable, the kernel generates only jump instructions, and limits this behaviour to specific pages determined when the executable is loaded.

That's what full Relocation Read-Only essentially does, right? You pay for this by having to bind all addresses at startup time rather than lazily as they're needed.

Static linking avoids this overhead right?

Yes, a statically-linked binary will have all internal references resolved by the linker. If PIE is disabled, the addresses will all be fixed, which makes exploiting such a binary much easier. Glibc has only just added support for static PIE executables (in Feb of this year), and I'm not sure what the implementation is - whether it relocates all the references in the binary (increased memory usage), or uses a PLT/GOT approach to support relocations (increased runtime overhead).

Depending on the linkage, function calls internal to the library can use ordinary PC-relative function calls instead of fully indirect function calls. So you get the low per-call cost, but pay a higher per-process cost in memory consumption.


The cost of indirect jumps isn't implied by position independent (PIE/PIC) code alone, but by PIE/PIC combined with jumps across separately loadable segments. For jumps _within_ a segment you can just use a direct relative call which is the fastest type of call and doesn't require any load-time fixup ever.

Of course then there is also the consideration of symbol interposition: if you want to support that even _within_ a statically linked thing (usually something other than the main executable), then you need a GOT-like thing, but not necessarily a PLT-like thing.

Does it necessarily use more memory though? Eg. code is usually read-only hence can be shared by multiple processes (for apps that create many invocations of the same executable).

You'd have to check every page to see if you already have a read-only copy in memory though, right?

Not quite, necessarily. It means another program can't share the same DLL as that one. So you might well end up using more memory that way.

But you can use LTO with more libs which may reduce your overall binary size.

Increases memory usage, disk space, patch requirements and load time due to lack of shared library caching

Yes, at the cost of a bloated binary size.

But what if you link in just the functions you need, vs having the enormity of Library X on your disk that the package manager put there because it can’t know what you actually use/need?

Sure, but generally Library X is already on your system because someone else needs it. So why not use it if it's there?

Maybe you want it in a container with minimum dependency

This reminds me of working with C++ and the Pebble SDK. The Pebble builds ELF binaries but doesn't have an ELF loader. It also uses position independent code. The SDK uses objtools to strip the code out of the ELF and the Pebble OS just copies the code into memory and jumps to it.

In C, this works fine. But in C++, gcc generates the GOT for virtual method calls but there is no ELF loader to fill it.

I accidentally found out if my code contains a destructor that makes an assignment to a static variable then no GOT is generated and virtual method calls work for the entire project! I'm not sure the logic behind that but this is the code that is needed:

    class SampleObject
        //! Destroys the underlying Pebble window
        virtual ~SampleObject() 
            // The assignment to a static variable makes the virtual methods work for subclasses
            // I have no idea why!
            dummy() = 0; 

        // Dummy static variable needed for virutal method pointers to work (see destructor)
        inline static int& dummy() { static int dummy = 0; return dummy; }  

Fortunately "native C" is not one, unchangeable thing. If the plt overhead really is important in your situation, you can still compile either static, or with "--fno-plt".

FWIW, I get very different results from inside my Linux VM (MacBook Pro, 2.9 GHz Core i7):

    jit: 1.433856 ns/call
    plt: 1.700611 ns/call
    ind: 1.404397 ns/call
The PLT is double-indirected (call->jmp->func), but the single-indirect call performs the same as the JIT. This backs up @tlb's note that modern x86 machines will predict indirect calls.

Now that I'm thinking about it a bit more - I recall that GCC 6.1 added "-fno-plt" for x64 targets, which replaces PLT calls with "call [func@got]" instead, eliminating the double-indirection. Unfortunately, I don't have a GCC 6.1 environment, so I'd be curious to see if someone can run the benchmark with "-fno-plt" and how it performs. A combination of that plus indirect call prediction should basically close the gap with the JIT approach.

It does close the gap.


  cc -shared -fPIC -Os -s -o empty.so empty.c
  cc -std=c99 -Wall -Wextra -O3 -g3 -fpie -pie -o benchmark benchmark.c ./empty.so -ldl

  jit: 1.541608 ns/call
  plt: 2.309939 ns/call
  ind: 1.540583 ns/call

  cc -shared -fPIC -Os -s -o empty.so empty.c
  cc -std=c99 -Wall -Wextra -O3 -g3 -fpie -fno-plt -pie -o benchmark benchmark.c ./empty.so -ldl

  jit: 1.541836 ns/call
  plt: 1.539239 ns/call
  ind: 1.542874 ns/call

See this thread from yesterday: https://news.ycombinator.com/item?id=17165448

tl;dr: getting the function address first with dlsym or using -fno-plt does close the gap.

Which CPU you got? Lots of MacBook Pros have 2.9 GHz i7, ranging from 2012 to 2018 models.

It's a MacBookPro14,3, which makes it an i7-7820HQ according to everymac.

A rule of thumb:If you see "faster than C" claims its either poor C or they use different code path/algorithm thats not equivalent to C code. A FFI that is faster than native, isn't FFI but some other mechanism that more direct - in this case bypassing the address of the function resolution. 'static inline' is of course even faster. The only C overhead i recall that is legit is libc startup time with glibc, and you can use faster libc alternatives.

Another rule of thumb: if you see "faster than C" and the article mentions LuaJIT, then it's correct.

TFA is not really about C either. Nor is it about Lua. It's about static vs. dynamic linking and nothing more. It's just irritating and click-baity.

Unless it is FORTRAN obviously

FORTRAN is only faster than C because it doesn't allow aliasing. Use the `restrict` keyword in C and you should get FORTRAN-like performance.

Fortran can give you much better speed at comparable code-complexity, as long as your problem is one which FORTRAN is optimized for, i.e. multidimensional array manipulation or numerical work.

How is the example in this article and the article is based on not equivalent to the C functionality?

Yes, the JIT is able to take advantage of runtime code generation to embed the final loaded address in the compiled assembly, which is something C _cannot do_ or at least does not do in its default configuration to support PIC, symbol interposition and other features.

JITs can have all the same support, but because of runtime code generation (which comes with its own costs) the final call-site can be more efficient.

So it's a reasonable rule of thumb, but it has plenty of exceptions. Other examples include JITs creating code optimized to the runtime-observed data distribution and JITs efficiently using specific hardware features of the current hardware rather than statically compiling in lowest-common-denominator or trying to use runtime dispatch.

Any micro performance gains from a function call will become rounding errors compared to the extra cost of any arithmetic operation on a dynamically-typed language.

It may be an interesting article for describing how JIT and some late-optimization techniques can produce optimized code given the right optimization preconditions, but I fail to see how the title is supposed to describe the contents of the article.

Last, but not least, performance-sensitive C code tends to be statically linked.

PLT and GOT are quite orthogonal to the C programming language, these are implementation details of dynamic linking and shared libraries.

Back in the 90s we would statically link performance-sensitive executables to avoid this overhead. It didn't require changing the language.

Damn, he beat me to the writeup.

> The most surprising result of the benchmark is that LuaJIT’s FFI is substantially faster than C.


I'm still amazed that there is code in-lining across lineages. I saw that on the last post and I still can't imagine all of the edge cases that are being handled by that. Are there many other languages that do that other than LuaJIT?

Rust is planning to enable inlining C functions in the near future. I believe this should work in both directions, and works because they are both compiled using LLVM.

The native code isn't being inlined into the compiled LUA code - it is still making a function call.

It's just that the way it happens to do the call is faster than C calling a function in a shared object.

Repeat after me: TFA is not about C nor Lua, it's about static vs. dynamic linking only.

Did you actually read it? It compares three cases: dynamic linking, calls through a function pointer and calls where the address to call is compiled directly into an immediate call instruction using a small JIT intended to emulate the LuaJIT case.

The benchmark didn't even include a static linking case! In general, however, I expect the static linking case to perform like the JIT case since both will almost always use a call with 32-bit relative immediate jump. I was a bit confused about why the article author chose to write a JIT rather than just use static linking, since I think they achieve the same thing here.

Time to bring back self modifying code?

A C program, that self modifies as it runs, to inline commonly called functions, or make direct jumps instead of indirect.

We already have these - JIT compilers as the other person commented. You can even get them for C if you want.

So you mean a JIT compiler?

In general, you want to avoid self modifying code. You gain a major security benefit by ensuring that each page in the address space is "Writeable" OR "Executable" but not both. This is often abbreviated to W^X.

Without W^X, many simple bugs can easily become bad exploits. You can inject some code into memory (using any program input) and then use the bug to cause it to be executed: immediate control.

If W^X is in place, then the only code that could be executed by an exploit is code that is already there. Techniques like ROP allow this to be exploited still, but it's much harder and needs more of security hole to start with.

I didn't even know the plt existed... Fascinating. I'm curious if other JITd runtimes (like the jvm) offer the same optimizations

The JVM could do this, but there is a lot of other overhead for JNI calls (by virtue of the JNI API specification). Project Panama is working on improving native interoperability: http://openjdk.java.net/projects/panama/

I was under the impression that although C->Java is slow, Java->C calls have been fast for years, with the JIT generating full-speed calls - https://web.archive.org/web/20160304055443/http://nerds-cent...

It's fast, but it depends on exactly what you're doing. There is always a bit of work managing the stack and converting between ABIs, and JNI is quite poor in performance when dealing with anything other than primitive types.

The sort of thing Panama is adding is support for layouts of structures so that you can get results to and from C quickly, along with things like expressing vector algorithms in a way that allows good optimisation on different CPUs.

Sounds good. Better FFI is always good, and Java seems to be way behind .Net when it comes to utilising SIMD.

Nice! I didn't know this existed, but this is great.

THis is essentially free optimization for every JITed runtime that is capable of inlining FFI calls because you get the real entry point from dlsym() and there is no sane way to get address of the .plt indirected entry point. On the other hand JVM AFAIK is not capable of inlining FFI calls (at the least because there is no API for direct FFI calls).

Especially timely as Spectre/Meltdown mitigations put indirect calls and branch predictions under more performance pressure -- nice reminder of how these things work.

How important is code sharing between processes for a regular Linux desktop or server distros? Could JIT compiled applications be a sensible default in the future?

It would be great if V8 could follow LuaJIT in this.

immutability ftw

Uh, whatever. This isn't about LuaJIT, nor any JIT, nor C, nor any high-level programming language FFI.

    ELF != C
    C does not imply ELF
    ELF does not imply C
This is about entirely and only dynamic linking vs. static linking. Selling it as anything else is bunk and click-bait.

And yes, dlsym(3) should return the address of the PLT, not the direct address of the function: so that dlsym(foo, "bar") == bar (when you can link with whatever object provides "bar").

LuaJIT is awesome. It's just amazing how much information is available to the JIT. I've been using it for more than five years now and I still can't believe that such a high-level dynamic language actually is competitive with C for system programming.

That Mike guy knows a thing or two I reckon.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact