Hacker News new | past | comments | ask | show | jobs | submit login
Comparing the C FFI overhead in various programming languages (github.com/dyu)
141 points by based2 on May 26, 2018 | hide | past | favorite | 83 comments

For those wondering why luajit is faster than C/C++/Rust, that's probably because it does direct calls to the function, while C/C++/Rust go through the PLT (procedure linkage table).

I don't have luajit to validate this, but changing the C test to dlsym() the symbol first and use that in the loop instead makes the test take 727ms instead of 896ms when going through the PLT on my machine.

Edit: confirmed after installing luajit, got the same time as the modified C test.

Thank you. I ran both through lldb to see why luajit was faster and was confused how luajit was faster despite doing quite a bit more work:

    luajit loop:

    0x12a0effe0: movl   %eax, %ebp
    0x12a0effe2: movl   %ebp, %edi
    0x12a0effe4: callq  *%rbx
    0x12a0effe6: movsd  0x8(%rsp), %xmm0          ; xmm0 = mem[0],zero
    0x12a0effec: xorps  %xmm7, %xmm7
    0x12a0effef: cvtsi2sdl %eax, %xmm7
    0x12a0efff3: ucomisd %xmm7, %xmm0
    0x12a0efff7: ja     0x12a0effe0

    c_hello loop:

    0x100000ea0 <+80>:  movl   %ebx, %edi
    0x100000ea2 <+82>:  callq  0x100000ee6               ; symbol stub for: plusone
    0x100000ea7 <+87>:  movl   %eax, %ebx
    0x100000ea9 <+89>:  cmpl   %r14d, %ebx
    0x100000eac <+92>:  jl     0x100000ea0               ; <+80>
    0x100000ee6 <+0>: jmpq   *0x134(%rip)

I'm still a bit surprised that that extra jumpq is more expensive than all those extra instructions luajit is executing.

EDIT: For those curious, Lua handles all numbers as floating-point; there are no ints in Lua. That's what all the extra instructions are for. Interestingly it looks like LuaJIT is storing the loop variable x as int (in the eax register) and just converting it to floating-point for comparison against "count" each loop. Interesting bending of the rules there.

So, running under perf stat tells me the luajit version runs very much more instructions (5M vs. 3M with my patched C version), but they run in the same number of cycles (2.5M), with the luajit version using more instructions per cycle. The unpatched C version uses 3.5M instructions, and 3.5M cycles, which is half the number of instructions per cycle compared to the luajit version.

How did you patch it? Just compile with "-fno-plt" or something more?

  diff --git a/hello.c b/hello.c
  index 725d5ec..b7b1745 100644
  --- a/hello.c
  +++ b/hello.c
  @@ -1,3 +1,5 @@
  +#define _GNU_SOURCE
  +#include <dlfcn.h>
   #include <stdio.h>
   #include "newplus/plus.h"

  @@ -7,8 +9,9 @@ void run(int count)
       long long end;
       int x = 0;
  +    int (*_plusone)(int) = dlsym(RTLD_DEFAULT, "plusone");
       while (x < count)
  -        x = plusone(x);
  +        x = _plusone(x);
       printf("%lld\n", current_timestamp() - start);
And you need to add -ldl to the link flags.

Edit: but yeah, -fno-plt works too.

> For those curious, Lua handles all numbers as floating-point; there are no ints in Lua.

Same as in JavaScript. I'm surprised by that. On the surface, having just floats and no ints sounds like a pretty dumb idea. What am I missing? Is there a reasonable rationale for such choice?

Well, doubles can represent all the numbers a 52 bit int can, so there's not a huge reason to not use doubles for whole numbers. It also means that new programmers don't get confused when they do 1/3 and get 1, and the language can be significantly simpler when there's only one number type. Also, when you try to store numbers bigger than 2^52, doubles degrade relatively gracefully, while ints wrap around.

There are definitely downsides too, but it's a trade-off, not an exclusively bad decision.

> It also means that new programmers don't get confused when they do 1/3 and get 1,

I'm not a new programmer (nearly 2 decades of professional programming experience), and I would be very confused if 1/3 using integer math returned 1.

Haha, that's what I get for writing comments late at night. I meant something like 4/3.

It was a perfect example of how integer math is not intuitive :-)

> Well, doubles can represent all the numbers a 52 bit int can, so there's not a huge reason to not use doubles for whole numbers.

Not exactly.

> It also means that new programmers don't get confused when they do 1/3 and get 1

Is this (actually, getting 0 from 1/3) really more surprising than doing 2+2 and getting 3.9999999999999?

Floats are mostly inexact and introduce errors with almost every operation. It's something you have to constantly keep in mind when doing any kind of math with meaningful comsequences. I for one think that exact types are both more useful for many practical cases, as well as significantly simpler in use.

> > Well, doubles can represent all the numbers a 52 bit int can, so there's not a huge reason to not use doubles for whole numbers.

> Not exactly.

Yes, exactly [0]. The 52 bit significand gives LuaJIT and other NaN-tagged languages exact integer representations within that range.

[0]: https://en.wikipedia.org/wiki/Floating-point_arithmetic#Inte...

> Is this (actually, getting 0 from 1/3) really more surprising than doing 2+2 and getting 3.9999999999999?

No, so it's a good thing doubles never do anything remotely like that

All floating point integer operations with values smaller than 252 are exact. Most floating point operations with small values with few significant figures are precise.

Fraction, and there are fractions like 1/10 that cannot be represented precisely in base two, much like 1/3 cannot be represented precisely in base 10.

Beyond that, there's the usual issue of pushing out the least significant bits, but that's insurmountable while working in finite precision.

I think you meant 2^52, not 252.

correct. thanks.

There are ints in Lua 5.3, and ffi ints in luajit.

But generally, its simpler, why should people care about representations. Lua also let you compike as int only I think.

> why should people care about representations

Because ints are exact, while floats introduce errors with nearly every operation.

Representations are important in programming - different ones have different consequences.

Javascript is likely to get 64 bit integers soon. It's busy making its way through TC39, anyways.

Lua 5.3 introduced actual 64 bit integers but those probably will never get in LuaJIT because its deeply married to the NaN tagging trick, which limits unboxed values to aprox 48 bits.

You're thinking of the 47 bit address space limitation.

LuaJIT represents the full range of valid 64 bit floats faithfully, giving a 52 bit signed int.

Either way, it cant represent 64 bit integers unless they are boxed (which is very slow)

I believe this is equivalent to adding the `nonlazybind` attribute to an extern declaration - see http://llvm.org/docs/LangRef.html#function-attributes

> This attribute suppresses lazy symbol binding for the function. This may make calls to the function faster, at the cost of extra program startup time if the function is not called during program startup.

I didn't find the zig issue for this feature so I created it: https://github.com/ziglang/zig/issues/1024

If we did the "Static control flow analysis" idea then zig would bench the same as the luajit/modified c test.

> I don't have luajit to validate this, but changing the C test to dlsym() the symbol first and use that in the loop instead makes the test take 727ms instead of 896ms when going through the PLT on my machine.

Isn't this what every C or C++ library that wraps another with dlopen / dlsym does ?

This is what Visual C++'s __declspec(dllimport) is for. :-) On Windows, GCC also seems to handle it correctly, along with its own synonym __attribute__((dllimport)).

FFI topic, yay!

A year ago, I made a very nice FFI for TXR Lisp.

Example: creating a window with SDL, GTK, X11 and Win32, from scratch: http://nongnu.org/txr/rosetta-solutions-main.html#Window%20c...

(The Win32 example is an almost expression-for-expression translation of the C sample from MSDN called "Your First Windows Program". It re-creates all needed Win32 data types using the FFI macro language.)

Unix Stackexchange accepted answer: decoding IP datagrams from tcpdump using TXR Lisp FFI: https://unix.stackexchange.com/a/379759/16369

The FFI type system has a richly nuanced declarative mechanism to specify data passing semantics (think: who owns what pointer, who must free what) at every level of nesting.

It has bitfields, unions, enums (typable to any integral type). Supports alignment and packing in structs and has special integral types that encode in little or big endian. Unicode, ASCII and UTF-8 string encoding; understands null termination as well as character arrays that are not null terminated. Bitfields can be signed and unsigned and their underlying cell type can be specified.

Pointer types can be tagged with symbols for a measure of type safety, to catch situations when a widget is passed to an API that expects a doohickey and such.


Because Wren's a bytecode interpreted dynamic language much of the overall time is probably spent just executing the surrounding benchmark code. I.e. in:

    while (x < count) {
        x = FFI.plusone(x)
It probably spends most of the total elapsed time on:

    while (x < count) {
        x = ...
It would be worth running a similar benchmark but with just the FFI part removed and then subtract that from the time. (And do the same thing for the other languages too, of course.)

Or just unroll the loop a few times.

I was thinking about suggesting that too. :)

It is like decade after Mike Pall show the world LuaJit and FFI, he is still schooling everyone.

It is very sad he left Reddit, HN and almost every other online communities along with LuaJIT.

Edit: Sorry I was wrong. Looks like he is still working on LuaJIT!

The Python "cffi" module (one of the many possible ways to do FFI in Python) works in a similar way: https://cffi.readthedocs.io/en/latest/overview.html#simple-e...

Thanks for the link. From this hn post, I'm baffled at how any other language could beat c at c ffi. (Hopefully your article explains how. Reading it now).

Am I getting this right? Is LuaJIT inlining linked assembly into the program at runtime? That's amazing.

Not sure if LuaJIT will do that, but it's not doing that here. LuaJIT is calling the library just like normal. See my and glandium's comment thread.

TL;DR: LuaJIT's trick is that it avoids the linkage table.

No, it is not. It still makes a call to the method in the shared object, it's just the way this call is made is more efficient than C.

"Inlinig" compiled assembly from a shared object is basically impossible because you can't just move machine code around arbitrary since it may access arbitrary additional code or data at relative offsets. If you move it, you'll break it.

That was my takeaway, but not sure to what extent. You could then do peephole optimization after. You'd lose out on the memory saving from shared objects, but that's a trade-off I'm cool with.

Inlining assembly at runtime is what JITs do.

The go results are quite surprisingly atrocious...

I’m not a go developer: is this because of the green threads usage?

Yeah, switching from the "Go stack" to the "C stack" is quite expensive. See this post from Dave Cheney: https://dave.cheney.net/2016/01/18/cgo-is-not-go

I think it is because of the way go does the stack which is unusual. Perhaps that is done to facilitate green threads. It is too bad because Go has many properties that are good for gamedev - AOT compiled exes, some exposure to pointers (any type can be a value or reference at your whim), low latency GC, and good SDL2 and opengl bindings. But the high cost to call into those bindings is a downside.

There is also a philosophy among the Go maintainers that you shouldn't call into C unless you have to. Unfortunately you can't draw pixels efficiently (at all?) in pure go, so...

> I think it is because of the way go does the stack which is unusual. Perhaps that is done to facilitate green threads.

Kinda. C's stacks are big (1~8MB default depending on the system IIRC), and while that's mostly vmem Go still doesn't want to pay for a full C stack per goroutine, plus since it has a runtime it can make different assumption and grow the stack dynamically if necessary.

So rather than set up a C stack per goroutine, Go sets up its own stack (initially 8K, reduced to 2K in 1.4) and if it hits a stack overflow it copies the existing stack to a new one (similar to hitting the limit on a vector).

But C can't handle that, it expects enough stacks, and it's got no idea where the stack ends or how to resize it (the underlying platform just faults the program on stack overflow), so you can't just jump to C code from Go code, you need an actual C stack for things to work, and that makes every C call from Go very expensive.

Rust used to do that as well, but decided to leave it behind as it went lower level and fast C interop was more important than builtin green threads.

Erlang does something similar to Go (by default a process has ~2.6K allocated, of which ~1.8K is for the process's heap and stack) but the FFI is more involved (and the base language slower) so you can't just go "I'll just import cgo and call that library" and then everybody dies.

Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack. The problem calling C from a goroutine is that whilst you have a real C stack right there .. other goroutines expect to be able to steal memory from it, and once you call C you don't know how much stack C is going to use. So whilst C is running, goroutines cannot allocate stack memory, which means they cannot call subroutines. The only way to fix that is to give the C function being called its own stack.

The problem you're going to have is that if 10K goroutines all call PCRE you need 10K stacks, because all the calls are (potentially) concurrent.

What makes go work is that the compiler calculates how much local memory a goroutine requires and so after a serialised bump of the stack pointer the routine cannot run out of stack. Serialising the bump between competing goroutines is extremely fast (no locks required). Deallocation is trickier, I think go uses copy collection, i.e. it copies the stack when it runs out of address space on the stack, NOT because its out of memory (the OS can always add to the end), but because the copying compacts the stack by not copying unused blocks. Its a stock standard garbage collection algorithm .. used in a novel way.

The core of Go is very smart. Pity about the rest of the language.

> Really? Thats not my understanding, its much smarter than that. Goroutines steal memory from the current pthread machine stack, that is, the machine stack.

There is no "machine stack", and yes in the details it tries to set up and memoise C stacks, but it still need to switch out the stack and copy a bunch of crap onto there, and that's expensive.

This gives me an idea on how to improve a cgo call: have a pool of C-stacks, preallocated and managed by Go.

brb time to bench

There is one technically, I didn't mention it because I didn't want to dive into the minutiae of the system.

They're not surprising. cgo is expensive as hell (and changes the semantics of the entire language).

Possibly gccgo pays less heavy a price?

I'm a Python programmer, so what I'm most interested in are Python and PyPy. Both with ctypes and with the cffi module.

Here looks like a GitHub gist you could use to do it for cffi vs Cython in Python, and PyPy.

< https://gist.github.com/brentp/7e173302952b210aeaf3 >

Thanks! Yes, it does.

ctypes is pretty slow in my experience. Cython is a nice alternative. Native speed and datatypes where you need them, plain old python anywhere else.

Yes, my experience with CPython is that ctypes is slower than writing (or generating, as is the case with Cython) the Python/C extension.

However, one advantage with ctypes is that it can work with other Python implementations, like PyPy.

Interesting article with interesting results!

I see there is a pull request to test the Erlang BEAM (via Elixir) with seemingly similar results to Go. To me that's pretty interesting given that Erlang's BEAM is a Virtual Machine and Go is not.

Additionally, thanks for the pointer to Tup, sounds like a very interesting Make alternative for some of my larger projects.

Tup is a gem. I use both tup and gn+ninja.

Tup is just a generic build system that can apply to any input whereas gn is focused only on c/c++.

Is there a reason Dart is so badly performing here? And wouldn't that have a big impact on Google's new Flutter project that they're building an OS around? I'd assume Flutter would need to be calling into FFI for each access to the Chromium graphics stack that it is built on.

Good question. Hopefully the google folks have plans to improve their dart FFI. I mean, if Mike Pall can do it ... :-)

I'd be interested in seeing Ruby-FFI, CPython, CFFI (Common Lisp) and C#'s DllImport.

I was curious about Common Lisp's CFFI also, so I tried it out. [1]

On my machine, the C version prints between 6855 and 6910.

Using SBCL (cffi-bench:run 2000000000) prints between 9185 and 9205.

Clisp is significantly slower, and (cffi-bench:run 200000000) is already at 55500, compared to ~100 in SBCL.

CCL crashes with a segmentation fault, and I'm still debugging.

Tomorrow I'll see about creating a build script and opening a pull request to the main repo.

[1] https://github.com/jl2/ffi-overhead/tree/master/common-lisp/...

EDIT: I recompiled the .so and C program with -O3, and it now prints between 4740-4780. SBCL using the -O3 library prints 7935 and 8935.

If we use the C performance ratio between the machines to adjust machine to machine comparisons. SBCL would be around 1600 in the original benchmark, between D and Java.

I built out a ruby-ffi test immediately upon seeing this. I want to put a little more effort in to it (I need to run some errands and spent all of 5 minutes on this), but... 2 samples/runs, 500M calls: 51050, 50853. No surprises there, really.

edit/ Ran this on my desktop (gaming rig) i7-4790k, 16GB RAM, gcc 7.3.1 20180406, Arch on Windows Subsystem for Linux. I have a machine closer in specs to the original authors that I'll test it against later today (hopefully).

This is very cool, but note that it is a microbenchmark comparing the overhead of calling plus(int, int). This is a very specific case of FFI that is easy and simple.

For my work on the Oil shell, I care more about moving strings back and forth across the VM boundary (not to mention bigger objects like records and arrays of records). There are many more design choices in that case, and I suspect the results will also look different.

Haskell has one of my favourite C FFIs, and there are also libraries that let you seamlessly mix C into Haskell expressions:


C++ and Felix obviously have the lowest overhead .. namely zero in both cases :-)

This seems to partially explain why Go programmers prefer pure-Go libraries.

Personally, the overhead doesn't bother me that much; you can manage it by moving more work into C, cutting down on the number of FFI calls. What steers me away from cgo is that it complicates the build/release process and makes it harder to write cross-platform code. Go makes these things painless, so there's a strong incentive to not stray outside the Go toolchain.

Wow, why is the overhead so high for Go? Something to do with the garbage collector maybe?

Go does not use the C stack, so calling into C requires setting up a C stack, trampolining from the Go context to the C context, making the call and coming back.

As a result, while cgo is easy to use:

1. calling a C function from Go is ~100 times more expensive than calling a Go function from Go — or was a few years back anyway this may have improved a bit since: https://www.cockroachlabs.com/blog/the-cost-and-complexity-o...

2. and using cgo has semantics impact on the entire program: https://dave.cheney.net/2016/01/18/cgo-is-not-go

This is why Go libraries generally don't wrap native libraries and go software only calls into C when they really have no other choice, and/or have very "coalesced" APIs available (a single function call doing a lot of work whereas languages like Python are happy with making lots of very small calls to C).

And why Go itself does not use the platform-provided standard libraries and performs syscalls directly, breaking any time those syscalls change[0] — which is not that rare because linux is the only platforms where raw syscalls are the kernel's API, on most other systems the libc is the platform's API. And even on linux it breaks from time to time[1] because Go doesn't want to link to libc but still wants to benefit from vdso[2].

[0] https://github.com/golang/go/issues/16606

[1] https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/

[2] https://twitter.com/bcantrill/status/774290166164754433?lang...

> Thus, cgo has an overhead because it performs a stack switch, not thread switch.

> It saves and restores all registers when C function is called, while it's not required when Go function or assembly function is called.


Would be interesting to know which ffi backend library is used for each.

libffi, dyncall, self-written, a jit library, llvm, ...

In my tests these varied wildly. It has nothing to do with the language using these.

I believe most of these if not all of the languages have a native C compatibility layer. Java has JNI. Zig and Rust inherently use llvm. Go has cgo. nim compiles to C. luajit has its own FFI notation. Etc.

This would be worth mentioning on the project issue page, methinks. The author might not want to do the work but it's a very good thing to keep in mind for anybody looking at the project and making judgement calls off of it.


I didn't do so because it wasn't my idea. I certainly don't mind though.

Interesting that d and nim get better performance than c itself??

I woudl expect d, rust, nim, c, and zig, to all have equivelant performance since they use the same calling convention as c...

I believe you're misreading the numbers. They're the "elapsed time in millis", so lower is better. The only result better than c itself seems to be luajit, which is quite impressive. I wonder how it achieves that.

Ahhh oops you're right. Point still stands, though -- I wonder why those languages should have different results?

They're both GC'd languages so the FFI may have some additional work/bookkeeping to do around the native calls, which C or Rust does not.

OP poses a good question and I don't think it's a simple case of GC vs. no GC. In Nim's case it might have something to do with not using the C int types (PR already open to fix this: https://github.com/dyu/ffi-overhead/pull/6)

Higher-level languages know more about the context and the stack. You also need a good inliner. You can do fast calls when you know exactly what registers you need to save and which not. With pure generic C you need to save most, with a more high-level language only those really needed, or even none, when you use the stack instead.

Common-Lisp e.g. usually leads the ffi benchmarks.

Hmm? Not really.

In any case C or higher level language the caller knows what registers are in used and need saving (to the extent they intersect with the ABIs clobbered register set).

In any case the caller usually doesn't know what registers the callee will use and so has to rely on the ABI.

Of course if things can be inlined its a different story, but this is about FFI, i.e., no inlining.


But several VM's/ABI's don't clobber the regs by themselves (i.e. all locals are volatile), thus the external C func can use all the regs they want. Hence the intro doesn't need to save much. E.g. this also helps with GC pauses to scan registers, not needed there. We had great success with such an ABI (potion, a jitted lua variant).

Native compilers with a better ABI than C has also advantages: Fortran (no need to save the return ptr), Common Lisp (mult. return values, can tune the FFI much better than usual ffi's), Haskell.

The biggest problem is still callback, calling functions from the FFI back to us. This is costly. The FFI decl. needs to know that. And closures (local upvalues on the stack) are only properly supported with Haible's avcall.

And some newer VM's solved the expensive FFI or C extension problem by redoing the external C part, such as pypy or Graal. They convert the called C function to their own format, and as such can inline it or know the used registers.

I'm not following you. Clearly in a pure JIT environment or with custom calling conventions you can do better than a typical "blind" C call restricted by the platform ABI - but we're talking about FFI calling into a "plain" shared object, right?

Such functions obey the platform ABI, so the caller needs to save exactly every callee-clobbered register they are using.

Also, if the external C function wants to use any callee-saved registers, it needs to save them. It doesn't know that it's being called by a VM that allows FFI functions to clobber everything (if that's what you're saying).

So yeah, there is all kinds of great stuff you can do with respect to function calls within a VM, or when you are working outside the platform C ABI (e.g., if you invented your own ABI), but it isn't clear how that relates to making fast FFI calls from higher level languages.

Nice to see Julia styling in these benchmarks!

yay I made front page (finally?).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact