
Comparing the C FFI overhead in various programming languages - based2
https://github.com/dyu/ffi-overhead
======
glandium
For those wondering why luajit is faster than C/C++/Rust, that's probably
because it does direct calls to the function, while C/C++/Rust go through the
PLT (procedure linkage table).

I don't have luajit to validate this, but changing the C test to dlsym() the
symbol first and use that in the loop instead makes the test take 727ms
instead of 896ms when going through the PLT on my machine.

Edit: confirmed after installing luajit, got the same time as the modified C
test.

~~~
fpgaminer
Thank you. I ran both through lldb to see why luajit was faster and was
confused how luajit was faster despite doing quite a bit more work:

    
    
        luajit loop:
    
        0x12a0effe0: movl   %eax, %ebp
        0x12a0effe2: movl   %ebp, %edi
        0x12a0effe4: callq  *%rbx
        0x12a0effe6: movsd  0x8(%rsp), %xmm0          ; xmm0 = mem[0],zero
        0x12a0effec: xorps  %xmm7, %xmm7
        0x12a0effef: cvtsi2sdl %eax, %xmm7
        0x12a0efff3: ucomisd %xmm7, %xmm0
        0x12a0efff7: ja     0x12a0effe0
    
        c_hello loop:
    
        0x100000ea0 <+80>:  movl   %ebx, %edi
        0x100000ea2 <+82>:  callq  0x100000ee6               ; symbol stub for: plusone
        0x100000ea7 <+87>:  movl   %eax, %ebx
        0x100000ea9 <+89>:  cmpl   %r14d, %ebx
        0x100000eac <+92>:  jl     0x100000ea0               ; <+80>
        
        
        0x100000ee6 <+0>: jmpq   *0x134(%rip)
    
    

I'm still a bit surprised that that extra jumpq is more expensive than all
those extra instructions luajit is executing.

EDIT: For those curious, Lua handles all numbers as floating-point; there are
no ints in Lua. That's what all the extra instructions are for. Interestingly
it looks like LuaJIT is storing the loop variable x as int (in the eax
register) and just converting it to floating-point for comparison against
"count" each loop. Interesting bending of the rules there.

~~~
TeMPOraL
> _For those curious, Lua handles all numbers as floating-point; there are no
> ints in Lua._

Same as in JavaScript. I'm surprised by that. On the surface, having just
floats and no ints sounds like a pretty dumb idea. What am I missing? Is there
a reasonable rationale for such choice?

~~~
mort96
Well, doubles can represent all the numbers a 52 bit int can, so there's not a
huge reason to not use doubles for whole numbers. It also means that new
programmers don't get confused when they do 1/3 and get 1, and the language
can be significantly simpler when there's only one number type. Also, when you
try to store numbers bigger than 2^52, doubles degrade relatively gracefully,
while ints wrap around.

There are definitely downsides too, but it's a trade-off, not an exclusively
bad decision.

~~~
TeMPOraL
> _Well, doubles can represent all the numbers a 52 bit int can, so there 's
> not a huge reason to not use doubles for whole numbers._

Not exactly.

> _It also means that new programmers don 't get confused when they do 1/3 and
> get 1_

Is this (actually, getting 0 from 1/3) really more surprising than doing 2+2
and getting 3.9999999999999?

Floats are mostly inexact and introduce errors with almost every operation.
It's something you have to constantly keep in mind when doing any kind of math
with meaningful comsequences. I for one think that exact types are both more
useful for _many_ practical cases, as well as significantly simpler in use.

~~~
ori_b
All floating point integer operations with values smaller than 2 __52 are
exact. Most floating point operations with small values with few significant
figures are precise.

Fraction, and there are fractions like 1/10 that cannot be represented
precisely in base two, much like 1/3 cannot be represented precisely in base
10.

Beyond that, there's the usual issue of pushing out the least significant
bits, but that's insurmountable while working in finite precision.

~~~
nneonneo
I think you meant 2^52, not 252.

~~~
ori_b
correct. thanks.

------
kazinator
FFI topic, yay!

A year ago, I made a very nice FFI for TXR Lisp.

Example: creating a window with SDL, GTK, X11 and Win32, from scratch:
[http://nongnu.org/txr/rosetta-solutions-
main.html#Window%20c...](http://nongnu.org/txr/rosetta-solutions-
main.html#Window%20creation)

(The Win32 example is an almost expression-for-expression translation of the C
sample from MSDN called "Your First Windows Program". It re-creates all needed
Win32 data types using the FFI macro language.)

Unix Stackexchange accepted answer: decoding IP datagrams from tcpdump using
TXR Lisp FFI:
[https://unix.stackexchange.com/a/379759/16369](https://unix.stackexchange.com/a/379759/16369)

The FFI type system has a richly nuanced declarative mechanism to specify data
passing semantics (think: who owns what pointer, who must free what) at every
level of nesting.

It has bitfields, unions, enums (typable to any integral type). Supports
alignment and packing in structs and has special integral types that encode in
little or big endian. Unicode, ASCII and UTF-8 string encoding; understands
null termination as well as character arrays that are not null terminated.
Bitfields can be signed and unsigned and their underlying cell type can be
specified.

Pointer types can be tagged with symbols for a measure of type safety, to
catch situations when a widget is passed to an API that expects a doohickey
and such.

[http://nongnu.org/txr/txr-manpage.html#N-0275762A](http://nongnu.org/txr/txr-
manpage.html#N-0275762A)

------
munificent
Because Wren's a bytecode interpreted dynamic language much of the overall
time is probably spent just executing the surrounding benchmark code. I.e. in:

    
    
        FFI.start()
        while (x < count) {
            x = FFI.plusone(x)
        }
        FFI.stop()
    

It probably spends most of the total elapsed time on:

    
    
        FFI.start()
        while (x < count) {
            x = ...
        }
        FFI.stop()
    

It would be worth running a similar benchmark but with just the FFI part
removed and then subtract that from the time. (And do the same thing for the
other languages too, of course.)

~~~
yourMadness
Or just unroll the loop a few times.

~~~
munificent
I was thinking about suggesting that too. :)

------
based2
[http://luajit.org/ext_ffi.html](http://luajit.org/ext_ffi.html)

~~~
ndesaulniers
Thanks for the link. From this hn post, I'm baffled at how any other language
could beat c at c ffi. (Hopefully your article explains how. Reading it now).

~~~
gravypod
Am I getting this right? Is LuaJIT inlining linked assembly into the program
at runtime? That's amazing.

~~~
fpgaminer
Not sure if LuaJIT will do that, but it's not doing that here. LuaJIT is
calling the library just like normal. See my and glandium's comment thread.

TL;DR: LuaJIT's trick is that it avoids the linkage table.

------
ComputerGuru
The go results are quite surprisingly atrocious...

I’m not a go developer: is this because of the green threads usage?

~~~
gameswithgo
I think it is because of the way go does the stack which is unusual. Perhaps
that is done to facilitate green threads. It is too bad because Go has many
properties that are good for gamedev - AOT compiled exes, some exposure to
pointers (any type can be a value or reference at your whim), low latency GC,
and good SDL2 and opengl bindings. But the high cost to call into those
bindings is a downside.

There is also a philosophy among the Go maintainers that you shouldn't call
into C unless you have to. Unfortunately you can't draw pixels efficiently (at
all?) in pure go, so...

~~~
masklinn
> I think it is because of the way go does the stack which is unusual. Perhaps
> that is done to facilitate green threads.

Kinda. C's stacks are big (1~8MB default depending on the system IIRC), and
while that's mostly vmem Go still doesn't want to pay for a full C stack per
goroutine, plus since it has a runtime it can make different assumption and
grow the stack dynamically if necessary.

So rather than set up a C stack per goroutine, Go sets up its own stack
(initially 8K, reduced to 2K in 1.4) and if it hits a stack overflow it copies
the existing stack to a new one (similar to hitting the limit on a vector).

But C can't handle that, it expects enough stacks, and it's got no idea where
the stack ends or how to resize it (the underlying platform just faults the
program on stack overflow), so you can't just jump to C code from Go code, you
need an actual C stack for things to work, and that makes every C call from Go
very expensive.

Rust used to do that as well, but decided to leave it behind as it went lower
level and fast C interop was more important than builtin green threads.

Erlang does something similar to Go (by default a process has ~2.6K allocated,
of which ~1.8K is for the process's heap and stack) but the FFI is more
involved (and the base language slower) so you can't just go "I'll just import
cgo and call that library" and then everybody dies.

~~~
Yttrill
Really? Thats not my understanding, its much smarter than that. Goroutines
steal memory from the current pthread machine stack, that is, the machine
stack. The problem calling C from a goroutine is that whilst you have a real C
stack right there .. _other goroutines_ expect to be able to steal memory from
it, and once you call C you don't know how much stack C is going to use. So
whilst C is running, goroutines cannot allocate stack memory, which means they
cannot call subroutines. The only way to fix that is to give the C function
being called its own stack.

The problem you're going to have is that if 10K goroutines all call PCRE you
need 10K stacks, because all the calls are (potentially) concurrent.

What makes go work is that the compiler calculates how much local memory a
goroutine requires and so after a serialised bump of the stack pointer the
routine cannot run out of stack. Serialising the bump between competing
goroutines is extremely fast (no locks required). Deallocation is trickier, I
think go uses copy collection, i.e. it copies the stack when it runs out of
address space on the stack, NOT because its out of memory (the OS can always
add to the end), but because the copying compacts the stack by not copying
unused blocks. Its a stock standard garbage collection algorithm .. used in a
novel way.

The core of Go is very smart. Pity about the rest of the language.

~~~
masklinn
> Really? Thats not my understanding, its much smarter than that. Goroutines
> steal memory from the current pthread machine stack, that is, the machine
> stack.

There is no "machine stack", and yes in the details it tries to set up and
memoise C stacks, but it still need to switch out the stack and copy a bunch
of crap onto there, and that's expensive.

------
eesmith
I'm a Python programmer, so what I'm most interested in are Python and PyPy.
Both with ctypes and with the cffi module.

~~~
mlthoughts2018
Here looks like a GitHub gist you could use to do it for cffi vs Cython in
Python, and PyPy.

<
[https://gist.github.com/brentp/7e173302952b210aeaf3](https://gist.github.com/brentp/7e173302952b210aeaf3)
>

~~~
eesmith
Thanks! Yes, it does.

------
lkurusa
Interesting article with interesting results!

I see there is a pull request to test the Erlang BEAM (via Elixir) with
seemingly similar results to Go. To me that's pretty interesting given that
Erlang's BEAM is a Virtual Machine and Go is not.

Additionally, thanks for the pointer to Tup, sounds like a very interesting
Make alternative for some of my larger projects.

~~~
dyu-
Tup is a gem. I use both tup and gn+ninja.

Tup is just a generic build system that can apply to any input whereas gn is
focused only on c/c++.

------
RyanZAG
Is there a reason Dart is so badly performing here? And wouldn't that have a
big impact on Google's new Flutter project that they're building an OS around?
I'd assume Flutter would need to be calling into FFI for each access to the
Chromium graphics stack that it is built on.

~~~
dyu-
Good question. Hopefully the google folks have plans to improve their dart
FFI. I mean, if Mike Pall can do it ... :-)

------
vortico
I'd be interested in seeing Ruby-FFI, CPython, CFFI (Common Lisp) and C#'s
DllImport.

~~~
jlarocco
I was curious about Common Lisp's CFFI also, so I tried it out. [1]

On my machine, the C version prints between 6855 and 6910.

Using SBCL (cffi-bench:run 2000000000) prints between 9185 and 9205.

Clisp is significantly slower, and (cffi-bench:run 200000000) is already at
55500, compared to ~100 in SBCL.

CCL crashes with a segmentation fault, and I'm still debugging.

Tomorrow I'll see about creating a build script and opening a pull request to
the main repo.

[1] [https://github.com/jl2/ffi-overhead/tree/master/common-
lisp/...](https://github.com/jl2/ffi-overhead/tree/master/common-lisp/cffi-
bench/cffi-bench.lisp)

EDIT: I recompiled the .so and C program with -O3, and it now prints between
4740-4780. SBCL using the -O3 library prints 7935 and 8935.

~~~
nabla9
If we use the C performance ratio between the machines to adjust machine to
machine comparisons. SBCL would be around 1600 in the original benchmark,
between D and Java.

------
chubot
This is very cool, but note that it is a microbenchmark comparing the overhead
of calling plus(int, int). This is a very specific case of FFI that is easy
and simple.

For my work on the Oil shell, I care more about moving strings back and forth
across the VM boundary (not to mention bigger objects like records and arrays
of records). There are many more design choices in that case, and I suspect
the results will also look different.

------
mrkgnao
Haskell has one of my favourite C FFIs, and there are also libraries that let
you seamlessly mix C into Haskell expressions:

[http://hackage.haskell.org/package/inline-c-0.6.1.0#readme](http://hackage.haskell.org/package/inline-c-0.6.1.0#readme)

------
Yttrill
C++ and Felix obviously have the lowest overhead .. namely zero in both cases
:-)

------
moosingin3space
This seems to partially explain why Go programmers prefer pure-Go libraries.

~~~
pitaj
Wow, why is the overhead so high for Go? Something to do with the garbage
collector maybe?

~~~
masklinn
Go does not use the C stack, so calling into C requires setting up a C stack,
trampolining from the Go context to the C context, making the call and coming
back.

As a result, while cgo is easy to use:

1\. calling a C function from Go is ~100 times more expensive than calling a
Go function from Go — or was a few years back anyway this may have improved a
bit since: [https://www.cockroachlabs.com/blog/the-cost-and-
complexity-o...](https://www.cockroachlabs.com/blog/the-cost-and-complexity-
of-cgo/)

2\. and using cgo has semantics impact on the entire program:
[https://dave.cheney.net/2016/01/18/cgo-is-not-
go](https://dave.cheney.net/2016/01/18/cgo-is-not-go)

This is why Go libraries generally don't wrap native libraries and go software
only calls into C when they really have no other choice, and/or have very
"coalesced" APIs available (a single function call doing a lot of work whereas
languages like Python are happy with making lots of very small calls to C).

And why Go itself does not use the platform-provided standard libraries and
performs syscalls directly, breaking any time those syscalls change[0] — which
is not that rare because linux is the only platforms where raw syscalls are
the kernel's API, on most other systems the libc is the platform's API. And
even on linux it breaks from time to time[1] because Go doesn't want to link
to libc but still wants to benefit from vdso[2].

[0]
[https://github.com/golang/go/issues/16606](https://github.com/golang/go/issues/16606)

[1] [https://marcan.st/2017/12/debugging-an-evil-go-runtime-
bug/](https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/)

[2]
[https://twitter.com/bcantrill/status/774290166164754433?lang...](https://twitter.com/bcantrill/status/774290166164754433?lang=en)

------
rurban
Would be interesting to know which ffi backend library is used for each.

libffi, dyncall, self-written, a jit library, llvm, ...

In my tests these varied wildly. It has nothing to do with the language using
these.

~~~
isaachier
I believe most of these if not all of the languages have a native C
compatibility layer. Java has JNI. Zig and Rust inherently use llvm. Go has
cgo. nim compiles to C. luajit has its own FFI notation. Etc.

------
earenndil
Interesting that d and nim get better performance than c itself??

I woudl expect d, rust, nim, c, and zig, to all have equivelant performance
since they use the same calling convention as c...

~~~
rurban
Higher-level languages know more about the context and the stack. You also
need a good inliner. You can do fast calls when you know exactly what
registers you need to save and which not. With pure generic C you need to save
most, with a more high-level language only those really needed, or even none,
when you use the stack instead.

Common-Lisp e.g. usually leads the ffi benchmarks.

~~~
BeeOnRope
Hmm? Not really.

In any case C or higher level language the caller knows what registers are in
used and need saving (to the extent they intersect with the ABIs clobbered
register set).

In any case the caller usually doesn't know what registers the callee will use
and so has to rely on the ABI.

Of course if things can be inlined its a different story, but this is about
FFI, i.e., no inlining.

~~~
rurban
Yes.

But several VM's/ABI's don't clobber the regs by themselves (i.e. all locals
are volatile), thus the external C func can use all the regs they want. Hence
the intro doesn't need to save much. E.g. this also helps with GC pauses to
scan registers, not needed there. We had great success with such an ABI
(potion, a jitted lua variant).

Native compilers with a better ABI than C has also advantages: Fortran (no
need to save the return ptr), Common Lisp (mult. return values, can tune the
FFI much better than usual ffi's), Haskell.

The biggest problem is still callback, calling functions from the FFI back to
us. This is costly. The FFI decl. needs to know that. And closures (local
upvalues on the stack) are only properly supported with Haible's avcall.

And some newer VM's solved the expensive FFI or C extension problem by redoing
the external C part, such as pypy or Graal. They convert the called C function
to their own format, and as such can inline it or know the used registers.

~~~
BeeOnRope
I'm not following you. Clearly in a pure JIT environment or with custom
calling conventions you can do better than a typical "blind" C call restricted
by the platform ABI - but we're talking about FFI calling into a "plain"
shared object, right?

Such functions obey the platform ABI, so the caller needs to save exactly
every callee-clobbered register they are using.

Also, if the external C function wants to use any callee-saved registers, it
needs to save them. It doesn't know that it's being called by a VM that allows
FFI functions to clobber everything (if that's what you're saying).

So yeah, there is all kinds of great stuff you can do with respect to function
calls within a VM, or when you are working outside the platform C ABI (e.g.,
if you invented your own ABI), but it isn't clear how that relates to making
fast FFI calls from higher level languages.

------
ChrisRackauckas
Nice to see Julia styling in these benchmarks!

------
dyu-
yay I made front page (finally?).

