
A high-speed network driver written in C, Rust, Go, C#, Java - Sindisil
https://github.com/ixy-languages/ixy-languages
======
kyrra
There is discussion about C# and Java being faster than Go, but one
interesting thing to note is that both C# and Java have to use C to interface
with the kernel.

Java: [https://github.com/ixy-
languages/ixy.java/blob/master/ixy/sr...](https://github.com/ixy-
languages/ixy.java/blob/master/ixy/src/ixy/c/ixy.c)

C#: [https://github.com/ixy-
languages/ixy.cs/blob/master/src/ixy_...](https://github.com/ixy-
languages/ixy.cs/blob/master/src/ixy_c/ixy.c)

Java needs a bit more C to make it work. C# only seems to need it for DMA
access. But when you look at the Go code, they got away with being pure Go and
using the syscall and unsafe package. So that's at least one plus for Go.

(the main readme calls this out, but at least one thing worth mentioning here
too).

As a Java coder for my day-job, I do like the breakdown they have of the
performance of the different GCs for their Java implementation.
[https://github.com/ixy-languages/ixy-
languages/blob/master/J...](https://github.com/ixy-languages/ixy-
languages/blob/master/Java-garbage-collectors.md)

~~~
dep_b
I can't really believe the Swift implementation needs to be that slow.
Objective-C used to be 100% C compatible and Swift more or less has complete
bridging to C because of the need to use these API's.

Objective-C was often called slow because iteration NSArray was much slower
than doing it in C. Well, if you needed to do it fast in Objective-C you
wouldn't do it using the user friendly and safe (for 1984) higher level
objects.

I think only Rust really allows you to write really safe and still really fast
code though.

~~~
emmericp
Yes, we could write most of the critical part in C and it would probably be
faster. But then it wouldn't be a Swift driver.

~~~
skohan
Are these benchmarks single-threaded? I took a brief look at the Swift
codebase, and I noticed that you are using semaphores, but there doesn't seem
to be any parallel execution anywhere in the project.

~~~
emmericp
The Semaphore is only used during initialization, never in the critical path,
see profiling results in the main repo

------
tylerl
If you can't see or interpret the graphs (mobile browser, etc.) here's a quick
description of the relative performance in terms that might be useful even
without the graphs.

Bidirectional forwarding, Packets per second: Here, the batch size matters;
small batches have a lower packet rate across the board. Each language has
increasing throughput with increasing batch size up to some point, and then
the chart goes flat. Python is by far the slowest, not even diverging from the
zero line. C is consistently the fastest, but flattens out at 16-packet batch
at 27Mpps. Rust is consistently about 10% slower than C until C flattens out,
then Rust catches up at the 32-packet batch size, and both are flat at 27Mpps.
Go is every so slightly faster than C# until the 16-packet batch size where
they cross (at 19Mpps), then C# is consistently about 2Mpps faster than Go. At
the 256-packet batch size, C# reaches 27Mpps, and Go 25Mpps. Java is faster
than C# and Go at very low batch sizes, but at 4 packets per batch Java slows
down (10Mpps), and quickly reaches its peak of 11 to 12 Mpps. OCaml and
Haskell follow a similar curve, with Haskell consistently about 15% slower
than Java, and Ocaml somewhere between the two. Finally, Swift and Javascript
are indistinguishable from each other, both about half the speed of Haskell
across the board.

Latency, at 90, 99, 99.9, 99.99.. etc., percentile. 1Mpps: All have zero-ish
latency at the 90 percentile point, then Javascript latency quickly jumps to
150us, then again at 99.99%ile jumps again to 300us. C# is the next to
increase: at the 99%ile mark there's a steady increase till it hits 40us at
99.99%ile. Then a steady increase to about 60us. Haskell keeps it at about
10us until 99.99%ile, then a steady increase to about 60us, and a sudden spike
at the end to 250us. Java latency remains low until 99.95%ile, then it quickly
spikes up reaching a max of 325us. Next OCaml spikes at around 99.99%ile,
reaching a max of about 170us. Next comes Swift, with a maximum of about 70us.
Finally, C, Rust, and Go have the lowest latency. Rust and C are
indistinguishable, and Go latency diverges to about 20% higher than the other
two at the 99.999%ile mark, where it sways, eventually hitting around 25us
while C and rust hit about 22us.

~~~
gnode
The Rust page also compares the performance of the Rust implementation using
prefetching, which slightly outperforms C for some batch sizes.
[https://github.com/ixy-languages/ixy.rs#performance](https://github.com/ixy-
languages/ixy.rs#performance)

It would be a bit of a cheat, as it isn't portable, but it would be nice to
see prefetching in the C implementation for the sake of comparison.

------
userbinator
Cross-language comparisons are always interesting to look at; if I had the
time, I'd really like to write one in Asm and see how it compares.

I've written NIC drivers for some older chipsets, and IMHO it's not something
that's particularly "algorithmic" in computation or could necessarily show
off/exercise a programming language well; what's really measured here is
probably an approximation to how fast these languages can copy memory, because
that's ultimately what a NIC driver mostly does (besides waiting.) To send,
you put the data in a buffer and tell the NIC to send it. To receive, the NIC
tells you when it has received something, and you copy the data out.
Nonetheless, the astonishingly bad performance of the Python version is
surprising.

Although I haven't looked at the source in any detail, I know that newer NICs
do a lot more of the processing (e.g. checksums) that would've been done in
the host software, so that would be another way in which the performance of
the host software wouldn't be evident.

One other thing I'd like to see is a chart of the binary sizes too (with and
without all the runtime dependencies).

~~~
bsder
> Nonetheless, the astonishingly bad performance of the Python version is
> surprising.

In the paper, they point out that the Python version is the only one they
didn't bother to optimize.

However, my takeaway is that practically everybody can handle north of 1
Gigabits per second (2 Million packets per second x 64 bytes per packet) even
on a 1.6GHz core. I find _THAT_ quite a bit more astonishing actually.

~~~
fgonzag
I don't see why it's that surprising. We've been stuck on 1Gbps for the better
part of 20 years. What's surprising to me is that wired networking was sorta
left behind the tech wave, sure 10Gbps exists but it's still not that
affordable or widespread.

~~~
Goz3rr
I wouldn't say it was exactly left behind, because the average consumer will
not really benefit from anything over 1Gbit. 1Gbit is already enough to
saturate most consumer harddrives.

I run 10gbit inside my home and it didn't even cost me that much (if you go
with 10Gbit fiber instead of copper) with the sole reasons of getting quicker
transfers between my PC and NAS. My NAS has 4 SFP+ ports and functions as a
switch. I bought second hand PCIe SFP+ NICs for $40 each and matching
transceivers for $15 each. 10M of fiber costs less than $10.

There's no point in going higher, because 10Gbit is already way past the
sequential writing speed of the drive array in my NAS, and it's pretty much
saturating the NVMe cache drive in the NAS or the NVMe storage in my PC.

That's not to say you can't go faster, because 100, 200 and 400Gbit are very
much possible and in use in datacenters and the like.

~~~
cure
> I wouldn't say it was exactly left behind, because the average consumer will
> not really benefit from anything over 1Gbit. 1Gbit is already enough to
> saturate most consumer harddrives.

That hasn't been true for a long time. Even one single spinning rust hard
drive made in the last decade can do sequential reads at ~120-150MiB/sec,
which is easily enough to saturate a 1 Gbit/s link.

SSDs have way, way higher throughput for sequential read and write. Good SSDs
will also beat that number handily for random read/writes.

And of course, any machine with more than 1 hard drive can easily saturate a
1Gbit/s network.

I also find it surprising that wired networking has been 'stuck' on 1Gbit/s
for decades.

------
saurik
That JavaScript and Swift have essentially the same performance here is
extremely telling: there are essentially four performance regimes (five if you
count Python, but clearly from the graphs you should not ;P), and what would
really be interesting--and which this page isn't bothering to even examine?!
:(--is what is causing each of these four regimes. I want to know what is so
similar about C# and Go that is causing them to have about the same
performance, and yet much more performance (at higher batch sizes) than the
regime of Java/OCaml/Haskell (a group which can't be explained by their
garbage collectors as one of the garbage collectors tested for Java was "don't
collect garbage" and it had the same performance). It frankly makes me expect
there to be some algorithmic difference between those two regimes that is
causing the difference, and it has nothing to do with
language/runtime/fundamental performance.

~~~
emmericp
Swift spends 76% of the time incrementing/decrementing reference counts; ARC
is just very bad at pushing tens of millions of objects through it every
second.

There's some more evlauation for Swift here: [https://github.com/ixy-
languages/ixy.swift/tree/master/perfo...](https://github.com/ixy-
languages/ixy.swift/tree/master/performance)

It's just a coincidence that JavaScript and Swift end up with almost the same
performance; there is nothing similar between these two runtimes and
implementations.

~~~
skohan
This is also a clear optimization target. It is very possible to write Swift
code which requires very little reference-counting overhead.

~~~
dep_b
The problem is that 99% of all Swift developers use the language to create
front-ends for powerful devices and you never need to squeeze the last drop of
performance out of them.

------
antoinealb
The author of this project presented it last year at CCC, here is the video:
[https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...](https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_high-
level_languages)

~~~
ksangeelee
Thanks, that was interesting. If anyone is excited enough to try driving
peripherals in userspace via hardware registers, I can recommend starting with
a Raspberry Pi, since it has several well documented peripherals (UART, SPI,
I2C, DMA, and of course lots of GPIO), and the techniques described in this
talk are transferable.

A search for 'raspberry pi mmap' will yield a lot of good starting points.

------
kerng
Cool to see C# being up there close to C and before Golang.

I haven't used C# much over the last year due to job change but always felt
like one of the most mature languages out there. Now working in Go and it's a
bit frustrating in comparison.

~~~
tylerl
Go isn't designed to feel mature, it's designed to be boring and effective.
It's designed to keep code complexity low even as the complexity of problems
and solutions increases. It's designed to allow large teams of medium-skill
programmers to consistently produce safe and effective solutions. The most
precise description ive heard to date is: "Go is a _get shit done_ language."

~~~
pjmlp
Basically it is designed for writing boilerplate libraries and code generators
to cover up lack of language features, which even well known projects are
forced to make use of (k8s).

I bet a G2EE variant isn't too far away.

~~~
grumpydba
Yet in the infrastructure side, it's much more used than c#. Go figure.

~~~
pjmlp
You again.

What infrastructure, those riding the consulting and conference Docker and K8s
2019 wave fad?!?

~~~
grumpydba
Right now I'm using prometheus and grafana to monitor around 8k database
servers (sql server too BTW). We have pricing applications using influxdb.
Docker. Openstack. Minio. We also have mattermost.

All of this in a conservative big bank. My friends in the banking sector tell
the same story.

True there are lots of c# enterprisey web apps.

However given the amount of boilerplate you describe, I cannot understand how
such useful and reliable tools can be delivered in Go.

A hint:just because a language is not to your liking, it does not mean that it
is not useful, performant and reliable.

~~~
pjmlp
One anecdote doesn't make the IT industry.

~~~
grumpydba
I'm talking about the whole banking sector in France.

~~~
pjmlp
And yet I haven't seen any of that on our Fortune 500 French clients, which
naturally includes banks, go figure.

~~~
grumpydba
Are you working in operations and infrastructure? My take is that writing
enterprisey applications you are not exposed to those tools. I'm in ops.

~~~
pjmlp
Not personally, but we do have mixed teams.

AWS, Azure, actual hardware racks, plain old VMs, JEE containers, .NET
packages, Ansible, Puppet, Chef, whatever scripting stack, but surely not one
line of Go related code.

------
chrisaycock
A specific finding from this research is on the front page:

[https://news.ycombinator.com/item?id=20944403](https://news.ycombinator.com/item?id=20944403)

Rust was found to be slightly slower than C because of bounds checking, which
the compiler keeps even in production builds.

~~~
mlindner
Except their answer is wrong, because Rust (LLVM rather) does eliminate bounds
checks. They're comparing GCC vs LLVM here more than they are comparing C vs
Rust. They should have compiled their C code in LLVM. Their implementation is
littered with uses of "unsafe" which means its almost impossible for the
compiler to eliminate the bounds checks.

~~~
GrayShade
There's a per-packet bounds check here [1] which probably can't be eliminated
by the compiler because it cycles over the array. I imagine that's noticeable.

[1]: [https://github.com/ixy-
languages/ixy.rs/blob/master/src/ixgb...](https://github.com/ixy-
languages/ixy.rs/blob/master/src/ixgbe.rs#L175)

~~~
ChrisSD
So the bounds check is:

    
    
        queue.bufs_in_use[rx_index]
    

If so the bounds check could possibly be safely eliminated by the programmer
because I think `wrap_ring` ensures that rx_index will always be in bounds?

~~~
GrayShade
Yes. It wouldn't be too unidiomatic to use get_unchecked in those two places,
perhaps with a debug_assert! in place.

It would be really nice if this wasn't needed, but it's a valid use of unsafe
code.

------
chvid
So why the difference in "language" speeds?

You have some the results not quite following the conventional expectation.
For example the Swift implementation is as slow as JavaScript. JavaScript is a
lot faster than Python. Java is considerable slower than the usually very
similar C#.

The implementation is fairly complex; so it is a bit hard to see what is going
on. But it must be possible to pin the big performance differences implied by
the two graphs to something?

~~~
ygra
Python is interpreted bytecode. This means that for every small instruction on
the bytecode there's a round trip to the Python interpreter that has to
execute that instruction. This is faster than parsing and interpreting at the
same time, such as shells often do, but it's still a lot slower than JIT
compilers.

Now, a just-in-time (JIT) compiler transforms the code into machine code at
runtime. Usually from bytecode. Java, C# JavaScript all use this model
predominantly these days. This takes a bit of work during runtime and you
cannot afford too complicated optimizations that a C or C++ compiler would do,
but it comes close (and for certain reasons is even better sometimes). So
that's the main reason why JavaScript is faster than Python. Theres a Python
JIT compiler, PyPy, that might close the gap, though. And for Python in
particular there are also other options to improve speed somewhat, one of them
involves converting the Python code to C. Not too idiomatic, usually, though.

As for Java and C#, that's a point where it can sometimes show that C# has
been designed to be a high-level language that can drop down to low levels if
needed. C# has pointers and the ability to control memory layout of your data,
if you need it. This turns off a lot of niceties and safeties that the
language usually offers (you also need the _unsafe_ keyword, which has that
name for a reason), but can improve speed. Newer versions of C# increasingly
added other features that allow you to _safely_ write code that performs
predictably fast. But even value types and reified generics go a long way of
making things faster by default than being required to always use classes and
the heap.

Java on the other hand has few of those features where the develop is offered
low-level control. It has one major advantage, though, in that its own JIT
compiler is a lot more advanced and can do some crazy transformations and
optimizations. One might argue that Java needs that much magic because you
don't have much control at the language level to make things fast, so as far
as performance goes between C# and Java this may be pretty much the tradeoff
between complicated language and complicated JIT compiler.

As for which benchmark shows Java being faster than C# depends a bit on how
the code was written, but recently .NET has become a lot better as well and
popular multi-language benchmarks show C# often faster than Java.

------
AlEinstein
Surprisingly good performance for the C# implementation!

~~~
jcranmer
For me, that was the line that surprised me the most. The .NET VM has had a
reputation as being a worse variant of the JVM, but it seems that now the
tables have turned.

~~~
fortran77
Really? To me it was always a runtime VM done right! The .net CLR is much more
stable, leak-proof, and performant in my experience. I have .net services that
run on servers for years without ever being restarted.

Given that C# and "Rust" are neck and neck, I'd rather have a nice GC language
to work with.

~~~
thethirdone
Go and C# are pretty much neck and neck until the batch size gets large. Rust
is always ahead of C#.

------
molyss
That's a very interesting experiment in many levels. Haven't taken the time to
look at the paper yet, but I'm curious of how you got your number of pps vs
Gb/s in the README :

"full bidirectional load at 20 Gbit/s with 64 byte packets (29.76 Mpps)".
sounds like 20Gb/s should be closer to 40Mpps than to 30Mpps. Did you hit CPU
limits on the packet generator, or am I missing some packet header overhead ?

Did you try bigger that 64-byte packets ? I'm curious how various runtimes
would handle that.

And how long did you run the benchmarks ? I couldn't really figure it out from
the github or the paper. Mostly wondering if java and other Gc'd language
showed improvement or degradation over time. I could see the JITs kicking in,
but I could also see the GCs causing latency spikes.

~~~
benou
> am I missing some packet header overhead ?

Yes: Ethernet adds 20 bytes: 8 byte preamble/start of frame delimiter + 12
byte interframe gap

=> the "on-the-wire" size is actually 84-bytes

=> 20Gbps/84-bytes = 29.76Mpps

> Did you try bigger that 64-byte packets ? I'm curious how various runtimes
> would handle that.

In typical forwarding, packet size does not impact forwaring that much until
you hit some bandwidth limit (PCIe, DDR and/or L3 cache) because you only
touch the packet header (typically the 1st 64-bytes cacheline in the packet).
The data transfer itself will be done by NIC DMA.

~~~
emmericp
PCIe bandwidth also decreases with increasing packet size as there's a lot of
overhead per packet. Memory isn't used, it's all handled in cache, hitting
main memory is super slow.

------
azhenley
The fact that Go is slower than C# really amazes me! Not long ago I switched
from C# to Go on a project for performance reasons, but maybe I need to go
back.

~~~
apta
What made you come to the conclusion that golang was faster than C#? The hype
and claims we see in blogs that are not backed up by anything?

Both C# and Java are faster than golang.

~~~
hermitdev
Usually where I see C# slow down is not because of the language, but because
of over engineered "enterprisey" solutions that Java has a bad rep for e.g.
having things like a FactoryProviderFactory type idioms.

A lot of the projects I work on, for instance, heavily utilize dependency
injection for no gain. There's only one implementation, theres no test mocks.
Its just overengineered and obsfuscated for no reason.

Coming from a predominantly C++ background, we eschew virtual wherever
possible, favoring compile time polymorphism to runtime whenever possible,
because we're cognizant of the overhead of the indirect dispact and likely
loss of optimized opportunities to inline trivial calls.

For sure, one can write C# or Java that can keep up, or even outperform C++ in
some circumstances, but youre not going to do it with "enterprise" patterns
hiding behind interfaces and factories and dependency injection.

~~~
pjmlp
That isn't how Turbo Vision, OWL, VCL, MFC, Motif++, PowerPlant, C Set++, ATL,
Qt, Unreal, COM/UWP, wxWidgets, JUCE look like.

There are the CppCon talks, the Modern C++ advocacy, and then there is the
code that everyone at most corporations actually write.

~~~
blt
Virtual dispatch is particularly well suited for constructing dynamic GUIs at
runtime. Doesn't mean that "everyone" is writing code like that.

~~~
pjmlp
COM/UWP is not only for GUIs, it is the full area of modern Windows APIs.

Then there are ORM like the ill fated POET.

Yeah just like not everyone is writing code that "eschew virtual wherever
possible, favoring compile time polymorphism to runtime whenever possible",
specially on large corporations with mixed language teams.

Beyond C++ conference talks, I am yet to see stuff like SFINAE and tag
dispatching in the C++ code I occasionally deal with. Grated those are
libraries that get called from Java/.NET projects.

~~~
hermitdev
I have written a fair amount of C++ template metaprogramming and policy based
libraries. One library I wrote, in particular, was a templated generic
matching engine primarily used in the self-clearing of trades. Through
template policies, it could be configured to do one-to-one, on-to-many, many-
to-many matching based upon template args, for example. I also did a bit of
SFINAE in writing a home-grown ORM lib. I haven't really written any libs
using tag dispatching, but I've certainly used my fair share (looking at you,
Boost MultiIndex).

You don't usually see these sorts of types wrapped for Java or .Net, and if
they are, you usually have some sort of proxy in between to hide the
templates.

------
non-entity
Is there a compelling reason to write high level user mode drivers like this
over traditional kernel drivers? I remember finding this repo a few years back
and being fascinated.

~~~
pjmlp
Traditional monolithic kernel drivers are on their way out, specially given
security concerns.

[https://developer.apple.com/videos/play/wwdc2019/702/](https://developer.apple.com/videos/play/wwdc2019/702/)

[https://source.android.com/devices/architecture/hal-
types](https://source.android.com/devices/architecture/hal-types)

[https://docs.microsoft.com/en-us/windows-
hardware/drivers/de...](https://docs.microsoft.com/en-us/windows-
hardware/drivers/develop/getting-started-with-universal-drivers)

~~~
non-entity
Interesting. I suppose the ability to (almost?) debug drivers like I would any
other user-mode applications could be a big win.

------
Shorel
Rust has definitely earned my respect.

Someone add D lang to this test! I want to know!

------
mister_hn
It misses C++

~~~
pjmlp
While C++ is way better than using C, it doesn't forbid "writing C with C++
compiler", does rendering useless all the safety features it offers, if one
isn't allowed to tame the team via static analysis tooling.

------
nicky19890202
This is good.

------
yc12340
I am calling in question validity of this project as a benchmark.

The author asserts, that "it's virtually impossible to write allocation-free
idiomatic Java code, so we still allocate... 20 bytes on average per forwarded
packet". This sounds questionable, — does that mean, that he actually performs
a JVM memory allocation for _every_ packet?! Furthermore, the specifics of
memory management look murky. One implementation uses "volatile" C writes [1]
(simply storing data to memory). Another implementations of the same thing
uses a full CPU memory barrier [2]. Which one is right?

In my opinion, significant inconsistencies between implementations render any
comparison between them invalid. And when a whole cross-language test suite is
written by one person, you can be sure, that they don't really excel in many
of those languages.

This is why I like Benchmark Game — all benchmarks are submitted by users, so
they are a lot closer to how a real-world decent programmers can solve the
problem. Still not perfect, but at least that counts as an attempt.

1: [https://github.com/ixy-
languages/ixy.java/blob/fcad50339e537...](https://github.com/ixy-
languages/ixy.java/blob/fcad50339e5372d7cc1cc53d7a7360bec04aaebd/ixy/src/main/java/de/tum/in/net/ixy/ixgbe/IxgbeTxQueue.java)

2: [https://github.com/ixy-
languages/ixy.java/blob/fcad50339e537...](https://github.com/ixy-
languages/ixy.java/blob/fcad50339e5372d7cc1cc53d7a7360bec04aaebd/ixy/src/ixy/c/ixy.c)

~~~
emmericp
Java reaches 52% of C speed in the benchmark game ("fastest measurement at the
largest workload" data set, geometric mean), we reach 38%. Seems like our
implementation is within a reasonable range for something that's usually not
done in Java.

A full memory barrier is not required, but some languages only offer that. For
example, go had the same problem. It's not a bottleneck because it goes to
MMIO PCIe space which is super slow anyways (awaits a whole PCIe roundtrip).

And no, it obviously wasn't written by only one person but a team of 10.

No, we are not saying that we allocate for every packet. We say that we
allocate 20 bytes on average per packet.

~~~
altfredd
Java support all sorts of memory barriers via VarHandles (see
GET_OPAQUE/SET_OPAQUE). VarHandles can be created from contents of
ByteBuffers:

[https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/M...](https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/MethodHandles.html#byteBufferViewVarHandle-
java.lang.Class-java.nio.ByteOrder-)

