
mTCP: A high-performance user-level TCP stack for multicore systems - gyre007
https://shader.kaist.edu/mtcp/
======
iforgotpassword
Not to be confused with mTCP, a TCP/IP library for DOS focused on low memory
footprint and efficiency so that it's usable on IBM XT class machines.
Includes a bunch of applications like IRC HTTP FTP clients and servers.

[https://brutman.com/mTCP/](https://brutman.com/mTCP/)

------
yxhuvud
Interesting. However, Last modified: 2014. There has happened a lot in the
Linux Kernel since then. I wonder what the state of the art is nowadays and
how it compares to the Kernel.

~~~
jdsully
The kernel is still quite slow at handling TCP/IP packets. Its the bane of my
existence with KeyDB. I'm still experimenting with io_uring which should help
a little bit but that was only released this year.

~~~
yxhuvud
I'm not claiming it is fast. I'm claiming there is a lot of water under the
bridge and development done, both for the kernel and probably also for custom
TCP stacks. Hence my call for a more up to date comparison.

That said, indeed, io_uring seems like quite great progress.

~~~
jdsully
I should be more clear: From my perspective as someone following this closely
little progress has been made. With the notable exception of io_uring.

------
amiga-workbench
I mistook this for
[http://www.brutman.com/mTCP/](http://www.brutman.com/mTCP/), but I suppose
there are limited naming choices when it comes to TCP stacks.

------
Radle
But why not optimize the TCP part of the kernel?

~~~
raphaelj
I worked on such a TCP stack while doing my MSc. thesis a couple of years ago
[1].

Handling TCP in the kernel has some overhead due to system calls.

Also, the way sockets are designed does not make them very scalable, as you
have lock contention on the TCP state machine. The SO_REUSEPORT feature
introduced in Linux 3.9 solve some of these lock issues, but the kernel TCP
stack is still not fully parallel [2].

\--

[1] [https://github.com/RaphaelJ/rusty](https://github.com/RaphaelJ/rusty)

[2]
[https://raw.githubusercontent.com/RaphaelJ/rusty/master/doc/...](https://raw.githubusercontent.com/RaphaelJ/rusty/master/doc/img/performances.png)

~~~
derefr
> But the kernel TCP stack is still not fully parallel

Yes, and this project, if moved into kernel-space, would entirely replace that
stack and its state-machine.

> system calls

You’ve still got the overhead (context switches and memory copies) of getting
the _IP_ packet out of/into the kernel, which I don’t think is all that much
less than the overhead of getting a TCP packet out of/into the kernel.

Really, what you want is SR-IOV to allow the user-space process to do direct
Ethernet DMA to its own dedicated network card. No copies at all!

But if you’re willing to do _that_ , then the application is basically acting
as its own kernel... so why not just admit that, and instead of writing a
user-space process that has half the features of a kernel, just either 1.
write your logic as a Linux kernel driver, or 2. compile your program into a
unikernel framework? Then your VM-nee-application’s host can be a proper VMM
like Xen or ESXi, where it’s easier to configure that SR-IOV dedication as
part of your VM-nee-application’s workload configuration.

For this reason, I’ve never understood people trying to do things like this
“in user-space.” You’re playing at being a kernel—with all of the _problems_
of being a kernel—without the ability to rely on an existing, well-written
kernel as a basis for your logic (like e.g. the parts that handle the L1-L3
layers of the network stack, which you aren’t changing much.)

~~~
scott_s
> But if you’re willing to do that, then the application is basically acting
> as its own kernel... so why not just admit that, and instead of writing a
> user-space process that has half the features of a kernel, just either 1.
> write your logic as a Linux kernel driver, or 2. compile your program into a
> unikernel framework?

Because the kernel is still a lot of other things for you other than
networking - I think it's a stretch to say that all user-space networking
makes your work "half" of that of a kernel. And, _you 're_ not necessarily the
one doing it. _You_ may be an application, and your user-level TCP (including
kernel bypass) may be a _library_ from someone else. But to your general point
of now you are now well past the city walls, and may run into trouble, I
agree. I assume that this sort of thing is only done by a small number of
people.

~~~
yxhuvud
It does raise the question about if it would be possible by the kernel to
implement features that so to say erect city walls in a way that processes
can't access whatever it want. Ie, granting access to _some_ of the ethernet
DMA but not all of it, or something similar. Perhaps it isn't possible in
theory, but perhaps it is, or could be.

------
snvzz
Sane OSs have been taking proper advantage of SMP for a while.

Such as Dragonfly:
[https://www.dragonflybsd.org/features/#index2h2](https://www.dragonflybsd.org/features/#index2h2)

~~~
unmole
Yet Dragonfly's network performance is nowhere near as good as that of Linux,
let alone FreeBSD.

~~~
imheresamir
Could you provide a source for this? Just curious.

~~~
snvzz
I'm curious too, as the ones I've seen show Linux outdone by Dragonfly,
matching throughput at much lower latency.

[https://leaf.dragonflybsd.org/~sephe/perf_cmp.pdf](https://leaf.dragonflybsd.org/~sephe/perf_cmp.pdf)

------
ohnoesjmr
Presumably this needs it's own IP sockets, which means this needs root to run?

------
nopacience
Does your site use mTCP ?

------
stcredzero
The implication of what mTCP is doing: Only threads on the same CPU core can
communicate at the highest speeds. Isn't this situation broken? Shouldn't we
have better hardware support for communicating threads? Why in modern
software, are threads communicating through CPU cache invalidation?

~~~
dooglius
Moving to e.g. shared scratchpad memory would be a major paradigm shift, you'd
have to do a lot of coordination between CPU vendors, OS/kernel developers,
high-performance library writeslrs, etc to make it happen.

~~~
dragontamer
> Moving to e.g. shared scratchpad memory would be a major paradigm shift

Shared scratchpad memory is (slowly) happening in the GPGPU world, at least in
very limited circumstances.

But yeah, I think that's why GPGPU programmers are managing to get better
scaling than classical CPUs, because GPGPUs are more willing to toy with the
memory model, and GPGPU programmers are willing to use those highly-
specialized communication features.

~~~
dooglius
I don't know much about GPGPU-land, but all of the difficulties I foresee have
to do with multiprocessing/sharing/context switching... my guess is that only
one logical program can use the scratchpad at a time?

~~~
dragontamer
> I don't know much about GPGPU-land, but all of the difficulties I foresee
> have to do with multiprocessing/sharing/context switching... my guess is
> that only one logical program can use the scratchpad at a time?

Well.. that one logical program can have 1024 SIMD-threads (maybe 16x (AMD) or
32x (NVidia) "actual" threads... where an "actual thread" is a "ticking
program counter" by my definition), but yeah, its "one logical program" from
the perspective of the GPU.

Even more specific than "one logical program", but "one threadgroup". So if
you have a program that spins up 4096 SIMD-threads, only 1024 of them at a
time can actually share a particular Shared Memory. (The GPU will allocate
say, 10kB to the different "thread groups", but Threadgroup 0-1023 can touch
its own 10kB block, while Threadgroup 1024-2047 can only touch another 10kB
block)

GPU Shared Memory can be "split" to different programs. So one program can
reserve 10kB, while a 2nd program can reserve 20kB, and then both can run on
one GPU-unit. But the two programs are unable to touch each other's shared
memory.

