Hacker News new | past | comments | ask | show | jobs | submit login
Why do we use the Linux kernel's TCP stack? (jvns.ca)
427 points by nkurz on July 2, 2016 | hide | past | favorite | 130 comments



Please don't rewrite your network stack unless you can afford to dedicate a team to support it full time.

Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks. The justifications were different each time, though never perf.

The projects were filled with lots of early confidence and successes. "So much faster" and "wow, my code is a lot simpler than the kernel equivalent, I am smart!" We shipped versions that worked, with high confidence and enthusiasm. It was fun. We were smart. We could rewrite core Internet protocol implementations and be better!

Then the bug reports started to roll in. Our clean implementations started to get cluttered with nuances in the spec we didn't appreciate. We wasted weeks chasing implementation bugs in other network stack that were defacto but undocumented parts of the internet's "real" spec. Accommodating these cluttered that pretty code further. Performance decreased.

In both cases, after about a year, we found ourselves wishing we had not rewritten the network stack. We started making plans to eliminate the dependency, now much more complicated because we had to transition active deployments away.

I have not made that mistake a 3d time.

If you are Google, Facebook or another internet behemoth that is optimizing for efficiently at scale and can afford to dedicate a team to the problem, do it. But if you are a startup trying to get a product off the ground, this is Premature optimization. Stay far, far away.


Agreed. Also, it likely makes you and your company a bad Internet citizen to roll your own, for the reasons you've mentioned. My first company sold proxy servers, and I can't count the number of times a buggy router stack or other embedded thing broke the web for some users some of the time. If you make an off-by-one mistake in your PMTU discovery code, you're gonna waste somebody's day. If you don't deal with encapsulation right, you're going to waste someone's day. If you respond incorrectly to ICMP messages, you're going to waste someone's day. The list of things that can go wrong is endless.

I kinda feel like it's comparable to rolling your own encryption. Yes, most of the standards are well-written and well-defined, and you can spend a couple weeks really grokking Stevens' book (or whatever the modern equivalent is, I dunno, as I don't implement anything at that level anymore), but you're gonna spend years becoming bug-compatible with the rest of the Internet (or coming to realize your interpretation and the rest of the world's interpretation of the spec differ).

2 million requests a second sounds amazing. But, the price is high. As you note, if you have a team dedicated to it, cool. And, if you want to do it for fun and experience, that's also cool. But, Linux has a lot of wisdom built-in. There's like a gazillion commits over its life that have fixed various networking bugs and quirks.


Why does it make someone a bad internet citizen to implement internet protocols? Is there something official about the Linux kernel (the insecure, buggy, and written-in-C Linux kernel). I would start clean-sheet for a lot of reasons, mostly security, but I don't know what that has to do with my internet virtues.


No, there's nothing official about the Linux kernel implementation. Though you could make the case for BSD...

That aside, the reason it makes you a bad citizen is the lack of follow through. If you're a small organization writing a network implementation, you know from the start that your implementation is unlikely to get the decade+ of follow-through that is necessary to fix your interoperability problems. Instead you're putting that cost on every other internet host that is going to have to interact with your out-of-spec implementation until the day the last instance of your code is shut down.


I wrote a ip stack for my little toy OS a few years back and even that experience is miserable and hard. It never works quite right because either your implementation sucks or the people you're talking to are fast and loose with standards. More often both. It taught me though that the papers don't matter and the real standards are what's in the wild, and that what's happening in the wild is an order of magnitude more insane than you can imagine.


The specification, as written in the RFC, is pretty useless... But luckily people sat down and formalised TCP/IP as spoken on the Internet (10 years ago, Linux-2.4.20/FreeBSD-4.6/Windows XP): https://www.cl.cam.ac.uk/~pes20/Netsem/index.html

All the code is now BSD licensed https://github.com/PeterSewell/netsem

What is missing in 10 years of TCP? Apart from SACK and delayed ACK not much it seems (in FreeBSD congestion control algorithms are pluggable nowadays).

And I currently revive it (working hard on test suite using DTrace probes) https://www.cl.cam.ac.uk/~pes20/HuginnTCP/

Please ping me if you're interested (hannes at mehnert dot org) and eager to help.


> "Please don't rewrite your network stack unless you can afford to dedicate a team to support it full time."

Generally agreed but there's a caveat to this statement.

> "Twice in my career I have been on teams where we decided to rewrite IP or TCP stacks."

This is the caveat. Don't try to rewrite a protocol that's already running billions of devices on the internet which you have to interoperate with. It's a much more tractable problem when you control both ends.

Twice in my career, I've worked on protocols (with success... the first company went public, second has it's protocol working well in millions of devices). Both times, the protocols were key to the business of course and both times the "custom" protocol only had to interoperate against itself on the other end.


I've always thought custom implementations of this flavor were smart or dumb. It's never an OK choice, there's never a middle ground. I think in your case, it's a big win. In the op's case, it was a bad choice.

I'm not saying the op's team aren't smart and capable people. They just made a bad choice.

Fundamentally, it's a hard thing to get perfect. There are lots of pointy edges in that code. If it is to be undertaken, it needs to be a huge win. The potential for a huge win is worth the effort.


Did you work at Onlive or something? I was always curious what their protocol was doing. I never bothered to look, because I figured it would take a while to figure out what I was looking at.


No it wasn't Onlive. The company that went public was Riverbed and they used their custom TCP between their appliances. My current company (which I founded) is PacketZoom and our custom protocol (built on top of UDP) communicates between our SDK in the mobile app and our servers distributed around the world.


We use Riverbeds on either side of our sat shots here in Antarctica. Work fairly well for the most part. I didn't know they were running a custom TCP implementation in between the end devices.


Veteran of a similar affair. Well, two stacks, and once, an HTTP proxy.

You may be smarter than the average bear, but you have to deal with other people's um . . . questionable decisions (and bugs).

Customers won't care that the FuppedUckTron-9000 web server they bought on eBay is non-compliant and that its Content-Length needs to have special casing to work around some spectacular drain-bamage, they only care about their valuable business data ^H^H^H porn.

Stacks and proxies are usually thankless.


The original article claims that having the TCP stack in the kernel causes performance problems because it needs to do excessive locking.

I can't judge, but if really that is true, then in principle, a user-space library could be written to take care of all those corner cases you mention, and still be faster than the kernel stack.

Of course that wouldn't be everyone rolling their own.


I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems like if the problem is locking, you should be able to get good results from working on the locking (finer grained locks / tweaking parameters) in less time than building a full tcp stack.

What kind of limitations are people seeing with the Linux kernel? If I'm interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of unencrypted content with a single socket E5-2650L (document isn't super clear though, it says they were designed for 40Gbps). My servers usually run out of application CPU before they run out of network -- but I've run some of them up to 10Gbps without a lot of tuning.

[1] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf Context is accelerating https downloads, but some decent numbers anyway.



Gbps is not created equal. Traffic with many small packets takes a lot more resources compared to traffic with fewer but bigger. Netflix packages would be as big as they come.


Yes, in fact for things like Juniper/Cisco firewalls they will always quote PPS in full MTU packets. If you want to bring that shiny new firewall to its knees try sending it traffic with the minimum MTU of 68 bytes at line rate for the NIC.


Ah, I'm also dealing with large packets, generally.


The problem isn't locking so much, it's that you have to dispatch to a kernel thread when you're requesting and sending data, paying the cost of that context switch every time. In userspace you can spin a polling thread on its own core and DMA data up and down to the hardware all day long without yielding your thread to another one.


The kernel is mapped into the top of the address space of each user spaces process. That is generally pretty efficient which is why it is done.


sure, that saves you from dumping TLB state - but you still need to save register state, copy data from a user supplied buffer in to a kernel-owned device-mapped buffer - wiping L1 data and instruction caches in the process.

For 99% of use cases this isn't a problem, but if you're trying to save every possible microsecond, then it definitely does.


Sure, I was more commenting on the parent post that suggested that the cost was doing to a "context switch" when its not a context switch at all its mode switch - to "kernel mode."

If you are trying to save microseconds you are probably running special hardware like the SolarFlare network cards which also run the drivers in user space. These are generally hedge funds or high frequency trading shops. I can't imagine anyone else could justify the price.


For example, you could use the BSD TCP stack which has been refactored into a user space library as part of the rump kernel project.


I guess most locking is in place in order to allow multiplexing - different applications need access to different sockets, which however send and receive data from the same NIC.

If you implement the whole stack in userspace and have only a single thread which processes all data you might get away with less locking. However as soon as there are multiple threads which want to send/receive on multiple endpoints there is the same need for synchronization and it would need to be implemented in userspace.


in practice you will bind a single core to each NIC and then run a single polling thread for each networking core that runs with rtprio enabled.


The problem here mostly is not about the throughput,which i guess most of the highend hardware could help, but in terms of how many connections it can handle per second is where the bottleneck is on the Linux Kernel.I cannot imagine a 1 million CPS linux kernel support now .. But the same is possible with a UserSpace TCP Stack .


I suspect this is not limited to networking stacks.

I see a whole lot of fretting over technical debt here on NH, and the more i see it the more i find myself thinking that said debt do not just happen. There will always be a reason, and it likely it can never be fully avoidable.

As such, whenever i see someone joyfully ripping out old code or rewriting something from scratch, i can't help think that at best it has reset the clock with a few years. We will be right back where we started, and then some, soon enough.


Yes and no.

For example, the company I currently work for has 15 years of unaddressed technical debt, and it has enormous impact on everything we're doing today. Fixing it is absolutely necessary, and where it's been possible to do so it's already made a significant impact.

However, in a few years things will likely still be awful and even the shiny new improvements will have been dirtied up, because the cultural and managerial decisions that led to the current state are still in full effect. You're correct in that technical debt is ironically usually as much a people problem as a technical one and solely-technical solutions are usually inadequate.

That doesn't mean that I'm going to stop fixing things where I can and pushing for change though, because at the very least it means my life there will be a little saner (until I finally burn out on fighting the tide and find somewhere different to go).


I have seen variations on this so, so many times over the years. "Why use that generic version, when we can roll our own?" Without a dedicated team, you're then stuck implementing and maintaining something on your own, while dozens, hundreds, or even thousands of people participate in the development of open-source libraries.

This is similar to making a private fork of an open-source project. At first, it seems fantastic. But pretty soon, you discover that no one, including the original authors, can provide you with advice and support.

I can't even imagine how painful that would be for part of the network stack. Yukko.


Agree that for most situations, especially public internet facing, this is the right advice.

However, in addition to big companies using internally, it can also work when the environment is otherwise controlled. For example scylladb is built on top of seastar and has it's own userspace networking stack based on dpdk that it can use, and it works just fine since only other database instances and well-behaved clients will be interacting with those servers. No dedicated team required, just realistic isolation on the scope of this "custom" protocol.


So essentially, you are saying that the Internet is doomed to eternal cruft. It seems like the Left-pad incident shows that a Chinese dolls level of interdependency results in inherent instability.

But there is an army of programmers, sysadmins and so-forth to fix all this. Moreover, this is what people have gotten to work on a really large scale. It seems like programming in two hundred will become 90% "listening to the mythology" and 10% actual logic. But that's how it happens.


> "So essentially, you are saying that the Internet is doomed to eternal cruft."

No. This is a classic disruption story. While the establishment is smug and comfortable about the accumulated cruft of decades, others are working on the problems they're completely ignoring. Check out my other comments in this thread.


If there is to be disruptive change, it will be a new protocol that solves global-scale problems the old one did not. The automobile did not depend on the horse. An incrementally better TCP implementation will not disrupt TCP.

Unfortunately a large portion of protocol development seems to be occurring at the application layer even if it doesn't belong there, which is how we ended up with HTTP/2 and WebSockets.


May I ask what kind of startups you worked at where you felt the need to rewrite it?


The pragmatics are all against rewriting the network stack, as you have thoroughly explained. Though I have a feeling of unease. The network stack implementations are lacking diversity. They improve slower than they could otherwise do. They have plenty of undocumented obscure corner cases. Developing an implementation of a [de-facto] standard requires a solid open test suite. The Web platform has one, https://github.com/w3c/web-platform-tests. Is there an equivalent suite for TCP/IP?


Certainly there are high-quality commercial testing appliances for precise performance and correctness figures. For example, https://www.ixiacom.com/products/ixanvl


Years ago I worked for a company who was building a piece of networking hardware, and we had bought some code that implemented various well-known routing protocols. This included the TCP/IP layer, but it was just the BSD stack.

Our dev hardware had a boot loader that would use FTP to download the most recent images to finish the boot. After we cut over to this code, we started having an occasional problem where this download would fail -- the box would send a spurious RST and kill the connection. I was one of the protocols people, and it ended up on my plate.

I spent 3 days staring at tcpdump, adding enormous piles of debug traces, and finally going through the code line-by-line with a copy of the Stevens book open on my lap. Eventually I found the problem - they had made one single change to the BSD code, and in so doing introduced the bug. It was compiler-dependent.

Writing this code correctly is hard.


> It was compiler-dependent.

Reminds me of a bug i was wrangling in GTK, now hopefully long gone, that only showed up if the lib was compiled with a certain level of optimization set in GCC.


A few quick notes:

- The Linux network stack has been getting faster. Workarounds that may have made sense 5 years ago, vs 2.6.something, may make less sense today vs 4.7. It makes it hard to research the topic, as you'll find statements of fact that are no longer true today.

- The kernel stack is complex, but then it also handles a lot of TCP/IP behaviors. Eg, buffer bloat. You might code a lightweight stack that works great in microbenchmarking (factors faster), but falls apart when connected to real devices. I've worked on differing TCP/IP implementations that had this behavior, and the stack that was slower in the lab was faster for the customer's real world messy workload.

- What is Facebook doing? eXtreme Data Path: https://github.com/iovisor/bpf-docs/raw/master/Express_Data_... ... this is really new and exciting stuff.

- Unikernels :-)


Another technology worth mentioning is the Linux network parallelism features (which I think came from Google):

https://www.kernel.org/doc/Documentation/networking/scaling....

My colleague at Netflix, Amer Ather, wrote about this tuning and more in a post detailing some network tuning from a Xen guest, where he was reaching 2 million packets/sec:

http://techblog.cloudperf.net/2016/05/


Google Maglev: 2008

XDP (Facebook and others): 2016


I worked on ICQ project as a backend developer (C/C++) after Mail.Ru bought ICQ from AOL.

We had legacy from AOL - AOL's proprietary TCP stack in user space.

ICQ used AOL's TCP stack for external connections with outer world (i.e. with clients).

I asked AOL engineers why they needed TCP stack in user space. They said that back in old days (90s) there was no good scalable TCP implementation which could handle many simultaneous connections properly.

Each process which used this TCP stack had to fully own network interface.

Of course, we gradually replaced AOL's TCP stack with native stack because nowadays Linux TCP stack is good enough.

May be these days someone still need proprietary TCP stack in user land to meet specific needs which aren't available in native Linux implementation.

Also if you have TCP stack in user land, you have more control over it.


AOL had a bespoke webserver too.

https://en.wikipedia.org/wiki/AOLserver


I really liked AOLServer! My current company's first website ran on OpenACS, which was written in Tcl and ran on aolserver. It was really quite a nice system (and Tcl is a quite nice language).

But, the web server worked with the standard Linux network stack; at least, it did by the time I saw it.


Tcl may be the most under appreciated language.


It was also written in tcl


They said that back in old days (90s) there was no good scalable TCP implementation which could handle many simultaneous connections properly.

I think that's true, using select() the max number of connections is ~1024. epoll() lets you do many more, but that wasn't available back then.


select doesn't have such an inherent limit. The limit is an artifact of the way the fd_set structure and macros are defined and implemented. Most unix-like kernels, including Linux, will accept a much larger fd_set. The userspace libraries (including glibc) even often permit you to redefine FD_SETSIZE at compile time, which makes it even easier. That can cause problems when sharing fd_sets between different libraries, though, so isn't a good idea.

IME once you get past an FD_SETSIZE of 8192 the overhead of preparing 3KB of data on every single select call begins to show appreciable CPU load. CPUs are fast enough that you'll rarely get more than a handful of pending events per call even with thousands of active connections, so the cost isn't amortized well.

Nonetheless, if you use epoll without actually making use of and benefitting from persistent events (as some naive applications do, including a much-touted "high-performance" platform), select can even be faster than epoll under load. Under load, a naive application which resets events on a descriptor every time it processes that descriptor is pushing much more data through syscalls than select would.


One of the interesting features of the AOL TCP stack is that they listed on all the ports -- pretty easy to do if you're in userspace.


The core functionality required is a "zero copy" networking lib : dpdk, netmap.

Normally, when the kernel recieves data from the network, it allocates a block in the kernel and copy the data into it. Then your read operation copies that data in your user space. The "zero copy" networking stacks avoids the data copy. The way it works, as I was explained, is that they use a shared memory mapped zone. This zone is organized as a pool of blocks managed with non blocking lists. Blocks have a fixed size big enough to hold the ~1500 IP blocks. I never used it so I don't know the details.

When data arrives, it is directly written in place in a block of the memory mapped zone. In the user space you use select/epoll/kqueue or polling if the waiting time is very small. Once you have a block you can process it. This block contains raw network data received from the network card. So it's up to you to code and decode the TCP/IP headers or use an existing lib like mTCP that does that for you. I was told that it can work with dpdk and netmap.

My colleague is currently using netmap for a high performance data acquisition application in a LAN and plan to test dpdk with mTCP this summer. mTCP should simplify programming. At CERN they are now testing data acquisition setups using dpdk to be able to use commodity component hardware.

My colleague told me that Netmap is available in the BSD kernel so that you can use it right away. It is not included in the Linux kernel and you then need to patch it in. Zero copy is the future of network programming. Linux is late on this one. Then there is dpdk on which I don't have much info yet except that it is made by Intel, it is open source and compatible with AMD processors. It is apparenly not easy to install.

Since dpdk and netmap communicates directly with the network card, it only works with supported network cards.

The gain in performance is significant, but I have no numbers at hand to give.


You can do zero copy I/O on Linux using vmsplice+extensions, and on FreeBSD using regular read/write calls. Both can do DMA with user space buffers with much less disruption to traditional design patterns for implementing network daemons.

I think the real benefit of DPDK and netmap is that you're avoiding all the logic of the existing IP stacks, not to mention firewall rules, etc. At the same time you're now responsible for all of that. And IMO the amazing throughput most people claim to see with DPDK is a result of simply neglecting to implement all the hard logic which makes the Internet actually work. All the weirdness, head scratching, and hair pulling these solutions cause is an externality engineers will never care about, and in most cases probably be oblivious about. The exception to this state of affairs is when they're doing read-only packet sniffing and filtering, simply passing the packets back out another interface.

If most of what you care about is avoiding the cost of the kernel/userspace split, then you can just use NetBSD Rump or similar unikernel frameworks.


Do you know if the most optimal FreeBSD zero copy support is only available for Tigeon NICs? I couldn't quite figure it out from reading the following pages:

- http://www.freebsd.org/cgi/man.cgi?query=zero_copy

- http://www.kegel.com/c10k.html#zerocopy

- http://people.freebsd.org/~ken/zero_copy/


I don't actually know. I never went down the rabbit hole of zero-copy, only watched various teams crash and burn.

I've had success simply reducing the number of data copies in userspace. I use tools like Ragel, lean fifo buffers in C with moving read and write windows (i.e. slices), etc.

For example, for a [now defunct] startup, I implemented a real-time streaming radio transcoder which inserted targeted, per-listener, dynamically selected ad spots into live streams. So, it would take an existing radio source (Flash, ICY, MMS, etc), transcode the codec and format to suite the listener, and when it detected ad spots select and insert a targeted ad.

On a _single_ E3 Haswell core (core, not chip) I was able to transcode 5,000 streams in real-time, each with dynamically selected and inserted ad spots, cycling between 30 seconds of the stream and a new 30 second ad for the stress test. All software; no hardware or even SIMD (other than what GCC could squeeze out). The Linux kernel was spending more CPU time handling interrupts and pushing out the packets then my daemon. At that point I knew I was already at the 80% solution and moved on to more feature development. I knew that after fiddling with the kernel I'd have more than enough performance.

FFMpeg, GStreamer, etc couldn't even come close to that. I had written all my own stream parsers and writers, both so I had control over buffer management, but also because I needed frame-level control for splicing in ad spots. The only libraries I used were for the low-level codecs and for resampling. Notably, there was actually more copying then you'd think; enough to keep the interfaces relatively clean. The key, apparently, was maintaining data locality. Some libraries go through extraordinary lengths for zero-copy, but they end up with so much pointer indirection that it's a net loss.


Avoiding the logic of tcp stack processing does indeed come into play. My colleague process data sent through a tcp connection in a LAN. So he basically ignores all the tcp header of the incomming data. He simply checks the source ip.

He will test mTCP to get a full fledged tcp/ip stack so he will be able to measure it's overhead.


FreeBSD Netmap user here. You actually have to recompile the kernel with "device netmap" added in your kernconf. Piece of cake, after 20' you are good to go. But you need a real network card and the FreeBSD driver must be ready for netmap. Using intel 10Gbps adapters (~200euros) is a safe avenue (FreeBSD ixgbe driver). Even in VMWare, you can pass-through the PCI address of the adapter port to your virtual machine and have it talk to the card directly. Everything works very good.

The gain in performance is mind boggling! Trying to sniff approx. 2+Gbps traffic with Suricata using the "normal" avenue of libpcap ends up dropping a small percentage of the packets. And the machine will waste incredible CPU. Using Suricata with netmap (no need to recompile, Suricata pkgng binary build from FreeBSD comes ready) uses exactly one capture thread and drops ZERO packets. This behavior is stable for days!

Netmap is hands down awesome.


I was looking at Chelsio NICs and there were mentions of Netmap support. Do you know what it means for a NIC to support Netmap vs one that doesnt? Is it an extra optimization/fast-path?


Can't give a good technical answer to that. But I suspect that it should be a matter of driver mostly. When you mmap /dev/netmap from userland, the OS TCP/IP stack is disconnected and you get access to the card tx/rx rings. Obviously the driver has to facilitate this.


Netmap is in GENERIC now.

Netmap will work over any NIC now, but not at speed. For all the speed gains, you need netmap support in the driver(s) you want to use.


Yes GENERIC has it! But 10.3-RELEASE doesn't have it by default yet, so I had to compile.


You can also get very fancy with Intel DDIO (Data-Direct IO) where you can have the NIC write packets directly into L3 cache.


Do you know if any DPDK or netmap drivers can be made to ignore the layer 2 Ethernet CRC/checksum of incoming packets? There are some interesting applications for using Wireshark to diagnose protocols that use Ethernet for layer 1 and sort of for layer 2, but slightly vary the layer 2 protocol (e.g. a malformed packet CRC, or unusual packet sizes, or no inter-frame spacing), but I don't know of hardware that will pass packets that fail the CRC or arrive closer than 96nsec.


I think in that case it would make sense to build your own NIC so that it can operate out of spec. Ethernet PHY ICs that speak MII or RMII are usually only a few dollars each. One of those might give you the signal you are looking for.


The CRC is often checked directly in hardware, so you would have to configure the NIC properly, and I have not seen many high-performance NICs that allow to modify that setting.


Both can. You can disable hw CSUM for dpdk, and netmap doesn't yet support it.


Is that the layer 3 or layer 2 checksum?


I don't understand where the performance gain comes from if you're just replacing the in-kernel TCP/IP handling with user-mode TCP/IP handling.


As the article says, it is significantly a matter of lock-free structures and not crossing the kernel/userland barrier. Also cache locality (combined with DirectIO which somebody else mentioned) is great.

With a userspace stack, you also get lots of tuning capability, which is not available with the kernel. One cool tunable is how long to busy-wait before sleeping for an interrupt.

This benchmark I posted on HN once shows how significant these effects are (using TCP loopback): https://news.ycombinator.com/item?id=9027365


I'm not sure. My colleague explained me that the main gain came from avoiding to do a malloc in kernel space and a data copy. But it's true also that many small reads in a single block implies many system calls. I don't know the real relative processing cost of these operations.


For what I have seen, most applications using DPDK do not require TCP/IP handling. For example, packet monitoring, data transfers between nodes that are just connected by a single cable (thus no need for routing or TCP arrival guarantees)...

But even reimplementing the TCP/IP stack it would probably be more faster: you don't make system calls (which are expensive), you don't go around copying every other to socket buffers, you don't have to decide which application should receive the packet... It's hard, but it can yield performance improvements.

On the other hand, I'd try to activate other features for improved performance, such as running applications in the same cores that the queues are running on, configuring the NIC RX queues and using jumboframes (>1514 bytes) if you control the network path. All of these can yield noticeable improvements without much effort.


I haven't yet run into a case where it makes sense to do it, but you can gain some efficiency because you avoid context switching between the kernel and the program.


People who are interested in this should take a look at DragonflyBSD: https://www.dragonflybsd.org/

The things this quite claims are crucial to high performance networking, that need to be done outside of the linux kernel, are already done in the DragonflyBSD kernel:

    The key to better networking scalability, says Van, 
    is to get rid of locking and shared data as much as
    possible, and to make sure that as much processing work
    as possible is done on the CPU where the application is
    running. It is, he says, simply the end-to-end
    principle in action yet again. This principle, which
    says that all of the intelligence in the network
    belongs at the ends of the connections, doesn't stop at the kernel."
Dragonflybsd solves the locking and shared data problem by running a network stack per cpu core. If you've got 4 cores, you've got 4 kernel threads handling TCP. I don't understand the details of scheduling processes to the cpu handling its network connections, but Matt Dillon, the head of the project claims it's accounted for in the scheduler:

    <dillon>	we already do all of that
    <dillon>	its a fairly sophisticated scheduling algorithm.  
    How processes get grouped together and vs the network
    protocol stacks depends on the cpu topology and the load on the machine
```

edit: If you want to try it out on a VPS, vultr is the best one I know that allows you to upload custom ISOs (https://www.vultr.com/coupons/). There was a bug fixed recently in the virtio drivers that should make it pretty stable running on a VPS. You'll have to wait for 4.5 (a few weeks from now) or just upgrade from master (very easy).



Is it feasible to use dragonfly bsd for small-medium sized sites? I am currently using netbsd for my servers.


I don't see why it wouldn't be feasible for a site of any size. My only concern is running into issues with hammer. I was unable to mount my hammer partition after I filled up the drive. Not sure if it's a virtio bug or something that would affect actual disks as well, but that's something to be careful with.


We don't use the kernel TCP implementation because we decided to rethink the entire networking stack (from DNS to TCP/HTTP/TLS) as it applies to native mobile apps. So we wrote this whole stack but using UDP instead of TCP, and of course implemented entirely in userspace both on client side and server side

Today's mobile app ecosystem has the amazing advantage of rapidly updating binary apps running on a billion devices. This allows us to ship new client side code (with app updates) to millions of devices every few days. It's the first time in computing history that such rapid iteration on network protocols has been possible and the only way to take advantage is to ditch the kernel and ship your code with the app.

EDIT: I wrote a thing about it in much more detail than the space here will afford : http://www.infoworld.com/article/3016733/application-develop...


Questions: (Disclaimer, I know only the surface of networking)

"For example, PZP is able to identify the device, rather than its IP address, as the endpoint for data packets, allowing it to accommodate the intermittent nature of mobile connections in a fault-tolerant manner":

In what way is this different from NAT? Is this somehow aware of the phone switching towers (topology changes)? Or is there another advantage? How do they get that information?

"The PacketZoom protocol recovers from dropped packets “gracefully,” with minimal overhead above the amount of data lost in the dropped packets."

I don't know signal processing well enough to know how this works, I'm guessing you don't need to do things in order like TCP, so you can ask to resend and then stich it in?

Where is the beef? Is it in following devices rather than IPs? What does the interface look like?


> ... I'm guessing you don't need to do things in order like TCP ...

kind of tangentially relevant: the problem with tcp is that it conflates congestion loss with error loss. this causes backoff, ergo poor performance under conditions such as lossy links, handovers etc. etc.

udp doesn't really suffer from any of this because, well you do everything yourself. another advantage with udp, imho, is that server side can be scaled quite easily :) e.g. you can have multiple processes terminating a udp control protocol, and demuxing connections onto session tasks which can be spread around.

i am kind of _surprised_ that ip addresses for devices end up changing. this might _only_ be true if you are handing over from/to different radio access networks wifi -> lte -> wifi etc. etc. roaming within a network would (must ?) not end up changing your addresses at all. for example, in an LTE network, PGW being an ip anchor, assigns addresses to the devices, and there is only one such 'thing' per epc core (slightly convoluted when moving from one operator to another, but the basic idea is still there)

from the article linked by GP, it seems that each device needs to register with a cloud based server, via which the device traffic is routed. this initial registration can be used to exchange things like IMEI or somesuch for device identification, which then remains the same as the device moves around.


"this initial registration can be used to exchange things like IMEI or somesuch for device identificatio"

PacketZoom stack does create our own randomly generated identifier per app install (so two different apps on the same device will have different ids). Nothing at all that would positively identify a device or a user (IMIE, UDID etc.). That's a strict no-no by app/play store policies.


> Nothing at all that would positively identify a device or a user (IMIE, UDID etc.) ...

if we ignore (for a moment) that play-store/app-store policies forbid imei use by applications, i am just curious why you think IMEI cannot be used for _device_ identification ? thanks !


"forbid imei use by applications, i am just curious why you think IMEI cannot be used for _device_ identification"

I may have phrased it unclearly. I was trying to say that things like UDID or IMEI that can be used to identify individual users and/or devices are strictly verboten by app store policies (and PacketZoom doesn't read or use any of those identifiers)


> I may have phrased it unclearly.

cool ! thanks for clarifying the whole thing.


> the problem with tcp is that it conflates congestion loss with error loss. this causes backoff, ergo poor performance under conditions such as lossy links, handovers etc. etc.

Indeed. But, while there are many other congestion handling protocols that are frequently used on smaller networks [1], is there anything better than TCP style that is thought to work on "Internet scale"?

[1] E.g. Infiniband has some kind of credit/token based flow control, producing a lossless fabric. There's even stuff like that for Ethernet, e.g. https://en.wikipedia.org/wiki/Audio_Video_Bridging has 802.1Qav, https://en.wikipedia.org/wiki/Credit-based_fair_queuing


"In what way is this different from NAT?"

I'm not sure how NAT solves this problem at all unless I'm missing something. If you lose a TCP connection from a nat'd device, there's no way to create the same connection ("same" in terms of {IP1, port1, IP2, port1} tuple) again.

"I'm guessing you don't need to do things in order like TCP, so you can ask to resend and then stitch it in"

Yes that helps. It also helps if the protocol is smart enough to distinguish between congestion based loss vs media error based loss (with a high degree of accuracy).

"Where is the beef?"

Check out my comment upthread (or downthread... where it's placed on the page now) https://news.ycombinator.com/item?id=12021454


See HIP (host identity protocol) for the IP equivalent.


Not quite relevant to the OP article, as the way you described would still need to traverse the kernel and not able to realize the performance gains that the article described.


It's not just at locking and context switches. There's a deeper principle at play here (referenced from Julia's original post.)

"It is, he says, simply the end-to-end principle in action yet again. This principle, which says that all of the intelligence in the network belongs at the ends of the connections, doesn't stop at the kernel. It should continue, pushing as much work as possible out of the core kernel and toward the actual applications."

You can take this principle further. The two ends can do much smarter things when they know a lot about each other. What is the app use case (one static file, lots of tiny files, short API call, long running connections with intermittent message exchange etc.)? What are the current networking conditions... mobile radio type/carrier/location/time/subnet etc.? Can the sending side do a quick db lookup to make intelligent decisions about window-sizes/retransmits etc.). What if you had billions of data points on network conditions from around the world from every conceivable network, what can you learn from that?

An app session might make 10 concurrent requests but it might want to give higher priority (more bandwidth) to the first one. Or even, a strict priority queueing might be desired. Going further, you might want to cancel the requests that are queued up on the server side without losing the built up state of the connection. And this is just getting started.

There's a lot more. Smarter server discovery, easier key-exchange for encryption, independence from IP address as the endpoint and resultant resilience against network discontinuities. Once you take control of both ends and narrow down the scope to "solve native mobile apps", the solution space becomes much broader. The article I posted in GF comment contains a lot more details. Some blogposts on our site contain even more.

Now, I presented the above as "what ifs". The reality is that we're already doing most of those things for our customers. The remaining are in the pipeline

Kernel TCP (or TCP in general) is a limited, one-size-fits-all approach codified (dare I say, ossified) into RFC's a long time ago. Formally there's a process to approve changes to the protocol but those have to go through at least years if not decades long process to take effect widely (google the sagas of I10 or TFO if you're interested). The ability to iterate over the protocol tweaks on both sides (client and server) within a matter of days from conception of the idea is amazingly powerful.


I've done low level network programming for most of my career, and I've done a bit in the Linux network stack, as well as writing two partial userspace TCP stacks for different applications, but both were for middlebox type appliances where the goal was primarily to replicate enough of what the TCP state machines would look like on both ends of the transaction to do make some decision (in one case for firewalling, and the other was doing analysis to try to make inferences about user behavior on one end of the connection when they could be pivoting through multiple devices in the network.

It feels like networking isn't really a special case at all, just like most things the OS networking stacks are well understood, mature, and probably not worth trying to replace if you need a flexible general purpose networking stack that balances latency, throughput, correctness, security, memory and cpu efficiency, etc. But because they are trying to balance a lot of difference concerns across a wide swath of use-cases it's not hard to implement a better version if you have a very specific need.


Very well said.


Well I think the main reason is that you don't want to have hardware dependencies in your software. You instead depend on the kernel socket interface, and the kernel abstracts over hardware.

Most of the situations mentioned are where you explicitly have control over the hardware:

    - embedded devices
    - high frequency trading -- obviously they control their own hardware
    - Google -- ditto, data centers have very specific hardware
My understanding is that you can rely on Intel network cards having a similar programming interface, but there is still a fair bit of diversity in other hardware (correct me if I'm wrong)

I think there's no open source user space TCP stack because then you would have to recreate all the portability of the Linux kernel in user space... although I could be wrong about this.


There are DPDK drivers for a large number of NICs, not just Intel ones. Either the Kernel or DPDK needs drivers anyhow, just depends how they are written.


And when there is a bug in a string function, your program sends data to the wrong IP because OOPS! No protected memory! And when your user space process crashes, nobody is there to close your TCP connections for you. That's a reasonable trade off if you really need the raw request rate.

But if your web server spends most of it's time waiting for databases and taking to caches and network file systems, the common hardware abstraction of the kernel starts being worth it. Or if your binary needs to run on heterogenous hardware. Or if your single server runs multiple separate server processes.

Hybrid memory mapped stacks may end up being the best of both worlds, though. Time will tell!


What is a "Hybrid memory mapped stack"?

Edit: Oh, you must be referring to the network stack. I thought you meant the program stack (you know, the place where automatic variables in C are allocated).

(I'm still not sure what "hybrid" is supposed to mean here though)


"The TCP standard is evolving, and if you have to always use your kernel's TCP stack, that means you can NEVER EVOLVE." -> This is untrue, the Linux kernel is open source, you could "easily" write your own and replace the current. Either in a branch or trying to get it in the official tree.


Context: “Google can't force Android vendors to rebase kernels but requires new TCP functionality such as TCP fast open.”

I don’t think they’re actually doing that though.


This also implies that the Linux kernel TCP stack is never evolving which is an odd claim to make.


Dropping in to plug SPDK (http://spdk.io), which is like DPDK but for storage devices. This will become increasingly relevant as SSDs become much faster with next generation media, and storage has the added benefit that all SSDs have one of a couple standardized interfaces and can share drivers.

The equivalent layer of a TCP/IP stack for storage is probably a filesystem, and the kernel filesystems and block layers are at least as inefficient as the network stack for similar reasons.


> ... which is like DPDK but for storage devices.

honest question, with so many 'things' taking matter into their own hands, how do you ensure that they play 'nice' with each other ?

any and all insights are greatly appreciated !


Two answers:

1) Modern hardware is getting much better at multiplexing resources (aka "making the things play nice with each other"). See for example sr-iov.

2) More and more applications are distributed across many machines, with the resources of each machine entirely consumed by that one application. In that case there is only one "thing" running so the problem of multiplexing goes away. This is the premise of technologies like unikernel.


> 1) Modern hardware is getting much better at multiplexing resources (aka "making the things play nice with each other"). See for example sr-iov.

sorry, but with sr-iov, you still have one single 'control' application muxing resources for applications sitting above it. for example, with dpdk, you more or less take over the complete network card.

the 'control' might be even human in some cases :) which/who carefully lays out the resource mapping.

another example, intel's cmt/cat techniques (https://github.com/01org/intel-cmt-cat/wiki) map nic's rx/tx rings to processors l3 cache etc. this is ofcourse assuming that userland applications have complete control over pci-e lanes etc.

now, if you have two such control applications e.g. one doing packet-io and the other doing disk-io, how do you ensure that these control applications don't starve each other out ?

in canonical settings, kernel would be democratizing (is that even a word ?) access to the underlying h/w. but since that is bypassed, we are in a strange new world...


Sharing CPU resources between user space drivers is certainly a challenge. The best way to view this is that the kernel provides a general purpose solution for sharing resources with some associated overhead. Tools like DPDK and SPDK let you opt out of that, but now you are responsible for intelligently sharing the hardware.

You, as the application developer, have a distinct advantage though - you only need to solve the problem for your application, and using that knowledge can often lead to more efficient solutions. This may mean dedicating cores to the network or disk, or it may mean working in fixed sized batches, etc.


> ... you only need to solve the problem for your application, and using that knowledge can often lead to more efficient solutions...

this ! _exactly_ this :) imho, the fundamental re-architecture of I/O subsystem for x86 machines has kind of relegated this playing field now to mostly solving _only_ s/w problem, rather than a combination of h/w and s/w.

for example, earlier if you wanted to write a very high performance node in, say the epc-core e.g. SGW/PGW/MME etc. you would assemble a bunch of folks with very diverse set of expertise. right from h/w i/o subsystem designers who could do npu's, switch-fabrics etc. to driver dudes, to 'infrastructure' folks to application programmers etc. etc.

in the current incarnation, a vanilla off the shelf x86 machine is more than sufficient. and if your s/w architecture is _right_, you can scale quite easily.


Do you have any experience using POSIX async IO (aio_read, ...) or Linux kernel async IO (io_submit, ...)?

I'm curious in how throughput performance and CPU usage compares with these to SPDK. My (relatively confident) guess is that performance will be better with SPDK, but if it's not that much better programming against a more general interface (a filesystem) is appealing.


Submitting and completing a 4k I/O using SPDK is about 7 times more CPU efficient than the equivalent operation with libaio, which is opening a raw block device with O_DIRECT.

Said another way, on a recent Xeon CPU you can expect to drive somewhere around 3 million 4k I/O per second with SPDK per core. With libaio, you can do about 450,000 per core off the top of my head.

SPDK has no locks or cross core communication required, so it scales linearly with additional CPU cores. Blk-mq in the kernel also helped the kernel scaling problem significantly, but I'm not sure if it is perfectly linear yet.

Most applications need something like a filesystem to function - there is no denying that. Using SPDK requires applications to implement at least the minimal set of features in the filesystem that their application needs. Many databases and storage services already bypass the filesystem or use the filesystem as a block allocator only, so it is not a big leap from there to SPDK.


Something I would really like to see in modern operating systems is the ability to do TCP like UDP, i.e. you bind a socket and then send and receive TCP packets (everything after the ports in the TCP header) as datagrams.

The problem that solves is this. Right now if you don't want to use the OS TCP implementation your choices are a) raw sockets or tun/tap or kernel drivers or something equally heavy-handed that all require privileges, or b) encapsulate in UDP.

Which makes "encapsulate in UDP" a great choice until you see this:

  $ cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established 
  432000
  $ cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream 
  180
The reason for that is because UDP is connectionless. With TCP a long timeout is reasonable because the state can be discarded before that when the middlebox sees the connection close.

So if you need long-lived sessions, UDP requires you to send a lot of keepalives. But "TCP as datagrams" would let you actually look like TCP to the middlebox and get the long timeout and the ability to notify it when you're done, without requiring CAP_NET_ADMIN or similar from the OS.


I blame this mostly on the monolithic kernel design of gnu/Linux. I wish someone would make an effort at a hybrid or pure microkernel gnu/linux. I'm keeping an eye on minix which looks interesting as well, but the only other interesting network stack I'm aware of is dragonfly bsd.


Exactly, was going to say the same thing. If you have a microkernel, your TCP stack will be in user space and can be more of a framework where applications can be a lot more flexible in how they use it. With the right library structure you could even have custom OSI layers while reusing the rest. This would certainly be less buggy and of higher quality than everyone rewriting it wholly. In that sense, I'm glad we have efforts to reuse seL4 for general purpose operating systems, because it also has a capability system.


> libuinet is a library version of FreeBSD's TCP stack. I guess there's a theme here.

That is excellent, considering FreeBSD's TCP stack is supposed to be faster and better designed than that of Linux.


TL;DR: we use it because then you don't have to write your own. The real question is why it's slow. The answer to that, according to someone they quoted, is that if you write your own TCP stack, it's inside your application and less stuff has to be copied and switched around (no overhead from receiving in the kernel and then having to pass it on). So it's not that the kernel's implementation is shitty, it's that you'd have to write your own for every application if you wanted to improve it substantially.


Given your comment and the one about having written a stack from scratch twice and regretting it wouldn't in then be a good idea to use just the kernel's battle hardened code but in user space? At most try to upstream some patches so the same code and be used in userspace. All the benefits (well used code) and none of the drawbacks (context switching).

Or am I'm hopelessly naive?


I think you can do the with the NetBSD rump kernel


> it's not that the kernel's implementation is shitty

It's not shitty, but it's full of locks and no batching, so it's definitely not suited for modern CPUs.

https://lwn.net/Articles/629155/


I guess using a unikernel also falls under this. Do they address some of the problems of trying this on a regular OS with a user space IP stack?


Well, yeah, I guess. When context switches are the problem, you either move the application into the kernel space (which is what unikernels do) or move the network stack into the userspace (which is what netmap, etc. do)


One reason to write a user space stack or, at least, a custom kernel module is to avoid the plethora of copying that can occur. Routing VXLAN interfaces to some kind of layer 2 encryption bridge? The kernel is going to copy that shit like 3 times before it hits user space. Writing your own stack can often save you all that copying around between bridges and interfaces.

Edit:

I'm going to plug my own zero-copy lib for Linux in golang. It has general purpose tcp/ip suppport (though no logic for handling tcp communication itself). It's lock-free and thread safe: https://github.com/nathanjsweet/zsocket


>it can do, in some benchmarks, 2 million requests per second.

Is there a standard specification for machines that benchmark network software? Or, when someone quotes a number like this, do they mean "in our environment"?


Based on the article's title, I was expecting reasons to use the Linux kernel's TCP stack - i.e. to actually answer the question. I guess it was rhetorical, though, since the whole article was about reasons why one shouldn't use Linux's TCP stack.

All those reasons, however, reek rather strongly of premature optimization, and that is the best reason why one should and does use the Linux kernel's TCP stack. 99.995% of the time, there are far worse bottlenecks in one's setup than one's TCP implementation.


What would be more interesting, and useful, compared to all these network stacks, would be a machine readable specification of TCP/IP from which a correct implementation could be engineered.

However, the realist in me concedes, the specification itself, given the present state of the art, would probably fix, unsatisfactorily, many implementation details (in order for the implementation to pass the spec).

I believe we need a network protocol with a solid, simple semantics. IP, that is not.


You have seen the network semantics research project https://www.cl.cam.ac.uk/~pes20/Netsem/index.html? It is a formal model of TCP/IP validated with Linnux 2.4.20/FreeBSD-4.6/Windows XP (yes, that was ~10 years ago).

It is nowadays BSD licensed on GitHub https://github.com/PeterSewell/netsem (and I'm currently reviving it https://www.cl.cam.ac.uk/~pes20/HuginnTCP/)...


No, I wasn't aware. Thank you for pointing this work out. This is exactly the kind of thing I was hoping for. Wonderful!

I can't wait to see what you have planned.

I was thinking after I posted my comment, that it'd be cool if someone could produce a fuzz tester that used both the specification, and the fact that you can turn the Linux and NetBSD network stacks into libraries (libOS and rumpkernel respectively) and co-engineer/evolve the spec whilst also finding and fixing bugs in both the network stacks.

Excited by what you'll be up to!


hmm, my other OS is MirageOS (https://mirage.io) -- also see https://nqsb.io contains my previous two years of work ;)

I'd rather call it extensive exploration than fuzz testing what is in my mind...


> As far as I can tell, there aren't any available general purpose open source userspace TCP/IP stacks available. There are a few specialized ones

"specialised" is probably key. If you have a hard problem and also time, money and skills you can probably make something that handles your special case very quickly, and doesn't do the rest of the general case at all. That is an expected trade-off.


The kernel TCP stack has two important constraints that alternatives usually don't have:

1. Ability to handle packets to/from multiple applications

2. Expose the BSD socket API


Did a UDP/IP/Ethernet "stack" (relied on a packet driver shim / smth to send / recv the Ethernet frames) in NASM for DOS while having couple months of experience and using Tannenbaum's book and RFCs.

Worked great (was sitting in a TSR, was stress tested via sending a storm of packets over a short netwrok cable directly to the machine (otherwise network printers would halt :)) while having Elite:Frontiers running its Cobra ship animation on the screen).

Were the times :)

[ humble point is, with things as simple as UDP, special-cased for no packet fragmentation, etc and when the scenario is LAN, you can do it easily and over couple days even, if you really have no other options ]



From what I recall, while Linux doesn't have that in the kernel anymore, Microsoft's IIS still uses an httpd driver, but I don't know if it's optional. And to be clear, ktthpd never got popular. It was written during the times of 10k battles and eventually syscalls were added for Apache and friends to achieve it without an httpd in the kernel.

Edit: khttpd (real name: tux) has actually only existed in Red Hat's and SuSE's kernel and was never mainlined.


IIRC Microsoft IIS got pwned really hard because of a bug in that driver…


Not as far as I know.



Have I missed something where you provided an alternative?


The article mentions more than over alternative.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: