
Why do we use the Linux kernel's TCP stack? - nkurz
http://jvns.ca/blog/2016/06/30/why-do-we-use-the-linux-kernels-tcp-stack/
======
jjguy
Please don't rewrite your network stack unless you can afford to dedicate a
team to support it full time.

Twice in my career I have been on teams where we decided to rewrite IP or TCP
stacks. The justifications were different each time, though never perf.

The projects were filled with lots of early confidence and successes. "So much
faster" and "wow, my code is a lot simpler than the kernel equivalent, I am
smart!" We shipped versions that worked, with high confidence and enthusiasm.
It was fun. We were smart. We could rewrite core Internet protocol
implementations and be better!

Then the bug reports started to roll in. Our clean implementations started to
get cluttered with nuances in the spec we didn't appreciate. We wasted weeks
chasing implementation bugs in other network stack that were defacto but
undocumented parts of the internet's "real" spec. Accommodating these
cluttered that pretty code further. Performance decreased.

In both cases, after about a year, we found ourselves wishing we had not
rewritten the network stack. We started making plans to eliminate the
dependency, now much more complicated because we had to transition active
deployments away.

I have not made that mistake a 3d time.

If you are Google, Facebook or another internet behemoth that is optimizing
for efficiently at scale and can afford to dedicate a team to the problem, do
it. But if you are a startup trying to get a product off the ground, this is
Premature optimization. Stay far, far away.

~~~
adrianratnapala
The original article claims that having the TCP stack in the kernel causes
performance problems because it needs to do excessive locking.

I can't judge, but if really that is true, then in principle, a user-space
library could be written to take care of all those corner cases you mention,
and still be faster than the kernel stack.

Of course that wouldn't be everyone rolling their own.

~~~
toast0
I've only poked at the FreeBSD TCP stack and not the Linux stack, but it seems
like if the problem is locking, you should be able to get good results from
working on the locking (finer grained locks / tweaking parameters) in less
time than building a full tcp stack.

What kind of limitations are people seeing with the Linux kernel? If I'm
interpreting Netflix's paper[1] correctly, they could push at least 20 Gbps of
unencrypted content with a single socket E5-2650L (document isn't super clear
though, it says they were designed for 40Gbps). My servers usually run out of
application CPU before they run out of network -- but I've run some of them up
to 10Gbps without a lot of tuning.

[1]
[https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf](https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf)
Context is accelerating https downloads, but some decent numbers anyway.

~~~
yxhuvud
Gbps is not created equal. Traffic with many small packets takes a lot more
resources compared to traffic with fewer but bigger. Netflix packages would be
as big as they come.

~~~
bogomipz
Yes, in fact for things like Juniper/Cisco firewalls they will always quote
PPS in full MTU packets. If you want to bring that shiny new firewall to its
knees try sending it traffic with the minimum MTU of 68 bytes at line rate for
the NIC.

------
tingley
Years ago I worked for a company who was building a piece of networking
hardware, and we had bought some code that implemented various well-known
routing protocols. This included the TCP/IP layer, but it was just the BSD
stack.

Our dev hardware had a boot loader that would use FTP to download the most
recent images to finish the boot. After we cut over to this code, we started
having an occasional problem where this download would fail -- the box would
send a spurious RST and kill the connection. I was one of the protocols
people, and it ended up on my plate.

I spent 3 days staring at tcpdump, adding enormous piles of debug traces, and
finally going through the code line-by-line with a copy of the Stevens book
open on my lap. Eventually I found the problem - they had made one single
change to the BSD code, and in so doing introduced the bug. It was compiler-
dependent.

Writing this code correctly is hard.

~~~
digi_owl
> It was compiler-dependent.

Reminds me of a bug i was wrangling in GTK, now hopefully long gone, that only
showed up if the lib was compiled with a certain level of optimization set in
GCC.

------
brendangregg
A few quick notes:

\- The Linux network stack has been getting faster. Workarounds that may have
made sense 5 years ago, vs 2.6.something, may make less sense today vs 4.7. It
makes it hard to research the topic, as you'll find statements of fact that
are no longer true today.

\- The kernel stack is complex, but then it also handles a lot of TCP/IP
behaviors. Eg, buffer bloat. You might code a lightweight stack that works
great in microbenchmarking (factors faster), but falls apart when connected to
real devices. I've worked on differing TCP/IP implementations that had this
behavior, and the stack that was slower in the lab was faster for the
customer's real world messy workload.

\- What is Facebook doing? eXtreme Data Path: [https://github.com/iovisor/bpf-
docs/raw/master/Express_Data_...](https://github.com/iovisor/bpf-
docs/raw/master/Express_Data_Path.pdf) ... this is really new and exciting
stuff.

\- Unikernels :-)

~~~
brendangregg
Another technology worth mentioning is the Linux network parallelism features
(which I think came from Google):

[https://www.kernel.org/doc/Documentation/networking/scaling....](https://www.kernel.org/doc/Documentation/networking/scaling.txt)

My colleague at Netflix, Amer Ather, wrote about this tuning and more in a
post detailing some network tuning from a Xen guest, where he was reaching 2
million packets/sec:

[http://techblog.cloudperf.net/2016/05/](http://techblog.cloudperf.net/2016/05/)

------
hal9000xp
I worked on ICQ project as a backend developer (C/C++) after Mail.Ru bought
ICQ from AOL.

We had legacy from AOL - AOL's proprietary TCP stack in user space.

ICQ used AOL's TCP stack for external connections with outer world (i.e. with
clients).

I asked AOL engineers why they needed TCP stack in user space. They said that
back in old days (90s) there was no good scalable TCP implementation which
could handle many simultaneous connections properly.

Each process which used this TCP stack had to fully own network interface.

Of course, we gradually replaced AOL's TCP stack with native stack because
nowadays Linux TCP stack is good enough.

May be these days someone still need proprietary TCP stack in user land to
meet specific needs which aren't available in native Linux implementation.

Also if you have TCP stack in user land, you have more control over it.

~~~
ams6110
AOL had a bespoke webserver too.

[https://en.wikipedia.org/wiki/AOLserver](https://en.wikipedia.org/wiki/AOLserver)

~~~
SwellJoe
I really liked AOLServer! My current company's first website ran on OpenACS,
which was written in Tcl and ran on aolserver. It was really quite a nice
system (and Tcl is a quite nice language).

But, the web server worked with the standard Linux network stack; at least, it
did by the time I saw it.

~~~
nulltype
Tcl may be the most under appreciated language.

------
chmike
The core functionality required is a "zero copy" networking lib : dpdk,
netmap.

Normally, when the kernel recieves data from the network, it allocates a block
in the kernel and copy the data into it. Then your read operation copies that
data in your user space. The "zero copy" networking stacks avoids the data
copy. The way it works, as I was explained, is that they use a shared memory
mapped zone. This zone is organized as a pool of blocks managed with non
blocking lists. Blocks have a fixed size big enough to hold the ~1500 IP
blocks. I never used it so I don't know the details.

When data arrives, it is directly written in place in a block of the memory
mapped zone. In the user space you use select/epoll/kqueue or polling if the
waiting time is very small. Once you have a block you can process it. This
block contains raw network data received from the network card. So it's up to
you to code and decode the TCP/IP headers or use an existing lib like mTCP
that does that for you. I was told that it can work with dpdk and netmap.

My colleague is currently using netmap for a high performance data acquisition
application in a LAN and plan to test dpdk with mTCP this summer. mTCP should
simplify programming. At CERN they are now testing data acquisition setups
using dpdk to be able to use commodity component hardware.

My colleague told me that Netmap is available in the BSD kernel so that you
can use it right away. It is not included in the Linux kernel and you then
need to patch it in. Zero copy is the future of network programming. Linux is
late on this one. Then there is dpdk on which I don't have much info yet
except that it is made by Intel, it is open source and compatible with AMD
processors. It is apparenly not easy to install.

Since dpdk and netmap communicates directly with the network card, it only
works with supported network cards.

The gain in performance is significant, but I have no numbers at hand to give.

~~~
aduitsis
FreeBSD Netmap user here. You actually have to recompile the kernel with
"device netmap" added in your kernconf. Piece of cake, after 20' you are good
to go. But you need a real network card and the FreeBSD driver must be ready
for netmap. Using intel 10Gbps adapters (~200euros) is a safe avenue (FreeBSD
ixgbe driver). Even in VMWare, you can pass-through the PCI address of the
adapter port to your virtual machine and have it talk to the card directly.
Everything works very good.

The gain in performance is mind boggling! Trying to sniff approx. 2+Gbps
traffic with Suricata using the "normal" avenue of libpcap ends up dropping a
small percentage of the packets. And the machine will waste incredible CPU.
Using Suricata with netmap (no need to recompile, Suricata pkgng binary build
from FreeBSD comes ready) uses exactly one capture thread and drops ZERO
packets. This behavior is stable for days!

Netmap is hands down awesome.

~~~
cm3
I was looking at Chelsio NICs and there were mentions of Netmap support. Do
you know what it means for a NIC to support Netmap vs one that doesnt? Is it
an extra optimization/fast-path?

~~~
aduitsis
Can't give a good technical answer to that. But I suspect that it should be a
matter of driver mostly. When you mmap /dev/netmap from userland, the OS
TCP/IP stack is disconnected and you get access to the card tx/rx rings.
Obviously the driver has to facilitate this.

------
cgag
People who are interested in this should take a look at DragonflyBSD:
[https://www.dragonflybsd.org/](https://www.dragonflybsd.org/)

The things this quite claims are crucial to high performance networking, that
need to be done outside of the linux kernel, are already done in the
DragonflyBSD kernel:

    
    
        The key to better networking scalability, says Van, 
        is to get rid of locking and shared data as much as
        possible, and to make sure that as much processing work
        as possible is done on the CPU where the application is
        running. It is, he says, simply the end-to-end
        principle in action yet again. This principle, which
        says that all of the intelligence in the network
        belongs at the ends of the connections, doesn't stop at the kernel."
    

Dragonflybsd solves the locking and shared data problem by running a network
stack per cpu core. If you've got 4 cores, you've got 4 kernel threads
handling TCP. I don't understand the details of scheduling processes to the
cpu handling its network connections, but Matt Dillon, the head of the project
claims it's accounted for in the scheduler:

    
    
        <dillon>	we already do all of that
        <dillon>	its a fairly sophisticated scheduling algorithm.  
        How processes get grouped together and vs the network
        protocol stacks depends on the cpu topology and the load on the machine

```

edit: If you want to try it out on a VPS, vultr is the best one I know that
allows you to upload custom ISOs
([https://www.vultr.com/coupons/](https://www.vultr.com/coupons/)). There was
a bug fixed recently in the virtio drivers that should make it pretty stable
running on a VPS. You'll have to wait for 4.5 (a few weeks from now) or just
upgrade from master (very easy).

~~~
terminalcommand
Is it feasible to use dragonfly bsd for small-medium sized sites? I am
currently using netbsd for my servers.

~~~
cgag
I don't see why it wouldn't be feasible for a site of any size. My only
concern is running into issues with hammer. I was unable to mount my hammer
partition after I filled up the drive. Not sure if it's a virtio bug or
something that would affect actual disks as well, but that's something to be
careful with.

------
chetanahuja
We don't use the kernel TCP implementation because we decided to rethink the
entire networking stack (from DNS to TCP/HTTP/TLS) as it applies to native
mobile apps. So we wrote this whole stack but using UDP instead of TCP, and of
course implemented entirely in userspace both on client side and server side

Today's mobile app ecosystem has the amazing advantage of rapidly updating
binary apps running on a billion devices. This allows us to ship new client
side code (with app updates) to millions of devices every few days. It's the
first time in computing history that such rapid iteration on network protocols
has been possible and the only way to take advantage is to ditch the kernel
and ship your code with the app.

EDIT: I wrote a thing about it in much more detail than the space here will
afford : [http://www.infoworld.com/article/3016733/application-
develop...](http://www.infoworld.com/article/3016733/application-
development/in-search-of-a-cure-for-slow-mobile-downloads.html)

~~~
patmcguire
Questions: (Disclaimer, I know only the surface of networking)

"For example, PZP is able to identify the device, rather than its IP address,
as the endpoint for data packets, allowing it to accommodate the intermittent
nature of mobile connections in a fault-tolerant manner":

In what way is this different from NAT? Is this somehow aware of the phone
switching towers (topology changes)? Or is there another advantage? How do
they get that information?

"The PacketZoom protocol recovers from dropped packets “gracefully,” with
minimal overhead above the amount of data lost in the dropped packets."

I don't know signal processing well enough to know how this works, I'm
guessing you don't need to do things in order like TCP, so you can ask to
resend and then stich it in?

Where is the beef? Is it in following devices rather than IPs? What does the
interface look like?

~~~
signa11
> ... I'm guessing you don't need to do things in order like TCP ...

kind of tangentially relevant: the problem with tcp is that it conflates
congestion loss with error loss. this causes backoff, ergo poor performance
under conditions such as lossy links, handovers etc. etc.

udp doesn't really suffer from any of this because, well you do everything
yourself. another advantage with udp, imho, is that server side can be scaled
quite easily :) e.g. you can have multiple processes terminating a udp control
protocol, and demuxing connections onto session tasks which can be spread
around.

i am kind of _surprised_ that ip addresses for devices end up changing. this
might _only_ be true if you are handing over from/to different radio access
networks wifi -> lte -> wifi etc. etc. roaming within a network would (must ?)
not end up changing your addresses at all. for example, in an LTE network, PGW
being an ip anchor, assigns addresses to the devices, and there is only one
such 'thing' per epc core (slightly convoluted when moving from one operator
to another, but the basic idea is still there)

from the article linked by GP, it seems that each device needs to register
with a cloud based server, via which the device traffic is routed. this
initial registration can be used to exchange things like IMEI or somesuch for
device identification, which then remains the same as the device moves around.

~~~
chetanahuja
_" this initial registration can be used to exchange things like IMEI or
somesuch for device identificatio"_

PacketZoom stack does create our own randomly generated identifier per app
install (so two different apps on the same device will have different ids).
Nothing at all that would positively identify a device or a user (IMIE, UDID
etc.). That's a strict no-no by app/play store policies.

~~~
signa11
> Nothing at all that would positively identify a device or a user (IMIE, UDID
> etc.) ...

if we ignore (for a moment) that play-store/app-store policies forbid imei use
by applications, i am just curious why you think IMEI cannot be used for
_device_ identification ? thanks !

~~~
chetanahuja
_" forbid imei use by applications, i am just curious why you think IMEI
cannot be used for _device_ identification"_

I may have phrased it unclearly. I was trying to say that things like UDID or
IMEI that can be used to identify individual users and/or devices are strictly
verboten by app store policies (and PacketZoom doesn't read or use any of
those identifiers)

~~~
signa11
> I may have phrased it unclearly.

cool ! thanks for clarifying the whole thing.

------
rebeccaskinner
I've done low level network programming for most of my career, and I've done a
bit in the Linux network stack, as well as writing two partial userspace TCP
stacks for different applications, but both were for middlebox type appliances
where the goal was primarily to replicate enough of what the TCP state
machines would look like on both ends of the transaction to do make some
decision (in one case for firewalling, and the other was doing analysis to try
to make inferences about user behavior on one end of the connection when they
could be pivoting through multiple devices in the network.

It feels like networking isn't really a special case at all, just like most
things the OS networking stacks are well understood, mature, and probably not
worth trying to replace if you need a flexible general purpose networking
stack that balances latency, throughput, correctness, security, memory and cpu
efficiency, etc. But because they are trying to balance a lot of difference
concerns across a wide swath of use-cases it's not hard to implement a better
version if you have a very specific need.

~~~
vocatus_gate
Very well said.

------
chubot
Well I think the main reason is that you don't want to have hardware
dependencies in your software. You instead depend on the kernel socket
interface, and the kernel abstracts over hardware.

Most of the situations mentioned are where you explicitly have control over
the hardware:

    
    
        - embedded devices
        - high frequency trading -- obviously they control their own hardware
        - Google -- ditto, data centers have very specific hardware
    

My understanding is that you can rely on Intel network cards having a similar
programming interface, but there is still a fair bit of diversity in other
hardware (correct me if I'm wrong)

I think there's no open source user space TCP stack because then you would
have to recreate all the portability of the Linux kernel in user space...
although I could be wrong about this.

~~~
virtuallynathan
There are DPDK drivers for a large number of NICs, not just Intel ones. Either
the Kernel or DPDK needs drivers anyhow, just depends how they are written.

------
jwatte
And when there is a bug in a string function, your program sends data to the
wrong IP because OOPS! No protected memory! And when your user space process
crashes, nobody is there to close your TCP connections for you. That's a
reasonable trade off if you really need the raw request rate.

But if your web server spends most of it's time waiting for databases and
taking to caches and network file systems, the common hardware abstraction of
the kernel starts being worth it. Or if your binary needs to run on
heterogenous hardware. Or if your single server runs multiple separate server
processes.

Hybrid memory mapped stacks may end up being the best of both worlds, though.
Time will tell!

~~~
catern
What is a "Hybrid memory mapped stack"?

Edit: Oh, you must be referring to the network stack. I thought you meant the
program stack (you know, the place where automatic variables in C are
allocated).

(I'm still not sure what "hybrid" is supposed to mean here though)

------
Qantourisc
"The TCP standard is evolving, and if you have to always use your kernel's TCP
stack, that means you can NEVER EVOLVE." -> This is untrue, the Linux kernel
is open source, you could "easily" write your own and replace the current.
Either in a branch or trying to get it in the official tree.

~~~
deno
Context: “Google can't force Android vendors to rebase kernels but requires
new TCP functionality such as TCP fast open.”

I don’t think they’re actually doing that though.

------
benlwalker
Dropping in to plug SPDK ([http://spdk.io](http://spdk.io)), which is like
DPDK but for storage devices. This will become increasingly relevant as SSDs
become much faster with next generation media, and storage has the added
benefit that all SSDs have one of a couple standardized interfaces and can
share drivers.

The equivalent layer of a TCP/IP stack for storage is probably a filesystem,
and the kernel filesystems and block layers are at least as inefficient as the
network stack for similar reasons.

~~~
signa11
> ... which is like DPDK but for storage devices.

honest question, with so many 'things' taking matter into their own hands, how
do you ensure that they play 'nice' with each other ?

any and all insights are greatly appreciated !

~~~
shykes
Two answers:

1) Modern hardware is getting much better at multiplexing resources (aka
"making the things play nice with each other"). See for example sr-iov.

2) More and more applications are distributed across many machines, with the
resources of each machine entirely consumed by that one application. In that
case there is only one "thing" running so the problem of multiplexing goes
away. This is the premise of technologies like unikernel.

~~~
signa11
> 1) Modern hardware is getting much better at multiplexing resources (aka
> "making the things play nice with each other"). See for example sr-iov.

sorry, but with sr-iov, you still have one single 'control' application muxing
resources for applications sitting above it. for example, with dpdk, you more
or less take over the complete network card.

the 'control' might be even human in some cases :) which/who carefully lays
out the resource mapping.

another example, intel's cmt/cat techniques ([https://github.com/01org/intel-
cmt-cat/wiki](https://github.com/01org/intel-cmt-cat/wiki)) map nic's rx/tx
rings to processors l3 cache etc. this is ofcourse assuming that userland
applications have complete control over pci-e lanes etc.

now, if you have two such control applications e.g. one doing packet-io and
the other doing disk-io, how do you ensure that these control applications
don't starve each other out ?

in canonical settings, kernel would be democratizing (is that even a word ?)
access to the underlying h/w. but since that is bypassed, we are in a strange
new world...

~~~
benlwalker
Sharing CPU resources between user space drivers is certainly a challenge. The
best way to view this is that the kernel provides a general purpose solution
for sharing resources with some associated overhead. Tools like DPDK and SPDK
let you opt out of that, but now you are responsible for intelligently sharing
the hardware.

You, as the application developer, have a distinct advantage though - you only
need to solve the problem for your application, and using that knowledge can
often lead to more efficient solutions. This may mean dedicating cores to the
network or disk, or it may mean working in fixed sized batches, etc.

~~~
signa11
> ... you only need to solve the problem for your application, and using that
> knowledge can often lead to more efficient solutions...

this ! _exactly_ this :) imho, the fundamental re-architecture of I/O
subsystem for x86 machines has kind of relegated this playing field now to
mostly solving _only_ s/w problem, rather than a combination of h/w and s/w.

for example, earlier if you wanted to write a very high performance node in,
say the epc-core e.g. SGW/PGW/MME etc. you would assemble a bunch of folks
with very diverse set of expertise. right from h/w i/o subsystem designers who
could do npu's, switch-fabrics etc. to driver dudes, to 'infrastructure' folks
to application programmers etc. etc.

in the current incarnation, a vanilla off the shelf x86 machine is more than
sufficient. and if your s/w architecture is _right_, you can scale quite
easily.

------
zrm
Something I would really like to see in modern operating systems is the
ability to do TCP like UDP, i.e. you bind a socket and then send and receive
TCP packets (everything after the ports in the TCP header) as datagrams.

The problem that solves is this. Right now if you don't want to use the OS TCP
implementation your choices are a) raw sockets or tun/tap or kernel drivers or
something equally heavy-handed that all require privileges, or b) encapsulate
in UDP.

Which makes "encapsulate in UDP" a great choice until you see this:

    
    
      $ cat /proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established 
      432000
      $ cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream 
      180
    

The reason for that is because UDP is connectionless. With TCP a long timeout
is reasonable because the state can be discarded before that when the
middlebox sees the connection close.

So if you need long-lived sessions, UDP requires you to send a lot of
keepalives. But "TCP as datagrams" would let you actually look like TCP to the
middlebox and get the long timeout and the ability to notify it when you're
done, without requiring CAP_NET_ADMIN or similar from the OS.

------
arca_vorago
I blame this mostly on the monolithic kernel design of gnu/Linux. I wish
someone would make an effort at a hybrid or pure microkernel gnu/linux. I'm
keeping an eye on minix which looks interesting as well, but the only other
interesting network stack I'm aware of is dragonfly bsd.

~~~
cm3
Exactly, was going to say the same thing. If you have a microkernel, your TCP
stack will be in user space and can be more of a framework where applications
can be a lot more flexible in how they use it. With the right library
structure you could even have custom OSI layers while reusing the rest. This
would certainly be less buggy and of higher quality than everyone rewriting it
wholly. In that sense, I'm glad we have efforts to reuse seL4 for general
purpose operating systems, because it also has a capability system.

------
hiphopyo
> libuinet is a library version of FreeBSD's TCP stack. I guess there's a
> theme here.

That is excellent, considering FreeBSD's TCP stack is supposed to be faster
and better designed than that of Linux.

------
lucb1e
TL;DR: we use it because then you don't have to write your own. The real
question is why it's slow. The answer to that, according to someone they
quoted, is that if you write your own TCP stack, it's inside your application
and less stuff has to be copied and switched around (no overhead from
receiving in the kernel and then having to pass it on). So it's not that the
kernel's implementation is shitty, it's that you'd have to write your own for
every application if you wanted to improve it substantially.

~~~
greggman
Given your comment and the one about having written a stack from scratch twice
and regretting it wouldn't in then be a good idea to use just the kernel's
battle hardened code but in user space? At most try to upstream some patches
so the same code and be used in userspace. All the benefits (well used code)
and none of the drawbacks (context switching).

Or am I'm hopelessly naive?

~~~
nulltype
I think you can do the with the NetBSD rump kernel

------
KayEss
I guess using a unikernel also falls under this. Do they address some of the
problems of trying this on a regular OS with a user space IP stack?

~~~
floatboth
Well, yeah, I guess. When context switches are the problem, you either move
the application into the kernel space (which is what unikernels do) or move
the network stack into the userspace (which is what netmap, etc. do)

------
hacknat
One reason to write a user space stack or, at least, a custom kernel module is
to avoid the plethora of copying that can occur. Routing VXLAN interfaces to
some kind of layer 2 encryption bridge? The kernel is going to copy that shit
like 3 times before it hits user space. Writing your own stack can often save
you all that copying around between bridges and interfaces.

Edit:

I'm going to plug my own zero-copy lib for Linux in golang. It has general
purpose tcp/ip suppport (though no logic for handling tcp communication
itself). It's lock-free and thread safe:
[https://github.com/nathanjsweet/zsocket](https://github.com/nathanjsweet/zsocket)

------
known
Reminds me of khttpd [http://lxr.free-
electrons.com/source/net/khttpd/?v=2.4.37](http://lxr.free-
electrons.com/source/net/khttpd/?v=2.4.37)

~~~
cm3
From what I recall, while Linux doesn't have that in the kernel anymore,
Microsoft's IIS still uses an httpd driver, but I don't know if it's optional.
And to be clear, ktthpd never got popular. It was written during the times of
10k battles and eventually syscalls were added for Apache and friends to
achieve it without an httpd in the kernel.

Edit: khttpd (real name: tux) has actually only existed in Red Hat's and
SuSE's kernel and was never mainlined.

~~~
floatboth
IIRC Microsoft IIS got pwned really hard because of a bug in that driver…

~~~
yuhong
Not as far as I know.

~~~
cm3
[https://technet.microsoft.com/en-
us/library/security/ms15-03...](https://technet.microsoft.com/en-
us/library/security/ms15-034.aspx)

[https://github.com/xPaw/HTTPsys](https://github.com/xPaw/HTTPsys)

[https://ma.ttias.be/remote-code-execution-via-http-
request-i...](https://ma.ttias.be/remote-code-execution-via-http-request-in-
iis-on-windows/)

------
xxr
>it can do, in some benchmarks, 2 million requests per second.

Is there a standard specification for machines that benchmark network
software? Or, when someone quotes a number like this, do they mean "in our
environment"?

------
yellowapple
Based on the article's title, I was expecting reasons to use the Linux
kernel's TCP stack - i.e. to actually answer the question. I guess it was
rhetorical, though, since the whole article was about reasons why one _shouldn
't_ use Linux's TCP stack.

All those reasons, however, reek rather strongly of premature optimization,
and _that_ is the best reason why one should and does use the Linux kernel's
TCP stack. 99.995% of the time, there are far worse bottlenecks in one's setup
than one's TCP implementation.

------
throwaway000002
What would be more interesting, and useful, compared to all these network
stacks, would be a machine readable specification of TCP/IP from which a
correct implementation could be engineered.

However, the realist in me concedes, the specification itself, given the
present state of the art, would probably fix, unsatisfactorily, many
implementation details (in order for the implementation to pass the spec).

I believe we need a network protocol with a solid, simple semantics. IP, that
is not.

~~~
hannesm
You have seen the network semantics research project
[https://www.cl.cam.ac.uk/~pes20/Netsem/index.html](https://www.cl.cam.ac.uk/~pes20/Netsem/index.html)?
It is a formal model of TCP/IP validated with Linnux
2.4.20/FreeBSD-4.6/Windows XP (yes, that was ~10 years ago).

It is nowadays BSD licensed on GitHub
[https://github.com/PeterSewell/netsem](https://github.com/PeterSewell/netsem)
(and I'm currently reviving it
[https://www.cl.cam.ac.uk/~pes20/HuginnTCP/](https://www.cl.cam.ac.uk/~pes20/HuginnTCP/))...

~~~
throwaway000002
No, I wasn't aware. Thank you for pointing this work out. This is _exactly_
the kind of thing I was hoping for. Wonderful!

I can't wait to see what you have planned.

I was thinking after I posted my comment, that it'd be cool if someone could
produce a fuzz tester that used both the specification, and the fact that you
can turn the Linux and NetBSD network stacks into libraries (libOS and
rumpkernel respectively) and co-engineer/evolve the spec whilst also finding
and fixing bugs in both the network stacks.

Excited by what you'll be up to!

~~~
hannesm
hmm, my other OS is MirageOS ([https://mirage.io](https://mirage.io)) -- also
see [https://nqsb.io](https://nqsb.io) contains my previous two years of work
;)

I'd rather call it extensive exploration than fuzz testing what is in my
mind...

------
SideburnsOfDoom
> As far as I can tell, there aren't any available general purpose open source
> userspace TCP/IP stacks available. There are a few specialized ones

"specialised" is probably key. If you have a hard problem and also time, money
and skills you can probably make something that handles your special case very
quickly, and doesn't do the rest of the general case at all. That is an
expected trade-off.

------
vbernat
The kernel TCP stack has two important constraints that alternatives usually
don't have:

1\. Ability to handle packets to/from multiple applications

2\. Expose the BSD socket API

------
mlvljr
Did a UDP/IP/Ethernet "stack" (relied on a packet driver shim / smth to send /
recv the Ethernet frames) in NASM for DOS while having couple months of
experience and using Tannenbaum's book and RFCs.

Worked great (was sitting in a TSR, was stress tested via sending a storm of
packets over a short netwrok cable directly to the machine (otherwise network
printers would halt :)) while having Elite:Frontiers running its Cobra ship
animation on the screen).

Were the times :)

[ humble point is, with things as simple as UDP, special-cased for no packet
fragmentation, etc and when the scenario is LAN, you can do it easily and over
couple days even, if you really have no other options ]

------
tcarey83
Have I missed something where you provided an alternative?

~~~
adamnemecek
The article mentions more than over alternative.

