
10 Million Concurrent Connections – The Kernel is the Problem - snaky
http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html
======
haberman
> The kernel is the problem

I had an epiphany one day when I realized that the kernel is nothing but a
library with an expensive calling convention.

The only reason we bother calling the kernel at all is because it has
privileges that userspace programs don't have, and it uses those privileges to
control/mediate access to shared resources.

The downside of the kernel is that there's no way around it. You can't "opt
out" of any of its decisions. With any normal library, if had an O(n^2)
algorithm somewhere, or wasn't optimized for your use case, or just generally
got in the way, you could choose not to link against it. User-space libraries
are democratic in this respect; you vote with your linker line. But with the
kernel, it's "my way or the highway." The kernel is the only way to the
hardware.

Here's an unfortunate example: O_DIRECT is one of those few ways that you
_can_ sidestep the kernel. With O_DIRECT you can bypass the kernel's page
cache, which there are very good reasons for wanting to do. But Linus's
opinion is that "I should have fought back harder":
<https://lkml.org/lkml/2007/1/10/233> He thinks it's _unfortunate_ that you
can sidestep the kernel, because "You need a buffer whatever IO you do, and it
might as well be the page cache."

But what if you want better memory accounting, or better isolation between
users, or a different data structure, or any number of other tweaks to the
page cache implementation? Well thankfully we have O_DIRECT. Otherwise, your
only choice would be to try to convince the Linux kernel maintainers to
integrate your change, after you've tweaked it so that it's suitable for
absolutely everyone else that uses Linux, and given it an interface that Linux
is willing to support forever. Good luck with that.

The kernel has always been the problem. User-space is where it's at.

~~~
adamnemecek
People have had that realization in the past

<http://en.wikipedia.org/wiki/Exokernel>

<http://en.wikipedia.org/wiki/Microkernel>

[http://en.wikipedia.org/wiki/Tanenbaum%E2%80%93Torvalds_deba...](http://en.wikipedia.org/wiki/Tanenbaum%E2%80%93Torvalds_debate)

There are good reasons why they have not caught on, performance being the most
salient one.

~~~
haberman
> People have had that realization in the past

"The kernel is just a library" isn't exactly the same sentiment as "the kernel
should be as small as possible" -- I believed the latter before I fully
understood the former. "The kernel is just a library" means that all of the
experience we have designing and factoring userspace APIs carries over into
kernel design. Furthermore it means that the kernel is a strictly less
flexible library than userspace libraries, with a strictly more expensive
calling convention, and that its _only_ advantage is that it can protect and
mediate access to hardware.

> There are good reasons why they have not caught on, performance being the
> most salient one.

Most of the received wisdom about microkernels is based on outdated designs
like Mach, and not modern designs like L4. L4 is significantly more efficient
than Mach.

~~~
adamnemecek
Exokernels use library operating systems.

L4 was also one of a kind endeavor extremely optimized for the specific
architecture. I can't really imagine that something like this could be
achievable for a commercial OS.

~~~
haberman
OKL4 is built on a third-generation L4 microkernel and is deployed to over 1.5
billion mobile devices: [http://www.ok-labs.com/releases/release/ok-labs-
software-sur...](http://www.ok-labs.com/releases/release/ok-labs-software-
surpasses-milestone-of-1.5-billion-mobile-device-shipments)

~~~
adamnemecek
"[OKL4] is a microkernel-based embedded hypervisor". That does not quite sound
like an OS.

~~~
tmzt
What is an OS but a process hypervisor?

------
minimax
_Don’t scribble data all over memory via pointers. Each time you follow a
pointer it will be a cache miss: [hash pointer] - > [Task Control Block] ->
[Socket] -> [App]. That’s four cache misses._

Incidentally, game programmers have been spreading the gospel on this issue
for several years now. See for more:

[http://macton.smugmug.com/gallery/8936708_T6zQX#!i=593426709...](http://macton.smugmug.com/gallery/8936708_T6zQX#!i=593426709&k=ZX4pZ)

[http://research.scee.net/files/presentations/gcapaustralia09...](http://research.scee.net/files/presentations/gcapaustralia09/Pitfalls_of_Object_Oriented_Programming_GCAP_09.pdf)

[http://www.altdevblogaday.com/2011/11/07/an-example-in-
data-...](http://www.altdevblogaday.com/2011/11/07/an-example-in-data-
oriented-design-sound-parameters/)

~~~
dietrichepp
I couldn't help but laugh reading the first link. Sure, there are some good
ideas for optimization. But (1) no profiling, and (2) there are a couple
really bogus suggestions.

For example, suggestion #33, to shift by a constant amount instead of a
variable amount. You see, it only _looks_ like you're shifting by a variable
amount. This is one of those things that compilers have been optimizing for
years and are very good at: strength reduction wrt variables that depend on
the loop variable.

You'll also see game programmers do things like "x >> 4" instead of "x/16",
because "x >> 4" is faster. It _is_ faster in assembly language, but your
compiler already knows that and you are just making the code harder to read
every time you do a micro-optimization.

Game programmers spread the gospel on lots of premature optimization nonsense,
in my opinion, and are mostly inexperienced when it comes to writing
mantainable code. It's a kind of hazard of the industry. Performance problems
means cutting beloved features, and rather than doing any maintenance you just
start a new game from scratch. (Not universally, of course. There are a few
programmers professionally working on engines.)

~~~
to3m
Slide #33 is actually reasonable advice. Variable shift = 11 cycle latency on
PS3/Xbox360, and it blocks both threads and disables interrupts as it runs.
(Will the compiler figure this out? Maybe, maybe not. But if you write the
code you want - which in this case you might as well, since the transformation
is simple - then you won't have to worry about that. As a general point of
principle you should write the code you want, rather than something else that
you hope will become the code you have in mind; computers are very bad at
figuring out intent, but excellent at simply following instructions.)

What are the other bogus suggestions? The overall thrust of the slides seems
to me valid: know what's expensive (branches, memory access, mini-pitfalls
like the micrododed thing), pick a strategy that avoids all of that, and don't
throw away performance. Performance always ends up being an issue; you don't
have to carefully preserve every last cycle, but that's no excuse for just
pissing away cycles doing pointless stuff.

Not explicitly stated, but apparent from comparing the suggested approach to
the original one, is that you can't always just leave this stuff to the last
minute, when you've had the profiler tell you what's a bottleneck and what
isn't. The requisite changes might have far-reaching consequences, and so it's
worth giving a bit of thought to performance matters when you start and things
are a bit more in flux.

A bit of domain knowledge also won't hurt. I bet if this function were ten
times worse, but called "PostDeserializeFromDVD", it wouldn't attract very
much attention.

~~~
smegel
> but that's no excuse for just pissing away cycles doing pointless stuff.

Yes there is - maintainable code, programmer time and effort. What are you
worried about, a large power bill due to your cpu doing what is supposed to be
doing?

On an unrelated note this kind of attitude is the first thing I test for
during a programmer interview, and is my strongest case for eliminating
potential candidates. I made the mistake once of letting one through - his
first task was a relatively straightforward modification to a large C program.
A week later I was a bit worried he hadn't reported back done, so I went to
check up on him and it turns out he was busy changing _every line of code_ to
adjust formatting and variable names, not to mention these kinds of pointless
micro-optimizations. And he hadn't even checked in once, he was saving the
checkin itself for another multiple day effort. Sigh. I tried using the
"premature optimization is the root of all evil" line on him to get my point
across (and see if he had heard of it), and it was when I saw his eyes flare
up in anger I knew he had to go. Sad really because he was otherwise quite
bright.

Now I basically put C++/game programmer applications in a "special pile" to be
considered as a last resort. I just dont need this kind of arrogance and
cowbow mentality wrecking the place. Its like sending a bull into a china
shop.

~~~
to3m
If performance is a requirement, it's a requirement, and you need to bear it
in mind. And working in games, it usually is. Virtually every project has
problems with performance, and dealing with the issues properly at the end of
a project can be very hard. By that point, the code is usually insufficiently
malleable to be safely transformed in the necessary fashion, and there's a
very good chance that you'll introduce new bugs anyway (or cause problems by
fixing existing ones).

So, armed with a few simple rules of thumb about what is particularly
expensive (let's say: memory accesses, branching, indirect jumps, square
root/trig/pow/etc., integer division), and a bit of experience about which
parts tend to cause problems that can be rather intrusive to fix (and object
culling is one of these), one might reasonably put in a bit of forethought and
try to structure things in a way that means they're relatively efficient from
the start. Better that than just producing something that's likely to be a
bottleneck, but written in a way that means it is never going to be efficient,
whatever you do to it, without a ground-up rewrite. And that's the sort of
approach the slide deck appears to be advocating.

Seems uncontroversial, to my mind. Some (and I'd agree) might even just call
this planning ahead. Seems that when you plan ahead by drawing up diagrams of
classes and objects, because you've been burned by having people just diving
in and coming up with a big ball of spaghetti, that's good planning ahead. But
when you plan ahead by trying to ensure that the code executes in an efficient
manner, because you've been burned by having people come up with slothful code
that wastes half its execution time and requires redoing because of that,
that's premature optimisation, and a massive waste of time.

As with any time you make plans for the future, sometimes you get it wrong.
Ars longa vita brevis, and all that.

------
tankenmate
On the face of it this is a nonsensical problem; a 10Gbps ethernet connection
is not going to need 10M concurrent tcp connections; ethernet + tcp/ip
protocol overhead is approx 4% (62 bytes overhead from 1542 bytes max per
packet, no jumbo packets over the wider internet yet), the average UK
broadband connection is now over 12Mbps
([http://media.ofcom.org.uk/2013/03/14/average-uk-broadband-
sp...](http://media.ofcom.org.uk/2013/03/14/average-uk-broadband-speeds-hit-
double-figures/)), that gives approximately 800 connections to fill the pipe.
Even at a paltry 1% bandwidth usage per connection and 4x10Gbps adaptors in
the server that is still 320,000 connections, I have done 150k connections on
a high end desktop Linux machine. Available memory (4kx2 socket buffers; you
do want connections to be as fast as possible no?) and bandwidth will be a
limit to the number of connections long before you get to 10M connections. You
are far better off buying multiple machines (redundancy) in more than one
location (even more redundancy) before heading off to load yourself up with
technical debt that could choke a zoo full of elephants.

The Linux kernel has come a long way in the last few years with improvements
to SMP scaling of sockets, zero copy, and large numbers of file handles; make
use of it! The level of technical skill being applied to fix kernel problems
is probably more expensive than you can afford.

~~~
zokier
There are usecases that use minimal to no bandwidth but a lot of connections.
IMAP-IDLE comes to mind as a premier example. Essentially anything that
requires "push" capability these days relies on open connections, other
examples are websockets and instant messaging. Also the number of concurrent
users is on the rise because of the always-on nature of cellphones. While 10M
sounds bit far fetched today, I think that using available bandwidth as a
measure for connection count is misguided.

~~~
tankenmate
UDP, or better still SCTP, would be a far better protocol for this use case;
having said that there are legacy issues with NAT, protocol support (e.g.
IMAP, websockets). Having said that multi-homing is coming down the pipeline,
the real pity is that the devices that could most benefit from it, mobile
computing, largely refuse to support it; i.e. iOS and Android. Maybe pushing
the firefox guys to support SCTP on their OS / websockets implementation would
force the other parties to the table... Maybe a user space implementation on
top of UDP would be the way to go.

~~~
qu4z-2
I think the problem with UDP is all the people who think that NAT is a
perfectly valid firewall/use case and not a hack.

~~~
tankenmate
Agreed a serious "real world" problem. having said that people are working on
it. I have a search around earlier and found there is a draft RFC for the UDP
encapsulation of SCTP ([http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-
encaps-...](http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-encaps-14)),
this combined with a soul destroying use of a zero data payload keep alive to
fend off NAT stupidity, and maybe a server side end point abuse of port 53 to
keep "Carrier Grade" NAT quiet might be the trick. All this should work on
mobile platforms.

In general doing this "properly" is an exercise in icky compromise.

------
byte_bach
Well,

    
    
       sorry to say I don't buy the "Unix was designed as a phone switch control plane" nonsense at all. Here is Dennis Ritchie:
    

"From the point of view of the group that was to be most involved in the
beginnings of Unix (K. Thompson, Ritchie, M. D. McIlroy, J. F. Ossanna), the
decline and fall of Multics had a directly felt effect. We were among the last
Bell Laboratories holdouts actually working on Multics, so we still felt some
sort of stake in its success. More important, the convenient interactive
computing service that Multics had promised to the entire community was in
fact available to our limited group, at first under the CTSS system used to
develop Multics, and later under Multics itself. Even though Multics could not
then support many users, it could support us, albeit at exorbitant cost. We
didn't want to lose the pleasant niche we occupied, because no similar ones
were available; even the time-sharing service that would later be offered
under GE's operating system did not exist. What we wanted to preserve was not
just a good environment in which to do programming, but a system around which
a fellowship could form. We knew from experience that the essence of communal
computing, as supplied by remote-access, time-shared machines, is not just to
type programs into a terminal instead of a keypunch, but to encourage close
communication. "

Unix was designed for people, not for Bell System Switches: <http://cm.bell-
labs.com/cm/cs/who/dmr/hist.html>

------
wmf
I think the original source is easier to read:
<http://c10m.robertgraham.com/p/manifesto.html>

------
gioele
> The problem with packets is they go through the Unix kernel. The network
> stack is complicated and slow. The path of packets to your application needs
> to be more direct. Don’t let the OS handle the packets.

Why don't you focus on simplifying the network stack instead?

The stack is complicated and slow for a reason: it takes care of many things.
I will believe that you can do better when you provide the same amount of
functionality (QoS, firewall, probing, tracing). If you say that you can do
without all these additional features, why don't you go and optimize the stack
so that the most basic code path is smaller and faster?

~~~
toki5
If you have one specialized need (i.e. one _specific_ path for packets to
travel), then it is just as valid an approach to trash every other path.

To turn your own question around: Why bother trying to optimize a stack that
contains a _lot_ of stuff you don't strictly need? The talk doesn't deal with
general-purpose servers that perform multiple roles; it says "if you have a
singular role, here is how you hyper-optimize to support that role at a huge
scale."

~~~
gioele
> If you have one specialized need (i.e. one specific path for packets to
> travel), then it is just as valid an approach to trash every other path.

You only have one specialized need at the very beginning. Then you start
seeing the need for "just another little feature". And soon you will start
replicating big parts of the network stack. Call it Greenspun's tenth rule for
the network, if you want.

Also, most of the time, what you "just need" happens to be also the common
need of many other users out there. Joining forces to fix a single code path
is a much better investment than redoing things in userland for the sake of
it.

------
jtchang
Well of course it is the problem. The kernel does lots of other stuff.

If you were to ask someone to build a vehicle that can go really fast you
might end up with a car. But ideally you'd really want a rocket.

I'm sure that there are systems out there that serve web pages with only bare
metal. Where the "kernel" exists only as an architectural stub. Why shouldn't
the NIC serve web pages directly?

~~~
PeterisP
For example, there are systems that do video streaming directly from NAND-
memory to NIC, bypassing the kernel to ensure high throughput (say, multiple
100Gbps optic links) and no jitter - linux handles the control part (data
management, initiating new streams, etc), but the data part is outside of it.

------
JoeAltmaier
Embedded programmers have been scheduling I/O like this for decades as well. A
thread is a lot on an embedded kernel.

~~~
spartango
The point about statically allocating "huge pages" aligns with my experience
as an embedded programmer as well. Dynamic allocation is simply a no-no on
those systems.

~~~
JoeAltmaier
Usually because of the runtime - locks, garbage collection strike when you
can't afford it (realtime buffering).

So my approach is a heap-cache usually. Calculate the log2 size, choose a heap
bucket, if its empty THEN go to the heap. Never frees so never garbage
collects (neither does single-large-allocation so no loss), just relinks freed
blocks on the cache bucket by size for simple reuse. It soon comes to a
working-set and never takes longer than a critical-section and link/unlink, at
least once it settles down.

------
tzury
a different version of the same presentation, a more "class(y)" one...

<http://www.youtube.com/watch?v=ZFvnPAIo4F0>

------
m1ke
Anyone can recommend a good place to start reading about *nix kernel for non-
programmers? Or any other programing concepts that are being referred in the
post? I'm pretty green sysop guy and want to try and understand better the
post (currently I'm getting maybe 10-20%)

------
flametroll
Here is an interesting concept that thinks 'outside the box':

<http://www.youtube.com/watch?v=gq1vDG-st1k>

~~~
mindjiver
Interesting how similar this is an advanced 4G cell network. In previous
generations of networks there was always a "controller" node taking care of
the handover between the cells. This of course didn't scale very well when
there was a spike in traffic usage. So in 4G cell networks there are no
controllers and all handovers are done via the cells instead.

Interesting to see this thinking applied to something slightly different,
thanks for the link!

------
tmzt
This presentation hints at one possible implementation of this, specifically
for VoIP/RTP.

[http://www.cluecon.com/presentation/building-
conferenceivr-p...](http://www.cluecon.com/presentation/building-
conferenceivr-platform-nodejs-sipjs-and-webrtc/)

I would love to see actual working code for IP/SIP/RTP in node.js using packet
sockets, if anybody knows of an open source project.

------
anon1685
If this became a real problem, wouldn't the pragmatic solution be to
distribute the load across multiple machines? I fail to see how handling 10M
connections on a single machine has any merit, taking into account reliability
for example, apart from being able to brag about it.

------
jacob019
writing custom drivers for each NIC is a blocker. Could 10M be achieved with
raw sockets?

~~~
alexkus
libnetfilter_queue for rx, raw socket for tx.

That was going to be the basis for a project to handle large numbers of
concurrent connections. Still on the drawing board though.

------
zerop
What should be I doing to handle thousands of quick and short lived
connections?

------
chrismealy
Where are people getting these super cheap servers?

------
abraininavat
First of all, what a disorganized article. Who the hell edited this?

Secondly, the article surmises that the kernel is the problem, which seems
right. Then it makes a leap to state that _the way to do this is to write your
own driver_.

Who said? I certainly don't agree. You are still running a multi-user kernel,
but you've now sacrificed its stability by running a parallel network stack.
Linux wasn't written to work in that way and it's hard to know what you're
getting into when you subvert your OS in that way. This article would be much
better if it talked about other options out there. For example..

Why not take out the middle-man entirely with something like OpenMirage
(<http://www.openmirage.org/>)? Get rid of the lowest-common-denominator
performance you get using a general-purpose OS and make your application the
OS. Link in whatever OS-like services (network, filesystem, etc) you may need.
Talk to the hardware directly without going through a middle-man and without
ugly hacks.

~~~
jsnell
Well, he's mostly suggesting using a user-space networking stack provided by
your vendor (Intel's DPDK). But writing your own userspace driver from scratch
is actually very simple -- I've done it for Intel's 1Gbps and 10Gbps cards
before the DPDK was made publicly available. We're talking ~1000 lines of
user-space code, and less than a month of work. Writing your own userspace
networking stack is of course more complicated, depending on exactly which
level of protocol support you need. Supporting say ARP or EtherChannel is
trivial, while a production quality and interoperable TCP stack will be man-
years of work.

But having written systems doing 10M TCP connections at 10Gbps, I strongly
believe that you want to relegate the kernel into the role of a management
system. Having the system split across the kernel and user-space will lead to
unacceptable performance, pretty much no matter where you split things. And
having the full application in the kernel would be a nightmare for
development, debugging, and deployment. (And I sure as hell am not going to
choose a pre-alpha research project over Linux as the base of telecoms
systems.)

~~~
snaky
Just found another iniciative about userspace tcp/ip stack that looks somewhat
promising - <http://www.openonload.org>

Presentation - <http://www.slideshare.net/shemminger/uio-final>

