10 Million Concurrent Connections – The Kernel is the Problem

haberman · on May 14, 2013

> The kernel is the problem

I had an epiphany one day when I realized that the kernel is nothing but a library with an expensive calling convention.

The only reason we bother calling the kernel at all is because it has privileges that userspace programs don't have, and it uses those privileges to control/mediate access to shared resources.

The downside of the kernel is that there's no way around it. You can't "opt out" of any of its decisions. With any normal library, if had an O(n^2) algorithm somewhere, or wasn't optimized for your use case, or just generally got in the way, you could choose not to link against it. User-space libraries are democratic in this respect; you vote with your linker line. But with the kernel, it's "my way or the highway." The kernel is the only way to the hardware.

Here's an unfortunate example: O_DIRECT is one of those few ways that you can sidestep the kernel. With O_DIRECT you can bypass the kernel's page cache, which there are very good reasons for wanting to do. But Linus's opinion is that "I should have fought back harder": https://lkml.org/lkml/2007/1/10/233 He thinks it's unfortunate that you can sidestep the kernel, because "You need a buffer whatever IO you do, and it might as well be the page cache."

But what if you want better memory accounting, or better isolation between users, or a different data structure, or any number of other tweaks to the page cache implementation? Well thankfully we have O_DIRECT. Otherwise, your only choice would be to try to convince the Linux kernel maintainers to integrate your change, after you've tweaked it so that it's suitable for absolutely everyone else that uses Linux, and given it an interface that Linux is willing to support forever. Good luck with that.

The kernel has always been the problem. User-space is where it's at.

adamnemecek · on May 14, 2013

People have had that realization in the past

http://en.wikipedia.org/wiki/Exokernel

http://en.wikipedia.org/wiki/Microkernel

http://en.wikipedia.org/wiki/Tanenbaum%E2%80%93Torvalds_deba...

There are good reasons why they have not caught on, performance being the most salient one.

haberman · on May 14, 2013

> People have had that realization in the past

"The kernel is just a library" isn't exactly the same sentiment as "the kernel should be as small as possible" -- I believed the latter before I fully understood the former. "The kernel is just a library" means that all of the experience we have designing and factoring userspace APIs carries over into kernel design. Furthermore it means that the kernel is a strictly less flexible library than userspace libraries, with a strictly more expensive calling convention, and that its only advantage is that it can protect and mediate access to hardware.

> There are good reasons why they have not caught on, performance being the most salient one.

Most of the received wisdom about microkernels is based on outdated designs like Mach, and not modern designs like L4. L4 is significantly more efficient than Mach.

caf · on May 14, 2013

...and that its only advantage is that it can protect and mediate access to hardware.

It has another advantage, which is that it acts as a trusted third party that can allow two mutually untrusting users/program to share a software-implemented resource - for example, a disk cache or a TCP stack.

adamnemecek · on May 14, 2013

Exokernels use library operating systems.

L4 was also one of a kind endeavor extremely optimized for the specific architecture. I can't really imagine that something like this could be achievable for a commercial OS.

haberman · on May 14, 2013

OKL4 is built on a third-generation L4 microkernel and is deployed to over 1.5 billion mobile devices: http://www.ok-labs.com/releases/release/ok-labs-software-sur...

adamnemecek · on May 14, 2013

"[OKL4] is a microkernel-based embedded hypervisor". That does not quite sound like an OS.

tmzt · on May 14, 2013

What is an OS but a process hypervisor?

duaneb · on May 14, 2013

monolithic kernels are good for general purpose computing, but lose their edge in specific use cases (e.g. 10M connections).

ithkuil · on May 14, 2013

Instead of moving kernel functionality inside user mode, another way of shortcutting this "expensive calling convention" would be to run your code in kernel mode (kernel module, or with something like kernel mode linux). (Possibly implementing most of your code with tools/languages that makes screwing up with global memory difficult or impossible)

Security and isolation might be provided by other means like virtualization.

Not saying it's a better way, it's just an alternative to writing time critical drivers in user-space as TFA proposes. I suspect that if you have to handle 10 million connections, you don't really need a full featured multiuser desktop system and you can manage this component to have specific deployment requirements.

JoeAltmaier · on May 14, 2013

'Virtual hardware' is an old idea, nobody has mentioned it here. The idea is, the hardware interface can be exposed in a page in user space. The driver is a library that can run in the application process. The hardware has to reconcile several user-mode apps but that's usually just queueuing.

Gain: no user-kernel process switches; no kernel call to poll completion; no data copying between processes; no interrupt transition. This was the idea behind InfiniBand but virtual hardware is older than that.

snaky · on May 22, 2013

>Gain: no user-kernel process switches; no kernel call to poll completion; no data copying between processes; no interrupt transition

AFAIK that's may be implemented on existing Linux system using Intel DPDK or PF_RING (routing packets directly from NIC to mmap'ed memory in userspace), "maxcpus=" Linux boot directive (to run Linux kernel on one or two cores only), and "pthread_setaffinity_np()" call in user process (to run the process on other cores which will never be imterrupted by Linux kernel).

tbrownaw · on May 14, 2013

Compile everything to MSIL (or Java) bytecode. Apparently the time saved on syscalls and on not using some of the hardware memory protection features is enough to compensate for the type-safety overhead, so it's not even slower.

http://research.microsoft.com/en-us/projects/singularity/

lukego · on May 14, 2013

Agree completely. Linux is developing pretty nice opt-out mechanisms though, despite Linus's best intentions.

In Snabb Switch (http://snabb.co/snabbswitch/) we take large blocks of RAM (HugeTLB) and whole PCI device access (mmap via sysfs) from Linux and then do everything else ourselves in userspace. The kernel is more like a BIOS in that sense: it gets you up and running and takes care of the details you really don't care about.

EDIT: Extremely early draft ahem detailed design document: http://snabb.co/misc/snabbswitch-07may2013.pdf

npsimons · on May 14, 2013

But what if you want better memory accounting, or better isolation between users, or a different data structure, or any number of other tweaks to the page cache implementation?

Oh, I don't know, you could always write your own kernel module. Considering I used to work on a kernel module that hooked the IRQ calls to implement hard RT, I'm not convinced that the kernel is that constraining. If you really feel constrained by the kernel, you could do without and go bare metal. Even if you don't want to do that, you've still got tons of other possibilities (the BSDs, any of the OSS microkernels, etc, etc). At least you've got those options with FLOSS kernels; I'd like to see how far you can get with changes to a closed source kernel, or worse, some hardware that's locked down to "keep you safe from malicious code".

As for the approval process of Linux patches, there's good reason for that, and other FLOSS kernels are just as (if not more) picky. The truth is, the Linux kernel has probably had more use cases thrown at it than any other software, ever. It's not perfect, nor heavily optimized towards any particular configuration; but it's highly configurable, and with a bit of tweaking (which would take less time and be less error prone than going bare metal), you can almost always get an acceptable solution to your problem.

Some cases where routing around the kernel are of genuine interest and necessary; most of the time, though, I far too often hear of programmers blaming others for performance problems that they haven't run a profiler on (if they've even heard of profiling).

jbert · on May 14, 2013

That's an interesting perspective.

Thinking in those terms, the benefits of the "privileged lib" are:

- has a lot of functionality (support for hardware X, Y etc, good implementations of important, difficult algorithms)

- interoperability and network effects, if you use this library you can conveniently interoperate with other applications using it.

You can also get around the "strictly more expensive calling convention" by writing your code to run as a loadable kernel module. Obviously that's inconvenient, but that's kind of the point of this - trading convenience for performance.

haberman · on May 14, 2013

It's funny that you mention those as benefits, because to me those are two big reasons why userspace is better.

For example, think of LLVM. GCC has a lot of functionality, and implements a lot of important, difficult algorithms. But because GCC is in userspace, LLVM/Clang have been able to get better and better until they can realistically challenge GCC in this space. People are starting to use LLVM more and more; it offers really compelling new tools like ASAN (http://clang.llvm.org/docs/AddressSanitizer.html), Clang Static Analyzer (http://clang-analyzer.llvm.org/), and LLDB. LLVM is also offering an embeddable and modular architecture, which GCC has opposed on philosophical grounds for years.

Because this is in user-space, LLVM was able to form and grow alongside GCC. People didn't have to make a big switch just to try out LLVM; it's not intrusive like it would be to switch from Linux to FreeBSD or anything like that. That's why I think the network effects of user-space are better. In the kernel it's "get merged or die." In user-space, similar projects can compete and people can vote with their linker lines.

There are tons of examples of this. In the last 10 years we've seen a lot of displacement in userspace, where next-gen technologies have made a lot of inroads against the incumbents:

    - Apache has been challenged by nginx
    - Perl has been challenged by Python/Ruby
    - screen has been challenged by tmux

None of these upstart competitors had to ask permission or get buy-in from the incumbents, they just did it. Now compare this with when Con Kolivas had an idea in 2007 for how to improve the Linux scheduler. He was really passionate about this and was maintaining his own patch against the Linux kernel -- he called his scheduler the Staircase Deadline scheduler (http://lwn.net/Articles/224865/). He showed positive results and had a lot of fans of his patches. But then Ingo Molnar, the established Linux scheduler guy, took Con's ideas and created his own separate rewrite of the scheduler that he called the "Completely Fair Scheduler." Linus picked Ingo's scheduler instead of Con's, which left Con frustrated and he stopped working on Linux.

We'll never get to find out what might have happened if Con could have realistically offered his work as an alternative. The scheduler is in the kernel, which means that Linus picks winners and losers. In userspace, no one has that power.

jbert · on May 14, 2013

I think it's less about "one true kernel" with too much control by an established crew and more about system integration and risk.

The vast majority of people don't run a Linus kernel. They run a distro kernel. Distros can (and do) ship multiple kernels with different sets of patches and options. They have a default, but they also have a default web server and C compiler.

The kernel is abused by all kinds of different workloads. Distros choosing to offer kernels with more "speculative" patches will have to support them. The kernel is a risk-averse environment. I think that's the reason, not fiat by Linus.

Also note, a good counterexample to your main point is the android kernel. Linus-kernel:Android-kernel is quite close to gcc:llvm.

Replacing big chunk of functionality takes a lot of resources, so it helps to be apple (LLVM) or google (android kernel).

Also, don't underestimate kernel modules either. If you want to, say, expose page table information that "linux doesn't let you", you can write a module to do so.

[edit - removed Con Kolivas related part of response. Don't want to drag up old flamewar.]

s_baby · on May 14, 2013

You might find HaLVM interesting. Runs a Haskell stack directly on the machine.

http://corp.galois.com/halvm

tikhonj · on May 14, 2013

Also mirage for OCaml. Unlike HalVM--which does seem like an awesome project--it's currently under active development. They also have some clever ideas about security.

fulafel · on May 14, 2013

> The downside of the kernel is that there's no way around it. You can't "opt out" of any of its decisions.

Well, you can sidestep most of it. For example, re the c10k/c10m problem, you can the transport protocol impl to userspace. See example here how they did it with SCTP: http://www.cs.ubc.ca/labs/dsg/mpi-sctpuserspace_ICCCN2012.pd... - you could do the same with TCP if you don't care about having a TCP-enabled kernel.

The other example of course is all the protocols built on top of UDP.

This still has the NIC driver shoveling at least the raw IP packets via the kernel driver, but you could sidestep that too by using the hypervisor interface to route the NIC pci space and interrupts to your program (like Xen uses).

gonzo · on May 14, 2013

> I had an epiphany one day when I realized that the kernel is nothing but a library with an expensive calling convention.

And a difficult debugging environment.

minimax · on May 13, 2013

Don’t scribble data all over memory via pointers. Each time you follow a pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App]. That’s four cache misses.

Incidentally, game programmers have been spreading the gospel on this issue for several years now. See for more:

http://macton.smugmug.com/gallery/8936708_T6zQX#!i=593426709...

http://research.scee.net/files/presentations/gcapaustralia09...

http://www.altdevblogaday.com/2011/11/07/an-example-in-data-...

dietrichepp · on May 13, 2013

I couldn't help but laugh reading the first link. Sure, there are some good ideas for optimization. But (1) no profiling, and (2) there are a couple really bogus suggestions.

For example, suggestion #33, to shift by a constant amount instead of a variable amount. You see, it only looks like you're shifting by a variable amount. This is one of those things that compilers have been optimizing for years and are very good at: strength reduction wrt variables that depend on the loop variable.

You'll also see game programmers do things like "x >> 4" instead of "x/16", because "x >> 4" is faster. It is faster in assembly language, but your compiler already knows that and you are just making the code harder to read every time you do a micro-optimization.

Game programmers spread the gospel on lots of premature optimization nonsense, in my opinion, and are mostly inexperienced when it comes to writing mantainable code. It's a kind of hazard of the industry. Performance problems means cutting beloved features, and rather than doing any maintenance you just start a new game from scratch. (Not universally, of course. There are a few programmers professionally working on engines.)

to3m · on May 13, 2013

Slide #33 is actually reasonable advice. Variable shift = 11 cycle latency on PS3/Xbox360, and it blocks both threads and disables interrupts as it runs. (Will the compiler figure this out? Maybe, maybe not. But if you write the code you want - which in this case you might as well, since the transformation is simple - then you won't have to worry about that. As a general point of principle you should write the code you want, rather than something else that you hope will become the code you have in mind; computers are very bad at figuring out intent, but excellent at simply following instructions.)

What are the other bogus suggestions? The overall thrust of the slides seems to me valid: know what's expensive (branches, memory access, mini-pitfalls like the micrododed thing), pick a strategy that avoids all of that, and don't throw away performance. Performance always ends up being an issue; you don't have to carefully preserve every last cycle, but that's no excuse for just pissing away cycles doing pointless stuff.

Not explicitly stated, but apparent from comparing the suggested approach to the original one, is that you can't always just leave this stuff to the last minute, when you've had the profiler tell you what's a bottleneck and what isn't. The requisite changes might have far-reaching consequences, and so it's worth giving a bit of thought to performance matters when you start and things are a bit more in flux.

A bit of domain knowledge also won't hurt. I bet if this function were ten times worse, but called "PostDeserializeFromDVD", it wouldn't attract very much attention.

smegel · on May 13, 2013

> but that's no excuse for just pissing away cycles doing pointless stuff.

Yes there is - maintainable code, programmer time and effort. What are you worried about, a large power bill due to your cpu doing what is supposed to be doing?

On an unrelated note this kind of attitude is the first thing I test for during a programmer interview, and is my strongest case for eliminating potential candidates. I made the mistake once of letting one through - his first task was a relatively straightforward modification to a large C program. A week later I was a bit worried he hadn't reported back done, so I went to check up on him and it turns out he was busy changing every line of code to adjust formatting and variable names, not to mention these kinds of pointless micro-optimizations. And he hadn't even checked in once, he was saving the checkin itself for another multiple day effort. Sigh. I tried using the "premature optimization is the root of all evil" line on him to get my point across (and see if he had heard of it), and it was when I saw his eyes flare up in anger I knew he had to go. Sad really because he was otherwise quite bright.

Now I basically put C++/game programmer applications in a "special pile" to be considered as a last resort. I just dont need this kind of arrogance and cowbow mentality wrecking the place. Its like sending a bull into a china shop.

to3m · on May 13, 2013

If performance is a requirement, it's a requirement, and you need to bear it in mind. And working in games, it usually is. Virtually every project has problems with performance, and dealing with the issues properly at the end of a project can be very hard. By that point, the code is usually insufficiently malleable to be safely transformed in the necessary fashion, and there's a very good chance that you'll introduce new bugs anyway (or cause problems by fixing existing ones).

So, armed with a few simple rules of thumb about what is particularly expensive (let's say: memory accesses, branching, indirect jumps, square root/trig/pow/etc., integer division), and a bit of experience about which parts tend to cause problems that can be rather intrusive to fix (and object culling is one of these), one might reasonably put in a bit of forethought and try to structure things in a way that means they're relatively efficient from the start. Better that than just producing something that's likely to be a bottleneck, but written in a way that means it is never going to be efficient, whatever you do to it, without a ground-up rewrite. And that's the sort of approach the slide deck appears to be advocating.

Seems uncontroversial, to my mind. Some (and I'd agree) might even just call this planning ahead. Seems that when you plan ahead by drawing up diagrams of classes and objects, because you've been burned by having people just diving in and coming up with a big ball of spaghetti, that's good planning ahead. But when you plan ahead by trying to ensure that the code executes in an efficient manner, because you've been burned by having people come up with slothful code that wastes half its execution time and requires redoing because of that, that's premature optimisation, and a massive waste of time.

As with any time you make plans for the future, sometimes you get it wrong. Ars longa vita brevis, and all that.

cecilpl · on May 14, 2013

> What are you worried about, a large power bill due to your cpu doing what is supposed to be doing?

Video games are real-time. There's a hard limit on the number of cycles you can make use of per frame.

11 cycles spent mispredicting a branch is 11 cycles you can't use for better-looking IK or smarter pathfinding.

dietrichepp · on May 13, 2013

> Variable shift = 11 cycle latency on PS3/Xbox360

That's not the issue here. It doesn't matter if the variable shift has to get its results by carrier pigeon from a monk in Tennessee, because the variable shift is eliminated by the compiler. When the programmer is manually doing work that has already been automated, your process is inefficient.

to3m · on May 14, 2013

I tried some super-simple test code and ran it through the PS3 compilers... but was unable to get either to turn the variable-width shift into a fixed-width one. With the right flag included, gcc was even good enough to warn me about the microcoded instruction.

I also tried gcc (OS X, x64) and llvm (OS X, x64/ARM) and they didn't do the transformation either. (I'm not sure I would expect this for ARM anyway, but for x64 the variable shift code looked a bit more involved than I was expecting. Perhaps a transformation into a fixed-width shift would be beneficial in that case as well.)

Test code:

    float f(const float *f,unsigned mask){
        unsigned i;
        float t=0.f;
        
        for(i=0;i<32;++i){
            if(mask&(1<<i))
                t+=f[i];
        }
        
        return t;
    }
        
    float f2(const float *f,unsigned mask){
        unsigned i,m;
        float t=0.f;
        
        for(i=0,m=1;i<32;++i,m<<=1){
            if(mask&m)
                t+=f[i];
        }
        
        return t;
    }

Compile options were "-O3" for SNC (this seems to be about your lot as far as options go) and "-O6 -fstrength-reduce" for gcc (obviously I could spend all day fiddling with all the possible options, but I won't, which I admit is a flaw in my experiment - but I believe snc is supposed to produce much better code anyway). And in both cases, the code for `f' includes a variable shift, and the code for `f2' didn't.

Still, I would stand by my maxim even if the data in this case were against me. It's the winning strategy. Why rely on the compiler's ability to second-guess your intentions, when you could just tell it what you want, straight up?

Arelius · on May 14, 2013

Thanks for doing these tests, Super insightful.

While, I'm normally of the school of thought to let the compiler do the optimization. Modern compilers often miss what would seem like rather trivial optimizations, And often due to assumptions that the language spec won't allow that the programmer otherwise can make.

dietrichepp · on May 14, 2013

I'm not sure there's a point in testing on x64 because I'm not sure that changing to a fixed shift is actually an optimization. I think that the variable shift was slow back in the Pentium days but it's really a thing of the past.

I'm honestly surprised that the variable shift didn't get fixed. This strength reduction will happen if you use a multiply instead of a shift: the multiply will get reduced to an addition. I had assumed that the same would hold for variable and fixed width shifts.

I would file this as a bug against the compilers in question. I don't think it would take very long to fix.

keypusher · on May 14, 2013

Thanks for this.

derleth · on May 13, 2013

> Will the compiler figure this out? Maybe, maybe not.

Does your compiler not generate assembly? Does your platform not have objdump or a similar tool you can use as a disassembler? I suppose asking you to profile is too much, but you can at least look.

TheEzEzz · on May 14, 2013

Keep in mind that for games you are often targeting three different consoles plus PC. You may also port to other platforms eventually, and won't be able to easily track down all these lines of code at a later date to guarantee they optimize down correctly on your new target.

groby_b · on May 14, 2013

Since I actually happen to know Mike, I can guarantee you he profiled. He might not share the results in the slides, because they were about making a point and getting laughs in the process, not production code.

This is one of those things that compilers have been optimizing for years and are very good at: strength reduction wrt variables that depend on the loop variable.

Many compilers are good at it. In many instances. But if you do multi-platform development, with entirely different compilers for each platform, doing manual strength reduction is a good investment of your time if performance really matters.

and rather than doing any maintenance you just start a new game from scratch

Thankfully, that model is fading. It was a result of rapidly changing architectures, so that often at the beginning of the next console cycle, you had to rewrite anyways - the hardware was so different that what was previously fast suddenly was a disaster.

Given the fact that hardware is moving closer and closer to being bog-standard, and that code bases are large enough that a rewrite is actually insanity, not a couple of bored weekends, this mentality is fading out.

Slowly, granted, but there's hope :)

TheMakeA · on May 13, 2013

I think the points that you bring up are mostly true for well established and common platforms and architectures, but the author, Mike Acton, focuses mostly on the PS3, where it is more conceivable that a compiler would not e.g. optimize a division operation to a shift, especially for an SPU.

I don't think it's fair to generalize game programmers for premature optimization nonsense. Game development certainly has much stricter requirements on performance than say, Joe's Ruby on Rails website, and so I would be much more inclined to trust them for optimization tips. That being said though, yes, a lot of them have no idea what they are talking about.

I have been on both sides of the fence. I have worked with people who attempt to optimize every single instruction, and with people who gladly misinterpret "premature optimization is the root of all evil." Both of them are obviously toxic in their own ways.

mierle · on May 13, 2013

Be careful: "x / 16" won't get converted to "x >> 4" by the compiler if x is a signed int. I lost an argument about this in a code review once. Look at the disassembly for the case that x is and is not signed if you don't believe me.

spc476 · on May 14, 2013

Odd. The code:

    int main(int argc,char *argv[])
    {
      int x = strtol(argv[1],NULL,10);
      printf("result is %d\n",x/16);
      return 0;
    }

The resulting assembly contained no division instructions, nor a call to a divide routine, but there is one instruction that does an arithmetic right shift by 4. Changing x to unsigned changed the instruction to a logical right shift by 4.

I think it really depends upon the compiler.

mierle · on May 14, 2013

Strangely, I can't reproduce the effect with a divide by 16. With GCC 4.5.2 -O2, I see a "shrl $4, %eax" for unsigned, and "sarl $4, %eax" for signed. However, if I divide by 2, the results are different; the function:

  int div_by_2(int x) {return x/2;}

compiles to:

  div_by_2:
    movl	%edi, %eax
    shrl	$31, %eax
    addl	%edi, %eax
    sarl	%eax
    ret

Meanwhile, the signed variant:

  unsigned int div_by_2(unsigned int x) {return x/2;}

compiles to:

  div_by_2:
    movl	%edi, %eax
    shrl	%eax
    ret

My conclusion remains that for hot loops, you should check the compilers assembly to make sure the compiler is applying the strength reduction you expect.

to3m · on May 14, 2013

I got the following (-O6 -fstrength-reduce). A single arithmetic shift is no good, so the code adjusts the value first.

    movl	%eax, %esi
    sarl	$31, %esi  ; ESI = FFFFFFFF (-) / 00000000 (+)
    shrl	$28, %esi  ; ESI = 0000000F (-) / 00000000 (+)
    addl	%eax, %esi ; offset to cancel invalid values
    sarl	$4, %esi   ; shift

It first generates an offset value in ESI: 15 if the value to divide was negative, or 0 if it was positive.

Then it adds this to the original value. This is the cunning part - well I think so anyway. It ensures that for the values where x>>4 would give the wrong value (i.e., -15<=x<0), x is made positive and smaller than 16, so that x>>4 gives the correct value of zero. For all other values, the bottom 4 bits don't affect the result of the division, so it's fine to offset them as well.

(Don't think I've ever seen VC++ generate anything like this, but I'd pretty much always have an explicit shift rather than a divide by 16 so maybe it does and I'd just never noticed. The choice of 16 usually comes from having some 4-bit field somewhere, and when dealing with bitfields you're best off using bitwise instructions.)

snaky · on May 13, 2013

>mostly inexperienced when it comes to writing mantainable code

Mostly. Just like any other average guys in any industry.

But game industry is continuously bringing back many interesting technologies, especially performance-related. Like LuaJIT, rapidly becoming mainstream and widely used, GPU-assisted calculation, used even in PostgreSQL - https://wiki.postgresql.org/wiki/PGStrom - and so much more.

pasquinelli · on May 13, 2013

everything you said is super true. hella super true.

tankenmate · on May 13, 2013

On the face of it this is a nonsensical problem; a 10Gbps ethernet connection is not going to need 10M concurrent tcp connections; ethernet + tcp/ip protocol overhead is approx 4% (62 bytes overhead from 1542 bytes max per packet, no jumbo packets over the wider internet yet), the average UK broadband connection is now over 12Mbps (http://media.ofcom.org.uk/2013/03/14/average-uk-broadband-sp...), that gives approximately 800 connections to fill the pipe. Even at a paltry 1% bandwidth usage per connection and 4x10Gbps adaptors in the server that is still 320,000 connections, I have done 150k connections on a high end desktop Linux machine. Available memory (4kx2 socket buffers; you do want connections to be as fast as possible no?) and bandwidth will be a limit to the number of connections long before you get to 10M connections. You are far better off buying multiple machines (redundancy) in more than one location (even more redundancy) before heading off to load yourself up with technical debt that could choke a zoo full of elephants.

The Linux kernel has come a long way in the last few years with improvements to SMP scaling of sockets, zero copy, and large numbers of file handles; make use of it! The level of technical skill being applied to fix kernel problems is probably more expensive than you can afford.

jsnell · on May 13, 2013

If you look at real network traffic flowing through e.g. a mobile operator's core network, you'll find that for 1Gbps of traffic there will be anywhere from 200k to 500k open TCP connections (depending a bit on the traffic mix, and on what idle connection thresholds you use for the analysis). So if you're making TCP-aware network appliances targeting a 10Gbps traffic rate, it's not a bad idea to plan for simultaneous 10M connections. You might get away with a factor of 2 or 5 less, but probably not with a factor of 10. 150k connections would be a joke.

zokier · on May 13, 2013

There are usecases that use minimal to no bandwidth but a lot of connections. IMAP-IDLE comes to mind as a premier example. Essentially anything that requires "push" capability these days relies on open connections, other examples are websockets and instant messaging. Also the number of concurrent users is on the rise because of the always-on nature of cellphones. While 10M sounds bit far fetched today, I think that using available bandwidth as a measure for connection count is misguided.

tankenmate · on May 13, 2013

UDP, or better still SCTP, would be a far better protocol for this use case; having said that there are legacy issues with NAT, protocol support (e.g. IMAP, websockets). Having said that multi-homing is coming down the pipeline, the real pity is that the devices that could most benefit from it, mobile computing, largely refuse to support it; i.e. iOS and Android. Maybe pushing the firefox guys to support SCTP on their OS / websockets implementation would force the other parties to the table... Maybe a user space implementation on top of UDP would be the way to go.

qu4z-2 · on May 14, 2013

I think the problem with UDP is all the people who think that NAT is a perfectly valid firewall/use case and not a hack.

tankenmate · on May 14, 2013

Agreed a serious "real world" problem. having said that people are working on it. I have a search around earlier and found there is a draft RFC for the UDP encapsulation of SCTP (http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-udp-encaps-...), this combined with a soul destroying use of a zero data payload keep alive to fend off NAT stupidity, and maybe a server side end point abuse of port 53 to keep "Carrier Grade" NAT quiet might be the trick. All this should work on mobile platforms.

In general doing this "properly" is an exercise in icky compromise.

snaky · on May 13, 2013

>Available memory (4kx2 socket buffers; you do want connections to be as fast as possible no?)

When 1U commodity x86 server built on C602 Intel chipset can be packed with 768 GB RAM..

nly · on May 13, 2013

Your numbers assume TCP will always be the dominant connection-oriented protocol.

rorrr2 · on May 13, 2013

Your math assumes all of your connections are saturated to the max (on user's end).

If that's the case, then yeah, 10 million connections would require a 14.3 terabytes/sec connection, which doesn't exist for regular PCs, as far as I know.

There are many applications when most of your connections are sitting idly and checking something like memcache for status updates once a second.

byte_bach · on May 14, 2013

Well,

   sorry to say I don't buy the "Unix was designed as a phone switch control plane" nonsense at all. Here is Dennis Ritchie:

"From the point of view of the group that was to be most involved in the beginnings of Unix (K. Thompson, Ritchie, M. D. McIlroy, J. F. Ossanna), the decline and fall of Multics had a directly felt effect. We were among the last Bell Laboratories holdouts actually working on Multics, so we still felt some sort of stake in its success. More important, the convenient interactive computing service that Multics had promised to the entire community was in fact available to our limited group, at first under the CTSS system used to develop Multics, and later under Multics itself. Even though Multics could not then support many users, it could support us, albeit at exorbitant cost. We didn't want to lose the pleasant niche we occupied, because no similar ones were available; even the time-sharing service that would later be offered under GE's operating system did not exist. What we wanted to preserve was not just a good environment in which to do programming, but a system around which a fellowship could form. We knew from experience that the essence of communal computing, as supplied by remote-access, time-shared machines, is not just to type programs into a terminal instead of a keypunch, but to encourage close communication. "

Unix was designed for people, not for Bell System Switches: http://cm.bell-labs.com/cm/cs/who/dmr/hist.html

wmf · on May 13, 2013

I think the original source is easier to read: http://c10m.robertgraham.com/p/manifesto.html

gioele · on May 13, 2013

> The problem with packets is they go through the Unix kernel. The network stack is complicated and slow. The path of packets to your application needs to be more direct. Don’t let the OS handle the packets.

Why don't you focus on simplifying the network stack instead?

The stack is complicated and slow for a reason: it takes care of many things. I will believe that you can do better when you provide the same amount of functionality (QoS, firewall, probing, tracing). If you say that you can do without all these additional features, why don't you go and optimize the stack so that the most basic code path is smaller and faster?

toki5 · on May 13, 2013

If you have one specialized need (i.e. one specific path for packets to travel), then it is just as valid an approach to trash every other path.

To turn your own question around: Why bother trying to optimize a stack that contains a lot of stuff you don't strictly need? The talk doesn't deal with general-purpose servers that perform multiple roles; it says "if you have a singular role, here is how you hyper-optimize to support that role at a huge scale."

gioele · on May 14, 2013

> If you have one specialized need (i.e. one specific path for packets to travel), then it is just as valid an approach to trash every other path.

You only have one specialized need at the very beginning. Then you start seeing the need for "just another little feature". And soon you will start replicating big parts of the network stack. Call it Greenspun's tenth rule for the network, if you want.

Also, most of the time, what you "just need" happens to be also the common need of many other users out there. Joining forces to fix a single code path is a much better investment than redoing things in userland for the sake of it.

PeterisP · on May 13, 2013

Well, the whole point of the article is that for some needs even the option of extra functionality hurts - you want to remove every extra 'if', every unneeded (for you) field in a data structure to make it fit in cache better, etc.

wmf · on May 13, 2013

This is a worthwhile topic of research and there are some papers at HotOS about it, but I don't blame people for taking shortcuts if they can't solve the general case.

lmm · on May 13, 2013

Would linux accept those patches?

yid · on May 13, 2013

It doesn't have to -- you can just have a loadable kernel module that takes over the network stack.

lmm · on May 13, 2013

Such a thing would be very difficult to maintain given that linux refuses to provide a stable API.

yid · on May 13, 2013

Agreed, but it's possible (in theory, much like writing your own network stack).

wmf · on May 13, 2013

Linux developers are very opposed to having their lovingly crafted network stack replaced with something crappy and/or proprietary (see TOE).

jtchang · on May 13, 2013

Well of course it is the problem. The kernel does lots of other stuff.

If you were to ask someone to build a vehicle that can go really fast you might end up with a car. But ideally you'd really want a rocket.

I'm sure that there are systems out there that serve web pages with only bare metal. Where the "kernel" exists only as an architectural stub. Why shouldn't the NIC serve web pages directly?

PeterisP · on May 13, 2013

For example, there are systems that do video streaming directly from NAND-memory to NIC, bypassing the kernel to ensure high throughput (say, multiple 100Gbps optic links) and no jitter - linux handles the control part (data management, initiating new streams, etc), but the data part is outside of it.

JoeAltmaier · on May 13, 2013

Embedded programmers have been scheduling I/O like this for decades as well. A thread is a lot on an embedded kernel.

spartango · on May 14, 2013

The point about statically allocating "huge pages" aligns with my experience as an embedded programmer as well. Dynamic allocation is simply a no-no on those systems.

JoeAltmaier · on May 14, 2013

Usually because of the runtime - locks, garbage collection strike when you can't afford it (realtime buffering).

So my approach is a heap-cache usually. Calculate the log2 size, choose a heap bucket, if its empty THEN go to the heap. Never frees so never garbage collects (neither does single-large-allocation so no loss), just relinks freed blocks on the cache bucket by size for simple reuse. It soon comes to a working-set and never takes longer than a critical-section and link/unlink, at least once it settles down.

tzury · on May 13, 2013

a different version of the same presentation, a more "class(y)" one...

http://www.youtube.com/watch?v=ZFvnPAIo4F0

m1ke · on May 14, 2013

Anyone can recommend a good place to start reading about *nix kernel for non-programmers? Or any other programing concepts that are being referred in the post? I'm pretty green sysop guy and want to try and understand better the post (currently I'm getting maybe 10-20%)

flametroll · on May 13, 2013

Here is an interesting concept that thinks 'outside the box':

http://www.youtube.com/watch?v=gq1vDG-st1k

mindjiver · on May 14, 2013

Interesting how similar this is an advanced 4G cell network. In previous generations of networks there was always a "controller" node taking care of the handover between the cells. This of course didn't scale very well when there was a spike in traffic usage. So in 4G cell networks there are no controllers and all handovers are done via the cells instead.

Interesting to see this thinking applied to something slightly different, thanks for the link!

tmzt · on May 14, 2013

This presentation hints at one possible implementation of this, specifically for VoIP/RTP.

http://www.cluecon.com/presentation/building-conferenceivr-p...

I would love to see actual working code for IP/SIP/RTP in node.js using packet sockets, if anybody knows of an open source project.

anon1685 · on May 14, 2013

If this became a real problem, wouldn't the pragmatic solution be to distribute the load across multiple machines? I fail to see how handling 10M connections on a single machine has any merit, taking into account reliability for example, apart from being able to brag about it.

jacob019 · on May 13, 2013

writing custom drivers for each NIC is a blocker. Could 10M be achieved with raw sockets?

alexkus · on May 13, 2013

libnetfilter_queue for rx, raw socket for tx.

That was going to be the basis for a project to handle large numbers of concurrent connections. Still on the drawing board though.

zerop · on May 14, 2013

What should be I doing to handle thousands of quick and short lived connections?

chrismealy · on May 14, 2013

Where are people getting these super cheap servers?

abraininavat · on May 13, 2013

First of all, what a disorganized article. Who the hell edited this?

Secondly, the article surmises that the kernel is the problem, which seems right. Then it makes a leap to state that the way to do this is to write your own driver.

Who said? I certainly don't agree. You are still running a multi-user kernel, but you've now sacrificed its stability by running a parallel network stack. Linux wasn't written to work in that way and it's hard to know what you're getting into when you subvert your OS in that way. This article would be much better if it talked about other options out there. For example..

Why not take out the middle-man entirely with something like OpenMirage (http://www.openmirage.org/)? Get rid of the lowest-common-denominator performance you get using a general-purpose OS and make your application the OS. Link in whatever OS-like services (network, filesystem, etc) you may need. Talk to the hardware directly without going through a middle-man and without ugly hacks.

jsnell · on May 13, 2013

Well, he's mostly suggesting using a user-space networking stack provided by your vendor (Intel's DPDK). But writing your own userspace driver from scratch is actually very simple -- I've done it for Intel's 1Gbps and 10Gbps cards before the DPDK was made publicly available. We're talking ~1000 lines of user-space code, and less than a month of work. Writing your own userspace networking stack is of course more complicated, depending on exactly which level of protocol support you need. Supporting say ARP or EtherChannel is trivial, while a production quality and interoperable TCP stack will be man-years of work.

But having written systems doing 10M TCP connections at 10Gbps, I strongly believe that you want to relegate the kernel into the role of a management system. Having the system split across the kernel and user-space will lead to unacceptable performance, pretty much no matter where you split things. And having the full application in the kernel would be a nightmare for development, debugging, and deployment. (And I sure as hell am not going to choose a pre-alpha research project over Linux as the base of telecoms systems.)

snaky · on May 14, 2013

Just found another iniciative about userspace tcp/ip stack that looks somewhat promising - http://www.openonload.org

Presentation - http://www.slideshare.net/shemminger/uio-final

clgbrst · on May 14, 2013

Have you personnally studied the code of this research project?

toddh · on May 13, 2013

> First of all, what a disorganized article. Who the hell edited this?

I have to agree. It didn't really turn out as well as I wanted. The talk is really good, but I could never make this read smoothly.

Fortunately I think the content is excellent and that will hopefully save the day.

My take is along the lines of the earlier embedded systems comment. This is the sort of thing you would have done in a real-time OS snagging packets directly in an ISR and queuing them to a task for later processing. The processor and memory suggestions are just common practice in that arena.

But that kind of sucks for numerous reasons. It's great having a widely supported OS that you can leverage while replacing the bits you need to meet your goals. That's pretty genius in my book.

toki5 · on May 13, 2013

>I have to agree. It didn't really turn out as well as I wanted. The talk is really good, but I could never make this read smoothly.

Props for stepping forward and saying this :-)

I think it has to do with that you're trying to capture the content of a talk largely by presenting the bullet points from that talk's slide deck, but in doing so miss a lot of context that came from the presentation.

Your summary at the bottom is great -- I think it should be the whole article. This would offer viewers a snapshot of what they can expect out of the video so that they don't miss anything (for example, a very common takeaway I'm seeing from this and from people I show it to is that this is "just" a network stack hack -- no mention of his proposal for multi-core optimization, etc).

snaky · on May 13, 2013

There is another one alternative, rump kernel[1], mentioned in commentary to original posting about "direct" network processing[2]. The recent development was discussed in LuaJIT mailinglist recently (the thread is interesting by all means in "minus OS" context by the way)[3]

[1] http://www.netbsd.org/docs/rump/

[2] http://erratasec.blogspot.ru/2013/02/custom-stack-it-goes-to...

[3] http://thread.gmane.org/gmane.comp.lang.lua.luajit/2493

justincormack · on May 13, 2013

Hope to have some real deliverables in a week or two....

bradleyjg · on May 13, 2013

The answer is right there in the article: "Yet, what you have is code running on Linux that you can debug normally, it’s not some sort of weird hardware system that you need custom engineer for. You get the performance of custom hardware that you would expect for your data plane, but with your familiar programming and development environment."

Maybe Mirage will pan out, but for right now it's in the pre-alpha stage. We've heard about the glories of microkernals before (GNU Hurd, anyone?).

snaky · on May 13, 2013

That's obvious that not every developer could solve 10M problem on x86 commodity hardware, of course. But obviously there are particular cases where it's worth it clearly.

"Debug normally" isn't the case with BigTable or any other MapReduce implementation, is it? Do we think Google and big bunch of others made a mistake with it? Obviously not.

bradleyjg · on May 13, 2013

Distributed software has its own pain points in development and testing, but SFAIK BigTable and MapReduce nodes run on commodity hardware running a flavor of linux.[1]

[1] This paper seems to support that: http://int3.de/res/GfsMapReduce/GfsMapReducePaper.pdf

snaky · on May 13, 2013

That was my point exactly. Running Big Data on commodity hardware nodes (instead of Oracle/SunFire) was just as crazy during early Google days as OS kernel bypassing today.

bradleyjg · on May 13, 2013

I see. So the argument is that because something that was considered crazy panned out against the odds, therefore everything considered crazy is actually a good bet?

That seems like dubious logic.

Also unlike Google's bet on commodity hardware, I don't see the millions of dollars of savings to be had if the micro-kernel idea works out. At best you'll get an incremental performance benefit and easier debugging. It'd be different if there were no open source OSes and so rolling your own was the only alternative to paying millions in licensing fees.