
The C10K problem - luu
http://www.kegel.com/c10k.html
======
spacemanmatt
Last time I built a server project from scratch it was on a 300MHz quad-core
Xeon base, which hosted around 150 simultaneous web users. It took some effort
to make it scale like that but it was worth it. My hardware maintenance was
low because we really maximized the software capacity.

In modern times, RAM and CPUs are more than 10 times bigger and faster, but I
am seeing people get around 25 times LESS out of them, because they choose
terrible tools, don't benchmark, and generally don't care. Now (different
company) our app server pool has 150 nodes and each serves 4 users. The
application complexity is significantly smaller than what I did 13 years ago.

I sincerely doubt any of my coworkers have read this document. It shows.

~~~
davidw
A lot of the calculation is the $$$ you make per connection. For a SaaS kind
of thing, you can generally keep that high enough to not worry about squeezing
every last drop of efficiency out of things. A much bigger worry for most
startup-ish companies is finding product market fit.

~~~
DrJ
While I would agree that trying to squeeze every last drop of efficiency may
not be worth the time/money for the company. I think the parent is implying
that people are wasting a lot of extra resources due to poor tools, poor
coding, poor profiling, poor maintenance, etc.

I would agree though that for startups, the problem is finding the place to
$$$, it is important to know what the cost per user/connection/instance is and
being able to control or monitor it.

------
pyalot2
The 10k problem is still not solved. A run of the mill VPS typically manages
around 1k simultaneous connections and 10k requests/s. That is, over plain
http. Over TLS this goes down to about 100 connections and 600 requests/s. A
cheap amazon instance will usually be around 1/5th of all that.

With a bit of tweaking you can get the plain http case a bit higher, but the
TLS route will not get much better, because TLS isn't written with speed in
mind (the handshake is slow and expensive) and the prevalent implementation
(OpenSSL) is not written for high performance servers (it basically dictates
that you run blocking sockets in a thread per connection).

Unfortunately spdy and various http2 proposals rely on TLS (in order to punch
trough proxies), which means going back in server performance about 10 years.

So it is of little surprise that companies have started offering "cloud"
solutions, because the typical VPS can't handle todays high traffic internet
over TLS (worse than plain http by a factor of 10) and the typical cloud
server is worse than a VPS by a factor 5, creating a 50x performance
degradation, artificially. Obviously when faced with the question of running
10 servers, or 100, most small companies turn to the even worse "cloud"
solution, requiring even more servers (500).

The whole affair is a sodding mess, and we're wasting massive amounts of
energy and capital on insisting on doing things inefficiently. This is because
by rights our VPS servers should easily be able to break trough the 10k limit
in every way, but it can't because the OS wastes a lot of time running an
inefficient network stack as well as that TLS and OpenSSL can't be bothered to
get their act together.

And that is how, in the year of our lord 2014, more than 15 years after
somebody writing the 10k problem article, and after webservers becoming at
least 128x faster then back in the 90ties, most websites out there can still
not stand up to serious traffic, take ages to load, and are hosted by an
infrastructure (routers and whatnot) that easily buckle under even light
DDOSing.

~~~
wbsun
When it goes to transactions and stateful sessions, C10K, or even C1K is far
from solved yet.

~~~
pyalot2
But that's a bit absurd isn't it? A recent core i7 reaches 124850 MIPS. That
means that if it takes 1/1000th of a second to handle a connection, the CPU
could execute 124 million instructions during that time. At 10/10000th it
could still execute 12 million instructions. A minimal asynchronous connection
handling certainly doesn't require more than say 10'000 instructions, so our
servers are under-delivering on their performance at least by a factor of
1200x, perhaps even by a factor of 12000x.

Our servers should be able to surpass C1k easily, even C10k shouldn't tax
them. They should, by rights, only be taxed by the C10m problem.

~~~
sharpneli
There is one number that has not really changed. It's memory latency, and
another is the processor clock speed.

The latency of main memory read is still around 100ns. It has been around that
for over 10 years now. It means your CPU will have to wait for hundreds of
clock cycles to get a read from RAM if it's not in the cache, and in huge
datasets it probably is not in cache.

Another issue is the processor clock speed. Yes it is true that modern i7 can
reach 124850 MIPS. However that number comes from having 4 cores with each of
them being able to reach up to 8 instructions per clock. You are still limited
in executing dependent instructions.

That sounds a lot. But one must remember that it reaches 8 instructions per
clock only when the instructions are a good mix of float/int instructions, no
branches and the instructions are not dependent on eachother. In practice you
reach maybe 1-2 instructions per clock. In some code it can go even to 0.5 IPC
(bunch of unpredictable branches and whatnot).

Writing a code that takes advantage of large memory bandwidth and poor latency
combined with massive CPU performance if the instructions are not too
dependent on eachother is almost like writing modern GPU programs.

It would be interesting to see what kind of an web server perf one could get
by carefully writing it in OpenCL (using CPU target, not GPU).

~~~
pyalot2
Yeah I'm not disputing that there aren't bottlenecks in the system (it's not
only the memory, the bus between NIC and CPU is also to blame).

But, blaming bad I/O performance on large datasets misses the point slightly.
You can perfectly well write a program that doesn't use heap, and has less
stack use than those processors have L2 cache... (of course that's a test
program).

But the network performance is probably not bound by system latency as much as
by an abysmally bad software stack, started with the kernel to the networking
stack to the sheer idea of TLS and to the implementation of TLS (OpenSSL).

Indeed, there's been calls to get rid of it all, all the layers and whatnot,
and bann the OS from all but one or two cores and get rid of the whole network
stack and layers and implement the networking directly in the application that
needs to do it.

~~~
rbanffy
> there's been calls to get rid of it all, all the layers and whatnot, and
> bann the OS from all but one or two cores and get rid of the whole network
> stack and layers and implement the networking directly in the application
> that needs to do it

You certainly know building and supporting that would cost more or less the
same as building and operating a sizeable datacenter. If it succeeds.

Using all processing power a modern CPU offers on real code with real data is
almost impossible. And it's not only memory latency and instruction
interdependence - there are latencies all over a PC even before you leave the
rackmount chassis. The supporting network is another source of uncontrollable
latencies. Most apps I manage spend 99.99% of their time waiting for something
to happen, be it the next packet, be the results from another server, which is
actually a cluster behind one or more load balancers.

You may get some better cache hit ratios by tweaking thread/core affinity, but
it won't take you to where you want to be.

If you really need that much performance, I'd suggest building your own VLIW
architecture and generate the instruction mix on auxiliary CPUs as a single
continuous thread on the fly based on all incoming requests for the VLIW core
to devour. That would be a huge undertaking, but it would also be pretty cool
CompSci.

~~~
vidarh
> You certainly know building and supporting that would cost more or less the
> same as building and operating a sizeable datacenter. If it succeeds.

There are plenty of solutions for that already. For the simplest case of the
user-space networking, you can pick a number of "off the shelf" solutions for
it:

[http://lukego.github.io/blog/2013/01/04/kernel-bypass-
networ...](http://lukego.github.io/blog/2013/01/04/kernel-bypass-networking/)
[http://www.openonload.org/](http://www.openonload.org/)

------
snaky
2013 update:

The C10M problem

It's time for servers to handle 10 million simultaneous connections, don't you
think? After all, computers now have 1000 times the memory as 15 years ago
when the first started handling 10 thousand connections.

Today (2013), $1200 will buy you a computer with 8 cores, 64 gigabytes of RAM,
10-gbps Ethernet, and a solid state drive. Such systems should be able to
handle: \- 10 million concurrent connections \- 10 gigabits/second \- 10
million packets/second \- 10 microsecond latency \- 10 microsecond jitter \- 1
million connections/second

[http://c10m.robertgraham.com/p/manifesto.html](http://c10m.robertgraham.com/p/manifesto.html)

~~~
peterwwillis
Read the "Other Performance Metric Relationships" part at the bottom of this
page[1]. Basically, just because your machine may be able to physically hold
10 million connections open, does not mean your machine could handle _opening_
10 million connections in a reasonable amount of time, much less handling 10
million _transactions_ in a reasonable amount of time. If you can't open that
many connections at once, or process that many transactions, just being able
to keep them open becomes moot.

This article[2] breaks down the issues fairly well. In order to handle this
kind of traffic, you have to basically redesign huge swaths of technology that
exist because we don't want to have to implement these things more than once.
I don't see how anyone would invest in this without a specific itch to be
scratched (like deep packet inspection).

[1]
[http://www.cisco.com/web/about/security/intelligence/network...](http://www.cisco.com/web/about/security/intelligence/network_performance_metrics.html)

[2] [http://highscalability.com/blog/2013/5/13/the-secret-
to-10-m...](http://highscalability.com/blog/2013/5/13/the-secret-
to-10-million-concurrent-connections-the-kernel-i.html)

------
jrmenon
Big fan of kqueue() mentioned in the article. IIRC with sockets, it not only
tells you if the socket fd is ready (say for non-blocking read), but even
informs the number of bytes available to read which allows you write efficient
code (i.e. not to some fixed buffer where you may need to loop in again to see
if more data needs to be read).

Also I think for files/directories, you can listen for any changes that occur.

Wish this was officially available in Linux.

------
gberger
(1999)?

~~~
wging
Arguably (2009) at the earliest. Was certainly originally written earlier but
it seems like something of a living document. The first sentence after the
table of contents is:

"See Nick Black's execellent Fast UNIX Servers page for a circa-2009 look at
the situation."

~~~
jleader
If you scroll down to the bottom, there's a changelog describing some changes
applied during 2003-2011, and then "Copyright 1999-2014", which might give a
few clues about how old the document is.

------
dontdieych
If you decide to publish long serious article to web, please consider use css
for better readability. Don't rely on browser default css too much. They are
very busy to make javascript fast. Especially, please don't use 100% width.
Sure, our displays have far more width than height. But that width is good for
movie, not text. Put any books on your monitor. Then reduce text line width.

[https://www.readability.com/articles/zrgovuxr](https://www.readability.com/articles/zrgovuxr)

TW;DR: Too Wide, Didn't Read.

~~~
ricardobeat
This article is 15 years old. There wasn't much in the way of CSS back then,
much less responsive design. But the brilliant thing is it still works
perfectly, because it's so simple! Pop your tab to a window and resize, or use
a readability browser extension/website. Oh, you just did that.

~~~
elwell
haha yeah

