
The C10M problem - z_
http://c10m.robertgraham.com/p/manifesto.html
======
erichocean
What's significant to me is that you can do this stuff today on stock Linux.
No need to run weird single-purpose kernels, strange hypervisors, etc.

You can SSH into your box. You can debug with gdb. Valgrind. Everything is
normal...except the performance, which is just insane.

Given how easy it is, there isn't really a good excuse anymore to not write
data plane applications the "right" way, instead of jamming everything through
the kernel like we've been doing. Especially with Intel's latest E5
processors, the performance is just phenomenal.

If you want a fun, accessible project to play around with these concepts,
Snabb Switch[0] makes it easy to write these kinds of apps with LuaJIT, which
also has a super easy way to bind to C libraries. It's fast too: 40 million
packets a second using a scripting language(!).

I wrote a little bit about a recent project I completed that used these
principles here:
[https://news.ycombinator.com/item?id=7231407](https://news.ycombinator.com/item?id=7231407)

[0]
[https://github.com/SnabbCo/snabbswitch](https://github.com/SnabbCo/snabbswitch)

~~~
nl
Snabb Switch looks awesome.

Is there a list of "things" (not sure of the terminology?) people have built
with it?

I guess the downside is that you can't virtualize it (I realize that is kind
of the point, but it does reduce the accessibility of it).

~~~
lukego
I'm the Snabb Switch originator.

The project is new: I and other open source contributors are currently under
contract to build a Network Functions Virtualization platform for Deutsche
Telekom's TeraStream project [1] [2]. This is called Snabb NFV [3] and it's
going to be totally open source and integrated with OpenStack.

Currently we are doing a lot of virtualization work from the "outside" of the
VM: implementing Intel VMDq hardware acceleration and providing zero-copy
Virtio-net to the VMs. So the virtual machine will see normal Virtio-net but
we will make that operate really fast.

Inside the VMs we can either access a hardware NIC directly (via IOMMU "PCI
passthrough") or we can write a device driver for the Virtio-net device.

So, early days, first major product being built, and lots of potential both
inside and outside VMs, lots of fantastic products to build with nobody yet
building them :-)

[1] [http://blog.ipspace.net/2013/11/deutsche-telekom-
terastream-...](http://blog.ipspace.net/2013/11/deutsche-telekom-terastream-
designed.html)

[2]
[https://ripe67.ripe.net/archives/video/3/](https://ripe67.ripe.net/archives/video/3/)

[3] [https://github.com/SnabbCo/snabbswitch/blob/snabbnfv-
readme/...](https://github.com/SnabbCo/snabbswitch/blob/snabbnfv-
readme/src/designs/nfv/README.md)

~~~
chubot
How about this use case: I have a ChromeCast on my home network, but I want
sandbox/log its traffic. I would want to write some logic to ignore video
data, because that's big. But I want to see the metadata and which servers
it's talking to. I want to see when it's auto-updating itself with new
binaries and record them.

Is that a good use case for Snabb Switch, or is there is an easier way to
accomplish what I want?

~~~
lukego
That sounds pretty reasonable to me.

If you can express how you want to filter with a fancy pcap-filter expression
the tcpdump is the easy answer. Otherwise you might want to code it up in Lua
with snabbswitch.

Here is our basic trace store/replay library today:
[https://github.com/SnabbCo/snabbswitch/blob/master/src/lib/p...](https://github.com/SnabbCo/snabbswitch/blob/master/src/lib/pcap/pcap.lua)

~~~
chubot
OK and I forgot to say I might want to deny some traffic... like disable auto
updates but still allow it to contact other servers to play video. AFAIK
tcpdump doesn't let you do that.

Thanks for the very cool project! I will have to learn more about it.

------
wpietri
On the one hand, I love this. There's an old-school, down-to-the-metal,
efficiency-is-everything angle that resonates deeply with me.

On the other hand, I worry that just means I'm old. There are a lot of
perfectly competent developers out there that have very little idea about the
concerns that motivate thinking like this C10M manifesto.

I sometimes wonder if my urge toward efficiency something like my
grandmother's Depression-era tendency to save string? Is this kind of
efficiency effectively obsolete for general-purpose programming? I hope not,
but I'm definitely not confident.

~~~
logicchains
I believe it's what used to be called 'craftsmanship'. Taking pride in
creating things that are efficient and not wasteful for no other reason than
the desire to make the best product possible.

~~~
usea
It is wasteful to spend time doing something that benefits nobody, or where
the benefits outweigh the cost. Your time is worth something.

~~~
srean
I think the "just good enough for business" attitude contributed to the demise
of the American car and the rise of the Japanese ones. The American tradition
was to use engineering tolerances that would maximize throughput under the
constraint that it produced a pretty functional car. The Japanese tradition on
the other hand was to use tighter tolerances, well because there was room for
tightening the tolerance. At the surface level, or the MBA level it seems that
the Japanese way is just dumb. It does not make monetary sense when measured
in sales per year. But it turns out that the benefits express themselves at a
different timescale: better brand and a culture of devotion to improvement,
rather than to being "just good enough to work so that I can move to the next
thing".

Part of this applies to the craft of software too. One can be sloppy and churn
out functional websites by the dozens. At the superficial level, the goal to
extract the most from a 500MHX Pentium-III might seem brain dead, with little
or no pay off. But it pays back by instilling an attitude of deeply learning
your craft, that learning does pay back although that specific well tuned web
server might not. You dont have to build all your servers that way, but if it
doesnt hurt you a little when you know exactly what you could have done to
extract some more juice out of it, you will not reach that level of
understanding. If you are impatient or sloppy, you will build impatience into
your product, and it will show.

Besides it is always easy to over value ones time, it sometimes correlates
with conceit.

@sitkack what makes you think I was talking about the 50s. Going by the
downvote I seem to have touched a nerve. You are right Japanese products
around that time were synonymous with bad quality, not just in US. Much water
has flowed under the bridge since then.

~~~
sitkack
Your recollection of Japanese manufacturing is clouded by mythology. Japanese
quality sucked in the 1950s, they had a horrible world wide reputation for
shoddy goods.

~~~
logicchains
Maybe German manufacturing would be a better example?

~~~
guard-of-terra
German quality sucked some half century earlier, so much that England had to
require them write "Made in" on their goods to warn consumers.

Quality comes and goes.

~~~
sitkack
Exactly. When people complain about the quality of Chinese goods as a proxy
for China and Chinese people as a whole, I ask them where their MacBook Pro is
made and they quickly shut up.

All of these lazy, low tolerance sweeping generalizations need to go!

------
axman6
It seems we've already passed this problem: "We also show that with Mio,
McNettle (an SDN controller written in Haskell) can scale effectively to 40+
cores, reach a throughput of over 20 million new requests per second on a
single machine, and hence become the fastest of all existing SDN
controllers."[1] (reddit discussion at [2])

This new IO manager was added to GHC 7.8 which is due for final release very
soon (currently in RC stage). That said, I'm not sure if it can be said if all
(or even most) of the criteria have been met. But hey, at least they're
already doing 20M connections per second.

[1] [http://haskell.cs.yale.edu/wp-
content/uploads/2013/08/hask03...](http://haskell.cs.yale.edu/wp-
content/uploads/2013/08/hask035-voellmy.pdf) [2]
[http://www.reddit.com/r/haskell/comments/1k6fsl/mio_a_highpe...](http://www.reddit.com/r/haskell/comments/1k6fsl/mio_a_highperformance_multicore_io_manager_for/)

------
jared314
Previous Discussion:
[https://news.ycombinator.com/item?id=5699552](https://news.ycombinator.com/item?id=5699552)
(9 months ago)

High Scalability Post: [http://highscalability.com/blog/2013/5/13/the-secret-
to-10-m...](http://highscalability.com/blog/2013/5/13/the-secret-
to-10-million-concurrent-connections-the-kernel-i.html)

Original Shmoocon Presentation:
[http://www.youtube.com/watch?v=73XNtI0w7jA](http://www.youtube.com/watch?v=73XNtI0w7jA)

------
alberth
WhatsApp is achieving ~3M concurrent connections on a single node. [1][2]

The architecture is FreeBSD and Erlang.

It does make me wonder, and I've asked this question before [3], why can
WhatsApp handle so much load per node when Twitter struggled for so many years
(e.g. Fail Whale)?

[1] [http://blog.whatsapp.com/index.php/2012/01/1-million-is-
so-2...](http://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2011/)

[2, slide 16] [http://www.erlang-
factory.com/upload/presentations/558/efsf2...](http://www.erlang-
factory.com/upload/presentations/558/efsf2012-whatsapp-scaling.pdf)

[3]
[https://news.ycombinator.com/item?id=7171613](https://news.ycombinator.com/item?id=7171613)

~~~
barrkel
The problem of 1:1 messaging is slightly different to Twitter, which is more
m:n. 1:1 messaging can be handled reasonably easily with a mailbox per user,
and there is no shared state. Messaging with m:n has different optimal
patterns depending on the relative ratios of m and n. Twitter has many users
with millions of followers; if Twitter used a 1:1 mailbox approach like a chat
app, these users would be whole countries worth of load on their own.

That's not to say that Twitter's scaling issues where wholly forgivable. They
weren't fatal to the service, but I don't think they were necessary with good
design from the start. High popularity is a good problem to have though.

------
joosters
If you are going to write a big article on a 'problem', then it would be a
good idea to spend some time explaining the problem, perhaps with some
scenarios (real world or otherwise) to solve. Instead, this article just leaps
ahead with a blind-faith 'we must do this!' attitude.

That's great if you are just toying with this sort of thing for fun, but
perhaps worthless if you are advocating a style of server design for others.

Also, the decade-ago 10k problem could draw some interesting parallels. First
of all, are machines today 1000 times faster? If they are, then even if you
hit the 10M magic number, you will still only be able to do the same amount of
work per-connection that you could have done 10 years ago. I am guessing that
many internet services are much more complicated than a decade ago...

And if you can achieve 10M connections per server, you really should be asking
yourself whether you actually want to. Why not split it down to 1M each over
10 servers? No need for insane high-end machines, and the failover when a
single machine dies is much less painful. You'll likely get a much improved
latency per-connection as well.

------
rdtsc
Here is how C2M<x<C3M connections problem was solved in 2011 using Erlang and
FreeBSD:

[http://www.erlang-
factory.com/upload/presentations/558/efsf2...](http://www.erlang-
factory.com/upload/presentations/558/efsf2012-whatsapp-scaling.pdf)

It shows good practical tricks and pitfalls. It was 3 years ago so I can only
assume it got better, but who knows.

Here is the thing though, do you need to solve C*M problem on a single
machine? Sometimes you do but sometimes you don't. But if you don't and you
distribute your system you have to fight against sequential points in your
system. So you put a load balancer and spread your requests across 100 servers
each 100K connections. Feels like a win, except if all those connections have
to live at the same time and then access a common ACID DB back-end. So now you
have to think about your storage backend, can that scale? If your existing db
can't handle, now you have to think about your data model. And then if you
redesign your data model, now you might have to redesign your application's
behavior and so on.

~~~
chongli
If you could do 10M connections on one machine, then why not 1B on 100? Does
it even make sense to have a billion simultaneous connections?

~~~
erichocean
If by "connection", you mean TCP, probably not. But that's not the only way to
maintain connections, and it's certainly not the only reliable network
protocol.

My latest project keeps every "connection" open at all times, but it's a
custom UDP based reliable messaging protocol, not TCP. At Facebook's scale,
we'd have the equivalent of one billion connections "open". It's easy to keep
them open, despite changing IP addresses, because every packet is public-key
authenticated and encrypted, so you don't have to rely on IP addresses to know
who you're talking to...

It also means you only pay for a connection setup time once. For mobile
devices, the improvement in latency is palpable.

~~~
mh-
Can I ask some questions about that? Extremely interested..

I've considered doing something similar for our messaging/signalling protocol
(currently standard TCP sockets established to several million mobile
devices.)

I had concerns about what the deliverability of UDP would be on mobile
networks; many carriers are going towards NAT'ing everything, (forced)
transparency proxies, etc.

Are you only using UDP from device->infrastructure? If not, do you rely on the
devices providing an ip:port over a heartbeat of sorts (to keep up with IP
changes?)

Do you have any issues with deliverability, in either direction? (not due to
UDP's inherent properties, but because of carrier network behavior)

thanks very much for anything you're able to answer.

~~~
erichocean
Good questions.

I'm using UDP in both directions, and I do have a heartbeat (currently set at
30 seconds, but I think we could go to 60 seconds without any problems). We do
use that to keep track of IP:PORT changes, but also (mainly?) to keep the UPD
hole punched, due to carrier's NAT'ing everything.

It works, and it works well. It's the same idea behind WebRTC, except instead
of going peer-to-peer, you go client<->server.

All I've seen so far is the usual UDP stuff: dropped packets, re-ordered
packets, and duplicate packets. Nothing out of the ordinary. Our network
protocol handles those things without any difficulties.

We did it specifically because all of our clients are mobile devices, and we
didn't want to have to do the lengthy TCP connection setup (or worse, SSL
setup) each time the network changed—which is often.

The biggest downside of UDP at the moment is that Apple only allows TCP
connections in the background. That seems like a silly decision, but whatever,
it's what they've done. I may, someday, set up a bunch of TCP forwarders for
iOS devices running in the background. Our messages can be decoded just fine
over byte-oriented streams, so it wouldn't change much.

It's a tough call. On the one hand, our UDP clients do not need to reconnect,
since the connection is always set up. So when they wake, they send a packet
to the server (Hello), and the server immediately sends back any immediate
updates, such as new chat messages.

Our read path is perhaps over-optimized, so it's exactly the network latency
of one round trip to get the updates since you last opened the app. It takes
longer to get the UI up in some cases, so that's why we haven't done the TCP
background thing. For others, that might be a much more important
consideration.

~~~
Scaevolus
A lot of these features sound similar to MinimaLT:
[http://cr.yp.to/tcpip/minimalt-20130522.pdf](http://cr.yp.to/tcpip/minimalt-20130522.pdf)

Do you have explicit DoS protections?

~~~
erichocean
That paper (as well as CurveCP) was definitely a huge inspiration for what I'm
doing.

A big difference is I'm not running a packet scheduler (e.g. Chicago).
Frankly, our data rate on a per device basis is just minuscule, and our
internal protocol has back pressure built into it anyway, so I just skipped
that part. If it becomes an issue (unlikely), I'll of course actually add an
explicit scheduler so our UDP traffic plays nicely with others.

I'm not doing MLT's puzzle step (although I really like the concept). At the
moment, all I can do is deny connections from unknown devices if we're under
attack (I can drop packets from an unknown device with a single hash lookup).
We're also in the process of moving to OVH, which is able to block DDoS stuff
at the network edge, should it happen.

There's more stuff I've got planned, but that's it for now.

~~~
Scaevolus
It's interesting how much you can streamline a protocol for a single use case.
Did you retain the crypto mostly intact-- including the PFS?

------
leoh
Projects such as the Erlang VM running right on top of xen seem like promising
initiatives to get the kind of performance mentioned
([http://erlangonxen.org/](http://erlangonxen.org/)).

~~~
rdtsc
I wish they open sourced it and let others look at the code and experiment
with it. For a lot of developers if it isn't open = it doesn't exist. Now it
is their code and they do whatever they want but that is my view of the
project.

------
ehsanu1
An implementation of the idea:
[http://www.openmirage.org/](http://www.openmirage.org/)

A good talk about it by one of the developers/researchers:
[http://vimeo.com/16189862](http://vimeo.com/16189862)

------
cjbprime
> There is no way for the primary service (such as a web server) to get
> priority on the system, leaving everything else (like the SSH console) as a
> secondary priority.

Just for the record -- the SSH console _is_ the primary priority. If the web
server always beats the SSH console and the web server is currently chewing
100% CPU due to a coding bug..

------
swah
Those two articles, [http://blog.erratasec.com/2013/02/multi-core-scaling-its-
not...](http://blog.erratasec.com/2013/02/multi-core-scaling-its-not-
multi.html) (from Robert Graham) and
[http://paultyma.blogspot.com.br/2008/03/writing-java-
multith...](http://paultyma.blogspot.com.br/2008/03/writing-java-
multithreaded-servers.html), seem to say opposing things about how threads
should be used.

Having no experience with writing Java servers, I wonder if any you guys have
an opinion on this.

------
ubikation
I think cheetah OS, the MIT exo kernel project proved this and halvm by Galois
does pretty well for network speed that xen provides, but I forget by how
much.

The netmap freebsd/linux interface is awesome! I'm looking forward to seeing
more examples of its use.

~~~
oscargrouch
i would just love to see that exo kernel from MIT in practice some day in some
OS.. i think the research is from the nineties, isnt?

Also, netmap from freebds was the first thing that come to my head, as a
relief from the IO bottleneck from moderns systems..

As in the original C10k, freebsd to the rescue here.. since it was the first
OS with the kqueue interface.. and now is netmap.. the numbers from the
speedup in the original paper are astounding

------
memracom
Just what are these resources that we are using more efficiently? CPU? RAM?

Are they that important? Should we not be trying to use electricity more
efficiently since that is a real world consumable resource. How many
connections can you handle per kilowatt hour?

~~~
logicchains
Generally electricity use is roughly proportional to CPU and RAM usage, as
they're powered by electricity. If you have two otherwise identical programs,
one of which uses 50% of the cpu and the other of which use 10%, chances are
the latter will use less electricity.

------
EdwardDiego
At the risk of sounding dumb, aren't we still limited to 65,534 ports on an
interface?

~~~
dxhdr
Port numbers must only be unique for ip:port pairs. A TCP connection is
identified by the "quadruple" source_ip:source_port, dest_ip:dest_port. You
can have as many connections as you want on the same source_ip on port 80 as
long as there aren't 65,535 to the same dest_ip (ie as long as the quadruple
is unique).

~~~
perlgeek
Also with IPv6 you can easily route a whole /64 net (2 __64 IPs!) onto a
single machine.

~~~
sp332
2^64 IPs. :)

------
voltagex_
>Content Blocked (content_filter_denied)

>Content Category: "Piracy/Copyright Concerns"

I'm starting to use these blocks at my workplace as a measure of site quality
(this will be a high quality article). Can someone dump the text for me?

~~~
sb057
That page is essentially a glorified intro to his series of blog entries, and
they are on another domain, so perhaps they are not blocked:

[http://blog.erratasec.com/search/label/C10M](http://blog.erratasec.com/search/label/C10M)

~~~
voltagex_
Ah, erratasec. I'm surprised that isn't blocked here, too.

Thanks for the link.

------
Aloisius
What's the current state of internet switches? Back when I used to run the
Napster backend, one of our biggest problems was that switches, regardless of
whether or not they claimed "line-speed" networking, would blow up once you
pumped too many pps at them. We went through every single piece of equipment
Cisco sold (all the way to having two fully loaded 12K BFRs) and still had
issues.

Mind you, this was partially because of the specifics of our system - a couple
million logged in users with tens of thousands of users logging in every
second pushing large file lists, a widely used chat system which meant lots of
tiny packets, a very large number of searches (small packets coming in, small
to large going out) and a huge number of users that were on dialup fragmenting
packets to heck (tiny MTUs!).

I imagine a lot of the kind of systems you'd want 10M simultaneous connections
for would hit similar situations (games and chat for instance) though I'm not
sure I'd want to (I can't imagine power knocking out the machine or an upgrade
and having all 10 million users auto-reconnect at once).

~~~
wmf
10 Gbps switches are pretty good and are generally line rate (as long as you
avoid ten-year-old chassis).

------
ganessh
"There is no way for the primary service (such as a web server) to get
priority on the system, leaving everything else (like the SSH console) as a
secondary priority" \- Can't we use the nice command (nice +n command) when
these process are started to change its priority? I am sorry if it is so naive
question

~~~
slashnull
He probably meant that from a TCP point of view, as in there is no way to give
a higher priority to incoming TCP connections going into the server than to
those going into SSH, even if you could use nice to assign more cpu time to
your server.

Or perhaps he meant that the infrastructure used to do multitasking still have
to interrupt both his server and SSH, but then described how the kernel can be
set to leave some cores free of work then to set the server to use only those
and then run absolutely uninterrupted.

Not the only bizarre and confusing statement he wrote, anyways.

------
BadassFractal
This article on High Scalability also covers part of the problem:
[http://highscalability.com/blog/2014/2/5/littles-law-
scalabi...](http://highscalability.com/blog/2014/2/5/littles-law-scalability-
and-fault-tolerance-the-os-is-your-b.html)

------
ksec
I think OSv or something similar would be part of that solution. Single User /
Purpose OS designed to do one / few things and those only.

I could only hope OSv development would move faster.

------
dschiptsov
So, he is trying to suggest that pthread-mutex based approach won't scale
(what a news!) and, consequently JVM is crap after all?) The next step would
be to admit that the very idea to "parallelize" sequential code which
imperatively processes sequential data by merely wrapping it into threads is,
a nonsense too?) Where this world is heading to?

------
nwmcsween
So an exokernel?

~~~
wmf
People don't use actual exokernels; they just use Linux like an exokernel. Aka
"1975 programming".

------
zerop
One more problem is cloud. We host on cloud. cloud service providers might be
using old hardware. Newest hardware or specific OS might be winner but no
options on cloud. How do you tackle that ?

~~~
jon-wood
If this sort of thing matters that much to you then you bite the bullet and
rent a rack to fill with physical servers somewhere.

------
eranation
What about academic operating system research that was done years ago?
Exokernel, SPIN, all aim to solve the "os is the problem" issue. Why don't we
see more in that direction?

------
slashnull
The two bottom-most articles (protocol parsing and commodity x86) are
seriously pure dump, but fortunately the ones about multi-core scaling are
pretty damn interesting.

------
porlw
Isn't this more-or-less how mainframes work?

