
One Million Concurrent TCP connections - pvodsevhcm
http://blog.whatsapp.com/index.php/2011/09/one-million/
======
mrb
I remember reading around 2002-2004 about a sysadmin managing a very large
"supernode" p2p server who was able to fine-tune its Linux kernel, and to
recompile an optimized version of the p2p app (to allocate data structures as
small as possible for each client) to support up to one million concurrent TCP
connections. It wasn't a test system, it was a production server routinely
reaching this many connection at its daily peak.

If it was possible in 2002-2004, I am not impressed that it is still possible
in 2011.

One of the optimizations was to reduce the per-connection TCP buffers
(net.ipv4.tcp_{mem,rmem,wmem}) to only allocate one physical memory page (4kB)
per client, so that one million concurrent TCP connections would only need 4GB
RAM. His machine had barely more than 4GB RAM (6 or 8? can't remember), which
was _a lot of RAM_ at the time.

I cannot find a link to my story though...

~~~
mmaunder
Agree with this being underwhelming. Maybe they'll hire someone who
understands that Erlang is using kqueue to make this possible.

Running netstat|grep like this on a high concurrency server takes a long time
to run. I've never found a faster way to get real-time stats on our busy
servers and would be interested if anyone else has.

~~~
thomasknowles
Actually I would run ss over the netstat for such high concurrency connection
as netstat hits /proc/net/tcp and it's quite slow. However, you need to have
the tcp_diag module loaded otherwise it falls back to /proc/net/tcp.

~~~
X-Istence
You may not have read the entire article, but he clearly explains that their
tech stack is FreeBSD + Erlang not Linux + Erlang.

There is no /proc/net/tcp on FreeBSD. Hell, there is no /proc unless it is
specifically mounted by the administrator, but tools certainly don't use it to
get data.

~~~
thomasknowles
I read it, I just presume people most users are linux based in here.

------
tworats
The WhatsApp guys are very sharp ex-Yahoo guys who've had tremendous
experience with scaling systems. Rick Reed is fairly legendary. Yahoo was a
long time FreeBSD shop, so it's not surprising they went with that.

I hope they publish how they did it - in fact let me drop them an email and
see if I can convince them to do so.

~~~
cperciva
_Yahoo was a long time FreeBSD shop_

FWIW, Yahoo still uses FreeBSD extensively.

------
lenn0x
It's been done before. Here is an article using Erlang and Linux.

[http://www.metabrew.com/article/a-million-user-comet-
applica...](http://www.metabrew.com/article/a-million-user-comet-application-
with-mochiweb-part-1)

Part 3 is my favorite. [http://www.metabrew.com/article/a-million-user-comet-
applica...](http://www.metabrew.com/article/a-million-user-comet-application-
with-mochiweb-part-3)

~~~
jen_h
This article series is kind of a Bible to me. It's not going to solve all of
your problems, for sure, depending on yer stack, and setting yourself up for a
forkbomb isn't the wisest in all situations, but it's got a lot of good advice
& is pretty good about providing you a "Okay, tweak these parameters and then
try to break it" baseline.

------
indygreg2
Technical details would be interesting. Until then, here's Urban Airship's
post from last year on 500k connections on Linux:

[http://urbanairship.com/blog/2010/09/29/linux-kernel-
tuning-...](http://urbanairship.com/blog/2010/09/29/linux-kernel-tuning-
for-c500k/)

------
jsr
OK, you hooked me with the title. But "FreeBSD + Erlang" was kind of a
dissatisfying reason for how you achieved it. Would love to hear more details!
How far we've come since <http://www.kegel.com/c10k.html>

~~~
getsat
They could have done it using C, Ruby, Python, or any other language. kqueue
is what makes FreeBSD (and OSX) awesome at concurrency.

<http://en.wikipedia.org/wiki/Kqueue>

~~~
silentbicycle
How does kqueue compare to epoll on Linux? I've written C code using kqueue on
OpenBSD and OS X, but have only used epoll via libev (and not at especially
high load). I thought the big change came from trading level- for edge-
triggered nonblocking IO, but maybe the kqueue implementation is superior for
sockets somehow?

The main advantage Erlang has over C/Python/Ruby/etc. is that asynchronous IO
is the default _throughout all its libraries_ , and it has a novel technique
for handling errors. Its asynchronous design is ultimately about fault
tolerance, not raw speed. Also, it can automatically and intelligently handle
a lot of asynchronous control flow that node.js makes you manage by hand
(which is so 70s!).

You can make event-driven asynchronous systems pretty smoothly in languages
with first class coroutines/continuations (like Lua and Scheme), but most
libraries aren't written with that use case in mind. Erlang's pervasive
immutability also makes actual parallelism easier.

With that many connections, another big issue is space usage -- keeping
buffers, object overhead, etc. low per connection. Some languages fare far,
far better than others here.

~~~
asomiv
Yes I would say kqueue, the interface, is superior to epoll. Kqueue allows one
to batch modify watcher states and to retrieve watcher states in a single
system call. With epoll, you have to call a system call for every
modification. Kqueue also allows one to watch for things like filesystem
changes and process state changes, epoll is limited to socket/pipe I/O only.
It's a shame that Linux doesn't support kqueue.

But as awesome as kqueue is, OS X apparently broke it:
[http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AN...](http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#OS_X_AND_DARWIN_BUGS)

~~~
Flow
IIRC, kqueue can be told to read the whole http-request before letting client
know there's data to read.

~~~
gorset
That's actually accept_filter(9). The man page for freebsd has a interesting
info:

    
    
        The accept filter concept was pioneered by David Filo at Yahoo! and
        refined to be a loadable module system by Alfred Perlstein.
    

The closest you can get by using kqueue is to set a low water mark, so that a
read event is only returned when there's enough data ready.

~~~
Flow
Ah, that's it! Thanks for finding it.

------
Rickasaurus
This may be a dumb question (I'm not a networks guy) but how do you maintain
so many connections with just 65535 ports? Can you have more than one
connection per port?

~~~
caf
The server generally only ever uses _one_ port, no matter how many clients are
connected. It is the tuple of (client IP, client port, server IP, server port)
that must be unique for each TCP connection - so the limit of 65535 ports is
only relevant for how many connections a _single_ client can make to a single
server.

~~~
nathanappere
I believe this is incorrect. The server usually listen on one port, but
everytime it does an accept, a different random port is used, and the client
start talking to the server on that new port.

~~~
caf
This is a surprisingly common misconception. When you accept, you get a new
_socket_. but it is on the same local port. You can readily see this by
running 'netstat' on a busy server.

~~~
nathanappere
Just checked it and you're right. I believed I had encoutered the behaviour I
described but I do not remember in which context. Anyway glad I learned
something!

~~~
dsl
FTP.

------
chubs
They beat me to it! I've only gotten to 500k on EC2, however i believe there's
some trickery in their firewalls / NAT which is holding me back... If anyone's
interested in the gory details, see: <http://splinter.com.au/tag/comet>

~~~
forsaken
Curious about what you've seen from the EC2 networking gear that's holding you
back? Firewall not letting more connections through?

~~~
chubs
Well, at the moment i can't pin it down to EC2, but it's the only thing i can
imagine it'd be. The network/cpu/memory usage is all healthy, and there's
nothing in the kernel log, so that's what i'm guessing is the cause. Although
i may be wrong.

~~~
csarva
We've observed the bottleneck to be an upper limit on packets/sec for a given
instance type. On an m1.large this is about 100k/sec. I believe it's due to
the virtual NIC just not being fast enough to handle high traffic loads.

The rightscale folks found the same thing:

[http://blog.rightscale.com/2010/04/01/benchmarking-load-
bala...](http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-
the-cloud/)

------
huhtenberg
In absolute numbers - wow, that's impressive.

It was several years ago, but I've done my share of high-concurrency stuff
under Linux and the highest I got to was about 200K connections - at which
point the single-threaded server bottlenecked at its disk I/O.

The main issue is not the actual connection count, it's what the per-socket OS
overhead is (so not exhaust non-swappable kernel memory), how many sockets are
concurrently _active_ (have an inbound or outbound data queued) and if the
application can handle all the events that epoll/kqueue report. This is not a
rocket science by any means, and the kernel is relatively easy to fine-tune
even when the actual load is present.

------
cnlwsu
I would be curious about the hardware used, it can make it much more or less
impressive. I have done a test with 1 million concurrent tcp connections using
java (mina) on an Ubuntu system... but it had 64 gb of ram. It kept running
for weeks under the load which I felt pretty good about.

------
getsat
What is the kernel structure overhead (in bytes) per connection on FreeBSD?

------
imsy
Good work. On Imsy (www.imsy.com) we hope to achieve the same with Node.js on
EC2. Presently the numbers are smaller (in 100K range).. but looks like it
will scale smoothly till 1 million.

~~~
henry501
It's hard to say that something is going to scale smoothly up an order of
magnitude until you're there. 900K requests is a lot of room for things to go
wrong.

------
kitsune_
How much ram does this single server have, what about its cores?

------
jen_h
A. That is AWESOME. Props, guys! B. Just how long did it take for that netstat
to return? ;)

~~~
spokengent
FWIW, netstat isn't fun to use. conntrack is better, or cat
/proc/net/ip_conntrack

~~~
Luyt
They're on FreeBSD. conntrack is a Linux utility.

~~~
spokengent
ah ok I see...

------
gabi38
How would kqueue compare to Windows's IO completion ports in terms of
performance?

------
flzz
Just because you can doesn't mean you should.

~~~
nivertech
Not many can, but some fortunate enough to actually need this stuff.

