
10M Concurrent Websockets - rayascott
http://goroutines.com/10m
======
bcardarella
The headline can be misleading, these are not active real-world socket
connections. According to the article they are mostly idle, with pings every
five minutes.

For comparison, Chris McCord (the creator of the Phoenix framework) did some
work on getting 2M concurrent connections with real-world(ish) data being
broadcast across all nodes: [http://www.phoenixframework.org/blog/the-road-
to-2-million-w...](http://www.phoenixframework.org/blog/the-road-to-2-million-
websocket-connections)

~~~
laylomo
And if I recall correctly, the 2 million figure was just an artificial limit
that they hit due to the configuration settings during the test, since it just
"seemed like a really big number" at the time. In theory, it could probably go
even higher.

------
gamache
For a counterpoint, here's a benchmark of a single server running
Elixir/Phoenix and supporting 2M concurrent websockets, sending pings once per
second (rather than every five minutes like this Go benchmark). The socket
handling code is simpler, too.

[http://www.phoenixframework.org/blog/the-road-
to-2-million-w...](http://www.phoenixframework.org/blog/the-road-to-2-million-
websocket-connections)

~~~
nulltype
Where is the ping once per second part? All I could find was this commit
removing pings:
[https://github.com/Gazler/phoenix_chat_example/commit/f809b2...](https://github.com/Gazler/phoenix_chat_example/commit/f809b2314a636624be12012469f07a280ace00ba)

~~~
gamache
Toward the bottom of the post: "This let us to reach 2M subscribers without
timeouts and maintain 1s broadcasts."

~~~
nulltype
That sounds like 1s latency. Is it producing a sustained rate of 2Mpps?

------
eloff
This is kind of lousy actually to use that much memory and other resources for
10M nearly completely idle connections.

If the kernel is bypassed, I have reason to believe Go can handle the line
rate of 16.7M packets/sec(on 10gbps nic.) I would love to integrate Intel's
DPDK with Go. I've done some ground work on this before - it's very feasible
to just replace the Go network stack in a transparent way. If any company is
interested in sponsoring that as an open-source project, my contact info is in
my profile.

~~~
jeff_marshall
As I'm sure you are well aware, there is quite a large gap between the DPDK
and a working TCP stack.

Do you plan to write the TCP stack from scratch? If not, are you planning to
leverage one of the golang unikernel back-end ports?

10Gb/s (even per-core) doesn't seem like much of a selling point for a network
stack on it's own, given where the rest of the industry is.

~~~
eloff
I wouldn't dare write my own. I'm not expert enough in TCP/IP stacks to know
what I don't know that will hurt me. I was thinking of lifting the relatively
simple TCP/IP stack from the Apache licensed Seastar framework. Since that
also is built for DPDK and sees production use I know that it should be easy
to adapt and probably doesn't have serious flaws.

I'm actually strongly tempted to just to glue Seastar and Go together. That's
the lowest risk approach.

~~~
hadagribble
I haven't gotten around to trying it myself yet, but
OpenFastpath+OpenDataplane supposedly has a userspace port of the FreeBSD
stack that runs on DPDK through ODP-DPDK.

[http://www.openfastpath.org/](http://www.openfastpath.org/)

[http://www.opendataplane.org/](http://www.opendataplane.org/)

------
vvanders
Key quote:

Sending a ping every 5 minutes at 10M connections was roughly the limit of
what the server could handle. This ends up being only about 30k pings per
second, which is not a terribly high number.

~~~
tim333
Their figures are quite a bit worse than MigratoryData's commercial WebSocket
Server software written in Java -

12 mil sockets, 200k pings / sec, 12 core server 54GB mem use, 57% cpu

[https://mrotaru.wordpress.com/2013/06/20/12-million-
concurre...](https://mrotaru.wordpress.com/2013/06/20/12-million-concurrent-
connections-with-migratorydata-websocket-server/)

~~~
michelrotaru
Thanks for the mention, Tim.

In that benchmark, pings are actually 512-byte messages. So, MigratoryData
published 200,000 messages/sec to 12 million clients at a total bandwidth of
over 1 Gbps.

BTW that benchmark result has been improved recently in terms of latency. The
JVM Garbage Collection pauses were almost eliminated using Zing JVM from Azul
Systems and the MigratoryData server was able to publish almost 1 Gbps to 10
million concurrent clients with an average end-to-end latency of 15
milliseconds, the 99th percentile of 24 milliseconds, and the maximum latency
of 126 milliseconds (computed on almost 2 billion messages):

[https://t.co/MVjLDLWS3U](https://t.co/MVjLDLWS3U)

Mihai

~~~
nulltype
Does MigratoryData use the kernel networking stack or bypass it? Also is the
Zing JVM included or do you have to license that separately?

~~~
michelrotaru
MigratoryData uses the kernel networking stack and it can be configured to run
as a non-root user.

Zing JVM is not a requirement for the MigratoryData server. MigratoryData
includes by default Oracle JVM, which is currently used in production. With
Oracle JVM we currently obtain quite decent Garbage Collection (GC) pauses for
Internet applications, so the latency is not too much impacted by the GC
pauses which occur from time to time.

Zing JVM could be used in certain projects to reduce near to zero those GC
pauses, so the latency is not impacted even in the worst case, when a GC
occurs. So, if Zing JVM is necessary for such a ultra low latency application,
Zing JVM should be licensed separately.

------
emerongi
Elixir & Phoenix Framework: 2M concurrent connections on a 40core/128GB box,
with a real-time chat example: [http://www.phoenixframework.org/blog/the-road-
to-2-million-w...](http://www.phoenixframework.org/blog/the-road-to-2-million-
websocket-connections)

~~~
insanebits
It might even be 20 core processor, because htop actually shows threads and
most processors nowadays are 2 threads per core.

For example my R710 shows 16 threads while it's only 2x4 real cores.

~~~
Rafert
It's not as simple as 2 threads per core, the amount of performance gains of
hyper-threading are dependent on the application you're running afaik.

~~~
phamilton
I think the point was more that 40 cores was a ceiling for computational
hardware, and it's very possible the actual hardware was 20 cores with hyper
threading, which would result in much lower perf than a true 40 core system.

------
hughes
> This is a 32-core machine with 208GB of memory. Sending a ping every 5
> minutes at 10M connections was roughly the limit of what the server could
> handle

While the 10M figure is impressive, this doesn't sound practical or useful.

~~~
iagooar
Why isn't it practical? Maybe not in the cloud, but there are some hosting
companies that offer quite powerful machines in the same ballpark.

Hetzner, a German hosting company, have some really good root servers, for
about 120€/month you get a beast of a machine with 16 cores and 128gb of RAM.
If you add some basic load balancing, you can achieve 10M for a really low
price. Add something like Docker and you can get a PaaS-like setup, that
handles millions of connections without breaking a sweat.

~~~
insanebits
If you're going for C10M you can already forget about docker. For performance
you really must be as close to the hardware as possible, which possibly means
tweaking kernel.

Best example I know of squeezing as much as possible out of the bare metal
servers would be StackOverflow, which runs of something in the ballpark of 20
servers(excluding replicas) IIRC.

~~~
cyphar
You do know that processes in a Docker container are regular processes right?
They just have a different namespace. They are as "close to the hardware" as
all of your other processes. I'm not sure how you'd go about using your own
TCP stack, but if you can do it with a normal process you can do it inside a
Docker container.

~~~
withjive
When working with Docker, it is possible to get down to the metal as you
mentioned.

Usually by bypassing the docker networking and using Host only network, but
then you lose a lot of the benefits of containers in the first place.

For example, weave or calico networking layer on top— which add a fair bit of
latency if your aiming for 10M connections— makes scaling containers quite
easy

I would imagine that if you plan on using Docker in your infrastructure
seriously, you are aiming for a multi host setup with many containers spread
throughout— and can settle for 100k connections per container easily.

------
smaili
> This is a 32-core machine with 208GB of memory.

If we're talking about these kind of specs, something tells me Erlang would do
even better.

~~~
rdtsc
Besides, I wouldn't trust stuff with shared memory when it comes to that many
concurrent connections.

I'd want it to be something with a more safety guarantees -- could be compile
time (strong type system, Haskell, Rust) or strong runtime fault tollerance --
Erlang/Elixir and other languages on the BEAM VM.

------
Xeoncross
Can anyone expand on what is actual limiting the linux kernel or network card
driver? If the server is only at 10% load and half the memory is free it seems
there might be either A) non-obvious tweaking needed or B) actual changes to
the system library design.

~~~
jzwinck
Probably a combination of context switching, memory bandwidth, and interrupt
handling.

If you're going to use a machine with 208 GB of RAM you may as well buy a
better network card. All major vendors of low latency adapters (which in this
case means handling lots of small messages) have transparent drivers for Linux
kernel bypass now. For example see OpenOnload. Then you avoid the kernel with
no code changes.

Once you remove the kernel from the critical path, you should be concerned
with I/O resources, which is already illustrated by the article's note about
how they get 5x better performance by using four machines with the same total
core count. What that does is buy you four times the I/O resources, like
memory channels and PCIe lanes.

------
eonnen
"It is similar to a push notification server like the iOS Apple Push
Notification Service, but without the ability to store messages if the client
is offline."

Not even close. Layer in X509 authentication and flaky mobile networks and now
you're into tuning backoff while dealing with the CPU overhead of the TLS
negotiation.

Also of note, AWS' network use to flake out at around 400K TCP connections
from the public network. May no longer be the case but it certainly was ~2012.

~~~
toast0
FYI, there's an unpublished limit on established TCP connections, bigger
instances have bigger limits. It's hard to hit them with normal applications
(you'll run out of memory from the TCP buffers and application state), but you
can definitely hit them if you're doing weird things.

------
Xeoncross
This is more about maintaining 10,000,000 open connections than about actually
sending useful data to this many people or even allowing input from this many
connections.

------
revelation
This is pretty useless other than to make you realize you are doing something
terribly wrong. Another good hint are the various number of system constants
you are having to tweak.

If you have these requirements (no read, only write, zero connection state)
you would be best served by doing userspace raw IO with something like DPDK
instead of the huge machinery that the kernel spins up for every TCP
connection.

------
Animats
It's useful if you want push notifications using the W3C Push API. That lets
you get notifications without becoming a slave to Google Cloud Messaging [1]
or Apple Push Notification Service [2]. This should now work in Firefox and
Google Chrome browsers[3]. Microsoft and Apple aren't on board yet.

[1] [https://developers.google.com/web/updates/2015/03/push-
notif...](https://developers.google.com/web/updates/2015/03/push-
notifications-on-the-open-web?hl=en) [2]
[https://www.raywenderlich.com/32960/apple-push-
notification-...](https://www.raywenderlich.com/32960/apple-push-notification-
services-in-ios-6-tutorial-part-1) [3] [http://caniuse.com/#feat=push-
api](http://caniuse.com/#feat=push-api)

~~~
azinman2
So if apple isn't on board and it's just for some web browsers.... It isn't
really an option at all :/

------
fenesiistvan
The extra processing required for websockets is so simple and well understood
then we should just talk about raw TCP socket optimizations instead. With
websocket processing you can do only a few tricks to further optimize a good
implementation, however on the TCP/IP layer the optimization tricks can be
endless and much more complicated.

------
drdaeman
I see the quite high TCP memory net.ipv4.tcp_mem setting of 100M max pages
(assuming standard 4K pages that's 381GiB, even more than actual physical
memory the machine has, so it feels a bit unsafe, as I think sometimes kernels
may lock up badly when they're overpressured - better start spewing errors
reasonably early), but don't see any tweaks about how much each individual
connection uses. I wonder if tcp_rmem/wmem tweaks can improve performance
(well, I'm not sure, but I think it's not completely unreasonable to assume
that if connections are idle they don't need the default - IIRC - 85KiB read
and 16KiB write buffers)

------
jamescun
This benchmark feels a little lazy. While I am in no doubt that Go could match
the 2M concurrent websockets of Elixir's Phoenix framework, a benchmark should
aim to implement the same functionality to be considered useful.

The only interesting thing here is that with that 10M connections figure, each
connection handler goroutine contained at least 2 channels (sub and t), and
they seemed to scale (no mention of latency though).

~~~
andrewstuart2
Seems like a latency of 5 minutes since he mentioned the server was about at
its limits.

------
arrty88
Seems inefficient to create a new timer per connection, no?

t := time.NewTicker(pingPeriod)

Maybe if you hold all of the connections in one array and create one timer for
the ping, things would be faster.

Curious how this compares to netty...

------
benwilber0
> After the websocket connection has been setup, the server never reads any
> data from the connection, it only writes messages to the client.

yeah not exactly difficult

------
hinkley
Back in my day we were still trying to solve the C10K problem. You darn kids
don't know how good you got it.

Now git off of my lawn, I need to take a nap.

------
mmaunder
The sysctl config is helpful for any high concurrency app e.g. node.

------
iLoch
You can tell this is web scale because of how complex the example is.

~~~
jkarneges
Yes this is a mostly a showcase of modern hardware and the Linux kernel.

However, Go's ability to do coroutines + multicore certainly makes it easier
to take advantage of the system's full potential, and this ability is not
terribly common in other languages (at least not in such an automatic way). Of
course the same result could be achieved with any event-driven language in a
less automatic way.

~~~
smt88
> _this ability is not terribly common in other languages_

Arguably any functional language does this even more automatically than Go
does.

~~~
jkarneges
Any functional language? Certainly Erlang works this way, but I'm less sure
about others.

~~~
smt88
One of the most important benefits of functional programming is that the
compiler can figure out which operations can run in parallel and which can't.
In other scenarios, the programmer has to explicitly decide these things
herself, so you end up with extra bugs (race conditions, for example).

The language doesn't matter, as long as its compiler can expect purely
functional code _and_ the compiler is written to take advantage of that code.

Rust kind of succeeds in having it both ways -- the programmer decides what
runs in parallel, but the compiler won't allow race conditions.

I can't imagine any better scenario than just writing code that can be
automatically parallelized, though.

~~~
jerf
So far as I know, all attempts at automatic parallelization have been
extremely disappointing. Even when dealing with Haskell code or other
similarly functional code, you don't get much out of it. See for instance:
[http://stackoverflow.com/questions/15005670/why-is-there-
no-...](http://stackoverflow.com/questions/15005670/why-is-there-no-implicit-
parallelism-in-haskell) I've seen a lot of other research I can't seem to dig
up right now that suggests that analyzing programs "in the wild" tend to
produce optimal speedups of less than even 2x across whole programs in
general.

I'm not holding out much hope for automatic parallelization to save us, or
even help all that much. The only remaining hope is that a language written
from the beginning to afford more automatically-parallel programs will help,
but those efforts (Fortress most notably) also have not gone well.
[https://en.wikipedia.org/wiki/Fortress_%28programming_langua...](https://en.wikipedia.org/wiki/Fortress_%28programming_language%29)
,
[http://web.cs.ucla.edu/~palsberg/course/cs239/F07/slides/for...](http://web.cs.ucla.edu/~palsberg/course/cs239/F07/slides/fortress.pdf)
starting page 33

It's a pity, really. But unfortunately, I wouldn't be promising anybody
anything about automatic parallelization, because even in _lab_ conditions
nobody's gotten it to work very well, so far as I know. It's a weird case,
because my intuition, like most other people's, agrees that there ought to be
a lot of opportunity for it, but the evidence is contradicting that.

Anyone who's got a solid link to contradictory evidence, I'd _love_ to hear
about it, but my impression is that research has mostly stopped on this topic,
because that's how poorly it has gone.

------
angersock
The interesting thing is...how long before you have 10M users?

I'm very much a fan of this sort of work--it's obviously useful for things
like data collection from sensors--but I can't help but wonder if it has any
practical business value for early-to-mid stage startups.

~~~
phamilton
Others have cited the Phoenix example of 2M.

We sort of take it for granted that you will have to rebuild your architecture
once you hit that point. What you gain for that is a platform where you can be
productive and iterate quickly.

What the Phoenix example shows is that without a huge loss of productivity up
front, you can put off the rewrite much longer.

Presumably, Go would provide similar benefits.

~~~
rozap
The important thing about Phoenix is that it uses regular old pg2 groups for
pubsub that can be distributed across nodes. Phoenix already does sharding for
you, you just need to add another node and the stuff just works.

When I was first playing with multi node Phoenix stuff I was amazed. It was
too easy to scale horizontally.

I'm not sure what the distribution story is for Go...

~~~
phamilton
The distribution story in this article is "publish via Redis". I much prefer
Phoenix's approach.

------
webaba
Most web devs think they need websockets but they really don't.

~~~
0x4a42
Care to explain?

~~~
webaba
I feel like frameworks like Meteor promote systematic use of websockets, and
more and more developers don't even try to implement scalable stateless server
code.

~~~
qaq
adoption of websockets in my experience is driven more by increasing mobile
usage.

------
julie1
How much time before companies wake up and think that hiring people that can
do the same job with only C1000 is a better investment?

------
morgosmaci

        apt-get install -y redis-server
        echo "bind *" >> /etc/redis/redis.conf
    

/shiver....

And your new redis server joins all of the other unprotected ones directly
connected to the net with no protection or real security.

------
zzzcpan
Sorry, but this is pointless. And I have a feeling that you could do better
even with a scripting language like Perl, Python, Ruby and a single-threaded
event loop.

~~~
windlep
Yes, exactly. Which is why we switched from Go to Python (using twisted), and
dropped our memory use substantially for holding open hundreds of thousands
(and soon millions) of websocket connections.

M:N schedulers and the primitives they use for their 'lightweight' threading
are always going to use more memory than a single-threaded event loop.

~~~
robbles
Why doesn't Elixir/Erlang have this problem too?

It's using green threads as well, with 1 "process" per connection, right? Is
it not the same kind of scheduling?

~~~
codesushi42
I'm guessing some kind of multiplexing of the sockets is being used in the
Erlang case. There's no green thread per every connection.

I don't understand what the grandparent is saying though. Yeah, if you create
a thread for every request, then you're going to be killed by the memory
overhead. This isn't unique to Go, and I remember it being a problem when many
noob Java programmers would spawn a thread per request before the introduction
of java.nio.

The downside with Twisted, Tornado, or whatever in Python is your code isn't
parallelized. It's concurrent, yes, but you aren't taking advantage of your
multiple cores without forking another Python process due to GIL.

Go, Scala, Java etc. are truly multi-threaded and compile to native code from
commandline or via JIT. Saying your performance advantage is due to switching
from Go to Python is a spurious claim. You weren't doing it right.

~~~
strmpnk
For most scaled up examples in Erlang and Elixir, there is a BEAM process
(green thread) per connection as well as others (usually arranged as a
supervised pool so one can avoid an acceptor bottleneck). There are a few
reasons it does better but the biggest is a carefully tuned SMP scheduling
mechanism and aggressive preemption policies. Some of these choices actually
hurt throughput in favor of fairness and latency. All in the name of
reliability over speed.

~~~
codesushi42
That's helpful. Thanks.

