
The Road to 2M Websocket Connections in Phoenix - chrismccord
http://www.phoenixframework.org/v1.0.0/blog/the-road-to-2-million-websocket-connections
======
polskibus
Great job Phoenix team! I wonder how much of this wouldn't even be possible if
not for the great BEAM platform and Cowboy web server.

Whatsapp famously works with 2m connections on a FreeBSD box (this number is
old, I bet they've beaten that number).
[http://highscalability.com/blog/2014/2/26/the-whatsapp-
archi...](http://highscalability.com/blog/2014/2/26/the-whatsapp-architecture-
facebook-bought-for-19-billion.html)

I wonder whether these two cases are in any way comparable. Different stacks,
different machines, different test. The only similarity being BEAM/Erlang used
as a platform. Speaks well of its scalability!

~~~
chrismccord
None of it would be possible without BEAM and Cowboy :) Cowboy is a huge
driver for our great numbers. It handles the tcp stack and websocket
connections for us. As I said elsewhere in this thread, we are doing more than
just holding idle WS connections open, but we're standing on the shoulders of
giants as far as cowboy and erlang are concerned.

------
bcardarella
We've been using Phoenix for several applications for over a year. The channel
functionality is so easy to use and implement. High recommend people checking
this out:
[http://www.phoenixframework.org/docs/channels](http://www.phoenixframework.org/docs/channels)

~~~
jsmoov
I'd also recommend the Programming Phoenix book, it's still in beta but the
chapter on Channels just came out:

[https://pragprog.com/book/phoenix/programming-
phoenix](https://pragprog.com/book/phoenix/programming-phoenix)

------
chrismccord
My ElixirConfEU keynote gave a high-level overview of the channels design for
those interested - quick jump to place in presentation:
[https://youtu.be/u21S_vq5CTw?t=22m34s](https://youtu.be/u21S_vq5CTw?t=22m34s)

I'm also happy to answer any other questions. The team is very excited about
our channel and pubsub layers progress over the last year.

------
toast0
If you listen on multiple ports, you can get way more than 64k connections per
client. If the workload is mostly idle, tsung should be able to manage quite a
few connections too.

~~~
chrismccord
We'd like to explore a private network so we can get away with more than 64k
per client, as it starts to get expensive spinning up 50 servers whenever we
want to test. Unfortunately, the instances that were lent to us came with no
way for us to assign additional IPs.

~~~
toast0
You don't need multiple IPs, tcp connections are a unique 4-tuple: {ClientIP,
ServerIP, ClientPort, ServerPort}. If you add more ServerPorts, you can re-use
the client ports. Now, if you want to test past 4 billion connections, you
need more IPs. :)

~~~
noir_lord
> You don't need multiple IPs, tcp connections are a unique 4-tuple:
> {ClientIP, ServerIP, ClientPort, ServerPort}.

Sometimes someone throws something out so casually and yet you realise the
innate sense of what they just said, I literally stopped with coffee halfway
to my mouth.

I'd never have thought of doing it that way.

~~~
toast0
I may have connection tested a few things in the past ;) If you're
benchmarking with tsung, it's nice, because it has hooks to let you pick the
client port when making the connection; you'll probably need this, because the
OS default assignment algorithms tend to get it wrong with this much stuff
going on.

------
mtw
I'd be interested in benchmarks vs other similar frameworks, for the same use
case. These had json/db queries benchmarks however Phoenix lags behind (will
be at the top soon?)
[https://www.techempower.com/benchmarks/previews/round11/](https://www.techempower.com/benchmarks/previews/round11/)

~~~
out_of_protocol
Take a look at source code. Phoenix in question

1) outdated, version used is really really old (0.8 or so)

2) Database pool set to really low value unlike other participants (~10x
lower)

3) Configured to do many things other frameworks skip, like generating csrf
token

All these points were fixed already but, most likely, new results will go to
Round 12

~~~
mtw
ok. will see round 12.

But why did you downvote my comment? I was just reporting a fact, a benchmark
that was performed by a third-party. It's not like I was voicing a weird
opinion or fabricating truth.

~~~
out_of_protocol
I didn't (nor have such rights nor intention). Also, valid question imo. Btw,
tested it [old version] locally with a single change - set pool size to 256
(similar to other frameworks). Result: ~10x performance boost

~~~
pekk
This is par for the course on the TechEmpower benchmarks. Certain
technologies' benchmarks have always been relentlessly optimized while others
are left to "the community" which means you can submit PRs, but they'll likely
be rejected even if they reflect the usual deployment patterns for a high
proportion of users of the language/framework being represented.

~~~
koshak
that's why this benchmark seems irrelevant.

------
ntrepid8
I've been using elixir/cowboy/Phoenix for the past few months and it's been
great. It seems like the elixir ecosystem is still being built out but there
are a lot of erlang libraries to use also.

------
alberth
I'd be interested in seeing this same performance benchmark run against
various other OSes.

In particular DragonflyBSD since it has a lockless network stack.

And FreeBSD for good measure given that Whatsapp runs an Erlang/FreeBSD stack
and has documented 2-3M connections themselves.

~~~
chrismccord
We were shocked how well stock ubuntu performed under these tests. With the
stories of freebsd tuning, we expected to hit an OS ceiling, but other than a
dozen shell script lines setting some sysctrl limits, everything else was
stock. I would also be very interested in how other OS's perform.

------
ch4s3
This is really cool, I'm hoping to see more small companies experimenting with
Phoenix. It seems like a great tool.

------
Thaxll
Is it a relevant benchmark to open 2M connexions without doing anything in it?

~~~
chrismccord
I was worried some might not make it that far into the post and come away with
that thought, but such is the nature of long-form prose :). This is more than
idle WS connections. Every channel client connects to the sever (via WS), then
subscribes via our PubSub layer to the chat room topic. Once we connect all 2M
clients, we use the real chat app to broadcast a message to all 2M users.
Broadcasts to out to all WS clients in about 2s. 2M people in the same chat
room might not be a real-world fit, but it stressed out PubSub layer to its
max, which was one of the goals for these tests. We were actually shocked at
how well broadcasts hold up at these levels.

~~~
BenoitP
What this benchmark shows is the lightweightness of the memory needed per
websocket. 2M users is really impressive.

2M users per machine is great if they are mostly idle. This is the use case of
WhatsApp, and their stats are[1]:

> Peaked at 2.8M connections per server

> 571k packets/sec

> >200k dist msgs/sec

Not every app is meant to have mostly idle users. Can a real-time MMO FPS be
done in phoenix is the question (lets limit each user's neighbourhood to the
10 closest players).

I'd be very interested in the other corners of the envelope for phoenix: A
requests/second benchmark over websockets, with the associated latency
HdrHistogram, like [2]

[1] [http://highscalability.com/blog/2014/2/26/the-whatsapp-
archi...](http://highscalability.com/blog/2014/2/26/the-whatsapp-architecture-
facebook-bought-for-19-billion.html)

[2] [http://www.ostinelli.net/a-comparison-between-misultin-
mochi...](http://www.ostinelli.net/a-comparison-between-misultin-mochiweb-
cowboy-nodejs-and-tornadoweb/)

~~~
mguillemot
I am a very happy user of Phoenix myself, and also a MMO game programmer, and
unfortunately I can say with confidence that you won't implement a real-time
MMO FPS on top of Phoenix channels any soon.

Not because of Phoenix itself, but because all modern lag-compensating
techniques rely on specific properties of UDP that are not implementable over
TCP (and thus not over Websockets or HTTP long poll, which are the currently
supported Phoenix Channel transports).

Not to diminish the value of Phoenix: the framework is really pleasant to use
both on dev side and ops side. And you could use it today to implement the
server-side of most games, even some "massive multiplayer" ones, as long as
latency is not your primary concern.

~~~
chrismccord
You are right our default transports (WS/LongPoll) are not well suited for FPS
requirements, but just to be clear transports are adapter based and you can
implement your own today for your own esoteric protocol or serialization
format. Outlined here:
[http://hexdocs.pm/phoenix/Phoenix.Socket.Transport.html](http://hexdocs.pm/phoenix/Phoenix.Socket.Transport.html)

The backend channel code remains the same and the transport takes care of the
underlying communication details.

------
rozap
the other day i set up hot code upgrades (in a single node, production
instance) with exrm and phoenix. it was a breeze and took about 30 minutes of
reading and setup. and now i can upgrade the backend while all the websocket
connections stay running. so cool!

------
nulltype
Now that you can buy servers in tiny linear increments, it seems more
interesting (but less headline-y) to talk about KB per connection or CPU per
connection. It looks like this is about 41KB per connection.

For comparison, I checked my not-very-optimized go server, and it looks like
(based on VSZ value) to use 25KB per connection in go, with 43KB per
connection used by nginx (to provide SSL termination).

~~~
chrismccord
Like I highlighted elsewhere, did your Go server setup a separate multiplexed
process for the PubSub channel and track the PubSub subscribers on each client
connection? Comparing the client size to raw WS connections isn't accurate in
this context. Horizontal scaling is nice, but you also have to consider the
overhead in broadcasts that now are distributed over dozens/hundreds of nodes
vs a handful of large nodes.

~~~
nulltype
I track the pubsub subscribers on each client connection, but I'm not sure
what you mean by a multiplexed process for the pubsub channel.

Depending on your traffic patterns, considering the broadcast activity is
potentially important, and then you might worry more about CPU than RAM.

I use redis myself, but gnatsd or redis cluster may help with scaling the
pubsub part of the system.

------
eridius
Great article! One minor nit: When explaining the bag vs duplicate_bag thing,
it says

> _The difference between bag and duplicate_bag is that duplicate_bag will
> allow multiple entries for the same key._

bag also allows multiple entries per key. After reading the ETS documentation,
it appears that duplicate_bag allows the same object instance to appear as a
value for a key multiple times, whereas bag only allows an object instance to
be added once to a key (e.g. so if you add the exact same object instance to
the table using a given key, bag will only end up with one value for the key,
but duplicate_bag will happily have many identical values for that key).

The following sentence is still fine:

> _Since each socket can only connect once and have one pid, using a duplicate
> bag didn 't cause any issues for us._

------
jkarneges
I love this kind of stuff. Awhile back I attempted to do similar benchmarking
with Pushpin: [http://blog.fanout.io/2013/10/30/pushing-to-100000-api-
clien...](http://blog.fanout.io/2013/10/30/pushing-to-100000-api-clients-
simultaneously/)

Like the Pheonix test, this tested broadcast throughput. However we couldn't
push more than 5000 messages per second out of a single EC2 m1.xlarge instance
otherwise latency would increase. My theory is that we were hitting the
maximum throughput of the underlying NIC, causing packet loss and requiring
TCP to retransmit (hence the latency). At some point I want to try adding
multiple virtual NICs to the same EC2 instance and see if that helps.

~~~
philsnow
ec2 instances (at least in ec2 classic) indeed have a limit of ~120K packets
per second. I don't know if this is per NIC, so I would be interested in
seeing your results.

------
arrty88
am i reading this correctly... 83gb of ram in use for 2m connections? seems
like a lot of overhead.

i have achieved 500k concurrent connections using atmosphere, consuming around
10gb of ram.

~~~
chrismccord
That's correct. We have not made any attempt at optimizing connection size
yet, so there should be room to trim this down. But honestly, I'm thrilled
with the numbers as is. This box on rackspace will run you $1500/mo, which is
going to be peanuts for a company requiring this kind of scale, but the more
we can trim down the connection size the better.

~~~
darthdeus
> which is going to be peanuts for a company requiring this kind of scale

That depends. Two million connections from people in an e-commerce site is a
lot, but two million connections for a side-thingy like some
analytics/ads/background-job-whatever doesn't have to be that much, especially
if you consider 3rd party code.

~~~
chrismccord
Cases will vary, but if you have _2 million_ active users at a time and can't
cover $1500/mo, then I'm not sure what your options will be. Along these
lines, I'm really excited about what kinds of creations Phoenix will enable
exactly because the kind of scale it gives you for current hardware. I think
we'll see disruptive applications/services come out because of what you can
get out of a single machine, so I find stressing price as not affordable in
this context really bizarre.

~~~
darthdeus
I understand, I do like the idea of Phoenix, and it is definitely great to be
able to keep 2M connections open at all.

I'm just wondering where is the per-connection overhead going into. Is there
some inherent limitation of the WebSocket protocol that forces the server to
keep large buffers or something? Not trying to bash on Phoenix, I'm just
genuinely interested in what is the lowest possible overhead one could achieve
while keeping an open WS connection.

~~~
asonge
Doing some rough math and my limited understanding of linux network internals,
it's about 40KB per connection in this benchmark. I know that cowboy is going
to require ~4KB or so per connection. Consulting a local ubuntu install, the
default minimum buffer sizes in TCP will be at least 4KB each (2* for read and
write), but by default 16KB each, and by default the max goes to 1.5MB or so
each. This is required for TCP retransmits and such. If you have clients on
shoddy connections or see packet loss, your memory could skyrocket on you. I
remember reading of a case where someone had a service die despite memory
overhead of 33% when the TCP packet loss rate went up (still under 1%), but it
caused their buffer sizes to grow large enough to run out of memory.

So that's 8KB (will be higher with more usage) for TCP buffers in the kernel,
4KB or so for cowboy, and 28KB or so left for various other bits of the system
when amortized per connection.

------
paulus_magnus2
isn't there a 65k limit of number of open ports on the TCP stack??

~~~
asonge
As noted above, TCP connection limits are only on a unique 4-tuple of local-
ip, local-port, remote-ip, remote-port.

