
Handling 1M websocket connections in Go - 1nvalid
https://github.com/eranyanay/1m-go-websockets
======
rdtsc
Very cool. Noticed they had to go down to epoll:
[https://github.com/eranyanay/1m-go-
websockets/blob/master/3_...](https://github.com/eranyanay/1m-go-
websockets/blob/master/3_optimize_ws_goroutines/epoll.go#L32) that seems a bit
too low level. Why not spawn 1m goroutines is that not feasible?

This also reminds me of Erlang VM handling 2M for Whatsapp on a single server
in 2012:

[https://blog.whatsapp.com/196/1-million-is-
so-2011](https://blog.whatsapp.com/196/1-million-is-so-2011)

~~~
windlep
> Noticed they had to down to epoll: [https://github.com/eranyanay/1m-go-
> websockets/blob/master/3_...](https://github.com/eranyanay/1m-go-
> websockets/blob/master/3_..). that seems a bit too low level.

Agreed, I'd be quite curious how many libraries in the Go ecosystem they can't
use as a result. Either because the library spawns a goroutine, uses one under
the hood, etc.

Having to drop to this level makes me wonder if it'd be better to just use a
language better suited to this type of asynchronous networking (C/C++/Rust).

~~~
kjksf
Very few Go libraries start goroutines on their own. Concurrency is mostly an
application level concern.

HTTP library is unusual in that sense.

At the same time, the beauty of how http library is designed and the solution
he describes is that there are hooks that allow being even more efficient with
very little code.

Or to put it differently: Go gives you excellent (compared to everything else
out there except C++/Rust) networking performance out of the box and you can
go even faster with a minimal amount of effort.

What you call "dropping to this level" is 80 lines of code
([https://github.com/eranyanay/1m-go-
websockets/blob/master/3_...](https://github.com/eranyanay/1m-go-
websockets/blob/master/3_optimize_ws_goroutines/epoll.go)) and now you don't
even have to write them yourself.

~~~
windlep
To be more specific, lots of idiomatic Go patterns typically involve using
channels. Working at this level means you can't do that per 'client' since
that would imply at least one goroutine per client.

There's plenty of libraries that involve channels (and assume you have
goroutines per connection/client), do those play well with the use of epoll in
this manner? I would assume I can't use the stdlib time package to do a
timeout for example, since that provides a channel to wait on, while in this
setup I need objects that work with epoll. Obviously in that case I could use
evio which abstracts over epoll/kqueue and provides a Tick...

So my comment was more that "dropping to this level" means dropping the use of
the majority of standard Go concurrency idioms which revolve around goroutines
and channels. It's not about the raw lines of code involved. Writing code in
this style of async doesn't feel very Go-like and when you look at an example
of some code using this style its exactly the thing Go was trying to avoid
with goroutines ([https://github.com/tidwall/evio/blob/master/examples/http-
se...](https://github.com/tidwall/evio/blob/master/examples/http-
server/main.go)).

You end up with async event-based code, not the clean synchronous-appearing Go
code that goroutines and channels provide.

------
rhizome
I'd like to see this kind of story combined with the "how to build a business"
threads to discuss what kinds of business models and sizes require a million
websocket connections to one process.

1M simultaneous users of a React (or other) app, is that a decently simple
case? What are some sites that have this level of activity? I found a 5yr old
article that says Spotify had 20MM simultaneous users then, but spread over
12,000 machines, so suffice it to say I'm having trouble finding a use-case
here besides (good!) research.

~~~
art0rz
I have written a 1mm+ simultaneous user websocket server for a second screen
app for a well known talent TV show in Node. It was running on a single server
(with failovers and redundancies of course) just fine, but 99% of the server
was just broadcasting. The hard part is sending single messages to users with
user-specific content.

~~~
toredash
Agree on the hard part. Fan out models like this with the same content is
"easy". Getting unique content to specific users in an efficient way, in scale
of 100.000 users plus, that's hard.

~~~
zepolen
Do you have an example of this type of unique content?

~~~
toredash
Quiz for instance with a leaderboard or prizing.

Let say you publish an Quiz to 1 mill users. Everyone response and you store
the responses centrally. Now you send out an aggregated view of the results
(e.g. % answered A).

Now, you want to inform one or more users that they won a prize. How do you do
this effectively ?

Of better, you want to display the ranking to each user if there are multiple
quiz.

------
cuddlecake
Can someone make a comparison between this and the Elixir Phoenix 2m websocket
conmections example.

I want to sleep.

~~~
brightball
I remember seeing a talk on the 2m websocket Elixir example and one of the
keys to it was that the sockets were actually being used and processing
messages intermittently during the test. Important thing to keep in mind vs
simply opening.

The other thing I'd be interested to see Elixir demonstrate would be doing a
hot deployment to avoid triggering all 2m connections to try reconnect at the
same time.

The other important note with the Phoenix example was that they were starting
3 processes for each socket (to supervisor, handle failures/reconnects) if I
remember correctly.

~~~
zensavona
With Elixir/OTP there would be no disconnection/reconnection. The way hot code
push works with OTP is that each "channel" (OTP "process"/websocket
connection) has a GenServer, which is basically a state machine, and when
there is a new version of the function available in the VM, the current state
is passed into a function (supplied by you) to mutate it into a shape
compatible with your new change, then next time the data is passed around the
loop to the state management function, it goes to the new one instead. Because
the state data is stored quite separately to the actual functions who mutate
it, there is no need to disconnect or do any mass purging of memory when
there's a hot code deployment.

~~~
Casperin
I recorded an example of what that looks like about a year ago
[https://www.youtube.com/watch?v=CZWMc2cXUAw](https://www.youtube.com/watch?v=CZWMc2cXUAw)
\-- I no longer work in Elixir, but this magic still kind of blows my mind.

------
ElijahLynn
related:

[https://phoenixframework.org/blog/the-road-to-2-million-
webs...](https://phoenixframework.org/blog/the-road-to-2-million-websocket-
connections)

[http://goroutines.com/10m](http://goroutines.com/10m)

~~~
adanhawth
...and an HN link for the latter:

[https://news.ycombinator.com/item?id=11320023](https://news.ycombinator.com/item?id=11320023)

------
spenrose
An issue with Go channels and high-performance networking from 2016:

"There was one fundamental mistake made, however, which is that we shouldn't
have used channels. ... First, they don't perform well enough. ... Second,
they make it very hard to prevent message loss. ... Third, the buffered
channels mean that Heka consumes much more RAM than would be otherwise needed"

[https://mail.mozilla.org/pipermail/heka/2016-May/001059.html](https://mail.mozilla.org/pipermail/heka/2016-May/001059.html)

~~~
cdoxsey
Not really a networking issue. Channels are (relatively) inefficient for small
payloads. With a decent payload size they're almost never the bottleneck in a
real-world program. You really have to benchmark it.

It sounds like in this case the message-loss-prevention rendered the original
design flawed. I don't see any reason why you couldn't use an on-disk queue in
Go vs other languages... though the cgo overhead of the lua binding sounds
like it was also an issue.

------
otabdeveloper2
"1M socket connections" is easy.

Having them all do something useful at the same time is the hard part. (And
no, "async" won't save you here.)

~~~
reilly3000
Agreed, but traffic doesn't typically flow like that. Maybe for a live event
stream's chat?

------
jondubois
These kinds of benchmarks are not very meaningful. I think that pretty much
any modern framework/language can handle at least 1 million idle WebSockets.
It's much more interesting to measure performance when you start sending
messages through them at regular intervals.

~~~
kjksf
When people rewrite system from X where X in (Python, Ruby, Node, Clojure) to
Go they usually see at least 10x improvement.

This is just one recent example: [https://www.infoq.com/articles/api-gateway-
clojure-golang](https://www.infoq.com/articles/api-gateway-clojure-golang)

The money quote:

"The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure
code - able to process 60 concurrent requests, to two instances (c3.2xlarge)
running Go code able to support ~5000 concurrent requests a minute"

If you google around you'll find more stories like that.

To do better than Go you would have to drop to C++ or Rust.

~~~
karim
_" When people rewrite system from X where X in (Python, Ruby, Node, Clojure)
to Go they usually see at least 10x improvement."_

Sure but how much of the performance gains come from Go and how much come from
just having a better understanding of the problem the second time around?

~~~
zepolen
Go is strongly typed and compiled, it literally does 10x less work than
dynamic languages for most things.

Pypy leads to a near 10x speed up over cPython as well.

------
lfmunoz4
1 million tcp connections with vertx / kotlin

[https://github.com/lfmunoz/vertx-kt-rocket](https://github.com/lfmunoz/vertx-
kt-rocket)

Nothing special other than having to tune Linux

~~~
mateuszf
Jvm is very optimized, so it's actually something not that non-special.

~~~
pmlopes
> "so it's actually something not that non-special."

That is not totally true, This is a mix of 2 things, using the JVM (which like
you said is being tuned and optimized for heavy loads) + using a true
asynchronous and reactive programming (and IO) model built on great
technologies such as (in this specific case: Kotlin, Eclipse Vert.x and
Netty).

As an experiment if you would pick another random set of libraries (imagine a
servlet container) achieving the same results would not be so trivial, see for
example:

[https://www.techempower.com/benchmarks/#section=data-r17&hw=...](https://www.techempower.com/benchmarks/#section=data-r17&hw=ph&test=db)

And observe that Eclipse Vert.x is on the top for these reasons while other
JVM frameworks are far behind.

~~~
mateuszf
That's why I used not two times. I meant that it _IS_ special, sorry for
confusion, I'm not a native speaker.

------
ValleZ
I thought you can make only 65k connections per port because of TCP
limitations.

~~~
hathawsh
The limit is 65k connections per client IP address, limited by the number of
available ports on the client. You can theoretically accept connections from
all possible client IP addresses simultaneously.

This means that in order to test 1M simultaneous connections, you would need
to use at least 16 client IP addresses. Probably more.

~~~
athenot
As a fun aside, you can do this on localhost by defining additional addresses
tied to the loopback interface:

    
    
        127.0.0.2
        127.0.0.3
        etc.
    

I once had to test a piece of software that identified its connections by the
source IP, so I had a script to create thousands of loopback addresses within
127.0.0.0/8 and then run against that software and verify the connections were
doing what they were supposed to do. (This was on Mac OS X.)

------
lkramer
Is there a video of the talk?

