Handling 1M websocket connections in Go

rdtsc · on Feb 13, 2019

Very cool. Noticed they had to go down to epoll: https://github.com/eranyanay/1m-go-websockets/blob/master/3_... that seems a bit too low level. Why not spawn 1m goroutines is that not feasible?

This also reminds me of Erlang VM handling 2M for Whatsapp on a single server in 2012:

https://blog.whatsapp.com/196/1-million-is-so-2011

windlep · on Feb 13, 2019

> Noticed they had to down to epoll: https://github.com/eranyanay/1m-go-websockets/blob/master/3_.... that seems a bit too low level.

Agreed, I'd be quite curious how many libraries in the Go ecosystem they can't use as a result. Either because the library spawns a goroutine, uses one under the hood, etc.

Having to drop to this level makes me wonder if it'd be better to just use a language better suited to this type of asynchronous networking (C/C++/Rust).

kjksf · on Feb 13, 2019

Very few Go libraries start goroutines on their own. Concurrency is mostly an application level concern.

HTTP library is unusual in that sense.

At the same time, the beauty of how http library is designed and the solution he describes is that there are hooks that allow being even more efficient with very little code.

Or to put it differently: Go gives you excellent (compared to everything else out there except C++/Rust) networking performance out of the box and you can go even faster with a minimal amount of effort.

What you call "dropping to this level" is 80 lines of code (https://github.com/eranyanay/1m-go-websockets/blob/master/3_...) and now you don't even have to write them yourself.

windlep · on Feb 14, 2019

To be more specific, lots of idiomatic Go patterns typically involve using channels. Working at this level means you can't do that per 'client' since that would imply at least one goroutine per client.

There's plenty of libraries that involve channels (and assume you have goroutines per connection/client), do those play well with the use of epoll in this manner? I would assume I can't use the stdlib time package to do a timeout for example, since that provides a channel to wait on, while in this setup I need objects that work with epoll. Obviously in that case I could use evio which abstracts over epoll/kqueue and provides a Tick...

So my comment was more that "dropping to this level" means dropping the use of the majority of standard Go concurrency idioms which revolve around goroutines and channels. It's not about the raw lines of code involved. Writing code in this style of async doesn't feel very Go-like and when you look at an example of some code using this style its exactly the thing Go was trying to avoid with goroutines (https://github.com/tidwall/evio/blob/master/examples/http-se...).

You end up with async event-based code, not the clean synchronous-appearing Go code that goroutines and channels provide.

e-r-a-n · on Feb 13, 2019

The idea behind going all the way down to epoll is to reduce the memory footprint by around 30% otherwise, with every goroutine stack consuming 4-8KB, it's not feasible to go with 1goroutine-1websocket approach

cdoxsey · on Feb 14, 2019

That's what, 8GB of ram? A lot for a laptop... but not too much for a server.

doublerebel · on Feb 14, 2019

30% extra RAM is a very noticeable server expense.

ElijahLynn · on Feb 13, 2019

Same. Reminded me of Erlang too!

rhizome · on Feb 13, 2019

I'd like to see this kind of story combined with the "how to build a business" threads to discuss what kinds of business models and sizes require a million websocket connections to one process.

1M simultaneous users of a React (or other) app, is that a decently simple case? What are some sites that have this level of activity? I found a 5yr old article that says Spotify had 20MM simultaneous users then, but spread over 12,000 machines, so suffice it to say I'm having trouble finding a use-case here besides (good!) research.

leonhandreke · on Feb 13, 2019

Apparently WhatsApp at some point handled 2M connections per server: http://highscalability.com/blog/2014/3/31/how-whatsapp-grew-...

art0rz · on Feb 13, 2019

I have written a 1mm+ simultaneous user websocket server for a second screen app for a well known talent TV show in Node. It was running on a single server (with failovers and redundancies of course) just fine, but 99% of the server was just broadcasting. The hard part is sending single messages to users with user-specific content.

toredash · on Feb 14, 2019

Agree on the hard part. Fan out models like this with the same content is "easy". Getting unique content to specific users in an efficient way, in scale of 100.000 users plus, that's hard.

zepolen · on Feb 15, 2019

Do you have an example of this type of unique content?

toredash · on Feb 15, 2019

Quiz for instance with a leaderboard or prizing.

Let say you publish an Quiz to 1 mill users. Everyone response and you store the responses centrally. Now you send out an aggregated view of the results (e.g. % answered A).

Now, you want to inform one or more users that they won a prize. How do you do this effectively ?

Of better, you want to display the ranking to each user if there are multiple quiz.

kjksf · on Feb 13, 2019

One of their slides (https://speakerdeck.com/eranyanay/going-infinite-handling-1m...) lists the use cases for such functionality and they are pretty broad (message queues, chat apps, notifications, social feeds, collaborative editing, location updates).

When you consider the price of the hardware and complexity of the system, it's obviously useful not just for handling 1M connections per host but also 10k connections per host.

If some other technology / framework can only do 1k connection per host, then you can per 10x less for hardware and have room to spare. And the system will simpler which means faster to develop.

rhizome · on Feb 15, 2019

Sure, I can imagine all kinds of websocket applications and their economic tradeoffs, just not ones that have ever enjoyed a million connections against a single server instance.

sudhirj · on Feb 14, 2019

I’m working on https://whatsgoingon.in that needs this and uses the base technique - which is actually pretty cool and good enough. I can live with 20GB of RAM for a million connections, given that it’s a paid product (live blogging)

tomohawk · on Feb 13, 2019

Flip it around. If the stack you're using is memory or compute intensive for a given load, it will cost you more to provision that load. You may not need to service 1M connections, but you may need to service 10K connections with a single box instead of multiple boxes due to cost.

z3t4 · on Feb 14, 2019

I like to think of projects like this when a team of "engineers" have trouble scaling their cloud service to support more then one simultaneous user.

cuddlecake · on Feb 13, 2019

Can someone make a comparison between this and the Elixir Phoenix 2m websocket conmections example.

I want to sleep.

brightball · on Feb 13, 2019

I remember seeing a talk on the 2m websocket Elixir example and one of the keys to it was that the sockets were actually being used and processing messages intermittently during the test. Important thing to keep in mind vs simply opening.

The other thing I'd be interested to see Elixir demonstrate would be doing a hot deployment to avoid triggering all 2m connections to try reconnect at the same time.

The other important note with the Phoenix example was that they were starting 3 processes for each socket (to supervisor, handle failures/reconnects) if I remember correctly.

zensavona · on Feb 14, 2019

With Elixir/OTP there would be no disconnection/reconnection. The way hot code push works with OTP is that each "channel" (OTP "process"/websocket connection) has a GenServer, which is basically a state machine, and when there is a new version of the function available in the VM, the current state is passed into a function (supplied by you) to mutate it into a shape compatible with your new change, then next time the data is passed around the loop to the state management function, it goes to the new one instead. Because the state data is stored quite separately to the actual functions who mutate it, there is no need to disconnect or do any mass purging of memory when there's a hot code deployment.

Casperin · on Feb 15, 2019

I recorded an example of what that looks like about a year ago https://www.youtube.com/watch?v=CZWMc2cXUAw -- I no longer work in Elixir, but this magic still kind of blows my mind.

e-r-a-n · on Feb 13, 2019

If you look at the server implementation, the client messages are also read and parsed, but are not printed to stdout for demonstration purposes (otherwise it hard to keep track) You can uncomment that printf instruction and see that it still performs nearly the same way

ponyous · on Feb 14, 2019

I did hot code upgrades on a server with WS connections active and I don't remember connections being broken.

mcintyre1994 · on Feb 13, 2019

Do you have a link to that Elixir one by any chance?

brad0 · on Feb 13, 2019

https://phoenixframework.org/blog/the-road-to-2-million-webs...

ElijahLynn · on Feb 13, 2019

http://goroutines.com/10m

adanhawth · on Feb 13, 2019

...and an HN link for the latter:

https://news.ycombinator.com/item?id=11320023

spenrose · on Feb 14, 2019

An issue with Go channels and high-performance networking from 2016:

"There was one fundamental mistake made, however, which is that we shouldn't have used channels. ... First, they don't perform well enough. ... Second, they make it very hard to prevent message loss. ... Third, the buffered channels mean that Heka consumes much more RAM than would be otherwise needed"

https://mail.mozilla.org/pipermail/heka/2016-May/001059.html

cdoxsey · on Feb 14, 2019

Not really a networking issue. Channels are (relatively) inefficient for small payloads. With a decent payload size they're almost never the bottleneck in a real-world program. You really have to benchmark it.

It sounds like in this case the message-loss-prevention rendered the original design flawed. I don't see any reason why you couldn't use an on-disk queue in Go vs other languages... though the cgo overhead of the lua binding sounds like it was also an issue.

otabdeveloper2 · on Feb 14, 2019

"1M socket connections" is easy.

Having them all do something useful at the same time is the hard part. (And no, "async" won't save you here.)

reilly3000 · on Feb 14, 2019

Agreed, but traffic doesn't typically flow like that. Maybe for a live event stream's chat?

cryptica · on Feb 13, 2019

These kinds of benchmarks are not very meaningful. I think that pretty much any modern framework/language can handle at least 1 million idle WebSockets. It's much more interesting to measure performance when you start sending messages through them at regular intervals.

kjksf · on Feb 13, 2019

When people rewrite system from X where X in (Python, Ruby, Node, Clojure) to Go they usually see at least 10x improvement.

This is just one recent example: https://www.infoq.com/articles/api-gateway-clojure-golang

The money quote:

"The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute"

If you google around you'll find more stories like that.

To do better than Go you would have to drop to C++ or Rust.

karim · on Feb 14, 2019

"When people rewrite system from X where X in (Python, Ruby, Node, Clojure) to Go they usually see at least 10x improvement."

Sure but how much of the performance gains come from Go and how much come from just having a better understanding of the problem the second time around?

zepolen · on Feb 14, 2019

Go is strongly typed and compiled, it literally does 10x less work than dynamic languages for most things.

Pypy leads to a near 10x speed up over cPython as well.

Scarbutt · on Feb 14, 2019

"The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute"

That doesn't make any sense though.

piranha · on Feb 14, 2019

That article is anything but useful. They have spent significantly more time designing and writing app in Go than their prototype in Clojure and now it’s faster? What a surprise!

fulafel · on Feb 15, 2019

Hmm, there's a slight disparity between the quote and their benchmark table, which shows Clojure and Go neck and neck in the request/sec dept. I guess all the system re-architecting explains most of the difference.

In any case, as the article text also mentions, on the Go side they are using a complete reverse-proxy library that's included with the Go stdlib, which can be a significant advantage aside from the properties of the language itself.

But then it seems they ended up reimplementing many things that the JVM provides: "In order to achieve out of the box functionality such as CPU and memory usage metrics, business logic counters and more - we needed to write basically this entire stack from scratch, which enabled us a much more rapid deep dive of the intricacies of Golang."

apta · on Feb 14, 2019

> To do better than Go you would have to drop to C++ or Rust.

C++ and Rust yes, but also Java and .NET.

weberc2 · on Feb 14, 2019

In my experience Java, C#, and Go are roughly on par. Go is a bit easier to optimize though.

basic6 · on Feb 14, 2019

I didn't think Go rewrites were common anymore since the results were mediocre.

kasey_junk · on Feb 14, 2019

Or Java or C# or basically any non-interpreted language.

rakenodiax · on Feb 14, 2019

In this case, Clojure is already running on the JVM

kasey_junk · on Feb 14, 2019

The Clojure compiler does not generally produce the same byte code as the Java one.

Idiomatic clojure is routinely slower than idiomatic Java and that’s a well known and expected outcome.

e-r-a-n · on Feb 13, 2019

They are not idling. Look at the code.

lfmunoz4 · on Feb 13, 2019

1 million tcp connections with vertx / kotlin

https://github.com/lfmunoz/vertx-kt-rocket

Nothing special other than having to tune Linux

mateuszf · on Feb 13, 2019

Jvm is very optimized, so it's actually something not that non-special.

pmlopes · on Feb 14, 2019

> "so it's actually something not that non-special."

That is not totally true, This is a mix of 2 things, using the JVM (which like you said is being tuned and optimized for heavy loads) + using a true asynchronous and reactive programming (and IO) model built on great technologies such as (in this specific case: Kotlin, Eclipse Vert.x and Netty).

As an experiment if you would pick another random set of libraries (imagine a servlet container) achieving the same results would not be so trivial, see for example:

https://www.techempower.com/benchmarks/#section=data-r17&hw=...

And observe that Eclipse Vert.x is on the top for these reasons while other JVM frameworks are far behind.

mateuszf · on Feb 14, 2019

That's why I used not two times. I meant that it IS special, sorry for confusion, I'm not a native speaker.

ValleZ · on Feb 13, 2019

I thought you can make only 65k connections per port because of TCP limitations.

hathawsh · on Feb 13, 2019

The limit is 65k connections per client IP address, limited by the number of available ports on the client. You can theoretically accept connections from all possible client IP addresses simultaneously.

This means that in order to test 1M simultaneous connections, you would need to use at least 16 client IP addresses. Probably more.

athenot · on Feb 13, 2019

As a fun aside, you can do this on localhost by defining additional addresses tied to the loopback interface:

    127.0.0.2
    127.0.0.3
    etc.

I once had to test a piece of software that identified its connections by the source IP, so I had a script to create thousands of loopback addresses within 127.0.0.0/8 and then run against that software and verify the connections were doing what they were supposed to do. (This was on Mac OS X.)

e-r-a-n · on Feb 13, 2019

And most linux distros limits the ephemeral range even further to about 30K ports If you take a deeper look, the solution uses Docker to connect clients from different network namespaces to mitigate this

shereadsthenews · on Feb 13, 2019

This comment appears routinely on HN. I'd love to find out how something so basic about TCP came to be so widely misunderstood.

wmf · on Feb 13, 2019

Because almost everyone tests with one client IP and one server IP.

sempron64 · on Feb 13, 2019

The limit is only between a single client and server ip, since there can only be up to 65k source port/ip pairs (assuming the destination port is a single port, which it seems to be in the code examples). So if there are multiple client ips it seems like it's not an issue.

https://stackoverflow.com/a/2332756

kondro · on Feb 13, 2019

An open connection is identified by all 4 values of the unique tuple source IP, source port, destination IP, destination port.

foobiekr · on Feb 13, 2019

Deep down, the key for a connection is the concatenation of SA DA SP DP. DA and DP are constant for a given service in the simple case and the others are 4 octets [although not all of it usable] for IPv4 and 2 octets (for port).

lessclue · on Feb 14, 2019

UNIX sockets can help.

e-r-a-n · on Feb 14, 2019

lkramer · on Feb 13, 2019

Is there a video of the talk?