Agreed, I'd be quite curious how many libraries in the Go ecosystem they can't use as a result. Either because the library spawns a goroutine, uses one under the hood, etc.
Having to drop to this level makes me wonder if it'd be better to just use a language better suited to this type of asynchronous networking (C/C++/Rust).
Very few Go libraries start goroutines on their own. Concurrency is mostly an application level concern.
HTTP library is unusual in that sense.
At the same time, the beauty of how http library is designed and the solution he describes is that there are hooks that allow being even more efficient with very little code.
Or to put it differently: Go gives you excellent (compared to everything else out there except C++/Rust) networking performance out of the box and you can go even faster with a minimal amount of effort.
To be more specific, lots of idiomatic Go patterns typically involve using channels. Working at this level means you can't do that per 'client' since that would imply at least one goroutine per client.
There's plenty of libraries that involve channels (and assume you have goroutines per connection/client), do those play well with the use of epoll in this manner? I would assume I can't use the stdlib time package to do a timeout for example, since that provides a channel to wait on, while in this setup I need objects that work with epoll. Obviously in that case I could use evio which abstracts over epoll/kqueue and provides a Tick...
So my comment was more that "dropping to this level" means dropping the use of the majority of standard Go concurrency idioms which revolve around goroutines and channels. It's not about the raw lines of code involved. Writing code in this style of async doesn't feel very Go-like and when you look at an example of some code using this style its exactly the thing Go was trying to avoid with goroutines (https://github.com/tidwall/evio/blob/master/examples/http-se...).
You end up with async event-based code, not the clean synchronous-appearing Go code that goroutines and channels provide.
The idea behind going all the way down to epoll is to reduce the memory footprint by around 30%
otherwise, with every goroutine stack consuming 4-8KB, it's not feasible to go with 1goroutine-1websocket approach
I'd like to see this kind of story combined with the "how to build a business" threads to discuss what kinds of business models and sizes require a million websocket connections to one process.
1M simultaneous users of a React (or other) app, is that a decently simple case? What are some sites that have this level of activity? I found a 5yr old article that says Spotify had 20MM simultaneous users then, but spread over 12,000 machines, so suffice it to say I'm having trouble finding a use-case here besides (good!) research.
I have written a 1mm+ simultaneous user websocket server for a second screen app for a well known talent TV show in Node. It was running on a single server (with failovers and redundancies of course) just fine, but 99% of the server was just broadcasting. The hard part is sending single messages to users with user-specific content.
Agree on the hard part. Fan out models like this with the same content is "easy". Getting unique content to specific users in an efficient way, in scale of 100.000 users plus, that's hard.
Let say you publish an Quiz to 1 mill users. Everyone response and you store the responses centrally. Now you send out an aggregated view of the results (e.g. % answered A).
Now, you want to inform one or more users that they won a prize. How do you do this effectively ?
Of better, you want to display the ranking to each user if there are multiple quiz.
One of their slides (https://speakerdeck.com/eranyanay/going-infinite-handling-1m...) lists the use cases for such functionality and they are pretty broad (message queues, chat apps, notifications, social feeds, collaborative editing, location updates).
When you consider the price of the hardware and complexity of the system, it's obviously useful not just for handling 1M connections per host but also 10k connections per host.
If some other technology / framework can only do 1k connection per host, then you can per 10x less for hardware and have room to spare. And the system will simpler which means faster to develop.
Sure, I can imagine all kinds of websocket applications and their economic tradeoffs, just not ones that have ever enjoyed a million connections against a single server instance.
I’m working on https://whatsgoingon.in that needs this and uses the base technique - which is actually pretty cool and good enough. I can live with 20GB of RAM for a million connections, given that it’s a paid product (live blogging)
Flip it around. If the stack you're using is memory or compute intensive for a given load, it will cost you more to provision that load. You may not need to service 1M connections, but you may need to service 10K connections with a single box instead of multiple boxes due to cost.
I remember seeing a talk on the 2m websocket Elixir example and one of the keys to it was that the sockets were actually being used and processing messages intermittently during the test. Important thing to keep in mind vs simply opening.
The other thing I'd be interested to see Elixir demonstrate would be doing a hot deployment to avoid triggering all 2m connections to try reconnect at the same time.
The other important note with the Phoenix example was that they were starting 3 processes for each socket (to supervisor, handle failures/reconnects) if I remember correctly.
With Elixir/OTP there would be no disconnection/reconnection. The way hot code push works with OTP is that each "channel" (OTP "process"/websocket connection) has a GenServer, which is basically a state machine, and when there is a new version of the function available in the VM, the current state is passed into a function (supplied by you) to mutate it into a shape compatible with your new change, then next time the data is passed around the loop to the state management function, it goes to the new one instead. Because the state data is stored quite separately to the actual functions who mutate it, there is no need to disconnect or do any mass purging of memory when there's a hot code deployment.
I recorded an example of what that looks like about a year ago https://www.youtube.com/watch?v=CZWMc2cXUAw -- I no longer work in Elixir, but this magic still kind of blows my mind.
If you look at the server implementation, the client messages are also read and parsed, but are not printed to stdout for demonstration purposes (otherwise it hard to keep track)
You can uncomment that printf instruction and see that it still performs nearly the same way
An issue with Go channels and high-performance networking from 2016:
"There was one fundamental mistake made, however, which is that we shouldn't have used channels. ... First, they don't perform well enough. ... Second, they make it very hard to prevent message loss. ... Third, the buffered channels mean that Heka consumes much more RAM than would be otherwise needed"
Not really a networking issue. Channels are (relatively) inefficient for small payloads. With a decent payload size they're almost never the bottleneck in a real-world program. You really have to benchmark it.
It sounds like in this case the message-loss-prevention rendered the original design flawed. I don't see any reason why you couldn't use an on-disk queue in Go vs other languages... though the cgo overhead of the lua binding sounds like it was also an issue.
These kinds of benchmarks are not very meaningful. I think that pretty much any modern framework/language can handle at least 1 million idle WebSockets. It's much more interesting to measure performance when you start sending messages through them at regular intervals.
"The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute"
If you google around you'll find more stories like that.
To do better than Go you would have to drop to C++ or Rust.
"When people rewrite system from X where X in (Python, Ruby, Node, Clojure) to Go they usually see at least 10x improvement."
Sure but how much of the performance gains come from Go and how much come from just having a better understanding of the problem the second time around?
"The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute"
That article is anything but useful. They have spent significantly more time designing and writing app in Go than their prototype in Clojure and now it’s faster? What a surprise!
Hmm, there's a slight disparity between the quote and their benchmark table, which shows Clojure and Go neck and neck in the request/sec dept. I guess all the system re-architecting explains most of the difference.
In any case, as the article text also mentions, on the Go side they are using a complete reverse-proxy library that's included with the Go stdlib, which can be a significant advantage aside from the properties of the language itself.
But then it seems they ended up reimplementing many things that the JVM provides: "In order to achieve out of the box functionality such as CPU and memory usage metrics, business logic counters and more - we needed to write basically this entire stack from scratch, which enabled us a much more rapid deep dive of the intricacies of Golang."
> "so it's actually something not that non-special."
That is not totally true, This is a mix of 2 things, using the JVM (which like you said is being tuned and optimized for heavy loads) + using a true asynchronous and reactive programming (and IO) model built on great technologies such as (in this specific case: Kotlin, Eclipse Vert.x and Netty).
As an experiment if you would pick another random set of libraries (imagine a servlet container) achieving the same results would not be so trivial, see for example:
The limit is 65k connections per client IP address, limited by the number of available ports on the client. You can theoretically accept connections from all possible client IP addresses simultaneously.
This means that in order to test 1M simultaneous connections, you would need to use at least 16 client IP addresses. Probably more.
As a fun aside, you can do this on localhost by defining additional addresses tied to the loopback interface:
127.0.0.2
127.0.0.3
etc.
I once had to test a piece of software that identified its connections by the source IP, so I had a script to create thousands of loopback addresses within 127.0.0.0/8 and then run against that software and verify the connections were doing what they were supposed to do. (This was on Mac OS X.)
And most linux distros limits the ephemeral range even further to about 30K ports
If you take a deeper look, the solution uses Docker to connect clients from different network namespaces to mitigate this
The limit is only between a single client and server ip, since there can only be up to 65k source port/ip pairs (assuming the destination port is a single port, which it seems to be in the code examples). So if there are multiple client ips it seems like it's not an issue.
Deep down, the key for a connection is the concatenation of SA DA SP DP. DA and DP are constant for a given service in the simple case and the others are 4 octets [although not all of it usable] for IPv4 and 2 octets (for port).
This also reminds me of Erlang VM handling 2M for Whatsapp on a single server in 2012:
https://blog.whatsapp.com/196/1-million-is-so-2011