The headline can be misleading, these are not active real-world socket connections. According to the article they are mostly idle, with pings every five minutes.
And if I recall correctly, the 2 million figure was just an artificial limit that they hit due to the configuration settings during the test, since it just "seemed like a really big number" at the time. In theory, it could probably go even higher.
For a counterpoint, here's a benchmark of a single server running Elixir/Phoenix and supporting 2M concurrent websockets, sending pings once per second (rather than every five minutes like this Go benchmark). The socket handling code is simpler, too.
This is kind of lousy actually to use that much memory and other resources for 10M nearly completely idle connections.
If the kernel is bypassed, I have reason to believe Go can handle the line rate of 16.7M packets/sec(on 10gbps nic.) I would love to integrate Intel's DPDK with Go. I've done some ground work on this before - it's very feasible to just replace the Go network stack in a transparent way. If any company is interested in sponsoring that as an open-source project, my contact info is in my profile.
I wouldn't dare write my own. I'm not expert enough in TCP/IP stacks to know what I don't know that will hurt me. I was thinking of lifting the relatively simple TCP/IP stack from the Apache licensed Seastar framework. Since that also is built for DPDK and sees production use I know that it should be easy to adapt and probably doesn't have serious flaws.
I'm actually strongly tempted to just to glue Seastar and Go together. That's the lowest risk approach.
I haven't gotten around to trying it myself yet, but OpenFastpath+OpenDataplane supposedly has a userspace port of the FreeBSD stack that runs on DPDK through ODP-DPDK.
Yes openonload is an easy way to get a boost. It doesn't help you much in the commodity cloud though. There are still a lot of advantages to managing your own hardware when it's feasible.
Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle. This ends up being only about 30k pings per second, which is not a terribly high number.
In that benchmark, pings are actually 512-byte messages. So, MigratoryData published 200,000 messages/sec to 12 million clients at a total bandwidth of over 1 Gbps.
BTW that benchmark result has been improved recently in terms of latency. The JVM Garbage Collection pauses were almost eliminated using Zing JVM from Azul Systems and the MigratoryData server was able to publish almost 1 Gbps to 10 million concurrent clients with an average end-to-end latency of 15 milliseconds, the 99th percentile of 24 milliseconds, and the maximum latency of 126 milliseconds (computed on almost 2 billion messages):
MigratoryData uses the kernel networking stack and it can be configured to run as a non-root user.
Zing JVM is not a requirement for the MigratoryData server. MigratoryData includes by default Oracle JVM, which is currently used in production. With Oracle JVM we currently obtain quite decent Garbage Collection (GC) pauses for Internet applications, so the latency is not too much impacted by the GC pauses which occur from time to time.
Zing JVM could be used in certain projects to reduce near to zero those GC pauses, so the latency is not impacted even in the worst case, when a GC occurs. So, if Zing JVM is necessary for such a ultra low latency application, Zing JVM should be licensed separately.
I think the point was more that 40 cores was a ceiling for computational hardware, and it's very possible the actual hardware was 20 cores with hyper threading, which would result in much lower perf than a true 40 core system.
> This is a 32-core machine with 208GB of memory. Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle
While the 10M figure is impressive, this doesn't sound practical or useful.
Why isn't it practical? Maybe not in the cloud, but there are some hosting companies that offer quite powerful machines in the same ballpark.
Hetzner, a German hosting company, have some really good root servers, for about 120€/month you get a beast of a machine with 16 cores and 128gb of RAM. If you add some basic load balancing, you can achieve 10M for a really low price. Add something like Docker and you can get a PaaS-like setup, that handles millions of connections without breaking a sweat.
If you're going for C10M you can already forget about docker. For performance you really must be as close to the hardware as possible, which possibly means tweaking kernel.
Best example I know of squeezing as much as possible out of the bare metal servers would be StackOverflow, which runs of something in the ballpark of 20 servers(excluding replicas) IIRC.
You do know that processes in a Docker container are regular processes right? They just have a different namespace. They are as "close to the hardware" as all of your other processes. I'm not sure how you'd go about using your own TCP stack, but if you can do it with a normal process you can do it inside a Docker container.
When working with Docker, it is possible to get down to the metal as you mentioned.
Usually by bypassing the docker networking and using Host only network, but then you lose a lot of the benefits of containers in the first place.
For example, weave or calico networking layer on top— which add a fair bit of latency if your aiming for 10M connections— makes scaling containers quite easy
I would imagine that if you plan on using Docker in your infrastructure seriously, you are aiming for a multi host setup with many containers spread throughout— and can settle for 100k connections per container easily.
Besides, I wouldn't trust stuff with shared memory when it comes to that many concurrent connections.
I'd want it to be something with a more safety guarantees -- could be compile time (strong type system, Haskell, Rust) or strong runtime fault tollerance -- Erlang/Elixir and other languages on the BEAM VM.
Can anyone expand on what is actual limiting the linux kernel or network card driver? If the server is only at 10% load and half the memory is free it seems there might be either A) non-obvious tweaking needed or B) actual changes to the system library design.
Probably a combination of context switching, memory bandwidth, and interrupt handling.
If you're going to use a machine with 208 GB of RAM you may as well buy a better network card. All major vendors of low latency adapters (which in this case means handling lots of small messages) have transparent drivers for Linux kernel bypass now. For example see OpenOnload. Then you avoid the kernel with no code changes.
Once you remove the kernel from the critical path, you should be concerned with I/O resources, which is already illustrated by the article's note about how they get 5x better performance by using four machines with the same total core count. What that does is buy you four times the I/O resources, like memory channels and PCIe lanes.
"It is similar to a push notification server like the iOS Apple Push Notification Service, but without the ability to store messages if the client is offline."
Not even close. Layer in X509 authentication and flaky mobile networks and now you're into tuning backoff while dealing with the CPU overhead of the TLS negotiation.
Also of note, AWS' network use to flake out at around 400K TCP connections from the public network. May no longer be the case but it certainly was ~2012.
FYI, there's an unpublished limit on established TCP connections, bigger instances have bigger limits. It's hard to hit them with normal applications (you'll run out of memory from the TCP buffers and application state), but you can definitely hit them if you're doing weird things.
This is more about maintaining 10,000,000 open connections than about actually sending useful data to this many people or even allowing input from this many connections.
This is pretty useless other than to make you realize you are doing something terribly wrong. Another good hint are the various number of system constants you are having to tweak.
If you have these requirements (no read, only write, zero connection state) you would be best served by doing userspace raw IO with something like DPDK instead of the huge machinery that the kernel spins up for every TCP connection.
It's useful if you want push notifications using the W3C Push API. That lets you get notifications without becoming a slave to Google Cloud Messaging [1] or Apple Push Notification Service [2]. This should now work in Firefox and Google Chrome browsers[3]. Microsoft and Apple aren't on board yet.
The extra processing required for websockets is so simple and well understood then we should just talk about raw TCP socket optimizations instead. With websocket processing you can do only a few tricks to further optimize a good implementation, however on the TCP/IP layer the optimization tricks can be endless and much more complicated.
I see the quite high TCP memory net.ipv4.tcp_mem setting of 100M max pages (assuming standard 4K pages that's 381GiB, even more than actual physical memory the machine has, so it feels a bit unsafe, as I think sometimes kernels may lock up badly when they're overpressured - better start spewing errors reasonably early), but don't see any tweaks about how much each individual connection uses. I wonder if tcp_rmem/wmem tweaks can improve performance (well, I'm not sure, but I think it's not completely unreasonable to assume that if connections are idle they don't need the default - IIRC - 85KiB read and 16KiB write buffers)
This benchmark feels a little lazy. While I am in no doubt that Go could match the 2M concurrent websockets of Elixir's Phoenix framework, a benchmark should aim to implement the same functionality to be considered useful.
The only interesting thing here is that with that 10M connections figure, each connection handler goroutine contained at least 2 channels (sub and t), and they seemed to scale (no mention of latency though).
Yes this is a mostly a showcase of modern hardware and the Linux kernel.
However, Go's ability to do coroutines + multicore certainly makes it easier to take advantage of the system's full potential, and this ability is not terribly common in other languages (at least not in such an automatic way). Of course the same result could be achieved with any event-driven language in a less automatic way.
Go effectively uses a thread-per-connection model with goroutines (M:N threading instead of N:N threading though). You won't be able to fully exploit the hardware that way.
It is, however, easier to use, but the context switching and stack overhead would be too great at anything close to 10M connections.
One of the most important benefits of functional programming is that the compiler can figure out which operations can run in parallel and which can't. In other scenarios, the programmer has to explicitly decide these things herself, so you end up with extra bugs (race conditions, for example).
The language doesn't matter, as long as its compiler can expect purely functional code and the compiler is written to take advantage of that code.
Rust kind of succeeds in having it both ways -- the programmer decides what runs in parallel, but the compiler won't allow race conditions.
I can't imagine any better scenario than just writing code that can be automatically parallelized, though.
So far as I know, all attempts at automatic parallelization have been extremely disappointing. Even when dealing with Haskell code or other similarly functional code, you don't get much out of it. See for instance: http://stackoverflow.com/questions/15005670/why-is-there-no-... I've seen a lot of other research I can't seem to dig up right now that suggests that analyzing programs "in the wild" tend to produce optimal speedups of less than even 2x across whole programs in general.
It's a pity, really. But unfortunately, I wouldn't be promising anybody anything about automatic parallelization, because even in lab conditions nobody's gotten it to work very well, so far as I know. It's a weird case, because my intuition, like most other people's, agrees that there ought to be a lot of opportunity for it, but the evidence is contradicting that.
Anyone who's got a solid link to contradictory evidence, I'd love to hear about it, but my impression is that research has mostly stopped on this topic, because that's how poorly it has gone.
The interesting thing is...how long before you have 10M users?
I'm very much a fan of this sort of work--it's obviously useful for things like data collection from sensors--but I can't help but wonder if it has any practical business value for early-to-mid stage startups.
We sort of take it for granted that you will have to rebuild your architecture once you hit that point. What you gain for that is a platform where you can be productive and iterate quickly.
What the Phoenix example shows is that without a huge loss of productivity up front, you can put off the rewrite much longer.
The important thing about Phoenix is that it uses regular old pg2 groups for pubsub that can be distributed across nodes. Phoenix already does sharding for you, you just need to add another node and the stuff just works.
When I was first playing with multi node Phoenix stuff I was amazed. It was too easy to scale horizontally.
I'm not sure what the distribution story is for Go...
I feel like frameworks like Meteor promote systematic use of websockets, and more and more developers don't even try to implement scalable stateless server code.
Sorry, but this is pointless. And I have a feeling that you could do better even with a scripting language like Perl, Python, Ruby and a single-threaded event loop.
You are right about the evented approach, but the Go solution (and previously published Erlang/Elixir solutions) allow the developer to write sequential code instead of evented code.
Sequential code tends to be more maintainable, readable, etc. vs. evented code.
So while you are right about evented systems, this isn't pointless.
Yes, exactly. Which is why we switched from Go to Python (using twisted), and dropped our memory use substantially for holding open hundreds of thousands (and soon millions) of websocket connections.
M:N schedulers and the primitives they use for their 'lightweight' threading are always going to use more memory than a single-threaded event loop.
I believe erlang can only block at the top level. This means it doesn't need to keep a stack per thread around but only enough space for a single activation frame.
I'm guessing some kind of multiplexing of the sockets is being used in the Erlang case. There's no green thread per every connection.
I don't understand what the grandparent is saying though. Yeah, if you create a thread for every request, then you're going to be killed by the memory overhead. This isn't unique to Go, and I remember it being a problem when many noob Java programmers would spawn a thread per request before the introduction of java.nio.
The downside with Twisted, Tornado, or whatever in Python is your code isn't parallelized. It's concurrent, yes, but you aren't taking advantage of your multiple cores without forking another Python process due to GIL.
Go, Scala, Java etc. are truly multi-threaded and compile to native code from commandline or via JIT. Saying your performance advantage is due to switching from Go to Python is a spurious claim. You weren't doing it right.
For most scaled up examples in Erlang and Elixir, there is a BEAM process (green thread) per connection as well as others (usually arranged as a supervised pool so one can avoid an acceptor bottleneck). There are a few reasons it does better but the biggest is a carefully tuned SMP scheduling mechanism and aggressive preemption policies. Some of these choices actually hurt throughput in favor of fairness and latency. All in the name of reliability over speed.
Yeah, compared to that Go also does some extra locking, extra syscalls to wake up event loop if channels are used, etc. So for this kind of load I would expect Go to be slower than any single threaded event loop.
I know nothing about the implementation of Go scheduler (or go in general, really), but there is in principle no reason why it would require additional locking or syscalls compared to a plain event based design. Is it just a limitation of the current implementation?
Oh $DEITY. That can't be possibly right. Are they serialising all FD creation thru a single mutex?? Why they can't just close all sockets that need closing after fork?
Also the FD mutex that need to be taken before each read and write is nasty (is it just to prevent a race with close or is it for something else?), but at least that won't usually require any syscall and on sane applications could be optimised with a Java-like biased lock.
For comparison, Chris McCord (the creator of the Phoenix framework) did some work on getting 2M concurrent connections with real-world(ish) data being broadcast across all nodes: http://www.phoenixframework.org/blog/the-road-to-2-million-w...