10M Concurrent Websockets

bcardarella · on March 19, 2016

The headline can be misleading, these are not active real-world socket connections. According to the article they are mostly idle, with pings every five minutes.

For comparison, Chris McCord (the creator of the Phoenix framework) did some work on getting 2M concurrent connections with real-world(ish) data being broadcast across all nodes: http://www.phoenixframework.org/blog/the-road-to-2-million-w...

laylomo · on March 20, 2016

And if I recall correctly, the 2 million figure was just an artificial limit that they hit due to the configuration settings during the test, since it just "seemed like a really big number" at the time. In theory, it could probably go even higher.

gamache · on March 19, 2016

For a counterpoint, here's a benchmark of a single server running Elixir/Phoenix and supporting 2M concurrent websockets, sending pings once per second (rather than every five minutes like this Go benchmark). The socket handling code is simpler, too.

http://www.phoenixframework.org/blog/the-road-to-2-million-w...

nulltype · on March 20, 2016

Where is the ping once per second part? All I could find was this commit removing pings: https://github.com/Gazler/phoenix_chat_example/commit/f809b2...

gamache · on March 20, 2016

Toward the bottom of the post: "This let us to reach 2M subscribers without timeouts and maintain 1s broadcasts."

nulltype · on March 20, 2016

That sounds like 1s latency. Is it producing a sustained rate of 2Mpps?

slashdev · on March 19, 2016

This is kind of lousy actually to use that much memory and other resources for 10M nearly completely idle connections.

If the kernel is bypassed, I have reason to believe Go can handle the line rate of 16.7M packets/sec(on 10gbps nic.) I would love to integrate Intel's DPDK with Go. I've done some ground work on this before - it's very feasible to just replace the Go network stack in a transparent way. If any company is interested in sponsoring that as an open-source project, my contact info is in my profile.

jeff_marshall · on March 20, 2016

As I'm sure you are well aware, there is quite a large gap between the DPDK and a working TCP stack.

Do you plan to write the TCP stack from scratch? If not, are you planning to leverage one of the golang unikernel back-end ports?

10Gb/s (even per-core) doesn't seem like much of a selling point for a network stack on it's own, given where the rest of the industry is.

slashdev · on March 20, 2016

I wouldn't dare write my own. I'm not expert enough in TCP/IP stacks to know what I don't know that will hurt me. I was thinking of lifting the relatively simple TCP/IP stack from the Apache licensed Seastar framework. Since that also is built for DPDK and sees production use I know that it should be easy to adapt and probably doesn't have serious flaws.

I'm actually strongly tempted to just to glue Seastar and Go together. That's the lowest risk approach.

hadagribble · on March 20, 2016

I haven't gotten around to trying it myself yet, but OpenFastpath+OpenDataplane supposedly has a userspace port of the FreeBSD stack that runs on DPDK through ODP-DPDK.

http://www.openfastpath.org/

http://www.opendataplane.org/

jeff_marshall · on March 20, 2016

Thanks for the pointer to Seastar, I wasn't aware of that project.

Good luck with your work - I'm curious to see the result.

omgtehlion · on March 20, 2016

You can easily throw some openonload/vma at it before going full DPDK

slashdev · on March 20, 2016

Yes openonload is an easy way to get a boost. It doesn't help you much in the commodity cloud though. There are still a lot of advantages to managing your own hardware when it's feasible.

vvanders · on March 19, 2016

Key quote:

Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle. This ends up being only about 30k pings per second, which is not a terribly high number.

tim333 · on March 20, 2016

Their figures are quite a bit worse than MigratoryData's commercial WebSocket Server software written in Java -

12 mil sockets, 200k pings / sec, 12 core server 54GB mem use, 57% cpu

https://mrotaru.wordpress.com/2013/06/20/12-million-concurre...

michelrotaru · on March 20, 2016

Thanks for the mention, Tim.

In that benchmark, pings are actually 512-byte messages. So, MigratoryData published 200,000 messages/sec to 12 million clients at a total bandwidth of over 1 Gbps.

BTW that benchmark result has been improved recently in terms of latency. The JVM Garbage Collection pauses were almost eliminated using Zing JVM from Azul Systems and the MigratoryData server was able to publish almost 1 Gbps to 10 million concurrent clients with an average end-to-end latency of 15 milliseconds, the 99th percentile of 24 milliseconds, and the maximum latency of 126 milliseconds (computed on almost 2 billion messages):

https://t.co/MVjLDLWS3U

Mihai

nulltype · on March 20, 2016

Does MigratoryData use the kernel networking stack or bypass it? Also is the Zing JVM included or do you have to license that separately?

michelrotaru · on March 21, 2016

MigratoryData uses the kernel networking stack and it can be configured to run as a non-root user.

Zing JVM is not a requirement for the MigratoryData server. MigratoryData includes by default Oracle JVM, which is currently used in production. With Oracle JVM we currently obtain quite decent Garbage Collection (GC) pauses for Internet applications, so the latency is not too much impacted by the GC pauses which occur from time to time.

Zing JVM could be used in certain projects to reduce near to zero those GC pauses, so the latency is not impacted even in the worst case, when a GC occurs. So, if Zing JVM is necessary for such a ultra low latency application, Zing JVM should be licensed separately.

emerongi · on March 19, 2016

Elixir & Phoenix Framework: 2M concurrent connections on a 40core/128GB box, with a real-time chat example: http://www.phoenixframework.org/blog/the-road-to-2-million-w...

insanebits · on March 19, 2016

It might even be 20 core processor, because htop actually shows threads and most processors nowadays are 2 threads per core.

For example my R710 shows 16 threads while it's only 2x4 real cores.

Rafert · on March 20, 2016

It's not as simple as 2 threads per core, the amount of performance gains of hyper-threading are dependent on the application you're running afaik.

phamilton · on March 20, 2016

I think the point was more that 40 cores was a ceiling for computational hardware, and it's very possible the actual hardware was 20 cores with hyper threading, which would result in much lower perf than a true 40 core system.

hughes · on March 19, 2016

> This is a 32-core machine with 208GB of memory. Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle

While the 10M figure is impressive, this doesn't sound practical or useful.

masklinn · on March 19, 2016

Yeah the original C10M formulation (from 2013) specified 8 cores, 64GB RAM, 10GB Eth and an SSD for:

- 10 million concurrent connections

- 10 gigabits/second

- 10 million packets/second

- 10 microsecond latency

- 10 microsecond jitter

- 1 million connections/second

However it was for raw connections, not for websockets.

iagooar · on March 19, 2016

Why isn't it practical? Maybe not in the cloud, but there are some hosting companies that offer quite powerful machines in the same ballpark.

Hetzner, a German hosting company, have some really good root servers, for about 120€/month you get a beast of a machine with 16 cores and 128gb of RAM. If you add some basic load balancing, you can achieve 10M for a really low price. Add something like Docker and you can get a PaaS-like setup, that handles millions of connections without breaking a sweat.

insanebits · on March 19, 2016

If you're going for C10M you can already forget about docker. For performance you really must be as close to the hardware as possible, which possibly means tweaking kernel.

Best example I know of squeezing as much as possible out of the bare metal servers would be StackOverflow, which runs of something in the ballpark of 20 servers(excluding replicas) IIRC.

cyphar · on March 20, 2016

You do know that processes in a Docker container are regular processes right? They just have a different namespace. They are as "close to the hardware" as all of your other processes. I'm not sure how you'd go about using your own TCP stack, but if you can do it with a normal process you can do it inside a Docker container.

withjive · on March 20, 2016

When working with Docker, it is possible to get down to the metal as you mentioned.

Usually by bypassing the docker networking and using Host only network, but then you lose a lot of the benefits of containers in the first place.

For example, weave or calico networking layer on top— which add a fair bit of latency if your aiming for 10M connections— makes scaling containers quite easy

I would imagine that if you plan on using Docker in your infrastructure seriously, you are aiming for a multi host setup with many containers spread throughout— and can settle for 100k connections per container easily.

nightpool · on March 19, 2016

Because 1 packet every 5 minutes only comes out to about 30k packets per second, which is actually pretty small.

smaili · on March 19, 2016

> This is a 32-core machine with 208GB of memory.

If we're talking about these kind of specs, something tells me Erlang would do even better.

rdtsc · on March 20, 2016

Besides, I wouldn't trust stuff with shared memory when it comes to that many concurrent connections.

I'd want it to be something with a more safety guarantees -- could be compile time (strong type system, Haskell, Rust) or strong runtime fault tollerance -- Erlang/Elixir and other languages on the BEAM VM.

Xeoncross · on March 19, 2016

Can anyone expand on what is actual limiting the linux kernel or network card driver? If the server is only at 10% load and half the memory is free it seems there might be either A) non-obvious tweaking needed or B) actual changes to the system library design.

jzwinck · on March 19, 2016

Probably a combination of context switching, memory bandwidth, and interrupt handling.

If you're going to use a machine with 208 GB of RAM you may as well buy a better network card. All major vendors of low latency adapters (which in this case means handling lots of small messages) have transparent drivers for Linux kernel bypass now. For example see OpenOnload. Then you avoid the kernel with no code changes.

Once you remove the kernel from the critical path, you should be concerned with I/O resources, which is already illustrated by the article's note about how they get 5x better performance by using four machines with the same total core count. What that does is buy you four times the I/O resources, like memory channels and PCIe lanes.

eonnen · on March 19, 2016

"It is similar to a push notification server like the iOS Apple Push Notification Service, but without the ability to store messages if the client is offline."

Not even close. Layer in X509 authentication and flaky mobile networks and now you're into tuning backoff while dealing with the CPU overhead of the TLS negotiation.

Also of note, AWS' network use to flake out at around 400K TCP connections from the public network. May no longer be the case but it certainly was ~2012.

toast0 · on March 20, 2016

FYI, there's an unpublished limit on established TCP connections, bigger instances have bigger limits. It's hard to hit them with normal applications (you'll run out of memory from the TCP buffers and application state), but you can definitely hit them if you're doing weird things.

Xeoncross · on March 19, 2016

This is more about maintaining 10,000,000 open connections than about actually sending useful data to this many people or even allowing input from this many connections.

revelation · on March 19, 2016

This is pretty useless other than to make you realize you are doing something terribly wrong. Another good hint are the various number of system constants you are having to tweak.

If you have these requirements (no read, only write, zero connection state) you would be best served by doing userspace raw IO with something like DPDK instead of the huge machinery that the kernel spins up for every TCP connection.

Animats · on March 20, 2016

It's useful if you want push notifications using the W3C Push API. That lets you get notifications without becoming a slave to Google Cloud Messaging [1] or Apple Push Notification Service [2]. This should now work in Firefox and Google Chrome browsers[3]. Microsoft and Apple aren't on board yet.

[1] https://developers.google.com/web/updates/2015/03/push-notif... [2] https://www.raywenderlich.com/32960/apple-push-notification-... [3] http://caniuse.com/#feat=push-api

azinman2 · on March 20, 2016

So if apple isn't on board and it's just for some web browsers.... It isn't really an option at all :/

fenesiistvan · on March 20, 2016

The extra processing required for websockets is so simple and well understood then we should just talk about raw TCP socket optimizations instead. With websocket processing you can do only a few tricks to further optimize a good implementation, however on the TCP/IP layer the optimization tricks can be endless and much more complicated.

drdaeman · on March 20, 2016

I see the quite high TCP memory net.ipv4.tcp_mem setting of 100M max pages (assuming standard 4K pages that's 381GiB, even more than actual physical memory the machine has, so it feels a bit unsafe, as I think sometimes kernels may lock up badly when they're overpressured - better start spewing errors reasonably early), but don't see any tweaks about how much each individual connection uses. I wonder if tcp_rmem/wmem tweaks can improve performance (well, I'm not sure, but I think it's not completely unreasonable to assume that if connections are idle they don't need the default - IIRC - 85KiB read and 16KiB write buffers)

jamescun · on March 19, 2016

This benchmark feels a little lazy. While I am in no doubt that Go could match the 2M concurrent websockets of Elixir's Phoenix framework, a benchmark should aim to implement the same functionality to be considered useful.

The only interesting thing here is that with that 10M connections figure, each connection handler goroutine contained at least 2 channels (sub and t), and they seemed to scale (no mention of latency though).

andrewstuart2 · on March 19, 2016

Seems like a latency of 5 minutes since he mentioned the server was about at its limits.

arrty88 · on March 19, 2016

Seems inefficient to create a new timer per connection, no?

t := time.NewTicker(pingPeriod)

Maybe if you hold all of the connections in one array and create one timer for the ping, things would be faster.

Curious how this compares to netty...

benwilber0 · on March 20, 2016

> After the websocket connection has been setup, the server never reads any data from the connection, it only writes messages to the client.

yeah not exactly difficult

hinkley · on March 20, 2016

Back in my day we were still trying to solve the C10K problem. You darn kids don't know how good you got it.

Now git off of my lawn, I need to take a nap.

mmaunder · on March 19, 2016

The sysctl config is helpful for any high concurrency app e.g. node.

iLoch · on March 19, 2016

You can tell this is web scale because of how complex the example is.

jkarneges · on March 19, 2016

Yes this is a mostly a showcase of modern hardware and the Linux kernel.

However, Go's ability to do coroutines + multicore certainly makes it easier to take advantage of the system's full potential, and this ability is not terribly common in other languages (at least not in such an automatic way). Of course the same result could be achieved with any event-driven language in a less automatic way.

TheHydroImpulse · on March 19, 2016

Go effectively uses a thread-per-connection model with goroutines (M:N threading instead of N:N threading though). You won't be able to fully exploit the hardware that way.

It is, however, easier to use, but the context switching and stack overhead would be too great at anything close to 10M connections.

smt88 · on March 19, 2016

> this ability is not terribly common in other languages

Arguably any functional language does this even more automatically than Go does.

jkarneges · on March 19, 2016

Any functional language? Certainly Erlang works this way, but I'm less sure about others.

smt88 · on March 19, 2016

One of the most important benefits of functional programming is that the compiler can figure out which operations can run in parallel and which can't. In other scenarios, the programmer has to explicitly decide these things herself, so you end up with extra bugs (race conditions, for example).

The language doesn't matter, as long as its compiler can expect purely functional code and the compiler is written to take advantage of that code.

Rust kind of succeeds in having it both ways -- the programmer decides what runs in parallel, but the compiler won't allow race conditions.

I can't imagine any better scenario than just writing code that can be automatically parallelized, though.

jerf · on March 19, 2016

So far as I know, all attempts at automatic parallelization have been extremely disappointing. Even when dealing with Haskell code or other similarly functional code, you don't get much out of it. See for instance: http://stackoverflow.com/questions/15005670/why-is-there-no-... I've seen a lot of other research I can't seem to dig up right now that suggests that analyzing programs "in the wild" tend to produce optimal speedups of less than even 2x across whole programs in general.

I'm not holding out much hope for automatic parallelization to save us, or even help all that much. The only remaining hope is that a language written from the beginning to afford more automatically-parallel programs will help, but those efforts (Fortress most notably) also have not gone well. https://en.wikipedia.org/wiki/Fortress_%28programming_langua... , http://web.cs.ucla.edu/~palsberg/course/cs239/F07/slides/for... starting page 33

It's a pity, really. But unfortunately, I wouldn't be promising anybody anything about automatic parallelization, because even in lab conditions nobody's gotten it to work very well, so far as I know. It's a weird case, because my intuition, like most other people's, agrees that there ought to be a lot of opportunity for it, but the evidence is contradicting that.

Anyone who's got a solid link to contradictory evidence, I'd love to hear about it, but my impression is that research has mostly stopped on this topic, because that's how poorly it has gone.

phamilton · on March 19, 2016

Haskell, Ocaml and the like tend to have strong preemptive scheduling using a "green threads" approach, which will give similar scalability.

fnordsensei · on March 19, 2016

http-kit comes to mind as well: http://www.http-kit.org/600k-concurrent-connection-http-kit....

angersock · on March 19, 2016

The interesting thing is...how long before you have 10M users?

I'm very much a fan of this sort of work--it's obviously useful for things like data collection from sensors--but I can't help but wonder if it has any practical business value for early-to-mid stage startups.

phamilton · on March 19, 2016

Others have cited the Phoenix example of 2M.

We sort of take it for granted that you will have to rebuild your architecture once you hit that point. What you gain for that is a platform where you can be productive and iterate quickly.

What the Phoenix example shows is that without a huge loss of productivity up front, you can put off the rewrite much longer.

Presumably, Go would provide similar benefits.

rozap · on March 20, 2016

The important thing about Phoenix is that it uses regular old pg2 groups for pubsub that can be distributed across nodes. Phoenix already does sharding for you, you just need to add another node and the stuff just works.

When I was first playing with multi node Phoenix stuff I was amazed. It was too easy to scale horizontally.

I'm not sure what the distribution story is for Go...

phamilton · on March 20, 2016

The distribution story in this article is "publish via Redis". I much prefer Phoenix's approach.

tim333 · on March 20, 2016

I guess if you're writing anything with messaging you may ponder how to work it if you get millions of users even if the probability is low.

webaba · on March 19, 2016

Most web devs think they need websockets but they really don't.

0x4a42 · on March 19, 2016

Care to explain?

webaba · on March 19, 2016

I feel like frameworks like Meteor promote systematic use of websockets, and more and more developers don't even try to implement scalable stateless server code.

qaq · on March 20, 2016

adoption of websockets in my experience is driven more by increasing mobile usage.

julie1 · on March 20, 2016

How much time before companies wake up and think that hiring people that can do the same job with only C1000 is a better investment?

morgosmaci · on March 20, 2016

    apt-get install -y redis-server
    echo "bind *" >> /etc/redis/redis.conf

/shiver....

And your new redis server joins all of the other unprotected ones directly connected to the net with no protection or real security.

zzzcpan · on March 19, 2016

Sorry, but this is pointless. And I have a feeling that you could do better even with a scripting language like Perl, Python, Ruby and a single-threaded event loop.

andrewmutz · on March 19, 2016

You are right about the evented approach, but the Go solution (and previously published Erlang/Elixir solutions) allow the developer to write sequential code instead of evented code.

Sequential code tends to be more maintainable, readable, etc. vs. evented code.

So while you are right about evented systems, this isn't pointless.

zzzcpan · on March 19, 2016

> Sequential code tends to be more maintainable, readable, etc. vs. evented code.

This is simply not true, I'm sorry.

As for pointless - the whole thing shows how Go is useless at 10M connections and nothing more.

windlep · on March 19, 2016

Yes, exactly. Which is why we switched from Go to Python (using twisted), and dropped our memory use substantially for holding open hundreds of thousands (and soon millions) of websocket connections.

M:N schedulers and the primitives they use for their 'lightweight' threading are always going to use more memory than a single-threaded event loop.

robbles · on March 20, 2016

Why doesn't Elixir/Erlang have this problem too?

It's using green threads as well, with 1 "process" per connection, right? Is it not the same kind of scheduling?

gpderetta · on March 20, 2016

I believe erlang can only block at the top level. This means it doesn't need to keep a stack per thread around but only enough space for a single activation frame.

codesushi42 · on March 20, 2016

I'm guessing some kind of multiplexing of the sockets is being used in the Erlang case. There's no green thread per every connection.

I don't understand what the grandparent is saying though. Yeah, if you create a thread for every request, then you're going to be killed by the memory overhead. This isn't unique to Go, and I remember it being a problem when many noob Java programmers would spawn a thread per request before the introduction of java.nio.

The downside with Twisted, Tornado, or whatever in Python is your code isn't parallelized. It's concurrent, yes, but you aren't taking advantage of your multiple cores without forking another Python process due to GIL.

Go, Scala, Java etc. are truly multi-threaded and compile to native code from commandline or via JIT. Saying your performance advantage is due to switching from Go to Python is a spurious claim. You weren't doing it right.

strmpnk · on March 20, 2016

For most scaled up examples in Erlang and Elixir, there is a BEAM process (green thread) per connection as well as others (usually arranged as a supervised pool so one can avoid an acceptor bottleneck). There are a few reasons it does better but the biggest is a carefully tuned SMP scheduling mechanism and aggressive preemption policies. Some of these choices actually hurt throughput in favor of fairness and latency. All in the name of reliability over speed.

codesushi42 · on March 21, 2016

That's helpful. Thanks.

zzzcpan · on March 19, 2016

Yeah, compared to that Go also does some extra locking, extra syscalls to wake up event loop if channels are used, etc. So for this kind of load I would expect Go to be slower than any single threaded event loop.

gpderetta · on March 20, 2016

I know nothing about the implementation of Go scheduler (or go in general, really), but there is in principle no reason why it would require additional locking or syscalls compared to a plain event based design. Is it just a limitation of the current implementation?

zzzcpan · on March 20, 2016

It is both, the implementation and all of the ideas behind it.

These should give you some idea on what is going on there:

https://golang.org/src/net/fd_mutex.go https://golang.org/src/net/fd_unix.go#L237 https://golang.org/src/syscall/exec_unix.go#L17

gpderetta · on March 20, 2016

Oh $DEITY. That can't be possibly right. Are they serialising all FD creation thru a single mutex?? Why they can't just close all sockets that need closing after fork?

Also the FD mutex that need to be taken before each read and write is nasty (is it just to prevent a race with close or is it for something else?), but at least that won't usually require any syscall and on sane applications could be optimised with a Java-like biased lock.