Hacker News new | past | comments | ask | show | jobs | submit login
10M Concurrent Websockets (goroutines.com)
265 points by 0x4542 on March 19, 2016 | hide | past | favorite | 80 comments



The headline can be misleading, these are not active real-world socket connections. According to the article they are mostly idle, with pings every five minutes.

For comparison, Chris McCord (the creator of the Phoenix framework) did some work on getting 2M concurrent connections with real-world(ish) data being broadcast across all nodes: http://www.phoenixframework.org/blog/the-road-to-2-million-w...


And if I recall correctly, the 2 million figure was just an artificial limit that they hit due to the configuration settings during the test, since it just "seemed like a really big number" at the time. In theory, it could probably go even higher.


For a counterpoint, here's a benchmark of a single server running Elixir/Phoenix and supporting 2M concurrent websockets, sending pings once per second (rather than every five minutes like this Go benchmark). The socket handling code is simpler, too.

http://www.phoenixframework.org/blog/the-road-to-2-million-w...


Where is the ping once per second part? All I could find was this commit removing pings: https://github.com/Gazler/phoenix_chat_example/commit/f809b2...


Toward the bottom of the post: "This let us to reach 2M subscribers without timeouts and maintain 1s broadcasts."


That sounds like 1s latency. Is it producing a sustained rate of 2Mpps?


This is kind of lousy actually to use that much memory and other resources for 10M nearly completely idle connections.

If the kernel is bypassed, I have reason to believe Go can handle the line rate of 16.7M packets/sec(on 10gbps nic.) I would love to integrate Intel's DPDK with Go. I've done some ground work on this before - it's very feasible to just replace the Go network stack in a transparent way. If any company is interested in sponsoring that as an open-source project, my contact info is in my profile.


As I'm sure you are well aware, there is quite a large gap between the DPDK and a working TCP stack.

Do you plan to write the TCP stack from scratch? If not, are you planning to leverage one of the golang unikernel back-end ports?

10Gb/s (even per-core) doesn't seem like much of a selling point for a network stack on it's own, given where the rest of the industry is.


I wouldn't dare write my own. I'm not expert enough in TCP/IP stacks to know what I don't know that will hurt me. I was thinking of lifting the relatively simple TCP/IP stack from the Apache licensed Seastar framework. Since that also is built for DPDK and sees production use I know that it should be easy to adapt and probably doesn't have serious flaws.

I'm actually strongly tempted to just to glue Seastar and Go together. That's the lowest risk approach.


I haven't gotten around to trying it myself yet, but OpenFastpath+OpenDataplane supposedly has a userspace port of the FreeBSD stack that runs on DPDK through ODP-DPDK.

http://www.openfastpath.org/

http://www.opendataplane.org/


Thanks for the pointer to Seastar, I wasn't aware of that project.

Good luck with your work - I'm curious to see the result.


You can easily throw some openonload/vma at it before going full DPDK


Yes openonload is an easy way to get a boost. It doesn't help you much in the commodity cloud though. There are still a lot of advantages to managing your own hardware when it's feasible.


Key quote:

Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle. This ends up being only about 30k pings per second, which is not a terribly high number.


Their figures are quite a bit worse than MigratoryData's commercial WebSocket Server software written in Java -

12 mil sockets, 200k pings / sec, 12 core server 54GB mem use, 57% cpu

https://mrotaru.wordpress.com/2013/06/20/12-million-concurre...


Thanks for the mention, Tim.

In that benchmark, pings are actually 512-byte messages. So, MigratoryData published 200,000 messages/sec to 12 million clients at a total bandwidth of over 1 Gbps.

BTW that benchmark result has been improved recently in terms of latency. The JVM Garbage Collection pauses were almost eliminated using Zing JVM from Azul Systems and the MigratoryData server was able to publish almost 1 Gbps to 10 million concurrent clients with an average end-to-end latency of 15 milliseconds, the 99th percentile of 24 milliseconds, and the maximum latency of 126 milliseconds (computed on almost 2 billion messages):

https://t.co/MVjLDLWS3U

Mihai


Does MigratoryData use the kernel networking stack or bypass it? Also is the Zing JVM included or do you have to license that separately?


MigratoryData uses the kernel networking stack and it can be configured to run as a non-root user.

Zing JVM is not a requirement for the MigratoryData server. MigratoryData includes by default Oracle JVM, which is currently used in production. With Oracle JVM we currently obtain quite decent Garbage Collection (GC) pauses for Internet applications, so the latency is not too much impacted by the GC pauses which occur from time to time.

Zing JVM could be used in certain projects to reduce near to zero those GC pauses, so the latency is not impacted even in the worst case, when a GC occurs. So, if Zing JVM is necessary for such a ultra low latency application, Zing JVM should be licensed separately.


Elixir & Phoenix Framework: 2M concurrent connections on a 40core/128GB box, with a real-time chat example: http://www.phoenixframework.org/blog/the-road-to-2-million-w...


It might even be 20 core processor, because htop actually shows threads and most processors nowadays are 2 threads per core.

For example my R710 shows 16 threads while it's only 2x4 real cores.


It's not as simple as 2 threads per core, the amount of performance gains of hyper-threading are dependent on the application you're running afaik.


I think the point was more that 40 cores was a ceiling for computational hardware, and it's very possible the actual hardware was 20 cores with hyper threading, which would result in much lower perf than a true 40 core system.


> This is a 32-core machine with 208GB of memory. Sending a ping every 5 minutes at 10M connections was roughly the limit of what the server could handle

While the 10M figure is impressive, this doesn't sound practical or useful.


Yeah the original C10M formulation (from 2013) specified 8 cores, 64GB RAM, 10GB Eth and an SSD for:

- 10 million concurrent connections

- 10 gigabits/second

- 10 million packets/second

- 10 microsecond latency

- 10 microsecond jitter

- 1 million connections/second

However it was for raw connections, not for websockets.


Why isn't it practical? Maybe not in the cloud, but there are some hosting companies that offer quite powerful machines in the same ballpark.

Hetzner, a German hosting company, have some really good root servers, for about 120€/month you get a beast of a machine with 16 cores and 128gb of RAM. If you add some basic load balancing, you can achieve 10M for a really low price. Add something like Docker and you can get a PaaS-like setup, that handles millions of connections without breaking a sweat.


If you're going for C10M you can already forget about docker. For performance you really must be as close to the hardware as possible, which possibly means tweaking kernel.

Best example I know of squeezing as much as possible out of the bare metal servers would be StackOverflow, which runs of something in the ballpark of 20 servers(excluding replicas) IIRC.


You do know that processes in a Docker container are regular processes right? They just have a different namespace. They are as "close to the hardware" as all of your other processes. I'm not sure how you'd go about using your own TCP stack, but if you can do it with a normal process you can do it inside a Docker container.


When working with Docker, it is possible to get down to the metal as you mentioned.

Usually by bypassing the docker networking and using Host only network, but then you lose a lot of the benefits of containers in the first place.

For example, weave or calico networking layer on top— which add a fair bit of latency if your aiming for 10M connections— makes scaling containers quite easy

I would imagine that if you plan on using Docker in your infrastructure seriously, you are aiming for a multi host setup with many containers spread throughout— and can settle for 100k connections per container easily.


Because 1 packet every 5 minutes only comes out to about 30k packets per second, which is actually pretty small.


> This is a 32-core machine with 208GB of memory.

If we're talking about these kind of specs, something tells me Erlang would do even better.


Besides, I wouldn't trust stuff with shared memory when it comes to that many concurrent connections.

I'd want it to be something with a more safety guarantees -- could be compile time (strong type system, Haskell, Rust) or strong runtime fault tollerance -- Erlang/Elixir and other languages on the BEAM VM.


Can anyone expand on what is actual limiting the linux kernel or network card driver? If the server is only at 10% load and half the memory is free it seems there might be either A) non-obvious tweaking needed or B) actual changes to the system library design.


Probably a combination of context switching, memory bandwidth, and interrupt handling.

If you're going to use a machine with 208 GB of RAM you may as well buy a better network card. All major vendors of low latency adapters (which in this case means handling lots of small messages) have transparent drivers for Linux kernel bypass now. For example see OpenOnload. Then you avoid the kernel with no code changes.

Once you remove the kernel from the critical path, you should be concerned with I/O resources, which is already illustrated by the article's note about how they get 5x better performance by using four machines with the same total core count. What that does is buy you four times the I/O resources, like memory channels and PCIe lanes.


"It is similar to a push notification server like the iOS Apple Push Notification Service, but without the ability to store messages if the client is offline."

Not even close. Layer in X509 authentication and flaky mobile networks and now you're into tuning backoff while dealing with the CPU overhead of the TLS negotiation.

Also of note, AWS' network use to flake out at around 400K TCP connections from the public network. May no longer be the case but it certainly was ~2012.


FYI, there's an unpublished limit on established TCP connections, bigger instances have bigger limits. It's hard to hit them with normal applications (you'll run out of memory from the TCP buffers and application state), but you can definitely hit them if you're doing weird things.


This is more about maintaining 10,000,000 open connections than about actually sending useful data to this many people or even allowing input from this many connections.


This is pretty useless other than to make you realize you are doing something terribly wrong. Another good hint are the various number of system constants you are having to tweak.

If you have these requirements (no read, only write, zero connection state) you would be best served by doing userspace raw IO with something like DPDK instead of the huge machinery that the kernel spins up for every TCP connection.


It's useful if you want push notifications using the W3C Push API. That lets you get notifications without becoming a slave to Google Cloud Messaging [1] or Apple Push Notification Service [2]. This should now work in Firefox and Google Chrome browsers[3]. Microsoft and Apple aren't on board yet.

[1] https://developers.google.com/web/updates/2015/03/push-notif... [2] https://www.raywenderlich.com/32960/apple-push-notification-... [3] http://caniuse.com/#feat=push-api


So if apple isn't on board and it's just for some web browsers.... It isn't really an option at all :/


The extra processing required for websockets is so simple and well understood then we should just talk about raw TCP socket optimizations instead. With websocket processing you can do only a few tricks to further optimize a good implementation, however on the TCP/IP layer the optimization tricks can be endless and much more complicated.


I see the quite high TCP memory net.ipv4.tcp_mem setting of 100M max pages (assuming standard 4K pages that's 381GiB, even more than actual physical memory the machine has, so it feels a bit unsafe, as I think sometimes kernels may lock up badly when they're overpressured - better start spewing errors reasonably early), but don't see any tweaks about how much each individual connection uses. I wonder if tcp_rmem/wmem tweaks can improve performance (well, I'm not sure, but I think it's not completely unreasonable to assume that if connections are idle they don't need the default - IIRC - 85KiB read and 16KiB write buffers)


This benchmark feels a little lazy. While I am in no doubt that Go could match the 2M concurrent websockets of Elixir's Phoenix framework, a benchmark should aim to implement the same functionality to be considered useful.

The only interesting thing here is that with that 10M connections figure, each connection handler goroutine contained at least 2 channels (sub and t), and they seemed to scale (no mention of latency though).


Seems like a latency of 5 minutes since he mentioned the server was about at its limits.


Seems inefficient to create a new timer per connection, no?

t := time.NewTicker(pingPeriod)

Maybe if you hold all of the connections in one array and create one timer for the ping, things would be faster.

Curious how this compares to netty...


> After the websocket connection has been setup, the server never reads any data from the connection, it only writes messages to the client.

yeah not exactly difficult


Back in my day we were still trying to solve the C10K problem. You darn kids don't know how good you got it.

Now git off of my lawn, I need to take a nap.


The sysctl config is helpful for any high concurrency app e.g. node.


You can tell this is web scale because of how complex the example is.


Yes this is a mostly a showcase of modern hardware and the Linux kernel.

However, Go's ability to do coroutines + multicore certainly makes it easier to take advantage of the system's full potential, and this ability is not terribly common in other languages (at least not in such an automatic way). Of course the same result could be achieved with any event-driven language in a less automatic way.


Go effectively uses a thread-per-connection model with goroutines (M:N threading instead of N:N threading though). You won't be able to fully exploit the hardware that way.

It is, however, easier to use, but the context switching and stack overhead would be too great at anything close to 10M connections.


> this ability is not terribly common in other languages

Arguably any functional language does this even more automatically than Go does.


Any functional language? Certainly Erlang works this way, but I'm less sure about others.


One of the most important benefits of functional programming is that the compiler can figure out which operations can run in parallel and which can't. In other scenarios, the programmer has to explicitly decide these things herself, so you end up with extra bugs (race conditions, for example).

The language doesn't matter, as long as its compiler can expect purely functional code and the compiler is written to take advantage of that code.

Rust kind of succeeds in having it both ways -- the programmer decides what runs in parallel, but the compiler won't allow race conditions.

I can't imagine any better scenario than just writing code that can be automatically parallelized, though.


So far as I know, all attempts at automatic parallelization have been extremely disappointing. Even when dealing with Haskell code or other similarly functional code, you don't get much out of it. See for instance: http://stackoverflow.com/questions/15005670/why-is-there-no-... I've seen a lot of other research I can't seem to dig up right now that suggests that analyzing programs "in the wild" tend to produce optimal speedups of less than even 2x across whole programs in general.

I'm not holding out much hope for automatic parallelization to save us, or even help all that much. The only remaining hope is that a language written from the beginning to afford more automatically-parallel programs will help, but those efforts (Fortress most notably) also have not gone well. https://en.wikipedia.org/wiki/Fortress_%28programming_langua... , http://web.cs.ucla.edu/~palsberg/course/cs239/F07/slides/for... starting page 33

It's a pity, really. But unfortunately, I wouldn't be promising anybody anything about automatic parallelization, because even in lab conditions nobody's gotten it to work very well, so far as I know. It's a weird case, because my intuition, like most other people's, agrees that there ought to be a lot of opportunity for it, but the evidence is contradicting that.

Anyone who's got a solid link to contradictory evidence, I'd love to hear about it, but my impression is that research has mostly stopped on this topic, because that's how poorly it has gone.


Haskell, Ocaml and the like tend to have strong preemptive scheduling using a "green threads" approach, which will give similar scalability.



The interesting thing is...how long before you have 10M users?

I'm very much a fan of this sort of work--it's obviously useful for things like data collection from sensors--but I can't help but wonder if it has any practical business value for early-to-mid stage startups.


Others have cited the Phoenix example of 2M.

We sort of take it for granted that you will have to rebuild your architecture once you hit that point. What you gain for that is a platform where you can be productive and iterate quickly.

What the Phoenix example shows is that without a huge loss of productivity up front, you can put off the rewrite much longer.

Presumably, Go would provide similar benefits.


The important thing about Phoenix is that it uses regular old pg2 groups for pubsub that can be distributed across nodes. Phoenix already does sharding for you, you just need to add another node and the stuff just works.

When I was first playing with multi node Phoenix stuff I was amazed. It was too easy to scale horizontally.

I'm not sure what the distribution story is for Go...


The distribution story in this article is "publish via Redis". I much prefer Phoenix's approach.


I guess if you're writing anything with messaging you may ponder how to work it if you get millions of users even if the probability is low.


Most web devs think they need websockets but they really don't.


Care to explain?


I feel like frameworks like Meteor promote systematic use of websockets, and more and more developers don't even try to implement scalable stateless server code.


adoption of websockets in my experience is driven more by increasing mobile usage.


How much time before companies wake up and think that hiring people that can do the same job with only C1000 is a better investment?


    apt-get install -y redis-server
    echo "bind *" >> /etc/redis/redis.conf
/shiver....

And your new redis server joins all of the other unprotected ones directly connected to the net with no protection or real security.


Sorry, but this is pointless. And I have a feeling that you could do better even with a scripting language like Perl, Python, Ruby and a single-threaded event loop.


You are right about the evented approach, but the Go solution (and previously published Erlang/Elixir solutions) allow the developer to write sequential code instead of evented code.

Sequential code tends to be more maintainable, readable, etc. vs. evented code.

So while you are right about evented systems, this isn't pointless.


> Sequential code tends to be more maintainable, readable, etc. vs. evented code.

This is simply not true, I'm sorry.

As for pointless - the whole thing shows how Go is useless at 10M connections and nothing more.


Yes, exactly. Which is why we switched from Go to Python (using twisted), and dropped our memory use substantially for holding open hundreds of thousands (and soon millions) of websocket connections.

M:N schedulers and the primitives they use for their 'lightweight' threading are always going to use more memory than a single-threaded event loop.


Why doesn't Elixir/Erlang have this problem too?

It's using green threads as well, with 1 "process" per connection, right? Is it not the same kind of scheduling?


I believe erlang can only block at the top level. This means it doesn't need to keep a stack per thread around but only enough space for a single activation frame.


I'm guessing some kind of multiplexing of the sockets is being used in the Erlang case. There's no green thread per every connection.

I don't understand what the grandparent is saying though. Yeah, if you create a thread for every request, then you're going to be killed by the memory overhead. This isn't unique to Go, and I remember it being a problem when many noob Java programmers would spawn a thread per request before the introduction of java.nio.

The downside with Twisted, Tornado, or whatever in Python is your code isn't parallelized. It's concurrent, yes, but you aren't taking advantage of your multiple cores without forking another Python process due to GIL.

Go, Scala, Java etc. are truly multi-threaded and compile to native code from commandline or via JIT. Saying your performance advantage is due to switching from Go to Python is a spurious claim. You weren't doing it right.


For most scaled up examples in Erlang and Elixir, there is a BEAM process (green thread) per connection as well as others (usually arranged as a supervised pool so one can avoid an acceptor bottleneck). There are a few reasons it does better but the biggest is a carefully tuned SMP scheduling mechanism and aggressive preemption policies. Some of these choices actually hurt throughput in favor of fairness and latency. All in the name of reliability over speed.


That's helpful. Thanks.


Yeah, compared to that Go also does some extra locking, extra syscalls to wake up event loop if channels are used, etc. So for this kind of load I would expect Go to be slower than any single threaded event loop.


I know nothing about the implementation of Go scheduler (or go in general, really), but there is in principle no reason why it would require additional locking or syscalls compared to a plain event based design. Is it just a limitation of the current implementation?


It is both, the implementation and all of the ideas behind it.

These should give you some idea on what is going on there:

https://golang.org/src/net/fd_mutex.go https://golang.org/src/net/fd_unix.go#L237 https://golang.org/src/syscall/exec_unix.go#L17


Oh $DEITY. That can't be possibly right. Are they serialising all FD creation thru a single mutex?? Why they can't just close all sockets that need closing after fork?

Also the FD mutex that need to be taken before each read and write is nasty (is it just to prevent a race with close or is it for something else?), but at least that won't usually require any syscall and on sane applications could be optimised with a Java-like biased lock.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: