Hacker News new | past | comments | ask | show | jobs | submit login
600k concurrent websocket connections on AWS using Node.js (2015) (jayway.com)
357 points by dgelks on Oct 11, 2019 | hide | past | favorite | 137 comments

A few pieces of advice based on running https://github.com/mozilla-services/autopush-rs, which handles tens of millions of concurrent connections across a fleet of small EC2 instances.

1) Consider not running the largest instance you need to handle your workload, but instead distributing it across smaller instances. This allows for progressive rollout to test new versions, reduces the thundering herd when you restart or replace an instance, etc.

2) Don't set up security group rules that limit what addresses can connect to your websocket port. As soon as you do that connection tracking kicks in and you'll hit undocumented hard limits on the number of established connections to an instance. These limits vary based on the instance size and can easily become your bottleneck.

3) Beware of ELBs. Under the hood an ELB is made of multiple load balancers and is supposed to scale out when those load balancers hit capacity. A single load balancer can only handle a certain number of concurrent connections. In my experience ELBs don't automatically scale our when that limit is reached. You need AWS support to manually do that for you. At a certain traffic level, expect support to tell you to create multiple ELBs and distribute traffic across them yourself. ALBs or NLBs may handle this better; I'm not sure. If possible design your system to distribute connections itself instead of requiring a load balancer.

2 and 3 are frustrating because they happen at a layer of EC2 that you have little visibility into. The best way to avoid problems is to test everything at the expected real user load. In our case, when we were planning a change that would dramatically increase the number of clients connecting and doing real work, we first used our experimentation system to have a set of clients establish a dummy connection, then gradually ramped up that number of clients in the experiment as we worked through issues.

If you have a lot of trafic you shoudn't use an ELB in the first place, should be using an NLB by default it comes with a 40Gb pipe and multi millions connections.


We did not use an NLB because it is expensive in its pricing model for TLS connections (and we prefer to let AWS terminate TLS). For 20 million connections, an NLB would cost $28,800/month. It's drastically cheaper to use an ELB, provided you can talk to AWS to have them scale it for you.

When I last discussed this with AWS, having them scale it for you meant that you actually had to contact them to pre-warm the ELB whenever you expected more traffic. It didn't work for our use cases where we didn't know when to expect more traffic.

Yup, that hasn't changed in our experience. If you're looking at bursty traffic beyond what AWS will pre-warm for you, then this likely won't work. We have fairly consistent loads throughout the day.

Back in the day, in my first job, I created load balancing using LVS. It was not easy to set up, compared to AWS, but it worked, and it was completely free, plus some man-hours up front. I wonder if it would be worth it nowadays...

Is it drastically cheaper to use an NLB without letting AWS terminate TLS? I thought AWS was expensive for TLS termination, generally, unless you need the client locality for lower latency.

Yes, if you terminate TLS its drastically cheaper. Without TLS, each NLB unit lets you have 100,000 concurrent connections. With TLS, a single unit is 3,000 connections. Similar pricing occurs with ALB's.

Terminating TLS yourself incurs some CPU cost and a bit more memory cost. How much CPU/memory is eaten depends on the efficiency of your code. Our Rust implementation roughly matches C code efficiency, so we could handle terminating TLS ourselves if ELB stops being feasible at some point.

You can also employ DNS round robin with health checks in Route53

ELB is pretty old and should be avoided tbh, use ALB and if you can't should switch to NLB ( for pure tcp LB ).

Same issue as NLB, they charge for units based on concurrent connections and it gets very expensive very quickly.

DNS round robin has the disadvantage of not guaranteeing that the traffic will be equally distributed. This can be a problem if there's a lot of automated traffic.

But, with that said, in term of price, it's unbeatable.

I have had success DNS round robin-ing a cluster of HAproxy LBs which handle SSL and then proxy to my backend websocket gateways which keeps the real load evenly distributed where it matters and also run health checks on the nodes.

The point is at scale you can't guarantee equal but something around there

That works well, but if you do ecommerce or a web app for which people write a lot of bots, then the traffic won't necessarily be that well distributed (mostly because users while run bots out of dedicated servers or such and will tend to only query dns once).

A bit of an edge case but pretty much the only issue I've had so far which is why I bring it up.

That's the first tool I grab for balancing _a lot_ of connections.

> 1) Consider not running the largest instance you need to handle your workload, but instead distributing it across smaller instances. This allows for progressive rollout to test new versions, reduces the thundering herd when you restart or replace an instance, etc.

I came here to say this. Horizontal compute is a miracle.

Distribution itself might be a world of pain depending on the nature of your application. For example stateful apps where low latency is expected.

How meaningful is the benchmark for handling idle connections? Handling more is better, of course, but if the server melts down when 0.1% of them have any activity, maybe maximizing idle connections isn't the right place to expend optimization effort?

For us, the experiment (https://blog.mozilla.org/services/2018/10/03/upcoming-push-s...) with mostly idle clients was very helpful, since it flushed out problems with out load balancing layer and in the end gave us confidence that it, our app, and our persistence layer could safely handle many more connections.

If we had begun having problems once we started sending more push messages, we would have simply stopped using the new service (https://github.com/mozilla-services/megaphone) responsible for that until we worked through them.

Depends on your workload

Could you explain or post a link to something about #2? I’ve never heard of that before!

I can not. AWS, for whatever reason, does not want this publicly documented. I would write up what we learned in our testing in more detail, but I have been asked not to.

You can find some discussion of this behavior in places like https://forums.aws.amazon.com/thread.jspa?threadID=231806. I originally became aware of the issue, before hitting in production, from the HN comment at https://news.ycombinator.com/item?id=18314138

I'm glad you found my comment helpful!

I myself found out about to avoid the connection tracking in this thread https://news.ycombinator.com/item?id=15724072

It's super frustrating that Amazon doesn't document this in a more approachable manner.

Their response displeases me.

This is what you need to know: "Not all flows of traffic are tracked. If a security group rule permits TCP or UDP flows for all traffic ( and there is a corresponding rule in the other direction that permits all response traffic ( for all ports (0-65535), then that flow of traffic is not tracked. The response traffic is therefore allowed to flow based on the inbound or outbound rule that permits the response traffic, and not on tracking information."

Very interesting. It makes a lot of sense from a Linux perspective. The conntrack table requires memory, so there is some physical limit on its size. That also explains why it scales per instance type.

This is also why "endless scaling" is a fud. At some point or some level, there is either a physical, soft, standard or some kind of limit

I can corroborate this, at Verizon we had problems scaling out ELBs for a very similar use case.

I would also be interested in this. First time I hear about such a limitation.

I recently played around with Athena for load balancer logs, and with sql it is easy to cross reference many connection entries from the load balancer. Do you believe this could help in getting more visibility to spot bottleneck problems stated in 2 and 3?

Not really, because there won't be logs for connections that were never established.

How are you keeping track of those?

As I recall, when we hit the ELBs connection limits the failed connections were reflected in the SpilloverCount metric.

I believe we only hit the security group limits in load testing. There, we could see connections fail to an instance when that instance's established connections as reported by netstat or similar tools hit a certain threshold.

Some dude in 2012 did 1 million active connections with whatever node.js and v8 versions we had at that time :) http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-co...

Edit: I wasn't aware but there was some build-up to it: - 100k connections: http://blog.caustik.com/2012/04/08/scaling-node-js-to-100k-c... - 250k connections: http://blog.caustik.com/2012/04/10/node-js-w250k-concurrent-...

Yeah, we used nginx-push-stream-module[1] to support 1M connections with lower boxes. Websocket-as-a-service. Really cool module. Was a realtime project for a live TV contest where people could participate with their phones.

[1] https://github.com/wandenberg/nginx-push-stream-module

Did not know about this module. Instead, we are using a pretty identical one [1]. Works flawlessly

[1] https://github.com/slact/nchan

Sweet, didn't know about nchan too.

Nchan is amazing

So 1/4 of what Elixir/Erlang can handle, but more difficult and less reliable:


Gary was using a much more powerful instance (both in cpu and memory) that this post uses when he reached 2 million with Phoenix. But Phoenix is also doing quite a lot more per connection than node. Basically for multiple reasons they're incomparable experiments, there's really no basis here to draw conclusions like '1/4 of what Elixir/Erlang can handle' from.

"but more difficult" sounds pretty subjective... I can get behind the "less reliable" comment because Erlang is rock solid.

Sorry to be that guy, but I can find a TON more people capable of supporting ECMA vs. Erlang and/or Elixr. I say this as a huge fanboy of both.

I know I'll get rolled on HN for saying this because we all drink the optimization Koolaid but I feel this is worth mentioning.

You have a fair point, but if by "more difficult" the grandparent meant "more difficult to maintain", I'd have to agree with him. The failure modes of processes running in the BEAM are much more transparent and understandable than the usual ad-hoc JS nodejs server.

IMO they're on the same level but I think it's just my style of how I manage my Node.js applications down my try/catch blocks, exception handling, and primarily process isolation.

> usual ad-hoc JS nodejs server

Yep - that's the key why I feel they're on the same level. I heavily rely on the cluster module and IPC to get my work done which gives me true process isolation/safety however I admit it's more code in Node.js to make it rock-solid ;)

Anecdotal, but I find that the typical Node.js developer is 100% "single-process" with weird supervisors like PM2. I also see over-engineered fleets of EC2 instances behind ELB/ALBs with health checking kicking bad instances out + a way to replace them... this is also stupid common in Docker/docker-compose/K8s as well.

For the record I think that development pattern described in the above paragraph is total crap and I put it on the same level of idiots giving modern PHP7 a bad name. Devs who practice like this completely blew the content from their CS100/200 level courses out their butt (yeah yeah I say this as a dropout myself whatever ha).

Not fair !

You are comparing "4 CPUs and 15GB of memory" for NodeJS with "40 CPUs and 128 GB of memory" for Elixir/Phoenix

Ok, here's a better comparison:

"Websocket Shootout: Clojure, C++, Elixir, Go, NodeJS, and Ruby"


> NodeJS performance is hampered by its single-threaded architecture, but given that fact it performs very well. The NodeJS server was the smallest overall in lines of code and extremely quick and easy to write. Javascript is the lingua franca for web development, so NodeJS is probably the easiest platform for which to hire or train developers.

Do you actually read the blog post before making the claim ? The author is not using "Cluster" module to scale connections across CPUs with "Sticky-session" library. Also, neither he mentioned the runtime flags with which the launched all VMs in comparison nor he mentioned the optimization at OS level.

It seems that the author actually used the "cluster" module, but just does not mention it in the article: https://github.com/hashrocket/websocket-shootout/blob/master...

I expected all benchmarks that use the (linux) kernel for I/O events would perform similarly. The code is so small that the interpreter overhead should not degrade the performance too much. May be the JSON parsing / serialization?

So you think that "sticky-session" is not important ? Can you tell why ?

That benchmark is from 2016, but the use of `golang.org/x/net/websocket` package for Go should be replaced with `github.com/gorilla/websocket` and re-run, if it hasn't been already. Based on their git repo, gorilla/websocket has existed in some form since at least 2013 though not sure how production-ready it would've been in 2016.

Both articles from 2015.

There is more performant web-socket implementation than the one mentioned in the blog. It can handle 6X more connections and much less memory



Note that the blog post is from 2015. There are many optimization (Ignition and TurboFan pipeline) has been done in V8 since then, especially offloading GC activity to separate thread than NodeJS Main thread.

Unfortunately it is written by someone whose technical ability far exceeds his people skills, which are essential for a library module that developers can safely depend on.

This is doubly unfortunate because I very much share his views on bloated frameworks etc. :(

Yeah that's the dev who pushed a purposefully broken version of their library to npm because he was upset that they wouldn't let him delete it.

There is that Phoenix thing when they famously did 2 Million.


I agree with the suggestion that smaller instances that can be scaled is not a bad idea.

I don't get point of using Node.js when compared to something like Elixir. Elixir's Phoenix can handle more numbers of concurrent connections as well as provide reliability with better programming abstractions, distribution, pretty good language.

There are a handful of languages that would be suitable for this. The "right one" to use depends on more than just the language features for that task:

- What third party libraries do you need to use? Some languages have very good support for some, and less for others.

- What are the internal integrations you need to support? Can they be over the network or are you calling into code in a particular language?

- What is the pool of skills available to you as a team? Do you go with a language that has a reputation of being really good for this task but of which the team knows very little (and therefore will have a learning curve working out the common pitfalls), or do you go with a better understood language which the team has already mastered, and stretch it to go beyond what mere mortals do with it? Note: there's no right answer here, both options have severe drawbacks.

- Related to the previous: what's your company's culture regarding technical diversity?

This is a very typical diplomatic answer to stick with old, known programming languages.

Maybe because a lot of the time it's good advice?

Often case that isn't the wrong way to go. Newer doesn't mean better, and there's always trade-offs to consider. "It's newer" by itself really shouldn't be much of a consideration.

Erlang is older, more suitable, more stable and less well known. Nobody mentioned that it is better because newer.

No, but the person I'm responding to is implying that using old, known languages is somehow a bad thing.

But in this concrete case his point is valid, I'd likewise expect Erlang's BEAM to outperform node.js in concurrency any day of the week. Elixir being new is fairly irrelevant because it compiles down to the same erlang BYTE CODE. Erlang is 30 years old.

This is actually not true, if you take a really optimzied C/C++ library for WS Nodejs will crush Elixir by a large margin. Elixir / Erlang is slow and consume a lot of memories compared to more native languages or C libraries.

Ex: https://github.com/uNetworking/uWebSockets

Isn't this like saying one can bolt a rocket onto a Ford to make it much faster than a Toyota? One can incorporate C/C++ into Elixir, too.

At that point, you may as well implement your own server in C++ using epoll.

Nah. Devs don't touch C++ code in this case.

I don't understand people eating hot dogs. Hamburgers have...

Guess you haven't heard what hot dogs are made from



It's not about which is the best technology. The barrier to entry for Node.js is next to nothing and the size of the ecosystem is incomparable to Elixir. Which means tons of companies will go with Node and feed back to the ecosystem and so on...

If we are going on size of ecosystem/what companies are invested in/etc, then you may as well stick with Java and the JVM with something like Vert.x. No need for Node.

nodejs ecosystem is much larger than the java one. also the amount of money companies invested in node is at least on par with java. think only about google chrome.

There's no advantage to using Node in a lot of cases though. Maybe if you're creating a microservice to generate email html or something. Data/state management, stream/job processing? Java or Kotlin all the way.

I have written a lot of Node services. Spent six years doing it. Gimme Java plz.

> The barrier to entry for Node.js is next to nothing

Assuming existing experience with JS / npm / async style. If you don't have that, I'm not sure which would be harder to start with. Given a little bit of experience with each, I'd actually lean towards elixir being a simple choice. Then again it depends on whether you're cool with training new devs in case of lack of elixir people.

Because javascript is one of the most popular programming languages in the world and the pool of Elixir programmers is almost non-existent?

From this https://www.ycombinator.com/topcompanies/ Brex, Podium, PagerDuty use Elixir. They are in top YC companies, not a surprise.

Do you think they got big because they use it or do they use it because they're big?

Pre-optimization is the root of all evil, my dude.

@runj__ You should read this blog post by Brex founder. He really says how Elixir suited well to the problem they were trying to solve https://medium.com/brexeng/why-brex-chose-elixir-fe1a4f31319...

In the end they picked it because they already knew it.

That's still a really good reason to pick a technology right? Especially if speed to market is one of your goals.

And that's the reason why most people chose Java(Script)

Discord used Elixir from the start

Thanks for pointing those out. I would really like to switch to Elixir in my day job.

> pool of Elixir programmers is almost non-existent

Nonsense. There's more Elixir programmers (or at least those who want to program Elixir professionally) than there are Elixir jobs.

More Elixir programmers than Elixir jobs is not an incentive to learn and/or use a language.

No, but Elixir/Erlang OTP is worth learning in and of itself.

I mean, sure, but i already have a long list of that, probably longer than I can learn in a lifetime.

Things get prioritised when they help pay the bills

I am totally with you on Elixir/ Phoenix. But using Node.js is about leveraging frontend devs to get productive on the backend fast. At least, I think that’s the idea. Maybe also leveraging Google’s dependence on V8 (and thus all the engineering love it gets), although I don’t think that’s really a great argument compared to the EVM.

Well, Node.js is just good tech with a simple async-everything concurrency model.

Ya. nodejs is "JavaScript bindings for libev".

Deno (also by Ryan Dahl) "TypeScript bindings for libev" may become a viable successor. https://deno.land/manual.html#introduction

let’s hope so! Looks like a great project

The difference in performance between any languages are very small, like only one order of magnitude, but it's usually possible to get two orders of magnitude better performance in any language by optimizing. Eg. If your manager wont allow you to cut the AWS bill 100x by doing some optimizations. He she/she will certainly not allow you to rewrite everything in another language in order to cut bills by 10x.

You could make the same argument comparing Elixir and Go, or Go and C++

Interesting details; it would be nice to see how those ulimit/networking numbers were arrived at.

The title should have [2015].

Yep - title should've had [2015], my bad!

Added now.

Genuinely missing the point question: isn't the number of concurrent socket (websocket or otherwise) connections just a function of the underlying OS and number of instances thereof, not a function of Node.js ?

Even if the OS can take it, it requires proper engineering in your stack (ie, node.js here).

it wasn't long ago that even managing 10K connections on a server was considered quite a feat - see http://www.kegel.com/c10k.html

number of connections ultimately depends on the OS limits assuming you have infinite RAM and CPU resources to start your sessions, do the handshakes, serve ping/pong packets, register events to event loops and fire timers.

Doesn't sound so impressive. I've done close to a million on a single Digital Ocean droplet using nchan[0] before. Latency was reasonable even with that many connections, you just need to set your buffer sizes carefully. Handshakes are also expensive, so it's useful to to be able to control the client and build in some smart reconnect/back-off logic.

[0] https://www.nginx.com/resources/wiki/modules/Nchan/

would be nice to compare all providers with something like https://www.techempower.com/benchmarks/ (azure & aws)

Does anyone have a more recent experience? I currently use socket.io 2.2 with node v10.16, no v8 tweaks in a docker container. At ~1000 sockets, sometimes the server receives spikes of 8000 HTTP reqs/sec, which it has to distribute to the websockets, up to 100 msgs/sec, ~1kb/msg to each socket. These spikes are making the server unstable, socket.io switches most of the clients from websockets to xhr polling.

I found this article very useful to resolve connections moving to xhr polling on our Node.js 10.16 server https://medium.com/@k1d_bl4ck/a-quick-story-about-node-js-so...

I don't have any experiments to share, but you can go father if you stop using socket.io, but I guess you need something to deal with long polling.

You should consider tweaking --max_old_space_size, we got a lot of mileage giving node more memory.

I make heavy use of socket.io chatrooms and have to support older browsers. Haven't found any better solution yet, but I'll try the tweak, thanks. I'm also looking into load balancing via haproxy or nginx, since I need less than 10k concurrent clients.

Have you tried "sticky-cluster" ?


I thought this was going to be about AWS-managed websockets using API Gateway. I've been using that at a really small scale and it's got a great API but other than almost certainly being much more expensive than the EC2 machine used here I wonder how well it works with that sort of scale.

I have written and deployed message queues in Node.js that take data from Redis and push it out on websockets. It is a pain to deal with the GC in weird cases. This was about 5 years ago, so the details might not accurate.

Things worked fine until some day some customer started sending multi-megabyte strings over the system. It is difficult to actually track down that it is GC that is halting the system and then figuring out ways to fix the issue. We ended up not using JavaScript strings and instead using Node.js buffers to do the transport - I don't recall the Node.js library for Redis supporting that out of the box.

It's better but still a pain. The GC is so much more fragile than the JVM.

Would have used nchan for that probably :p

Does anyone have experience doing the same on GCP?

In particular right now I am trying to add live reloading to my App Engine Standard app but Standard doesn't support long lived connections (so no websockets) and App Engine Flexible seems like it will be pricy.

I think I can set up a single separate websocket instance which is only responsible for doing listen/notify on postgres and telling the client when it's time to refetch from the main webserver again.

Does this sound approximately workable? Will I actually be able to reach the connection numbers like in this article?

I wonder how it would compare in terms of cost with doing it via the api-gateway. It would depend on how your app and user base scales and what the sockets are being used for I suppose.


Api gateway is very expensive once you get sustained load. EC2 much cheaper.

Lots of wisdom on this page ;)

Just want to add. Real-world, often the predominant use case is not optimizing for "max-conns". But <100,000 concurrent users, who instead need to be connected for a very long time.

In this instance, I've found Caddy's websocket directive, inspired by Websocketd, to be quite robust and elegant. It's just a process per conn. Handling stdin, stout style messaging ;)

But once they all start doing stuff then what happens? 600k is more of a function of memory right?

Why are websites still hijacking scroll behavior in 2019? I can't even take the article seriously with my scrolling bouncing and glitching all over the place.

A better solution would be to use nchan+nginx and then your Node API is just a simple stateless REST service. Will scale better and be easier to maintain.

How do you test something like this internally?

Load testing tools like Tsung.

So what's the point of attempting to max out the count of idle connections on some cloud engine?

I did this with Netty back in 2010. Increasing nf_conntrack_max is a hard-learned lesson.

Why is this specific to AWS ?

I think the idea is to dispel the myth that you can’t get that kind of performance on a cloud instance.

From what I understand there is no real "performance" in said test. Just a bunch of idle connections doing big nothing and eating all CPU time with garbage collection.

Does anyone know what latency overhead switching to ECS/EKS would add?

It's not difficult to see that such amount of connections is only useful if your server performs a trivial task. Otherwise, the bottleneck will be your CPU(s).

What do you need this many connections for?

One possible use is for high traffic websites. For example, StackExchange peaked at 500k concurrent websocket connections back in 2016. [0]

[0] https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

Surely not on one endpoint

We’re spread over 9 servers but only due to ephemeral port and handle exhaustion issues. Each server is fronted by 4 HAProxy frontends that each handle ~18k connections.

Since Nick’s post we’ve moved from StackExchange.NetGain to the managed websocket implementation in .NET Core 3 using Kestrel and libuv. That sits at around 2.3GB RAM and 0.4% CPU. Memory could be better (it used to be < 1GB with NetGain) but improvements would likely come from tweaks to how we configure GC under .NET Core which we haven’t really investigated yet.

We could run on far fewer machines but we have 9 sitting there serving traffic for the sites themselves so no harm in spreading the load a little!

Ephemeral port exhaustion is easy to handle if you control HAProxy and the origins.

You'll need the source [1] option on your server lines, and you also need to adjust to allow more connections, one of these will do: have the origin server listen on more ports, add more ips to the origin server and listen to those too, add more ips to the proxy and use those to connect as well.

I'm not sure about handle exhaustion? I've run into file descriptor limits, those are usually simple to set (until you run into a code enforced limit)

[1] https://cbonte.github.io/haproxy-dconv/1.7/configuration.htm...

I might be missing some details there to be honest; I'm more familiar with the application side of things than the infrastructure :)

Maybe just a couple. If you google the StackExchange Hardware Setup, you will find they use surprisingly little Hardware. Back in 2013, they could go with 2 single webservers not accounting for failover reduncancy [0]. They have upgraded since, but still, if you consider the amount of hits they get, they run a pretty humble hardware [1].

One of the arguments I make when people say microservices + cloud built to the core is the only way to scale - clearly a cleverly architected approach can save you lots on hardware/hosting money.

[0]: https://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-... [1]: https://nickcraver.com/blog/2016/02/17/stack-overflow-the-ar...

Many kinds of products. One of our log aggregation products in AWS currently sits at 20TB ingress from over 500k connections. This is where slight optimizations start to have an enormous financial impact.

Lots of people using your site.

Would be nice to rebench with the new websocketstream API!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact