1) Consider not running the largest instance you need to handle your workload, but instead distributing it across smaller instances. This allows for progressive rollout to test new versions, reduces the thundering herd when you restart or replace an instance, etc.
2) Don't set up security group rules that limit what addresses can connect to your websocket port. As soon as you do that connection tracking kicks in and you'll hit undocumented hard limits on the number of established connections to an instance. These limits vary based on the instance size and can easily become your bottleneck.
3) Beware of ELBs. Under the hood an ELB is made of multiple load balancers and is supposed to scale out when those load balancers hit capacity. A single load balancer can only handle a certain number of concurrent connections. In my experience ELBs don't automatically scale our when that limit is reached. You need AWS support to manually do that for you. At a certain traffic level, expect support to tell you to create multiple ELBs and distribute traffic across them yourself. ALBs or NLBs may handle this better; I'm not sure. If possible design your system to distribute connections itself instead of requiring a load balancer.
2 and 3 are frustrating because they happen at a layer of EC2 that you have little visibility into. The best way to avoid problems is to test everything at the expected real user load. In our case, when we were planning a change that would dramatically increase the number of clients connecting and doing real work, we first used our experimentation system to have a set of clients establish a dummy connection, then gradually ramped up that number of clients in the experiment as we worked through issues.
Terminating TLS yourself incurs some CPU cost and a bit more memory cost. How much CPU/memory is eaten depends on the efficiency of your code. Our Rust implementation roughly matches C code efficiency, so we could handle terminating TLS ourselves if ELB stops being feasible at some point.
But, with that said, in term of price, it's unbeatable.
A bit of an edge case but pretty much the only issue I've had so far which is why I bring it up.
I came here to say this. Horizontal compute is a miracle.
If we had begun having problems once we started sending more push messages, we would have simply stopped using the new service (https://github.com/mozilla-services/megaphone) responsible for that until we worked through them.
You can find some discussion of this behavior in places like https://forums.aws.amazon.com/thread.jspa?threadID=231806. I originally became aware of the issue, before hitting in production, from the HN comment at https://news.ycombinator.com/item?id=18314138
I myself found out about to avoid the connection tracking in this thread https://news.ycombinator.com/item?id=15724072
It's super frustrating that Amazon doesn't document this in a more approachable manner.
I believe we only hit the security group limits in load testing. There, we could see connections fail to an instance when that instance's established connections as reported by netstat or similar tools hit a certain threshold.
Edit: I wasn't aware but there was some build-up to it:
- 100k connections: http://blog.caustik.com/2012/04/08/scaling-node-js-to-100k-c...
- 250k connections: http://blog.caustik.com/2012/04/10/node-js-w250k-concurrent-...
Sorry to be that guy, but I can find a TON more people capable of supporting ECMA vs. Erlang and/or Elixr. I say this as a huge fanboy of both.
I know I'll get rolled on HN for saying this because we all drink the optimization Koolaid but I feel this is worth mentioning.
> usual ad-hoc JS nodejs server
Yep - that's the key why I feel they're on the same level. I heavily rely on the cluster module and IPC to get my work done which gives me true process isolation/safety however I admit it's more code in Node.js to make it rock-solid ;)
Anecdotal, but I find that the typical Node.js developer is 100% "single-process" with weird supervisors like PM2. I also see over-engineered fleets of EC2 instances behind ELB/ALBs with health checking kicking bad instances out + a way to replace them... this is also stupid common in Docker/docker-compose/K8s as well.
For the record I think that development pattern described in the above paragraph is total crap and I put it on the same level of idiots giving modern PHP7 a bad name. Devs who practice like this completely blew the content from their CS100/200 level courses out their butt (yeah yeah I say this as a dropout myself whatever ha).
You are comparing "4 CPUs and 15GB of memory" for NodeJS with "40 CPUs and 128 GB of memory" for Elixir/Phoenix
"Websocket Shootout: Clojure, C++, Elixir, Go, NodeJS, and Ruby"
Do you actually read the blog post before making the claim ? The author is not using "Cluster" module to scale connections across CPUs with "Sticky-session" library. Also, neither he mentioned the runtime flags with which the launched all VMs in comparison nor he mentioned the optimization at OS level.
I expected all benchmarks that use the (linux) kernel for I/O events would perform similarly. The code is so small that the interpreter overhead should not degrade the performance too much. May be the JSON parsing / serialization?
Note that the blog post is from 2015. There are many optimization (Ignition and TurboFan pipeline) has been done in V8 since then, especially offloading GC activity to separate thread than NodeJS Main thread.
This is doubly unfortunate because I very much share his views on bloated frameworks etc. :(
I agree with the suggestion that smaller instances that can be scaled is not a bad idea.
- What third party libraries do you need to use? Some languages have very good support for some, and less for others.
- What are the internal integrations you need to support? Can they be over the network or are you calling into code in a particular language?
- What is the pool of skills available to you as a team? Do you go with a language that has a reputation of being really good for this task but of which the team knows very little (and therefore will have a learning curve working out the common pitfalls), or do you go with a better understood language which the team has already mastered, and stretch it to go beyond what mere mortals do with it? Note: there's no right answer here, both options have severe drawbacks.
- Related to the previous: what's your company's culture regarding technical diversity?
I have written a lot of Node services. Spent six years doing it. Gimme Java plz.
Assuming existing experience with JS / npm / async style. If you don't have that, I'm not sure which would be harder to start with. Given a little bit of experience with each, I'd actually lean towards elixir being a simple choice. Then again it depends on whether you're cool with training new devs in case of lack of elixir people.
Pre-optimization is the root of all evil, my dude.
Nonsense. There's more Elixir programmers (or at least those who want to program Elixir professionally) than there are Elixir jobs.
Things get prioritised when they help pay the bills
Deno (also by Ryan Dahl) "TypeScript bindings for libev" may become a viable successor. https://deno.land/manual.html#introduction
The title should have .
it wasn't long ago that even managing 10K connections on a server was considered quite a feat - see http://www.kegel.com/c10k.html
You should consider tweaking --max_old_space_size, we got a lot of mileage giving node more memory.
Would have used nchan for that probably :p
In particular right now I am trying to add live reloading to my App Engine Standard app but Standard doesn't support long lived connections (so no websockets) and App Engine Flexible seems like it will be pricy.
I think I can set up a single separate websocket instance which is only responsible for doing listen/notify on postgres and telling the client when it's time to refetch from the main webserver again.
Does this sound approximately workable? Will I actually be able to reach the connection numbers like in this article?
Just want to add. Real-world, often the predominant use case is not optimizing for "max-conns". But <100,000 concurrent users, who instead need to be connected for a very long time.
In this instance, I've found Caddy's websocket directive, inspired by Websocketd, to be quite robust and elegant. It's just a process per conn. Handling stdin, stout style messaging ;)
Since Nick’s post we’ve moved from StackExchange.NetGain to the managed websocket implementation in .NET Core 3 using Kestrel and libuv. That sits at around 2.3GB RAM and 0.4% CPU. Memory could be better (it used to be < 1GB with NetGain) but improvements would likely come from tweaks to how we configure GC under .NET Core which we haven’t really investigated yet.
We could run on far fewer machines but we have 9 sitting there serving traffic for the sites themselves so no harm in spreading the load a little!
You'll need the source  option on your server lines, and you also need to adjust to allow more connections, one of these will do: have the origin server listen on more ports, add more ips to the origin server and listen to those too, add more ips to the proxy and use those to connect as well.
I'm not sure about handle exhaustion? I've run into file descriptor limits, those are usually simple to set (until you run into a code enforced limit)
One of the arguments I make when people say microservices + cloud built to the core is the only way to scale - clearly a cleverly architected approach can save you lots on hardware/hosting money.