Hacker News new | past | comments | ask | show | jobs | submit login

Glad you pointed this out! We could have gone into a lot more detail on the reasons that we've arrived at our current system architecture, but it would've distracted from the root problem we were solving in the post. Happy to go into it a bit here.

Node event loop blockages are the primary reason we have so many processes running. We have enough integrations and iterate on them quickly enough that our infrastructure essentially treats them as untrusted/breakable. We want to avoid ReDoS-style bugs from affecting more than the current request, so we handle one request per process. A little inelegant, but we've still been able to horizontally scale the system, and frankly the extra infrastructure cost hasn't been enough to be worth the effort to change it.

To get around the start-task rate limit, we've tried running multiple identical containers per ECS task. However, they need to be marked as "essential" in CloudFormation to make sure our capacity doesn't degrade on container exits, and this means that one container exiting will also exit other containers in the same task.

Multiple processes per container is another interesting approach. We've used Node subprocesses in the past, but we found them tricky for reasons that are unrelated to deploy speed.

One thing we've really liked about rolling our own approach is that we decide when to declare a deploy complete. ECS is pretty conservative about not declaring a deploy complete until the final container has finished draining, which can take minutes for some of our requests. With our fast deploys, we declare a deploy complete when the final container running old code stops accepting new requests, which is significantly sooner. This makes follow-on deploys and rollbacks much smoother.




Specifically, the problem is that your integrations may block, right? If they crashed or never returned, Node would deal with that okay, wouldn't it?

What are these "integrations"? Could you farm them out to a flock of subprocesses, so you needed fewer top-level Node processes? What are they written in? If Node, how do you even get them to block?

And, of course, have you considered rewriting it in ... carrier lost


If we handled N concurrent requests per process, a crash would definitely have the bleed-over effect we're trying to avoid, where a problem in a single request caused the other N-1 to fail. "Crash" meaning primarily OOM or unhandled rejections (which we exit on). And yes, the other big problem is blocking the event loop with an inefficient regexp, massive sort operation, clumsy use of Ramda, etc. This sort of blockage causes our ECS health checks to fail, so the container eventually gets killed.

"Never returning" is an interesting problem in Node. We have gRPC request timeouts, so the request fails on the client side after a set amount of time. Our Node server gets a cancellation event when this happens, but there's no guarantee that the in-flight request ever stops processing or gives up its resources, even if it's not blocking the event loop. E.g. our request handler could have a while(true) loop that continually does an async network request, and even though each gRPC request eventually times out, we would eventually have a swarm of zombie while(true) loops that are operating. To address this problem, we thread a lightweight context object into our framework which can essentially call an isCancelled() method. Before doing most low-level async operations (e.g. network requests), we check isCancelled() and throw if the gRPC request has timed out.

The integrations are all written in Typescript. As mentioned in another comment, we've attempted multiple processes per container and multiple containers per task. Right now, we see the bigger win as being able to fix "all of the above" and run multiple requests per Node process, but it'll take some legwork to get there :)


Great insights, thanks for replying!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: