Glad you pointed this out! We could have gone into a lot more detail on the reas...

twic · on Aug 28, 2019

Specifically, the problem is that your integrations may block, right? If they crashed or never returned, Node would deal with that okay, wouldn't it?

What are these "integrations"? Could you farm them out to a flock of subprocesses, so you needed fewer top-level Node processes? What are they written in? If Node, how do you even get them to block?

And, of course, have you considered rewriting it in ... carrier lost

bjacokes · on Aug 28, 2019

If we handled N concurrent requests per process, a crash would definitely have the bleed-over effect we're trying to avoid, where a problem in a single request caused the other N-1 to fail. "Crash" meaning primarily OOM or unhandled rejections (which we exit on). And yes, the other big problem is blocking the event loop with an inefficient regexp, massive sort operation, clumsy use of Ramda, etc. This sort of blockage causes our ECS health checks to fail, so the container eventually gets killed.

"Never returning" is an interesting problem in Node. We have gRPC request timeouts, so the request fails on the client side after a set amount of time. Our Node server gets a cancellation event when this happens, but there's no guarantee that the in-flight request ever stops processing or gives up its resources, even if it's not blocking the event loop. E.g. our request handler could have a while(true) loop that continually does an async network request, and even though each gRPC request eventually times out, we would eventually have a swarm of zombie while(true) loops that are operating. To address this problem, we thread a lightweight context object into our framework which can essentially call an isCancelled() method. Before doing most low-level async operations (e.g. network requests), we check isCancelled() and throw if the gRPC request has timed out.

The integrations are all written in Typescript. As mentioned in another comment, we've attempted multiple processes per container and multiple containers per task. Right now, we see the bigger win as being able to fix "all of the above" and run multiple requests per Node process, but it'll take some legwork to get there :)

_1qd4 · on Aug 28, 2019

Great insights, thanks for replying!