The only way this makes sense to me is if they have to contend with lots of expensive parsing, event sequencing, and throttling requirements. Payment APIs, bank websites, etc can be quite byzantine. I could understand how one might code yourself into a corner with a monolithic node app and basically just say "F-it, we're doing this synchronously!"
I don't even think it's a terribly bad thing to do assuming it favors feature velocity.... but at that point, I'd recommend moving away from Node towards something like Python. And if you wanted to dip your toes back into async plumbing land, explore Go or Elixir.
I have never seen a good argument for using golang for business logic. If you are writing the actual server then sure, use golang. If you are writing some high-speed network interconnect, use golang. Some crazy caching system, sure use golang. The public WS endpoint, use golang.
But if you need to access a DB with golang for anything more than, like, a session token, then you made the wrong choice and you need to go back and re-assess.
Elixir is in the "germination phase" and I predict massive adoption in the next 5 years. It is a truly excellent platform, every fintech company I know at least has their toe in the water. Everyone I show this video to [1] just says "well, shit."
You hit the nail on the head here. When N different API requests simultaneously time out – all because a ramda.uniq call in one of them received an array of 100,000 nested objects – it's easy to make a spot code fix, but harder to systematically prevent it from happening in the future. There aren't really linters for "bad event loop blockage". Code reviews are the main tool we have, but you'd be surprised what sorts of logic can trickily block the event loop. For API reliability and development velocity in the short-term, by far the easiest approach was to throw more infrastructure at the problem.
We do use Go for almost all of our other services, and there are an increasing number of integrations written in Python. But we're still using and investing in our Node integrations code for the foreseeable future, and this was an important step for simplifying our infrastructure.
We certainly hope the tooling and rollout process in the post were instructive for anyone using Node, even if their stacks were pristine from day 1 and never need this sort of complex migration :)
Taking a wild guess: Some of their bank integrations probably require browser automation. If you're doing browser automation, the best tool for the job is (currently) Puppeteer, which runs on Node. There are other third-party language bindings for the Chrome dev tools protocol, but Puppeteer is developed by Google as a first-class citizen alongside Chrome.
Presumably not every integration requires browser automation, so they might not all be going at once. But they have a $25k monthly EC2 bill, so it's not out of the ballpark.
FWIW, I reliably have 6 puppeteer/chrome instances (headful, even) going on a single box and it's not even at half capacity.
That was my thought to. They've got a problem where they've got no idea what a given transaction costs and some unpredictable amount of transactions result in some serious work that holds up the event queue.
God knows they could be waiting for some reel to reel tape to spin up somewhere...
But the whole point of synchronous I/O is to isolate the programmer from having to think about that spinning tape up takes a non-zero time. I have a feeling that this gets lost sometimes in all that "async I/O is the GREATEST!" craze.
Async is nice - if you can handle it. But this is not easy to do in complex systems and processes. It is certainly easier to work with an old-fashioned process that blocks when waiting for whatever you need to wait for, and just scale by letting the OS run lots of those in parallel. Sure, it's less efficient. But it's easier for the devs to handle.
I just read the hidden undertone of this article as "our devs aren't that smart after all".
But you need to know if you can do that something first, or if you've done that something too many times in the last N minutes (and could get blocked, forcing thousands of other somethings to get endlessly queued). Or if that something could take too long, and actually you could be doing 200 other somethings in the same time etc. It's not that simple.
Their velocity might have been slowed by figuring out how to manage 4,000 containers effectively. If they had dealt with managing concurrency effectively sooner, they would need 30x less containers-- 133.
Not so much, they're using ECS which takes care of a lot of those headaches and sounds like they're coordinating with a load balancer / reverse proxy for distributing those requests... A 1-1 request model in that kind of system is really simple to setup. Setting up to orchestrate multiple requests per node was probably much more time intensive.
I don't even think it's a terribly bad thing to do assuming it favors feature velocity.... but at that point, I'd recommend moving away from Node towards something like Python. And if you wanted to dip your toes back into async plumbing land, explore Go or Elixir.