Hacker News new | past | comments | ask | show | jobs | submit login
We 30x'd our Node parallelism (plaid.com)
202 points by bjacokes 7 months ago | hide | past | favorite | 243 comments

I was building scalable node applications a few years ago for a very large e-commerce player- millions of customers. I think node.js is a great platform, but its apparent simplicity means there are hordes, and I mean like 90+% of the community, that can "just get things done" without understanding what is going on under the hood at all. And to be fair, for most startupy types of companies that need to iterate fast, that is what you want to optimize for.

My interview screening question was pretty simple- "Is node.js single threaded or multithreaded?" And to most, they spit back the blogspam headline- "Single threaded!" I think the most correct answer is "its complicated" but would accept that because most people would say that is the "right" answer. So I would follow up with- "what exactly happens in a default installation if we have say... 5 requests come in at exactly the same time to just return some static content from disk?" (Node's default threadpool is 4). And here is where you could see their understanding just fell apart. Some would say they would be handled entirely synchronously, others completely in parallel- but then had no idea what the cause of the parallelism was. Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.

Before reading this post, I was like eh this is a waste of time- its typical medium bullshit- they almost certainly found they were doing some blocking call in the event loop and then removed it and voila, 30x speedup. It was interesting because it was a lot worse! They spent all this time and hard work figuring out everything but what was taking so long in the event loop, and it seems that was the last place they actually looked.

Anyway, node can be a highly scalable platform (https://changelog.com/podcast/116) but you need to understand it or else it will bite you in the foot. When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

This is why I recommend that anyone running Node in production use a tracing tool like New Relic. It's super easy to see what is blocking the event loop. Just choose a duration (say 10ms) and look for any execution spans that are longer than that duration.

Ideally you want to be yielding back to the event loop at least every 1 ms. Anything that takes too long without yielding will show up as a latency delay before your code is able to start handling a new request (technically a background thread in Node.js will pick up the request, but your code won't start executing in response to it until you yield back to the event loop again).

To be honest the more difficult thing to diagnose sometimes is event loop overburdening. If each of your execution spans are taking 1ms, then you can only do a max of 1000 of them per second (assuming there was no delay between executions, but there is). So if you are trying to handle a large number of requests per second the event loop may end up with say 1005 execution spans per second that it needs to execute to handle that request volume. Because you can't do 1005ms of work in 1000ms the extra work will queue up.

So gradually you will end up with 5 backlogged execution spans stacking up per second. Each second you will get 5ms more latency. The overall request latency will just gradually increase and increase as work gets further and further delayed in the queue.

Overall I just think of Node.js as a fancy CPU scheduler. As long as you give it even, decently sized chunks of work to schedule, and you don't give it too many to schedule you will be fine. Anyway I'm a huge fan of Node.js but yeah its easy to fall into some gotcha's if you don't study how it works. The simplicity is a bit misleading

Elastic APM is better than New Relic in how it traces node and it is completely free and open source (you can use a cloud product).

Disclosure and bias: I work on Node core and always hear ranting about incorrect usage in async_hooks in anyone but Elastic APM in core meetings. I used both products and have no affiliation to other companies.

Interesting! I've had too many bad experiences over the years with Elasticsearch and ELK stack so I usually avoid Elasticsearch based products like the plague. Maybe if someone else runs it for me though...

Do you know what they are doing differently with async_hooks in Elastic APM?

How exactly do you use NewRelic to see what is blocking the event loop? I thought we always needed Flame Graphs for it (which NR doesn't provide)

Flame graph is only needed if you are having a lot of trouble pinning down the exact call that is taking a long time. In most cases I've found traces to be all I needed. For example if I see that this span of execution took longer than expected and I read through that code and see a bunch of JSON serialization its pretty obvious.

To be honest that comes with a lot of Node.js experience though. You probably need flame graphs to start out with, but eventually I find New Relic traces to be all I need, as I already have a sense for the relative CPU weight of the various calls inside a trace span.

> Very few actually understood that node is an event loop executing javascript backed by a threadpool for async operations.

This is true, and that JavaScript is mostly a synchronous programming language with host environments that can provide asynchronisity.

A caveat though is that the most important part of I/O is network I/O (tcp/udp sockets) and Node uses real async operations there rather than a threadpool.

FS is just really hard to get right in a cross platform way and that's why it's on the threadpool. Some other stuff like dns is also famously on the threadpool but tcp sockets are not - it's a big part of why Node is fast.

Here is a SO answer that expands a bit more on kevstev excellent comment:


At my last gig, I maintained a handful of existing, and created some new nodejs things. I had previously done a lot of Java and even hacked on Apache. I had no prior nodejs experience.

The first thing that really bothered me about our use of nodejs was no one could say why stuff would fail in production. So many moving parts. One of my team members figured out some edge case interactions between nodejs and nginx (used for HTTPS), which I would have never figured out on my own. It wouldn't have even occurred to me to look there. But other crashes, caused by apparent leaks, were mystifying.

The second, and bigger, thing that really bothered me about nodejs, and expressjs in particular, was the notion of back pressure is completely missing. If it's in there, I couldn't find it. So our endpoints were still accepting new socket connections without processing responses from backend services (eg redis, other nodejs endpoints, auth services), which would either zombie or ABEND those backends. And no one could figure out why.

I only understood what was happening because I'd already been through all that "architecture" madness a decade earlier with Java services.

I guess what I'm saying is while I LOVE nodejs' closeness to the metal, I didn't like going back in time 10-15 years.

Also, npm is crap.

>I think node.js is a great platform,

I'm curious as to why. For large scale applications like this, you have other options that offer higher performance ceilings, have more safety and correctness features, and are likely more productive as well. What is the attraction to node?

A guy has to invent a scripting language for browsers in 9 days -> he decides on a lisp -> management says no it has to look like java -> he comes up with something -> its dynamically typed -> lets run a huge banking infrastructure on this


The real killer feature is async. Since a modern web request typically spend most of the time waiting for database calls, file system requests or similar, a naively coded server in most languages can handle relatively few requests per thread, so you scale up the number of threads to something like 100 per core, and now the overhead of running and switching between these treads is limiting the performance.

Being used to Node, I was flabbergasted when writing C for Linux* . The file system commands just leave my thread hanging while the result is being generated, if I use it on a network drive it might hang for a minute before timing out, so I have to make a tread for each file system command, solely so that it can stall without bringing down the whole application.

* I have no delusions that Windows is any better, Linux is just what I have first hand experience with.

>a naively coded server in most languages can handle relatively few requests per thread, so you scale up the number of threads to something like 100 per core, and now the overhead of running and switching between these treads is limiting the performance.

This would be true if you hired someone to write a server in C about 15 years ago. It's not true today. And I hope you're not putting a naively coded server like that in production, or at least doing the hour of research once you notice it's awfully slow to solve the problem.

Like if you wrote your backend in Go, Rust, Java or any number of languages (even C/C++ with common dependencies!) and did a little reading while you designed it, this issue wouldn't exist.

It happens that Windows is better. It has much better kernel support for async IO.

You're handily skipping over 15 years of improvement and iteration between the last and second-last points there.

My background was doing low level C++ in HFT/Algorithmic trading for years, with a bit of Java interspersed, before doing this whole complete right turn of doing webdev in js for ecommerce. In C++, doing web stuff was very difficult, build times were long, JSON support existed but was awkward, iterating was just very difficult, even when you had your whole build/deploy setup going, there was still a lot of work if you wanted to build a CRUD app just marshalling and unmarshalling objects to the DB, etc. It was a drag at best and a real pain at worst. Java... was a bit better, and I don't really have a problem with Java as a language. Java programmers, however, seem to really delight in building architectural monuments and get paid by the abstraction. In every java system I have jumped into, you are always neck deep in xml, and massive object hierarchies and factory factories and its just like where the F is the code that actually does stuff?! I remember working on one project, where I was building the "engine" and another guy was building the web interface for it, and was using Spring when it was relatively new, and he happily declared "all I need to do is wire up my configs now, I am pretty much done." Narrator: A week later, he still wasn't done... and this wasn't unusual in my experience, in most java apps "config" was just as complex and problematic as code, and the mindset of "its just config" made config changes more likely to cause a production outage of some sort.

Then, you take a look at node. You look at a getting started tutorial. Its javascript on the front, and on the back. The JSON in between is "native" and is convenient and easy to use, easy to read, lightweight, and just makes a lot of intuitive sense- especially when I had found myself neck deep in XML in previous jobs for the same tasks. I had a nice looking HTML5 web app running in a few minutes- my mind was blown. Then you take a look at the frameworks- express and hapi, and the vast module ecosystem- and how easy it was to build a simple CRUD website with leveldb, or mysql, or really an endless array of storage options. And people were using those options! It wasn't just the bog standard RDBMS being used every place, with your only real choice being mysql, postgres, or if you had money, Oracle. Building endpoints with routes in these frameworks made your code so easy to divide up along clear lines, and there just wasn't the endless miles of boilerplate/scaffold code, and ugly syntax and type systems to fight with and plan ahead of. Things Just Worked. Turning around a code change was a matter of seconds, not a minutes long build process- I had never felt so productive- and writing code was fun again! Deploys were easy, restarts were fast. Rollbacks, when necessary, were painless. There was a plugin/module for everything (too much in hindsight).

Now, this was 6 years ago. Go was around, but still kind of a blip on the radar, Ruby/Python were probably the closest real contenders. Ruby had lost steam, I honestly took some cursory looks at it, but it didn't seem to have traction. Python, suffered from its single threadedness and GIL, and its popularity with the ML crowd- Flask and such existed, but was pretty rudimentary compared to what Express/Hapi were offering, and no one seemed that interested in those projects. I like Go a lot, and for a pure backend service, it might be my go-to today, as one of the original arguments for Node was "its the same language on the front and the back, no more delineation between FE and BE developers, anyone can jump in and fix the bugs!" Which, along the lines of my original comment, don't really work out in reality, at least not on larger systems. People drawn to FE work usually have never done real systems development and don't understand how things work under the hood- which isn't a problem, until one day it is and then its a huge one.

The dynamic typing argument... is somewhat valid, but I found that enforcing api contracts with hapi/joi gave you the equivalence of type safety at your interface borders, while still giving you the flexibility of dynamic typing within your code. In fact, Joi went even farther than just type checking, it could check that your int was within range for the field, that your dates were formatted properly, etc... In mega large codebases, this will come back to bite you, but I found the plugin architecture of Hapi really discouraged that kind of crap from leaking in and it was easy to build truly modularized code.

The performance ceilings aren't that different, and not that impactful, at least not until you get to FANG scale, and I mean literally only FANG scale. We were running a billion dollar business with on 8 fairly small VMs for the API layer, which handled all of the ecommerce transaction handling. I remember at one point we encountered a memory leak of some sort in node, and the instances were falling over and dying about once an hour, but restarting and recovering- this was causing a few % error rates to our customers. I was insistent that we get all hands on deck to figure this out ASAP, and our head of Ops type person said "kevstev, we can throw hardware at this problem to meet SLOs until you get it under control. Your monthly server costs are less than my studio apartment cost me per month in Jersey City 15 years ago."

You just have to have a basic understanding of whats going on at an architectural level, something a few hours of doing the right reading and experimenting can get you if you have the proper background. The number of gotchas to avoid to get that performance were an order of magnitude, if not more, fewer than in a language like C++ (Which I feel has actually gotten so complicated and difficult to grok its become a parody of itself- and I say that as someone who used it and adored it for 15 years).

Thank you so much for putting into words thoughts that resemble my own. I absolutely love how productive I am working with Node in JS vs anything else. I like C# and have enjoyed learning Rust, but nothing really compares.

Yes, there are a number of footguns, but that's true of any language and platform. You can do stupid things in any number of platforms and languages. I don't even see anything particularly egregious in TFA for that matter.

There are fortune 100 companies with systems handling hundreds of thousands of requests per second on a couple dozen servers in Node.js ... is it absolute performance for CPU intensive operations, not really, does it handle more simultaneous requests than everything else, not even close. What it does offer is a really good mix of good enough performance with unmatched developer productivity.

> safety and correctness features

You can achieve safety and correctness features for node via good lint rules and typescript/flow.

You can technically achieve all of that even in an obscure language like Brainfuck but it doesn’t mean it’s a good idea.

Why would you duct tape hacks on top of hacks to achieve the result you want instead of using a language that has already all of the functionality built-in?

I find it really annoying how JavaScript is treated on developer forums like this one. Why is it that when the same exact things are done in a typical language it's called “tooling” but in JavaScript it's “duct tape hacks on top of hacks”?

Don't get me wrong, my favorite language is Rust, but pretending the JavaScript ecosystem is unusable doesn't make you cool. I can be extremely productive in TypeScript.

JavaScript itself is not a hack, it works fine for its original purpose - making webpages interactive.

What I consider hacks is the hundreds of different tools and dialects of JS that are used to bend JS into doing something it wasn’t really designed for.

JS is fine in the browser, and TypeScript is also fine there because you don’t really have the choice to run anything different (although Web Assembly might change the game soon). But on the backend you have the privilege to pick between dozens of different languages that are better suited to the task and support the features you want out of the box without layering hacks on top. Why not just go with one of those?

I can appreciate your suspicion of a dynamically typed language, but type script isn't really a "hack" its a superset of the language. I am not a huge fan of it, to me its kind of the worst of both worlds- if you need type safety, use something else.

However, if you haven't taken a look at Joi- https://hapi.dev/family/joi/?v=16.1.8 you probably should. To me this was a very happy medium- you can enforce your "types" at the point of ingress and egress at your api, and even do validations there, while still being able to enjoy the flexibility of type safety within your code.

Lol - I don’t think I’ve ever met someone as enamored with Joi as I am. I think I abuse it in all the same ways as you do. I even publish my Joi schemas for request validation with my API specifications.

We chose JavaScript/Node because it's the language we knew best, and because we have to write it in the client as we're building a web-based service. No other language offers us that, unless it compiles to JavaScript (eg TypeScript).

This grants us a magnitude of benefits, like being able to do server-side rendering and share code between the client and server. It also means we only need to hire people who can write JS, instead of JS and something else.

TypeScript, Flow, and ESLint aren't hacks. These are mature tools used by some of the largest, most sophisticated engineering teams in the world.

Most bugs encountered in production systems aren't type based issues. Types are more useful for developer productivity (e.g., intellisense) than any other purpose.

Previously worked at a Node company - at one point in an effort to improve code quality we ran statistics on errors we'd seen over the past [period - forget exactly]. Type errors were our most common source of error both by number of total errors and number of distinct errors.

I'd love to see any data or case studies that claim the opposite if you have any.


> Is a 15% reduction in bugs making it all the way through your development pipeline worth it to you?...


> 38% of bugs at Airbnb could have been prevented by TypeScript according to postmortem analysis

I've never seen a number far outside of the 15-30% range.

In my experience, most bugs are operator error. Developers didn't code for branching paths that should've been accounted for, etc.

Personally, I'm a fan of TypeScript. Just don't expect to remove the majority of your bugs via its usage. The old "no silver bullet" adage.

How is JavaScript worth it if it typically adds 15-30% more bugs? That’s an enormous figure.

Well, it runs everywhere, for one. That's an attractive quality to a language.

One could argue that all errors are type errors. There is not just a type system, static or dynamic, inferred or annotated, which is good enough to catch all bugs. :P

Just curious- did you use Joi or anything similar to try to at least verify at some point that you had a valid object?

A follow up, but I also wanted to highlight: plenty of typed systems (like Java Microservices) have bugs. If 100% of bugs were due to typing issues, these systems would -never- have bugs. Ever. Yet, we know that's not the case. That's another rationale for how the 1/4 to 1/3 less bugs ratio makes logical sense.

> When I was last doing this stuff, upwards of 80% of our time was being spent essentially just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

It's only tangentially related to your question, but I can't help but ask this question: why people use JSON instead of protobufs at all?

I'm mostly a client-side developer, and most of my server-side experience is in hobby projects; still, I always used protobufs and loved it. They never damaged my feature velocity, apart from an hour to set up the build system in the beginning, and type safety helped me quite a few times when I forgot to sync changes in protocol on client and server side. Are there some secret advantages of going with json that I don't see because of limited experience?

JSON existed before protobufs is really it. When I left the node world 3 years ago, protobufs were the new hot thing. Any new project should start with them over JSON imho.

There is some friction to them though, and I think a lot of it is that most tutorials and beginner books like to keep things as simple as possible, and people start their little project, it gets traction, and then they figure out they need protobufs but now its hard to introduce. In most projects, even today, it seems that its the version 2.0 that gets protobufs, v1.0 keeps JSON for simplicity, unless you have a bunch of seasoned devs involved.

What does happen with the 5th request.

It waits in a queue in a background thread in the node.js http library until your code can execute to handle it. So if your code takes a long time before returning back to the event loop the request will just wait in that queue for a long time before the next opportunity for the event loop to execute code in response to the event.

It will get queued, until one of the 4 requests in front of it has its task, to return the file, complete.

Well... yes. If you're using `express.static()`, or if all five requests are getting different things from the disk. If you're using something that caches your static content in-memory, then the first request will use the thread pool to read that content, but the other four requests won't touch the thread pool - it's all async IO at that point, so it's all happening "concurrently" in a single thread, being multiplexed right here: https://github.com/libuv/libuv/blob/1ce6393a5780538ad8601cae...

Is it really only 4? When I look at my VM stats I seem to recall it having something like 15. Of course, this could be a config change on the Node Alpine container.

I have a Node service where I get tens of thousands requests a second and I still thought Node was single threaded. Where can I read about this?

The event loop is single threaded. Async tasks are executed on I/O threads which are configurable. So if your app is I/O bound, the event loop will typically dequeue tasks pretty quickly allowing lots of requests.

That's a great interview question (especially if you're not so much into hiring :))

Another one is: what happens when a node process completed execution?

  // node ex.js
  function foo() {  // something async here }

This is a fun question to discuss (I think some consider this a bug in node).

Node.js is a good abstraction layer. In my experience, everything get leaky once you get hundreds of concurrent users.

The only way this makes sense to me is if they have to contend with lots of expensive parsing, event sequencing, and throttling requirements. Payment APIs, bank websites, etc can be quite byzantine. I could understand how one might code yourself into a corner with a monolithic node app and basically just say "F-it, we're doing this synchronously!"

I don't even think it's a terribly bad thing to do assuming it favors feature velocity.... but at that point, I'd recommend moving away from Node towards something like Python. And if you wanted to dip your toes back into async plumbing land, explore Go or Elixir.

> explore Go or Elixir

I have never seen a good argument for using golang for business logic. If you are writing the actual server then sure, use golang. If you are writing some high-speed network interconnect, use golang. Some crazy caching system, sure use golang. The public WS endpoint, use golang.

But if you need to access a DB with golang for anything more than, like, a session token, then you made the wrong choice and you need to go back and re-assess.

Elixir is in the "germination phase" and I predict massive adoption in the next 5 years. It is a truly excellent platform, every fintech company I know at least has their toe in the water. Everyone I show this video to [1] just says "well, shit."

[1] https://www.youtube.com/watch?v=JvBT4XBdoUE

What is wrong with accessing DB from golang?

Nothing. But I imagine with “business logic” you’d favor expressiveness over speed and type safety.

You hit the nail on the head here. When N different API requests simultaneously time out – all because a ramda.uniq call in one of them received an array of 100,000 nested objects – it's easy to make a spot code fix, but harder to systematically prevent it from happening in the future. There aren't really linters for "bad event loop blockage". Code reviews are the main tool we have, but you'd be surprised what sorts of logic can trickily block the event loop. For API reliability and development velocity in the short-term, by far the easiest approach was to throw more infrastructure at the problem.

We do use Go for almost all of our other services, and there are an increasing number of integrations written in Python. But we're still using and investing in our Node integrations code for the foreseeable future, and this was an important step for simplifying our infrastructure.

We certainly hope the tooling and rollout process in the post were instructive for anyone using Node, even if their stacks were pristine from day 1 and never need this sort of complex migration :)

I'd recommend moving away from Node...

Taking a wild guess: Some of their bank integrations probably require browser automation. If you're doing browser automation, the best tool for the job is (currently) Puppeteer, which runs on Node. There are other third-party language bindings for the Chrome dev tools protocol, but Puppeteer is developed by Google as a first-class citizen alongside Chrome.

I think that overemphasizes Pupeteer itself.

It's really just bindings for the dev tools protocol.

Half the GitHub issues result in "well the protocol requires X and we can't change that".

Pupeteer is popular because it's web automation protocol bindings for a web language, not because it a sophisticated layer or does very much.

There are literally dozens of language bindings for the protocol. [1] Some are quite good and widely used, for example chromedp (Go bindings). [2]

[1] https://github.com/ChromeDevTools/awesome-chrome-devtools#pr...

[2] https://github.com/chromedp/chromedp

4000 chrome instances? Probably not. Here I am trying to run 4 chrome instances in parallel in CI without crashing.

Presumably not every integration requires browser automation, so they might not all be going at once. But they have a $25k monthly EC2 bill, so it's not out of the ballpark.

FWIW, I reliably have 6 puppeteer/chrome instances (headful, even) going on a single box and it's not even at half capacity.

Check out the selenoid project. It’s like a selenium grid but dockerized with novnc, etc.

That was my thought to. They've got a problem where they've got no idea what a given transaction costs and some unpredictable amount of transactions result in some serious work that holds up the event queue.

God knows they could be waiting for some reel to reel tape to spin up somewhere...

The whole point of async I/O is to be able to do something useful while waiting for tape to spin up.

I don’t buy it.

But the whole point of synchronous I/O is to isolate the programmer from having to think about that spinning tape up takes a non-zero time. I have a feeling that this gets lost sometimes in all that "async I/O is the GREATEST!" craze.

Async is nice - if you can handle it. But this is not easy to do in complex systems and processes. It is certainly easier to work with an old-fashioned process that blocks when waiting for whatever you need to wait for, and just scale by letting the OS run lots of those in parallel. Sure, it's less efficient. But it's easier for the devs to handle.

I just read the hidden undertone of this article as "our devs aren't that smart after all".

But you need to know if you can do that something first, or if you've done that something too many times in the last N minutes (and could get blocked, forcing thousands of other somethings to get endlessly queued). Or if that something could take too long, and actually you could be doing 200 other somethings in the same time etc. It's not that simple.

The article certainly raises more questions than answers that's for sure.

Haskell also has very nice concurrency IMO.

Their velocity might have been slowed by figuring out how to manage 4,000 containers effectively. If they had dealt with managing concurrency effectively sooner, they would need 30x less containers-- 133.

Not so much, they're using ECS which takes care of a lot of those headaches and sounds like they're coordinating with a load balancer / reverse proxy for distributing those requests... A 1-1 request model in that kind of system is really simple to setup. Setting up to orchestrate multiple requests per node was probably much more time intensive.


That was an interesting read, thanks for linking to it. It's hard finding articles online discussing Node and performance, most people just dismiss it as an unviable option due to scale and speed concerns. 30x really is quite the jump though.

> Each Node worker runs a gRPC server

Not going to lie, this kind of surprised me. When I think of a Node backend I think of ExpressJS. Not because I think Express is better, but because it's been pushed around in the past few years as the fastest, simplest way of running a backend.

Yet, if you're going to be running a gRPC server, why not use a more performant language with better multithreading support? I thought this article was about them optimizing a grandfathered-in solution (such as Express), but I can't tell why they built out a gRPC server in Node in the first place.

Our integrations are primarily written in Node, which was the original language used for everything at Plaid. Almost all of those original services (except for integrations) have been migrated to Go or Python at this point. We've standardized on gRPC as our wire format, so we stayed consistent and used gRPC in Node.

With perfect hindsight, it's a fair point that all the pros and cons could net out to another language being best for our integrations. Integrations are the largest and most quickly-changing codebase at Plaid, so such a migration would be a massive undertaking. We definitely didn't want to block scalability improvements on doing a language migration.

I've been hoping that the Cloudflare folks will open source parts of their Workers; they seem to have figured out a secure, performant way to run untrusted javascript at scale.

The Node gRPC implementation is fine. It uses the C++ implementation which is the gold standard. It has Prometheus and OpenTracing interceptors. You basically give nothing up by using it, if your team wants to write a language that runs on node.

The bigger issue to me, is (at least the last time I looked) you can't use the cluster module with node combined with gRPC, so the only real way to take advantage of extra CPU capacity, if available is workers or external processes that are self-managed vs. cluster integration.

I think you can just run one node.js per core (or whatever the optimal balance is) and tell your load balancer that there are instances of your service available at hostname:8080, hostname:8081, etc. A lot of people are going to get this "for free" when they tell their container orchestrator that they want 8 replicas that each request 1 cpu, and the scheduler finds a node with 8 free cpus.

I’d be curious to hear more about the circumstances that ended up with a blocked runloop. Are there hundreds of junior engineers, or perhaps third parties writing code that you don’t control? I have seen people accidentally write blocking code, but not at such an egregious rate that we couldn’t catch it in code review, or at worst the runloop detector would alert on it in prod and we would roll back the deploy.

For instances where you actually know you need lots of CPU, there are now strategies for offloading that specific work, although they have taken a while to get nice and easy to use.

Sure, one example I remember off the top of my head is a bank that sometimes returned duplicate transaction data, so an engineer had called ramda.uniq on the transaction array. Transactions are nested objects and slow to compare, so when you find an account with 100,000 transactions... kaboom. Some scenarios are more subtle, but a common theme is that the amount of data in an account can vary by many orders of magnitude.

> We were running 4,000 Node containers (or "workers") for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.

I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

Failing to scale because their previous approach to scaling was a worker per request, a model which was roundly moved away from, because that's how CGI and Apache modules worked and it didn't scale well.

I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

But then you read stories like this, and I find it hard to relate to the original problem.

There are a couple of reasons that the legacy scaling model was viable for us. As mentioned in the post, only 1/10 of our traffic was from the API, which gave us a roundabout way to scale by diverting resources. And it's only viable to use this model of scaling when the business value of a request is high – we were originally quite happy to spin up more containers when we reached our scaling limit. That's the pragmatic reason why we were processing one request per container.

In terms of what issues caused us to move away from parallelism in the first place, it was all the CPU-bound stuff that you might expect: ReDoS-style issues, post-processing arrays in very large edge cases, programmer error, etc.

> In terms of what issues caused us to move away from parallelism in the first place, it was all the CPU-bound stuff that you might expect: ReDoS-style issues, post-processing arrays in very large edge cases, programmer error, etc.

But these are not parallelism problems. These are single threading problems, which the core problem with Node.js, not parallelism in general. Hence I think the question stands: why did you choose node for this?

It was chosen about 6 years ago when the product was first being developed, so most of us on the engineering team weren't around when the decision was made. The main choice we're making at this point is: what's the impact and ROI of a language migration vs getting Node to work as well as we can?

> what's the impact and ROI of a language migration vs getting Node to work as well as we can?

Hire an architect costs what? Putting the genie back in the bottle was a problem Plaid baked into its early success, which is common with startups hiring engineers with zero architecture knowledge.

Yeah, I don't get it either, at all. The original poster wrote below:

> In terms of what issues caused us to move away from parallelism in the first place, it was all the CPU-bound stuff that you might expect: ReDoS-style issues, post-processing arrays in very large edge cases, programmer error, etc.

But it's trivial (a single line) in Node to place breaks in CPU processing to allow the event loop to fire, and as for "programmer error"... many commenters below are also complaining async programming is too hard or finicky.

But that's like complaining about C because pointers are hard, or Java because OOP is hard, or databases because planning indexes is hard.

Once you "get" async, pointers, OOP, or indexes, it's easy. And it's part of your job as a professional programmer to get it. Async is no trickier than anything else.

The setup in the first place makes absolutely no sense to me, using a language exactly opposite of how it's meant to be.

In one case I cited elsewhere in the comments, an engineer had called ramda.uniq on an array of nested objects which was occasionally very large. When calling into external packages, I don't think we have as much control over yielding to the event loop, but I could be wrong. I know that there are some JSON/regex libraries that give you some protection on this front.

I agree that it would be nice if all developers were infallible – I'm reminded of a friend describing their company, where "we don't write tests because we all write good code". At a certain point, you have to look for processes – linters, monitoring, testing, language choices [1] – where people can't shoot themselves in the foot. (Code reviews being only moderately less fallible than a single engineer.) It's not enough to just say "be better" whenever bad code is written.

I think when the decision was made (years ago) to handle a single request per container, they couldn't find such a process to prevent event loop blockages, other than migrating an already-large codebase away from Node. As others have pointed out, maybe such a migration is necessary – after all, event loop blockages are still an inherent risk because of how Node works. It's just a lower risk than it was a year or two ago, because we've significantly improved our usage of the event loop, and also have tooling in place to catch blockages before they become an issue.

[1] https://news.ycombinator.com/item?id=18564643

Oh, interesting about external libraries.

Yeah, external libraries for Node ought to be designed so that any function that ever might take any length of time whatsoever should always be callable as async. But if they're badly designed or not intended for large inputs, they might not be. You'd definitely need to find another library or write your own there, so I get that.

The issue is that many developers that are coming from synchronous programming don't get asynchronous programming. They could both improve the code by not writing blocking code, and also using something like the cluster module (https://nodejs.org/api/cluster.html).

I get that.

What I don't get is how nobody treats it as an issue when developers coming from Python or Java to C don't get pointers.

The assumption is that you learn.

But for some reason, people think it's "OK" to not get async, that it's the language's fault rather than the programmer's. That's what I don't understand. It's like a different cultural standard gets applied.

I agree with you, but I think many developers get reluctant to change when they have been doing something one way for a long time, especially if they feel that one way works fine. I can also understand the position, as sometimes it can be fatiguing when technologies are constantly changing. For this project though, if they are actively going to avoid asynchronous programming, they may have been better off choosing a synchronous language.

ubu7737 7 months ago [flagged]

There is simply no excuse for this. You are a software engineer or you are not, the cadence of change is part of the technology aspiration.

I have no patience for persons who don't belong to the discipline.

At some point you have to take a step back and realize you've grown beyond your tech and reach out for something else. Elixir sounds like a great fit for these problems.

For example Discord reached out to Rust and built tiny Rust components that are called from Elixir for their server user list. Some servers have 200,000+ people online, and Elixir wasn't cutting it performance wise. Rust, boom now it works.

I feel like this article is missing a crucial piece of information: why was integration code was blocking the event loop in the first place?...

Agreed. This is the second scratch my head moment from the Plaid engineering team blog recently.

They didn't actually understand Node very well at first and then later they figured it out.

Related, from the article:

> We hypothesized that increasing the Node maximum heap size from the default 1.7GB may help. To solve this problem, we started running Node with the max heap size set to 6GB [..], which was an arbitrary higher value that still fit within our EC2 instances.

Sounds like they were utilizing their EC2 instances very poorly. Why not run more workers per instance, or switch to an instance type with less RAM (or more CPUs)?

They were using ECS, it also looks like they had to work through a couple bottlenecks to get multiple requests per node working well... I think they could get further by using the newer workers api, since gRPC doesn't work with the cluster module.

Pretty sure apache + cgi would scale better :)

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

No you are not. I wonder which CTO would allow this; like everyone here, the exact case is not really clear (or at least why this solution is a great solution for it), but this sounds like a weird solution (and expensive) to some issue. I really don't understand these 'solutions' and I am almost 100% sure I (with a team! but the point that this is not the best solution for the problem) can whip up something far simpler and more efficient for this problem. But ofcourse there are problems that might fit?

I wonder what percentage of the massive compute power of huge cloud data centers is spent just chugging away on ugly clunky hacks to run bad code?

Quite a large percentage. I worked on a site that got 1 request per second and they were able to handle it by spinning up like 20 VMs. Turns out they were just using Entity Framework wrong. Whoops.

But also, though, you have to consider that most places aren't Plaid, and most places developer time is more expensive than throwing an extra machine at the problem.

I got the same feeling. We use node and usually split to 500 concurrent requests per process.

Still interesting...

The way it happens always: many people working to a solution, not agreeing on one and then comprising on something in the middle, even if it makes no sense.

> I thought one of the key selling points with Node was an fully async standard library, enabling better scaling in process.

We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

The fact that Node ships a fully asynchronous standard library doesn't in any way fix the fact that Node is a runtime for a language that itself is a mistake.

> We still have an event loop that is trivially blocked by very simple programmer errors, destroying the whole advantage that you describe here.

So they fixed the issue that some requests blocked... by making all requests blocking.

This is the worst kind of software engineering.

There is a massive deadlocking design mistake in the centre of the language - literally a huge red button with DO NOT PRESS printed on it. Thousands of programmers pass it by every single day, or hour, or minute, and the creators of the runtime insist that it is impossible to fix that button whatsoever; instead, all users need to work around it by ensuring that their code in no way presses that red button on purpose or even by accident.

These people insist that it is impossible to program normally and in a language that is actually sane and does not advertise obvious and gaping design mistakes as "features of the language". These people advertise the analogue of Python's Global Interpreter Lock as the core foundation of their language.

These people advertise Node and the language it implements as practical for implementing multithreaded applications. Posts such as this show what sort of bullshit it is; it is only practical to use Node for parallelism if each single Node instance is only ever run single-threaded. You don't parallelize by running multiple threads, you parallelize by running multiple Node runtimes.

This is no longer an act against productivity or usability. This is simply insane and shows one of the most basic things that are wrong about Node's language and approach. It is impossible to write a multithreaded program if your language of choice makes it trivial, and practically unavoidable, to globally lock your whole runtime with every single line of code you write and import as your dependencies.

Running a Node worker for each thread is standard practice.

No different than having a dedicated threadpool for asynchronous programming on the JVM.

Yes blocking the event loop is easy. No it's not THAT easy. I've never done it because you think about it while writing code. It's part of the environment. I have had to fix lots of reports etc that try to load up the world and iterate through it in a loop that doesn't yield to the event loop. It's possible to never make that mistake by understanding your environment (just like how managing pointers in C is "hard").

> Yes blocking the event loop is easy. No it's not THAT easy. I've never done it because you think about it while writing code.

To be pedantic, all JavaScript functions block the event loop. It's just that the vast majority of functions execute so quickly that the amount of time your function blocks is very short.

I once had a loop that processed tons of data and would block the event loop for 1-3 seconds. I ended up solving it by writing an asynchronous loop which used promises and process.nextTick() to make each iteration a separate execution block on the event loop. But I've only had to do something weird like that once in 10 years of node development.

Right sorry. I meant for a significant amount of time that took down production or caused serious issues. I've taken down prod in other ways though :p

At one company I built API middleware that calculated time spent in the event loop vs waiting on dependencies (other apis, Redis, etc) by managing a couple counters per request. It was very helpful.

I mean, running multiple node runtimes (aka multiprocessing) actually sounds like a reasonable compromise for parallelism. That's the standard solution for dynamic languages without great multithreading support. If you needed great multithreading support then Node probably wasn't the right choice for you in the first place, but for most applications, it's probably fine.

However, running multiple containers for parallelism sounds a little bit crazy. In the worst case, each container may be running on its own server, but even assuming multiple containers per host, I'm guessing they were running an insignificant number of instances, which is probably why they were able to save $300k in server costs.

Yet this is the case mentioned in the article.

> We were running 4,000 Node containers (or "workers") for our bank integration service.

Yes, I agree. I'm arguing against your characterization of node as a poor runtime. For many line-of-business applications node is a fine choice. However, plaid has to integrate with many banks which only expose web pages, not APIs, so I'm guessing that they have to do a non-trivial amount of CPU work to scrape and process HTML responses. For this, it may not be such a good choice.

All I'm saying is that the choice of language is (usually) not the issue. Poor architecture design causes a lot more problems than whether you choose python/java/ruby/node for your webapp.

Yeah I’m confused why they didn’t start with running multiple processors per container. There must be a reason?

> [javascript] makes it trivial, and practically unavoidable, to globally lock your whole runtime with every single line of code you write and import as your dependencies.

As someone not well-versed in js, could you describe one such case? Concurrent access to a global from two threads? Mutexes? My background is more with systems languages and I have done very little js for the browser, so I do not see that big red button.

An event loop is basically a single thread that executes functions from a FIFO queue. A function can put itself on a queue by "yielding", which means, allowing other functions to execute.

If your function blocks, for example, by performing a wait on something without yielding, then nothing else gets computed because of the wait, but the event queue is occupied since the function has not yielded. This breaks the cooperative part of the cooperative multithreading mechanism of Node.

All right, thank you. So it's basically the same as any other non-preemptive multitasking design.

If the event loop is part of the Node language/runtime, I can see the case for making it preemptive.

> So it's basically the same as any other non-preemptive multitasking design.

All this has happened before. All this will happen again.

In Python, you can execute time.sleep(10_000) in an asyncio program, which means it wont yield back control to the event loop effecting preventing the runtime from doing anything for 10 seconds.

In Javascript world I guess you could do the same by replacing time.sleep with some CPU bound code. eg a big calculation or an infinite for loop.

Yes... often JSON serialization and deserialization are the biggest blockers, if they're using puppeteer (as suggested in other comments) that can have its' own instance overhead. I think with ECS they used the "easy" button to start off with and now that they've hit those limits have had to refactor, which isn't unreasonable imho.

I'm working on a project now that's very CPU bound and using limited workers behind an MQ as a distributed RPC behind a fronting API interface so that I can handle scaling... though it's also a Windows-only library involved. There's definitely an art to scaling certain types of workloads and many different options. Sometimes the simplest solution you can come up with really is the best option.

The blog in question indicates some uncommon and questionable engineering practices with Node.js. There are likely hundreds of success stories for every one like that.

The first Node.js service I wrote and maintained, processed thousands of requests in parallel and was successfully in production until the company it was developed for ran out of money.

But how many of these "questionable" practices get held up as engineering marvels by the creators of the monstrosity? When you look at this blog post, you can see the author really felt a sense of pride in this Frankenstein and wanted to show the whole village.

I didn't get that impression.. what I got was a few probably common issues that will come up when trying to handle more parallel requests with Node. Note: gRPC doesn't support cluster, so you'd have to manage your own multi-thread (Workers) or multi-process (fork) options.

Multithreading is still possible - for clarity of code. Multiple processing threads are running in parallel, but only one at a time. This is a subset of useful applications, but not totally worthless.

Yes, it is possible, and actually works - unless one of your threads blocks the event loop and prevents all others from running. This way it takes a single locking thread to halt the whole system, whereas with standard threads a single stalling thread does not cause all others to starve. This is a risk in Node that multiple other runtimes of other languages do not have.

Are you seriously arguing against event loops as a category? Blocking in non-blocking code is going to be an issue whether the language is C or JavaScript (duh).

For these kinds of programming, yes, I argue against it. A single stalling function in Node deadlocks the whole system; a single stalling thread in the C++ model still permits other threads to run. This is a risk that is completely avoidable by not using languages which require event loops at their core.

JavaScript doesn't require event loop design. You can do a PHP like backend design with JS, where each request is handled by a fresh process, and all JS functions block. There's nothing in the language that prevents this.

Some features would become unusable, like Promises and async/await, but those would be worthless in such a design anyway.

Nobody does it. Nobody implemented it yet. Nobody made it production ready. That is what a few minutes of googling gave me.

As much as I'd like to see it, your JS without an event loop is a purely theoretical construct so far.

I did blocking design using Duktape for my runtime Linux distro reinstallation tool and some other projects. It works fine and it's easier than implementing async aware bindings.

I wouldn't implement it for the HTTP server/backend use case, because I find the blocking/share nothing architecture of PHP limiting.

The point is the language doesn't prevent this kind of architecture.

It's simple but effective.

>by very simple programmer errors

I can't help but also feel that is also an issue and in another given language this issue might not happen ... but they'd hit another.

It's so easy to say "don't do X because problem Y won't happen" but hard to predict what happens when you move from (language, platform, or whatever) X to (language, platform, or whatever) Z.... and I suspect people often hit issues and realize that maybe Y wasn't the problem.

I see it all the time and I feel like "Wait guies I'm not sure we're fixing the right thing!?!?!"

This article raises a lot more questions than answers IMO.

Async is just modern cooperative multitasking, and just like the 90s, it's easy to accidentally lock the whole system.

Yeah, I remember just how nice it was going from Cooperative MT to Preemptive multi tasking -- the general view was that anything that only did Cooperative MT was just a Toy.

I'd bet that the orders of magnitude of speed from Moore's law did in CMT by making PMT doable without a huge speed hit.

Nothing I've seen from async is cleaner, easier to maintain, or better from a cognitive load POV. It's just more efficient for certain types of loads because you're being a consenting adult and not breaking things.

CMT for concurrency + DLP for parallelism is a he'll of a lot more scalable in LOC than the unsafe hammer of preemptive. We keep unsafe parallelism to only tiny tiny tiny GPU code snippets.

Node's concurrency and parallelism probs don't require deep lang changes but runtime ones and some ergonomics to be closer to Go -- and not far IMO bc of last few years of work for Async, safe (fresh env) eval, and json/buffer msg's.

More exciting would be something like Apache Arrow / Berkeley's 'Plasma', but that stuff is still more exploratory.

For folks looking for a good primer on the terminology and history of cooperative vs preemptive multi tasking, I took a pause on reading this thread to look for an article


This article does a good overview

What language were you using that had preemptive multi tasking?

We are no longer in the 90s. The code has increased in volume a hundredfold and it comes from literally everywhere. You can no longer trust everything on your machine or your network to be bug-free or otherwise non-hostile.

Creating a system in the 21st century that tries to follow ideals from the 90s gives us the kind of idiotism that we can witness here.

I think you may have misinterpreted earthboundkid; the claim isn't that it worked in the 90s, the claim is that it was already broken in the 90s.

You are otherwise on the right track, though Node does technically have one advantage, which is that it is a cooperatively-scheduled island in a preemptively-scheluded overall OS. In the 1990s, when the cooperatively-scheduled program was not cooperative, you locked the machine, not the process [1]. There is a reason why Apple went very aggressive with the OSX rewrite; the previous Systems had basically written themselves into a corner where they had to use cooperative multitasking because so much code made use of the implicit promises it provides, yet they could no longer afford to compete with Microsoft if they didn't get off it it, because the complexity just kept going up, up, up and the problem was going to continue getting exponentially worse.

For a Node program, you only have to account for the Node program itself, not everything running on the computer. Still, you're in the same exponentially-growing-complexity trap (with a very initially-safe-seeming low exponent, but it still gets you in the end), you just reset yourself back to a point earlier on the curve.

[1] There are various details, caveats, interrupts, etc, the picture is more complicated than one sentence can convey, but the principle still held and it was still possible to wedge the machine fairly badly for varying periods of time with simple bad code.

If I thought my bank was running thousands of node containers in parallel to handle transactions, I think I'd look for a new bank.

I mean, if you write slow code in a high throughput environment you'll just kill the CPU from context switching between threads instead.

It doesn't matter if it's in an event loop or thread per request. Architect things correctly.

> trivially blocked by very simple programmer errors

Can you give an example please?

I think it's much easier to block a thread with C#'s async programming model than node's...

Node only has one thread. Everything else follows.

No it doesn’t. We’ve had good models for concurrency in single-threaded systems for a while now.

You say no it doesn't and then speak about concurrency models for single-threaded systems. Choose one :)

Node can't not be single-threaded. Because Javascript is. Node.js is single threaded. It has a single event loop in a single thread, and all the "concurrency" is simply queued on that loop. It offloads some tasks to libuv for some system-related tasks but that's it. And the thread pool that libuv creates is very limited.

Anything that doesn't end up in libuv (that is, probably vast majority of user code) will only ever run in one thread... because Javascript is single-threaded hence V8 is single-threaded hence Node.js you get the gist.

And of course, node.js even has a separate documentation section titled "Don't Block the Event Loop (or the Worker Pool)" [1] because it's trivial to block the event loop.

[1] https://nodejs.org/en/docs/guides/dont-block-the-event-loop/

Not OP, but… concurrency is not the same as parallelism. This is an important hair to split:


"Concurrency" doesn't warrant scare quotes even when describing a single threaded program.

It's true that Node.js allows one task to block many others, but that's an implementation detail of Node.js, not a guaranteed result which necessarily applies to any program using only one OS thread. Other programs suffer from analogous starvation/deadlock/livelock/priority inversion problems, but those would be implementation details too, not guaranteed results from using multiple OS threads.

I never said anything about other programs. I never said that other single-threaded programs don't suffer from the same problems. That's is sort of the point. Node.js is single-threaded and, as a result, had all the same problems a single-threaded programs has. And no amount of hand-waving can change that fact.

You’re focusing on single-threadedness like that’s a problem but it’s not. For example, you can run multiple processes on a single core system just fine — even if one of those processes hangs.

AFAIK there’s nothing in the JS language that forces a single-threaded implementation. I think the real reasons that Node is single threaded are a) legacy and more importantly b) a lot of existing code would break if events were dispatched in parallel.

> you can run multiple processes on a single core system just fine — even if one of those processes hangs.

Yes, I can. Node.js doesn't do that.

The model being "don't do that".

Nodejs has an event loop running in a single thread, but most calls you'll do will be async (and therefore let other operations continue to be executed by the event loop). The entire programming ecosystem is async by default.

The only thing I can think of where a programmer could "easily" block the event loop would be if they explicitly use sync filesystem or other blocking calls instead of the async API. But in any project with reasonable code review I don't think this would happen.

> The only thing I can think of where a programmer could "easily" block the event loop would be

Parsing JSON. Or executing a regex. Or anything, really, that blocks the thread: https://nodejs.org/en/docs/guides/dont-block-the-event-loop/

There's no magic

Well it's a question of magnitude. In theory everything blocks the loop even if it's only 1 CPU cycle before it yields.

In practice I've rarely seen production nodejs applications cause significant CPU blocking issues. Huge JSON parses sometimes, yes, but then it has to be one hell of a payload to cause any significant issue.

Regexes? Have you really seen regexes block CPU for significant time relative to the rest of the application? I'm sure it's possible with a crazy enough runaway pattern, but I've never seen it happen.

I really don't understand what people are getting at here. Nodejs is an async programming model, it's non-blocking by default. Are the people saying it's trivial to break nodejs developers?

My original response was to this [1]

> I think it's much easier to block a thread with C#'s async programming model than node's...

Which spawned a discussion in which people seemingly think that you can't block a thread in node, or that since node is async it means it's not single-threaded etc.

Since it is single-threaded, it's quite possible that the original post wouldn't need 4000 node instances otherwise (emphasis mine):

> We were running 4,000 Node containers (or "workers") for our bank integration service. The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop

[1] https://news.ycombinator.com/item?id=21782591

I wonder if all this was at root, pickup up jobs from a message queue and they only wanted each process to only have one job in flight at once.

> I can't be the only person who reads stories like this and wonders how they arrived at that solution in the first place?

Here's how it probably worked: they liked Node, they liked containers, they put Node into containers and it worked, and they stuck with it as the user base grew.

I've encounted different issues with NodeJS services in the past (and still do) both with CPU bottleneck and Heap allocations. So i wrote openprofiling-node [0] during this summer to help me profile my apps directly in production and export the result in a S3 bucket. I believe it may help someone else here so i'm posting it

[0]: https://github.com/vmarchaud/openprofiling-node

On a positive note: this was a good write up.


...Or you could just use Erlang or Elixir, where concurrency and parallelism come pretty much out of the box, with very little effort required for you to fine-tune the desired policy / strategy.

The insistence on using Javascript is just beyond lunacy at this point.

Well, if Elixir had a Typesystem like Javascript has, I'd instantly switch to it. But atm I'm staying with Node because of Typescript.

True, it doesn't have it. Between pattern matching and function guards however, it has a decent way to protect against common errors.

The true treasure is Erlang / Elixir's runtime though. The parallelism, the self-healing, the preemptive scheduling.

They write (somewhere in the middle)

> Since V8 implements a stop-the-world GC, new tasks will inevitably receive less CPU time, reducing the worker’s throughput

But there is this Google blog post vom January 2019:


> Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.

So I guess they used an older node.js version. The current LTS version is 12.x and it is from around the middle of this year.


PS: If the blog author reads this, there is an accessibility problem with the Google-hosted inline images. If I try - without ad blocker - in an anonymous window I see none of the inline images. Logged into Google with my own account I can see some but not all the images. Apparently which images I can see depends on being logged in to my Google account? I also tried IE Edge just to see if the browser makes a difference - no inline images visible there either.

When I try to view the image in a new tab, I get:

Your client does not have permission to get URL /Iw-RdHoPjbwuSAqJHK3C0Sy8m29NqzeHPtmJ7CVFuYqwr4CbwpGjwn9O4bcDNtCf_hLD4FGc75nkQYnJBgyA-CT2ikBDWQD-nAtqxXa4Lw2yDuh_-ywcsDaer6m4LyVtljwfrajO from this server. (Client IP address: [redacted])

Rate-limit exceeded That’s all we know.

Fixed the images about half an hour ago, sorry about this!

Ditto, images weren't showing up for me

Compared to a compiled language, node / JIT langs make it difficult to know what will be fast in prod.

V8 JIT means that things like order of keys in an object or number of different calls to a function might affect whether your function gets optimized.

And there's no easy way to find out if a JS function is falling back to slow mode or to tell the buildsystem 'this is a hot path, don't let me write code that deopts this call'.

It's not clear from the article why they were only able to run one request per node process, but that alone would make it questionable why use Node at all then. The entire point of the environment has been nixed. The article is quite confounding to understand how they arrived at that point in the first place.

"Only 10% of Plaid's data pulls involve a user who is present"

Since they provide an API, it seems like some of the calls where they think a user isn't present might actually have one present.

We thread knowledge of whether a data pull was initiated by the API or by our cron-style service into our load-balancing layer, so this ends up being pretty straight-forward.

Ahh, got it. The "present and linking their account" part threw me off. Sounded like only the "linking" call was getting the fast lane.

The other 90% are not triggered by the API, they are "periodic transaction updates" - presumably they refresh once a day or something.

Yeah, I read that, but it's not clear exactly what those calls are. It sorta sounds like making assumptions on how their users are using the API.

In fact, it sounds like they think "linking an account" is the only "user present" API call:

"Only 10% of Plaid's data pulls involve a user who is present and linking their account to an app"

No, I'd read this as "linking their account to an app" meaning that the plain account's API credentials are configured in the app, so the app can call the plaid API (presumably interactively on user interaction).

I don't want to be that guy, but why did they start with nodejs for something like this instead of using the JVM or Go?

They probably already had some decent experience with Node and it solved their initial problem well enough, refactoring or rewriting costs usually make engineering managers frown upon (wrongfully) and so it becomes much harder to fix this in the long term.

I have experience with node, am frontend developer... but would never in wildest dreams use Node for any kind of production backend. And it's not even the question of programming language - the problem is NPM and whole package management which is inherently insecure.

My guess is because their system is primarily issuing HTTP requests and extracting data out of responses: html, xml, json, plaintext, etc. Web scraping is a messy business and using a language that allows you to be flexible with string manipulation and types goes a long way toward sanity.

How is Javascript better at string manipulation? I've never encountered anything special there that I can't do in just about every other language. Javascript just has more helper functions out of the box.

I wouldn't characterize it as "better" but specifically easier and more flexible for the people writing and maintaining these scrapers. I'm also speaking more broadly about scripting languages (not javascript specifically) vs the aforementioned JVM or Go, and the ease with which you can deal with inconsistent, frequently changing, and often completely invalid inputs from a wide variety of data sources.

Plaid's use case here is automating logins, responding to captchas, manipulating those on-screen virtual keypads to respond to security questions, chaining together multiple HTTP requests, and then parsing out frequently invalid, rapidly changing, and just plain broken content from a wide multitude of banking websites.

To me, that seems like a case against Javascript. Invalid or broken content should return an error, the parser shouldn't try to "fix" it.

And things like data types should be strictly enforced, otherwise you can get unpredictable results, which is especially bad when you're dealing with money transfers.

> Invalid or broken content should return an error, the parser shouldn't try to "fix" it.

You would have a really difficult life in web scraping. You do not have the guarantees of well-formed data. Instead you get HTML with mismatched tags, JSON with newlines in the middle of strings, content that claims it's UTF-8 but upon closer inspection is actually GB2312, pagination endpoints with off-by-one errors, etc. It's an absolute mess and taking the stance of "well, they didn't encode their JSON correctly, so we're not going to operate on their data" isn't a very effective strategy.

> Which is especially bad when you're dealing with money transfers

Afaik Plaid is read-only. They fetch information from financial institutions and make it available through an API.

I'm actually quite experienced with web scraping, mostly using PHP and XPath, but also with Javascript as well as a custom approach written in Rust. I know in detail what an inconsistent mess everything is.

That's why I'm so uncomfortable handling things like bank transfers over such inconsistent, buggy systems, which is what Plaid does. It's not read-only: https://plaid.com/use-cases/consumer-payments/

Not to say I don't trust Plaid, I'm sure they're aware of all this and very careful about how they do things.

Then I feel like we're in agreement then that web scraping is a messy and imprecise art and the flexibility of a scripting language like PHP provides is immense.

I have no affiliation with plaid, I've honestly only heard negative things about those guys, I was only empathizing with the difficulties in maintaining thousands of different scrapers and why I felt a scripting language provided far more latitude to get things done.

Node is pretty good for managing HTTP requests as long as the responses aren't too large. But parsing data, especially html/xml, is CPU-intensive in node and probably not a great fit.

I’d be curious to hear your reevaluation of moving this to Lambda after some of the major announcements during re:invent. My guess is some of the reasons you went ECS have been addressed with these announcements. Obviously some of the new features are still preview, but would be interested to hear your analysis none the less.

Oftentimes there's a several month delay from when stuff is announced at re:invent and when it's GA. I don't think anyone would ever make technical decisions based on announcements; they would wait until they could touch it and actually create a proof of concept. In other words, the "analysis" is nonexistent, since there's nothing to analyze.

Does node have something similar to how apcu is used with PHP?

That is, an mmap based kv store so that if you choose to run more than one node process on a single server, it has a fast kv cache?

I'm aware you can use redis or similar, but a simple mmap kv store is simpler and faster for a single server use case.

I totally see what you mean, coming from a PHP world myself a few years ago. The key thing to note is that node.js (like many other languages including Java) starts a server process that basically does not stop until you explicitly restart it (or it crashes); unlike PHP where every request starts a brand new process on a clean slate (hence needing APCu to store a local memory cache per server). Meaning, what you can accomplish with APCu in PHP can be trivially accomplished by a simple Object in node.js (i.e. a map/hash), by virtue of having a require cache (hence every time you require'd the lib it returns the same instance of the object).

If you want a simple open source lib to do exactly that for you and provide an easy to use API, you can use something like https://www.npmjs.com/package/tmp-cache .

The context is multiple node processes running on a single box, so a shared cache across processes has value for some use cases. I don't think the cache module you suggested would work in that case.

I'm aware of the runtime model differences between node and PHP.

You can use something like LMDB on every language.

Ah, yeah. I suppose that would mean you need a fast node.js serializer. Apcu uses its own serializer that is fast-ish.

You can use something like FlattBuffers to have non-copying reads.

In case anyone else gets excited by JSONStream, know that the package hasn't been updated in over a year, and the GitHub repo was archived by the author with no link to a successor.

I'm maintaining a fork here that incorporates all of the valid open PRs from the original repo + some more updates: https://github.com/contra/JSONStream

It isn't published on NPM (you can use it as a git dependency) but if people are interested I can.

Thanks for sharing!

Why don't you publish releases?

Oboe has a similar API, can't speak for performance though.


$300k is $300k, but they just raised $250M last year, is this a really good use of time for their engineering team? That's a little above ~0.1% of capital.

Why wouldn't it be? You save 300k, that's an engineer salary... that's pretty much the meaning of a job, building value that's higher than your salary. This clearly took less than a year of engineer time. Seems like they got their value out of that employee.

That's just one of the benefits.

> our system is more robust to increases in external request latencies or spikes in API traffic from our customers

A good example of avoiding premature optimization. I'd imagine delaying tackling this problem freed them up to tackle problems that impact users.

This only holds if they didn’t pour hours into the original solution. Setting up and managing 4000 node services doesn’t sound like a quick hack.

While we were worried about event loop blockages causing outages, another more subtle problem would have been if event loop blockages doubled our user-facing latency. (If you read the section on latency ratios, you'll see that comparing parallel vs non-parallel workers was the most useful stat in figuring out how effectively we were using the event loop.) It definitely gave us peace-of-mind to know that event loop blockages wouldn't have an effect beyond the requests they're processing.

Honestly, the accounting for which would've been higher impact – investing in parallelism earlier, or adding infrastructure and having more resources to devote to other pressing needs – is difficult to do, even in retrospect. There was surprisingly little effort required to get to 4,000 node containers in an ECS cluster, other than deploy speed issues which we talked about in a previous post [1]. But it's possible this migration process would have been easier if we had done it sooner.

[1] https://blog.plaid.com/how-we-reduced-deployment-times-by-95...

ubu7737 7 months ago [flagged]

> But it's possible this migration process would have been easier if we had done it sooner.

What the f*? Of course it would have been easier if you had done it sooner. What you lacked was the willpower from decision-makers who had growth of dollar-signs in their eyes.

You've littered this thread with comments explaining how every move you made was based on ROI. That's the kiss of death for architecture concerns, and bizarrely it puts Node.js on the list of runtimes for data/stream processing backends.

No matter how many times you explain how you made these decisions, I can't help getting the feeling you were wearing horse blinders.

Edit: I find it impossible to imagine that nobody on the engineering team ever shouted, Hey look out! We are basically a Web farm for banking-related requests, this is insane! Surely you've heard from those people and they were let go.

I say "possible" because our system observability was less mature even 12 months ago. Firefighting 10 different root causes of memory or event loop issues without the right tooling in place would be a nightmare. That's why we did a deep dive into the tooling that we considered to be a prerequisite for this project – hopefully it's helpful for others in our situation.

Different companies make different decisions when weighing ROI against architecture concerns. We're heavy on pragmatism and impact at Plaid, so it's quite intentional that we don't fall all the way on the latter end of the spectrum. I appreciate the discussion in the comments as to how effectively we are balancing these two concerns – certainly this is an area where reasonable people can disagree.

Writing and maintaining concurrent code for greenfield projects is relatively hard compared to sync code.

Provisioning and deploying with ECS is usually just mouse clicks.

Ironic. Linked images failing to display due to "Rate limit exceeded"...

4k containers? That's microservices going macro big time.

I don't like to be overly negative, especially when a company/team is being transparent about what they're doing and giving insight into their engineering practices - but has anyone else's estimation of Plaid's engineering team just gone down the toilet?

This blog post gives me the impression that either Plaid is filled with either junior or incompetent engineers - to scale to 4k containers serving 1 request each for an API workload is absolute insanity.

These engineers are building stuff for banking. Banking!! There is literally no way I'm going near Plaid with a very long bargepole after reading this.

It I was someone senior at Plaid, I'd be pulling this blog post before it harms reputation any further.

I mean, that's always the thing, isn't it? If a company publishes about the problems it has, the question is whether other companies have the same problems and just hide it, or whether this company is actually worse. This comes up a lot with gitlab, for instance; remember the time they discovered they had no backups? At most companies, customers would never find out about that, so I'm not sure that them telling us about it usefully informs my view of their competence. Similarly, here, the only way we'd know about this if they didn't say anything would be poor performance, which would be... less than surprising, on a financial website, in my experience. So maybe they suck, or maybe they're equal with others but a little more open; I don't know how to tell.

Thanks for a reasoned response to what I realise was a very negative comment. I do agree with what you've said, and I do feel a little bad for slamming them when they're being transparent.

OTOH, I do still feel this is so bad they need to be called out on it, and it really does scare me off using them. Given they're being transparent, it boggles the mind that they're tried to justify this, rather than just owning it, admitting it was the result of letting a junior do some resumed-driven-developlemt (or however it came about).

I posted it on my engineering org's random channel. The 4,000 instances of the same service thing immediately got a laugh out of everyone. How a tech company operated like this is beyond me...

Hah, I actually did the same, and it was basically a stream of WTFs?!

To put a positive spin on this (perhaps a first for me in this thread!), I plan on doing something of an internal post-mortem with my team, where we'll look at the deficiencies of this design, try to reason about how on earth it came to fruition, and critique our in-place review processes to make sure something like this never happens to us.

Hi, Plaid engineer here (not the author, but I helped with the post).

I don't think we've tried to assert that the old system is perfect. We went into some detail in the post about why it took us this far. Certainly, the single request per container approach wouldn't scale if our unit economics were different. We didn't get into this too much in the post, but the Node service sits behind a couple of layers of Go services, so the we had more control over scaling API traffic than it might appear.

Likewise, I hope we didn't give the impression that the new system is perfect. We've explored other languages for integrations in the past (even Haskell, at one point), and are continuing to do so. A migration away from our years-old Node integrations codebase would be a massive undertaking at this point. Absent that, it doesn't seem consistent to say "you're incompetent for handling 1 request per container" and also "you're incompetent for writing this post" – if you believe the former then it makes sense to be an advocate for this project, at least until a language migration can be done.

I think the set of hoops we had to jump through in order to add concurrent requests without adding latency is a good demonstration of why we didn't do this sooner. It wasn't a massive undertaking by any means, but it wasn't trivial. At any rate, we're not really looking for a gold star here – just putting this out there and hoping this will be useful for others who are, as other commenters have put it, building their own "Frankensteins" :)

I mean, my read of this is:

1. We used a system which uses event loops to achieve great concurrency, but we turned that off because we don't trust it. 2. Instead, we spent $300k/yr rolling out one-process-per-API as though we were using Apache 1.3. 3. We used an arbitrary JSON library without knowing anything about its performance characteristics, which it turns out were inordinately bad

It's not that this wasn't a great exercise in engineering and problem-solving, or that it's not a great demonstration of how to solve scaling problems at scale, those are definitely true. It's more that "we spent $300k/yr more than we needed to so our engineers didn't need to learn how to use our technology stack properly."

I'm not meaning to be harsh, I've kludged enough garbage into production in my lifetime, but more that the fact that you got into that situation in the first place gives a poor impression of either your development team or your development processes.

I don't disagree with most of what you're saying. The Nth engineer at a startup rarely looks with admiration at decisions made by the (N/10)th engineer – but it was those decisions which helped the company grow to its current size. Likewise, I think most of us will be happy if the company 10x's again. Then some super-duper-senior engineer can look at the decisions we're making now – they're not perfect, but we're doing the best we can with our current knowledge and resources – and gripe about them. The circle of life goes on.

FWIW, I don't believe Node was chosen specifically for its concurrency – it was just the language chosen for the entire stack by the company's founding engineers, and lives on in just this one service.

Gah, I can't believe you're still trying to justify this madness!

In a parallel universe, a barely-competent engineer would have designed something more far more obvious, simpler, performant - all while using less hours, and not borderline-fraudulently wasting substantial amounts of your VC's money.

If the company 10x's again, it won't be because of poor engineering, it'll be because of marketing and VC's who don't know how you're wasting their money. If the company 0.1x's, it'll likely be because of a security breach because of appalling design.

So, in your estimation, plaid's engineering is a bunch of less-than-half-competent madmen, who might as well be committing fraud, correct?

Is that overly negative or just the right amount?

If I may add, Monzo bank which utilize Golang has interesting stack.


Ah, a serious question as an aside to my last (scathing) comment - does Plaid have architects? What about architecture and/or code reviews?

I'd be very interested in reading about that.

Your question feels pretty sarcastic, but I think it'd be great to write a post about our process for projects specs and reviews. We do have an old post on the blog about code reviews that you could read, but it's more focused on the cultural approach to making code reviews less intimidating for junior engineers, less about the details. While we don't have an "architect" title, a fair portion of our more recent hires have 10+ years of experience (I don't have exact numbers though), and we're working on the best way to coordinate efforts and improve our system-wide architecture.

In general, I think it'd be great if more companies communicated their engineering practices and war stories externally – I'd love to read such posts myself! Unfortunately, it takes quite a bit of time to write a post (for engineers who are already stretched thin), and it seems that being honest about shortcomings at an early-stage company is an invitation for people to be personally disrespectful. It is what it is, but I imagine that's one reason we see such posts from only a small handful of startups, and the subject matter is often cherrypicked and sugarcoated.

So anyway, thanks for engaging with our post all the same – and maybe we'll be back with a post on architecture reviews in 5 years when all the kinks are ironed out :)

Reading the article, I'd fully expected a post-mortem at the end, describing how architecture and code review processes were going to be tightened up to ensure a monstrosity like this never happened again - that would have been transparent, interesting, and given me confidence in Plaid's engineering.

Instead, you've peppered this thread with comments that kind-of, sort-of justify the approach taken.

I'm sorry, but this approach cannot be justified - it's overly complex, and far from the simplest or most obvious approach. I'm truely shocked that Plaid has produced an architecture like this, and doubly so that Plaid would try to justify it. My guess here (and given the attempts at justification, this is me being really charitable) is that a junior dev was given too much leeway, and did some resume-driven-development, just so they could say they'd worked with 4k containers.

It's important to keep in mind that efficiency isn't usually particularly important for a startup. I'm sure they knew when they initially set up this system that it wasn't performant...but it was nice and quick and easy and gets the feature out the door. Why should they worry about $100k or whatever when they're funded for > $350M? Their bottleneck is engineer hours, not dollars.

Instead the rational thing to do is build something quick and dirty and optimize later, and that's exactly what they've done.

I understand that, and plenty times myself I've "done the simplest" thing - sometimes you need to ship an MVP, fast.

The difference here is that what they did wasn't even the simplest thing - it was a crazy, insanely wasteful thing that just happened to work for a while. Being honest, for me, it's an indefensible approach.

> Why should they worry about $100k or whatever when they're funded for > $350M? Their bottleneck is engineer hours, not dollars

Arg, but this rubs me up the wrong way! Any half-way competent engineer could have built something simpler and much more performant, and likely in many less hours too. Sometimes stopping, thinking and discussing for a few minutes or hours will save numerous hours. I mean, how many hours did they spend on this "diagnosis" alone?

>Their bottleneck is engineer hours, not dollars.

Their bottleneck was software being able to scale past a hard stop. I guess having a known breaking point of scalability is a good thing? But building things in a way where you either have to overhaul your development runtime or not be able to scale past a certain point is pretty terrible.

It seems like the only reason they did this was because they really felt the pain of it from the business and dev side and they were lucky enough that they had traffic spikes to raise these issues. If they had more consistent day-to-day traffic then this would have just hit a breaking point one day and they would've been fucked until it was fixed.

Frankly the reality is so surreal there, it's actually surprising it doesn't mention uploading csv files over ftp, generating excel files, having cameras pointing on monitors of legacy systems that read data (no, this is not a joke) or spawning a promise for Martha to cross check something and click ok somewhere behind two bastion hosts, three firewalls and one and a half soap integrations. I'm not defending "4k containers because even loop can be blocked" which is silly, just reminding of the context - in other words you can do shittiest automated thing there and you're a hero. Next year-or-two hero is going to be somebody shrinking it by another X orders of magnitude.

I kind of alluded to it in my reply, but I tend to agree- they spent a lot of time and hard work- looking in all the wrong places! Its hard to imagine how they missed the forest for the trees so badly here.

Worse is- they never really explain where that 30x improvement came from- or if they even understand it themselves? They talk a lot about getting their memory issues under control, but hardly at all about actual parallelism- and it seems that even then they confuse it with merely speeding up operations that are blocking.

I kind of expected this post to be "We did a whoops and had a blocking call to a DB/fs/compression call/whatever. This was all happening in the event loop and not being farmed out to the threadpool by libuv. We fixed it and now look like heroes to our CTO!"

What they talk about are issues that blocked them from parallelism per node and how they resolved the issues. I'm not sure what additional information you're expecting?

Though I'm somewhat surprised they didn't use Worker patterns per node with self monitoring for health above and beyond what they already did.

For banking... accurate, simple, safe, reliable are more important than performance/throughput. IMHO optimizing the above and for developer efficiency should be the first priority and for scale or max throughput later.

The simplest solution is to scale to one worker per node initially if you're doing anything compute intensive... once you've done that, and/or you need better performance for any number of reasons including cost, then you can do more. Now, I'm not sure I would have gotten to 4k nodes before I started to re-evaluate parallelism or better scaling options, but the initial implementation is absolutely fine.

> For banking... accurate, simple, safe, reliable are more important than performance/throughput. IMHO optimizing the above and for developer efficiency should be the first priority and for scale or max throughput later.

I get it, but come on - this was not a "performance optimisation" issue, but one of bad architecture; an architecture that certainly doesn't inspire confidence in the priorities you mention: accuracy, simplicity, safety.

Having a worker that does one request, processes that one request and returns a result isn't accurate, simple or safe? Scaling that simplistic interface in a 1:1 manner across many systems via ECS is pretty straight forward.

Now, in addition to probably optimizing what they've done, converting to, for example another container system, like K8s where they can scale vertically a bit better may have been another approach.

The biggest issue that I see is gRPC doesn't work that great with Node. You can't use it with cluster which means you have to self-manage threads/processes and it adds complexity there. Yeah, there's definitely issues that come into scaling in terms of performance optimization.... but where they started from isn't unreasonable imho.

ubu7737 7 months ago [flagged]

Where I work compliance is job #1.

That doesn't prevent us from thinking about performance. GTFO with this nonsense.

They started with a simple implementation where one node handles one request at a time... end to end. They used ECS for easy-button scaling. That's a perfectly reasonable approach for starting out.

I would have probably pushed for a shift in orchestration to kubernetes along with some tweaking as an initial uplift. Others would re-write the whole thing in another language. They chose to add a bit of complexity for multiple requests per node support. They all have their pluses and minuses, but in the end it doesn't mean the initial approach was bad, or their refactor wasn't pragmatic or practical.

Dramatic rewrites to a codebase lead to instability and in practice fail as much as succeed.

> Dramatic rewrites to a codebase

Otherwise known as re-architecting?

Your comments in this thread have been breaking the site guidelines, and getting worse as they go along. Would you please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here? Note the bit about curious conversation.


I'm not sure if you're aware, but dang is one of the Hacker News moderators in that article. Dang stands for DANiel Gackle.

Linus Torvalds quote: "You need to grow thick skin". I also got a knee-jerk reaction when reading the first part, but the article explained it well, given that they probably don't want to give out too much information.

So how would you have engineered it? I would just send the data uncompressed granted that the receiving server is probably in the same data-center with switches capable of handling Tbit's of data per second.

I liked the article, but would have wanted more details. I love optimizations, it's such a drug, the rush when you make something x times faster. This article doesn't give me a bad impression. Contrary I'm thinking about sending an application.

In their defense. It looks like they have over 400 employees and raised over 350 million in funding. On all things that truly matter currently they seem like a very sucessfull company.

I can guarantee you a VPE or CTO who can say they helped do that... but ran into a scaling issue from their success will have no issue with employment and no reason to be ashamed. All the more impressive if it was just a bunch of junior engineers.

This comment says more about you than it does about Plaid. Their "insane" design met business requirements successfully enough to grow them into a multi-billion dollar company.

Did you consider the likely (and more charitable) explanation that they were aware their design was "bad", but had higher priorities until now?

If I were you, I'd be pulling your comment before it harms your reputation any further. :)

>multi-billion dollar company

WeWork is a "multi-billion dollar company" in the same way that Plaid is. Private funding valuations don't really mean anything anymore.

> Did you consider the likely (and more charitable) explanation that they were aware their design was "bad"

I think marketing and VC valuations grew them into a multi-billion dollar company; whether they remain so, to a large part relies on how fast they burn through VC cash - so, not looking too good on that front...

No even half-way competent engineer would come up with such a complex, unperformant solution to a simple problem - I think a higher priority should be hiring engineers who actually have a clue what they're doing.

As for meeting business requirements... while this might have worked for a while, it was plainly not a good way to meet them, and given Plaid are in the banking sector, really doesn't bode well for the future (I'm having flashforwards already to security breaches, plaintext passwords etc...).

ubu7737 7 months ago [flagged]

Downvoted for angering the VC class...

> no way I'm going near Plaid with a very long bargepole after reading this

But you'd go to a competitor who hasn't published a blog post, whose internal code you haven't audited and simply presume is just fine?

In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus.

> In plaid's defense, lack of performance tuning isn't necessarily a lack of security focus

Come on, this is not about "performance tuning", where you're trying to eek out every last drop of performance - it's about a completely indefensible, complex, wasteful solution to a simple problem.

I'd say engineering insanity at this level is very worrisome for what they've done at the security side of things.

ubu7737 7 months ago [flagged]

ROFL "performance tuning" this is not tuning this is architecture.

Some people think you can just write software, sell it to customers, and it's "tuning" to make it work properly.

You should be fired from whatever job you have.

My guess is that you have no job, you are fronting USD.

In which case you have absolutely no place in this conversation and you should be ashamed of yourself for speaking up.

A fool and his money are easily parted.

We've banned this account for breaking the site guidelines and ignoring our request to stop.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future.


TLDR: How to spend millions of dollars of our investors' money because we hired junior devs who chose a framework that was trendy but couldn't scale.

> We were running 4,000 Node containers


Nobody involved in this project should be allowed to ever be in the same room as a computer again.

This comment breaks the site guidelines and is not cool. Would you please read https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here?


Why? They had a 12 factor -ish app that scaled the normal way; run more copies. Eventually that got expensive. They had the observability to figure out what was making it expensive and whether or not their fixes had an effect. They then saved $300,000.

Seems like everything went right to me.

I would be worried if the blog post was "we randomly tweaked some stuff and we can't measure it but it's a little better" or "we rewrote it in go and in the rewrite introduced 87 new bugs while fixing 42 old bugs". They engineered a solution, built from good investment in infrastructure, rather than ninja-ing a hack. That, to me, is a very good thing.

A lot of people seem deeply upset that Node was involved, but I think that's a red herring. The problem they had -- allocate a large chunk of memory, keep a reference to it while it is slowly sent to another server, free memory -- is going to happen in any language. (I don't super agree with their solution of "make the server faster" because one day it's going to be slow for some other reason and this problem will crop up again. Instead they probably just need a fixed amount of memory to dedicate to this process and to drop the debug payload when the buffer is full. Or just put it in the request path if it's crucial that it be produced every time no matter what. At least that will apply backpressure to calling services, pop the circuit breaker, and redirect requests to a region where S3 isn't broken. But I don't think the debug information is THAT important ;)

> Why? They had a 12 factor -ish app that scaled the normal way

So, yes, horizontal scaling is good, especially for stateless workloads - but that doesn't mean you run the most hopelessly under-performing code imaginable on each node, so you basically have to scale out like this! I mean, seriously, 4000 containers to serve 4000 concurrent requests? I mean, I can't even...

I honestly can't believe the attempts in this thread to justify such an utterly, horrendously bad architecture - there are 1001 better, simpler even, ways to approach this.

Yes, premature optimisation is bad, but optimisation here was nowhere near premature.

I'm going to disagree.

When you start a business, you have no idea what it's going to grow into, or if it's going to grow. So you start simple. The design was good enough for there to one day be too many customers. That is huge.

When this happened, they started a second copy of their app, and could now handle twice as many customers. Repeat 3998 more times. Now the toy app is making some real money, so you can afford to deep-dive into the system and fix the technical problems.

They avoided the real issue that kills startups, having a customer call you because they want to buy your service and you saying "sorry, we aren't accepting any new customers right now because Hacker News comments don't like our software architectures."

To save $300,000 they first needed to waste $300,000 by reinventing a problem that was solved in 1967.

I don't know how many software engineer they got on that team, but considering how much they raised and how much their product is used, $300k seems actually quite cheap for something that people consider here as being an awfully big mistake.

I would say the same thing about hiring people who make snide dismissals.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact