
We 30x'd our Node parallelism - bjacokes
https://blog.plaid.com/how-we-parallelized-our-node-service-by-30x/
======
kevstev
I was building scalable node applications a few years ago for a very large
e-commerce player- millions of customers. I think node.js is a great platform,
but its apparent simplicity means there are hordes, and I mean like 90+% of
the community, that can "just get things done" without understanding what is
going on under the hood at all. And to be fair, for most startupy types of
companies that need to iterate fast, that is what you want to optimize for.

My interview screening question was pretty simple- "Is node.js single threaded
or multithreaded?" And to most, they spit back the blogspam headline- "Single
threaded!" I think the most correct answer is "its complicated" but would
accept that because most people would say that is the "right" answer. So I
would follow up with- "what exactly happens in a default installation if we
have say... 5 requests come in at exactly the same time to just return some
static content from disk?" (Node's default threadpool is 4). And here is where
you could see their understanding just fell apart. Some would say they would
be handled entirely synchronously, others completely in parallel- but then had
no idea what the cause of the parallelism was. Very few actually understood
that node is an event loop executing javascript backed by a threadpool for
async operations.

Before reading this post, I was like eh this is a waste of time- its typical
medium bullshit- they almost certainly found they were doing some blocking
call in the event loop and then removed it and voila, 30x speedup. It was
interesting because it was a lot worse! They spent all this time and hard work
figuring out everything but what was taking so long in the event loop, and it
seems that was the last place they actually looked.

Anyway, node can be a highly scalable platform
([https://changelog.com/podcast/116](https://changelog.com/podcast/116)) but
you need to understand it or else it will bite you in the foot. When I was
last doing this stuff, upwards of 80% of our time was being spent essentially
just JSON.parse()'ing, and we were looking to move to protobufs to avoid that.

~~~
gameswithgo
>I think node.js is a great platform,

I'm curious as to why. For large scale applications like this, you have other
options that offer higher performance ceilings, have more safety and
correctness features, and are likely more productive as well. What is the
attraction to node?

A guy has to invent a scripting language for browsers in 9 days -> he decides
on a lisp -> management says no it has to look like java -> he comes up with
something -> its dynamically typed -> lets run a huge banking infrastructure
on this

wat

~~~
NohatCoder
The real killer feature is async. Since a modern web request typically spend
most of the time waiting for database calls, file system requests or similar,
a naively coded server in most languages can handle relatively few requests
per thread, so you scale up the number of threads to something like 100 per
core, and now the overhead of running and switching between these treads is
limiting the performance.

Being used to Node, I was flabbergasted when writing C for Linux* . The file
system commands just leave my thread hanging while the result is being
generated, if I use it on a network drive it might hang for a minute before
timing out, so I have to make a tread for each file system command, solely so
that it can stall without bringing down the whole application.

* I have no delusions that Windows is any better, Linux is just what I have first hand experience with.

~~~
unlinked_dll
>a naively coded server in most languages can handle relatively few requests
per thread, so you scale up the number of threads to something like 100 per
core, and now the overhead of running and switching between these treads is
limiting the performance.

This would be true if you hired someone to write a server in C about 15 years
ago. It's not true today. And I hope you're not putting a naively coded server
like that in production, or at least doing the hour of research once you
notice it's awfully slow to solve the problem.

Like if you wrote your backend in Go, Rust, Java or any number of languages
(even C/C++ with common dependencies!) and did a little reading while you
designed it, this issue wouldn't exist.

------
spamizbad
The only way this makes sense to me is if they have to contend with lots of
expensive parsing, event sequencing, and throttling requirements. Payment
APIs, bank websites, etc can be quite byzantine. I could understand how one
might code yourself into a corner with a monolithic node app and basically
just say "F-it, we're doing this synchronously!"

I don't even think it's a terribly _bad_ thing to do assuming it favors
feature velocity.... but at that point, I'd recommend moving away from Node
towards something like Python. And if you wanted to dip your toes back into
async plumbing land, explore Go or Elixir.

~~~
stickfigure
_I 'd recommend moving away from Node..._

Taking a wild guess: Some of their bank integrations probably require browser
automation. If you're doing browser automation, the best tool for the job is
(currently) Puppeteer, which runs on Node. There are other third-party
language bindings for the Chrome dev tools protocol, but Puppeteer is
developed by Google as a first-class citizen alongside Chrome.

~~~
rdsubhas
4000 chrome instances? Probably not. Here I am trying to run 4 chrome
instances in parallel in CI without crashing.

~~~
stickfigure
Presumably not every integration requires browser automation, so they might
not all be going at once. But they have a $25k monthly EC2 bill, so it's not
out of the ballpark.

FWIW, I reliably have 6 puppeteer/chrome instances (headful, even) going on a
single box and it's not even at half capacity.

------
rauchp
That was an interesting read, thanks for linking to it. It's hard finding
articles online discussing Node and performance, most people just dismiss it
as an unviable option due to scale and speed concerns. 30x really is quite the
jump though.

> Each Node worker runs a gRPC server

Not going to lie, this kind of surprised me. When I think of a Node backend I
think of ExpressJS. Not because I think Express is better, but because it's
been pushed around in the past few years as the fastest, simplest way of
running a backend.

Yet, if you're going to be running a gRPC server, why not use a more
performant language with better multithreading support? I thought this article
was about them optimizing a grandfathered-in solution (such as Express), but I
can't tell why they built out a gRPC server in Node in the first place.

~~~
bjacokes
Our integrations are primarily written in Node, which was the original
language used for everything at Plaid. Almost all of those original services
(except for integrations) have been migrated to Go or Python at this point.
We've standardized on gRPC as our wire format, so we stayed consistent and
used gRPC in Node.

With perfect hindsight, it's a fair point that all the pros and cons could net
out to another language being best for our integrations. Integrations are the
largest and most quickly-changing codebase at Plaid, so such a migration would
be a massive undertaking. We definitely didn't want to block scalability
improvements on doing a language migration.

~~~
mnutt
I've been hoping that the Cloudflare folks will open source parts of their
Workers; they seem to have figured out a secure, performant way to run
untrusted javascript at scale.

------
mnutt
I’d be curious to hear more about the circumstances that ended up with a
blocked runloop. Are there hundreds of junior engineers, or perhaps third
parties writing code that you don’t control? I have seen people accidentally
write blocking code, but not at such an egregious rate that we couldn’t catch
it in code review, or at worst the runloop detector would alert on it in prod
and we would roll back the deploy.

For instances where you actually know you need lots of CPU, there are now
strategies for offloading that specific work, although they have taken a while
to get nice and easy to use.

~~~
bjacokes
Sure, one example I remember off the top of my head is a bank that sometimes
returned duplicate transaction data, so an engineer had called ramda.uniq on
the transaction array. Transactions are nested objects and slow to compare, so
when you find an account with 100,000 transactions... kaboom. Some scenarios
are more subtle, but a common theme is that the amount of data in an account
can vary by many orders of magnitude.

------
7777fps
> We were running 4,000 Node containers (or "workers") for our bank
> integration service. The service was originally designed such that each
> worker would process only a single request at a time. This design lessened
> the impact of integrations that accidentally blocked the event loop, and
> allowed us to ignore the variability in resource usage across different
> integrations. But since our total capacity was capped at 4,000 concurrent
> requests, the system did not gracefully scale.

I can't be the only person who reads stories like this and wonders how they
arrived at that solution in the first place?

Failing to scale because their previous approach to scaling was a worker per
request, a model which was roundly moved away from, because that's how CGI and
Apache modules worked and it didn't scale well.

I thought one of the key selling points with Node was an fully async standard
library, enabling better scaling in process.

But then you read stories like this, and I find it hard to relate to the
original problem.

~~~
phoe-krk
> I thought one of the key selling points with Node was an fully async
> standard library, enabling better scaling in process.

We still have an event loop that is trivially blocked by very simple
programmer errors, destroying the whole advantage that you describe here.

The fact that Node ships a fully asynchronous standard library doesn't in any
way fix the fact that Node is a runtime for a language that itself is a
mistake.

~~~
nicoburns
> We still have an event loop that is trivially blocked by very simple
> programmer errors, destroying the whole advantage that you describe here.

So they fixed the issue that some requests blocked... by making all requests
blocking.

~~~
phoe-krk
This is the worst kind of software engineering.

There is a massive deadlocking design mistake in the centre of the language -
literally a huge red button with DO NOT PRESS printed on it. Thousands of
programmers pass it by every single day, or hour, or minute, and the creators
of the runtime insist that it is impossible to fix that button whatsoever;
instead, all users need to work around it by ensuring that their code in no
way presses that red button on purpose or even by accident.

These people insist that it is impossible to program normally and in a
language that is actually sane and does not advertise obvious and gaping
design mistakes as "features of the language". These people advertise the
analogue of Python's Global Interpreter Lock as the core foundation of their
language.

These people advertise Node and the language it implements as practical for
implementing multithreaded applications. Posts such as this show what sort of
bullshit it is; it is only practical to use Node for parallelism _if each
single Node instance is only ever run single-threaded_. You don't parallelize
by running multiple threads, you parallelize by running multiple Node
runtimes.

This is no longer an act against productivity or usability. This is simply
insane and shows one of the most basic things that are wrong about Node's
language and approach. It is impossible to write a multithreaded program if
your language of choice makes it trivial, and practically unavoidable, to
globally lock your whole runtime with every single line of code you write and
import as your dependencies.

~~~
eshyong
I mean, running multiple node runtimes (aka multiprocessing) actually sounds
like a reasonable compromise for parallelism. That's the standard solution for
dynamic languages without great multithreading support. If you needed great
multithreading support then Node probably wasn't the right choice for you in
the first place, but for most applications, it's probably fine.

However, running multiple _containers_ for parallelism sounds a little bit
crazy. In the worst case, each container may be running on its own server, but
even assuming multiple containers per host, I'm guessing they were running an
insignificant number of instances, which is probably why they were able to
save $300k in server costs.

~~~
phoe-krk
Yet this is the case mentioned in the article.

> We were running 4,000 Node containers (or "workers") for our bank
> integration service.

~~~
eshyong
Yes, I agree. I'm arguing against your characterization of node as a poor
runtime. For many line-of-business applications node is a fine choice.
However, plaid has to integrate with many banks which only expose web pages,
not APIs, so I'm guessing that they have to do a non-trivial amount of CPU
work to scrape and process HTML responses. For this, it may not be such a good
choice.

All I'm saying is that the choice of language is (usually) not the issue. Poor
architecture design causes a lot more problems than whether you choose
python/java/ruby/node for your webapp.

------
vmarchaud
I've encounted different issues with NodeJS services in the past (and still
do) both with CPU bottleneck and Heap allocations. So i wrote openprofiling-
node [0] during this summer to help me profile my apps directly in production
and export the result in a S3 bucket. I believe it may help someone else here
so i'm posting it

[0]: [https://github.com/vmarchaud/openprofiling-
node](https://github.com/vmarchaud/openprofiling-node)

------
jdc0589
On a positive note: this was a good write up.

On a negative note: FOR THE LOVE OF ALL THAT IS HOLY, HOW DID THIS HAPPEN.

------
pdimitar
...Or you could just use Erlang or Elixir, where concurrency _and_ parallelism
come pretty much out of the box, with very little effort required for you to
fine-tune the desired policy / strategy.

The insistence on using Javascript is just beyond lunacy at this point.

~~~
timmy-turner
Well, if Elixir had a Typesystem like Javascript has, I'd instantly switch to
it. But atm I'm staying with Node because of Typescript.

~~~
pdimitar
True, it doesn't have it. Between pattern matching and function guards
however, it has a decent way to protect against common errors.

The true treasure is Erlang / Elixir's runtime though. The parallelism, the
self-healing, the preemptive scheduling.

------
nosianu
They write (somewhere in the middle)

> _Since V8 implements a stop-the-world GC, new tasks will inevitably receive
> less CPU time, reducing the worker’s throughput_

But there is this Google blog post vom January 2019:

[https://v8.dev/blog/trash-talk](https://v8.dev/blog/trash-talk)

> _Over the past years the V8 garbage collector (GC) has changed a lot. The
> Orinoco project has taken a sequential, stop-the-world garbage collector and
> transformed it into a mostly parallel and concurrent collector with
> incremental fallback._

So I guess they used an older node.js version. The current LTS version is 12.x
and it is from around the middle of this year.

\---

PS: If the blog author reads this, there is an accessibility problem with the
Google-hosted inline images. If I try - without ad blocker - in an anonymous
window I see none of the inline images. Logged into Google with my own account
I can see _some_ but not all the images. Apparently which images I can see
depends on being logged in to my Google account? I also tried IE Edge just to
see if the browser makes a difference - no inline images visible there either.

~~~
jimbo1qaz
When I try to view the image in a new tab, I get:

Your client does not have permission to get URL /Iw-
RdHoPjbwuSAqJHK3C0Sy8m29NqzeHPtmJ7CVFuYqwr4CbwpGjwn9O4bcDNtCf_hLD4FGc75nkQYnJBgyA-
CT2ikBDWQD-nAtqxXa4Lw2yDuh_-ywcsDaer6m4LyVtljwfrajO from this server. (Client
IP address: [redacted])

Rate-limit exceeded That’s all we know.

------
awinter-py
Compared to a compiled language, node / JIT langs make it difficult to know
what will be fast in prod.

V8 JIT means that things like order of keys in an object or number of
different calls to a function might affect whether your function gets
optimized.

And there's no easy way to find out if a JS function is falling back to slow
mode or to tell the buildsystem 'this is a hot path, don't let me write code
that deopts this call'.

------
bfrog
It's not clear from the article why they were only able to run one request per
node process, but that alone would make it questionable why use Node at all
then. The entire point of the environment has been nixed. The article is quite
confounding to understand how they arrived at that point in the first place.

------
tyingq
_" Only 10% of Plaid's data pulls involve a user who is present"_

Since they provide an API, it seems like some of the calls where they think a
user isn't present might actually have one present.

~~~
bjacokes
We thread knowledge of whether a data pull was initiated by the API or by our
cron-style service into our load-balancing layer, so this ends up being pretty
straight-forward.

~~~
tyingq
Ahh, got it. The "present and linking their account" part threw me off.
Sounded like only the "linking" call was getting the fast lane.

------
Scarbutt
I don't want to be that guy, but why did they start with nodejs for something
like this instead of using the JVM or Go?

~~~
meritt
My guess is because their system is primarily issuing HTTP requests and
extracting data out of responses: html, xml, json, plaintext, etc. Web
scraping is a messy business and using a language that allows you to be
flexible with string manipulation and types goes a long way toward sanity.

~~~
calibas
How is Javascript better at string manipulation? I've never encountered
anything special there that I can't do in just about every other language.
Javascript just has more helper functions out of the box.

~~~
meritt
I wouldn't characterize it as "better" but specifically easier and more
flexible for the people writing and maintaining these scrapers. I'm also
speaking more broadly about scripting languages (not javascript specifically)
vs the aforementioned JVM or Go, and the ease with which you can deal with
inconsistent, frequently changing, and often completely invalid inputs from a
wide variety of data sources.

Plaid's use case here is automating logins, responding to captchas,
manipulating those on-screen virtual keypads to respond to security questions,
chaining together multiple HTTP requests, and then parsing out frequently
invalid, rapidly changing, and just plain broken content from a wide multitude
of banking websites.

~~~
calibas
To me, that seems like a case against Javascript. Invalid or broken content
should return an error, the parser shouldn't try to "fix" it.

And things like data types should be strictly enforced, otherwise you can get
unpredictable results, which is especially bad when you're dealing with money
transfers.

~~~
meritt
> Invalid or broken content should return an error, the parser shouldn't try
> to "fix" it.

You would have a really difficult life in web scraping. You do not have the
guarantees of well-formed data. Instead you get HTML with mismatched tags,
JSON with newlines in the middle of strings, content that claims it's UTF-8
but upon closer inspection is actually GB2312, pagination endpoints with off-
by-one errors, etc. It's an absolute mess and taking the stance of "well, they
didn't encode their JSON correctly, so we're not going to operate on their
data" isn't a very effective strategy.

> Which is especially bad when you're dealing with money transfers

Afaik Plaid is read-only. They fetch information from financial institutions
and make it available through an API.

~~~
calibas
I'm actually quite experienced with web scraping, mostly using PHP and XPath,
but also with Javascript as well as a custom approach written in Rust. I know
in detail what an inconsistent mess everything is.

That's why I'm so uncomfortable handling things like bank transfers over such
inconsistent, buggy systems, which is what Plaid does. It's not read-only:
[https://plaid.com/use-cases/consumer-payments/](https://plaid.com/use-
cases/consumer-payments/)

Not to say I don't trust Plaid, I'm sure they're aware of all this and very
careful about how they do things.

~~~
meritt
Then I feel like we're in agreement then that web scraping is a messy and
imprecise art and the flexibility of a scripting language like PHP provides is
immense.

I have no affiliation with plaid, I've honestly only heard negative things
about those guys, I was only empathizing with the difficulties in maintaining
thousands of different scrapers and why I felt a scripting language provided
far more latitude to get things done.

------
rynop
I’d be curious to hear your reevaluation of moving this to Lambda after some
of the major announcements during re:invent. My guess is some of the reasons
you went ECS have been addressed with these announcements. Obviously some of
the new features are still preview, but would be interested to hear your
analysis none the less.

~~~
bdcravens
Oftentimes there's a several month delay from when stuff is announced at
re:invent and when it's GA. I don't think anyone would ever make technical
decisions based on announcements; they would wait until they could touch it
and actually create a proof of concept. In other words, the "analysis" is
nonexistent, since there's nothing to analyze.

------
tyingq
Does node have something similar to how apcu is used with PHP?

That is, an mmap based kv store so that if you choose to run more than one
node process on a single server, it has a fast kv cache?

I'm aware you can use redis or similar, but a simple mmap kv store is simpler
and faster for a single server use case.

~~~
godot
I totally see what you mean, coming from a PHP world myself a few years ago.
The key thing to note is that node.js (like many other languages including
Java) starts a server process that basically does not stop until you
explicitly restart it (or it crashes); unlike PHP where every request starts a
brand new process on a clean slate (hence needing APCu to store a local memory
cache per server). Meaning, what you can accomplish with APCu in PHP can be
trivially accomplished by a simple Object in node.js (i.e. a map/hash), by
virtue of having a require cache (hence every time you require'd the lib it
returns the same instance of the object).

If you want a simple open source lib to do exactly that for you and provide an
easy to use API, you can use something like
[https://www.npmjs.com/package/tmp-cache](https://www.npmjs.com/package/tmp-
cache) .

~~~
tyingq
The context is multiple node processes running on a single box, so a shared
cache across processes has value for some use cases. I don't think the cache
module you suggested would work in that case.

I'm aware of the runtime model differences between node and PHP.

------
mceachen
In case anyone else gets excited by JSONStream, know that the package hasn't
been updated in over a year, and the GitHub repo was archived by the author
with no link to a successor.

~~~
contrahax
I'm maintaining a fork here that incorporates all of the valid open PRs from
the original repo + some more updates:
[https://github.com/contra/JSONStream](https://github.com/contra/JSONStream)

It isn't published on NPM (you can use it as a git dependency) but if people
are interested I can.

~~~
mceachen
Thanks for sharing!

Why don't you publish releases?

------
FanaHOVA
$300k is $300k, but they just raised $250M last year, is this a really good
use of time for their engineering team? That's a little above ~0.1% of
capital.

~~~
dwild
Why wouldn't it be? You save 300k, that's an engineer salary... that's pretty
much the meaning of a job, building value that's higher than your salary. This
clearly took less than a year of engineer time. Seems like they got their
value out of that employee.

------
deedubaya
A good example of avoiding premature optimization. I'd imagine delaying
tackling this problem freed them up to tackle problems that impact users.

~~~
coddle-hark
This only holds if they didn’t pour hours into the original solution. Setting
up and managing 4000 node services doesn’t sound like a quick hack.

~~~
bjacokes
While we were worried about event loop blockages causing outages, another more
subtle problem would have been if event loop blockages doubled our user-facing
latency. (If you read the section on latency ratios, you'll see that comparing
parallel vs non-parallel workers was the most useful stat in figuring out how
effectively we were using the event loop.) It definitely gave us peace-of-mind
to know that event loop blockages wouldn't have an effect beyond the requests
they're processing.

Honestly, the accounting for which would've been higher impact – investing in
parallelism earlier, or adding infrastructure and having more resources to
devote to other pressing needs – is difficult to do, even in retrospect. There
was surprisingly little effort required to get to 4,000 node containers in an
ECS cluster, other than deploy speed issues which we talked about in a
previous post [1]. But it's possible this migration process would have been
easier if we had done it sooner.

[1] [https://blog.plaid.com/how-we-reduced-deployment-times-
by-95...](https://blog.plaid.com/how-we-reduced-deployment-times-by-95/)

~~~
ubu7737
> But it's possible this migration process would have been easier if we had
> done it sooner.

What the f __*? Of course it would have been easier if you had done it sooner.
What you lacked was the willpower from decision-makers who had growth of
dollar-signs in their eyes.

You've littered this thread with comments explaining how every move you made
was based on ROI. That's the kiss of death for architecture concerns, and
bizarrely it puts Node.js on the list of runtimes for data/stream processing
backends.

No matter how many times you explain how you made these decisions, I can't
help getting the feeling you were wearing horse blinders.

Edit: I find it impossible to imagine that nobody on the engineering team ever
shouted, Hey look out! We are basically a Web farm for banking-related
requests, this is insane! Surely you've heard from those people and they were
let go.

~~~
bjacokes
I say "possible" because our system observability was less mature even 12
months ago. Firefighting 10 different root causes of memory or event loop
issues without the right tooling in place would be a nightmare. That's why we
did a deep dive into the tooling that we considered to be a prerequisite for
this project – hopefully it's helpful for others in our situation.

Different companies make different decisions when weighing ROI against
architecture concerns. We're heavy on pragmatism and impact at Plaid, so it's
quite intentional that we don't fall all the way on the latter end of the
spectrum. I appreciate the discussion in the comments as to how effectively we
are balancing these two concerns – certainly this is an area where reasonable
people can disagree.

------
supermatt
Ironic. Linked images failing to display due to "Rate limit exceeded"...

------
mirekrusin
4k containers? That's microservices going macro big time.

------
GordonS
I don't like to be overly negative, especially when a company/team is being
transparent about what they're doing and giving insight into their engineering
practices - but has anyone else's estimation of Plaid's engineering team just
gone down the toilet?

This blog post gives me the impression that either Plaid is filled with either
junior or incompetent engineers - to scale to 4k containers serving 1 request
each for an API workload is absolute insanity.

These engineers are building stuff for _banking_. _Banking_!! There is
literally no way I'm going near Plaid with a very long bargepole after reading
this.

It I was someone senior at Plaid, I'd be pulling this blog post before it
harms reputation any further.

~~~
sicromoft
This comment says more about you than it does about Plaid. Their "insane"
design met business requirements successfully enough to grow them into a
multi-billion dollar company.

Did you consider the likely (and more charitable) explanation that they were
aware their design was "bad", but had higher priorities until now?

If I were you, I'd be pulling your comment before it harms your reputation any
further. :)

~~~
GordonS
> Did you consider the likely (and more charitable) explanation that they were
> aware their design was "bad"

I think marketing and VC valuations grew them into a multi-billion dollar
company; whether they remain so, to a large part relies on how fast they burn
through VC cash - so, not looking too good on that front...

No even half-way competent engineer would come up with such a complex,
unperformant solution to a simple problem - I think a higher priority should
be hiring engineers who actually have a clue what they're doing.

As for meeting business requirements... while this might have worked for a
while, it was plainly not a good way to meet them, and given Plaid are in the
banking sector, really doesn't bode well for the future (I'm having
flashforwards already to security breaches, plaintext passwords etc...).

~~~
ubu7737
Downvoted for angering the VC class...

------
CyanLite2
TLDR: How to spend millions of dollars of our investors' money because we
hired junior devs who chose a framework that was trendy but couldn't scale.

------
Phil_Latio
> We were running 4,000 Node containers

LOL

------
PixyMisa
Nobody involved in this project should be allowed to ever be in the same room
as a computer again.

~~~
jrockway
Why? They had a 12 factor -ish app that scaled the normal way; run more
copies. Eventually that got expensive. They had the observability to figure
out what was making it expensive and whether or not their fixes had an effect.
They then saved $300,000.

Seems like everything went right to me.

I would be worried if the blog post was "we randomly tweaked some stuff and we
can't measure it but it's a little better" or "we rewrote it in go and in the
rewrite introduced 87 new bugs while fixing 42 old bugs". They engineered a
solution, built from good investment in infrastructure, rather than ninja-ing
a hack. That, to me, is a very good thing.

A lot of people seem deeply upset that Node was involved, but I think that's a
red herring. The problem they had -- allocate a large chunk of memory, keep a
reference to it while it is slowly sent to another server, free memory -- is
going to happen in any language. (I don't super agree with their solution of
"make the server faster" because one day it's going to be slow for some other
reason and this problem will crop up again. Instead they probably just need a
fixed amount of memory to dedicate to this process and to drop the debug
payload when the buffer is full. Or just put it in the request path if it's
crucial that it be produced every time no matter what. At least that will
apply backpressure to calling services, pop the circuit breaker, and redirect
requests to a region where S3 isn't broken. But I don't think the debug
information is THAT important ;)

~~~
GordonS
> Why? They had a 12 factor -ish app that scaled the normal way

So, yes, horizontal scaling is good, especially for stateless workloads - but
that doesn't mean you run the most hopelessly under-performing code imaginable
on each node, so you basically _have_ to scale out like this! I mean,
seriously, _4000_ containers to serve 4000 concurrent requests? I mean, I
can't even...

I honestly can't believe the attempts in this thread to justify such an
utterly, _horrendously_ bad architecture - there are 1001 better, simpler
even, ways to approach this.

Yes, premature optimisation is bad, but optimisation here was nowhere _near_
premature.

~~~
jrockway
I'm going to disagree.

When you start a business, you have no idea what it's going to grow into, or
if it's going to grow. So you start simple. The design was good enough for
there to one day be too many customers. That is huge.

When this happened, they started a second copy of their app, and could now
handle twice as many customers. Repeat 3998 more times. Now the toy app is
making some real money, so you can afford to deep-dive into the system and fix
the technical problems.

They avoided the real issue that kills startups, having a customer call you
because they want to buy your service and you saying "sorry, we aren't
accepting any new customers right now because Hacker News comments don't like
our software architectures."

