Geometric effects of certain system design choices

tlarkworthy · on Jan 5, 2021

It's a bit of a wobbly tower of reasoning to causally go from "you used node.js" to "you overwhelmed your cloud providers DNS service and not even dnsmasq can dig you out".

There are a ton of interventions you could do on that journey and I don't think it makes a convincing argument against picking X in the first place.

I spose it is a good example of unforeseen resource constraints that big successful systems encounter. In my limited experience almost all commonly used software breaks under massive load and you have to radically modify or write your own at the largest scales. But by then you are rich so it doesn't matter

mwcampbell · on Jan 5, 2021

Based on her previous writing, I suspect that she was talking about Python with a forking WSGI server like gunicorn, not Node.js. Node.js would have the same problem, but to a lesser degree; since it's natively async, you don't need as many worker processes.

thwarted · on Jan 5, 2021

Don't you solve at least half of this by putting a load balancer in front the service being hit (which is internal, so we control it), then the clients don't need to know which machines are up and working, you don't overload DNS packets with a bunch of endpoints, and the load balancer takes care of keeping track of which machines are available, and the clients don't need to share any state about the status of the individual remote machines.

I'm not sure what this piece is getting at ("don't do this thing that no one would do anyway"), other than ignoring 20+ years of well-understood scaling methodology, and listing a bunch of unrelated gotchas (at high enough traffic rates, you have to deal with running out of ephemeral ports even without the design outlined). This isn't a great modern example of the compounding effects of bad design.

marcosdumay · on Jan 5, 2021

In-process caches of DNS replies would solve it (they don't have to last for long, 1 minute is enough). Doesn't libcurl do that by default?

Federating the internal DNS would solve it. Who creates a DNS system without redundancy?

A keeping the connections open for each process would solve it. But then, you'd need HTTP/1.1 or newer, maybe that's too new.

In fact, that scenario requires breaking about every good practice for the tools used. Yet the author mainly blames the lack of threads. Yeah, sharing that coordination info would increase your scale by some 10x, while creating a lot of other unrelated failure modes.

grandinj · on Jan 5, 2021

I guess the obvious answer to this is: congratulations! You must have a successful product, or you wouldn’t be seeing so many requests. Now that you have a successful product, you have the budget to generate a solution.

Personally, in this situation, I would find some multithreaded RPC proxy thingy where a single instance would sit between the worker processes and the rest of the services.

z3t4 · on Jan 5, 2021

The multi threaded one process app might work well - until you need to scale across many machines, then the multi-process architecture might be easier to scale. But you probably end up rewriting the whole architecture anyway.

karmakaze · on Jan 5, 2021

This reads like a Hollywood disaster movie. Could it happen, okay? How likely is it given that enough go down the path of X and become large enough to be hitting these limits? I want to read those stories not this.

p_l · on Jan 6, 2021

Not all the same decisions - we didn't have forking model at the core except maybe some stuff from gunicorn, but RPC and DNS woes did happen in geometric fashion.

And that was with barely few services and maybe total less of 20 VMs. Some things were triggered by me putting a load-balancer in front of services (It's the first project where I used Thrift, and the project that left me with strong reasons to not use it ever again), so we actually ended up with several cases where developers went "it must be load balancer" then we would finally track it down to stupid django logging configured to use MySQL as log store, with MySQL conked out by colocated Redis server because the new service which asked for it didn't specify expected usage. That was fun. So was "we DoS'd our DNS system and thus everything died because kerberos couldn't resolve reverse queries.

karmakaze · on Jan 6, 2021

Similarly, I've seen DNS problems at scale with coarse-grained (not micro) services that were in Java/Scala (also using Thrift) with threads. Gave up on Thrift due to the poor/inconsistent language support even between Java<->Scala.