
Cloud-Induced Damage - brodouevencode
https://rachelbythebay.com/w/2020/05/06/scale/
======
brodouevencode
Is it bad form to drop the first comment on your own submission? Either way,
here goes. I'd like to address each of these 4 mini rants in the order posted.

> Scapegoating

Most CSPs (all the ones I've worked with - the big three) have something known
as a "shared responsibility model". If there's a problem with the application,
configuration, machine OS it's on you. If it's a networking backplane,
hardware, virtualization problem then it's on them. If there's a problem with
IO speeds to disk and it's outside the promised bounds you call them up and
yell at them. Generally they'll issue a refund, have an engineer within
slapping distance to get to work on it, and resolve it faster than the time it
takes you to research and resolve the problem. This has been the case for the
past 8 years or so. There really is no scapegoating, only understanding who is
responsible for what.

> Profiling is a thing, after all.

Not just in a DC the team operates, but also in the cloud. The thing about
cloud is that inefficiencies do not reveal themselves in having to throw
hardware at the problem because users drop. It does so in cost. "oh wow,
service X now runs on 1200 machines" Trust me, your finance person will see
that problem before you do. Profiling is _required_ in cloud for cost
optimization. All too often I've seen companies lift and shift to the cloud,
not rewrite or cloud-optimize a damn thing, and wonder why their bill is so
high. The cloud is not the data center (the cloud is a little less forgiving
than the DC), but the underlying principals work for both.

> elasticity thrashing

This will only ever happen once. And the engineer who set up that stupid rule
will forever be mocked, and rightfully so. And no one in the org will ever
make that mistake again.

> Scaling and lag

Here we see a scenario in which there's a primary service, with supplementary
services that scale depending on the load of the primary (sounds like a
typical microservice architecture). But in the scenario the primary dies, the
supplementaries scale to nothing, and when the primary comes back online the
supplementaries may take a while to scale back up, and they and everything
upstream of them suffers. Because there's not much detail around it let's
assume the typical microservice backed service that processes some _thing_.
There are a couple of sub-points to this:

1\. Event driven architecture exists for a reason. Talking Kafka, Kinesis,
Dataflow, ActiveMQ, Zuul, Eureka, Redis, Hystrix, HAProxy, Consul, KeyDB and a
whole host of other message queuing/passing tools, routing and service
discovery utilities, and caching. This will give the supplementaries time to
scale as necessary without losing anything except time. If there's an online
requirement for the supplementary to be up from the primary, that
supplementary may need to be folded back into the primary's codebase.

2\. Your supplementaries should be scaling. The lag may need to be tuned, but
yes they should scale up and down with the primary. There's no sense in
running a whole cluster of machines just to send emails if no one is on the
primary site. That's a giant waste of money. If the primary is crashing with
some sort of regularity during peak hours, fix it!

Hell, with respect to this point Kubernetes will address or fix most of it for
you.

EDIT: clarity

