
Taking too much slack out of the rubber band - r4um
http://rachelbythebay.com/w/2019/11/10/scale/
======
roland35
I see this kind of thinking all the time in hardware engineering as well, and
it all boils down to premature optimization. Cost almost always is driving
this.

One example is a recent project was a very cost-sensitive machine in which a
small heater was copied over from another product, but no one actually
verified that it was good to the required limits (just the default use case).
Well, turns out it wasn't quite powerful enough but it is way too late and
expensive now to fix it at the end! Also, all the engineering time was wasted
to figure this out (but it often seems management doesn't count engineering
time the same way as parts cost)!

I've since learned that in the beginning of a project it is critical to
identify the riskiest parts of the design and try to isolate that to a module
and over-spec it, hopefully with a path to reduced cost later on. But the most
important thing I've learned is don't try to solve tomorrow's problems today!

~~~
techslave
> management doesn't count engineering time the same way as parts cost

because the IRS doesn’t either

~~~
falcolas
When you count for the employment tax for every employee, it's not exactly
cheap (around 30-35% of the base salary). The difference really is that
whether the engineer is looking at the heating coil or not, the company is
still paying them.

~~~
techslave
no, the difference is whether the cost is allocated to fixed or overhead cost
(presumably scaling slowly) or COGS (presumably scaling linearly).

increase in COGS looks very bad for the business.

[https://www.investopedia.com/ask/answers/101314/what-are-
dif...](https://www.investopedia.com/ask/answers/101314/what-are-differences-
between-operating-expenses-and-cost-goods-sold-cogs.asp)

~~~
roland35
It certainly depends a lot on your quantities... if you are only making
100-200 of something a year then I think engineering time would dominate the
calculation. But as with all things, it depends and you need to do the math
for your own situation!

------
amalcon
I've spent quite a bit of time on a problem very similar to this. It's
surprisingly challenging. Imagine this scenario:

Some service has three units of capacity available (e.g. VMs). This is the
minimum amount allowed, on the theory that things won't break too badly if one
of them happens to crash. You target 66% CPU utilization. Suddenly, one goes
down, and the software sees 100% CPU utilization on the other two. What should
the software do?

Well, the obvious thing is to add one more instance, assuming that one of them
crashed and its load shifted to the other two. However, what if the thing that
actually happened is that the demand doubled, and the load caused the crash?
Then, you should probably add six more instances (assuming that the two
remaining live ones are going to go down while those six are coming up).

If you look at only CPU utilization, it's impossible to tell the difference
between these two situations.

~~~
sohex
Which is why proper monitoring and understanding of the system as a whole is
imperative. Utilization doesn't come from nowhere, be it CPU, memory, or
anything else. If you understand that requests per second x generates CPU
usage y then you can monitor at the edge and scale according to actual need.

------
dehrmann
This is even scarier in the physical world. Just-in-time logistics means
companies aren't warehousing inventories as large as they used to. In the case
of major events (natural disasters, terrorist, etc.), there isn't enough
reserve supply to go around.

~~~
Enginerrrd
This has been a major generational cultural shift. I learned from old-timers
who grew up in the depression and have a 3-month supply of food and batteries
and a water filter in their garage, a tow-strap and toolkit in their car, a
shotgun in their safe, and money in seven different bank accounts as well as
at least 3 cash stash spots and some gold hidden away somewhere. They grew up
in a time where almost nothing has the reliability that (we think) it does
today. A less safe time all around.

The reality is, we have a much more interconnected web of dependencies with
little capacity to absorb disruptions. We'll almost certainly see some much
more significant consequences when those now low probability events finally do
occur.

------
SideburnsOfDoom
The dynamic scaling version of cascading failure

------
jes
It’s important that systems have some design margin (buffers of one kind or
another) so that a disruption / transient event in one part of the system is
absorbed locally and not passed on to the rest of the system.

------
thaniri
It seems like this problem is solved by simply setting a sensible minimum in
an autoscaling group. And not "everyone on Earth was abducted by aliens and
stopped using the service" levels of minimum.

Say I'm an e-commerce site and on Black Friday I can see historically (or just
make an educated guess if it's your first holiday sale) I get "n" requests per
second to my service.

I'll set my autoscaling group the day before to be able to handle that "n"
number of requests, with the ability to grow if my expectations are exceeded.
If my expectations are not met, then my autoscaling group won't shrink. Then
the day after the holiday sale, I can configure my autoscaling group to have a
different minimum.

This solves the problem of balancing between capacity planning and saving
money by not having idle resources running.

If you're the type of person who hates human intervention for running your
operation, then fine. Put in a scheduled config change every year before a
sale to change your autoscaling group size.

It's pretty rare to have enormous spikes in application usage without good
reason. Such as video-game releases, holiday sales, startup openings, viral
social media campaigns.

~~~
pbalau
> It seems like this problem is solved by simply setting a sensible minimum in
> an autoscaling group.

Do you really think people do things because it makes sense to do them for
their particular situation or because those things are "the thing to do(tm)"?

Most people go to see Mona Lisa because that's what people do when in Paris,
not because they care about that particular piece of art.

Same with automation. It really makes me sad when I see people "automating"
things they barely understand how to manually do, let alone the "when" to do
it.

Yes, your example is perfectly valid, but that means one understands the
system they are working with and generally people have no bloody clue about
what they are doing.

------
GauntletWizard
I recently gave a talk at SRECon [1] about a partial solution: Using a PID
controller. It won't solve all instances of this problem, but properly tuned,
it will dampen the effect of these sudden events and quicken the response
times to them.

[1]
[https://www.usenix.org/conference/srecon19emea/presentation/...](https://www.usenix.org/conference/srecon19emea/presentation/hahn)

------
thunderbong
Money quote - "Capacity engineering is no joke."

------
dgritsko
> Of course, at some point, [...] the local service gets restarted by the ops
> team (because it can't self-heal, naturally)

Maybe off-topic, but what are some good strategies for the kind of "self-
healing" being talked about here? If a service needs to be restarted, how
could you automate the detection and restart process?

~~~
perlgeek
In the simplest case, the service could shut itself down, and the supervising
daemon / scheduler would restart it.

Supervisors like systemd also have a watchdog that will force-restart a
service that hasn't checked in for some time.

For a service that manages its own network connection, implementing auto-
reconnect can be a form of self-healing (and surprisingly hard to get right in
all edge cases).

The key is, as Rachel wrote in the OP, to get a good signal. You need to be
able to distinguish a working from a non-working service to implement reliable
self-healing.

~~~
dgritsko
> You need to be able to distinguish a working from a non-working service to
> implement reliable self-healing.

I think this is the crux of what I was trying to get at. Curious to read how
others have approached this problem.

------
patmcguire
There's something related called the bullwhip effect. I _think_ that throwing
away requests under load rather than putting them in some overflow queue
prevents it. The effects aren't magnified down the chain of services as each
scales up because it's only incoming traffic.

------
jerkstate
dynamically scaling down based on cpu consumption is the wrong way to do it
IMO. if your site is decently sized you have a pretty typical diurnal pattern
with weekly cyclical variation, that's your baseline.

~~~
NoodleIncident
> For another thing, how about knowing approximately what the traffic is
> supposed to look like for that time of day, on that day of the week and/or
> year? Then, don't down-scale past that by default?

------
insanebits
But if your service was down for more than what it takes to downscale to
minimum scaling back up is not that big of an issue. It was down anyways. Also
24/7 instances exist for a reason, autoscaling is there to handle spikes, not
a normal traffic.

~~~
Arnt
Pay attention, and don't confuse intention with effect.

What she's saying is that if you configure scaling such that it'lll scale down
when demand is unusually low, and then demand returns, the spike may be a
difficult one to handle, _particularly_ if your services depend on each other
but each scales only based on its own history.

If A needs B needs C, and demand suddenly returns to A, does that cause C to
scale up? Or will A scale up first, and but C stay low for another half-hour
because it recently scaled down?

Having C stay under demand for a half-hour after an outage ends wasn't
anyone's intention when the autoscaling was configured. But as I wrote, don't
confuse intention with effect.

------
DasIch
That just means you should scale based on the work to be done rather than poor
proxies such as CPU utilization. Also set a reasonable minimum and maximum
based on observed load in production and review this as part of regular
operational reviews.

------
svacko
OT: can you update the link to use the https version of the site? The author
has not implemented http->https redirect for the site yet.

~~~
taneq
Do you not run HTTPS Everywhere?

~~~
matheusmoreira
Mobile Chrome does not support extensions.

~~~
eru
Off-topic: Firefox on Android does support extensions. Useful for that crucial
ad blocker.

~~~
Filligree
And it's just as fast as chrome, at least on newer phones. That's a crucial
difference from last year.

------
diminoten
Good edge case to consider when designing an auto scaling service, but now
that I'm aware of it, I think I'll be able to design around the problem with
some combo of the suggested solutions, and still get the autoscaling that I
feel like the article was trying to convince me not to do...

~~~
acdha
I don’t think the article was saying not to auto-scale as much as realizing
it’s less of an edge case than it might sound and being careful not to
underestimate the level of effort or overestimate the savings. That rang true
to me — I’ve seen a lot of people realize the staff time they spent ended up
pushing the time to recoup many years into the future. This is especially
common if they’re inspired by a big tech company’s cool blog post or talk
describing something amortized across a much larger volume.

------
tus88
If scaling up is painful there is something wrong with the architecture. Aside
from this scenario, what if you just get a spike in traffic? If your scaling
solution can't handle it, get a better one, otherwise what's the point?

~~~
saagarjha
Presumably servers take time to boot and initialize, which is still a problem
if you get a spike but those aren't as sudden as "everything just turned back
on".

~~~
bostik
Yup, in a reasonably, but not entirely optimised setup, the spinup time for a
new node from scale-up event launching to it being able to serve traffic may
take 2-3 minutes. And trust me, after a couple of mishaps with _very_
aggressive scale-ups, your system will not launch the full demand absorption
all at once.

In a _fully_ optimised setup each service image is itself 100% preconfigured,
and only provisions node secrets during the boot. Even one of these types of
nodes takes easily 30-40 seconds from launch event to actually serve traffic:
it may join the load balancer just 25 seconds in, but the load balancer will
want to see at least two good health checks before allowing any traffic to it.

The problem with aggressive upscaling in the depicted scenario is that your
plumbing layer is also likely scaled down. Hitting it with a cascade of new
nodes has the risk of going all thundering herd, crippling the system for both
existing and new nodes.

------
tqkxzugoaupvwqr
Useful anecdote to learn from but not the article I expected from reading the
title. I was prepared to read a story about literal rubber bands.

