
Nines are not enough: meaningful metrics for clouds - feross
https://blog.acolyer.org/2019/06/19/nines-are-not-enough/
======
deathanatos
> _List the good outcomes you want, and the bad outcomes to be avoided_

I feel like sometimes you don't know the bad outcomes, until they happen.
E.g., years ago, my team had a fairly big outage/issue caused by S3 reads
taking multiple of 60s. (That is, our latency graph had very large spikes
right over 60s, 120s, 180s¹. These were reads in the typically in the 10 KiB
to 100 KiB range, sometimes as large as single-digit megs — i.e., they should
take milliseconds, maybe a second, not 3 minutes. It took a significant amount
of back-and-forth² as it either only effected our bucket, or just nobody else
noticed. But it slowed our ability to process down that we built up an
incredibly backlogs. (Processing a file had previously taken <1s, and was now
taking over three minutes in some cases, a 200x slowdown!)

This is _still_ not covered by the S3 SLA.

Also had a different cloud provider where the "Create VM" API returned 200 OK.
The VM never finished booting. The SLA was over the 200 OK, not the actual
completion of the task. Basically, _exactly_ the example in the article, with
a real world "we're paying for _this_?" provider.

¹I'm simplifying. It was weirdly actually 70s, 130s, and 190s. I can only
presume that's a 10s timeout and _n_ 60s timeouts, somewhere.

²Woe unto the person who doesn't come to support with request IDs. My
impression the whole time is "these guys can't see their own response
latency?"

~~~
jkoudys
I did service work at a big corp for years, and we would've loved to take a
problem like yours. I'd always want the tickets that came in with some
interesting, visible pattern and a client who's already done some analysis.
Much more fun than taking 6 hours telling a room full of panicky executives
how important their problem is to me, before getting to the 10 minutes it
actually takes to solve it.

------
jugg1es
I can second the sentiment in this article about SLA/SLE being very hard to
define. For example, my company originally agreed to an SLA with a customer
regarding API response time. The problem was that the SLA was a flat 500ms and
didn't take into account the nature of the possible queries. It's possible to
request up to 18 months worth of data, which is never going to return to
500ms.

I had to spend 6 hours analyzing data to figure out what factors of a query
actually impacted response time. It required me re-learning advanced
spreadsheet skills to find correlations in log data. We're now in the process
of rewriting the agreement because this analysis was not done at contract
time.

This is a topic not really written about that much.

~~~
lazyant
Not for your particular case of a long reply but response times SLOs are
usually measured in percentiles ("90% of responses within 500ms")

------
inflatableDodo
Altitude, extent, velocity, humidity, temperature, pressure, droplet size,
electrical charge, acidity, pictorial similarity and fluffyness.

~~~
jerrre
It only now dawned on my why Digital Ocean's VMs[1] are called droplets...

[1] not 100% if that's what they are.

~~~
knd775
They are VMs

------
peter303
No one has achieved two nines yet.

Embarassing recent downtimes for Google and AWS.

~~~
NickNameNick
I've been trying to convince my boss that 'nine fives' is a perfectly
acceptable target.

Seriously, some systems only need to be up when people are actually using
them. It doesn't matter if they don't work out of hours or over weekends.

~~~
masklinn
> Seriously, some systems only need to be up when people are actually using
> them. It doesn't matter if they don't work out of hours or over weekends.

OTOH "out of hours" or "over weekends" is a very good time to make batch
processes happen, so for some business services it might be better to not be
up during hours, but reliably be up outside of them.

An other issue with that is when the service is used internationally /
globally, or even just by 24/7 businesses, and even "over weekend" becomes not
necessarily a thing.

~~~
flukus
> An other issue with that is when the service is used internationally /
> globally

This also often happens prematurely. If the team in the US needs live data
entered in Asia then there's not much you can do, they system has to be
global. But an often overlooked option (today) is running multiple instances
of the same software with each only needing nine fives in it's region. Even if
you do need live data it might be better to have another process shuffling it
between international instances. Also helps with latency.

------
ijpoijpoihpiuoh
The only real SLA is the ability to vote with your feet and go to another
provider that satisfies your business needs, or conversely to increase your
investment with a particular cloud if you are satisfied with their
performance. Providers know this, and they know that even if they wanted to,
they can't capture all the possible values that are important to your
particular business in numerical metrics. They also know that what is
important to your business might not be important to another one, and
therefore that it is hopeless to try to capture the entire gamut in a single
set of metrics that are universally applicable. Any attempt would be futile,
so it's best to stick to simpler metrics and let businesses and customers
decide for themselves whether a given cloud meets their particular needs. SLA
violation penalties hurt a little, but the real pain comes when you lose
business.

------
carapace
Over the years I've found that a lot of the confusion surrounding computers
can be cut through by seeing them as a kind of factory. An elegant system or
piece of software looks like an elegant factory, while a clunky or over-
designed system or program is like a Rube Goldberg machine.

What makes good sense in leasing unused factory capacity would make good sense
in leasing cloud resources, eh?

