
Meaningful Availability - zdw
https://blog.acolyer.org/2020/02/26/meaningful-availability/
======
jedberg
This was a big thing for us at Netflix too. It was extremely rare for all of
Netflix to be down. Almost every outage was a partial outage.

For us to measure our success in increasing availability, we first had to
figure out a way to measure availability.

We came up with a multi-pronged approach. The first thing we did is figure out
how to predict how much traffic there should be at any given time. This was
basically using historical data to determine the shape of the curve and then
adjusting it to fit current traffic.

Then we would figure out how far off of the predicted traffic we were and the
was our downtime.

Since we also had control of the client experience in almost all cases, we
were able to measure from the client side as well (and trust that data), and
include that data in our measurements.

Where things got interesting was when say a large ISP was down. A bunch of
people couldn't get to us, but that wasn't really our fault (or was it?). At
the end of the day the users didn't care if it was our fault or their ISP, so
we counted that against ourselves.

All of this is to say that yes, it's really hard to figure out uptime,
especially for distributed systems where almost every failure is a partial
failure.

But at the end of the day your users don't care which microservice or network
segment was at fault, they only care that they couldn't use the product.

~~~
feyman_r
>> Since we also had control of the client experience in almost all cases, we
were able to measure from the client side as well (and trust that data), and
include that data in our measurements.

The W3C Network Error Logging specification now allows websites to register
client-side errors (DNS, TCP, TLS etc.), record them offline, and send them
via side-channel telemetry to a _different_ endpoint later on. This has
changed how we measure client-side availability on browser-based services.

Chromium-based browsers (68+) support this feature, enabled by default.

>> Where things got interesting was when say a large ISP was down.

NEL also helps the service owner determine with some level of accuracy whether
a problem in an ISP is specifically related to the service, or to the ISP

{disclaimer: work for MSFT, specifically on NEL-related tech}

[1] [https://w3c.github.io/network-error-
logging/](https://w3c.github.io/network-error-logging/)

~~~
jedberg
How do you guarantee that the client hasn't modified the telemetry data?

~~~
yjftsjthsd-h
I doubt you can, but why would they? I mean, yes, maybe an unethical
competitor could use it as an attack, and yes we live in a world where
AdNauseam exists, but I would be _surprised_ if that kind of thing was enough
to matter at any scale (i.e. it'd disappear below your noise floor unless it
was a really good mass attack).

~~~
jedberg
Well, it depends on the use case. At Netflix for example, it was the outliers
that were interesting. There aren't many 2011 LG TVs out there, but if they
were all consistently failing, that would be something we'd want to call out.

Even every 2011 LG TV failing at once falls below the noise floor, unless
you're specifically looking for it.

But if your _are_ specifically looking for outliers, than a single bad actor
can really mess up your day.

------
bitminer
This is a useful and interesting contribution from Google.

However, like a lot of current software and web-scale systems research, almost
no attention is paid to contributions from other industries, or even current
practice in automotive, aerospace, consumer or finance, to name a few.

The earliest reference in the paper is 1986, with most post-2001. A definition
of availability is attributed to Toeroe and Tam 2012. It was most definitely
in use decades before!

The authors and interested others would benefit from reviewing current systems
engineering practice (see INCOSE.org organization). Texts such as

Blanchard and Blyler, Systems Engineering Management

Blanchard and Fabrycky, Systems Engineering Analysis

NASA, Systems Engineering Handbook (and accompanying management handbook)

and others would open a few eyes, I expect.

The biggest issue with the term "availability" for a single-instance system is
well identified by Google. The term was originated, however, for many
thousands-instances systems, such as procurement of combat aircraft, operating
vehicle fleets and so on. Substituting another measure (and term) is a
benefit.

However, one issue I didn't see addressed in the paper: how to measure success
for the purpose of an SLA. Contracts would need a simple comparison to a
single number.

