
Your nines are not my nines - zdw
http://rachelbythebay.com/w/2019/07/15/giant/
======
altmind
Million times this.

Its shocking how "elevated rate of errors for specific endpoint" in your cloud
provider status page is actually amplified to be a soft-outage of your product
when your writes to disk never return, your databases returning inconsistent
data or your orchestration taking some drastic measures for the failing health
check.

When you have a lot of components in your cloud mix, failure of one
stage(network->balancing->quering->rendering->persistence) bring everything
down.

if 10 of your cloud services each have a reliability of 99.999, all together
the reliability is not 99.999.

cloud providers can claim mountain-high availablity whereas users will never
get their apps running with advertised reliability for now there is multiple
subcomponents that can fail.

~~~
m463
I worked at a company once where each bug had a really interesting field: root
cause

I wish I could remember the values you could fill in, they were very
intelligently chosen.

What I learned: if you didn't know what the root cause was, you probably
didn't fix anything.

~~~
devdas
There is no root cause.

[https://www.kitchensoap.com/2012/02/10/each-necessary-but-
on...](https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-
sufficient/)

[https://blog.acolyer.org/2016/02/10/how-complex-systems-
fail...](https://blog.acolyer.org/2016/02/10/how-complex-systems-fail/)

[http://web.mit.edu/2.75/resources/random/How%20Complex%20Sys...](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf)

~~~
inflatableDodo
Another way to look at that is that a root cause can be a set.

~~~
hnick
I've had more than one case at work where it would come down to bad
requirements. Both systems worked exactly as specified and were bug free (for
the issue in scope). They just had a different understanding of reality by
design. Root cause here is some mixture of poor understanding of the problem
domain by various staff.

~~~
JohnFen
> Both systems worked exactly as specified and were bug free

If the system is working as designed, then there really is no bug. A bug is a
malfunction, after all.

I've always called broken systems that are working as designed BAD: Broken As
Designed.

~~~
hnick
I'd agree. The defect (not sure you can call it a bug?) was in the design/spec
not the system.

------
ChuckMcM
This rings so true it hurts. At a very large, very blue, company I recall a
time trying to explain to an account manager that I wanted to write the SLAs
in terms of my footprint, which is to say, given the resources you have
allocated to my account, lets set some SLAs like "latency from any node to any
node", "latency from any node to the primary internet", "latency from any node
to the secondary internet", "availability of primary internet", "secondary
internet" and "blended availability of both."

I had a bunch of these things, all of which were things that were tracked,
measured, and monitored, in an existing setup.

Their response was, "We really don't have any way to provide the data for your
SLAs, much less actually sign up to enforce them." I suggested that they were
not serious about being in the 'cloud' business then. They seemed miffed.

~~~
beering
Would you be willing to pay more to have the SLAs behave like that, and if so,
how much more? Genuinely curious, maybe there's a market for "cloud but better
SLAs".

~~~
ChuckMcM
Good question. The point of the article is that the value of cloud SLAs are
inversely proportional to the size of the cloud.

Think of it this way, consider the definition of an "availability" SLA as 'the
mean availability of all hosts in our cloud'. If its reported at "five 9's" or
99.999% that means that a cloud of 100,000 machines could have one machine
down for days at a time and never cause their SLA to slip. Big providers
average over multiple hundreds of thousands of machines, your stuff could be
down all the time and yet 'everything' is "meeting all the SLAs".

You see this outside of data centers in other overly generalized metrics.
Unemployment is only 3.7%[1]! Yay right? Tell that to the people of Magoffin
County Kentucky where unemployment is 12.3%[2]

So would I pay more? I don't know. If none of the service providers would
offer SLAs based on my footprint, it would not be a choice. If one does, then
it becomes the preferred choice even if it is more expensive. At which point
do all of them to remain competitive? Another good question. Could be a good
differentiator for the #3 cloud provider Google. I know they have the
technology to do it if they chose to.

[1]
[https://www.bls.gov/news.release/pdf/empsit.pdf](https://www.bls.gov/news.release/pdf/empsit.pdf)

[2] [https://www.lanereport.com/112723/2019/04/state-releases-
cou...](https://www.lanereport.com/112723/2019/04/state-releases-county-
unemployment-data-for-march-2019/)

~~~
rossjudson
You might want to have a look at some Google Cloud SLAs[1]. They are generally
calculated based on actual performance on a particular customer's RPC traffic,
down to individual RPCs (at least on Google Cloud Storage, where I work). Read
through the agreements to the definition of "Error Rate", and I think you'll
find the terms you're looking for.

You're welcome! :)

The root post does raise an important issue, though -- just because GCS thinks
it's doing great on your RPCs doesn't mean that your _system_ is doing great.

[1] [https://cloud.google.com/terms/sla/](https://cloud.google.com/terms/sla/)

~~~
jaytaylor
How are Google Cloud SLAs valuable when the service regularly has multi-hour
(IIRC, > 5 hours!) (sometimes global) outages?

I pose that Google is not presently in a good position to highlight as a role
model or case-study for demonstration of effective cloud provider SLAs.

~~~
badpun
>How are Google Cloud SLAs valuable when the service regularly has multi-hour
(IIRC, > 5 hours!) (sometimes global) outages?

You can sue them? (if that's not forbidden by the contract).

~~~
azernik
See my sibling comment - they have a process for talking to them to get
discounts for months when they don't meet the SLA. If they _don 't_ give you
said discount, you can sue them for breach of contract, but if they pay out
you can't sue them for damages.

------
tschellenbach
I think these blogposts create a false narrative. It should start by
acknowledging that in 99% of cases any outsourced systems' stability will be
better than anything in-house. Yes there are some vendors that don't do a good
job. I heard many people complaining about Layer for instance. But most
vendors, AWS, Stripe, Algolia, Stream, can invest more time and effort in
stability than you can feasible do for an in-house solution. This is not
surprising, if you do something for thousands of customers you can dedicate
more effort to Q&A, Docs, maintenance, monitoring, firefighting etc. For every
story of someone having vendor issues there are dozens for things going to
shit with in-house code.

There have been times in the past when I was annoyed with AWS stability
issues. We've all been there. But I also know that AWS is more stable than
anything I could feasible build in-house.

~~~
AstralStorm
Will it? The response times to failure matter more oftentimes than the
availability itself. Redundancy as well, which can be cheaper manually handled
than relying on some unknown process of a provider. The support on any cloud
is terrible for a small client.

------
tootie
I've been on the receiving end of this from the POV of a fortune 50 company.
Companies that are not gnats on anyone's window. Treatment is the same. These
big guys just all suck at professional service. All their money goes into
sales and product engineering. Lock-in breeds retention. The biggest clients
are the ones that tend to be the most locked in too.

~~~
user5994461
Having worked in F50 too.

When there is a choice to make between public cloud with 99.xxx% SLA or the
internal cloud with 90% uptime and 6 months SLA to get a server, the right
choice is always cloud.

~~~
DangitBobby
>public cloud

>internal cloud

>the right choice is always cloud

I can't tell if you missed a word or are making a joke...

~~~
notatoad
"internal cloud" is not the cloud. It's just a bottle full of mist.

~~~
illvm
I thought it was something like Azure Stack.

------
cortesoft
This can be complicated, though. I work for a large CDN, and we have systems
that monitor our customer experience. Almost every issue those systems
discover, however, end up being issues with a customer's origin or
configuration. We ended up having to change our procedures on how we responded
to issues we discovered because all of our support time was spent checking
these issues, and realizing they were outside our control.

There is always two sides to these sorts of things.

~~~
ropman76
One of the most frustrating things about dealing with situations like this is
actually getting ahold of someone with enough experience to say where the
issue is to begin with even if it’s out of the provider’s control.

I have sent a lot of log files to cloud vendor trying to find why their web
hosted application was so slow (6-10 second response times on a CRM app they
provided). If someone would have responded with an actual answer (your
firewall is blocking traffic or try this setup etc) I could have worked with
that. Instead we got nothing but stealth ticket closes and “sorry we don’t
know why this is slow” responses. This article hit a nerve because you really
do dance to someone else’s tune when you go to the “cloud”.

~~~
rossjudson
I think there is a lot of room for cloud provider innovation in this area. It
shouldn't take a human to tell you what's wrong.

------
hinkley
We've talked about these classes of problem from time to time, a lot of ideas
have been put forward but what's the solution?

If I have a multi-tenant system, and no one customer is dominant (always
causes problems IMO), my 'biggest customer' might only be 4% of my traffic.
There are a million things that can go wrong that make this customer's
experiences different from everyone else's, from getting my sharding solution
wrong to small-C n^2 issues (and a whole lot of space between for nlog(n)
problems).

If I'm doing 95th percentile calculations that will not show up in my metrics.
If I have a larger customer that's 10% of my traffic, almost half of their
users could be having issues before my alerts go off.

And then there's explaining to your boss that 5 9's across twenty interacting
services is around 99.98%, and that's only if degradation in one service
doesn't cause failure in another.

~~~
jameshart
The issue is that percentiles are a very crude tool. This isn’t just a
multitenancy problem, it can manifest in any multiuser system. If 0.5% of your
traffic comes from New Zealand, a DNS issue affecting your CDN routing that
causes all NZ traffic to time out won’t affect your 99th percentile loadtime
graph at all. Essentially, percentiles are useless for discovering problems
that have a strong effect on a small portion of your traffic.

Scatter plots and histograms are much better at telling you when the
distribution of a stat has gone bimodal with a small but consistent group in
the outlier group. Percentiles only make sense for telling you, when you
already know what shape the distribution curve should be, how flat that curve
is right now. They don’t tell you when the shape has changed.

~~~
bosie
What scatter plot and/or histogram do you have in mind to dig up the small but
consistent group in the outlier group?

~~~
jameshart
Ah, yeah - figuring out what they have in common is the trick :)

Good APMs and trace tools will let you zero in on traces by characteristics -
so if you notice there’s a bump in requests which have a 2 second load time,
you can select them all and analyze how they are distributed - whether they
are mostly one browser, one location, or one user even. But you need a solid
strategy for tagging and logging traces.

------
Rafuino
I've checked and the top cloud players all have uptime SLAs (which according
to the blog post don't seem to have the necessary granularity to matter). See
[https://aws.amazon.com/compute/sla/](https://aws.amazon.com/compute/sla/),
[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla),
and [https://azure.microsoft.com/en-
us/support/legal/sla/summary/](https://azure.microsoft.com/en-
us/support/legal/sla/summary/) for examples.

But are there other SLAs like for in-zone latency, or hardware performance
(e.g. IOPS or bandwidth from your local or remote storage)? Are these kinds of
SLAs part of larger private agreements (like, Netflix, a huge AWS customer),
or is uptime the only SLA offered? Haven't been able to find any info on this
in my searches...

~~~
013a
They all have SLAs, but the "tier" at which a problem happens dramatically
affects how the SLA pays out, or if it pays out.

For example, within the AWS Compute SLA you linked:

> Unavailable is defined as: For Amazon EC2 (other than Single EC2 Instances),
> Amazon ECS, or Amazon Fargate, when all of your running instances or running
> tasks, as applicable, deployed in two or more AZs in the same AWS region
> (or, if there is only one AZ in the AWS region, that AZ and an AZ in another
> AWS region) concurrently have no external connectivity.

Get this: Single EC2 instances have an SLA of _90%_. Seriously. Its in that
article.

In other words; AZ outages rarely see pay out, because you "didn't architect
your cloud correctly". And we've been told some nicer variation of this when
asking for a reimbursement a few years back. You do have to ask, you know.
They could literally automate this process, but they don't. Whatever.

Let's also be clear about the language here: There's no "pay out" at all. What
happens is, you get the amazing privilege of not being forced to pay them for
a product that didn't work.

That ties directly back to the article; they pay out based on their
architecture and SLAs, which are not your architecture and SLAs, unless you
perfectly match your architecture to their architecture, which will have gaps,
and then you're bought in so hard that you could never leave if you needed to.

~~~
unreal37
I know it's just an example, but it really IS how you architect your solution.
If you have a single EC2 instance without redundancy that's important to
something, you're doing it wrong.

~~~
yjftsjthsd-h
You're not _wrong_ , but it feels like a cheap answer. Why should I pay twice
as much just because AWS can't keep an instance up? (And double-cost can be
understating it; I've run commercial software where the multi-
instance/clustered version is far more expensive than the single-node version.
I'd actually _like_ to make it fully multi-AZ, but I'm not gonna get the
company to to drop that kind of money on it.)

------
aluminussoma
The problem is that the vendor is incentivized to publicly use whatever metric
shows the highest availability. Otherwise, the vendor will have to pay back
credits. The vendor's nines are never my nines.

~~~
toomuchtodo
The only way this gets solved is through cloud consumers providing streams of
telemetry (sanitized of any data of value besides success/failure metrics of
the underlying cloud primitives) to a central reporting uptime stats broker
(Speedtest.net meets DataDog meets the Internet Weather Map). The incentives
to fudge or exaggerate you uptime claims as a vendor through sales and
marketing is too high; let the data speak for itself.

Do you trust AWS' status page? Or are you coming to Hacker News to ask why
your network latency between instances has skyrocketed unexpectedly?

~~~
hinkley
I wonder if this is a sort of thing you could interest EFF or another
organization to put funding behind.

------
inlined
This doesn’t seem to be true (at least for all vendors). AFAIK, Google Cloud
has per-customer SLAs, though you might need to have enough traffic for
statistical significance in some products.

~~~
wrs
Using GCE as an example [0], it’s per-customer, but _all instances_ in
_multiple zones_ have to be unavailable. You could have 99% instance failure
and not qualify.

>Loss of external connectivity or persistent disk access for all running
Instances, when Instances are placed across two or more Zones in the same
Region.

[0]
[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla)

------
sumanthvepa
I usually advise my clients to treat cloud providers like they would treat
hardware, perhaps a bit more reliable (though in practice that is usually not
the case.) If you cannot afford for your database to unavailable, invest in
creating a backup db for your database. That advice holds regardless of
whether said database is run on a cloud or not. Base your investment decisions
on the downtime you observe of the cloud provider. Don't expect to make your
cloud provider magically able to ensure your app has n 9s reliability. That's
on you.

------
ducktypegoose
I get the sense rachelbythebay may be another satisfied Azure customer.

~~~
quickthrower2
Azure customer gets the blues.

------
stefco_
Cloud solutions have plenty of issues, but I'm quite surprised there aren't
more replies talking about how many impossible problems have been made
tractable and reliable thanks to the cloud (or to be more precise, made _much
cheaper_ to solve reliably). The article makes a great point about
_accountability_ , in the sense that no one at a cloud-providing corp is
immensely worried about transient failures that only affect small user sets
heavily (which sucks if you're affected). But for my scientific computing use
cases, getting things working reliably at any sort of scale within budget is
_impossible_ without the cloud. Research institutions' computing clusters are
just smaller, less reliable, less flexible versions of the cloud (good luck
getting sysadmins to do anything useful at all).

One of the collaborations I work in, LIGO, recently gave up on private servers
and transitioned to AWS for our Gravitational Candidate Database [1] because
the cloud is so much better. I made this change to my own low-latency search
framework [2] years ago. If you're not "lucky" enough to (be forced to) use a
university/collaboration cluster, you'd have to maintain your own server,
which is orders of magnitude less reliable and more expensive/difficult. I
understand that not all workflows are the same, but for all of my nontrivial
applications, cloud providers save so much time and money that I can do
something as bold as making a provider-agnostic architecture with more robust
failover. I recognize that more complicated workflows might require e.g. 10
separate AWS services with AWS-specific features causing lock-in, but at that
level of complexity, I'm guessing the problem must be virtually impossible
with a non-cloud solution anyway. If you really can't figure out another way
to deal with resiliency, you might just need to accept that your problem space
is really hard and that you're lucky to even be able to run it at all. Again,
I think the original article is right about the fact that _you have to account
for this_ yourself; the cloud is not magic, and your code still has to
understand that it is (like all abstractions) going to leak.

Again, the point about responsiveness in the original article is very well-
taken; I'm just surprised more people aren't observing that _overall_ the
reliability, cost, and flexibility provided by cloud solutions is utterly
transformative in terms of reliability.

[1] [http://gracedb.ligo.org](http://gracedb.ligo.org)

[2] [http://multimessenger.science](http://multimessenger.science)

------
crazygringo
I have a hard time taking this article seriously when it’s all “innuendo” and
not actually naming any names or providing any verifiable facts at all.

If the author had a specific problem with specific SLA’s, tell us with real
details.

And SLA’s aren’t for winning the lottery or providing impossible-to-meet
standards. You need to look at what they actually cover, compare with your
costs and reliability of running infrastructure in-house, and then pick the
right tradeoff for you. I can’t even tell if the author is accusing cloud
providers of fraud, of being misleading, if the author just never understood
the SLA properly, or what.

------
devnonymous
Well, even their nines aren't a whole lot nines if you've been paying
attention to all the outages of late.

------
ex3xu
Isn't this why service-level agreements exist? If the nines of uptime are that
important to your business and you don't want to be a gnat on a windshield,
you've got to give the vendor some financial incentive to pay attention to
you, right?

Or is Rachel talking about a situation where you have an SLA in place, but you
can't even prove downtime to the vendor because their monitoring software is
inadequate?

~~~
mentat
The cost to your business will always be greater than the maximum refund on
the SLA, even if it's a full refund. That's why you're using the provider in
the first place, since you can make more money than they're charging.

~~~
gowld
That's not true. If you get a month refund for a day down, you can still come
out ahead. Even so, it's not reasonable to compare to perfection, you should
compare to other options.

------
sabujp
i'm not sure how this person has their system architected but they should look
closely at the 9's the company is talking about. Is it 5 9's across all
regions? Within a single region? What about for the specific service? It
really all depends, the post is an over simplification or they haven't
architected their system appropriately to cover actually get 5 nines on the
host cloud.

------
kwhitefoot
Surely when you use buy such a service the uptime guarantee applies to the
service you have paid for? What happens to the rest of the customers is
irrelevant. And the average availability the aggregate is even more so.

------
LaserToy
They are not the only ones who are not monitoring from customer perspective.
It is actually hard, as it involves much more than watching some metric. I
proposed a user experience monitoring system for my ex employer, which was
based on a very simple principle: problem is a difference between what
customer wants and should be able to do and what customer is able to do right
now. It kind of funny, but some major eng companies (maybe even all) don’t
know what is user experience...

------
_bxg1
This plays into something I've been thinking about recently, which is that
even when a technology scales indefinitely, maybe technology _business_
doesn't. Maybe, in a world where hosting and CRUD apps and everything feel
like solved problems, there's still a place for smaller providers that can
interface with their customers directly and tailor themselves to their needs.
It's a vaguely comforting thought.

------
shanemhansen
I think the fundamental problem, in engineering terms, is that most cloud
deployments effectively wire together cloud components serially.

Your LB may have some nines, your individual vms (or set of vms in a region)
may have some nines, your data store may have some nines, but if all of them
aren't working together it's unlikely your business will be up.

This is inherently customer-dependent and yet it's super predictable (nobody
only uses a lb).

------
weberc2
Can we change the link URL to use HTTPS?

~~~
quickthrower2
Not sure why you were downvoted but that would be a good idea since the site
supports HTTPS

------
mattbillenstein
You can get closer to their nines by using less of their stuff - if you use
every service aws offers to power a single app, you'll have markedly less 9's
than if you only use a few...

The foundational services (VMs, dns, s3, etc) I've found to be more reliable
than others (ebs).

------
blueyes
Public clouds: The new Comcast.

------
hermitdev
I'm testing changes to a process that uses Azure. We have dozens of on site
SQL server dbs, but this one process decided to use Cosmos, because...they
didnt have to writet as much code. The developer is gone, but I'm left
supporting a process that sucks data down from its source, does a little
transform, then shoves it into Cosmos. Then, pulls the data directly back from
Cosmos to load it to our internal server dbs. Why?! It's a total facepalm to
me. Extra stage, extra step and extra complexity and extra cost for no gain.
We don't serve any external pages or services. No reason for this data to be
in the cloud. All internal use.

Fucking hate devs that do this, especially the ones that wander on before they
have to justify their actions to anyone.

/rant

------
johngalt
It's not just the cloud providers perspective on outages. That is the rosiest
interpretation of misleading availability stats. There is an obvious moral
hazard involved because most availability tracking is self reported and outage
criteria are vague.

This behavior certainly isn't limited to cloud providers. If anything internal
operations departments are worse. The only difference is that internal
departments can be pressured more effectively.

------
nsxwolf
What does that title even mean?

~~~
ska
It should read "Your nines are not my nines" which is a little better.

------
ravedave5
This is so incredibly true. My company has internal services used between
teams. For some reason my app can always tell when another app is down and
they never can.

------
JTbane
Even funnier, when a massive network outage occurs, cloud providers shrug and
say "not my problem :-)"

~~~
darkcha0s
How would you mitigate that if you were running it on-premise?

------
zerocrates
Is this title a consequence of some automatic HN system to try to reduce
"fluff" in titles? It's kind of nonsensical in this instance. I saw another
title earlier today missing a leading "How" that also didn't make much sense.

~~~
sctb
Indeed. Sorry! If you notice such disfigurement in the future feel free to
email hn@ycombinator.com and we might get to it quicker.

------
draw_down
I suppose that’s true. But it also occurs to me that their ops team is not
your ops team, their observability stack is not yours, etc.

I work for a company that’s fairly well known here. I can’t recall us having
an outage (or something less severe than a full outage) that was our cloud
provider’s fault and not ours. I’d recommend the appropriate caution before
“blaming the compiler”.

------
frostyj
Still remember back to the time when S3 bragged their SLA as five nines. After
the notorious incident ~2 years ago, they made it down to three nine.

~~~
icedchai
You may be confused. "One nine" would mean 90% availability.

~~~
frostyj
my bad. 5 nines v.s. 3 nines.

~~~
quickthrower2
a 40% drop in nines. 40% less reliable!

