
SRE Fundamentals: SLIs, SLAs and SLOs - nealmueller
https://cloudplatform.googleblog.com/2018/07/sre-fundamentals-slis-slas-and-slos.html
======
Animats
Google: _" An SLA normally involves a promise to someone using your service
that its availability should meet a certain level over a certain period, and
if it fails to do so then some kind of penalty will be paid. This might be a
partial refund of the service subscription fee paid by customers for that
period, or additional subscription time added for free."_

"Partial refund". That's a very low standard for a service level agreement,
but typical of Google. Your whole business is down, it's their fault, and all
you get a partial refund on the service.

A service level agreement is really a service packaged with an insurance
product. The insurance product part should be evaluated as such - does it
cover enough risk and is the coverage amount high enough? You can buy business
interruption insurance from insurance companies, and should price that out in
comparison with the cost and benefits of a SLA. If this is crucial to your
core business, as with an entire retail chain going down because a cloud-based
point of sale system goes down, it needs to be priced accordingly.

See: [1]

[1]
[https://www.researchgate.net/publication/226123605_Managing_...](https://www.researchgate.net/publication/226123605_Managing_Violations_in_Service_Level_Agreements)

~~~
luhn
> "Partial refund". That's a very low standard for a service level agreement,
> but typical of Google.

It seems to be _the_ standard. The most generous SLA I've seen is 5% off the
monthly bill for each 30 minutes of downtime (up to 100%). If I'm down for 10
hours, waiving one month of bills doesn't come close to the damage done.

An SLA seems to be more of a promise than an agreement, because if the service
goes down you're SOL and the provider gets a slap on the wrist (partial
refund).

~~~
stephengillie
I've worked for a cloud provider who paid 45x for downtime. If you were down
for an hour, you got 45 hours of credit on your bill.

My current ISP credits 5x - I was impacted in an outage expected to last all
day, and they credited me 5 days on my next bill.

~~~
joshuamorton
450 hours is less than a month, so that sla is actually worse than the one the
above poster described, at least if you're down for less than a day.

------
pspeter3
This is a great article for defining terms. For some reason though, this quote
made me laugh out loud:

"Excessive availability can become a problem because now it’s the expectation.
Don’t make your system overly reliable if you don’t intend to commit to it to
being that reliable."

~~~
ksmith14
The Google SREs mentioned this in their book; the Chubby locking service had
uptime that was so high that folks started to neglect making their own
services resilient to Chubby failures:
[https://landing.google.com/sre/book/chapters/service-
level-o...](https://landing.google.com/sre/book/chapters/service-level-
objectives.html#xref_risk-management_global-chubby-planned-outage)

~~~
robax
+1 for this book. As a junior DevOps engineer this book has been super
helpful.

~~~
philsnow
the book is structured in a way that makes it pretty easy to jump around and
pick and choose which parts you want to read or skip, so it's not a very large
commitment to read it

------
peterwwillis
If you're building a system from scratch, keep in mind that this way of
designing your service may not be flexible enough. You don't want just service
level objectives, agreements and indicators, you want _customer level_.

Your service may end up providing for multiple customers with different
requirements. Maybe 1% of your customers will end up using 99% of your
resources, creating uncomfortable situations that affect the other 99% of
customers. To get away from this you have to start spinning off multiple
identical services just for groups of customers, which is really annoying to
maintain. You may find you need to add hard resource limits to control
customer behavior, which is hard to add after the fact.

Instead, if you design your new system from scratch with customer-specific
isolation and service levels, you can run one giant service and still prevent
customer-specific load from hampering the rest of the service. You can also
just run duplicate services at different levels of availability based on
customer requirements, but that's not going to work forever.

As an aside, I'm looking forward to reading ITIL 2019 to see what new
processes they've adopted. I think everyone who's getting into SRE stuff
should have a solid foundation on the basics of IT Operations management
first.

------
asn1parse
In ops, we often have other internal groups that we either work with or
support. It's often useful to view these groups as a customer, then you use
the same policies, perhaps with a few exceptions in some cases, to manage the
relationship. Typically we call this the OLA, the operating level agreement. I
can only speak for my own experience, but operations groups I've been part of
that don't have this concept of the operating level agreement typically suffer
various types of damage to reputation. This is because there are no rules
around how internal groups assess accountability, and therefore by having the
terms of the OLA, you have the ability to defend your position as long as you
stayed within the terms of the OLA. For example when we started building
VAData data centers all over the world for Amazon, by having an OLA, we were
able to push back on groups that claimed we were not holding up our end of the
agreement.

~~~
mlthoughts2018
I work in machine learning, where my team’s ML web services are typically
requested by other in-house teams to provide features for their business
logic, and so our SLAs are also agreements with other in-house teams.

What I’ve found is that product managers and business people are typically
extremely resistant to traditional concepts of software requirements or
feature planning, because they want flexibility to change requirements late in
development without any negative repurcussion to them.

But somehow the language of SLAs magically clicks and they are more receptive
to defining a service agreement. Then you ask them, from a business point of
view, how much uptime does it need, what sort of throughput does it have to
support, is the budget for outages or failures distributed equally across all
features or more important for some features than others?

This practically leads directly to the same scoping and requirements
discussion you would have had in traditional software planning, but for some
reason the language of SLAs is more palatable, so I’ve found it is an
effective way to get around some non-tech person in the loop who might be
fighting against detailing a proper spec or documenting priority delivery
among features.

------
bpchaps
When reading these articles, never forget that your company is NOT Google! If
your company doesn't have a management/infrastructure/communication/skill
structure that Google has, then it will be very difficult to implement these
fundamentals.

In many cases, an SRE is a job to save costs. If your company doesn't get its
shit together and doesn't give your SREs the support it needs, then they'll
hate their jobs and the company.

~~~
oblio
I have no idea why you’re being downvoted. It’s the same thing as
Borg/Kubernetes, MapReduce/Hadoop: some things just don’t apply or aren’t as
effective unless you’re operating at a huge scale and with Google’s culture.

~~~
mmt
> unless you’re operating at a huge scale and with Google’s culture.

I'm not sure one has to go to the extreme of _huge_ scale, anywhere near where
Google is now, (not that that's what you said), nor all the aspects of their
culture, but I agree that key _fundamental_ aspects are often missed.

My favorite example is to point out that Google does _not_ run Hadoop on
expensive, virtualized AWS instances (or even brand-name servers with useless-
for-purpose features[1] that creep up the cost). Rather, one of their
competitive advantages, from the very start, has been to optimize hardware
that they purchase, customize, and operate for cost (and performance).

The other is, as you mention, culture, which involves a remarkable amount of
specialization, with groups dedicated to hardware, networking, internal
tooling (i.e. building and maintaining the Hadoop-euquivalent), and, of
course, SRE, who couldn't even begin to do their jobs without all those other
groups' support.

Of course, there's an argument to be made that things like k8s and PaaS/IaaS
can take the place of all those supporting groups at Google, but my
counterargument is that they both fail to impart any benefit of customization
(or, conversely cultural benefit of the mindset of doing everything that way
across the entire company) and carry a tremendous cost (in money and
complexity).

[1] redundant power supplies, high-density chasses, onboard hardware RAID

------
strmpnk
These distinctions started making more sense when I realize they map to OKRs
which is generally how Google is said to track individual and team
performance.

In general, it's good to be precise about how you measure and when something
is a hard or soft boundary. Otherwise, firefighting gets out of control. It's
hard to determine when to stop something and put out a fire if you can't
prioritize issues based on the boundaries you've set for your system.

~~~
smueller1234
SLOs certainly don't rigidly map to OKRs. Maybe it's easier to consider them
(two sided) commitments about the quality of service? They're more of an
ongoing measure of quality rather than a quarterly objective.

~~~
strmpnk
Good point on the quarterly vs continuous measurement. I'm not implying they
are rigidly mapped but it makes sense you can put quality changes down as an
objective for a team. This can be both end-of-quarter quality but also the
general rate of change over the entire quarter.

Depending on the situation, I have seen teams aim to achieve certain SLOs but
it can also be that certain other things can be achieved without letting the
SLOs suffer (if they're already at a reasonably high quality).

------
zzzcpan
So, how do you choose that service level objective? How do you know which
solutions to implement to not make things "overly reliable"? Isn't that more
important question? As doing this without some sort of methodology will almost
always result in useless solutions and overpaying to cloud and other hosting
providers. Like implementing rather expensive failover within the datacenter,
while ignoring how unreliable datacenters are and how cheaply you can
implement failover between datacenters via DNS.

I like the idea of modelling availability/reliability for this. Even if you
don't have the right numbers and do it on a napkin, not in code, it still can
highlight solutions with best cost/benefit ratios.

~~~
gcardone_
Disclaimer: I am an SRE at Google, opinions are my own.

There's an excellent talk by Google VP of SRE Ben Treynor:
[https://www.youtube.com/watch?v=iF9NoqYBb4U](https://www.youtube.com/watch?v=iF9NoqYBb4U).
tl;dw: try to measure actual user experience, and make sure that even the long
tile of customer still gets a good product experience. What "good product
experience" means depends, on your product.

The rest of the error budget is for you to spend on releasing new features,
changing the underlying architecture, etc.

------
erikb
So there is one obscure metric "service is available, i.e. can do its job",
and this metric has different attributes: there are actual metric values
(SLIs), there are internal goals (SLOs) and there are legally binding promises
(SLAs) to users/customers. I would argue that this is not much content here.

Content, imo, would be something like this: We define "available" as
"processor_load<99% and disk_load<99% and ram_load<99% and server responds
with http 200 on port xyz", because reason_a, reason_b, reason_c. But other
people could argue that it is not as much about the node but about how
service_x is experienced, so one could track the speed of http responses to
user requests and they should be under 0.1sec over 95% of the time. etc...

That you should track metrics, that you should set goals, and that you should
define SLAs with your customers/users is standard business practice, not new
knowledge.

------
alttab
"Within Google, we implement periodic downtime in some services to prevent a
service from being overly available."

Uh..... what?

~~~
nbm
Services have different relationships with each others in terms of
dependencies, and in terms of what you think those dependencies are.

If your idea of how things work is that services A, B, and C can optionally
use service D, else use some fallback process, then if D has never failed,
then you've never used that fallback process. And services X, Y, and Z which
rely on services A, B, and C haven't had to deal with those services using
their fallback processes either. So, instead of waiting for D to fail, you can
take it down at a convenient time.

This applies to services as a whole, or services within a locality, or all
services in some availability zone.

------
ProAm
This is an interesting article from a company that has almost nil customer
support.

~~~
rahimnathwani
From the movie The Negotiator:

A Marine and a sailor are taking a piss. The Marine goes to leave without
washing up. The sailor says, 'In the Navy they teach us to wash our hands.'
The Marine turns to him and says 'in the Marines they teach us not to piss on
our hands'.

BTW it's not true that Google has almost nil customer support. There's
extensive support for paying customers (for ads, GCP, GSuite etc.).

But it's amazing to me how reliable things like Gmail are, and how in so many
years I've never felt the need to seek support.

~~~
iamdave
The joke in that scene always baffled me, because the Marines are born of the
Navy and still carry a _lot_ of the Navy's epistemology-why would they be
taught something so fundamental so differently?

(Yes it's a joke but sometimes I overthink things, heh)

~~~
jldugger
I've also seen it as Harvard and MIT graduates, then someone comes in, washes
his hands first, saying "at Yale, they taught us to wash our hands before
touching a holy object."

~~~
tpfour
Quite OT but I almost always wash my hands _before_ (and after) using the
restroom. Especially in a public place, it always made sense to me to do it
before and after. It seems much more hygienic both for the "holy object" and
other people!

~~~
walshemj
I was told that if your work in a chemical plant or a chip fab you learn to
wash your hands before you don't want chemicals on sensitive parts

And of course you don't know what germs etc are on the taps :-)

~~~
TomMarius
You usually use gloves though

~~~
walshemj
Yes so there is nothing wrong with an extra layer :-) ask any medical
practitioner why the put gloves on after they wash their hands.

------
insiderinsider
Getting the definitions right

------
saywatnow
Does Site Reliability include using assets from no less than 7 domains and
requiring Javascript to present a few paragraphs of text?

~~~
TheCoelacanth
Presumably their blog is very low on the list of things they care about the
reliability of.

