Hacker News new | more | comments | ask | show | jobs | submit login
SRE Fundamentals: SLIs, SLAs and SLOs (googleblog.com)
287 points by nealmueller 7 months ago | hide | past | web | favorite | 83 comments

Google: "An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free."

"Partial refund". That's a very low standard for a service level agreement, but typical of Google. Your whole business is down, it's their fault, and all you get a partial refund on the service.

A service level agreement is really a service packaged with an insurance product. The insurance product part should be evaluated as such - does it cover enough risk and is the coverage amount high enough? You can buy business interruption insurance from insurance companies, and should price that out in comparison with the cost and benefits of a SLA. If this is crucial to your core business, as with an entire retail chain going down because a cloud-based point of sale system goes down, it needs to be priced accordingly.

See: [1]

[1] https://www.researchgate.net/publication/226123605_Managing_...

> "Partial refund". That's a very low standard for a service level agreement, but typical of Google.

It's a standard across the industry, pretty much since the beginning of SLAs.

They're not insurance, and not meant to compensate you if your business is disrupted. That's on you. (And there are many ways to protect your business from provider outages.)

SLA payouts are meant to be mildly punitive, and to align incentives -- in aggregate, the SLA payouts add up and can hurt Google if there are a lot of customers affected by frequent outages.

This doesn't make any sense to me. How is Google supposed to be in a position to price the business risk of individual customers into a standard SLA that they offer to all their customers? That would require Google to charge different amounts of money per customer (commensurate with the business risk placed on Google's services for that customer), running actuarial numbers to ensure that Google would have the means to pay out when the SLA is violated. Doing so would place undue burden on customers, who would need to prove business risk before buying the service, and many customers are unaware of the real business risk of downtime (having not run the numbers) anyway.

With that said... Maybe it's a good product idea, to sell varying levels of SLA violation insurance alongside the service covered by the SLA. The default, free level of insurance covers the cost of the service itself, as it does today, but perhaps a customer could buy premium insurance from Google that the SLA will not be violated, increasing the payouts to offset business risk. After all, who better to put a price on the risk than Google themselves? So probably, Google can offer a better price on offsetting the risk, than a third party insurer which doesn't have access to Google's internal data.

The evil part of outages is that, no matter how much resource you dumped into developments toward a more reliable system, it still happens. This is true for every company including Google. So when one company is choosing between cloud providers, they compare these SLAs with themselves. Usually it's pretty hard for a random shop to reach good SLAs. So I don't see "business risk" here. Risks present all the time, CTO should try hard to minimize them but no way to remove them.

Selling insurance for SLAs seems to be an interesting idea, but this kind of insurance might be really similar to earthquake insurance, since violation of SLAs tend to be not common (otherwise why committing) but it might be a huge cascade failure once happens. Would you like to buy one? Earthquake insurance quirks all apply.

On the other side, Google has zero incentives to violate SLAs. A. You really cannot control how large the violation would be. B. Damage to branding >>>>>> money payout.

> "Partial refund". That's a very low standard for a service level agreement, but typical of Google.

It seems to be the standard. The most generous SLA I've seen is 5% off the monthly bill for each 30 minutes of downtime (up to 100%). If I'm down for 10 hours, waiving one month of bills doesn't come close to the damage done.

An SLA seems to be more of a promise than an agreement, because if the service goes down you're SOL and the provider gets a slap on the wrist (partial refund).

I've worked for a cloud provider who paid 45x for downtime. If you were down for an hour, you got 45 hours of credit on your bill.

My current ISP credits 5x - I was impacted in an outage expected to last all day, and they credited me 5 days on my next bill.

450 hours is less than a month, so that sla is actually worse than the one the above poster described, at least if you're down for less than a day.

I'm not aware of any SLA from any cloud provider or ISPs that offer anything other than a partial refund and/or credit. This is most certainly not specific to Google.

Check out the contract for a lottery or gambling system provider. They usually provide that the service provider is responsible for all losses for downtime or other errors on the provider's part, including fraud and theft. GTech pays about 0.5% of their revenue in penalties.

>A service level agreement is really a service packaged with an insurance product.

Now, such a service may be sold on the very high end... but in the general case, that's not what "Service Level Agreement" usually means.

(as an aside, I strongly suggest you get your business insurance from a party other than your service provider; serous outages can bankrupt service providers as-is... if they had to pay out customer damages, that would become a lot more likely.)

> "Partial refund". That's a very low standard for a service level agreement, but typical of Google. Your whole business is down, it's their fault, and all you get a partial refund on the service.

The SLA is the contract. While this may not be possible, you'd normally have to negotiate a higher payout for a higher service cost, but otherwise it's fixed based on the amount you pay, not on the amount your business makes.

This is a great article for defining terms. For some reason though, this quote made me laugh out loud:

"Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable."

The Google SREs mentioned this in their book; the Chubby locking service had uptime that was so high that folks started to neglect making their own services resilient to Chubby failures: https://landing.google.com/sre/book/chapters/service-level-o...

+1 for this book. As a junior DevOps engineer this book has been super helpful.

the book is structured in a way that makes it pretty easy to jump around and pick and choose which parts you want to read or skip, so it's not a very large commitment to read it

Mine just came in the mail today. Pretty stoked.

Still that's bad design on the clients' part. E.g. - Just because malloc "never" fails doesn't mean it can't fail :) so better error check for it.

Doesn't matter. Engineering around human failure is part of the profession.

That's a beautiful way to put it. I'd read that book.

Well, I'm a Google SRE so...

Failure of malloc() might be a bad example to pick because on linux, by default, most distros overcommit, so malloc won't fail, generally. Instead, malloc will succeed allocating the address space just fine, but the RAM will get allocated upon first use, meaning that even though malloc gave you a supposedly valid pointer rather than NULL, actually using that pointer will crash your program.

Other distros may have this differently and return NULL. It's not portable and also just bad to not check for it.

Is there a way to fix this/switch it off? I never got the rationale for this behaviour.

There's a sysctl: vm.overcommit_memory=2

What most people don't realize is that you will get more OOMs if you disable overcommit.

This is actually a serious point, not a joke.

New services may be launched with provisional technology to establish or evaluate a market or pricing model. The underlying technology in the initial implementation may have different performance or availability characteristics to what's actually envisioned for the full-scale product, and care has to be taken to actually compensate for this - i.e. introducing synthetic delay/jitter/faults to avoid setting the wrong expectation for the product.

It's funny but true. All observable properties of a system will eventually become hard dependencies for someone.

Isn't that one of the reasons for Netflix' chaos monkey? To make sure no one thinks "my dependency will always be there"?

That's more part of CHaP: https://medium.com/netflix-techblog/chap-chaos-automation-pl... and FIT: https://medium.com/netflix-techblog/fit-failure-injection-te... -- it's a mechanism to artificially inject errors to understand how upstream dependencies effect your availability.

I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built. As you start to approach near 100% reliable system, you start experiencing failures that are caused by minute disturbances/flaws in underlying dependencies(hardware, physical location) which can't be controlled. This is what they realized while trying to push the limits to build highly reliable system.

> I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built.

Wrong guess imho. It means building highly reliable systems requires knowledge and experiences. Trying to build them and solving the problems step-by-step is one way to understand how it can be achieved.

This is a semi-variant of Hyrum's law.

If you're building a system from scratch, keep in mind that this way of designing your service may not be flexible enough. You don't want just service level objectives, agreements and indicators, you want customer level.

Your service may end up providing for multiple customers with different requirements. Maybe 1% of your customers will end up using 99% of your resources, creating uncomfortable situations that affect the other 99% of customers. To get away from this you have to start spinning off multiple identical services just for groups of customers, which is really annoying to maintain. You may find you need to add hard resource limits to control customer behavior, which is hard to add after the fact.

Instead, if you design your new system from scratch with customer-specific isolation and service levels, you can run one giant service and still prevent customer-specific load from hampering the rest of the service. You can also just run duplicate services at different levels of availability based on customer requirements, but that's not going to work forever.

As an aside, I'm looking forward to reading ITIL 2019 to see what new processes they've adopted. I think everyone who's getting into SRE stuff should have a solid foundation on the basics of IT Operations management first.

In ops, we often have other internal groups that we either work with or support. It's often useful to view these groups as a customer, then you use the same policies, perhaps with a few exceptions in some cases, to manage the relationship. Typically we call this the OLA, the operating level agreement. I can only speak for my own experience, but operations groups I've been part of that don't have this concept of the operating level agreement typically suffer various types of damage to reputation. This is because there are no rules around how internal groups assess accountability, and therefore by having the terms of the OLA, you have the ability to defend your position as long as you stayed within the terms of the OLA. For example when we started building VAData data centers all over the world for Amazon, by having an OLA, we were able to push back on groups that claimed we were not holding up our end of the agreement.

I work in machine learning, where my team’s ML web services are typically requested by other in-house teams to provide features for their business logic, and so our SLAs are also agreements with other in-house teams.

What I’ve found is that product managers and business people are typically extremely resistant to traditional concepts of software requirements or feature planning, because they want flexibility to change requirements late in development without any negative repurcussion to them.

But somehow the language of SLAs magically clicks and they are more receptive to defining a service agreement. Then you ask them, from a business point of view, how much uptime does it need, what sort of throughput does it have to support, is the budget for outages or failures distributed equally across all features or more important for some features than others?

This practically leads directly to the same scoping and requirements discussion you would have had in traditional software planning, but for some reason the language of SLAs is more palatable, so I’ve found it is an effective way to get around some non-tech person in the loop who might be fighting against detailing a proper spec or documenting priority delivery among features.

When reading these articles, never forget that your company is NOT Google! If your company doesn't have a management/infrastructure/communication/skill structure that Google has, then it will be very difficult to implement these fundamentals.

In many cases, an SRE is a job to save costs. If your company doesn't get its shit together and doesn't give your SREs the support it needs, then they'll hate their jobs and the company.

I have to disagree. The typical and intuitive ways of reasoning about outages and outage risk - screaming at the engineers until they fix it, desperately passing the buck, finding someone to fire in the aftermath - are not a good fit for any context. Every company can benefit from a more principled mental model of system reliability.

If your company's management doesn't even know what an SRE is, then you're stuck in the same exact place, where the SREs are the one being screamed at instead. Some companies just rename "devops" to "SRE".

I think the renaming is fine as long as it also comes with the responsibility of driving the tracking and improving of site reliability :)

I just renamed my microwave to "refrigerator", but all of my food caught on fire and started leaking operational debt! :(

> In many cases, an SRE is a job to save costs.

This is 100% the case. I would actually argue that it is the only job of the SRE organization - hit the budgets by balancing costs of availability vs. costs of unavailability. If the org has massive budgets and general budget flexibility then it is easy. Otherwise, SREs are magicians to pull the rabbits out of a hat inventing the most awe inspiring methods/tools/hacks/workarounds needed to meet and beat budget targets.

In the other orgs SREs are an indirect level of outsourcing of everything to SaaS providers.

Yep. It's pretty disheartening, frankly.

I have no idea why you’re being downvoted. It’s the same thing as Borg/Kubernetes, MapReduce/Hadoop: some things just don’t apply or aren’t as effective unless you’re operating at a huge scale and with Google’s culture.

> unless you’re operating at a huge scale and with Google’s culture.

I'm not sure one has to go to the extreme of huge scale, anywhere near where Google is now, (not that that's what you said), nor all the aspects of their culture, but I agree that key fundamental aspects are often missed.

My favorite example is to point out that Google does not run Hadoop on expensive, virtualized AWS instances (or even brand-name servers with useless-for-purpose features[1] that creep up the cost). Rather, one of their competitive advantages, from the very start, has been to optimize hardware that they purchase, customize, and operate for cost (and performance).

The other is, as you mention, culture, which involves a remarkable amount of specialization, with groups dedicated to hardware, networking, internal tooling (i.e. building and maintaining the Hadoop-euquivalent), and, of course, SRE, who couldn't even begin to do their jobs without all those other groups' support.

Of course, there's an argument to be made that things like k8s and PaaS/IaaS can take the place of all those supporting groups at Google, but my counterargument is that they both fail to impart any benefit of customization (or, conversely cultural benefit of the mindset of doing everything that way across the entire company) and carry a tremendous cost (in money and complexity).

[1] redundant power supplies, high-density chasses, onboard hardware RAID

No idea, either. This whole thread is being bombarded with downvotes.

Downvoters: Whatsup?

These distinctions started making more sense when I realize they map to OKRs which is generally how Google is said to track individual and team performance.

In general, it's good to be precise about how you measure and when something is a hard or soft boundary. Otherwise, firefighting gets out of control. It's hard to determine when to stop something and put out a fire if you can't prioritize issues based on the boundaries you've set for your system.

SLOs certainly don't rigidly map to OKRs. Maybe it's easier to consider them (two sided) commitments about the quality of service? They're more of an ongoing measure of quality rather than a quarterly objective.

Good point on the quarterly vs continuous measurement. I'm not implying they are rigidly mapped but it makes sense you can put quality changes down as an objective for a team. This can be both end-of-quarter quality but also the general rate of change over the entire quarter.

Depending on the situation, I have seen teams aim to achieve certain SLOs but it can also be that certain other things can be achieved without letting the SLOs suffer (if they're already at a reasonably high quality).

Correct. Breaking or risking the SLO will instead lead to stopping new features until reliability is restored.

So, how do you choose that service level objective? How do you know which solutions to implement to not make things "overly reliable"? Isn't that more important question? As doing this without some sort of methodology will almost always result in useless solutions and overpaying to cloud and other hosting providers. Like implementing rather expensive failover within the datacenter, while ignoring how unreliable datacenters are and how cheaply you can implement failover between datacenters via DNS.

I like the idea of modelling availability/reliability for this. Even if you don't have the right numbers and do it on a napkin, not in code, it still can highlight solutions with best cost/benefit ratios.

Disclaimer: I am an SRE at Google, opinions are my own.

There's an excellent talk by Google VP of SRE Ben Treynor: https://www.youtube.com/watch?v=iF9NoqYBb4U. tl;dw: try to measure actual user experience, and make sure that even the long tile of customer still gets a good product experience. What "good product experience" means depends, on your product.

The rest of the error budget is for you to spend on releasing new features, changing the underlying architecture, etc.

So there is one obscure metric "service is available, i.e. can do its job", and this metric has different attributes: there are actual metric values (SLIs), there are internal goals (SLOs) and there are legally binding promises (SLAs) to users/customers. I would argue that this is not much content here.

Content, imo, would be something like this: We define "available" as "processor_load<99% and disk_load<99% and ram_load<99% and server responds with http 200 on port xyz", because reason_a, reason_b, reason_c. But other people could argue that it is not as much about the node but about how service_x is experienced, so one could track the speed of http responses to user requests and they should be under 0.1sec over 95% of the time. etc...

That you should track metrics, that you should set goals, and that you should define SLAs with your customers/users is standard business practice, not new knowledge.

"Within Google, we implement periodic downtime in some services to prevent a service from being overly available."

Uh..... what?

Services have different relationships with each others in terms of dependencies, and in terms of what you think those dependencies are.

If your idea of how things work is that services A, B, and C can optionally use service D, else use some fallback process, then if D has never failed, then you've never used that fallback process. And services X, Y, and Z which rely on services A, B, and C haven't had to deal with those services using their fallback processes either. So, instead of waiting for D to fail, you can take it down at a convenient time.

This applies to services as a whole, or services within a locality, or all services in some availability zone.

Read the full context of that quote. There's even more in the SRE book.

"Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable"

If a service has exceeded the reliability target for a given time period, you can take it down to basically let users know that this can happen and to not expect more.

You don't want them to get to the point where they are integrating so much with a service (and assuming a higher reliability that you have not promised ) that they end up mad at you when it performs worse, but still as intended, at a later date.

Imagine if in python open('file.txt', 'r') never failed so no one ever bothered to put a try block. To prevent this from happening they purposely have open() fail a couple times.

There’s a particular global system that’s very reliable — Global Chubby — and to keep people from putting it in their serving path they just regularly take it down for like an hour per quarter.

Similar concept to Netflix Chaos Monkey.

If you exceed your SLO, people think your service is more reliable than it is.

When you have your infrequent but expected failures, they are caught by surprise unless you normalize your SLO burn.

This is an interesting article from a company that has almost nil customer support.

From the movie The Negotiator:

A Marine and a sailor are taking a piss. The Marine goes to leave without washing up. The sailor says, 'In the Navy they teach us to wash our hands.' The Marine turns to him and says 'in the Marines they teach us not to piss on our hands'.

BTW it's not true that Google has almost nil customer support. There's extensive support for paying customers (for ads, GCP, GSuite etc.).

But it's amazing to me how reliable things like Gmail are, and how in so many years I've never felt the need to seek support.

I have a $700 dollar phone I bought from Google with apps made by Google and Gmail search hasn't worked in weeks.

Don't know if I got feature flagged or what. Unaffected by clearing data, clearing cache, or signing out of my account.

Search just returns nonsensical results.

The joke in that scene always baffled me, because the Marines are born of the Navy and still carry a lot of the Navy's epistemology-why would they be taught something so fundamental so differently?

(Yes it's a joke but sometimes I overthink things, heh)

I've also seen it as Harvard and MIT graduates, then someone comes in, washes his hands first, saying "at Yale, they taught us to wash our hands before touching a holy object."

Quite OT but I almost always wash my hands _before_ (and after) using the restroom. Especially in a public place, it always made sense to me to do it before and after. It seems much more hygienic both for the "holy object" and other people!

I was told that if your work in a chemical plant or a chip fab you learn to wash your hands before you don't want chemicals on sensitive parts

And of course you don't know what germs etc are on the taps :-)

You usually use gloves though

Yes so there is nothing wrong with an extra layer :-) ask any medical practitioner why the put gloves on after they wash their hands.

I thought one of the advantages of the "holey" variety was that you don't have to put your hands on anything to piss... ehh, Eli.

The original joke involved simply the demonym for the servicemember and can be used with any service, and says “they teach sailors/soldiers/airmen to ...” and “well they teach marines/etc ...,” which would probably make you less confused by the joke. I heard that joke growing up involving airmen in both directions of the joke, and it was common until that film.

Relatedly, in case you don’t know this, never think you can call a marine a sailor based on the lineage you’re discussing here. Soldier is also only an appropriate term for someone in the Army, and there are countless films that screw this up. It’s less about the service and more of an identity.

Relatedly, in case you don’t know this, never think you can call a marine a sailor based on the lineage you’re discussing here.

As an Army veteran (who doesn't really like announcing himself as such when doesn't add to the discussion), quite well aware. I do-however make light-hearted jokes about Crayons from time-to-time ;) It's a fun sibling rivalry we have, the Army and the Corps.

> how reliable things like Gmail are

Except when they aren't and even have to get data back from backup tapes.

We buy a support package and receive excellent support for our GCP services.

Although that, technically, refutes an accusation of non-existence of customer support, it begs the question of what it means to be a enough of a "customer" to receive support (and at what level):

Is it enough to use a gratis product? ("Paying" for it with data or ad-eyeballs, I suppose)

Is it enough to pay money for the product?

Must one also pay a subscription fee in addition to paying for the product in the first place? [1]

Is something else, sometimes, necessary (such as volume/clout)?

I think we've seen most of the spectrum of answers from the software industry (especially "enterprise" software), with the main novelty being the existence of web/SaaS gratis products.

[1] Depending on where on the spectrum between hand-holding and mere bug fixes the support ends up falling, this could be characterized as double-dipping

I'm not sure what "almost nil customer support" measures out to, but speaking for myself and not Google Cloud (my employer) I know we have:

- fantastic support internally (probably not what you're caring about),

- support to external globally-scaled customers whose issues don't exist because technical account management helped set up clear goals, such as uptime, described in the blog (probably also not what you're counting)

- support for even the smallest companies willing to pay as little as $100/user/month for Role-Based Support[1] and also receive direct access to support until they decide it's no longer needed (and by design, scale support costs to zero)

1. https://cloud.google.com/support/

> This is an interesting article from a company that has almost nil customer support.

This never gets old. It's almost as predictable as some random reference to "don't be evil".

If you don’t think a company has customer support it’s because you aren’t a customer.

IIRC they define their customers as other internal teams, separate from external customers, who (I'm guessing) are handled by external product teams.

That's generally correct. Though a lot of SLOs of SRE teams are influenced by external commitments as well. But I suppose that's pretty obvious considering there's SRE teams supporting cloud products.

I suspect they meant there internal customers :-)

Getting the definitions right

Does Site Reliability include using assets from no less than 7 domains and requiring Javascript to present a few paragraphs of text?

Presumably their blog is very low on the list of things they care about the reliability of.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact