"Partial refund". That's a very low standard for a service level agreement, but typical of Google. Your whole business is down, it's their fault, and all you get a partial refund on the service.
A service level agreement is really a service packaged with an insurance product. The insurance product part should be evaluated as such - does it cover enough risk and is the coverage amount high enough? You can buy business interruption insurance from insurance companies, and should price that out in comparison with the cost and benefits of a SLA. If this is crucial to your core business, as with an entire retail chain going down because a cloud-based point of sale system goes down, it needs to be priced accordingly.
It's a standard across the industry, pretty much since the beginning of SLAs.
They're not insurance, and not meant to compensate you if your business is disrupted. That's on you. (And there are many ways to protect your business from provider outages.)
SLA payouts are meant to be mildly punitive, and to align incentives -- in aggregate, the SLA payouts add up and can hurt Google if there are a lot of customers affected by frequent outages.
With that said... Maybe it's a good product idea, to sell varying levels of SLA violation insurance alongside the service covered by the SLA. The default, free level of insurance covers the cost of the service itself, as it does today, but perhaps a customer could buy premium insurance from Google that the SLA will not be violated, increasing the payouts to offset business risk. After all, who better to put a price on the risk than Google themselves? So probably, Google can offer a better price on offsetting the risk, than a third party insurer which doesn't have access to Google's internal data.
Selling insurance for SLAs seems to be an interesting idea, but this kind of insurance might be really similar to earthquake insurance, since violation of SLAs tend to be not common (otherwise why committing) but it might be a huge cascade failure once happens. Would you like to buy one? Earthquake insurance quirks all apply.
On the other side, Google has zero incentives to violate SLAs. A. You really cannot control how large the violation would be. B. Damage to branding >>>>>> money payout.
It seems to be the standard. The most generous SLA I've seen is 5% off the monthly bill for each 30 minutes of downtime (up to 100%). If I'm down for 10 hours, waiving one month of bills doesn't come close to the damage done.
An SLA seems to be more of a promise than an agreement, because if the service goes down you're SOL and the provider gets a slap on the wrist (partial refund).
My current ISP credits 5x - I was impacted in an outage expected to last all day, and they credited me 5 days on my next bill.
Now, such a service may be sold on the very high end... but in the general case, that's not what "Service Level Agreement" usually means.
(as an aside, I strongly suggest you get your business insurance from a party other than your service provider; serous outages can bankrupt service providers as-is... if they had to pay out customer damages, that would become a lot more likely.)
The SLA is the contract. While this may not be possible, you'd normally have to negotiate a higher payout for a higher service cost, but otherwise it's fixed based on the amount you pay, not on the amount your business makes.
"Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable."
What most people don't realize is that you will get more OOMs if you disable overcommit.
New services may be launched with provisional technology to establish or evaluate a market or pricing model. The underlying technology in the initial implementation may have different performance or availability characteristics to what's actually envisioned for the full-scale product, and care has to be taken to actually compensate for this - i.e. introducing synthetic delay/jitter/faults to avoid setting the wrong expectation for the product.
Wrong guess imho. It means building highly reliable systems requires knowledge and experiences. Trying to build them and solving the problems step-by-step is one way to understand how it can be achieved.
Your service may end up providing for multiple customers with different requirements. Maybe 1% of your customers will end up using 99% of your resources, creating uncomfortable situations that affect the other 99% of customers. To get away from this you have to start spinning off multiple identical services just for groups of customers, which is really annoying to maintain. You may find you need to add hard resource limits to control customer behavior, which is hard to add after the fact.
Instead, if you design your new system from scratch with customer-specific isolation and service levels, you can run one giant service and still prevent customer-specific load from hampering the rest of the service. You can also just run duplicate services at different levels of availability based on customer requirements, but that's not going to work forever.
As an aside, I'm looking forward to reading ITIL 2019 to see what new processes they've adopted. I think everyone who's getting into SRE stuff should have a solid foundation on the basics of IT Operations management first.
What I’ve found is that product managers and business people are typically extremely resistant to traditional concepts of software requirements or feature planning, because they want flexibility to change requirements late in development without any negative repurcussion to them.
But somehow the language of SLAs magically clicks and they are more receptive to defining a service agreement. Then you ask them, from a business point of view, how much uptime does it need, what sort of throughput does it have to support, is the budget for outages or failures distributed equally across all features or more important for some features than others?
This practically leads directly to the same scoping and requirements discussion you would have had in traditional software planning, but for some reason the language of SLAs is more palatable, so I’ve found it is an effective way to get around some non-tech person in the loop who might be fighting against detailing a proper spec or documenting priority delivery among features.
In many cases, an SRE is a job to save costs. If your company doesn't get its shit together and doesn't give your SREs the support it needs, then they'll hate their jobs and the company.
This is 100% the case. I would actually argue that it is the only job of the SRE organization - hit the budgets by balancing costs of availability vs. costs of unavailability. If the org has massive budgets and general budget flexibility then it is easy. Otherwise, SREs are magicians to pull the rabbits out of a hat inventing the most awe inspiring methods/tools/hacks/workarounds needed to meet and beat budget targets.
In the other orgs SREs are an indirect level of outsourcing of everything to SaaS providers.
I'm not sure one has to go to the extreme of huge scale, anywhere near where Google is now, (not that that's what you said), nor all the aspects of their culture, but I agree that key fundamental aspects are often missed.
My favorite example is to point out that Google does not run Hadoop on expensive, virtualized AWS instances (or even brand-name servers with useless-for-purpose features that creep up the cost). Rather, one of their competitive advantages, from the very start, has been to optimize hardware that they purchase, customize, and operate for cost (and performance).
The other is, as you mention, culture, which involves a remarkable amount of specialization, with groups dedicated to hardware, networking, internal tooling (i.e. building and maintaining the Hadoop-euquivalent), and, of course, SRE, who couldn't even begin to do their jobs without all those other groups' support.
Of course, there's an argument to be made that things like k8s and PaaS/IaaS can take the place of all those supporting groups at Google, but my counterargument is that they both fail to impart any benefit of customization (or, conversely cultural benefit of the mindset of doing everything that way across the entire company) and carry a tremendous cost (in money and complexity).
 redundant power supplies, high-density chasses, onboard hardware RAID
In general, it's good to be precise about how you measure and when something is a hard or soft boundary. Otherwise, firefighting gets out of control. It's hard to determine when to stop something and put out a fire if you can't prioritize issues based on the boundaries you've set for your system.
Depending on the situation, I have seen teams aim to achieve certain SLOs but it can also be that certain other things can be achieved without letting the SLOs suffer (if they're already at a reasonably high quality).
I like the idea of modelling availability/reliability for this. Even if you don't have the right numbers and do it on a napkin, not in code, it still can highlight solutions with best cost/benefit ratios.
There's an excellent talk by Google VP of SRE Ben Treynor: https://www.youtube.com/watch?v=iF9NoqYBb4U. tl;dw: try to measure actual user experience, and make sure that even the long tile of customer still gets a good product experience. What "good product experience" means depends, on your product.
The rest of the error budget is for you to spend on releasing new features, changing the underlying architecture, etc.
Content, imo, would be something like this: We define "available" as "processor_load<99% and disk_load<99% and ram_load<99% and server responds with http 200 on port xyz", because reason_a, reason_b, reason_c. But other people could argue that it is not as much about the node but about how service_x is experienced, so one could track the speed of http responses to user requests and they should be under 0.1sec over 95% of the time. etc...
That you should track metrics, that you should set goals, and that you should define SLAs with your customers/users is standard business practice, not new knowledge.
If your idea of how things work is that services A, B, and C can optionally use service D, else use some fallback process, then if D has never failed, then you've never used that fallback process. And services X, Y, and Z which rely on services A, B, and C haven't had to deal with those services using their fallback processes either. So, instead of waiting for D to fail, you can take it down at a convenient time.
This applies to services as a whole, or services within a locality, or all services in some availability zone.
"Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable"
If a service has exceeded the reliability target for a given time period, you can take it down to basically let users know that this can happen and to not expect more.
You don't want them to get to the point where they are integrating so much with a service (and assuming a higher reliability that you have not promised ) that they end up mad at you when it performs worse, but still as intended, at a later date.
When you have your infrequent but expected failures, they are caught by surprise unless you normalize your SLO burn.
A Marine and a sailor are taking a piss. The Marine goes to leave without washing up. The sailor says, 'In the Navy they teach us to wash our hands.' The Marine turns to him and says 'in the Marines they teach us not to piss on our hands'.
BTW it's not true that Google has almost nil customer support. There's extensive support for paying customers (for ads, GCP, GSuite etc.).
But it's amazing to me how reliable things like Gmail are, and how in so many years I've never felt the need to seek support.
Don't know if I got feature flagged or what. Unaffected by clearing data, clearing cache, or signing out of my account.
Search just returns nonsensical results.
(Yes it's a joke but sometimes I overthink things, heh)
And of course you don't know what germs etc are on the taps :-)
Relatedly, in case you don’t know this, never think you can call a marine a sailor based on the lineage you’re discussing here. Soldier is also only an appropriate term for someone in the Army, and there are countless films that screw this up. It’s less about the service and more of an identity.
As an Army veteran (who doesn't really like announcing himself as such when doesn't add to the discussion), quite well aware. I do-however make light-hearted jokes about Crayons from time-to-time ;) It's a fun sibling rivalry we have, the Army and the Corps.
Except when they aren't and even have to get data back from backup tapes.
Is it enough to use a gratis product? ("Paying" for it with data or ad-eyeballs, I suppose)
Is it enough to pay money for the product?
Must one also pay a subscription fee in addition to paying for the product in the first place? 
Is something else, sometimes, necessary (such as volume/clout)?
I think we've seen most of the spectrum of answers from the software industry (especially "enterprise" software), with the main novelty being the existence of web/SaaS gratis products.
 Depending on where on the spectrum between hand-holding and mere bug fixes the support ends up falling, this could be characterized as double-dipping
- fantastic support internally (probably not what you're caring about),
- support to external globally-scaled customers whose issues don't exist because technical account management helped set up clear goals, such as uptime, described in the blog (probably also not what you're counting)
- support for even the smallest companies willing to pay as little as $100/user/month for Role-Based Support and also receive direct access to support until they decide it's no longer needed (and by design, scale support costs to zero)
This never gets old. It's almost as predictable as some random reference to "don't be evil".