Its shocking how "elevated rate of errors for specific endpoint" in your cloud provider status page is actually amplified to be a soft-outage of your product when your writes to disk never return, your databases returning inconsistent data or your orchestration taking some drastic measures for the failing health check.
When you have a lot of components in your cloud mix, failure of one stage(network->balancing->quering->rendering->persistence) bring everything down.
if 10 of your cloud services each have a reliability of 99.999, all together the reliability is not 99.999.
cloud providers can claim mountain-high availablity whereas users will never get their apps running with advertised reliability for now there is multiple subcomponents that can fail.
Uptime and error metrics are technical and should be kept away from managers.
I'm not sure how much time I have to participate but I wouldn't mind chipping in a bit on a co-op in this space.
But it might be easier to convince Is it Down Right Now to grow some fangs, or socialize the idea that it does (perception counts for a lot).
But I think I'm going to be waiting a while.
Realistically they're also on a "Pub" server which is just another client with a known IP that facilitates transfer of data between peers.
you could take it another step and build a personal image gallery / sharing client on SSB as the protocol supports (and has) many different apps that run on it.
Amazon did this. It went pretty well. So well they decided to sell the results of the expertise.
Automated status page updates can also reduce trust, since then the status page is itself exposed to more kinds of system failures.
I wish I could remember the values you could fill in, they were very intelligently chosen.
What I learned: if you didn't know what the root cause was, you probably didn't fix anything.
I've been in many orgs where root cause was either completely missing or completely missed the point. Recently I quit a company that thought they were doing RCA. The fact they sent out an email that there was an outage and then N number of hours or days later another email to say it was "fixed" and that they scaled or thought they found the problem and don't worry about it anymore. Literally we had weeks where the exact same outage occurred multiple times. And with every outage the exact same response.
So... I asked simple questions of leadership as to why RCA process was not implemented. Why RCA did not require a standardized template to be filled out as part of a production outage. Why a "5 Whys" approach wasn't being considered to truly expose the actual root cause. Why there wasn't any accountability.
At the end of the day failure cause is not root cause and many struggle conflating the two. Honestly when an organization doesn't hold true RCA as a critical part of engineering process I, personally, feel as though that organization will innevitably hit a glass ceiling. Among other problem areas the disconnect with RCA, for me, was why I couldn't stay at that company anymore. It was embarrassing watching from the inside as the same mistakes were made over and over with nobody the wiser.
If the system is working as designed, then there really is no bug. A bug is a malfunction, after all.
I've always called broken systems that are working as designed BAD: Broken As Designed.
In general, I'm so used to RCA and layered mitigations (what one of our greybeards calls "belt and suspenders") that I don't know how quality happens without it. I'm a convert to the idea that if you can't fix a problem directly, the fix has to isolate or be as close to the problem as possible. Otherwise the bad state just ripples outward as complexity.
The gist was that the causes were appropriate and educational. Folks couldn't choose "user is an idiot", instead having to choose "the interface was confusing".
But a freeform report is also necessary. How else are you going to adequately explain what, where, why, etc., the root cause was?
Yes! If you don't know root cause, then you don't know what went wrong. Not only do you not know what to fix, any shotgun debugging is likely to have only fixed one symptom, leaving the actual malfunction in place.
Almost all of the companies I've owned or worked for have recognized this with a simple rule: if you haven't found (and proven) root cause, then the bug cannot be closed as fixed. Any company that doesn't have a variation of this going on is a company whose products you can't trust (and a company I would prefer not to work for).
When I'm seeing 100% failure rate, there's often nothing on their status page. Or there's some bullshit metric like VM acquisition times are double normal for, say, some Windows VM. But I'm not seeing 8% failure rate. I'm not seeing an extra 30 seconds. I'm seeing 100% failure rate, with long timeouts, and retries.
(The answer is 99.99)
1. Suppose we define availability as "at least one is up".
If the failures are completely independent, then the probability of any one being down is 10^-5 (five nines) and the probability of all 10 being down at the same time is (10^-5)^10 = 10^-50 (fifty nines).
2. If we instead define availability as "all 10 are up" (which is essentially equivalent to one failure causes a cascading failure) then in the same scenario where failures are independent, this is (1-10^-5)^10 ~= 99.99% (four nines).
> The problem is that they weren't monitoring from the customer's perspective. Had they done that, it would have been clear that oodles of requests from some subset of customers were failing. They would have also realized that certain customers had all of their requests failing.
This is saying that if you are small, all your failing requests are within the 0.001% that the provider is allowed to fail.
I suppose this depends on what how 99.999 uptime is defined in the SLA.
It's like an episode of Dirk Niblick: https://www.youtube.com/watch?v=bCoGMYV3UPk
I had a bunch of these things, all of which were things that were tracked, measured, and monitored, in an existing setup.
Their response was, "We really don't have any way to provide the data for your SLAs, much less actually sign up to enforce them." I suggested that they were not serious about being in the 'cloud' business then. They seemed miffed.
Think of it this way, consider the definition of an "availability" SLA as 'the mean availability of all hosts in our cloud'. If its reported at "five 9's" or 99.999% that means that a cloud of 100,000 machines could have one machine down for days at a time and never cause their SLA to slip. Big providers average over multiple hundreds of thousands of machines, your stuff could be down all the time and yet 'everything' is "meeting all the SLAs".
You see this outside of data centers in other overly generalized metrics. Unemployment is only 3.7%! Yay right? Tell that to the people of Magoffin County Kentucky where unemployment is 12.3%
So would I pay more? I don't know. If none of the service providers would offer SLAs based on my footprint, it would not be a choice. If one does, then it becomes the preferred choice even if it is more expensive. At which point do all of them to remain competitive? Another good question. Could be a good differentiator for the #3 cloud provider Google. I know they have the technology to do it if they chose to.
You're welcome! :)
The root post does raise an important issue, though -- just because GCS thinks it's doing great on your RPCs doesn't mean that your system is doing great.
I pose that Google is not presently in a good position to highlight as a role model or case-study for demonstration of effective cloud provider SLAs.
You can sue them? (if that's not forbidden by the contract).
The granularity may be harder to define for cloud services, but it is very much doable; it's all about making sure that the target metrics have zero connection to the global state of the system.
For example, Amazon will give you a 30% refund "for the individual Included Service in the affected AWS region for the monthly billing cycle in which the Unavailability occurred" if availability during a month drops to, but not below, 95% (that's a 1.5 day downtime).
That means that if your service goes 100% down because EC2 was completely broken in a region for 1.5 days, you get a refund of 9 days worth of EC2 (compute) charges, but not the associated EBS (disk) or S3 (storage) or other charges.
And "unavailability" counts only if at least two availability zones at the same time are completely down. And then you have to request the credit in a very specific format.
Google and Azure look extremely similar.
Are better SLAs typically negotiated? Because based on this, it seems like the only thing keeping cloud providers reliably above their SLAs is the fear of losing current and future customers, not the SLAs themselves. In other words, the SLAs are completely meaningless.
There have been times in the past when I was annoyed with AWS stability issues. We've all been there. But I also know that AWS is more stable than anything I could feasible build in-house.
I can't acknowledge this, as it has not been my experience at all.
When there is a choice to make between public cloud with 99.xxx% SLA or the internal cloud with 90% uptime and 6 months SLA to get a server, the right choice is always cloud.
>the right choice is always cloud
I can't tell if you missed a word or are making a joke...
When I was in ops, thinking the world was going to collapse and we were all going to get fired if a service went down, I didn't really get it. It's just a negotiating tactic. It's a cost of doing business.
There is always two sides to these sorts of things.
I have sent a lot of log files to cloud vendor trying to find why their web hosted application was so slow (6-10 second response times on a CRM app they provided). If someone would have responded with an actual answer (your firewall is blocking traffic or try this setup etc) I could have worked with that. Instead we got nothing but stealth ticket closes and “sorry we don’t know why this is slow” responses. This article hit a nerve because you really do dance to someone else’s tune when you go to the “cloud”.
The same can be said of large orgs with a large on-prem footprint.
The author wants us to look at things from the customer's perspective. The thing is, we (and presumably all major cloud providers) do. Every feature released, every API call, has a canary associated with it that does nothing but pretend to be a customer using that feature. There are definitely cases that slip through the cracks that shouldn't have (we forgot to properly test for a certain condition or combination use case etc) but the vast majority of the time a customer experiences an outage it's because of something the customer did.
That's not to excuse the 5 9 guarantee that inspires fake confidence. But we're always upfront with customers that there's a shared responsibility for availability: it's our responsible to make sure what customers pay for works, but it's also a customer's responsibility that there's enough redundancy in their architecture for their use case.
However, we will spend a great deal of time resolving their issues regardless. Last week, for example, we had a customer encountering failures with their program using our product. I obtained the source for the customer's application and debugged it for them.
I like that we do this -- it's really nice to solve a customer's problem, and even nicer to be able to tell them it wasn't the fault of our software. It's expensive, of course, but our support contracts are priced to take this into account.
If I have a multi-tenant system, and no one customer is dominant (always causes problems IMO), my 'biggest customer' might only be 4% of my traffic. There are a million things that can go wrong that make this customer's experiences different from everyone else's, from getting my sharding solution wrong to small-C n^2 issues (and a whole lot of space between for nlog(n) problems).
If I'm doing 95th percentile calculations that will not show up in my metrics. If I have a larger customer that's 10% of my traffic, almost half of their users could be having issues before my alerts go off.
And then there's explaining to your boss that 5 9's across twenty interacting services is around 99.98%, and that's only if degradation in one service doesn't cause failure in another.
Scatter plots and histograms are much better at telling you when the distribution of a stat has gone bimodal with a small but consistent group in the outlier group. Percentiles only make sense for telling you, when you already know what shape the distribution curve should be, how flat that curve is right now. They don’t tell you when the shape has changed.
Good APMs and trace tools will let you zero in on traces by characteristics - so if you notice there’s a bump in requests which have a 2 second load time, you can select them all and analyze how they are distributed - whether they are mostly one browser, one location, or one user even. But you need a solid strategy for tagging and logging traces.
 https://honeycomb.io. Disclaimer: I used to work for them.
...but now you're in a recursive problem: Who watches the watcher? If the watcher goes down, your insights are gone. Do you devote your entire engineering staff to monitoring, then?
A two-pronged approach would be better: Customer Touch-Point monitoring built into your product and external monitoring should your CTP monitoring go down. If your external monitoring goes down, you still have the CTP, so not all visibility is lost.
And what does it mean? And am I stupid, or is this not a term everyone knows?
In my experience the things that are easier to write-off as unique to one case, non-representative, or too rare to fix, so they don't have to be thoroughly addressed are warning signs of a robustness issue. Still doesn't mean that they'll get fixed then and there, but they often come around to bite you in the ass later.
But are there other SLAs like for in-zone latency, or hardware performance (e.g. IOPS or bandwidth from your local or remote storage)? Are these kinds of SLAs part of larger private agreements (like, Netflix, a huge AWS customer), or is uptime the only SLA offered? Haven't been able to find any info on this in my searches...
For example, within the AWS Compute SLA you linked:
> Unavailable is defined as: For Amazon EC2 (other than Single EC2 Instances), Amazon ECS, or Amazon Fargate, when all of your running instances or running tasks, as applicable, deployed in two or more AZs in the same AWS region (or, if there is only one AZ in the AWS region, that AZ and an AZ in another AWS region) concurrently have no external connectivity.
Get this: Single EC2 instances have an SLA of 90%. Seriously. Its in that article.
In other words; AZ outages rarely see pay out, because you "didn't architect your cloud correctly". And we've been told some nicer variation of this when asking for a reimbursement a few years back. You do have to ask, you know. They could literally automate this process, but they don't. Whatever.
Let's also be clear about the language here: There's no "pay out" at all. What happens is, you get the amazing privilege of not being forced to pay them for a product that didn't work.
That ties directly back to the article; they pay out based on their architecture and SLAs, which are not your architecture and SLAs, unless you perfectly match your architecture to their architecture, which will have gaps, and then you're bought in so hard that you could never leave if you needed to.
Do you trust AWS' status page? Or are you coming to Hacker News to ask why your network latency between instances has skyrocketed unexpectedly?
As an aside, once in a while I imagine what kind of field day Upton Sinclair would have with this trip around of the pendulum swing toward dystopia.
Ineptitude, and it being a hard problem, are sufficient to explain the status quo.
>Loss of external connectivity or persistent disk access for all running Instances, when Instances are placed across two or more Zones in the same Region.
One of the collaborations I work in, LIGO, recently gave up on private servers and transitioned to AWS for our Gravitational Candidate Database  because the cloud is so much better. I made this change to my own low-latency search framework  years ago. If you're not "lucky" enough to (be forced to) use a university/collaboration cluster, you'd have to maintain your own server, which is orders of magnitude less reliable and more expensive/difficult. I understand that not all workflows are the same, but for all of my nontrivial applications, cloud providers save so much time and money that I can do something as bold as making a provider-agnostic architecture with more robust failover. I recognize that more complicated workflows might require e.g. 10 separate AWS services with AWS-specific features causing lock-in, but at that level of complexity, I'm guessing the problem must be virtually impossible with a non-cloud solution anyway. If you really can't figure out another way to deal with resiliency, you might just need to accept that your problem space is really hard and that you're lucky to even be able to run it at all. Again, I think the original article is right about the fact that you have to account for this yourself; the cloud is not magic, and your code still has to understand that it is (like all abstractions) going to leak.
Again, the point about responsiveness in the original article is very well-taken; I'm just surprised more people aren't observing that overall the reliability, cost, and flexibility provided by cloud solutions is utterly transformative in terms of reliability.
If the author had a specific problem with specific SLA’s, tell us with real details.
And SLA’s aren’t for winning the lottery or providing impossible-to-meet standards. You need to look at what they actually cover, compare with your costs and reliability of running infrastructure in-house, and then pick the right tradeoff for you. I can’t even tell if the author is accusing cloud providers of fraud, of being misleading, if the author just never understood the SLA properly, or what.
Or is Rachel talking about a situation where you have an SLA in place, but you can't even prove downtime to the vendor because their monitoring software is inadequate?
If a provider promises that, overall, 99.5% of all requests will succeed, but the 0.5% errors are all concentrated on some few customers / regions / AZs, customers can have a very bad day.
So this is about promising each customer that 99.5% of all their requests will succeed, and monitor in a way that makes sure you can keep that promise.
For the last gcloud outage, i think you have to talk to people and APPLY for a credit, obviously very few did that https://news.ycombinator.com/item?id=20078296
amazon ec2 clearly states their sla gives you credits https://aws.amazon.com/compute/sla/
azure compute clearly states they give you credits https://azure.microsoft.com/en-us/support/legal/sla/virtual-...
wanna better sla - pay up, like i said in the beginning. as the cost of sla is proportional to payout that works like an insurance, not like coercive measure to increase reliability.
Your LB may have some nines, your individual vms (or set of vms in a region) may have some nines, your data store may have some nines, but if all of them aren't working together it's unlikely your business will be up.
This is inherently customer-dependent and yet it's super predictable (nobody only uses a lb).
The foundational services (VMs, dns, s3, etc) I've found to be more reliable than others (ebs).
Fucking hate devs that do this, especially the ones that wander on before they have to justify their actions to anyone.
This behavior certainly isn't limited to cloud providers. If anything internal operations departments are worse. The only difference is that internal departments can be pressured more effectively.
I work for a company that’s fairly well known here. I can’t recall us having an outage (or something less severe than a full outage) that was our cloud provider’s fault and not ours. I’d recommend the appropriate caution before “blaming the compiler”.