Hacker News new | past | comments | ask | show | jobs | submit login
Your nines are not my nines (rachelbythebay.com)
424 points by zdw 3 months ago | hide | past | web | favorite | 129 comments

Million times this.

Its shocking how "elevated rate of errors for specific endpoint" in your cloud provider status page is actually amplified to be a soft-outage of your product when your writes to disk never return, your databases returning inconsistent data or your orchestration taking some drastic measures for the failing health check.

When you have a lot of components in your cloud mix, failure of one stage(network->balancing->quering->rendering->persistence) bring everything down.

if 10 of your cloud services each have a reliability of 99.999, all together the reliability is not 99.999.

cloud providers can claim mountain-high availablity whereas users will never get their apps running with advertised reliability for now there is multiple subcomponents that can fail.

The fact that many status pages are updated manually and any incident disclosure need to get approval from management(aws?) does not add to the status page trust.

Uptime and error metrics are technical and should be kept away from managers.

Maybe it's time for a consumer watchdog group to step in and do their own reporting for services like this. Like https://www.isitdownrightnow.com/ but with sharper teeth.

I'm not sure how much time I have to participate but I wouldn't mind chipping in a bit on a co-op in this space.

But it might be easier to convince Is it Down Right Now to grow some fangs, or socialize the idea that it does (perception counts for a lot).

Or it's time to go back to one's own infrastructure and take one's own destiny into one's own hands, along with the responsibility.

For quite a while now I've been hoping that cloud tools will hit the turnkey solution point where anyone technically literate can manage a small private cloud. Baby pictures shouldn't be on Facebook. They should hosted on my machine, and replicated on Uncle Bill's and Aunt Sally's, and only family members have access to them.

But I think I'm going to be waiting a while.

in a way Scuttlebutt provides this. Distributed social network. No central storage of anything. You post an image, it's on your machine, and the machines you connect to that have chosen to "follow" you. So, your baby pictures are on your machine and "Uncle Bill's and Aunt Sally's"

Realistically they're also on a "Pub" server which is just another client with a known IP that facilitates transfer of data between peers.

you could take it another step and build a personal image gallery / sharing client on SSB as the protocol supports (and has) many different apps that run on it.

I can't remember ever seeing this work out well lol. Happy to be proven wrong one day, though.

I have my own datacenter. Works well and has for the past 20 years. Costs me peanuts because I know what I'm doing and how to do it. Will never be an Amazon or any other "cloud" provider's customer.

how do you address the central "something went boom" issue of the post. Like if a backhoe takes out the line to connection to the data center do you still have any "nines" ?

I have set up everything to be redundant.

You may have heard of a company called Amazon.

It appears I had forgotten about that back-story!

What thread do you think you are replying to?

> own infrastructure and take one's own destiny into one's own hands, along with the responsibility.

Amazon did this. It went pretty well. So well they decided to sell the results of the expertise.

This was extremely evident in the slow response and poor communication during the recent Salesforce outage.

> The fact that many status pages are updated manually and any incident disclosure need to get approval from management(aws?) does not add to the status page trust.

Automated status page updates can also reduce trust, since then the status page is itself exposed to more kinds of system failures.

I worked at a company once where each bug had a really interesting field: root cause

I wish I could remember the values you could fill in, they were very intelligently chosen.

What I learned: if you didn't know what the root cause was, you probably didn't fix anything.

1000% this.

I've been in many orgs where root cause was either completely missing or completely missed the point. Recently I quit a company that thought they were doing RCA. The fact they sent out an email that there was an outage and then N number of hours or days later another email to say it was "fixed" and that they scaled or thought they found the problem and don't worry about it anymore. Literally we had weeks where the exact same outage occurred multiple times. And with every outage the exact same response.

So... I asked simple questions of leadership as to why RCA process was not implemented. Why RCA did not require a standardized template to be filled out as part of a production outage. Why a "5 Whys" approach wasn't being considered to truly expose the actual root cause. Why there wasn't any accountability.

At the end of the day failure cause is not root cause and many struggle conflating the two. Honestly when an organization doesn't hold true RCA as a critical part of engineering process I, personally, feel as though that organization will innevitably hit a glass ceiling. Among other problem areas the disconnect with RCA, for me, was why I couldn't stay at that company anymore. It was embarrassing watching from the inside as the same mistakes were made over and over with nobody the wiser.

So how did the leadership respond to your questions? Did they agree, and did things change?

They "took it as feedback". The problem was the leadership was convinced they had a real RCA process in place. They didn't.

The OP is talking about bugs not complex system failures. These are not the same thing. It might take multiple bugs to cause a complex system to fail, because complex systems often have enormous amounts of complexity built in, but a single bug is often a fairly unidimensional thing. You can identify the root cause of a bug.

Another way to look at that is that a root cause can be a set.

I've had more than one case at work where it would come down to bad requirements. Both systems worked exactly as specified and were bug free (for the issue in scope). They just had a different understanding of reality by design. Root cause here is some mixture of poor understanding of the problem domain by various staff.

> Both systems worked exactly as specified and were bug free

If the system is working as designed, then there really is no bug. A bug is a malfunction, after all.

I've always called broken systems that are working as designed BAD: Broken As Designed.

I'd agree. The defect (not sure you can call it a bug?) was in the design/spec not the system.

Was it like an list of predefined values? Where I work they do root cause analysis for everything, but with freeform answers so what you describe might be different from what I'm used to.

In general, I'm so used to RCA and layered mitigations (what one of our greybeards calls "belt and suspenders") that I don't know how quality happens without it. I'm a convert to the idea that if you can't fix a problem directly, the fix has to isolate or be as close to the problem as possible. Otherwise the bad state just ripples outward as complexity.

Unfortunately this was probably 20 years ago. I know it was a list of predefined values in a drop-down, but I'm not sure if there was a other/write-in field.

The gist was that the causes were appropriate and educational. Folks couldn't choose "user is an idiot", instead having to choose "the interface was confusing".

Well "the interface was confusing" doesn't really rule out "user is an idiot", but most likely will make matters worse.

I like having a set of broad predefined values (it helps with standardization, which helps with searching in the future).

But a freeform report is also necessary. How else are you going to adequately explain what, where, why, etc., the root cause was?

> if you didn't know what the root cause was, you probably didn't fix anything.

Yes! If you don't know root cause, then you don't know what went wrong. Not only do you not know what to fix, any shotgun debugging is likely to have only fixed one symptom, leaving the actual malfunction in place.

Almost all of the companies I've owned or worked for have recognized this with a simple rule: if you haven't found (and proven) root cause, then the bug cannot be closed as fixed. Any company that doesn't have a variation of this going on is a company whose products you can't trust (and a company I would prefer not to work for).

Saucelabs is very bad this way. We have tunnels flake out once in a while (I'm still convinced there's a concurrency bug in their tunnel implementation, based on missing events I've seen in test logs), but sometimes Sauce is just having issues.

When I'm seeing 100% failure rate, there's often nothing on their status page. Or there's some bullshit metric like VM acquisition times are double normal for, say, some Windows VM. But I'm not seeing 8% failure rate. I'm not seeing an extra 30 seconds. I'm seeing 100% failure rate, with long timeouts, and retries.

> if 10 of your cloud services each have a reliability of 99.999, all together the reliability is not 99.999.

(The answer is 99.99)

It's probably not in practice, since that assumes failures are perfectly independent. If they are perfectly correlated, the answer is still 99.999. For most real cases, it will be between those extremes.

I think it depends on how you define availability.

1. Suppose we define availability as "at least one is up". If the failures are completely independent, then the probability of any one being down is 10^-5 (five nines) and the probability of all 10 being down at the same time is (10^-5)^10 = 10^-50 (fifty nines).

2. If we instead define availability as "all 10 are up" (which is essentially equivalent to one failure causes a cascading failure) then in the same scenario where failures are independent, this is (1-10^-5)^10 ~= 99.99% (four nines).

Although I agree with your assessment that the total uptime of the system is the multiplication of the individual systems, that doesn’t appear to be the point of the article.

> The problem is that they weren't monitoring from the customer's perspective. Had they done that, it would have been clear that oodles of requests from some subset of customers were failing. They would have also realized that certain customers had all of their requests failing.

This is saying that if you are small, all your failing requests are within the 0.001% that the provider is allowed to fail.

I suppose this depends on what how 99.999 uptime is defined in the SLA.

> if 10 of your cloud services each have a reliability of 99.999, all together the reliability is not 99.999.

It's like an episode of Dirk Niblick: https://www.youtube.com/watch?v=bCoGMYV3UPk

This rings so true it hurts. At a very large, very blue, company I recall a time trying to explain to an account manager that I wanted to write the SLAs in terms of my footprint, which is to say, given the resources you have allocated to my account, lets set some SLAs like "latency from any node to any node", "latency from any node to the primary internet", "latency from any node to the secondary internet", "availability of primary internet", "secondary internet" and "blended availability of both."

I had a bunch of these things, all of which were things that were tracked, measured, and monitored, in an existing setup.

Their response was, "We really don't have any way to provide the data for your SLAs, much less actually sign up to enforce them." I suggested that they were not serious about being in the 'cloud' business then. They seemed miffed.

Would you be willing to pay more to have the SLAs behave like that, and if so, how much more? Genuinely curious, maybe there's a market for "cloud but better SLAs".

Good question. The point of the article is that the value of cloud SLAs are inversely proportional to the size of the cloud.

Think of it this way, consider the definition of an "availability" SLA as 'the mean availability of all hosts in our cloud'. If its reported at "five 9's" or 99.999% that means that a cloud of 100,000 machines could have one machine down for days at a time and never cause their SLA to slip. Big providers average over multiple hundreds of thousands of machines, your stuff could be down all the time and yet 'everything' is "meeting all the SLAs".

You see this outside of data centers in other overly generalized metrics. Unemployment is only 3.7%[1]! Yay right? Tell that to the people of Magoffin County Kentucky where unemployment is 12.3%[2]

So would I pay more? I don't know. If none of the service providers would offer SLAs based on my footprint, it would not be a choice. If one does, then it becomes the preferred choice even if it is more expensive. At which point do all of them to remain competitive? Another good question. Could be a good differentiator for the #3 cloud provider Google. I know they have the technology to do it if they chose to.

[1] https://www.bls.gov/news.release/pdf/empsit.pdf

[2] https://www.lanereport.com/112723/2019/04/state-releases-cou...

You might want to have a look at some Google Cloud SLAs[1]. They are generally calculated based on actual performance on a particular customer's RPC traffic, down to individual RPCs (at least on Google Cloud Storage, where I work). Read through the agreements to the definition of "Error Rate", and I think you'll find the terms you're looking for.

You're welcome! :)

The root post does raise an important issue, though -- just because GCS thinks it's doing great on your RPCs doesn't mean that your system is doing great.

[1] https://cloud.google.com/terms/sla/

How are Google Cloud SLAs valuable when the service regularly has multi-hour (IIRC, > 5 hours!) (sometimes global) outages?

I pose that Google is not presently in a good position to highlight as a role model or case-study for demonstration of effective cloud provider SLAs.

You report to them your outage (https://support.google.com/cloud/contact/cloud_platform_sla) and then get monetary credits. e.g. for an arbitrarily-clicked service [1], it's 10% off for a month when they don't hit three 9s, 25% if they don't hit two 9s, and 50% off if they're below 95%. Which honestly isn't a very high uptime requirement, but there is a very clear process for getting payouts when they miss it.

[1] https://cloud.google.com/filestore/sla

>How are Google Cloud SLAs valuable when the service regularly has multi-hour (IIRC, > 5 hours!) (sometimes global) outages?

You can sue them? (if that's not forbidden by the contract).

See my sibling comment - they have a process for talking to them to get discounts for months when they don't meet the SLA. If they don't give you said discount, you can sue them for breach of contract, but if they pay out you can't sue them for damages.

In the old phone model, SLAs are per-customer: it doesn't matter what your network as a whole looks like, if any phone line goes below N 9s you're going to be paying out the contract penalties. It's still better for the telco to fail for fewer customers, because then they don't have to pay out as much.

The granularity may be harder to define for cloud services, but it is very much doable; it's all about making sure that the target metrics have zero connection to the global state of the system.

Are there any large providers where the SLAs are meaningful, instead of a variation of "you don't have to pay us if we didn't provide the service" (i.e. something not remotely related to the damage typically caused by such outages)?

For example, Amazon will give you a 30% refund "for the individual Included Service in the affected AWS region for the monthly billing cycle in which the Unavailability occurred" if availability during a month drops to, but not below, 95% (that's a 1.5 day downtime).

That means that if your service goes 100% down because EC2 was completely broken in a region for 1.5 days, you get a refund of 9 days worth of EC2 (compute) charges, but not the associated EBS (disk) or S3 (storage) or other charges.

And "unavailability" counts only if at least two availability zones at the same time are completely down. And then you have to request the credit in a very specific format.

Google and Azure look extremely similar.

Are better SLAs typically negotiated? Because based on this, it seems like the only thing keeping cloud providers reliably above their SLAs is the fear of losing current and future customers, not the SLAs themselves. In other words, the SLAs are completely meaningless.

I think these blogposts create a false narrative. It should start by acknowledging that in 99% of cases any outsourced systems' stability will be better than anything in-house. Yes there are some vendors that don't do a good job. I heard many people complaining about Layer for instance. But most vendors, AWS, Stripe, Algolia, Stream, can invest more time and effort in stability than you can feasible do for an in-house solution. This is not surprising, if you do something for thousands of customers you can dedicate more effort to Q&A, Docs, maintenance, monitoring, firefighting etc. For every story of someone having vendor issues there are dozens for things going to shit with in-house code.

There have been times in the past when I was annoyed with AWS stability issues. We've all been there. But I also know that AWS is more stable than anything I could feasible build in-house.

Will it? The response times to failure matter more oftentimes than the availability itself. Redundancy as well, which can be cheaper manually handled than relying on some unknown process of a provider. The support on any cloud is terrible for a small client.

> It should start by acknowledging that in 99% of cases any outsourced systems' stability will be better than anything in-house.

I can't acknowledge this, as it has not been my experience at all.

I've been on the receiving end of this from the POV of a fortune 50 company. Companies that are not gnats on anyone's window. Treatment is the same. These big guys just all suck at professional service. All their money goes into sales and product engineering. Lock-in breeds retention. The biggest clients are the ones that tend to be the most locked in too.

Having worked in F50 too.

When there is a choice to make between public cloud with 99.xxx% SLA or the internal cloud with 90% uptime and 6 months SLA to get a server, the right choice is always cloud.

>public cloud

>internal cloud

>the right choice is always cloud

I can't tell if you missed a word or are making a joke...

"internal cloud" is not the cloud. It's just a bottle full of mist.

I thought it was something like Azure Stack.

The SLA is just a metric used to negotiate discounts/credits next time around.

When I was in ops, thinking the world was going to collapse and we were all going to get fired if a service went down, I didn't really get it. It's just a negotiating tactic. It's a cost of doing business.

I'm not sure but managing 1000s of servers/storage whatever manually is quiet a feat.

Yeah, no kidding. Hybrid is great if you can swing it.

So naturally internal cloud always gets picked.

This can be complicated, though. I work for a large CDN, and we have systems that monitor our customer experience. Almost every issue those systems discover, however, end up being issues with a customer's origin or configuration. We ended up having to change our procedures on how we responded to issues we discovered because all of our support time was spent checking these issues, and realizing they were outside our control.

There is always two sides to these sorts of things.

One of the most frustrating things about dealing with situations like this is actually getting ahold of someone with enough experience to say where the issue is to begin with even if it’s out of the provider’s control.

I have sent a lot of log files to cloud vendor trying to find why their web hosted application was so slow (6-10 second response times on a CRM app they provided). If someone would have responded with an actual answer (your firewall is blocking traffic or try this setup etc) I could have worked with that. Instead we got nothing but stealth ticket closes and “sorry we don’t know why this is slow” responses. This article hit a nerve because you really do dance to someone else’s tune when you go to the “cloud”.

I think there is a lot of room for cloud provider innovation in this area. It shouldn't take a human to tell you what's wrong.

"because you really do dance to someone else’s tune when you go to the “cloud”"

The same can be said of large orgs with a large on-prem footprint.

I work for a company in a similar situation, and I agree.

The author wants us to look at things from the customer's perspective. The thing is, we (and presumably all major cloud providers) do. Every feature released, every API call, has a canary associated with it that does nothing but pretend to be a customer using that feature. There are definitely cases that slip through the cracks that shouldn't have (we forgot to properly test for a certain condition or combination use case etc) but the vast majority of the time a customer experiences an outage it's because of something the customer did.

That's not to excuse the 5 9 guarantee that inspires fake confidence. But we're always upfront with customers that there's a shared responsibility for availability: it's our responsible to make sure what customers pay for works, but it's also a customer's responsibility that there's enough redundancy in their architecture for their use case.

Where I currently work (all of our customers are enterprises), we encounter this all the time. I'd say about 90% of our serious support tickets are problems with the customer's system or code and are completely out of our control.

However, we will spend a great deal of time resolving their issues regardless. Last week, for example, we had a customer encountering failures with their program using our product. I obtained the source for the customer's application and debugged it for them.

I like that we do this -- it's really nice to solve a customer's problem, and even nicer to be able to tell them it wasn't the fault of our software. It's expensive, of course, but our support contracts are priced to take this into account.

We've talked about these classes of problem from time to time, a lot of ideas have been put forward but what's the solution?

If I have a multi-tenant system, and no one customer is dominant (always causes problems IMO), my 'biggest customer' might only be 4% of my traffic. There are a million things that can go wrong that make this customer's experiences different from everyone else's, from getting my sharding solution wrong to small-C n^2 issues (and a whole lot of space between for nlog(n) problems).

If I'm doing 95th percentile calculations that will not show up in my metrics. If I have a larger customer that's 10% of my traffic, almost half of their users could be having issues before my alerts go off.

And then there's explaining to your boss that 5 9's across twenty interacting services is around 99.98%, and that's only if degradation in one service doesn't cause failure in another.

The issue is that percentiles are a very crude tool. This isn’t just a multitenancy problem, it can manifest in any multiuser system. If 0.5% of your traffic comes from New Zealand, a DNS issue affecting your CDN routing that causes all NZ traffic to time out won’t affect your 99th percentile loadtime graph at all. Essentially, percentiles are useless for discovering problems that have a strong effect on a small portion of your traffic.

Scatter plots and histograms are much better at telling you when the distribution of a stat has gone bimodal with a small but consistent group in the outlier group. Percentiles only make sense for telling you, when you already know what shape the distribution curve should be, how flat that curve is right now. They don’t tell you when the shape has changed.

What scatter plot and/or histogram do you have in mind to dig up the small but consistent group in the outlier group?

Ah, yeah - figuring out what they have in common is the trick :)

Good APMs and trace tools will let you zero in on traces by characteristics - so if you notice there’s a bump in requests which have a 2 second load time, you can select them all and analyze how they are distributed - whether they are mostly one browser, one location, or one user even. But you need a solid strategy for tagging and logging traces.

You could statistically test whether your data matches the expected distribution with automated tests.

That's the "monitor from the customer's point of view" approach the OP alludes to. If you use tools like Honeycomb [1] that can easily and routinely answer questions like "show me the 95th percentile latencies for each of the 10 customers experiencing the worst latencies", then situations like you're describing are a lot easier to discover.

[1] https://honeycomb.io. Disclaimer: I used to work for them.

>That's the "monitor from the customer's point of view" approach the OP alludes to.

...but now you're in a recursive problem: Who watches the watcher? If the watcher goes down, your insights are gone. Do you devote your entire engineering staff to monitoring, then?

A two-pronged approach would be better: Customer Touch-Point monitoring built into your product and external monitoring should your CTP monitoring go down. If your external monitoring goes down, you still have the CTP, so not all visibility is lost.

Having standard shaped telemetry for SDKs (via OpenTelemetry hopefully now) then allowing opt-in aggregation of customer views, either as an observability product to peers or to the providers is what I've been trying to get interest in for the last few years. Having consistent data to show cloud support is also helpful, even if only for one company, especially if you can show you're usually right. A short blog post on this I wrote a bit ago: https://lightstep.com/blog/tough-conversations-with-cloud-pr...

What role does the word "shaped" have in your opening sentence? Telemetry that is standard-shaped? Standard telemetry that is shaped?

And what does it mean? And am I stupid, or is this not a term everyone knows?

>There are a million things that can go wrong that make this customer's experiences different from everyone else

In my experience the things that are easier to write-off as unique to one case, non-representative, or too rare to fix, so they don't have to be thoroughly addressed are warning signs of a robustness issue. Still doesn't mean that they'll get fixed then and there, but they often come around to bite you in the ass later.

I've checked and the top cloud players all have uptime SLAs (which according to the blog post don't seem to have the necessary granularity to matter). See https://aws.amazon.com/compute/sla/, https://cloud.google.com/compute/sla, and https://azure.microsoft.com/en-us/support/legal/sla/summary/ for examples.

But are there other SLAs like for in-zone latency, or hardware performance (e.g. IOPS or bandwidth from your local or remote storage)? Are these kinds of SLAs part of larger private agreements (like, Netflix, a huge AWS customer), or is uptime the only SLA offered? Haven't been able to find any info on this in my searches...

They all have SLAs, but the "tier" at which a problem happens dramatically affects how the SLA pays out, or if it pays out.

For example, within the AWS Compute SLA you linked:

> Unavailable is defined as: For Amazon EC2 (other than Single EC2 Instances), Amazon ECS, or Amazon Fargate, when all of your running instances or running tasks, as applicable, deployed in two or more AZs in the same AWS region (or, if there is only one AZ in the AWS region, that AZ and an AZ in another AWS region) concurrently have no external connectivity.

Get this: Single EC2 instances have an SLA of 90%. Seriously. Its in that article.

In other words; AZ outages rarely see pay out, because you "didn't architect your cloud correctly". And we've been told some nicer variation of this when asking for a reimbursement a few years back. You do have to ask, you know. They could literally automate this process, but they don't. Whatever.

Let's also be clear about the language here: There's no "pay out" at all. What happens is, you get the amazing privilege of not being forced to pay them for a product that didn't work.

That ties directly back to the article; they pay out based on their architecture and SLAs, which are not your architecture and SLAs, unless you perfectly match your architecture to their architecture, which will have gaps, and then you're bought in so hard that you could never leave if you needed to.

I know it's just an example, but it really IS how you architect your solution. If you have a single EC2 instance without redundancy that's important to something, you're doing it wrong.

You're not wrong, but it feels like a cheap answer. Why should I pay twice as much just because AWS can't keep an instance up? (And double-cost can be understating it; I've run commercial software where the multi-instance/clustered version is far more expensive than the single-node version. I'd actually like to make it fully multi-AZ, but I'm not gonna get the company to to drop that kind of money on it.)

Wow, I missed the 90% uptime for a single instance. That's pretty terrible.

It's not 90% uptime, it's 90% SLA. As in the threshold for "down for more than that, we pay out".

The problem is that the vendor is incentivized to publicly use whatever metric shows the highest availability. Otherwise, the vendor will have to pay back credits. The vendor's nines are never my nines.

The only way this gets solved is through cloud consumers providing streams of telemetry (sanitized of any data of value besides success/failure metrics of the underlying cloud primitives) to a central reporting uptime stats broker (Speedtest.net meets DataDog meets the Internet Weather Map). The incentives to fudge or exaggerate you uptime claims as a vendor through sales and marketing is too high; let the data speak for itself.

Do you trust AWS' status page? Or are you coming to Hacker News to ask why your network latency between instances has skyrocketed unexpectedly?

I wonder if this is a sort of thing you could interest EFF or another organization to put funding behind.

"It is difficult to get a man to understand something, when his salary depends on his not understanding it."

As an aside, once in a while I imagine what kind of field day Upton Sinclair would have with this trip around of the pendulum swing toward dystopia.

No, the vendor is incentived to provide a good experience to the customer, especially vendors that aren't AWS. They know they have everything to lose from customers changing clouds.

Ineptitude, and it being a hard problem, are sufficient to explain the status quo.

This doesn’t seem to be true (at least for all vendors). AFAIK, Google Cloud has per-customer SLAs, though you might need to have enough traffic for statistical significance in some products.

Using GCE as an example [0], it’s per-customer, but all instances in multiple zones have to be unavailable. You could have 99% instance failure and not qualify.

>Loss of external connectivity or persistent disk access for all running Instances, when Instances are placed across two or more Zones in the same Region.

[0] https://cloud.google.com/compute/sla

Is that significance also counted on Google scale? :)

I usually advise my clients to treat cloud providers like they would treat hardware, perhaps a bit more reliable (though in practice that is usually not the case.) If you cannot afford for your database to unavailable, invest in creating a backup db for your database. That advice holds regardless of whether said database is run on a cloud or not. Base your investment decisions on the downtime you observe of the cloud provider. Don't expect to make your cloud provider magically able to ensure your app has n 9s reliability. That's on you.

I get the sense rachelbythebay may be another satisfied Azure customer.

Azure customer gets the blues.

Cloud solutions have plenty of issues, but I'm quite surprised there aren't more replies talking about how many impossible problems have been made tractable and reliable thanks to the cloud (or to be more precise, made much cheaper to solve reliably). The article makes a great point about accountability, in the sense that no one at a cloud-providing corp is immensely worried about transient failures that only affect small user sets heavily (which sucks if you're affected). But for my scientific computing use cases, getting things working reliably at any sort of scale within budget is impossible without the cloud. Research institutions' computing clusters are just smaller, less reliable, less flexible versions of the cloud (good luck getting sysadmins to do anything useful at all).

One of the collaborations I work in, LIGO, recently gave up on private servers and transitioned to AWS for our Gravitational Candidate Database [1] because the cloud is so much better. I made this change to my own low-latency search framework [2] years ago. If you're not "lucky" enough to (be forced to) use a university/collaboration cluster, you'd have to maintain your own server, which is orders of magnitude less reliable and more expensive/difficult. I understand that not all workflows are the same, but for all of my nontrivial applications, cloud providers save so much time and money that I can do something as bold as making a provider-agnostic architecture with more robust failover. I recognize that more complicated workflows might require e.g. 10 separate AWS services with AWS-specific features causing lock-in, but at that level of complexity, I'm guessing the problem must be virtually impossible with a non-cloud solution anyway. If you really can't figure out another way to deal with resiliency, you might just need to accept that your problem space is really hard and that you're lucky to even be able to run it at all. Again, I think the original article is right about the fact that you have to account for this yourself; the cloud is not magic, and your code still has to understand that it is (like all abstractions) going to leak.

Again, the point about responsiveness in the original article is very well-taken; I'm just surprised more people aren't observing that overall the reliability, cost, and flexibility provided by cloud solutions is utterly transformative in terms of reliability.

[1] http://gracedb.ligo.org

[2] http://multimessenger.science

I have a hard time taking this article seriously when it’s all “innuendo” and not actually naming any names or providing any verifiable facts at all.

If the author had a specific problem with specific SLA’s, tell us with real details.

And SLA’s aren’t for winning the lottery or providing impossible-to-meet standards. You need to look at what they actually cover, compare with your costs and reliability of running infrastructure in-house, and then pick the right tradeoff for you. I can’t even tell if the author is accusing cloud providers of fraud, of being misleading, if the author just never understood the SLA properly, or what.

Well, even their nines aren't a whole lot nines if you've been paying attention to all the outages of late.

Isn't this why service-level agreements exist? If the nines of uptime are that important to your business and you don't want to be a gnat on a windshield, you've got to give the vendor some financial incentive to pay attention to you, right?

Or is Rachel talking about a situation where you have an SLA in place, but you can't even prove downtime to the vendor because their monitoring software is inadequate?

I think this is about the granularity of SLAs and monitoring.

If a provider promises that, overall, 99.5% of all requests will succeed, but the 0.5% errors are all concentrated on some few customers / regions / AZs, customers can have a very bad day.

So this is about promising each customer that 99.5% of all their requests will succeed, and monitor in a way that makes sure you can keep that promise.

To get a proper SLA you need to pay for SLA. What SLA google, azure, aws have is useless, some service credit proportional to the outage duration. Totally nothing to cover lost profits or direct damage.

For the last gcloud outage, i think you have to talk to people and APPLY for a credit, obviously very few did that https://news.ycombinator.com/item?id=20078296

Working with enterprises that actually have individual SLA's with one of the clouds you mentioned, this is not true. You can handle out your own SLA's with certain providers, and not just get "credits." These enterprises have mature enough monitoring solutions to be able to prove to the provider that they didn't hold up on their end. Besides that, every half-way capable solution architect wouldn't move "system critical" software to the cloud, hence reducing "lost profits" or "direct damage". If you're doing that, it's seriously your own fault.

google cloud compute clearly states that their free sla benefits come in form of credits https://cloud.google.com/compute/sla

amazon ec2 clearly states their sla gives you credits https://aws.amazon.com/compute/sla/

azure compute clearly states they give you credits https://azure.microsoft.com/en-us/support/legal/sla/virtual-...

wanna better sla - pay up, like i said in the beginning. as the cost of sla is proportional to payout that works like an insurance, not like coercive measure to increase reliability.

Did you skip there part where I said enterprises get separate SLA's from the ones you and I get?

It's a bit of both. But also, you need to remember, legal may pass blame, but they won't keep your app up. Your app being down is likely bad, even if you get paid for it.

The cost to your business will always be greater than the maximum refund on the SLA, even if it's a full refund. That's why you're using the provider in the first place, since you can make more money than they're charging.

That's not true. If you get a month refund for a day down, you can still come out ahead. Even so, it's not reasonable to compare to perfection, you should compare to other options.

An SLA could technically go beyond refunding payments into payout out fines.

i'm not sure how this person has their system architected but they should look closely at the 9's the company is talking about. Is it 5 9's across all regions? Within a single region? What about for the specific service? It really all depends, the post is an over simplification or they haven't architected their system appropriately to cover actually get 5 nines on the host cloud.

Surely when you use buy such a service the uptime guarantee applies to the service you have paid for? What happens to the rest of the customers is irrelevant. And the average availability the aggregate is even more so.

They are not the only ones who are not monitoring from customer perspective. It is actually hard, as it involves much more than watching some metric. I proposed a user experience monitoring system for my ex employer, which was based on a very simple principle: problem is a difference between what customer wants and should be able to do and what customer is able to do right now. It kind of funny, but some major eng companies (maybe even all) don’t know what is user experience...

This plays into something I've been thinking about recently, which is that even when a technology scales indefinitely, maybe technology business doesn't. Maybe, in a world where hosting and CRUD apps and everything feel like solved problems, there's still a place for smaller providers that can interface with their customers directly and tailor themselves to their needs. It's a vaguely comforting thought.

I think the fundamental problem, in engineering terms, is that most cloud deployments effectively wire together cloud components serially.

Your LB may have some nines, your individual vms (or set of vms in a region) may have some nines, your data store may have some nines, but if all of them aren't working together it's unlikely your business will be up.

This is inherently customer-dependent and yet it's super predictable (nobody only uses a lb).

Can we change the link URL to use HTTPS?

Not sure why you were downvoted but that would be a good idea since the site supports HTTPS

You can get closer to their nines by using less of their stuff - if you use every service aws offers to power a single app, you'll have markedly less 9's than if you only use a few...

The foundational services (VMs, dns, s3, etc) I've found to be more reliable than others (ebs).

Public clouds: The new Comcast.

I'm testing changes to a process that uses Azure. We have dozens of on site SQL server dbs, but this one process decided to use Cosmos, because...they didnt have to writet as much code. The developer is gone, but I'm left supporting a process that sucks data down from its source, does a little transform, then shoves it into Cosmos. Then, pulls the data directly back from Cosmos to load it to our internal server dbs. Why?! It's a total facepalm to me. Extra stage, extra step and extra complexity and extra cost for no gain. We don't serve any external pages or services. No reason for this data to be in the cloud. All internal use.

Fucking hate devs that do this, especially the ones that wander on before they have to justify their actions to anyone.


It's not just the cloud providers perspective on outages. That is the rosiest interpretation of misleading availability stats. There is an obvious moral hazard involved because most availability tracking is self reported and outage criteria are vague.

This behavior certainly isn't limited to cloud providers. If anything internal operations departments are worse. The only difference is that internal departments can be pressured more effectively.

What does that title even mean?

It should read "Your nines are not my nines" which is a little better.

I think HN has some algorithms that attempt to clean up titles to make them less clickbaity, something may have gone wrong here

This is so incredibly true. My company has internal services used between teams. For some reason my app can always tell when another app is down and they never can.

Even funnier, when a massive network outage occurs, cloud providers shrug and say "not my problem :-)"

How would you mitigate that if you were running it on-premise?

Is this title a consequence of some automatic HN system to try to reduce "fluff" in titles? It's kind of nonsensical in this instance. I saw another title earlier today missing a leading "How" that also didn't make much sense.

Indeed. Sorry! If you notice such disfigurement in the future feel free to email hn@ycombinator.com and we might get to it quicker.

I suppose that’s true. But it also occurs to me that their ops team is not your ops team, their observability stack is not yours, etc.

I work for a company that’s fairly well known here. I can’t recall us having an outage (or something less severe than a full outage) that was our cloud provider’s fault and not ours. I’d recommend the appropriate caution before “blaming the compiler”.

Still remember back to the time when S3 bragged their SLA as five nines. After the notorious incident ~2 years ago, they made it down to three nine.

You may be confused. "One nine" would mean 90% availability.

my bad. 5 nines v.s. 3 nines.

a 40% drop in nines. 40% less reliable!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact