Hacker News new | past | comments | ask | show | jobs | submit login
The reliability pillar of the AWS Well-Architected Framework [pdf] (awsstatic.com)
149 points by robfig 27 days ago | hide | past | web | favorite | 60 comments



The section about calculating the availability of hard and redundant dependencies ignores the fact that systems often fail in tandem. For e.g. you might have a primary and secondary database in different AZs, each with 99.9% availability. This gives read operations a hypothetical uptime of 6-9s. But hidden SPOFs like operator error, VM infra failing, load balancer or other networking outages can make the 0.01% failure time overlap, blowing away your 6-9s dependency guarantee.


Many run into this when they learn the hard way that “RAID is not a backup” as wiping out all your data on one set of disks also wiped it on the other. Ditto for naive replication with no rollback capability. It’s a fancier version of the same thing.

“DROP TABLE customer”

Oh crap! Better failover to DR. Oh crap! It’s dropped there too?!


exponentially delayed replication down the replica chain...


Why increase the interval at all? Why not just equally, or at most linearly increasing, delayed replication?


With replication at 1 minute intervals and ten replicas, you can step back at most 10 minutes.

Same with replicas at 1, 2, 4, 8, etc minutes, ten replicas let you step back up to 17 hours.

Much more time to detect and correct a catastrophic error.


p.14:

"Because 99.999% availability provides for less than 5 minutes of downtime per year, every operation performed to the system in production will need to be automated and tested with the same level of care. With a 5 minute per year budget, human judgement and action is completely off the table for failure recovery"


That just gets you back to automated systems, for example, causing their own failure, then responding to it by causing more failure.

For example, a relatively common occurence is a BGP link getting saturated. Get that situation bad enough and the BGP session will go down, which will redirect all that traffic to another link with another BGP session, which then proceeds to go down. Meanwhile the original session comes back up and ... And then the failures synchronize and cause a third link to go down (each time taking more traffic with it and therefore causing failures faster).

The second issue discussed is that the math in the statistics only works if the failures never synchronize. That's true for a lot of statistical analyses and mostly people ... just don't care. Yes that makes those analyses wrong. But we don't have a better way of doing those analyses.


Out of the model risks is what usually kills you.


This is a tautology, because if the risk was successfully modelled, it would not be able to kill you.


What’s incorrect about accepting a risk that could kill you, such as flying on air planes or becoming a soldier? Another example is going all-in during a poker game, you’re simply saying you’re ok with that severe risk for the corresponding upsides.


My point is that stating “out of model risks is what kills you” is a tautology. For systems where known risks are modelled, it’s highly likely that any “fatality” or equivalent events are controlled for. If you were to then look at the “fatalities” that occur after that, I would expect them to arise due to risks not currently known or modelled in the system.

I make no comment around acceptance of risk, “fatal” or otherwise. :)


Not necessarily, a risk can be part of the model but not accurately quantified. Having a model doesn't mean it's perfect.


Risk not included in your model is still a risk


Couldn't agree with you more. Specifically, certain (poorly-thought out) failure handling mechanisms actually end up transmitting the very condition that triggered the first failure to other parts of the system - thus creating cascading failures.


(erased)


This is not true. AWS provides multiple constucts to help isolate failures, such as Availability Zones that provide seperate locations, and Regions that are designed to be completely isolated from each other. You can also use different AWS Accounts to provide logical seperation of your different workloads and limit management.


This is total submarine marketing by AWS. Im sure some very smart people with good intentions at AWS produced this, but the fact is they're not an independent academic providing objectively "the best" practices, they're a company trying to sell you something. They probably in full belief in their own abilities believe the AWS way of doing things is best, but frankly, I know as much as you do, or at least if you want to win me over you need to comprehensively prove it, not just preach from the hill.

TL:DR stop telling me you're doing better architectures in the same of selling me something


I don't look at the well-architected framework as "the AWS way is the best way of doing things" but rather "this is the best way of doing things on AWS".


> "... For example, if a system makes use of two independent components, each with an availability of 99.9%, the resulting system availability is >99.999% ... "

This does not seem correct.


Independent and redundant components is the missing context the quote was taken from, wherein the quote is correct.


What they probably mean is that if each component can fail independently with a probability of 0.001 (0.1%) then the probability of both of them failing is 0.001 * 0.001 = 0.000001 (0.0001%)

If the system depends on just one of the components working then 1 - 0.000001 = 0.999999 (99.9999‬%)


This is correct, however in reality completely independent components are very rare. Even things that seem independent and truly redundant e.g. jet engines of an airliner, are much more likely to fail after one of them fails. Therefore this line of reasoning must be applied with extreme care.


Indeed, contaminated fuel would do it.


I would not model the risk on fuel as part of the risk on jet engines.


Other correlated risks include weather, birds, thrust, age, time since last maintenance...

In fact, I'm having trouble coming up with any common cause of engine failure which isn't correlated.


They do put extensive precautions in place for over water duel engine flights: https://en.wikipedia.org/wiki/ETOPS

“Avoidance of multiple similar systems maintenance. Maintenance practices for the multiple similar systems requirement were designed to eliminate the possibility of introducing problems into both systems of a dual installation (e.g., engines and fuel systems) that could ultimately result in failure of both systems. The basic philosophy is that two similar systems should not be maintained or repaired during the same maintenance visit. Some operators may find this difficult to implement because all maintenance must be done at their home base.”

http://www.boeing.com/commercial/aeromagazine/aero_07/etops....


Manufacturing defect either material or user error maybe?


Many manufacturing defects affect whole batches of units.

People serious about preserving data with redundant arrays, tend to be careful to avoid using multiple drives from the same batch.

I vaguely recall a cloud backup provider losing customer data because they failed to do this. Annoyingly I can't find it on google.


Even in this case, after a failure of one engine, the other engine(s) are set to a higher thrust, which increases likelihood of their failure.


Unless someone explain or argue that this statistic, when a system with such high degree of availability let's say >=99.9% also can be said to have other properties beyond just the mere statistical nature. If not the resulting availability should be 99.8001%


I suppose it depends if the use/need is in serial or in parallel.


Interesting that security is 1 of 5 pillars and hardly gets a mention in this paper. Who wrote this and why?


It was written by the AWS Well-Architected team, and is based on curating best practices from Solutions Architects working with customers and engineering teams. Its one of Six main papers, with the main framework, and then one per pillar https://aws.amazon.com/architecture/well-architected/

There is also a free tool available to review architectures. https://aws.amazon.com/well-architected-tool/

The aim is to help customers learn best practices for building and operating in the cloud. (I work in the AWS Well-Architected team)



Thank you


(erased)


Of course AWS wants you to use them. They would never market for a competitor, that just makes 0 sense.

In terms of reliability, AWS and GCP have solid track records in terms of multi-region availability (I can't say the same about Azure... They seem to be years behind still.) Cost absolutely goes up when you're building out regional fault tolerance, but that's true whether you use a mixture of cloud providers or any specific provider. If you need high availability, you need to waste some level of resources (and in turn, money), it's part of the design.

I'd argue almost anyone would be fine running out of a single aws region. There are specific cases where extreme availability matters, but in that case you're making the tradeoff of cost and simplicity for reliability.

It would be interesting to see something like a study on whether using multiple regions in one cloud provider makes more sense over having your load spread across providers in terms of complexity versus fault tolerance.


This is not the intent at all, I work in the AWS Well-Architected team, and we are engineers trying to help other engineers. If you follow the best practices you will spend less (there are performance and cost optimization pillars).


(erased)


I think the way the forward looking CFO positions approach this (that keeps development velocity high without the 20 layers of paperwork to get CFO approval in old system) is as follows.

New feature coming online, team sits with cost/acctg side and says, we expect our budget needs to go up by $X / month.

Either third party or now AWS budgets for a given tag/project/account are prepared daily.

If the daily / hourly rate > expected, inquiries as to why are made, discipline if needed (rare).

Many CFO's are so happy to be out of the CapEX game, out of the Oracle audit game they will probably put up with a fair bit of a tradeoff there.

Those same CFO's have had a lot of trouble when a project doesn't work out under old approach. Now they just spin down everything in AWS over a few hours.

Govt side in particular, the datacenter buildout costs are CRAZY and the utilization often terrible.


I think this really hits it on the nose. CapEx is “cheap” long-run but often insanely difficult. It only makes sense if you have the scale (1) to control variance (2) to amortize planning costs (3) and it provides a competitive advantage. Meanwhile sysadmins with no knowledge of accounting will complain that on-prem is cheaper. This is why cloud wins.


Thanks for the clarification, I agree we can do more around cost, we do offer advice here that should help https://d1.awsstatic.com/whitepapers/architecture/AWS-Cost-O...

As an engineer I would prefer to see a workload work well on a single vendor, before adding the complexity of running it across multiple vendors. Complexity is rarely your friend in Reliabilty or Security


AWS has not had a global outage in a very long time (I think there was one S3 outage that was global because at the time there was only 2 regions).

The venn diagram between services that need very high availability, and yet have such low traffic that distributing resources across 2 or 3 AZ's is a waste of money, should be pretty small.

Multi cloud for a individual service will never make sense as all the providers know how to price network transit to make this infeasible.


Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay opex to run in multiple regions and lay off your SREs, but if you really wanted high availability (and not just some service credits when things go wrong) you would spend the engineering effort to go multi-provider. At that point on-prem looks better, but at least with multi-provider you’re no longer in the business of ordering hardware and power.

The idea that failures between different systems are uncorrelated doesn’t even work at a hardware level (e.g. two different sticks of RAM), it’s pure fantasy when you’re talking about software stacks, especially when you have things like the blow-up-the-world button called “BGP.” I’m deeply uncomfortable reading formulas for availability that add or subtract nines, if there’s real money on the line, you can afford to do some better math.

Disclosure: Work at a cloud provider, opinions are my own.


I can't say for your employeer, but at AWS this is not our intent. Our job is to help customers build better architectures, and if you read the framework you will see its agnostic of vendor.

You can't approach reliability with the "the sky is falling down, so theres no point" approach. You are actually going to have think about component failure, blast radius and what happens afterwards a failure.

rather than spreading FUD, why not help customers by showing them the better math?


> I can't say for your employeer, but at AWS this is not our intent.

Lost the antecedent, here. What is not your intent?

Granted my understanding of cloud economics is somewhat murky. My general impression is that the big selling point of cloud is the move from capex to opex, and the second selling point is that you don’t need the same level of operations expertise to run cloud compared to on-prem. My hot take is that for high reliability you still need tons of operations expertise, and in these scenarios, solutions like (partial) on-prem and multi-provider become much more favorable.

> rather than spreading FUD, why not help customers by showing them the better math?

I feel like I’m being accused of spreading FUD, and I want to know why? All I am really trying to say here is that you can’t just do simple arithmetic on published #nines and end up with something that approximates the truth for your service in a useful way. Depending on the operation of your business this approximation may be acceptable or it may not be.

Just to recap, the bad math is to come up with some threshold for acceptable performance in all components, model each as an independent Bernoulli variable, and then plug them into some boolean formula. This is the math published in the guide here, and you can do it with arithmetic on #nines. The reason why this is bad is because this leads you towards a shallow understanding of your system and creates pressure to inflate estimates of availability beyond actual availability, sometimes much so.

Unfortunately, if you are really interested in calculating availability you need to come up with a model for your particular system. This is a complex subject that involves coming up with a model which strikes some balance between accuracy, simplicity (so it can be understood and used to inform strategy), usefulness (to customers / downstream), and supportability (to engineers working on the service). I can't tell you how to do that, my best guess is to hire someone who knows enough statistics to be dangerous and lock them in a room for three months with a terminal and access to your metrics.

(As a side note, the tools for this kind of analysis are much better for systems like electronic circuits. For example, a part might be labeled as "1% tolerance" but when you are running simulations you use a probability distribution.)

When you actually do this, you often find out that your system is much less available (reliable, durable) than it was designed to be. It’s common that your cloud provider could stay within SLO and your service would still be “down” as far as your definition went. So then you have the engineering problem of figuring out how to improve things, which may involve multi-region, multi-cloud, on-prem, redesigning parts of your system, changing utilization targets, etc. The model helps because it can reveal key insights like “if you improve latency in this part of the system, it improves availability in this other part of the system”, so you can decide where to spend engineering resources.

I can guarantee you that the folks who are in charge of, say, EC2 are not just doing math with #nines of the systems underlying EC2. They are measuring and modeling. If I really wanted to figure out how to calculate availability of my systems, I would want to read case studies of real-world systems.


on the intent question, you said:

> Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay opex to run in multiple regions and lay off your SREs, but if you really wanted high availability (and not just some service credits when things go wrong) you would spend the engineering effort to go multi-provider.

That is not AWS's intent.

And its FUD to suggest that somehow cloud vendors want to make money out of reliability. For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.


> For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.

Whether your costs are similar or not depends on the amount of inter-AZ traffic, which is charged for.


> That is not AWS's intent.

To clarify, I’m not talking about intent or “wants” in any way.

(If you are speaking of AWS’s intent, do you have some kind of privileged knowledge of what AWS “wants” to do?)

> And its FUD to suggest that somehow cloud vendors want to make money out of reliability.

Let’s forget about what AWS “wants”.

I think that AWS deserves to make money for making reliable services, and that it is right to pay them more money to get more reliability. This is not FUD, this is how business works. When I am working with a cloud provider to host my services, I am not trying to extract the most favorable terms possible—I want a deal where both parties are making money. If AWS is not making money from reliability, then that means, conversely, that I can’t buy it from them (more or less—this is a simplification).

And you can see that reliability is all over AWS’s marketing materials, because reliability is important to AWS’s customers.

> For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.

Reliability is more than just a collection of IaaS products wired together. The fact that two particular configurations have similar price points is not informative.

Single-region, multi-AZ is great for a large percentage of customers. When that’s not enough, you pay more money, and you should do some modeling of system availability because there are plenty of categories of errors that will now prevent you from reaching those higher availability targets.


> the idea that you can just pay opex to run in multiple regions and lay off your SREs,

> That is not AWS's intent.

Yet that's what all their produced literature describes.

> its FUD to suggest that somehow cloud vendors want to make money out of reliability.

It's literally a selling point. Not sure how mentioning it is FUD.

It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.


> Yet that's what all their produced literature describes.

No, it's not. I've read it. The Well Architected stuff is actually really good about not being AWS only. These are generally good principals to keep in mind regardless of which provider you go with. And that's how it's written, and that's how AWS teaches it in person (I've actually been through their Well Architected course). Yes, they use AWS tooling to teach it, but nothing about it requires AWS.

So, no. You are completely wrong.

> It's literally a selling point. Not sure how mentioning it is FUD.

Because the pricing AWS has doesn't necessarily increase cost with increase reliability. Suggesting that you need to spend more money to increase reliability is FUD.

> It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.

That's not what was pointed out. Rather, it was pointed out that you can get high reliability without resorting to multi-cloud and higher costs. For you to continue suggesting otherwise, you first have to start off by explaining why you think high availability can't be obtained using a single cloud provider.


> The Well Architected stuff

While that is the topic, that is not what I was referencing. "produced literature" is a bit more expansive than that and casually accessible (eg https://media.amazonwebservices.com/architecturecenter/AWS_a...). If you're going to reply about how someone is wrong, please have the courtesy to digest the statements made. Lashing out with non-sequitors is not constructive, when you could take a statement in good faith and consider you misinterpreted it.


> Because the pricing AWS has doesn't necessarily increase cost with increase reliability. Suggesting that you need to spend more money to increase reliability is FUD.

I do not understand this statement. In general, it costs more to increase the reliability of a system. This includes both infrastructure cost and engineering cost. I also do not understand why this could be FUD. I don’t have any fear, uncertainty, or doubt when I pay more for better services. This is normal and expected. Conversely this means that I can save money if I identify components of my system with lower demands for infrastructure reliability.


You appear to be speaking on behalf of Amazon Web Services but don’t mention this relationship or your company contact information in your HN profile.

Is this normal and allowed at AWS, or do you just not have a Social Media Policy?

You have said you work on the AWS Well-Architected team in your other comments, but it feels like strong statements about intent of a company should be attributable to an employee directly, and not an anonymous internet handle.

Others from AWS like @jeffbarr and @_msw_ seem to be much more transparent when making statements about Amazon.


I generally keep my profiles for work and personal seperate on social media. When I cross that division I disclose if I have some interest that would influence my statements. My work handle on twitter is @WellArchitected, and linked in is https://www.linkedin.com/in/philipfitzsimons/ I'm not in the superstar league of @jeffbarr and @_msw_


Something that often goes ignored is that in high availability situations like that, humans are the biggest risk by a huge margin. Computers (almost always) don't accidentally wipe out the prod database thinking it was beta, that's usually a human that made a simple mistake.

Dumping money into IaaS while neglecting to recognize that high availability requires extreme quality of both engineering, as well as testing/qa is a super common mistake that's being made these days. This means slowing down, which most upper management doesn't want to do.

Things like executing DR plans regularly to test that they wofk are extremely costly to the business,and are one of those "IT" things that you hopefully never have to actually do in production. But without things like that, what's the point of 3xing the cost of your fleet and adding complexity (another great vector to cause failure) for "availability"?

I think the answer often comes down tok the fact that people are hard to hire. Computers are easy to provision.


Yes, humans are really important in these situations, thats why "Operational Excellence" is the first pillar of AWS Well-Architected - you can only get so far in terms of reliability and security if you don't consider people and process. https://d1.awsstatic.com/whitepapers/architecture/AWS-Operat... (I work on the AWS Well-Architected team)


(erased)


Uh, don't you need to HIRE some SREs to get that multi-everything setup engineered correctly?


Depends on the complexity of your service. The question is also not about what is “true” (you need to hire SREs if you want high reliability) but what is believed. Beliefs are what drive sales. Cloud is so pervasive that customer beliefs about what cloud can/cannot do are different from both reality and from marketing materials.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: