“DROP TABLE customer”
“Oh crap! Better failover to DR. Oh crap! It’s dropped there too?!”
Same with replicas at 1, 2, 4, 8, etc minutes, ten replicas let you step back up to 17 hours.
Much more time to detect and correct a catastrophic error.
"Because 99.999% availability provides for less than 5 minutes of downtime per year, every operation performed to the system in production will need to be automated and tested with the same level of care. With a 5 minute per year budget, human judgement and action is completely off the table for failure recovery"
For example, a relatively common occurence is a BGP link getting saturated. Get that situation bad enough and the BGP session will go down, which will redirect all that traffic to another link with another BGP session, which then proceeds to go down. Meanwhile the original session comes back up and ... And then the failures synchronize and cause a third link to go down (each time taking more traffic with it and therefore causing failures faster).
The second issue discussed is that the math in the statistics only works if the failures never synchronize. That's true for a lot of statistical analyses and mostly people ... just don't care. Yes that makes those analyses wrong. But we don't have a better way of doing those analyses.
I make no comment around acceptance of risk, “fatal” or otherwise. :)
TL:DR stop telling me you're doing better architectures in the same of selling me something
This does not seem correct.
If the system depends on just one of the components working then 1 - 0.000001 = 0.999999 (99.9999%)
In fact, I'm having trouble coming up with any common cause of engine failure which isn't correlated.
“Avoidance of multiple similar systems maintenance.
Maintenance practices for the multiple similar systems requirement were designed to eliminate the possibility of introducing problems into both systems of a dual installation (e.g., engines and fuel systems) that could ultimately result in failure of both systems. The basic philosophy is that two similar systems should not be maintained or repaired during the same maintenance visit. Some operators may find this difficult to implement because all maintenance must be done at their home base.”
People serious about preserving data with redundant arrays, tend to be careful to avoid using multiple drives from the same batch.
I vaguely recall a cloud backup provider losing customer data because they failed to do this. Annoyingly I can't find it on google.
There is also a free tool available to review architectures. https://aws.amazon.com/well-architected-tool/
The aim is to help customers learn best practices for building and operating in the cloud.
(I work in the AWS Well-Architected team)
In terms of reliability, AWS and GCP have solid track records in terms of multi-region availability (I can't say the same about Azure... They seem to be years behind still.) Cost absolutely goes up when you're building out regional fault tolerance, but that's true whether you use a mixture of cloud providers or any specific provider. If you need high availability, you need to waste some level of resources (and in turn, money), it's part of the design.
I'd argue almost anyone would be fine running out of a single aws region. There are specific cases where extreme availability matters, but in that case you're making the tradeoff of cost and simplicity for reliability.
It would be interesting to see something like a study on whether using multiple regions in one cloud provider makes more sense over having your load spread across providers in terms of complexity versus fault tolerance.
New feature coming online, team sits with cost/acctg side and says, we expect our budget needs to go up by $X / month.
Either third party or now AWS budgets for a given tag/project/account are prepared daily.
If the daily / hourly rate > expected, inquiries as to why are made, discipline if needed (rare).
Many CFO's are so happy to be out of the CapEX game, out of the Oracle audit game they will probably put up with a fair bit of a tradeoff there.
Those same CFO's have had a lot of trouble when a project doesn't work out under old approach. Now they just spin down everything in AWS over a few hours.
Govt side in particular, the datacenter buildout costs are CRAZY and the utilization often terrible.
As an engineer I would prefer to see a workload work well on a single vendor, before adding the complexity of running it across multiple vendors. Complexity is rarely your friend in Reliabilty or Security
The venn diagram between services that need very high availability, and yet have such low traffic that distributing resources across 2 or 3 AZ's is a waste of money, should be pretty small.
Multi cloud for a individual service will never make sense as all the providers know how to price network transit to make this infeasible.
The idea that failures between different systems are uncorrelated doesn’t even work at a hardware level (e.g. two different sticks of RAM), it’s pure fantasy when you’re talking about software stacks, especially when you have things like the blow-up-the-world button called “BGP.” I’m deeply uncomfortable reading formulas for availability that add or subtract nines, if there’s real money on the line, you can afford to do some better math.
Disclosure: Work at a cloud provider, opinions are my own.
You can't approach reliability with the "the sky is falling down, so theres no point" approach. You are actually going to have think about component failure, blast radius and what happens afterwards a failure.
rather than spreading FUD, why not help customers by showing them the better math?
Lost the antecedent, here. What is not your intent?
Granted my understanding of cloud economics is somewhat murky. My general impression is that the big selling point of cloud is the move from capex to opex, and the second selling point is that you don’t need the same level of operations expertise to run cloud compared to on-prem. My hot take is that for high reliability you still need tons of operations expertise, and in these scenarios, solutions like (partial) on-prem and multi-provider become much more favorable.
> rather than spreading FUD, why not help customers by showing them the better math?
I feel like I’m being accused of spreading FUD, and I want to know why? All I am really trying to say here is that you can’t just do simple arithmetic on published #nines and end up with something that approximates the truth for your service in a useful way. Depending on the operation of your business this approximation may be acceptable or it may not be.
Just to recap, the bad math is to come up with some threshold for acceptable performance in all components, model each as an independent Bernoulli variable, and then plug them into some boolean formula. This is the math published in the guide here, and you can do it with arithmetic on #nines. The reason why this is bad is because this leads you towards a shallow understanding of your system and creates pressure to inflate estimates of availability beyond actual availability, sometimes much so.
Unfortunately, if you are really interested in calculating availability you need to come up with a model for your particular system. This is a complex subject that involves coming up with a model which strikes some balance between accuracy, simplicity (so it can be understood and used to inform strategy), usefulness (to customers / downstream), and supportability (to engineers working on the service). I can't tell you how to do that, my best guess is to hire someone who knows enough statistics to be dangerous and lock them in a room for three months with a terminal and access to your metrics.
(As a side note, the tools for this kind of analysis are much better for systems like electronic circuits. For example, a part might be labeled as "1% tolerance" but when you are running simulations you use a probability distribution.)
When you actually do this, you often find out that your system is much less available (reliable, durable) than it was designed to be. It’s common that your cloud provider could stay within SLO and your service would still be “down” as far as your definition went. So then you have the engineering problem of figuring out how to improve things, which may involve multi-region, multi-cloud, on-prem, redesigning parts of your system, changing utilization targets, etc. The model helps because it can reveal key insights like “if you improve latency in this part of the system, it improves availability in this other part of the system”, so you can decide where to spend engineering resources.
I can guarantee you that the folks who are in charge of, say, EC2 are not just doing math with #nines of the systems underlying EC2. They are measuring and modeling. If I really wanted to figure out how to calculate availability of my systems, I would want to read case studies of real-world systems.
> Yep. Amazon (GCP, Azure) are making bank off the idea that you can just pay opex to run in multiple regions and lay off your SREs, but if you really wanted high availability (and not just some service credits when things go wrong) you would spend the engineering effort to go multi-provider.
That is not AWS's intent.
And its FUD to suggest that somehow cloud vendors want to make money out of reliability. For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.
Whether your costs are similar or not depends on the amount of inter-AZ traffic, which is charged for.
To clarify, I’m not talking about intent or “wants” in any way.
(If you are speaking of AWS’s intent, do you have some kind of privileged knowledge of what AWS “wants” to do?)
> And its FUD to suggest that somehow cloud vendors want to make money out of reliability.
Let’s forget about what AWS “wants”.
I think that AWS deserves to make money for making reliable services, and that it is right to pay them more money to get more reliability. This is not FUD, this is how business works. When I am working with a cloud provider to host my services, I am not trying to extract the most favorable terms possible—I want a deal where both parties are making money. If AWS is not making money from reliability, then that means, conversely, that I can’t buy it from them (more or less—this is a simplification).
And you can see that reliability is all over AWS’s marketing materials, because reliability is important to AWS’s customers.
> For example, in AWS, a webapp with a single Region approach using multiple AZs will give you higher availablility, but similar costs to a single AZ approach.
Reliability is more than just a collection of IaaS products wired together. The fact that two particular configurations have similar price points is not informative.
Single-region, multi-AZ is great for a large percentage of customers. When that’s not enough, you pay more money, and you should do some modeling of system availability because there are plenty of categories of errors that will now prevent you from reaching those higher availability targets.
> That is not AWS's intent.
Yet that's what all their produced literature describes.
> its FUD to suggest that somehow cloud vendors want to make money out of reliability.
It's literally a selling point. Not sure how mentioning it is FUD.
It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.
No, it's not. I've read it. The Well Architected stuff is actually really good about not being AWS only. These are generally good principals to keep in mind regardless of which provider you go with. And that's how it's written, and that's how AWS teaches it in person (I've actually been through their Well Architected course). Yes, they use AWS tooling to teach it, but nothing about it requires AWS.
So, no. You are completely wrong.
> It's literally a selling point. Not sure how mentioning it is FUD.
Because the pricing AWS has doesn't necessarily increase cost with increase reliability. Suggesting that you need to spend more money to increase reliability is FUD.
> It's FUD (and rightly so) to point out that your cross-platform solution is likely less production-ready and reliable than AWS (in total) as a single point of failure.
That's not what was pointed out. Rather, it was pointed out that you can get high reliability without resorting to multi-cloud and higher costs. For you to continue suggesting otherwise, you first have to start off by explaining why you think high availability can't be obtained using a single cloud provider.
While that is the topic, that is not what I was referencing. "produced literature" is a bit more expansive than that and casually accessible (eg https://media.amazonwebservices.com/architecturecenter/AWS_a...). If you're going to reply about how someone is wrong, please have the courtesy to digest the statements made. Lashing out with non-sequitors is not constructive, when you could take a statement in good faith and consider you misinterpreted it.
I do not understand this statement. In general, it costs more to increase the reliability of a system. This includes both infrastructure cost and engineering cost. I also do not understand why this could be FUD. I don’t have any fear, uncertainty, or doubt when I pay more for better services. This is normal and expected. Conversely this means that I can save money if I identify components of my system with lower demands for infrastructure reliability.
Is this normal and allowed at AWS, or do you just not have a Social Media Policy?
You have said you work on the AWS Well-Architected team in your other comments, but it feels like strong statements about intent of a company should be attributable to an employee directly, and not an anonymous internet handle.
Others from AWS like @jeffbarr and @_msw_ seem to be much more transparent when making statements about Amazon.
Dumping money into IaaS while neglecting to recognize that high availability requires extreme quality of both engineering, as well as testing/qa is a super common mistake that's being made these days. This means slowing down, which most upper management doesn't want to do.
Things like executing DR plans regularly to test that they wofk are extremely costly to the business,and are one of those "IT" things that you hopefully never have to actually do in production. But without things like that, what's the point of 3xing the cost of your fleet and adding complexity (another great vector to cause failure) for "availability"?
I think the answer often comes down tok the fact that people are hard to hire. Computers are easy to provision.