If you haven't seen yet, news is it was a power loss: > 5:01 AM PST We can confi...

vinay_ys · on Dec 22, 2021

This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.

tyingq · on Dec 22, 2021

"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility." https://aws.amazon.com/compliance/data-center/infrastructure...

So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").

Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...

vinay_ys · on Dec 22, 2021

Usually when someone claims T3+ they mean they have UPS clusters in 3+1 (or such) configuration and two different such UPS clusters power two power-strips in a rack. Then, would also have incoming grid power supply from two different HV sub-stations with non-intersecting cable paths. They would also have diesel power generators in 3+1 or 5+2 configurations with automatic startup time in seconds. The UPS's power storage (chemical or potential energy based devices) can hold enough energy to handle full load for several minutes. If these are design and maintained correctly, even while concurrent scheduled maintenance is ongoing, an unexpected component failure should not cause catastrophic outage. At each layer (grid incomers, generator incomers, UPS power incomers) there are switches to switch over whenever there's a need (maintenance or failure).

If they claim tier4, then they basically have everything in n+n configuration.

tyingq · on Dec 22, 2021

Though that doesn't match very well with "uninterruptible power supply units can be engaged for certain functions". It sounds worded to convey that the UPS is limited in some way. An interesting old summary of their 2012 us-east-1 incident with power/generators/ups/switching: https://aws.amazon.com/message/67457/

dylan604 · on Dec 22, 2021

The generators should be powering up as soon as one of the 2 different sources goes down. It takes generators a few minutes to power up and get "warmed up". If they don't start this process until both mains sources are down, then oops, there's power outage.

I used to work next door to a "major" cable TV station's broadcast location. They had multiple generators on-site, and one of them was running 24/7 (they rotated which one was hot). A major power outage hit, and there was a thunderous roar as all of the generators fired up. The channel never went off the air.

tyingq · on Dec 22, 2021

There are setups where the UPS is designed to last long enough for generator spin up as well. I believe it's the most common setup if you have both. I assume spinning up the generators for very short-lived line power blips might be undesirable.

greyhair · on Dec 23, 2021

I was in a Bell Labs facility that had notoriously bad power. We occupied the building before the second main feed had been fully approved by the state and run to the building.

Our main computer lab had a serial UPS that was online 100% of the time, though the inverters where under a very light load. If the mains even acted 'weird' (dips, bad power factor, spikes) the UPS jumped full on, and didn't revert to main power until the main was stable for some duration of time. The UPS was able to carry the full lab (which was quite large) for about two hours, allowing plenty of time for the generator to fire up.

The UPS ran a lot, and because the main was 'weird', the outages were often short, the generator wouldn't even start during the first ten minutes of UPS coverage. Of course, the rest of the building would be dark, other then emergency lighting.

I was a embedded firmware engineer, and our development lab was directly on the wall behind the UPS. When it fired into 100% mode, it roared, mostly from cooling. It was sort of a heads up that the power was likely to fail soon.

dylan604 · on Dec 23, 2021

>fully approved by the stat

Why in the world would the state need to be involved in this level of decision?

BenjiWiebe · on Dec 22, 2021

Are you sure about the few minutes part? The standby generators I've seen take seconds to go from off to full load. We have an 80kw model, but I've also seen videos of load tests of much larger generators and they also take only seconds to go to full load.

reaperducer · on Dec 22, 2021

It might depend on when the backup system was built. No company updates their system every year.

A few minutes seems correct for one place I worked.

This was back in the 90's, before UPS technology got really interesting. Our system was two large rooms with racks and racks and racks of car batteries wired together. When the power went out, the batteries took over until the diesel generator could come online.

I saw it work during several hurricanes and other flood events.

I always found the idea of running an entire building off of car batteries amusing. The engineers didn't share my mirth.

pbecotte · on Dec 23, 2021

Was a generator technician before I got into programming. Even the 2 megawatt systems could start up and take full load in 10-20 seconds. It sounds basically like starting your car with your foot on the gas.

The "when" shouldn't really matter- Diesel engines aren't a new thing. Warming them up isn't really a thing either- they'll have electric warmers hooked up to the building power to keep them ready to go.

dylan604 · on Dec 22, 2021

Lead acid batteries in that form factor were the staple for many UPS systems, and the thing most people didn't really appreciate was how expensive they were to maintain. If you didn't do regular maintenance, you'd find out that one of the cells in one of the batts was dead causing the whole thing to be unable to deliver power at precisely the worst time. Financially strapped companies cut maintenance contracts at the first sign of trouble.

Edit to add: I was at a place that took over a company that had one of these. With all of the dead batteries, it was just a really really large inverter taking the 3-phase AC to DC back to AC with a really nice and clean sine wave.

idiotsecant · on Dec 22, 2021

Lead acid batteries are still industry standard in many applications where you are OK with doing regular maintenance and you just need them to work, full stop. I think you'd be surprised how much of your power generation infrustructure, for example, has a 125VDC battery system for blackouts.

mcpherrinm · on Dec 23, 2021

I think it depends on the type of generator. I know one datacenter I worked with had turbine generators that took a few minutes to get spun up. They were started and spun up by essentially a truck engine. Those generators were quite old, though.

hughesjj · on Dec 22, 2021

I thought running a generator full time was illegal AF due to environmental regulations?

AndyJames · on Dec 23, 2021

Not really. If there's no local law against it then it's legal especially outside the cities.

AtlasBarfed · on Dec 22, 2021

Has datacenter power redundancy undergone any sort of revolution with grid storage becoming industrial scale?

I wonder if a lot of AWS dc design in this area predates the battery grid storage revolution with (what my impression is) a far faster adaptation/switchover time than a generator spin up, and possibly software systems that work to detect and switch over quickly?

AWS can claim it will be best of breed, but they aren't going to throw out a DC power redundancy investment (or threaten downtime) that they can't wring more ROI on.

tyingq · on Dec 22, 2021

I'd be surprised. Data centers eat a lot of energy, and it's hard to beat the energy density of diesel (120 MJ/kg vs ~1 for batteries) and the ability to have nearby tanks or scheduled trucks.

Tesla apparently did some early pilot stuff: https://www.datacenterdynamics.com/en/analysis/teslas-powerp...

Gravityloss · on Dec 25, 2021

When deciding data center locations, companies certainly have in mind the quality of the electrical infrastructure in that country or region...

Maybe it could affect people buying services as well.

rainbowzootsuit · on Dec 22, 2021

Likely the UPS can't run HVAC, and you are in an overheat condition in about two minutes with a fully loaded data center without cooling. Proportionately longer as load is reduced.

JshWright · on Dec 23, 2021

> I really hope they publish a thorough RCA for this incident.

We're still waiting on the RCA for last week's us-west outage...

codeduck · on Dec 22, 2021

another example of a single dc in a single AZ rendering an entire region almost unusable. This has shades of eu-central-1 all over again.

nightpool · on Dec 22, 2021

Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?

matharmin · on Dec 22, 2021

Yes, I've seen issues that affected the entire region. In my specific case, I happened to have an ElastiCache cluster in the affected AZ that became unreachable (my fault for single AZ). But even now, I'm unable to create any new ElastiCache clusters in different AZs (which I wanted to use for manual failover). And there were a lot of errors on the AWS console during the outage.

"almost unusable" is maybe exaggerating, but there were definitely issues affecting more than just the single AZ.

wizwit999 · on Dec 22, 2021

That seems acceptable. The Data plane failure is contained to an AZ. Control plane is often not.

jedberg · on Dec 22, 2021

Probably because you aren’t the only one trying to do that. The folks who successfully fail over a zone are the ones who have already automated the process and are running active/active configurations so everything is set up and ready to go.

codeduck · on Dec 22, 2021

We've had alerts for packet loss and had issues in recovering region-spanning services (both AWS and 3rd party).

Yes, some of these we should be better at handling ourselves, but... it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands.

edit: just to short circuit any "well, why aren't you running redundant regions" - we run redundant regions at all times. But for reasons of latency, many customers will bind to their closest region, and the nature of our technology is highly location-bound It is not possible for us to move active sessions to an alternate region. So something like this is... unpleasant.

mentat · on Dec 22, 2021

You don't have health checks?

codeduck · on Dec 23, 2021

How are health checks supposed to help when you can't do anything?

buchanmilne · on Dec 23, 2021

You said:

> it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands

"Expect to lose an AZ" includes not being able to make any changes to existing instances in the affected AZ.

If you had instances across multiple AZs behind an ELB with health checks, then the ELB should automatically remove the affected instances.

If you have a different architecture, you would want to: * Have another mechanism that automatically stops sending traffic to impaired instances (ideal), or * Have a means to manually remove the instances from service without being able to interact with or modify those instances in any way

Does that help, or have I misunderstood your problem?

codeduck · on Dec 23, 2021

You have misunderstood our problem. Ec2 behind elb/alb/nlb is the least of our issues.

londons_explore · on Dec 22, 2021

A lot of people will automatically fail over jobs to other AZ's. That often involves spinning up lots more EC2 instances and moving PB's of data. The end result is all capacity on other AZ's gets used up, and networks get full to capacity, and even if those other zones are technically working, practically they aren't really usable.

reilly3000 · on Dec 22, 2021

While there may be more machines provisioned, many orgs run active setups for failover so they aren’t as affected. In terms of data transfer, it should already be there. Where would it come from? Certainly not the dead AZ.

manquer · on Dec 22, 2021

It is Amazon's services themselves which are advertised multi-AZ that would do bulk of this thundering hurd kind of requests.

Godel_unicode · on Dec 22, 2021

That doesn't appear to have happened though, I haven't seen issues outside az4

tyingq · on Dec 22, 2021

Perspective, I would guess. Unless you spend a lot of time on retry/timeout/fail logic around AWS apis, your app could be stuck/blocked in the RunInstances() api, for example.

SCdF · on Dec 22, 2021

So dumb question from someone who hasn't maintained large public infrastructure:

Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?

IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?

fulafel · on Dec 22, 2021

IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.

fulafel · on Dec 23, 2021

Replying to myself - also in this case people are reporting that load balancing service provided by AWS failed so it doesn't necessarily help if your own stuff is tested and working.

gpm · on Dec 22, 2021

> or are all these apps that are failing built wrong

Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.

It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.

peeters · on Dec 22, 2021

As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.

luhn · on Dec 22, 2021

It sounds like EC2 API is having a brownout due to this, so a lot of people can't failover to a new AZ.

robjan · on Dec 22, 2021

That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.

sprite · on Dec 22, 2021

I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?

TruthWillHurt · on Dec 22, 2021

Amazon shifts the responsibility for multi-AZ deployment to us customers, saving themselves complexity and charging us extra - win-win for them.

_joel · on Dec 22, 2021

You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's

xyst · on Dec 22, 2021

This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.

philsnow · on Dec 22, 2021

In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?

It's turtles all the way down, and underneath all the turtles is us-east-1.

notyourday · on Dec 22, 2021

We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.

alostpuppy · on Dec 22, 2021

Why do folks host their stuff in us-East? Is there a draw other than organizational momentum?

dragonwriter · on Dec 22, 2021

> Why do folks host their stuff in us-East?

Off the top of my head, US-EAST-1 is:

(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),

(2) consistently in the first set of regions to get new features,

(3) usually in the lowest price tier for features whose pricing varies by region,

(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.

#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.

alostpuppy · on Dec 22, 2021

Thank you! This one is why I love HN.

superdug · on Dec 22, 2021

It's the cheapest.

deanCommie · on Dec 22, 2021

us-east-2 has exactly the same prices as us-east-1.

res0nat0r · on Dec 22, 2021

Most likely inertia. us-east-1 was the first AWS region, gets new features released there first and is the largest in the USA, so many companies have been running their for many years, and the cost of moving to us-east-2 > the cost of occasional AWS created downtime.

GrumpyNl · on Dec 22, 2021

How come they dont have power backups?

chkhd · on Dec 22, 2021

"When a fail-safe system fails, it fails by failing to fail-safe." - https://en.wikipedia.org/wiki/Systemantics

2-718-281-828 · on Dec 22, 2021

is that just playing with words?

NovemberWhiskey · on Dec 22, 2021

I think it's predicated on a misunderstanding of what "fail-safe" actually means.

For example, in railway signaling, drivers are trained to interpret a signal with no light as the most restrictive aspect (e.g. "danger"). That way, any failure of a bulb in a colored light signal, or a failure of the signal as a whole, results in a safe outcome (albeit that the train might be delayed while the driver calls up the signaler).

Or, in another example from the railways, the air brake system on a train is configured such that a loss of air pressure causes emergency brake activation.

Fail-safe doesn't mean "able to continue operation in the presence of failures"; it means "systematically safe in the presence of failure".

Systems which require "liveness" (e.g. fly-by-wire for a relaxed stability aircraft) need different safety mechanisms because failure of the control law is never safe.

pdpi · on Dec 22, 2021

> "systematically safe in the presence of failure".

And even then, you still need to define "safe". Imagine a lock powered by an electromagnet. What happens if you lose power?

The safety-first approach is almost always for the unpowered lock to default to the open state — allow people to escape in case of emergency.

Conversely, the security-first approach is to keep the door locked — nothing goes in or out until the situation is under control.

A more complex solution is to design the lock to be bistable. During operating hours when the door is unlocked, failure keeps it unlocked. Outside operating hours, when the door is set to locked, it stays locked.

The common factor with all these scenarios is that you have a failure mode (power outage), and a design for how the system ensures a reasonable outcome in the face of said failure.

jsmith99 · on Dec 22, 2021

Or nuclear reactors that fail safe by dropping all the control rods into the core to stop all activity. The reactor may be permanently ruined after that (with a cost of hundreds of millions or billions to revert) but there will be no risk of meltdown.

sgarland · on Dec 22, 2021

Sort of. A failsafe reactor design [can] include[s] things like:

* Negative temperature coefficient of reactivity: as temperature increases, the neutron flux is reduced, which both makes it more controllable, and tends to prevent runaway reactions.

* Negative void coefficient of reactivity: as voids (steam pockets) increase, the neutron is reduced.

* Control rods constructed solely of neutron adsorbent. The RBMK reactor (Chernobyl) in particular used graphite followers (tips), which _increased_ reactivity initially when being lowered.

It's also worth noting that nuclear reactors are designed to be operated within certain limits. The RBMK reactor would have been fine had it been operated as designed.

Source: was a nuclear reactor operator on a submarine.

NovemberWhiskey · on Dec 22, 2021

I don't know enough about reactor control systems to be sure on that one. The idea of a fail-safe system is not that there's an easy way to shut them down, but more that the ways we expect the component parts of a system to fail result in the safe state.

e.g. consider a railway track circuit - this is the way that a signaling system knows whether a particular block of a track is occupied by a train or not. The wheels and axle are conductive so you can measure this electrically by determining whether there's a circuit between the rails or not.

The naive way to do this would be to say something like "OK, we'll apply a voltage to one rail, and if we see a current flowing between the rails we'll say the block is occupied." This is not fail-safe. Say the rail has a small break, or if power is interrupted: no current will flow, so the track always looks unoccupied even if there's a train.

The better way is to say "We'll apply a voltage to one rail, but we'll have the rails connected together in a circuit during normal operation. That will energize a relay which will cause the track to indicate clear. If a train is on the track, then we'll get a short circuit, which will cause the relay to de-energize, indicating the track is occupied."

If the power fails, it shows the track occupied because the relay opens. If the rail develops a crack, the circuit opens, again causing the relay to open and indicate the track is occupied. If the relay fails, then as long as it fails open (which is the predominant failure mode of relays) the track is also indicated as occupied.

losvedir · on Dec 22, 2021

No. For example train signalling which controls whether a train can do onto a section of track operates in a fail safe manner, where if something goes wrong, the signal fails into a safe "closed" state rather than an unsafe "open" state. This means trains are incorrectly being told to stop even though technically the tracks are clear, rather than incorrectly being told to go even though there is another train ahead.

"fail-safe" doesn't mean "doesn't fail", it means that the failure mode chooses false negatives or false positives (depending on the context) to be on the safe side.

marcosdumay · on Dec 22, 2021

You mean to ask if it's a joke? Yes, it's a joke.

Or you ask if it's a lesson about how real systems operate? Because yes, it's a very serious lesson about how real systems operate.

Anyway, you seem out of grasp on system engineering. Your reply downthread isn't applicable (of course fail-safes can fail, anything can fail). If you want to learn more on this area (not everybody wants, and its ok), following that link of system theory books on the wiki may be a good idea. Or maybe start at the root:

https://en.wikipedia.org/wiki/Systems_theory

Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either.

jerf · on Dec 22, 2021

"Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either."

In my experience, you can be specific, but then you get the problem that people think that if they just 'what if' a narrow solution to the particular problem you're presenting they've invalidated the example, when the point was 1. that this is a representative problem, not this specific problem and 2. in real life you don't get a big arrow pointing at the exact problem 3. in real life you don't have one of these problems, your entire system is made out of these problems, because you can't help but have them, and 4. availability bias: the fact that I'm pointing an arrow at this problem for demonstration purposes makes it very easy to see, but in real life, you wouldn't have a guarantee that the problem you see is the most important one.

There's a certain mindset that can only be acquired through experience. Then you can talk systems engineering to other systems engineers and it makes sense. But prior to that it just sounds like people making excuses or telling silly stories or something.

"(of course fail-safes can fail, anything can fail)"

Another way to think of it is the correlation between failure. In principle, you want all your failures to be uncorrelated, so you can do analysis assuming they're all independent events, which means you can use high school statistics on them. Unfortunately, in real life there's a long tail (but a completely real tail) of correlation you can't get rid of. If nothing else, things are physically correlated by virtue of existing in the same physical location... if a server catches fire, you're going to experience all sorts of highly correlated failures in that location. And "just don't let things catch fire" isn't terribly practical, unfortunately.

Which reiterates the theme that in real life, you generally have very incomplete data to be operating on. I don't have a machine that I can take into my data center and point at my servers and get a "fire will start in this server in 89 hours" readout. I don't get a heads up that the world's largest DDOS is about to be fired at my system in ten minutes. I don't get a heads up that a catastrophic security vulnerability is about to come out in the largest logging library for the largest language and I'm going to have a never-before-seen random rolling restart on half the services in my company with who knows what consequences. All the little sample problems I can give in order to demonstrate systems engineering problems imply a degree of visibility you don't get in real life.

itsoktocry · on Dec 22, 2021

>is that just playing with words?

It conveys reality, that "fail-safe" isn't literal, as if anyone believed that.

2-718-281-828 · on Dec 22, 2021

I mean it has to be play with words or tongue in cheek simply b/c the assumption of a fail-safe system failing is already contradictory. So you cannot say anything smart about that beyond - there are no fail-safe systems that fail.

seeking_future · on Dec 22, 2021

The real world is the play. Words are just catching up.

Talanes · on Dec 22, 2021

https://en.wikipedia.org/wiki/Gare_de_Lyon_rail_accident

Fail safes do fail. Often due to severe user error.

frupert52 · on Dec 22, 2021

Do you mean in that it fails by failing to be the thing that it purports to be? Making it no longer that thing? At what point does bread become toast?

the-dude · on Dec 22, 2021

An unknown unknown.

redm · on Dec 22, 2021

Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.

I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.

taf2 · on Dec 22, 2021

it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...

IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.

res0nat0r · on Dec 22, 2021

If the rack your instances are running in are totally offline then the ec2 api unfortunately can't talk to the dom0 and tell the instances to stop/start, so you get annoying "stuck instances", and really can't do anything until the rack is back online and able to respond to API calls unfortunately.

chousuke · on Dec 22, 2021

Sometimes, you have a component which fails in such a way that your redundancies can't really help.

I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.

Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.

macintux · on Dec 22, 2021

To my mind, among the more frustrating aspects to implementing protection against failure is that the mechanisms to be added can themselves cause failure.

It's turtles all the way down.

chousuke · on Dec 22, 2021

You need to pick your battles and choose what you want to protect against to mitigate risk and enable day-to-day operations.

For example, too often people will set up clustered databases and whatnot because "they need HA" without much thought about all the other potential effects of using a cluster, such as much more complicated recovery scenarios.

In the vast majority of cases, an active-passive replicated database with manual failover is likely to have fewer pitfalls and gives you the same operational HA a clustered database would, even though in the case of a (rare) real failure it would not automatically recover like a cluster might.

trelane · on Dec 22, 2021

Anything can fail, even your backup, and especially if it's mechanical.

rdines · on Dec 22, 2021

The battery backups (called uninterruptible power supplies) are only meant to bridge the gap between the power going out and the generator turning on, which is a few minutes. Did they say power was the issue this time? I suspect it’s actually something else (ahem network)

Spooky23 · on Dec 22, 2021

Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.

thetinguy · on Dec 22, 2021

They do. I remember watching one of their sessions where they showed every rack having its own battery backup.

tyingq · on Dec 22, 2021

An article on that: https://datacenterfrontier.com/aws-designs-in-rack-micro-ups...

Interesting quote:

“This is exactly the sort of design that lets me sleep like a baby,” said DeSantis. “And indeed, this new design is getting even better availability” – better than “seven nines” or 99.99999 percent uptime, DeSantis said.

TrueDuality · on Dec 22, 2021

According to the SOC certifications they give their customers they do.