If you haven't seen yet, news is it was a power loss:
> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.
"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility."https://aws.amazon.com/compliance/data-center/infrastructure...
So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").
Usually when someone claims T3+ they mean they have UPS clusters in 3+1 (or such) configuration and two different such UPS clusters power two power-strips in a rack. Then, would also have incoming grid power supply from two different HV sub-stations with non-intersecting cable paths. They would also have diesel power generators in 3+1 or 5+2 configurations with automatic startup time in seconds. The UPS's power storage (chemical or potential energy based devices) can hold enough energy to handle full load for several minutes. If these are design and maintained correctly, even while concurrent scheduled maintenance is ongoing, an unexpected component failure should not cause catastrophic outage.
At each layer (grid incomers, generator incomers, UPS power incomers) there are switches to switch over whenever there's a need (maintenance or failure).
If they claim tier4, then they basically have everything in n+n configuration.
Though that doesn't match very well with "uninterruptible power supply units can be engaged for certain functions". It sounds worded to convey that the UPS is limited in some way. An interesting old summary of their 2012 us-east-1 incident with power/generators/ups/switching: https://aws.amazon.com/message/67457/
The generators should be powering up as soon as one of the 2 different sources goes down. It takes generators a few minutes to power up and get "warmed up". If they don't start this process until both mains sources are down, then oops, there's power outage.
I used to work next door to a "major" cable TV station's broadcast location. They had multiple generators on-site, and one of them was running 24/7 (they rotated which one was hot). A major power outage hit, and there was a thunderous roar as all of the generators fired up. The channel never went off the air.
There are setups where the UPS is designed to last long enough for generator spin up as well. I believe it's the most common setup if you have both. I assume spinning up the generators for very short-lived line power blips might be undesirable.
I was in a Bell Labs facility that had notoriously bad power. We occupied the building before the second main feed had been fully approved by the state and run to the building.
Our main computer lab had a serial UPS that was online 100% of the time, though the inverters where under a very light load. If the mains even acted 'weird' (dips, bad power factor, spikes) the UPS jumped full on, and didn't revert to main power until the main was stable for some duration of time. The UPS was able to carry the full lab (which was quite large) for about two hours, allowing plenty of time for the generator to fire up.
The UPS ran a lot, and because the main was 'weird', the outages were often short, the generator wouldn't even start during the first ten minutes of UPS coverage. Of course, the rest of the building would be dark, other then emergency lighting.
I was a embedded firmware engineer, and our development lab was directly on the wall behind the UPS. When it fired into 100% mode, it roared, mostly from cooling. It was sort of a heads up that the power was likely to fail soon.
Are you sure about the few minutes part? The standby generators I've seen take seconds to go from off to full load. We have an 80kw model, but I've also seen videos of load tests of much larger generators and they also take only seconds to go to full load.
It might depend on when the backup system was built. No company updates their system every year.
A few minutes seems correct for one place I worked.
This was back in the 90's, before UPS technology got really interesting. Our system was two large rooms with racks and racks and racks of car batteries wired together. When the power went out, the batteries took over until the diesel generator could come online.
I saw it work during several hurricanes and other flood events.
I always found the idea of running an entire building off of car batteries amusing. The engineers didn't share my mirth.
Was a generator technician before I got into programming. Even the 2 megawatt systems could start up and take full load in 10-20 seconds. It sounds basically like starting your car with your foot on the gas.
The "when" shouldn't really matter- Diesel engines aren't a new thing. Warming them up isn't really a thing either- they'll have electric warmers hooked up to the building power to keep them ready to go.
Lead acid batteries in that form factor were the staple for many UPS systems, and the thing most people didn't really appreciate was how expensive they were to maintain. If you didn't do regular maintenance, you'd find out that one of the cells in one of the batts was dead causing the whole thing to be unable to deliver power at precisely the worst time. Financially strapped companies cut maintenance contracts at the first sign of trouble.
Edit to add: I was at a place that took over a company that had one of these. With all of the dead batteries, it was just a really really large inverter taking the 3-phase AC to DC back to AC with a really nice and clean sine wave.
Lead acid batteries are still industry standard in many applications where you are OK with doing regular maintenance and you just need them to work, full stop. I think you'd be surprised how much of your power generation infrustructure, for example, has a 125VDC battery system for blackouts.
I think it depends on the type of generator. I know one datacenter I worked with had turbine generators that took a few minutes to get spun up. They were started and spun up by essentially a truck engine. Those generators were quite old, though.
Has datacenter power redundancy undergone any sort of revolution with grid storage becoming industrial scale?
I wonder if a lot of AWS dc design in this area predates the battery grid storage revolution with (what my impression is) a far faster adaptation/switchover time than a generator spin up, and possibly software systems that work to detect and switch over quickly?
AWS can claim it will be best of breed, but they aren't going to throw out a DC power redundancy investment (or threaten downtime) that they can't wring more ROI on.
I'd be surprised. Data centers eat a lot of energy, and it's hard to beat the energy density of diesel (120 MJ/kg vs ~1 for batteries) and the ability to have nearby tanks or scheduled trucks.
Likely the UPS can't run HVAC, and you are in an overheat condition in about two minutes with a fully loaded data center without cooling. Proportionately longer as load is reduced.
Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?
Yes, I've seen issues that affected the entire region. In my specific case, I happened to have an ElastiCache cluster in the affected AZ that became unreachable (my fault for single AZ). But even now, I'm unable to create any new ElastiCache clusters in different AZs (which I wanted to use for manual failover). And there were a lot of errors on the AWS console during the outage.
"almost unusable" is maybe exaggerating, but there were definitely issues affecting more than just the single AZ.
Probably because you aren’t the only one trying to do that. The folks who successfully fail over a zone are the ones who have already automated the process and are running active/active configurations so everything is set up and ready to go.
We've had alerts for packet loss and had issues in recovering region-spanning services (both AWS and 3rd party).
Yes, some of these we should be better at handling ourselves, but... it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands.
edit: just to short circuit any "well, why aren't you running redundant regions" - we run redundant regions at all times. But for reasons of latency, many customers will bind to their closest region, and the nature of our technology is highly location-bound It is not possible for us to move active sessions to an alternate region. So something like this is... unpleasant.
> it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands
"Expect to lose an AZ" includes not being able to make any changes to existing instances in the affected AZ.
If you had instances across multiple AZs behind an ELB with health checks, then the ELB should automatically remove the affected instances.
If you have a different architecture, you would want to:
* Have another mechanism that automatically stops sending traffic to impaired instances (ideal), or
* Have a means to manually remove the instances from service without being able to interact with or modify those instances in any way
Does that help, or have I misunderstood your problem?
A lot of people will automatically fail over jobs to other AZ's. That often involves spinning up lots more EC2 instances and moving PB's of data. The end result is all capacity on other AZ's gets used up, and networks get full to capacity, and even if those other zones are technically working, practically they aren't really usable.
While there may be more machines provisioned, many orgs run active setups for failover so they aren’t as affected. In terms of data transfer, it should already be there. Where would it come from? Certainly not the dead AZ.
Perspective, I would guess. Unless you spend a lot of time on retry/timeout/fail logic around AWS apis, your app could be stuck/blocked in the RunInstances() api, for example.
So dumb question from someone who hasn't maintained large public infrastructure:
Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?
IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?
IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.
Replying to myself - also in this case people are reporting that load balancing service provided by AWS failed so it doesn't necessarily help if your own stuff is tested and working.
> or are all these apps that are failing built wrong
Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.
It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.
As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.
That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.
I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?
You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's
This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.
In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?
It's turtles all the way down, and underneath all the turtles is us-east-1.
We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.
(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),
(2) consistently in the first set of regions to get new features,
(3) usually in the lowest price tier for features whose pricing varies by region,
(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.
#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.
Most likely inertia. us-east-1 was the first AWS region, gets new features released there first and is the largest in the USA, so many companies have been running their for many years, and the cost of moving to us-east-2 > the cost of occasional AWS created downtime.
I think it's predicated on a misunderstanding of what "fail-safe" actually means.
For example, in railway signaling, drivers are trained to interpret a signal with no light as the most restrictive aspect (e.g. "danger"). That way, any failure of a bulb in a colored light signal, or a failure of the signal as a whole, results in a safe outcome (albeit that the train might be delayed while the driver calls up the signaler).
Or, in another example from the railways, the air brake system on a train is configured such that a loss of air pressure causes emergency brake activation.
Fail-safe doesn't mean "able to continue operation in the presence of failures"; it means "systematically safe in the presence of failure".
Systems which require "liveness" (e.g. fly-by-wire for a relaxed stability aircraft) need different safety mechanisms because failure of the control law is never safe.
> "systematically safe in the presence of failure".
And even then, you still need to define "safe". Imagine a lock powered by an electromagnet. What happens if you lose power?
The safety-first approach is almost always for the unpowered lock to default to the open state — allow people to escape in case of emergency.
Conversely, the security-first approach is to keep the door locked — nothing goes in or out until the situation is under control.
A more complex solution is to design the lock to be bistable. During operating hours when the door is unlocked, failure keeps it unlocked. Outside operating hours, when the door is set to locked, it stays locked.
The common factor with all these scenarios is that you have a failure mode (power outage), and a design for how the system ensures a reasonable outcome in the face of said failure.
Or nuclear reactors that fail safe by dropping all the control rods into the core to stop all activity. The reactor may be permanently ruined after that (with a cost of hundreds of millions or billions to revert) but there will be no risk of meltdown.
Sort of. A failsafe reactor design [can] include[s] things like:
* Negative temperature coefficient of reactivity: as temperature increases, the neutron flux is reduced, which both makes it more controllable, and tends to prevent runaway reactions.
* Negative void coefficient of reactivity: as voids (steam pockets) increase, the neutron is reduced.
* Control rods constructed solely of neutron adsorbent. The RBMK reactor (Chernobyl) in particular used graphite followers (tips), which _increased_ reactivity initially when being lowered.
It's also worth noting that nuclear reactors are designed to be operated within certain limits. The RBMK reactor would have been fine had it been operated as designed.
Source: was a nuclear reactor operator on a submarine.
I don't know enough about reactor control systems to be sure on that one. The idea of a fail-safe system is not that there's an easy way to shut them down, but more that the ways we expect the component parts of a system to fail result in the safe state.
e.g. consider a railway track circuit - this is the way that a signaling system knows whether a particular block of a track is occupied by a train or not. The wheels and axle are conductive so you can measure this electrically by determining whether there's a circuit between the rails or not.
The naive way to do this would be to say something like "OK, we'll apply a voltage to one rail, and if we see a current flowing between the rails we'll say the block is occupied." This is not fail-safe. Say the rail has a small break, or if power is interrupted: no current will flow, so the track always looks unoccupied even if there's a train.
The better way is to say "We'll apply a voltage to one rail, but we'll have the rails connected together in a circuit during normal operation. That will energize a relay which will cause the track to indicate clear. If a train is on the track, then we'll get a short circuit, which will cause the relay to de-energize, indicating the track is occupied."
If the power fails, it shows the track occupied because the relay opens. If the rail develops a crack, the circuit opens, again causing the relay to open and indicate the track is occupied. If the relay fails, then as long as it fails open (which is the predominant failure mode of relays) the track is also indicated as occupied.
No. For example train signalling which controls whether a train can do onto a section of track operates in a fail safe manner, where if something goes wrong, the signal fails into a safe "closed" state rather than an unsafe "open" state. This means trains are incorrectly being told to stop even though technically the tracks are clear, rather than incorrectly being told to go even though there is another train ahead.
"fail-safe" doesn't mean "doesn't fail", it means that the failure mode chooses false negatives or false positives (depending on the context) to be on the safe side.
Or you ask if it's a lesson about how real systems operate? Because yes, it's a very serious lesson about how real systems operate.
Anyway, you seem out of grasp on system engineering. Your reply downthread isn't applicable (of course fail-safes can fail, anything can fail). If you want to learn more on this area (not everybody wants, and its ok), following that link of system theory books on the wiki may be a good idea. Or maybe start at the root:
"Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either."
In my experience, you can be specific, but then you get the problem that people think that if they just 'what if' a narrow solution to the particular problem you're presenting they've invalidated the example, when the point was 1. that this is a representative problem, not this specific problem and 2. in real life you don't get a big arrow pointing at the exact problem 3. in real life you don't have one of these problems, your entire system is made out of these problems, because you can't help but have them, and 4. availability bias: the fact that I'm pointing an arrow at this problem for demonstration purposes makes it very easy to see, but in real life, you wouldn't have a guarantee that the problem you see is the most important one.
There's a certain mindset that can only be acquired through experience. Then you can talk systems engineering to other systems engineers and it makes sense. But prior to that it just sounds like people making excuses or telling silly stories or something.
"(of course fail-safes can fail, anything can fail)"
Another way to think of it is the correlation between failure. In principle, you want all your failures to be uncorrelated, so you can do analysis assuming they're all independent events, which means you can use high school statistics on them. Unfortunately, in real life there's a long tail (but a completely real tail) of correlation you can't get rid of. If nothing else, things are physically correlated by virtue of existing in the same physical location... if a server catches fire, you're going to experience all sorts of highly correlated failures in that location. And "just don't let things catch fire" isn't terribly practical, unfortunately.
Which reiterates the theme that in real life, you generally have very incomplete data to be operating on. I don't have a machine that I can take into my data center and point at my servers and get a "fire will start in this server in 89 hours" readout. I don't get a heads up that the world's largest DDOS is about to be fired at my system in ten minutes. I don't get a heads up that a catastrophic security vulnerability is about to come out in the largest logging library for the largest language and I'm going to have a never-before-seen random rolling restart on half the services in my company with who knows what consequences. All the little sample problems I can give in order to demonstrate systems engineering problems imply a degree of visibility you don't get in real life.
I mean it has to be play with words or tongue in cheek simply b/c the assumption of a fail-safe system failing is already contradictory. So you cannot say anything smart about that beyond - there are no fail-safe systems that fail.
Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.
I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.
it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...
IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.
If the rack your instances are running in are totally offline then the ec2 api unfortunately can't talk to the dom0 and tell the instances to stop/start, so you get annoying "stuck instances", and really can't do anything until the rack is back online and able to respond to API calls unfortunately.
Sometimes, you have a component which fails in such a way that your redundancies can't really help.
I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.
Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.
To my mind, among the more frustrating aspects to implementing protection against failure is that the mechanisms to be added can themselves cause failure.
You need to pick your battles and choose what you want to protect against to mitigate risk and enable day-to-day operations.
For example, too often people will set up clustered databases and whatnot because "they need HA" without much thought about all the other potential effects of using a cluster, such as much more complicated recovery scenarios.
In the vast majority of cases, an active-passive replicated database with manual failover is likely to have fewer pitfalls and gives you the same operational HA a clustered database would, even though in the case of a (rare) real failure it would not automatically recover like a cluster might.
The battery backups (called uninterruptible power supplies) are only meant to bridge the gap between the power going out and the generator turning on, which is a few minutes. Did they say power was the issue this time? I suspect it’s actually something else (ahem network)
Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.
“This is exactly the sort of design that lets me sleep like a baby,” said DeSantis. “And indeed, this new design is getting even better availability” – better than “seven nines” or 99.99999 percent uptime, DeSantis said.
> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.