We went to the manufacturer and tried to get them to make a firmware change; we wanted the generators to sacrifice themselves under most every circumstance (short of danger to humans). Generators are an irrelevant cost compared to the cost of an outage. The manufacturer refused, even when we offered to buy them without warranties.
I don't know if they still do but at that point AWS started to buy basically the same generators straight from China and writing their own firmware.
There is also a great story about an AWS authored firmware update on a set of louver microcontrollers causing a partial outage.
AWS really likes to own the entire stack.
*It has been years since I left, details are a bit fuzzy but I think I've got these right.
The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out.
This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World.
Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.
Edit, and reading further there it is:
>"If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact."
Most downtime I've seen in datacenters has been due to irregular power, not power loss. Case in point.
Dummy loads are good for testing but they are not typically variable - if you had 2.5mva generators and a 1mw dummy load you wouldn't be able to run more than 1mw of critical IT load.
I'll say this is the first I've heard of not having enough load to start a generator. They will happily start up and idle.
Good point about the lack of variability. Not sure if the generators would take kindly to suddenly switching out a 1 mw dummy load to free capacity for real load.
On the other hand, it's not like these engines have fixed RPM throttles, and they should be able to handle a wide range of loads, albeit not at peak efficiency. Something doesn't fully make sense about that story. Maybe the generators were sized for their ultimate planned capacity and thus way too much for their deployment at the time.
Or just crank up the AC :)
Perhaps a more financially responsible solution would be to spin up a bunch of instances that mine some sort of cryptocurrency. I doubt it'd cover electricity costs, but it could offset it some.
I had not realized it could get pretty hot there but it made sense given the energy they were dumping out.
* Plasma arc garbage incineration. Not smelly if the input arrives in a ceramic sealed container.
* Glass cullet smelter producing glass foam insulation bricks. Turn excess power into additional DC modular insulation. My personal favorite, because this is a giant resistive load that directly supports the DC’s bottom line.
* Aluminum recycling smelter. Build additional heat sinks for increasing rack efficiency.
* Distilled water generation. Route it back through water chillers, cut down on mineral scaling damage over time.
(You could run the generator at its favorite speed and sink excess power into a dummy load. This'd waste a ton of fuel, though, enough so that I'm not actually sure whether it'd be cheaper than eating an engine. That and I'm not sure what a data-center-sized dummy load would look like.)
I think you'll find fuel is extremely cheap, even when you're burning many gallons per minute.
A randomly-googled 2MW diesel generator consumes 50 gallons/hour at quarter load and 160 gallons/hour at full load . So let's say it's 100 gallons/hour to run a 2 MW generator at full load instead of quarter load. The OVH incident report [OP] says that their data center has two cables in each carrying 10 MVA (mega-volts-amperes, ~ watts), giving us 20 MW as roughly the maximum power consumption of the data center. Divide, multiply, it'd cost 1000 gallons/hour to run at full load, orrrr $3000/hour at the current price of diesel .
Yeah, that's pretty cheap, okay. You might run these generators for a few hundred hours over the lifetime of the installation and they probably cost millions of dollars, so you're not going to spend more on fuel than you'd spend having to replace the generator early. Why don't they have a dummy load for that scenario?
edit: Now that I look at the math I did I realize that I didn't need to look at OVH in particular at all, just compare the cost of fuel to the purchase price of one of the diesel generators I was looking at. >_< Meh, math still works, I'll leave it.
Basically a big hairdryer. :-)
1. Typically these are just big fan heaters https://en.m.wikipedia.org/wiki/Load_bank
That's how you create business opportunities for others, heh
Although diesel can get microbial contamination if not used in 6 - 12 months.
Do tell :-)
The actual solution to that problem, of course, is to use smaller generators?
I wish you and I could own the full stack on our hardware, too. Imagine free drivers, firmware and microcode, full schematics and specification of all the parts. That would approach my atheist's heaven.
Doing high voltage electronics without a license is a crime in a lot of places. I'm pretty sure a person with the right specialty can add a load so you won't need to hack up the firmware to violate its safe operating ranges.
If someone told me this in person I'd consider contacting authorities over their negligent behaviour. The manufacturer was willing to lose business for a safety matter even though allowing the firmware hacks would be really easy.
They now make their own if I remember correctly.
The reason was problem with one of Sprint's cables. It was carrying a lot of Nordunet traffic. Nordunet is/was the Nordic joint university network infrastructure. At the time, for historical reasons, Nordunet had more network capacity than most of the rest of Europe combined (consider that Norway was the second international connection to Arpanet in the early 70's due to the seismic array at Kjeller outside Oslo that provided essential data on Soviet nuclear tests; this made money flow to network research in Norway, and a joint Nordic effort let to significant investment in the other Nordic countries too; couple it with early work in Sweden, where they ended up hosting D-Gix - for many years the largest internet exchange point outside the US).
So Sprints cable went down, and everyone started re-routing. Problem #1: Everyone had high capacity to D-Gix or other exchange point in the Nordic countries. These connections were now flooded by traffic from the Nordic networks where people were used to being able to saturate their 100Mbps network connections (it was such a downer starting my ISP, and going from 100Mbps at university to 512kbps aggregate outbound capacity at our offices). Except most of the links connecting to D-Gix where 34Mbps or less, and there were not that many of them, and they were connecting entire universities or entire countries...
So connectivity within Europe started slowing down.
Problem #2: Most of the links from elsewhere in Europe to the US were 8Mbps or less, with maybe a couple ones faster than that, and there were not that many of them. Certainly not enough to compensate for several hundred Mbps of lost capacity.
Problem #3: Many of those links got so saturated and slow that failsafes were tripped and traffic started getting routed to their expensive backup links (e.g. people paying for a port and then paying for burst a 95th percentile basis and the like with expectation of normally not using them). Except, as it turns out, many of said backup links had been purchased from Sprint. Over that cable.
Today the number of cables and overall capacity is vastly higher, with many more providers, so I don't think the same could happen again, and D-Gix is no longer as important as it was (though it's still one of the largest international interchange points), but as we sat there pinging and trying to figure out what was up, it was a vivid lesson in ensuring your backups are actually sufficiently separate from your main systems that they stand a chance of working.
Mostly, when this happened, west coast users couldn't contact east coast servers (and vice versa), but occasionally, BGP would work out, and packets would get back and forth, but go around the world.
However, I am really frustrated by the lack of communication during this incident. All I had was the tweets from the founder, while their status page was down and I had no other place to reach out to them (phone + ticket system down).
What is worse is that they treat the individual SBGs (SBG1, SBG2, etc) as individual availability zones, when they truly are not. I had an application using their vrack on different SBGs availability zones and they all went down.
Anyway, glad they are fixing it, and glad for the clarification that they don't have real availability zones. Will design my applications better next time.
It's easy to handle when something is completely unavailable (i.e. instant error), but when something, be it a database or some endpoint, is available but horribly slow, that's a whole different thing.
In your house, you have the circuit-breaker pattern implemented in hardware. And upstream of your house there are many more layers of circuit breakers that generally increase in size until they reach a point where there is redundancy. A circuit-breaker going off in your house protects the other circuits in your house, a circuit-breaker going off on your street protects the rest of your neighborhood.
Industrial circuit-breakers are commonly a combination of hardware and software. Personally, I've lost more equipment due to brown-outs than any other cause. If you have equipment you don't want to experience a brown-out, program the breaker to cut off power completely.
Is it just a (non)-survivorship bias where we most commonly talk about the failure cases and they actually have a stellar 99.99999% record that doesn't make headlines?
My best story was as a young student turning up for my helpdesk shift at about 5:50am to a phalanx of fire engines and the Hazmat team.
The generator in the basement was maintained and tested. This time it had started when power went out, but there was a pump that filled a holding tank from a 10,000ish litre tank under the building (speced to run for a week apparently). A pipe had cracked, and so it just kept pumping fuel on to the floor, carpark and into the local stream ...
The only person in the building overnight was the operator (tape changer and report runner) some 5 stories above all this, who eventually smelt it and eventually investgated. It was a massive nightmare, and the place stank for weeks.
Then there was the time the halon system went off unnecessarily...
This is what anti-siphon valves are used to prevent: they are valves that open only when sufficient suction is present in the fuel line leaving the tank. Thus, when the line breaks or the pump fails, the valve will close.
> or it can be a few floors about the tank, too high to suck fuel from the tank so it has to be pushed up from below.
True, it has to be pushed up then. Alas, you can have a sequence of suction pumps. Note that otherwise you'd need to have high-pressure fuel piping, which might cause more problems than the possibility of feeding a fire indefinitely.
Switching paket based networks is often much easier because you can just hold on to the packet and retransmit if a line fails, that's why it's much rarer to have total network failures.
Generators themselves are very reliable machines. The engines are cast iron (instead of cast aluminium as found in all cars since two decades or so) and are nowhere near as close to the edge of material capacity compared to car engines. Electrical generators are essentially just a big blob of metal, they are simple, efficient and very reliable technology giving many decades of service. Generator controllers are designed to be reliable and aren't terribly complicated, either.
If generators really are just a giant blob of metal and very simple (and I have no reason to disbelieve the GP comment)... well... it could be kind of interesting to build an open source software stack to handle switchover. Because, disclaimers notwithstanding, the code would ostensibly be super simple too. So even if it couldn't officially be used directly, it would certainly provide a good base for engineers to copy-paste, either literally or ideologically (and then thoroughly verify, of course!).
Okay... thinking about it, I'm probably wrong - either generator control has some fundamental intricacies that make it not-completely-simple, or all sites have edge cases that have to be baked into the firmware by a system integrator/electrical engineer.
I say this because I'm (genuinely) trying to figure out why the PLCs failed - both in OVH's case, and in AWS's case (see elsewhere in this thread, https://news.ycombinator.com/item?id=15676189).
It's obvious there are crazy but legit reasons for this kind of thing to happen. I'm very curious what the complexity scale here is.
I've seen what switching 20kV looks like in videos - yeah, that kind of thing requires very careful design, and is invariably going to come entangled with a PLC-style controller, as is the norm for industrial equipment.
Disaster backups are hard to test as to do it properly they have to be integration tested, and doing so might break production. Sure your generator might power on for 5 minutes a month and make people happy, but there are many unforeseen things that are hard to test for.
In our case, we had a generator run out of fuel. Another time we had one have the battery die (they're started by batteries just like a car engine)
Every day. All of them (automated).
It would be interesting to run a datacenter completely on batteries that are charged from solar + the grid + gas generators. You could run the generators at optimal power output and use them daily. The batteries required for ~24-48 hours might still be too expensive to make this possible. Maybe some kind of super low power datacenter could pull this off. One day it shall be mine.
Like others said, yes they are big and with lots of moving parts. But if you take care of them (every month by the book) no one should have any unexpected problems.
For me is mind blowing that OVH did the last power failure test in MAY. WHY so rare? IMO that is 1st mistake.
They do seem to test the generators every month.
Even if we lost power on one power channel, we would still be operational. Even if the backup power failed. That's the reason you use redundant power.
If your "cloud provider" is charging you a lot but not providing your server redundant power from two grids to redundant PSUs, I don't think you're getting the best available design.
In this case, it looks like OVH just threw all the power onto a single channel, which allowed a failover bug to bring down the datacenter.
Unless the two grids are fully isolated (e.g., from two different countries), there is always going to be some overlap in the two power sources.
When you have an A/B isolated power datacenter, you can (and my datacenter regularly does) shut off each channel to test that the backup systems are working, without worrying that a failed test will bring down literally every customer in the datacenter. That testing would have likely caught this problem at OVH if they were able to do it, but they instead decided to just toss all the power into a single channel, making it impossible to do testing like this without a decent chance of widespread outages. So they just had to assume it would work, until it didn't.
Because our power channels are isolated, there's almost no difference between turning each one off separately, and then turning both of them off at the same time. So if we were to lose external power on both (which would be a history books blackout for how this DC gets its power, including an almost direct connection to a hydro-electric dam), you would have two well-tested backup power channels ready to go. Nothing is ever a guarantee, but I like my chances.
We had a problem on the optical network that allows RBX to be connected with the interconnection points we have in Paris, Frankfurt, Amsterdam, London, Brussels. The origin of the problem is a software bug on the optical equipment, which caused the configuration to be lost and the connection to be cut from our site in RBX. We handed over the backup of the software configuration as soon as we diagnosed the source of the problem and the DC can be reached again. The incident on RBX is fixed. With the manufacturer, we are looking for the origin of the software bug and also looking to avoid this kind of critical incident.
Zero problems in that 5 year period due to power...
Moreover, the power lines were not really redundant in SBG, and regarding the network downtime in RBX, it looks like the network core is not really redundant either.
I'd be curious to know how competitors like AWS and Google Cloud mitigate this kind of problems.
I remember having read somewhere about some company (Facebook/Google/Amazon/Twitter/Dropbox or something at the same scale) that regularly simulates a whole datacenter failure, which made believe it is possible to automatically recover from this.
Are you saying that even the companies I mentioned have the same issues as OVH when they recover from a complete power failure?
There is little to test about the introduction of a hard fault, but the service resumption in the other DC is full of data to analyze. Also, in such a setup, getting the fault location running again is not on a hard clock, since it is about restoring redundancy instead of the service.
It was Google: http://queue.acm.org/detail.cfm?id=2371516 ("Weathering the Unexpected" )
Kudos to Netflix, but restarting a virtual server vs a physical or a whole data center are different things.
I think every company that cares a bit about high availability knocks out stuff randomly or at least in different ways, does stuff like introducing packet loss, etc. It's another layer and another thing to test that on service/virtual server layer than on close to physical layers.
Of course, one should test that too and it's nowhere near impossible, but Chaosmonkey is for a somewhat different use case.
Also the "article" mentions that tests are done.
My comment was about servers managed by OVH. The post-mortem specifically says: "Since then, we have been working on restarting services. Powering the site with energy allows the servers to be restarted, but the services running on the servers still need to be restarted. That's why each service has been coming back gradually since 10:30 am. Our monitoring system allows us to know the list of servers that have successfully started up and those that still have a problem. We intervene on each of these servers to identify and solve the problem that prevents it from restarting."
The language used suggests that these are servers monitored and operated by OVH, not by random customers.
Source: am an OVH customer
They can restart all the servers they want if they manage not to kill any of my processes, break network connections, change IP addresses. If they can't (it's probably hard) it's up to me to implement a reduntant architecture. Then maybe I'll also implement some random kill switch to proof that the system is resilient to failures.