Hacker News new | comments | show | ask | jobs | submit login
OVH outage explained (ovh.net)
341 points by pmontra 10 months ago | hide | past | web | favorite | 123 comments



When I was at AWS we were using generators from a large commercial supplier. We were constantly having issues with them refusing to take over if there wasn't sufficient load. Doing so puts lots of stress on a generator and can significantly shorten its life.

We went to the manufacturer and tried to get them to make a firmware change; we wanted the generators to sacrifice themselves under most every circumstance (short of danger to humans). Generators are an irrelevant cost compared to the cost of an outage. The manufacturer refused, even when we offered to buy them without warranties.

I don't know if they still do but at that point AWS started to buy basically the same generators straight from China and writing their own firmware.

There is also a great story about an AWS authored firmware update on a set of louver microcontrollers causing a partial outage.

AWS really likes to own the entire stack.

*It has been years since I left, details are a bit fuzzy but I think I've got these right.


> we wanted the generators to sacrifice themselves under most every circumstance

The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out. This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World. Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.

http://perspectives.mvdirona.com/2017/04/at-scale-rare-event...


This sounds nice, but makes zero sense. Direct short will be fused and isolated, otherwise UPS banks would get damaged (maybe even explode).

Edit, and reading further there it is:

>"If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact."


The company that maintains my generators has a dummy load that is basically a large resistor bank and fan on a trailer. It seems like Amazon could just build something like that on a larger scale to load the generators - short term they could even rent the dummy load trailers and hook them up in the parking lot, right? For a permanent solution, have logic to switch in the appropriate amount of resistance to maintain the minimum load required.


There are, unfortunately, a lot of things between on and off when it comes to utility power. Over voltage, voltage drop (which can cause fuses to blow since amperage goes up), phase outage, phase imbalance, harmonic noise (causing ground/neutral feedback), micro outages (few cycles; <50ms). All of which can confuse systems such as a transfer switch, ups, generators or paralleling gear.

Most downtime I've seen in datacenters has been due to irregular power, not power loss. Case in point.

Dummy loads are good for testing but they are not typically variable - if you had 2.5mva generators and a 1mw dummy load you wouldn't be able to run more than 1mw of critical IT load.

I'll say this is the first I've heard of not having enough load to start a generator. They will happily start up and idle.


I would say the automatic transfer switch should be configured conservatively and go to backup power at the first hint of instability. If irregular power is the usual cause of downtime then I would expect the ATS to cope with those conditions, or it gets replaced with one that will.

Good point about the lack of variability. Not sure if the generators would take kindly to suddenly switching out a 1 mw dummy load to free capacity for real load.

On the other hand, it's not like these engines have fixed RPM throttles, and they should be able to handle a wide range of loads, albeit not at peak efficiency. Something doesn't fully make sense about that story. Maybe the generators were sized for their ultimate planned capacity and thus way too much for their deployment at the time.


I thought large diesel generators generally did operate at a fixed speed in order to output 60Hz. I know inverter/generators are common these days for small gas gensets, but have the big diesel ones switched over as well?


Can't you just have a smaller generator for low load? Literally wasting fuel so you can run a large generator for a low load should be criminal.


You might not realize how much fuel it takes just to maintain generators. They have to be exercised, load tested, overhauled, all on top of whatever time they spend providing backup power. Which is just as well, because an engine that sits without ever running won't operate when it is most needed, and the fuel can stagnate, or even attract water and grow algae.


It seems like excessively low load would be a relatively easy problem to solve. One could heat tanks of water to boiling and vent the steam, that could probably absorb as much energy as needed.

Or just crank up the AC :)


Huge DCs like Amazons and ours at Google are very efficient, cooling uses at most 10% of total power. So cranking up this wouldn't help. Nor wouldn't it be possible usual anyway, as these are typically using evaporation cooling, which can't really be cranked up like traditional compression cooling. And keeping a huge tank and heater ready would introduce another point of failure. The easiest is just to have your servers do some heavy calculations.


Compared to heating a water tank, it would be a much more responsible use of power to fire up some folding@home images.

Perhaps a more financially responsible solution would be to spin up a bunch of instances that mine some sort of cryptocurrency. I doubt it'd cover electricity costs, but it could offset it some.


There are load banks made for generators, just large resistive grids. A facility that I work at has one so that we can exercise the generator at operating load without actually switching over our power. So I would think a good solution for a DC might be to have load banks you could switch in when necessary.


Somewhere I have a picture of a car that was parked in the no parking area next to the dummy load for a data center generator. All of the plastic parts on the outside are melted off and the paint is a different color on the side that got the bulk of the hot air coming off the dummy load.

I had not realized it could get pretty hot there but it made sense given the energy they were dumping out.


Apart from the other ideas for picking up load, if you have land for it, then there are many productive ways to use excess generation capacity that can still be reliable, especially if a conventional resistive load is available on standby. Some methods can even bolster the DC’s own opex. None are cheaper than a resistive load, but I wonder if the resistive load was modeled as requiring high availability, then what is the pricing of the excess capacity?

* Plasma arc garbage incineration. Not smelly if the input arrives in a ceramic sealed container.

* Glass cullet smelter producing glass foam insulation bricks. Turn excess power into additional DC modular insulation. My personal favorite, because this is a giant resistive load that directly supports the DC’s bottom line.

* Aluminum recycling smelter. Build additional heat sinks for increasing rack efficiency.

* Distilled water generation. Route it back through water chillers, cut down on mineral scaling damage over time.


The generator is not going to run unless there's an outage.


I forgot to mention that you only need to set up for very small-scale production, almost maker-scale, so not a whole lot of land is required. With automation, any outage will create a steady trickle of usable goods. The DC's I work with run generators once a month to test (and generally, most engines don't like sitting still all the time). Certainly sufficient for use by nearby residents, so co-generation would benefit third parties if not the DC itself.


Do you know why its harder for a generator to power an unloaded circuit? Just curious.


The generator's mechanical parts will perform more efficiently at some speeds/loads than at others. Everything happening at a different temperature and pressure will change fluid flows and forces and put different stresses in different places. Combustion won't happen cleanly, oil and gases won't flow the way they're designed to and the engine will get dirty, etc. If you're running outside of the region the engine was designed to operate in you're stressing it in ways it was not designed to handle, lowering efficiency and shortening its lifespan.

(You could run the generator at its favorite speed and sink excess power into a dummy load. This'd waste a ton of fuel, though, enough so that I'm not actually sure whether it'd be cheaper than eating an engine. That and I'm not sure what a data-center-sized dummy load would look like.)

https://en.wikipedia.org/wiki/Power_band


> This'd waste a ton of fuel, though, enough so that I'm not actually sure whether it'd be cheaper than eating an engine.

I think you'll find fuel is extremely cheap, even when you're burning many gallons per minute.


Hmmm. I didn't actually do the numbers. Let's see...

A randomly-googled 2MW diesel generator consumes 50 gallons/hour at quarter load and 160 gallons/hour at full load [1]. So let's say it's 100 gallons/hour to run a 2 MW generator at full load instead of quarter load. The OVH incident report [OP] says that their data center has two cables in each carrying 10 MVA (mega-volts-amperes, ~ watts), giving us 20 MW as roughly the maximum power consumption of the data center. Divide, multiply, it'd cost 1000 gallons/hour to run at full load, orrrr $3000/hour at the current price of diesel [2].

Yeah, that's pretty cheap, okay. You might run these generators for a few hundred hours over the lifetime of the installation and they probably cost millions of dollars, so you're not going to spend more on fuel than you'd spend having to replace the generator early. Why don't they have a dummy load for that scenario?

[1] http://www.dieselserviceandsupply.com/Diesel_Fuel_Consumptio... [2] https://www.eia.gov/petroleum/gasdiesel/

edit: Now that I look at the math I did I realize that I didn't need to look at OVH in particular at all, just compare the cost of fuel to the purchase price of one of the diesel generators I was looking at. >_< Meh, math still works, I'll leave it.


Also, if you have one power outage, it's more likely you'll have another soon, so you don't want to burn up your generator (likely a long lead time to replace) during the first outage, just to have nothing during the next outage.


Fuel is much more expensive in Europe. A liter of Diesel is about 1.40, so $6.50 per gallon.


Does Europe tax diesel the same way no matter where it's used? I know in the US diesel for generators is exempt from a lot of taxes resulting in it being significantly cheaper in some states.


I only know of Finland, but here diesel for work machines is much cheaper than diesel for driving.


In the UK Diesel is taxed differently for agricultural usage, it typically has a red dye added to it to indicate it's for agricultural use only. I don't honestly know about generators though.


You can use red diesel in a generator, I think the legality comes down to whether it's being used to power a vehicle on a public road.


You're right, some countries have exceptions. Netherlands where I live used to, but no longer does. Not sure about France.


There's the externality of pollution you didn't consider.


That's true, but how much pollution does it cost to replace the generator?


Since the generators apparently run less efficiently at partial load, that's not a given either. They'll almost certainly produce more pollution per unit fuel. Hm, is that a few percent more, or a few times more?


Dummy load looks like this: http://www.simplexdirect.com/images/pic-l-atlasTrailer.jpg It's "only" a megawatt but there are bigger ones as well.

Basically a big hairdryer. :-)


My understanding is that it's the diesel engines that don't like it (the generator head itself is fine). Basically, some of the fuel in the combustion chamber doesn't get burned, and comes out the exhaust. This builds up in the exhaust system, forming an ooze in the pipes. I think it's worse on a turbo diesel, because now the ooze is coming into the turbocharger.

https://en.wikipedia.org/wiki/Wet_stacking


Adding a heat-bank / load-bank[1] solves this problem.

1. Typically these are just big fan heaters https://en.m.wikipedia.org/wiki/Load_bank


As far as I understood, it's mostly because it won't reach optimal temperatures and pressure, which then leads to incomplete combustion and carbon buildups in the cylinders.


And at AWS scale you can always have a human team present to provide a human back up the cost of having 2 or 3 people on an overnight shift is trivial compared to the down time costs.


I wonder why they couldn't do that at OVH. The data centres are always manned and they had 8 minutes on batteries. Why didn't they detect after 20 seconds that generators don't start despite power failure and triggered a manual failover? Sounds like something the staff should be able to do?


> The manufacturer refused

That's how you create business opportunities for others, heh


Why Diesel generators? Natural gas or gasoline should avoid wet stacking.


Diesel tends to be used more because they tend to be much lower maintenance than other fuel engines, and the cost of the fuel itself is also lower due to the higher efficiency.


I think also storing diesel is easier. It's not very flammable as compared petrol (gasoline) or LPG / LNG.

Although diesel can get microbial contamination if not used in 6 - 12 months.[1]

1. https://en.m.wikipedia.org/wiki/Microbial_contamination_of_d...


Why are they lower maintenance in this situation (i.e. power failsafe generators that are used for only a few hundred hours over their life)? I appreciate that diesel car/truck engines are lower maintenance, but they are used in a very different way.


Easier to generate a fixed-frequency AC output. Diesel engines can run at a fixed RPM and adjust fueling to match load, where gasoline and natural gas need a narrow AFR and have to vary rotation speed for power output. Maybe less of an issue these days as inverter/generators are more common, though I dunno if that's the case for the big generators.


> There is also a great story about an AWS authored firmware update on a set of louver microcontrollers causing a partial outage.

Do tell :-)


Seems like you would have to disable emissions systems to get what you described, which would be illegal ;)

The actual solution to that problem, of course, is to use smaller generators?


Or you can fudge your emissions testing.


> AWS really likes to own the entire stack.

I wish you and I could own the full stack on our hardware, too. Imagine free drivers, firmware and microcode, full schematics and specification of all the parts. That would approach my atheist's heaven.


I'm not an expert, but a high power device safe load means that it won't place a lot of the load on itself. when it places the load on itself it gets really hot, so hot that parts may melt, and things that were supposed to be insulated may now be conductive.

Doing high voltage electronics without a license is a crime in a lot of places. I'm pretty sure a person with the right specialty can add a load so you won't need to hack up the firmware to violate its safe operating ranges.

If someone told me this in person I'd consider contacting authorities over their negligent behaviour. The manufacturer was willing to lose business for a safety matter even though allowing the firmware hacks would be really easy.


Amazon spends billions on energy and has hundreds of hardware and electrical engineers. They are one of the best if not the best in data center engineering. They know what they are doing.


AWS had an identical failure with their ATS(automatic transfer switch) system which took out an availability zone. After some forensics and a code review of the plc software, realized it was an omission in that system.

They now make their own if I remember correctly.

https://image.slidesharecdn.com/cpn208failuresatscale-121129...


Very transparent, kudos for that. A reminder to everyone, though, that the brochure doesn't always match reality. OVH pitches itself as a global data center leader, with 20 facilities, 2 of them amongst the largest in the world. Not to harp on them, just to say that having some backup plan on a separate provider is always a good idea, no matter who your primary is.


Also when you decide on a backup, make sure your backup is not hosting their servers in the same DCs as is your primary, or worse yet that your backup is renting their servers from the same people you rent your primary from.


Your backup site should be in a different region - e.g. if your primary is eastern US your backup should be central or western US. Otherwise you are vulnerable to various natural and man-made disasters.


My lesson in this was not servers, but networks. Back in the mid 90's at some point, Europe largely lost network connectivity to the US for about a day (this is all from recollection; I ran an ISP at the time, and we were affected, but I'm sure I've gotten the occasional detail wrong). Not entirely - there was significant capacity that was technically operational, but in practical terms large subsets of European networks were cut off.

The reason was problem with one of Sprint's cables. It was carrying a lot of Nordunet traffic. Nordunet is/was the Nordic joint university network infrastructure. At the time, for historical reasons, Nordunet had more network capacity than most of the rest of Europe combined (consider that Norway was the second international connection to Arpanet in the early 70's due to the seismic array at Kjeller outside Oslo that provided essential data on Soviet nuclear tests; this made money flow to network research in Norway, and a joint Nordic effort let to significant investment in the other Nordic countries too; couple it with early work in Sweden, where they ended up hosting D-Gix - for many years the largest internet exchange point outside the US).

So Sprints cable went down, and everyone started re-routing. Problem #1: Everyone had high capacity to D-Gix or other exchange point in the Nordic countries. These connections were now flooded by traffic from the Nordic networks where people were used to being able to saturate their 100Mbps network connections (it was such a downer starting my ISP, and going from 100Mbps at university to 512kbps aggregate outbound capacity at our offices). Except most of the links connecting to D-Gix where 34Mbps or less, and there were not that many of them, and they were connecting entire universities or entire countries...

So connectivity within Europe started slowing down.

Problem #2: Most of the links from elsewhere in Europe to the US were 8Mbps or less, with maybe a couple ones faster than that, and there were not that many of them. Certainly not enough to compensate for several hundred Mbps of lost capacity.

Problem #3: Many of those links got so saturated and slow that failsafes were tripped and traffic started getting routed to their expensive backup links (e.g. people paying for a port and then paying for burst a 95th percentile basis and the like with expectation of normally not using them). Except, as it turns out, many of said backup links had been purchased from Sprint. Over that cable.

Today the number of cables and overall capacity is vastly higher, with many more providers, so I don't think the same could happen again, and D-Gix is no longer as important as it was (though it's still one of the largest international interchange points), but as we sat there pinging and trying to figure out what was up, it was a vivid lesson in ensuring your backups are actually sufficiently separate from your main systems that they stand a chance of working.


This type of outage was common in the US in the 90s as well. Many large (tier 1) ISPs found out that their redundant fiber links were not in redundant bundles, other tier 1s also had important links in the same bundles, and that a backhole had severed the bundle.

Mostly, when this happened, west coast users couldn't contact east coast servers (and vice versa), but occasionally, BGP would work out, and packets would get back and forth, but go around the world.


I have been a big OVH advocate (cheap prices + free DDoS protection + good hardware) and have been using them for years.

However, I am really frustrated by the lack of communication during this incident. All I had was the tweets from the founder, while their status page was down and I had no other place to reach out to them (phone + ticket system down).

What is worse is that they treat the individual SBGs (SBG1, SBG2, etc) as individual availability zones, when they truly are not. I had an application using their vrack on different SBGs availability zones and they all went down.

Anyway, glad they are fixing it, and glad for the clarification that they don't have real availability zones. Will design my applications better next time.


Can I ask if vrack itself stayed online with non SBG instances inside? I'm just curious if this also affects their network wide services like vrack or public cloud instances outside SBG.


If power to network gear was down, then it probably affected vrack connectivity, likely on the entirety of sbg.


I wonder how easy it is to really test these failover systems. Even if you disconnect the external power (which is risky, since you lose redundancy), it will be a "clean" disconnection, with the power going neatly to zero. I've found a few post-mortems of high-voltage faults, and the waveforms go crazy during the fault, which could lead to failure modes in the transfer switch which wouldn't be found on a "clean" disconnection.


When we had the building UPS replaced several years ago, the contractors accidentally left one of the neutral bars disconnected... everything was up and working fine, the switch was flipped to cut off mains and ensure the backup system could take over, and suddenly there were pops and smoke all over the place as hundreds of desktops, servers, network gear, etc got a dose of ~430v instead of 240. That was a very expensive test.


That applies to other kinds of reliability testing too, especially everything that goes through a network.

It's easy to handle when something is completely unavailable (i.e. instant error), but when something, be it a database or some endpoint, is available but horribly slow, that's a whole different thing.


Seems like we need systems designed with "suicide" mechanisms built in, so that if they detect that they have a poor quality of life (err, I mean, that they're providing a poor quality of service) they'll shut down completely.


The circuit-breaker pattern is commonly used to make the affected system appear off-line to it's upstream dependents - you don't need to actually shut the server down completely and will probably want to log into it to see what's happening. If the dependent system can't function without the affected system, it should also indicate an error and further upstream systems should detect it.

In your house, you have the circuit-breaker pattern implemented in hardware. And upstream of your house there are many more layers of circuit breakers that generally increase in size until they reach a point where there is redundancy. A circuit-breaker going off in your house protects the other circuits in your house, a circuit-breaker going off on your street protects the rest of your neighborhood.

Industrial circuit-breakers are commonly a combination of hardware and software. Personally, I've lost more equipment due to brown-outs than any other cause. If you have equipment you don't want to experience a brown-out, program the breaker to cut off power completely.


You'd think that somebody'd start selling some kind of generator that you could hook up the systems to that would generate various forms of unclean power on purpose, precisely so that you could test those rare failure modes.


Why are backup generators so unreliable? And that is unreliable by the standard of a non-critical system let alone that of an emergency system where reliability is their sole purpose.

Is it just a (non)-survivorship bias where we most commonly talk about the failure cases and they actually have a stellar 99.99999% record that doesn't make headlines?


I think they're just large, complicated physical things where stuff goes wrong, even if you're maintaining and testing.

My best story was as a young student turning up for my helpdesk shift at about 5:50am to a phalanx of fire engines and the Hazmat team.

The generator in the basement was maintained and tested. This time it had started when power went out, but there was a pump that filled a holding tank from a 10,000ish litre tank under the building (speced to run for a week apparently). A pipe had cracked, and so it just kept pumping fuel on to the floor, carpark and into the local stream ...

The only person in the building overnight was the operator (tape changer and report runner) some 5 stories above all this, who eventually smelt it and eventually investgated. It was a massive nightmare, and the place stank for weeks.

Then there was the time the halon system went off unnecessarily...


The particular failure mode you're talking about is reasonably easy to reliably remove: all fuel piping should be under lower than atmospheric pressure. You accomplish this by installing suction pumps next to consumers (and install additional pumps if the height difference is too large). That way, if a pipe breaks (except for the short pieces of pipe between the pipe and consumer), there will be no sustained fuel leak.


Building constraints don't always make it that easy -- the generator could be below the tank level, so siphon action can still siphon the tank empty, or it can be a few floors about the tank, too high to suck fuel from the tank so it has to be pushed up from below.


> the generator could be below the tank level

This is what anti-siphon valves are used to prevent: they are valves that open only when sufficient suction is present in the fuel line leaving the tank. Thus, when the line breaks or the pump fails, the valve will close.

> or it can be a few floors about the tank, too high to suck fuel from the tank so it has to be pushed up from below.

True, it has to be pushed up then. Alas, you can have a sequence of suction pumps. Note that otherwise you'd need to have high-pressure fuel piping, which might cause more problems than the possibility of feeding a fire indefinitely.


So now the system fails due to frothy diesel. Or due to failures of the additional pumps (adding mechanical complexity to a system often decreases overall reliability)


Yes. This is a tradeoff between a chance of catastrophic failure (fire fed by fuel from the storage tank) and "normal" failure.


I have a similar story of a smallish colocation site located in the cellar of a high school building in Switzerland where we hosted two servers for a small company ten years ago. The operator company had a generator at the site. During a thunderstorm there was a power outage, so at first all was fine. However a lightning nearby destroyed some critical circuitry of the generator. I additionally made the mistake of putting the watchdog system in the same colocation so that I didn't even get a notification about the outage. But after all I would have been powerless anyway.


Most of the time, it's not the actual generator that fails. It's often either the part that detects power failure (fluctuations, brownouts, ... are harder to detect than total failure) or the switchover hardware to the generator. If you want to switch 20KV, you're really handling an awful lot of energy. Physics kinda get in the way there.

Switching paket based networks is often much easier because you can just hold on to the packet and retransmit if a line fails, that's why it's much rarer to have total network failures.


To add to this, backup power systems are in fairly widespread deployment and bigger installations are tested and maintained on a schedule. You'll never see "hospital transferred emergency-available circuits to backup system; no malfunction" in the news, yet it happens every few weeks in every installation.

Generators themselves are very reliable machines. The engines are cast iron (instead of cast aluminium as found in all cars since two decades or so) and are nowhere near as close to the edge of material capacity compared to car engines. Electrical generators are essentially just a big blob of metal, they are simple, efficient and very reliable technology giving many decades of service. Generator controllers are designed to be reliable and aren't terribly complicated, either.


Apparently OVH did test their generator system but the outage detector or controller failed.


Hmm. Makes me wonder.

If generators really are just a giant blob of metal and very simple (and I have no reason to disbelieve the GP comment)... well... it could be kind of interesting to build an open source software stack to handle switchover. Because, disclaimers notwithstanding, the code would ostensibly be super simple too. So even if it couldn't officially be used directly, it would certainly provide a good base for engineers to copy-paste, either literally or ideologically (and then thoroughly verify, of course!).

Okay... thinking about it, I'm probably wrong - either generator control has some fundamental intricacies that make it not-completely-simple, or all sites have edge cases that have to be baked into the firmware by a system integrator/electrical engineer.

I say this because I'm (genuinely) trying to figure out why the PLCs failed - both in OVH's case, and in AWS's case (see elsewhere in this thread, https://news.ycombinator.com/item?id=15676189).

It's obvious there are crazy but legit reasons for this kind of thing to happen. I'm very curious what the complexity scale here is.


The actual diesel engines and generators are simple metal. Not the switching circuits. If the electric controllers can't handle the load/switching speed/whatever, no clever software and/or reliability in the actual generators will help.


Ah.

I've seen what switching 20kV looks like in videos - yeah, that kind of thing requires very careful design, and is invariably going to come entangled with a PLC-style controller, as is the norm for industrial equipment.


We've had this happen in our small onsite datacenter. A single phase out of the 3 changed waveform after a lightning strike on a utility pole. The power monitor to turn on the generator didn't see this a problem, but the UPS did. So the UPS was on for several minutes and I noticed some of the ceiling lights out that depended on that phase. Fortunately I and another employee were there and turned on the manual transfer to put us on generator power.

Disaster backups are hard to test as to do it properly they have to be integration tested, and doing so might break production. Sure your generator might power on for 5 minutes a month and make people happy, but there are many unforeseen things that are hard to test for.


>> Most of the time, it's not the actual generator that fails.

In our case, we had a generator run out of fuel. Another time we had one have the battery die (they're started by batteries just like a car engine)


Similar to many of us make backups in IT but how many actually do try them out and restore?


waves hand

Every day. All of them (automated).


Can you turn/expand that into a somewhat longer answer, a list of links that you perused to build that or .. a post on its own?


Brains and know-how my good man, not copypasta.


We have yearly recovery exercises for randomly chosen subsets of servers and databases.


In this case the generators themselves worked, just not the mechanism to cut over to them.


Generally called a "static switch", and they are prone to issues. Here's one: https://www.aegps.com/en/products/ups-accessories/static-tra...


Are they automatically a single point of failure or can you install several of those on one site?


I include that bit of the system in "backup generator" to distinguish it from "generator".


Because it's "backup" hardware. Like a lot of insurance, it's often just a false sense of security.

It would be interesting to run a datacenter completely on batteries that are charged from solar + the grid + gas generators. You could run the generators at optimal power output and use them daily. The batteries required for ~24-48 hours might still be too expensive to make this possible. Maybe some kind of super low power datacenter could pull this off. One day it shall be mine.


People. In my experience faulty generators come from bad maintenance. We have one in every site and 2 generators in locations where there is no electricity (mostly Caterpillar). They mostly fail due to lack of/bad maintenance.

Like others said, yes they are big and with lots of moving parts. But if you take care of them (every month by the book) no one should have any unexpected problems.

For me is mind blowing that OVH did the last power failure test in MAY. WHY so rare? IMO that is 1st mistake.


> SBG's latest test for backup recovery were at the end of May 2017. During this last test, we powered SBG only from the generators for 8 hours without any issues and every month we test the backup generators empty.

They do seem to test the generators every month.


My datacenter has 2 power channels, coming from two different power grids. They have never had a full outage, even when a substation two blocks away started on fire.

Even if we lost power on one power channel, we would still be operational. Even if the backup power failed. That's the reason you use redundant power.

If your "cloud provider" is charging you a lot but not providing your server redundant power from two grids to redundant PSUs, I don't think you're getting the best available design.

In this case, it looks like OVH just threw all the power onto a single channel, which allowed a failover bug to bring down the datacenter.


The power grid in and of itself is a highly interconnected and redundant system. The fact that a substation fire didn't bring down your datacenter doesn't necessarily mean that the "redundant" setup was responsible.

Unless the two grids are fully isolated (e.g., from two different countries), there is always going to be some overlap in the two power sources.


The trick to building good datacenters is you make as many things redundant as possible, and it reduces the chance of one thing taking the entire datacenter down. Being connected to two external power providers is not the reason our datacenter stayed up, it's one of the many carefully designed redundancies that all add up to higher availability.

When you have an A/B isolated power datacenter, you can (and my datacenter regularly does) shut off each channel to test that the backup systems are working, without worrying that a failed test will bring down literally every customer in the datacenter. That testing would have likely caught this problem at OVH if they were able to do it, but they instead decided to just toss all the power into a single channel, making it impossible to do testing like this without a decent chance of widespread outages. So they just had to assume it would work, until it didn't.

Because our power channels are isolated, there's almost no difference between turning each one off separately, and then turning both of them off at the same time. So if we were to lose external power on both (which would be a history books blackout for how this DC gets its power, including an almost direct connection to a hydro-electric dam), you would have two well-tested backup power channels ready to go. Nothing is ever a guarantee, but I like my chances.


Even two different countries is no guarantee of isolation. The 2009 blackout affected both Paraguay and a large portion of Brazil. If two countries are near enough that it's feasible to get power from both, they're near enough that their power networks might be interconnected.


Great summary. That's for the SBG outage, not the RBX one though.


Details about the RBX outage were already posted earlier: http://travaux.ovh.net/?do=details&id=28244#comment35217


According to their maintenance postings, RBX was an upgrade problem to the software driver of the optical fiber :

"RBX: We had a problem on the optical network that allows RBX to be connected with the interconnection points we have in Paris, Frankfurt, Amsterdam, London, Brussels. The origin of the problem is a software bug on the optical equipment, which caused the configuration to be lost and the connection to be cut from our site in RBX. We handed over the backup of the software configuration as soon as we diagnosed the source of the problem and the DC can be reached again. The incident on RBX is fixed. With the manufacturer, we are looking for the origin of the software bug and also looking to avoid this kind of critical incident. "


They gave more details in the French version. The equipment is Cisco NCS2000.


Let's hope that the second equipment they ordered is not Cisco so that they have full redundancy.


Datacenter I was using for 5 years: once per month, employees on site, power is cut/disrupted somehow, UPS kicks in, within 60 seconds the 750kw diesel genset is supposed to be running and taking load.

Zero problems in that 5 year period due to power...


Every small datacenter has a fake diesel generator that doesn't work. I worked for one datacenter and they purposely bought a non-working diesel generator and moved it into a room with the sole purpose to pitch it to investors that they had a working secondary power source.


How on earth did you go from "I saw this thing once" to "everyone does it"?


"Every"? I know for a fact that we don't because we've fallen back to the redundant generator before and it's worked flawlessly. (pity the same couldn't have been said for some of our other redundant systems).


I have a hard time understanding why they needed humans babysitting the restart of servers and services. Servers and services are supposed to automatically restart after the power came back. This implies OVH doesn't regularly test server restarts with something like Netflix Chaos Monkey.

Moreover, the power lines were not really redundant in SBG, and regarding the network downtime in RBX, it looks like the network core is not really redundant either.

I'd be curious to know how competitors like AWS and Google Cloud mitigate this kind of problems.


Restarting a whole datacenter from scratch is different from restarting individual servers. If a service needs another service running on another server, it may fail to start. For example, many servers may need the DHCP server. Some servers may need the DNS. Some others may need Kafka to be up and running, etc. You may have some documentation on which servers to start first, but things are moving so fast that it's not unlikely you'll need to rely on supervision to know which service was able to start correctly.


I don't buy this argument. If a service needs another service running on another server which is not already available, it should retry with an exponential backoff. Doing anything else is a bug.


I agree it's a bug, but this is far, far more difficult to test against with a chaos monkey than one might think. With a whole datacenter down, not to speak of multiple availability zones, problems and dependencies will crop up that are extremely difficult to anticipate. Maybe you expected one other AZ to be up so you could provision servers from there, but now you have to somehow get one of them bootstrapped. Or maybe during the outage power was cycled on some equipment one too many times, causing some failsafe to trip, or maybe some of the core infrastructure like your switches and routers expect some core service to be up, but that core service can't get online without working network.


Thanks for the great explanation. I agree that a Chaos Monkey test something a lot more easier to recover from than a whole datacenter getting down.

I remember having read somewhere about some company (Facebook/Google/Amazon/Twitter/Dropbox or something at the same scale) that regularly simulates a whole datacenter failure, which made believe it is possible to automatically recover from this.

Are you saying that even the companies I mentioned have the same issues as OVH when they recover from a complete power failure?


Simulated DC failure is more often then not just traffic flow engineering. It is more about testing the DC that takes over the traffic than it is about testing service restart in the inactive DC.

There is little to test about the introduction of a hard fault, but the service resumption in the other DC is full of data to analyze. Also, in such a setup, getting the fault location running again is not on a hard clock, since it is about restoring redundancy instead of the service.


> I remember having read somewhere about some company (Facebook/Google/Amazon/Twitter/Dropbox or something at the same scale) that regularly simulates a whole datacenter failure

It was Google: http://queue.acm.org/detail.cfm?id=2371516 ("Weathering the Unexpected" [2012])


Thanks! This is the article I had in mind.


OVH runs servers for their clients, do you really expect them to randomly restart their clients' servers? Netflix is not a hosting provider, the situations are not comparable.


The comment in the OVH post-mortem suggests it is about internal servers and services. You're right about customer bare-metal servers. They can't restart them randomly.


I would be pissed if my hosting provider randomly restarted my servers, taking them offline for minutes each time.


So would I, but that actually makes me realize that I should be doing more of what netflix is doing too and ensure I can randomly restart servers without affecting service.


Isn't choasmonkey just randomly restarting virtual servers on Amazon?

Kudos to Netflix, but restarting a virtual server vs a physical or a whole data center are different things.

I think every company that cares a bit about high availability knocks out stuff randomly or at least in different ways, does stuff like introducing packet loss, etc. It's another layer and another thing to test that on service/virtual server layer than on close to physical layers.

Of course, one should test that too and it's nowhere near impossible, but Chaosmonkey is for a somewhat different use case.

Also the "article" mentions that tests are done.


Some of you are commenting that asking OVH to randomly restart customer servers would be silly. I agree.

My comment was about servers managed by OVH. The post-mortem specifically says: "Since then, we have been working on restarting services. Powering the site with energy allows the servers to be restarted, but the services running on the servers still need to be restarted. That's why each service has been coming back gradually since 10:30 am. Our monitoring system allows us to know the list of servers that have successfully started up and those that still have a problem. We intervene on each of these servers to identify and solve the problem that prevents it from restarting."

The language used suggests that these are servers monitored and operated by OVH, not by random customers.


OVH monitors customers' servers and investigates if they stop responding to pings, unless you opt out of this. I'm guessing they were looking at this system to make sure all their customers' machines came online again after the incident.

Source: am an OVH customer


Yes, I'm aware of this monitoring service. Is OVH able to login in your server to investigate what is the issue?


I have some VPSes there. All of them run independent services, no redundancy.

They can restart all the servers they want if they manage not to kill any of my processes, break network connections, change IP addresses. If they can't (it's probably hard) it's up to me to implement a reduntant architecture. Then maybe I'll also implement some random kill switch to proof that the system is resilient to failures.


Meanwhile in the real world...




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: