
OVH outage explained - pmontra
http://status.ovh.net/?do=details&id=15162#comment18119
======
tgtweak
AWS had an identical failure with their ATS(automatic transfer switch) system
which took out an availability zone. After some forensics and a code review of
the plc software, realized it was an omission in that system.

They now make their own if I remember correctly.

[https://image.slidesharecdn.com/cpn208failuresatscale-121129...](https://image.slidesharecdn.com/cpn208failuresatscale-121129215220-phpapp01/95/cpn208-failures-
at-scale-how-to-ride-through-them-aws-re-invent-2012-19-638.jpg?cb=1434493281)

------
patrickg_zill
Datacenter I was using for 5 years: once per month, employees on site, power
is cut/disrupted somehow, UPS kicks in, within 60 seconds the 750kw diesel
genset is supposed to be running and taking load.

Zero problems in that 5 year period due to power...

------
rmdoss
I have been a big OVH advocate (cheap prices + free DDoS protection + good
hardware) and have been using them for years.

However, I am really frustrated by the lack of communication during this
incident. All I had was the tweets from the founder, while their status page
was down and I had no other place to reach out to them (phone + ticket system
down).

What is worse is that they treat the individual SBGs (SBG1, SBG2, etc) as
individual availability zones, when they truly are not. I had an application
using their vrack on different SBGs availability zones and they all went down.

Anyway, glad they are fixing it, and glad for the clarification that they
don't have real availability zones. Will design my applications better next
time.

~~~
therealmarv
Can I ask if vrack itself stayed online with non SBG instances inside? I'm
just curious if this also affects their network wide services like vrack or
public cloud instances outside SBG.

~~~
tgtweak
If power to network gear was down, then it probably affected vrack
connectivity, likely on the entirety of sbg.

------
jijji
Every small datacenter has a fake diesel generator that doesn't work. I worked
for one datacenter and they purposely bought a non-working diesel generator
and moved it into a room with the sole purpose to pitch it to investors that
they had a working secondary power source.

~~~
otterley
How on earth did you go from "I saw this thing once" to "everyone does it"?

------
kyledrake
My datacenter has 2 power channels, coming from two different power grids.
They have never had a full outage, even when a substation two blocks away
started on fire.

Even if we lost power on one power channel, we would still be operational.
Even if the backup power failed. That's the reason you use redundant power.

If your "cloud provider" is charging you a lot but not providing your server
redundant power from two grids to redundant PSUs, I don't think you're getting
the best available design.

In this case, it looks like OVH just threw all the power onto a single
channel, which allowed a failover bug to bring down the datacenter.

~~~
Cyph0n
The power grid in and of itself is a highly interconnected and redundant
system. The fact that a substation fire didn't bring down your datacenter
doesn't necessarily mean that the "redundant" setup was responsible.

Unless the two grids are fully isolated (e.g., from two different countries),
there is always going to be some overlap in the two power sources.

~~~
kyledrake
The trick to building good datacenters is you make as many things redundant as
possible, and it reduces the chance of one thing taking the entire datacenter
down. Being connected to two external power providers is not _the_ reason our
datacenter stayed up, it's one of the many carefully designed redundancies
that all add up to higher availability.

When you have an A/B isolated power datacenter, you can (and my datacenter
regularly does) shut off each channel to test that the backup systems are
working, without worrying that a failed test will bring down literally every
customer in the datacenter. That testing would have likely caught this problem
at OVH if they were able to do it, but they instead decided to just toss all
the power into a single channel, making it impossible to do testing like this
without a decent chance of widespread outages. So they just had to assume it
would work, until it didn't.

Because our power channels are isolated, there's almost no difference between
turning each one off separately, and then turning both of them off at the same
time. So if we were to lose external power on both (which would be a history
books blackout for how this DC gets its power, including an almost direct
connection to a hydro-electric dam), you would have two well-tested backup
power channels ready to go. Nothing is ever a guarantee, but I like my
chances.

------
tyingq
Very transparent, kudos for that. A reminder to everyone, though, that the
brochure doesn't always match reality. OVH pitches itself as a global data
center leader, with 20 facilities, 2 of them amongst the largest in the world.
Not to harp on them, just to say that having some backup plan on a separate
provider is always a good idea, no matter who your primary is.

~~~
bringtheaction
Also when you decide on a backup, make sure your backup is not hosting their
servers in the same DCs as is your primary, or worse yet that your backup is
renting their servers from the same people you rent your primary from.

~~~
vidarh
My lesson in this was not servers, but networks. Back in the mid 90's at some
point, Europe largely lost network connectivity to the US for about a day
(this is all from recollection; I ran an ISP at the time, and we were
affected, but I'm sure I've gotten the occasional detail wrong). Not entirely
- there was significant capacity that was technically operational, but in
practical terms large subsets of European networks were cut off.

The reason was problem with one of Sprint's cables. It was carrying a lot of
Nordunet traffic. Nordunet is/was the Nordic joint university network
infrastructure. At the time, for historical reasons, Nordunet had more network
capacity than most of the rest of Europe combined (consider that Norway was
the second international connection to Arpanet in the early 70's due to the
seismic array at Kjeller outside Oslo that provided essential data on Soviet
nuclear tests; this made money flow to network research in Norway, and a joint
Nordic effort let to significant investment in the other Nordic countries too;
couple it with early work in Sweden, where they ended up hosting D-Gix - for
many years the largest internet exchange point outside the US).

So Sprints cable went down, and everyone started re-routing. Problem #1:
Everyone had high capacity to D-Gix or other exchange point in the Nordic
countries. These connections were now flooded by traffic from the Nordic
networks where people were used to being able to saturate their 100Mbps
network connections (it was such a downer starting my ISP, and going from
100Mbps at university to 512kbps aggregate outbound capacity at our offices).
Except most of the links connecting to D-Gix where 34Mbps or less, and there
were not that many of them, and they were connecting entire universities or
entire countries...

So connectivity within Europe started slowing down.

Problem #2: Most of the links from elsewhere in Europe to the US were 8Mbps or
less, with maybe a couple ones faster than that, and there were not that many
of them. Certainly not enough to compensate for several hundred Mbps of lost
capacity.

Problem #3: Many of those links got so saturated and slow that failsafes were
tripped and traffic started getting routed to their expensive backup links
(e.g. people paying for a port and then paying for burst a 95th percentile
basis and the like with expectation of normally not using them). Except, as it
turns out, many of said backup links had been purchased from Sprint. Over
_that_ cable.

Today the number of cables and overall capacity is vastly higher, with many
more providers, so I don't think the same could happen again, and D-Gix is no
longer as important as it was (though it's still one of the largest
international interchange points), but as we sat there pinging and trying to
figure out what was up, it was a vivid lesson in ensuring your backups are
actually sufficiently separate from your main systems that they stand a chance
of working.

~~~
toast0
This type of outage was common in the US in the 90s as well. Many large (tier
1) ISPs found out that their redundant fiber links were not in redundant
bundles, other tier 1s also had important links in the same bundles, and that
a backhole had severed the bundle.

Mostly, when this happened, west coast users couldn't contact east coast
servers (and vice versa), but occasionally, BGP would work out, and packets
would get back and forth, but go around the world.

------
cesarb
I wonder how easy it is to really test these failover systems. Even if you
disconnect the external power (which is risky, since you lose redundancy), it
will be a "clean" disconnection, with the power going neatly to zero. I've
found a few post-mortems of high-voltage faults, and the waveforms go crazy
during the fault, which could lead to failure modes in the transfer switch
which wouldn't be found on a "clean" disconnection.

~~~
gedrap
That applies to other kinds of reliability testing too, especially everything
that goes through a network.

It's easy to handle when something is completely unavailable (i.e. instant
error), but when something, be it a database or some endpoint, is available
but horribly slow, that's a whole different thing.

~~~
cperciva
Seems like we need systems designed with "suicide" mechanisms built in, so
that if they detect that they have a poor quality of life (err, I mean, that
they're providing a poor quality of service) they'll shut down completely.

~~~
smoyer
The circuit-breaker pattern is commonly used to make the affected system
appear off-line to it's upstream dependents - you don't need to actually shut
the server down completely and will probably want to log into it to see what's
happening. If the dependent system can't function without the affected system,
it should also indicate an error and further upstream systems should detect
it.

In your house, you have the circuit-breaker pattern implemented in hardware.
And upstream of your house there are many more layers of circuit breakers that
generally increase in size until they reach a point where there is redundancy.
A circuit-breaker going off in your house protects the other circuits in your
house, a circuit-breaker going off on your street protects the rest of your
neighborhood.

Industrial circuit-breakers are commonly a combination of hardware and
software. Personally, I've lost more equipment due to brown-outs than any
other cause. If you have equipment you don't want to experience a brown-out,
program the breaker to cut off power completely.

------
jedisct1
Great summary. That's for the SBG outage, not the RBX one though.

~~~
praseodym
Details about the RBX outage were already posted earlier:
[http://travaux.ovh.net/?do=details&id=28244#comment35217](http://travaux.ovh.net/?do=details&id=28244#comment35217)

~~~
folbec
According to their maintenance postings, RBX was an upgrade problem to the
software driver of the optical fiber :

"RBX: We had a problem on the optical network that allows RBX to be connected
with the interconnection points we have in Paris, Frankfurt, Amsterdam,
London, Brussels. The origin of the problem is a software bug on the optical
equipment, which caused the configuration to be lost and the connection to be
cut from our site in RBX. We handed over the backup of the software
configuration as soon as we diagnosed the source of the problem and the DC can
be reached again. The incident on RBX is fixed. With the manufacturer, we are
looking for the origin of the software bug and also looking to avoid this kind
of critical incident. "

~~~
jedisct1
They gave more details in the French version. The equipment is Cisco NCS2000.

~~~
dx034
Let's hope that the second equipment they ordered is not Cisco so that they
have full redundancy.

------
ngrilly
I have a hard time understanding why they needed humans babysitting the
restart of servers and services. Servers and services are supposed to
automatically restart after the power came back. This implies OVH doesn't
regularly test server restarts with something like Netflix Chaos Monkey.

Moreover, the power lines were not really redundant in SBG, and regarding the
network downtime in RBX, it looks like the network core is not really
redundant either.

I'd be curious to know how competitors like AWS and Google Cloud mitigate this
kind of problems.

~~~
icebraining
OVH runs servers for their clients, do you really expect them to randomly
restart their clients' servers? Netflix is not a hosting provider, the
situations are not comparable.

~~~
wereHamster
I would be _pissed_ if my hosting provider randomly restarted my servers,
taking them offline for minutes each time.

~~~
drdebug
So would I, but that actually makes me realize that I should be doing more of
what netflix is doing too and ensure I can randomly restart servers without
affecting service.

~~~
tete
Isn't choasmonkey just randomly restarting virtual servers on Amazon?

Kudos to Netflix, but restarting a virtual server vs a physical or a whole
data center are different things.

I think every company that cares a bit about high availability knocks out
stuff randomly or at least in different ways, does stuff like introducing
packet loss, etc. It's another layer and another thing to test that on
service/virtual server layer than on close to physical layers.

Of course, one should test that too and it's nowhere near impossible, but
Chaosmonkey is for a somewhat different use case.

Also the "article" mentions that tests are done.

------
duncanawoods
Why are backup generators so unreliable? And that is unreliable by the
standard of a non-critical system let alone that of an emergency system where
reliability is their sole purpose.

Is it just a (non)-survivorship bias where we most commonly talk about the
failure cases and they actually have a stellar 99.99999% record that doesn't
make headlines?

~~~
Xylakant
Most of the time, it's not the actual generator that fails. It's often either
the part that detects power failure (fluctuations, brownouts, ... are harder
to detect than total failure) or the switchover hardware to the generator. If
you want to switch 20KV, you're really handling an awful lot of energy.
Physics kinda get in the way there.

Switching paket based networks is often much easier because you can just hold
on to the packet and retransmit if a line fails, that's why it's much rarer to
have total network failures.

~~~
blattimwind
To add to this, backup power systems are in fairly widespread deployment and
bigger installations are tested and maintained on a schedule. You'll never see
"hospital transferred emergency-available circuits to backup system; no
malfunction" in the news, yet it happens every few weeks in every
installation.

Generators themselves are very reliable machines. The engines are cast iron
(instead of cast aluminium as found in all cars since two decades or so) and
are nowhere near as close to the edge of material capacity compared to car
engines. Electrical generators are essentially just a big blob of metal, they
are simple, efficient and very reliable technology giving many decades of
service. Generator controllers are designed to be reliable and aren't terribly
complicated, either.

~~~
AstralStorm
Apparently OVH did test their generator system but the outage detector or
controller failed.

~~~
exikyut
Hmm. Makes me wonder.

If generators really are just a giant blob of metal and very simple (and I
have no reason to disbelieve the GP comment)... well... it could be kind of
interesting to build an open source software stack to handle switchover.
Because, disclaimers notwithstanding, the code would ostensibly be super
simple too. So even if it couldn't officially be used directly, it would
certainly provide a good base for engineers to copy-paste, either literally or
ideologically (and then thoroughly verify, of course!).

Okay... thinking about it, I'm probably wrong - either generator control has
some fundamental intricacies that make it not-completely-simple, or all sites
have edge cases that have to be baked into the firmware by a system
integrator/electrical engineer.

I say this because I'm (genuinely) trying to figure out why the PLCs failed -
both in OVH's case, and in AWS's case (see elsewhere in this thread,
[https://news.ycombinator.com/item?id=15676189](https://news.ycombinator.com/item?id=15676189)).

It's obvious there are crazy but legit reasons for this kind of thing to
happen. I'm very curious what the complexity scale here is.

~~~
jotm
The actual diesel engines and generators are simple metal. Not the switching
circuits. If the electric controllers can't handle the load/switching
speed/whatever, no clever software and/or reliability in the actual generators
will help.

~~~
exikyut
Ah.

I've seen what switching 20kV looks like in videos - yeah, that kind of thing
requires very careful design, and is invariably going to come entangled with a
PLC-style controller, as is the norm for industrial equipment.

------
shrike
When I was at AWS we were using generators from a large commercial supplier.
We were constantly having issues with them refusing to take over if there
wasn't sufficient load. Doing so puts lots of stress on a generator and can
significantly shorten its life.

We went to the manufacturer and tried to get them to make a firmware change;
we wanted the generators to sacrifice themselves under most every circumstance
(short of danger to humans). Generators are an irrelevant cost compared to the
cost of an outage. The manufacturer refused, even when we offered to buy them
without warranties.

I don't know if they still do but at that point AWS started to buy basically
the same generators straight from China and writing their own firmware.

There is also a great story about an AWS authored firmware update on a set of
louver microcontrollers causing a partial outage.

AWS really likes to own the entire stack.

*It has been years since I left, details are a bit fuzzy but I think I've got these right.

~~~
melq
Do you know why its harder for a generator to power an unloaded circuit? Just
curious.

~~~
saulrh
The generator's mechanical parts will perform more efficiently at some
speeds/loads than at others. Everything happening at a different temperature
and pressure will change fluid flows and forces and put different stresses in
different places. Combustion won't happen cleanly, oil and gases won't flow
the way they're designed to and the engine will get dirty, etc. If you're
running outside of the region the engine was designed to operate in you're
stressing it in ways it was not designed to handle, lowering efficiency and
shortening its lifespan.

(You _could_ run the generator at its favorite speed and sink excess power
into a dummy load. This'd waste a ton of fuel, though, enough so that I'm not
actually sure whether it'd be cheaper than eating an engine. That and I'm not
sure what a data-center-sized dummy load would look like.)

[https://en.wikipedia.org/wiki/Power_band](https://en.wikipedia.org/wiki/Power_band)

~~~
grecy
> _This 'd waste a ton of fuel, though, enough so that I'm not actually sure
> whether it'd be cheaper than eating an engine._

I think you'll find fuel is extremely cheap, even when you're burning many
gallons per minute.

~~~
saulrh
Hmmm. I didn't actually do the numbers. Let's see...

A randomly-googled 2MW diesel generator consumes 50 gallons/hour at quarter
load and 160 gallons/hour at full load [1]. So let's say it's 100 gallons/hour
to run a 2 MW generator at full load instead of quarter load. The OVH incident
report [OP] says that their data center has two cables in each carrying 10 MVA
(mega-volts-amperes, ~ watts), giving us 20 MW as roughly the maximum power
consumption of the data center. Divide, multiply, it'd cost 1000 gallons/hour
to run at full load, orrrr $3000/hour at the current price of diesel [2].

Yeah, that's pretty cheap, okay. You might run these generators for a few
hundred hours over the lifetime of the installation and they probably cost
millions of dollars, so you're not going to spend more on fuel than you'd
spend having to replace the generator early. Why _don 't_ they have a dummy
load for that scenario?

[1]
[http://www.dieselserviceandsupply.com/Diesel_Fuel_Consumptio...](http://www.dieselserviceandsupply.com/Diesel_Fuel_Consumption.aspx)
[2]
[https://www.eia.gov/petroleum/gasdiesel/](https://www.eia.gov/petroleum/gasdiesel/)

edit: Now that I look at the math I did I realize that I didn't need to look
at OVH in particular at all, just compare the cost of fuel to the purchase
price of one of the diesel generators I was looking at. >_< Meh, math still
works, I'll leave it.

~~~
joelhaasnoot
Fuel is much more expensive in Europe. A liter of Diesel is about 1.40, so
$6.50 per gallon.

~~~
gaadd33
Does Europe tax diesel the same way no matter where it's used? I know in the
US diesel for generators is exempt from a lot of taxes resulting in it being
significantly cheaper in some states.

~~~
dageshi
In the UK Diesel is taxed differently for agricultural usage, it typically has
a red dye added to it to indicate it's for agricultural use only. I don't
honestly know about generators though.

~~~
JoeMalt
You can use red diesel in a generator, I think the legality comes down to
whether it's being used to power a vehicle on a public road.

