
OVH Incident in Strasbourg - fvv
http://status.ovh.com
======
lode
More info on Twitter from OVH's CEO:
[https://twitter.com/olesovhcom](https://twitter.com/olesovhcom)

and on
[https://twitter.com/ovh_support_en](https://twitter.com/ovh_support_en)

"SBG: ERDF is trying to find out the default. 2 separated 20kV lines are down.
We are trying to restart 2 generators A+B for SBG1/SG4. 2 others generators
A+B work in SBG2. 1 routing room is in SBG1, the second in SBG2. Both are
down. "

"An incident is ongoing impacting our network. We are all on the problem.
Sorry for the inconvenience."

"SBG: 1 gen restarted."

"RBX: all optical links 100G from RBX to TH2, GSW, LDN, BRU, FRA, AMS are
down."

~~~
jakub_g
BTW this seems to be a better status page than the one submitted to HN (which
is 404ing)

[http://status.ovh.com/](http://status.ovh.com/)

~~~
roblabla
The status page was down during the outage.

~~~
jakub_g
If so, then it's just like Amazon's status page during the AWS outage [1].

Pro-tip: self-hosting status page is maybe not the best idea.

[1]
[https://twitter.com/awscloud/status/836656664635846656?lang=...](https://twitter.com/awscloud/status/836656664635846656?lang=en)

~~~
dspillett
Like the old Red Dwarf episode:

Lister: What's the damage Hol?

Holly: I don't know Dave. The damage report machine has been damaged.

------
qwerty69
It started with all our SBG servers going down simultaneously. Approximately
1h later all our RBX servers went down as well including the OVH status page
and all other OVH web applications. Either their SBG and RBX data centers are
somehow connected or those are indeed two independent incidents.

~~~
dx034
SBG was a power failure, followed by a generator failure. Not sure if they set
up their network infrastructure in a way that this could spread but I find it
hard to imagine that the outage in SBG triggered RBX going completely black.
Unless of course they store their configuration files in SBG.

~~~
blattimwind
It seems RBX was an unrelated failure corrupting the router configuration
(@olesovhcom).

SBG going down due to a quadruple power failure (both grid connections and
both generators) is quite spectacular.

------
arekkas
I moved away from OVH after I paid 3 months advance (~$300) for a server which
burned down after 1 1/2 months. They did not issue any refunds (data, blood,
sweat and tears were lost that day). I have been an OVH customer for 12 years.

Today, I'm glad to have moved away all my production environments as well.

~~~
dx034
I'm still at OVH (support is reasonable and prices cheap) but would never
trust one provider with all my infrastructure. In the end it always turns out
that there is a single point of failure and if it's just the billing
department. Using two providers protects you from that and if you chose some
with good peering and free traffic, keeping both in sync is relatively easy.

That's just the problem with services as AWS, traffic is too costly to have
production live with another provider as well.

~~~
cjsuk
Spot on.

Having one AWS account scares the crap out of me as well. It’s never a good
thing if all your eggs are in one basket.

My money is on stuff spread across Bytemark, Linode and DigitalOcean with a DR
plan involving mostly automatic recovery.

AWS doesn’t get a look in as it is extremely costly to port away from anything
that isn’t bare metal and pipes.

~~~
z3t4
How do you do fail-over ?

~~~
cjsuk
Mirror your data onto another provider continuously (log shipping/rsync),
Switch DNS.

Ansible works for this stuff as it allows you to can the task of “quick get me
a production environment up on Linode!”

If you can afford some downtime you don’t need a hot standby just roll out
everything into new provider and you’re done.

I’ve done this on very large scale environments and small ones and it’s
achievable for even small organisations. The killer is avoid anything you
can’t run on bare metal servers.

~~~
therealmarv
I also do this approach. Get a decent, from your infrastructure independent
DNS provider and take care of your Ansible scripts. This way in emergency you
do a one liner, have a new production server running and change DNS settings.

~~~
mschuster91
In addition I'd suggest getting a second domain in a TLD operated by another
company in another country than the primary TLD is, and teaching your
customers/users that both are valid. This protects you from three things:

1) your DNS provider having issues (even Route53 sometimes has them,
[https://mwork.io/2017/03/14/aws-route53-dns-outage-
impacts-l...](https://mwork.io/2017/03/14/aws-route53-dns-outage-impacts-last-
almost-a-full-day/))

2) legal issues, when one of your domains gets seized or the provider gets
pressured into cancelling, just as has happened with Pirate bay, SciHub and
friends, gambling or sites with user generated content that may be illegal or
frowned upon in some countries or for the latter to the Nazi site Daily
Stormer (although I'm glad for it being down, it's a perfect example what can
happen in a very short time frame)

3) (edit, after suggestion below) the entire TLD going down because the TLD
DNS provider has issues, which also happens from time to time.

~~~
cjsuk
Good advice. We had some serious trouble when .io went down a few weeks back.

------
wiz21c
Damn, every emergency power supply I have encountered (the big ones with fuel
and hundreds of batteries) always fail to start when they have to... Why is
that ?

~~~
gyaru
People not actually testing emergency equipment.

~~~
_wmd
While at $bigco we halted testing of generation equipment because it was
sending DCs offline more often than it kept them up. Lawyers were involved,
things got ugly

~~~
PuffinBlue
I'm _completely_ unfamiliar with electrical generators/power generation, so
take this question in the spirit of ignorance:

Is there not a way to test generators without actually having them power the
live datacenter infrastructure? I mean, simulate the exact generation and load
requirements that the generators will face?

I don't know if it's feasible to dump all that power to ground or whatever,
but that way you could test the generator under full load at will and identify
issues without impacting the (operationally) live datacenter itself.

This equates a bit in my head with the 'verifying the backup' bit you might
get in software, whereas actually using the generator to power the live
operational datacenter equipment would be more like 'restore from backup'.

I don't know if it's possible though.

~~~
sdfafhlska
You can if you have to. But then you're really only doing a fancy simulation.

I accompanied my dad (power engineer) to a water purification plant where they
were testing new equipment for the back up generator. There their weekly tests
involved moving the entire plant to the diesel generator and running it of
back up power for a couple of hours (once you start a big generator you have
to let it run or it wont last long).

Potential problems for your generator that a resistor bank wont capture
include, power factor (phase shift from a motor or switching supply),
harmonics (from switching power), startup transients (from every power
supplies' capacitors).

All these things can trip the generator, or worse, burn it out.

So if you can't test with the real load, supersize it!

P.S. Every test is a simulation of reality. At Fukushima the diesel generators
flooded. Lesson - the unknown reason that'll knock out your grid can knock out
your backup

P.P.S If you can, gently turning the load back on is very beneficial. Don't
flip the master switch that controls all your load - flip a part of your load,
wait a while for the system to stabilize, and flip part of it back.

~~~
merb
P.S. Every test is a simulation of reality. At Fukushima the diesel generators
flooded. Lesson - the unknown reason that'll knock out your grid can knock out
your backup

well the lesson there was more like, that it's stupid to put your diesel
generators deep in the ground when they should sustain burst sea level raises.
(well I think it's never a good idea to do that, I've seen special places to
put them even deep inside germany, just because some panicful people that
might think that it still could overflow with ground water, etc)

~~~
dredmorbius
The generators were placed low to ensure they would not be disabled by a major
earthquake, as shaking intensity increases with height above ground.

The basement was, according to plan, protected by a seawall.

The height of the seawall failed to take into account the fact that on a
subduction tectonic plate, as pressure builds, the land-side plate rises, and
as the earthquake relieving that pressure strikes, the land falls -- by as
much as several meters.

The seawall's height failed to account for this.

That among other elements, but it proved sufficient to kick off the Fukushima
disaster, given other aspects.

In hindsight, placing the generators at ground level in an elevated location
might have been a better bet. Or locating the entire generating plant further
upslope.

~~~
merb
> In hindsight, placing the generators at ground level in an elevated location
> might have been a better bet. Or locating the entire generating plant
> further upslope.

yeah, well thats what I meant. I mean they were buried deep... There were even
studies that this was dumb:

\- [https://news.usc.edu/86362/fukushima-disaster-was-
preventabl...](https://news.usc.edu/86362/fukushima-disaster-was-preventable-
new-study-finds/)

\-
[https://en.wikipedia.org/wiki/Fukushima_Daiichi_Nuclear_Powe...](https://en.wikipedia.org/wiki/Fukushima_Daiichi_Nuclear_Power_Plant#Power_plant_information)
(section end)

\- [http://carnegieendowment.org/2012/03/06/why-fukushima-was-
pr...](http://carnegieendowment.org/2012/03/06/why-fukushima-was-preventable-
pub-47361)

and basically tepco knew that. they were just too lazy (it was prolly too cost
intensive) to do something against it (i.e. placing them higher or creating
water bunkers (u-boot style)..

besides the generators there was also the human failure part. (most of the
time human failure happens. I mean If I do not automate something, I might do
it right 3 times and the fourth time I most often fail hard...)

------
fvv
UPDATE: not all datacenters are down, it seems like that in europe because ovh
routing hasn't been updated so from our point of view everythign is down but
really it is not :)

~~~
dx034
I think they don't know themselves what works. The CEO said GRA is down while
I can access it without issues but that could be depending on where you try to
connect from.

~~~
qeternity
Where did he say GRA was down?

~~~
dx034
Looks like he deleted the tweet but he wrote earlier that GRA was down with
the others while BHS was up. I'd guess he is in RBX and couldn't reach GRA so
assumed it was down until they realised that the RBX-GRA line is down.

EDIT: He did indeed delete the tweet, [1] is the url in case anyone knows a
website archiving them quickly enough.

[1]
[https://twitter.com/olesovhcom/status/928536233311076353](https://twitter.com/olesovhcom/status/928536233311076353)

~~~
onestone
I have several servers in GRA which were unreachable for two hours. Might have
been due to the routing or DDoS-protection infrastructure.

------
fxaguessy
Network and RBX are UP again:
[https://twitter.com/olesovhcom/status/928556358353539072](https://twitter.com/olesovhcom/status/928556358353539072)
(but SGB's datacenters are still being restarted)

~~~
gerardnll
No it isn't.

~~~
dx034
Servers appear up just the website is struggling (probably everyone logging in
at the same time to file a ticket and complain).

------
therealmarv
wow, yesterday I was playing with their public cloud because considering
choosing them. I had some connection problem with my private networking there
(deleted it more than once) and opened a ticket. If it was me... sorry, haha.
Not good advertisement but it can happen to everyone.

~~~
PuffinBlue
Huh, I was just looking at them too. Contrary to popular opinion, I kinda
prefer it when these things happen before I sign up so that in the post-mortem
usually whatever architectural failure lead to the outage is corrected and you
get a stronger service.

Usually.

~~~
cm2187
Not saying this is the case here but that's also sometimes where you spot
amateurism and should run away. I remember a host provider a long time ago
(15y) who was storing its backups on the same machine as the main data. Guess
how I figured out!

~~~
PuffinBlue
Oh man that sounds bad! You're right, sometimes these events do expose the
inability to handle failure and you're right that that means walking away.

Thankfully many times we're reminded that there are good people out there
working hard against difficult constraints and they finally get their chance
to do things 'correctly' in the wake of the SHTF.

------
NiklasMort
can't wait to read about the detailed followup on this in a few days, it is
always interesting to see how such major outages happen

------
jedisct1
Not "all datacenters". Only 2 of them. They have 22, not counting all the
POPs.

~~~
dx034
9 by their count (7xRBX+2xSBG). When this was posted, the CEO wrote that
everything except BHS (Canadian DC) was offline (tweet now deleted).
Presumably he was in RBX and noticed that they couldn't reach any of the other
DCs. So it looked like all DCs were down for a while.

~~~
jedisct1
9 buildings, 2 locations.

------
drchaos
This affects DNS as well, since domaindiscount24 (a rather large registrar in
Germany) happens to host all three of their nameservers with OVH.

Just in case you wonder why your sites don't work, even if you host them
somewhere else.

~~~
notwedtm
This seems like really poor planning on domaindiscount24's part.

~~~
vultour
This sems lke something I'd expect from someone called domaindiscount24

------
pmontra
The status page is up again [http://status.ovh.net/](http://status.ovh.net/)

I paste the report so far:

\-------------

FS#15162 — SBG

Attached to Project— Network

Task Type: Incident

Category: Strasbourg

Status: In progress

Percent Complete: 0%

Details

We are experiencing an electrical outage on Strasbourg site.

We are investigating.

Comments (2)

Comment by OVH - Thursday, 09 November 2017, 10:55AM

SBG: ERDF repared 1 line 20KV. the second is still down. All Gens are UP. 2
routing rooms coming UP. SBG2 will be UP in 15-20min (boot time). SBG1/SBG4:
1h-2h

Comment by OVH - Thursday, 09 November 2017, 12:04PM

Traffic is getting back up. About 30% of the IP are now UP and running.

\-------------

VPSes are still marked as read in the dashboard. I can't access mine.

~~~
pmontra
More:

Comment by OVH - Thursday, 09 November 2017, 12:44PM

Everything is back up electrically. We are checking that everything is OK and
we are identifying still impacted services/customers.

Comment by OVH - Thursday, 09 November 2017, 13:25PM

Hello, Two pieces of information,

This morning we had 2 separate incidents that have nothing to do with each
other. The first incident impacted our Strasbourg site (SBG) and the 2nd
Roubaix (RBX). In SBG we have 3 datacentres in operation and 1 under
construction. In RBX, we have 7 datacentres in operation.

SBG: In SBG we had an electrical problem. Power has been restored and services
are being restarted. Some customers are UP and others not yet. If your service
is not UP yet, the recovery time is between 5 minutes and 3-4 hours. Our
monitoring system allows us to know which customers are still impacted and we
are working to fix it.

RBX: We had a problem on the optical network that allows RBX to be connected
with the interconnection points we have in Paris, Frankfurt, Amsterdam,
London, Brussels. The origin of the problem is a software bug on the optical
equipment, which caused the configuration to be lost and the connection to be
cut from our site in RBX. We handed over the backup of the software
configuration as soon as we diagnosed the source of the problem and the DC can
be reached again. The incident on RBX is fixed. With the manufacturer, we are
looking for the origin of the software bug and also looking to avoid this kind
of critical incident.

We are in the process of retrieving the details to provide you with
information on the SBG recovery time for all services/customers. Also, we will
give all the technical details on the origin of these 2 incidents.

We are sincerely sorry. We have just experienced 2 simultaneous and
independent events that impacted all RBX customers between 8:15 am abd 10:37
am and all SBG customers between 7:15 am and 11:15 am. We are still working on
customers who are not UP yet in SBG. Best, Octave

------
oelmekki
Btw, note for those who use ovh ISP like me (this is a thing in France): your
connection works, only the DNS's do not.

Fix (debian-like):

    
    
        sudo apt-get install bind9
    

Then put in /etc/resolv.conf, if it's not already there:

    
    
        nameserver 127.0.1.1
    

This runs a local nameserver that you use directly for resolving.

Oh, obviously, you need resolving to install the resolver :) Hope you have a
4g connection available.

Alternatively, you can just use google dns:

    
    
        nameserver 8.8.8.8
        nameserver 8.8.4.4

------
tyingq
My OVH dedicated servers seem fine. Webservers, ssh, all working. All ones in
Canada.

~~~
rkachowski
My vserver also seems fine, uptime 528 days. The control panel seems to be
down however.

------
qeternity
All of our dozen or so bare metal boxes are up in GRA as well as all of our
cloud instances. However object storage is down.

------
dx034
They now posted their explanation [1] but I don't buy it. I find it hard to
believe that the RBX incident happened shortly after the SGB incident without
any connection between these two. They should have redundant networking (at
least that's what they say) so one corrupted DB in RBX shouldn't have brought
down the whole DC (or 7 DCs according to their system). Maybe they pulled
corrupt data from SGB because it was down but I don't believe that at the same
time of a power failure, two redundant network nodes got corrupted without any
notice. Otherwise wouldn't that mean that one hardware issue can also bring
down a whole region?

[1]
[http://status.ovh.net/?do=details&id=15162&PHPSESSID=7220be2...](http://status.ovh.net/?do=details&id=15162&PHPSESSID=7220be21848b5db440d2cb66c5ee7e14)

------
dx034
Some servers in GRA still appear to work if that's of any help. All data
centres offline at once sounds more like an attack than a power failure in one
location. According to them, there was a power failure in SBG but I don't see
how that should affect routing in data centres several hundred miles away.

[https://twitter.com/olesovhcom/status/928541667283623936](https://twitter.com/olesovhcom/status/928541667283623936)

EDIT: Maybe related to the Cisco issue?

[https://blogs.cisco.com/security/cisco-psirt-mitigating-
and-...](https://blogs.cisco.com/security/cisco-psirt-mitigating-and-
detecting-potential-abuse-of-cisco-smart-install-feature)

~~~
pfg
It seems more likely that their data centers aren't quite as isolated as they
thought they'd be. The outage also appears to be limited to their locations in
Europe.

~~~
dx034
I always wondered why they promote having 6 datacentres in Roubaix when google
maps shows that they're all within 50m. Can't be too much redundancy there.

~~~
pyrale
It's their original implantation. The 6 DCs in Roubaix are probably there for
more storage capacity, not for redundancy.

~~~
dx034
Then they could call it one DC. A DC can have more than one building. But by
saying you have several data centres you imply redundancy, similar to AZs at
AWS. And (at least to Amazon), two AZs are far enough away from each other so
that one building could blow up without affecting the other one.

------
jedisct1
Details here:
[http://travaux.ovh.net/?do=details&id=28244](http://travaux.ovh.net/?do=details&id=28244)

Apparently, the root cause of that issue is a critical software bug in Cisco
NCS 2000 transponders.

~~~
peterwwillis
> "Diagnosis: All the transponder cards we use, ncs2k-400g-lk9,
> ncs2k-200g-cklc, are in "standby" state. One of the possible origins of such
> a state is the loss of configuration. So we recovered the backup and put
> back the configuration, which allowed the system to reconfigure all the
> transponder cards."

Their interfaces lost their configuration, and they re-applied configuration,
and state came back. This does not equal critical software bug.

> "One of the solutions is to create 2 optical node systems instead of one. 2
> systems, that means 2 databases and so in case of loss of configuration,
> only one system is down. If 50% of the links go through one of the systems,
> today we would have lost 50% of the capacity but not 100% of links."

This is a crap mitigation. They're still depending on the same hardware and
process that led to the first outage, only now there's more of it, so there's
more chances to fail.

If they had continuous configuration automation they would have detected when
the router's state changed, identified the missing bits, and applied
configuration.

"New" routers (as in, since 2011) have APIs and can even run code directly on
the router in order to fulfill these requirements. Cisco has multiple white
papers, and even provides complete products to manage and certify
configuration is applied as desired, even in cloud-agnostic multi-tier
networks. Even on old routers, practically all config management solutions out
there have plugins to manage Cisco routers.

It's also ridiculous that they had no access to remote hands. This is IT 101.

------
dorfsmay
Not "all"!

Maybe their main DCs, or their largest, but not all of them. I have virtual
servers in thier Quebec DC (BHS) and it hasn't gone down since the last time I
rebooted it.

------
ashitlerferad
I have 30+ servers on OVH. All are online.

------
xmichael99
This happens to Internap almost weekly... I always wondered why they never
make it in the news.

------
dredmorbius
How Complex Systems Fail

[http://web.mit.edu/2.75/resources/random/How%20Complex%20Sys...](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf)

------
ever1
Detailed report
[https://twitter.com/olesovhcom/status/928904373949919232](https://twitter.com/olesovhcom/status/928904373949919232)

------
perlgeek
A website that I host on ovh is up:
[https://sudokugarden.de/](https://sudokugarden.de/)

ovh.com looks down for me too.

You can check it's hosted by OVH:

$ whois $(dig sudokugarden.de +short)

~~~
seszett
It's in the Gravelines datacenter, which is not actually down contrary to what
the initial reports said (only Strasbourg and Roubaix are, and for two
different reasons).

------
fapjacks
Huh. I have services active on two dedicated machines from OVH in Canada, and
I was logged into both via SSH all night, and didn't have any interruption at
all.

------
r1ch
Looks like only their routing / network was down. My servers just came back up
and haven't experienced any power outage.

~~~
zimpenfish
Same here - my server was merely unreachable, not down.

------
nstricevic
I just moved 2 apps to OVH. So this was totally unexpected. My apps are
unavailable for more than 7 hours.

Does this happen often with OVH?

------
askmike
My server hosted on OVH had some problems (DNS lookups) but has stayed up and
works fine right now.

EDIT: Hosted in EU.

------
gizzlon
Now this status page is down as well. Sucks to be them right now =/ (I'm in
Europe)

------
pavlakoos
I'm trying to find ETA for solving the issue, but they didn't post it on
Twitter.

Anybody knows ETA?

~~~
gyaru
They haven't posted any ETA yet.

~~~
pavlakoos
They're getting up

------
stevenh
My OVH servers in Canada and Australia are running fine.

My OVH servers in France are all inaccessible.

~~~
throw2016
Seems to be back up. Quite a large disruption for OVH. Hope we get a
postmortem

------
jagermo
This has to be one of the least informative status page I have ever seen.

------
treo
Looks like they are starting to come back up. My VPS is accessible again.

------
aerovistae
I had never heard of this company til I saw this post. Shrugged, thought,
"huh, wonder who that's affecting."

Opened up Age of Empires II....no connection. Go to website for game
servers..."Our provider, OVH, is down...."

Go figure.

~~~
zaarn
OVH is rather popular in Europe atleast (alongside Hetzner and 1&1).

~~~
tyingq
The Canadian data center is also a very low cost way to serve the US. I'm an
OVH fan. They have their quirks, but the pricing is great. You just make sure
you compensate for their quirks with backups and DR plans.

~~~
zaarn
I also like them from a privacy viewpoint.

My data is with me in Europe and the company that has my data is in Europe
with me too. If I was using DO or AWS, then my data may be in Europe but the
control over the data is in the US, free for all to the three letter agencies
and lacking privacy laws there.

------
oron
not all of them, I have some servers in Canada, working OK

------
thejosh
Sydney is fine.

------
KeitIG
I imagine Mr Good Guy at OVH telling some others:

"guys we have a single point of failure in our architecture with SBG, maybe we
should...

\- naaah it's fine, we do not have time nor resources"

Then shit happens.

 _edit: I have no idea what is happening exactly, but OVH being what it is, it
seems extremely weird that all datacenters "can" get down at the same time,
and it looks like a serious architecture problem to me (or backup systems,
like generators, not being correctly tested... whatever). I am really curious
about the future explanation with what happened exactly_

 _edit2: Why all the downvotes? Even the status page of OVH is down, do not
tell me it is good design. We are not here to be charitable, but realist._

~~~
vabene1111
its OVH: The Hardware is good DDoS protection is good The Prices are high

but

support/administration does not work well, i have a lot of really weird
story's with them, from them plugging in a keyboard in our server to reboot it
(without any reason) to taking down a server for a requested maintenance only
to notice after 4 hours of downtime that they did not ask their bosses if they
were allowed to even perform the maintenance requested (and then not getting
permission to do so after another 2 hours ..)

For me it feels like there are some really deep issues somewhere in the whole
administration that make incidents like this no real surprise

Problem is most other providers dont work any better, so ...

Everyone makes mistakes, let's just hope they learn from it.

~~~
tyingq
>ts OVH: The Hardware is good DDoS protection is good The Prices are high

The prices are high? Compared to what? Cheap is their raison d'être.

~~~
vabene1111
compared to other dedicated Server Hardware, not talking about business Cloud
Infrastructure, no idea about that.

Sry if that caused confusion

~~~
dx034
Who's cheaper on the dedicated side? Hetzner can be a bit cheaper than
soyoustart but wouldn't be aware of anyone else (with reasonable quality).

~~~
zaarn
I've tried Hetzner but the Network Peering to Telekom is subpar compared to
OVH. On Hetz I got about 40Mbps up/down to my local computer while on OVH I
can easily load my DSL 100% without issues.

~~~
dx034
OVH is one of the few with direct peering to Telekom, most don't want to pay
for that. Otherwise Hetzner is good, their own network is limited but peering
works reasonably well. But apart from them I'm not aware of anyone with
cheaper dedicated prices than soyoustart

------
contingencies
_To make error is human. To propagate error to all server in automatic way is
#devops._ \- @devopsborat

~~~
chii
just in case you followed the wrong devops borat, it's @DEVOPS_BORAT (the
other one is a spam bot).

~~~
waz0wski
Azamat suggest to follow for your Kazakh tech need

@DNS_BORAT [https://twitter.com/DNS_BORAT](https://twitter.com/DNS_BORAT)

@InfoSecBorat
[https://twitter.com/InfoSecBorat](https://twitter.com/InfoSecBorat)

@KanbanBorat
[https://twitter.com/KanbanBorat](https://twitter.com/KanbanBorat)

@mysqlborat [https://twitter.com/mysqlborat](https://twitter.com/mysqlborat)

@NetEng_Borat
[https://twitter.com/NetEng_Borat](https://twitter.com/NetEng_Borat)

@secure_borat
[https://twitter.com/secure_borat](https://twitter.com/secure_borat)

@SecurityBorat
[https://twitter.com/SecurityBorat](https://twitter.com/SecurityBorat)

@Sysadm_Borat
[https://twitter.com/Sysadm_Borat](https://twitter.com/Sysadm_Borat)

~~~
IgorPartola
Anyone here remember
[https://en.m.wikipedia.org/wiki/Bastard_Operator_From_Hell](https://en.m.wikipedia.org/wiki/Bastard_Operator_From_Hell)

~~~
dredmorbius
[https://www.theregister.co.uk/data_centre/bofh/](https://www.theregister.co.uk/data_centre/bofh/)

------
Sami_Lehtinen
Title is misleading. Only RBX and SBG were affected.

06:15 UTC SBG serves failed.

OVH network weathermap: [http://weathermap.ovh.net](http://weathermap.ovh.net)

Btw. First post:
[https://news.ycombinator.com/item?id=15660524](https://news.ycombinator.com/item?id=15660524)

~~~
tmikaeld
Sorry, i didn't see the time stamps! You where first! :D

~~~
Sami_Lehtinen
We're nerds, so how about checking the facts (?). Please explain me if there
has been some kind of time anomaly lately or?

My submit timestamp: 07:21:25

Your submit timestamp: 07:28:23

About the title. I did consider the title for a while, because I wasn't sure
how bad the situation was. But from my own independent monitoring system I did
see that RBX and SBG servers were unavailable. Of course I also did some basic
trouble shooting and confirmation work before posting.

Btw. Right now, there's some network traffic present on SBG network. Let's
hope that the systems are soon up'n'running.

~~~
tmikaeld
Sorry, I didn't know how to check the timestamps.

Yeah, I expected this to clear up in a matter of minutes.

Now it seems to be a shitstorm of historic proportions...

~~~
Sami_Lehtinen
I'm waiting for the postmortem. I'm very curious to see, what was the root
cause of all this mess. As usual, there probably were several overlapping
causes.

------
metafunctor
Someone with access might wish to update the title of this post, because all
OVH datacenters are definitely not down.

~~~
dx034
But no one knows which DCs and services are down. They lost their internal
network and have no idea themselves.

~~~
metafunctor
I'm pretty sure they know exactly which services are down.

Even if they didn't, clearly many services are up and running normally, so
saying "all datacenters are down" is just a lie.

------
Hates_
Trending on Twitter with the hashtag #OVHGATE

[https://twitter.com/hashtag/OVHGATE?src=hash](https://twitter.com/hashtag/OVHGATE?src=hash)

~~~
api
We selected their three data center EU region precisely because they were
three separate data centers, so not happy. This is clearly bad design.

I think we're now going to have to look into multi-provider options. The only
way to be solidly _up_ is to be hosted by more than one company at more than
one data center.

I've also heard stories of billing nightmares where you get locked out of a
cloud provider account, so that's another thing.

~~~
vim_wannabe
>locked out of a cloud provider account

I guess this is already a reason by its own. It, among other problems, is what
happens when we go from small "local" providers you can actually call to
automated global providers that cannot provide immediate support even if they
tried.

~~~
dx034
That's the problem with this trade off. The small providers tend to have good
support and someone you can reach if anything goes wrong but they won't have
experts on site 24/7\. The large ones have dozens of them on call at all times
but lack in support (unless you pay a lot).

