

Network Instability in NYC2 on July 29, 2014 - retrodict
https://www.digitalocean.com/company/blog/network-instability-in-nyc2-on-july-29-2014/

======
donavanm
So an RE failed and flipped to the secondary. They run mlag from the agg layer
to their tors. Said mlag had a grey failure. The time to recovery was
predominantly fault detection. After detection they upgraded the os on the
known good switch. Presumably this bounced, affecting remaining traffic. After
upgrade they downed the failed switch, shifting traffic to the known good.

Moderately interesting that they run everything off a single agg pair. Also
that they use mlag instead of routing/mpls/etc for availability.

Key finding is lack of visibility in to the layer 2 availability and
performance. Would be interestin to see if they try to ecmp layer 3 or use
existing lacp frames for fault detection in the future.

~~~
insaneirish
They're probably learning that MLAG should be avoided unless absolutely
necessary. But providing common L2 domains across cabinets is probably
something they "need."

There's much less to go wrong with ECMP at L3. Stateful networking components
frighten me.

~~~
donavanm
Not knowing anything about their infrastructure Id guess that they're using
vlan tags for customer isolation. They'd want customer instances spread among
racks, based on instance type etc. Going to in house/vxlan/nvgre encap
certainly looks better suited, but still has a high bar to entry.

~~~
contingencies
You guys sound like you know a lot about these types of networking protocols.
Can you recommend a decent summary of currently available approaches to
relatively dynamic private link layer network topology provision suitable for
cross-cabinet (or even cross-site) virtualized infrastructures and their
drawbacks? For instance, I've been seeing Open vSwitch gaining popularity.

------
akg_67
The network problems seem to have started earlier. I received a penny for my
NYC2 droplet just before this major incident. Luckily, I decided to shutdown
and move the droplet to SFO before this incident. Over the last couple of
years dealing with cloud providers, I have learned to dust off the contingency
plan as soon there is whiff of issues at a location.

Though the recap is good, there are red flags that show gaps in DO processes
and incident management.

Upgrade of software before the original problem was diagnosed and resolved.
This is a big no-no, never introduce a new variable in an existing problem
even if the service provider or lab testing shows the chances of failure are
minimal. I have worked long enough with technology infrastructure to
experience situations where service provider insisted on upgrading software
during unrelated incident and made situation worse.

A better approach would have been to fail the network to good switch first and
when the bad switch was fixed, upgrade the software on bad switch first then
failover to the upgraded switch and upgrade software on good switch. This time
you guys got lucky but sooner or later your luck will run out.

~~~
imbriaco
Generally speaking I agree with you. However, in this specific case there were
a couple of reasons we chose the path we did:

1\. We had experienced bugs with the currently running release which we were
fairly sure would manifest when we removed the damaged core from the network.
These were primarily around MAC learning.

2\. We had performed testing ourselves in a non-production environment and
were already planning to take a network maintenance to update these switches
in the next few days.

Given those factors, we judged that introducing another variable in the new
version was less risky than proceeding with the defects that we knew about in
the existing version.

------
Donzo
While the outage was upsetting, the response from DO was reassuring. Not only
did they jump right on the problem, they maintained communication during the
affected period. Then, they refunded me $160 (a month of service).

Every host has an occasional problem. It's how they handle the problem that is
important to me. This is a night and day contrast compared to the service I
received with other hosts with which I've dealt.

~~~
pnmahoney
was this other people's experience? e.g. getting a refund?

~~~
nieve
I only keep three machines with them, but two are in NYC2. For the July 24th
outage they issued a combined 4¢ SLA credit for claimed two hours downtime.
For this much worse problem? Nada. I suspect that DO is prioritizing refunds
for more lucrative customers and if you're not spending enough you're not
going to get anything. I may need to look into moving some stuff over to
Ramnode if they've got a NYC location now (Wall Street clients).

------
snewman
A good postmortem... and a clear example of a Type 1 Outage. There are pretty
much only two kinds of outage in a decently-run system:

1\. Two or three things go wrong at once. Some of the problems are
spontaneous, and some were always broken but it took the other problem(s) to
uncover it. (Here, the SSD failure was presumably spontaneous, but the "not
completely successful" routing engine failover sounds like a case of rarely-
used-hence-poorly-tested.)

2\. A systemic failure hits all redundant components at once (DDOS, fat-
fingered global configuration change, calendar bug, etc.)

------
korzun
Good post.

This is essentially the risk you face when dealing with new providers.

I guarantee that other providers had to go thought the same set of issues and
phases prior to achieving a truly redundant infrastructure.

Would love to hear a follow up on the audit.

Edit: Misread a part of the post. Thought they were doing fail-over on a
different network level.

~~~
sard420
Pretty common for router/switch manufacturers to include just 1 storage device
(SSD/CF/whatever). That's one reason why you buy the second router/switch for
redundancy.

~~~
korzun
Thanks.

Corrected my post. For some reason I thought failure point was outside of the
actual switch appliance.

------
bogomipz
I find it alarming that they don't have a competent network engineer of their
own on staff. Note the following:

"We are working very closely with our networking partner to understand the
nature of the failure, assess the chances of a repeat event, and to begin
planning architectural changes for the future."

"Our initial focus was on verifying the configuration so we initiated a line-
by-line configuration review by engineers at our network partner"

Yikes. You get what you pay for.

~~~
imbriaco
We have very competent network engineers, but it never hurts to have a second
opinion. Verifying that there are no subtle interactions that we've missed is
simply prudent.

At our scale, partnering with our vendors is a necessity. We often operate in
areas that, while they are technically within specification, are toward the
higher end of the range.

~~~
jwatte
Which vendor? I have experience with two brands in similar condos, one good,
one not so much...

~~~
imbriaco
I prefer not to publicly name the vendor. This is about problems on our
network, and it's ultimately our responsibility.

