
Block Storage Issues Across All Regions: Incident Report for DigitalOcean - juliand
http://status.digitalocean.com/incidents/g76kgjxqrzxs?
======
redredrobot
That's not seriously the real engineering portmortem is it? That looks more
like a 'Resolved the issue' update - it is way too shallow and vague.

If this was sent out at AWS as a COE (postmortem), it would be ripped apart -
it is not going to satisfy anyone reading it that they should have confidence
this class of failure isn't going to happen again. It looks like they haven't
even identified the root cause(s) of the failure...

~~~
mike_d
Yeah this is just a status update. Here is the TLDR:

> Efforts continue to find the incompatibility in the networking configuration
> change. Additionally, we are exploring improvements to our tools and
> processes to facilitate a finer grained, more incremental deployment method
> for wide, system-level changes.

------
unilynx
Unfortunately, this doesn't explain why this change was applied to 5
datacenters at the same time, or why they didn't do that but it still affected
5 of them.

I would have liked to hear more about how they are going to reduce the blast
radius of such a change, because it sounds like something that could have been
deployed to a single datacenter first

~~~
Pick-A-Hill2019
Agreed. Closest is this .... ' we are exploring improvements to our tools and
processes _to facilitate a finer grained, more incremental deployment method
for wide, system-level changes_.'

------
ebcode
Adding my voice to the chorus here as a DO customer. This 188-word
"postmortem" gives postmortems a bad name. I would like to know the details of
the "network configuration change" and _why_ it "caused incompatibilities".
And also _how_ you will ensure that this particular failure will not re-occur.

Trust and transparency are the currencies of the internet, in the same way
that cigarettes and contraband are the currencies in prison. This post is
worth approx. a 1/2 smoked cigarette.

------
caiobegotti
It's because of "reports" like these I didn't feel like staying as their
customer. Whomever is in charge of [limiting scope and wording of] these
reports should listen to a few things in private at their HQs.

~~~
o-__-o
Given that, would you trust the engineers from DO to work on your systems?

~~~
caiobegotti
Of course, engineers are not individually responsible for large outages like
that, I trust they always do their best (I'm a SRE myself).

~~~
o-__-o
Even if it’s the culture at DO to not question existing designs? I’m saying I
would look at this example and question DO engineers that come to work for me
rather than trust blindly that they won’t bring bad habits with them...

------
notacoward
This happened, and was apparently resolved, on Monday. Am I the only one who
wonders if it was released on a Saturday to minimize the amount of
attention/commentary it would get (e.g. here)?

------
lucb1e
Aside from a timeline and some promises, this is the full post-mortem analysis
of what happened:

> The outage was triggered as a result of a networking configuration change on
> the Block Storage clusters to improve handling packet loss scenarios. The
> new setting caused incompatibilities

So that doesn't tell us very much about the cause ("a networking configuration
change") nor the effect ("incompatibilities").

------
mlthoughts2018
> “The outage was triggered as a result of a networking configuration change
> on the Block Storage clusters to improve handling packet loss scenarios. The
> new setting caused incompatibilities, which led to network interfaces
> becoming unavailable.”

I wonder if this just means someone changed an MTU configuration and it led to
tons of fragmentation in different components of the network, and especially
for any large file transfer making things timeout constantly to render an
outage. Just a wild guess, but I’ve seen this happen with in-house datacenters
before, so perhaps.

------
exabrial
Nice details in the report. Come on DO, this was the most annoyingly PC,
lawyer sanitized, non-scientific RCA anyone here has ever read. Here's the
BLUF line: Don't cause issues. That causes problems.

------
adreamingsoul
Sigh, I really like DO but this just shows how much they still need to learn
about operations. AWS does “raise” the bar for that, but unfortunately you can
only really see that from the inside.

