
Update on the April 5th, 2017 Outage - janvdberg
https://www.digitalocean.com/company/blog/update-on-the-april-5th-2017-outage/
======
lloydde
> The root cause of this incident was a engineer-driven configuration error. A
> process performing automated testing was misconfigured using production
> credentials. As such, we will be drastically reducing access to the primary
> system for certain actions to ensure this does not happen again.

Classic. Someone manually ran the test automation while pointing at production
including its cleanup routine of DROP. Very foundational access control was
not in place.

------
sb8244
Does this mean that their testing servers can access production db over
network? That's something we specifically put under control so that even a
misconfiguration can't cause this. I'm surprised an audit didn't discover
that.

------
ulysses
I like DO, and appreciate them being open with the outage timeline. Two thing
really jump out at me though:

1.) If they have a time-delayed replica, why not recover it to the point-in-
time before the issue, and then switch master over to it?

2.) They took a backup of the replica, moved the backup to the master, and
then restored from it. Do they not have direct backups of the master? A time-
delayed replica is no replacement for valid backups.

------
cjbprime
It's interesting that there's a common theme for multi-hour outages being
gated by waiting for backups to restore -- the GitLab outage was similar in
cause and had an even longer duration.

I wonder if there's anything to be done about reducing the duration of these
outages. I suppose copying TBs of data around the world is simply always going
to take hours.

~~~
danappelxx
Not an expert, but can't you just temporarily make the replica the master?

~~~
onozz
Not if a drop database was done - the replica would have followed and dropped
it too

------
spotman
Imagine that! Working backups! nice job not losing data digital ocean.

