
FRA1 Block Storage Issue - nik736
https://status.digitalocean.com/incidents/8sk3mbgp6jgl
======
qertoip
Can confirm.

We run 26 services in production on DigitalOcean. Every single VPS in our
setup uses block storage feature as a persistence layer (system logs, app
logs, databases, etc).

Thanks to this architecture we can rebuild machines at will. The _function_
and _state_ are nicely separated.

The downside is that we are now fucked.

~~~
mattbillenstein
You should re-architect without network block storage as a requirement imho.
Ever since that big AWS EBS outage in 2012 or whatever, I avoid it like the
plague.

Databases like low-latency local storage (the newer nvme instances on aws are
very good) and logs and whatnot are aggregated with other systems (fluent,
logstash, etc). I actually do not miss EBS much at all really -- if I have a
problem with a VM, they're disposable and redundant and I've designed out most
of the SPOFs.

~~~
boulos
Disclosure: I work on Google Cloud.

I wouldn't throw the baby out with the bathwater. Across Google, we rely
almost exclusively on networked storage (Colossus) unless we need extremely
high-performance local flash. Being able to separate compute from storage is a
huge part of our ability to both scale out, as well as to do live migration
for GCE (where Persistent Disk, our equivalent of EBS is built upon Colossus).

Persistent Disk has never had an outage like the EBS one, but I attribute that
to the Colossus and underlying teams having run this at Google for a really
long time. Fwiw, the AWS folks have also massively improved EBS over the
years. You can still worry, but I prefer to think about overall MTBF rather
than consider networked storage in particular as the plague :).

~~~
butwhythough
Disclosure: I use both gce and ec2

Unfortunately unlike EBS, persistent disks (Colossus) in Google cloud share a
network plane with the VM. To quote the docs.

"Each persistent disk write operation contributes to your virtual machine
instance's cumulative network egress cap."
[https://cloud.google.com/compute/docs/disks/performance](https://cloud.google.com/compute/docs/disks/performance)

If you haven't noticed the mtu in Google cloud is < 1500 vs AWS where you can
get jumbo frames (9k). I have no reason to believe persistent disks are
different.

Enabling live migrations? You mean the choice of having an instance terminated
with zero notice or migrate with a 60 second warning if you subscribe to the
right API? Oh yeah and this is happening constantly.

Live migrations (and by connection VM attrition), persistent disks, and
network performance are my least favorite aspects of google cloud today.

That being said google cloud does have alot of advantages over ec2. These are
just not them.

~~~
boulos
First, I'm sorry you've had a bad time. We track our VM MTBF closely, so
hearing "this is happening constantly" is really worrying. Feel free to reach
out to me (email in profile), or Support so we can dig into your experience.
If something is wrong, we should diagnose and fix it.

Can you say why you prefer explicitly separated egress caps? We let our
networking egress be shared between all sources of traffic on purpose, because
it lets you go full throttle rather than hardcap on "flavor". That is, why
restrict someone that doesn't write to disk much by "stealing" several Gbps
for the PD/EBS they aren't going to use?

Finally, it's true that our MTU is too damn low. But that also isn't
particularly material for PD: when the guest issues a write, we handle it all
behind the scenes (it's not like your guest sees the write get fragmented into
packets).

~~~
butwhythough
Thank you for the fresh perspective and the thoughtful reply. I also
appreciate the offer to reach out. Rest assured we are actively engaged during
the events and have found support incredibly responsive. At a high level none
of these are reasons for us to stop investing in our google cloud stack, nor
have they caused any major outages. Think of them as quality of life comments.

Live migrations

To clarify "this is happening constantly". I meant we see live migrations
happen frequently through the day. Going back the last 24 hours I see
"hundreds" of migrations. We do have days where we will see 3 or 4x this
number. The majority of these were successful and our logging, probes, and
graphs show nothing exciting.

On the positive side of things, we have noticed a marked improvement in
migration times, probe failures, and instance fatalities in the last 6 months.
Where before we would regularly see live migrations take upwards of 15 minutes
or longer. They are now at or under 2 minutes (With only a handful of
exceptions barely worth mentioning).

I do appreciate the facility of live migration and the proactive approach
googles takes to host maintenance. The 60 second notification window is just
too damn short for some of our services to properly drain themselves. So
instead, we hold on to our butts and hope for the best on those boxes.

If there was one improvement to live migration it would be to have the option
of a 15 minute (or even 30 minutes.. am I being greedy?) notice.

Networking egress caps shared between instance and persistent disk

The edge case that hurts here is when you have a high bandwidth service that
also writes a lot of data to a PD disk and the comcast style burst bandwidth
throttling that happens on the instances (This is pure speculation and may
have improved since last time this was investigated. The observation at the
time was the throttling is a bit too efficient and hit the instance
disproportionately). We have since migrated to either local ssd or tmpfs disks
for these type of hosts in gce. Their sister services are still running fine
in ec2 on ebs instances.

MTU

Yay! 1500 would be better, 9k would be great. This hurts when connectivity
starts to see an increase in packet loss and the corresponding increase in
packet re-transmission (and latency). Not to mention the overhead these extra
packets incur (Those 20-60+ bytes just in headers add up quick).

TLDR wishlist

Longer notification window before live migration actually starts PD-optimized
instances 1500/9k MTU

------
eropple
As somebody who's been looking really hard at a project/side business that'd
use Spaces (DO's object storage system), this makes me super, super nervous.
To say nothing of _block_ storage--yikes.

Can anyone speak to quality/reliability of other object storage providers that
have S3-compatible (including presigned URL) APIs? S3's pricing is absolutely
ridiculous by comparison, but they have the reliability argument on their
side...

~~~
pinewurst
I can't speak for the quality of other object storage providers, but being in
the storage business I can say that if someone is running Ceph, find another
provider.

~~~
polskibus
What's a better open source alternative to Ceph?

~~~
AFNobody
There isn't a better open source alternative to Ceph.

However, improperly designed/architected you will end up with serious scaling
issues.

~~~
scurvy
Testing testing testing!

It's important to actually test things.

~~~
AFNobody
Yes it is and not just initial testing but at scale with full volume. ;)

------
teilo
I have heard of a number of Ceph nightmares like this. A few years back, Logos
Bible Software, who has a Ceph-based content platform (a huge library of
e-books with a massive amount of meta-data), was down for a week because of a
cascading Ceph cluster failure.

It really doesn't speak well of the Ceph architecture. It is highly
performant, but at what cost? Failures on this scale can ruin a business.

~~~
zzzcpan
Well, you can always partition a large cluster into many small clusters and
prevent cascading failures, other issues from affecting everyone or getting
too long to recover from. This is like a very basic reliability technique
everyone should know.

------
unilynx
They've just started sending out SLA credit notices:

\-----

Hello,

On 2018-04-01 at 7:08 UTC, one of several storage clusters in our FRA1 region
suffered a cascading failure. As a result, multiple redundant hosts in the
storage cluster suffered an Out Of Memory (OOM) condition and crashed nearly
simultaneously.

We have identified that you, or your team account, were impacted by this
incident and will grant an SLA credit equal to 30% of your entire Block
Storage spend for April, not just usage in FRA1. This credit will appear on
your account at the end of April, and will be reflected on your April 2018
invoice.

We apologize for the incident and recognize the impact this outage had on your
work and business. You can read the full detail of our public post-mortem
here:
[http://status.digitalocean.com/incidents/8sk3mbgp6jgl](http://status.digitalocean.com/incidents/8sk3mbgp6jgl)

Thank you, Team DigitalOcean

------
pstrateman
100% of the downtime for MomentoVPS was from ceph cluster failures....

