
DigitalOcean block storage is down - kaendfinger
https://status.digitalocean.com/incidents/g76kgjxqrzxs
======
CaliforniaKarl
Personally, if DO don’t have anything new in a status post, I’d prefer seeing
an update that says something like “We are continuing to work on the issue.
Nothing new to report. Next update in X minutes.” That is a lot easier for me
to parse than the text that someone seems to be copy/pasting in each update.

~~~
iamsb
Would be great if statuspage.io has a button when pushed publishes message
similar to your suggestion.

------
kyledrake
What unholy thing did they do that broke it across 12 different datacenters,
good lord.

~~~
alexeldeib
This does seem to indicate a notable lack of isolation for the blast radius
between DO datacenters. Would be interesting to see the post mortem.

~~~
protomyth
I get the feeling that whoever writes the post-mortem is going to have a bit
of pressure to assure folks that there is isolation going forward.

~~~
klodolph
That would be a bad sign that there’s something wrong with the culture. I
would hope for a postmortem that identified flaws that genuinely needed to be
fixed.

~~~
viraptor
Those are not mutually exclusive and actually a good idea. You want to fix
this specific issue, but also ensure that whatever process took down one DC
doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it
would be something wrong.

~~~
klodolph
> Those are not mutually exclusive and actually a good idea.

The goals “assuring folks that there is isolation” and “identifying flaws that
need to be fixed” are somewhat contrary to each other.

The post-mortem should identify flaws in systems, processes, and thinking. It
should not try to assure people that there is isolation when there is evidence
to the contrary.

> You want to fix this specific issue, but also ensure that whatever process
> took down one DC doesn't affect other DCs.

This was a multi-regional failure. So, this specific issue is also an
isolation problem, among other things. You will want to _ensure_ that this
problem doesn’t happen again but you shouldn’t _assure_ that it won’t.

------
pastrami_panda
This is OT, but I have a droplet on DO and I'm amazed at the amount of
malicious traffic it gets. Is it normal for a very private vps to receive
thousands of ssh attempts per hour? I have fail2ban installed and the jail is
so busy it's quite astounding. Anyone with more web hosting experience that
can weigh in?

~~~
zeta0134
I work for a web hosting company in Texas, and this is ridiculously common.
Any public IP with any public service at all will be poked, prodded, and
generally made uncomfortable by every bot and crawler you can think of, trying
common password combinations and scanning for common vulnerabilities in
popular software. This catches _so many_ of our customers by surprise, who
tend to mistakenly believe they're being targeted in some kind of attack.
Generally they're not, unless they're running something vulnerable and one of
the bots _noticed._

Fail2ban is great to at least stem the tide. It's good at slowing down SSH
brute forcing, and can be set up to throttle poorly behaved scrapers so your
site isn't getting hammered constantly. If you can deal with the
inconvenience, it's even better to put services that don't need to be truly
public behind an IP whitelist. That stops the vast majority of malicious
traffic, most of which is going after the low hanging fruit anyway.

Otherwise, it's kinda just a fact of life. With the good traffic also comes
the bad.

~~~
davrosthedalek
I always switch my outward-facing ssh servers to key-only. Is there any
advantage for running fail2ban additionally?

~~~
ac2u
for my DO droplet I also changed the ssh port to a silly-high random port and
the last time I checked it reduced the amount of nosy bots knocking at the
door to zero.

~~~
davrosthedalek
I used to do so too, but sometimes had problems with very restrictive
firewalls killing connections to high/unknown ports when traveling. They would
only allow vpns or ssh to connect.

------
hartator
Not sure why the previous incident page got flagged. This is the new one.

It's affecting us for real. Making almost our whole service - serpapi.com -
down. As we are storing database files on block storage volumes.

~~~
dang
I took a look at the flags on these stories and am pretty sure they're from
users who are tired of "X is down" submissions, which tend to get posted a lot
and often to be a little on the trivial side.

However, since several HN users are expressing that this issue is genuinely
affecting them, I've turned off flags on the OP about this and merged the
comments here.

~~~
tyingq
The "across all regions" part makes this one different for me, and interesting
even though I'm not a customer of their block storage. I'm curious about the
sequence of events, or design choices, that would cause that.

------
louwrentius
Isn't Digital Ocean running Ceph for their block storage?

I would wonder - as others suggested - that they may have stretched the
cluster across datacenters ?!

Would be interested in the post-mortem.

~~~
ngrilly
Yes, DO uses Ceph: [https://blog.digitalocean.com/why-we-chose-ceph-to-build-
blo...](https://blog.digitalocean.com/why-we-chose-ceph-to-build-block-
storage/)

------
jacquesm
Thank you Digital Ocean for once again proving that 'The Cloud' is not a
backup.

~~~
vinw
'The Cloud' is _a_ backup. Just don't let it be your only backup!

~~~
pnutjam
It's not backed up if you don't have 3 copies.

~~~
lunchables
The saying we always use is "If it is not in 3 places, it doesn't exist."

And another: "3 copies, at least one offsite"

------
stephenr
This is your weekly reminder that anything you want to be reasonably “HA”
should span multiple vendors in multiple DCs.

~~~
dc352
that would be pretty cool but to have that, you need a high-network-latency
solution, i.e., pretty much cold back-up. For some time I thought it's pretty
last century option but having been experimenting for some time now, it's the
option with lowest impact on system performance. More importantly, it's
reasonably resilient.

~~~
stephenr
I've read your comment now about 4 times and all I have come up with is "huh?"

Literally thousands if not millions of organisations operate multi-DC
infrastructure across the planet.

Is it harder than setting up a single box in one DC? Yes. Is it harder than
setting up a mini-cluster of boxes in one DC? Yes. Is it rocket science? No.

------
simplehuman
Anyone have a review of using DO k8s or DO managed DB in production?

------
unilynx
DigitalOcean just posted a post-mortem on
[http://status.digitalocean.com/incidents/g76kgjxqrzxs](http://status.digitalocean.com/incidents/g76kgjxqrzxs)

(the same url)

------
privateSFacct
Higher latency (per status) is not end of world especially if it’s just “may
experience” higher latency.

~~~
erikrothoff
That wording kinda ticked me off because our volume was completely
inaccessible. Rebooting did not mount it at all.

------
imglorp
Hrm, Atlassian BitBucket is also down. Just a coincidence? Does BB use DO?

[https://bitbucket.status.atlassian.com/incidents/4t1pkwrdtl8...](https://bitbucket.status.atlassian.com/incidents/4t1pkwrdtl8b)

~~~
iamaelephant
BitBucket definitely doesn't use DO.

------
seaghost
Their block storage is such a failure. I’m back and forth with support to
automatically delete files with lifecycles for over 2 months now and it’s
still not resolved.

~~~
ngrilly
Since you're trying to "delete file with lifecycles", I'm quite sure your
problem is with their object storage (called Spaces), and not their block
storage.

------
sb8244
It looks like they have just updated it as resolved and monitoring.

------
sunasra
I was always wondering how I can get know proactively if something like this
break or some service has an outage. As a result, I have built this tool(
[http://incidentok.com](http://incidentok.com) )

~~~
sunsetMurk
great idea - going to give it a whirl this week.

i'm curious about the slack integration. can you provide some more info on
what that looks like? eg. just a message in real-time when it goes down? a
daily message of statuses? etc. Any sort of customization w/ it?

I currently use a soup of zapier zaps to take care of this problem.

~~~
sunasra
Hey. Thanks IncidentOK will send message to slack using webhook as soon as
incident reported by any product. Didn't thought to send status everyday. But
I am open for suggestions

Message looks like this [https://imgur.com/jjbMKj8](https://imgur.com/jjbMKj8)

------
sodosopa
So that’s why bot attacks and spam traffic was lower.

------
golanggeek
This is really down for more than 2 hours!!!

~~~
sondh
Last night I was testing DO managed Kubernetes cluster with persistent volume
claim and the volume took 15 minutes to reattach after the pod is rescheduled
to another host. I thought it was just some weird hiccup and went to bed.

The incident report indicated the problem started 4 hours ago (around 9pm GMT)
but I was having problem around 4pm. It's definitely not a 2-hour incident.

~~~
dc352
our disks in London went down at about 8:45pm UTC (10 mins 100% disk
utilization alert triggered at 5 to) and DO recovery message was sent out at
about 2am UTC. We switched our service (keychest.net) on at 3:15am

------
irfanbaigse
DigitalOCean bad experience

------
jbverschoor
Their ad was “you’ve been developing like a beast and your app is ready to go
live”

DO is a nice thing to play around with and maybe launch something, but I
wouldn’t run full production on it.

