
Digital Ocean needs to start showing outages on their status page - jonny_eh
https://medium.com/p/f9b57df6d938
======
minaguib
In any infrastructure beyond the tiny/small stage, there are always pieces
that are failing or have failed. In some cases these failures are observed by
customers. In other cases they are not.

It would be unreasonable to have a company-wide status page that constantly
lists "some customers are experiencing some problems". That's not the point of
the status page - the status page, as the author suggested, is there to
highlight issues that are affecting a significant section of the customer
base.

The right thing for Digital Ocean to do in cases like this is to allow you, in
your private dashboard, to see the problem and follow up on a master ticket
for escalation and resolution.

~~~
mattbee
As an infrastructure provider, I disagree. Customers expect to find the answer
to the question "did I do something wrong, or are you having problems?" on the
status page.

While we've listed every outage affecting >1 customer since 2004 on our status
page, the issue is always expressing each outage in a way that allows a
customer to identify that _their_ server is affected by a particular entry.

That sometimes involves knowing that their server is in a particular data
centre, or that it is connected to a particular switch, etc. and we do our
best to make sure that people can identify their problem, if nothing else than
by timing, i.e. making sure we list something ASAP.

But status pages are still very useful once people start calling in - support
can positively identify that yes, you are affected by this problem and you can
track progress at that URL. If we keep updating as we promise, bang, that's
one support call (at most) per affected customer.

So even for a minor problem - broken switch, VM host machine, power failure in
a rack, that's worth listing from my point of view.

As I said above I'm looking to tie our databases and some basic network
monitoring into it this year. That way we can proactively notify people
affected by a particular problem, as well as continuing to list even small
problems publicly.

~~~
nixgeek
At the scale of DigitalOcean (~10k physical nodes), Amazon (>250k physical
nodes) or Google this seems wholly unreasonable, there's a definite signal to
noise ratio issue because there will be hundreds of failures per day in any
large infrastructure, some of which will have customer impact (ranging from
minor to major). It's a statistical reality.

A lot of the other suggestions seem to centre around extending your status
page to be more personal to the end user, this is (to some extent) the route
Amazon AWS has taken in allowing you to see which instances are scheduled for
retirement (because they lack live migration capability, a la GCE), etc.

I note that Amazon now sends out maintenance e-mails advising when certain
IPsec connections will go down for them to perform upgrades, this is also
great, and others should copy.

This leaves the 'globally visible' bit (i.e.
[http://status.aws.amazon.com/](http://status.aws.amazon.com/)) for critical
outages affecting a large proportion of your customers.

~~~
neom
I was just discussing how we could provide this level of granularity without
overwhelming the status page as we scale. What if we provided an api endpoint
for you to check the health of the node?

~~~
count
Status needs to be hosted not on the infrastructure being reported on. If the
issue is an API endpoint outage, having an API endpoint for status reporting
is...counterproductive :)

~~~
neom
If there was an API outage we would report it on our status page, the blog
post is about the health of individual physical nodes on our network that
bring down clusters of VMs. If there was an API endpoint outage we'd post it
to the status page. :)

------
zapt02
I had 48 hours of downtime on a DigitalOcean node last week. All events on the
node was stalled so I could not boot down or take an image to spin up a new
instance. Had to hammer their support with a dozen ticket before someone
didn't just give me a canned reply. Of course they did not acknowledge this
very long outage on their status page. I like DO but stuff like this just
can't happen without anyone checking on it stat. I have become very vary about
using them for critical infrastructure since then.

~~~
meddlepal
Using a budget IaaS for "critical infrastructure" is your mistake.

~~~
yeukhon
You seem to think every business has a budget for a big contract with DO or
Amazon or Google.

Some people run their email server on these budget IaaS. Some prefer to host
outside of Google or Amazon's power. so where else should they host their own
server? Home?

Ideally if we have continuous streaming backing up a node, then when the host
machine failed a second machine can pick up to serve the last backup. This is
of course expensive for any provider for every customer. But asking DO to
actually report the status of the node, its host machine and the region is the
right thing to do.

Customers don't need to know the full technical detail but even a nice
friendly message (email, sms or even on the status page) will ease the
conflict: "Your host now appears offline because the host machine is offline.
Don't worry! Your data is safe with our backup! If you have any concern,
please contact XXXXX@digitalocean.com or at xxx-xxx-xxx."

Conclusion:

* report the status of the droplet on personal dashboard

* for non-isolated incident, report it on both droplet personal dashboard and public dashboard.

~~~
JohnTHaller
Digital Ocean is absolutely not meant for critical infrastructure, nor is it
meant for running a production mail server (there's a good chance the IP has
already been flagged somewhere for spam in any shared cloud server IP space).
You're paying for a low-budget VPS with no phone support. Yes, they have a
99.9% SLA, but the penalty to them if they exceed that is minimal.

~~~
nknighthb
The right way to defend a company is not by contradicting their own marketing.

------
andmarios
Many people here claim “it's a $5 box, what do you expect”.

First it may be a $10 or a $20 box. But more importantly, every VPS provider
of the few I've tried my self (e.g. Linode, XenCon), sends an email and opens
a support ticket every time a $10 VPS goes off.

So please, do not try to change the norm. A server is a server no matter the
cost and should be reliable. I like digital ocean but they won't get better by
petting them.

~~~
aroch
Neither Leaseweb or OVH send me emails when my $1000's in Server/VPS/Compute
appliances go down unless I sign up for a paid SLA (I do). Rackspace
_sometimes_ emails me about downtime, but usually only after the fact to let
me know I'm getting $40 back. Amazon doesn't really let me know either, their
status board remains in the green barring natural disaster or the AWS region
literally catching on fire.

Unless you're paying for an SLA which defines uptime and notification time,
you're at the beck and call of Best Effort support, which may mean no
notification and no reimbursement for downtime. Notification within minutes of
your VPS going down isn't necessarily the norm.

~~~
sirdogealot
OVH doesn't inform you because you're paying commodity pricing for unmanaged
servers. It's actually your job to keep track of the downtime, not theirs.

Digitalocean stands to make ~$10,000 monthly of of a "$1,000 server" because
it is somewhat managed, and thus has the responsability of informing the 100
or 1,000 odd customers affected because of their mistake imho.

~~~
prebrov
Agree totally. Physical servers are fully managed by DO and they absolutely
must inform customers when they fail at their job.

------
scootklein
Founder of statuspage.io here.

This is 100% a tool problem, and a problem we're actively working on for
customers of ours that plan on having thousands of customized "views" for what
they normally consider a "status page". Per-user functionality is one use
case, but it can and will go deeper than that. If they could post an incident
such that only you can see it, or such that only you are notified, they most
certainly would.

I disagree that posting everything to be globally viewable is the right course
of action, as this outage doesn't necessarily implicate fault on DO as a
provider, but it also doesn't mean that you as an individual customer
shouldn't have access to your specific view of a status page as it relates to
exactly what infrastructure you live on.

You'd be surprised how prevalent this issue is, and how much inaction it
creates on the provider end.

~~~
teepo
Or the OP could use a cloud IaaS provider that creates tickets on his behalf.
The fact he submitted a ticket to DO and got a response indicating what the
issue is, and that they are working on it is fine IMO since you get what you
pay for. He's embarrassed and angry that he didn't plan for HA and had down
time.

~~~
kbar13
I disagree.

The whole point of IaaS is that the hardware/hypervisor/network is the
provider's responsibility, and everything inside of your VM is yours. If the
provider isn't doing due diligence to monitor their infra, then what infra are
you selling as a part of your IaaS?

~~~
Ecio78
How do you know that they're not monitoring their infra? The OP is complaining
that they didn't post an update on the status page. I see no evidence that
they were not aware of the issue at the physical hypervisor level before he
filled the ticket.

~~~
kbar13
What's the point of monitoring your infra if the report never reaches the
party who cares?

~~~
Ecio78
that you can proactively fix an issue before some/many/most of the users
become aware of the issue.

------
kordless
I was thinking about a feature providing callbacks when the hypervisor running
your instance fails.

I think it's completely unreasonable to expect a status update on every single
thing that might go wrong in Digital Ocean's infrastructure. If a single
hypervisor/server fails, that could be ANYTHING. Bad drives, flaky memory,
failed fans, etc., etc. This stuff happens ALL THE TIME and does not warrant a
system wide update that something is wrong with the service, simply because
there is nothing wrong with the service. All the other bits are functioning
normally.

A loss of a box is expected and shit happens. Architect for it, or expect it
to fail at some point. Everything dies eventually.

Also, $5 a month.

~~~
canadev
Just playing devil's advocate,

So, I'll get a callback from my provider when my server is down...

Unfortunately, that callback is POSTed to a server that is already running on
my provider, so I never see it :)

(Assuming that you mean that the callback is an external thing, i.e. provider
-> customer. Also, I suppose I could host the callback receiving server on
another provider. But I don't want to.)

~~~
kordless
Something like AppEngine would be a good callback handling mechanism.
Something that could start more boxes, for example.

------
silas
Linode opens a ticket for all affected users and emails them proactively, I
think this is the right way to message this type of outage.

~~~
noir_lord
Yep and it's one of the main reasons I pay $20 an instance per month instead
of $5.

Sometimes you actually do get what you pay for.

------
kapilvt
So an individual vm instance in any cloud provider is unreliable. Welcome to
the cloud. Its foolish to assume otherwise, keep backups and restore to
another instance or go HA. What's unreasonable is to assume they should
publish a status update saying that one of their thousands of machines is
having an issue. Fwiw, I've had the same 'hypervisor' problem' response from
DO one time when i couldn't terminate a droplet.

Otoh when they have whole api outages for creating or destroying vms (like
also happened last week) i'd expect and did find a status update. What their
threshold is for reporting isn't clear.. but a 1 machine issue isn't something
any provider would report on.

~~~
bwb
Plus this isn't cloud, just normal VPS with marketing.

~~~
josephcooney
What's the difference?

~~~
devicenull
Some people feel that cloud means high availability vm's. So that when a host
fails the servers are transparently migrated away. IMO this is the exact
opposite of what you want.

The people that want HA like this generally custom build every server and
don't use configuration management.

------
mattbee
[http://forum.bytemark.co.uk/category/outage-
notifications](http://forum.bytemark.co.uk/category/outage-notifications) \-
every single one of Bytemark's outages since June 2004. It's a great sales
tool, at least when someone asks "so how reliable are you?". I a few using
statuspage.io but without having an entire network map & customer list
uploaded (so it knows who to notify), it's not going to be better, and I'm not
sure it's geared to presenting historical outage data.

I'm hoping to finally build a tool this year incorporating some network
monitoring (i.e. "unconfirmed reports") and a copy of all our internal
rack/network database, so that we get to the holy grail: 1) customers get
notified by email/SMS of stuff that definitely affects them, 2) it's easy for
an engineer (or the whole team) to write notes on outages as they happen, and
have them presented in a way that doesn't confuse customers.

------
BuildTheRobots
> IMO a status page should be a public record of all the times your service
> has experienced a catastrophic failure, even for a small number of
> customers, if not also small hiccups like packets loss or lag.

This is the most important point. Status page should be a _log_ rather than a
transient message as with so many providers.

edit: as other people have already namedropped, I'll point out that OVH do a
pretty good job with the status page[1] and network maps[2].

[1] [http://status.ovh.net/](http://status.ovh.net/) [2]
[http://weathermap.ovh.net/](http://weathermap.ovh.net/)

------
teepo
What level of support are you expecting for $5 a month? If uptime and support
of your website/ application is important to you it's likely time you invest
more in your cloud infrastructure.

~~~
mjolk
What if I'm giving them $10 a month? $20? At what point have you as a customer
"bought the right" to updates?

------
skolos
Sounds like what Digital Ocean doing in respect to status is common industry
practice.

One startup I know was using Heroku to host their website. One day the website
had problems and was unavailable for extended period of time. Heroku team
worked hard to resolve the problem, but it still took them several hours.
Heroku did not update their status. They stated that if 99% of customers do
not have problems, they consider any problems to be local and not reflecting
status of the whole infrastructure.

------
mkal_tsr
With the outages I've had at DigitalOcean and the AWS price cuts, I'm prepping
to move over to AWS unless DO can make a good case against it ... I'm tired of
seeing timeouts and droplets hanging which affects my business.

~~~
wc-
You will definitely have downtime due to hardware issues with AWS as well, and
every other host out there... Maybe it would be better to spend the time you
were going to spend switching to AWS investigating a load balancing setup and
other things you would need to remove the dependency on one instance always
working correctly?

The flexibility of the cloud is awesome but sometimes it does get cloudy up
there. Redundancy is a must with digitalocean, aws, or anyone else.

~~~
gabemart
Can anyone recommend a very simple load-balancing setup for a bunch of tiny
cloud servers serving a static app?

I'm currently using round-robin DNS for load-distribution as the simplest
thing that could possibly work, but it obviously doesn't actually balance load
and it doesn't remove dead servers from the pool. What's the next step up that
doesn't cost and arm and a leg to implement?

~~~
robszumski
Choose your favorite provider's load balancer service and add all the
machines. Configure the health check to listen for a 200 on /. This will leave
you with a SPOF at the LB, but you can round-robin DNS between two LBs on
different providers if you wanted to.

------
larsmak
When the cloud was first being introduced one of the core principles was that
your instances should be considered ephemeral - and that they can terminate at
any time, without any notice or further explanation. I think some people
sometimes forgets that this still holds today.

Don't expect your instance to run forever. Design your architecture with this
in mind. There is a reason e.g. aws provides multiple availability zones.

~~~
devicenull
It's interesting, because these days there are people that define cloud as
"high availability VMs". They expect that in the case of a host failure that
everything can be transparently migrated to another host.

~~~
larsmak
I'd say that depends on what kind of service the provider is offering. Later
generation cloud providers might specializing in providing such VMs - Heroku
and GAE comes to mind. But on providers that offer more "bare metal" VMs one
should not expect such migration. Indeed it would be almost impossible to do,
since it would require a lot of information about the application you are
running. Heroku and GAE solves this by limiting the capabilities of the VMs so
that they have more control/information about the running applications.

------
brryant
i think this just goes to show that the DO platform is not suitable for sites
that require uptime. The price point justifies the simplicity, but sometimes
simple isn't enough. AWS still has everything a startup needs to run their
stack

~~~
xur17
Just to provide another data point.

I've been with Digital Ocean for over a year, and I'm fairly certain that I
haven't had any downtime at all. The site is used by thousands of users a day,
and I've never had any complaints. Pingdom is set to a resolution of 1 minute,
and hasn't reported any outages either (that weren't caused by me).

~~~
brryant
We run our forums on DO as well and it hasn't caused a hiccup at all. Uptime
has been great for me, but at this point there isn't enough tooling around DO
for me to be comfortable with doing more than just simple web sites

------
jafaku
I was thinking about switching to Digital Ocean. Thanks for letting us know
they aren't as professional as everyone claimed.

~~~
pekk
Budget VPS is budget VPS, there's nothing unprofessional about it, if you want
excellent support you are going to pay for the privilege.

~~~
oafitupa
That's not support, that's them failing at providing what they advertised. I
have used OVH for years, and they always fixed everything in their end very
quickly. And that was for unmanaged dedicated servers, so I never had
"excellent support", I had 0 support, but they provided what they advertised.

------
jonny_eh
It's been at least 5 hours and my droplet is still unresponsive, and their
status page still says everything is all good.

~~~
jafaku
Welcome to the cloud!

------
oomkiller
OVH does a much better job at this I think. It's not the prettiest status
page, but it lists a lot more things.
[http://status.ovh.com](http://status.ovh.com)

~~~
silas
Yeah, this is why spamming your general status page with lots of information
is a bad idea. I have no idea what's going on there. At first glance it looks
like half their services are having issues, which makes me think it always
looks like that, which means it tells me nothing without digging into the
detailed issues.

~~~
dylz
OVH is a multinational broadband ISP in addition to colo, cloud/vserver,
dedicated, voip, domain registrar, etc.

Looking at the columns at a glance, there are almost no events "in progress"
(most are Closed) and the vast majority of the open events are early warning
for maintenance windows affecting very specific services.

------
tzakrajs
You are asking a bit much. Your site is so insignificant in the grand scheme
of goings-on at Digital Ocean that it would be far more misleading to mention
it on their status page.

------
digitalabyss
I agree that you get what you pay for. Just because you don’t have a budget
for a $1000/mo dedicated server does not mean you have the right to that
service level at a lower price.

As fellow geeks I am sure we have all been asked for advice on buying
computers from our family and friends. I’m a mac person and use a $3000 laptop
but I know most of them will not be willing to spend this much money. I
normally quote them a $600(ish) computer that has at least an i-Series
processor and 6-8GB of RAM. I tell them to avoid a couple vendors that I
consider to be bottom of the barrel. It never fails though, that for all the
reasoning and advice I give them, if they find a computer for $400 in the
Sunday paper all that gets thrown out the window. They don’t want a computer
that fits their needs they just want the cheapest computer. It also never
fails that the moment it starts having problems they come to me for help and
if I give them any slack for buying a cheap POS that I am suddenly considered
an a-hole for not helping since that’s all they can afford.

The way I look at it is I spend my hard earned money on a reliable machine so
I am not put out by the type of issues you get with cheap hardware or cheap
services and you did not head my advice to avoid the same issues and thus by
asking me to give up my weekend or even a couple hours to help you makes you
the a-hole not me.

I have used, Rackspace, Linode, DO, AWS, and Google Apps among others and none
of their status pages are every very helpful. It’s really a problem with
Google Apps since my users know to check there and then claim the problem is
not with Google even though it is. I frequently have issues with their IMAP
servers failing where a user can connect via the web interface but not through
a IMAP client. This is never shown on their status page. Of course I am going
to check the status page.

The only hosting provider I have no complaints about is Rackspace but those
servers are almost $1000/mo. On the other hand they do open up tickets for me
faster than I can log into the management portal to do it myself. Even still I
have had hour-long outages. If you don’t have HA you WILL have outages no
mater who you use or how much you pay. Ironically the worst service was from
The Planet even though those where still $800/mo dedicated servers. Their
status page was a twitter account that they did not advertise on their
homepage.

I have debated moving our average SMB size clients with no HA from linode to
DO just because tools like Packer can interface with them easier. Still don’t
know about that.

------
shirro
This is very true. It is the nature of cheap unmanaged services that the user
bears the bulk of the burden of maintaining their server instance. But this
makes it even more important that providers fess up when they have issues.

I have only been with DO for a short time but I have already encountered
several of these silent outages where I can't query the status of my server
through their tools and there is no status update. It is very different to
Linode. They are a fantastic service for personal sites but I wouldn't host
anything professional with them until they do more to gain my confidence.

------
cdelsolar
I love DO but their infrastructure is not stable. I get random downtimes very
often (get error emails at least once a week with connection errors). It's
definitely cheap, and I'm using it for a side project web app, but I don't
think I would use it for an actual business yet. I hope they keep growing
though until they can completely gain our trust.

------
larrys
This is really very simple.

Low prices and popularity bring growth.

Growth brings more customers than you can support as well as infrastructure
problems that you can't handle. Because your company has not scaled up over a
long time it has to hire people and shove things through the pipeline quickly
which means mistakes (both in process and people) will inevitably be made.
Rome wasn't built in a day as the saying goes but startups are. And they end
up growing quicker than they should. Someone has to lose and it's the customer
(not all of them but some of them). [1]

Not to mention the fact that if you are charging very little ($5 per month is
pretty cheap obviously for the base service) it gives you less profit to
handle things in a way that are perhaps more robust or doesn't give you the
ability to paper over problems by building in redundancy.

The saying "price quality speed" pick any two applies here.

DO will get better of course but it will take time as they iron out and
encounter the various issues that they face.

[1] I've observed this since 1982 when PC clones came out and competed with
IBM. The clones shoved things into the channel and all the sudden hardware
problems were shoved on the customers. Previously IBM charged enough that
those things were handled by IBM not their customers. Because they had the
profits and took the time to pay attention to details.

------
scriptle
I had HTTPD and MySQL server running on my $5 droplet, and I had no SWAP
partition. I recorded the outages using pingdom. It was interesting to see
that the droplet went down too frequently and sometimes never came up. Had to
manually restart the droplet from the Digital Ocean Interface.

------
_asciiker_
The choice of showing the status of our hyper-visors permanently was not an
easy one amongst the team. But in the end, we figured customers have a right
to know.

[http://www.tailoredclouds.com/cloud-
status.html](http://www.tailoredclouds.com/cloud-status.html)

~~~
kbar13
To be honest, 9 host machines is nothing like displaying status for hundreds
if not thousands of host machines in a large-scale deployment in major IaaS
providers.

~~~
_asciiker_
I understand your point. We decided to put only the most crucial 9 hyper-
visors. We are a startup after all and a self-funded one. more will come.

------
Fizzadar
I suspect this is because it looks bad to display the individual status of
every node, because they'll have more common failures; keeping it limited to
DC-wide issues is good for business.

They should have a private/client-only per-hypervisor status page.

------
blueskin_
DO need to to a _lot_ of things. They don't even have IPv6 yet, which is,
TBFH, beyond a joke, even for a $5/mo provider. Then the censorship guarantees
I will never use them even past that.

------
DiabloD3
I don't get why more HN-type startups don't look at using dedicated machines
instead. Every dollar counts.

Exelion.net is running a special for HN users. 50% off for life for any
server. Or more than one server. Or quite a few servers. Just use HN50 when
checking out.

Exelion does top of the line E3-1230v3 Haswell (8 thread) + 16gb + 2x240GB SSD
RAID 1 + Gigabit port + 33TB/mo bandwidth for $115/mo.

DO does 8 thread + 16gb + 160gb SSD + 6tb/transfer for $160/mo plus another
$1350/mo for the full 33TB/mo bandwidth.

And if you wanted unmetered gigabit? Exelion does it for $200/mo extra flat
rate. DO does it for an equivalent of $16,120/mo for that plan (based on 5
cents per gigabyte and 6TB already included with the plan).

It just doesn't make sense financially to keep using them.

------
beachstartup
two is one and one is none. doesn't matter where it's hosted.

servers fail, and you need redundancy. you might as well get angry at the sky
for being blue.

~~~
jonny_eh
Where did I complain about them failing?

------
puredemo
This is why you use a more substantial, long-standing service, like
[http://prgmr.com](http://prgmr.com)

~~~
insertnickname
I don't know anything about their reliability, but their prices quite high:
[http://prgmr.com/xen/plans.html](http://prgmr.com/xen/plans.html)

------
nilved
This is a little crazy. Linode doesn't put up a security advisory when a
single node has hardware problems.

------
philip1209
I receive new relic alerts about Digital Ocean outages way too frequently.

------
halayli
For those interested in monitoring their websites and services and create
custom status pages, I run webmon.com for this exact reason.

