
Don't ever use Digital Ocean as your production System. - coolboykl
Receive email from Digital Ocean
------------------------------------<p>We&#x27;ve had to unfortunately reboot your Droplet xxxdb due to an issue on the underlying physical node where the Droplet runs.<p>We are investigating the health of the physical node to determine whether this was a single incident or systemic.<p>If you have any questions related to this issue, please send us a ticket. https:&#x2F;&#x2F;cloud.digitalocean.com&#x2F;support<p>Happy coding,
DigitalOcean
-----------------------------------<p>What I no so happy about this..<p>a. Didn&#x27;t give me advance notice, so that I can prepare cutover from other droplets.<p>b. Although they claim they have rebooted my droplet, they didn&#x27;t check if the droplet booted correctly, in fact, this particular instance been down for more then 3 hours.<p>c. Worst,  somehow during reboot the droplet to new HW, the system revert the droplet to original kernel, resulted a lot of underlying services not working... I have to spend another 2-3 hours to setup new droplet again.<p>d. After I complains this to their support people, and explain this is totally not acceptable, as the whole server been setup for a very important client of me, and for unplanned downtime, we have to pay huge fine. Their Director then credit me USD20. Discounting the fact I have to pay the fine, additional time to setup the droplet..<p>------------<p>I ask around my friend who use Digital Ocean before, all have bad experience using DO, some droplet will reboot to READ only, some of droplet is lost forever..<p>Conclusion, use DO for your development&#x2F;testing server...Never Never use it for your production server, at least not yet.
======
patio11
Hosts requiring a reboot due to issues with the underlying physical hardware
are inevitable for VMs. It happens to me approximately once per instance-year
on Rackspace (and before that, Slicehost). [Edit: Slightly better
approximation: ~5 incidents in the last ~20 instance-years.]

The general level of seriousness with regards to operations is not constant
across all VM providers. Let's put it this way: there exist companies where
the heart-and-soul of the business is supporting $5 a month hobbyists and
there are companies where the heart-and-soul of the business is business-grade
services. Choose appropriately.

~~~
benguild
This. Some VPS providers don't care at all about downtime since the services
are so cheap. Then, scaling that cheap service to hundreds of dollars a month
for more resources doesn't necessarily guarantee any increase in quality...
maybe just an increase in attention if you make a fuss

------
neom
This does suck, seriously. And honestly I don't think we do enough to allow
you to prevent stuff like this from taking your application down, sure.. you
should be building clustered, highly available applications that have good
failover, but really we don't provide you with the best tools to do real fail
over. We're working really hard right now to provide a shared ip so you can
fail it over onto another IP, and we're also in the early stages of building
out a really nice load balancer. People forget DO is still a pretty small
team, 15 engineers now.. and we're doing our best. :)

Again, this really sucks and I'm sorry it happened - I'd love you to email me
the IP of the droplet you had issues with (je@digitalocean.com)

j.

CTE, DigitalOcean.

------
thejosh
This will happen on any VPS provider. DO isn't the "cloud", same as
Linode/RAMNode/any other VPS provider.

If you really had a "huge fine" weighing over you, why didn't you spring for
multiple servers to prevent these things from happening? Redundancy will
hopefully be your lesson here, rather then blaming budget VPS providers for
your incompetence.

------
PaulHoule
This happens in AWS too. In the cloud you have to assume nodes will go down
and not have it be a disaster if one fails.

~~~
mnem
AWS generally, although not always, shows an alert before this happens.

However, I totally agree - if you are running any sort of service where you
have a financial penalty if that service goes down, then it's your
responsibility to ensure your service's architecture supports catastrophic
failure of nodes. 1 machine running all the things isn't high availability and
shouldn't be sold as such.

~~~
coolboykl
I do agree, We do setup for HA, but due to some bugs on our end, the cut over
from our DB Slave to Master didn't happen. Will be more careful next time.

------
stevekemp
> as the whole server been setup for a very important client of me, and for
> unplanned downtime, we have to pay huge fine.

If your availability requirements are such that you get fined for downtime
then you can never ever rely upon a single machine, no matter who hosts it.

Even if you have your own dedicated server it will eventually fail. (Be it a
dead drive, dead NIC, or blown PSU/PSUs.)

You need to be setting up a cluster for high-availability, without a single
point of failure. Even then you're at the whims of the network between your
clients and the location where you're served - some ISPs will have issues at
any given moment, and will have broken routes.

Really your problem here seems to boil down to three things:

1\. Your server was rebooted and you had no monitoring in place to detect the
downtime - pingdom, etc, would have alerted you.

2\. You seem to think a single guest/droplet/host will be 100% available.

3\. You have a spare host/hosts to failover, but that process is manual, so
without advance notice you didn't know to do it. See point 1.

------
SEJeff
People often misunderstand the difference between failover, high availability,
and load balancing.

Failover - n +1 node waiting to take over in the event of the primary node
falling over

High availability - n + 2 - requires a minimum of 3 nodes to decide "quorum"
and "elect" a master node. This often involves hardware level fencing and
STONITH (google the name if you're not aware).

Load Balancing - distributing load amongst multiple nodes to scale
horizontally better.

In a perfect world, you are using a data store that supports master / master
replication. Then you just front your data store with a load balancer like
haproxy. You can ensure the load balancer stays up by running multiple of them
on different nodes with setting up up failover with something like keepalived.
Sucks this happened at DO, but perhaps it will help you build more robust
infrastructure in the future.

Always design for each individual component to fail.

~~~
coolboykl
So in this case, do I needs to setup two nodes for HAProxy as well, to load
balance MYSQL Cluster?

~~~
SEJeff
Yes, and then setup keepalived on them so you have a stable VIP that is
guaranteed to move between them. I did a writeup of how we did it (when I
worked there) for ticketmaster here:

[http://www.digitalprognosis.com/opensource/scripts/keepalive...](http://www.digitalprognosis.com/opensource/scripts/keepalived/HOWTO)

The gist is that you have a dummy interface ie: dummy0, and when you down that
interface, the VIP flips to the backup node with the highest VRRP priority.

------
reitanqild
Or take the approach Netflix did: [http://techblog.netflix.com/2012/07/chaos-
monkey-released-in...](http://techblog.netflix.com/2012/07/chaos-monkey-
released-into-wild.html)

------
anderspetersson
I did get the same message yesterday, but I'm in a little different situation.

Pingdom shows about 36 minutes of downtime, but when DO fixed the issue my
sites started to work again. I guess that's acceptable for the $5/month I'm
paying and if I need HA I could remove the SPOF.

One thing that do bothers me however is the fact that there are no comment on
the status page on the issue witch makes me wonder how often these types of
incidents happens.

------
tmikaeld
Automatic failover is not the same as High Availability.

You should have a completely redundant node standing by to take over if the
first one fails if you are running critical services - although, few clients
seem to want to pay the +double cost of having it.

There is even the possibility to have a node in sync in a different
datacenter, although the bandwidth costs can become high.

~~~
coolboykl
Yes, lesson learn from us.. We do have a scripts to auto cut over our Slave to
Master DB once it detect our Master DB "death"..

Guess it's best to setup Master to Master replication..

~~~
stevekemp
> Guess it's best to setup Master to Master replication..

Master-Master has issues of its own, if you use shoddy applications that
assume next-id=max(id)+1.

In the worst case situation you can find data on one master, and not the
other, and vice-versa. So be prepared to reconcile things if you get into a
split-brain scenario.

------
mechiland
I had exactly same experience - logged on twitter.
[https://twitter.com/mechiland/status/441199476502323200](https://twitter.com/mechiland/status/441199476502323200)
Actually I stop using DD after that.

------
propercoil
Have the same issue now. if the kernel was changed I'm screwed. Their support
don't seem to understand the issue. I've linked this page so hopefully it will
help

------
lazylizard
there's various degrees of 'important'? HA? failover(auto? manual?)? backup?
none of the above(so its just a charade,eh?)? and i do believe SLAs usually
compensate you based on how much you pay;its not insurance, its 'money back
guarantee'..

