Hacker News new | past | comments | ask | show | jobs | submit login
Don't ever use Digital Ocean as your production System.
16 points by coolboykl on May 11, 2014 | hide | past | favorite | 25 comments
Receive email from Digital Ocean ------------------------------------

We've had to unfortunately reboot your Droplet xxxdb due to an issue on the underlying physical node where the Droplet runs.

We are investigating the health of the physical node to determine whether this was a single incident or systemic.

If you have any questions related to this issue, please send us a ticket. https://cloud.digitalocean.com/support

Happy coding, DigitalOcean -----------------------------------

What I no so happy about this..

a. Didn't give me advance notice, so that I can prepare cutover from other droplets.

b. Although they claim they have rebooted my droplet, they didn't check if the droplet booted correctly, in fact, this particular instance been down for more then 3 hours.

c. Worst, somehow during reboot the droplet to new HW, the system revert the droplet to original kernel, resulted a lot of underlying services not working... I have to spend another 2-3 hours to setup new droplet again.

d. After I complains this to their support people, and explain this is totally not acceptable, as the whole server been setup for a very important client of me, and for unplanned downtime, we have to pay huge fine. Their Director then credit me USD20. Discounting the fact I have to pay the fine, additional time to setup the droplet..


I ask around my friend who use Digital Ocean before, all have bad experience using DO, some droplet will reboot to READ only, some of droplet is lost forever..

Conclusion, use DO for your development/testing server...Never Never use it for your production server, at least not yet.

Hosts requiring a reboot due to issues with the underlying physical hardware are inevitable for VMs. It happens to me approximately once per instance-year on Rackspace (and before that, Slicehost). [Edit: Slightly better approximation: ~5 incidents in the last ~20 instance-years.]

The general level of seriousness with regards to operations is not constant across all VM providers. Let's put it this way: there exist companies where the heart-and-soul of the business is supporting $5 a month hobbyists and there are companies where the heart-and-soul of the business is business-grade services. Choose appropriately.

This. Some VPS providers don't care at all about downtime since the services are so cheap. Then, scaling that cheap service to hundreds of dollars a month for more resources doesn't necessarily guarantee any increase in quality... maybe just an increase in attention if you make a fuss

This does suck, seriously. And honestly I don't think we do enough to allow you to prevent stuff like this from taking your application down, sure.. you should be building clustered, highly available applications that have good failover, but really we don't provide you with the best tools to do real fail over. We're working really hard right now to provide a shared ip so you can fail it over onto another IP, and we're also in the early stages of building out a really nice load balancer. People forget DO is still a pretty small team, 15 engineers now.. and we're doing our best. :)

Again, this really sucks and I'm sorry it happened - I'd love you to email me the IP of the droplet you had issues with (je@digitalocean.com)


CTE, DigitalOcean.

This will happen on any VPS provider. DO isn't the "cloud", same as Linode/RAMNode/any other VPS provider.

If you really had a "huge fine" weighing over you, why didn't you spring for multiple servers to prevent these things from happening? Redundancy will hopefully be your lesson here, rather then blaming budget VPS providers for your incompetence.

> as the whole server been setup for a very important client of me, and for unplanned downtime, we have to pay huge fine.

If your availability requirements are such that you get fined for downtime then you can never ever rely upon a single machine, no matter who hosts it.

Even if you have your own dedicated server it will eventually fail. (Be it a dead drive, dead NIC, or blown PSU/PSUs.)

You need to be setting up a cluster for high-availability, without a single point of failure. Even then you're at the whims of the network between your clients and the location where you're served - some ISPs will have issues at any given moment, and will have broken routes.

Really your problem here seems to boil down to three things:

1. Your server was rebooted and you had no monitoring in place to detect the downtime - pingdom, etc, would have alerted you.

2. You seem to think a single guest/droplet/host will be 100% available.

3. You have a spare host/hosts to failover, but that process is manual, so without advance notice you didn't know to do it. See point 1.

This happens in AWS too. In the cloud you have to assume nodes will go down and not have it be a disaster if one fails.

Ditto. It happens with physical servers too. If you can't survive unscheduled degradation or termination of a node, you aren't running a high availability service. (Note: not every production service needs to be HA.)

I agree. When you don't run your servers yourself it should be a calculated risk that sometimes your server goes down. Of course, if this happens quite often, you should considering going to another hoster, but it can't be guaranteed that one server has 100% uptime with perfect recovery.

If the website of one of my clients goes down, it's not a disaster and it's fine when it's up and running again in a few hours, maybe a day. I understand it's not nice when this happens, but it's the risk you take with essentially outsourcing your hosting.

AWS generally, although not always, shows an alert before this happens.

However, I totally agree - if you are running any sort of service where you have a financial penalty if that service goes down, then it's your responsibility to ensure your service's architecture supports catastrophic failure of nodes. 1 machine running all the things isn't high availability and shouldn't be sold as such.

I do agree, We do setup for HA, but due to some bugs on our end, the cut over from our DB Slave to Master didn't happen. Will be more careful next time.

Except AWS is an actual cloud, and DO is just a few VPS with barely private networking available in half the locations.

What makes a cloud a cloud and not a vps?

That's interesting. So, from that, the thing that stops DO being a cloud service (and linode and so on) is that you can't say "I used 2 CPUs at 100% for 12 hours"? It only allows for the granularity of saying "the node was on for 12 hours"?

People often misunderstand the difference between failover, high availability, and load balancing.

Failover - n +1 node waiting to take over in the event of the primary node falling over

High availability - n + 2 - requires a minimum of 3 nodes to decide "quorum" and "elect" a master node. This often involves hardware level fencing and STONITH (google the name if you're not aware).

Load Balancing - distributing load amongst multiple nodes to scale horizontally better.

In a perfect world, you are using a data store that supports master / master replication. Then you just front your data store with a load balancer like haproxy. You can ensure the load balancer stays up by running multiple of them on different nodes with setting up up failover with something like keepalived. Sucks this happened at DO, but perhaps it will help you build more robust infrastructure in the future.

Always design for each individual component to fail.

So in this case, do I needs to setup two nodes for HAProxy as well, to load balance MYSQL Cluster?

Yes, and then setup keepalived on them so you have a stable VIP that is guaranteed to move between them. I did a writeup of how we did it (when I worked there) for ticketmaster here:


The gist is that you have a dummy interface ie: dummy0, and when you down that interface, the VIP flips to the backup node with the highest VRRP priority.

I did get the same message yesterday, but I'm in a little different situation.

Pingdom shows about 36 minutes of downtime, but when DO fixed the issue my sites started to work again. I guess that's acceptable for the $5/month I'm paying and if I need HA I could remove the SPOF.

One thing that do bothers me however is the fact that there are no comment on the status page on the issue witch makes me wonder how often these types of incidents happens.

Automatic failover is not the same as High Availability.

You should have a completely redundant node standing by to take over if the first one fails if you are running critical services - although, few clients seem to want to pay the +double cost of having it.

There is even the possibility to have a node in sync in a different datacenter, although the bandwidth costs can become high.

Yes, lesson learn from us.. We do have a scripts to auto cut over our Slave to Master DB once it detect our Master DB "death"..

Guess it's best to setup Master to Master replication..

> Guess it's best to setup Master to Master replication..

Master-Master has issues of its own, if you use shoddy applications that assume next-id=max(id)+1.

In the worst case situation you can find data on one master, and not the other, and vice-versa. So be prepared to reconcile things if you get into a split-brain scenario.

there's various degrees of 'important'? HA? failover(auto? manual?)? backup? none of the above(so its just a charade,eh?)? and i do believe SLAs usually compensate you based on how much you pay;its not insurance, its 'money back guarantee'..

I had exactly same experience - logged on twitter. https://twitter.com/mechiland/status/441199476502323200 Actually I stop using DD after that.

Have the same issue now. if the kernel was changed I'm screwed. Their support don't seem to understand the issue. I've linked this page so hopefully it will help

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact