In any infrastructure beyond the tiny/small stage, there are always pieces that are failing or have failed. In some cases these failures are observed by customers. In other cases they are not.
It would be unreasonable to have a company-wide status page that constantly lists "some customers are experiencing some problems". That's not the point of the status page - the status page, as the author suggested, is there to highlight issues that are affecting a significant section of the customer base.
The right thing for Digital Ocean to do in cases like this is to allow you, in your private dashboard, to see the problem and follow up on a master ticket for escalation and resolution.
As an infrastructure provider, I disagree. Customers expect to find the answer to the question "did I do something wrong, or are you having problems?" on the status page.
While we've listed every outage affecting >1 customer since 2004 on our status page, the issue is always expressing each outage in a way that allows a customer to identify that _their_ server is affected by a particular entry.
That sometimes involves knowing that their server is in a particular data centre, or that it is connected to a particular switch, etc. and we do our best to make sure that people can identify their problem, if nothing else than by timing, i.e. making sure we list something ASAP.
But status pages are still very useful once people start calling in - support can positively identify that yes, you are affected by this problem and you can track progress at that URL. If we keep updating as we promise, bang, that's one support call (at most) per affected customer.
So even for a minor problem - broken switch, VM host machine, power failure in a rack, that's worth listing from my point of view.
As I said above I'm looking to tie our databases and some basic network monitoring into it this year. That way we can proactively notify people affected by a particular problem, as well as continuing to list even small problems publicly.
At the scale of DigitalOcean (~10k physical nodes), Amazon (>250k physical nodes) or Google this seems wholly unreasonable, there's a definite signal to noise ratio issue because there will be hundreds of failures per day in any large infrastructure, some of which will have customer impact (ranging from minor to major). It's a statistical reality.
A lot of the other suggestions seem to centre around extending your status page to be more personal to the end user, this is (to some extent) the route Amazon AWS has taken in allowing you to see which instances are scheduled for retirement (because they lack live migration capability, a la GCE), etc.
I note that Amazon now sends out maintenance e-mails advising when certain IPsec connections will go down for them to perform upgrades, this is also great, and others should copy.
This leaves the 'globally visible' bit (i.e. http://status.aws.amazon.com/) for critical outages affecting a large proportion of your customers.
I was just discussing how we could provide this level of granularity without overwhelming the status page as we scale. What if we provided an api endpoint for you to check the health of the node?
Status needs to be hosted not on the infrastructure being reported on. If the issue is an API endpoint outage, having an API endpoint for status reporting is...counterproductive :)
If there was an API outage we would report it on our status page, the blog post is about the health of individual physical nodes on our network that bring down clusters of VMs. If there was an API endpoint outage we'd post it to the status page. :)
I had 48 hours of downtime on a DigitalOcean node last week. All events on the node was stalled so I could not boot down or take an image to spin up a new instance. Had to hammer their support with a dozen ticket before someone didn't just give me a canned reply. Of course they did not acknowledge this very long outage on their status page. I like DO but stuff like this just can't happen without anyone checking on it stat. I have become very vary about using them for critical infrastructure since then.
You seem to think every business has a budget for a big contract with DO or Amazon or Google.
Some people run their email server on these budget IaaS. Some prefer to host outside of Google or Amazon's power. so where else should they host their own server? Home?
Ideally if we have continuous streaming backing up a node, then when the host machine failed a second machine can pick up to serve the last backup. This is of course expensive for any provider for every customer. But asking DO to actually report the status of the node, its host machine and the region is the right thing to do.
Customers don't need to know the full technical detail but even a nice friendly message (email, sms or even on the status page) will ease the conflict: "Your host now appears offline because the host machine is offline. Don't worry! Your data is safe with our backup! If you have any concern, please contact XXXXX@digitalocean.com or at xxx-xxx-xxx."
Conclusion:
* report the status of the droplet on personal dashboard
* for non-isolated incident, report it on both droplet personal dashboard and public dashboard.
Digital Ocean is absolutely not meant for critical infrastructure, nor is it meant for running a production mail server (there's a good chance the IP has already been flagged somewhere for spam in any shared cloud server IP space). You're paying for a low-budget VPS with no phone support. Yes, they have a 99.9% SLA, but the penalty to them if they exceed that is minimal.
Whether this is for critical infrastructure or not, the provider should tell the customer the problem automatically via the dashboard.
It should take some engineering work, but not whole lot.
How is that demand too much? Should we discard that demand because DO is a low-budget VPS? If you truly value your customer, you would take that suggestion seriously. I don't have millions to employ someone to manage an AWS farm for me. Instead of me asking DO why my nodes are down every time that happens, I want DO to tell me once that happen. It's a simple customer demand.
As others have pointed out, that 5$ box might actually be a 20$ or 40$ box, so I disagree with your notion of not putting critical infrastructure on DO.
That being said, their penalties are ridiculous. I got a 10$ "SLA Credit" for 2 days of downtime. So I agree their SLA is useless.
there are many other hosting solutions that offer Virtual Private Servers with a high percentage of guaranteed uptime.
For example, with Dreamhost you can get a VPS for ~$15 a month. If your e-mail isn't worth that much to you, then why are you bothering to self-host your e-mail anyways?
On this note, what is the easiest way to transfer my free Google Apps account from a .co.cc domain to a real domain and keep my free status? Can I change domains within Google Apps itself?
> All events on the node was stalled so I could not boot down or take an image to spin up a new instance.
This should not surprise you. This is a common failure method of VMs. So, let's say a host was down. Depending on their storage methods, this means that the all the images on the disk are inaccessible. This means that you can't interact with it, which is why you couldn't take snapshots.
> Had to hammer their support with a dozen ticket before someone didn't just give me a canned reply.
Abusing support is never the right answer, I'm not surprised you only got canned replies.
> I like DO but stuff like this just can't happen without anyone checking on it stat.
It's almost like every problem cannot be solved instantly.
This wasn't a very nice or helpful response. The OP relayed their experience and you picked it apart as if they were uninformed and incompetent. Pointing out a common failure doesn't mean that the service should not handle it quickly and keep their system status updated. Filing a new support ticket after getting a canned response that does not address an issue is reasonable. Finally, 48 hours is a long time. The post sounds like they weren't looking for instant service -- just some kind of service.
After the first 24 hours i started to get a bit twitchy.
Most of all I was worrying about data loss. (RAID degradation turned out to be the core issue, but I managed to boot the server up after 48 hours and copy over everything before spinning up a new instance, so no data loss occured.)
Many people here claim “it's a $5 box, what do you expect”.
First it may be a $10 or a $20 box. But more importantly, every VPS provider of the few I've tried my self (e.g. Linode, XenCon), sends an email and opens a support ticket every time a $10 VPS goes off.
So please, do not try to change the norm. A server is a server no matter the cost and should be reliable. I like digital ocean but they won't get better by petting them.
Neither Leaseweb or OVH send me emails when my $1000's in Server/VPS/Compute appliances go down unless I sign up for a paid SLA (I do). Rackspace sometimes emails me about downtime, but usually only after the fact to let me know I'm getting $40 back. Amazon doesn't really let me know either, their status board remains in the green barring natural disaster or the AWS region literally catching on fire.
Unless you're paying for an SLA which defines uptime and notification time, you're at the beck and call of Best Effort support, which may mean no notification and no reimbursement for downtime. Notification within minutes of your VPS going down isn't necessarily the norm.
It isn't about notification nor SLA. It is about making your customers feel at ease, giving your team less work and lowering your costs.
If my VPS is down and my hosting provider do not aknowledge it, it is only logical that I will have myself to create a ticket. As a customer I get both anxious and lose time. On the provider's end, now they have an open ticket that they have to read, possibly connect with another employee than the one that provides 1st level support and give me a proper answer.
An automated ticket that would probably be semi-automatically closed too is best for both of us.
For notification purposes there are excellent services (uptimerobot, pingdom, monitor.us). SLA would be nice to have but most good lower cost VPS providers usually are over 99.9%, which is nice except the occasional outlier.
OVH doesn't inform you because you're paying commodity pricing for unmanaged servers. It's actually your job to keep track of the downtime, not theirs.
Digitalocean stands to make ~$10,000 monthly of of a "$1,000 server" because it is somewhat managed, and thus has the responsability of informing the 100 or 1,000 odd customers affected because of their mistake imho.
I received an email from OVH every time I managed to bork the [network | iptables | kernel | other stuff I should not mess with] so much that the server stopped being responsive. I always figured it was part of their default monitoring, I will check if I have subscribed to a paid option or not.
As a very long time Linode user (recently switched to DO for all my servers) I never received emails when my instance would go down (and being in the Fremont DC that happened quite a few times). Maybe that is new but it certainly has not been the norm.
I have had a single downtime, got an email and autocreated ticket. This was about a month ago so maybe it's relatively recent, or maybe it depends on the dc.
This is 100% a tool problem, and a problem we're actively working on for customers of ours that plan on having thousands of customized "views" for what they normally consider a "status page". Per-user functionality is one use case, but it can and will go deeper than that. If they could post an incident such that only you can see it, or such that only you are notified, they most certainly would.
I disagree that posting everything to be globally viewable is the right course of action, as this outage doesn't necessarily implicate fault on DO as a provider, but it also doesn't mean that you as an individual customer shouldn't have access to your specific view of a status page as it relates to exactly what infrastructure you live on.
You'd be surprised how prevalent this issue is, and how much inaction it creates on the provider end.
Or the OP could use a cloud IaaS provider that creates tickets on his behalf. The fact he submitted a ticket to DO and got a response indicating what the issue is, and that they are working on it is fine IMO since you get what you pay for. He's embarrassed and angry that he didn't plan for HA and had down time.
The whole point of IaaS is that the hardware/hypervisor/network is the provider's responsibility, and everything inside of your VM is yours. If the provider isn't doing due diligence to monitor their infra, then what infra are you selling as a part of your IaaS?
How do you know that they're not monitoring their infra?
The OP is complaining that they didn't post an update on the status page. I see no evidence that they were not aware of the issue at the physical hypervisor level before he filled the ticket.
customized views for relevant infra is probably the most important part of a good statuspage. I'm surprised it isn't as common yet.
Most people don't go to a status page to check out how their provider as a whole is running. They just care about stuff that is directly relevant to them.
once they're already a customer you're correct. the OP did mention the other major factor of a status page - a sales tool. prospective customers want to see incident history, who was affected, their response modes, metrics like uptime and latency are always good, etc.
I was thinking about a feature providing callbacks when the hypervisor running your instance fails.
I think it's completely unreasonable to expect a status update on every single thing that might go wrong in Digital Ocean's infrastructure. If a single hypervisor/server fails, that could be ANYTHING. Bad drives, flaky memory, failed fans, etc., etc. This stuff happens ALL THE TIME and does not warrant a system wide update that something is wrong with the service, simply because there is nothing wrong with the service. All the other bits are functioning normally.
A loss of a box is expected and shit happens. Architect for it, or expect it to fail at some point. Everything dies eventually.
This is precisely what everyone with a virtual machine (or a physical machine i guess) should be doing. Set up your puppet/chef/salt/ansible configs, make sure they are up to date. Use those to deploy your application, and that way when something goes sideways you can use the same configuration to bring up a new VM immediately.
THIS is the power of virtualization, and if you're not doing this, you're doing it wrong.
So, I'll get a callback from my provider when my server is down...
Unfortunately, that callback is POSTed to a server that is already running on my provider, so I never see it :)
(Assuming that you mean that the callback is an external thing, i.e. provider -> customer. Also, I suppose I could host the callback receiving server on another provider. But I don't want to.)
I think this is mostly a marketing problem. They could provide simple statistics of possible types of failures and its causes. They could show average response for each type of failure. You don't have to update this all the time. No one expects full fledged service for $5 clouds, but there are many ways of letting customers know what they're getting beforehand. Simply dismissing every minor glitch "because it's cheap" doesn't give customers a good impression.
So an individual vm instance in any cloud provider is unreliable. Welcome to the cloud. Its foolish to assume otherwise, keep backups and restore to another instance or go HA. What's unreasonable is to assume they should publish a status update saying that one of their thousands of machines is having an issue. Fwiw, I've had the same 'hypervisor' problem' response from DO one time when i couldn't terminate a droplet.
Otoh when they have whole api outages for creating or destroying vms (like also happened last week) i'd expect and did find a status update. What their threshold is for reporting isn't clear.. but a 1 machine issue isn't something any provider would report on.
Some people feel that cloud means high availability vm's. So that when a host fails the servers are transparently migrated away. IMO this is the exact opposite of what you want.
The people that want HA like this generally custom build every server and don't use configuration management.
http://forum.bytemark.co.uk/category/outage-notifications - every single one of Bytemark's outages since June 2004. It's a great sales tool, at least when someone asks "so how reliable are you?". I a few using statuspage.io but without having an entire network map & customer list uploaded (so it knows who to notify), it's not going to be better, and I'm not sure it's geared to presenting historical outage data.
I'm hoping to finally build a tool this year incorporating some network monitoring (i.e. "unconfirmed reports") and a copy of all our internal rack/network database, so that we get to the holy grail: 1) customers get notified by email/SMS of stuff that definitely affects them, 2) it's easy for an engineer (or the whole team) to write notes on outages as they happen, and have them presented in a way that doesn't confuse customers.
> IMO a status page should be a public record of all the times your service has experienced a catastrophic failure, even for a small number of customers, if not also small hiccups like packets loss or lag.
This is the most important point. Status page should be a _log_ rather than a transient message as with so many providers.
edit: as other people have already namedropped, I'll point out that OVH do a pretty good job with the status page[1] and network maps[2].
What level of support are you expecting for $5 a month? If uptime and support of your website/ application is important to you it's likely time you invest more in your cloud infrastructure.
Sounds like what Digital Ocean doing in respect to status is common industry practice.
One startup I know was using Heroku to host their website. One day the website had problems and was unavailable for extended period of time. Heroku team worked hard to resolve the problem, but it still took them several hours. Heroku did not update their status. They stated that if 99% of customers do not have problems, they consider any problems to be local and not reflecting status of the whole infrastructure.
With the outages I've had at DigitalOcean and the AWS price cuts, I'm prepping to move over to AWS unless DO can make a good case against it ... I'm tired of seeing timeouts and droplets hanging which affects my business.
You will definitely have downtime due to hardware issues with AWS as well, and every other host out there... Maybe it would be better to spend the time you were going to spend switching to AWS investigating a load balancing setup and other things you would need to remove the dependency on one instance always working correctly?
The flexibility of the cloud is awesome but sometimes it does get cloudy up there. Redundancy is a must with digitalocean, aws, or anyone else.
Oh for sure, redundancy for the load-balancers, app servers, caches, and database servers are a must ... if it can fail, it will at the worst time ;-)
My point was more that the frequency at which I and others are having DO issues seems to be higher than AWS recently (I can't independently verify it until I've used AWS myself) and truth be told it's a subjective/opinionated statement. I do remember a time when AWS was extremely flaky when they first started, so I'd like to think it's merely growing pains for DO, but at the same time ... I have my own growing pains to worry about.
Can anyone recommend a very simple load-balancing setup for a bunch of tiny cloud servers serving a static app?
I'm currently using round-robin DNS for load-distribution as the simplest thing that could possibly work, but it obviously doesn't actually balance load and it doesn't remove dead servers from the pool. What's the next step up that doesn't cost and arm and a leg to implement?
Choose your favorite provider's load balancer service and add all the machines. Configure the health check to listen for a 200 on /. This will leave you with a SPOF at the LB, but you can round-robin DNS between two LBs on different providers if you wanted to.
When the cloud was first being introduced one of the core principles was that your instances should be considered ephemeral - and that they can terminate at any time, without any notice or further explanation. I think some people sometimes forgets that this still holds today.
Don't expect your instance to run forever. Design your architecture with this in mind. There is a reason e.g. aws provides multiple availability zones.
It's interesting, because these days there are people that define cloud as "high availability VMs". They expect that in the case of a host failure that everything can be transparently migrated to another host.
I'd say that depends on what kind of service the provider is offering. Later generation cloud providers might specializing in providing such VMs - Heroku and GAE comes to mind. But on providers that offer more "bare metal" VMs one should not expect such migration. Indeed it would be almost impossible to do, since it would require a lot of information about the application you are running. Heroku and GAE solves this by limiting the capabilities of the VMs so that they have more control/information about the running applications.
i think this just goes to show that the DO platform is not suitable for sites that require uptime. The price point justifies the simplicity, but sometimes simple isn't enough. AWS still has everything a startup needs to run their stack
I've been with Digital Ocean for over a year, and I'm fairly certain that I haven't had any downtime at all. The site is used by thousands of users a day, and I've never had any complaints. Pingdom is set to a resolution of 1 minute, and hasn't reported any outages either (that weren't caused by me).
We run our forums on DO as well and it hasn't caused a hiccup at all. Uptime has been great for me, but at this point there isn't enough tooling around DO for me to be comfortable with doing more than just simple web sites
AWS recently had a 30% price drop for EC2 and a host of other things. Though they're still not as cheap as DO, but they're in a similar ballpark now. DO is starting to lose their 'killer feature', it seems.
DO droplets do have more disk space for your dollar, but AWS's answer to bulk data storage is S3, which is very cheap itself.
Amazon S3 charges $0.12 per GiB for egress to the Internet. That's $120 per TiB. A $5 DO droplet comes with 1 TiB of upload per month (if they've even started enforcing these caps, which, last time I checked, they hadn't). Running my tiny ambient noise streaming website from S3 would bankrupt me.
If less than $120/month would actually bankrupt you, then yes, you have to go for the absolute cheapest option available. Sure, your particular use case very much favours DO's pricing, but if that kind of money would bankrupt you, then you have such little money that it's irrelevant what anyone but the cheapest provider costs.
I didn't mention anything about how much bandwidth my site uses. I just pointed out you can serve > 1 TiB from a DO droplet for $5 (or 10 TiB from 10 droplets for $50, or 100 TiB for $500, and so on).
The main point is that S3 egress costs are substantial for bandwidth-intensive applications.
I may have been misled by 'tiny' - I wouldn't have characterised something that throws out multiple terabytes of data per month as 'tiny'.
I take your point, though this being said, it again comes down to use case. A couple of hundred dollars a month for bandwidth is nothing for a business above a certain size, but it's tons for personal use. Depending on your use case, it may be also far more effective to use S3 and simply pay the data bill, rather than architect a system distributed across a ton of droplets.
That's not support, that's them failing at providing what they advertised. I have used OVH for years, and they always fixed everything in their end very quickly. And that was for unmanaged dedicated servers, so I never had "excellent support", I had 0 support, but they provided what they advertised.
Yeah, this is why spamming your general status page with lots of information is a bad idea. I have no idea what's going on there. At first glance it looks like half their services are having issues, which makes me think it always looks like that, which means it tells me nothing without digging into the detailed issues.
OVH is a multinational broadband ISP in addition to colo, cloud/vserver, dedicated, voip, domain registrar, etc.
Looking at the columns at a glance, there are almost no events "in progress" (most are Closed) and the vast majority of the open events are early warning for maintenance windows affecting very specific services.
See also https://status.aa.net.uk/ from a company we're friendly with - 100% openness but in common with most good status pages, doesn't help a customer to know whether they're affected by a particular issue or not.
You are asking a bit much. Your site is so insignificant in the grand scheme of goings-on at Digital Ocean that it would be far more misleading to mention it on their status page.
I agree that you get what you pay for. Just because you don’t have a budget for a $1000/mo dedicated server does not mean you have the right to that service level at a lower price.
As fellow geeks I am sure we have all been asked for advice on buying computers from our family and friends. I’m a mac person and use a $3000 laptop but I know most of them will not be willing to spend this much money. I normally quote them a $600(ish) computer that has at least an i-Series processor and 6-8GB of RAM. I tell them to avoid a couple vendors that I consider to be bottom of the barrel. It never fails though, that for all the reasoning and advice I give them, if they find a computer for $400 in the Sunday paper all that gets thrown out the window. They don’t want a computer that fits their needs they just want the cheapest computer. It also never fails that the moment it starts having problems they come to me for help and if I give them any slack for buying a cheap POS that I am suddenly considered an a-hole for not helping since that’s all they can afford.
The way I look at it is I spend my hard earned money on a reliable machine so I am not put out by the type of issues you get with cheap hardware or cheap services and you did not head my advice to avoid the same issues and thus by asking me to give up my weekend or even a couple hours to help you makes you the a-hole not me.
I have used, Rackspace, Linode, DO, AWS, and Google Apps among others and none of their status pages are every very helpful. It’s really a problem with Google Apps since my users know to check there and then claim the problem is not with Google even though it is. I frequently have issues with their IMAP servers failing where a user can connect via the web interface but not through a IMAP client. This is never shown on their status page. Of course I am going to check the status page.
The only hosting provider I have no complaints about is Rackspace but those servers are almost $1000/mo. On the other hand they do open up tickets for me faster than I can log into the management portal to do it myself. Even still I have had hour-long outages. If you don’t have HA you WILL have outages no mater who you use or how much you pay. Ironically the worst service was from The Planet even though those where still $800/mo dedicated servers. Their status page was a twitter account that they did not advertise on their homepage.
I have debated moving our average SMB size clients with no HA from linode to DO just because tools like Packer can interface with them easier. Still don’t know about that.
This is very true. It is the nature of cheap unmanaged services that the user bears the bulk of the burden of maintaining their server instance. But this makes it even more important that providers fess up when they have issues.
I have only been with DO for a short time but I have already encountered several of these silent outages where I can't query the status of my server through their tools and there is no status update. It is very different to Linode. They are a fantastic service for personal sites but I wouldn't host anything professional with them until they do more to gain my confidence.
I love DO but their infrastructure is not stable. I get random downtimes very often (get error emails at least once a week with connection errors). It's definitely cheap, and I'm using it for a side project web app, but I don't think I would use it for an actual business yet. I hope they keep growing though until they can completely gain our trust.
Growth brings more customers than you can support as well as infrastructure problems that you can't handle. Because your company has not scaled up over a long time it has to hire people and shove things through the pipeline quickly which means mistakes (both in process and people) will inevitably be made. Rome wasn't built in a day as the saying goes but startups are. And they end up growing quicker than they should. Someone has to lose and it's the customer (not all of them but some of them). [1]
Not to mention the fact that if you are charging very little ($5 per month is pretty cheap obviously for the base service) it gives you less profit to handle things in a way that are perhaps more robust or doesn't give you the ability to paper over problems by building in redundancy.
The saying "price quality speed" pick any two applies here.
DO will get better of course but it will take time as they iron out and encounter the various issues that they face.
[1] I've observed this since 1982 when PC clones came out and competed with IBM. The clones shoved things into the channel and all the sudden hardware problems were shoved on the customers. Previously IBM charged enough that those things were handled by IBM not their customers. Because they had the profits and took the time to pay attention to details.
I had HTTPD and MySQL server running on my $5 droplet, and I had no SWAP partition. I recorded the outages using pingdom. It was interesting to see that the droplet went down too frequently and sometimes never came up. Had to manually restart the droplet from the Digital Ocean Interface.
The choice of showing the status of our hyper-visors permanently was not an easy one amongst the team. But in the end, we figured customers have a right to know.
To be honest, 9 host machines is nothing like displaying status for hundreds if not thousands of host machines in a large-scale deployment in major IaaS providers.
I suspect this is because it looks bad to display the individual status of every node, because they'll have more common failures; keeping it limited to DC-wide issues is good for business.
They should have a private/client-only per-hypervisor status page.
DO need to to a lot of things. They don't even have IPv6 yet, which is, TBFH, beyond a joke, even for a $5/mo provider. Then the censorship guarantees I will never use them even past that.
I don't get why more HN-type startups don't look at using dedicated machines instead. Every dollar counts.
Exelion.net is running a special for HN users. 50% off for life for any server. Or more than one server. Or quite a few servers. Just use HN50 when checking out.
Exelion does top of the line E3-1230v3 Haswell (8 thread) + 16gb + 2x240GB SSD RAID 1 + Gigabit port + 33TB/mo bandwidth for $115/mo.
DO does 8 thread + 16gb + 160gb SSD + 6tb/transfer for $160/mo plus another $1350/mo for the full 33TB/mo bandwidth.
And if you wanted unmetered gigabit? Exelion does it for $200/mo extra flat rate. DO does it for an equivalent of $16,120/mo for that plan (based on 5 cents per gigabyte and 6TB already included with the plan).
It just doesn't make sense financially to keep using them.
It would be unreasonable to have a company-wide status page that constantly lists "some customers are experiencing some problems". That's not the point of the status page - the status page, as the author suggested, is there to highlight issues that are affecting a significant section of the customer base.
The right thing for Digital Ocean to do in cases like this is to allow you, in your private dashboard, to see the problem and follow up on a master ticket for escalation and resolution.