Hacker News new | past | comments | ask | show | jobs | submit login
"DigitalOcean Killed Our Company" (twitter.com/w3nicolas)
1343 points by sergiomattei on May 31, 2019 | hide | past | favorite | 596 comments

As DigitalOcean's CTO, I'm very sorry for this situation and how it was handled. The account is now fully restored and we are doing an investigation of the incident. We are planning to post a public postmortem to provide full transparency for our customers and the community.

This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.

Thanks for the replies. Let me try to address a few of the things I have seen here. We haven't completed our investigation yet which will include details on the timeline, decisions made by our systems, our people, and our plans to address where we fell short. That said, I want to provide some information now rather than waiting for our full post-mortem analysis. A combination of factors, not just the usage patterns, led to the initial flag. We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues. Clearly we messed up in this case.

Additionally, the steps taken in our response to the false positive did not follow our typical process. As part of our investigation, we are looking into our process and how we responded so we can improve upon this moving forward.

With all due respect I think you’ve missed the point. The larger point from my perspective is that you denied your client the ability to move their data off your platform. This would be akin to someone breaking the terms of their lease and you confiscating all their belongings with the intent of burning them. You should provide some sort of grace period for users to move their data off your platform. For everyone else reading this this is should be a wake up call why you should never trust your data to a singular entity. Even if they have 99.9999% uptime you never know when they’ll decide to deny you access to your data.

Thank you for jumping in personally to clarify what happened.

As a business owner with much of our infrastructure depending on DigitalOcean, the incident is concerning. It affects the reputation of DO as well as its customers.

The demographics on Twitter and especially here on HN represents a sizable crowd with decision-making influence on DO's bottom line. I hope to see some effort being made to prevent situations like this in the future, and to regain the trust.

As a (so far) satisfied customer, it's great to hear that:

> A combination of factors, not just the usage patterns, led to the initial flag.

> We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues.

> we are looking into our process and how we responded so we can improve upon this

I’ll be awaiting the post-mortem and, depending on that and the procedures proposed to stop this from happening again, will hold off moving everything I have from DO.

The real “mess up” here was the bit where you blocked the account with no reason given and no further communication - other than the one-liner your intern wrote for the email.

I’m expecting you to sit down with your legal team and rewrite your TOS to be more customer-focused and less robotic.

I wanted to provide you all with an update on the postmortem I promised on Friday. Our analysis has been completed. We will be sharing the full document soon and will publish a link in this thread for those wanting to read it. We promised Raisup a first look and we have provided the draft document to them this afternoon. Because some information in the document could be considered sensitive we wanted to give Raisup a chance to review the document before sharing with the public.

As a long term customer here is a small suggestion to make this fail-safe: by default do trust your customers, and just ask them first instead of shooting them down first.

Considering you have been marketing yourself as the platform for developer oriented cloud, you should be aware that surge provisioning can and will always be happening.

This is the right move. Sure, if an account has repeated violations after clear communications, then action must be taken.

But it doesn't make sense to shut it down before discussion!

I have always been a happy DO customer. So glad to see you step into this lions den and reply to the community.

Looking forward to the write up!

What do you recommend your clients to do if that kind of mistake happens to them? Is Twitter-shaming the only way out?

I know people say some legal arguments why they close you down and won't say anything, but this is the worst scenario ever. I'd be better off excused at something I didn't do than just ooops we can't tell you anything, your account has been shut down.

This is important. I hate how it had became standard for companies to screw their customers unless they are online-shamed.

The response email even read like a giant polite FUCK YOU (we locked your account, no further action required by you)

You bet I will have further action!

And it is after the shaming that you get an "I am sorry for this situation". Which sounds more like saying "I'm sorry we got caught".

My frustration is not with DO specifically, as they do exactly what every other company does.

But, what of the other thousands of people that got screwed and did not put it on twitter?

It is the equivalent of when you are in a restaurant and get screwed: It is the loudest person that complains more the one that gets the reward, while all the others silently swallow the injustice.

It's most likely due to the fact that the people who can act upon the process itself, not just follow the process inevitably see the issue and do truly want to help.

Getting your message into the right hands is what matters, not the platform it's on.

Mistakes happen, and algorithms are sometimes a necessary part of scale/efficiency. Everyone understands that.

That said, what's highly troubling as a DO customer (and someone who is planning to deploy startup infrastructure of my own with DO) is:

1) The discrepancy between this customer's experience and clear assurances made on this very forum by high-level DO employees that:

a. warnings are ALWAYS issued before suspensions.

b. even in the event of a suspension, services remain accessible (though dashboard access and/or the ability to spin up NEW services may be impacted), ie. the affected customer could still retrieve data or SSH in to droplets.

2) The relatively trivial nature of the customer's offending usage (temporarily spinning up 10 droplets). What happens if, for example, a startup gets a press mention somewhere that leads to a massive traffic spike, necessitating a sudden and significant spin-up of new droplets (especially if this is done programmatically versus by hand in the dashboard)?

3) The apparent lack of consideration of the customer's history, or investigation into their usage. It seems the threshold for suspending services of longstanding customers who are verifiably engaging in commerce (taking a moment to look at their website and general online presence for indicators of legitimacy), should be SUBSTANTIALLY higher than for, say, an account who signed up a week ago. Context matters.

I'm no longer able to edit the above comment, so to elaborate on #1:

Following is a comment[1] by Moisey Uretsky in another thread[2]:

> Depending on which items are flagged the account is put into a locked state, which means that access is limited. However, the droplets for that account and other services are not affected at all. The account is also notified about the action and a dialogue is opened, to determine what the situation is. There is no sudden loss of service. There is no loss of service without communication. If after multiple rounds of communication it is determined that the account is fraudulent, even then there is no loss of service that isn't communicated well in advance of the situation.

1. https://news.ycombinator.com/item?id=18296344

2. https://news.ycombinator.com/item?id=18294940

This is why I'm so confused by the case under discussion, because the customer appeared to have been completely locked out without warning.

What he said in an other thread, and this thread, is press release, marketing. Don't trust what he says to save his business. You have absolutely no reason to.

I prefer to give them the benefit of the doubt, though a clear explanation of why the above policy was not followed seems warranted. (It also doesn't appear to have been followed in several other instances reported by other former customers in various HN threads.)

If DO reserves the right to cut off services and access to your own data permanently and without warning (outside of a court order or confirmed illegal activity), that needs to be unequivocally stated, and the triggering factors should be made known. Otherwise, DO is not fit for production systems.

Additionally, it would be nice to see the creation of a transparent, high-level appeal process for customers affected by suspensions. Truly malicious customers wouldn't use it (what would they hope to successfully argue to an actual human reviewer?), but it would greatly benefit legitimate customers to have an outlet other than social media by which to "get something done" in the event of an inappropriate suspension followed by a breakdown in the standard review process.

Not sure why you’re being downvoted. Point 2 is very relevant. Scaling instances due to sudden peaks should be totally safe. Even when automated. Guess AWS is still lonely at the top.

AWS has default instance limits too, though, which you need to open ticket to increase.

Which is a much better policy than suspending the account.. 10 instances is nothing!

It really is a trivial amount of resources to have it triggering such a reaction. It's almost like DigitalOcean doesn't like being in the cloud hosting business. One of the fundamental, desirable points to the shift to such cloud hosting services is that you can quickly spin up a bunch of resources when needed and then dump it.

You've got an additional problem though, which is that this tells us you have two support channels: one that doesn't work (i.e. yours, the one you built), and one that does (Twitter-shaming). The first channel represents how you act when no one's watching; the second, how you act when they are. Most people prefer to deal with people for whom those two are the same.

As a DO user who was planning on ramping up usage in the coming weeks and months, this is what scares me and what is making me seriously reconsider.

Do not use DO. The very fact that their default response to suspected spam is to cause prod downtime is so bizarre and unacceptable that it does not make any sense whatsoever for a business to rely on them.

Thanks, I’ll stick with AWS then.



Running anything business or privacy critical on DO is madness.

Indeed, this was bad. I assume they were trying to extend SSD lifetime by reducing writes.

It's fair to note that scrubbing is now the default behavior when a droplet is destroyed, so they did listen to the feedback.


The SSD thing is a red herring.

You do not need to scrub or write anything to not provide user A’s data to user B in a multi-tenant environment. Sparse allocation can easily return nulls to a reader even while the underlying block storage still contains the old data.

They were just incompetent.

On top of all of that, when I pointed out that what they were doing was absolute amateur hour clownshoes, they oscillated between telling me it was a design decision working as intended (and that it was fine for me to publicize it), and that I was an irresponsible discloser by sharing a vulnerability.

Then they made a blog post lying about how they hadn’t leaked data when they had.


As someone who has been blown off by DO support, you hit the nail on the head.

So well written. This is exactly what's so scary about this whole thing.

I think it says a lot that this CTO joker flew in, regurgitated the standard-issue "we will endeavor to do better" apology and left without answering any of the very legitimate follow-up questions. I would never deal with an organisation that behaves like these guys.

Is there any response that would satisfy you?


"This will not happen again, ever".

People's livelihoods are at stake in DO's hosting. Canned responses and brutal account lockouts should have NEVER been on the table to begin with.

That’d be unrealistic for any company to claim, and if any company I worked with did claim that I would run for the hills.

That’s akin to saying “we’ll never ship a bug”, or “we have an SLO of 100%”. That’s impossible for anyone to claim. Same goes for the response handling. There is clearly a lot of room for improvement there, but if you’re insisting on not getting canned response, that means a human needs to be involved at some point. Humans will at times be slow to respond. Humans will at times make mistakes. This is just an unavoidable reality.

I get that mob mentality is strong when shit hits the fan publicly, but have a bit of empathy and think about what reasonable solutions you may come up with if you were to be in their situation, rather than asking for a “magic bullet”.

I could see a good response here being an overhaul of their incident response policy, especially in terms of L1 support. Probably by beefing up the L2 staffing, and escalating issues more often and more quickly. L2 support is generally product engineers rather than dedicated support staff/contractors, so it’s more expensive to do for sure, but having engineers closer to the “front line” in responding to issues closes the loop better for integrating fixes into the product, and identifying erroneous behavior more quickly.

Sure, me and a lot of others react rather strongly in these situations. I agree with that but you already seem to understand the reasons.

However, can you say with a straight face that the very generic message left here by DO's CTO instills confidence in you about how will they handle such situations in the future?

Techies hate lawyer/corporate weasel talk. Least that person could do was do their best to speak plainly without promising the sky and the moon.

I would prefer a generic message and a promise for follow up once all the facts are known over a rushed response that may be incorrect.

I’m an engineering manager in an infrastructure team (not at all affiliated with Digital Ocean, tho full disclosure, I do have one droplet for my personal website). I know how postmortems generally work, and it’s messy enough to track down root cause even when it’s not some complex algorithm like fraud detection going off the rails.

I’d rather get slow information than misinformation, but I understand the frustration in not being able to see the inner working of how an incident is being handled.

I applaud people like you. Seriously.

And I agree with your premise. However, my practice has shown that postmortems are watered-down evasive PR talk, many times.

If you look at this through the eyes of a potential startup CTO, wouldn't you be worried about the lack of transparency?

And finally, why is such an abrupt account lockdown even on the table, at all? You can't claim you are doing your best when it's very obvious that you are just leaving your customers at the mercy of very crude algorithms -- and those, let's be clear on that, could have been created without ever locking down an account without a human approval at the final step.

What I'm saying is that even at this early stage when we know almost nothing, it's evident that this CTO here is not being sincere. It seems DO just wants to maximally automate away support, to their customers' detriment.

Whatever the postmortem ends up being it still won't change the above.

Our line so far has been to change provider of service if we start getting copy - paste answers from support. We always make sure we can get hold of a human on the phone even without a big uptime contract. This has so far lead us to small companies that are not overrun by free accounts used as spam or SEO accounts. That means they have no need for automatic shutdown of accounts and instead you get a phonecall if something goes wrong.

This is how I would go about it as well. But I imagine that's a big expense for non-small companies, and not only through money but through the time of valuable professionals that could have spend the time improving the bottom line.

I too value less known providers. The human factor in support is priceless.

> Is there any response that would satisfy you?

Do you believe that a PR response made in damage control mode that actually changes nothing is something that's satisfactory?

I mean, apparently this screwup was so damaging that it killed a company. What part of the PR statement addresses that precendent?

How about "we will reimburse the company for any damages if we found it was our fault"?

It's been 7 hours, some follow up answers would be nice...

7 hours, on a Friday night in the headquarters time zone. This issue is resolved and is clearly not wide spread, so does getting a response on Monday or Tuesday vs right now make any difference?

Companies are made of people. Let the people have a life. Their night is shitty enough as is after this, I guarantee you.

The thing is, my business don't want to deal with people. It wants to deal with a business made of multiple people to guarantee service availability. If he cannot answer, surely someone else in DigitalOcean can?

You are being unreasonable here. He promised a postmortem. I’d much rather wait a few days to get a clearly written, comprehensive analysis of the problems than to get an immediate stream of confusing and contradictory raw data.

If you have ever been involved in post facto analysis of a process breakdown like this you know how hard it is to get the full picture immediately. Rushing something out does no one any favors.

Sure, but the email he received basically said "your account is locked. No other info. Thank You". That to me is a much scarier thing than anything else in the thread. How can anyone trust in your infrastructure if your standard protocol is literally just shutting down their entire operation without any form of review or communication?

We have a relatively large spend ($5k+) @ DO, for a unique client (most of our other clients can be served by our colocated facility), and I'm going to second this. Or with any other provider. They should always explain exactly which rule was broken. If the customer is legit + genuine, they will promptly fix the issue and won't be a further problem. Being vague makes it super troublesome to rely on any service that takes that tactic. (Like Google, for example) If they continue to re-offend, and find other ways to skirt the rules, that's when you move on to account termination.

You can't, obviously. Even though I've used them before I really doubt I'll ever use DigitalOcean again. I can almost understand terminating customers (with notice) via automated heuristics for suspicious behavior, especially on the low end of the hosting market, but locking out a legitimate paying customer from backups with no notice or recourse is terrifying.

"In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time."

This didn't seem like a case of being "too slow" - the customer in question went through your review process (which was slow, yes), and the only response he got was "We have decided not to reactivate your account, have a nice day".

That just seems like a lack of interest in supporting your customers that are falsely flagged.

Last week ended on a real low note for many of us at DO. We took a perfectly good customer and gave them an experience no one should have to go through (all while he was trying to leave on vacation no less). We can and must do better. To do better we need to learn from our mistakes. To that end, we also think sharing the information about this incident openly is the best way to help all our customers understand what happened and what we are doing to prevent it in the future.

Yesterday we completed our postmortem analysis of the incident involving Nicolas (@w3Nicolas) and his company Raisup (@raisupcom). With their permission we are sharing the full report on our blog here:


No offense, as I'm sure this has been hard, but a screwup like this publicly demonstrates DO is not ready for prime time competition against AWS, Azure, GCP and the like.

I'd gladly do whatever it takes to KYC, send you my business license, tax returns, EIN, invoice billing, etc so you know there is someone behind my account.

We spend thousands of hours eliminating single points of failure. If an automated system can undermine that work, DO is not an option for us to host anymore.

A year of Data backup lost. Do you realize how that alone may cause the clients to dump a company and do you realize that startups may never recover from fiasco like these? I understand that it was false positive triggered by internal systems. But how do you explain the delay in restoring the services and reflagging again within hours after the services were restored?

At the end it was not even a matter of delay. It was more like, we locked your account, we don't want to hear you anymore until the end of times.

Public post mortem? Brilliant.

Hope you can share what you learnt from this incident and hopefully you'll take a hard look at your processes.

I'd hate to be caught in the same issue, especially that we are already customers, and I'm not sure I'll have as much clout as Nicolas here to get your attention.

> and I'm not sure I'll have as much clout as Nicolas here to get your attention.

It's occurring to me now that while I've successfully ignored twitter for years, I should probably rectify that just so I have somewhere to type my hopes and prayers when this eventually happens to me, and hope for a miracle. It sure seems like the only place they're listened to.

> I'm not sure I'll have as much clout as Nicolas here to get your attention.

Maybe keeping a twitter (and other social media) account with at least a certain number of follower should be considered a part of a company's security strategy? You'd also need to post something interesting periodically, to keep your follower, so that you have their attention when you need it.

Will the customer be compensated for business losses?

I hope they sue and win. This bull###t needs to be fought.

IANAL, but DO's ToS is loaded with weasel words.[0] So if they can sue in some jurisdiction where the binding arbitration and liability limitations don't apply, maybe they could at least get a fair settlement.

0) https://www.digitalocean.com/legal/terms-of-service-agreemen...

Are you dreaming?

It would probably be worth it to restore trust: Refund all the money they've taken from this company for the last year, and apply a credit to their account for 3x that amount, say.

Sadly enough, yes. I'm sure that it's covered in DO's ToS.

But the DO CTO did basically admit fault in a public forum.

It's not the false positive that is the issue here. The issue is that a. it took way too long to get the business back up and running, and b. the second response gave no explanation and no recourse for the business to become operational again.

The very fact that this can happen from an automated script with no oversight should give every one of your customers pause as to whether they continue with your service.

I'd say the issue is that DO is shutting down servers for any reason at all (legal issues aside). If DO sells a product with a particular capacity, why should they intervene at all if a user is using all of the capacity they're paying for?

Gcp bans mining. Most providers frown on running tor exit nodes.

So unless a person is popular enough to get enough people talking about it on twitter or hacker news, someone whose account is flagged by your bad script is going to lose his business.

That doesn't sound good to me.

Are you aware that Viasat has blacklisted a huge amount of Digitalocean /24 subnets? I can't access many of my servers when I'm on a satellite connection in addition to other websites hosted on Digitalocean. I've talked with the Viasat NOC and they told me they were blocking Digitalocean subnets due malware.

This is probably worth it's own post, it would be very interesting to see more detail. I'm also probably certain that this is also not exclusive to DO.

DO is famously bad at dealing with abuse reports; in a lot of cases they simply do nothing.

I block their netblocks for a lot of things, too.

Should we be concern about our 40+ droplets with DO now? We built our business on DO, we really can go bankrupt as well as our 30+ clients if anything like this happens to us. Please change your support system ASAP otherwise we will be switching to another platform. We are expecting a very serious response from you.

Do not use DO. The very fact that their default automated response to spam is prod downtime is unacceptable.

It requires so many failures in understanding the service being provided across the company for this decision making process to have ever actualized that there is no reasonable expectation of safety or trust from DO at this point.

Every cloud company has anti abuse systems that will limit your access to their APIs / take down your machines if abuse is suspected - for example if it looks like you're mining bitcoin. Your prod isn't any different from your staging for them

Clearly you should be doing regular backups of everything, and not on DO. And make sure to test your backups. And make sure you have a fast migration plan into another cloud.

Ideally you should be cloud-agnostic, but that's quite hard to achieve.

That’s all well and good. But how do you plan to reimburse your customer for this gross negligence? I have never heard of such incompetence or lack of communications from anyone on AWS’s business support plan. Why should anyone trust DO over AWS or Azure?

Curious how you will compensate them?

I'm genuinely curious. What type of fraud or abuse are you trying to prevent? Maybe cover that in the postmortem.

If your DO (or other cloud provider) credentials are compromised, it's usually a matter of seconds before someone fires up the largest possible number of instances to start crypto mining.

Yup. LeonM, you are correct. In this case that was the cryptocurrency mining detector that was triggered. More details in the postmortem.

Unacceptable. I just instructed my team to begin a transition to AWS.

You ruined your brand

Do you realize that by abusing this thread to make a single PR focused comment with no intention of participating in the conversation -- you've disrespected the community here and the few remaining DO customers within said community.

>> and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive.

You clearly don't make every effort, and did not -- so why waste the extra verbiage and switch from active to passive voice?

Based on your cliche response I have zero confidence that DO will do anything substantial to address the root causes of the issue.

I've found DO's public posts to be particularly grating in the "we are listening to YOU, our customer. we take feedback extremely seriously" department.

Some people on HN hate Linode because of their past security screwups (which is valid), but having used both DO and Linode quite a lot, the support on Linode is way, way, way better than DO's.

DO's tier 1 support is almost useless. I set up a new account with them recently for a droplet that needed to be well separated from the rest of my infrastructure, and ran into a confusing error message that was preventing it from getting set up. I sent out a support request, and a while later, over an hour I think, I got an equally unhelpful support response back.

Things got cleared up by taking it to Twitter, where their social media support folks have got a great ground game going, but I really don't want to have to rely on Twitter support for critical stuff.

DO seems to have gone with the "hire cheap overseas support that almost but doesn't quite understand English" strategy, whereas the tier 1 guys at Linode have on occasion demonstrated more Linux systems administration expertise than I've got.

I have interviewed with DO and they tried diverting me towards a support position.

They told me that on a single day a support engineer was supposed to help/advice customers on pretty much whatever the customer was having issue with and also handle something between 80-120 tickets per day.

It's nice to see that DO is willing to help on pretty much anything they (read: their team) has knowledge about, but with 80-120 tickets per day I cannot expect to give meaningful help.

Needed EDIT: it seems to me that this comments is receiving more attention than it probably deserves, and I feel it's worth clarifying some things:

1. I decided not to move forward with the interview as I was not interested in that support position, so I have not verified that's the volume of tickets.

2. From their description of tickets, such tickets can be anything from "I cannot get apache2 to run" to "how can I get this linucs thing to run Outlook?" (/s) to "my whole company that runs on DO is stuck because you locked my account".

I once worked for eBay a long time ago, and support consisted of 4 concurrent chats, offering pre-programmed macros often pointing to terribly written documentation the person had already read and was confused about. If you took the time to actually assist somebody you were chastised in a weekly review where they went over your chat support. The person doing mine told me I had the highest satisfaction record in the entire company, and a 'unique gift of clear and concise conversation, like you're actually talking to them face to face' then said I'd be fired next week because my coworkers were knocking off hundreds of tickets a day just using automated responses, leaving their customers fuming in anger with low satisfaction ratings, as people are very aware of being fed automated responses but the goal was not real support, it was just clearing the tickets by any means possible. I decided to try half and half, so if the support question was written by somebody who obviously would not understand the documentation (grandma trying to sell a car), I would help them but just provide shit support to everybody else in the form of macros like my coworkers. Of course this was unacceptable and I got canned the next week as promised. Was an interesting experience, I can imagine DO having an insane scope to their support requests like 'what is postgresql'.

Anyway imho you should have taken the support position and schemed your way into development internally. This was my plan at eBay before they fired me, though they shut down the branch here a few months later and moved to the Philippines anyway so I wouldn't have lasted long regardless.

I'm fortunate that my own company (Rackspace) at least has a level head about this sort of thing. My direct manager looks at my numbers (~60-80 interactions per month) and my colleagues (many hundreds of interactions per month) and correctly observes that we have different strengths, and that's the end of the discussion. I have a tendency to take my time and go deep on issues, and my coworkers will send me tickets that need that sort of investigative troubleshooting. My coworker meanwhile will rapidly run through the queue and look for simple tickets to knock out. He sweeps the quick-fix work away, but also knows his limits and will escalate the stuff he's not familiar with.

Let me stress here, this is not nearly as easy of a problem to solve as it appears to be on the surface. We're struggling as a company right now because after our recent merge, a lot of our good talent has left and we're having to rebuild a lot of our teams. Even so, I'm still happy with our general approach. Management understands that employees will often have wildly different problem solving approaches and matching metrics, and that's perfectly OK as long as folks aren't genuinely slacking off and we as a team are still getting our customers taken care of. I think that's important to keep in mind no matter how big or small your support floor gets.

+1 for Rack support. A previous company I worked for was heavily invested in Rackspace infrastructure and while I often opined not getting the equivalent experience with AWS for the resume, I was regularly floored with the quality of their support. Whenever I had the pleasure of needing to open a ticket they solved my problems and usually taught me something new in the process. The linux guys were very clearly battle hardened admins.

I couldn't imagine getting that level of support from DO, let alone Amazon.

i have the opposite experience with Rackspace. The low end stuff (hosted exchange etc) is basically useless, people who are obviously on multiple chats, they let tickets sit for days...

Even when we had small handful of physical servers with them, they seemed inept. They actually lost our servers one time and couldn’t get someone out to reset power on our firewall.

My experiences were all with their "dedicated" or "managed" cloud services. Although I did notice that their marketing seemed to shift in the last months I was working with them for that employer from "let us help you build things on Rackspace" to "let us help you move what you built on Rackspace to AWS"

Yes, the Public Cloud, which houses most of the smaller Managed Infrastructure accounts (minimal support) is one of the bigger ... I believe the polite word is "opportunities?" It's a very pretty UI on top of a somewhat fragile Open Stack deployment, which needs a significant amount of work to patch around noisy infrastructure problems. That turns into a support floor burden, and it shows in ticket latency. Critiques directed at that particular product suite are, frankly, quite valid. I think Rackspace tried to compete with AWS, realized very quickly that they do not have Amazon's ability to rapidly scale, and very nearly collapsed under their own weight.

That said, our FAWS team are a good bunch, and what AWS lacks in support they more than make up for in well engineered, stable infrastructure. Since Rackspace's whole focus is support, I think the pairing works well on paper and it should scale effectively, but we'll have to see how it plays out in practice.

So is the business plan now to provide premium support for AWS customers?

Oh, absolutely:


This is a big push, internally and externally. I don't know too much about the details (I don't work directly with that team) but it's been one of our bigger talking points for a while now.

Support should be looked at as a profit center, but almost everyone tries to run it like a cost center.

It's crazy that companies spend $$ on marketing and sales, then cheap out on a interaction with someone who is already interested in / using their product.

Running profit centers requires comparatively rarer leadership resources while running cost centers only requires easy-to-hire management resources. You don't want your best leaders whipping your support center into shape letting the company's competitive edge fritter away.

Alternatively, I'd ask if you want easy-to-hire management resources as your primary touchpoint with paying customers.

Crazy from the individual perspective; natural from the group's.

"We have a mark, lets suck it dry until we can throw it away and find new and better marks."

Sustainability tends not be a concern until after a group's leadership jumps ship and parasitizes other hosts.

It remains weird to me that this even _can_ work as a business strategy. Customers know this isn't right, so they are only staying with a business that does this for so long as it is the absolutely cheapest/ only way to achieve what they want. That's super high risk, because if a competitor undercuts you, or an alternative appears, you are going to lose all those customers pretty much instantly.

Almost invariably the high ticket rates are also driven by bad product elsewhere. Money is being spent on customer "services" sending out useless cut-and-paste answers to tickets to make up for money not spent on UX and software engineering that would prevent many of those tickets being raised. Over time that's the same money, but now the customer is also unhappy. Go ask your marketing people what customer acquisition costs. That's what you're throwing away every time you make a customer angry enough to quit. Ouch.

"Abuse your customers and get away with it because they aren't good at punishing you" is the status quo in a depressing fraction of companies.

> imho you should have taken the support position and schemed your way into development internally

> This was my plan at eBay before they fired me

I guess you answered yourself.

You are very much like Bob Parr (Mr Incredible) at the Insurance company.

The reality is that his boss is true...

7 tireless hours of work (with lunch break) 15 minutes to Listen, understand and resolve an issue, assuming perfect knowledge, a lot of luck and normal human speed, that would still amount to less than 30 resolutions a day.

with no data or experience to back this up, i'd expect this to be kind of a long tail, where some will take an hour, and the majority will take 30 sec

Yep, this is spot on - I used to work on a webhosting help desk and could bang out about 100 tickets a shift, because so many were small queries that required no depth work.

Old MSFT rule of thumbs was 2 bugs per day during bug crunch mode. Sounds crazy, but when you consider the number of "this text is wrong" and "that text box is too short" bugs that existed after a year of furious development, it wasn't too hard to achieve.

Gotta hit that ZBB!

Brought back memories. I think it might be a little Stockholm syndrome but there was just something about the pressure of getting a release out when you know it only happens once every few years. Bug triage definitely improved my persuasion technique.

Now its just "meh, we'll fix it in next months release".

Agile bug fix workflow ;)

Per day? Or per hour?

Day. Of course some bugs would be: “this entire feature is done wrong and doesn’t work”.

having data and experience to back this up --

you are entirely correct.

something's not clear: are you talking from direct experience ?

He's using mathematics to compute the number of tickets an employee can handle per day, given certain assumptions. Given the data from znpy above, we see that nurettin's assumption that the time per ticket is 15 minutes is inconsistent with DO's expectations; instead, the average time spent per ticket should be about 5 minutes.

30 tasks x 0.25 hours per task is 7.5 hours. Looks simply like a calculation.

I have no prior experience working at DO tech support.

This is appalling. I worked as a L1 ticket tech for an old LAMP host back in the day where probably half of the tickets required nothing more than a password reset or a IP removal from our firewall, very easy stuff, and was proud if I got over 60 responses out in a 8 hour shift. And that time was spent mostly just typing a response to the customer. I really expected higher standards out of DO.

Indeed. If that's their average ticket-per-engineer rate, their support team is way undersized to handle the load.

Yikes. Those metrics are entirely unreasonable for providing any support besides canned responses. Rough seas at DO.

Disclaimer: Based on providing support myself and coaching a support team at both a web hosting company and ISP I used to own years ago.

Yikes. I'd rather work manual labor vs hit 80-120/day ticket targets!

Linode is definitely in the minority here. Most companies, in tech and outside of it, seem to follow the DO model. Twitter provides decent service, and the official help channels provide canned responses and template emails.

I somewhat blame people in tech, actually. More than one company is creating products that "cut customer service costs via machine learning", which is code for "pick keywords from incoming tickets and autoreply with a template"

ISP here: The margins in bulk hosting services are incredibly thin, and companies have resorted to automation tools. If somebody asked me to run backend infrastructure for something like DigitalOcean or Linode, I would run away screaming. It would literally be my own personal hell. I would rather run any other sort of ISP services on the planet than a bulkhosting service where anybody with a pulse and $10 to $20/month can sign up for a VPS.

I truly feel sorry for their first and second tier customer support people. I imagine the staff churn rate is incredible.

People who work for these sorts of low-end hosting companies inevitably quit and try to work for an ISP that has more clueful customers. When you have people paying $250/month to colocate a few 1RU servers, the level of clue of the customer and amount of hassle you will get from the customer is a great deal less than a $15/month VPS customer.

This race to the bottom has reached a point that it's harming customers. It's okay to be more expensive than the competition if you provide a better service.

Personal opinion, it's really important in the ISP/hosting world to identify what market categories are a race to the bottom, and if at all possible, refuse to participate in them.

I look at companies selling $5 to $15/month VPS services and try to figure out how many customers they need to be set up for monthly recurring services, in order to pay for reasonably reliable and redundant infrastructure, and the math just doesn't pencil out without:

a) massive oversubscription

b) near full automation of support, neglect of actual customer issues, callous indifference caused by overworked first tier support

Conversely, as a customer, you should be suspicious when some company is offering a ridiculous amount of RAM, disk space and "unlimited 1000 Mbps!" at a cheap price. You should expect that there will be basically no support, it might have two nines of uptime, you're responsible for doing all your own offsite backups, etc.

If you use such a service for anything that you would consider "production", you need to design your entire configuration of the OS and daemons/software on the VM with one thing in mind: The VM might disappear completely at any time, arbitrarily, and not come back, and any efforts to resolve a situation through customer support will be futile.

"you're responsible for doing all your own offsite backups"

That's going to be true no matter which cloud provider you choose.

Their ToS almost certainly include terms which allow them to kick you off and refund any monies for any reason whatsoever.

Good luck if you bought into their entire ecosystem and can't move elsewhere on a whim.

That's going to be true no matter which cloud provider you choose

My company provides services to fortune 100 companies, and we host literally petabytes of data on their behalf in Amazon S3, but we don't have offsite backups. We (and they) rely on Amazon's durability promise.

We do offer the option of replicating their data to another cloud provider, but few customers use that service -- few companies want to pay over twice the cost of storage for a backup they should never need to use when the provider promises 99.999999999% durability.

I don't know the data you're holding. If it is sensitive data, like customer anything, would it infact make sense not to have offsite backup?

Reasoning: Your contract with Amazon promises durability and I'm sure there's a service level agreement with penalty/liability clauses. By implementing a redundant backup, you're replicating something that you don't legally need to have, double-or-more due diligence on the offsite backup security/credentials, and in case of a failure of Amazon create a grey area with clients "Do you have the data, or do you not?"

In short, there could be a very good business reason not to do offsite backups.

Regardless of durability if you lose your customers data are you sure you will have customers paying you to keep you in business while you figure out liability?

In this case, it was not losing data, but losing access to data. The data was eventually restored. Lose customers' data could also mean losing the backup:

"We're sorry, the tape that we didn't needed to keep has been lost/zero-dayed/secondary service provider has gone bankrupt/Billy's house that we left it at got robbed." These must be disclosed to a customer immediately.

Minimising attack/liability surface is not only a technical problem, but a business one too.

For AWS it doesn't make a lot of sense to protect against AWS itself losing data since you're paying them a premium for that. Backups in this model would be logically separated so a user/programmer error can't wipe out the only copy of your production dataset.

It's just greed, not some wisdom. When data is lost - it's just lost. Maybe AWS will pay some compensation because of their promises, but money not always can solve problems of missing data.

Ten years ago, I had a pretty reasonable $10/mo account, that I eventually moved to a $20/mo account because I needed more resources to keep up with traffic.

I'd expect that, as more transistors have been packed blah blah blah, that such a $10/mo account would have gotten better, not worse, since then.

Support and staffing costs are not subject to Moore's law.

Neither is bandwidth.

bandwidth does drop in price, and increase in capacity, at a fairly rapid rate. Look at what an ISP might pay for a 10GbE IP transit circuit in 2008 vs what you can get a 100GbE circuit for today. But peoples' bandwidth needs and traffic also grow rapidly.

> I'd expect that, as more transistors have been packed blah blah blah, that such a $10/mo account would have gotten better, not worse, since then.

This is what Linode do - they keep you on the same payment level, but raise what you get for it.

There is the same risk of being kicked off when using Amazon AWS. The rules are different, but there will be situations where you lose everything (imagine you become visible for a political reason and the landscape shifts a bit).

Unfortunately there is no guarantee that I receive any better support at the more expensive provider though

At a certain price point, yes there is, if you're paying $800/month for hosting services to a mid sized regional ISP with presence at major IX points. That ISP cares about its reputation, and cares about the revenue it's getting from you.

I can tell you that as a person whose job title includes "network engineer", we have a number of customers who have critical server/VM functions similar to these people who had the DigitalOcean disaster. If something goes wrong, an actual live human being with at least a moderate degree of linux+neteng clue is going to take a look at their ticket, personally address it, and go through our escalation path if needed.

Having paid substantial amounts for various services over the years, paying hundreds of dollars per month doesn't automatically make you into a priority.

There seems to be a sweet spot for company size here. Too small companies can't support you even though they really want to. Large companies are busy chasing millions in big contracts, and don't really care about your $800 per month at all.

Very good points. What I would recommend is to use a mid sized ISP in your local area where you can meet with people in person. At higher dollar figures there should be some sales person and network engineer you can meet in their local office, meet for coffee, discuss your requirements, and have something of a real business relationship with. You and your company should be personally known to them.

If you are just some semi anonymous faceless person ordering services off a credit card payment form on a website, all bets are off...

Imho, that point was reached in hosting over 15 years ago (which is why I sold the hosting company I had back then). We’ve seen some short lived upticks periodically since then, but they all end up going back to shit as they tried to scale.

I like prgmr.com’s motto, “expect about $5/month in support”

srn from prgmr.com here. Our tagline originally came from us being a low cost service, but I like to think of it as a customer support philosophy. One meaning is we want you to be able to fix the problem yourself by giving you instructions instead of logging into your system. Another is we try give you the benefit of the doubt that it could be our problem and not assume it's yours when there's an issue.

which goes in line with “we don’t assume you’re stupid”

Linode was more of a bootstrapped business, it grew slowly and steadily. Digital Ocean was always built to grow fast from the beginning.

I think that the size of the company or how fast they grow is a good proxy for having poor customer support. What we should be doing is finding the slow growing businesses or the mid-tier (not too small, not too big) businesses to take our business to.

There’s nothing wrong with that approach, the person raising the support ticket likely hasn’t read through all the documentation of the product they’re using.

If implemented well, sure -- sometimes, maybe often, you can point a customer to a support document that directly answers their specific question and relieves some of the load on your staff. That's great.

But the execution matters a lot, and DO's is currently not great. IIRC, it takes clicking through a few screens of "are you sure your question isn't in our generic documentation? How about this page? No? This one then? Still no? You're really sure you need to talk to someone about this error? sigh Okay, fine then."

These systems should not be implemented as a barrier to reaching human support, but they often are.

In all my experience with support, I have been referred to a document that helped me with my problem literally zero times, because if something went wrong the first thing I did was Google it and so I already saw the unhelpful document.

> DO seems to have gone with the "hire cheap overseas support that almost but doesn't quite understand English" strategy, whereas the tier 1 guys at Linode have on occasion demonstrated more Linux systems administration expertise than I've got.

That's simply not true. There's support engineers hired around the world, and depending on when your ticket is posted, someone awake at that time will answer. DO is super remote friendly and as a result, has employees (and support folks) everywhere on the planet. Not "cheap oversea support" at all. There's a lot of support folks in the US, Canada, Europe, and in India where they have a datacenter.

disclosure: ex-employee

> That's simply not true. There's support engineers hired around the world, and depending on when your ticket is posted, someone awake at that time will answer. DO is super remote friendly and as a result, has employees (and support folks) everywhere on the planet. Not "cheap oversea support" at all. There's a lot of support folks in the US, Canada, Europe, and in India where they have a datacenter.

Going to guess from this tweet https://twitter.com/AntoineGrondin/status/113096281882239385... that you're currently staffed at DO. That's fine; I know two people who do great work on DO's security team. But it would be helpful if you could disclose this when you comment about your employer publicly so that readers don't have to dig up your keybase and then your twitter account to understand it.

I don't work at DO anymore, but you're right I should disclose that I used to do so. Will edit my original post.

Much appreciated, thanks!

But, isn’t hiring all over the world exactly because it is cheaper for the same kind of talent. I’m sure the company doesn’t do it out of the goodness of their heart.

Then of course, there is no guarantee these people speak and understand english perfectly.

I would say it's not. There's many advantages to hiring remote workers, they've been discussed at length elsewhere. One advantage is not having to pay office space, which indeed lowers cost. However DO has a nice office in Manhattan so really... they're not saving much money. And then on the term of compensation, for some reason DO pays it's remote employees really well. I don't know how this changed in recent years but people in NA and EU are all paid handsomely despite being remote. I don't know about other locales.

The reality is that top talent, even if remote, is competitive whether they are in NYC/SFO/SEA or not. And DO has some pretty talented people on staff.

And then, having people in all timezones is definitely an advantage for 24/7 support. I'd say it's not negligible, and not an after thought.

Now about english fluency, it's only that important to english native locations. And really, most of tech does not necessarily have english as a first language - I certainly don't. So I'd say that encountering support engineers with imperfect english shouldn't be a problem to anyone, and definitely not a sign of cheap labor. In fact, I'd say bitching about someone's english proficiency in tech is kind of counterproductive and I find it discriminatory.

Anyways. DO doesn't hire international employees to get cheap labor, that's a preposterous proposition. And with datacenters around the world and a large presence and customer base, it makes sense to have staff on board from many of these areas. And that staff might answer your tickets at night when they're on shift. Shouldn't they?

I can concur from a company that is not DO we hire workers in locations around the glove specifically to have people awake in their normal time zones, not because it's cheaper because it's not always cheaper. There are many countries that have a large portion of very intelligent and multi lingual people, especially when it comes to English. Just because someone isn't a native English speaker, or speaks in different dialect doesn't mean that they are any less capable.

  isn't a native English speaker, or speaks in different dialect doesn't mean that they are any less capable
It's not just about being less capable... Being less understandable can trump capability.

Not least England, which is literally full of actual native English speakers, and 8 hours ahead of the west coast of the USA.

I probably should have left the word "overseas" out of my initial comment, it gave it a flavor that doesn't match my left-wing multicultural globalist ideals.

That said, I disagree wholeheartedly that it's okay for support staff to not be completely fluent in the language they're providing support in, regardless of the language.

There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff. The end results are identical.

There are people who are very talented and very fluent in more than one language. Those people tend to be more expensive. So, many companies forego hiring those workers and instead hire others who are cheaper and "about as good". My multiple experiences with DO support have suggested that that's what they're doing.

As other commenters are suggesting, it may just be instead that DO is expecting its support staff to meet metrics that are causing them to spend only a minute or two per ticket and send out scripted replies.

I know many engineers who are not that fluent in English whom you would never contemplate qualifying as illiterate; you would quickly see that (1) they're encumbered by English and (2) are obviously extremely proficient technically, and literate.

People who would make gratuitous grammatical mistakes but have read more classics than the average American college graduate. I can easily count many just thinking of it.

You're arguing here against something I didn't say. You took one word from my statement -- "illiterate" -- and built a whole new argument around it which was never mine to begin with. I don't think you're doing it intentionally, I suspect it's just because you have a particular sensitivity on this subject. Either way I don't think I can say anything here that'll get a fair treatment from you.

Not that person, but you said this:

> There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff.

That statement reeks of ignorance. It seems you have almost no experience with other languages than your own, or you would know that communicating while being non-fluent or with a non-fluent works just fine most of the time. Sometimes misunderstandings happens and it can be a bit slower but that is all.

> Sometimes misunderstandings happens and it can be a bit slower but that is all.

So your position is that support that's a bit slower, with some misunderstandings, is exactly as good as fast support without misunderstandings, even in downtime-sensitive applications.

Well, okay then.

this statement is downright offensive and something i wouldn't expect to read here.

fluency is a high barrier to clear. it took me 5 years of speaking/reading/writing english daily to come even close to "fluency".

before that, i had a really good advanced english, but i wasn't fluent. and it didn't mean i was "illiterate".

And Linode has always had faster CPU, I/O, Network. And lots of small things like that DO only catches up in the recent years, like pooled bandwidth.

Although these days I tend to go with UpCloud, it is very similar to Linode and DO, except you can do custom instances like 20x vCPU with 1GB Memory, spin it up for $0.23 an hour. Compared to standard plan on Linode and DO, 20 vCPU with 96GB Memory would be $0.72/Hr.

I love Linode, their support is awesome, their CPUs on "Dedicated CPU Plans" are great (by benchmarks - similar to GCP n1 CPU), disks io bandwidth just amazing. But I had to leave them because their network is not so reliable. It was difficult decision and I tried to return few weeks ago (because I really love them), but again network in London was just not so great.

I'd love to see the benchmarks you used. I've done a bunch over the years, mostly I/O focused because that's the most common bottleneck for the kind of work I do. While Linode does pretty well, especially compared to the cloud giants, DO has pretty much always come out on top.

Really depends on underlying hardware of the Linode and DO VPS servers as they can vary greatly especially on DO depending on datacenter and region you end up in. Newer DO datacenters getting newer hardware so difference compared to older DO datacenters is huge - benchmarks of same DO droplet plan on different hardware https://community.centminmod.com/threads/digitalocean-us-15-...

I've been on Linode for 8+ years now (moved there from Slicehost when Rackspace swallowed them up) and their service (not necessarily customer support) has significantly degraded. Not sure I blame them though. They've become far more popular since I started with them and are probably doing their best to grow... but I no longer recommend them as I used to.

Just my experience though.

That's how I feel about Scaleway. Scaling customer service is no easy task especially technical companies that require agents to have some understanding of the product.

I drop all traffic from Scaleway, they do not care about abuse coming from their network: https://badpackets.net/ongoing-large-scale-sip-attack-campai...

Call me cynical but for those prices I don't expect amazing service or stability, so I don't run any production stuff there.

Could you give some concrete examples on how their service has degraded? I've been using their service for years for light stuff, and I haven't had any problems.

So we host scores for a sport, as well as inputting these scores. Every year we have a few high traffic events (much higher than normal), and I scale up our servers to support it. However for the last two years there have been outages in their Newark data centre during both of these events. One time it was DNS, all other times it's been data centre wide.

I suspect as they've increased in popularity they've become a bigger target for DDOS attacks.

I've also noticed that in the past year there's been a lot of data centre outages... like every couple months. Hasn't been a deal breaker for us, since our traffic is generally fairly low outside of season, but the ones during seasons really hurt.

Also I'd like to add that they do give you the heads up when there are issues, which is a big plus in my book compared to some other hosts.

I really do think it's just growing pains, and I don't mean to disparage them. Just being honest that I wouldn't recommend them for high availability services. Since I consider them a low budget host that's probably unfair though. We've just outgrown them is all.

This is all anecdotal of course.

So who would you recommend?

If you are looking for high availability - Google Cloud Platform network is the best. In some other things AWS is better, but GCP network quality is awesome.

I don't really have a low budget alternative. I know that for our service we're evaluating both google and amazon cloud offerings, but only for our high availability services. I figure DO is in the same boat if not worse.

I've been using Hetzner Cloud and have been quite happy with it.

I'll throw in Prgmr.com. One of their owners, Alyn Post, is on Lobsters with us. They even donate hosting to the site. He's been a super-nice guy over the years. Given how cost-competitive market is, they mainly differentiate on straight-forward offerings with good service. So, I tell folks about them if concerned about good service or more ethical providers.


Main, potential caveat is they're Xen-based hosting. That may or may not matter depending on what one wants to run. They support the major Linux's.

Luke used to post here quite a bit too

Linode support helped me diagnose some rare and obscure Linux OS error. They went above their call of duty and it’s why I’m still a customer.

>the support on Linode is way, way, way better than DO's.

Can't comment about support, but DO's tutorials on linux server random task 101 are fantastic.

I was surprised how much of my ubuntu server setup googling ended up on DO pages.

most of them are well written. It’s the community though, not DO

So true! Well written, to the point and kept up to date.

I've been with Linode since 2009, and only got bitten once when their entire Atlanta DC went down near Christmas.

I was eager to try the new DO managed Postgres service, but I guess I won't after this blunder.

+1 I have used both as well. Linode support has always been good and DO support has been a blocking barrier.

In response to my first support request DO sent me instructions for closing my account. I followed up and they refunded me a random few dollars.

second this, Linode for 15+ years here, tried DO a few times(but never left Linode), now 100% back with Linode.

Linode even has an irc channel(you can use browser to access it), I rarely need support, but when I really need it, it is always fast, to the point, available.

Long-time Linode customer here. What were the security screwups? I tried a couple searches but didn't turn up anything.

The two major ones were:

A Bitcoin theft via the Linode Manager interface in 2012: https://news.ycombinator.com/item?id=3654110

and a second Linode Manager compromise in 2016: https://news.ycombinator.com/item?id=10845170

In both cases, if you dig into the context a bit, the story turned out that Linode wasn't fully disclosing the breaches to their customers until they were forced to do so when the news about them reached a certain volume. They also may have been -- almost certainly were -- dishonest about the extent of the damage and how it may have impacted their other customers.

At the time, their Manager interface was a ColdFusion application, which tends to be a big pile of bad juju. They started writing a new one from scratch after, I think, the second compromise.

The really bad thing here is that they got soundly spanked for being less than truthful the first time, and then four years later -- when they'd had ample time to learn from that mistake -- they did it again.

So there's a nonzero chance at any given time that Linode's infrastructure has been compromised and they know it and have decided not to tell you about it.

That's what prompted me to start exploring DigitalOcean more. Unfortunately, I've found that there's a far greater chance that I'll experience actual trouble exacerbated by poor support than that I'll be impacted by an upstream breach, so about half my stuff still lives on in Linode.

> So there's a nonzero chance at any given time that Linode's infrastructure has been compromised and they know it and have decided not to tell you about it.

This is entirely the roots of my distrust of them right now. Mistakes happen. Companies I trust demonstrate that they've learned from their mistakes. My tolerance for mistakes is pretty low when it comes to security related things, though. If something has gone wrong, let me know so I can take remedial steps. Their handling of both of those incidents suggested I can't trust them to tell me in sufficient time to protect myself.

Here's a search that will help, look no further than the site you're already on:


They've had several high-profile breaches over the years.

Vultr is underrated too. I've had nothing but positive tech support experiences with them. Their weird branding turns people off but they do not seem fly by night. We have used them for various things for years with very few problems.

I've dealt with many Vultr instances on behalf of my clients, and I've had nothing but negative experiences with them. Unstable performance even on top-tier plans. Internal network issues that support keeps trying to blame their customer for. Nowadays when I find that a new or prospective client has been using Vultr, the first thing I recommend is to move off of Vultr.

When there's an issue with Linode's platform, they discover it before I do and open a ticket to let me know they're working on it. When there's an issue with Vultr, the burden of proof seems to be on me to convince them that it's their problem not mine.

My only issue is the provisioning weirdness.

I've had interesting issues automating deployment to the point that my current build script provisions 9 VMs, benchmarks them and shuts down the worst performing 6. Some of their co-location is CPU stressed.

My current employer uses Linode, and yeah, they have pretty good support. However, I've been using DO for the last 5 years, and haven't needed to reach out to support once. But I've had to contact Linode support about 5 times in the last year.

Does DO have different levels of support that you can pay for like AWS? I like that system. You pay when you need it. You pay more if you need more support.

The difference in support DO vs Linode is probably due to DO being cheaper.

> The difference in support DO vs Linode is probably due to DO being cheaper.

What? Most DO and Linode plans have exact same specs and cost exactly the same, and IIRC it was DO matching Linode. Although DO didn't seem to enforce their egress budget while Linode does.

(I've been a customer of both for many years, and only dropped DO a few months ago.)

EDIT: Also, according to some benchmarks (IIRC), for the exact same specs Linode usually has an edge in performance.

AWS support is fixed contract. There is basic, business and enterprise. What do you mean by “you pay when you need it”?

I can cancel the fixed contract any time. e.g. when I need support I buy a month of basic or business.

Having used both for years, I’d probably recommend DO. None of the big security issues, but also I’ve found more downtime with Linode, don’t know if they’re upgrading their infrastructure a lot for some reason.

This headline is grossly misleading and very clickbaity "Killed our company". It's not exactly big business that you scale up to 10 droplets for short bursts, I am willing to bet their spend on DigitalOcean is less than $500 a month, yet the author is expecting enterprise support.

DigitalOcean should go the route of AWS and kill off free support completely and offer paid support plans. Something like $49 a month or 10% of the accounts monthly spend.

If you are a serious company with paying Fortune 500 customers, you need to act serious and pay up for premium support and stop expecting free.

Well, not locking you your account for false negatives and unlocking it when you ask for it should be in the free support plan of any company. No one pays to get locked out, support plan or not.

They’re working on that and were sending out surveys a couple of weeks ago whether customers would be interested. They had a slightly higher minimum amount in mind for premier support

You already get bumped to a higher tier of support once you hit 500/month spend for no additional fees but that just gives you promises for lower response times

I'm done using DO.

It's usually a bad sign when a company can't meet current needs/expectations and then decides to try and productize their failures.

Looks like Moisey Uretsky personally intervened fairly quickly: https://twitter.com/moiseyuretsky/status/1134547532149854208

That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places. If they'd had the data, they could have spun up their service on a different hosting provider relatively easily.

Probably because of publicity. How many of those companies went bankrupt silently, because their case did not cause much attention in news?

Accidents can happen. Don't really blame Digital Ocean for the accidental locking but this response is insane: https://pbs.twimg.com/media/D76ocofXoAY_xB5.png

DO has shown that their service is simply not suitable for some use cases: those that impose an "unreasonable" load on their infraestructure.

Even worse: they don't explicitly state what is considered "unreasonable". So, if your business is serious, you have to assume the worst-case scenario: DO can't be used for anything

Conclusion: Digital Ocean is just for testing, playing around, not suitable for production.

> Conclusion: Digital Ocean is just for testing, playing around, not suitable for production.

I think that's always been the standard position most people take. DO, Linode, etc are for personal side projects, hosting community websites, forums etc. They are not for running a real business on. Some people do, sure but if hosting cost is really that big a portion of your total budget you probably don't have a real business model yet anyway.

I am of the impression that people rent cloud services because they can expense the cost to someone else or because of an inability to plan long term or a need of low latency.

That's the kind of response you only send when you're convinced the customer is actually nefarious and you don't care about losing them. I wonder if there is any missing backstory here or if it really is just a case of mistaken analysis.

> Don't really blame Digital Ocean for the accidental locking

I find it concerning that they have such a low threshold for triggering a lock. 10 droplets is hardly cloud scale.

Used to work at Linode, let's flip this on it's head:

-When the majority of abuse support dealt with was people angrily calling and asking about the fraudulent charges on their cards for dozens of Lie-nodes you consider putting caps in place to reduce support burden and reduce chargebacks.

At the time at Linode, if it was a known customer, we could easily and quickly raise that limit and life is good.

I've always wondered how Amazon dealt with fraud/abuse at their scale.

I don't think DO was wrong here to have a lock, but the post lock procedure seemed to be the problem.

That response is ice cold. Reminds me of suspension emails Amazon sends out to their FBA sellers.

That's the sort of response you get from Twitter or Facebook, where you are not paying for the service.

Agreed. I actually had to reread that a few times because I could not believe that someone actually approved of that text.

That text suggests larger organizational problems within the company.

If you explain why, you give the actual abusers clues as to how to avoid detection. It's common place behaviour for companies to _not_ reveal details.

You can provide a helpful message with options for recourse without giving abuser's "clues." These are not somehow mutually exclusive. By your logic it makes sense to punish a marginal element at the expense of the majority.

It's a spectrum; being opaque and stone-cold to that degree is shitty customer service.

I think the major issue there is process and management related. The account should have been reviewed by someone with the authority to activate it, and it definitely shouldn't have been flagged a second time. But looks like DO thought the user was malicious, and issues raised by malicious users don't get much information. The response was horrible though.

To be quite honest and thoughtful about it - probably extremely few companies went bankrupt silently because of issues like this. Let's be realistic.

Sure. Hopefully it results in a change of policy, or at least a public statement of some kind. Everyone can't depend on the cofounder to come in and save them from bad automation.

From the looks of it they will be posting a public postmortem to their status page.

Agreed, but it's not like the original poster had a huge platform, he just posted about it on Twitter. I may despise Twitter for a bunch of different reasons, but I can't deny it's a great tool for raising issues to companies.

He has 4300 followers on twitter. What if he had 50?


> Account should be re-activated - need to look deeper into the way this was handled. It shouldn't have taken this long to get the account back up, and also for it not be flagged a second time.

So... he doesn't address what is the scariest part to me, the message that just says "Nope, we've decided never to give your account back, it's gone, the end."

How would you like that to have been addressed?

I think it's entirely reasonable for companies to have that option. "You are doing something malicious and against the rules, you have been permanently removed". In this case, that option was misused, but I don't think the existence of that possiblity is inheritly surprising.

Access to your data should never be denied. Ever. It was not DigitalOcean's data. If you are a hosting provider, you can't ever hold customer data hostage or deny them access to it in any way.

Again, I must disagree. If DO genuinely believed that you were doing something malicious and that data was harmful or evil for you to own (e.g. other people's SSN, etc) then they are in the "right" to deny access to it. DO should not be forced to aid bad actors.

And, regardless of what DO should or should not do, they can do whatever they want with their own hard drives. You should structure your business accordingly.

If DO believed that there was criminal activity (notice I am not using the word "malicious"), they should have reported it to the police, and it that case they might be justified in securing a copy of the data. Blocking access would be justified only in the most extreme cases (such as if the data could be harmful to others, e.g. pictures of minors).

If there is no police report, then they are trying to act as police themselves, which I think is unacceptable. It is not their data.

Your argument that they can do whatever they want with their hard drives is indeed something I will take care to remember — I definitely would not want to host anything with DO.

> If DO genuinely believed that you were doing something malicious and that data was harmful or evil for you to own (e.g. other people's SSN, etc) then they are in the "right" to deny access to it.

The observant will note the particular corner you're backing into here -- that a business might be justified in denying access to code/data being used in literally criminal behavior -- is notably distinct from the general and likely much more common case.

> they can do whatever they want with their own hard drives.

Sure. But to the extent they take that approach, Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.

Businesses that can be relied on generally instead offer terms of service and processes that don't really allow them to act arbitrarily.

> ... a business might be justified in denying access to code/data being used in literally criminal behavior...

I agree. Look at the absolutism of the comment I am replying to. My whole point is that there might be some nuance to the situation.

> ...Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.

Again, I agree. Considering how cheap AWS, backblaze, and Google drive is, it is completely ridiculous to depend on any one single hosting service to hold all your data forever and never err.

At no point did DO ever believe this. This happened purely and simply because of usage patterns changing. It was done automatically and a bot locked them out. They should not be locking out data based on an automated script.

You seem to be accusing the aggrieved party of being a bad actor, when that is not the case.

The change in usage patterns does not appear to be the only flag.


No, it was one of a number of factors. Usage pattern was definitely a factor. It’s still pretty awful.

For some practical, if extreme, examples: if a customer were to host a phishing site, or a site hosting CP, it would be grossly irresponsible (and likely even illegal) for the hosting provider to retain the customer's data after account suspension and allow them to download it.

When this happens they should contact law enforcement, not play god.

> they should contact law enforcement

And do what in the mean time? The legal system acts slowly. In the age of social media outrage, would you allow the headline "Digital Ocean knew they were serving criminals, and they didn't stop them" if you were CEO?

It's easy to be outraged when these systems and procedures are used against the innocent. That does not mean we should stop using rational thought. If someone is using DO to cause harm, then DO should (be allowed to) stop the harmful actions.

> Your account has been temporarily locked pending the result of an ongoing investigation.

You lock down the image, and let law enforcement do their thing. If law enforcement clear them, you then give the customer access to their data, perhaps for a short time before you cut them off as they seem to be a risky customer to have.

You don't unilaterally make the decision, you offload your responsibility onto the legal process.

I agree that this was probably the most reasonable decision for them to make.

The fact that there are hundreds of comments on HN condemning them for this action proves my point.

>would you allow the headline "Digital Ocean knew they were serving criminals, and they didn't stop them" if you were CEO?

Seems to work just fine for AWS, Google and Cloudflare. In fact, counter to your argument, Cloudflare got in massive shit when they did decide to play God.

Lol, only a judge has such rights (to decide if data is illegal or not), not some DO algorithms.

Exactly. Are all images of children illegal? I have a photo of me as an infant. What kind of algorithmic nonsense absolutism are they talking about?

Reasonable to have the shutdown part of the option, yes.

At the very least, they should also provide ALL, as in every last byte, of data, schemas, code, setup etc. to the defenestrated customers. As in: "sorry, we cannot restart your account, but you can download a full backup of your system as of it's last running configuration here: -location xyz-, and all previous backups are available here: -location pdq-".

Anything less is simply malicious destruction of a customer's property.

If you violate a lease and get evicted, they don't keep your furniture & equipment unless you abandon it.

Sure, a backup would have been a significant improvement, but still – a backup only protects against data loss and not against downtime.

That's probably a reason to use containerization / other technologies so that you can spin up your services in a couple minutes on a different cloud provider.

You don't need to use containers for that.. all you have to do is set up a warm replica of the service with another provider. The fail over doesn't even have to be automatic, but that is the minimum amount of redundancy any production SaaS should have.

A "warm replica" is going to cost money though, while containerization allows you to not have anything spun up until the moment you need it, and then have it ready to go minutes / an hour later.

That is patently false, unless you plan on starting from a clean slate on the new environment. Any one who proposed such a solution as a business continuity practice to me would be immediately fired.

Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.

Replicating TBs worth of data and making sure the replica is relatively up to date is the hard part. So is fail over and fail back. Basically everything but running the code/service/app, which is the part containers solve.

I was responding to this comment:

> Sure, a backup would have been a significant improvement, but still – a backup only protects against data loss and not against downtime.

Assuming you have data backup / recovery good to go, the downtime issue needs to be solved by getting your actual web application / logic up and running again. With something like docker-compose, you can do this on practically any provider with a couple of commands. Frontend, backend, load-balancer -- you name it, all in one command.

> Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.

Speaking of "patently false"...

>That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places.

They should have, at the very least, one DR site on a different provider in a different region that is replicated in real-time and ready to go live after an outage is confirmed by the IT Operations team (or automatically depending on what services are being run).

>That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places

Yes they should.

How many 2-man shops do you think follow all the proper backup and security procedures?

I feel for these guys, but that's not "all the proper backup procedures". I'm part of a three-man shop and storing backups in another place is the second thing you do immediately after having backups in the first place. Never mind being locked out by the company - what happens if the data centre burns to the ground?

It could literally be a cron job that dumps your DB to a desktop computer once a week. Not exactly CIA-level stuff.

More realistically they would have done backups inside DO and would still be locked out. Not many people actually do complete offsite backups to a completely different hosting provider, getting locked out of your account is usually just not a consideration. It’s unrealistic to expect this of a tiny startup.

>getting locked out of your account is usually just not a consideration

How many horror stories need to reach the front page of HN before people stop believing this? Getting locked out of your cloud provider is a very common failure mode, with catastrophic effects if you haven't planned for it. To my mind, it should be the first scenario in your disaster recovery plan.

Dumping everything to B2 is trivially easy, trivially cheap and gives you substantial protection against total data loss. It also gives you a workable plan for scenarios that might cause a major outage like "we got cut off because of a billing snafu" or "the CTO lost his YubiKey".

> How many horror stories need to reach the front page of HN before people stop believing this

Sounds like the opposite of the survivor bias. I don't believe it's any sort of common (though it does happen), even less that "it should be the first scenario in your disaster recovery plan"

Even if the stories we hear of account lockouts isn't typical, the absolute number of them that we see -- especially those (like this one) that appear to be locked (and re-locked) by automated processes -- should be cause for concern when setting up a new business on someone else's infrastructure.

If you plan for the "all of our cloud infrastructure has failed simultaneously and irreparably" scenario, you get a whole bunch of other disaster scenarios bundled in for free.

Whether it's normally a consideration or not, there are no meaningful barriers in terms of cost or effort, so it's totally realistic to expect it of a tiny startup.

Every week there's another article on HN about a tiny business being squished in the gears of a giant, automated platform. In some cases like app stores this is unavoidable, but there are plenty of hosting providers to choose from. People need to learn that this is something that can happen to you in today's world, and take reasonable steps to prepare for it.

And there are a million stories of startups who build the wrong thing, don't achieve product-market fit, etc.

You can't dot every I, cross every t and also build a compelling product as a 2 person shop.

Backups aren't "dotting i's and crossing t's", they're fundamental. FFS, just rsync your database directory somewhere.

Then maybe you shouldn't be building that product with a 2 person shop.

Sounds good in theory.

I don't know, it seems simple enough to me. I have a server on DO hosting some toy-level projects, and IIRC it took me 15-30 min to set up a daily Cron job to dump the DB, tar it, and send it to S3, with a minimum-privilege account created for the purpose, so that any hacker that got in couldn't corrupt the backups. I'm not a CLI or Linux automation whiz, others could probably do it faster.

> It’s unrealistic to expect this of a tiny startup.

I could not disagree more. There's a right way and a wrong way to do this, it's trivial to do it right, and the risks of doing it wrong are enormous.

> It’s unrealistic to expect this of a tiny startup.

Then it's unrealistic to trust them with your business.

That's better than nothing, but still not great.

We don't know the structure of their DB and whether failover is important or not, so we don't know if the DB can be reliably pulled as a flat file backup and still have consistent data.

We also don't know how big the dataset is or how often it changes. Sometimes "backup over your home cable connection" just isn't practical.

Cron jobs can (and do) silently fail in all kinds of annoying and idiotic ways.

And as most of us are all too painfully aware, sometimes you make less-than-ideal decisions when faced with a long pipeline of customer bug reports and feature requests, vs. addressing the potential situation that could sink you but has like a 1 in 10,000 chance of happening any given day.

But yes, granted that as a quick stop-gap solution it's better than nothing.

> We also don't know how big the dataset is or how often it changes.

I'm going to take a stab at small and infrequently.

Every 2-3 months we had to execute a python script that takes 1s on all our data (500k rows), to make it faster we execute it in parallel on multiple droplets ~10 that we set up only for this pipeline and shut down once it’s done.

Yeah, probably. But we shouldn't be calling these guys out for not taking the "obvious and simple" solution when we aren't 100% certain that it would actually work. That happens too often on HN, and then sometimes the people involved pop in to explain why it's not so simple, and everyone goes "...oh." Seems like we should learn something from that. I've gone with "don't assume it's as simple as your ego would lead you to believe."

I suggested that solution because everyone is saying "they're only a two-man shop so they don't have the time and money to do things properly". Anyone has the time and money to do the above, and there's a 90% chance that it would save them in a situation like this.

Even if they lost some data, even if the backup silently failed and hadn't been running for two months, it's the difference between a large inconvenience and literally your whole business disappearing.

Sure it could be. Still not enough companies actually do this though...

"2-man teams generally don't prioritize backups" isn't an excuse for not prioritizing backups.

> "2-man teams generally don't prioritize backups" isn't an excuse for not prioritizing backups.

They had backups, but being arbitrarily cut-off from their hosting provider wasn't part of their threat model.

Isn't a big part of cloud marketing the idea that they're so good at redundancy, etc. that you don't need to attempt that stuff on your own? The idea that you have to spread your infrastructure across multiple cloud hosting providers, while smart, removes a lot of the appeal of using them at all. In any case, it's also probably too much infrastructure cost for a 2-man company.

> In any case, it's also probably too much infrastructure cost for a 2-man company.

keeping your production and your backups in the same cloud provider is the equivalent of keeping your backup tapes right next to the computer they're backing up. you're exposing them both to strongly correlated risks. you've just changed those risks from "fire, water, theft" to "provider, incompetence, security breach"

So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3? I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.

Back in the day, we used to talk a lot about how RAID is not a backup strategy. The modern version of that is that S3 is not a backup strategy.

> So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3?

You're paying to try and ensure you don't need to restore from backups. Our data lives in an RDS cluster (where we pay for read replicas to try and make sure we don't need to restore from backups) and in S3 (where we pay for durable storage to try and make sure we don't need to restore from backups), but none of that is a backup!

If you're not on the AWS cloud S3 is a decent place to store your backups of course, but storing your backups on S3 when you're already on AWS is, at best, negligent, while treating the durability of S3 as a form of backups is simply absurd.

> I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.

The company I work for is on the AWS cloud, so we store our backups on B2 instead. It's no more work than storing them on S3, and it means we still have our data in the event that we, for whatever reason, lose access to the data we have in S3. Who the hell doesn't have offsite backups?

> Back in the day, we used to talk a lot about how RAID is not a backup strategy. The modern version of that is that S3 is not a backup strategy.

This is not remotely the same thing. A RAID offers no protection against logical corruption from an erroneous script or even something as simple as running a truncate on the wrong table. Having a backup of your database in a different storage medium on the same cloud provider protects from vastly more failure modes.

> Who the hell doesn't have offsite backups?

No one. But S3 is already storing your data in three different data centers even if you have a single bucket in one region, and you also have SQL log replication to another region. Multi-region is as easy as enabling replication but that is only available within a single cloud provider (I can't replicate RDS to Google Cloud SQL, only to another RDS region). I would guess that a lot of people use that rather than using a different cloud provider.

> This is not remotely the same thing. A RAID offers no protection against logical corruption from an erroneous script [...] But S3 is already storing your data in three different data centers

That sounds like...the same argument?

A RAID array stores your data on multiple physical drives in the machine, but offers no protection against logical corruption (where you store the same bad data on every drive), destruction of the machine, or loss of access to the machine.

S3 stores your data in multiple physical data centres in the region, but offers no protection against logical corruption, downtime of the entire region, or loss of access to the cloud.

You can't count replicas as providing durability against any threat that will apply equally to all the replicas.

> So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3?


> > "fire, water, theft"

i'm sure you could add a few more things to the list.

> I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.

not terribly common to understand risk.

Storing a file on two tier1s would surely protect you from fire, water, theft no? Yet you will also be paying for all the extra copies Amazon and Google each make. I'm not disagreeing that this is the right strategy, just pointing out that the market offerings and trends don't support it.

> being arbitrarily cut-off from their hosting provider wasn't part of their threat model

Let's be fair: The threat model here is "lose access to our data".

This can happen in a number of ways, lost (or worse, leaked) password to the cloud provider, provider goes bankrupt, developer gets hacked, and a thousand other things.

Even if you trust your provider to have good uptime, there's really no excuse for not having any backups. Especially not if you're doing business with Fortune 500's.

Yeah I think this is what people are not getting. Redundant backups might mean "don't worry, in addition to backups on the instance, I have them going to a S3 bucket in region 1 and then also region 2 in case that region goes down," which of course doesn't protect from malicious activity from the provider. You certainly _should_ make sure you have backups locally available or in a secondary cloud provider but this is some hindsight.

Which is exactly why I won’t sign contracts with bootstrapped startups in an enterprise context.

What Fortune 500 company is doing business with 2-man shops that aren’t?

Yeah there might be some unpleasant internal conversations following that event

An unpleaseant conversation with his pillow. It's a two people company and he is the only technical person.

I'm talking about DO, not the affected company

Having backups in two places could easily triple the hosting costs. The question is what costs more. Eg. Losing data vs backup costs.

As a startup, generally your secondary backup could literally be an external hard drive from best buy, or an infrequent access S3 bucket (or hell, even Glacier). No excuse, especially when "dealing with Fortune 500 companies".

Triple hosting costs?

Literally just push a postgres dump to S3 (or any other storage provider) once a night as a "just in case something stupid happens with my primary cloud provider". It'd take a couple hours tops to set up and cost next to nothing.

Most of the costs aren't from storage space, but compute power. We aren't talking about duplicating the whole infrastructure, just backing up the data. Disk space is dirt cheap.

Also, by "two places" I meant the live DB and one backup that's somewhere completely different. My wording may have been confusing.

They did have backups. Thats why I asumed you meant double backups. If you do cold storage you should have 3 copies due to possible corruptions. Sure tape drives are cheap but someone also have to run and check the backups.

That seems like an odd cost increase. How do you figure it would lead to a tripling of op costs?

Lets say you have 100 TB of data, plus two backups, you are now paying for 300 TB of data.

I would say that it doubles the cost of backups, but using this math, we start with one copy plus one backup, and add a second backup; that means only a 50% increase.

Also a secondary backup doesn't need to be in hot storage. Coldline or Glacier or similar can easily be a quarter of the price per GB.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact