As DigitalOcean's CTO, I'm very sorry for this situation and how it was handled. The account is now fully restored and we are doing an investigation of the incident. We are planning to post a public postmortem to provide full transparency for our customers and the community.
This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.
Thanks for the replies.
Let me try to address a few of the things I have seen here. We haven't completed our investigation yet which will include details on the timeline, decisions made by our systems, our people, and our plans to address where we fell short. That said, I want to provide some information now rather than waiting for our full post-mortem analysis. A combination of factors, not just the usage patterns, led to the initial flag. We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues. Clearly we messed up in this case.
Additionally, the steps taken in our response to the false positive did not follow our typical process. As part of our investigation, we are looking into our process and how we responded so we can improve upon this moving forward.
With all due respect I think you’ve missed the point. The larger point from my perspective is that you denied your client the ability to move their data off your platform. This would be akin to someone breaking the terms of their lease and you confiscating all their belongings with the intent of burning them. You should provide some sort of grace period for users to move their data off your platform. For everyone else reading this this is should be a wake up call why you should never trust your data to a singular entity. Even if they have 99.9999% uptime you never know when they’ll decide to deny you access to your data.
Thank you for jumping in personally to clarify what happened.
As a business owner with much of our infrastructure depending on DigitalOcean, the incident is concerning. It affects the reputation of DO as well as its customers.
The demographics on Twitter and especially here on HN represents a sizable crowd with decision-making influence on DO's bottom line. I hope to see some effort being made to prevent situations like this in the future, and to regain the trust.
As a (so far) satisfied customer, it's great to hear that:
> A combination of factors, not just the usage patterns, led to the initial flag.
> We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues.
> we are looking into our process and how we responded so we can improve upon this
I’ll be awaiting the post-mortem and, depending on that and the procedures proposed to stop this from happening again, will hold off moving everything I have from DO.
The real “mess up” here was the bit where you blocked the account with no reason given and no further communication - other than the one-liner your intern wrote for the email.
I’m expecting you to sit down with your legal team and rewrite your TOS to be more customer-focused and less robotic.
I wanted to provide you all with an update on the postmortem I promised on Friday. Our analysis has been completed. We will be sharing the full document soon and will publish a link in this thread for those wanting to read it. We promised Raisup a first look and we have provided the draft document to them this afternoon. Because some information in the document could be considered sensitive we wanted to give Raisup a chance to review the document before sharing with the public.
As a long term customer here is a small suggestion to make this fail-safe: by default do trust your customers, and just ask them first instead of shooting them down first.
Considering you have been marketing yourself as the platform for developer oriented cloud, you should be aware that surge provisioning can and will always be happening.
What do you recommend your clients to do if that kind of mistake happens to them? Is Twitter-shaming the only way out?
I know people say some legal arguments why they close you down and won't say anything, but this is the worst scenario ever. I'd be better off excused at something I didn't do than just ooops we can't tell you anything, your account has been shut down.
This is important. I hate how it had became standard for companies to screw their customers unless they are online-shamed.
The response email even read like a giant polite FUCK YOU (we locked your account, no further action required by you)
You bet I will have further action!
And it is after the shaming that you get an "I am sorry for this situation". Which sounds more like saying "I'm sorry we got caught".
My frustration is not with DO specifically, as they do exactly what every other company does.
But, what of the other thousands of people that got screwed and did not put it on twitter?
It is the equivalent of when you are in a restaurant and get screwed: It is the loudest person that complains more the one that gets the reward, while all the others silently swallow the injustice.
It's most likely due to the fact that the people who can act upon the process itself, not just follow the process inevitably see the issue and do truly want to help.
Getting your message into the right hands is what matters, not the platform it's on.
Mistakes happen, and algorithms are sometimes a necessary part of scale/efficiency. Everyone understands that.
That said, what's highly troubling as a DO customer (and someone who is planning to deploy startup infrastructure of my own with DO) is:
1) The discrepancy between this customer's experience and clear assurances made on this very forum by high-level DO employees that:
a. warnings are ALWAYS issued before suspensions.
b. even in the event of a suspension, services remain accessible (though dashboard access and/or the ability to spin up NEW services may be impacted), ie. the affected customer could still retrieve data or SSH in to droplets.
2) The relatively trivial nature of the customer's offending usage (temporarily spinning up 10 droplets). What happens if, for example, a startup gets a press mention somewhere that leads to a massive traffic spike, necessitating a sudden and significant spin-up of new droplets (especially if this is done programmatically versus by hand in the dashboard)?
3) The apparent lack of consideration of the customer's history, or investigation into their usage. It seems the threshold for suspending services of longstanding customers who are verifiably engaging in commerce (taking a moment to look at their website and general online presence for indicators of legitimacy), should be SUBSTANTIALLY higher than for, say, an account who signed up a week ago. Context matters.
I'm no longer able to edit the above comment, so to elaborate on #1:
Following is a comment[1] by Moisey Uretsky in another thread[2]:
> Depending on which items are flagged the account is put into a locked state, which means that access is limited. However, the droplets for that account and other services are not affected at all. The account is also notified about the action and a dialogue is opened, to determine what the situation is. There is no sudden loss of service. There is no loss of service without communication. If after multiple rounds of communication it is determined that the account is fraudulent, even then there is no loss of service that isn't communicated well in advance of the situation.
What he said in an other thread, and this thread, is press release, marketing. Don't trust what he says to save his business. You have absolutely no reason to.
I prefer to give them the benefit of the doubt, though a clear explanation of why the above policy was not followed seems warranted. (It also doesn't appear to have been followed in several other instances reported by other former customers in various HN threads.)
If DO reserves the right to cut off services and access to your own data permanently and without warning (outside of a court order or confirmed illegal activity), that needs to be unequivocally stated, and the triggering factors should be made known. Otherwise, DO is not fit for production systems.
Additionally, it would be nice to see the creation of a transparent, high-level appeal process for customers affected by suspensions. Truly malicious customers wouldn't use it (what would they hope to successfully argue to an actual human reviewer?), but it would greatly benefit legitimate customers to have an outlet other than social media by which to "get something done" in the event of an inappropriate suspension followed by a breakdown in the standard review process.
Not sure why you’re being downvoted. Point 2 is very relevant. Scaling instances due to sudden peaks should be totally safe. Even when automated. Guess AWS is still lonely at the top.
It really is a trivial amount of resources to have it triggering such a reaction. It's almost like DigitalOcean doesn't like being in the cloud hosting business. One of the fundamental, desirable points to the shift to such cloud hosting services is that you can quickly spin up a bunch of resources when needed and then dump it.
You've got an additional problem though, which is that this tells us you have two support channels: one that doesn't work (i.e. yours, the one you built), and one that does (Twitter-shaming). The first channel represents how you act when no one's watching; the second, how you act when they are. Most people prefer to deal with people for whom those two are the same.
Do not use DO. The very fact that their default response to suspected spam is to cause prod downtime is so bizarre and unacceptable that it does not make any sense whatsoever for a business to rely on them.
You do not need to scrub or write anything to not provide user A’s data to user B in a multi-tenant environment. Sparse allocation can easily return nulls to a reader even while the underlying block storage still contains the old data.
They were just incompetent.
On top of all of that, when I pointed out that what they were doing was absolute amateur hour clownshoes, they oscillated between telling me it was a design decision working as intended (and that it was fine for me to publicize it), and that I was an irresponsible discloser by sharing a vulnerability.
Then they made a blog post lying about how they hadn’t leaked data when they had.
I think it says a lot that this CTO joker flew in, regurgitated the standard-issue "we will endeavor to do better" apology and left without answering any of the very legitimate follow-up questions. I would never deal with an organisation that behaves like these guys.
That’d be unrealistic for any company to claim, and if any company I worked with did claim that I would run for the hills.
That’s akin to saying “we’ll never ship a bug”, or “we have an SLO of 100%”. That’s impossible for anyone to claim. Same goes for the response handling. There is clearly a lot of room for improvement there, but if you’re insisting on not getting canned response, that means a human needs to be involved at some point. Humans will at times be slow to respond. Humans will at times make mistakes. This is just an unavoidable reality.
I get that mob mentality is strong when shit hits the fan publicly, but have a bit of empathy and think about what reasonable solutions you may come up with if you were to be in their situation, rather than asking for a “magic bullet”.
I could see a good response here being an overhaul of their incident response policy, especially in terms of L1 support. Probably by beefing up the L2 staffing, and escalating issues more often and more quickly. L2 support is generally product engineers rather than dedicated support staff/contractors, so it’s more expensive to do for sure, but having engineers closer to the “front line” in responding to issues closes the loop better for integrating fixes into the product, and identifying erroneous behavior more quickly.
Sure, me and a lot of others react rather strongly in these situations. I agree with that but you already seem to understand the reasons.
However, can you say with a straight face that the very generic message left here by DO's CTO instills confidence in you about how will they handle such situations in the future?
Techies hate lawyer/corporate weasel talk. Least that person could do was do their best to speak plainly without promising the sky and the moon.
I would prefer a generic message and a promise for follow up once all the facts are known over a rushed response that may be incorrect.
I’m an engineering manager in an infrastructure team (not at all affiliated with Digital Ocean, tho full disclosure, I do have one droplet for my personal website). I know how postmortems generally work, and it’s messy enough to track down root cause even when it’s not some complex algorithm like fraud detection going off the rails.
I’d rather get slow information than misinformation, but I understand the frustration in not being able to see the inner working of how an incident is being handled.
And I agree with your premise. However, my practice has shown that postmortems are watered-down evasive PR talk, many times.
If you look at this through the eyes of a potential startup CTO, wouldn't you be worried about the lack of transparency?
And finally, why is such an abrupt account lockdown even on the table, at all? You can't claim you are doing your best when it's very obvious that you are just leaving your customers at the mercy of very crude algorithms -- and those, let's be clear on that, could have been created without ever locking down an account without a human approval at the final step.
What I'm saying is that even at this early stage when we know almost nothing, it's evident that this CTO here is not being sincere. It seems DO just wants to maximally automate away support, to their customers' detriment.
Whatever the postmortem ends up being it still won't change the above.
Our line so far has been to change provider of service if we start getting copy - paste answers from support. We always make sure we can get hold of a human on the phone even without a big uptime contract. This has so far lead us to small companies that are not overrun by free accounts used as spam or SEO accounts. That means they have no need for automatic shutdown of accounts and instead you get a phonecall if something goes wrong.
This is how I would go about it as well. But I imagine that's a big expense for non-small companies, and not only through money but through the time of valuable professionals that could have spend the time improving the bottom line.
I too value less known providers. The human factor in support is priceless.
7 hours, on a Friday night in the headquarters time zone. This issue is resolved and is clearly not wide spread, so does getting a response on Monday or Tuesday vs right now make any difference?
Companies are made of people. Let the people have a life. Their night is shitty enough as is after this, I guarantee you.
The thing is, my business don't want to deal with people. It wants to deal with a business made of multiple people to guarantee service availability. If he cannot answer, surely someone else in DigitalOcean can?
You are being unreasonable here. He promised a postmortem. I’d much rather wait a few days to get a clearly written, comprehensive analysis of the problems than to get an immediate stream of confusing and contradictory raw data.
If you have ever been involved in post facto analysis of a process breakdown like this you know how hard it is to get the full picture immediately. Rushing something out does no one any favors.
Sure, but the email he received basically said "your account is locked. No other info. Thank You". That to me is a much scarier thing than anything else in the thread. How can anyone trust in your infrastructure if your standard protocol is literally just shutting down their entire operation without any form of review or communication?
We have a relatively large spend ($5k+) @ DO, for a unique client (most of our other clients can be served by our colocated facility), and I'm going to second this. Or with any other provider. They should always explain exactly which rule was broken. If the customer is legit + genuine, they will promptly fix the issue and won't be a further problem. Being vague makes it super troublesome to rely on any service that takes that tactic. (Like Google, for example) If they continue to re-offend, and find other ways to skirt the rules, that's when you move on to account termination.
You can't, obviously. Even though I've used them before I really doubt I'll ever use DigitalOcean again. I can almost understand terminating customers (with notice) via automated heuristics for suspicious behavior, especially on the low end of the hosting market, but locking out a legitimate paying customer from backups with no notice or recourse is terrifying.
"In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time."
This didn't seem like a case of being "too slow" - the customer in question went through your review process (which was slow, yes), and the only response he got was "We have decided not to reactivate your account, have a nice day".
That just seems like a lack of interest in supporting your customers that are falsely flagged.
Last week ended on a real low note for many of us at DO. We took a perfectly good customer and gave them an experience no one should have to go through (all while he was trying to leave on vacation no less). We can and must do better. To do better we need to learn from our mistakes. To that end, we also think sharing the information about this incident openly is the best way to help all our customers understand what happened and what we are doing to prevent it in the future.
Yesterday we completed our postmortem analysis of the incident involving Nicolas (@w3Nicolas) and his company Raisup (@raisupcom). With their permission we are sharing the full report on our blog here:
No offense, as I'm sure this has been hard, but a screwup like this publicly demonstrates DO is not ready for prime time competition against AWS, Azure, GCP and the like.
I'd gladly do whatever it takes to KYC, send you my business license, tax returns, EIN, invoice billing, etc so you know there is someone behind my account.
We spend thousands of hours eliminating single points of failure. If an automated system can undermine that work, DO is not an option for us to host anymore.
A year of Data backup lost. Do you realize how that alone may cause the clients to dump a company and do you realize that startups may never recover from fiasco like these? I understand that it was false positive triggered by internal systems. But how do you explain the delay in restoring the services and reflagging again within hours after the services were restored?
Hope you can share what you learnt from this incident and hopefully you'll take a hard look at your processes.
I'd hate to be caught in the same issue, especially that we are already customers, and I'm not sure I'll have as much clout as Nicolas here to get your attention.
> and I'm not sure I'll have as much clout as Nicolas here to get your attention.
It's occurring to me now that while I've successfully ignored twitter for years, I should probably rectify that just so I have somewhere to type my hopes and prayers when this eventually happens to me, and hope for a miracle. It sure seems like the only place they're listened to.
> I'm not sure I'll have as much clout as Nicolas here to get your attention.
Maybe keeping a twitter (and other social media) account with at least a certain number of follower should be considered a part of a company's security strategy? You'd also need to post something interesting periodically, to keep your follower, so that you have their attention when you need it.
IANAL, but DO's ToS is loaded with weasel words.[0] So if they can sue in some jurisdiction where the binding arbitration and liability limitations don't apply, maybe they could at least get a fair settlement.
It would probably be worth it to restore trust: Refund all the money they've taken from this company for the last year, and apply a credit to their account for 3x that amount, say.
It's not the false positive that is the issue here. The issue is that a. it took way too long to get the business back up and running, and b. the second response gave no explanation and no recourse for the business to become operational again.
The very fact that this can happen from an automated script with no oversight should give every one of your customers pause as to whether they continue with your service.
I'd say the issue is that DO is shutting down servers for any reason at all (legal issues aside). If DO sells a product with a particular capacity, why should they intervene at all if a user is using all of the capacity they're paying for?
So unless a person is popular enough to get enough people talking about it on twitter or hacker news, someone whose account is flagged by your bad script is going to lose his business.
Are you aware that Viasat has blacklisted a huge amount of Digitalocean /24 subnets? I can't access many of my servers when I'm on a satellite connection in addition to other websites hosted on Digitalocean. I've talked with the Viasat NOC and they told me they were blocking Digitalocean subnets due malware.
This is probably worth it's own post, it would be very interesting to see more detail. I'm also probably certain that this is also not exclusive to DO.
Should we be concern about our 40+ droplets with DO now? We built our business on DO, we really can go bankrupt as well as our 30+ clients if anything like this happens to us. Please change your support system ASAP otherwise we will be switching to another platform. We are expecting a very serious response from you.
Do not use DO. The very fact that their default automated response to spam is prod downtime is unacceptable.
It requires so many failures in understanding the service being provided across the company for this decision making process to have ever actualized that there is no reasonable expectation of safety or trust from DO at this point.
Every cloud company has anti abuse systems that will limit your access to their APIs / take down your machines if abuse is suspected - for example if it looks like you're mining bitcoin. Your prod isn't any different from your staging for them
Clearly you should be doing regular backups of everything, and not on DO. And make sure to test your backups. And make sure you have a fast migration plan into another cloud.
Ideally you should be cloud-agnostic, but that's quite hard to achieve.
That’s all well and good. But how do you plan to reimburse your customer for this gross negligence? I have never heard of such incompetence or lack of communications from anyone on AWS’s business support plan. Why should anyone trust DO over AWS or Azure?
If your DO (or other cloud provider) credentials are compromised, it's usually a matter of seconds before someone fires up the largest possible number of instances to start crypto mining.
Do you realize that by abusing this thread to make a single PR focused comment with no intention of participating in the conversation -- you've disrespected the community here and the few remaining DO customers within said community.
>> and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive.
You clearly don't make every effort, and did not -- so why waste the extra verbiage and switch from active to passive voice?
Based on your cliche response I have zero confidence that DO will do anything substantial to address the root causes of the issue.
I've found DO's public posts to be particularly grating in the "we are listening to YOU, our customer. we take feedback extremely seriously" department.
Some people on HN hate Linode because of their past security screwups (which is valid), but having used both DO and Linode quite a lot, the support on Linode is way, way, way better than DO's.
DO's tier 1 support is almost useless. I set up a new account with them recently for a droplet that needed to be well separated from the rest of my infrastructure, and ran into a confusing error message that was preventing it from getting set up. I sent out a support request, and a while later, over an hour I think, I got an equally unhelpful support response back.
Things got cleared up by taking it to Twitter, where their social media support folks have got a great ground game going, but I really don't want to have to rely on Twitter support for critical stuff.
DO seems to have gone with the "hire cheap overseas support that almost but doesn't quite understand English" strategy, whereas the tier 1 guys at Linode have on occasion demonstrated more Linux systems administration expertise than I've got.
I have interviewed with DO and they tried diverting me towards a support position.
They told me that on a single day a support engineer was supposed to help/advice customers on pretty much whatever the customer was having issue with and also handle something between 80-120 tickets per day.
It's nice to see that DO is willing to help on pretty much anything they (read: their team) has knowledge about, but with 80-120 tickets per day I cannot expect to give meaningful help.
Needed EDIT: it seems to me that this comments is receiving more attention than it probably deserves, and I feel it's worth clarifying some things:
1. I decided not to move forward with the interview as I was not interested in that support position, so I have not verified that's the volume of tickets.
2. From their description of tickets, such tickets can be anything from "I cannot get apache2 to run" to "how can I get this linucs thing to run Outlook?" (/s) to "my whole company that runs on DO is stuck because you locked my account".
I once worked for eBay a long time ago, and support consisted of 4 concurrent chats, offering pre-programmed macros often pointing to terribly written documentation the person had already read and was confused about. If you took the time to actually assist somebody you were chastised in a weekly review where they went over your chat support. The person doing mine told me I had the highest satisfaction record in the entire company, and a 'unique gift of clear and concise conversation, like you're actually talking to them face to face' then said I'd be fired next week because my coworkers were knocking off hundreds of tickets a day just using automated responses, leaving their customers fuming in anger with low satisfaction ratings, as people are very aware of being fed automated responses but the goal was not real support, it was just clearing the tickets by any means possible. I decided to try half and half, so if the support question was written by somebody who obviously would not understand the documentation (grandma trying to sell a car), I would help them but just provide shit support to everybody else in the form of macros like my coworkers. Of course this was unacceptable and I got canned the next week as promised. Was an interesting experience, I can imagine DO having an insane scope to their support requests like 'what is postgresql'.
Anyway imho you should have taken the support position and schemed your way into development internally. This was my plan at eBay before they fired me, though they shut down the branch here a few months later and moved to the Philippines anyway so I wouldn't have lasted long regardless.
I'm fortunate that my own company (Rackspace) at least has a level head about this sort of thing. My direct manager looks at my numbers (~60-80 interactions per month) and my colleagues (many hundreds of interactions per month) and correctly observes that we have different strengths, and that's the end of the discussion. I have a tendency to take my time and go deep on issues, and my coworkers will send me tickets that need that sort of investigative troubleshooting. My coworker meanwhile will rapidly run through the queue and look for simple tickets to knock out. He sweeps the quick-fix work away, but also knows his limits and will escalate the stuff he's not familiar with.
Let me stress here, this is not nearly as easy of a problem to solve as it appears to be on the surface. We're struggling as a company right now because after our recent merge, a lot of our good talent has left and we're having to rebuild a lot of our teams. Even so, I'm still happy with our general approach. Management understands that employees will often have wildly different problem solving approaches and matching metrics, and that's perfectly OK as long as folks aren't genuinely slacking off and we as a team are still getting our customers taken care of. I think that's important to keep in mind no matter how big or small your support floor gets.
+1 for Rack support. A previous company I worked for was heavily invested in Rackspace infrastructure and while I often opined not getting the equivalent experience with AWS for the resume, I was regularly floored with the quality of their support. Whenever I had the pleasure of needing to open a ticket they solved my problems and usually taught me something new in the process. The linux guys were very clearly battle hardened admins.
I couldn't imagine getting that level of support from DO, let alone Amazon.
i have the opposite experience with Rackspace. The low end stuff (hosted exchange etc) is basically useless, people who are obviously on multiple chats, they let tickets sit for days...
Even when we had small handful of physical servers with them, they seemed inept. They actually lost our servers one time and couldn’t get someone out to reset power on our firewall.
My experiences were all with their "dedicated" or "managed" cloud services. Although I did notice that their marketing seemed to shift in the last months I was working with them for that employer from "let us help you build things on Rackspace" to "let us help you move what you built on Rackspace to AWS"
Yes, the Public Cloud, which houses most of the smaller Managed Infrastructure accounts (minimal support) is one of the bigger ... I believe the polite word is "opportunities?" It's a very pretty UI on top of a somewhat fragile Open Stack deployment, which needs a significant amount of work to patch around noisy infrastructure problems. That turns into a support floor burden, and it shows in ticket latency. Critiques directed at that particular product suite are, frankly, quite valid. I think Rackspace tried to compete with AWS, realized very quickly that they do not have Amazon's ability to rapidly scale, and very nearly collapsed under their own weight.
That said, our FAWS team are a good bunch, and what AWS lacks in support they more than make up for in well engineered, stable infrastructure. Since Rackspace's whole focus is support, I think the pairing works well on paper and it should scale effectively, but we'll have to see how it plays out in practice.
This is a big push, internally and externally. I don't know too much about the details (I don't work directly with that team) but it's been one of our bigger talking points for a while now.
Support should be looked at as a profit center, but almost everyone tries to run it like a cost center.
It's crazy that companies spend $$ on marketing and sales, then cheap out on a interaction with someone who is already interested in / using their product.
Running profit centers requires comparatively rarer leadership resources while running cost centers only requires easy-to-hire management resources. You don't want your best leaders whipping your support center into shape letting the company's competitive edge fritter away.
It remains weird to me that this even _can_ work as a business strategy. Customers know this isn't right, so they are only staying with a business that does this for so long as it is the absolutely cheapest/ only way to achieve what they want. That's super high risk, because if a competitor undercuts you, or an alternative appears, you are going to lose all those customers pretty much instantly.
Almost invariably the high ticket rates are also driven by bad product elsewhere. Money is being spent on customer "services" sending out useless cut-and-paste answers to tickets to make up for money not spent on UX and software engineering that would prevent many of those tickets being raised. Over time that's the same money, but now the customer is also unhappy. Go ask your marketing people what customer acquisition costs. That's what you're throwing away every time you make a customer angry enough to quit. Ouch.
7 tireless hours of work (with lunch break) 15 minutes to Listen, understand and resolve an issue, assuming perfect knowledge, a lot of luck and normal human speed, that would still amount to less than 30 resolutions a day.
Yep, this is spot on - I used to work on a webhosting help desk and could bang out about 100 tickets a shift, because so many were small queries that required no depth work.
Old MSFT rule of thumbs was 2 bugs per day during bug crunch mode. Sounds crazy, but when you consider the number of "this text is wrong" and "that text box is too short" bugs that existed after a year of furious development, it wasn't too hard to achieve.
Brought back memories. I think it might be a little Stockholm syndrome but there was just something about the pressure of getting a release out when you know it only happens once every few years. Bug triage definitely improved my persuasion technique.
Now its just "meh, we'll fix it in next months release".
He's using mathematics to compute the number of tickets an employee can handle per day, given certain assumptions. Given the data from znpy above, we see that nurettin's assumption that the time per ticket is 15 minutes is inconsistent with DO's expectations; instead, the average time spent per ticket should be about 5 minutes.
This is appalling. I worked as a L1 ticket tech for an old LAMP host back in the day where probably half of the tickets required nothing more than a password reset or a IP removal from our firewall, very easy stuff, and was proud if I got over 60 responses out in a 8 hour shift. And that time was spent mostly just typing a response to the customer. I really expected higher standards out of DO.
Linode is definitely in the minority here. Most companies, in tech and outside of it, seem to follow the DO model. Twitter provides decent service, and the official help channels provide canned responses and template emails.
I somewhat blame people in tech, actually. More than one company is creating products that "cut customer service costs via machine learning", which is code for "pick keywords from incoming tickets and autoreply with a template"
ISP here: The margins in bulk hosting services are incredibly thin, and companies have resorted to automation tools. If somebody asked me to run backend infrastructure for something like DigitalOcean or Linode, I would run away screaming. It would literally be my own personal hell. I would rather run any other sort of ISP services on the planet than a bulkhosting service where anybody with a pulse and $10 to $20/month can sign up for a VPS.
I truly feel sorry for their first and second tier customer support people. I imagine the staff churn rate is incredible.
People who work for these sorts of low-end hosting companies inevitably quit and try to work for an ISP that has more clueful customers. When you have people paying $250/month to colocate a few 1RU servers, the level of clue of the customer and amount of hassle you will get from the customer is a great deal less than a $15/month VPS customer.
This race to the bottom has reached a point that it's harming customers. It's okay to be more expensive than the competition if you provide a better service.
Personal opinion, it's really important in the ISP/hosting world to identify what market categories are a race to the bottom, and if at all possible, refuse to participate in them.
I look at companies selling $5 to $15/month VPS services and try to figure out how many customers they need to be set up for monthly recurring services, in order to pay for reasonably reliable and redundant infrastructure, and the math just doesn't pencil out without:
a) massive oversubscription
b) near full automation of support, neglect of actual customer issues, callous indifference caused by overworked first tier support
Conversely, as a customer, you should be suspicious when some company is offering a ridiculous amount of RAM, disk space and "unlimited 1000 Mbps!" at a cheap price. You should expect that there will be basically no support, it might have two nines of uptime, you're responsible for doing all your own offsite backups, etc.
If you use such a service for anything that you would consider "production", you need to design your entire configuration of the OS and daemons/software on the VM with one thing in mind: The VM might disappear completely at any time, arbitrarily, and not come back, and any efforts to resolve a situation through customer support will be futile.
That's going to be true no matter which cloud provider you choose
My company provides services to fortune 100 companies, and we host literally petabytes of data on their behalf in Amazon S3, but we don't have offsite backups. We (and they) rely on Amazon's durability promise.
We do offer the option of replicating their data to another cloud provider, but few customers use that service -- few companies want to pay over twice the cost of storage for a backup they should never need to use when the provider promises 99.999999999% durability.
I don't know the data you're holding. If it is sensitive data, like customer anything, would it infact make sense not to have offsite backup?
Reasoning: Your contract with Amazon promises durability and I'm sure there's a service level agreement with penalty/liability clauses. By implementing a redundant backup, you're replicating something that you don't legally need to have, double-or-more due diligence on the offsite backup security/credentials, and in case of a failure of Amazon create a grey area with clients "Do you have the data, or do you not?"
In short, there could be a very good business reason not to do offsite backups.
Regardless of durability if you lose your customers data are you sure you will have customers paying you to keep you in business while you figure out liability?
In this case, it was not losing data, but losing access to data. The data was eventually restored. Lose customers' data could also mean losing the backup:
"We're sorry, the tape that we didn't needed to keep has been lost/zero-dayed/secondary service provider has gone bankrupt/Billy's house that we left it at got robbed." These must be disclosed to a customer immediately.
Minimising attack/liability surface is not only a technical problem, but a business one too.
For AWS it doesn't make a lot of sense to protect against AWS itself losing data since you're paying them a premium for that. Backups in this model would be logically separated so a user/programmer error can't wipe out the only copy of your production dataset.
It's just greed, not some wisdom. When data is lost - it's just lost. Maybe AWS will pay some compensation because of their promises, but money not always can solve problems of missing data.
Ten years ago, I had a pretty reasonable $10/mo account, that I eventually moved to a $20/mo account because I needed more resources to keep up with traffic.
I'd expect that, as more transistors have been packed blah blah blah, that such a $10/mo account would have gotten better, not worse, since then.
bandwidth does drop in price, and increase in capacity, at a fairly rapid rate. Look at what an ISP might pay for a 10GbE IP transit circuit in 2008 vs what you can get a 100GbE circuit for today. But peoples' bandwidth needs and traffic also grow rapidly.
There is the same risk of being kicked off when using Amazon AWS. The rules are different, but there will be situations where you lose everything (imagine you become visible for a political reason and the landscape shifts a bit).
At a certain price point, yes there is, if you're paying $800/month for hosting services to a mid sized regional ISP with presence at major IX points. That ISP cares about its reputation, and cares about the revenue it's getting from you.
I can tell you that as a person whose job title includes "network engineer", we have a number of customers who have critical server/VM functions similar to these people who had the DigitalOcean disaster. If something goes wrong, an actual live human being with at least a moderate degree of linux+neteng clue is going to take a look at their ticket, personally address it, and go through our escalation path if needed.
Having paid substantial amounts for various services over the years, paying hundreds of dollars per month doesn't automatically make you into a priority.
There seems to be a sweet spot for company size here. Too small companies can't support you even though they really want to. Large companies are busy chasing millions in big contracts, and don't really care about your $800 per month at all.
Very good points. What I would recommend is to use a mid sized ISP in your local area where you can meet with people in person. At higher dollar figures there should be some sales person and network engineer you can meet in their local office, meet for coffee, discuss your requirements, and have something of a real business relationship with. You and your company should be personally known to them.
If you are just some semi anonymous faceless person ordering services off a credit card payment form on a website, all bets are off...
Imho, that point was reached in hosting over 15 years ago (which is why I sold the hosting company I had back then). We’ve seen some short lived upticks periodically since then, but they all end up going back to shit as they tried to scale.
srn from prgmr.com here. Our tagline originally came from us being a low cost service, but I like to think of it as a customer support philosophy. One meaning is we want you to be able to fix the problem yourself by giving you instructions instead of logging into your system. Another is we try give you the benefit of the doubt that it could be our problem and not assume it's yours when there's an issue.
Linode was more of a bootstrapped business, it grew slowly and steadily. Digital Ocean was always built to grow fast from the beginning.
I think that the size of the company or how fast they grow is a good proxy for having poor customer support. What we should be doing is finding the slow growing businesses or the mid-tier (not too small, not too big) businesses to take our business to.
There’s nothing wrong with that approach, the person raising the support ticket likely hasn’t read through all the documentation of the product they’re using.
If implemented well, sure -- sometimes, maybe often, you can point a customer to a support document that directly answers their specific question and relieves some of the load on your staff. That's great.
But the execution matters a lot, and DO's is currently not great. IIRC, it takes clicking through a few screens of "are you sure your question isn't in our generic documentation? How about this page? No? This one then? Still no? You're really sure you need to talk to someone about this error? sigh Okay, fine then."
These systems should not be implemented as a barrier to reaching human support, but they often are.
In all my experience with support, I have been referred to a document that helped me with my problem literally zero times, because if something went wrong the first thing I did was Google it and so I already saw the unhelpful document.
> DO seems to have gone with the "hire cheap overseas support that almost but doesn't quite understand English" strategy, whereas the tier 1 guys at Linode have on occasion demonstrated more Linux systems administration expertise than I've got.
That's simply not true. There's support engineers hired around the world, and depending on when your ticket is posted, someone awake at that time will answer. DO is super remote friendly and as a result, has employees (and support folks) everywhere on the planet. Not "cheap oversea support" at all. There's a lot of support folks in the US, Canada, Europe, and in India where they have a datacenter.
> That's simply not true. There's support engineers hired around the world, and depending on when your ticket is posted, someone awake at that time will answer. DO is super remote friendly and as a result, has employees (and support folks) everywhere on the planet. Not "cheap oversea support" at all. There's a lot of support folks in the US, Canada, Europe, and in India where they have a datacenter.
Going to guess from this tweet https://twitter.com/AntoineGrondin/status/113096281882239385... that you're currently staffed at DO. That's fine; I know two people who do great work on DO's security team. But it would be helpful if you could disclose this when you comment about your employer publicly so that readers don't have to dig up your keybase and then your twitter account to understand it.
But, isn’t hiring all over the world exactly because it is cheaper for the same kind of talent. I’m sure the company doesn’t do it out of the goodness of their heart.
Then of course, there is no guarantee these people speak and understand english perfectly.
I would say it's not. There's many advantages to hiring remote workers, they've been discussed at length elsewhere. One advantage is not having to pay office space, which indeed lowers cost. However DO has a nice office in Manhattan so really... they're not saving much money. And then on the term of compensation, for some reason DO pays it's remote employees really well. I don't know how this changed in recent years but people in NA and EU are all paid handsomely despite being remote. I don't know about other locales.
The reality is that top talent, even if remote, is competitive whether they are in NYC/SFO/SEA or not. And DO has some pretty talented people on staff.
And then, having people in all timezones is definitely an advantage for 24/7 support. I'd say it's not negligible, and not an after thought.
Now about english fluency, it's only that important to english native locations. And really, most of tech does not necessarily have english as a first language - I certainly don't. So I'd say that encountering support engineers with imperfect english shouldn't be a problem to anyone, and definitely not a sign of cheap labor. In fact, I'd say bitching about someone's english proficiency in tech is kind of counterproductive and I find it discriminatory.
Anyways. DO doesn't hire international employees to get cheap labor, that's a preposterous proposition. And with datacenters around the world and a large presence and customer base, it makes sense to have staff on board from many of these areas. And that staff might answer your tickets at night when they're on shift. Shouldn't they?
I can concur from a company that is not DO we hire workers in locations around the glove specifically to have people awake in their normal time zones, not because it's cheaper because it's not always cheaper. There are many countries that have a large portion of very intelligent and multi lingual people, especially when it comes to English. Just because someone isn't a native English speaker, or speaks in different dialect doesn't mean that they are any less capable.
I probably should have left the word "overseas" out of my initial comment, it gave it a flavor that doesn't match my left-wing multicultural globalist ideals.
That said, I disagree wholeheartedly that it's okay for support staff to not be completely fluent in the language they're providing support in, regardless of the language.
There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff. The end results are identical.
There are people who are very talented and very fluent in more than one language. Those people tend to be more expensive. So, many companies forego hiring those workers and instead hire others who are cheaper and "about as good". My multiple experiences with DO support have suggested that that's what they're doing.
As other commenters are suggesting, it may just be instead that DO is expecting its support staff to meet metrics that are causing them to spend only a minute or two per ticket and send out scripted replies.
I know many engineers who are not that fluent in English whom you would never contemplate qualifying as illiterate; you would quickly see that (1) they're encumbered by English and (2) are obviously extremely proficient technically, and literate.
People who would make gratuitous grammatical mistakes but have read more classics than the average American college graduate. I can easily count many just thinking of it.
You're arguing here against something I didn't say. You took one word from my statement -- "illiterate" -- and built a whole new argument around it which was never mine to begin with. I don't think you're doing it intentionally, I suspect it's just because you have a particular sensitivity on this subject. Either way I don't think I can say anything here that'll get a fair treatment from you.
> There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff.
That statement reeks of ignorance. It seems you have almost no experience with other languages than your own, or you would know that communicating while being non-fluent or with a non-fluent works just fine most of the time. Sometimes misunderstandings happens and it can be a bit slower but that is all.
> Sometimes misunderstandings happens and it can be a bit slower but that is all.
So your position is that support that's a bit slower, with some misunderstandings, is exactly as good as fast support without misunderstandings, even in downtime-sensitive applications.
And Linode has always had faster CPU, I/O, Network. And lots of small things like that DO only catches up in the recent years, like pooled bandwidth.
Although these days I tend to go with UpCloud, it is very similar to Linode and DO, except you can do custom instances like 20x vCPU with 1GB Memory, spin it up for $0.23 an hour. Compared to standard plan on Linode and DO, 20 vCPU with 96GB Memory would be $0.72/Hr.
I love Linode, their support is awesome, their CPUs on "Dedicated CPU Plans" are great (by benchmarks - similar to GCP n1 CPU), disks io bandwidth just amazing. But I had to leave them because their network is not so reliable. It was difficult decision and I tried to return few weeks ago (because I really love them), but again network in London was just not so great.
I'd love to see the benchmarks you used. I've done a bunch over the years, mostly I/O focused because that's the most common bottleneck for the kind of work I do. While Linode does pretty well, especially compared to the cloud giants, DO has pretty much always come out on top.
Really depends on underlying hardware of the Linode and DO VPS servers as they can vary greatly especially on DO depending on datacenter and region you end up in. Newer DO datacenters getting newer hardware so difference compared to older DO datacenters is huge - benchmarks of same DO droplet plan on different hardware https://community.centminmod.com/threads/digitalocean-us-15-...
I've been on Linode for 8+ years now (moved there from Slicehost when Rackspace swallowed them up) and their service (not necessarily customer support) has significantly degraded. Not sure I blame them though. They've become far more popular since I started with them and are probably doing their best to grow... but I no longer recommend them as I used to.
That's how I feel about Scaleway. Scaling customer service is no easy task especially technical companies that require agents to have some understanding of the product.
Could you give some concrete examples on how their service has degraded? I've been using their service for years for light stuff, and I haven't had any problems.
So we host scores for a sport, as well as inputting these scores. Every year we have a few high traffic events (much higher than normal), and I scale up our servers to support it. However for the last two years there have been outages in their Newark data centre during both of these events. One time it was DNS, all other times it's been data centre wide.
I suspect as they've increased in popularity they've become a bigger target for DDOS attacks.
I've also noticed that in the past year there's been a lot of data centre outages... like every couple months. Hasn't been a deal breaker for us, since our traffic is generally fairly low outside of season, but the ones during seasons really hurt.
Also I'd like to add that they do give you the heads up when there are issues, which is a big plus in my book compared to some other hosts.
I really do think it's just growing pains, and I don't mean to disparage them. Just being honest that I wouldn't recommend them for high availability services. Since I consider them a low budget host that's probably unfair though. We've just outgrown them is all.
If you are looking for high availability - Google Cloud Platform network is the best. In some other things AWS is better, but GCP network quality is awesome.
I don't really have a low budget alternative. I know that for our service we're evaluating both google and amazon cloud offerings, but only for our high availability services. I figure DO is in the same boat if not worse.
I'll throw in Prgmr.com. One of their owners, Alyn Post, is on Lobsters with us. They even donate hosting to the site. He's been a super-nice guy over the years. Given how cost-competitive market is, they mainly differentiate on straight-forward offerings with good service. So, I tell folks about them if concerned about good service or more ethical providers.
second this, Linode for 15+ years here, tried DO a few times(but never left Linode), now 100% back with Linode.
Linode even has an irc channel(you can use browser to access it), I rarely need support, but when I really need it, it is always fast, to the point, available.
In both cases, if you dig into the context a bit, the story turned out that Linode wasn't fully disclosing the breaches to their customers until they were forced to do so when the news about them reached a certain volume. They also may have been -- almost certainly were -- dishonest about the extent of the damage and how it may have impacted their other customers.
At the time, their Manager interface was a ColdFusion application, which tends to be a big pile of bad juju. They started writing a new one from scratch after, I think, the second compromise.
The really bad thing here is that they got soundly spanked for being less than truthful the first time, and then four years later -- when they'd had ample time to learn from that mistake -- they did it again.
So there's a nonzero chance at any given time that Linode's infrastructure has been compromised and they know it and have decided not to tell you about it.
That's what prompted me to start exploring DigitalOcean more. Unfortunately, I've found that there's a far greater chance that I'll experience actual trouble exacerbated by poor support than that I'll be impacted by an upstream breach, so about half my stuff still lives on in Linode.
> So there's a nonzero chance at any given time that Linode's infrastructure has been compromised and they know it and have decided not to tell you about it.
This is entirely the roots of my distrust of them right now. Mistakes happen. Companies I trust demonstrate that they've learned from their mistakes. My tolerance for mistakes is pretty low when it comes to security related things, though. If something has gone wrong, let me know so I can take remedial steps. Their handling of both of those incidents suggested I can't trust them to tell me in sufficient time to protect myself.
Vultr is underrated too. I've had nothing but positive tech support experiences with them. Their weird branding turns people off but they do not seem fly by night. We have used them for various things for years with very few problems.
I've dealt with many Vultr instances on behalf of my clients, and I've had nothing but negative experiences with them. Unstable performance even on top-tier plans. Internal network issues that support keeps trying to blame their customer for. Nowadays when I find that a new or prospective client has been using Vultr, the first thing I recommend is to move off of Vultr.
When there's an issue with Linode's platform, they discover it before I do and open a ticket to let me know they're working on it. When there's an issue with Vultr, the burden of proof seems to be on me to convince them that it's their problem not mine.
I've had interesting issues automating deployment to the point that my current build script provisions 9 VMs, benchmarks them and shuts down the worst performing 6. Some of their co-location is CPU stressed.
My current employer uses Linode, and yeah, they have pretty good support. However, I've been using DO for the last 5 years, and haven't needed to reach out to support once. But I've had to contact Linode support about 5 times in the last year.
Does DO have different levels of support that you can pay for like AWS? I like that system. You pay when you need it. You pay more if you need more support.
The difference in support DO vs Linode is probably due to DO being cheaper.
> The difference in support DO vs Linode is probably due to DO being cheaper.
What? Most DO and Linode plans have exact same specs and cost exactly the same, and IIRC it was DO matching Linode. Although DO didn't seem to enforce their egress budget while Linode does.
(I've been a customer of both for many years, and only dropped DO a few months ago.)
EDIT: Also, according to some benchmarks (IIRC), for the exact same specs Linode usually has an edge in performance.
Having used both for years, I’d probably recommend DO. None of the big security issues, but also I’ve found more downtime with Linode, don’t know if they’re upgrading their infrastructure a lot for some reason.
This headline is grossly misleading and very clickbaity "Killed our company". It's not exactly big business that you scale up to 10 droplets for short bursts, I am willing to bet their spend on DigitalOcean is less than $500 a month, yet the author is expecting enterprise support.
DigitalOcean should go the route of AWS and kill off free support completely and offer paid support plans. Something like $49 a month or 10% of the accounts monthly spend.
If you are a serious company with paying Fortune 500 customers, you need to act serious and pay up for premium support and stop expecting free.
Well, not locking you your account for false negatives and unlocking it when you ask for it should be in the free support plan of any company. No one pays to get locked out, support plan or not.
They’re working on that and were sending out surveys a couple of weeks ago whether customers would be interested. They had a slightly higher minimum amount in mind for premier support
You already get bumped to a higher tier of support once you hit 500/month spend for no additional fees but that just gives you promises for lower response times
That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places. If they'd had the data, they could have spun up their service on a different hosting provider relatively easily.
DO has shown that their service is simply not suitable for some use cases: those that impose an "unreasonable" load on their infraestructure.
Even worse: they don't explicitly state what is considered "unreasonable". So, if your business is serious, you have to assume the worst-case scenario: DO can't be used for anything
Conclusion: Digital Ocean is just for testing, playing around, not suitable for production.
> Conclusion: Digital Ocean is just for testing, playing around, not suitable for production.
I think that's always been the standard position most people take. DO, Linode, etc are for personal side projects, hosting community websites, forums etc. They are not for running a real business on. Some people do, sure but if hosting cost is really that big a portion of your total budget you probably don't have a real business model yet anyway.
I am of the impression that people rent cloud services because they can expense the cost to someone else or because of an inability to plan long term or a need of low latency.
That's the kind of response you only send when you're convinced the customer is actually nefarious and you don't care about losing them. I wonder if there is any missing backstory here or if it really is just a case of mistaken analysis.
Used to work at Linode, let's flip this on it's head:
-When the majority of abuse support dealt with was people angrily calling and asking about the fraudulent charges on their cards for dozens of Lie-nodes you consider putting caps in place to reduce support burden and reduce chargebacks.
At the time at Linode, if it was a known customer, we could easily and quickly raise that limit and life is good.
I've always wondered how Amazon dealt with fraud/abuse at their scale.
I don't think DO was wrong here to have a lock, but the post lock procedure seemed to be the problem.
You can provide a helpful message with options for recourse without giving abuser's "clues." These are not somehow mutually exclusive. By your logic it makes sense to punish a marginal element at the expense of the majority.
I think the major issue there is process and management related. The account should have been reviewed by someone with the authority to activate it, and it definitely shouldn't have been flagged a second time. But looks like DO thought the user was malicious, and issues raised by malicious users don't get much information. The response was horrible though.
Sure. Hopefully it results in a change of policy, or at least a public statement of some kind. Everyone can't depend on the cofounder to come in and save them from bad automation.
Agreed, but it's not like the original poster had a huge platform, he just posted about it on Twitter. I may despise Twitter for a bunch of different reasons, but I can't deny it's a great tool for raising issues to companies.
> Account should be re-activated - need to look deeper into the way this was handled. It shouldn't have taken this long to get the account back up, and also for it not be flagged a second time.
So... he doesn't address what is the scariest part to me, the message that just says "Nope, we've decided never to give your account back, it's gone, the end."
I think it's entirely reasonable for companies to have that option. "You are doing something malicious and against the rules, you have been permanently removed". In this case, that option was misused, but I don't think the existence of that possiblity is inheritly surprising.
Access to your data should never be denied. Ever. It was not DigitalOcean's data. If you are a hosting provider, you can't ever hold customer data hostage or deny them access to it in any way.
Again, I must disagree. If DO genuinely believed that you were doing something malicious and that data was harmful or evil for you to own (e.g. other people's SSN, etc) then they are in the "right" to deny access to it. DO should not be forced to aid bad actors.
And, regardless of what DO should or should not do, they can do whatever they want with their own hard drives. You should structure your business accordingly.
If DO believed that there was criminal activity (notice I am not using the word "malicious"), they should have reported it to the police, and it that case they might be justified in securing a copy of the data. Blocking access would be justified only in the most extreme cases (such as if the data could be harmful to others, e.g. pictures of minors).
If there is no police report, then they are trying to act as police themselves, which I think is unacceptable. It is not their data.
Your argument that they can do whatever they want with their hard drives is indeed something I will take care to remember — I definitely would not want to host anything with DO.
> If DO genuinely believed that you were doing something malicious and that data was harmful or evil for you to own (e.g. other people's SSN, etc) then they are in the "right" to deny access to it.
The observant will note the particular corner you're backing into here -- that a business might be justified in denying access to code/data being used in literally criminal behavior -- is notably distinct from the general and likely much more common case.
> they can do whatever they want with their own hard drives.
Sure. But to the extent they take that approach, Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.
Businesses that can be relied on generally instead offer terms of service and processes that don't really allow them to act arbitrarily.
> ... a business might be justified in denying access to code/data being used in literally criminal behavior...
I agree. Look at the absolutism of the comment I am replying to. My whole point is that there might be some nuance to the situation.
> ...Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.
Again, I agree. Considering how cheap AWS, backblaze, and Google drive is, it is completely ridiculous to depend on any one single hosting service to hold all your data forever and never err.
At no point did DO ever believe this. This happened purely and simply because of usage patterns changing. It was done automatically and a bot locked them out. They should not be locking out data based on an automated script.
You seem to be accusing the aggrieved party of being a bad actor, when that is not the case.
For some practical, if extreme, examples: if a customer were to host a phishing site, or a site hosting CP, it would be grossly irresponsible (and likely even illegal) for the hosting provider to retain the customer's data after account suspension and allow them to download it.
And do what in the mean time? The legal system acts slowly. In the age of social media outrage, would you allow the headline "Digital Ocean knew they were serving criminals, and they didn't stop them" if you were CEO?
It's easy to be outraged when these systems and procedures are used against the innocent. That does not mean we should stop using rational thought. If someone is using DO to cause harm, then DO should (be allowed to) stop the harmful actions.
> Your account has been temporarily locked pending the result of an ongoing investigation.
You lock down the image, and let law enforcement do their thing. If law enforcement clear them, you then give the customer access to their data, perhaps for a short time before you cut them off as they seem to be a risky customer to have.
You don't unilaterally make the decision, you offload your responsibility onto the legal process.
>would you allow the headline "Digital Ocean knew they were serving criminals, and they didn't stop them" if you were CEO?
Seems to work just fine for AWS, Google and Cloudflare. In fact, counter to your argument, Cloudflare got in massive shit when they did decide to play God.
Reasonable to have the shutdown part of the option, yes.
At the very least, they should also provide ALL, as in every last byte, of data, schemas, code, setup etc. to the defenestrated customers. As in: "sorry, we cannot restart your account, but you can download a full backup of your system as of it's last running configuration here: -location xyz-, and all previous backups are available here: -location pdq-".
Anything less is simply malicious destruction of a customer's property.
If you violate a lease and get evicted, they don't keep your furniture & equipment unless you abandon it.
That's probably a reason to use containerization / other technologies so that you can spin up your services in a couple minutes on a different cloud provider.
You don't need to use containers for that.. all you have to do is set up a warm replica of the service with another provider. The fail over doesn't even have to be automatic, but that is the minimum amount of redundancy any production SaaS should have.
A "warm replica" is going to cost money though, while containerization allows you to not have anything spun up until the moment you need it, and then have it ready to go minutes / an hour later.
That is patently false, unless you plan on starting from a clean slate on the new environment. Any one who proposed such a solution as a business continuity practice to me would be immediately fired.
Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.
Replicating TBs worth of data and making sure the replica is relatively up to date is the hard part. So is fail over and fail back. Basically everything but running the code/service/app, which is the part containers solve.
> Sure, a backup would have been a significant improvement, but still – a backup only protects against data loss and not against downtime.
Assuming you have data backup / recovery good to go, the downtime issue needs to be solved by getting your actual web application / logic up and running again. With something like docker-compose, you can do this on practically any provider with a couple of commands. Frontend, backend, load-balancer -- you name it, all in one command.
> Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.
>That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places.
They should have, at the very least, one DR site on a different provider in a different region that is replicated in real-time and ready to go live after an outage is confirmed by the IT Operations team (or automatically depending on what services are being run).
I feel for these guys, but that's not "all the proper backup procedures". I'm part of a three-man shop and storing backups in another place is the second thing you do immediately after having backups in the first place. Never mind being locked out by the company - what happens if the data centre burns to the ground?
More realistically they would have done backups inside DO and would still be locked out. Not many people actually do complete offsite backups to a completely different hosting provider, getting locked out of your account is usually just not a consideration. It’s unrealistic to expect this of a tiny startup.
>getting locked out of your account is usually just not a consideration
How many horror stories need to reach the front page of HN before people stop believing this? Getting locked out of your cloud provider is a very common failure mode, with catastrophic effects if you haven't planned for it. To my mind, it should be the first scenario in your disaster recovery plan.
Dumping everything to B2 is trivially easy, trivially cheap and gives you substantial protection against total data loss. It also gives you a workable plan for scenarios that might cause a major outage like "we got cut off because of a billing snafu" or "the CTO lost his YubiKey".
> How many horror stories need to reach the front page of HN before people stop believing this
Sounds like the opposite of the survivor bias. I don't believe it's any sort of common (though it does happen), even less that "it should be the first scenario in your disaster recovery plan"
Even if the stories we hear of account lockouts isn't typical, the absolute number of them that we see -- especially those (like this one) that appear to be locked (and re-locked) by automated processes -- should be cause for concern when setting up a new business on someone else's infrastructure.
If you plan for the "all of our cloud infrastructure has failed simultaneously and irreparably" scenario, you get a whole bunch of other disaster scenarios bundled in for free.
Whether it's normally a consideration or not, there are no meaningful barriers in terms of cost or effort, so it's totally realistic to expect it of a tiny startup.
Every week there's another article on HN about a tiny business being squished in the gears of a giant, automated platform. In some cases like app stores this is unavoidable, but there are plenty of hosting providers to choose from. People need to learn that this is something that can happen to you in today's world, and take reasonable steps to prepare for it.
I don't know, it seems simple enough to me. I have a server on DO hosting some toy-level projects, and IIRC it took me 15-30 min to set up a daily Cron job to dump the DB, tar it, and send it to S3, with a minimum-privilege account created for the purpose, so that any hacker that got in couldn't corrupt the backups. I'm not a CLI or Linux automation whiz, others could probably do it faster.
We don't know the structure of their DB and whether failover is important or not, so we don't know if the DB can be reliably pulled as a flat file backup and still have consistent data.
We also don't know how big the dataset is or how often it changes. Sometimes "backup over your home cable connection" just isn't practical.
Cron jobs can (and do) silently fail in all kinds of annoying and idiotic ways.
And as most of us are all too painfully aware, sometimes you make less-than-ideal decisions when faced with a long pipeline of customer bug reports and feature requests, vs. addressing the potential situation that could sink you but has like a 1 in 10,000 chance of happening any given day.
But yes, granted that as a quick stop-gap solution it's better than nothing.
> We also don't know how big the dataset is or how often it changes.
I'm going to take a stab at small and infrequently.
Every 2-3 months we had to execute a python script that takes 1s on all our data (500k rows), to make it faster we execute it in parallel on multiple droplets ~10 that we set up only for this pipeline and shut down once it’s done.
Yeah, probably. But we shouldn't be calling these guys out for not taking the "obvious and simple" solution when we aren't 100% certain that it would actually work. That happens too often on HN, and then sometimes the people involved pop in to explain why it's not so simple, and everyone goes "...oh." Seems like we should learn something from that. I've gone with "don't assume it's as simple as your ego would lead you to believe."
I suggested that solution because everyone is saying "they're only a two-man shop so they don't have the time and money to do things properly". Anyone has the time and money to do the above, and there's a 90% chance that it would save them in a situation like this.
Even if they lost some data, even if the backup silently failed and hadn't been running for two months, it's the difference between a large inconvenience and literally your whole business disappearing.
> "2-man teams generally don't prioritize backups" isn't an excuse for not prioritizing backups.
They had backups, but being arbitrarily cut-off from their hosting provider wasn't part of their threat model.
Isn't a big part of cloud marketing the idea that they're so good at redundancy, etc. that you don't need to attempt that stuff on your own? The idea that you have to spread your infrastructure across multiple cloud hosting providers, while smart, removes a lot of the appeal of using them at all. In any case, it's also probably too much infrastructure cost for a 2-man company.
> In any case, it's also probably too much infrastructure cost for a 2-man company.
keeping your production and your backups in the same cloud provider is the equivalent of keeping your backup tapes right next to the computer they're backing up. you're exposing them both to strongly correlated risks. you've just changed those risks from "fire, water, theft" to "provider, incompetence, security breach"
So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3? I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.
Back in the day, we used to talk a lot about how RAID is not a backup strategy. The modern version of that is that S3 is not a backup strategy.
> So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3?
You're paying to try and ensure you don't need to restore from backups. Our data lives in an RDS cluster (where we pay for read replicas to try and make sure we don't need to restore from backups) and in S3 (where we pay for durable storage to try and make sure we don't need to restore from backups), but none of that is a backup!
If you're not on the AWS cloud S3 is a decent place to store your backups of course, but storing your backups on S3 when you're already on AWS is, at best, negligent, while treating the durability of S3 as a form of backups is simply absurd.
> I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.
The company I work for is on the AWS cloud, so we store our backups on B2 instead. It's no more work than storing them on S3, and it means we still have our data in the event that we, for whatever reason, lose access to the data we have in S3. Who the hell doesn't have offsite backups?
> Back in the day, we used to talk a lot about how RAID is not a backup strategy. The modern version of that is that S3 is not a backup strategy.
This is not remotely the same thing. A RAID offers no protection against logical corruption from an erroneous script or even something as simple as running a truncate on the wrong table. Having a backup of your database in a different storage medium on the same cloud provider protects from vastly more failure modes.
> Who the hell doesn't have offsite backups?
No one. But S3 is already storing your data in three different data centers even if you have a single bucket in one region, and you also have SQL log replication to another region. Multi-region is as easy as enabling replication but that is only available within a single cloud provider (I can't replicate RDS to Google Cloud SQL, only to another RDS region). I would guess that a lot of people use that rather than using a different cloud provider.
> This is not remotely the same thing. A RAID offers no protection against logical corruption from an erroneous script [...] But S3 is already storing your data in three different data centers
That sounds like...the same argument?
A RAID array stores your data on multiple physical drives in the machine, but offers no protection against logical corruption (where you store the same bad data on every drive), destruction of the machine, or loss of access to the machine.
S3 stores your data in multiple physical data centres in the region, but offers no protection against logical corruption, downtime of the entire region, or loss of access to the cloud.
You can't count replicas as providing durability against any threat that will apply equally to all the replicas.
Storing a file on two tier1s would surely protect you from fire, water, theft no? Yet you will also be paying for all the extra copies Amazon and Google each make. I'm not disagreeing that this is the right strategy, just pointing out that the market offerings and trends don't support it.
> being arbitrarily cut-off from their hosting provider wasn't part of their threat model
Let's be fair: The threat model here is "lose access to our data".
This can happen in a number of ways, lost (or worse, leaked) password to the cloud provider, provider goes bankrupt, developer gets hacked, and a thousand other things.
Even if you trust your provider to have good uptime, there's really no excuse for not having any backups. Especially not if you're doing business with Fortune 500's.
Yeah I think this is what people are not getting. Redundant backups might mean "don't worry, in addition to backups on the instance, I have them going to a S3 bucket in region 1 and then also region 2 in case that region goes down," which of course doesn't protect from malicious activity from the provider. You certainly _should_ make sure you have backups locally available or in a secondary cloud provider but this is some hindsight.
As a startup, generally your secondary backup could literally be an external hard drive from best buy, or an infrequent access S3 bucket (or hell, even Glacier). No excuse, especially when "dealing with Fortune 500 companies".
Literally just push a postgres dump to S3 (or any other storage provider) once a night as a "just in case something stupid happens with my primary cloud provider". It'd take a couple hours tops to set up and cost next to nothing.
Most of the costs aren't from storage space, but compute power. We aren't talking about duplicating the whole infrastructure, just backing up the data. Disk space is dirt cheap.
Also, by "two places" I meant the live DB and one backup that's somewhere completely different. My wording may have been confusing.
They did have backups. Thats why I asumed you meant double backups. If you do cold storage you should have 3 copies due to possible corruptions. Sure tape drives are cheap but someone also have to run and check the backups.
I would say that it doubles the cost of backups, but using this math, we start with one copy plus one backup, and add a second backup; that means only a 50% increase.
This exact same thing happened to me last year. I accessed my account abroad and they perma banned me.
Support was useless and even with evidence did not believe who I was.
I then somehow convinced them to give me temp access, which in my opinion is even worse. They didn't believe me about who I was and then gave me temporary access to an account. DO can't be trusted when their support team could so easily be socially engineered.
Okay, if this is real, I am concerned. We have 40+ droplets with many clients. If anything happens like this we will lose our entire operation, as well as all of our client's confidential ecommerce data.
Obviously concerning. Care you provide more details on this? I respect your desire to stay anon, but any details to add some color would be great. For example, did you completely leave Digital Ocean, and if so, where did you go?
This is probably going to get buried in the replies, but I had a similar experience with DigitalOcean about a year ago with my account getting permanently locked with very little explanation and no way of getting it back. It's still locked to this day. I was just a student using my Github student package credit, but I was pretty appalled by the service from DO and vowed never to buy from them.
Unfortunately, my ticket no longer appears under closed tickets. I was still able to dig up my original ticket message and all the responses their support made to me through my email though. Here they are:
Between the replies I asked about what I could do to verify my account. As you can see, they didn't even give me a single chance to do so. They told me to hang on twice then just permanently closed it up. I'm not sure how I even got flagged. All I did was turn on a droplet and delete it. I checked the audit logs, and there was nothing suspicious there either. It was just me logging in and out.
I thought about making a deal about it on Twitter, but I didn't bother because I don't have any followers and it wasn't a huge loss to me either. Maybe it's the only way?
The emails from their "Trust and Safety" team are extremely tone-deaf...
"We've locked you out, no explanation"
"Sorry for any inconvenience"
Seriously? That last line is like a slap in the face.
No one should talk to a customer like that in this situation, if only because (a) if this is real abuse, you don't need to be "fake nice" and (b) if it's a false-positive, you've just come across as extremely smug when you're in the wrong.
If you're reading this and concerned for your own backup story, fret not! In 2019 secure off-premise backups are super easy to implement, even for a 1 person shop. Get something like Restic or Borg or any one of the enumerated options here: https://github.com/restic/others
I've recently implemented backups with Restic, the static binary and plethora of supported storage backends was extremely appealing. The easiest seems to be to just point it at a S3 bucket, but given most people have infrastructure on AWS (off-premise means off-premise) having other options supported out of the box is pretty handy.
> The easiest seems to be to just point it at a S3 bucket, but given most people have infrastructure on AWS (off-premise means off-premise) having other options supported out of the box is pretty handy.
Minio is great! I use it! But spinning up and managing yet another service when you're already a small shop adds more barriers to entry. Maybe find an already s3-compatible store (like Wasabi) or find something cheap and easy to spin up that's supported by the tool (like https://www.hetzner.com/storage/storage-box)
How do you verify your off-site backups? Do you periodically download the entire backup set and bit compare to the originals? That sounds like a lot of bandwidth usage, not to mention costs.
Given that the author was quite vague about the nature of this “pipeline” and that their product is an “AI-powered Startup Selection engine”, I have a suspicion they were probably crawling and scraping a whole bunch of pages for new startups. It’s possible that this was totally legit and it just looked like a ddos attack, or that it was something else entirely, but everyone here seems to have taken him at his word that what they were doing was actually above board.
Having been on the receiving end of terribly broken "pipelines" at startups wanting to hammer away at my resources, the right response is terminate first discuss later.
I know of a company that explicitly had a "call us to discuss first" clause in their contract with a smaller cloud provider. Everyone was on holiday and not answering the phones while their hacked account was being used to spin up dozens of boxes launching a DoS attack against a crypto scam site. Guess who had to eat the bill on that one?
In case something looking like a DDoS, they tend to disable network access to droplets until you manually 'resolve' it. Instead of destroying the whole account.
Or even better, have contracts with the companies. Maybe unlikely for them, but I think “scraping” is too often assumed to be “bad” in some way. The company I work for does a lot of web scraping, but we have contracts with our partners to scrape their websites. They may still have robots.txt that ask users not to scrape some areas, but we are allowed to bypass those.
They locked my account, without refunding the ~$200 balance, with no reason given except "We reviewed the account and found it matches unusual patterns associated with violations of our Terms of Service and Acceptable Use Policy." When asked, they would not reveal which terms were violated.
No warning was given, and no way to retrieve any data. Fortunately nothing essential was lost.
"It's interesting how many companies simply shut down service rather than say give a warning and wait for a response (or at least start a clock)."
I'm sure many people have started their companies firmly convinced that they'll give plenty of warnings and never automatically shut anything down.
The problem is, you rapidly discover that doesn't scale, not even on a human level. You send your notice. 48 hours later, you've gotten no response. If you act now, it isn't materially different from your point of view as if you simply acted right away.
Also, in a cloud environment, even Digital Ocean, as many people have learned the hard way with leaked credentials, you can rack up charges faster than the relevant humans can even conceivably be notified. As the hosting company, you can't just let abusive or accidental usage go. You can refund their money, but that's still resources of yours that went to something that failed to produce revenue rather than something that did; you can't absorb that indefinitely.
I'm pretty sure you'll inevitably discover that you have no choice but to put automation in.
This is exactly why AWS has relatively low default account limits, and you have to open a support ticket to raise them. It's largely to prevent run-away costs from surprising the customer.
I accidentally left a 24xlarge instance running for a month without realizing it and they looked at the activity and were totally cool about zeroing the bill for that instance for the month. Basically gave me us a $2000 credit.
It does probably help that I said I would be careful not to do that again and had already put in a CloudWatch Alarm to automatically power-off the instance after a set period of idleness before filing the ticket.
There have been so many stories of AWS accounts being “hacked” (actually they weren’t. someone posted their keys to a public github repo), the person panicking, then sending a ticket to AWS and then getting a refund. AWS support is excellent - especially on the business tier and above.
I will gladly pay the extra money for AWS than to even think about DO or even GCP for a money making project.
But more on topic: with Aurora/MySQL you can have an on-site hosted read replica from an AWS hosted database. That would be a cheap, easy real time backup solution if I were really worried about AWS screwing me over.
The good will generated by the stream of customer testimonials of this process we hear about is priceless.
The proposition seems to go something like this: it's a new thing, mistakes are statistically expected, you make an honest one and plead "oops!" and we refund you, no doubt pointing you to resources on best practices and account throttling. As long as the customer takes the lesson to heart, everyone wins.
This. It's fairly easy to setup from the provider side and easily solves this problem. Rate limits are fairly easy and can be automated based on criteria like account length/payment/abuse incidents. I think disabling the account is a little heavy handed unless it's brand new
For the customer, there is a big difference. Hitting an API to send an email and a text 48hrs before shutting down services is a common courtesy and easily automatable.
The host should throttle resources in the interim if its at risk of running up massive bills or adversely impacting other clients. None of this is breaking new ground, there isn't a good reason for large hosts to act like shit.
Your point is reasonable in many cases. But in this particular case, the charges would not have been substantially bigger than what this company paid before, and an automated system ought to take that into account.
If I’m a $10/month customer, kill my account early, and it’ll save me more often than not. If I’m a big spender, maybe wait a bit longer.
DigitalOcean is operating worse than a fly by night host (like AlphaRacks, GreenValueHost, etc). The reasonable course of action would've been to email the customer and throttle their API access to prevent load spikes, but DO instead locked their entire account (not just the service that DO felt was being abused).
A fly by night will often only suspend the VM or database that is in question, not other services on the account (having been in that position before).
I saw a great deal on LEB for a KVM VPS from Alpharacks and signed up for a 2 year plan (my first mistake).
When SSHing in to the VPS it didn't have the advertised specs, and when I raised the issue with their support, they eventually fixed it..
Then I realized the second problem, they gave me the same IP address as someone else. You could still use the web VNC console, and as soon as you made an outbound network connection, inbound connections would work... for a few seconds... then SSH would drop. Reconnecting by SSH says "host key changed" i.e. you hit someone else's server sharing the same IP address. Using the web VNC console, works again for a few seconds, drops again.
It took about 7 days of arguing with their support to explain these two problems to them... by which time, the 3-day refund window had expired...
Lesson learned about low-end boxes. I have had about 50/50 good and bad experiences, using half a dozen providers like this, not really making financial sense overall.
I recommend everyone stay far far away from AlphaRacks. If anything remains of them after this week.
My point was focused on how shithosts handle abuse, they'd suspend your container, but any other products or services you have would remain available, your account generally wouldn't be locked (short of disputing a payment).
Not in my experience they don't. I use OVH and nfoservers and I've had an issue like this exactly once on both hosts.
On OVH one of my servers was hacked and running typical scripts that are run once that happens (port checking, common admin credentials, brute force attempts, etc)
They cut off all internet access to and from the server and sent me an alert stating what was happening and that I needed to VNC into the server, resolve the issue, and let them know how/why it happened and how I resolved the issue. Once that was done they just removed all the blocks on the server and we all went on our merry way.
Edit: To clarify the VNC console is on their site, not a remote connection.
I run the IPFS daemon on Hetzner, and it was trying to connect to local IPs because of some misconfiguration. They sent me an email saying my server was portscanning their LAN, and I should fix it and email them how I fixed it.
I didn't know what they were talking about so I replied saying that, they helped me shut down the local port connections and I never heard another complaint from them. There was no downtime or banning at any point.
Hmm, I've done lots of stuff on OVH that other hosts consider abusive, with nary a word of complaint from OVH.
OVH seems to have this whole automation thing down cold, monitoring boxes for dying components and letting me run hog wild on their VMs without limit...
The one time I used their phone support, the guy I got was fairly helpful. They seem like a hands off company overall.
Far from flawless. They did it to me and it was dropping half of DNS requests I was receiving so my websites were down. And you practically have to beg them to take you out of DDoS protection.
This doesn't seem unreasonable, OVH is not advertising DDOS protection (nor do they seem to be structured to offer it). Some hosts will can your VM and throw out its data, which is a much worse outcome.
A machine on a GCP project got compromised. GCP emailed us right away and we fixed the issues promptly. No outages, no arbitrary suspension, and no grief apart from the compromise.
Developer has a Python script that takes 1 second per record to execute and he has 500,00 records to process, so he spins up 10 distinct VMs each running the same Python script to parallelize the task.
The provider shuts him down and cites a section of the EULA that says "You shall not take any action that imposes an unreasonable... load on our infrastructure." Basically saying "Hey whatever you did, don't do that."
Developer gets his account restored and then proceeds to do exactly what he did to get it locked out the first time around.
Also Developer has all his eggs in one basket.
Shitty customer/product service aside, someone explain to me why DigitalOcean is at fault here?
Developer receives "k, we've restored your account", which sounds an awful lot like "what you're doing is fine".
Developer gets shut down again despite having explained the behavior as requested (i.e. the explanation is on file) and despite the explanation being considered sufficient by DO.
The "with the details you provided, we've removed the hold" e-mail is hard to interpret differently than "sorry for the misunderstanding, this is fine", especially as the initial e-mail asked them to explain to "ensure your account is not subjected to additional scrutiny or placed on billing hold". If they meant "ok but don't do that again" they should have stated so.
My guess is that the account suspensions are an automated monitoring process and not someone sitting in a NOC monitoring activity.
Think of that automated monitoring like a circuit breaker in an electrical panel. If you plug an air compressor into a 15 amp circuit and it pops the breaker, once you flip the breaker back on would you run the compressor again? No? Then why would you think it ok perform the same activity that tripped up the automated system?
The OP seemed to be aware that spooling up 10 VMs to do whatever he did was what got him booted, so... why didn't he reach out to DO to find out exactly what alarms he tripped and then take action to either get his account whitelisted or modify his process to ensure it didn't trip the automated alarms again?
He just got his account unbanned (flipped the circuit breaker) and fired his job back up (turned the compressor back on).
Why would DO be upset about spinning up 10 VMs then spinning them down again? Isn't this exactly the point of cloud providers? This is what they bill me for, right?
Smaller VPS providers like Linode or DO oversubscribe like crazy. Last time I used Linode, they would email us telling us we're using too much CPU or memory, and we'd need to move to a larger tier VM.
I think you misunderstood those emails. They are just there to help you if you didn't realize some process was stuck or something, they specifically say "This is not meant as a warning or a representation that you are misusing your resources." and you can also change the value that triggers those emails or disable them completely.
My gut feeling agrees with "I don't trust the Dev's explanation tbh." Companies and developers often try to get away with doing Bad Things, and cry wolf publicly without offering up specifics about what they were trying to get away with. "Spin up 10 VMs for 500k rows of data" offers no explanation to just what those 10 VMs are doing. There is a big difference between "using memory and cpu" and "saturating the network in abusive ways".
Random speculation of one possibility: each of those 10 instances were suddenly doing something unexpected and spammy with the network. Maybe sending 500k+ emails (one per row of data claimed by the developer) over SMTP in a very short period, or jumping to massive spikes in torrent traffic, or crawling sites to scrape data (maybe each row of 500k is just a top-level domain name, and they crawl every URL on those domains, possibly turning 500k rows into hundreds of million of http requests).
The postmortem will be interesting. If DO is truly at fault here, that email after the second lockout saying the account is locked after review, no further details required... bad.
The developer is a 2 person team. Why would they use multiple clouds at that stage?
Additionally, if 10 spun up VMs is considered an “unreasonable” load on DigitalOcean infrastructure I shutter at the concept of building anything on the service. Does DigitalOcean even define “unreasonable” in their terms or is it kept vague?
My understanding is that it isn't the 10 VMs as much as the resource usage (my suspicion is that DO is running a lot closer to the margin that larger providers, so they police this more). So they probably pegged all the CPUs at 100%. (perhaps a message queue approach would have been easier on the resources)
`unreasonable load` sounds pretty vague. What counts as unreasonable? 10 VMs doesn't sound like much, and I believe if I'm renting a VM with XYZ specs, I should be allowed to use up-to max capacity it says so in specs. What am I missing here?
Except it wasn’t “whatever you did, don’t do that.” It was, “this looks weird, can you explain?” Followed by “OK, sounds good, carry on.” Then they got shut down.
I am a software developer and yes software is eating the world. One of the side effects of software eating the world is out of control software.
* Autobans in facebook
* Cheated instacart drivers
* $10000 stolen from thousands of bank accounts (and returned hopefully) on Etsy
* Tesla cars literally killing people (it now feels like it's once a month right?)
Now the software that runs software is running amok.
The interesting thing about software is that it runs very quickly and it acts as a giant lever that affects the entire world.
You can think of it like a giant airport suddenly being installed in your back yard and just start having planes take off, changing your $300k investment into a $120k valued house overnight. That's how quickly software is changing the real world.
I know there is at least one HN browser writing a book on it. But I would love to see more books on how the internet, and software, is messing our world up.
This reminds me of recent talk by Jonathan Blow [1], where he talks about how we've made very little progress in the field of software and anything that appears to be progress is just software leveraging better hardware.
It's quite scary how low our standards have gotten.
I agree with you, and I was coming here to find similar thinking folks in the comment section and the most worrying thing is that I haven't found any except you. Most people seem to take it as a normal thing that software simply ruins lives and businesses. Todays developers' simply have no ethics. Also most developers seem to think that they can do whatever and they can get away with it, because y'know "Reasons". Sigh. :(
I really don't understand this sheeple thinking, for example most people simply don't understand that an automated fraud detection system is not a technologically important thing, it's an economically important cost management system effectively.
Just like companies externalize the cost of helpdesk personnel by operating an automated call center (and by proxy making the customer bear the cost), the goal is the same with fraud systems. But we cannot just simply throw our arms up in the air or shrug our shoulders when the companies leverage our lives this way.
Take Facebook for example, they acted like they had no responsibility or any power to review and take down or prevent toxic and/or hateful comments by employing human reviewers until they were forced to do so in some countries. And guess what, they had no trouble doing so, their profit might have reduced somewhat, but not that much.
So all in all, anyone who thinks he has integrity as a developer should take a look into himself when he justifies systems like fraud systems (or any other unnecessary cost reducing actions) as necessary. They are economically beneficial, yes; necessary, no.
Not even surprised by the way Digital Ocean have handled this. They pulled something similar on me back in 2014 at a previous company I used to work in. They essentially shut down my account and did not even let me get my backups out.
It seems like there should be a middle-ground between all-on and all-off. If I'm paying customer, I should be able to access my account in some capacity even if some abuse related issue closes off server access.
Why is killing accounts part of the way they do business in any way, EVER?
That's what destroys company reputation.
I may be wrong but my understanding is that the gold standard - Amazon Web Services - will only ever suspend your account until an issue is sorted out.
Whoever runs Digital Ocean needs to stand up and say very loud and clear to this community that they will never, ever delete accounts - if he doesn't then he can live with the business destroying reputation of Digital Ocean being "one of those account killing companies".
What company would ever host on "an account killer" - the risk is way too high.
> Why is killing accounts part of the way they do business in any way, EVER?
Because fraud and abuse exist.
Sometimes customers really are doing malicious things which need to be stopped immediately, and the only thing that will make that happen is disabling their account. Trying to make accommodations for those users is a fast path to getting sued, getting blacklisted by mail providers, and/or losing your upstream connection entirely.
I thought Fortune 500 companies (at least the enterprise companies I dealt with) that had checklists for their SAAS vendors that had to check off things like Disaster Recovery Readiness, etc.
At the very least this company learned a hard lesson about Disaster Recovery best practices. Hopefully all the up and coming companies reading this story learns as well. Also please remember that a backup that isn't tested IS NOT A BACKUP! I've been in so many situations where backups were corrupted, so part of the disaster recovery is to test the backups and make sure you can really recover.
There are a million different scenarios where their data could be lost and it not be Digital Ocean's fault. It's the company's responsibility to have protected their customers from this.
I worked at a company with no real Disaster Recovery plan. I was told that "we can get the servers up and running within 18 hrs if we had an outage", which not only was absurdly slow but probably an underestimate. Only by the grace of God did we not suffer a real outage but if we did, it was totally the VP's fault for not addressing my concerns.
My personal experience such questionnaires is that the questions can be vague, redundant, or not even really ask anything at all. But, even when the questions are concise and precise, it's all to easy to "word smith" an answer that seems to give the desired answer without every actually saying anything. And the person administering said checklist may not even know enough about what the checklist is checking to vet the answers.
Not even giving the account a warning is honestly enough for me not to go with DigitalOcean for even small projects in the future. Their prices are not that good for fly by night VPSs if that's how they're going to act.
This reinforces my gut feeling that companies like DigitalOcean and Scaleway are fine for hobby usage, but for any serious business operations, I'd go with AWS.
Well if anything else this case just made that abundantly clear. Locking a customer resources without warning, and not allowing access to their data, is unacceptable, I'd argue even for hobby services. There's a lot of VPS providers cheaper than DigitalOcean, I don't believe they have room to act like this.
There's a pretty hard cap on the level of redundancy you can do with a two-man company, as I assume a two-man company does not bring in a lot of money.
The number of employees shouldn't be the deciding factor when you are a tech company that apparently has fortune 500 companies as customers.
I'm not talking 5 9's redundancy. I'm talking grab a backup once a week or something, anything, to help mitigate a scenario like this. According to the thread, they lost ~1 year of data. That should be unfeasible to a company serving customers, let alone Fortune 500 customers.
Disaster recovery planning is key for a technical company to succeed. It is clear they never considered a scenario where there DO account would be closed/compromised/down.
>It is clear they never considered a scenario where there DO account would be closed/compromised/down.
I don't think the chance of DigitalOcean automatically freezing your account to a point where only a co-founder can do something about it has been well publicised.
In all practicality, DO freezing your account has the same effect of DO being down (or closing, etc.), or your account being compromised and you being locked out of it.
A contingency plan should ideally have been in place for a scenario where, regardless of root cause, you have lost access to your DO account.
Sure, but them closing combined with the chance of them freezing your account (feasible, considering the topic here) and the chance of account compromise, and the chance they go down for extended maintenance... It is inexcusable not to have a disaster recovery plan for the scenario where you cannot access your DO account.
Imagine you are a customer of this company. Would you be rallying to their defense, "backups aren't needed because the scenarios are unlikely", or would you be angry that the company had zero contingency planning and lost all of your data (or the data you rely upon)?
If you can honestly say, as a (hypothetical) customer of the company in the thread, that you wouldn't care if a company you relied upon has no disaster recovery planning, more power to you. I, however, like to make sure that the companies I'm relying on have some sort of contingency that protects me as a customer.
Plenty of one man companies have Fortune 500 customers. Most are ran like the startup in question, as basic disaster mitigation is overhead. Don't expect the world from one guy keeping a company afloat, unless you enjoy disappointment.
From my experience you don't have fortune 500 companies as a customer as a >50person,>5mil revenue company.
You may have employees from such companies paying you via their business credit card, but will never get through real procurement without documented SLA procedures which are required to prevent a scenario exactly like this.
Yeah maybe, but not having an offsite backup is asking for trouble. They just failed to consider that "their site" == "digital ocean's cloud", and that since it's someone else's computers, they could easily be locked out at any time for any dumb reason including this one.
An expensive lesson to learn to take and check backups regularly.
There are certainly a lot of limits on what you can do with such limited resources, but a reasonable backup with a different provider is certainly doable at that small scale.
It won't be entirely up to date when the worst case happens, you'll be unavailable and you'll probably have lost a day of data or so, but you won't have lost everything.
Rclone to AWS or Google or whatever is easy to set up, add a daily dump of your database to the folder you back up. Unless you handle a lot of data, costs are probably not a big factor.
True, but it isn’t hard to add S3 as an additional backup destination. Ideally if they had their infrastructure setup as code, it would have made recovery possible.
Are there guides to this that take you through good configs for smaller environments like this? i.e. you have a postgres DB and two web servers. What's the simplest backup process and how do you replicate to another DB VM over on another VPS provider securely?
Stuff like this is relatively simple when you are trying to learn it, but keeping it operation is hard at a small level. Is the less to really use PaaS until it's viable for you to be running a small K8s cluster, or equivalent fleet? Seems really expensive compared to VPSes, but having better guides might help.
I think a real takeaway here is to avoid going all in on one hosting provider. You should always have "off-site" backups for mission critical data.
Yes, it sucks that DO did this. But this is hardly the first time someone got screwed over by some poor AI automated security. Backing up your data to backblaze or AWS would be the cheapest insurance policy you could buy.
I know this doesn't help you now, but you may want to consider distributing your site or setting up DR sites across multiple VPS providers. If your application supports it, you may even want to consider using a DNS provider that can do health checks and fail over the site for you.
Heh, that reminds me when a director of one unit of a "world's top 10 brand" I was working at on big data architecture told me that they always knew when I ran anything on the cloud as I brought it down within a few seconds of my heavily multi-threaded processing scripts. I guess DigitalOcean decided that instead of fixing their infrastructure they just kill their own smart clients.
I find this enforced mediocrity pretty appalling. With barely functional "anomaly detection" Deep Learning models with dubious decision making (I did some so I am familiar with the "landscape") it's gonna be a lot of fun for anything slightly deviating from whatever vague norm that can't be explained nor tested against.
> Every 2-3 months we had to execute a python script that takes 1s on all our data (500k rows), to make it faster we execute it in parallel on multiple droplets ~10 that we set up only for this pipeline and shut down once it’s done.
Wait, a whole distributed computing sub-system to make a 1s process faster?
> I got their final message right after arriving in Portugal.
Did he initiate the script from a IP address outside France?
Though that might explain why DO's fraud detection was triggered, it doesn't excuse their actions. Send an email first, jeez.
I have happily used DO for small things, like hosting a Ghost blog, or running an Algo VPN, but I am a little surprised to see people doing bigger infrastructure with them - not that they deserve to lose everything for making that choice, but it seems like it would have been clearly riskier than AWS/GC/Azure?
I'm surprised that people put significantly more faith into GCP, AWS, Azure etc... similar things have happened in lots of scenarios. Not to mention when registrars have taken down sites.
This is a shame, and imho it sucks a lot. One of my biggest points of paranoia are with backups and scripting a return of a site on another provider should the worst happen.
Getting ready to launch something barely more than a hobby and was planning on DO because the hosted Postgres and a small K8s cluster is significantly less for what's there than the alternatives. Frankly, I don't want to go from ~$100/month to over $200 for another provider for something that likely won't lead anywhere.
>We lost everything, our servers, and more importantly 1 year of database backups. We now have to explain to our clients, Fortune 500 companies why we can’t restore their account.
I think what you have to explain is why there wasn't a contingency plan, with your own servers, colocation, another cloud offering, etc...
When AWS has gone down in the past, it's severely impacted massive tech companies like Netflix and Spotify.
Why would there be an expectation that a 2-man shop have "another cloud offering" as a contingency plan when some of the biggest and best tech companies do not?
People use services like AWS or DO because they are the contingency plan - they have the size and scale that smaller companies cannot afford or implement.
The difference is that when AWS goes down, Netflix/Spotify still have backups and could adapt infrastructure if the outage involved permanent data-loss. You're talking about the people who built https://github.com/Netflix/chaosmonkey
I'd argue that it should be _easier_ for a 2-man company to adapt to cloud service outages, as they likely don't have to keep up with nearly as many backups or moving parts.
Another practical difference: When AWS goes down, it makes the news outside the tech bubble too. Customers are much more likely to forgive you, if you send a link to NY Times that says "Amazon is suffering a global outage, affecting tens of thousands of companies"
So pretend they had offsite backups. That's a separate issue from an entire contingency plan. The ability to adapt is not the same thing. This company could certainly adapt to a new host if they had an extra backup.
The ability to adapt is the definition of a contingency plan. It's essentially, "If this person/service/database/customer/etc vanished off the face of the earth, what do?"
Their entire business was completely reliant on DO droplets. It doesn't take much foresight to think, "hey, I should probably make a backup in case this VPS goes down."
Nothing in this comment thread, or the OP twitter thread mentions anything about the rest of this imaginary contingency plan of theirs.
Coldtea said they need to explain their lack of contingency plan wrt "servers, colocation, another cloud offering, etc...".
PostPost said they didn't, that even huge companies don't have contingency plans.
I agree with PostPost, and I'm trying to figure out which one you agree with.
If you define being able to adapt as a contingency plan, well, I have confidence that this company is fully able to adapt! Their architecture is small and pretty easy to move. The only problem is a lack of external backup, which will be remedied very soon, and once that happens they could easily shift to another service even if DO re-disabled their account.
So that would mean you agree with PostPost. But you don't seem to agree at all.
I'm struggling to reconcile "The ability to adapt is the definition of a contingency plan." and "this imaginary contingency plan of theirs". If you demand a preexisting written plan then that means you're not accepting "the ability to adapt" as a valid answer at all.
Wow. well, I will avoid DO at all costs for the rest of my career. There is a reason Enterprises trust AWS, GCP, Azure, and pay for premium support (and startups should, to the best of their financial abilities). He needs to sue or negotiate to recover material damages caused by this.
Best Twitter comment (grammar errors and all): "And if your full business relies on one tech partner (no offsite backups) your not doing your tech job right."
This is conflating two different things. One point is valid, the other is not.
- No offsite backups? Agreed. Even for a two person team it is sloppy.
- "Relies on one tech partner?" Strongly disagree. Even large enterprises often have a hard dependency on AWS, Azure, Rackspace, or similar. To suggest that a two person team should have deployment plans for multiple independent cloud vendors is just fantastical thinking with no basis in reality.
Plus if they did over-engineer it by making it cloud agnostic, setting up accounts to sit dormant, cold instances elsewhere, etc, people would just criticize them for that inefficiency/wasting time.
Some may have an availability dependency on those services, but if they don't have a full BC and DR plan ready to go within a few hours of losing those service they're not going to be a big enterprise for long.
Are you talking about a small WordPress site or something?
Very, very few tech companies could simply move everything to a new cloud provider in a few hours. I would even hazard a guess that almost none can.
I have all my infrastructure as code and can break it all down and spin it back up in kubernetes clusters in minutes. But due to the quirks of each cloud provider, there are tons of little fixes that would inevitably need to be made.
Not to mention that many companies have way more data than could even be copied over in a few hours.
The topic is having a single tech partner. If the partner is large enough you can have a high level of redundancy by utilizing their cross-zone/geographically distributed services (e.g. US East, EU West, Asia Pacific, etc).
How you got from "single tech partner" to "have no disaster recovery plan" I don't know.
Maybe. But don't forget this is a small company of just two people. Yes the backups should have been off site but relying on just one digital partner at that scale isn't the worst (they're unlikely to have the money or time to federate out to other services).
Yes, ideally they would have already tested their back-up solution, the back-ups would be offsite and, if something like this happened, they could stand-up on another provider. But that just ignores the reality of them being a super tiny business. Almost no one at that size is going to do that.
I disagree. Many 1 and 2 person shops implement proper off-site backups because they understand that loosing their data is a death sentence for their business. Proper off-site backups are neither expensive nor time consuming to establish these days. There is no good excuse for even the tiniest of companies to not implement them.
Actually, you can implement cheap external backups even in a 1-man company.
But I still support that DO here is a 100% liable toward their client. Now the liability between the said client and his own clients is an other matter.
Indeed. I'd urge everyone here to pay attention how many sites/companies are dead in the water the next time there is a full blown AWS outage and see if they are as quick to heave the same criticisms at those fortune 500 companies that they are at this two man operation.
What happened w/ Digital Ocean is inexcusable, and has potentially dire consequences for two individuals livlihood. In the immediate aftermath of such an event, focusing on the devs percieved lack of disaster preparedness seems petty.
I agree. Who are all these people that pull backups, that maybe GBs or TBs in size for offline storage? How does that even work in practical terms in disaster scenarios like this where resolution times are expected within hours and not days?
It doesn't. I would hazard a guess that there are zero medium to large companies out there right now that could swap to a new cloud provider in a few hours.
On the other hand: Get customers and traction before you build a multi-site, fault-tolerant, self-healing, webscale platform that Google would be proud to have.
I think we needlessly shame one-person operations for focusing on actual customer needs instead of ops busywork and yak shaving.
No one that I've seen so far is saying they should have a system that is "muilti-site, fault-tolerant, self-healing, webscale that Google would be proud to have".
They are saying run a simple backup and keep it literally anywhere else.
I think we needlessly hyperbolize "do a backup once a month and keep it somewhere else" into some sort of NSA operation.
Its backups. Its 2019. Its dead-easy and very affordable.
I agree. Putting all your eggs in one basket these days indicates poor decisions or risk assessment.
> Digital Ocean "Trust and Safety"
Does this phrase give anyone else the heebie-jeebies? They deliberately locked his account without looking at previous metrics over the previous months to conclude this was obviously not a concern, and why?
Things that have crossed my mind:
Is their automatic system was poorly tuned?
Was this deliberately initiated? If so, why?
Ideally partners should be trusted (and trustful), in practice, they aren't
Though trusting DO/AWS/GCP, etc is much more reliable, than, let's say, betting your whole business on somebody's proprietary API (like an FB game, an Linkedin API something, etc)
You're correct. But there are also unicorns that don't follow this rule. It's just that their compute activity would never trigger a false-positive, so everyone (except their ops team) is blissfully unaware of their fragility.
That was the stupidest comment to me. This guy is desperate for help and this fart-sniffer is finger-waving someone he doesn't know online with unnecessary platitudes.
The timeline is interesting in itself: This story was posted 4 hours ago on Twitter, 3 hours ago on HN.
The cofounder picked it up 3h ago. DO responded and apologized from the official account 2 hours ago, claiming it was fixed, and is actively responding to people tweeting at them, doing damage control, since about 1 hour, promising a public postmortem.
While it's sad that a social media escalation was needed (and it confirms that getting attention on social media is the only effective way to resolve hard issues like this), the response after that was quite fast. Let's see how well and how quickly they deliver on the postmortem.
A few years ago I used to run a VPS on DO with a mail server, VPN and some code I was writing. Once I was done with everything, I used their snapshots feature to backup my VPS and shut it down.
Two years later I wanted to restore the VPS but turns out my snapshot become "outdated" and they stopped supporting the format for restoration... Support was completely useless, wouldn't even let me download the snapshot, said at most they could mount it into a new VPS and I could recover the data myself.
That seems reasonable. Launching security-bug-ridden software seems dangerous, so mounting to a VPS so you can copy the important files seems an entirely reasonable access method.
that last response from them is pretty damning. it's like what happens when your customer success turns off their brain and just applies blanket rules to everything.
yikes, just when i thought their kubernetes and managed db's were looking attractive...
Strong anti-recommendation for this advice. The cost in dev time plus keeping things running is not worth it. This is a rare event. Just use a more well-established cloud service.
That's generally good advice, but I think it's important to keep backups and other disaster-recovery stuff somewhere else. Heck, just buy a NAS box and sync things nightly.
We do a fair amount of business with DO to the tune of about 40+ droplets used together and separately for various tasks.
While we could certainly survive the loss of these assets, the recovery would be long and costly.
So I would certainly say that this story gives me a great deal of pause and will take up some mental space this weekend as I think about future dealings with DO.
We lost everything, our servers, and more importantly 1 year of database backups. We now have to explain to our clients, Fortune 500 companies why we can’t restore their account.
And yet, the explanation is very simple:
Because you neglected basic principles and elected to put all of your backup eggs in one basket.
It's always sad to see these stories because they always seem like a preventable tragedy. DigitalOcean, Azure, and AWS seem like the go-to for start-ups these days, instead of self-hosting your stuff at home or even in a colocation space. Even though it's a "dirty word" these days, on-prem does have its huge benefits.
Professional stuff is one thing, but that's not to mention anything personal - anything I care about I won't put exclusively on someone else's computer. I want to have absolute control over as much of my stack as I can. Seems really scary that some company has control of your entire infrastructure and can ban-hammer you without notice, permanently, at any time of day or night (or while you're on vacation).
It's not just the volume of usage that can indicates fraud, but the pattern. In addition, relying on quotas creates a system that is easily games by perpetrators of fraud.
Caps or quotas are not sufficient to deal with this problem.
If a cloud service is supposed to be ready for production, then customers should be safe to assume that they will not simply be shut down, especially not without warning. Otherwise, the provider must make clear that the service is only for hobby use and not for commercial use.
What kind of fraud do you have in mind and do you know of a case in which, for example, Azure switched off an enterprise customer due to unusual usage patterns?
Yes! My college roommate was shutdown in Las Vegas after eating for 3 hours. That was thirty years ago and I still can't stop laughing after witnessing such a debacle.
The quality of DigitalOcean communications in this case - especially email responses look like worst examples from some pretty terrible corporations. I thought DO was (is?) trying to differ from them.
I was evaluating DO the past two months, but reading this, I'm staying at Linode - their poor security incident handling in the past is a theoretical concern, this is a more immediate one.
It's not this anecdote in itself, but that it corroborates my experience during trial that I ignored and dismissed as support incompetence (which should have been a warning sign in itself). After setting up the account, adding a payment card, I wasn't able to enter our VAT ID as part of billing details, with some nondescript error. So I asked support.
Two days(!) later, they responded by asking for incorporation documents, which was frankly bizarre (and a first in ~10 years of running business): they're not exactly a bank with KYC requirements. When I responded, basically, WFT?, and told them to check the billing data in VIES, they eventually fixed it.
But what I got from it was a distinct impression that their default assumption is that the customer is trying to defraud them, even when it makes no sense. To this day, I have no idea what kind of fraud could they possibly be anticipating there (they allowed the card).
This story is on the same general subject, and so are others surfaced here and on twitter in reaction: the customer is presumed scumbag.
I see a lot of people saying "They should have had more than one provider" and "They should have had better backups".
What I'm saying is: If you ARE going to put your business at the mercy of one company from top to bottom, would it not be wise to try to get some kind of account rep? Or have some sort of communication with the company as to the nature of your operation?
And if that isn't an option, should you do business with them?
Question : so what do you do besides host it yourself? Do you set up a backup on a different cloud service so that at least you can failover without too much downtime? you then have to at least pay for the some of the resources even if there is no traffic. or backup locally but have a process set up with the different service so you can provision and get back up quickly? or is there a better solution?
At least have a plan (which you test out once) of how you will migrate to another provider in case your current one screws you like this? All you will need to do continuously then is dump a backup copy of that data into they providers storage..
They are reckless to trust a single company with their backups worth of money. I don't trust Google and Amazon with backups of my wedding video nobody cares about, so I'm using my home NAS with raid as a third additional storage (along with Glacier and Coldline). And they just trust DigitalOcean with all their data, assets? Crazy. Don't put all eggs in one basket. Register domain from one company, host DNS in another company, host your servers in third company, keep your backups in multiple places owned by multiple companies, test your backups periodically, have migration plan if one company decides to ban you.
That's another reason that I'm very skeptical of Amazon and Google offerings of their proprietary API. Rent virtual servers, no problems, you can rent those servers from a thousands of hosters. But if you're using their proprietary APIs, you will have to rewrite your software to migrate and you will have very tight time frame.
This is why I always use many providers. I'm talking about $10/month or so packages, this is pretty cheap. They're also not all on the same card in case one gets locked
I use 4 for my current company and have redundancy spread over them so that if, say, AWS goes down or I get an account locked or whatever, nothing is lost, things continue to be operational, slight degradation happens and that's it.
This really isn't that hard to set up. Under a day or so and then just do stuff in ssh config and the shell rc to act as helpers so you remember how to do things.
It's super cheap, pretty easy, and robust.
It's awful what happened to this guy but it's kinda like the person who backs up to the same harddrive as the originals. Awful to lose stuff, it shouldn't happen, but also don't do that.
While DO is not without fault in this story, any Fortune 500 company would have to be pretty stupid to work with this company after they admit to not storing backups in other places. I feel for you, but a backup is not a backup if it’s subject to the same weaknesses as the master copy.
I can understand why a cloud platform would proactively disable workloads that appear to be acting dangerously. At the same time it seems quite unreasonable that there's no mechanism for a customer to get their data back when their account is unilaterally shut down.
Are there any best practices I should follow when using AWS, GCP, Azure, DO, and others to avoid these sorts of situations?
I have heard although can’t confirm that using on account billing rather than using a credit card makes them less likely to just disable your account for billing issues. Things like that would be helpful to know. Should I be letting them know more contact info about us, asking for an account manager, etc?
Does it make a difference in how they react to you if you are spending low thousands a month vs tens of thousands a month vs hundreds of thousands etc? Or is everything always automated to death?
You know, there is usually an easy way to mitigate this. When you create an account, be sure to supply business information e.g EIN in the USA and see if you can do invoice billing. Once you get set-up as a business, most providers assume you know what you are doing.
These sorts of things tend to slip through the cracks. If you are running is a business capacity, make sure you treat everything like any other business would.
Even as a two-man company, you can't afford the cost and potential liability of not operating as a registered business. It also shows that this is a serious business and not a hobby.
This same experience happened to me 6mos ago as well, actually I tend to read this online every month or so. No response from them for weeks, they without warning locked and deleted all instances, data, backups, everything. It wasn't until I posted on HN that I got a response. They said it was a mistake and apologized but...yea...whoops we deleted everything and killed your start up, want a free month of service? So no matter how slick the UI, I will go out of my way to never use this trigger happy company that is killing peoples startups on a regular basis.
> After sending multiple emails and DM on Twitter they unlocked our account, we got 12h of downtime and got a nice
Wait their entire business was effectively shutdown and all they did was send an email? Granted the DO handling of a possible abuse situation is shocking, but to allow your business to go down for 12 hours and not be trying to call any and every actual human being at DO seems to be negligent on their part. While us technical types love interfacing via digital means, some situations benefit from actually talking to a live human.
Maybe they should have their own physical server in a datacenter. In addition to more flexibility, it would also be cheaper. Cloud providers are trying to convince people it is easier by using them, but in the end, if your cloud grows a lot, you are still going to need a team dedicated for management, etc. They are trying to convince it is cheaper to use them, but I can easily have 32GB of RAM or far far more on a single node at the fraction of the cost of the similar virtual offering (if they even offer such big VMs).
It won't be easier, and it won't be cheaper. When you're a small team trying to build a business, there are a lot of business functions where you won't personally have the expertise or the time to do it yourself efficiently, and your requirements won't be large enough to justify hiring somebody full-time.
In the case of these business functions, the standard and correct approach is to outsource them to a 3rd-party service provider. You do this with accountancy, legal representation, facilities, office management, recruitment, etc. If you try to bring all these things in-house from the get-go you'll never get around to building a product, and it's financially and logistically sensible to do it with IT as well. This calculation may change over time as your business grows, but if you can't comprehend that the correct strategy for a fledgling business may not be the same as an established one, then you're simply not suited to run a business in the first place.
Of course, in every case, you're taking a calculated risk by relying on a service provider: They may go bust, they may be incompetent or malicious, they might ramp up their prices. Your job as the manager of a company is to accept and manage these risks as best you can. Risks cannot be eliminated, only managed, and attempting to do is a fool's errand. If things do go wrong, you'll always have people lining up to tell you how you could have avoided this problem, usually it's by making a decision that can be justified with the benefit of hindsight. You should ignore these people. The only question is: did you make the correct decision at the time, based on the facts to hand?
Or just copy backups away from your cloud provider, to local storage, another cloud such as AWS or dropbox. No point making backups if you can't access them.
This is not the first time DO has done things like this. I would not trust DO with anything critical, and have advised people in the past to not use DO. Nothing here surprises me.
* The company quickly creates and starts 10 VMs and triggers an automated lock-down, with a message mentioning a sudden spike of activity.
* The company gets the account unlocked.
* The company again quickly creates and starts 10 VMs, and triggers the auto-lockdown again.
Note to self: when something damaging happens as a result of a seemingly normal action, avoid doing that same seemingly normal action immediately again, lest the damaging consequences hit again.
Who is good about not locking accounts or taking similar actions? Apple and Google are both notorious for blocking things for no reason, is AWS or Azure any better?
They are all going to have some form of automated protection against malicious activity, but I suspect AWS and Google’s algorithms are better than the others. My experience with AWS and Google in general is that your treatment varies with your support plan. With business or enterprise level, you have dedicated resources within the company that are going to be aware of such issues or can escalate and sort it out quickly. I understand not wanting to shell out the base cost for enterprise if you are a small company on a budget, but paying at least for business support is a good idea if you are actually running a business. I have never actually heard of this happening to such a customer though, so perhaps they have extra processes in place?
I've only had one instance where Linode contacted me about suspicious activity. I responded promptly, it was obvious they understood my answer, and nothing came of it. Another comment thread on this story has people saying Linode's support is better.
Not excusing DO here but there’s also a lesson to be learned about backups: make sure they’re kept off site in case the whole site itself because compromised. Usually when we say this we mean in terms of a physical disaster (eg fire) but in this case it’s also in case of a logical disaster (account getting shit down).
So the lesson is don’t leave your backups with the same cloud provider that host your database. You should also pull local copies as well
This isn't assuring to read at all. I'm trying to build and still relying heavily on DO, for what is becoming thousands in profit quite quickly. Frankly, I'm very uninterested in resolving a problem in hosting - mostly because it's boring, but because DO seems nice to use in every other way.
On the other hand - I'm becoming increasingly aware of the inevitable Twitter-social-media-pile-on for any company ever.
I am sorry they had this problem. Stories like this about edge cases with cloud deployments make my conviction stronger to always: 1) have a data backup plan that uses a different cloud provider unless you have so much data that you can’t afford the extra bandwidth and storage costs. 2) if possible don’t rely on unique cloud platform services and be ready to bring a system up quickly on another service provider.
How about a support emergency button. Clicking it will cost you $1,000, but would immediately put you in contact with a human in escalated support mode.
I think an important thing you can learn from this story is that you should keep your backup on a different host(s) or better even have replication enabled.
In these days, most apps generally can be migrated to a new host in seconds as long as you have the data source alive.
If they had access to thier data they probably should have been able to spin up a similar ec2 instance in minutes and say goodbye to DO forever.
It's easy to think your data is safe when your cloud provider advertises 11 9's of durability. But three replicas of your data doesn't protect against things like this. Even in the cloud age, offsite backups are important: Different provider, different region, different payment method.
Unfortunately it doesn't help that cloud egress bandwidth is criminally expensive.
I really liked https://hyper.sh, they provided hosting for running Docker containers. They handled this same case by having a limit on every account by default. When I requested to increase that limit, they asked about my workload, I explained them briefly, they allowed the limit increase and all was good.
I agree with the other comments about not having backups in the same place, and ensuring that you distribute your assets (domain, DNS, compute, backups, etc.) across as many providers as possible.
One thing I will add is that, especially for a small shop or project, assume from the get go that by renting infrastructure from DO (or any provider) that user-hostile actions can and will be taken when it comes to any issue regarding TOS violations that you are unaware of.
This assumption helps to build redundancy in your mindset. Have a production website or app in DO for example? Droplet backup, periodic snapshots, B2 server backup, S3 tar backups, containerize apps if possible, have equally provisioned (smaller, idle VMs) infra on another DO account or another provider if possible, and so on. I know this is overkill but paranoid sysadmins/devops are always rewarded.
Just to add some context for DO specifically, they're a great provider in my anecdotal experience and they are constantly rolling out services aimed at medium to large scale workloads, such as managed databases and k8s.
That being said, it's entirely possible to transfer snapshots [0] to another DigitalOcean user account or teams account. So at the very least, create an entirely new DO account just for holding snapshots, outside of the native droplet backups and third-party backups you're doing on an application level.
Out of curiosity, does anyone have some resources to have a homegrown DO/AWS server space?
Say I have a friend at all the continents (including Antartica for hypthetical fun-ness), they are all willing to allocate some square meters to plunk down some servers for whatever is needed.
How would I go about it and build this small(ish) infrastructure myself?
If you run mission-critical apps for F500 companies, you HAVE to have a backup/DR policy in place, using a different IT provider than your main one (and/or a different infrastructure site).
DO in my experience is not the greatest company to work with; but at the same time, your "incompetence" killed your company, not DO.
As someone who has spent the last 8 years in big enterprise it amuses me slightly that small businesses are discovering this now. Because where I work our customers often have contracts that make us accountable for every single hour of downtime. Requiring us to pay customers penalty fees if their service level is affected.
You can not rely on one company. Let me repeat that. You. Can. Not. Rely. On. One. Company. If you are you’re negligent and you deserve what you get. You need your backups to be good and to be with a second provider. This is NOT rocket science. Treat your company seriously if you expect to sell services.
I think, going forward, a data backup is not going to be enough. You’ll need a full devops plan b/c where you just redeploy to another hosting provider.
Even better: run your infrastructure across multiple hosting providers with something like consul. DO might cause slower service but not a death sentence.
A corollary is keep your infrastructure simple, and know how it works, so you can stand it up anywhere. Web server/db/memcached can run anywhere easily and is enough for almost every company out there. Are you sure you need {cool new service}? Are you really really sure? Your database can probably do it.
DO didn't handle this we'll, but a company that wants to serve "Fortune 500" customers ought to have a more mature process for handling an outage like this. The fact that they didn't makes it hard to view them as a serious, credible business.
This is why you need, at minimum, a nightly mirror. Ideally you stack on top of that a load-balancing device or service to redirect traffic in the event of an outage from your primary farm.
Never go on holiday again until you have backups and failover.
Over the years, I have moved most of my VPS to vultr, and now only 1 droplet left in DO. This story again proves my mistrust towards them. I don't like AWS because it's too complicated for my use, but now I think we better have some backups.
Amazon recently paid $200M or so for CloudEndure - I don’t know if they support DO, but they do let you easily move between and among cloud providers, which is something every cloud based company must plan for.
When will the wheel turn again, when we learn it is best to have at least one copy of your Data in your hand? Not your Cloud VM, Server Rented somewhere, but actually in a HDD within your own premises.
Ran a Minecraft server in college via DigitalOcean. Was my first experience with a VPS. They locked it down after looking at the console logs and deciding it wasn’t “educational” enough for them.
We had a similar incident with DO 8 days ago. It didn't kill our company, but we got hit hard.
Our business is Dynalist, an online outliner app. Many of our users store all their notes on Dynalist, so uptime is really important.
Starting 7 PM last Tuesday, we saw a slowdown in request handling. We filed a ticket with DO 2 hours after that (we also posted our initial tweet to keep our users informed: https://twitter.com/DynalistHQ/status/1131087411797270529).
A few hours later, we started to experience full downtime. Still no reply from DO. We filed another ticket with the prefix "[URGENT]". Still no reply.
We waited for 24 hours for their reply. We took turns taking naps because we're only a 2-person team.
After 24 hours, we tweeted @ DO (https://twitter.com/DynalistHQ/status/1131397013306847232). 2 hours later we finally got a support person working on our ticket. We didn't want to take it to the social media, but there doesn't seem to be any other way at that point. DO doesn't have phone support, and us "bumping" our support ticket didn't work either.
After 2 hours going back and forth on the support ticket and providing logs, DO's support person identified the issue and offered to move us to a less crowded server. They asked us what's a good time to do a manual migration if a live migration fails, and we replied immediately saying whenever is fine (we're experiencing downtime anyway).
We thought it's over, but we were so wrong.
They didn't reply in another 4 hours. That was 4 hours of more downtime. Sometimes, CPU steal is down a bit and our server could catch up some requests, although it would still take 10 seconds for our users to open Dynalist. But most of the time, our web app was totally inaccessible. Watching the charts on our dashboard go up and down felt like some of the hardest hours of my life... mainly because there's nothing we could do.
4 hours in, I realized we had to post another angry tweet to get a solution. There's nothing else to do other than trying to stay awake anyway. So I posted another tweet: https://twitter.com/DynalistHQ/status/1131497962184564737
This tweet didn't seem to work. Nothing happened in the next 3.5 hours and things started to feel surreal. I didn't know how much longer this downtime is going to last, and I didn't know what we were going to do about it.
At that time, it was 9:30 AM EDT and people were starting their day. We were getting more and more emails and tweets asking what is going on and where are their notes. A few customers were angry, but most were understanding and supportive.
At 9:55 AM EDT, DO finally did the live migration a few minutes before the time limit we gave them, which was 10 AM. That was the end of the incident; CPU steal was down to < 1% and Dynalist was finally up again.
However, we couldn't trust DO any more. This weekend we're migrating to a dedicated server provider which has phone and live chat support. DO is pretty good for spinning up a $5 box quickly to test something, but we learned the hard way we shouldn't rely on it.
This is very sad, and why we all need to make the Web decentralized and own our own data. Here, several Fortune 500 companies were relying on a two person team, which itself was relying on a hosting provider with full discretion to shut it all down. What could possibly go wrong?
https://qbix.com/platform is one of many projects working to tackle this. Tim Berners-Lee’s SOLID project and others are, also.
Bookmarking this for next time there's a HN comment "why is Netflix using AWS, they should just get some cheap VPS's from DO"
The rise of the public cloud is pretty fortunate for companies like DO, they can make the same assumptions that legacy VPS companies did (most vm's sit idle so oversell them massively, most customers will create a single vm and nothing else) while branding themselves as having the same strong infrastructure as AWS.
Has anyone ever actually said Netflix should switch from AWS to "some cheap VPS's from DO"? AWS offers a lot that DO doesn't. They're hardly even comparable.
How disappointing of DO. Clearly shows their hypocrisy and lack of care for customers with private tickets versus a public forum. Will not be recommending DO.
To everyone justifying the lack of a backup strategy by saying they’re a two man show:
As far as sympathy goes, you’re not wrong. But you’re also justifying every pain in the ass procurement process you’ve ever dealt with. Your attitude is why so many companies won’t go near a two man shop.
I would agree if they were reckless, but here they had a backup strategy, just the risk of a complete sudden shutdown from their provider was not taken into it. This is a new risk that must be taken into account, and I’m not sure procurement would have scanned this. Also the pain in procurement is usually because it’s in fact on irrelevant arbitrary administrative things and not on real risks like this.
Soooo... an automated script flagged them. they get themselves unflagged, and proceed to do the same thing immediately. without any real confirmation that they wouldn't be flagged again.
They've basically been flagged as abusing the system multiple times, and they're surprised they had to kick up a storm to get themselves reactivated again?
Not to mention, that process they need to run every couple of months, that takes 1s, but they still need to parallelize over a bunch of vms, that's weird and sounds like something that needs to be rearchitected at the very least.
My interpretation of that was that the batch job takes 1s per row or something like that rather than 1s total. It obviously wouldn't make sense to spin up 10 nodes to turn a 1s job into 10 0.1s jobs.
Not at all trying to blame the original posters and victims:
While they seem great for hobbyist and small business sites, there’s no way I’d trust Fortune 500 client business to something like DigitalOcean. I just don’t see the benefits over a more established operation like AWS, Azure or GCP. Saving $50 here and there isn’t worth it.
DigitalOcean is established. They're newer than GCP and about the same age as Azure, and IIRC at one point were the second largest VPS provider, second only to AWS.
And unfortunately this stuff happens to your "established" examples as well. Here [1] is a particular example of Google shutting down an entire GCP account with no explanation. Some comments report the same on AWS as well. Ironically, people in that HN thread are actually suggesting Vultr (kinda like DO, but even smaller) as a good alternative.
We rely almost exclusively on DigitalOcean and the stories about Google shutting down accounts really gives me pause. It's not an isolated incident either, and raising support is next to impossible, so I hear.
I'm lucky enough that my spend with DO is high enough to qualify for support, so if this ever happened to me at least I know I'd get a couple chances to make things right
Were this to happen on GCP I'm fairly certain they'd just black hole my account since I'm spare change to them.
I think it's the opposite. The cofounder intervened and the access was restored. This would never happen with Google or Amazon -- once they lock you out, your entire business is permabanned, and you won't be able to reach any human with authority to help you.
Maybe - AWS at least, you'll have an account rep to bang on whose job it is to remove obstacles to your spending more. I'm not sure how big exactly you have to be to get an account rep, but my small company with only 50 instances has one.
Your "established operations" are more expensive and GCP is newer than DO. Plus, getting the attention of someone to restore your account is probably easier in DO than in a faceless giant company like Amazon, Microsoft or Google.
My AWS support tickets usually get answered in two minutes or less and I have a dedicated rep who I can call whenever I want and we have regular check-ins anyway.
You get what you pay for. We're even upgrading from this support plan to an Enterprise account.
Startups can usually get enough in AWS credits that they probably could have their entire first year of service _for free_.
Yes, this is a bad look for DO. But the way they're able to beat AWS on price includes things like "worse support." And if AWS goes out of business, you'll know in advance. DO isn't the same story. You should be planning for redundancy if DO is truly business critical for you.
AWS support is a lot more responsive even if you are not a big source of revenue. One time when I had an issue, I directly reached out to our startup program point of contact and they made sure everything was resolved by constantly following up with the interval team responsible and keeping me in the loop
To be blunt I've heard of GCP doing this to other people before. AWS and Azure though both understand that customer service is extremely important, and shutting down services destroys confidence. In the case of them suspecting malicious activity they'd have actual security people look at what's happening, and then maybe blackhole their traffic while they start calling people.
After even a single incident like this no sane company would relay on DigitalOcean. This is the kind of crap you expect from a shared web host overselling resources, not from a company that wants to provide infrastructure to tech companies.
That was true a long time ago, but recently DO has also proven to be a professional cloud hosting provider in the same league. Frankly, I was expecting the same level of support/service as AWS from them.
BTW, I don't even see a downvote button around posts, there is only upvote button! Is it because I'm new here?
You can only downvote once your own comments have been upvoted a bit.
Keep in mind though that you should only downvote for abusive comments (which also deserve a flag), gross misinformation, or other things like that. Disagreeing with someone is not a good reason to downvote.
Conversely, I don't see how a company like DO can afford to offload their customer support to automation for these "we'll shut down your business" kinds of tickets. Tweets like these generate a lot of press.
See, you totally can run big stuff DO (or Vultr or whatever). It just takes 20 minutes on the phone to make sure they know they can be paid when the time comes.
I can't speak on Digital Ocean and I guess it depends how you define "viable" but I've been running my upper six-figure revenue business on Heroku with tremendous success over the past 6 years. Customer support is super responsive and very helpful. The minimal downtime I've faced had has largely been the fault of AWS.
Couldn't agree more! We've been using Heroku at ReadMe for ~5 years and it is easily my favorite piece of technology we use. Off the top of my head I can't think of a single issue that entire time that has been their fault.
Heroku is expensive but fantastic. Unless you have complicated needs and the price point doesn’t shock you, it’s a fantastic way to use AWS while sparing yourself a lot of headaches.
I see no need for DO to shut down accounts and active long established nodes for a fraud check. Disable the offending node, disable making new ones, and limit editing existing ones to off/on controls.
Regardless of this example of a false positive, locking whole accounts over that is unwise.
Rather unfortunate customer service experience that you can't get help from the actual support and you can't get help from the official Twitter account. You just need to pray that your Twitter thread gets noticed hard enough for the actual co-founder to notice it so they can make the call that saves your company.
Also, even the co-founder doesn't seem to know exactly why the service was suspended, even though he clearly managed to arrange things.
Right? I hate that justice in these cases relies on the person tweeting and then that tweet hooking on and getting popular on Hacker News. So depressing.
The response (and its timing) will determine whether we continue with DO as a host for the (admittedly tiny) bit of infrastructure we host there.
And there had better be a reasonable response on HN if they value their HN-reading customers; it is where we got the first recommendations for them years ago.
>We haven't completed our investigation yet which will include details on the timeline, decisions made by our systems, our people, and our plans to address where we fell short.
I'm referring to _this_ response, i.e., the one where they explain why they did what they did.
This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.