This situation occurred due to false positives triggered by our internal fraud and abuse systems. While these situations are rare, they do happen, and we take every effort to get customers back online as quickly as possible. In this particular scenario, we were slow to respond and had missteps in handling the false positive. This led the user to be locked out for an extended period of time. We apologize for our mistake and will share more details in our public postmortem.
Additionally, the steps taken in our response to the false positive did not follow our typical process. As part of our investigation, we are looking into our process and how we responded so we can improve upon this moving forward.
As a business owner with much of our infrastructure depending on DigitalOcean, the incident is concerning. It affects the reputation of DO as well as its customers.
The demographics on Twitter and especially here on HN represents a sizable crowd with decision-making influence on DO's bottom line. I hope to see some effort being made to prevent situations like this in the future, and to regain the trust.
As a (so far) satisfied customer, it's great to hear that:
> A combination of factors, not just the usage patterns, led to the initial flag.
> We recognize and embrace our customers ability to spin up highly variable workloads, which would normally not lead to any issues.
> we are looking into our process and how we responded so we can improve upon this
The real “mess up” here was the bit where you blocked the account with no reason given and no further communication - other than the one-liner your intern wrote for the email.
I’m expecting you to sit down with your legal team and rewrite your TOS to be more customer-focused and less robotic.
Considering you have been marketing yourself as the platform for developer oriented cloud, you should be aware that surge provisioning can and will always be happening.
But it doesn't make sense to shut it down before discussion!
Looking forward to the write up!
I know people say some legal arguments why they close you down and won't say anything, but this is the worst scenario ever. I'd be better off excused at something I didn't do than just ooops we can't tell you anything, your account has been shut down.
The response email even read like a giant polite FUCK YOU (we locked your account, no further action required by you)
You bet I will have further action!
And it is after the shaming that you get an "I am sorry for this situation". Which sounds more like saying "I'm sorry we got caught".
My frustration is not with DO specifically, as they do exactly what every other company does.
But, what of the other thousands of people that got screwed and did not put it on twitter?
It is the equivalent of when you are in a restaurant and get screwed: It is the loudest person that complains more the one that gets the reward, while all the others silently swallow the injustice.
Getting your message into the right hands is what matters, not the platform it's on.
That said, what's highly troubling as a DO customer (and someone who is planning to deploy startup infrastructure of my own with DO) is:
1) The discrepancy between this customer's experience and clear assurances made on this very forum by high-level DO employees that:
a. warnings are ALWAYS issued before suspensions.
b. even in the event of a suspension, services remain accessible (though dashboard access and/or the ability to spin up NEW services may be impacted), ie. the affected customer could still retrieve data or SSH in to droplets.
2) The relatively trivial nature of the customer's offending usage (temporarily spinning up 10 droplets). What happens if, for example, a startup gets a press mention somewhere that leads to a massive traffic spike, necessitating a sudden and significant spin-up of new droplets (especially if this is done programmatically versus by hand in the dashboard)?
3) The apparent lack of consideration of the customer's history, or investigation into their usage. It seems the threshold for suspending services of longstanding customers who are verifiably engaging in commerce (taking a moment to look at their website and general online presence for indicators of legitimacy), should be SUBSTANTIALLY higher than for, say, an account who signed up a week ago. Context matters.
Following is a comment by Moisey Uretsky in another thread:
> Depending on which items are flagged the account is put into a locked state, which means that access is limited. However, the droplets for that account and other services are not affected at all. The account is also notified about the action and a dialogue is opened, to determine what the situation is. There is no sudden loss of service. There is no loss of service without communication. If after multiple rounds of communication it is determined that the account is fraudulent, even then there is no loss of service that isn't communicated well in advance of the situation.
This is why I'm so confused by the case under discussion, because the customer appeared to have been completely locked out without warning.
If DO reserves the right to cut off services and access to your own data permanently and without warning (outside of a court order or confirmed illegal activity), that needs to be unequivocally stated, and the triggering factors should be made known. Otherwise, DO is not fit for production systems.
Additionally, it would be nice to see the creation of a transparent, high-level appeal process for customers affected by suspensions. Truly malicious customers wouldn't use it (what would they hope to successfully argue to an actual human reviewer?), but it would greatly benefit legitimate customers to have an outlet other than social media by which to "get something done" in the event of an inappropriate suspension followed by a breakdown in the standard review process.
Running anything business or privacy critical on DO is madness.
It's fair to note that scrubbing is now the default behavior when a droplet is destroyed, so they did listen to the feedback.
You do not need to scrub or write anything to not provide user A’s data to user B in a multi-tenant environment. Sparse allocation can easily return nulls to a reader even while the underlying block storage still contains the old data.
They were just incompetent.
On top of all of that, when I pointed out that what they were doing was absolute amateur hour clownshoes, they oscillated between telling me it was a design decision working as intended (and that it was fine for me to publicize it), and that I was an irresponsible discloser by sharing a vulnerability.
Then they made a blog post lying about how they hadn’t leaked data when they had.
"This will not happen again, ever".
People's livelihoods are at stake in DO's hosting. Canned responses and brutal account lockouts should have NEVER been on the table to begin with.
That’s akin to saying “we’ll never ship a bug”, or “we have an SLO of 100%”. That’s impossible for anyone to claim. Same goes for the response handling. There is clearly a lot of room for improvement there, but if you’re insisting on not getting canned response, that means a human needs to be involved at some point. Humans will at times be slow to respond. Humans will at times make mistakes. This is just an unavoidable reality.
I get that mob mentality is strong when shit hits the fan publicly, but have a bit of empathy and think about what reasonable solutions you may come up with if you were to be in their situation, rather than asking for a “magic bullet”.
I could see a good response here being an overhaul of their incident response policy, especially in terms of L1 support. Probably by beefing up the L2 staffing, and escalating issues more often and more quickly. L2 support is generally product engineers rather than dedicated support staff/contractors, so it’s more expensive to do for sure, but having engineers closer to the “front line” in responding to issues closes the loop better for integrating fixes into the product, and identifying erroneous behavior more quickly.
However, can you say with a straight face that the very generic message left here by DO's CTO instills confidence in you about how will they handle such situations in the future?
Techies hate lawyer/corporate weasel talk. Least that person could do was do their best to speak plainly without promising the sky and the moon.
I’m an engineering manager in an infrastructure team (not at all affiliated with Digital Ocean, tho full disclosure, I do have one droplet for my personal website). I know how postmortems generally work, and it’s messy enough to track down root cause even when it’s not some complex algorithm like fraud detection going off the rails.
I’d rather get slow information than misinformation, but I understand the frustration in not being able to see the inner working of how an incident is being handled.
And I agree with your premise. However, my practice has shown that postmortems are watered-down evasive PR talk, many times.
If you look at this through the eyes of a potential startup CTO, wouldn't you be worried about the lack of transparency?
And finally, why is such an abrupt account lockdown even on the table, at all? You can't claim you are doing your best when it's very obvious that you are just leaving your customers at the mercy of very crude algorithms -- and those, let's be clear on that, could have been created without ever locking down an account without a human approval at the final step.
What I'm saying is that even at this early stage when we know almost nothing, it's evident that this CTO here is not being sincere. It seems DO just wants to maximally automate away support, to their customers' detriment.
Whatever the postmortem ends up being it still won't change the above.
I too value less known providers. The human factor in support is priceless.
Do you believe that a PR response made in damage control mode that actually changes nothing is something that's satisfactory?
I mean, apparently this screwup was so damaging that it killed a company. What part of the PR statement addresses that precendent?
Companies are made of people. Let the people have a life. Their night is shitty enough as is after this, I guarantee you.
If you have ever been involved in post facto analysis of a process breakdown like this you know how hard it is to get the full picture immediately. Rushing something out does no one any favors.
This didn't seem like a case of being "too slow" - the customer in question went through your review process (which was slow, yes), and the only response he got was "We have decided not to reactivate your account, have a nice day".
That just seems like a lack of interest in supporting your customers that are falsely flagged.
Yesterday we completed our postmortem analysis of the incident involving Nicolas (@w3Nicolas) and his company Raisup (@raisupcom). With their permission we are sharing the full report on our blog here:
I'd gladly do whatever it takes to KYC, send you my business license, tax returns, EIN, invoice billing, etc so you know there is someone behind my account.
We spend thousands of hours eliminating single points of failure. If an automated system can undermine that work, DO is not an option for us to host anymore.
Hope you can share what you learnt from this incident and hopefully you'll take a hard look at your processes.
I'd hate to be caught in the same issue, especially that we are already customers, and I'm not sure I'll have as much clout as Nicolas here to get your attention.
It's occurring to me now that while I've successfully ignored twitter for years, I should probably rectify that just so I have somewhere to type my hopes and prayers when this eventually happens to me, and hope for a miracle. It sure seems like the only place they're listened to.
Maybe keeping a twitter (and other social media) account with at least a certain number of follower should be considered a part of a company's security strategy? You'd also need to post something interesting periodically, to keep your follower, so that you have their attention when you need it.
But the DO CTO did basically admit fault in a public forum.
The very fact that this can happen from an automated script with no oversight should give every one of your customers pause as to whether they continue with your service.
You clearly don't make every effort, and did not -- so why waste the extra verbiage and switch from active to passive voice?
Based on your cliche response I have zero confidence that DO will do anything substantial to address the root causes of the issue.
That doesn't sound good to me.
I block their netblocks for a lot of things, too.
It requires so many failures in understanding the service being provided across the company for this decision making process to have ever actualized that there is no reasonable expectation of safety or trust from DO at this point.
Ideally you should be cloud-agnostic, but that's quite hard to achieve.
DO's tier 1 support is almost useless. I set up a new account with them recently for a droplet that needed to be well separated from the rest of my infrastructure, and ran into a confusing error message that was preventing it from getting set up. I sent out a support request, and a while later, over an hour I think, I got an equally unhelpful support response back.
Things got cleared up by taking it to Twitter, where their social media support folks have got a great ground game going, but I really don't want to have to rely on Twitter support for critical stuff.
DO seems to have gone with the "hire cheap overseas support that almost but doesn't quite understand English" strategy, whereas the tier 1 guys at Linode have on occasion demonstrated more Linux systems administration expertise than I've got.
They told me that on a single day a support engineer was supposed to help/advice customers on pretty much whatever the customer was having issue with and also handle something between 80-120 tickets per day.
It's nice to see that DO is willing to help on pretty much anything they (read: their team) has knowledge about, but with 80-120 tickets per day I cannot expect to give meaningful help.
Needed EDIT: it seems to me that this comments is receiving more attention than it probably deserves, and I feel it's worth clarifying some things:
1. I decided not to move forward with the interview as I was not interested in that support position, so I have not verified that's the volume of tickets.
2. From their description of tickets, such tickets can be anything from "I cannot get apache2 to run" to "how can I get this linucs thing to run Outlook?" (/s) to "my whole company that runs on DO is stuck because you locked my account".
Anyway imho you should have taken the support position and schemed your way into development internally. This was my plan at eBay before they fired me, though they shut down the branch here a few months later and moved to the Philippines anyway so I wouldn't have lasted long regardless.
Let me stress here, this is not nearly as easy of a problem to solve as it appears to be on the surface. We're struggling as a company right now because after our recent merge, a lot of our good talent has left and we're having to rebuild a lot of our teams. Even so, I'm still happy with our general approach. Management understands that employees will often have wildly different problem solving approaches and matching metrics, and that's perfectly OK as long as folks aren't genuinely slacking off and we as a team are still getting our customers taken care of. I think that's important to keep in mind no matter how big or small your support floor gets.
I couldn't imagine getting that level of support from DO, let alone Amazon.
Even when we had small handful of physical servers with them, they seemed inept. They actually lost our servers one time and couldn’t get someone out to reset power on our firewall.
That said, our FAWS team are a good bunch, and what AWS lacks in support they more than make up for in well engineered, stable infrastructure. Since Rackspace's whole focus is support, I think the pairing works well on paper and it should scale effectively, but we'll have to see how it plays out in practice.
This is a big push, internally and externally. I don't know too much about the details (I don't work directly with that team) but it's been one of our bigger talking points for a while now.
It's crazy that companies spend $$ on marketing and sales, then cheap out on a interaction with someone who is already interested in / using their product.
"We have a mark, lets suck it dry until we can throw it away and find new and better marks."
Sustainability tends not be a concern until after a group's leadership jumps ship and parasitizes other hosts.
Almost invariably the high ticket rates are also driven by bad product elsewhere. Money is being spent on customer "services" sending out useless cut-and-paste answers to tickets to make up for money not spent on UX and software engineering that would prevent many of those tickets being raised. Over time that's the same money, but now the customer is also unhappy. Go ask your marketing people what customer acquisition costs. That's what you're throwing away every time you make a customer angry enough to quit. Ouch.
> This was my plan at eBay before they fired me
I guess you answered yourself.
The reality is that his boss is true...
Gotta hit that ZBB!
Now its just "meh, we'll fix it in next months release".
you are entirely correct.
Disclaimer: Based on providing support myself and coaching a support team at both a web hosting company and ISP I used to own years ago.
I somewhat blame people in tech, actually. More than one company is creating products that "cut customer service costs via machine learning", which is code for "pick keywords from incoming tickets and autoreply with a template"
I truly feel sorry for their first and second tier customer support people. I imagine the staff churn rate is incredible.
People who work for these sorts of low-end hosting companies inevitably quit and try to work for an ISP that has more clueful customers. When you have people paying $250/month to colocate a few 1RU servers, the level of clue of the customer and amount of hassle you will get from the customer is a great deal less than a $15/month VPS customer.
I look at companies selling $5 to $15/month VPS services and try to figure out how many customers they need to be set up for monthly recurring services, in order to pay for reasonably reliable and redundant infrastructure, and the math just doesn't pencil out without:
a) massive oversubscription
b) near full automation of support, neglect of actual customer issues, callous indifference caused by overworked first tier support
Conversely, as a customer, you should be suspicious when some company is offering a ridiculous amount of RAM, disk space and "unlimited 1000 Mbps!" at a cheap price. You should expect that there will be basically no support, it might have two nines of uptime, you're responsible for doing all your own offsite backups, etc.
If you use such a service for anything that you would consider "production", you need to design your entire configuration of the OS and daemons/software on the VM with one thing in mind: The VM might disappear completely at any time, arbitrarily, and not come back, and any efforts to resolve a situation through customer support will be futile.
That's going to be true no matter which cloud provider you choose.
Their ToS almost certainly include terms which allow them to kick you off and refund any monies for any reason whatsoever.
Good luck if you bought into their entire ecosystem and can't move elsewhere on a whim.
My company provides services to fortune 100 companies, and we host literally petabytes of data on their behalf in Amazon S3, but we don't have offsite backups. We (and they) rely on Amazon's durability promise.
We do offer the option of replicating their data to another cloud provider, but few customers use that service -- few companies want to pay over twice the cost of storage for a backup they should never need to use when the provider promises 99.999999999% durability.
Reasoning: Your contract with Amazon promises durability and I'm sure there's a service level agreement with penalty/liability clauses. By implementing a redundant backup, you're replicating something that you don't legally need to have, double-or-more due diligence on the offsite backup security/credentials, and in case of a failure of Amazon create a grey area with clients "Do you have the data, or do you not?"
In short, there could be a very good business reason not to do offsite backups.
"We're sorry, the tape that we didn't needed to keep has been lost/zero-dayed/secondary service provider has gone bankrupt/Billy's house that we left it at got robbed." These must be disclosed to a customer immediately.
Minimising attack/liability surface is not only a technical problem, but a business one too.
I'd expect that, as more transistors have been packed blah blah blah, that such a $10/mo account would have gotten better, not worse, since then.
This is what Linode do - they keep you on the same payment level, but raise what you get for it.
I can tell you that as a person whose job title includes "network engineer", we have a number of customers who have critical server/VM functions similar to these people who had the DigitalOcean disaster. If something goes wrong, an actual live human being with at least a moderate degree of linux+neteng clue is going to take a look at their ticket, personally address it, and go through our escalation path if needed.
There seems to be a sweet spot for company size here. Too small companies can't support you even though they really want to. Large companies are busy chasing millions in big contracts, and don't really care about your $800 per month at all.
If you are just some semi anonymous faceless person ordering services off a credit card payment form on a website, all bets are off...
I think that the size of the company or how fast they grow is a good proxy for having poor customer support. What we should be doing is finding the slow growing businesses or the mid-tier (not too small, not too big) businesses to take our business to.
But the execution matters a lot, and DO's is currently not great. IIRC, it takes clicking through a few screens of "are you sure your question isn't in our generic documentation? How about this page? No? This one then? Still no? You're really sure you need to talk to someone about this error? sigh Okay, fine then."
These systems should not be implemented as a barrier to reaching human support, but they often are.
That's simply not true. There's support engineers hired around the world, and depending on when your ticket is posted, someone awake at that time will answer. DO is super remote friendly and as a result, has employees (and support folks) everywhere on the planet. Not "cheap oversea support" at all. There's a lot of support folks in the US, Canada, Europe, and in India where they have a datacenter.
Going to guess from this tweet https://twitter.com/AntoineGrondin/status/113096281882239385... that you're currently staffed at DO. That's fine; I know two people who do great work on DO's security team. But it would be helpful if you could disclose this when you comment about your employer publicly so that readers don't have to dig up your keybase and then your twitter account to understand it.
Then of course, there is no guarantee these people speak and understand english perfectly.
The reality is that top talent, even if remote, is competitive whether they are in NYC/SFO/SEA or not. And DO has some pretty talented people on staff.
And then, having people in all timezones is definitely an advantage for 24/7 support. I'd say it's not negligible, and not an after thought.
Now about english fluency, it's only that important to english native locations. And really, most of tech does not necessarily have english as a first language - I certainly don't. So I'd say that encountering support engineers with imperfect english shouldn't be a problem to anyone, and definitely not a sign of cheap labor. In fact, I'd say bitching about someone's english proficiency in tech is kind of counterproductive and I find it discriminatory.
Anyways. DO doesn't hire international employees to get cheap labor, that's a preposterous proposition. And with datacenters around the world and a large presence and customer base, it makes sense to have staff on board from many of these areas. And that staff might answer your tickets at night when they're on shift. Shouldn't they?
isn't a native English speaker, or speaks in different dialect doesn't mean that they are any less capable
That said, I disagree wholeheartedly that it's okay for support staff to not be completely fluent in the language they're providing support in, regardless of the language.
There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff. The end results are identical.
There are people who are very talented and very fluent in more than one language. Those people tend to be more expensive. So, many companies forego hiring those workers and instead hire others who are cheaper and "about as good". My multiple experiences with DO support have suggested that that's what they're doing.
As other commenters are suggesting, it may just be instead that DO is expecting its support staff to meet metrics that are causing them to spend only a minute or two per ticket and send out scripted replies.
People who would make gratuitous grammatical mistakes but have read more classics than the average American college graduate. I can easily count many just thinking of it.
> There is functionally no difference between trying to interact with talented support staff who aren't fluent in your language, and trying to interact with illiterate support staff.
That statement reeks of ignorance. It seems you have almost no experience with other languages than your own, or you would know that communicating while being non-fluent or with a non-fluent works just fine most of the time. Sometimes misunderstandings happens and it can be a bit slower but that is all.
So your position is that support that's a bit slower, with some misunderstandings, is exactly as good as fast support without misunderstandings, even in downtime-sensitive applications.
Well, okay then.
fluency is a high barrier to clear. it took me 5 years of speaking/reading/writing english daily to come even close to "fluency".
before that, i had a really good advanced english, but i wasn't fluent. and it didn't mean i was "illiterate".
Although these days I tend to go with UpCloud, it is very similar to Linode and DO, except you can do custom instances like 20x vCPU with 1GB Memory, spin it up for $0.23 an hour. Compared to standard plan on Linode and DO, 20 vCPU with 96GB Memory would be $0.72/Hr.
Just my experience though.
I suspect as they've increased in popularity they've become a bigger target for DDOS attacks.
I've also noticed that in the past year there's been a lot of data centre outages... like every couple months. Hasn't been a deal breaker for us, since our traffic is generally fairly low outside of season, but the ones during seasons really hurt.
Also I'd like to add that they do give you the heads up when there are issues, which is a big plus in my book compared to some other hosts.
I really do think it's just growing pains, and I don't mean to disparage them. Just being honest that I wouldn't recommend them for high availability services. Since I consider them a low budget host that's probably unfair though. We've just outgrown them is all.
This is all anecdotal of course.
Main, potential caveat is they're Xen-based hosting. That may or may not matter depending on what one wants to run. They support the major Linux's.
Can't comment about support, but DO's tutorials on linux server random task 101 are fantastic.
I was surprised how much of my ubuntu server setup googling ended up on DO pages.
I was eager to try the new DO managed Postgres service, but I guess I won't after this blunder.
Linode even has an irc channel(you can use browser to access it), I rarely need support, but when I really need it, it is always fast, to the point, available.
A Bitcoin theft via the Linode Manager interface in 2012: https://news.ycombinator.com/item?id=3654110
and a second Linode Manager compromise in 2016: https://news.ycombinator.com/item?id=10845170
In both cases, if you dig into the context a bit, the story turned out that Linode wasn't fully disclosing the breaches to their customers until they were forced to do so when the news about them reached a certain volume. They also may have been -- almost certainly were -- dishonest about the extent of the damage and how it may have impacted their other customers.
At the time, their Manager interface was a ColdFusion application, which tends to be a big pile of bad juju. They started writing a new one from scratch after, I think, the second compromise.
The really bad thing here is that they got soundly spanked for being less than truthful the first time, and then four years later -- when they'd had ample time to learn from that mistake -- they did it again.
So there's a nonzero chance at any given time that Linode's infrastructure has been compromised and they know it and have decided not to tell you about it.
That's what prompted me to start exploring DigitalOcean more. Unfortunately, I've found that there's a far greater chance that I'll experience actual trouble exacerbated by poor support than that I'll be impacted by an upstream breach, so about half my stuff still lives on in Linode.
This is entirely the roots of my distrust of them right now. Mistakes happen. Companies I trust demonstrate that they've learned from their mistakes. My tolerance for mistakes is pretty low when it comes to security related things, though. If something has gone wrong, let me know so I can take remedial steps. Their handling of both of those incidents suggested I can't trust them to tell me in sufficient time to protect myself.
They've had several high-profile breaches over the years.
When there's an issue with Linode's platform, they discover it before I do and open a ticket to let me know they're working on it. When there's an issue with Vultr, the burden of proof seems to be on me to convince them that it's their problem not mine.
I've had interesting issues automating deployment to the point that my current build script provisions 9 VMs, benchmarks them and shuts down the worst performing 6. Some of their co-location is CPU stressed.
The difference in support DO vs Linode is probably due to DO being cheaper.
What? Most DO and Linode plans have exact same specs and cost exactly the same, and IIRC it was DO matching Linode. Although DO didn't seem to enforce their egress budget while Linode does.
(I've been a customer of both for many years, and only dropped DO a few months ago.)
EDIT: Also, according to some benchmarks (IIRC), for the exact same specs Linode usually has an edge in performance.
DigitalOcean should go the route of AWS and kill off free support completely and offer paid support plans. Something like $49 a month or 10% of the accounts monthly spend.
If you are a serious company with paying Fortune 500 customers, you need to act serious and pay up for premium support and stop expecting free.
You already get bumped to a higher tier of support once you hit 500/month spend for no additional fees but that just gives you promises for lower response times
It's usually a bad sign when a company can't meet current needs/expectations and then decides to try and productize their failures.
That said, any company, especially one working with Fortune 500's, should have DB backups in at least two places. If they'd had the data, they could have spun up their service on a different hosting provider relatively easily.
Even worse: they don't explicitly state what is considered "unreasonable". So, if your business is serious, you have to assume the worst-case scenario: DO can't be used for anything
Conclusion: Digital Ocean is just for testing, playing around, not suitable for production.
I think that's always been the standard position most people take. DO, Linode, etc are for personal side projects, hosting community websites, forums etc. They are not for running a real business on. Some people do, sure but if hosting cost is really that big a portion of your total budget you probably don't have a real business model yet anyway.
I find it concerning that they have such a low threshold for triggering a lock. 10 droplets is hardly cloud scale.
-When the majority of abuse support dealt with was people angrily calling and asking about the fraudulent charges on their cards for dozens of Lie-nodes you consider putting caps in place to reduce support burden and reduce chargebacks.
At the time at Linode, if it was a known customer, we could easily and quickly raise that limit and life is good.
I've always wondered how Amazon dealt with fraud/abuse at their scale.
I don't think DO was wrong here to have a lock, but the post lock procedure seemed to be the problem.
That text suggests larger organizational problems within the company.
> Account should be re-activated - need to look deeper into the way this was handled. It shouldn't have taken this long to get the account back up, and also for it not be flagged a second time.
So... he doesn't address what is the scariest part to me, the message that just says "Nope, we've decided never to give your account back, it's gone, the end."
I think it's entirely reasonable for companies to have that option. "You are doing something malicious and against the rules, you have been permanently removed". In this case, that option was misused, but I don't think the existence of that possiblity is inheritly surprising.
And, regardless of what DO should or should not do, they can do whatever they want with their own hard drives. You should structure your business accordingly.
If there is no police report, then they are trying to act as police themselves, which I think is unacceptable. It is not their data.
Your argument that they can do whatever they want with their hard drives is indeed something I will take care to remember — I definitely would not want to host anything with DO.
The observant will note the particular corner you're backing into here -- that a business might be justified in denying access to code/data being used in literally criminal behavior -- is notably distinct from the general and likely much more common case.
> they can do whatever they want with their own hard drives.
Sure. But to the extent they take that approach, Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.
Businesses that can be relied on generally instead offer terms of service and processes that don't really allow them to act arbitrarily.
I agree. Look at the absolutism of the comment I am replying to. My whole point is that there might be some nuance to the situation.
> ...Digital Ocean or any other service is publicly declaring that however affordable they may be for prototyping, they're unsuitable for reliable applications.
Again, I agree. Considering how cheap AWS, backblaze, and Google drive is, it is completely ridiculous to depend on any one single hosting service to hold all your data forever and never err.
You seem to be accusing the aggrieved party of being a bad actor, when that is not the case.
And do what in the mean time? The legal system acts slowly. In the age of social media outrage, would you allow the headline "Digital Ocean knew they were serving criminals, and they didn't stop them" if you were CEO?
It's easy to be outraged when these systems and procedures are used against the innocent. That does not mean we should stop using rational thought. If someone is using DO to cause harm, then DO should (be allowed to) stop the harmful actions.
You lock down the image, and let law enforcement do their thing. If law enforcement clear them, you then give the customer access to their data, perhaps for a short time before you cut them off as they seem to be a risky customer to have.
You don't unilaterally make the decision, you offload your responsibility onto the legal process.
The fact that there are hundreds of comments on HN condemning them for this action proves my point.
Seems to work just fine for AWS, Google and Cloudflare. In fact, counter to your argument, Cloudflare got in massive shit when they did decide to play God.
At the very least, they should also provide ALL, as in every last byte, of data, schemas, code, setup etc. to the defenestrated customers. As in: "sorry, we cannot restart your account, but you can download a full backup of your system as of it's last running configuration here: -location xyz-, and all previous backups are available here: -location pdq-".
Anything less is simply malicious destruction of a customer's property.
If you violate a lease and get evicted, they don't keep your furniture & equipment unless you abandon it.
Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.
Replicating TBs worth of data and making sure the replica is relatively up to date is the hard part. So is fail over and fail back. Basically everything but running the code/service/app, which is the part containers solve.
> Sure, a backup would have been a significant improvement, but still – a backup only protects against data loss and not against downtime.
Assuming you have data backup / recovery good to go, the downtime issue needs to be solved by getting your actual web application / logic up and running again. With something like docker-compose, you can do this on practically any provider with a couple of commands. Frontend, backend, load-balancer -- you name it, all in one command.
> Containers solve the easy problem, which is how to make sure the dev environment matches the production environment. That is it.
Speaking of "patently false"...
They should have, at the very least, one DR site on a different provider in a different region that is replicated in real-time and ready to go live after an outage is confirmed by the IT Operations team (or automatically depending on what services are being run).
Yes they should.
How many 2-man shops do you think follow all the proper backup and security procedures?
How many horror stories need to reach the front page of HN before people stop believing this? Getting locked out of your cloud provider is a very common failure mode, with catastrophic effects if you haven't planned for it. To my mind, it should be the first scenario in your disaster recovery plan.
Dumping everything to B2 is trivially easy, trivially cheap and gives you substantial protection against total data loss. It also gives you a workable plan for scenarios that might cause a major outage like "we got cut off because of a billing snafu" or "the CTO lost his YubiKey".
Sounds like the opposite of the survivor bias. I don't believe it's any sort of common (though it does happen), even less that "it should be the first scenario in your disaster recovery plan"
Every week there's another article on HN about a tiny business being squished in the gears of a giant, automated platform. In some cases like app stores this is unavoidable, but there are plenty of hosting providers to choose from. People need to learn that this is something that can happen to you in today's world, and take reasonable steps to prepare for it.
You can't dot every I, cross every t and also build a compelling product as a 2 person shop.
I could not disagree more. There's a right way and a wrong way to do this, it's trivial to do it right, and the risks of doing it wrong are enormous.
Then it's unrealistic to trust them with your business.
We don't know the structure of their DB and whether failover is important or not, so we don't know if the DB can be reliably pulled as a flat file backup and still have consistent data.
We also don't know how big the dataset is or how often it changes. Sometimes "backup over your home cable connection" just isn't practical.
Cron jobs can (and do) silently fail in all kinds of annoying and idiotic ways.
And as most of us are all too painfully aware, sometimes you make less-than-ideal decisions when faced with a long pipeline of customer bug reports and feature requests, vs. addressing the potential situation that could sink you but has like a 1 in 10,000 chance of happening any given day.
But yes, granted that as a quick stop-gap solution it's better than nothing.
I'm going to take a stab at small and infrequently.
Every 2-3 months we had to execute a python script that takes 1s on all our data (500k rows), to make it faster we execute it in parallel on multiple droplets ~10 that we set up only for this pipeline and shut down once it’s done.
Even if they lost some data, even if the backup silently failed and hadn't been running for two months, it's the difference between a large inconvenience and literally your whole business disappearing.
They had backups, but being arbitrarily cut-off from their hosting provider wasn't part of their threat model.
Isn't a big part of cloud marketing the idea that they're so good at redundancy, etc. that you don't need to attempt that stuff on your own? The idea that you have to spread your infrastructure across multiple cloud hosting providers, while smart, removes a lot of the appeal of using them at all. In any case, it's also probably too much infrastructure cost for a 2-man company.
keeping your production and your backups in the same cloud provider is the equivalent of keeping your backup tapes right next to the computer they're backing up. you're exposing them both to strongly correlated risks. you've just changed those risks from "fire, water, theft" to "provider, incompetence, security breach"
> So what is the purpose of the massive level of redundancy that you are already paying for when you store a file on S3?
You're paying to try and ensure you don't need to restore from backups. Our data lives in an RDS cluster (where we pay for read replicas to try and make sure we don't need to restore from backups) and in S3 (where we pay for durable storage to try and make sure we don't need to restore from backups), but none of that is a backup!
If you're not on the AWS cloud S3 is a decent place to store your backups of course, but storing your backups on S3 when you're already on AWS is, at best, negligent, while treating the durability of S3 as a form of backups is simply absurd.
> I don’t think it’s terribly common for even medium sized companies to have a multi tier1 cloud backup strategy.
The company I work for is on the AWS cloud, so we store our backups on B2 instead. It's no more work than storing them on S3, and it means we still have our data in the event that we, for whatever reason, lose access to the data we have in S3. Who the hell doesn't have offsite backups?
This is not remotely the same thing. A RAID offers no protection against logical corruption from an erroneous script or even something as simple as running a truncate on the wrong table. Having a backup of your database in a different storage medium on the same cloud provider protects from vastly more failure modes.
> Who the hell doesn't have offsite backups?
No one. But S3 is already storing your data in three different data centers even if you have a single bucket in one region, and you also have SQL log replication to another region. Multi-region is as easy as enabling replication but that is only available within a single cloud provider (I can't replicate RDS to Google Cloud SQL, only to another RDS region). I would guess that a lot of people use that rather than using a different cloud provider.
That sounds like...the same argument?
A RAID array stores your data on multiple physical drives in the machine, but offers no protection against logical corruption (where you store the same bad data on every drive), destruction of the machine, or loss of access to the machine.
S3 stores your data in multiple physical data centres in the region, but offers no protection against logical corruption, downtime of the entire region, or loss of access to the cloud.
You can't count replicas as providing durability against any threat that will apply equally to all the replicas.
> > "fire, water, theft"
i'm sure you could add a few more things to the list.
not terribly common to understand risk.
Let's be fair: The threat model here is "lose access to our data".
This can happen in a number of ways, lost (or worse, leaked) password to the cloud provider, provider goes bankrupt, developer gets hacked, and a thousand other things.
Even if you trust your provider to have good uptime, there's really no excuse for not having any backups. Especially not if you're doing business with Fortune 500's.
Literally just push a postgres dump to S3 (or any other storage provider) once a night as a "just in case something stupid happens with my primary cloud provider". It'd take a couple hours tops to set up and cost next to nothing.
Also, by "two places" I meant the live DB and one backup that's somewhere completely different. My wording may have been confusing.