Hacker News new | past | comments | ask | show | jobs | submit login
An update on last week's customer shutdown incident (digitalocean.com)
578 points by grzm 45 days ago | hide | past | web | favorite | 265 comments

That’s a really fair and reasonable response. Not sure what else people really expect here.

> The template used for response in account denial will be removed entirely. If account access is denied during an appeal, which often is the case as most appeals are true bad actors, the agent must create a reasoned response.

Glad this is seen as an issue and corrected.

IMO, this probably would have made this whole thing never escalate if a better response was previously in place for everyone.

Accidents, shotty support, whatever — all expected these days unless you have big cash money agreements in place.

But to kill an account of a responsive person with a gigantic middle finger email without reasoning was a pretty dumb process in place. You can see the email on the Twitter thread somewhere.

Glad it’s fixed! Still a DO fan here

Edit: TALKING ABOUT THIS: https://pbs.twimg.com/media/D76ocofXoAY_xB5.png

> That’s a really fair and reasonable response. Not sure what else people really expect here.

The root cause of suspension is incomprehensible to me. They were suspended because they launched a set of instances and these were using 100% CPU. How is that unreasonable and cause for suspension?

I'm not a Digital Ocean customer, but if I were, I'd expect to be able to use the resources I bought without risk of being suspended. This is the root cause. It was compounded by incompetent customer support, but I really do not understand the suspension cause.

The response tackles all secondary factors, but does not talk about the root cause. I'd expect it to.

Agreed. They say in the postmortem they it was protection against crypto mining, but what kind of weird reason is that?! If I want to pay for 10 instances mining crypto, why the hell wouldn't I be allowed to do that? I don't see why they should block any workload as long as the credit card details are valid.

It's a customer protection method. Most cryptominers are not using accounts they pay for. They compromise customer accounts and spin up resources. If you aren't proactive about communicating this to customers or blocking it, it can be quite some time before the customer notices and almost all customers will request a refund - even when the attack is a compromised password / successful phish on the customer's side.

Additionally, all cloud providers operate on various models of over-subscription. It is not in anyone's (customer / provider) interest to allow the full consumption of resources when the activity is fraudulent.

As you can see in the post-mortem, they are fine with the usage. They have a process and flag to allow legitimate customers to use their resources. However, based on previous experience at another cloud provider, I would bet that over 90% of those automated hits are correct.

This was bad support. They know that and they seem to be making the right moves to fix it. Fraud is bad for everyone and has to be combated. Not doing so can raise prices and kill a business like DO. I'm sure they feel awful that a customer was so poorly impacted, but the error wasn't in the first ban, it was everything after that.

Part of the whole issue here revolves around shared hosting in my opinion. Host hardware is so oversold that one customer utilizing 100% CPU is so impactful to a handful of other customers that it's not allowed at all. I have seen providers that has terminated services for less than 100% CPU usage, a constant 90% is enough on some of them. But due to the profit margins and shared hosting, providers are able to charge incredibly low prices per instance and be able to oversell their hardware sometimes as much as 10 to 20 times. That's as many as four hundred customers on a box that should maybe have 20 if it weren't oversold it all. In this case it really is an instance of you get what you pay for. The service we provide is no oversold hardware and all dedicated plans. Some people are initially very turned off by the pricing but the ability to allow customers to mine if they wanted to and not affect a single other customer on the platform giving each customer the same experience regardless of any other one images resource utilization, leaves too much happier customers even if smaller profit margins for us. At the end of the day customer experience and support provided are two of the most important factors in running a hosting provider. While I disagree with aspects of digital oceans business model as a shared hosting provider, I do think that the response to this was more than appropriate and better than would be expected of a lot of shared hosting providers, provided they actually implement any of the things talked about in the response.

When you say we/us as a more expensive, but dedicated alternative, what is the cost difference as a percentage for say a small project?

Edit: found your site, looks like you’re cheaper than aws at a glance

If it was just the first point, the customer should be able to confirm that the activity was intended without even going through human review. It should be like when your bank texts you to confirm an odd transaction. They don't simply lock your account.

It sounds much more like it was the second point, which is unsettling. It's one thing to plan your pricing based on the assumption that most customers won't maximally-utilize. It's another thing to enforce a soft-limit that's vague and below what was advertised. I'd much rather have a lowered, known limit than whatever this is.

I totally agree, but unfortunately my bank (a major U.S. bank) does block a transaction and sometimes lock my credit card completely when they think the transaction is suspicious. There’s no confirmation mechanism, I have to call them to get the card unlocked. Of course, this usually gets resolved within five minutes (except that one time when I had to renew a .ng domain, and the Nigeria-originated transaction got auto-blocked three times in a row, and eventually the case had to be escalated to override their security mechanism entirely), not 29 hours.

> They don't simply lock your account.

Capital One did this to me once, and refused to restore the use of the blocked account even after I immediately called them and confirmed that the transaction that triggered the block was not fraudulent.

It was in combination with a lack of payment history. So if they had been paying it would not have triggered but they had been working off of credits instead. I think this point addresses your concern that paying customers should be allowed to mine.

I ran some really long compute jobs on GCP (100% CPU for weeks across many vCPUs) with credits without getting flagged. I was evaluating FFTW performance for a project. Perhaps GCP could tell I was calling into FFTW and not mining so they decided it wasn't fraud?

It makes sense for a company like DO to not allow crypto miners to use credits. Or else they would develop elaborate systems to create fake accounts and spin them up to mine.

Google can afford to eat the cost and perhaps has better heuristics to detect mining. And they definitely have better data to detect a single user signing up for multiple accounts.

Perhaps, or they viewed credits as payment history? I’m not defending the algorithm as even DO has said it was a false-positive. I just wanted to point out that this wasn’t an attack by DO on paying for crypto. That it specifically was trying to look at non-payers.

Having your instances run at 100% CPU pretty much raises a red flag at any cloud provider. Depending on your plan it either gets shut off (like in this case) or you get a notice about "suspicious" behavior and a bit of time to fix the "issue".

What's next? Having your disks use too much I/O causes the same response? Or actually using the RAM you pay for?

I run my own iron, with cloud only for elastic loads. Every time I launch a cloud instance, it will be using 100% CPU, otherwise I wouldn't launch it. It's unacceptable to label that profile as "suspicious". It never happened to me on AWS or Azure.

> ...you pay for

The major indicator here was the lack of payment history, so they hadn’t paid for it but were working off of credit. I think it’s a nuance that’s very important.

I'm sorry to dig heels, but that's no excuse. If the credit they were given allowed them to use the resources, it follows that using the resources is not a breach of contract.

From the description I imagine Digital Ocean offers a free period or tier, to reduce friction in customer acquisition. This is a marketing tool, and must not, in any way, cause situations like the one described.

If a marketing tool induces service failure, it has no place in a professional setting.

Credit and promo codes are also used extensively for fraud. If a business had been in operation for a while solely on credit, it may well generate a false positive in a fraud detection algorithm if it scaled dramatically.

But it is important to disconnect monetary spending from coupons or vouchers as they are not equivalent.

You mention free tier but that’s not what was at issue here. Also, 10 additional instances isn’t in the free tier of any cloud service I’ve used.

I’m not saying that DO is correct, but I believe the parent argument was a simplification if the events in question. Also, DOs handling of it via support was far worse than the initial algorithm, imo.

> But it is important to disconnect monetary spending from coupons or vouchers as they are not equivalent.

They must be. If they are not, then you've entered the territory I referred, where marketing actions are impacting service availability. This impact is not acceptable in professional services.

In this specific case, if voucher giveaways produce ingress of resource leeches (cryptominers that will never result in real customers), and if it is impossible to prevent this undesired ingress without impacting existing customers (which it is), then that marketing action must stop. This is the conclusion I expected from the post-mortem.

Money is fungible and fiat while vouchers are vendor-locked and not fiat, that's why they can't be evaluated the same.

I won't try to argue whether they should be removed in their entirety, that's not even an option I had even considered until now.

This is confusing though, since Digital Ocean credit can mean like a referral, or by prepaying your account - something I do to prevent billing overages.

Hardly the point.

Using what you've rightfully obtained shouldn't be regarded with suspicion.

That seems even more hyperbolic. Are you suggesting that no service should attempt to detect fraud?

Of course not.

Are you suggesting that 100% usage implies fraud?

There's a difference between suspecting fraud from high resource usage and equating high resource usage with fraud.

The latter is what is happening, here, and its outrageous.

That's a simplification of what was happening. It was a combination of indicators that they list:

- A large increase in number of nodes

- All nodes using 100% of CPU

- AND a lack of payment history

I'm merely saying that the lack of payment history is an important indicator of suspicious activity. 100% usage by-itself was not the primary indicator that their article discusses.

I can assure you we run AWS instances at 100% with no problem at all. (Well, no problem from AWS; sometimes it's caused by a software bug.)

That's not true. A proper cloud provider (AWS or Azure) would not bat an eye because the CPU is frequently pegged at 100%.

That sounds odd to me. Especially given that digitalocean is the default dynamic provider for Gitlab CI builds which _will_ run droplets at 100% CPU.

From what I understand, they’re not saying you’re not allowed to use 100% for (what their user agreements define as) legitimate uses. They’re saying several droplets suddenly created and immediately going to 100% flags them as suspicious activity for human review. Looks like after such review, they would flag them as legitimate and all would be fine, 100% CPU or not.

They’ve botched that second step though.

That doesn't make sense to me. You pay for the time you have the droplets running, so it seems kind of silly to have them sit idle for a bit before you give them work to do.

I don't work at a cloud provider, but I think the reasoning is:

It's a common pattern in malicious actors to immediately spin up several droplets and immediately peg the CPU on each one.

There are, obviously, non-malicious actors who do the same, but it's a bit like wearing a balaclava in public: Likely to raise some suspicion just because it's associated with malicious actors.

Not sure what the materialization of that suspicion might look like -- competitors trying to crush DO's business? mass account creation or mass fraudulent logins? "mining crypto"? What I could come up felt quote-unquote legit grounds for a timed suspension but only instinctively so.

> I'd expect to be able to use the resources I bought without risk of being suspended

They weren't bought resources at the time, they were on credit. In this case a false positive for sure.

In the case of an actual cryptominer it's more likely they'll just ditch the account when it comes to billing time. Even more likely is that it's a compromised account that someone else has to pay for

I can't really fault their postmortem or their response on HN. The corrections are all good, but the very fact that these things need to be corrected (automatically locking the entire account when there is a compute spike, having such a casual review process before permanently denying access to an account, not having 24/7 support after locking an account, etc.) makes you question their overall maturity as a B2B infrastructure provider.

Sure it's better to never make a mistake but so long as they don't make a habit of things like this I'm not going to think anything of it until I see more cracks in the wall.

A screw up is inevitable. A mature response is not. So the fact they gave mature response goes a long way. Although it's unfortunate that social media seems to be their emergency support channel...

> Sure it's better to never make a mistake but so long as they don't make a habit of things like this

This is the thing - the customer that got locked out managed to get attention on HN, Reddit and other media - this seems to have prompted action from Digital Ocean.

How many have silently fallen victim before this ? We don't really know if this is a habit or not - we only know this one customer was corrected.

Based on this post, Digital Ocean is taking specific measures company wide to prevent similiar issues from affecting any customer in the future. So they did no just correct the situation for this one customer.

> declined to activate it

Except they were declining to unlock it, right? I’m always shocked to see support that’s so pitiful they don’t even bother to have a correctly worded template for a common event.

The real problem is support reps that aren’t trained properly and don’t even care enough to apply a bit of common sense. Getting rid of a response template doesn’t automatically make the support reps care enough to apply common sense.

How about a “don’t fuck me” support tier where I can pay a one time $100-$250 fee for the sole purpose of getting a phone call before my account gets banned?

The real problem is most definitely not the support rep. They don’t really go off book. This is the process as designed and approved by higher management, not by a low pay first level support (unless you assume they have some top level engineer doing this stuff).

And going off process could make it better... yay, self pat on the back. But it could make it worse in which case I see unemployment in the support rep’s future. So they won’t go that way very often.

Anyone who ever had such a low positioned job knows how it works. At that level your only freedom is to do what you’re told and follow company process.

No, this is the fault of the manager who asked for this process and their manager who approved it. Management isn’t just about picking up a higher paycheck, it’s also to take the accountability for the decisions made under your watch.

> That’s a really fair and reasonable response. Not sure what else people really expect here.

If you nuke VMs, under no circumstances do you also nuke access to data, backups, etc.

Because if it wasn't for "social escalation" (aka: mob justice via HN and Twitter), this 2 person company would have lost everything.

If you terminate a customer for $reasons, the data still belongs to the user, and not the company. And the company should still be legally required to provide the data on a reasonable timescale, like FTP access for 7 days.

While you're swinging that legal word around, have an armchair lawyer skim DO's Terms of Service.


Summary: Do offsite backups n'all you dinguses

You're right. It's good to have some expectations of the company, but customers really need to take the TOS seriously.

It's yet again that the ToS is hidden crap that goes against the direct things, like "Backups".

Sure the ToS needs legalese crap for the lawyers, but a plain version also needs to be made. I'm certainly no lawyer, and nor are most people.

It’s not really hidden and they do have an easy to read non-lawyer summary underneath the part I quoted

> In other words, we trust that you’ll be responsible and back up your own data. Things happen!

It doesn't take a college degree, in law or otherwise, to understand that data in one place - whether that's physically or under the umbrella of a single service provider - is subject to unpredictable, unexpected, total loss.

I agree it's a generally good response. There are a few more things I'd like to see more clearly addressed:

* While the removal of the account termination template is good, in conjunction with additional hiring to support more attention to any individual ticket, I can't tell by whose standards the "reasoned response" is gauged, or if the response is reviewable at all. I did note that they now want two human reviewers, but that's distinct from specifying a process in which a reasoned response is articulated and reviewed.

* More importantly, if the reasoned response doesn't pass muster with the customer, what's their resort? Still Twitter-shaming? I suppose that's legit if they'd rather their mistakes were public like this.

* The question of whether an account-wide lockout w/ no data retrieval is a necessary/proper consequence for those flagged for CPU abuse needs addressing -- ideally they should have a different policy that allows for data egress (with bandwidth fees, if necessary), but if not, a rationale and clear policy might be acceptable.

Back in the days before Twitter, folks wrote to the CEO or other senior executives as a last resort. Might still be effective in some cases.

> shotty support

"shoddy", for what it's worth.

Woop. TIL - Thanks!

How tremendously forgiving.

Their apparent conclusion that high CPU% for a few hours or half day or whatever means "cryptocurrency miner - ban ASAP!" is naive and flawed.

Compute offload is an ancient and fairly common use case for the public cloud; my VPS (or ten..) should be able to burn 100% CPU for many hours compiling a large project, even if it means they make less profit than they would have had I instead run a static web server that sleeps on IO, imposing nearly no CPU load.

At the very least they should provide some objective, quantitative guidance on exactly how many CPU-seconds-per-hour they consider acceptable/not-abuse (or, if not CPU-seconds, then increased host power consumption, or whatever they are ultimately trying to limit to ensure they can pack a few hundred near-zero-load servers onto the same host to make glorious truly massive profits all the time).

Don't make customers guess at whether their workload will trigger some opaque but hyper-aggressive abuse automation or not.

I think the heuristic they use is spike of high cpu + non-established billing history, not just CPU. That seems to me much more indicative of potential fraud, though by no means foolproof.

Indeed, but AFAICT there are apparently still some opaque, undefined CPU% limits for people paying with CC instead of free credits. They also mentioned elsewhere that customers paying via PO are exempt from the automated miner murderer, but that was was news to me and I guess just furthers my point: we shouldn't have to trawl HN threads to understand your CPU% abuse limits; they should be spelled out specifically and quantitatively in the main TOS, for each type of payment method, and any other factor(s) that effect them.

They point out that automatically terminating compute is a bad idea that they will no longer be doing in most cases.

"Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review"

That's a good idea, though I would still prefer to understand their detailed CPU% abuse criteria pre-deployment rather than via just-try-it-and-see-what-happens. Secret rules are a problem, no matter if the enforcement is automated, manual, or some hybrid of the two.

I think you are missing the point. They don’t care if you use 100% of CPU. What they don’t want you to do is use 100% CPU and not pay the bill.

Not that I am defending their actions and perma-ban.

> They don’t care if you use 100% of CPU

They clearly do, at least for some subset of customers meeting various quasi-secret criteria.

> they don’t want you to do is use 100% CPU and not pay the bill.

The account here was fully paid up, albeit via credits that they issued rather than via USD. Regardless, it was not past due, so, the high CPU% was the mortal sin.

Reading this response it seems that crypto-mining is not allowed on digital ocean as they have checks against it. The TOS doesn't say so explicitly but does note that:

>violation of any of these Terms of Service or any law, or if you misuse system resources, such as, by employing programs that consume excessive network capacity, CPU cycles, or disk IO

By my reading that seems to mean that you're not allowed to use your VMs to their full capacity due to them being over-provisioned. This is in contrast to AWS who are more explicit on which instances (T instances) are over-provisioned and exactly how they're throttled.

If you want to do cryptocurrency mining on DO that is actually okay with us. Some of the other respondents are correct the behavior we were looking for was really around fraudulent accounts being created and performing cryptocurrency mining. This is why the trigger that flagged this account was using payment history as a key factor in the triggering.

The thing that has me scratching my head is how this chain of events unfolded.

I get that your fraud algorithm flagged it because of lack of established payment. how is this possible if what the tweet referred to as "locking us out of all of our backups and work"? surely an account history of any significance would have an established payment record. From their tweets they mention that they had 5 droplets and some storage of a not insignificant number of records (~500k) and that a script is required to be run every 2-3 months to process some data and that script spins up 10 droplets during that time. seems like it will take 13 hours to process the data based on row count and per record time.[0] I am struggling to see how they didn't have payment history. can you elaborate?

In addition another thing I'd think would help assuage fears of a complete lockout is some process where you can request and download the db or a snapshot of the virtual machine.

[0] https://twitter.com/w3Nicolas/status/1134529322902007809

> If you want to do cryptocurrency mining on DO that is actually okay with us

Do you disclose this anywhere? Are there any special steps one could take to avoid issues while doing legitimate mining?

Your post-mortem implies this is not allowed at all.

Your post-mortem implies this is not allowed at all.

Not sure why you were downvoted, I had the same impression, after reading:

...an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors). These signals, coupled with a number of account-level signals (including payment history and current run rate compared to total payments) are used to determine if automated action is warranted to minimize the impact of potential fraudulent high-cpu-loads on other customers

This sounds like they don't permit extended high CPU loads due to the impact it can have on other customers.

Cryptojacking is a well-known, major problem for cloud compute providers. Catching and squashing new exploits that allow people to create a fresh account, run up compute bills and then abandon the account without paying is very important.

My guess would be that this is such a well-known problem (within the field of cloud compute at least) that they just didn't think they had to state that normal crypto mining by paying customers is completely fine.

Is ‘normal’ crypto mining in the cloud even profitable, compared to custom designed hardware?

Depending on the cryptocurrency's proof-of-work algorithm and new-ness, it can be profitable to mine in the cloud. I've done it briefly in the past. But generally it's not profitable.

In every cryptocurrency (the popular and functional ones anyway), there's a set global rate of mining rewards. All miners compete for a slice of that reward, so as more people mine, each individual miner gets less reward. (This causes an equilibrium to be reached where more people mine until it's no longer profitable for more people to start mining. If mining becomes unprofitable, some miners will drop out, and the remaining miners will each make a little more.) If masses of people realize that cloud mining for a particular cryptocurrency is profitable, then what generally happens is that lots of people pounce on cloud providers to mine, it becomes barely profitable, and then people operating their own hardware that's cheaper than cloud providers come in and push the mining rewards down to where it's no longer profitable for people to cloud mine.

Because cloud mining is never profitable in the long run, most cloud mining that happens is fraudulent activity using stolen cloud accounts or payment info. (If you're not paying for it, then making any amount of money from it is profitable.)

It depends on what crypto is going to be mined an how many accounts can be stolen given the fact that there is already plethora of bots that look all over GitHub for accidentally committed credentials. Heck, just a year ago people did scans for outdated WordPress installations to inject, among other things, some JavaScript (!) Monero miners [0]…

[0] https://arstechnica.com/information-technology/2018/01/more-...

No. Cryptomining represents an arbitrage opportunity such that the spot price of the instances should be adjusted. In the long run it should not be profitable.

No, that quite clearly states that they treat high CPU loads as suspect on accounts without an established good payment history or if it significantly deviates from previous usage patterns.

The keyword here is 'fraudulent'. High-cpu-loads is allowed, but an automated service monitors for fraudulent activity.

What would "fraud" mean in this context? Are they talking about customers who don't pay their bill to DO? (If so, seems like the account should just be temporarily suspended until the bill is paid.) Or are they talking about fraud to other parties, like phishing sites? (If so, I don't see the connection to crypto mining.)

My understanding is that they're trying to prevent users from creating new accounts, running 100%CPU until it's time to pay the bill and then just not paying, moving on to another new account.

edit: from elsewhere ITT it seems they're doing this with stolen credit cards.

Obviously they don't verify if the load is fraudulent. Otherwise this whole debacle couldn't have happened.

Blog post did mention that accounts with high CPU usage and payment history won't be flagged.

Here was my key takeaway:

"Cryptocurrency mining mitigation detects suspicious behavior, including very high CPU utilization on an account with no payment history, which results in an account lock"

Lots might have been done wrong here, but it sounds like they had an account with trial or promotional credits - I can see how this could easily be abused.

Completely. At the same time, those promotional credits are going to be used by guys like me who will have to decide if their services are worth having to spend an extended period of time explaining why "we're not recommending AWS/Azure/Goober"

An account shutdown, or enough complaints from verifiable sources, and I'm not going to the trouble. Not to pull out an old trope "Nobody got fired for picking IBM". But that's the case: pulls a move like this and the entirety of the customer, who likely came in with AWS in mind (in some cases, was advised against it and insisted on it) is going to shrug their shoulders. Pick a provider that the customer hasn't heard of and I'm going to get a phone call that goes something like "You're the one that said we should use that basement-operation!" with raised voices. Heck, the last time there was an Azure outage, we didn't hear from most of our customers. It was so impacting that even customers well outside of software development/technology read news articles and connected the dots. I had one customer tell me he thought it was just their corporate internet connection; they assumed it was working[1].

[0] Plus, too lazy to put in the research; sorry.

[1] They were a customer who insisted on doing the app monitoring, themselves -- that guy was getting the alerts and similarly assumed it was the network since that happened regularly with another application they developed -- the monitoring server was on-prem.

Sounds like it's designed to counter stolen credit cards.

An attacker might load a stolen credit card number into an account and only use enough resources to generate a few dollars worth of billing. The owner of the credit card might not notice the small charge.

Then after a few months of low billing (to bypass a previous heuristic), they ramp up the utilization, mine a bunch of coins then the holder gets a massive bill.

The holder does a charge back and DO is left holding the bill.

It's also designed to keep everyone relatively happy in a shared-CPU environment. "Standard" droplets share CPU with others on the same node, so one droplet pegging the CPU 24/7 can be problematic.

AWS doesn't have this problem because either your instance is allowed to use all the CPU that's allicated to it, or else (t2 & t3) the platform will automatically limit your CPU usage. You don't have to care about how your usage affects other people. It's one more thing that AWS abstracts away. DO's abstraction, on the other hand, is rather leaky in this area. That's a problem in and of itself, in addition to the matter of credit card fraud that every company has to deal with.

I don't think that it isn't allowed but that they have seen that fraudulently acquired resources are generally used for crypto-mining so they felt that this was a good signal to look for.

Right. Reading between the lines a bit, it's not the activity itself that DO is worried about, but a pattern of usage that suggests that the account may have been created fraudulently or compromised.

This is correct. This was the primary thing we were attempting to solve for in this case and the bug in the algorithm started the chain of events documented in the postmortem.

Cryptocurrency mining is a favourite of fraudsters. Get a fake or stolen credit card number / identity, sign up for an account on a cloud provider, and spin up instances to do your mining. Depending on how quickly the provider reacts to the pattern of behaviour, or identifies the account as fraudulent etc, you can have earned a reasonable pay-off.

If the provider isn't paying attention and/or doesn't have good fraud detection in place, it may be a few months before your account and resources are terminated (assuming the payments eventually bounce, and the provider gives you a chance to fix it)

I believe the purpose of these checks is for when people's outdated wordpress install or whatever gets compromised by script kiddies. Generally, the scripts install crypto miners to mine for the hacker until your account gets shut down, running you up a huge bill in the process.

I think it's moreso people continually creating new accounts to get free credits. Then bitcoin mining turns the credits into actual money.

Anecdotally, I had a python script unintentionally using 100% for a month or two and never heard about it or noticed until I happened to look at top.

Counter-anecdote slash venting:

I recently logged into my long-dormant DO account to kick the tires on their now-GA managed Kubernetes service and to contribute some CPU cycles to a distributed computing project I'm interested in (not cryptocurrency, I swear).

I first requested and was approved for a droplet limit increase (to 25). I started 20 nodes, deployed my very CPU intensive workload, and 12 hours later my account was locked and my nodes all went NotReady.

I immediately replied to the abuse ticket explaining my usage. 4 days later, my account was unlocked and received the "allow high CPU" flag... but they billed me for the nodes as if they had been running that whole time. I asked for a credit (5 days ago) and they haven't replied yet. Probably a little busy over there right now.

So... I'm not too thrilled with DO. I get that cryptocurrency ruined everything, but this has been a frustrating experience and I'm glad I was only using it for a pet project.

Hey xnxn - Thanks for raising this issue. I'm Zach, Director of Support at DO. Please send me an email with your account detalis, first name at, and I'll take care of this.

Customers should not have to resort to shaming you publicly to get timely support responses.

Two points establish a line.

Hey Sneak - I totally agree* :) We've already started efforts between Support and Security leadership to leverage the 24/7 structure of Support. Our goal is that no one will ever need to use social as an escalation path, and our new Support Engineers who are joining in mid-June and early July will be part of this a reality for our customers.

Fyi, I'm planning to migrate off of DO due to the crummy payment options. With every other recurring service I can have a PayPal subscription, DO doesn't support PayPal properly though.

Also, your billing system is incredibly spammy (1 to 2 emails a day at times). The bills I get are always for partial months despite the VMs on the account running for a full month.

Hey metildaa - Zach here from DO support. It is accurate that that do not currently support PayPal subscriptions, this certainly isn't the overall experience that you should be having. Can you shoot me an email (first name at) and I'll look into this for you?

Thank you! Zach

Thanks Zach, will do.

Email received, followed up on, and account credited.


Really? What is so bad about your support that you needed to hear about this from HN before you did anything?

The postmortem...they admit that they need to hire more support staff.

That's my point. Their support got so bad that they had to be called out on in publicly before they did anything. That's hardly what you want to hear from a company like DO.

> That's my point. Their support got so bad that they had to be called out on in publicly before they did anything. That's hardly what you want to hear from a company like DO.

That's what you get with most providers these days. Ever tried reaching Google or Amazon?

Actually I have complained to Amazon... and they have always responded within a couple days and in one case where I complained about Prime Video, they called me on a Sunday, which shocked me because everything is otherwise closed in Austria, except gas stations, restaurants and hospitals.

Google on the other hand, I make complaints or suggestions once every couple months and 99% of the time I don't even get a boilerplate response.

I don't believe that Google or Amazon have ever done anything like what DO did before. However, if you notice their support guy Zach was responding to a completely different incident that is not related to the one being addressed with their post-mortem.

If you pay for enterprise support with Amazon you can open a case with a less than 15 minute SLA [1]. With business that goes down to an hour SLA. With both plans you can create chats or calls that typically get answered within a matter of minute with and assigned to an engineer with a background in the service.

[1] https://aws.amazon.com/premiumsupport/plans/enterprise/

Amazon and AWS. I've always been happy with the support I've received from them. It's always been easy.

Amazon support is from everything I've heard and experienced great.

So don't lump everyone together in an attempt to paint DO in a better light when it's not true.

I have no relation to DO, but I'm surprised by the negative responses in this thread. I can't think of any other major company conducting a public postmortem for a customer service failure (as opposed to networking/ops failure). Not only are they changing their policies across the board, taking on more risk to improve customer experience, but they are hiring extra people so it does not happen again. Kudos for that!

And of course DO will still retain the ability to suspend your account for suspected fraud - that is the case with any cloud services company, and any online business in general (check your ToS). Again, I can't think of any business that will en masse promise to never react to any fraudulent users. It's how this process is performed that matters and that's what they are improving.

> I'm surprised by the negative responses in this thread.

That is actually a general pattern. Negative, dismissive responses come fast because they're reflexive. They don't require processing significant information, nor reflection, nor thoughtful writing—all of which take time. Therefore they are the first to appear.

After that, a second wave of responses gets triggered because people read the first wave and are dismayed by how negative it was. That is the contrarian dynamic: https://hn.algolia.com/?query=%22contrarian%20dynamic%22&sor....

Yep. I haven't had the best experience with DO previously (not terrible, just not great), and was vocal in the original thread about this, and I think this is one of the best reactions to a customer service failure that I've seen from a tech company.

This wasn't a failure that impacted thousands of customers [#]. DO could've just fixed things up for this customer and said not a word more, and changed nothing, and everyone would've forgotten about it by about ... now.

Instead they dedicated a nontrivial amount of resources to understanding what went wrong -- identifying not just a single cause, but several -- and publicly explained what happened, without a lot of weasel words, and what they're going to do to fix it.

It's an awesome response.

[#]: ...at that specific time. Yes, others have probably previously been impacted.

This happens every time a post-mortem is posted on HN. It is often dismissed as a PR move, or ass-covering, or lies.

I haven't been around the block enough to know, but I'll take your word for it.

What I don't understand is why anyone would complain about a post-mortem as PR. If that's what DO was up to here, I'm sold—it shows transparency, thoughtful problem analysis, and swift execution.

More than what this means to me as a customer, I could surely stand to learn a lot in my own work from the way they approached this situation.

A PR move isn't worth anything unless they actually follow through. I suppose many people are skeptical that things will actually improve as PR is often all talk and no actual action. Not saying that's the case here, but it certainly is difficult to trust a company purely by what they say for PR these days.

There's no amount of money you can pay them to get decent support, that I'm aware; at least, if you could, you can't just pay them upfront. They screwed up their weird bespoke network configuration system (which they inexplicably use in place of DHCP) on FreeBSD, and their only support output is: lol blow down your instance and start a fresh one.

You tell me why a company like theirs, which should be mature, ditched their proprietary support and ticketing system (which actually worked) for an shoddy, misconfigured off-the-shelf product which has equally little excuse for being that bad.

I think it's customary here to be unfair to companies, even when they're doing "the right thing" as DO is right now; but DO has spent their customer patience budget elsewhere, so I'm not going to be surprised if people view the post mortem as an inadequate replacement for getting it right the first time, rather than a followup to an honest mistake they will actually try hard to avoid in the future.

> There's no amount of money you can pay them to get decent support, that I'm aware;

We at NodeBB have a high enough spend to have access to level 2 agents (their responses are around 1 hour turnaround, if not sooner).

We don't actually spend that much either, compared to what some of our clients pay AWS, etc...

Now, we certainly don't qualify for their highest tier with the dedicated support manager and slack access, but that's ok. DO's been amazing to us as a host.

Yeah, I'd just like to be able to pay for the support, without having so much volume. I can't really consider sending them enough business that we'd qualify for level 2 agents if I run into regular technical issues with the services before I even get that chance.

Is it bad for image if they just have a "send a couple hundred bucks because you need to speak with someone real bad" button?

What is the high enough spend average to get Level 2 agents if you don't mind sharing ?

I wonder this me too :-)

Their pricing page reads: "world-class technical support to all of our customers—around the clock" (https://www.digitalocean.com/pricing/#Included_services ) b.t.w. which doesn't seem totally accurate to me, with the 12h and 29h response times in this account banned case.

We qualify for their "Business Support" tier: https://www.digitalocean.com/support/#BusinessSupport

Hey, sorry for the delay -- the info sheet they provided us lists $500 monthly spend.


DO has been amazing for us too.

I don’t understand how the account did not have payment history. Hadn’t it been in operation for quite some time?

I think the account used startup credit.


The tweet thread suggested using for years. So either they are giving credits long time for startups they don't even know how and what they are doing or messed up a simple logic about history of the customer.

Similarly, as I have skin in the game and didn't want to get blocked by DO, I was expecting a muddy response and ready to make a throwaway account and complain about "keeping processes opaque to give carte blanche to take whatever arbitrary action they like" in the normal complaint about deplatforming and tech censorship, but then I read their document and was surprised and encouraged by how transparent DO were. Congratulations DO!

> I can't think of any other major company conducting a public postmortem for a customer service failure (as opposed to networking/ops failure).

There are many companies that have done this in the past. They are not doing this out of the goodness of their hearts, this is lip service for the fact that their mishap blew up in their faces on twitter. Do you really think they would have gone at length to highlight to the public this incident had it not gone viral?

> Not only are they changing their policies across the board, taking on more risk to improve customer experience, but they are hiring extra people so it does not happen again. Kudos for that!

There is no telling that they are actually going to follow through with anything. Mere lip service.

The bottom line is, people host their businesses and livelihoods on cloud providers and they (the cloud providers) should take the necessary care and precautions when taking destructive actions. Maybe err on the side of the customer instead of shutting down someone's entire business because of some automated heuristic. Maybe have a better response time than 29 hours. Maybe teach basic communication and develop processes so that care agents can see and react appropriately to recent activity on the account. These are not revolutionary concepts, they are simple things that demonstrate customer care, something DigitalOcean is sorely lacking.

> precautions when taking destructive actions.

No data was lost, it is not destructive in anyway.

> because of some automated heuristic.

If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"

> people host their businesses and livelihoods on cloud providers

people shouldn't run entire operation on credits and blame DO in twitter.

Only issue is that DO took 29 hours, apart from that i see no problem with DO.

> people shouldn't run entire operation on credits

Why not? Until now, I wouldn't have considered that using credits might make me a second-class customer. They should at least be upfront about that.

> No data was lost, it is not destructive in anyway.

Except in the way that the guy may[1] have lost customers or revenue due to the downtime. Being offline, even without data loss, is very destructive for many businesses.

[1] I don't know anything about his business.

> No data was lost, it is not destructive in anyway.

Tell that to the owner who was begging DO for their data back on Twitter. Again, had this not blown up on twitter nothing would have been done.

> If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"

What's your point in saying this?? The fact is that the customer faced downtime because of a bug in DO's code.

> people shouldn't run entire operation on credits and blame DO in twitter.

Are you saying that customers on credits aren't subject to SLA's?

> Only issue is that DO took 29 hours, apart from that i see no problem with DO.

I think you seem very biased.

It should be pretty hard to shit down a legit biz. Seemed automatic in this case.

I mean it is. Unless your business is wholly dependent on a service from my business.

Edit: shut

What? We use a few million dollars in GCP credits every month.

"There are many companies that have done this in the past."

Haven't seen those, can you point me to them? I love companies doing this.

There is also no telling if they're not going to follow through, and it is not mere lip service.

What cloud business do you run that does better, according to your standards?

Key takeaway for me: DigitalOcean might still kill your account at any time, revoking your access to your data.

Refusing service is fine, but holding my data hostage and refusing access to it is not, so I am making a note to not consider DO for any kind of hosting.

I am curious which alternative service providers you use that offer guarantees to never kill your account, and to provide copies of data stored on their infrastructure.

(As noted by many observers during the initial event, anyone keeping their data exclusively within one organisation's walls is a profound mistake.)

I don't know if it's a promise, but I know from experience that AWS won't kill your account when they automatically detect fraud-like increased usage. They send warnings that the account will be shut down in several days.

Because shutting down running servers without warning is completely unacceptable for a B2B infrastructure provider.

> I don't know if it's a promise ...

I suspect that it almost definitely is not.

As IT professionals we should do better than use words like 'kill' to describe system and account changes.

As I understand it, Digital Ocean suspended the account, and because the (perceived) problem was related to excessive / potentially fraudulent CPU usage, they suspended the machines. The data contained on them was not deleted. Does that match your reading?

As someone who suffered from extremely noisy neighbours in AWS (in the very early days), risking significant damage to the performance guarantees to our customers, I'm actually cautiously happy with automated protection systems. Naturally I'd rather noisy neighbours were throttled so that I never heard them, and I expect that's closer to what happens these days

I think 'kill' is appropriate here. They shut down access to his company's account along with all running infrastructure and after escalating multiple times, their final official response (after 30hrs of silence) was that his company was permanently denied access to the account (DO calls it 'account termination' in their postmortem). The first tweet of the thread that went viral was 'How Digital Ocean just killed our company' [https://twitter.com/w3Nicolas/status/1134529316904153089].

I don't want to spent too much time dumping on them since they clearly know how badly this situation was handled, but this is an example of terrible automated protection leading to a company that's not enterprise-ready. As you say, AWS probably doesn't publicly promise not to terminate your account, but this would never happen because they understand that availability and security matter more than anything else when running B2B infrastructure.

On review, the wording is quite vague.

From TFA:

> Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked and powered back on.

But it's not clear if any data was deleted by DigitalOcean.

The suggestion the account was unlocked rather than re-created suggests it was not, but OTOH there's no reference to erase, delete, restore, or indeed current state of customer data in that post mortem.

The fact that data didn't get deleted is irrelevant if the customer can't actually get at that data.

That the customer got unlocked is of course a good thing, but for at least 30 hours the customer couldn't access their data, that's highly problematic.

Nothing was deleted or removed. The droplets were powered off and the access to them locked. once (way too long later) the unlock happened the customer had full control and access again.

Good to hear - that's how I'd assumed things went, but thank you for clarifying Barry.

>> I don't know if it's a promise ...

>I suspect that it almost definitely is not.

It's not a promise. But they don't shut it down without letting you know they're going to be shutting it down first.

Because I reinstalled my droplet with a different filesystem manually, the snapshot restore doesnt work. Support tells me they cant do anything, so my 2~ of chat logs are sitting in a disk image that they cant restore bc they need to mount it for some reason...

Hey there! I would love to follow up on the issue you're describing here. It looks like you tried to bring a disk image over from a provider in a format that we don't support, and unfortunately there's nothing trivial we could do to get a working Droplet out of it (which is a requirement for us to expose the volume within the systems we have).

I can't promise a super fast resolution - but I'd be happy to work internally to see if there's any outside-the-ordinary workarounds we can supply here if you're willing to follow back up on the ticket.

I replied to my ticket (#2710287). Thank you so much for giving it a shot by the way.

If you have illegal content such as hacked / stolen data or child pornography on your account, they absolutely should revoke access to your data.

I appreciate that they did this.

It's sad to me that your only chance in hell of getting huge companies to listen to you is by shamespamming across social media.

That, coupled with the clear issues following procedure from support, paint a clear picture: customer service is an area to skimp on for big tech.

Hey Nkozyra - Zach here from DO Support. One of our remediation efforts, that is already underway, is that Support and Security Operations leadership will create new workflows to allow abuse-related events to leverage the 24/7 structure of Support.

On Support, we have additional Support Engineers joining our Developer Experience team in mid-June and early July. We will continually assess our ability to provide high-quality responses as fast as possible to all tickets. Our customer feedback will continue to be the measure of how well we're doing, but our goal is that no one will ever need to use social as an escalation path.

I've been using DO spaces for about a year now, and for the later half of that time, my experience has been pretty terrible.

- Spaces throwing up errors that magically fix itself a couple of days later.

- Asking about the credits we were promised for when Spaces lost our files results in the question being ignored. Still haven't received the promised credits for 6+ months. I can't even look back at the tickets now as the support system has deleted all the tickets older than a month.

It's gotten to the point where we have started work on migrating off DO, which is unfortunate because DO's offerings looked very attractive.

Hi Sladey - Zach here from DO. I'd like to help you out and investigate what might be going on. Can you send me an email (first name at) and I'll investigate right away?

Thank you, Zach

In what way is DigitalOcean a "huge company"? At ~300 employees, it's closer to the SMB's definition of a small business (<250 employees) than mid-size (<500 employees).

In all fairness, having worked at a few tech startups, it can be hard to scale customer service to keep up with demand—you don't control how many support tickets come in, and it takes a lot more time to hire and train new customer service agents than it does to spin up new servers, and if you over-hire, it's a lot more costly than shutting down some servers.

DO is claimed to be "third-largest hosting company in the world in terms of web-facing computers", so that should give you an idea of how many customers they have.

How do they manage so many boxes with so few people? Do they rent metal and resell it with value add software?

By skimping out on support, clearly.

Oh interesting, hadn't seen that claim before

I feel like a better measure for the "size" of a tech company is number of customers, rather than number of employees. Considering software doesn't need a linearly increasing number of people to produce/support it as your share of the market gets bigger.

It doesn’t scale linearly, but we’re all here discussing in this thread because scaling support is hard.

I had wondered if they were a large company or not. I have used them in the past and there was a specific reason that we picked them over AWS/Azure/Goober, but those details escape me.

Based on the other comments, they don't sound like a huge company. Someone mentioned about 300 employees. I am not sure their revenue and chuckled at the reference that they're "third largest" given some specific criteria (of course, "largest" doesn't mean anything, either -- what we're really looking for is "third best" by some criteria that we've defined in our head)

I found this: https://www.canalys.com/newsroom/cloud-market-share-q4-2018-...

That jives more with other things I've read and anecdotal experience.

I think the size of the company is less relevant than the last point that you make about the clear issues around training/support. At a company that size, CSR training might be less formal than it needs to be. When a one-off like a customer who is legitimate but in every other way appears to be fraud might involve a slack messages to people with wrong information rather than clear guidance followed up with formal training.

It's difficult for a smaller (average cash-flow for their size) company to succeed in highly competitive markets that are, effectively, commodities. A larger company can handle being a loss leader to knock smaller competitors out of the market, providing excellent customer service. They don't. But it'd be easier for them to do so. :)

Digital Ocean: allow me to do some extended verification so you know exactly who I am and reduce your risk. In exchange, there is no automated locking, rather we are contacted and have 24 hours to mitigate the issue.


* 1 year continuous ontime payments at $250+/mo usage

* automatic billing is set up

* billing limits are set up and have been reviewed within the last year

* copy of our business insurance and license

* u2f on all accounts


The first few items on your list are actually a part of what we meant by "having billing history with us". There are a number of things we look at in that bucket. We use these items as a part of validating users before taking any action (yes, we clearly failed on this account due to the credits which is a clear bug). As far as offering things like a copy of your business license or other means of verification that isn't a bad idea. As an example people paying with POs today are excluded from the algorithm already.

Please make it official, so that people can have peace of mind knowing that they've got that "verified" badge. People hate having to wonder whether they're at risk of crossing an invisible, inscrutable, and constantly changing threshold. See: PayPal and AdSense account forfeitures. You could do so much better than that.

Not sure why you’re being downvoted so hard - while I think your list of requirements perhaps aren’t the right ones, I like the idea in general. Have a process where I can make assurances of some form that we’re good guys, and treat me accordingly.

What is a "business license"?

Various businesses in the usa require licensure to operate. These are often done at state or even federal levels. Not every business has one, but even an EIN provides them more information about who's paying for the service.

Not every business has an EIN. I'm kind of surprised at the number of regulatory hoops you're willing to jump through to negotiate a minimal level of service.

Looking at it from the point of operations of the affected company, this doesn't sound like minimum level of service. Instead, DO was the entire business structure of the company. As a company, if you're entire business plan depends on the services of a 3rd party, it is not unusual to go through extra steps to ensure 3rd party can't end your business. Running an internet driven company from a consumer service from ATT/Spectrum etc is suicidal. If your network goes down, they'll fix it when the get to it. With a business level account, they have much more guarantees to keep your signal hot. Running a POC off of a free tier of AWS/DO/etc is fine, but these guys were well past POC.

As happy as I am to see this post itself, the mistakes made here are pretty appalling.

Killing customer accounts by automated action without any human check just seems like a recipe for disaster. Even if you can respond faster to crypto issues, the effects of a false positive are just unacceptable.

Though apparently the human checks at Digital Ocean don’t work either.

According to the post, that's not what happened? The customer account wasn't terminated by the automated system, but rather by the second Abuse Ops agent.

Upon a second review by a different Abuse Operations agent [...] the agent fully denied access back into the account. This action triggered the final “access denied” communication to the customer.

That was after the automated process locked access to the account and powered off all associated machines.

Read the post?

> The initial account lock and resource power down resulted from an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors).

You're looking at the one false positive instead the potentially thousands of true positives. Those true positives are people using free DO credits to cryptomine on DO. And probably 99.99% of the time the system works as intended. Dealing with bad actors and abusers is not pleasant or easy.

This sort of 0.01% case is exactly what developers and sysadmins deal with on a regular basis -- some crazy scenario that lead to a bug you've never experienced before. The correct and only response is to fix the bug (whether it be software or process), offer whatever compensation to the injured parties, and move on.

How much do you think this one instance will hurt DO’s bottom line?

I’m prepared to say the lost income potential is massive.

Who would ever use them for production if they can arbitrarily decide that your servers need to be powered down?

Digital Ocean's follow-up to "DigitalOcean Killed Our Company" https://news.ycombinator.com/item?id=20064169

We dropped DO from our company usage after similar issues, though honestly DO probably wasn't the right place for our product at that stage of development. What was meant as a POC became technical debt and an outage forced is to come to terms with the fact that by the time the issue happened we had more than enough of our own capacity to run on our own metal.

Kudos to DO for the open incident management. As someone who does this myself, these are often really painful and hard to get right.

Good response from DO, but this line jumps out at me.

> Responses to account locks were not prioritized differently from a ticket management standpoint to be above less severe tickets.

That's arguably the biggest failure, IMO. The fact that an action which locks/terminates an account is not prioritized any different than a general ticket is pretty jaw-dropping, and I'm glad they're going to change that.

Yeah... that one was painful and we are fixing it. At least if the priority placed this at the top of the queue we could have acted faster. Probably the same outcome due to the other issues involved in this incident though.

I appreciate all your transparency and engagement on this issue. It probably would have had the same outcome, yes, but potentially resolved much more quickly. Regardless, the fact that you're fixing it is music to my ears.

Here's what's really wrong. This is a B2B service with B2C-grade terms of service. You don't want to base your business on one of those. Not one with a "sole discretion" termination clause. Those are for low-value consumer facing services only.

Compare, say, these terms of service from a major dedicated server hosting company.[1]

Either of the parties may terminate this Agreement (including all existing Orders) if: The other party breaches any material obligation under it (other than our obligations covered by an SLA), and fails to begin to cure such a breach within ten days of written notice of such a breach from the non-breaching party, or fails to completely cure such a breach within thirty days of the original written notice; OR a force majeure event continues for more than thirty days.

Now that's what a reasonable B2B contract looks like. That seems to be fairly standard for dedicated server hosting.

[1] https://info.codero.com/hubfs/Linked%20Assets/Legal%20Docume...

The startup claimed they had all their backups on digitalocean, which contained data of Fortune 500 companies.

A startup who has fortune 500 clients must have history. I don't get then why digitalocean says they do not have payment history. Either the startup moved a few weeks ago - but then why don't they have offsite backups if they just moved. Or because they're french they did have payment history but did not have an American credit card to similar.. not sure what's up

Their payment history was with DO credits according to the post.

Maybe one of the problems is DO's views of credits. Maybe things would work out better if they would treat credits like real money instead of phony money and tighten up how the credits are handed out.

Yes, DO have stated that how they viewed credits was a mistake and that they will be addressing it.

AWS is a 8,000 pound gorilla. This means throwing massive amounts of promotional credit at new accounts is often the only viable way to get them into the funnel.

They we're running on DO account credit so they may have been using the service for a while but haven't needed to pay for the service yet.

Looks like a clear indicator for possible abuse to me. Apart from the long response times, I can't really blame DO.

They were paying customers before, then got credits as startup. It's mentioned by DO in comments.

I really cannot blame DO for this incident. These kinds of things must be handled in a learn as you go way and when I consider what one can do with a DO VPS (or 10) it's astonishing. I would expect them to automatically flag some uses.

A business "relationship" is a two way thing. You call and talk to people, and tell them what you want to do, and ask if it's ok.

When I've called and talked to DO reps about what I've wanted to do they have been very accommodating.

Nothing in the statement from Digital Ocean indicates that they won't kill your account or shutdown your systems - that's not the sort of cloud host any company can afford to use.

Cloud providers that kill accounts - or SAY they kill accounts, must be dropped and not used.

The worst thing that should be possible is for your account to be suspended.

AWS, if there is a billing issue, prevents you making changes to your infrastructure via the console until the billing issue is sorted out - this is good and reasonable.

""Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.""

- I can imagine how this plays out:

(service agent 1 turns to next service agent along) 'This looks like a bad account - I think I should shut it down, what do you think buddy?'

(service agent 2) - 'Yep I trust you, shut it down.'

> Nothing in the statement from Digital Ocean indicates that they won't kill your account or shutdown your systems - that's not the sort of cloud host any company can afford to use.

There is no cloud that will issue such a promise. If your criterion is that a cloud has to promise that you won't get shut down, you just can't use cloud hosting providers.

The article discusses violating TOS, not a billing issue. I don't think it's unreasonable to disable an account in that circumstance but I agree that deleting images/resources without allowing customers to defend themselves and backup the systems would be unfair.

> The article discusses violating TOS, not a billing issue.

That's not the impression I got. It sounds like the issue was that a account with misinterpreted payment history was showing bitcoin-mining-like usage patterns. Mining is not against the terms of use, they were just erroneously convinced themselves that the customer was not going to pay for it.

I'd like to apologize for the typos in my previous comment that I neglected to notice before the edit window expired.


I have read and written similar RCA's in the past, this one is very good IMHO.

Barry Cooks did a phenomenal job with this after-action. He not only publicly accepted fault on DO's behalf (+1), not only stated the incident timeline clearly and without bias (+2), but also showed mitigation steps and procedural changes to avoid this in the future and prioritize customer business interests (+3). Many medium and larger sized companies should take note of this handling style (looking at you, Google and Facebook). I love that there was no generic PR "we're very sorry". Succinct, accurate, and without spin (+4).

I agree, the incident report was well done. The combination of factors that led to the issue was described in clear detail, and I was glad to see a concrete plan to improve various aspects to avoid future cases like this. It certainly helped to regain trust.

I had been expecting a short blip of an update denying anything of consequence (a twitter post promised a status update, but well, you know...) but this transparency significantly exceeds expectations. Nicely done DO.

You may want to explain service credits in some light detail though, for those that are unfamiliar with them.

That is everything I'd hoped to see as a developer and digital ocean customer. Good response.

Totally. I'm fairly new to DO and after seeing what happened was re-thinking my decision. But this is a solid followup, "we made a mistake" post so I think I can rest easy.

They hoped so.

One wonders how many others didn't get enough Twitter cred, before. That some low-level ticket stamper (even a high-level ticket-stamper) had authority to deep-six a customer on no more say-so than high CPU usage tells us more about the company than an incident report massaged by marketing communication specialists. Simply, the latter sounds good because it has been made to sound good by sounds-good experts, and could say anything; but the event itself is ground truth.

They will need a lot more time and good behavior to live this down.

I agree on the twitter cred point. The fact that this happened in the end, personally I think it is a good thing as it highlighted a weakness we must fix.

We trust our people high-level, low-level whatever to make important decisions everyday. thats why they are here.

The "marketing communications specialists" are getting slammed a lot here, so I will just point out that they spend most of their time rolling their eyes at my crappy grammar, spelling and ludicrous number of comma splices. I don't think our goal was to sound like anything. We just wanted to lay out our investigation and the follow on work we are undertaking.

Totally agree with your point that trust is earned and we lost many peoples in the last few days. That will take time and as you say good behavior to earn back, but that is what we are committed to doing.

I talk about mktg comms because I have worked at places where angry customers got earnest letters promising changes, but the manager expected to implement the changes said "No, we're not doing that!" Or "OK" but nothing happened. So I don't give much credit for promises, even when it was the right thing to promise.

Giving your ticket punchers authority is good when they are authorized to do what customers need to get or keep going. Giving them authority to eliminate customers, not so much.

I have to agree with the commenters who say it was an exemplary postmortem.

Hospitals have been doing formal postmortems for many years, but the number of them didn't start down until they instituted checklists.

I think this response is really good.

Now that we're all not picking it, however, I think they should remove the "People" section. They did a good job of adjusting process instead of blaming people. The people section, however, might lean toward blaming people. They didn't, in this case, but it could.

Hey there. Thanks for this feedback. I think it is important to be open honest but not blame-oriented in our review of the situation. People make mistakes and that is okay, so long as they aren't willful or due to incompetence. Neither of which was the case here. The key thing is not to create a situation where a mistake is an individuals fault. My general view is if people are making mistakes then we have done something wrong as a company and need to understand and fix the tools/training/process that led to the mistake.

I'm involved in work around reviewing medical care.

Generally, a "People" section that mentions processes not being followed is an incomplete root cause analysis.

Why was it was possible for the process not to be followed?

There's obviously a limit to how far it makes sense to drill down with why why why, but stopping at "someone didn't follow guidance" is too early.

I disagree. People are often (perhaps usually) a significant contributor to incidents and I like seeing that called out explicitly.

Internally, we want to know exactly who did what, when, and why they thought that was the right approach. We're pretty ruthless about getting those facts down; in exchange, we're beyond lenient on anything that resembles punishment of people (barring multiply repeated cases of extraordinarily poor judgment).

We don't publish post-mortems publicly (though we do internally); still, we generally elide names from the published docs (replacing with role names such as "Operator1", "SquadLead1", etc.), but internal to the teams, we really value understanding exactly what happened and don't shy away from understanding the specific people involved. It's not in any way a black mark on someone's record to have downed prod or made a problem worse. It happens; better we understand and accept that.

Huh. Well, that's how you handle a post mortem! You outline what you did wrong, and then you outline how you're going to fix it. And it looks like the proposed fixes are appropriate, so...

DO, like (nearly?) all companies (not to mention most people), is obviously greedy and self-interested, and yes, I'm sure a major driver of the quality of response was the twitter storm that erupted, and I don't want to excuse the underlying mistake which was significant, but...

...at least the responded well eventually!

Agreed, and it was more or less the response I was looking for a couple of days ago.

We'll be staying with DO.

We already use other VPS services as backup, and will probably add one or two more. But because of their well-documented response (and at least being able to identify what went wrong, and hopefully to fix it), we aren't going to drop DO.

Thanks for the response, and congrats on standing out in a very small crowd of companies who can own up to their customer service problems.

A very, very small crowd indeed.

I dont know that they responded well enough. A person can still see account locked. All data locked. And be locked out of all communications.

Digital Ocean notoriously doesn’t invest well in data science or machine learning, even having some key data science people leave recently.

I interviewed for a data science job there & the team of engineers seemed really unhappy. They reported into the director of operations, which is a weird place for data science to report, and the managers I met definitely viewed it as a paranoid cost center kind of thing.

Also in the interview process I recall that Digital Ocean made a very low offer and refused to discuss negotiating it. Seemed clear that cheap hires were mandatory for data science / machine learning.

I wouldn’t be surprised if this lack if investment meant that some data science intern or bootcamp grad is designing this automatic fraud shut down system, and that there’s a glaring lack of investment in professional usability for a system like that.

Sorry to hear that you had a bad experience and left with a bad impression of that team. We have a number of data sciences efforts including in the core R&D group where we are growing and working to improve models in support of a number of fleet monitoring tasks

I don't quite get the "running on credits w/ no payment history" and "ruined our business" combo. How can they run a business and never pay?

Many (all?) of the cloud providers offer credits to startups (credits as in free $ to spend on their services). So if they hadn’t burned through that, there would be no payment yet. (The startup I work at got $20k in credits and didn’t pay a dime for the first year)

I knew they gave credits, but I didn't realize it was to the level of $20k worth of credits. I think I got a couple hundred from DO.

Cloud/Hosting providers have startup programs that give you credit in the hope that you will stick around.

Did anyone notice DO leaked customer financials in this post? If I were a startup running on credit, I definitely wouldn't want to advertise that. WTF?

I was very worried about that specific detail and we reached out to the customer before posting this postmortem to expressly get his permission to share those details. If he had said no, we would have worked around the detail but not be able to explain as clearly what went wrong. He gave us his permission to share the information.

I don't understand what you think the issue here is?

My interpretation of this is that the customer had pre-paid credit on their account. Meaning they had not been through the typical bill cycle yet (hitting an external payment method).

How are you interpreting that they are running on credit? As in their account is in debt and they haven't paid yet?

Perception is everything. Many companies, especially Fortune 500, will do deep research before doing business with anyone. I've been through them where have been disqualified specifically due to our infancy and lack of proof of long standing. If someone read/mis-read someone's post that gave them the idea that the company didn't have enough runway, they might move to the next potential suitor.

I'm not sure I understand why running on credit would be an issue for a startup? That said, I hope they asked the user before posting this info.

IIRC it had to blow up on Twitter before DO paid any attention. At the end of the day that's why it's an issue, because they didn't sort it out until it went public.

I suppose the moral of the story is - have offsite backups, so you can switch VPS providers in an emergency.

When I clicked on this I had assumed DO had gone down last week. I was surprised when I finally realized what they were talking about. I think it is cool and commendable to offer this level of transparency on an issue like this.

Anecdotally, I use Digital Ocean for a few miscellaneous services, on an account I’ve had forever. I have never had any issues with it. I used to use lesser known low-end VPSes, but stopped when I lost a bunch of data on an incident involving a provider’s failed RAID controller. (It was my fault for not backing up, but I was young and foolish; they mostly served me well, but I do prefer the assurances of bigger providers nowadays.)

Depending on the severity and length, it could still have a long-term impact on that business. Also a bit unsettling that seemingly basic safety controls failed. But it is good to see DO being open and thorough about this incident.

That's the only way to get confidence back. I especially like the two peer review policy.

If it were not for public shaming on Twitter, the guy would still be turned off.

Dear DO, From your RCA it appears this is a type of fraud where stolen credit card is used to create a new cloud account and run up a huge charge in a short amount of time. Nowadays it could be for cryptocurrency mining (a few years ago and maybe still today, it could have been to run spambots or botnets or whatever).

I suggest your trust and safety team handle the payment fraud as a separate issue (using payment network intelligence) and resource abuse (spam or botnet) as a separate issue (by monitoring abuse reports, external underground intelligence; NOT resource monitoring or traffic monitoring).

It seems like in this case, weak muddied signals were combined to draw false-positive conclusions.

Also, it is equally important to build reputation score for good users and use that as a backstop to prevent them from getting shot by misbehaving fraud detection algorithms.

Since your business might be a lot of small customers, it is important you find a good way to easily trust a small customer with little usage and little spend. One way you could do this is by having a reasonable default cap on the resources for a new or small account. You could ratchet up this cap after verifying the payment instrument trustworthiness (through automated checks or manual verification process).

Hope this helps.

One of the problems is the credits initiative for startups (I'm assuming that's how this customer ended up running on credits.)

Companies have to grow to quite a big size before they consider offering various discounts and programs. By that time, the systems and processes are plentiful in number, complexity, and interaction. Management decides to implement a startup credits program and because it's not an instant money maker, it doesn't get treated carefully enough and causes various edge cases for the program's users (and hopefully none for the standard type of user).

In DO's case, startups should be vetted before being gifted credits and therefore excluded from the crypto checks and shutdown potential.

Sometimes it ends up in the customer's favor:

For an example of how poorly one-off programs can end up being implemented, my company is receiving special consideration from Stripe. No fees are being charged at the moment. Well, a customer asked for a refund. I issued the refund. Stripe paid out $25 to the bank account but took back only $23 for the refund because the code that does refunds doesn't know about the fee exemption. Good guy me emailed them about their bug but not much came of it.

Automated systems that are infallible are great. Most are not infallible, and they should just provide notices to humans.

Humans should (be adequately trained to) review and handle considerations where termination of service is involved.

From reading the DO response, it does sound like humans were involved (eventually). However, unless a customer's use of the service is posing an immediate and severe threat (security, DoS, whatever), service should not be stopped until AFTER a human has adequately reviewed the situation.

Stories like this remind me why sometimes it's better to use smaller providers who are less automated...

The communication regarding denial of access to the account creates a sense of helplessness; the finality without explanation requires correcting.

Lots of big companies who deal with small partners (developers, sellers, etc.) could learn from this, including Apple, Amazon, and Google. Lack of explanations, vague explanations, or confusing explanations for account shutdowns or other penalizations are the norm. And for some of these companies it's nearly impossible to talk with someone who can clarify what's wrong.

I like it when a company conducts a full failure analysis and takes responsibility. Doesn't happen often. Hope DO meaningfully improves its service as a result.

Its catastrophic to get locked out like that.

“The communication regarding denial of access to the account creates a sense of helplessness; the finality without explanation requires correcting.”

Right there, in one sentence, DO has figured out the one thing that all those millions of CPU cores at Google have failed to grasp.

So you cannot use your VPS to do whatever you want with it? I am having trouble to understand what's wrong with crypto mining so you get access to your VPS you paid for denied.. Or was that some sort of free plan he was running on? Still... why not introducing CPU quotas rather than blocking?

As mentioned elsewhere in the thread, accounts with high cpu usage and no billing history were locked - symptomatic of cryptominers creating accounts, using free credit or stolen CCNs, and ditching.

I think its great they wrote the post, but the tone still leaves a sour taste.

I think they could have worded it better.

I think this is a great response.

While from my own experience I don't see myself using DO again (see my previous posts, I had a similar experience except I didn't complain externally), the points in future measures look like they'll go a long way.

Best of luck to all future customers.

> The account owner leveraged Twitter as an avenue to call attention to the mistake

What an arrogance! Because official channel was silent, you forgot to mention.

> Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked

2 days is "shortly thereafter" for you?

Its a good response, clearly outlines causes and future effects. But I'd be very cautious to deal with a company which found such lazy detection mechanism to be adequate, considering the potential cost to real clients.

Something tells me making their abuse department 24/7 won't help I love DO but i had to make a support ticket recently and it took them about 2-3 days to respond.

Additional hiring has been approved for both Support and AbuseOps to reduce ticket queue wait times.

So this is the way they determine their support department is underresourced? By twitter shaming?

Barry Cooks for President, or any political role where explaining things in a balanced way seems impossible to the current individuals.

It's good to see a company pay attention and take action.

That said I would never use them, Amazon AWS is just a smarter solution all round.

How are the customer damages compensated? With just sorry, we ruined years of your work with couple of bad clicks?

I don't think there is a single cloud provider that accepts unlimited liability and wholly compensates customers for lost data, lost sales during downtime that was the provider's fault, etc. Their liability is generally strictly limited to the cost of service... so something like a $1 credit for the 6 days that your $5/mo VPS was down. Unless of course you are a very large customer that credibly threatens to switch, at which point you may get some special treatment above and beyond the TOS, though even then, rarely enough to fully recover your actual loss..

Ultimately if there is a lot of money on the line you need to do the work and pay the money to be multi-vendor, automatically failing over to AMZN or MS or DO or whatever when there is some massive screwup that takes down GOOG for half a day.

It's better than Amazon or Google. I have had accounts shutdown on both with absolutely no recourse.

You have exactly the same recourse with those that this person had with DigitalOcean: Be very loud and as public as possible with the problem and it'll get escalated to someone who can make a rational, reasonable choice (and we've seen the same sort of things happen to other big companies on here and twitter), or simply override prior choices purely for PR purposes. This is how many businesses operate now, with zero mechanisms to escalate outside of getting a torch-wielding mob riled up. It seems horribly counterproductive (I mean, my end impression of this whole incident is certainly not more positive about DO -- it's something that should never have happened), but it's how it's done now.

This was a few years back. I don't really care about those accounts anymore. Support was just automated responses and when I tried to call them directly, I was directed to the email accounts with automated responses.

My Amazon account was banned because A buyer (I believe was a competitor) purchased an item and claimed it was a fake. I had proof it wasn't, but it didn't really matter. Other than this I had nearly 100% positive feedback. What's funny is that I now buy thousands of dollars per month for my business through Amazon and I keep getting pestered to sign up with a business account.

Google banned an Adsense account when they somehow thought I was faking clicks. I still have no idea where they got this from. My site was't even live yet and I wasn't clicking on anything or even displaying ads beyond a simple test page with no traffic.

Feels like we're not hearing the entire story. Care to share more?

Excellent postmortem. Great work DO team, and hope these problems get resolved quickly.

Great response and ownership.

Kudos for this clear summary and planned improvements. Really good job folks.

I do a lot with cloud providers for my customer's products and have worked with Digital Ocean's products once before. I didn't have a particular opinion on them[0], and there's some things that seem off about the twitter thread when placed against the incident update report that this thread is linked to. So, all of that to say, I'm giving Digital Ocean the benefit of the doubt.

There are, however, a few things that could be improved about this process:

> Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.

The devil is in the details. Do this in a manner that the person confirming that the account is committing fraud is unaware that they are confirming another's denial; otherwise "dude, can you approve that termination I just did? I want it out of my queue/that guy was a dick." is a risk.

Off the top of my head, I'd probably generate two support tickets (linked, but without that link presented to the account termination CSR team member), assigned directly to this person, hidden from others ("hiding", along with training/process improvements is likely enough). If one person disagrees with the termination, close out the other person's ticket. If the CSR sub-org for this is global, place them with staff in opposing time-zones to optimize unnecessary confirmations (though you use a valuable measurement on how consistent your staff is)

> Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review.

I can't count how many services I've deployed that started under 90-days ago where the customer failed to add their account information to the service. I can't count them because I don't know. Usually our customer creates their account with instructions from us, and creates an account for us to use which doesn't have permissions over payment details. I wouldn't be surprised if I've had an app go past production that the customer simply forgot to do that important step on Day-1, or if the customer procrastinated until production, etc. We ask, but I've been lied to about stupider things (thankfully rare, but surprising from people who otherwise look like "grown-ups").

Minimally, it sounds like the whole process here is missing a "Hey, WTF is that thing you're running? Call us or we'll need to turn it all off" alert at least a little while before it ... turns it all off. At login, put a clear notice "We want you to love our services, so we let you try them without asking you for payment information. Unfortunately, we have to have monitoring in place to prevent hostile actors from loving us, too. Because of this, accounts newer than 90-days might have services shut off in error. If you want a notification an hour before action will be taken, provide your mobile phone number and we'll send you a text. Or you can enter credit card information/confirm your identity (not sure what options are available here) and we'll keep the bots from bothering you"

Of course, all of this costs money. And based on the incident response times, an explanation other than "failure to prioritize correctly" might very well be "failure to staff properly/have the tooling in place to handle the volume". Considering the competition in this market, I wouldn't be terribly surprised if "we can't afford it" plays into some of that.

[0] A little less awful than AWS in a lot of ways for the task I had to do.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact