> The template used for response in account denial will be removed entirely. If account access is denied during an appeal, which often is the case as most appeals are true bad actors, the agent must create a reasoned response.
Glad this is seen as an issue and corrected.
IMO, this probably would have made this whole thing never escalate if a better response was previously in place for everyone.
Accidents, shotty support, whatever — all expected these days unless you have big cash money agreements in place.
But to kill an account of a responsive person with a gigantic middle finger email without reasoning was a pretty dumb process in place. You can see the email on the Twitter thread somewhere.
Glad it’s fixed! Still a DO fan here
Edit: TALKING ABOUT THIS: https://pbs.twimg.com/media/D76ocofXoAY_xB5.png
The root cause of suspension is incomprehensible to me. They were suspended because they launched a set of instances and these were using 100% CPU. How is that unreasonable and cause for suspension?
I'm not a Digital Ocean customer, but if I were, I'd expect to be able to use the resources I bought without risk of being suspended. This is the root cause. It was compounded by incompetent customer support, but I really do not understand the suspension cause.
The response tackles all secondary factors, but does not talk about the root cause. I'd expect it to.
Additionally, all cloud providers operate on various models of over-subscription. It is not in anyone's (customer / provider) interest to allow the full consumption of resources when the activity is fraudulent.
As you can see in the post-mortem, they are fine with the usage. They have a process and flag to allow legitimate customers to use their resources. However, based on previous experience at another cloud provider, I would bet that over 90% of those automated hits are correct.
This was bad support. They know that and they seem to be making the right moves to fix it. Fraud is bad for everyone and has to be combated. Not doing so can raise prices and kill a business like DO. I'm sure they feel awful that a customer was so poorly impacted, but the error wasn't in the first ban, it was everything after that.
Edit: found your site, looks like you’re cheaper than aws at a glance
It sounds much more like it was the second point, which is unsettling. It's one thing to plan your pricing based on the assumption that most customers won't maximally-utilize. It's another thing to enforce a soft-limit that's vague and below what was advertised. I'd much rather have a lowered, known limit than whatever this is.
Capital One did this to me once, and refused to restore the use of the blocked account even after I immediately called them and confirmed that the transaction that triggered the block was not fraudulent.
Google can afford to eat the cost and perhaps has better heuristics to detect mining. And they definitely have better data to detect a single user signing up for multiple accounts.
I run my own iron, with cloud only for elastic loads. Every time I launch a cloud instance, it will be using 100% CPU, otherwise I wouldn't launch it. It's unacceptable to label that profile as "suspicious". It never happened to me on AWS or Azure.
The major indicator here was the lack of payment history, so they hadn’t paid for it but were working off of credit. I think it’s a nuance that’s very important.
From the description I imagine Digital Ocean offers a free period or tier, to reduce friction in customer acquisition. This is a marketing tool, and must not, in any way, cause situations like the one described.
If a marketing tool induces service failure, it has no place in a professional setting.
But it is important to disconnect monetary spending from coupons or vouchers as they are not equivalent.
You mention free tier but that’s not what was at issue here. Also, 10 additional instances isn’t in the free tier of any cloud service I’ve used.
I’m not saying that DO is correct, but I believe the parent argument was a simplification if the events in question. Also, DOs handling of it via support was far worse than the initial algorithm, imo.
They must be. If they are not, then you've entered the territory I referred, where marketing actions are impacting service availability. This impact is not acceptable in professional services.
In this specific case, if voucher giveaways produce ingress of resource leeches (cryptominers that will never result in real customers), and if it is impossible to prevent this undesired ingress without impacting existing customers (which it is), then that marketing action must stop. This is the conclusion I expected from the post-mortem.
I won't try to argue whether they should be removed in their entirety, that's not even an option I had even considered until now.
Using what you've rightfully obtained shouldn't be regarded with suspicion.
Are you suggesting that 100% usage implies fraud?
There's a difference between suspecting fraud from high resource usage and equating high resource usage with fraud.
The latter is what is happening, here, and its outrageous.
- A large increase in number of nodes
- All nodes using 100% of CPU
- AND a lack of payment history
I'm merely saying that the lack of payment history is an important indicator of suspicious activity. 100% usage by-itself was not the primary indicator that their article discusses.
They’ve botched that second step though.
It's a common pattern in malicious actors to immediately spin up several droplets and immediately peg the CPU on each one.
There are, obviously, non-malicious actors who do the same, but it's a bit like wearing a balaclava in public: Likely to raise some suspicion just because it's associated with malicious actors.
They weren't bought resources at the time, they were on credit. In this case a false positive for sure.
In the case of an actual cryptominer it's more likely they'll just ditch the account when it comes to billing time. Even more likely is that it's a compromised account that someone else has to pay for
A screw up is inevitable. A mature response is not. So the fact they gave mature response goes a long way. Although it's unfortunate that social media seems to be their emergency support channel...
This is the thing - the customer that got locked out managed to get attention on HN, Reddit and other media - this seems to have prompted action from Digital Ocean.
How many have silently fallen victim before this ? We don't really know if this is a habit or not - we only know this one customer was corrected.
Except they were declining to unlock it, right? I’m always shocked to see support that’s so pitiful they don’t even bother to have a correctly worded template for a common event.
The real problem is support reps that aren’t trained properly and don’t even care enough to apply a bit of common sense. Getting rid of a response template doesn’t automatically make the support reps care enough to apply common sense.
How about a “don’t fuck me” support tier where I can pay a one time $100-$250 fee for the sole purpose of getting a phone call before my account gets banned?
And going off process could make it better... yay, self pat on the back. But it could make it worse in which case I see unemployment in the support rep’s future. So they won’t go that way very often.
Anyone who ever had such a low positioned job knows how it works. At that level your only freedom is to do what you’re told and follow company process.
No, this is the fault of the manager who asked for this process and their manager who approved it. Management isn’t just about picking up a higher paycheck, it’s also to take the accountability for the decisions made under your watch.
If you nuke VMs, under no circumstances do you also nuke access to data, backups, etc.
Because if it wasn't for "social escalation" (aka: mob justice via HN and Twitter), this 2 person company would have lost everything.
If you terminate a customer for $reasons, the data still belongs to the user, and not the company. And the company should still be legally required to provide the data on a reasonable timescale, like FTP access for 7 days.
> 9.1 Subscriber is solely responsible for the preservation of Subscriber's data which Subscriber saves onto its virtual server (the "Data"). EVEN WITH RESPECT TO DATA AS TO WHICH SUBSCRIBER CONTRACTS FOR BACKUP SERVICES PROVIDED BY DIGITALOCEAN, TO THE EXTENT PERMITTED BY APPLICABLE LAW, DIGITALOCEAN SHALL HAVE NO RESPONSIBILITY TO PRESERVE DATA. DIGITALOCEAN SHALL HAVE NO LIABILITY FOR ANY DATA THAT MAY BE LOST, OR UNRECOVERABLE, BY REASON OF SUBSCRIBER'S FAILURE TO BACKUP ITS DATA OR FOR ANY OTHER REASON.
Summary: Do offsite backups n'all you dinguses
Sure the ToS needs legalese crap for the lawyers, but a plain version also needs to be made. I'm certainly no lawyer, and nor are most people.
> In other words, we trust that you’ll be responsible and back up your own data. Things happen!
* While the removal of the account termination template is good, in conjunction with additional hiring to support more attention to any individual ticket, I can't tell by whose standards the "reasoned response" is gauged, or if the response is reviewable at all. I did note that they now want two human reviewers, but that's distinct from specifying a process in which a reasoned response is articulated and reviewed.
* More importantly, if the reasoned response doesn't pass muster with the customer, what's their resort? Still Twitter-shaming? I suppose that's legit if they'd rather their mistakes were public like this.
* The question of whether an account-wide lockout w/ no data retrieval is a necessary/proper consequence for those flagged for CPU abuse needs addressing -- ideally they should have a different policy that allows for data egress (with bandwidth fees, if necessary), but if not, a rationale and clear policy
might be acceptable.
"shoddy", for what it's worth.
Compute offload is an ancient and fairly common use case for the public cloud; my VPS (or ten..) should be able to burn 100% CPU for many hours compiling a large project, even if it means they make less profit than they would have had I instead run a static web server that sleeps on IO, imposing nearly no CPU load.
At the very least they should provide some objective, quantitative guidance on exactly how many CPU-seconds-per-hour they consider acceptable/not-abuse (or, if not CPU-seconds, then increased host power consumption, or whatever they are ultimately trying to limit to ensure they can pack a few hundred near-zero-load servers onto the same host to make glorious truly massive profits all the time).
Don't make customers guess at whether their workload will trigger some opaque but hyper-aggressive abuse automation or not.
"Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review"
Not that I am defending their actions and perma-ban.
They clearly do, at least for some subset of customers meeting various quasi-secret criteria.
> they don’t want you to do is use 100% CPU and not pay the bill.
The account here was fully paid up, albeit via credits that they issued rather than via USD. Regardless, it was not past due, so, the high CPU% was the mortal sin.
>violation of any of these Terms of Service or any law, or if you misuse system resources, such as, by employing programs that consume excessive network capacity, CPU cycles, or disk IO
By my reading that seems to mean that you're not allowed to use your VMs to their full capacity due to them being over-provisioned. This is in contrast to AWS who are more explicit on which instances (T instances) are over-provisioned and exactly how they're throttled.
I get that your fraud algorithm flagged it because of lack of established payment. how is this possible if what the tweet referred to as "locking us out of all of our backups and work"? surely an account history of any significance would have an established payment record. From their tweets they mention that they had 5 droplets and some storage of a not insignificant number of records (~500k) and that a script is required to be run every 2-3 months to process some data and that script spins up 10 droplets during that time. seems like it will take 13 hours to process the data based on row count and per record time. I am struggling to see how they didn't have payment history. can you elaborate?
In addition another thing I'd think would help assuage fears of a complete lockout is some process where you can request and download the db or a snapshot of the virtual machine.
Do you disclose this anywhere? Are there any special steps one could take to avoid issues while doing legitimate mining?
Not sure why you were downvoted, I had the same impression, after reading:
...an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors). These signals, coupled with a number of account-level signals (including payment history and current run rate compared to total payments) are used to determine if automated action is warranted to minimize the impact of potential fraudulent high-cpu-loads on other customers
This sounds like they don't permit extended high CPU loads due to the impact it can have on other customers.
My guess would be that this is such a well-known problem (within the field of cloud compute at least) that they just didn't think they had to state that normal crypto mining by paying customers is completely fine.
In every cryptocurrency (the popular and functional ones anyway), there's a set global rate of mining rewards. All miners compete for a slice of that reward, so as more people mine, each individual miner gets less reward. (This causes an equilibrium to be reached where more people mine until it's no longer profitable for more people to start mining. If mining becomes unprofitable, some miners will drop out, and the remaining miners will each make a little more.) If masses of people realize that cloud mining for a particular cryptocurrency is profitable, then what generally happens is that lots of people pounce on cloud providers to mine, it becomes barely profitable, and then people operating their own hardware that's cheaper than cloud providers come in and push the mining rewards down to where it's no longer profitable for people to cloud mine.
Because cloud mining is never profitable in the long run, most cloud mining that happens is fraudulent activity using stolen cloud accounts or payment info. (If you're not paying for it, then making any amount of money from it is profitable.)
edit: from elsewhere ITT it seems they're doing this with stolen credit cards.
"Cryptocurrency mining mitigation detects suspicious behavior, including very high CPU utilization on an account with no payment history, which results in an account lock"
Lots might have been done wrong here, but it sounds like they had an account with trial or promotional credits - I can see how this could easily be abused.
An account shutdown, or enough complaints from verifiable sources, and I'm not going to the trouble. Not to pull out an old trope "Nobody got fired for picking IBM". But that's the case: pulls a move like this and the entirety of the customer, who likely came in with AWS in mind (in some cases, was advised against it and insisted on it) is going to shrug their shoulders. Pick a provider that the customer hasn't heard of and I'm going to get a phone call that goes something like "You're the one that said we should use that basement-operation!" with raised voices. Heck, the last time there was an Azure outage, we didn't hear from most of our customers. It was so impacting that even customers well outside of software development/technology read news articles and connected the dots. I had one customer tell me he thought it was just their corporate internet connection; they assumed it was working.
 Plus, too lazy to put in the research; sorry.
 They were a customer who insisted on doing the app monitoring, themselves -- that guy was getting the alerts and similarly assumed it was the network since that happened regularly with another application they developed -- the monitoring server was on-prem.
An attacker might load a stolen credit card number into an account and only use enough resources to generate a few dollars worth of billing. The owner of the credit card might not notice the small charge.
Then after a few months of low billing (to bypass a previous heuristic), they ramp up the utilization, mine a bunch of coins then the holder gets a massive bill.
The holder does a charge back and DO is left holding the bill.
AWS doesn't have this problem because either your instance is allowed to use all the CPU that's allicated to it, or else (t2 & t3) the platform will automatically limit your CPU usage. You don't have to care about how your usage affects other people. It's one more thing that AWS abstracts away. DO's abstraction, on the other hand, is rather leaky in this area. That's a problem in and of itself, in addition to the matter of credit card fraud that every company has to deal with.
If the provider isn't paying attention and/or doesn't have good fraud detection in place, it may be a few months before your account and resources are terminated (assuming the payments eventually bounce, and the provider gives you a chance to fix it)
I recently logged into my long-dormant DO account to kick the tires on their now-GA managed Kubernetes service and to contribute some CPU cycles to a distributed computing project I'm interested in (not cryptocurrency, I swear).
I first requested and was approved for a droplet limit increase (to 25). I started 20 nodes, deployed my very CPU intensive workload, and 12 hours later my account was locked and my nodes all went NotReady.
I immediately replied to the abuse ticket explaining my usage. 4 days later, my account was unlocked and received the "allow high CPU" flag... but they billed me for the nodes as if they had been running that whole time. I asked for a credit (5 days ago) and they haven't replied yet. Probably a little busy over there right now.
So... I'm not too thrilled with DO. I get that cryptocurrency ruined everything, but this has been a frustrating experience and I'm glad I was only using it for a pet project.
Two points establish a line.
Also, your billing system is incredibly spammy (1 to 2 emails a day at times). The bills I get are always for partial months despite the VMs on the account running for a full month.
That's what you get with most providers these days. Ever tried reaching Google or Amazon?
Google on the other hand, I make complaints or suggestions once every couple months and 99% of the time I don't even get a boilerplate response.
So don't lump everyone together in an attempt to paint DO in a better light when it's not true.
And of course DO will still retain the ability to suspend your account for suspected fraud - that is the case with any cloud services company, and any online business in general (check your ToS). Again, I can't think of any business that will en masse promise to never react to any fraudulent users. It's how this process is performed that matters and that's what they are improving.
That is actually a general pattern. Negative, dismissive responses come fast because they're reflexive. They don't require processing significant information, nor reflection, nor thoughtful writing—all of which take time. Therefore they are the first to appear.
After that, a second wave of responses gets triggered because people read the first wave and are dismayed by how negative it was. That is the contrarian dynamic: https://hn.algolia.com/?query=%22contrarian%20dynamic%22&sor....
This wasn't a failure that impacted thousands of customers [#]. DO could've just fixed things up for this customer and said not a word more, and changed nothing, and everyone would've forgotten about it by about ... now.
Instead they dedicated a nontrivial amount of resources to understanding what went wrong -- identifying not just a single cause, but several -- and publicly explained what happened, without a lot of weasel words, and what they're going to do to fix it.
It's an awesome response.
[#]: ...at that specific time. Yes, others have probably previously been impacted.
What I don't understand is why anyone would complain about a post-mortem as PR. If that's what DO was up to here, I'm sold—it shows transparency, thoughtful problem analysis, and swift execution.
More than what this means to me as a customer, I could surely stand to learn a lot in my own work from the way they approached this situation.
You tell me why a company like theirs, which should be mature, ditched their proprietary support and ticketing system (which actually worked) for an shoddy, misconfigured off-the-shelf product which has equally little excuse for being that bad.
I think it's customary here to be unfair to companies, even when they're doing "the right thing" as DO is right now; but DO has spent their customer patience budget elsewhere, so I'm not going to be surprised if people view the post mortem as an inadequate replacement for getting it right the first time, rather than a followup to an honest mistake they will actually try hard to avoid in the future.
We at NodeBB have a high enough spend to have access to level 2 agents (their responses are around 1 hour turnaround, if not sooner).
We don't actually spend that much either, compared to what some of our clients pay AWS, etc...
Now, we certainly don't qualify for their highest tier with the dedicated support manager and slack access, but that's ok. DO's been amazing to us as a host.
Is it bad for image if they just have a "send a couple hundred bucks because you need to speak with someone real bad" button?
Their pricing page reads: "world-class technical support to all of our customers—around the clock" (https://www.digitalocean.com/pricing/#Included_services ) b.t.w. which doesn't seem totally accurate to me, with the 12h and 29h response times in this account banned case.
There are many companies that have done this in the past. They are not doing this out of the goodness of their hearts, this is lip service for the fact that their mishap blew up in their faces on twitter. Do you really think they would have gone at length to highlight to the public this incident had it not gone viral?
> Not only are they changing their policies across the board, taking on more risk to improve customer experience, but they are hiring extra people so it does not happen again. Kudos for that!
There is no telling that they are actually going to follow through with anything. Mere lip service.
The bottom line is, people host their businesses and livelihoods on cloud providers and they (the cloud providers) should take the necessary care and precautions when taking destructive actions. Maybe err on the side of the customer instead of shutting down someone's entire business because of some automated heuristic. Maybe have a better response time than 29 hours. Maybe teach basic communication and develop processes so that care agents can see and react appropriately to recent activity on the account. These are not revolutionary concepts, they are simple things that demonstrate customer care, something DigitalOcean is sorely lacking.
No data was lost, it is not destructive in anyway.
> because of some automated heuristic.
If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"
> people host their businesses and livelihoods on cloud providers
people shouldn't run entire operation on credits and blame DO in twitter.
Only issue is that DO took 29 hours, apart from that i see no problem with DO.
Why not? Until now, I wouldn't have considered that using credits might make me a second-class customer. They should at least be upfront about that.
Except in the way that the guy may have lost customers or revenue due to the downtime. Being offline, even without data loss, is very destructive for many businesses.
 I don't know anything about his business.
Tell that to the owner who was begging DO for their data back on Twitter. Again, had this not blown up on twitter nothing would have been done.
> If the customer had "payment history" none of this would have happened. Probably it was being used under "startup credits"
What's your point in saying this?? The fact is that the customer faced downtime because of a bug in DO's code.
> people shouldn't run entire operation on credits and blame DO in twitter.
Are you saying that customers on credits aren't subject to SLA's?
> Only issue is that DO took 29 hours, apart from that i see no problem with DO.
I think you seem very biased.
Haven't seen those, can you point me to them? I love companies doing this.
What cloud business do you run that does better, according to your standards?
Refusing service is fine, but holding my data hostage and refusing access to it is not, so I am making a note to not consider DO for any kind of hosting.
(As noted by many observers during the initial event, anyone keeping their data exclusively within one organisation's walls is a profound mistake.)
Because shutting down running servers without warning is completely unacceptable for a B2B infrastructure provider.
I suspect that it almost definitely is not.
As IT professionals we should do better than use words like 'kill' to describe system and account changes.
As I understand it, Digital Ocean suspended the account, and because the (perceived) problem was related to excessive / potentially fraudulent CPU usage, they suspended the machines. The data contained on them was not deleted. Does that match your reading?
As someone who suffered from extremely noisy neighbours in AWS (in the very early days), risking significant damage to the performance guarantees to our customers, I'm actually cautiously happy with automated protection systems. Naturally I'd rather noisy neighbours were throttled so that I never heard them, and I expect that's closer to what happens these days
I don't want to spent too much time dumping on them since they clearly know how badly this situation was handled, but this is an example of terrible automated protection leading to a company that's not enterprise-ready. As you say, AWS probably doesn't publicly promise not to terminate your account, but this would never happen because they understand that availability and security matter more than anything else when running B2B infrastructure.
> Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked and powered back on.
But it's not clear if any data was deleted by DigitalOcean.
The suggestion the account was unlocked rather than re-created suggests it was not, but OTOH there's no reference to erase, delete, restore, or indeed current state of customer data in that post mortem.
That the customer got unlocked is of course a good thing, but for at least 30 hours the customer couldn't access their data, that's highly problematic.
>I suspect that it almost definitely is not.
It's not a promise. But they don't shut it down without letting you know they're going to be shutting it down first.
I can't promise a super fast resolution - but I'd be happy to work internally to see if there's any outside-the-ordinary workarounds we can supply here if you're willing to follow back up on the ticket.
It's sad to me that your only chance in hell of getting huge companies to listen to you is by shamespamming across social media.
That, coupled with the clear issues following procedure from support, paint a clear picture: customer service is an area to skimp on for big tech.
On Support, we have additional Support Engineers joining our Developer Experience team in mid-June and early July. We will continually assess our ability to provide high-quality responses as fast as possible to all tickets. Our customer feedback will continue to be the measure of how well we're doing, but our goal is that no one will ever need to use social as an escalation path.
- Spaces throwing up errors that magically fix itself a couple of days later.
- Asking about the credits we were promised for when Spaces lost our files results in the question being ignored. Still haven't received the promised credits for 6+ months. I can't even look back at the tickets now as the support system has deleted all the tickets older than a month.
It's gotten to the point where we have started work on migrating off DO, which is unfortunate because DO's offerings looked very attractive.
In all fairness, having worked at a few tech startups, it can be hard to scale customer service to keep up with demand—you don't control how many support tickets come in, and it takes a lot more time to hire and train new customer service agents than it does to spin up new servers, and if you over-hire, it's a lot more costly than shutting down some servers.
Based on the other comments, they don't sound like a huge company. Someone mentioned about 300 employees. I am not sure their revenue and chuckled at the reference that they're "third largest" given some specific criteria (of course, "largest" doesn't mean anything, either -- what we're really looking for is "third best" by some criteria that we've defined in our head)
I found this: https://www.canalys.com/newsroom/cloud-market-share-q4-2018-...
That jives more with other things I've read and anecdotal experience.
I think the size of the company is less relevant than the last point that you make about the clear issues around training/support. At a company that size, CSR training might be less formal than it needs to be. When a one-off like a customer who is legitimate but in every other way appears to be fraud might involve a slack messages to people with wrong information rather than clear guidance followed up with formal training.
It's difficult for a smaller (average cash-flow for their size) company to succeed in highly competitive markets that are, effectively, commodities. A larger company can handle being a loss leader to knock smaller competitors out of the market, providing excellent customer service. They don't. But it'd be easier for them to do so. :)
* 1 year continuous ontime payments at $250+/mo usage
* automatic billing is set up
* billing limits are set up and have been reviewed within the last year
* copy of our business insurance and license
* u2f on all accounts
Killing customer accounts by automated action without any human check just seems like a recipe for disaster. Even if you can respond faster to crypto issues, the effects of a false positive are just unacceptable.
Though apparently the human checks at Digital Ocean don’t work either.
Upon a second review by a different Abuse Operations agent [...] the agent fully denied access back into the account. This action triggered the final “access denied” communication to the customer.
> The initial account lock and resource power down resulted from an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors).
This sort of 0.01% case is exactly what developers and sysadmins deal with on a regular basis -- some crazy scenario that lead to a bug you've never experienced before. The correct and only response is to fix the bug (whether it be software or process), offer whatever compensation to the injured parties, and move on.
I’m prepared to say the lost income potential is massive.
Who would ever use them for production if they can arbitrarily decide that your servers need to be powered down?
Kudos to DO for the open incident management. As someone who does this myself, these are often really painful and hard to get right.
> Responses to account locks were not prioritized differently from a ticket management standpoint to be above less severe tickets.
That's arguably the biggest failure, IMO. The fact that an action which locks/terminates an account is not prioritized any different than a general ticket is pretty jaw-dropping, and I'm glad they're going to change that.
Compare, say, these terms of service from a major dedicated server hosting company.
Either of the parties may terminate this Agreement (including all existing Orders) if: The other party breaches any material obligation under it (other than our obligations covered by an SLA), and fails to begin to cure such a breach within ten days of written notice of such a breach from the non-breaching party, or fails to completely cure such a breach within thirty days of the original written notice; OR a force majeure event continues for more than thirty days.
Now that's what a reasonable B2B contract looks like. That seems to be fairly standard for dedicated server hosting.
A startup who has fortune 500 clients must have history. I don't get then why digitalocean says they do not have payment history. Either the startup moved a few weeks ago - but then why don't they have offsite backups if they just moved. Or because they're french they did have payment history but did not have an American credit card to similar.. not sure what's up
A business "relationship" is a two way thing. You call and talk to people, and tell them what you want to do, and ask if it's ok.
When I've called and talked to DO reps about what I've wanted to do they have been very accommodating.
Cloud providers that kill accounts - or SAY they kill accounts, must be dropped and not used.
The worst thing that should be possible is for your account to be suspended.
AWS, if there is a billing issue, prevents you making changes to your infrastructure via the console until the billing issue is sorted out - this is good and reasonable.
""Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.""
- I can imagine how this plays out:
(service agent 1 turns to next service agent along) 'This looks like a bad account - I think I should shut it down, what do you think buddy?'
(service agent 2) - 'Yep I trust you, shut it down.'
There is no cloud that will issue such a promise. If your criterion is that a cloud has to promise that you won't get shut down, you just can't use cloud hosting providers.
You may want to explain service credits in some light detail though, for those that are unfamiliar with them.
One wonders how many others didn't get enough Twitter cred, before. That some low-level ticket stamper (even a high-level ticket-stamper) had authority to deep-six a customer on no more say-so than high CPU usage tells us more about the company than an incident report massaged by marketing communication specialists. Simply, the latter sounds good because it has been made to sound good by sounds-good experts, and could say anything; but the event itself is ground truth.
They will need a lot more time and good behavior to live this down.
We trust our people high-level, low-level whatever to make important decisions everyday. thats why they are here.
The "marketing communications specialists" are getting slammed a lot here, so I will just point out that they spend most of their time rolling their eyes at my crappy grammar, spelling and ludicrous number of comma splices. I don't think our goal was to sound like anything. We just wanted to lay out our investigation and the follow on work we are undertaking.
Totally agree with your point that trust is earned and we lost many peoples in the last few days. That will take time and as you say good behavior to earn back, but that is what we are committed to doing.
Giving your ticket punchers authority is good when they are authorized to do what customers need to get or keep going. Giving them authority to eliminate customers, not so much.
I have to agree with the commenters who say it was an exemplary postmortem.
Hospitals have been doing formal postmortems for many years, but the number of them didn't start down until they instituted checklists.
Now that we're all not picking it, however, I think they should remove the "People" section. They did a good job of adjusting process instead of blaming people. The people section, however, might lean toward blaming people. They didn't, in this case, but it could.
Generally, a "People" section that mentions processes not being followed is an incomplete root cause analysis.
Why was it was possible for the process not to be followed?
There's obviously a limit to how far it makes sense to drill down with why why why, but stopping at "someone didn't follow guidance" is too early.
Internally, we want to know exactly who did what, when, and why they thought that was the right approach. We're pretty ruthless about getting those facts down; in exchange, we're beyond lenient on anything that resembles punishment of people (barring multiply repeated cases of extraordinarily poor judgment).
We don't publish post-mortems publicly (though we do internally); still, we generally elide names from the published docs (replacing with role names such as "Operator1", "SquadLead1", etc.), but internal to the teams, we really value understanding exactly what happened and don't shy away from understanding the specific people involved. It's not in any way a black mark on someone's record to have downed prod or made a problem worse. It happens; better we understand and accept that.
DO, like (nearly?) all companies (not to mention most people), is obviously greedy and self-interested, and yes, I'm sure a major driver of the quality of response was the twitter storm that erupted, and I don't want to excuse the underlying mistake which was significant, but...
...at least the responded well eventually!
We'll be staying with DO.
We already use other VPS services as backup, and will probably add one or two more. But because of their well-documented response (and at least being able to identify what went wrong, and hopefully to fix it), we aren't going to drop DO.
Thanks for the response, and congrats on standing out in a very small crowd of companies who can own up to their customer service problems.
A very, very small crowd indeed.
I interviewed for a data science job there & the team of engineers seemed really unhappy. They reported into the director of operations, which is a weird place for data science to report, and the managers I met definitely viewed it as a paranoid cost center kind of thing.
Also in the interview process I recall that Digital Ocean made a very low offer and refused to discuss negotiating it. Seemed clear that cheap hires were mandatory for data science / machine learning.
I wouldn’t be surprised if this lack if investment meant that some data science intern or bootcamp grad is designing this automatic fraud shut down system, and that there’s a glaring lack of investment in professional usability for a system like that.
My interpretation of this is that the customer had pre-paid credit on their account. Meaning they had not been through the typical bill cycle yet (hitting an external payment method).
How are you interpreting that they are running on credit? As in their account is in debt and they haven't paid yet?
I suppose the moral of the story is - have offsite backups, so you can switch VPS providers in an emergency.
Anecdotally, I use Digital Ocean for a few miscellaneous services, on an account I’ve had forever. I have never had any issues with it. I used to use lesser known low-end VPSes, but stopped when I lost a bunch of data on an incident involving a provider’s failed RAID controller. (It was my fault for not backing up, but I was young and foolish; they mostly served me well, but I do prefer the assurances of bigger providers nowadays.)
I suggest your trust and safety team handle the payment fraud as a separate issue (using payment network intelligence) and resource abuse (spam or botnet) as a separate issue (by monitoring abuse reports, external underground intelligence; NOT resource monitoring or traffic monitoring).
It seems like in this case, weak muddied signals were combined to draw false-positive conclusions.
Also, it is equally important to build reputation score for good users and use that as a backstop to prevent them from getting shot by misbehaving fraud detection algorithms.
Since your business might be a lot of small customers, it is important you find a good way to easily trust a small customer with little usage and little spend. One way you could do this is by having a reasonable default cap on the resources for a new or small account. You could ratchet up this cap after verifying the payment instrument trustworthiness (through automated checks or manual verification process).
Hope this helps.
Companies have to grow to quite a big size before they consider offering various discounts and programs. By that time, the systems and processes are plentiful in number, complexity, and interaction. Management decides to implement a startup credits program and because it's not an instant money maker, it doesn't get treated carefully enough and causes various edge cases for the program's users (and hopefully none for the standard type of user).
In DO's case, startups should be vetted before being gifted credits and therefore excluded from the crypto checks and shutdown potential.
Sometimes it ends up in the customer's favor:
For an example of how poorly one-off programs can end up being implemented, my company is receiving special consideration from Stripe. No fees are being charged at the moment. Well, a customer asked for a refund. I issued the refund. Stripe paid out $25 to the bank account but took back only $23 for the refund because the code that does refunds doesn't know about the fee exemption. Good guy me emailed them about their bug but not much came of it.
Humans should (be adequately trained to) review and handle considerations where termination of service is involved.
From reading the DO response, it does sound like humans were involved (eventually). However, unless a customer's use of the service is posing an immediate and severe threat (security, DoS, whatever), service should not be stopped until AFTER a human has adequately reviewed the situation.
Stories like this remind me why sometimes it's better to use smaller providers who are less automated...
Lots of big companies who deal with small partners (developers, sellers, etc.) could learn from this, including Apple, Amazon, and Google. Lack of explanations, vague explanations, or confusing explanations for account shutdowns or other penalizations are the norm. And for some of these companies it's nearly impossible to talk with someone who can clarify what's wrong.
Its catastrophic to get locked out like that.
Right there, in one sentence, DO has figured out the one thing that all those millions of CPU cores at Google have failed to grasp.
I think they could have worded it better.
While from my own experience I don't see myself using DO again (see my previous posts, I had a similar experience except I didn't complain externally), the points in future measures look like they'll go a long way.
Best of luck to all future customers.
What an arrogance! Because official channel was silent, you forgot to mention.
> Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked
2 days is "shortly thereafter" for you?
So this is the way they determine their support department is underresourced? By twitter shaming?
That said I would never use them, Amazon AWS is just a smarter solution all round.
Ultimately if there is a lot of money on the line you need to do the work and pay the money to be multi-vendor, automatically failing over to AMZN or MS or DO or whatever when there is some massive screwup that takes down GOOG for half a day.
My Amazon account was banned because A buyer (I believe was a competitor) purchased an item and claimed it was a fake. I had proof it wasn't, but it didn't really matter. Other than this I had nearly 100% positive feedback. What's funny is that I now buy thousands of dollars per month for my business through Amazon and I keep getting pestered to sign up with a business account.
Google banned an Adsense account when they somehow thought I was faking clicks. I still have no idea where they got this from. My site was't even live yet and I wasn't clicking on anything or even displaying ads beyond a simple test page with no traffic.
There are, however, a few things that could be improved about this process:
> Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.
The devil is in the details. Do this in a manner that the person confirming that the account is committing fraud is unaware that they are confirming another's denial; otherwise "dude, can you approve that termination I just did? I want it out of my queue/that guy was a dick." is a risk.
Off the top of my head, I'd probably generate two support tickets (linked, but without that link presented to the account termination CSR team member), assigned directly to this person, hidden from others ("hiding", along with training/process improvements is likely enough). If one person disagrees with the termination, close out the other person's ticket. If the CSR sub-org for this is global, place them with staff in opposing time-zones to optimize unnecessary confirmations (though you use a valuable measurement on how consistent your staff is)
> Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review.
I can't count how many services I've deployed that started under 90-days ago where the customer failed to add their account information to the service. I can't count them because I don't know. Usually our customer creates their account with instructions from us, and creates an account for us to use which doesn't have permissions over payment details. I wouldn't be surprised if I've had an app go past production that the customer simply forgot to do that important step on Day-1, or if the customer procrastinated until production, etc. We ask, but I've been lied to about stupider things (thankfully rare, but surprising from people who otherwise look like "grown-ups").
Minimally, it sounds like the whole process here is missing a "Hey, WTF is that thing you're running? Call us or we'll need to turn it all off" alert at least a little while before it ... turns it all off. At login, put a clear notice "We want you to love our services, so we let you try them without asking you for payment information. Unfortunately, we have to have monitoring in place to prevent hostile actors from loving us, too. Because of this, accounts newer than 90-days might have services shut off in error. If you want a notification an hour before action will be taken, provide your mobile phone number and we'll send you a text. Or you can enter credit card information/confirm your identity (not sure what options are available here) and we'll keep the bots from bothering you"
Of course, all of this costs money. And based on the incident response times, an explanation other than "failure to prioritize correctly" might very well be "failure to staff properly/have the tooling in place to handle the volume". Considering the competition in this market, I wouldn't be terribly surprised if "we can't afford it" plays into some of that.
 A little less awful than AWS in a lot of ways for the task I had to do.