
An update on last week's customer shutdown incident - grzm
https://blog.digitalocean.com/an-update-on-last-weeks-customer-shutdown-incident/
======
scotchio
That’s a really fair and reasonable response. Not sure what else people really
expect here.

> The template used for response in account denial will be removed entirely.
> If account access is denied during an appeal, which often is the case as
> most appeals are true bad actors, the agent must create a reasoned response.

Glad this is seen as an issue and corrected.

IMO, this probably would have made this whole thing never escalate if a better
response was previously in place for everyone.

Accidents, shotty support, whatever — all expected these days unless you have
big cash money agreements in place.

But to kill an account of a responsive person with a gigantic middle finger
email without reasoning was a pretty dumb process in place. You can see the
email on the Twitter thread somewhere.

Glad it’s fixed! Still a DO fan here

Edit: TALKING ABOUT THIS:
[https://pbs.twimg.com/media/D76ocofXoAY_xB5.png](https://pbs.twimg.com/media/D76ocofXoAY_xB5.png)

~~~
sergiosgc
> That’s a really fair and reasonable response. Not sure what else people
> really expect here.

The root cause of suspension is incomprehensible to me. They were suspended
because they launched a set of instances and these were using 100% CPU. How is
that unreasonable and cause for suspension?

I'm not a Digital Ocean customer, but if I were, I'd expect to be able to use
the resources I bought without risk of being suspended. This is the root
cause. It was compounded by incompetent customer support, but I really do not
understand the suspension cause.

The response tackles all secondary factors, but does not talk about the root
cause. I'd expect it to.

~~~
numlock86
Having your instances run at 100% CPU pretty much raises a red flag at any
cloud provider. Depending on your plan it either gets shut off (like in this
case) or you get a notice about "suspicious" behavior and a bit of time to fix
the "issue".

~~~
sergiosgc
What's next? Having your disks use too much I/O causes the same response? Or
actually using the RAM you pay for?

I run my own iron, with cloud only for elastic loads. Every time I launch a
cloud instance, it will be using 100% CPU, otherwise I wouldn't launch it.
It's unacceptable to label that profile as "suspicious". It never happened to
me on AWS or Azure.

~~~
chewbacha
> ...you pay for

The major indicator here was the lack of payment history, so they hadn’t paid
for it but were working off of credit. I think it’s a nuance that’s very
important.

~~~
sergiosgc
I'm sorry to dig heels, but that's no excuse. If the credit they were given
allowed them to use the resources, it follows that using the resources is not
a breach of contract.

From the description I imagine Digital Ocean offers a free period or tier, to
reduce friction in customer acquisition. This is a marketing tool, and must
not, in any way, cause situations like the one described.

If a marketing tool induces service failure, it has no place in a professional
setting.

~~~
chewbacha
Credit and promo codes are also used extensively for fraud. If a business had
been in operation for a while solely on credit, it may well generate a false
positive in a fraud detection algorithm if it scaled dramatically.

But it is important to disconnect monetary spending from coupons or vouchers
as they are not equivalent.

You mention free tier but that’s not what was at issue here. Also, 10
additional instances isn’t in the free tier of any cloud service I’ve used.

I’m not saying that DO is correct, but I believe the parent argument was a
simplification if the events in question. Also, DOs handling of it via support
was far worse than the initial algorithm, imo.

~~~
sergiosgc
> But it is important to disconnect monetary spending from coupons or vouchers
> as they are not equivalent.

They must be. If they are not, then you've entered the territory I referred,
where marketing actions are impacting service availability. This impact is not
acceptable in professional services.

In this specific case, if voucher giveaways produce ingress of resource
leeches (cryptominers that will never result in real customers), and if it is
impossible to prevent this undesired ingress without impacting existing
customers (which it is), then that marketing action must stop. This is the
conclusion I expected from the post-mortem.

~~~
chewbacha
Money is fungible and fiat while vouchers are vendor-locked and not fiat,
that's why they can't be evaluated the same.

I won't try to argue whether they should be removed in their entirety, that's
not even an option I had even considered until now.

------
rphlx
Their apparent conclusion that high CPU% for a few hours or half day or
whatever means "cryptocurrency miner - ban ASAP!" is naive and flawed.

Compute offload is an ancient and fairly common use case for the public cloud;
my VPS (or ten..) should be able to burn 100% CPU for many hours compiling a
large project, even if it means they make less profit than they would have had
I instead run a static web server that sleeps on IO, imposing nearly no CPU
load.

At the very least they should provide some objective, quantitative guidance on
exactly how many CPU-seconds-per-hour they consider acceptable/not-abuse (or,
if not CPU-seconds, then increased host power consumption, or whatever they
are ultimately trying to limit to ensure they can pack a few hundred near-
zero-load servers onto the same host to make glorious truly massive profits
all the time).

Don't make customers guess at whether their workload will trigger some opaque
but hyper-aggressive abuse automation or not.

~~~
hamandcheese
I think the heuristic they use is spike of high cpu + non-established billing
history, not just CPU. That seems to me much more indicative of potential
fraud, though by no means foolproof.

~~~
rphlx
Indeed, but AFAICT there are apparently still some opaque, undefined CPU%
limits for people paying with CC instead of free credits. They also mentioned
elsewhere that customers paying via PO are exempt from the automated miner
murderer, but that was was news to me and I guess just furthers my point: we
shouldn't have to trawl HN threads to understand your CPU% abuse limits; they
should be spelled out specifically and quantitatively in the main TOS, for
each type of payment method, and any other factor(s) that effect them.

------
marcinzm
Reading this response it seems that crypto-mining is not allowed on digital
ocean as they have checks against it. The TOS doesn't say so explicitly but
does note that:

>violation of any of these Terms of Service or any law, or if you misuse
system resources, such as, by employing programs that consume excessive
network capacity, CPU cycles, or disk IO

By my reading that seems to mean that you're not allowed to use your VMs to
their full capacity due to them being over-provisioned. This is in contrast to
AWS who are more explicit on which instances (T instances) are over-
provisioned and exactly how they're throttled.

~~~
bcooks
If you want to do cryptocurrency mining on DO that is actually okay with us.
Some of the other respondents are correct the behavior we were looking for was
really around fraudulent accounts being created and performing cryptocurrency
mining. This is why the trigger that flagged this account was using payment
history as a key factor in the triggering.

~~~
chris_wot
Your post-mortem implies this is not allowed at all.

~~~
Johnny555
_Your post-mortem implies this is not allowed at all._

Not sure why you were downvoted, I had the same impression, after reading:

 _...an automated service that monitors for cryptocurrency mining activity
(Droplet CPU loads and Droplet create behaviors). These signals, coupled with
a number of account-level signals (including payment history and current run
rate compared to total payments) are used to determine if automated action is
warranted to minimize the impact of potential fraudulent high-cpu-loads on
other customers_

This sounds like they don't permit extended high CPU loads due to the impact
it can have on other customers.

~~~
bgirard
The keyword here is 'fraudulent'. High-cpu-loads is allowed, but an automated
service monitors for fraudulent activity.

~~~
dlubarov
What would "fraud" mean in this context? Are they talking about customers who
don't pay their bill to DO? (If so, seems like the account should just be
temporarily suspended until the bill is paid.) Or are they talking about fraud
to other parties, like phishing sites? (If so, I don't see the connection to
crypto mining.)

~~~
hoseja
My understanding is that they're trying to prevent users from creating new
accounts, running 100%CPU until it's time to pay the bill and then just not
paying, moving on to another new account.

edit: from elsewhere ITT it seems they're doing this with stolen credit cards.

------
andr
I have no relation to DO, but I'm surprised by the negative responses in this
thread. I can't think of any other major company conducting a public
postmortem for a customer service failure (as opposed to networking/ops
failure). Not only are they changing their policies across the board, taking
on more risk to improve customer experience, but they are hiring extra people
so it does not happen again. Kudos for that!

And of course DO will still retain the ability to suspend your account for
suspected fraud - that is the case with any cloud services company, and any
online business in general (check your ToS). Again, I can't think of any
business that will en masse promise to never react to any fraudulent users.
It's how this process is performed that matters and that's what they are
improving.

~~~
computerex
> I can't think of any other major company conducting a public postmortem for
> a customer service failure (as opposed to networking/ops failure).

There are many companies that have done this in the past. They are not doing
this out of the goodness of their hearts, this is lip service for the fact
that their mishap blew up in their faces on twitter. Do you really think they
would have gone at length to highlight to the public this incident had it not
gone viral?

> Not only are they changing their policies across the board, taking on more
> risk to improve customer experience, but they are hiring extra people so it
> does not happen again. Kudos for that!

There is no telling that they are actually going to follow through with
anything. Mere lip service.

The bottom line is, people host their businesses and livelihoods on cloud
providers and they (the cloud providers) should take the necessary care and
precautions when taking destructive actions. Maybe err on the side of the
customer instead of shutting down someone's entire business because of some
automated heuristic. Maybe have a better response time than __29 hours __.
Maybe teach basic communication and develop processes so that care agents can
see and react appropriately to recent activity on the account. These are not
revolutionary concepts, they are simple things that demonstrate customer care,
something DigitalOcean is sorely lacking.

~~~
rajaganesh87
> precautions when taking destructive actions.

No data was lost, it is not destructive in anyway.

> because of some automated heuristic.

If the customer had "payment history" none of this would have happened.
Probably it was being used under "startup credits"

> people host their businesses and livelihoods on cloud providers

people shouldn't run entire operation on credits and blame DO in twitter.

Only issue is that DO took 29 hours, apart from that i see no problem with DO.

~~~
pbreit
It should be pretty hard to shit down a legit biz. Seemed automatic in this
case.

~~~
gavindean90
I mean it is. Unless your business is wholly dependent on a service from my
business.

------
jwr
Key takeaway for me: DigitalOcean might still kill your account at any time,
revoking your access to your data.

Refusing service is fine, but holding my data hostage and refusing access to
it is not, so I am making a note to not consider DO for any kind of hosting.

~~~
bepvte
Because I reinstalled my droplet with a different filesystem manually, the
snapshot restore doesnt work. Support tells me they cant do anything, so my 2~
of chat logs are sitting in a disk image that they cant restore bc they need
to mount it for some reason...

~~~
Sebguer
Hey there! I would love to follow up on the issue you're describing here. It
looks like you tried to bring a disk image over from a provider in a format
that we don't support, and unfortunately there's nothing trivial we could do
to get a working Droplet out of it (which is a requirement for us to expose
the volume within the systems we have).

I can't promise a super fast resolution - but I'd be happy to work internally
to see if there's any outside-the-ordinary workarounds we can supply here if
you're willing to follow back up on the ticket.

~~~
bepvte
I replied to my ticket (#2710287). Thank you so much for giving it a shot by
the way.

------
nkozyra
I appreciate that they did this.

It's sad to me that your only chance in hell of getting huge companies to
listen to you is by shamespamming across social media.

That, coupled with the clear issues following procedure from support, paint a
clear picture: customer service is an area to skimp on for big tech.

~~~
laughinghan
In what way is DigitalOcean a "huge company"? At ~300 employees, it's closer
to the SMB's definition of a small business (<250 employees) than mid-size
(<500 employees).

In all fairness, having worked at a few tech startups, it can be hard to scale
customer service to keep up with demand—you don't control how many support
tickets come in, and it takes a lot more time to hire and train new customer
service agents than it does to spin up new servers, and if you _over_ -hire,
it's a lot more costly than shutting down some servers.

~~~
omeid2
DO is claimed to be "third-largest hosting company in the world in terms of
web-facing computers", so that should give you an idea of how many customers
they have.

~~~
im3w1l
How do they manage so many boxes with so few people? Do they rent metal and
resell it with value add software?

~~~
breakingcups
By skimping out on support, clearly.

------
exabrial
Digital Ocean: allow me to do some extended verification so you know exactly
who I am and reduce your risk. In exchange, there is no automated locking,
rather we are contacted and have 24 hours to mitigate the issue.

Requirements:

* 1 year continuous ontime payments at $250+/mo usage

* automatic billing is set up

* billing limits are set up and have been reviewed within the last year

* copy of our business insurance and license

* u2f on all accounts

Fair?

~~~
bcooks
The first few items on your list are actually a part of what we meant by
"having billing history with us". There are a number of things we look at in
that bucket. We use these items as a part of validating users before taking
any action (yes, we clearly failed on this account due to the credits which is
a clear bug). As far as offering things like a copy of your business license
or other means of verification that isn't a bad idea. As an example people
paying with POs today are excluded from the algorithm already.

~~~
kijin
Please make it official, so that people can have peace of mind knowing that
they've got that "verified" badge. People hate having to wonder whether
they're at risk of crossing an invisible, inscrutable, and constantly changing
threshold. See: PayPal and AdSense account forfeitures. You could do so much
better than that.

------
Aeolun
As happy as I am to see this post itself, the mistakes made here are pretty
appalling.

Killing customer accounts by automated action without any human check just
seems like a recipe for disaster. Even if you can respond faster to crypto
issues, the effects of a false positive are just unacceptable.

Though apparently the human checks at Digital Ocean don’t work either.

~~~
laughinghan
According to the post, that's not what happened? The customer account wasn't
terminated by the automated system, but rather by the second Abuse Ops agent.

 _Upon a second review by a different Abuse Operations agent [...] the agent
fully denied access back into the account. This action triggered the final
“access denied” communication to the customer._

~~~
mintplant
That was after the automated process locked access to the account and _powered
off_ all associated machines.

------
grzm
Digital Ocean's follow-up to "DigitalOcean Killed Our Company"
[https://news.ycombinator.com/item?id=20064169](https://news.ycombinator.com/item?id=20064169)

------
cannonedhamster
We dropped DO from our company usage after similar issues, though honestly DO
probably wasn't the right place for our product at that stage of development.
What was meant as a POC became technical debt and an outage forced is to come
to terms with the fact that by the time the issue happened we had more than
enough of our own capacity to run on our own metal.

Kudos to DO for the open incident management. As someone who does this myself,
these are often really painful and hard to get right.

------
caffeinewriter
Good response from DO, but this line jumps out at me.

> Responses to account locks were not prioritized differently from a ticket
> management standpoint to be above less severe tickets.

That's arguably the biggest failure, IMO. The fact that an action which
locks/terminates an account is not prioritized any different than a general
ticket is pretty jaw-dropping, and I'm glad they're going to change that.

~~~
bcooks
Yeah... that one was painful and we are fixing it. At least if the priority
placed this at the top of the queue we could have acted faster. Probably the
same outcome due to the other issues involved in this incident though.

~~~
caffeinewriter
I appreciate all your transparency and engagement on this issue. It probably
would have had the same outcome, yes, but potentially resolved much more
quickly. Regardless, the fact that you're fixing it is music to my ears.

------
mtw
The startup claimed they had all their backups on digitalocean, which
contained data of Fortune 500 companies.

A startup who has fortune 500 clients must have history. I don't get then why
digitalocean says they do not have payment history. Either the startup moved a
few weeks ago - but then why don't they have offsite backups if they just
moved. Or because they're french they did have payment history but did not
have an American credit card to similar.. not sure what's up

~~~
heartbreak
Their payment history was with DO credits according to the post.

~~~
heliodor
Maybe one of the problems is DO's views of credits. Maybe things would work
out better if they would treat credits like real money instead of phony money
and tighten up how the credits are handed out.

~~~
__HYde
Yes, DO have stated that how they viewed credits was a mistake and that they
will be addressing it.

------
oblib
I really cannot blame DO for this incident. These kinds of things must be
handled in a learn as you go way and when I consider what one can do with a DO
VPS (or 10) it's astonishing. I would expect them to automatically flag some
uses.

A business "relationship" is a two way thing. You call and talk to people, and
tell them what you want to do, and ask if it's ok.

When I've called and talked to DO reps about what I've wanted to do they have
been very accommodating.

------
Animats
Here's what's really wrong. This is a B2B service with B2C-grade terms of
service. You don't want to base your business on one of those. Not one with a
"sole discretion" termination clause. Those are for low-value consumer facing
services only.

Compare, say, these terms of service from a major dedicated server hosting
company.[1]

 _Either of the parties may terminate this Agreement (including all existing
Orders) if: The other party breaches any material obligation under it (other
than our obligations covered by an SLA), and fails to begin to cure such a
breach within ten days of written notice of such a breach from the non-
breaching party, or fails to completely cure such a breach within thirty days
of the original written notice; OR a force majeure event continues for more
than thirty days._

Now that's what a reasonable B2B contract looks like. That seems to be fairly
standard for dedicated server hosting.

[1]
[https://info.codero.com/hubfs/Linked%20Assets/Legal%20Docume...](https://info.codero.com/hubfs/Linked%20Assets/Legal%20Documents/Codero_TOS.pdf)

------
andrewstuart
Nothing in the statement from Digital Ocean indicates that they won't kill
your account or shutdown your systems - that's not the sort of cloud host any
company can afford to use.

Cloud providers that kill accounts - or SAY they kill accounts, must be
dropped and not used.

The worst thing that should be possible is for your account to be suspended.

AWS, if there is a billing issue, prevents you making changes to your
infrastructure via the console until the billing issue is sorted out - this is
good and reasonable.

""Peer review of account terminations. For any account appealing a lock, two
agents will be required to review the submission prior to issuing a final
deny.""

\- I can imagine how this plays out:

(service agent 1 turns to next service agent along) 'This looks like a bad
account - I think I should shut it down, what do you think buddy?'

(service agent 2) - 'Yep I trust you, shut it down.'

~~~
nothal
The article discusses violating TOS, not a billing issue. I don't think it's
unreasonable to disable an account in that circumstance but I agree that
deleting images/resources without allowing customers to defend themselves and
backup the systems would be unfair.

~~~
ben0x539
> The article discusses violating TOS, not a billing issue.

That's not the impression I got. It sounds like the issue was that a account
with misinterpreted payment history was showing bitcoin-mining-like usage
patterns. Mining is not against the terms of use, they were just erroneously
convinced themselves that the customer was not going to pay for it.

~~~
ben0x539
I'd like to apologize for the typos in my previous comment that I neglected to
notice before the edit window expired.

------
newsoul2019
I have read and written similar RCA's in the past, this one is very good IMHO.

------
craftinator
Barry Cooks did a phenomenal job with this after-action. He not only publicly
accepted fault on DO's behalf (+1), not only stated the incident timeline
clearly and without bias (+2), but also showed mitigation steps and procedural
changes to avoid this in the future and prioritize customer business interests
(+3). Many medium and larger sized companies should take note of this handling
style (looking at you, Google and Facebook). I love that there was no generic
PR "we're very sorry". Succinct, accurate, and without spin (+4).

~~~
lioeters
I agree, the incident report was well done. The combination of factors that
led to the issue was described in clear detail, and I was glad to see a
concrete plan to improve various aspects to avoid future cases like this. It
certainly helped to regain trust.

------
silversconfused
I had been expecting a short blip of an update denying anything of consequence
(a twitter post promised a status update, but well, you know...) but this
transparency significantly exceeds expectations. Nicely done DO.

You may want to explain service credits in some light detail though, for those
that are unfamiliar with them.

------
Ill_ban_myself
That is everything I'd hoped to see as a developer and digital ocean customer.
Good response.

~~~
muppetman
Totally. I'm fairly new to DO and after seeing what happened was re-thinking
my decision. But this is a solid followup, "we made a mistake" post so I think
I can rest easy.

~~~
ncmncm
They hoped so.

One wonders how many others didn't get enough Twitter cred, before. That some
low-level ticket stamper (even a high-level ticket-stamper) had authority to
deep-six a customer on no more say-so than high CPU usage tells us more about
the company than an incident report massaged by marketing communication
specialists. Simply, the latter sounds good because it has been made to sound
good by sounds-good experts, and could say anything; but the event itself is
ground truth.

They will need a lot more time and good behavior to live this down.

~~~
bcooks
I agree on the twitter cred point. The fact that this happened in the end,
personally I think it is a good thing as it highlighted a weakness we must
fix.

We trust our people high-level, low-level whatever to make important decisions
everyday. thats why they are here.

The "marketing communications specialists" are getting slammed a lot here, so
I will just point out that they spend most of their time rolling their eyes at
my crappy grammar, spelling and ludicrous number of comma splices. I don't
think our goal was to sound like anything. We just wanted to lay out our
investigation and the follow on work we are undertaking.

Totally agree with your point that trust is earned and we lost many peoples in
the last few days. That will take time and as you say good behavior to earn
back, but that is what we are committed to doing.

~~~
ncmncm
I talk about mktg comms because I have worked at places where angry customers
got earnest letters promising changes, but the manager expected to implement
the changes said "No, we're not doing that!" Or "OK" but nothing happened. So
I don't give much credit for promises, even when it was the right thing to
promise.

Giving your ticket punchers authority is good when they are authorized to do
what customers need to get or keep going. Giving them authority to eliminate
customers, not so much.

I have to agree with the commenters who say it was an exemplary postmortem.

Hospitals have been doing formal postmortems for many years, but the number of
them didn't start down until they instituted checklists.

------
codazoda
I think this response is really good.

Now that we're all not picking it, however, I think they should remove the
"People" section. They did a good job of adjusting process instead of blaming
people. The people section, however, might lean toward blaming people. They
didn't, in this case, but it could.

~~~
bcooks
Hey there. Thanks for this feedback. I think it is important to be open honest
but not blame-oriented in our review of the situation. People make mistakes
and that is okay, so long as they aren't willful or due to incompetence.
Neither of which was the case here. The key thing is not to create a situation
where a mistake is an individuals fault. My general view is if people are
making mistakes then we have done something wrong as a company and need to
understand and fix the tools/training/process that led to the mistake.

~~~
Dayshine
I'm involved in work around reviewing medical care.

Generally, a "People" section that mentions processes not being followed is an
incomplete root cause analysis.

Why was it was possible for the process not to be followed?

There's obviously a limit to how far it makes sense to drill down with why why
why, but stopping at "someone didn't follow guidance" is too early.

------
Lazare
Huh. Well, that's how you handle a post mortem! You outline what you did
wrong, and then you outline how you're going to fix it. And it looks like the
proposed fixes are appropriate, so...

DO, like (nearly?) all companies (not to mention most people), is obviously
greedy and self-interested, and yes, I'm sure a major driver of the quality of
response was the twitter storm that erupted, and I don't want to excuse the
underlying mistake which was significant, but...

...at least the responded well eventually!

~~~
creeble
Agreed, and it was more or less the response I was looking for a couple of
days ago.

We'll be staying with DO.

We already use other VPS services as backup, and will probably add one or two
more. But because of their well-documented response (and at least being able
to identify what went wrong, and hopefully to fix it), we aren't going to drop
DO.

Thanks for the response, and congrats on standing out in a very small crowd of
companies who can own up to their customer service problems.

A very, very small crowd indeed.

------
mlthoughts2018
Digital Ocean notoriously doesn’t invest well in data science or machine
learning, even having some key data science people leave recently.

I interviewed for a data science job there & the team of engineers seemed
really unhappy. They reported into the director of operations, which is a
weird place for data science to report, and the managers I met definitely
viewed it as a paranoid cost center kind of thing.

Also in the interview process I recall that Digital Ocean made a very low
offer and refused to discuss negotiating it. Seemed clear that cheap hires
were mandatory for data science / machine learning.

I wouldn’t be surprised if this lack if investment meant that some data
science intern or bootcamp grad is designing this automatic fraud shut down
system, and that there’s a glaring lack of investment in professional
usability for a system like that.

~~~
bcooks
Sorry to hear that you had a bad experience and left with a bad impression of
that team. We have a number of data sciences efforts including in the core R&D
group where we are growing and working to improve models in support of a
number of fleet monitoring tasks

------
treis
I don't quite get the "running on credits w/ no payment history" and "ruined
our business" combo. How can they run a business and never pay?

~~~
vlahmot
Many (all?) of the cloud providers offer credits to startups (credits as in
free $ to spend on their services). So if they hadn’t burned through that,
there would be no payment yet. (The startup I work at got $20k in credits and
didn’t pay a dime for the first year)

~~~
treis
I knew they gave credits, but I didn't realize it was to the level of $20k
worth of credits. I think I got a couple hundred from DO.

------
woofie11
Did anyone notice DO leaked customer financials in this post? If I were a
startup running on credit, I definitely wouldn't want to advertise that. WTF?

~~~
cstrat
I don't understand what you think the issue here is?

My interpretation of this is that the customer had pre-paid credit on their
account. Meaning they had not been through the typical bill cycle yet (hitting
an external payment method).

How are you interpreting that they are running on credit? As in their account
is in debt and they haven't paid yet?

~~~
dylan604
Perception is everything. Many companies, especially Fortune 500, will do deep
research before doing business with anyone. I've been through them where have
been disqualified specifically due to our infancy and lack of proof of long
standing. If someone read/mis-read someone's post that gave them the idea that
the company didn't have enough runway, they might move to the next potential
suitor.

------
lacampbell
IIRC it had to blow up on Twitter before DO paid any attention. At the end of
the day that's why it's an issue, because they didn't sort it out until it
went public.

I suppose the moral of the story is - have offsite backups, so you can switch
VPS providers in an emergency.

------
jchw
When I clicked on this I had assumed DO had gone down last week. I was
surprised when I finally realized what they were talking about. I think it is
cool and commendable to offer this level of transparency on an issue like
this.

Anecdotally, I use Digital Ocean for a few miscellaneous services, on an
account I’ve had forever. I have never had any issues with it. I used to use
lesser known low-end VPSes, but stopped when I lost a bunch of data on an
incident involving a provider’s failed RAID controller. (It was my fault for
not backing up, but I was young and foolish; they mostly served me well, but I
do prefer the assurances of bigger providers nowadays.)

------
UseStrict
Depending on the severity and length, it could still have a long-term impact
on that business. Also a bit unsettling that seemingly basic safety controls
failed. But it is good to see DO being open and thorough about this incident.

~~~
newsoul2019
That's the only way to get confidence back. I especially like the two peer
review policy.

------
Animats
If it were not for public shaming on Twitter, the guy would still be turned
off.

------
vinay_ys
Dear DO, From your RCA it appears this is a type of fraud where stolen credit
card is used to create a new cloud account and run up a huge charge in a short
amount of time. Nowadays it could be for cryptocurrency mining (a few years
ago and maybe still today, it could have been to run spambots or botnets or
whatever).

I suggest your trust and safety team handle the payment fraud as a separate
issue (using payment network intelligence) and resource abuse (spam or botnet)
as a separate issue (by monitoring abuse reports, external underground
intelligence; NOT resource monitoring or traffic monitoring).

It seems like in this case, weak muddied signals were combined to draw false-
positive conclusions.

Also, it is equally important to build reputation score for good users and use
that as a backstop to prevent them from getting shot by misbehaving fraud
detection algorithms.

Since your business might be a lot of small customers, it is important you
find a good way to easily trust a small customer with little usage and little
spend. One way you could do this is by having a reasonable default cap on the
resources for a new or small account. You could ratchet up this cap after
verifying the payment instrument trustworthiness (through automated checks or
manual verification process).

Hope this helps.

------
heliodor
One of the problems is the credits initiative for startups (I'm assuming
that's how this customer ended up running on credits.)

Companies have to grow to quite a big size before they consider offering
various discounts and programs. By that time, the systems and processes are
plentiful in number, complexity, and interaction. Management decides to
implement a startup credits program and because it's not an instant money
maker, it doesn't get treated carefully enough and causes various edge cases
for the program's users (and hopefully none for the standard type of user).

In DO's case, startups should be vetted before being gifted credits and
therefore excluded from the crypto checks and shutdown potential.

Sometimes it ends up in the customer's favor:

For an example of how poorly one-off programs can end up being implemented, my
company is receiving special consideration from Stripe. No fees are being
charged at the moment. Well, a customer asked for a refund. I issued the
refund. Stripe paid out $25 to the bank account but took back only $23 for the
refund because the code that does refunds doesn't know about the fee
exemption. Good guy me emailed them about their bug but not much came of it.

------
blunte
Automated systems that are infallible are great. Most are not infallible, and
they should just provide notices to humans.

Humans should (be adequately trained to) review and handle considerations
where termination of service is involved.

From reading the DO response, it does sound like humans were involved
(eventually). However, unless a customer's use of the service is posing an
immediate and severe threat (security, DoS, whatever), service should not be
stopped until AFTER a human has adequately reviewed the situation.

Stories like this remind me why sometimes it's better to use smaller providers
who are less automated...

------
vjust
I like it when a company conducts a full failure analysis and takes
responsibility. Doesn't happen often. Hope DO meaningfully improves its
service as a result.

Its catastrophic to get locked out like that.

------
ilamont
_The communication regarding denial of access to the account creates a sense
of helplessness; the finality without explanation requires correcting._

Lots of big companies who deal with small partners (developers, sellers, etc.)
could learn from this, including Apple, Amazon, and Google. Lack of
explanations, vague explanations, or confusing explanations for account
shutdowns or other penalizations are the norm. And for some of these companies
it's nearly impossible to talk with someone who can clarify what's wrong.

------
ramtatatam
So you cannot use your VPS to do whatever you want with it? I am having
trouble to understand what's wrong with crypto mining so you get access to
your VPS you paid for denied.. Or was that some sort of free plan he was
running on? Still... why not introducing CPU quotas rather than blocking?

~~~
larkeith
As mentioned elsewhere in the thread, accounts with high cpu usage _and no
billing history_ were locked - symptomatic of cryptominers creating accounts,
using free credit or stolen CCNs, and ditching.

------
tmaly
I think its great they wrote the post, but the tone still leaves a sour taste.

I think they could have worded it better.

------
zxcvbn4038
“The communication regarding denial of access to the account creates a sense
of helplessness; the finality without explanation requires correcting.”

Right there, in one sentence, DO has figured out the one thing that all those
millions of CPU cores at Google have failed to grasp.

------
janjanson
I think this is a great response.

While from my own experience I don't see myself using DO again (see my
previous posts, I had a similar experience except I didn't complain
externally), the points in future measures look like they'll go a long way.

Best of luck to all future customers.

------
EugeneOZ
> The account owner leveraged Twitter as an avenue to call attention to the
> mistake

What an arrogance! Because official channel was silent, you forgot to mention.

> Shortly thereafter, DigitalOcean investigated the issue and the Raisup
> account was unlocked

2 days is "shortly thereafter" for you?

------
lbj
Its a good response, clearly outlines causes and future effects. But I'd be
very cautious to deal with a company which found such lazy detection mechanism
to be adequate, considering the potential cost to real clients.

------
chris_wot
_Additional hiring has been approved for both Support and AbuseOps to reduce
ticket queue wait times._

So this is the way they determine their support department is underresourced?
By twitter shaming?

------
Gelob
Something tells me making their abuse department 24/7 won't help I love DO but
i had to make a support ticket recently and it took them about 2-3 days to
respond.

------
steveharman
Barry Cooks for President, or any political role where explaining things in a
balanced way seems impossible to the current individuals.

------
PatrolX
It's good to see a company pay attention and take action.

That said I would never use them, Amazon AWS is just a smarter solution all
round.

------
jaakl
How are the customer damages compensated? With just sorry, we ruined years of
your work with couple of bad clicks?

~~~
rphlx
I don't think there is a single cloud provider that accepts unlimited
liability and wholly compensates customers for lost data, lost sales during
downtime that was the provider's fault, etc. Their liability is generally
strictly limited to the cost of service... so something like a $1 credit for
the 6 days that your $5/mo VPS was down. Unless of course you are a very large
customer that credibly threatens to switch, at which point you may get some
special treatment above and beyond the TOS, though even then, rarely enough to
fully recover your actual loss..

Ultimately if there is a lot of money on the line you need to do the work and
pay the money to be multi-vendor, automatically failing over to AMZN or MS or
DO or whatever when there is some massive screwup that takes down GOOG for
half a day.

------
lostmymind66
It's better than Amazon or Google. I have had accounts shutdown on both with
absolutely no recourse.

~~~
endorphone
You have exactly the same recourse with those that this person had with
DigitalOcean: Be very loud and as public as possible with the problem and
it'll get escalated to someone who can make a rational, reasonable choice (and
we've seen the same sort of things happen to other big companies on here and
twitter), or simply override prior choices purely for PR purposes. This is how
many businesses operate now, with zero mechanisms to escalate outside of
getting a torch-wielding mob riled up. It seems horribly counterproductive (I
mean, my end impression of this whole incident is certainly not _more_
positive about DO -- it's something that should never have happened), but it's
how it's done now.

~~~
lostmymind66
This was a few years back. I don't really care about those accounts anymore.
Support was just automated responses and when I tried to call them directly, I
was directed to the email accounts with automated responses.

My Amazon account was banned because A buyer (I believe was a competitor)
purchased an item and claimed it was a fake. I had proof it wasn't, but it
didn't really matter. Other than this I had nearly 100% positive feedback.
What's funny is that I now buy thousands of dollars per month for my business
through Amazon and I keep getting pestered to sign up with a business account.

Google banned an Adsense account when they somehow thought I was faking
clicks. I still have no idea where they got this from. My site was't even live
yet and I wasn't clicking on anything or even displaying ads beyond a simple
test page with no traffic.

------
sergiomattei
Excellent postmortem. Great work DO team, and hope these problems get resolved
quickly.

------
runjake
Great response and ownership.

------
LyalinDotCom
Kudos for this clear summary and planned improvements. Really good job folks.

------
mdip
I do a lot with cloud providers for my customer's products and have worked
with Digital Ocean's products once before. I didn't have a particular opinion
on them[0], and there's some things that seem off about the twitter thread
when placed against the incident update report that this thread is linked to.
So, all of that to say, I'm giving Digital Ocean the benefit of the doubt.

There are, however, a few things that could be improved about this process:

> Peer review of account terminations. For any account appealing a lock, two
> agents will be required to review the submission prior to issuing a final
> deny.

The devil is in the details. Do this in a manner that the person confirming
that the account is committing fraud is unaware that they are confirming
another's denial; otherwise "dude, can you approve that termination I just
did? I want it out of my queue/that guy was a dick." is a risk.

Off the top of my head, I'd probably generate two support tickets (linked, but
without that link presented to the account termination CSR team member),
assigned directly to this person, hidden from others ("hiding", along with
training/process improvements is likely enough). If one person disagrees with
the termination, close out the other person's ticket. If the CSR sub-org for
this is global, place them with staff in opposing time-zones to optimize
unnecessary confirmations (though you use a valuable measurement on how
consistent your staff is)

> Services that result in the power down of resources will no longer
> automatically take action on any account, regardless of lack of payment
> history, for accounts that were engaged more than 90 days prior. These cases
> will be escalated for manual review.

I can't count how many services I've deployed that started under 90-days ago
where the customer failed to add their account information to the service. I
can't count them because I don't know. Usually our customer creates their
account with instructions from us, and creates an account for us to use which
doesn't have permissions over payment details. I wouldn't be surprised if I've
had an app go past production that the customer simply forgot to do that
important step on Day-1, or if the customer procrastinated until production,
etc. We ask, but I've been lied to about stupider things (thankfully rare, but
surprising from people who otherwise look like "grown-ups").

Minimally, it sounds like the whole process here is missing a "Hey, WTF _is_
that thing you're running? Call us or we'll need to turn it all off" alert at
least a little while before it ... turns it all off. At login, put a clear
notice "We want you to love our services, so we let you try them without
asking you for payment information. Unfortunately, we have to have monitoring
in place to prevent hostile actors from loving us, too. Because of this,
accounts newer than 90-days might have services shut off in error. If you want
a notification an hour before action will be taken, provide your mobile phone
number and we'll send you a text. Or you can enter credit card
information/confirm your identity (not sure what options are available here)
and we'll keep the bots from bothering you"

Of course, all of this costs money. And based on the incident response times,
an explanation other than "failure to prioritize correctly" might very well be
"failure to staff properly/have the tooling in place to handle the volume".
Considering the competition in this market, I wouldn't be terribly surprised
if "we can't afford it" plays into some of that.

[0] A little less awful than AWS in a lot of ways for the task I had to do.

------
_Codemonkeyism
I don't like this.

"[...] cites the link to an older account, connected through a shared SSH key,
as additional justification for making the decision to deny access."

------
GreaterFool
> fraudulent high-cpu-loads

That's enough for me to never use DO again. One pays for a server but then
using it is fraudulent?!

~~~
ivalm
They were using credits so DO feared that this was a fake for credits account
created to mine crypto...

~~~
serf
that's not really relevant -- no customers are really allowed to use a VPS as
if it were bare-metal.

This is a pretty routine issue for people (myself included) who have
mistakenly considered using a few temporarily spun droplets for a few hours of
intense number crunching.

The 'nicest' complaint I've received was from Linode, which was less of a
complaint a more of a warning like 'Do you know what your VPS is doing?'
rather than 'Don't do this with your VPS.'. They never really told me to stop
-- just wanted to make sure that it was intentional.

~~~
GreaterFool
> no customers are really allowed to use a VPS as if it were bare-metal

I don't agree. What I'm paying for are vCPUs not actual CPUs. So I know I'm
not getting bare metal and that the compute power is already managed by VPS
provider.

Are you saying that on top of that I also have to monitor and be responsible
my own CPU usage or risk getting banned?

What are the rules?

That's why I wouldn't do anything serious on DO, just use it for low-cost side
projects.

AWS has explicit CPU credits on their cheaper instances. They built their
system to allow bursts of activity. If Digital Ocean doesn't have that their
offering is simply weak.

Also I recall few years ago Digital Ocean send me a message that referenced
specific processes running on my VM. I know that they have access to that if
they want to but looking into specific processes was stepping over the line in
my book.

I moved my servers from Digital Ocean shortly after that.

------
paukiatwee
Rule 1: Never shutdown customer services at the first machine/AI detected
potential issue. Always contact customer to verify, always.

DO never suitable for business use case.

------
arthurcolle
At every turn of this story, there is a resounding "we fucked up". Shameful
behavior on the part of an "infrastructure provider."

It doesn't matter that they create a post-mortem. All it means is that I can't
trust them to get things done correctly.

------
cprayingmantis
Like I commented in the last post they've entirely missed the point. You
shouldn't be allowed to deny a client access to their data unless there's some
law being broken. Data is property and you shouldn't be able to suddenly deny
a client access their property.

------
MagicPropmaker
I don't like the weaselly passive voice. And it's still not clear if a person
who doesn't have the clout on social media can get attention.

~~~
naniwaduni
[http://www.lel.ed.ac.uk/~gpullum/passive_loathing.pdf](http://www.lel.ed.ac.uk/~gpullum/passive_loathing.pdf)

~~~
bcooks
Thanks for the pointer. I'm going to blame my dad/up bringing for my over use
of passive voice. He will be deeply amused by this. I will read the paper and
attempt to improve.

~~~
BobbyVsTheDevil
Excuse me, but that is not how it's done. Repeat after me:

The paper will be read and attempts to improve will be made.

Dad will be blamed.

------
paulie_a
And at the end of the day: post-mortem no post-mortem no one really cares. The
technical details are always irrelevant to be useful and in 6 months no one
remembers anyways.

Hell you could write a post mortem: "shit hit the fan" and it wouldn't be any
more or less insightful.

~~~
grey-area
I'm a customer of Digital Ocean and I care. I'm pleased with this response and
it is exactly what I want to see when a company makes a mistake.

