But it wasn't, because every night I run a PG backup and copy it to AWS S3. I just had to download the backup from the other cloud vendor and restore it on my DO server.
Did DO fuck up? Yes. Did it cost me downtime? Yes. Was I mad? Yes, and I still am. But I still do business with DO because it costs ~half the price of a comparable EC2 instance, and writing a 10 line bash script to move my database backups to another cloud vendor isn't that hard. Storing that backup on S3 costs pennies per month.
I don't see "I've never run a business before" as a valid excuse, nor do I see "the cloud vendor is better equipped to handle backups" as a valid excuse. When it comes down to it, you are solely responsible to your customers. They're not going to care whose fault it was, because it was your fault.
Don't trust any of your vendors.
I'm curious why people who don't have massive scaling and variability issues choose DO or AWS for their hosting.
For 34€/month, Hetzner will rent you a physical server (i7-6700, 64GB RAM, 2x512GB SSD, 1Gbit/s networking). That's a monster of a machine and can run most sites out there. And if you need more oomph and better reliability, rent three.
I looked at this many times and still check the pricing regularly, and it just doesn't make sense for me to switch from these physical servers to virtualized offerings, not in any foreseeable future.
That's something worth paying for. It would be more comparable to two servers at Hetzner with rapid failover. But even that is more involved since you have to set up the logic of when to failover.
Auto-scaling is a benefit of "cloud" services. But it isn't the only selling point. Hardware abstraction is perhaps bigger.
If there's state in that virtual machine, it's probably either stored on the physical host or in a SAN. If it's in the physical host, it has to be fished out of that machine or restored from a backup. If it's a SAN, you can lose your virtual machine if the SAN goes down.
I've seen both happen.
Actually, a single machine with RAID is surprisingly stable. A provider like Hetzner can switch out a faulty disk in less than five minutes, or switch a faulty motherboard/power supply in less than half an hour.
Virtualization on top of this does not increase stability, in my experience.
Now, some cloud providers do have more sophisticated distributed systems that do not have single points of failure, and that's a completely different story.
Of course, the software itself is a source of correlated failures, so even there you should never rely on a single cloud vendor.
There's a poster here on this site who commented some months ago that he had for years rented three servers, each on a different continent, each from a different provider, and never had downtime. That's engineering.
This works in part because both Xen and KVM hypervisors (and possibly others) support live migrations, so it's altogether false that virtualization does not increase stability. Both DigitalOcean and Linode use KVM behind the scenes.
So, at lower cost than rented hardware, customers get staff whose job it is to constantly monitor systems for hardware failures and deal with them proactively in a way that minimizes downtime.
I think the point is to write your application in a way that there's not state in the VM. My VMs are disposable and in fact the way I do deployments is to spin up a new VM in DO and then assign it the floating IP for production. If things go haywire I can easily swap back the IP address to the known good VM.
GKE has given me the infrastructure of a 10 million dollar company, with me as the IT guy doing 1h or less a week of IT work.
I would not be able to do this on a $35 hertzer machine.
In fact, it's very commonly happening to instances, probably ones you have.
But if you're running on AWS with their EC2 instances for your application, then it's not more reliable than a single chunk of hardware in my experience (there are special cases, some hardware is just crappy)
Things like automatic failover and high availability in general are cool, but your SLOs should be set according to your business model, e.g. they need to make business sense.
I do understand how people like VPSs at $5 or $10/month (I've used one like that myself for years), and I do understand why companies that need massive scalability use AWS. But I do not understand why people in the middle (which I think includes a significant fraction of this audience) assume that a provider like AWS is the best solution.
Anyway, those are just my thoughts. I've been running a business on Hetzner servers for years now. And this is not necessarily a recommendation to use Hetzner: there are many providers which will offer low-cost physical servers. I think in spite of the cloud hype, these make sense for many applications.
Many ways of doing it. Virtualization is basically all direct hardware access for the stuff that counts.
But I digress. I am guilty of over building things. In part for fun. In part hubris. In part I like sleeping at night
Plus you get access to backups and all the other tools and products that are constantly being released. Not having to hire an IT guy at $1200/mo to maintain your physical server and tend to it's networking rules is a significant savings over a developer-managed aws/gcp instance. Plus all the cloud interfaces are standardized so you can just hire someone to come through and fix/upgrade your infrastructure.
Using a home-spun physical server means months if not years of undocumented tech debt for the next guy who comes along to have to maintain whatever kludges were installed long before he/she ever showed up. I just got done converting a bunch of physical servers over to the cloud and spent a year unwinding seven years of technical debt and now the dev/qa teams can actually spin up a new test environment in under 30 days (closer to 3 minutes).
One was an IT management/automation SAAS company (I think they were $450/mo), another was backoffice SAAS company (I think they were $1000/mo but they were also hosting for an independent dealer of their SAAS so it was two production systems, and 10 years of tech debt), third is a live video consultancy SAAS that outsourced their video to a third party api service. None of those are actually $200/mo but realistically in the ballpark of under $1000/mo.
When I converted the company (which included 40 engineers and probably 12 qa engineers) to the cloud I got my hands slapped for converting the QA/test datacenter (not production) and the cost went from $3200/mo on bare metal to $6500/mo but we were able to squeeze that down to $4200/mo running about 10 always-on test/pre-prod environments and 2-3 on-demand test environments.
This is all enterprise space stuff where the utilization is low but the value of the service is high. If you're doing facebook for dogs and trying to turn a profit on ad revenue from high utilization it's probably not as effective/realistic.
I got an e-mail that my account was flagged, but they didn't even give me a chance to tell them they had my name correct.
I responded to their email within 30 minutes, but everything had already been erased.
Luckily I had some stuff in github, but I will never use them again.
We had zero serving costs and would have continued to do so for years.
It makes perfect sense for Google/Amazon to incentivise startups to design themselves for their cloud.
Can you share how?
Hetzer charges €3.00/month for avm with 2GB RAM.
There are lots of "managed" features still missing but acceptable for a still young product.
I am not a lawyer but the advice I have been given is murky. It’s much easier to stuck with DO which, at my usage, costs about the same number of $CAD as lightsail would in $USD
Hetzner is only available in EU and specifically Germany. And my bet most of these people want it within US.
The two listed aren't really well known.
I'd never again run anything business-critical on Hetzner metal. I'm not even running my irc bouncer there, tbh. The cloud stuff I'm really happy though, higher uptimes than on most physical servers there.
Oh, and remind me again of the time where we regularly called them because of outages because our monitoring was better than theirs and the customer support people hadn't yet learned of the network outage...
Why people continue to think you can get everything on the cheap with services like this and not understand you're putting your business at risk by trying to cut corners on hosting and other services is beyond me.
I pay around $20-$25/month for Azure hosting for several e-commerce and mobile app clients I have. I told them up front you don't want to cut corners on this stuff, it will come back to haunt you later.
Sure, you'll save a few bucks year over year, but what happens in an outage? What happens when your clients can't order their stuff and you lose revenue for several days straight? Now the idea of saving a few hundred dollars a year evaporates as thousands of dollars are lost when your service provider takes a dump on you. Suddenly, it's not such a good deal anymore is it?
You can then keep telling yourself you have a stable, managed infrastructure and you're cutting on operational expenses and able to move fast while the engineers in your organization are working harder to make things work with generic services with shortcomings, making your product actually work in less than optimal ways.
Managed services feel like they are helping with operational costs but they have a cost when you're building your product.
I worked on projects that would have ended up being simpler and cheaper to operate on physical dedicated servers, but instead they are running on "ASG's that autoscale and have zero downtime" with "ALBs that send traffic to any and all hosts when all origins report unhealthy".
Things in the managed world is far less than ideal because one size doesn't fit all.
So I've never deployed a lambda before and I'm sure at low loads it's fine, but I would be afraid of taking a dependency on lambda/functions in an application architecture.
There's a whole host of services that make dev and deployment at lot easier.
Though instance costs are more expensive ... that's not the issue, the issue is 'TCO' (total cost of ownership) not 'server cost'.
If your team can move at 2x the speed on AWS, well that's worth a lot.
Cloud services are very useful, so the article raises a very legit concern/paradox.
> - Why were you only hosting one JUST ONE cloud provider?
> - Why didn't you have backups outside of JUST ONE cloud provider?
Telling them to just use two cloud hosting providers sounds easy on paper but when you're a cash-strapped startup it's a significant ask. Especially if you need to have the same architecture replicated across both, in order not to have extended downtime.
This is much bigger issue than having external backups (which every startup should still do).
I can't imagine doing less for a business. In my opinion, the business model and investment plan should account for all of this for at least the first five years.
Losing your backups because your on one site is always dangerous. Getting completely kicked off your VPS systems with no warning or help is what's new and scary about this DO story.
That's true whether the computer(s) is your own or someone else's.
It's not unreasonable to be annoyed when a company you are paying specifically to do the things you're not familiar with, catastrophically fails to do them correctly.
Again, your customer doesn't care what your excuse is, just like the dead company's founder doesn't care what DigitalOcean's excuse was. "Our processes failed" translates directly into "I failed".
(Also their backup offering is so much more cumbersome to use than borg backup that it wasn't a great loss to manage these myself).
Still amazed that they didn't notice no one's backups had been running for so long.
All that said in close to a decade (actually might be a decade) I've still yet to have any issue with linode (tempting the computer gods here) or their backups.
I remember when DO was first starting out and they used to knock their own stuff offline updating router tables.
Ever since I have a jaundiced view of their capabilities.
Linode backups start to fail when you have more then 3 million files on a drive.
When debugging this, they told me the reason, I did a snapshot that worked, and then they turned off backups because they were failing. They didn’t highlight that backups were now off.
Restores on the larger machines can take 5+ hours, and they will often report that the restore failed if you are restoring a drive that contains docker’s pipe files.
Ymmv, but I’m trying to get off linode.
If you're paying someone specifically to make backups for you, you should be able to trust that they've taken every reasonable measure to ensure that backups are actually being made and preserved.
I'm not sure what's so confusing about "don't trust your vendors" but I've had to make this exact same reply way too many times.
Don't trust your vendors!
Which is probably much easy on DO than AWS. For whatever reason AWS seems to go out of their way to make it as hard as possible to backup your RDS data to a separate AWS account.
You should hope they deliver the service you've paid for, but you should never trust that they will. Always plan for failure. That's the entire point of this whole "DO killed my company" saga.
It might take me a few hours to get back online and piece my docker containers back together but my customers can absorb a few hours of downtime as long as it's a very rare occasion. If they couldn't, I'd charge them enough to afford a real HA solution.
It was such a relief when times changed and regardless of who you were, you could start exchanging money for things and services.
This story and many others like it remind me of those times - it's back to who you know to get your account unlocked. I've never gotten a story on the front page of HN and probably have 5 twitter followers. What's my avenue of getting my data back?
This isn't just about DO either. Similar stories about google and other services are many to be found.
They want to automate that all away as much as possible.
The future is everyone who isn't somebody chatting about how they run their application on X... because they're mysteriously banned form Y and Z and the next guy talks about how he was banned on X and is on Z now... simply for that reason.
I have often wondered about that for Google. Sure, they have bucket loads of cash, but could they even feasibly stand up a large enough support staff to handle all their platforms? They have so many services, across dozens of languages, and serve hundreds of millions of customers in different timezones.
Also, why would they want to? They're saving untold amounts of money by pushing the problem onto the consumer, and if it works 99% of the time then it's probably good enough.
One of these days, they'll fuck up big time for many customers and get sued. They'll survive, but it'll cost a lot of money.
Especially for Google (and other life-or-death services for many people), the solution seems kind of simple: charge for support. Google terminated your GMail-Account because you logged in from Turkey? Pay $50 to get somebody to listen to your story and work with you on proving your identity. Would you rather change your email on all your accounts or pay $50?
I understand that support is expensive, and especially for ad financed or cheap products, having an agent look into something can quickly cost more than you'll ever make off that customer. If the customer pays for support, support becomes a product and the company doesn't have to treat support as a profit-killer (that is, automate it, make it annoying for the customer so he avoids it, and staff it with the cheapest available labor creating a high fluctuation because of bad working conditions).
Best as I can tell they are a early stage startup surviving off startup credits from DO. Also $5 doesn't go far towards support costs. After a minute or two troubleshooting and they are already losing money.
It was a pain.
Personally, I would understand this as extortion.
This is completely not Ok. If the idea is paid support for random problems, that's fine, if it's larger prices so it includes support, that's also fine. But if any company caused me a large damage and decided to ask money so they would reevaluate their actions, I'd go to the police.
The crazy thing is that it's really not that expensive. AWS, for example, offers 24x7 support w/ <1 hour response time for production issues starting at $100 a month. At worst it's 10% of your bill.
I fear it is something very doable, but I'm not sure there are many in leadership that can.
Here's some more information for anyone who's interested - https://www.digitalocean.com/support/#PremierSupport
Zach, DigitialOcean Support
It's not about money. I worked at a unicorn that paid very high wages for support staff AND allowed them to work remotely and asynchronously from anywhere in the world. If you lived in Southeast Asia or Eastern Europe you'd make more than a local doctor just answering emails.
We still simply couldn't hire enough halfway intelligent people fast enough to keep up with the user growth. For each support person we'd hire, there'd be 10,000 new customers joining the same week. "Automating that all away" was the only tractable way to respond to people at all in a reasonable time frame. Obviously the support quality was awful.
I suspect though that due to the general trend to look at support as a "cost" most skilled leadership has moved on or just settled for poor support practices and etc.
Interestingly in my experience support teams that operate outside the home country of the company OFTEN have massive turnover issues, more than say domestic (wherever domestic is). There's a gap there that just never seems to fill in completely.
There is also something to be said for managing support in the sense that you don't have to talk to every 10,000 customers ;)
Then again, it's not just support for services. Even abuse reports seem to be taken a lot more seriously it's some online 'influencer' at the receiving end rather than an average Joe.
Of course, people who don't have money are forced to deal with greater hardships, yet that's always been a footnote to "regardless of who you were, you could start exchanging money".
Personally, I blame the decoupling of the dollar to the gold standard and the distribution of newly minted dollars from the federal reserve into the well connected via banks and corporations that are controlled by a small social segment that all attend the same schools.
Once money became a thing that doesn't cost anything but the changing of zeros, then growth becomes a question not of how to produce something to get money, and but how to get the both the connection and the pedigree needed to receive cheap dollars.
This has created money silos, where the US aristocracy will take hundreds of billions in loses to capture a market and then extract value in monopolistic ways.
You can see this with google, facebook, amazon, etc.
It wasn't always the best companies that won. It was the best companies that had access to the vast capitals pools created out of thin air and who could promise to operate at the monopolistic scales the monied classes were aiming for from the beginning. That is why Ivy leaguers (whether drop outs or not) were chosen as the princelings. They are people who have a lot committed into the system and wouldn't dare betray it: they can be counted on to take things to their logical extreme.
I think also pertinent is the locking out of the middle and lower clases from growth fases of company creation (incentive angel investors and delaying IPOs + legally enforced discrimination against investors based on social class) - Oh and the pooling of legally stolen funds (pensions) into 'safe' stocks. Not to mention the legalization of bribery which has further accelerated our current state of legislative capture.
</completely my theory>
I was curious about the above sentence. Who is the aristocracy in the context of startups? Are you referring to the VCs? If so don't the VCs not care about the "class" of the founders of a startup as long as they think there's money to be made from their company?
But an Ivy graduated person is a class of person that is highly committed to the system. There's a massive effort a whole family has to make for someone to be there.
If success is based on capturing the most market share the quickest so you can corner the market and extract value via monopolistic methods, then why go with anyone who could be a risk?
Edit: Re who is the aristocracy? There is no exact line. Society works on a gradient of privilege, from those who have access to resources with the least effort, to those who have access to resources with the most effort. This always happens. But I believe it is aggravated when money doesn't cost anything to produce because risk is then more important than reward.
I feel like HN comments are often sanctimonious to the point of ignorance. A solo dev, scrambling to get his project to the point where there is even the tiniest chance it will succeed, is likely to have a complex network of hardcoded filepaths, hostnames, and other magic numbers, strings, and config files that would make it very difficult to make portable without significant time and effort.
Or maybe the guy was an idiot. But possibly entertain the thought that maybe his situation wasn't exactly comparable to your eh?
I have never encountered a business that does not require backups. Not having a DR site as early in the game as they were might be excusable, but even then suggesting they get one is sound advice. It's an additional expense but it has a cheaper implementation cost while your infrastructure is small. Having a good DR plan is a selling point as well, this is the first thing I ask SaaS providers when I am evaluating them.
What is a total failure is not having backups on another provider. Before anyone says it's just a 2-person startup... surely that makes it easier? I can't imagine their DBs are all that big. Even just a three line cronjob and a Backblaze B2 ($0.005/GB) account could have secured their business continuity in the event their DO account didn't come back up.
>But also DO Spaces for our object storage with ~500k media and our database backups.
Based on that comment (from @w3nicolas on Twitter) they actually could have used the same script they used for DO Spaces to back up their DBs to S3 (both use the S3 protocol). The 500k image files should be relatively small and cheap to back up too, appears they are just logo thumbnails for the tracked companies (500k image files, copy says they track 450k companies).
If you are building infrastructure and you don't have IT Ops experience consult somebody who does because they will save you from yourself in scenarios like this. If you are a dev you could probably throw a rock and hit 5 friends with enough ops experience to tell you the importance of redundant backups.
I think this comment might read a bit more abrasively than I meant it, but I don't mean to kick the guy after he was down. If you are in a similar boat to RaisUp with your backups go fix it now. If you rely on any one vendor for business continuity that is a problem. This PSA is sponsored by salty sysadmins everywhere.
Me neither. But unfortunately the most I've encountered don't either have truly full backups (missing stuff), actually working backup process (some issue with backup media, process, etc.) or any idea how to actually restore if something does happen.
Almost no one does drills to restore backups to some test system.
I can't talk about the other cases, but I can tell about one small startup I joined a long ago. The worst one. They did backups on a 3-disk RAID-5 server... with one disk broken and one with SMART warnings. Their backup process also failed to backup anything with too long path names, so in reality almost half of the data was actually missing! There was also some unicode issue that lost files with characters above 128 ASCII.
My first days went to just actually ensuring their data is backed up...
No, no, no. That's the "well, it's never happen before so why should I have planned for it?" attitude. If you are in a solo business it is the _most_ important time to pull out your SWOT chart be honest about threats and weaknesses. A solo company is literally the worse time to not consider weaknesses as you have the least amount of assets to cover calamity. His primary concern right now is to be able to reproduce his work on AWS or Azure. Period.
If you are just a side project running locally, than plan me be as simple as "If my laptop falls in a river, get a new one and clone the repo again." As you start having production systems, it may still be "clone the repo again", but this time to actual servers.
As you start having customers rely on you, your plan should get more robust. Maybe you don't have a hot DR site set up where you can just flip a switch. But you should know who your backup host would be, and have an account ready with them. You should what steps would be needed to go from repo -> new provider.
None of this needs to be set up and tested ahead of time if you are just getting started. But if you have paying customers, you have to have thought about it. DR starts tiny and scales up, just like everything else.
This is really the takeaway from this whole incident. Even if the account doesn’t get banned you’re just one accidental action away from your database and backups disappearing simultaneously otherwise in most cases. Or if you are using a blob store there may not be any backup if the original is deleted (it’s like RAID not backups).
If your cloud provider does your DNS and you switch to a different cloud provider...
...then just set your registrar to point to the new DNS provider. You've still got total control.
Changing A records, on the other hand, can take as little time as you want depending on the TTL value.
It's not common to use those two words as synonyms. Although it is common to use the registar as a cloud provider, it may not be a good practice either (depends on the registar).
By having an alternate provider you spread your risk. And in this scenario you could update your DNS from the locked DO account to AWS/Linode/GCP whatevr
Lots of support related issues here.
As someone who worked in support for a long time it isn't surprising to see this play out and the whole "Support and Security Operations leadership will create new workflows to allow abuse-related events to leverage the 24/7 structure of Support." while probably a good action to take, is one of those things that you see in support time and again and so rarely do you see an appropriate response as much as a patch for "this one thing won't happen again".
Nobody cares to staff, fund, and support the support teams until something goes wrong, and then it is usually "new workflows" for support.
Another thought: That was also back when support responded in a reasonable time frame, and they actually spoke on IRC and everything too.
Vultr ========================== Linode
Los Angeles ==================== Dallas
Mastercard 1234 ============= Visa 4567
VMs with custom Linux (ROOT/ZFS Debian)
Replicating snapshots every 5 minutes
CloudFlare + HAProxy Load Balancing
Edit: backups to B2 and Wasabi. ~$10 month
And good work on the credit cards -- too many people forget that point of failure.
Also, for as cool as Digital Ocean is, their primary focus is on low-use, shared cloud resources. From my experience, they over subscribe CPU resources so "noisy neighbors" was a problem sometimes. They do not provide or work with people as if they have production services, and they don't seem to like people who really want to use their system. I would never use them for production unless the work was ephemeral.
I have had friends who have had their Twitter or Youtube accounts shut down with no explanation from the companies involved. Just recently, MailChimp shut down a brand new account for my SaaS (we had only 4 subscribers added, and hadn't sent out any email campaigns yet!) with a refusal to give us an explanation why. We set up another account and did everything exactly the same way (including tediously replicating lots of email templates manually) without the same thing happening. Why was the first account shut down? We will never know, it seems.
I am all for using AI to detect dark patterns and raising a flag, but I think these companies need to bring a more human aspect into the after effects. Surely a legitimate operator will always email in with a "WTF?" whereas a nefarious actor will simply automatically move on to the next thing?
1. On your vendors don't say why they took action point. I think a lot of vendors, us included, worry about the bad actors knowing the _how_ of our algorithms because they will then have intelligence on what to try and work around to avoid detection. It's a constant adversarial chess match we adjust, they adjust.
2. As for the human element being needed for the after effects I totally agree. That said, I looked at our numbers and while many of the nefarious actors simply move to the next thing, a very large percentage of the, "WTF" replies are actually from bad actors hoping that is enough to be re-enabled and to keep going for a while. Our goal is to bring more people into the after effects and give them better information to make decisions and work with customers to avoid this kind of mess in the future.
If your business can be "killed" by your provider losing your data you have set yourself up to fail. I am not saying the provider is not to blame, but the burden should also be on the customer for not protecting themselves, what if DO burnt down?
We have yearly DR exercises, we have to bring up about 40 of our 120 production instances, it is a difficult task, but we know we can do it.
Never depend on your provider to save you, always make sure you have planed this out.
So despite the fact that you can get the same or more resources at other providers who advertise their shared CPU resources for less than half that charged by DO I’m still buying a shared resource?
I think this is something that needs to be much more front and center. With AWS I can spin up as many “boxes” as I want on a moment’s notice. As far as anything on DO’s site seems to advertise I can do the same.
Turns out that’s not so. If they wanted to shut down an account because they suspected it was compromised then that’s one thing. Shutting it down simply because of suspected cryptocurrency activity... not so much. If I’m willing to pay for it then that’s what I’m willing to pay for and there should be no limitations.
Now that becomes a little more clear as you read...
> determine if automated action is warranted to minimize the impact of potential fraudulent high-cpu-loads on other customers.
So clearly we are not talking the same resources that were expected.
The account that got locked down triggered automatic checks that used payment history as a "these people are okay" check. They didn't have a payment history, they were running solely on credits, so got flagged. IOW, they hadn't paid for anything yet.
> With AWS I can spin up as many “boxes” as I want on a moment’s notice
You can, but they can also shut you down for "abuse".
If he were using AWS or Azure and he had the same set of issues, I wouldn’t blame him at all. But he is using a third rate cloud provider because it was cheaper and then he acts surprise that they don’t have the same level of competence and support as AWS or Azure.
Yeah I purposefully left out GCP. I wouldn’t trust their support anymore than DO.
Is this how we're referring to running on servers we have contractual control over these days? Part of managing risk is being sure your infrastructure will stay there, and if you don't have a contract, your risk is significantly higher.
Are there any good resources for deploying multi cloud architectures?
Pick one of the strong public cloud providers such as AWS, Azure, or GCP and then be judicious about which PaaS offerings you use to avoid unnecessary vendor lock in unless it adds enough value to justify the lock in.
I’ve heard GCP (or was it AWS?) is less likely to shut you down if you bill on account vs using a credit card.
Are there any other things we can do to reduce problems?
Would asking for an account manager help? Would buying reserved instances help?
Is there a minimum spend that gets you more attention and priority? Eg 100/month vs 1k/month vs 10k/month etc?
Any other advice?
- Keep a steady balance.
- Have luck (sad but true).
Minimum spend and similar might help with customer service (if they can see your spend). But most problems start with automated services.
Don't switch cloud providers every half year because another one is slightly cheaper now (i.e. have longer running/existing accounts), might help.
But most important make sure that you don't get locked in by the could provider
Even a permanent ban from the could provider should just cause some, not to high, short term (mony) loss. This means:
- Do not ~use~ relay on cloud providers proprietary tec (sure use there management console, etc. but don't integrate it into your business to tightly).
- Have control over your DNS names so that you can map them to different IP addresses if needed.
- Make sure you can migrate to a different cloud provider in matters of hours/days/weeks (depending on what you use it for) at any time.
- Have backups outside of the could provider for all your stuff.
- don't get trapped by doing backups on a different service of the same provider (e.g. Amazon) ;=)
In the end how much this costs you depend a lot on what you do and how you do it. And yes deciding and not doing any continous shot term investments taking the risk of loading everything is fine, you just should make the decision conscious.
The thing that has me scratching my head is how this chain of events unfolded.
I get that your fraud algorithm flagged it because of lack of established payment. how is this possible if what the tweet referred to as "locking us out of all of our backups and work"? surely an account history of any significance would have an established payment record. From their tweets they mention that they had 5 droplets and some storage of a not insignificant number of records (~500k) and that a script is required to be run every 2-3 months to process some data and that script spins up 10 droplets during that time. seems like it will take 13 hours to process the data based on row count and per record time. I am struggling to see how they didn't have payment history. can you elaborate?
In addition another thing I'd think would help assuage fears of a complete lockout is some process where you can request and download the db or a snapshot of the virtual machine.
The account had been live for some time and in that sense had history but because of credits it didn't have payment history. As some others have commented lots of startups use credits to get their business going and depending on your usage they can last you for quite a while. Payment history indicates a willingness and capability to make payments.
Part of the issue here was what triggers the algorithm used when looking at remaining credits, payment history (none), workload deltas (the new spin ups), and effective run rate (think of that as the amount of money they would be charged for the workload they were spinning up). The bug in this case was both simple and super impactful. Raisup did nothing wrong, everything right in fact. We just blew it.
Thanks for the comment on request for download of backups or snapshot. That is a great idea, I guess we just never expected to actually go shoot a real customer and the fraudsters don't ask for their data.
That is the starting place for many folks.
What the hell was their cost then?
GCP, AWS, DO, Azure... they all bank on that sweet combination of huge clients and clients who do not really need them but due to opportunity costs, cant be bothered to find viable, self-reliant alternatives.
Test moving your environment between service providers in production.
You should be able to failover from one provider to another, in a predictable way. You should actually do so every few months so that you're well versed in what needs to happen. If you practice this, you will be prepared in case the absolute worst happens.
They finally gave me temporary access even though they didn't believe who I was. This was the icing on the cake, they gave me access to a cluster, they could've given it to anyone.
It's cheap but seriously not worth it. They also have many more outages than other larger providers.
That is, excluding the stories about the same account being used to sell thing on Amazon.coma and run a AWS setup. We know those are risky.
why rely on a single party, you are putting all your eggs in a single basket again.
is that really the case?
It doesn't explain any detail behind why there was radio silence until social media support stepped in, which is what I'm very curious about, but it does have a timeline of events and an apology.
If you want to "play" startup, feel free to put all your eggs in one basket. If you actually want to build something meaningful and long-lasting, and you have Fortune 500 customers, take a breath and invest time in an infrastructure that can't be destroyed by a single rogue algorithm.
After all of the horror stories we've heard over the years, how is that time investment (and ability to sleep well at night) not worth it?
As in? You can have your own cage, but somebody will have a cable that connects to the hardware you own. If they pull that cable because a single rogue algorithm told them that you're most likely abusing their service, that's it.
If this is considered “abuse” Digital Ocean is not a platform you should be running a business on.
If you can’t afford to pay for reliable big boy infrastructure, you might need to rethink your business plan or how much you charge.
What I did not expect to see a few days ago was the bill for the stopped droplets. Not just a little for the dormant disks I have had but seemingly full-prices for the entire droplets, as if they were all still "active". Nonetheless I deleted all my droplets, discarding whatever I had setup a few months ago, out of pure frustration. I never plan to return.
AWS may take me a little more time to setup and their interface may not be as pretty, but they will at least bill me fairly.
If you clicked okay without actually reading the message, that is entirely your own fault.