Not to mention the lack of visibility in changes - it seems like everything is constantly running at multiple versions that can change suddenly with no notice, and if that breaks your use case they don't really seem to know or care. It feels like there's miles of difference between the SRE book and how their cloud teams operate in practice.
I had an issue with my servers 2 days ago and I got a reply to my ticket within 1 minute. Follow-up replies were also very fast.
The person I was talking to was a system administrator who understood what I was talking about and could actually solve problems on the spot. He is actually the same person who answered my support requests last year. I don't know if that's a happy accident or if they try to keep the same support staff answering for the same clients. He was answering my requests consistently for 2 days this time.
I am not a big budget customer. AWS and GCP wouldn't think anything of me.
Thank you Vultr for supporting your product properly. And thanks Eric. You are very helpful!
Vultr looks like they provide more traditional services with a few extra niceties on top.
Within Google's infrastructure, I can deploy a new HTTPS REST endpoint with a .js file and 1 console command.
Could I set up an ecosystem on a Vultr VM to do the same? Sure, it isn't magic. But GCP's entire value prop is that they've already done a lot of that work for you, and as someone running a startup, I was able to go from "I have this idea" to "I have this REST endpoint" in couple of days, without worrying about managing infrastructure.
That said, articles like this always worry me. I've never seen an article that says "Wow Google's support team really helped out!"
There are ways to make the hosting relatively agnostic, but choosing a pub/sub solution (for example), that operates at 'web scale' will have a distinct impact on your solutions and force you into their ecosystem to maximize value. Why bother with BigCorps UltraResistant services if you're only going to use some small percentage of the capabilities?
I've made systems that abstract away the difference entirely, but I think the 'goldilocks zone' is abstracted core domain logic that will run on anything, and then going whole-hog on each individual environment. Accept that "cloud" is vendor lockin, and mitigate that threat at the level of deployment (multi-cloud, multi-stack), rather than just the application.
The best thing to do is to simply ignore them as asking about downvotes just invites more of them.
But it is telling that there have been at least 5 downvotes but no one is willing to comment as to why.
Edit: since I see you have a downvote (surprise, surprise) I'll clarify that it wasn't me.
Seems like a recipe for breeding unfairness. "Don't talk about the system", :sigh:
Downvotes can irritate, but there are at least two balms that don't involve adding noise to the threads. One is to remember that people sometimes simply misclick. The other is that unfairly downvoted comments mostly end up getting compensatory upvotes from fair-minded community members (as happened with yours here).
I'm happy for you to leave your comment. It might help someone else in future.
Their sales staff is arrogant and has no idea how to sell into F500 type companies.
Source: 10+ meetings, with different clients, I attended where the Google sales pitch was basically "we are smarter than you, and you will succumb". The Borg approach. Someone needs to revamp the G sales and support approach if they want to grow in the cloud space.
The other funny thing is the package had a neoprene sleeve for a Chromebook. Eventually a sales person reached out via email assuming I owned a Chromebook and acted like I owed them a phone call because they gave me a neoprene sleeve I couldn’t use.
The entire package ended up going in the trash, which was an unfortunate waste of unrecyclable materials.
"People who will look through every bit of advertising crap company x sends", vs those who don't.
Something, somewhere is probably making stats on that. ;)
Well, that seems to be the approach at Google. Starting with hiring
Not surprising they end up with a hivemind that can't see past their mistakes.
And I've seen it cause them to lose at least 10 potentially good sales.
They have advantages but they're so arrogant that it puts people off.
It's more than 10 times or more people told me they prefer Google's solution to Microsoft or Amazon's but they're going with a competitor because they can't stand Google's arrogant attitude. It's close to laughable because of throwing money away just because they won't back off.
It blows my mind that GCloud, with arguably superior tech and performance compared to AWS/Azure, can't handle support. I have my own horror stories from 2 years ago, but still they haven't fixed it.
Google just doesn't seem to be able to focus on products that requires service and customer support. Maybe they just don't care about it while they have an infinite revenue stream from search and advertising. Whatever it is, they should be humiliated.
I love the tech, and the ux details like in browser SSH (AWS hasn't improved UX EVER) but they can't get support right? Amazing.
That's literally any product that people pay for (instead of viewing ads).
Customer support isn't and never has been in their DNA. It's often rage-inducing how hard it is to contact a human at Google.
They seem to think they can engineer products that don't need humans behind them.
I'm actually going to take this back to my company as a principle: "Treat locked-in/subscribing customers as well as our salespeople treat prospects."
This means that you can run whatever development tools that you want on the EC2 instance, rather than the very limited code editor that Cloud9 provides. You can easily run a full copy of Visual Studio on a Workspace, and get the full resources of an EC2 instance with SSD drivess.
Anecdotally: I've been an MS gold partner in a bunch of different contexts for years. The experiences I had as 'small fish' techie with AWS were on par or better. YMMV, of course, but I'd be more comfortable putting my Enterprise in the hands of AWS support than MS's (despite MS being really good in that space).
Prices seem high though: https://cloud.google.com/support/?options=premium-support#op...
So maybe think about which hosting provider to go with, don't get me wrong I like their tech. But their moderation does need a more human element, to be frank all their products do. Simply ceding control to algorithmic judgement just won't work in the short term if ever at all.
Still works out cheaper for workloads than AWS does even factoring staff in at this point.
AWS always turns into cost and administrative chaos as well unless it is tightly controlled which in itself is costly and difficult the moment you have more than one actor. GCP probably the same but I have no experience with that. Very much more difficult to do this when you have physical constraints.
Two man startup, perhaps but I think the transition should go:
VPS (linode etc) for MVP, colo half rack, active/active racks two sites then scale out however your workload requires.
One of the failure modes I see a lot is failing to factor in latency in distributed systems. Mainly because most systems don’t benefit at all from distribution and do benefit from simplification.
The assumption on here is that a product is going to service GitHub or stackoverflow class loads at least, but literally most aren’t. Even high profile sites and web applications I have worked on tend to run on much smaller workloads than people expect. Latency optimisation by flattening distribution and consolidating has higher benefits than adopting fleet management in the mid term of a product.
Kubernetes is one of those things you pick when you need it not before you need it. And then only if you can afford to burn time and money on it with a guaranteed ROI.
Basically, the LB decides which kubernetes cluster will serve your request and once you're in a k8s cluster, you stay there.
You don't have the control-plane that the federation provides and a bit of overhead managing clusters independently, but we have automated the majority of the process. On the other hand, debugging is way easier and we don't suffer from weird latencies between clusters (weird because sometimes a request will go to a different cluster without any apparent reason <-- I'm sure there's one, but none that you could see/expect, hence debugging).
My people's time is more important than your complex system.
If you run into an issue, send me a note and I will get someone to reply to your issue.
Maybe support just needed to satisfy their quota of kicked out companies for the month, who knows?
Can you elaborate on that? What do they monitor with the moderation bot?
Imagine you're running a cosplay community, and all of a sudden all your content is being deleted because the SESTA/FOSTA bill gets passed in a country where your "cloud" happens to reside in: https://hardware.slashdot.org/story/18/03/25/0614209/sex-wor...
"told us straight out that it would and we should move"
Sounds shady. I bet this would make more sense if OP explained what his company actually does.
Not everything is outright "likely to get banned" (eg pron things). ;)
I assumed they could tell that via CPU usage with they already monitor for quotas.
By chance are you located out of the United States? These are not downtime issues, but anti-fraud prevention and finance issues.
I hope there is a possibility to put a backup contact person / credit card so organisations can deal with people going on vacation or being sick or whatever.
IMHO this should be nicely documented as any other technical material you get to learn about the cloud product when you create an account (e.g. important steps to ensure your account remains open even in case of important security breaches, yadda yadda it's possible we'll need a way to prove that you are you yadda yadda, this can happen when yadda yadda, be prepared, do yadda yadda)
AFAIK, Google Cloud credit card payments are processed through Google Pay, which supports multiple credit cards, debit cards, bank accounts, etc.
Ideally, in this case the company shouldn't be using the CFOs credit card, but entered into a payments agreement with Google, receiving POs, invoices and so on, including a credit line.
Never set up a crucial service like you'd set up a consumer service.
In many situations the "right thing" must be explained, otherwise when people fail to get it they can argue that wasn't really the right thing after all (sure that's ultimately because they just want to deflect the blame from themselves; so don't let them! clearly explain the assumptions under which anti-fraud measures are operating so people cannot claim they didn't know)
The nature of the tech in question seems important in this story.
Erm ... no, evidently not?
A prudent person might consider a cloud provider to be a domain of failure and choose a multi-cloud option, which would probably be the correct way to address this resiliency issue. However, that's not really an appropriate approach for an early stage startup, where availability is generally not that much of a concern.
Is an exploding car safe because it is built by an early stage startup?
Just because you decide that implementing resiliency isn't a good business decision for some early stage startup, doesn't magically make the product resilient, it just isn't and that may be OK.
There are many options to choose from for implementing resiliency, it could be having multiple providers concurrently, it could be having a plan for restoring service with a different provider in case one provider fails, it could be by setting up a contract with a sufficiently solvent provider that they pay for your damages if they fail to implement the resiliency that you need, whatever. But if you fail to consider an obvious failure mode of a central component of your system in your planning, then you are obviously not building a resilient system.
Edit: One more thing:
> There's no indication anywhere from GCP themselves that a project could be a domain of failure. If asked, I doubt they would consider it as such.
Then you are asking wrong, which still is your failure if you are responsible for designing a resilient system.
If you ask them "Is a complete project expected to fail at once?", of course they will say "no".
That's why you ask them "Will you pay me 10 million bucks if my complete project goes offline with less than one month advance warning?", and you can be sure you will get the response to the problem that you are actually trying to solve.
If you replace "multi-cloud" with "multi-datacenter" (in the pre-cloud days), this premise is fairly unassailable. In those same days, applying it to "multi-ISP", it becomes more arguable.
Today, though, the incremental cost (money and cognitive) of the multi-cloud solution, even for an early startup, doesn't seem like it would be high enough to make the notion downright inappropriate to consider.
I'd even argue that if a cloud provider makes the lock-in so attractive or multi-cloud so difficult that that's a sign not to depend on those exclusive services.
The economics don’t work out if you are trying to do this with just vanilla VMs across AWS, GCP and Azure and managing yourself. You either do it the old fashioned way renting rack space and putting your own kit in, or you make full use of the managed services at which point - by design - you are locked in.
Usually this can get handled after a few days of aggravating emails back and forth, we get our client to ban the affiliate in question, and move on with our days with no downtime. But a few weeks ago my coworker came in to find our server taken offline, because AWS emailed him about a spam complaint on a Friday night, and they hadn't gotten a response by Sunday. It'd been down for hours before he realized.
They'd just null terminated the IP of the server, so he updated IPs in DNS real quick, but he then spent half a day both resolving the complaint, and then getting someone at AWS to say it wouldn't happen again. They supposedly put a flag on his account requiring upper management approval to disable something again, but we'll see if that works when it comes up again.
If and when you do, give serious consideration to how you handle DNS.
For that, you should consider setting up multiple accounts to isolate those services from the portable ones.
Also plays really nicely with Terraform.
Ansible helps deploy software, but deploying software is the smallest problem of going multi-cloud.
You're absolutely right that running them at the same time, data syncing, traffic flow, etc is much more complicated.
Disclosure: I'm one of the founders.
Mist.io is a cloud management platform that uses Apache libcloud under the hood. It provides a REST API & a Web UI that can be used for creating, rebooting & destroying machines, but also for tagging, monitoring, alerting, running scripts, orchestrating complex deployments, visualizing spending, configuring access policies, auditing & more.
See: Route53 and Dyn outages in the past couple years.
Forgive my ignorance but that seems like a weird choice rather than cutting access to the servers or in some more formal ways for copyright...
Also kinda concernit that multiple departments can take enforcement type action and others not know it. That seems way disorganized / recipe for diasater.
Why not Azure? They have a solid platform and (at least for a MSFT partner) their support is top-notch.
Yet their support didn't ever solve a problem within their SLA's and sometimes critical level tickets were hanging for months.
Plus my impression is that whereas AWS (and possibly Google) clouds are built by engineers using best practices and logic, Azure products felt always very much marketing driven e.g. marketing gave engineering a list of features to launch and engineering did the minimum effort possible to have the corresponding box ticked. I absolutely hated working on Azure and now won't accept any contract on it.
Documentation is horrible or non-existing, things just don't work, have weird limitations or transient deployment errors, super weird architectural and implementation choices + you never escape the clunkyness of the MS legacy with for example AD.
What does this mean?
GP is saying that even though they had a contract to resolve issues within X hours/days the issues were not being solved within X hours/days.
Cynically: most SLAs with the 'Big Boys' tend to give guarantees about getting an answer, not a solution. "We are looking into the problem" may satisfy the terms of a contract, but they don't satisfy engineers in trouble.
This is not true at all. Once you start spending real money on GCP or AWS, they will reach out to you. You will probably sign a support contract and have an account manager at that point. Or you might go with enterprise support where you have dedicated technical assets within the company that can help with case escalation, architecture review, billing optimization, etc.
> What if the card holder is on leave and is unreachable for three days? We would have lost everything — years of work — millions of dollars in lost revenue.
Cards get skimmed all the time. When a card gets skimmed, the issuer informs everyone who is making recurring purchases with that card "Hey, this card was skimmed, it's dead".
If someone has a recurring charge attached to that account, the recurring charge will go bad. If this is an appreciable number of cloud services which are billed by the second, this can happen very, very quickly and without you knowing. Remember, sometimes the issuer informs you that the card was skimmed, which you will receive after all the automated systems have been told.
So, the cloud provider gets the cancel, and terminates the card. It then looks around sees the recurring charge, takes a look at your servers racking up $$ they can't recoup and the system goes "we don't know this person, they buy stuff from us, but we haven't analysed their credit. Are they good for the debt? We've never given them credit before. Better cut them off until they get in touch."
If only they had signed an enterprise agreement and gotten credit terms. It could still be paid with a credit card, but the supplier would say "They're good for $X, let it ride and tell them they'll be cut off soon". They can even attach multiple methods of payment to the account, where, for example, a second card with a different bank is used as a backup. Having a single card is a single point of failure in the system!
In closing, imagine you're a cryptocoin miner who uses stolen cards to mine on cloud services. What does that look like to the cloud provider?
Yep, someone signs up for cloud services, starts racking up large bills and then the card is flagged as stolen.
Would be interested to see what would've happened if they would've used a business account.
We're a health-care startup, and this past Saturday I got an email saying that due to the nature of our business, we were prohibited from using their payment platform (Credit Card companies have different risk profiles and charge accordingly--see Patereon v Adult Content Creators).
Rather than pull the plug immediately, they offered us a 5-day wind down period, and provided information on a competitor that takes on high-risk services.
Fortunately, the classification of our business was incorrect (we do not offer treatment nor perscription/pharma services), and after contacting their support via Email & Twitter, we resolved the issue in less than 24-hours.
So major kudos to Stripe for protecting their platform, WHILE also trying to do the right thing for the customers who run astray from the service agreement.
This seems like a billing issue. If they had offline billing and monthly invoicing (enterprise agreement) I do not believe this issue would have happened.
If you are running an enterprise business and do not have enterprise support and an enterprise relationship with the provider, you may be doing something wrong on your end. It sounds like the author of this post does not have an account team and hasn't take the appropriate steps to establish an enterprise relationship with their provider. They are running a consumer account which is fine in many many cases, but may not be fine for a company that requires absolutely no service interruptions.
IMO, the time this issue was resolved by the automated process (20 mins) is not too bad for consumer cloud services. Most likely this issue could have been avoided if the customer had an enterprise relationship (offline billing/invoicing, support, TAM, account flagging, etc, etc) with Google Cloud.
Paying bills offline via invoice establishes a enterprise agreement with cloud providers. It does in fact change everything with the issue discussed here. They wouldn't be taken offline due to an issue with the credit card payment.
We have significant spend on CC with GCP and we’re not “consumers”. Our account manager has no issue with this. If they did we’d move somewhere else.
> What if the card holder is on leave and is unreachable for three days? We would have lost everything — years of work — millions of dollars in lost revenue.
You should never be in this position. If this were to happen to us we would be able to create a new project with a different payment instrument, and provision it from the ground up with terraform, puppet and helm scripts. The only thing we would have to fix up manually are some DNS records and we could probably have everything back up in a few hours. Eventually when we have moved all of our services to k8s I would expect to be able to do this even on a different cloud provider if that were necessary.
This is one of the reasons I always implore people to have a backup of their data with another provider or at least under a different account. That protects against all kinds of accidents but also against malice.
If OP happened to me, sure yes I could have my entire infra on AWS/Azure/whatever else Terraform supports in an hour, maybe more to replace some of the tiny cloud specific features we use. But if it takes me a day to me to just move the data into Azure, thats an entire lost business day of productivity.
It's a simple truth that even if you are at the millions of dollars point, there is a data size at which you are basically all-in with whatever solution you've chosen, and having a secondary site even for a billion dollar company can be exceptionally difficult and cost prohibitive to move that sort of data around, again especially when you're heavily dependent on a specific service provider.
Yes, the blame in part lies with making the decision to rely on such a provider. At the same time, there are compelling arguments for using an existing infrastructure instead of working on the upkeep of your own for data and compute time at that scale. Redundancy is built into such infrastructures, and perhaps it should take a little more evidence for the provider to decide to kill access to everything without hard and reviewed evidence.
So then you're going to have to look into realtime replication to a completely different infrastructure and if you ever lose either one then you're immediately on very thin ice.
It's like dealing with RAID5 on arrays with lots of very large hard drives.
Backup is a bit of an art in itself, everyone has a different type of backup requirement for their application, some solutions might not be even financially feasible. You might never end up using your backup ever at all, but all it needs is one very bad day. And if your data is important enough, you will need to do everything possible to avoid that possible bad day.
> Backup is a bit of an art in itself
Fully agreed on that, and what is also an art is to spot those nasty little single points of failure that can kill an otherwise viable business. Just thinking about contingency planning makes you look at a business with different eyes.
If you are making money or if the data is important for you to lose than you should have a backup, anything else is faulty planning.
Well, you felt wrong. Of course you should back up those 100s of terabytes, in fact that it is that much information is an excellent reason on top of all the other ones to back it up, re-creating it is going to be next to impossible.
It's just that the companies I look at - not all, but definitely some - seem to be under the impression that the cloud (or their cloud provider) can be trusted. Which is wrong for many reasons, not just this article.
What bugs me about it is that there are some companies that give serious pushback because their cloud providers keep on hammering in to them how reliable their cloud is and that any back-up will surely be less reliable than their cloud solution and oh by the way we also have a backup feature that you can use.
They don't realize that even then they still have all their eggs in the one basket: their cloud account.
I only saw that part of the comment much much later.
Often the effects are more subtle, reading what you think something said rather than what it actually said, or missing a negation or some sub-clause that materially alters the meaning of a sentence.
Even in proofreading we find stuff that is so dead obvious it is embarrassing. On the whole visual input for data is rather unreliable, even when reading stuff you wrote yourself, which I find the most surprising bit of all.
Studying this is interesting, and to some extent important to us due to the nature of our business, missing critical info supplied by a party we are looking at could cause real problems so we have tried to build a process to minimize the incidence of such faults, even so I'm 100% sure that with every job we will always miss something, and I live in perpetual fear of that something being something important.
That's true, but they can do something to avoid making things worse, see the linked article.
The comment suggests they are using personal GCP account instead of enterprise account.
Millions of dollars worth of work + imply no backup + non-enterprise account (but expecting enterprise support) + not having multiple forms of payment available.
Combining all these together, it seems like all sorts of things are going wrong here.
I have never used GCP (or any of the big three cloud providers), so I don't know how they are in general, but in this specific case there seems to be faulty planning on the user end.
They suspect you are a fraudster because you are using a stolen card.
Either sign a proper SLA agreement with Google (which gives you 30 days to pay their bills by any form, and therefore you get 30 days notice before they pull the plug), or have two forms of payment on file. Preferably, don't use your GCP credit card at dodgy online retailers too...
While you make sense from Google's PoV, it doesn't from the customer's PoV. As google is a big corp, it's IMHO better to side with the customer here, as next time it might be you who's getting screwed over by Google/other corp.
Or at gas stations, ATMs, or any other place where someone can install a skimmer.
How do you do that?
It seemed very Kafkaesque to me, getting tried and convicted without any mention of the crime or charge. I think the author is justified in his disapproval.
The worst part is that when we did post morterm and asked Google why the support resolution was so slow despite being "the privileged" customer, their answer was that the P1 SLA was only to respond within 15 minutes there is no SLA for resolution. Most of the "response" that were getting was that a new support guy has taken over in a new time zone which is the most useless information for us.
We are seriously thinking of moving to another cloud vendor.
AWS would never admit that anything is wrong from their side.
Grace periods that respect your business should be a standard that all service providers hold themselves to
As far as I know, Mozilla has no business relationship with extension developers, so I would actually be very concerned if their first action wasn't to cut you off.
There is nothing dodgy about the extension. Mozilla was just being ridiculous.
Thank you for judging my business without even knowing it.
I don't know anything about your business or the extension, I'm just pointing out that you're in a space that makes you suspicious by association.
Edit: I should add that after 2 weeks of back and forth emails the dude was finally able to build it then blamed me for not mentioning he needed to run "npm run build", even though I did mention it AND it's in package.json AND it's mentioned in the (very short and concise) readme.txt.
So after this exasperating experience he just took down the extension without warning and said it's because it contains Google Analytics.
I would have happily removed Google Analytics from the extension. The dude had my source for 2 weeks and could have told me about that at any time, but decided to tell me after 2 weeks of mucking around, after he had already removed the extension.
It was me that decided it was not worth the hassle to have the extension on their store. I just left it off.
The extension is currently a proof of concept that I plan to revisit later.
And had they converted their project to monthly invoicing: https://cloud.google.com/billing/docs/how-to/invoiced-billin...
* When you have invoicing setup, the above shouldn't happen. You need to keep a payment method in good standing, but you have something like 10 days to pay your bill. -- They do a little bit more vetting (KYC) on the invoice path, and that effectively gets you out of dodge.
* Without paying for premium support, there's effectively no support.
I think if someone didn't pay their bill on time, you might shut off their service too, wouldn't you?
What does that have to do with anything? The account was not shut down for non-payment, it was shut down because of unspecified "suspicious activity."
But even in case of non-payment I would not shut down the account without any warning. Not if I wanted to keep my customers.
Then they unplug the Ethernet cable and wait a week or two.
But as you said, this isn’t about non-payment.
A hypothetical but not unheard of scenario in which immediate shutdown might be warranted.
It's a rough world and different providers have optimised for different threat models. AWS wants to keep customers hooked; GCP wants to prevent abuse, Digital Ocean wants to show it's as capable as anyone else.
If you can afford it, you build resilient multicloud infrastructure. If you can't yet do that; at the very least ensure that you have off-site backups of critical data. Cloud providers are not magic; they can fail in bizarre ways that are difficult to remedy. If you value your company you will ensure that your eggs are replicated to more than one basket and you will test your failover operations regularly. Having every deploy include failing over from one provider to another may or may not fit your comfort level; but it can be done.
Not without warning, no. It is possible that the customer intended to start a CPU-intensive process and fully intended to pay for it.
Send a warning first with a specific description of the "suspicious activity" and give the customer a chance to do something about it. Don't just pull the plug with no warning.
Yes, there's nothing wrong with that. You have their credit card and can even authorize certain amounts ahead of time to make sure it can be charged.
There's a degree of complexity that comes with multi-cloud that's ill-suited for most early stage companies. Especially in the age of "serverless" that has folks thinking they don't need people to worry about infrastructure.
My point is that the calculus has more to it than just money. The prudent response, of course, is to do as you described. Have a plan for your provider to go away.
Offsite backups and the necessary config management to bring up similar infra in another region/provider is likely sufficient for most.
Perhaps we'll start seeing a new crop of post-mortems from the "fail fast" type of startups failing due to cloud over-dependency issues. They're (presumably rare) edge cases, but easily fatal to an early enough startup.
I just heard a dozen founders sit up and think "Market Opportunity" in glowing letters.
CockroachDB has a strong offering.
But multi-cloud need not be complicated in implementation.
A few ansible scripts and some fancy footwork with static filesystem synchronization and you too can be moving services from place to place with a clear chain of data custody.
Everything I have runs in kubernetes. The only difficulty I have to deal with is figuring out how to deploy a kubernetes cluster in each provider.
From there, I write a single piece of orchestration that will drop my app stack in any cloud provider. I'm using a custom piece of software and event-driving automation to handle the creation and migration of services.
Migrating data across providers is hard as kubernetes doesn't have snapshots yet.
There are already a lot of startups in this space doing exactly the kind of thing that I just described. Most aim to provide a CD platform for k8s.
Rather, it would likely be enough to have a cloud-agnostic infrastructure with replication to a warm (or even mostly-cold to save on cost) standby at the alternate provider with a manual failover mechanism.
I disagree. More specifically, I think, instead, many  folks just don't make that assessment/estimate in the first place.
They just follow what they perceive to be industry best practices. In many ways, this is more about social proof than a cargo cult, even though the results can resemble the latter, such as elsewhere in this thread with a comment complaining they had a "resilient" setup in a single cloud that was shut down by the provider.
> There are distinct benefits that come with avoiding "HA" setups. Namely simplicity and speed.
Indeed, and, perhaps more importantly, being possible at all, given time ("speed") and money ("if you can afford it").
The same could be said of "scalability" setups, which can overlap in functionality (though I would argue that in cases of overlap the dual functionality makes the cost more likely to be worth it).
None of this is to say, though that "HA" is synonymous with "business continuity". It's much like the conceptual difference between RAID and backups, and even that's not always well understood.
 I won't go so far as to say "most" because that would be a made up statistic on my part
A clever man once said, "you own your availability".
An exercise in BC planning can really pay off. If infra is code, and it and the data are backed up reasonably well, then a good MTTR can obviate the need for a lot of HA complexity.
I assume I'm missing some meaning here, particularly since the premise of much of the discussion in the thread is that there can be high availability at one layer, but it can rendered irrelevant by a SPoF at another (especially when the "layer" is the provider of all of ones infrastructure).
Do you consider that a version of "none"? Or are you pointing out that, despite the middle ground under discussion, the "binary" approach is more common, if not more sensible?
And if it was that critical it should have support and a SLA contract, and you know, backups.
The costs of Google may be comparable or lower than other services, but they don't seem to get that risk is a cost. Risk can be your biggest cost. And they've amplified that risk unnecessarily and shifted it to the customer. Fatal, as I said.
You might never see this happen to your GCP account in it’s lifetime.
But even if you assign blame to the OP for not expecting this, it doesn't look good, because the lesson here is "you shouldn't use google and if you do, expect them to fuck you over, for no reason, at any time".
It reminds me of AWS's opacity-as-antidote-to-worry with respect to hardware failures. If the underlying hardware fails, the EC2 instance on it just disappears (I've heard GCP handles this better, and AWS might now, as well). I like to point out that this doesn't differ much from the situation of running physical hardware (while ignoring the hardware monitoring), both from a "worry" burden perspective and from a "downtime from hardware failure" perspective.
We've gone through several account teams of our own that seem to be eager to help only to turn into radio silence once we actually need something. We have already moved mission-critical services to AWS and Azure, with GCP only running K8S and VMs for better pricing and performance.
GCP has good leadership now but it's clearly taking longer than it should to improve unfortunately.