As someone who is currently struggling with Google Cloud's mediocre support, this is not surprising. We pay lots of money for support and have multiple points of contact but all tickets are routed through front-line support who have no context and completely isolate you from what's going on. For highly technical users the worst support is to get fed through the standard playbook ("have you tried turning it off and on again?") when you're dealing with an outage. Especially since the best case is your support person playing go-between with the many, siloed teams trying to troubleshoot an issue while they apparently try to pass the buck.
Not to mention the lack of visibility in changes - it seems like everything is constantly running at multiple versions that can change suddenly with no notice, and if that breaks your use case they don't really seem to know or care. It feels like there's miles of difference between the SRE book and how their cloud teams operate in practice.
I'd just like to take this opportunity to praise Vultr. I've been using them for years and their support has always been good, and contrary to every other growing company, has been getting better over time.
I had an issue with my servers 2 days ago and I got a reply to my ticket within 1 minute. Follow-up replies were also very fast.
The person I was talking to was a system administrator who understood what I was talking about and could actually solve problems on the spot. He is actually the same person who answered my support requests last year. I don't know if that's a happy accident or if they try to keep the same support staff answering for the same clients. He was answering my requests consistently for 2 days this time.
I am not a big budget customer. AWS and GCP wouldn't think anything of me.
Thank you Vultr for supporting your product properly. And thanks Eric. You are very helpful!
Google Cloud provides more than just VMs and Containers. It has a bunch of services backed in, from a variety of databases such as Firebase (that have powerful built in subscription and eventing systems) to fully baked in Auth, (Google will even handle doing two factor for you!) to assisting with certain types of machine learning.
Vultr looks like they provide more traditional services with a few extra niceties on top.
Within Google's infrastructure, I can deploy a new HTTPS REST endpoint with a .js file and 1 console command.
Could I set up an ecosystem on a Vultr VM to do the same? Sure, it isn't magic. But GCP's entire value prop is that they've already done a lot of that work for you, and as someone running a startup, I was able to go from "I have this idea" to "I have this REST endpoint" in couple of days, without worrying about managing infrastructure.
That said, articles like this always worry me. I've never seen an article that says "Wow Google's support team really helped out!"
Using such proprietary features sounds like a great way to subject yourself to vendor lock in and leave you vulnerable to your cloud provider's every whim. I understand that using ready made features is alluring, but at what point are you too dependent on somebody else? All these cloud services reminds me a bit of left-pad, how many external dependencies can you afford? Maybe I'm too suspicious and cynical, but then I read articles like these from time to time...
The difference, IMO, is that you're generally leveraging the cloud providers platform in addition to using their hosting.
There are ways to make the hosting relatively agnostic, but choosing a pub/sub solution (for example), that operates at 'web scale' will have a distinct impact on your solutions and force you into their ecosystem to maximize value. Why bother with BigCorps UltraResistant services if you're only going to use some small percentage of the capabilities?
I've made systems that abstract away the difference entirely, but I think the 'goldilocks zone' is abstracted core domain logic that will run on anything, and then going whole-hog on each individual environment. Accept that "cloud" is vendor lockin, and mitigate that threat at the level of deployment (multi-cloud, multi-stack), rather than just the application.
You're not alone. I worry the same about many things, but everyone just thinks I'm a negative nancy for discounting THIS AWESOME SERVICE with these awesome, 100% non evil people behind it!
Thanks for the info. You may be right about the downvote reason (though it's a pretty ridiculous reason), but I don't think that matters since they are in the same industry providing a similar service and there's no reason why GCP can't provide the same service as Vultr, especially since they charge a lot more for their instances than Vultr does.
Well, I don't really care about the precious internet points disappearing. I'd much rather hear from someone what their reasoning is since I might actually learn something.
But it is telling that there have been at least 5 downvotes but no one is willing to comment as to why.
Edit: since I see you have a downvote (surprise, surprise) I'll clarify that it wasn't me.
Please don't break the site guidelines by going on about downvotes. That's a strict reduction in signal/noise ratio, which mars your otherwise fine comment. We're trying for the opposite here.
Downvotes can irritate, but there are at least two balms that don't involve adding noise to the threads. One is to remember that people sometimes simply misclick. The other is that unfairly downvoted comments mostly end up getting corrective upvotes from fair-minded community members (as happened with yours here).
I’ve seen other companies walk away from Google Cloud for similar reasons. Automate everything to scale doesn’t work for the Fortune 500. They should absolutely own this market.
This is why AWS and Azure continue to gain market share in cloud, while Google remains relativity stagnant, despite (in many cases) superior technology.
Their sales staff is arrogant and has no idea how to sell into F500 type companies.
Source: 10+ meetings, with different clients, I attended where the Google sales pitch was basically "we are smarter than you, and you will succumb". The Borg approach. Someone needs to revamp the G sales and support approach if they want to grow in the cloud space.
Even for small businesses their sales is pretty bad. I once got a package in the mail from them with a URL containing a tracking code printed on it to contact them that was so obviously Google being Google and treating people as part of a funnel. There was no phone number to be found and nothing personalized.
The other funny thing is the package had a neoprene sleeve for a Chromebook. Eventually a sales person reached out via email assuming I owned a Chromebook and acted like I owed them a phone call because they gave me a neoprene sleeve I couldn’t use.
The entire package ended up going in the trash, which was an unfortunate waste of unrecyclable materials.
If you filled in a form at the link provided from one of the bits of paper in the box they would have sent you a Chromebook for the sleeve. I'e got one here gathering dust. My boss threw away the same package but I was curious and looked through it carefully.
Yes, this is our experience as well, and the root cause of their many problems with GCP. Tech is nice but matters little if the account team just ignores us.
Reminds me of a thread I saw on the Google Inbox mobile app a while back. Brilliant app, but no 'unread message counter'. There was a huge number of people on the thread begging for that feature and going so far as to say that it was the one thing that prevented them from using the app. Their thinking was apparently that you should have filters for everything and it all should've fallen neatly into little boxes, but for people that have been using email 10 times longer than those developers have been out of college, that's not very practical. One G dev chimed in and said 'But that's now how I use email' and closed off the discussion.
That's interesting. I was of the understanding that everything at Google office tries to de-stress you/undistract you. I thought that would result in people being calmer/ more empathetic.
Yes that's exactly what I'm talking about they are super arrogant and unwilling to discuss things at a practical level.
And I've seen it cause them to lose at least 10 potentially good sales.
They have advantages but they're so arrogant that it puts people off.
It's more than 10 times or more people told me they prefer Google's solution to Microsoft or Amazon's but they're going with a competitor because they can't stand Google's arrogant attitude. It's close to laughable because of throwing money away just because they won't back off.
It blows my mind that GCloud, with arguably superior tech and performance compared to AWS/Azure, can't handle support. I have my own horror stories from 2 years ago, but still they haven't fixed it.
Google just doesn't seem to be able to focus on products that requires service and customer support. Maybe they just don't care about it while they have an infinite revenue stream from search and advertising. Whatever it is, they should be humiliated.
I love the tech, and the ux details like in browser SSH (AWS hasn't improved UX EVER) but they can't get support right? Amazing.
That's the meme, but my experience with the business support for G Suite does match it at all: I can easily call the phone support, get a competent human quickly, and they are very helpful.
I didn't write that article, but last week I came to the same conclusion and began my migration from GCP to AWS. I admire Google's tech but Cloud Platform lacks fit and finish. It's not fully productized. It's not even fully documented. (Is it indolence or arrogance to publish a link to the source code as the only explanation of
an important API?) I'm sorry, Google, you ignored me when I was crushing on you. Now I have Amazon.
I think they still are mainly focused on their ad business as the core of the company and cloud is something they 'do on the side'. For Microsoft, Azure is core business, it's the future of the company. If they fuck it up, they're dead. Google apparently doesn't see their cloud offering as their core business and therefore doesn't get the attention it needs.
In my limited experience, Google has worse support than facebook (when it comes to advertising agencies). They simply don't care because you are a tiny multimillion euros company and they are THE GOOGLE.
Yeah, Cloud9 is billed as an IDE, but it's really more useful as a terminal inside your cloud environment that happens to have a text editor. Workspaces has been great for a cloud-based development environment, and the new Linux Workspaces will be more useful than the Web-based "cloud IDEs".
They are very different things: Workspaces runs a full desktop environment (Windows or Linux) on an EC2 instance, and enables you to remotely access it through client software. The client software uses Teradici PCoIP, rather than VNC or RDP, and Teradici is amazing: it is so fast that the desktop feels like it is running on your local computer.
This means that you can run whatever development tools that you want on the EC2 instance, rather than the very limited code editor that Cloud9 provides. You can easily run a full copy of Visual Studio on a Workspace, and get the full resources of an EC2 instance with SSD drivess.
AWS sends you emails 9 months in advance of needing to restart individual EC2 instances (with calm, helpful reminders all the way through). IME, they're also really good about pro-active customer outreach and meaningful product newsletters... Even for tiny installations (ie less than $10K yearly).
Anecdotally: I've been an MS gold partner in a bunch of different contexts for years. The experiences I had as 'small fish' techie with AWS were on par or better. YMMV, of course, but I'd be more comfortable putting my Enterprise in the hands of AWS support than MS's (despite MS being really good in that space).
It costs a pretty penny, but I’m very happy with AWS enterprise support. When we had a ticket that we didn’t escalate get a crappy answer, our TAM escalated on his own initiative to get us a better answer.
So piggybacking on this, I have a similar story to tell. We had a nice young startup, infra entirely built out on Google Cloud. Nicely, resiliently built, good solid stuff. Because of a keyword monitor picked up by their auto-moderation bot our entire project was shut down immediately, wasn't able to bring it up for several hours, thank god we hadn't gone live yet as we were then told by support that because of the grey area of our tech, they couldn't guarantee this wouldn't keep happening. And in fact told us straight out that it would and we should move.
So maybe think about which hosting provider to go with, don't get me wrong I like their tech. But their moderation does need a more human element, to be frank all their products do. Simply ceding control to algorithmic judgement just won't work in the short term if ever at all.
I’m starting to favour buying physical rack space again and running everything 2005 style with a light weight ansible layer. As long as your workload is predictable, the lock in, unpredictability, navigation through the maze of billing, weird rules and what-the-fuckism you have to deal with on a daily basis is merely trading one vendor specific hell for another. Your knowledge isn’t transferable between cloud vendors either so I’d rather have a hell I'm totally in control of and of which the knowledge has some retention value and will move around vendors no problems. You can also span vendors then thus avoiding the whole all eggs in one basket problem.
Hybrid is what you are looking for. Have a rack or two for your core and rent everything else from multiple cloud vendors, integrated with whatever orchestration you are running on your own racks (K8s? DC/OS? Ansible?).
Still works out cheaper for workloads than AWS does even factoring staff in at this point.
AWS always turns into cost and administrative chaos as well unless it is tightly controlled which in itself is costly and difficult the moment you have more than one actor. GCP probably the same but I have no experience with that. Very much more difficult to do this when you have physical constraints.
Two man startup, perhaps but I think the transition should go:
VPS (linode etc) for MVP, colo half rack, active/active racks two sites then scale out however your workload requires.
More importantly, there is a wealth of competent labor in the relatively stable area of maintaining physical servers (both on the hardware and software side). The modern cloud services move fast and break things, leading to a general shortage of resources and competent people. As a business, even if slightly more expensive initially, it makes more sense to start lower and work up to the cloud services as the need presents itself.
You can but that’s another costly layer of complexity and distribution to worry about.
One of the failure modes I see a lot is failing to factor in latency in distributed systems. Mainly because most systems don’t benefit at all from distribution and do benefit from simplification.
The assumption on here is that a product is going to service GitHub or stackoverflow class loads at least, but literally most aren’t. Even high profile sites and web applications I have worked on tend to run on much smaller workloads than people expect. Latency optimisation by flattening distribution and consolidating has higher benefits than adopting fleet management in the mid term of a product.
Kubernetes is one of those things you pick when you need it not before you need it. And then only if you can afford to burn time and money on it with a guaranteed ROI.
Sure. The idea is that you get the benefits of public cloud and cost savings of BYO hardware for extra capacity at lower cost. Of course, you're now absorbing hardware maintenance costs as well. I haven't seen a cost breakdown really making a strong case one way or the other, but my company is doing it anyway.
Have you actually done this, or are you repeating stuff off the website? Because everyone I've talked with about kubernetes federation says it's really not ready for production use.
The approach we have taken is to create independent clusters with a common LoadBalancer.
Basically, the LB decides which kubernetes cluster will serve your request and once you're in a k8s cluster, you stay there.
You don't have the control-plane that the federation provides and a bit of overhead managing clusters independently, but we have automated the majority of the process. On the other hand, debugging is way easier and we don't suffer from weird latencies between clusters (weird because sometimes a request will go to a different cluster without any apparent reason <-- I'm sure there's one, but none that you could see/expect, hence debugging).
My people's time is more important than your complex system.
That's exactly what we are thinking too. We've looked HARD into AWS/GCP/Azure, but for all the reasons you mentioned we don't want to go that route. Owning the entire stack is so much cheaper, both money and time wise.
I can tell a similar story with Amazon MWS, where even if we had access to "human support", it felt like talking to some bad ML, not understanding what we were saying. Ultimately that start up was disbanded, never violating any rule they had, but flagged because of a false positive, and we couldn't even prove we didn't violate anything because we didn't even go live yet. It felt Kafkaesque, punishing one of a myriad possible intents due to malfunctioning ML, with no recourse.
Maybe support just needed to satisfy their quota of kicked out companies for the month, who knows?
Is it only me or does it seem if you are not a "famous" person that has a lot of public visibility and is able to create pressure through a tweet or blog post you are lost, no number to call, no mail to write. Over the years I saw a lot of similar stories, youtube or in general "google accounts" blocked for no clear reason and no way to contact somebody to solve the issue... kinda scary...
The point is that anyone could fall into that category when laws change.
Imagine you're running a cosplay community, and all of a sudden all your content is being deleted because the SESTA/FOSTA bill gets passed in a country where your "cloud" happens to reside in: https://hardware.slashdot.org/story/18/03/25/0614209/sex-wor...
I've gotta whole heartily disagree. I've never encountered this on GCE. I run a DevOps consulting company and for standard EC2/machines I much prefer GCP. It's not even close. AWS for the most part lacks any or little user experience testing on UI's and developer interfaces. AWS region specific resources are a nightmare, billing on GCP with sustained use and custom machine types is vastly superior. Disks are much easier to grok, no provisioned IOPS, EBS optimized, enhanced networking hoopla.
By chance are you located out of the United States? These are not downtime issues, but anti-fraud prevention and finance issues.
I've noticed that over the last few years it's become increasingly difficult to do things with US based services (especially banking) if you are outside of the US. And this goes double if you are a US citizen with no ties to the States other than citizenship. Americans as a general rule have never been terribly adept at anything international; banking, languages, or even base geography. We have offices in Cambodia and Laos and I have been told by more than one US based-service/company that Laos is not a real country. I suppose they think the .la domain stands for Los Angeles :) We are looking to set up an office in Hong Kong or Singapore and use that to deal with Western countries. But we're a small not-for-profit operation and HK and Singapore are EXPENSIVE.
stopped yes, deleted the project if the photo id of the credit card account holder cannot be reached in 3 days might be an over-reaction though.
I hope there is a possibility to put a backup contact person / credit card so organisations can deal with people going on vacation or being sick or whatever.
IMHO this should be nicely documented as any other technical material you get to learn about the cloud product when you create an account (e.g. important steps to ensure your account remains open even in case of important security breaches, yadda yadda it's possible we'll need a way to prove that you are you yadda yadda, this can happen when yadda yadda, be prepared, do yadda yadda)
I agree that it seems like an over-reaction. But on an account with intense usage, a single credit card on file, no backup, and a fraud warning it does seem very suspicious.
AFAIK, Google Cloud credit card payments are processed through Google Pay, which supports multiple credit cards, debit cards, bank accounts, etc.
Ideally, in this case the company shouldn't be using the CFOs credit card, but entered into a payments agreement with Google, receiving POs, invoices and so on, including a credit line.
Never set up a crucial service like you'd set up a consumer service.
yes that's a very good description of the best practices that sadly many companies are not really following.
In many situations the "right thing" must be explained, otherwise when people fail to get it they can argue that wasn't really the right thing after all (sure that's ultimately because they just want to deflect the blame from themselves; so don't let them! clearly explain the assumptions under which anti-fraud measures are operating so people cannot claim they didn't know)
Projects in the context of GCP can encompass all the necessary infrastructure to build a highly available service using standard practices. There's no indication anywhere from GCP themselves that a project could be a domain of failure. If asked, I doubt they would consider it as such.
A prudent person might consider a cloud provider to be a domain of failure and choose a multi-cloud option, which would probably be the correct way to address this resiliency issue. However, that's not really an appropriate approach for an early stage startup, where availability is generally not that much of a concern.
In other words: It wasn't resiliently built stuff.
Is an exploding car safe because it is built by an early stage startup?
Just because you decide that implementing resiliency isn't a good business decision for some early stage startup, doesn't magically make the product resilient, it just isn't and that may be OK.
There are many options to choose from for implementing resiliency, it could be having multiple providers concurrently, it could be having a plan for restoring service with a different provider in case one provider fails, it could be by setting up a contract with a sufficiently solvent provider that they pay for your damages if they fail to implement the resiliency that you need, whatever. But if you fail to consider an obvious failure mode of a central component of your system in your planning, then you are obviously not building a resilient system.
Edit: One more thing:
> There's no indication anywhere from GCP themselves that a project could be a domain of failure. If asked, I doubt they would consider it as such.
Then you are asking wrong, which still is your failure if you are responsible for designing a resilient system.
If you ask them "Is a complete project expected to fail at once?", of course they will say "no".
That's why you ask them "Will you pay me 10 million bucks if my complete project goes offline with less than one month advance warning?", and you can be sure you will get the response to the problem that you are actually trying to solve.
> A prudent person might consider a cloud provider to be a domain of failure and choose a multi-cloud option, which would probably be the correct way to address this resiliency issue. However, that's not really an appropriate approach for an early stage startup, where availability is generally not that much of a concern.
If you replace "multi-cloud" with "multi-datacenter" (in the pre-cloud days), this premise is fairly unassailable. In those same days, applying it to "multi-ISP", it becomes more arguable.
Today, though, the incremental cost (money and cognitive) of the multi-cloud solution, even for an early startup, doesn't seem like it would be high enough to make the notion downright inappropriate to consider.
I'd even argue that if a cloud provider makes the lock-in so attractive or multi-cloud so difficult that that's a sign not to depend on those exclusive services.
The economics don’t work out if you are trying to do this with just vanilla VMs across AWS, GCP and Azure and managing yourself. You either do it the old fashioned way renting rack space and putting your own kit in, or you make full use of the managed services at which point - by design - you are locked in.
This is very concerning but can happen on AWS as well. July 4th last year at about 4PM PST amazon silently shutdown our primary load balancer (ALB) due to some copyright complaint. This took out our main api and several dependent apps. We were able to get a tech support agent on the phone but he wasn't able to determine why this happened for several hours. Eventually we figured out that another department within amazon was responsible for pulling down the alb in an undetectable way. Ironically we are now in the process of moving from aws -> gcp.
My coworker is running a hosted affiliate tracking system on AWS as part of our company. He regularly has to deal with AWS wanting to pull our servers because of email spam -- not because we're sending spam emails, but because some affiliate link is in a spam email that resolves to our server, and Spamhaus complained to AWS.
Usually this can get handled after a few days of aggravating emails back and forth, we get our client to ban the affiliate in question, and move on with our days with no downtime. But a few weeks ago my coworker came in to find our server taken offline, because AWS emailed him about a spam complaint on a Friday night, and they hadn't gotten a response by Sunday. It'd been down for hours before he realized.
They'd just null terminated the IP of the server, so he updated IPs in DNS real quick, but he then spent half a day both resolving the complaint, and then getting someone at AWS to say it wouldn't happen again. They supposedly put a flag on his account requiring upper management approval to disable something again, but we'll see if that works when it comes up again.
Fwiw, Ansible makes the multicloud thing pretty straightforward as long as you aren’t married to services that only work for a specific cloud provider.
For that, you should consider setting up multiple accounts to isolate those services from the portable ones.
Wouldn't that be Terraform (perfect for setting up cloud infrastructure) vs. Ansible (can do all, but more geared to provisioning servers you already have)?
Ansible uses Apache Libcloud to run just about anything you need on any cloud provider in terms of provisioning. Once provisioned, it will handle all of your various configuration and deployment on those.
How does ansible make it straighftorward? As far as I know, it neither helps with networking failover, load balancing, data consistency, or other aspects of distributed systems, and running one application across clouds is certainly a distributed systems problem, not a deployment problem.
Ansible helps deploy software, but deploying software is the smallest problem of going multi-cloud.
I know what ansible is and can do.
Your other comment is about how it can provision and deploy things. While true, it's unrelated to my point that that's the least of your problems in a multi-cloud world.
A lot of that depends on scale too. I was mostly talking about the ability to standardize configuration so that you could replicate your infrastructure on multiple providers. Essentially just making sure that you have a backup plan/redundancy in case something happens and you find yourself needing to spin things up elsewhere on short notice.
You're absolutely right that running them at the same time, data syncing, traffic flow, etc is much more complicated.
What are popular multi cloud solutions if you use AWS or GCP services that have proprietary APIs? Are there frameworks that paper over the API differences?
Apache libcloud is a Python library that's used primarily to create, reboot & destroy machines in any supported cloud.
Mist.io is a cloud management platform that uses Apache libcloud under the hood. It provides a REST API & a Web UI that can be used for creating, rebooting & destroying machines, but also for tagging, monitoring, alerting, running scripts, orchestrating complex deployments, visualizing spending, configuring access policies, auditing & more.
Forgive my ignorance but that seems like a weird choice rather than cutting access to the servers or in some more formal ways for copyright...
Also kinda concernit that multiple departments can take enforcement type action and others not know it. That seems way disorganized / recipe for diasater.
I respectfully disagree. I have worked on two projects with Azure both with big accounts, one even so big that we had senior Azure people sitting in our teams. Both had the highest possible support contract.
Yet their support didn't ever solve a problem within their SLA's and sometimes critical level tickets were hanging for months.
Plus my impression is that whereas AWS (and possibly Google) clouds are built by engineers using best practices and logic, Azure products felt always very much marketing driven e.g. marketing gave engineering a list of features to launch and engineering did the minimum effort possible to have the corresponding box ticked. I absolutely hated working on Azure and now won't accept any contract on it.
Documentation is horrible or non-existing, things just don't work, have weird limitations or transient deployment errors, super weird architectural and implementation choices + you never escape the clunkyness of the MS legacy with for example AD.
We did have the same issues back in beta and we're forced to build choas monkey degrees of robustness into our platform. Was this experience of yours a while back? However, there are now a few people at work who even run VMs on it as their daily driver.
Service Level Agreements dictate the quality, availability, and responsibilities with the client. They put bounds on how long things will take to get answered, and sometimes fixed.
GP is saying that even though they had a contract to resolve issues within X hours/days the issues were not being solved within X hours/days.
Cynically: most SLAs with the 'Big Boys' tend to give guarantees about getting an answer, not a solution. "We are looking into the problem" may satisfy the terms of a contract, but they don't satisfy engineers in trouble.
I know what a SLA means but I have never seen an SLA from Azure dictating a guaranteed response time. They only give the SLA for time until initial reply as far as I know. I was suspecting the person I replied to have misunderstood what it is they have purchased. Maybe in some cases if you pay them some obscene amount of money you can purchase an SLA for for time til resolution but I don't think that's the case here.
In this case the complaint was against some image(s) we were publicly hosting. We've taken steps to isolate our file hosting from the rest of the system in case this were to happen again. We only host images for fashion blog posts written by staff so I imagine other aws customers have had a much worse time in this regard.
No truly production and especially revenue critical dependency should go on the card. Have your lawyer/licensing person sign agreement with them with actual sla and customer support. If it’s not worth your time you shouldn’t complain when you loose it.
That's a great point. These cloud hosting companies don't make this a natural evolution though, because there's no human to talk to, you start tiny and increase your usage over time. But every company depending on something and paying serious money should have a specific agreement. I wonder if this could still happen though, even if you have a separate contract.
> These cloud hosting companies don't make this a natural evolution though, because there's no human to talk to
This is not true at all. Once you start spending real money on GCP or AWS, they will reach out to you. You will probably sign a support contract and have an account manager at that point. Or you might go with enterprise support where you have dedicated technical assets within the company that can help with case escalation, architecture review, billing optimization, etc.
It makes sense that would happen. So they just didn't have the contact info for the people here? Maybe they just were spending a little, but their whole business still depended on it.
There's a mismatch between how much you spend and how much business value is there. The spend for management systems of physical infrastructure like wind turbines is tiny relative to revenue compared to the typical pure software company, especially freemium or ad-driven stuff where revenue-to-compute ratio is very low. Calibrating for this wouldn't really be in Google's DNA.
Amen to that. Once you reach 1,000 USD monthly you can switch to regular invoiced account (subjected to verification) and you have dedicated account manager.
> What if the card holder is on leave and is unreachable for three days? We would have lost everything — years of work — millions of dollars in lost revenue.
Indeed, presumably they were then also at the mercy of the credit card company cancelling or declining the card at the critical billing renewal moment.
This is a standard risk with any attempt to remain anonymous with a supplier. The supplier, since they don't know you, and therefore can't trust you, will not offer
much credit.
Cards get skimmed all the time. When a card gets skimmed, the issuer informs everyone who is making recurring purchases with that card "Hey, this card was skimmed, it's dead".
If someone has a recurring charge attached to that account, the recurring charge will go bad. If this is an appreciable number of cloud services which are billed by the second, this can happen very, very quickly and without you knowing. Remember, sometimes the issuer informs you that the card was skimmed, which you will receive after all the automated systems have been told.
So, the cloud provider gets the cancel, and terminates the card. It then looks around sees the recurring charge, takes a look at your servers racking up $$ they can't recoup and the system goes "we don't know this person, they buy stuff from us, but we haven't analysed their credit. Are they good for the debt? We've never given them credit before. Better cut them off until they get in touch."
If only they had signed an enterprise agreement and gotten credit terms. It could still be paid with a credit card, but the supplier would say "They're good for $X, let it ride and tell them they'll be cut off soon". They can even attach multiple methods of payment to the account, where, for example, a second card with a different bank is used as a backup. Having a single card is a single point of failure in the system!
In closing, imagine you're a cryptocoin miner who uses stolen cards to mine on cloud services. What does that look like to the cloud provider?
Yep, someone signs up for cloud services, starts racking up large bills and then the card is flagged as stolen.
While not a cloud platform, I had an experience along the same vein with Stripe.
We're a health-care startup, and this past Saturday I got an email saying that due to the nature of our business, we were prohibited from using their payment platform (Credit Card companies have different risk profiles and charge accordingly--see Patereon v Adult Content Creators).
Rather than pull the plug immediately, they offered us a 5-day wind down period, and provided information on a competitor that takes on high-risk services.
Fortunately, the classification of our business was incorrect (we do not offer treatment nor perscription/pharma services), and after contacting their support via Email & Twitter, we resolved the issue in less than 24-hours.
So major kudos to Stripe for protecting their platform, WHILE also trying to do the right thing for the customers who run astray from the service agreement.
Please remember Google Cloud is a multi-tenant public cloud and in order run a multi-tenant environment providers have to monitor usage, users, billing, and take precautionary measures at times when usage or account activity is sensed to be irregular. Some of this management is done automatically by systems preserving QoS and monitoring for fraud or abuse.
This seems like a billing issue. If they had offline billing and monthly invoicing (enterprise agreement) I do not believe this issue would have happened.
If you are running an enterprise business and do not have enterprise support and an enterprise relationship with the provider, you may be doing something wrong on your end. It sounds like the author of this post does not have an account team and hasn't take the appropriate steps to establish an enterprise relationship with their provider. They are running a consumer account which is fine in many many cases, but may not be fine for a company that requires absolutely no service interruptions.
IMO, the time this issue was resolved by the automated process (20 mins) is not too bad for consumer cloud services. Most likely this issue could have been avoided if the customer had an enterprise relationship (offline billing/invoicing, support, TAM, account flagging, etc, etc) with Google Cloud.
A "consumer account"? I don't know what you're talking about. This is Google Cloud, not Spotify. I don't know a lot of "consumers" spending hundreds of dollars, thousands of dollars or more per month on Google Cloud. And paying bills by wire transfer instead of credit card doesn't change anything to the issue discussed here.
I'm a hobby programmer who runs a few small projects on GCP. My personal spending is smaller than OPs, but as mentioned elsewhere, once you hit a certain threshold, they will contact you to offer switching to a business account. Obviously they're not gonna force you to switch if you don't want to, but then don't complain for not getting business level support.
Hundreds or thousands is easily spent by 'consumers' on any public cloud. Think about it.
Paying bills offline via invoice establishes a enterprise agreement with cloud providers. It does in fact change everything with the issue discussed here. They wouldn't be taken offline due to an issue with the credit card payment.
In Germany, maybe even all of europe you need a tax ID which you, at least as far as the type they require is concerned, only get as a business, not a consumer. I actually tried due to the relatively easy way to get a fancy, reliable network (I kind of admire their global SDN that can push within 5% of line rate with no meaningful, added packet loss (apart from the minimal amount due to random cosmic rays and similar baseline effects).
That is correct. CC billing and individual projects could be consumer level usage. Anyone can signup for this level account and use for whatever purpose. Offline invoicing and G Suite / Organization setup / domain validation / enterprise support could be thought of as enterprise and would come with assurances such as billing payment stability.
I can't speak to the specific incident. We've been running almost 400 servers (instances and k8s cluster nodes) for over a year on GCP and we've been quite happy with the performance and reliability, as well as the support response when we have needed it. I did want to address this comment...
> What if the card holder is on leave and is unreachable for three days? We would have lost everything — years of work — millions of dollars in lost revenue.
You should never be in this position. If this were to happen to us we would be able to create a new project with a different payment instrument, and provision it from the ground up with terraform, puppet and helm scripts. The only thing we would have to fix up manually are some DNS records and we could probably have everything back up in a few hours. Eventually when we have moved all of our services to k8s I would expect to be able to do this even on a different cloud provider if that were necessary.
Restarting the service from scratch is one thing, but what about all your data? Some of these services have 100's T's of data hanging of them and if Google would delete that because of some perceived violation of their terms then that is not something you can recover from in a couple of hours, if at all.
This is one of the reasons I always implore people to have a backup of their data with another provider or at least under a different account. That protects against all kinds of accidents but also against malice.
Backup is a thing. If your company is making millions of dollars off your business you should have a redundant backup of everything including (especially) your data.
Yes, we have backups. The problem with `data` is not the fact I have backups, is that at a certain scale I will have so much data that "moving providers" could take on the order weeks.
If OP happened to me, sure yes I could have my entire infra on AWS/Azure/whatever else Terraform supports in an hour, maybe more to replace some of the tiny cloud specific features we use. But if it takes me a day to me to just move the data into Azure, thats an entire lost business day of productivity.
If it takes weeks then you should choose a second provider where you can show up with your backup hard drives or whatever you use and plug them in. Moving data physically is an option.
Note that I did not imply that restoring the service to a different project or provider would always be easy or fast (certainly in the case of very large data volumes it would be neither of those things). I was addressing the prospect of losing "years of work" as was stated in the OP. That sort of implies that most or all of what they did over that time is recorded only in the current state of the GCP project that was disabled, and that is a really terrifying position to be in.
People usually end up using backup or move infrastructure on short notice in catastrophic situations, which is presumably rare. Days worth of work to bring back your business in catastrophic downtime - doesn't seem like a bad thing at all to me. If anything, it sounds like a very well organized development flow with very optimistic time-frame.
This vastly simplifies the situation, especially when the cloud is involved. Having a backup, much less a replica of such data requires an enormous infrastructure cost, whether it's your own or someone else's infrastructure. The time to bring that data back to a live and stable state again also is quite costly. (note the stable part)
It's a simple truth that even if you are at the millions of dollars point, there is a data size at which you are basically all-in with whatever solution you've chosen, and having a secondary site even for a billion dollar company can be exceptionally difficult and cost prohibitive to move that sort of data around, again especially when you're heavily dependent on a specific service provider.
Yes, the blame in part lies with making the decision to rely on such a provider. At the same time, there are compelling arguments for using an existing infrastructure instead of working on the upkeep of your own for data and compute time at that scale. Redundancy is built into such infrastructures, and perhaps it should take a little more evidence for the provider to decide to kill access to everything without hard and reviewed evidence.
It might be too expensive for some people. But really there is no other solution other than full backup of everything. Relying on a single point of failure, even on an infrastructure with a stellar record, is just a dead man walking.
And then of course there is the important bit that from a regulatory perspective 'just a backup' may be enough to be able to make some statements about the past but it won't get you out of the situation where due to your systems being down you weren't ingesting real-time date during the gap. And for many purposes that makes your carefully made back-up not quite worthless but close to it.
So then you're going to have to look into realtime replication to a completely different infrastructure and if you ever lose either one then you're immediately on very thin ice.
It's like dealing with RAID5 on arrays with lots of very large hard drives.
About ~6 years ago, I was involved in a project where data would increase by 100gb per day and the database would also significantly change every day. I vaguely remember having some kind of cron bash script with mysqldump and rsync that would have a near identical offsite backup of data (also had daily, monthly snapshots). We also had a near identical staging setup of our original production application which we would use to restore our application from the near-realtime backup we had running. We had to test this setup every other month - it was an annoying thing to do at first. But we were exceedingly good at it over time. Thankfully we never had to use our backup, but we slept at night peacefully.
Backup is a bit of an art in itself, everyone has a different type of backup requirement for their application, some solutions might not be even financially feasible. You might never end up using your backup ever at all, but all it needs is one very bad day. And if your data is important enough, you will need to do everything possible to avoid that possible bad day.
That's a good scheme. Note how things like GCP make it harder rather than easier to set something like that up, you'd almost have to stream your data in real time to two locations rather than to bring it in to GCP first and then to stream it back out to your backup location.
> Backup is a bit of an art in itself
Fully agreed on that, and what is also an art is to spot those nasty little single points of failure that can kill an otherwise viable business. Just thinking about contingency planning makes you look at a business with different eyes.
Yes, I'm aware of that. But you'd be surprised how many businesses are under the impression that using 'the cloud' obviates the needs for backups. Especially if their data is in the 100's of terabytes.
Non-technical owners making faulty assumptions is not the fault of "Cloud" providers. It's probably common (I faced it myself personally, in a non-cloud situation), but there is nothing the providers can do about unprepared users.
While true. I was specifically referring to this part:
> What if the card holder is on leave and is unreachable for three days? We would have lost everything — years of work — millions of dollars in lost revenue.
The comment suggests they are using personal GCP account instead of enterprise account.
Millions of dollars worth of work + imply no backup + non-enterprise account (but expecting enterprise support) + not having multiple forms of payment available.
Combining all these together, it seems like all sorts of things are going wrong here.
I have never used GCP (or any of the big three cloud providers), so I don't know how they are in general, but in this specific case there seems to be faulty planning on the user end.
Agreed, that wasn't smart. But, to their defense, this is how these things start out, small enough to be useful, and by the time they get business critical nobody realizes the silly credit card is all that stands between them and unemployment.
Why does it matter whether you're making millions of dollars? If you have any information which you would like to not lose for any reason, back it up in as many formats and locations as is feasible.
> I felt the person I replied to implied that 100s of terabytes of data are too expensive to backup.
Well, you felt wrong. Of course you should back up those 100s of terabytes, in fact that it is that much information is an excellent reason on top of all the other ones to back it up, re-creating it is going to be next to impossible.
It's just that the companies I look at - not all, but definitely some - seem to be under the impression that the cloud (or their cloud provider) can be trusted. Which is wrong for many reasons, not just this article.
I forget where I first saw this quoted, but it's relevant here: "There is no 'cloud', only someone else's computer". That's part of why I store very little data online, compared to most people (or the data I actually have/want). Anything I'm not okay with someone else having on their computer is backed up and stored on hard physical media. No cloud provider can be trusted - the moment the government wants in, they'll get in; and the moment it's considered more profitable for the provider to quietly snoop in your stored data, rest assured that they will.
No problem, it's just that with 'This is one of the reasons I always implore people to have a backup of their data with another provider or at least under a different account.' that passage I thought I had the backup angle more than covered.
What bugs me about it is that there are some companies that give serious pushback because their cloud providers keep on hammering in to them how reliable their cloud is and that any back-up will surely be less reliable than their cloud solution and oh by the way we also have a backup feature that you can use.
They don't realize that even then they still have all their eggs in the one basket: their cloud account.
It's strange, but I completely missed the last part about backup from your comment. I have no idea how I missed it. Had I seen it would make my comment redundant and I would have never replied at all.
I only saw that part of the comment much much later.
Well, it definitely wasn't added in a later edit, or at least, not that I'm aware of, though I do have a tendency to write my comments out in bits submitted piece-by-piece. Even so, I wouldn't worry about it, I tend to miss whole blocks of text with alarming regularity while reading through stacks of pdfs and when comparing notes with colleagues we always wonder if we've been reading the same documents (they have the same problem...). Reading in parallel is our way of trying to ensure we don't miss anything and unfortunately it is not a luxury.
Often the effects are more subtle, reading what you think something said rather than what it actually said, or missing a negation or some sub-clause that materially alters the meaning of a sentence.
Even in proofreading we find stuff that is so dead obvious it is embarrassing. On the whole visual input for data is rather unreliable, even when reading stuff you wrote yourself, which I find the most surprising bit of all.
Studying this is interesting, and to some extent important to us due to the nature of our business, missing critical info supplied by a party we are looking at could cause real problems so we have tried to build a process to minimize the incidence of such faults, even so I'm 100% sure that with every job we will always miss something, and I live in perpetual fear of that something being something important.
This fraud flag is caused by your credit card being found in a leaked list of card numbers somewhere.
They suspect you are a fraudster because you are using a stolen card.
Either sign a proper SLA agreement with Google (which gives you 30 days to pay their bills by any form, and therefore you get 30 days notice before they pull the plug), or have two forms of payment on file. Preferably, don't use your GCP credit card at dodgy online retailers too...
Or you know, Google could have emailed them, told them exactly that and waited for a response before pulling the plug on the servers.
While you make sense from Google's PoV, it doesn't from the customer's PoV. As google is a big corp, it's IMHO better to side with the customer here, as next time it might be you who's getting screwed over by Google/other corp.
I may be missing something, so help me out here... I get the impression that the author was not told the precise reason why the activity was suspicious. Wouldn't a precise error message, if not actually a human interface, been helpful? Why the generic "suspicious activity" warning?
It seemed very Kafkaesque to me, getting tried and convicted without any mention of the crime or charge. I think the author is justified in his disapproval.
I can echo with the sentiment here. There have been a few times, they have broken backward compatibility resulting in our production outage without even new deployment. For example the BigQuery client library suddenly started breaking because they had rolled out some changes from the API contract the library was calling. When we reached out to support they took it very lightly saying why are we even using "the ancient version of library", Ok fair enough we upgraded the library to the recommended version but alas! the dataflow library started breaking due to this new upgrade. For next few hours support just kept on playing binary search of a version which was compatible with both bigQuery and dataflow while the production was down.
The worst part is that when we did post morterm and asked Google why the support resolution was so slow despite being "the privileged" customer, their answer was that the P1 SLA was only to respond within 15 minutes there is no SLA for resolution. Most of the "response" that were getting was that a new support guy has taken over in a new time zone which is the most useless information for us.
We are seriously thinking of moving to another cloud vendor.
I wonder how prevalent this behavior is. Mozilla behaves the same towards browser extensions, which put business depends on. They removed our extension multiple times, each time before asking for something different, be it uncompressed source code, instructions for how to build it, a second privacy policy separate from our sites policy and more. Each time we would have happily responded to a request promptly, but instead you find out when you’ve been shut down already.
Grace periods that respect your business should be a standard that all service providers hold themselves to
It sounds to me like Mozilla identified your extension as potentially malicious and prioritizing user safety, shut you down first.
As far as I know, Mozilla has no business relationship with extension developers, so I would actually be very concerned if their first action wasn't to cut you off.
I can confirm Mozilla handles this very poorly. I had the exact same experience with them. It was so bad that I actually just left the extension off their store and now focus on Chrome.
There is nothing dodgy about the extension. Mozilla was just being ridiculous.
Browser extensions that say they help with comparison shopping are a very common type of "Potentially Unwanted Application" (PUA - aka malware with a legal team). The infamous Superfish is an example of this type of thing, and there are many others.
I don't know anything about your business or the extension, I'm just pointing out that you're in a space that makes you suspicious by association.
Fair enough. But this has nothing to do with Mozilla's actions. It was as GP said. It includes things like their incompetence in dealing with a build process that creates transpiled/minified code. Even when I gave them all the source and the build instructions (npm run build) they still couldn't comprehend what was going on. Yes, I know it's strange since Mozilla makes a browser with a JavaScript engine.
Edit: I should add that after 2 weeks of back and forth emails the dude was finally able to build it then blamed me for not mentioning he needed to run "npm run build", even though I did mention it AND it's in package.json AND it's mentioned in the (very short and concise) readme.txt.
So after this exasperating experience he just took down the extension without warning and said it's because it contains Google Analytics.
I would have happily removed Google Analytics from the extension. The dude had my source for 2 weeks and could have told me about that at any time, but decided to tell me after 2 weeks of mucking around, after he had already removed the extension.
It was me that decided it was not worth the hassle to have the extension on their store. I just left it off.
Nah, not that keen personally (I don't even use Chrome). I was just pointing out that it would have been useful to have the URL to reduce confusion. :)
* When you have invoicing setup, the above shouldn't happen. You need to keep a payment method in good standing, but you have something like 10 days to pay your bill. -- They do a little bit more vetting (KYC) on the invoice path, and that effectively gets you out of dodge.
* Without paying for premium support, there's effectively no support.
I think if someone didn't pay their bill on time, you might shut off their service too, wouldn't you?
"Oh hey, it looks like $customer suddenly started a bunch of coinminers on their account at 10x their usual usage rate. Perfectly fine. Let them rack up a months billing in a weekend; why not?"
A hypothetical but not unheard of scenario in which immediate shutdown might be warranted.
It's a rough world and different providers have optimised for different threat models. AWS wants to keep customers hooked; GCP wants to prevent abuse, Digital Ocean wants to show it's as capable as anyone else.
If you can afford it, you build resilient multicloud infrastructure. If you can't yet do that; at the very least ensure that you have off-site backups of critical data. Cloud providers are not magic; they can fail in bizarre ways that are difficult to remedy. If you value your company you will ensure that your eggs are replicated to more than one basket and you will test your failover operations regularly. Having every deploy include failing over from one provider to another may or may not fit your comfort level; but it can be done.
> A hypothetical but not unheard of scenario in which immediate shutdown might be warranted.
Not without warning, no. It is possible that the customer intended to start a CPU-intensive process and fully intended to pay for it.
Send a warning first with a specific description of the "suspicious activity" and give the customer a chance to do something about it. Don't just pull the plug with no warning.
There's a degree of complexity that comes with multi-cloud that's ill-suited for most early stage companies. Especially in the age of "serverless" that has folks thinking they don't need people to worry about infrastructure.
My point is that the calculus has more to it than just money. The prudent response, of course, is to do as you described. Have a plan for your provider to go away.
Offsite backups and the necessary config management to bring up similar infra in another region/provider is likely sufficient for most.
> There's a degree of complexity that comes with multi-cloud that's ill-suited for most early stage companies. Especially in the age of "serverless" that has folks thinking they don't need people to worry about infrastructure.
Perhaps we'll start seeing a new crop of post-mortems from the "fail fast" type of startups failing due to cloud over-dependency issues. They're (presumably rare) edge cases, but easily fatal to an early enough startup.
> There's a degree of complexity that comes with multi-cloud that's ill-suited for most early stage companies. Especially in the age of "serverless" that has folks thinking they don't need people to worry about infrastructure.
I just heard a dozen founders sit up and think "Market Opportunity" in glowing letters.
CockroachDB has a strong offering.
But multi-cloud need not be complicated in implementation.
A few ansible scripts and some fancy footwork with static filesystem synchronization and you too can be moving services from place to place with a clear chain of data custody.
Everything I have runs in kubernetes. The only difficulty I have to deal with is figuring out how to deploy a kubernetes cluster in each provider.
From there, I write a single piece of orchestration that will drop my app stack in any cloud provider. I'm using a custom piece of software and event-driving automation to handle the creation and migration of services.
Migrating data across providers is hard as kubernetes doesn't have snapshots yet.
There are already a lot of startups in this space doing exactly the kind of thing that I just described. Most aim to provide a CD platform for k8s.
For an early startup, though, I would think it's not necessary to be "fully" multi-cloud.
Rather, it would likely be enough to have a cloud-agnostic infrastructure with replication to a warm (or even mostly-cold to save on cost) standby at the alternate provider with a manual failover mechanism.
Most folks overestimate their need for availability and lack a willingness to accept risk. There are distinct benefits that come with avoiding "HA" setups. Namely simplicity and speed.
> Most folks overestimate their need for availability and lack a willingness to accept risk.
I disagree. More specifically, I think, instead, many [1] folks just don't make that assessment/estimate in the first place.
They just follow what they perceive to be industry best practices. In many ways, this is more about social proof than a cargo cult, even though the results can resemble the latter, such as elsewhere in this thread with a comment complaining they had a "resilient" setup in a single cloud that was shut down by the provider.
> There are distinct benefits that come with avoiding "HA" setups. Namely simplicity and speed.
Indeed, and, perhaps more importantly, being possible at all, given time ("speed") and money ("if you can afford it").
The same could be said of "scalability" setups, which can overlap in functionality (though I would argue that in cases of overlap the dual functionality makes the cost more likely to be worth it).
None of this is to say, though that "HA" is synonymous with "business continuity". It's much like the conceptual difference between RAID and backups, and even that's not always well understood.
[1] I won't go so far as to say "most" because that would be a made up statistic on my part
Agreed for the most part. Availability for very many is a binary operation. They either do none of it or all of it.
A clever man once said, "you own your availability".
An exercise in BC planning can really pay off. If infra is code, and it and the data are backed up reasonably well, then a good MTTR can obviate the need for a lot of HA complexity.
> Availability for very many is a binary operation. They either do none of it or all of it.
I assume I'm missing some meaning here, particularly since the premise of much of the discussion in the thread is that there can be high availability at one layer, but it can rendered irrelevant by a SPoF at another (especially when the "layer" is the provider of all of ones infrastructure).
Do you consider that a version of "none"? Or are you pointing out that, despite the middle ground under discussion, the "binary" approach is more common, if not more sensible?
The binary approach is that it either isn't considered or people opt in for all of it without consideration for what is actually needed. The Google SRE book goes into this at length. For each service, they define SLOs and make a considered decision about how to meet them.
Oh, so what you're saying is that they're no considering the notion that there may be a medium-availability (for lack of a better term) solution, which could be perfectly adequate/appropriate?
Yes, there is or they wouldn't turn it off. Companies aren't in the habit of trying not to take your money for services without a pretty damn good reason.
And if it was that critical it should have support and a SLA contract, and you know, backups.
Right. Because big companies never ever do anything unjustified. Particularly when they put automatic processes in place with no humans in the loop, because we all know that computers never make mistakes.
This is fatal. I have a small pilot project on Google Cloud. Considering putting up a much larger system. Not now.
The costs of Google may be comparable or lower than other services, but they don't seem to get that risk is a cost. Risk can be your biggest cost. And they've amplified that risk unnecessarily and shifted it to the customer. Fatal, as I said.
Making a decision purely based upon some posts on HN and the original artical isn’t a good idea either as there is little data on how often this happens and how often (and pulling the plug could happen with another IAAS). You need to weigh up your options for risk management based upon how critical your project is, the amount of time/money you have to solve the issues.
You might never see this happen to your GCP account in it’s lifetime.
This is a hallmark of Google's lack of customer service. They used to use the same filtering alg on customer search feeds as public. The system was a grey list of some sort and the client was worth about 1m in ads a day to them. Never the less, once a month it would get blocked. Sometimes for over a day before someone read the email complaint and fixed it. We had no phone, chat, or any other access to them. They have no clue how to run a business nor do they care. Never partner with them.
There's quite a lot of people talking about how this is their own fault, that they should have expected it, that they should have been prepared. Victim blaming, some would say, even.
But even if you assign blame to the OP for not expecting this, it doesn't look good, because the lesson here is "you shouldn't use google and if you do, expect them to fuck you over, for no reason, at any time".
Exactly. The whole point of using AWS, Google Cloud, etc, is that you get to stop thinking about certain classes of problems. An infrastructure provider that is unreliable cancels most of the value of using them for infrastructure.
Worse, they can potentially more than cancel it out, if they merely remove the "worrying about hardware" (yes, and network and load balancers and everything else) aspects, which are, at least, well understood by some us out on the market, and replace it with "worrying about the provider" where a failure scenario is, not only more opaque, but potentially catastrophic, since it's a single vendor with all the infrastructure.
It reminds me of AWS's opacity-as-antidote-to-worry with respect to hardware failures. If the underlying hardware fails, the EC2 instance on it just disappears (I've heard GCP handles this better, and AWS might now, as well). I like to point out that this doesn't differ much from the situation of running physical hardware (while ignoring the hardware monitoring), both from a "worry" burden perspective and from a "downtime from hardware failure" perspective.
Google just doesn't have the talent, skills, or knowledge for dealing with business customers. They don't have competition in adtech and so never learned, but that doesn't work with GCP. They have great technical features but don't realize that's not what matters to a customer who wants their business to run smoothly.
We've gone through several account teams of our own that seem to be eager to help only to turn into radio silence once we actually need something. We have already moved mission-critical services to AWS and Azure, with GCP only running K8S and VMs for better pricing and performance.
GCP has good leadership now but it's clearly taking longer than it should to improve unfortunately.
I generally agree with you but there is one exception, Google fi has amazing support. I am surprised gcp wouldn't have similar support considering the obvious cost differences though.
This is the problem with being excessively metrics-driven. They have a fraud problem, and there's some low-dollar customer that their algorithm determines is say a 20% chance of fraud. They know that 80% of the non-fradulent people will just upload their ID or whatever immediately, and they shut down all the fraud right away. Their fraud metrics look great, and the 20% of customers that had a problem have low CLV so who cares? It's not worth the CSR to sort it out, and anyhow, the CSR could just get socially engineered. The problem is that the 20% are going to talk to other people about their nightmare experiences.
It may not be expensive to Google to lose the business, but it's very expensive for the customer. Google's offering is now non-competitive if you aren't doing things at enterprise scale. Of course many of Google's best clients will start out as small ones. The metrics won't capture the long-term repetitional damage that's being done by these policies.
this exactly describes Google policy on small fish, by neglecting their concerns, Google gets a lot of negative reputation from small startups who spread the word on forums like this, making their technical innovation largely irrelevant to their future success
We seem to hear a lot of bad google customer support stories. I guess it really shouldn't be surprising. Amazon grew as a company that put customers first. Google is kind of known for not doing that. They shut down services all the time. They don't really put an emphasis on customer support.
I had the company visa blocked temporarily for suspicious activity twice in the last 6 years and no one shut their service off, but I got a lot of warnings. Seems like a really shitty thing for google to do.
Maybe for critical accounts you need to have a backup visa on file with Google cloud with in case the first dies for security reasons.
A single visa is a single point of failure in an otherwise redundant systemm
If you use cloud services, a crucial scenario in your disaster recovery planning is "what if a cloud provider suddenly cuts us off?". It's a single point of failure akin to "what if a DC gets demolished by a hurricane?" or "what if a sysadmin gets hit by a bus?". If you don't have a plan for those scenarios, you're playing with fire.
I’ve used AWS support many times and it’s actually really awesome. You can ask them basically anything and they have experts on everything. Really impressive. Yes you pay every month for it but it’s really good.
I have a similar story. I submitted a ticket to increase my GPU quota. Then my account was suspended, because CSR think the account is committing fraud. At that moment, I have a valid payment method and have been using GCP for a couple of weeks. Only after I prepaid $200 and uploaded a bunch of documents including credit card pictures and ID pictures, my account was restored.
You heard me right, I prepaid them so that my account can be restored.
This story gives evidence to something I have seen from Google, and why I refuse to pay them cash for a service ever again, much less critical ones - Google customer service is bad by design. I have never seen a company so arrogant and opaque/untransparent as them
This happened to me in a project too. Everything went down due to their bogus fraud detection. I had a Kubernetes cluster down for over a day. Very unfortunate as I loved GCP :(
This highlights one of the challenges Google has going into the cloud market - they don’t have a history of serving enterprise customers and the organizational structure and processes to do that well. I think one of the reasons Microsoft has gotten cloud market share so quickly with Azure (in addition to bundling licenses with other products!) is that they have the experience with enterprise customers and the organization to serve them well (regardless of how Azure compares to GCP as a product). Supporting enterprise customers is what they have always done - not so with Google (and Amazon had to learn that as well).
I'm curious though, did you have multiple credit cards on file with your Google billing account? I'm under the impression that this is part of their intended strategy for avoiding service interruption, but I'd like to know if it actually works that way.
(I took this as a reminder to add a second card to my account)
This has happened to me. Google's billing system had a glitch, and all of a sudden, an old bill which was paid years ago became unpaid. Google immediately tore down everything in my account without notice due to non-payment.
If something like this ever happens in AWS, they email you, call you, give you a grace period, and generally, do their best to avoid affecting your infrastructure.
GCP is getting better, but it's not ready for anything other than science fair experiments.
That's pretty bad of Google. I just looked in detail at a company using Google Cloud exclusively for their infra and the application is somewhat similar to what these guys are doing. I'll pass the article on to them. Thanks for posting this.
If maximum uptime is critical to the business, your infrastructure should be cross-provider.
I've been running three providers as peers (DO, Linode, Vultr) as a one-man shop for years, and I sleep better at night knowing that no one intern can fatfinger code that takes me offline.
At that Point wouldn't hosting it yourself running on something like vmware vsphere be a simpler option? At least you would have a nice hardware abstraction and a consistent api to build your tooling on.
I hear you. For me, the abstraction is the Linux distro. Build scripts abstract out creating a clean, secure box before installing any custom software, so regardless of the provider, every machine is exactly the same.
Surely! Our business is mainly an API that B-to-B customers consume. The strategy is to create identical "pods" in different cities across different providers on identical distros. Between Vultr, Linode, and DO, that's 20+ cities you could place a pod in the States alone. Each pod has a proxy up front, a database slave, and a pair of app server and cache machines.
Ignoring tweaks for international customers, each proxy is in an A record round-robin with health checks via Route 53. US-based requests get forwarded to one of the pods, and the proxy either handles the request with local servers, or points to servers in another pod if it has servers that are down. If any pod has a power outage or goes down for any reason, Route 53 automatically pulls the entire pod out of the rotation. If an entire provider goes dark, all of the pods get pulled out of the rotation, but all the others keep running.
This is very cool. Where can I learn more about stuff like this, and what are the prerequisites for learning something like this? I have a BSc in CS and understand OSes and programming languages pretty well.
Very true! I have a script that automatically commits any DNS changes to source control, so if Route 53 bites it I could quickly move somewhere else, but you're right, on the off chance my registrar decides to vanish, there'll be some panicking.
you really can't compare switching between generic vps providers from switching between the big cloud providers that provide lots of more useful services.
It is a very different model, but my take is that you shouldn't be building anything on one provider that you couldn't easily move over to another. Provider lock-in is scary.
Google's custom service is ridiculous.
They ever suddenly emailed me that my merchant earning (accumulated in about three years) will be escheated to nation government until I provide valid payment information in one month. However, for some reason. it will take more than one month for me to get a valid payment information.
Then they really escheated my earning to some a nation government after 30 days. However when I requested them for the escheatment ID so that I can contact that nation government to find my money back, they said they don't have the ID! They eascheated my money without any recording! Which is almost the same as throwing my money in ocean.
I think this is another instance of GCP just being really, really terrible at interacting with customers. I'm biased a bit (in favor of GCP), I suppose in that I have a fair bit of infra in GCP and I really like it. I'll share a couple anecdotes.
Last year my team migrated most of our infrastructure out of AWS into GCP. We'd been running k8s with kops in AWS and really liked using GKE. We also developed a bizdev arrangement.
As I was scaling up my spend in GCP, I began purchasing committed use discounts. Roughly similar to reserved instances in AWS. I'd already made one purchase for around $5k a month and these are tied to a specific region and can't be moved. I went to purchase a second $5k block, and typo'd the request ending up with $5k worth of infra in us-central rather than us-west. The purchase doesn't go into effect until the following day and showed as "pending" in the console. No big deal, I thought, I'll just contact support and they'll fix it right away, I'm sure. I had this preconceived notion based on my experiences with AWS.
I open a support request and about an hour later I get a response that basically tells me that once I've clicked the thing, there's no undoing it and have a nice day.
I've literally just erroneously spent $5k for infra in us-central that I can't use and their response was basically, "tough". $5k is a sufficiently large loss that I'd be inclined to burn a few hours of my legal team's time dealing with this issue, something I shared with the support person. After much hemming and hawing over the course of a few days, they eventually fixed the issue.
More recently, I've been dealing with an issue that is apparently called a "stockout".
Unlike AWS, GCP does not randomize their AZ names for each customer. This means that for any given region, the "A" availability zone is where most of the new infrastructure is going to land by default. Some time in May, we started seeing resource requests for persistent SSD failing in us-west1-a. The assumption is that it would clear up pretty quickly after an hour or two, but persisted. After about a day of this, we opened up a support case asking what was going on and explaining the need for some kind of visibility or metrics for this kind of issue. The response we received was that this issue was "exceedingly rare" which was why there was no visibility and that it would be rectified shortly, but couldn't be given any specific timeline.
I followed up with my account rep, he escalated to a "customer engineer" who read the support engineers notes and elaborated how "rare" this event was and how unlikely it was to recur. Again, I contacted my account rep and explained my unhappiness with the qualitative responses from "engineers" and that I needed quite a bit more information on which to act. He was sympathetic throughout this whole process escalating inside the organization as needed and shared with me some fairly candid information that I couldn't get from anyone else.
Apparently, the issue is called "stockout" and us-west1-a had a long history of them. The issues had been getting progressively worse from the beginning of the year and at this time, this AZ was the most notorious of any in any region. Basically, the support engineer either patently lied or was just making stuff up. Also, I shared with my GCP rep how AWS load-balances across AZs. He promised to pass that along.
The moral of the story is that if you want to be in us-west, then maybe try us-west1-c. Also, GCP is a relatively young arm of a well-established company that has a terrible reputation of being able to communicate with consumers. They'll eventually figure it out, it will just take some time.
Am I alone in thinking critical infrastructure monitoring like this should be run on the metal in your own center? Sure, offload some data for processing and reporting to cloud providers. But I'm slightly worried that electrical grid technology is using something they cannot control, and freaking uptime robot (I use it also) instead of proper IT in a controlled facility.
Usually there are a bunch of comments about power plants being connected to the internet. I doubt the connections from the control rooms or cloud back to the machines are read only unless they have protocol level filters to remove write commands from the wan to plant networks.
Just the way it is unless it is nuclear plant probably.
This is a rookie mistake - create and host a mission critical part of your business (not just your backend infrastructure, your entire business) with a vendor that you have no real, vetted contract in place with?
Total clown show.
Take this as a lesson in risk management and don't fuck it up again.
I lost Gmail account (locked due phone verification)[lost on holiday sim card] all my photo from vacation important email is gone (done backup day before) and that was like 3 years ego still waiting for explanation from Google 20 email send no response
Google once shut down my employers App Store account for 3 days for some routine review. That didn’t just mean we didn’t get any money, that could be argued even. It meant we were simply off the store. Because some needed to see a TPS report
No doubt that this is horrible practice and customer service on googles-side - but at the hypotethical question - what if the card holder is unreachable blabla - we loose years of work etc. Well if you are not professional enough to back-up your system regularly you should loose everthing and it's your fault to begin with... For real - google is not the only one who should step up there game... Only because it's in the cloud now, doesn't mean you should ignore decades old best-practices... servers die, hard-drives die - people with passwords die... handling stuff like that is part of your job...
First, we sincerely apologize for the inconvenience caused by this issue. Protecting our customers and systems is top priority. This incident is a good example where we didn’t do a good job. We know we must do better. And to answer OP’s final message: GCP Team is listening.
Our team has been in touch with OP over what happened and will continue digging into the issue. We will be doing a full review in the coming days and make improvements not only to detection but to communications for when these incidents do occur.
I think it starts with ensuring that you have staff that actually review an action before it takes place. Relying on automation can be catastrophic for someone like OP.
Horrible. This reminds me of travels in 3rd world countries where sometimes the electricity just dropped out. Ha, never would have thought that the one Google would occasionally become like this...
We are currently conducting a study on behalf of the European Commission in order to have a better understanding of the nature, scope and scale of contract-related problems encountered by Small and Medium Sized Enterprises in EU in using cloud computing services.
The purpose is to identify the problems that SMEs in the EU are encountering in order to reflect on possible ways to address them. To assist the European Commission with this task you are kindly invited to contact: eu.study.cloudcomputing@ro.ey.com
This sounds interesting! I have founded an SME and I have experienced issues w/ cloud computing, mainly data storage and problems with the contracts from providers. I will write you an email! Thanks!
Whether AWS or Google Cloud, if you’re running a real business with real downtime costs, you need to pay for enterprise support. You’ll get much faster response times so that you can actually meet your SLA targets, and you’ll get a wealth of information and resources from live human beings that can help even into the design phases.
Feel free to budget that into your choices on where to host, but getting into any arrangement where you rely on whatever free tier of support is nonsensical once you’re making any kind of money.
Totally agree. There’s so many missing redundancies on this project. CTO / Dir. Eng should have their own card. They should have some contact with a Google account rep, and at least the basic support package.
We’re not a big Google customer, couple thousand a month, but when we migrated we instantly reached out to account reps and have regular quarterly check ins.
This has nothing to do with enterprise support, which is terrible and everything to do with the fact that Google has automated as much decision making as possible and is terrible at interacting with customers.
I'm not sure if it's the culture of secrecy or sheer cluelessness, but it's pretty bad.
I'm not trying to slam them either, I still really like their products and use them.
Paying for Enterprise Support simply so that they don’t fuck you over sounds a lot like hidden costs to me. The service should be enterprise-grade even if you don’t.
If you run an enterprise, pay for enterprise support. If you aren’t (r&d accounts, startups that aren’t monetized yet, etc.) then don’t pay for it. Flexibility is the model here, and it’s utilized by the big players, so being naive to the business model will burn you on many public clouds.
I wouldnt trust google with anything anymore. Their customer support (entreprise or not) always sucked. They're always "right". You can only suffer the damage silently unless you're worth millions for them.
People should come to realize that it's been years google is NOT your friend.
The weird thing here is that inside google, support is great. Meaning that when people are building products for other google engineers, they go way above and beyond what is needed to help each other out. Somehow, they are not able to transition that to external customers.
What really underlines this blog and thread is that as of the moment of writing this comment, there's no official answer from Google, or even a non official one from an employee. The feeling I get as a customer is that they just don't care.
I did make a post below actually a few hours back and will copy and paste it below for reference. Rest assured that we are working on this one and are in contact directly with the customer. We hope to have more of an official response soon.
from below:
I work here in Google Cloud Platform Support.
First, we sincerely apologize for the inconvenience caused by this issue. Protecting our customers and systems is top priority. This incident is a good example where we didn’t do a good job. We know we must do better. And to answer OP’s final message: GCP Team is listening.
Our team has been in touch with OP over what happened and will continue digging into the issue. We will be doing a full review in the coming days and make improvements not only to detection but to communications for when these incidents do occur.
Sad to know such events. Can anyone comment on what would be the number of events (%) lead to tipping point- move people to avoid any SaaS? Anyone experienced such tipping points that led to overall change in trend especially in other industries?
On top of "this sucks": such "fully automated" response is a violation of GDPR compliance that requires "right to obtain human intervention on the part of the controller".
Stunning that Google cloud can get away with this.
The first requirement of any production system is redundancy. It sound like on top of the layers for cpu, storage, network and application, the new requiment is redundant cloud platform providers.
Does google allow you to preauthorize your purchasing card? It seems like part of this is that they all of the sudden get suspicious about your billing information.
I experienced exactly the same thing earlier this month. Due to "suspicious activity" on YouTube, Google suspended not only my YouTube account, but also Gmail, Google Chat, and ALL other services they provide, including their Cloud services.
They provide no explanation as to why the accounts are closed, and provide only a single form to appeal the situation.
In my case they refused to release the locks on their services so all information, all contacts, all files, all histories, all YouTube data, and everything else they store is now effectively lost with no means of getting the data back.
This was done without warning, and the account was locked due to "suspicious activity on YouTube".
Through Experimentation with a new account I found that the "suspicious activity" they were referring to was Criticism of the Trump Administration policy of kidnapping children from their parents.
Posting such criticism to the threads that follow stories by MSNBC and other news sources triggered Google to block YouTube and all other services they provide and to do so without warning or any explanation.
Sorry but this happened because Google and aws lack of the vision of run true Enterprise companies. I see this situation on azure many times and in the end time try lo locate to te comercial people behind of this customer .. it is a lack of the Enterprise vision
I’d say that really depends on what you need. If you want a full platform with a lot of integrated services there’s really only GCP, AWS, Azure, Tencent, AliBaba and maybe something else I forgot. So I wouldn’t say it’s a ton of better things that are available. Sure if you only need a bunch of VMs you have a lot of options.
I had AliCloud experience similar to OP but they gave me 24 hour deadline.
"We have temporarily closed your Alibaba Cloud account due to suspicious activity. Please provide the following information within [24 hours] by email to compliance_support@aliyun.com in order to reopen your account:
...
If you fail to provide this information your account will be permanently closed, and we may take other appropriate and lawful measurers. Best regards, Alibaba Cloud Customer Service Center
"
I provided the documents in ~30 hours because that's when I saw the email. There was no further communication from Alibaba. I assumed everything is ok, but in 2 weeks my account was terminated.
Wow as a CTO this is a nightmare scenario. Is this common? I guess this means google.com does not use Google Cloud because I’m sure they have uptime targets. They cannot handle incidents like this and expect people to take them seriously as a cloud provider.
I think the reference is to this book, "Site Reliability Engineering" [1]. As you will see on the main page [2], it is part of Google's effort to describe "How Google Runs Production Systems", which is basically what the parent comment was asking about.
You should be have been able to deploy your git repository on another system pretty quickly, as well as have your own backups of your database.
The most time consuming thing should be setting up the environment variables.
Let me see, what else would be tricky: if you are using google analytics that data might be gone, but your other metrics package should have had many snapshots of that data too
From a Google shareholder’s perspective, this approach is unacceptable:
The only way to use their product safely is to engineer your entire business so that cloud providers are completely interchangeable.
Forcing the entire industry to pay the cost of transparently switching upfront completely commoditizes cloud providers, which means they’ll no longer be able to charge a sustainable markup for their offerings.
This is fiscally negligent. Upper management should be fired.
However, it’s great for the rest of the industry — Google nukes a few random startups from orbit, some VCs take a bath, early and mid-range adopters bleed money engineering open source workarounds, and everone else’s cloud costs drop to the marginal costs of electricity and silicon.
Not OP, but according to his business case, being down for a few days could bankrupt the company. Re-deploying from git doesn't solve the use-case of your public cloud provider pulling the plug on your machines.
"Why you should not use Google Cloud" if you're a small business. A large business will have contacts, either on their own or through a consulting firm, that can call Google employees and get help. As a small company, you're at the mercy of what Google thinks is adequate support for the masses.
Not to mention the lack of visibility in changes - it seems like everything is constantly running at multiple versions that can change suddenly with no notice, and if that breaks your use case they don't really seem to know or care. It feels like there's miles of difference between the SRE book and how their cloud teams operate in practice.