Hacker News new | comments | ask | show | jobs | submit login

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).

2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.

3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.

4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.




When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.

They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.

I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.


As someone who works for Government and Enterprise - all I care about sometimes is how a company behaves when everything goes wrong.

The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.

Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.

If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.


Very similar thing at our office. Considering the scale of which we run things, any outage could be a potential loss of millions _every minute_.

Sure, we use support tickets with vendors for small things. Console button bugging out, etc. But for large incidents, every vendor has a representative within an hour driving distance and will be called into a room with our engineers to fix the problem. This kind of outage, with zero communication, means the dropping of a contract.

Communication is critical for trust, especially if we're running a business off it.


Going single cloud on that scale is simply irresponsible though.

You need failovers to different providers and hopefully also have your hardware for general workloads

And suddenly the CEO doesn't care anymore if one of your potential failovers is behaving flaky in specific circumstances

Not saying it's good as it is.. communication as a saas provider is - as you said- one is the most important things... But this specific issue was not as bad as some people insinuate in this thread


Agree, if we are really talking about millions per minute (woah), then you can afford to failover to AWS.


As a government or large enterprise, you should get a support contract with the provider and have a dedicated support to contact.

Don't get it wrong. AWS is the exact same thing as Google. All you will is log a ticket and receive an automated ack by the next day.


You are incorrect about aws. If your pay for business support, and something is happening to your production environment, they are on a call with you in less than an hour.


How could I be incorrect when that's exactly what I said? You gotta pay for a support contract to have any meaningful support.


You also said that all you would get was an "automated ack". This seems to not be the case if aws provides an on-call support engineer.


I think the point is that that only happens if you have a contract. With GCP you can also get an oncall support engineer if you're large enough.


Use AWS and government "region".


"Support costs" calculation often doesn't include the costs of not having support.

When I worked at GoDaddy, there were around 2/3 of the company was customer support.

At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).

All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.

Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.

Google hasn't learned this lesson.


Google hasn't learned this lesson.

They have though; they've just drawn the conclusion that they'd rather put massive amounts of effort in to building services that users can use without needing support. This approach works well once the problems have been ironed out, but it's horrible until that's the case. Google's mature products like Ads, Docs, GMail, etc are amazing. Their new products ... aren't.


There's a big difference between SaaS applications and compute infrastructure for your business.

Google Ads and such also have a terrible support reputation, even with clients spending 8 figures.


>Google's mature products like Ads, Docs, GMail, etc are amazing.

Until something goes wrong and the only recourse is to post an angry Hacker News thread or call up people you personally know at Google to get it fixed. For example https://techcrunch.com/2017/12/22/that-time-i-got-locked-out....


I've seen Google projects where the project lead explained (actually responded) that they don't want to provide support, end of story. Google puts folks in charge but does not give them enough in the way accountability objectives.


With Dell you can certify with them so you can get replacement parts and such without the BS back and forth with some guy in india. Saves everyone time and money.


I did this many years ago and it was great.

We actually got to a point where we had a couple of spare parts onsite (sticks of RAM, HD, etc) and so we repair immediately and then request the replacement. This was on a large HPC cluster so we had almost daily failures of some kind (most commonly we'd get a stick of RAM that would fail ECC checks repeatedly).


To say "when it works it's stable and reliable" implies that it is neither...


60% of the time, it works every time...


> instead they keep sending tickets back for "more info"

Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.


I've definitely had interactions with smaller companies where you can effectively bypass first and second line by demonstrating you know what you're doing, mostly just saying the right things for them to accept that you've done basic troubleshooting steps already and really do need to talk to someone beyond that point.


Yes, same experience here, support at smaller companies can be more dedicated when talking to "knowledgeable" customers. It's generally easier to get to their 3rd level, sometimes just because there is no 1st or 2nd level at all. But at "big" enterprises - not so much.


> I've definitely had interactions with smaller companies where you can effectively bypass first and second line by demonstrating you know what you're doing, mostly just saying the right things for them to accept that you've done basic troubleshooting steps already and really do need to talk to someone beyond that point.

"Shibboleet" https://www.xkcd.com/806/


Smaller companies or personalized support structures (like named engineers) is very different. You can build up a relationship and usually bypass many questions to get to main issue, and even get it resolved before you can even open a case at larger organizations.

GCP does have role-based support models with a flat-rate plan, which is really great, but the overall quality of the responses leaves much to be desired.


Heh, your "test" reminds me of an old Hanselman article:

https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...


We had an issue a few weeks back where all nodes in west1-a could not pull docker images. Google support was pinballing P1 issue around the globe and across multiple teams for a few days untill I root caused it for them - turned out to be gce service account issues affecting entire zone. 2 days to rollback (no status page update). I know nobody gives a fuck but can’t help but feel vindicated as an ex google sre.


I think a lot of people give a fuck here; I do, at least. Thanks for outlining it, these things are fascinating (to me anyway, who has never worked in IT/ops).


We are GCP customers for the last couple of years. We use other cloud platforms(AWS, IBM, Oracle, OrionVM) too. We don't use GKE but use rancher/kubernetes combo on their standard platform.

So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.

But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.


I agree with this. Compared to AWS, when Google says it's down, it's down, and that's rare. When they say it's up, it's up.


I use AWS free tier and get customer support through email, but thats not the case with GCP. Do they provide free email support?

If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.


If 200/month is an issue, then you aren’t an “early stage startup”. You’re running a hobby project.


You know there's 1-man "start ups"/companies out there that serve more users than 99.5% of VC startups ever will, which earn their owner a very livable wage, but who still can't afford a 500 bucks business support plan on all of the 20 services they use.

If you've got VC money to blow so you can pretend your SaaS toy can feed 500 people while having money left to throw at things, that's cool. Just remember that other people might be running sustainable businesses.


> You know there's 1-man "start ups"...can't afford a 500 bucks business support plan on all of the 20 services they use.

And just like that you turned a $200/month bill into a $10k/month strawman.

> Just remember that other people might be running sustainable businesses.

Why are you pretending that a startup that can't afford $200/month is a "sustainable businesses"?


There are literally millions of small 1-2 people business that make less than 10k in profits.

I mean sure, they could go and probably afford to waste $200 extra on something random that will be useless to them most of the time, but that money is going straight out of their paycheck.

You don't remain profitable though by repeatedly making bad decisions like that. Which was my point.

Running a (small) profitable business is about making the right decisions consistently, and if you're likely to waste money on one thing, you're also likely to waste it on the 19 other similar things.

Maybe speak to literally anyone you know who is running a small businesses if you want to know more. Yes that includes your local small stores on your street.

At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.


The scenario being discussed was building a company on top of GCP and being unable to afford $200/month for support costs. Tell me what “mom and pop” shops are pulling in $10k/month from GCP-based software but are unwilling/unable to pay $200/month for the very thing that underpins their livelihoods?

This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.

You’re trying to equate small businesses with hobbies. You’ve now resorted to straw men, slippery slopes, and false equivalency. Maybe consider that if you have to distort the situation this much to make your point, you might just be wrong.

> At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.

I didn’t say anything about anyone’s livelihood. You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.

I bet the guy who started this thread about GCP’s support cost has made a sum total of <$1000 from his “startup”. Likely <$10. Hobby.

I don’t care if “quite a few people” got pissed about my comment. People with egos that delicate shouldn’t use social media.


> The scenario being discussed was building a company on top of GCP and being unable to afford $200/month for support costs. Tell me what “mom and pop” shops are pulling in $10k/month from GCP-based software but are unwilling/unable to pay $200/month for the very thing that underpins their livelihoods?

I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value, whether that's a business support plans on services they use or something else. It's true regardless of whether you're a brick and mortar store or some online service.

> This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.

Speaking about about false equivalencies...

> You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.

First off, I spoke of businesses making generally less than that.

Also (I already said this, good job ignoring that!) paying $200 bucks on a single useless thing is survivable for even a small business - but you know what's better than only making one bad business decision? Making no bad ones at all. Making too many will quickly break the camel's back.

Which was my whole argument and it's also what people generally refer to when they say they can't afford something.

For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.


> I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value

And I'm telling you that if you built your business on top of GCP, a support contract is probably not "low value". You'd happily pay $200 for support on your critical infrastructure, just as you'd happily pay $200 for a repairman to fix your washing machine if you owned a laundromat.

If you don't need support, then sure, don't pay for the plan. If you do need support, $200 seems pretty reasonable.

> Speaking about about false equivalencies...

Signing up for a monthly recurring support plan in case you need it is literally insuring your business.

> For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.

A support plan for your critical infrastructure probably isn't "useless". Which is the point. If your need for support is that low, then either you've built your own redundant systems to protect you or more likely you aren't running a real business.


If a comment reflects the median, it doesn’t address outliers.


"Startups" in the YC sense _are_ the outliers.


Okay. Lets say its an hobby project, but do you understand those today's hobby projects are tomorrow's mature startup?


No, I understand that most hobby projects are just hobbies and Google/Amazon/etc are under no obligation to provide support for hobbies that are literally a net cost to them.

I'm glad AWS's free tier is working for you, but complaining that Google doesn't want to give you free capacity for your business and then also provide you free support for that business is pretty absurd.


Yes, they provide using the public issue tracker. We have been used it with success.


Thanks. I am aware of that issue tracker. Its not an actual support portal (or at least the support I am expecting)


I'm curious, how much support DO you expect when using the free tier only?


I don't understand why someone would choose to deploy anything mission critical without having an support contract with the ISP, the manufacturer of the the software etc.


Simple, the cost of an outage is less than the cost of a support contract. Very few things are really mission critical as in they can never go down. Rather they simply have a cost to going down and you can choose to pay that one way or another.


And it's not like having a support contract precludes you from downtime.


I transitioned from collocation to self managed remote server farm and then onto self managed remote vms. All these providers provided de-facto support whether we opted for one or not. You can go to their portal and raise a ticket.

I am not saying with vast numbers its feasible but big cloud providers don't even give you the opportunity to raise a ticket if its their fault. There is a price you pay extra when you opt for any one of them but many don't realize. Having said that - almost all the time, our skilled expertise is better than their initial two level of support staff. We realized it early so we handle it better by going over the documentation and making our code resilient since all cloud platforms have some limit or another since overselling in a region is something they can't avoid. Going multiple regions across when you handle these exceptions is the only way through.


Or why chose an error prone technology?


You're doing me a scare. I'm in the evaluation phase with them. Maybe I'm missing something here, but this is not at all what the linked post says.

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

So, it's a UI console issue, it appears you can still manage

"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"

Similarly, it actually was resolved in Friday, but they forgot to mark it as so.

"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."


You are right about the Google blog content itself not indicating three days of outage. Turns out they just forgot to mark that particular issue as resolved on Friday, as you point out. This is my mistake. I would update my comment to reflect this, but it doesn't seem to allow an edit at this point.

The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.

Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.

I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.


I've been failing all weekend to create nodes in a GKE cluster through either the UI console or gcloud. Even right now I can't get any nodes to spin up.

Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.


I can't comment regarding GKE as we don't use that particular service, however we are very heavy users of many other GCP services, including Compute, Datastore, BigQuery, Pub/Sub, Storage, Functions, Speech, and others. Zero issues this weekend, everything is running 100% as any normal day.


We've had no issues deleting and creating node pools this weekend (on asia-east1-a). No other problems noticed either.


As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.


If you run your own k8s on GCP, you are not going to be affected by GKE.


> For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement)

What blog statement are you referring to? I don't see any such statement. Can you provide a link?

The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."

So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.

Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.

> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems

I don't see much of that.


I believe the OP was referring to the very same blog (web log) you cited.

https://status.cloud.google.com/incident/container-engine/18...

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

> So it sounds like a web interface problem, not a severely limiting

Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.


Right now we don't know. It's one of two possibilities from what I can tell:

a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.

b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.


This is why GCP has no hope of ever taking significant market share from AWS. Google thinks they can treat their cloud customers like they treat users of their free services. Customer support and communication are essential.


As if something like this has never happened to AWS?


"like this" -- a failure of the service, or a failure of communication and customer support?


Remember that time S3 went down and the only updates were on Twitter because the status page was hosted on S3?


Yes. When the AWS status page failed to accurately inform their customers for several hours, AWS used Twitter to ensure that there was communication with their customers.

What exactly is your point?


The S3 outage duration was only 4 hours.


I don't. But thanks for sharing, that's hilarious (for an unaffected person at least :D)


Lol, is this real? If so, hilarious.


Both, I suppose.


Not effecting all regions no.


I'm not sure about the market share, but I agree with the last two sentences.

...and I'm a happy GCP customer.


I recently removed my hosting from GCP. The pricing is confusing and unbelievable. Their customer service is a joke. I don't trust Google for longterm consistency due to the way they shut their own apps but I let that slide as I doubt they will do that on their cloud services. I have experience with AWS (rock solid, world class support but also costly), digital ocean (improving fast), heroku (good for beginners but also expensive and not as full featured as AWS) and finally Hetzner (too early to judge).


I think you're missing the portion about how it only appears to be the console ui, no?


“2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.”

Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.


> When an entire region is down what I have noticed is that all things are fucked globally on aws.

Do you have an example on this?


On 17 October, there was a multi-AZ network failure at us-east-1. It only lasted 3m35s, but it was enough that our customers were calling about our site being down.


That's still just one region, unless you were also hosted outside us-east-1.


Just grabbed first article. Example: In this case capitalone went down. I don’t work at capitalone - but I imagine they had their data copied across every region 30 times.

https://www.geekwire.com/2018/widespread-outage-amazon-web-s...


I think you're much too optimistic about capitalone. They probably had a single point of failure, possibly one they didn't realize they had.


CapitalOne is one of the few financial markets firms with open source cloud projects on github. I respect their tech org for that.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: