1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).
2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.
3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.
4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.
5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.
6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.
I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.
When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.
They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.
I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.
The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.
Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.
If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.
Sure, we use support tickets with vendors for small things. Console button bugging out, etc. But for large incidents, every vendor has a representative within an hour driving distance and will be called into a room with our engineers to fix the problem. This kind of outage, with zero communication, means the dropping of a contract.
Communication is critical for trust, especially if we're running a business off it.
You need failovers to different providers and hopefully also have your hardware for general workloads
And suddenly the CEO doesn't care anymore if one of your potential failovers is behaving flaky in specific circumstances
Not saying it's good as it is.. communication as a saas provider is - as you said- one is the most important things... But this specific issue was not as bad as some people insinuate in this thread
Don't get it wrong. AWS is the exact same thing as Google. All you will is log a ticket and receive an automated ack by the next day.
When I worked at GoDaddy, there were around 2/3 of the company was customer support.
At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).
All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.
Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.
Google hasn't learned this lesson.
They have though; they've just drawn the conclusion that they'd rather put massive amounts of effort in to building services that users can use without needing support. This approach works well once the problems have been ironed out, but it's horrible until that's the case. Google's mature products like Ads, Docs, GMail, etc are amazing. Their new products ... aren't.
Google Ads and such also have a terrible support reputation, even with clients spending 8 figures.
Until something goes wrong and the only recourse is to post an angry Hacker News thread or call up people you personally know at Google to get it fixed. For example https://techcrunch.com/2017/12/22/that-time-i-got-locked-out....
We actually got to a point where we had a couple of spare parts onsite (sticks of RAM, HD, etc) and so we repair immediately and then request the replacement. This was on a large HPC cluster so we had almost daily failures of some kind (most commonly we'd get a stick of RAM that would fail ECC checks repeatedly).
Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.
GCP does have role-based support models with a flat-rate plan, which is really great, but the overall quality of the responses leaves much to be desired.
So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.
But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.
If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.
If you've got VC money to blow so you can pretend your SaaS toy can feed 500 people while having money left to throw at things, that's cool. Just remember that other people might be running sustainable businesses.
And just like that you turned a $200/month bill into a $10k/month strawman.
> Just remember that other people might be running sustainable businesses.
Why are you pretending that a startup that can't afford $200/month is a "sustainable businesses"?
I mean sure, they could go and probably afford to waste $200 extra on something random that will be useless to them most of the time, but that money is going straight out of their paycheck.
You don't remain profitable though by repeatedly making bad decisions like that. Which was my point.
Running a (small) profitable business is about making the right decisions consistently, and if you're likely to waste money on one thing, you're also likely to waste it on the 19 other similar things.
Maybe speak to literally anyone you know who is running a small businesses if you want to know more. Yes that includes your local small stores on your street.
At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.
This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.
You’re trying to equate small businesses with hobbies. You’ve now resorted to straw men, slippery slopes, and false equivalency. Maybe consider that if you have to distort the situation this much to make your point, you might just be wrong.
> At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.
I didn’t say anything about anyone’s livelihood. You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.
I bet the guy who started this thread about GCP’s support cost has made a sum total of <$1000 from his “startup”. Likely <$10. Hobby.
I don’t care if “quite a few people” got pissed about my comment. People with egos that delicate shouldn’t use social media.
I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value, whether that's a business support plans on services they use or something else. It's true regardless of whether you're a brick and mortar store or some online service.
> This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.
Speaking about about false equivalencies...
> You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.
First off, I spoke of businesses making generally less than that.
Also (I already said this, good job ignoring that!) paying $200 bucks on a single useless thing is survivable for even a small business - but you know what's better than only making one bad business decision? Making no bad ones at all. Making too many will quickly break the camel's back.
Which was my whole argument and it's also what people generally refer to when they say they can't afford something.
For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.
And I'm telling you that if you built your business on top of GCP, a support contract is probably not "low value". You'd happily pay $200 for support on your critical infrastructure, just as you'd happily pay $200 for a repairman to fix your washing machine if you owned a laundromat.
If you don't need support, then sure, don't pay for the plan. If you do need support, $200 seems pretty reasonable.
> Speaking about about false equivalencies...
Signing up for a monthly recurring support plan in case you need it is literally insuring your business.
> For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.
A support plan for your critical infrastructure probably isn't "useless". Which is the point. If your need for support is that low, then either you've built your own redundant systems to protect you or more likely you aren't running a real business.
I'm glad AWS's free tier is working for you, but complaining that Google doesn't want to give you free capacity for your business and then also provide you free support for that business is pretty absurd.
I am not saying with vast numbers its feasible but big cloud providers don't even give you the opportunity to raise a ticket if its their fault. There is a price you pay extra when you opt for any one of them but many don't realize. Having said that - almost all the time, our skilled expertise is better than their initial two level of support staff. We realized it early so we handle it better by going over the documentation and making our code resilient since all cloud platforms have some limit or another since overselling in a region is something they can't avoid. Going multiple regions across when you handle these exceptions is the only way through.
"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."
So, it's a UI console issue, it appears you can still manage
"Affected customers can use gcloud command  in order to create new Node Pools. "
Similarly, it actually was resolved in Friday, but they forgot to mark it as so.
"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."
The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.
Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.
I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.
Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.
I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.
On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.
And yet, the status page says all services are available.
What blog statement are you referring to? I don't see any such statement. Can you provide a link?
The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."
So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.
Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.
> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems
I don't see much of that.
> So it sounds like a web interface problem, not a severely limiting
Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.
a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.
b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.
What exactly is your point?
...and I'm a happy GCP customer.
Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.
Do you have an example on this?