Google Kubernetes Engine's third consecutive day of service disruption

shareometry · on Nov 12, 2018

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).

2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.

3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.

4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

manigandham · on Nov 12, 2018

When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.

They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.

I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.

laurencei · on Nov 12, 2018

As someone who works for Government and Enterprise - all I care about sometimes is how a company behaves when everything goes wrong.

The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.

Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.

If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.

totallyashill · on Nov 12, 2018

Very similar thing at our office. Considering the scale of which we run things, any outage could be a potential loss of millions _every minute_.

Sure, we use support tickets with vendors for small things. Console button bugging out, etc. But for large incidents, every vendor has a representative within an hour driving distance and will be called into a room with our engineers to fix the problem. This kind of outage, with zero communication, means the dropping of a contract.

Communication is critical for trust, especially if we're running a business off it.

y4mi · on Nov 12, 2018

Going single cloud on that scale is simply irresponsible though.

You need failovers to different providers and hopefully also have your hardware for general workloads

And suddenly the CEO doesn't care anymore if one of your potential failovers is behaving flaky in specific circumstances

Not saying it's good as it is.. communication as a saas provider is - as you said- one is the most important things... But this specific issue was not as bad as some people insinuate in this thread

softawre · on Nov 12, 2018

Agree, if we are really talking about millions per minute (woah), then you can afford to failover to AWS.

user5994461 · on Nov 12, 2018

As a government or large enterprise, you should get a support contract with the provider and have a dedicated support to contact.

Don't get it wrong. AWS is the exact same thing as Google. All you will is log a ticket and receive an automated ack by the next day.

Cort3z · on Nov 12, 2018

You are incorrect about aws. If your pay for business support, and something is happening to your production environment, they are on a call with you in less than an hour.

user5994461 · on Nov 12, 2018

How could I be incorrect when that's exactly what I said? You gotta pay for a support contract to have any meaningful support.

Scriptor · on Nov 12, 2018

You also said that all you would get was an "automated ack". This seems to not be the case if aws provides an on-call support engineer.

joshuamorton · on Nov 12, 2018

I think the point is that that only happens if you have a contract. With GCP you can also get an oncall support engineer if you're large enough.

qbaqbaqba · on Nov 12, 2018

Use AWS and government "region".

Osiris · on Nov 12, 2018

"Support costs" calculation often doesn't include the costs of not having support.

When I worked at GoDaddy, there were around 2/3 of the company was customer support.

At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).

All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.

Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.

Google hasn't learned this lesson.

onion2k · on Nov 12, 2018

Google hasn't learned this lesson.

They have though; they've just drawn the conclusion that they'd rather put massive amounts of effort in to building services that users can use without needing support. This approach works well once the problems have been ironed out, but it's horrible until that's the case. Google's mature products like Ads, Docs, GMail, etc are amazing. Their new products ... aren't.

manigandham · on Nov 12, 2018

There's a big difference between SaaS applications and compute infrastructure for your business.

Google Ads and such also have a terrible support reputation, even with clients spending 8 figures.

marcinzm · on Nov 12, 2018

>Google's mature products like Ads, Docs, GMail, etc are amazing.

Until something goes wrong and the only recourse is to post an angry Hacker News thread or call up people you personally know at Google to get it fixed. For example https://techcrunch.com/2017/12/22/that-time-i-got-locked-out....

QuantumGood · on Nov 12, 2018

I've seen Google projects where the project lead explained (actually responded) that they don't want to provide support, end of story. Google puts folks in charge but does not give them enough in the way accountability objectives.

jgalentine007 · on Nov 12, 2018

With Dell you can certify with them so you can get replacement parts and such without the BS back and forth with some guy in india. Saves everyone time and money.

phamilton · on Nov 12, 2018

I did this many years ago and it was great.

We actually got to a point where we had a couple of spare parts onsite (sticks of RAM, HD, etc) and so we repair immediately and then request the replacement. This was on a large HPC cluster so we had almost daily failures of some kind (most commonly we'd get a stick of RAM that would fail ECC checks repeatedly).

dvdgsng · on Nov 12, 2018

> instead they keep sending tickets back for "more info"

Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.

jon-wood · on Nov 12, 2018

I've definitely had interactions with smaller companies where you can effectively bypass first and second line by demonstrating you know what you're doing, mostly just saying the right things for them to accept that you've done basic troubleshooting steps already and really do need to talk to someone beyond that point.

dvdgsng · on Nov 12, 2018

Yes, same experience here, support at smaller companies can be more dedicated when talking to "knowledgeable" customers. It's generally easier to get to their 3rd level, sometimes just because there is no 1st or 2nd level at all. But at "big" enterprises - not so much.

rainbowzootsuit · on Nov 12, 2018

> I've definitely had interactions with smaller companies where you can effectively bypass first and second line by demonstrating you know what you're doing, mostly just saying the right things for them to accept that you've done basic troubleshooting steps already and really do need to talk to someone beyond that point.

"Shibboleet" https://www.xkcd.com/806/

manigandham · on Nov 12, 2018

Smaller companies or personalized support structures (like named engineers) is very different. You can build up a relationship and usually bypass many questions to get to main issue, and even get it resolved before you can even open a case at larger organizations.

GCP does have role-based support models with a flat-rate plan, which is really great, but the overall quality of the responses leaves much to be desired.

softawre · on Nov 12, 2018

Heh, your "test" reminds me of an old Hanselman article:

https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...

ElBarto · on Nov 12, 2018

To say "when it works it's stable and reliable" implies that it is neither...

llampx · on Nov 12, 2018

60% of the time, it works every time...

dilyevsky · on Nov 12, 2018

We had an issue a few weeks back where all nodes in west1-a could not pull docker images. Google support was pinballing P1 issue around the globe and across multiple teams for a few days untill I root caused it for them - turned out to be gce service account issues affecting entire zone. 2 days to rollback (no status page update). I know nobody gives a fuck but can’t help but feel vindicated as an ex google sre.

icelancer · on Nov 12, 2018

I think a lot of people give a fuck here; I do, at least. Thanks for outlining it, these things are fascinating (to me anyway, who has never worked in IT/ops).

navinsylvester · on Nov 12, 2018

We are GCP customers for the last couple of years. We use other cloud platforms(AWS, IBM, Oracle, OrionVM) too. We don't use GKE but use rancher/kubernetes combo on their standard platform.

So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.

But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.

rogerkirkness · on Nov 12, 2018

I agree with this. Compared to AWS, when Google says it's down, it's down, and that's rare. When they say it's up, it's up.

vira28 · on Nov 12, 2018

I use AWS free tier and get customer support through email, but thats not the case with GCP. Do they provide free email support?

If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.

dpark · on Nov 12, 2018

If 200/month is an issue, then you aren’t an “early stage startup”. You’re running a hobby project.

chmod775 · on Nov 12, 2018

You know there's 1-man "start ups"/companies out there that serve more users than 99.5% of VC startups ever will, which earn their owner a very livable wage, but who still can't afford a 500 bucks business support plan on all of the 20 services they use.

If you've got VC money to blow so you can pretend your SaaS toy can feed 500 people while having money left to throw at things, that's cool. Just remember that other people might be running sustainable businesses.

dpark · on Nov 14, 2018

> You know there's 1-man "start ups"...can't afford a 500 bucks business support plan on all of the 20 services they use.

And just like that you turned a $200/month bill into a $10k/month strawman.

> Just remember that other people might be running sustainable businesses.

Why are you pretending that a startup that can't afford $200/month is a "sustainable businesses"?

chmod775 · on Nov 15, 2018

There are literally millions of small 1-2 people business that make less than 10k in profits.

I mean sure, they could go and probably afford to waste $200 extra on something random that will be useless to them most of the time, but that money is going straight out of their paycheck.

You don't remain profitable though by repeatedly making bad decisions like that. Which was my point.

Running a (small) profitable business is about making the right decisions consistently, and if you're likely to waste money on one thing, you're also likely to waste it on the 19 other similar things.

Maybe speak to literally anyone you know who is running a small businesses if you want to know more. Yes that includes your local small stores on your street.

At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.

dpark · on Nov 16, 2018

The scenario being discussed was building a company on top of GCP and being unable to afford $200/month for support costs. Tell me what “mom and pop” shops are pulling in $10k/month from GCP-based software but are unwilling/unable to pay $200/month for the very thing that underpins their livelihoods?

This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.

You’re trying to equate small businesses with hobbies. You’ve now resorted to straw men, slippery slopes, and false equivalency. Maybe consider that if you have to distort the situation this much to make your point, you might just be wrong.

> At the end of the day you probably pissed off quite a few people on here when you called their livelihood a hobby project.

I didn’t say anything about anyone’s livelihood. You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.

I bet the guy who started this thread about GCP’s support cost has made a sum total of <$1000 from his “startup”. Likely <$10. Hobby.

I don’t care if “quite a few people” got pissed about my comment. People with egos that delicate shouldn’t use social media.

chmod775 · on Nov 16, 2018

> The scenario being discussed was building a company on top of GCP and being unable to afford $200/month for support costs. Tell me what “mom and pop” shops are pulling in $10k/month from GCP-based software but are unwilling/unable to pay $200/month for the very thing that underpins their livelihoods?

I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value, whether that's a business support plans on services they use or something else. It's true regardless of whether you're a brick and mortar store or some online service.

> This is akin to saying that a mom and pop laundromat can’t afford insurance, or shouldn’t because they won’t frequently need it.

Speaking about about false equivalencies...

> You’re the one pretending that small businesses bringing home $120k/year can’t afford a $200 monthly support bill.

First off, I spoke of businesses making generally less than that.

Also (I already said this, good job ignoring that!) paying $200 bucks on a single useless thing is survivable for even a small business - but you know what's better than only making one bad business decision? Making no bad ones at all. Making too many will quickly break the camel's back.

Which was my whole argument and it's also what people generally refer to when they say they can't afford something.

For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.

dpark · on Nov 16, 2018

> I was trying to tell you that most small businesses can't go around spending hundreds of bucks of things that provide little value

And I'm telling you that if you built your business on top of GCP, a support contract is probably not "low value". You'd happily pay $200 for support on your critical infrastructure, just as you'd happily pay $200 for a repairman to fix your washing machine if you owned a laundromat.

If you don't need support, then sure, don't pay for the plan. If you do need support, $200 seems pretty reasonable.

> Speaking about about false equivalencies...

Signing up for a monthly recurring support plan in case you need it is literally insuring your business.

> For instance you may say "I can't afford to go to this restaurant", even though you'd have enough money to do it without going immediately bankrupt. But it'd be a bad decision, too many of which quickly add up.

A support plan for your critical infrastructure probably isn't "useless". Which is the point. If your need for support is that low, then either you've built your own redundant systems to protect you or more likely you aren't running a real business.

marmaduke · on Nov 12, 2018

If a comment reflects the median, it doesn’t address outliers.

randomsearch · on Nov 12, 2018

"Startups" in the YC sense _are_ the outliers.

vira28 · on Nov 12, 2018

Okay. Lets say its an hobby project, but do you understand those today's hobby projects are tomorrow's mature startup?

dpark · on Nov 14, 2018

No, I understand that most hobby projects are just hobbies and Google/Amazon/etc are under no obligation to provide support for hobbies that are literally a net cost to them.

I'm glad AWS's free tier is working for you, but complaining that Google doesn't want to give you free capacity for your business and then also provide you free support for that business is pretty absurd.

pentium10 · on Nov 12, 2018

Yes, they provide using the public issue tracker. We have been used it with success.

vira28 · on Nov 12, 2018

Thanks. I am aware of that issue tracker. Its not an actual support portal (or at least the support I am expecting)

dvdgsng · on Nov 12, 2018

I'm curious, how much support DO you expect when using the free tier only?

TanakaTarou · on Nov 12, 2018

I don't understand why someone would choose to deploy anything mission critical without having an support contract with the ISP, the manufacturer of the the software etc.

marcinzm · on Nov 12, 2018

Simple, the cost of an outage is less than the cost of a support contract. Very few things are really mission critical as in they can never go down. Rather they simply have a cost to going down and you can choose to pay that one way or another.

jjeaff · on Nov 12, 2018

And it's not like having a support contract precludes you from downtime.

navinsylvester · on Nov 12, 2018

I transitioned from collocation to self managed remote server farm and then onto self managed remote vms. All these providers provided de-facto support whether we opted for one or not. You can go to their portal and raise a ticket.

I am not saying with vast numbers its feasible but big cloud providers don't even give you the opportunity to raise a ticket if its their fault. There is a price you pay extra when you opt for any one of them but many don't realize. Having said that - almost all the time, our skilled expertise is better than their initial two level of support staff. We realized it early so we handle it better by going over the documentation and making our code resilient since all cloud platforms have some limit or another since overselling in a region is something they can't avoid. Going multiple regions across when you handle these exceptions is the only way through.

StreamBright · on Nov 12, 2018

Or why chose an error prone technology?

lilbobbytables · on Nov 12, 2018

You're doing me a scare. I'm in the evaluation phase with them. Maybe I'm missing something here, but this is not at all what the linked post says.

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

So, it's a UI console issue, it appears you can still manage

"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"

Similarly, it actually was resolved in Friday, but they forgot to mark it as so.

"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."

shareometry · on Nov 12, 2018

You are right about the Google blog content itself not indicating three days of outage. Turns out they just forgot to mark that particular issue as resolved on Friday, as you point out. This is my mistake. I would update my comment to reflect this, but it doesn't seem to allow an edit at this point.

The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.

Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.

I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.

haldora · on Nov 12, 2018

I've been failing all weekend to create nodes in a GKE cluster through either the UI console or gcloud. Even right now I can't get any nodes to spin up.

Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.

timdumol · on Nov 12, 2018

We've had no issues deleting and creating node pools this weekend (on asia-east1-a). No other problems noticed either.

fizzledbits · on Nov 12, 2018

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

raincom · on Nov 12, 2018

If you run your own k8s on GCP, you are not going to be affected by GKE.

aviv · on Nov 12, 2018

I can't comment regarding GKE as we don't use that particular service, however we are very heavy users of many other GCP services, including Compute, Datastore, BigQuery, Pub/Sub, Storage, Functions, Speech, and others. Zero issues this weekend, everything is running 100% as any normal day.

tejohnso · on Nov 12, 2018

> For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement)

What blog statement are you referring to? I don't see any such statement. Can you provide a link?

The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."

So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.

Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.

> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems

I don't see much of that.

paulddraper · on Nov 12, 2018

I believe the OP was referring to the very same blog (web log) you cited.

https://status.cloud.google.com/incident/container-engine/18...

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

> So it sounds like a web interface problem, not a severely limiting

Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.

marcinzm · on Nov 12, 2018

Right now we don't know. It's one of two possibilities from what I can tell:

a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.

b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.

johnpython · on Nov 12, 2018

This is why GCP has no hope of ever taking significant market share from AWS. Google thinks they can treat their cloud customers like they treat users of their free services. Customer support and communication are essential.

ernsheong · on Nov 12, 2018

As if something like this has never happened to AWS?

wumpus · on Nov 12, 2018

"like this" -- a failure of the service, or a failure of communication and customer support?

ty_a · on Nov 12, 2018

Remember that time S3 went down and the only updates were on Twitter because the status page was hosted on S3?

paulddraper · on Nov 12, 2018

Yes. When the AWS status page failed to accurately inform their customers for several hours, AWS used Twitter to ensure that there was communication with their customers.

What exactly is your point?

pzh · on Nov 12, 2018

The S3 outage duration was only 4 hours.

ascar · on Nov 12, 2018

I don't. But thanks for sharing, that's hilarious (for an unaffected person at least :D)

topicseed · on Nov 12, 2018

Lol, is this real? If so, hilarious.

ernsheong · on Nov 12, 2018

Both, I suppose.

swedish_mafia · on Nov 12, 2018

Not effecting all regions no.

halbritt · on Nov 12, 2018

I'm not sure about the market share, but I agree with the last two sentences.

...and I'm a happy GCP customer.

rorykoehler · on Nov 12, 2018

I recently removed my hosting from GCP. The pricing is confusing and unbelievable. Their customer service is a joke. I don't trust Google for longterm consistency due to the way they shut their own apps but I let that slide as I doubt they will do that on their cloud services. I have experience with AWS (rock solid, world class support but also costly), digital ocean (improving fast), heroku (good for beginners but also expensive and not as full featured as AWS) and finally Hetzner (too early to judge).

meow_mix · on Nov 12, 2018

I think you're missing the portion about how it only appears to be the console ui, no?

ransom1538 · on Nov 12, 2018

“2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.”

Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.

human_error · on Nov 12, 2018

> When an entire region is down what I have noticed is that all things are fucked globally on aws.

Do you have an example on this?

excalq · on Nov 12, 2018

On 17 October, there was a multi-AZ network failure at us-east-1. It only lasted 3m35s, but it was enough that our customers were calling about our site being down.

ghusbands · on Nov 12, 2018

That's still just one region, unless you were also hosted outside us-east-1.

ransom1538 · on Nov 12, 2018

Just grabbed first article. Example: In this case capitalone went down. I don’t work at capitalone - but I imagine they had their data copied across every region 30 times.

https://www.geekwire.com/2018/widespread-outage-amazon-web-s...

jefftk · on Nov 12, 2018

I think you're much too optimistic about capitalone. They probably had a single point of failure, possibly one they didn't realize they had.

carlsborg · on Nov 12, 2018

CapitalOne is one of the few financial markets firms with open source cloud projects on github. I respect their tech org for that.

usmannk · on Nov 11, 2018

We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...

Jedi72 · on Nov 12, 2018

They work for Google so obviously they are much smarter than you. If theres a problem its probably the customers fault. /sarcasm

mbrumlow · on Nov 12, 2018

I was so mad to read that until you said /sarcasm :p

That being said I really do think there is a difference between who is working at google today and the google we all fell in love with pre-2008.

I am sure there are a amazing people still working at google, but nowhere near like it was.

The way I like to think about google is that some amazing people mad ea awesome train that builds tracks in front of it -- you can call them gods maybe -- but those people are gone -- or a least the critical mass required to build such a train has dwindled to just dust. What we have left is a awesome train full of people pulling the many levers left behind.

To make things even worse my last interview as a SRE left me wondering if even the people who are there know this as well, and they are actually working hard to keep out those who might expose light on to this. I don't say that because I did not get the job -- I am actually happy I did not get extended a offer.

I say this with one exception, the old-timer who was my last interview. I could tell he was dripping in knowledge and eager to share it with any that would listen. I came out of his 45 min session learning many things -- I wold actually pay to work with a guy like that.

I would also like to point out that the work ethic was not what I expected. I was told that when on call, my duty was to figure out the root cause was in the segment I was responsible for. I don't know about you, but if my phone rings at night I am going to see through to a resolution and understand the problem in full -- even if it is not on the segment that I was assigned.

/end rant

blub · on Nov 12, 2018

You've managed to roll the myth of 10x engineers, start-up geniuses, nostalgia and gut feelings into one message of very dubious veracity.

At the same time you ignored the massive complexity and size of Google compared to what they were at the beginning.

This is voodoo organisational analysis.

engineeringwoke · on Nov 12, 2018

10x engineers are real. Start-up geniuses are real. The large majority of people have their heads up their asses. Wake up and smell the coffee

mav3rick · on Nov 12, 2018

You can't find and hire geniuses for every component. Systems should scale with the average engineer in mind (so should code). We have all smelt the coffee and it smells even better when your team is efficient and well rested.

engineeringwoke · on Nov 13, 2018

I don't disagree. Building around 10x'ers in this way creates god complexes and unhappy "senior" engineers (since they don't do any of the cool work). Having one very early still juices your productivity

mav3rick · on Nov 12, 2018

The work ethic is in tact. It is not fair to load people with stress and ask them to drop everything. You're conflating poor resource allocation with "work ethic". Burning the midnight oil when it can be avoided is not work ethic. The correct way is to load balance outage resolution.

mbrumlow · on Nov 12, 2018

I really don't know how to reply to you as you have set up a bunch of windmills you assumed from my previous post. Who said anything about poor resource allocation? Who said we need to load people with stress?

That being said -- when you are on call -- dropping everything is exactly what is expected.

mav3rick · on Nov 12, 2018

Many many AWS people have left citing on call as the worst part of their job. Also, it is really a far fetched allegation that interviewers try to fail interviewees to hide their own incompetence. I get that you may not think much of Google SREs but to allege that is just in bad spirit. I hope you do get to see that the people inside Google are one of the best perks of working here. They are smart and motivated and willing to help each other.

antt · on Nov 12, 2018

Were they that motivated to help their customers.

gmueckl · on Nov 12, 2018

But limiting responsibilities for on call employees is a way of limiting the workload they have during that time - ultimately benefiting the employees. They are on call, not working. I don't see the windmills here.

mav3rick · on Nov 12, 2018

Yes this was my point.

raincom · on Nov 12, 2018

Even though you have troubleshooting skills and other skills that can help Google, Google assumes that these skills are derivative of what they are looking for in interview. So, they are looking for those primary skills.

jkaplowitz · on Nov 12, 2018

Many of the tier 1 GCP support reps work for external vendors nowadays, which is probably part of the problem.

During my time on the GCE team (note I don't work at Google now) I knew multiple full-time Google employee support reps, including some still at the company. They have the good attitude and deep knowledge you'd hope for.

The problem is simply about how Google scales their GCP support org. To be completely clear, AWS support is by and large not great either.

If you're a big or strategically important customer, of course, you can get a good response from either company.

halbritt · on Nov 12, 2018

80% of my support experiences are laughably bad.

20% of my support experiences are amazing.

Fortunately, I don't require decent support to keep my service running. My sales rep tells me that he's aware of the problem.

I speculate it's simply the result of GCP trying to grow the org very quickly.

lstamour · on Nov 12, 2018

I totally heard that when trying to get engineer attention to YouTube Premium’s frequent “download errors” due to their transcoding-on-the-fly (or something). I was telling a support rep I had evidence that if I switched the setting from standard to high def (or vice versa) that the error would go away, but I could reproduce it with the same video, and thought it was a CDN/transcode issue. They kept marking the ticket as “unable to reproduce” and I had to wonder — as a paying customer, don’t they have analytics on my phone that would show exactly the request I was making which was failing in their logs? And if they saw it succeed, why not tell me the problem was my ISP? I’d have been happy to follow up... but nothing’s ever wrong in Google-land. :/

reaperducer · on Nov 12, 2018

Again, because their support reps don't believe there's a problem.

Perhaps if you explained it on a whiteboard...

navinsylvester · on Nov 12, 2018

In general GCP has quota limits so its expected of customers to catch 5xx and do exponential back off. But this info is not explicitly stated.

From my personal experience - i think all big cloud providers first two level support staff is no good if it isn't an obvious dumb one on your part. I always prefer to forgo support and try to go through every bit of their documentation to figure it out on our own. This helps to save huge amount of time. But if you have developer support - it can help to expedite things little faster though.

halbritt · on Nov 12, 2018

Did they ask you for a screenshot?

That's my favorite.

StreamBright · on Nov 12, 2018

I just took a screenshot of this and any time somebody asks my why AWS i will forward them. Thanks!

Jedi72 · on Nov 11, 2018

"The data says engagement is down 46%, I think its time we drop the product."

- Someone at Google right now, probably.

justinsb · on Nov 11, 2018

I can assure you that's not the case! Also, while people like to repeat this meme, Google Cloud does have a formal deprecation policy (https://cloud.google.com/terms/), whose intent is to give you some assurances.

(I work at Google, on GKE, though I am not a lawyer and thus don't work on the deprecation policy)

chrisseaton · on Nov 12, 2018

> Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer

for any reason

at any time

uluyol · on Nov 12, 2018

Nice job cherry picking text.

> 7.1 Discontinuance of Services. Subject to Section 7.2, Google may discontinue > any Services or any portion or feature for any reason at any time without > liability to Customer.

Let's take a look at Section 7.2:

> 7.2 Deprecation Policy. Google will announce if it intends to discontinue or > make backwards incompatible changes to the Services specified at the URL in > the next sentence. Google will use commercially reasonable efforts to continue > to operate those Services versions and features identified at > https://cloud.google.com/terms/deprecation without these changes for at least > one year after that announcement, unless (as Google determines in its > reasonable good faith judgment): > > (i) required by law or third party relationship (including if there is a change > in applicable law or relationship), or > > (ii) doing so could create a security risk or substantial economic or material > technical burden. > > The above policy is the "Deprecation Policy."

To me that looks like a reasonable deprecation policy.

scoot · on Nov 12, 2018

> To me that looks like a reasonable deprecation policy.

It might be, until they jack up the prices 15X with limited notice (looking at you, Google maps [1]). No deprecation needed, just force users off the platform unless they're willing to pay a massive premium.

[1] https://www.google.com/search?q=google+maps+price+increase

jkaplowitz · on Nov 12, 2018

Google Maps has never been subjected to that policy, unlike GCP services. These org chart divisions are real but only clear to Googlers, Xooglers (I'm in this category), and people who pay extremely close attention.

The fact that they're all Google makes reputation damage bleed across meaningfully different parts of what's in truth now a conglomerate under the umbrella name Google.

jjeaff · on Nov 12, 2018

Except all the Google maps setup and API keys are generated from the gcp UI and the billing happens on the cloud platform as well. While maps didn't start as a gcp product, they seem to have rolled it in to gcp fully.

jkaplowitz · on Nov 12, 2018

Not fully. Really what happened is they did a re-org that gave them Google Cloud as an umbrella brand including GCP, Google Maps Platform (this new version of Google Maps as a commercial service), Chrome, Android, G Suite...

The bit of Maps Platform integration for management of the billing and API layer was called out in the announcement blog as an integration with the console specifically, and the docs and other branding around Maps Platform remain distinct from GCP still in excessively subtle ways that Googlers pay more attention to than everyone else, like hosting the docs on developers.google.com instead of cloud.google.com and having Platform in its name separately from Cloud Platform.

This stuff makes sense to Googlers not only because of the org chart but also because Google has a pretty unified API layer technology and because Google put in a lot of work to unify billing tech & management. Reusing that is efficient but not always clear.

But you're right to be confused. Their branding is a mess and always has been. This is the same company that thought Google Play Books makes sense as a product name.

Google's product / PR / comms / exec people are very bad at understanding how external people who don't know Google's org chart and internal tech will perceive these things, or at least bad at prioritizing those concerns.

They live and breathe their corporate internals too much to realize this. Some Google engineers and tech writers realize the confusion but pick other battles to fight instead (like making good quality products).

They do at least document which services are subjected to the GCP Deprecation Policy (Maps is not there): https://cloud.google.com/terms/deprecation

As for what products are actually part of GCP, it's the parts of this page that aren't an external partner's brand name, aren't called out separately like G Suite or Cloud Identity or Cloud Search, and aren't purely open source projects like Knative and Istio (as opposed to the productized versions within GCP), with the caveat that the level so far of integration into GCP of Google acquisitions like Apigee, Firebase, and Stackdriver varies depending on per-company specifics: https://cloud.google.com/products/

G Suite and Cloud Identity accounts can be used with GCP, just like any other Google accounts. They are part of Google Cloud but not Google Cloud Platform.

Hope I waded through the mess correctly for you. :)

matwood · on Nov 12, 2018

The maps price gouge is yet another reason I will not use google services for anything but ancillary services.

omeid2 · on Nov 12, 2018

Let's not forget that Google can change the terms as they please with a 90 Days notice as per Section 7.1 Modifications of the terms. So any promise that is longer than 90 days, even without a escape hatch like Section 7.2, would be legally weak and subject to change at any time without much recourse.

Jedi72 · on Nov 12, 2018

It's ok I guess but still lets them turn it off if, in their judgement, its an economic burden i.e. costing them money.

If they ever do deprecate something people have built on though they're gonna get absolutely crucified. That's probably better protection than any terms of service.

Fizzer · on Nov 12, 2018

> If they ever do deprecate something people have built on though they're gonna get absolutely crucified.

They do this all the time, and they get crucified every time. I built a Google Hangout App and a Chrome App, both of which were platforms eventually shut down.

This is where the meme came from, and it's why I personally stopped building on top of Google products. A 1-year deprecation policy is no assurance to me if I plan for my app to live longer than that.

jkaplowitz · on Nov 12, 2018

Their approach to things like GCP is very different than their approach to those other areas of Google. But they don't separate their branding or unify their deprecation attitudes enough to avoid cross-org-chart reputation damage like this.

BurnGpuBurn · on Nov 12, 2018

So basically they screw you over on either medium or hard, and according to you the real problem is them not telling us clearly whether or not we receive the medium or hard screw over option.

By the way that GCP is so full of loopholes where Google can get out of its obligations its laughable. So it's not even that clear cut that the GCP is really a better alternative.

And even when it turns out to be legally sound, when stuff like this happens, who's going to sue google over it? Nobody, and they know it.

jkaplowitz · on Nov 12, 2018

Oh, courts routinely give binding weight to words like Google's deprecation policy uses, and any large megacorp who is sufficiently badly impacted by a legalese violation (though SLA issues and deprecation issues are two separate things) wouldn't be scared away from a lawsuit by Google being big. I can imagine EU regulatory action or a class action lawsuit as other possible mechanisms.

But as I say in another comment, the contract is less important than both trust and reality. Keep in mind nobody focuses on how AWS doesn't even have a public deprecation policy.

I'm right there with many people in this thread in agreeing that Google has a trust problem, due mostly to real perception issues stemming from Google's habits outside GCP, which can and do impact people's perceptions of what they'll do with GCP.

The reality of what Google has done and will do with GCP, though, is pretty good. Sure they do sometimes deprecate things in ways Amazon never would. But not nearly as often or as abruptly as they do on the consumer side - that would be commercial suicide - and they do other things better than Amazon. Tradeoffs.

BurnGpuBurn · on Nov 12, 2018

> The reality of what Google has done and will do with GCP, though, is pretty good.

No. It's just words. Actions speak louder than words. Googles' actions in the last couple of days spoke pretty loud. No amount of words will change that.

Are you working for Google PR or something?

jkaplowitz · on Nov 12, 2018

I haven't worked for Google since 2015, and I never worked for their PR department. I was just a rank-and-file engineer (and a rank-and-file tech lead for one small team near the end of my time there). If I worked for Google PR, my comments throughout this thread would have far less criticism of the company's messaging and branding than they do. :)

I'm still a fan of GCP as a suite of products and services, as much as I recognize many of Google's organizational failings and disagree with plenty of their product decisions in other areas of Google.

Google (including GCP) has been bad at external communication as long as I've paid attention, and that includes external communications around incidents. What actions are you referring to, beyond poor and confusing communication (i.e. words) around what is or isn't broken or fixed at what points during the incident? That's most of the problem I'm aware of from this incident.

With that said, part of the reason people notice GCP's outages more than AWS's is that GCP publicly notes their outages way more than AWS does. In other words, among the outages that either cloud has, Google much more often creates an incident on their public status page and Amazon much more often fails to.

My "reality of [...] GCP" comment was about the bigger picture of the cloud platform offering, not any one specific incident.

kuyan · on Nov 12, 2018

> It's ok I guess but still lets them turn it off if, in their judgement, its an economic burden i.e. costing them money.

If a service Google runs is losing money, what reason would they have to not shut it down?

lovich · on Nov 12, 2018

With this terms of service, none. Which is why people don't trust them.

If I pay you for a service that would take time to migrate off of, and you are making money off me now, I am going to be ripshit if you decide to just turn it off because it's suddenly not making money for you in the short term. Google's done this a lot, and the fact that don't provide concrete time lines in their contract gives even less reason to trust them

jkaplowitz · on Nov 12, 2018

It's not about the contract. AWS doesn't even have a deprecation policy in the contract - seriously, GCP provides more legally binding guarantees than AWS. It's about trust.

People look at AWS's track record, and trust that. People look at Google's track record, overlook what to an inside-the-company Googler perspective are dramatically significant organizational boundaries or product lifecycle definitions that are very poorly communicated outside the company, mentally apply reputational damage from one part of Google (or from a preview-stage GCP product) to a different part of the company (or to a generally available GCP product), and don't trust that.

Google has always been worse at externally facing PR than at the internal reality, even when I worked there (2011-2015). Major company weakness.

But the internal reality inside GCP, perceptions aside, is pretty good even now.

yardie · on Nov 12, 2018

This is the subtle, but important, difference between SaaS and PaaS/IaaS. Services are to use. Platforms are built upon. Flickr is a service. If they shutdown I'll get another one. If they shutdown I'll just move to another. GCP is a platform, if they shutdown I have to re-architect the entire thing from scratch.

If it's costing them money they haven't figured out a model, yet, that works in their favour.

dodobirdlord · on Nov 12, 2018

Customers won't pay money in the first place to use a service if it may vanish out from under them? I expect a cloud service provider not to offer a service unless they think it is going to be profitable, and I expect them to continue to offer it even if it turns out not to be profitable, because otherwise I will take my business to a cloud service provider that will give me that guarantee.

8ytecoder · on Nov 12, 2018

>Subject to Section 7.2

Which is the deprecation policy. (I mean I share your frustration with Google's what-appears-to-be-at-least haphazard policy of shutting down services instead of trying to gain traction. But, let's not misrepresent what they say).

chrisseaton · on Nov 12, 2018

I thought that 'subject to 7.2' meant that they can use the escape-hatch there 'substantial economic or material technical burden'? They can list anything under that.

I don't think it's wrong - they can deprecate any service they want to do whenever they want, unless people have paid for and signed a contract that says otherwise which I guess people aren't doing.

But the policy doesn't really guarantee anything at all does it, due to the reference escape-hatches? It might as well not exist?

joshuamorton · on Nov 12, 2018

The preceding line is the imprtant part. Which is essentially

"Subject to the deprecation policy [which says that Google will give at least 1 year notice before cancelling services], Google may discontinue..."

In other words, at any time, google can give you a years notice.

(I work at Google, but am not a lawyer and this isn't official in any capacity).

Please don't selectively quote things out of context to give a misleading impression.

chrisseaton · on Nov 12, 2018

But what do these things mean?

> commercially reasonable

> substantial economic or material technical burden

Is one engineer working on an old service to keep it alive commercially reasonable or a substantial burden? I don't know. Do you?

In practice this policy lets them shut off anything they want any time they want. Again it's their playground they can do what they want unless they signed a contract saying they'd do something else for you so I don't have a problem with it.

joshuamorton · on Nov 12, 2018

I think you're ascribing an unreasonable amount of bad faith here, and, to rephrase what I had here before, you're approaching this from an engineering perspective, not a legal one. And that's not how those things work.

To be clear, that policy is a contract. And those things would be decided by a jury. And if my understanding is correct, the reasonable person standard applies. So you can answer this yourself, do you think a reasonable person would believe that your interpretation is valid?

If not, why mention it?

lovich · on Nov 12, 2018

>If not, why mention it?

Because it makes more people feel comfortable enough to use your services and pay you, without actually binding you towards any sort of behavior that would cost you money. There's a direct financial incentive here to use legalese to give the semblance of reliability without having to deliver on it

housingpost · on Nov 12, 2018

I’d say google earned that bad faith this year alone with what they did with maps.

joshuamorton · on Nov 12, 2018

As another user mentioned, maps isn't cloud.

dx034 · on Nov 12, 2018

But it's Google, in the end the same CEO has to sign off on these changes (I'm sure the Maps price hike was approved by C-level and not just the Maps department).

mempko · on Nov 12, 2018

Google lost our faith after shutting down countless services.

xchaotic · on Nov 12, 2018

By the way, that's probably more than $100k/year if he does. Probably triple that with the overhead of the office space, HR and middle management.

chongli · on Nov 12, 2018

But hey, it's all spelled out in the policy, so don't say we didn't warn you!

Caveat emptor, folks.

brian-armstrong · on Nov 12, 2018

What happens when they suddenly deprecate the deprecation policy?

omeid2 · on Nov 12, 2018

Nothing, sort of. Subject to "Section 1.7 Modifications" of Terms:

b. To the Agreement:

Google may make changes to this Agreement, including pricing (and any linked documents) from time to time. .... Google will provide at least 90 days’ advance notice for materially adverse changes to any SLAs by either: (i) sending an email to Customer’s primary point of contact; (ii) posting a notice in the Admin Console; or (iii) posting a notice to the applicable SLA webpage. If Customer does not agree to the revised Agreement, please stop using the Services. Google will post any modification to this Agreement to the Terms URL.

skrebbel · on Nov 12, 2018

So the deprecation policy is "you got 90 days".

whydoineedthis · on Nov 11, 2018

im pretty sure he just forget the /s (sarcasm) on his post, but this was pretty cool information anyway, so thanks!

davemp · on Nov 12, 2018

I think it’s telling of Google’s culture that the corporate arm felt the need to formalize this in law. I won’t pretend to know what it’s telling. Just suggest that you listen for yourself. Look at rule of law versus the ideas of liberty if you’d like a stronger nudge.

jeffcore · on Nov 12, 2018

AWS' Customer Agreement[1] essentially has the same language. I wouldn't be surprised to see similar language from other cloud providers as well. Seems rather prudent on their part.

[1] https://aws.amazon.com/agreement/

davemp · on Nov 12, 2018

I would suggest the inference I was alluding to would also apply amazon.

Note how I never stated the inference. This is because I wanted to share a way of thinking without feeling the responsibility to reply to people attempting to force me to prove some prescriptive, arbitrary inference rule by exhaustion. I do not participate in such practices casually. I also consider it rude to subject people to such practices without consent. I also believe it is a practice that kills online discussion platforms. See this community’s thought provoking guidelines :)

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

justinsb · on Nov 11, 2018

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

antpls · on Nov 11, 2018

The status dashboard is inaccurate and/or a lie. It only tells about the GKE incident, while in fact the problem also impacts Google Compute Engine users. I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

As another comment pointed out, what's the point of having so many zones and redundancy around the globe if such global failure can still happen? I thought the "cloud" was supposed to make this kind of failure impossible

stevehawk · on Nov 12, 2018

This is unfortunately the norm. Like when AWS S3 went down (but couldn't update its own status images because they're in S3 and we all laughed) and along with it went Alexa, lambda, and every other service dependent on S3.

jorblumesea · on Nov 12, 2018

S3 is really the one of the few services on aws that can do that unfortunately. It has no concept of zone/region, it's truly global. To me it seems like a serious design flaw, as everything else in aws is striped by region, but not sure why exactly it was built like that.

edit:

nvm s3 has regions, it's the bucket names that are global.

dodobirdlord · on Nov 12, 2018

Buckets are globally addressable because they planned for each S3 bucket + object key to have an associated URL (actually several), and URLs are a global namespace.

http(s)://<bucket>.s3.amazonaws.com/<object> http(s)://s3.amazonaws.com/<bucket>/<object>

jorblumesea · on Nov 12, 2018

urls would have to be global, but why buckets themselves? It seems like a many to one relationship would easily be possible.

kevan · on Nov 12, 2018

Given the historical context (S3 launched 12 years ago, 5 months before EC2 launched with the us-east-1 region), it's reasonable that S3 buckets were global because regions didn't really exist yet as a concept.

If you look at the docs now[1], new buckets are regionalized and the region is in the URL for non-us-east-1 regions.

[1] https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_...

jorblumesea · on Nov 12, 2018

Got it, makes sense. So the only way you'd have a global bucket is if you created it before a certain date (whenever they were regional-ized, I assume). Thanks!

carbocation · on Nov 11, 2018

> I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).

antpls · on Nov 11, 2018

Tried on us-east, us-north, europe-west, also tried asia, with different instance sizes and with both UI and CLI. None worked for me.

pfd1986 · on Nov 12, 2018

Same here.

aviv · on Nov 11, 2018

Have not seen any restarts this weekend, and we have several hundred instances on GCE.

carbocation · on Nov 12, 2018

Thanks! I'm running Skylake 96 core instances but I haven't given up to try the 64 core instances for comparison yet. If I get another restart, I'll do a 96 vs 64 to try to narrow down the cause. Most likely, of course, this is a software issue on my end, not Google's.

0xbadcafebee · on Nov 12, 2018

> I thought the "cloud" was supposed to make this kind of failure impossible

You have to remember that you're trying to have access to backend platforms and infrastructure at all times, which almost no public utility does (assuming "the cloud" is "public utility computing"). Power plants go into partial shutdown, water treatment plants stop processing, etc. Utilities are only designed to provide constant reliability for the last mile.

If there's a problem with your power company, they can redirect power from another part of the grid to service customers. But some part of your power company is just... down. Luckily you have no need to operate on all parts of the grid at all times, so you don't notice it's down. But failure will still happen.

Your main concern should be the reliability of the last mile. Getting away from managing infrastructure yourself is the first step in that equation. AppEngine and FaaS should be the only computing resources you use, and only object storage and databases for managing data. This will get you closer to public utility-like computing.

But there's no way to get truly reliable computing today. We would all need to use edge computing, and that means leaning heavily on ISPs and content provider networks. Every cloud computing provider is looking into this right now, but considering who actually owns the last mile, I don't think we're going to see edge computing "take over" for at least a decade.

aiisjustanif · on Nov 12, 2018

> I thought the "cloud" was supposed to make this kind of failure impossible

If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.

davemp · on Nov 12, 2018

I’ll suggest considering whether entities enamored with centralizing ideals are more likely to fail to properly realize the robustness of a distributed system.

aviv · on Nov 11, 2018

We have created GCE instances in several US regions without any issue today. Last one was 10 minutes ago in west2.

marcinzm · on Nov 11, 2018

I appreciate all the effort you're putting in and I understand such situations can be stressful but user's having to depend on someone responding on hacker news for status updates seems really amateur for an organization the size of google.

NicoJuicy · on Nov 11, 2018

The default is : https://status.cloud.google.com/incident/container-engine/18...

People who respond here could be employees of Google, caring about it and respond here because they know it.

What he can mention ( a lot of people are working on it) is what you can suspect when something is going down. All other cloud providers do the same.

marcinzm · on Nov 11, 2018

The default you linked to has not been updated in 2 days... which is my whole point regarding having to rely on hacker news for any status updates.

edit: The default is also only about the UI issue and there's no issue tracker for the broader non-UI disruptions going on since Friday.

Waterluvian · on Nov 11, 2018

Even an update of "no change" is tremendously valuable.

trhway · on Nov 11, 2018

>really amateur for an organization the size of google.

There is a reason while Google have been having hard time making inroads in the enterprise cloud. Kind of impedance mismatch between enterprise and the Google style. That 2 stories like high "We heart API" sign on the Google Enterprise building facing 237 just screams about it :)

rdtsc · on Nov 12, 2018

Strangely and sadly with gmail account blocking and other such issues HN and Twitter is often better way to get Google's support than to contact support.

ben_jones · on Nov 11, 2018

As much as I love bashing big corps I see HN as a supplementary communication channel for products like GCP - its a luxury we get to access alongside normal customer support channels in the GCP console, twitter, etc.

yeukhon · on Nov 11, 2018

Let me put it this way. HackerNews, or in fact, any news outlets are not official. Customers should be getting emails from Google and be informed on its official webpage to explain what's going on. You don't want your neighbor to tell you you owe taxes. You want the government to send a notice to you.

thwy12321 · on Nov 11, 2018

Critical service is failing, minimal information about why, but we should be so happy someone says a few sentences on here? For all of the engineering elitism coming out of google, Amazon is way more on their game across a number of products.

tomcam · on Nov 12, 2018

Thanks for jumping in here on your own time. The following question is not meant to be hostile, it is merely curiosity. Isn’t this supposed to be the kind of thing that monitoring and diagnostics software should find automatically? Serious question, not meant to embarrass you.

rlancer · on Nov 11, 2018

Creating clusters via the UI is still not working for me.

rlancer · on Nov 11, 2018

UPDATE: Created a Cluster successfully in Australia... Still not able to do so in the US.

nexus7556 · on Nov 12, 2018

Have you tried via the gcloud command?

fizzledbits · on Nov 12, 2018

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

dilyevsky · on Nov 12, 2018

So, given that i filed this months ago via official support and it’s still not fixed, can you look into misleading container memory reporting ui bug. It reports memory_total but should be working_set

hacknat · on Nov 11, 2018

Question to Google employees:

Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.

illumin8 · on Nov 12, 2018

Google has a global SDN (software-defined network) that gives them some unique and beneficial capabilities, like being able to onboard traffic in the closest CDN POP and letting it ride over the Google backbone to the region your systems are running in.

The problem is that running a global SDN like this means if you do something wrong, you can have outages that impact multiple regions simultaneously.

This is why AWS has strict regional isolation and will never create cross-region dependencies (outside of some truly global services like IAM and Route 53 that have sufficient redundancy that they should (hopefully) never go down).

Disclaimer: I work for AWS, but my opinions are my own.

tehlike · on Nov 11, 2018

2 outage in 5 years sounds pretty low, to be honest.

Disclaimer: google employee in ads, who worked on many many fires throughout the years, but talking from my personal perspective and not from my employer. I am sure we are striving to have 0, but realistically, i have seen many that says things happen. Learn, and improve.

noselasd · on Nov 11, 2018

The issue people have with it is that it's global, not regional, indicating that there are dependencies in the entire architecture that people does not expect to be there.

B-Con · on Nov 12, 2018

There are many other possible causes for global outages, that specific one is not high on my list of likely culprits.

blazespin · on Nov 11, 2018

Yes, hello, canaries anyone

tehlike · on Nov 11, 2018

Plenty of bugs happen despite canaries.

hexchain · on Nov 12, 2018

Just like YouTube a few weeks ago?

tehlike · on Nov 12, 2018

Yt, ads, a bunch of services. Sure.

yeukhon · on Nov 11, 2018

5 years? I remember a major outage maybe in the past year.

kenhwang · on Nov 11, 2018

I believe there was a multi-hour global YouTube/Bigtable/Cloud SQL/Datastore outage in October.

Then there was the global load balancer outage in July.

Looking though the incident history, there were essentially monthly multi-region or global service disruptions of various services.

origami777 · on Nov 12, 2018

Most feature rich cloud? I think that title belongs to AWS.

jkaplowitz · on Nov 12, 2018

You're right in terms of breadth officially covered. But if you look at the features where they both officially have support, there are many examples where the GCP version is more reliable and usable than the AWS version. Even GKE is an example of this, despite the outage in node pool creation that we're discussing here. Way better than EKS.

(Disclosure: I worked for Google, including GCP, for a few years ending in 2015. I don't work or speak for them now and have no inside info on this outage.)

illumin8 · on Nov 12, 2018

I think you're going to have to back up a claim like this with some facts.

GKE being the exception, since it was launched a couple years before EKS. AWS clearly has way more services, and the features are way deeper than GCP.

Just compare virtual machines and managed databases, AWS has about 2-3x more types of VMs (VMs with more than 4TB of RAM, FPGAs, AMD Epyc, etc.), and in databases, more than just MySQL and PostgreSQL. When you start looking at features you get features that you just can't get in GCP, like 16 read-replicas, point in time recovery, backtrack, etc.

Disclaimer: I work for AWS but my opinions are my own.

jkaplowitz · on Nov 12, 2018

Each platform has features the other platform doesn't, even though AWS has more.

Some of GCP's unique compelling features include live VM migration that makes it less relevant when a host has to reboot, the new life that has recently been put into Google App Engine (both flexible environment and the second generation standard environment runtimes), the global load balancer with a single IP and no pre-warming, and Cloud Spanner.

In terms of feature coverage breadth I started my previous comment by agreeing that AWS was ahead, and I still reaffirm that. But if you randomly select a feature that they both have to a level which purports to meet a given customer requirement, the GCP offering will frequently have advantages over the AWS equivalent.

Examples besides GKE: BigQuery is better regarded than Amazon Redshift, with less maintenance hassle. And EC2 instance, disk, and network performance is way more variable than GCE which generally delivers what it promises.

One bit of praise for AWS: when Amazon does document something, the doc is easier to find and understand, and one is less likely to find something out of date in a way that doesn't work. But GCP is more likely to have documented the thing in the first place, especially in the case of system-imposed limits.

To be clear, I want there to be three or four competitive and widely used cloud options. I just think GCP is now often the best of the major players in the cases where its scope meets customer needs.

illumin8 · on Nov 12, 2018

Redshift is not a direct competitor with BigQuery. It's a relational data warehouse. BigQuery more directly competes with Athena, which is a managed version of Apache Presto, and my personal opinion is that Athena is way better than BigQuery because I can query data that is in S3 (object storage) without having to import it into BigQuery first.

Disk and network performance is extremely consistent with AWS so long as you use newer instance types and storage types. You can't reasonably compare the old EBS magnetic storage to the newer general purpose SSD and provisioned IOPS volume types, and likewise, newer instances get consistent non-blocking 25gbps network performance.

I'm not so sure I would praise our documentation; it is one of the areas that I wish we were better at. Some of the less used services and features don't have excellent documentation, and in some cases you really have to figure it out on your own.

GCP is a pretty nice system overall, but most of the time when I see comparisons, when GCP looks better its because the person making the comparison is comparing the AWS they remember from 5-6 years ago with the GCP of today, which would be like comparing GAE from 2012 with today.

jkaplowitz · on Nov 12, 2018

The comments I made about Redshift vs BigQuery and about disk/network/etc reflect current opinions of colleagues who use AWS currently (or recently in some cases) and extensively, not 5-6 year old opinions. Even my own last use of AWS was maybe 2-3 years ago, when Redshift was AWS's closest competitor to BigQuery and when I saw disk/network issues directly.

You're right that Athena seems like the current competitor to BigQuery. This is one of those things that are easy to overlook when people made the comparison as recently as a couple of years ago (before Athena was introduced) and Redshift vs BigQuery is still often the comparison people make. This is where Amazon's branding is confusing to the customer: so many similar but slightly different product niches, filled at different times by entirely different products with entirely unrelated names.

When adding features, GCP would usually fill adjacent niches like "serverless Redshift" by adding a serverless mode to Redshift, or something like that, and behavior would be mostly similar. Harder to overlook and less risky to try.

Meanwhile, when Athena was introduced, people who had compared Redshift and BigQuery and ruled out the former as too much hassle said "ah, GCP made Amazon introduce a serverless Redshift. But it's built on totally different technology. I wonder if it will be one of the good AWS products instead of the bad ones." (Yes, bad ones exist. Amazon WorkMail is under the AWS umbrella but basically ignored, to give one example.)

And then they go back to the rest of their day, since moving products (whether from Redshift or BigQuery) to Athena would not be worth the transition cost, and forget about Athena entirely.

On the disk/network question, no I didn't see performance problems with provisioned IOPS volume types, but that doesn't matter: for GCE's equivalent of EBS magnetic storage, they do indeed give what they promise, at way less cost than their premium disk types. There's no reason it isn't a fair comparison.

And for the "instance" part of my EC2 performance comment, I was referring to a noisy neighbor problem where sometimes a newly created instance would have much worse CPU performance than promised and so sometimes delete and recreate was the solution. GCE does a much better job at ensuring the promised CPUs.

I'm glad AWS and GCP have lots of features, improve all the time, and copy each other when warranted. But I don't think the general thrust of my comparison has gone invalid, even if my recent data is more skewed toward GCP and my AWS data is skewed toward 2-3 years old. Only the specifics have changed (and the feature gap narrowed with respect to important features).

electrum · on Nov 14, 2018

Presto is not an Apache project (although it is open source under the Apache License).

hacknat · on Nov 12, 2018

Yeah. Perhaps feature rich was an overstatement. I meant that when GCP does do a product it works like I’d expect it to work and has the features I need. Not always the case with a AWS, particularly around ELBs and VPCs.

_wmd · on Nov 11, 2018

It is a natural effect of building massive yet flat homogeneous systems, failures tend to be greatly amplified.

Most of what you can read of Google's approach will teach you their ideal computing environment is a single planetary resource, pushing any natural segmentation and partitioning out of view.

toomuchtodo · on Nov 11, 2018

> I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective.

It's the opposite really: the expectation that service providers have no unexpected downtime is unrealistic, and it's strange this idea persists.

Twirrim · on Nov 11, 2018

(disclaimer: I work for another cloud provider)

I agree, in general, outages are almost inevitable, but global outages shouldn't occur. It suggests at least a couple of things:

1) Bad software deployments, without proper validation. A message elsewhere in this post on HN suggest that problems have been occurring for at least 5 days, which makes me think this is the most likely situation. If this is the case, presumably given this is multiple days in to the issue, rolling back isn't an option. That doesn't say good things about their testing or deployment stories, and possibly their monitoring of the product? Even if the deployment validation processes failed to catch it, you'd really hope alarming would have caught it.

or:

2) Regions aren't isolated from each other. Cross-region dependencies are bad, for all sorts of obvious reasons.

toomuchtodo · on Nov 11, 2018

That shouldn't, but they do. S3 goes down [1]. The AWS global console goes down, right after Prime Day outages [2]. Lots of Google Cloud services go down [3, current thread]. Tens of Azure services go down hard [4].

Are software development and release processes improving to mitigate these outages? We don't know. You have to trust the marketing. Will regions ever be fully isolated? We don't know. Will AWS IAM and console ever not be global services? We don't know.

Blah blah blah "We'll do better in the future". Right. Sure. Some service credits will get handed out and everyone will forget until the next outage.

Disclaimer: Not a software engineer, but have worked in ops most of my career. You will have downtime, I assure you. It is unavoidable, even at global scale. You will never abstract and silo everything per region.

[1] https://www.theregister.co.uk/2017/03/01/aws_s3_outage/

[2] https://www.cnbc.com/2018/07/16/aws-hits-snag-after-amazon-p...

[3] https://www.cnet.com/news/google-cloud-issues-causes-outages...

[4] https://www.datacenterknowledge.com/uptime/microsoft-blames-...

ignoramous · on Nov 12, 2018

Can't speak for Google, but Facebook and Salesforce chose Cells for HA.

http://highscalability.com/blog/2012/5/9/cell-architectures....

toomuchtodo · on Nov 12, 2018

Doesn't look like it was all that helpful to Facebook (as of 1542038976). Facebook.com errors out currently.

> Facebook Platform Appears to be down

> A check of https://developers.facebook.com/status/dashboard/ returns an error and I'm unable to login with facebook to some of my mobile apps.

https://news.ycombinator.com/item?id=18434262