Hacker News new | past | comments | ask | show | jobs | submit login
Google OAuth Is Failing with 500 Error Code (cloud.google.com)
163 points by adige01can 32 days ago | hide | past | web | favorite | 54 comments

Disclosure: Former Googler

I didn't think I would laugh that hard coming in here and reading these comments, but here we are. I can virtually guarantee that the following things are not the root of the problem:

* Lack of compute/networking/storage

* Incompetence of employees

* Back to school traffic spikes

* Just about anything else here

My $0.02. these are almost always due to bad roll outs, usually configuration changes. But I've been wrong countless times before!

"these are almost always due to bad roll outs, usually configuration changes." Some would consider that falling under #2. (Not me though)

Given that "87.623% of all outages are caused by changes" is generally accepted wisdom of running large scale services, I would tend to disagree with those people. Incompetence: no. Opportunity for improvement: yes.

Edit: made-up number has large margin of error

Google often has a outage or two around this time of the year when all the US schools come back and millions of students log in at the same time.

I thought that's because this is the time when interns' code get deployed

Sounds pretty anecdotal. Can you back up this claim?

We run a app making millions of API calls to Google every day, so they show up in our monitoring, even if they don't make the status page. It's a pattern I've noticed from getting paged at 3am (NZ time) this time of year going back the last 4 years, I don't have hard data at hand for it though. It's not the same thing every year either - this time OAuth got hit hard, previously we've mostly seen slowdowns and higher error rates on the Drive APIs.

Says something that they know the what, where, when, and why, but the still don't (or won't) prevent it.

It's a matter of cost.

It's a temporary spike of traffic far outside of average use.

In order to handle it, they'd need a lot of resources that just aren't justifiable for the rest of the year.

You seem similar things when places sell highly desired items in online stores.

They're a bleeding cloud provider. I think they can temporarily scale up a bit.

Temporarily build a data center?

Temporarily shut down other people's services to favor their own?

I guess you're not familiar with how cloud services work. There's (generally) plenty of excess capacity.

Within Google, there is plenty of wasted capacity, but getting capacity for a service to scale up is a fiendishly difficult task. They even employ hundreds of people whose main role is to try to allocate all the various types of resources to teams. "Oh - you want 1700 GB of bigtable in Reliability zone 3 in Atlanta? I'm afraid your department doesn't own any there - you can trade with the ads department who aren't using theirs, if you give them 2Tbits of network bandwidth between Peru and Brazil? That trade will only be till the end of the quarter though, because then they need it back."

It's the (generally) that's the issue here, now isn't it?

I didn't see any notifications that GCP was out of capacity. Did I miss one?

Gcp capacity and capacity for Google services is fairly isolated. They never coexist on the same racks

Doubtful. I put that in since absolute statements are always dangerous. I'd say at any given time, there's an excess of 20-50%. There's virtually no chance they were lacking capacity.

Just as I was implementing and testing high-priority Google SSO changes at work... it goes down :-)

You broke Google. I hope you feel good about yourself.

Better now when you can test your error handling then at 3:30am when your high priority service goes down :)

I would!

same. and also thought that i broke it.

btw never saw 500 from G before, usually just timeouts during outages

Same :-/

It sure feels like there have been quite a few big outages this summer (Google in particular). I wonder if they are getting sloppy or this is just bad luck?

My gut reaction is that many companies make the majority of their income/deals during the fall/winter/spring (think back to school shopping, Christmas, re-signing contracts for the next year, tax season, etc.) Thus, many companies try to make major infrastructure changes during the summer, when there will be a slightly smaller negative impact on the business's bottom line in case something does go wrong.

Another way to think about that is that PMs spend three quarters of the year distracted by meetings with external clients, and only have summer to actually ride their teams on implementing the things they've negotiated.

My take is that many people are out on vacation during the summer months and sometimes things break (or break harder than usual) when certain knowledgeable people aren't available.

Also, summer interns. A friend who worked at FB said outages there go up markedly when interns start pushing to prod.

> interns start pushing to prod

hey everybody, I spotted the problem! I'll keep an eye on my mailbox for the giant consulting fee that I assume is enroute.

You really think interns are allowed to push code to production at FB?

I worked at FB for four years, and they are in fact allowed to do so.

Google only hires the best of the best.

This is one of our favorite jokes internally, often invoked when new signage is put up to clarify how trashcans work or the like.

The best of the best of the best, Sir! With Honours!

Maybe memorising algorithms from a textbook does not translate to great system engineering and SRE skills? Who would have thought.

I'm glad you have a good handle on the data about the predictive power Google interviews have about the individuals.

Can you share the data?

I have a good handle on how the customer base for cloud computing services perceives cloud vendors and aws is way ahead of everyone else by a mile.

This is because they have a far better and complete platform, and it appears to be better engineered too with fewer of the downtime incidents we’ve been seeing at the other vendors.

Why do you suspect this is? Let me tell you. Amazon had better people. Full stop.

That's a whole lot of words to say "no"

It’s random. You might as well read chicken entrails; at least then there’d be less racial and gender bias in the process.

This would also require killing a lot of chickens, which feels less pleasant.

I can't login to my business gmail account using Google Chrome, but I can login successfully using Internet Explorer and Firefox. Duh. :-)

Chrome in incognito mode seems to work as well.

I like the majority of GCP products I work with, but judging by the amount of issues in the past year, GCP feels amateurish compared to AWS, which we continue to user in order to host most mission critical operations.

I cant access any of my Airflow clusters atm. :/

To whoever downvoted me. Censoring the fact customers have issues, doesn't make the product better.

Login from incognito works for me so I assumed one of my extensions was causing an issue somehow and started disabling them. I'm sorry I doubted you, extensions.

Link to incident on Google Cloud status:


My users are reporting that it's back up, or at least intermittently working now.

When you use Google OAuth, you give Google the power to turn you off. Think about it.

Unless you own the complete path from server to user (and no one does) there is always someone who can “turn you off”.

Remember when you could route around failures? Have connections to two different backbone providers? Does anyone still do that?

Googler not on Cloud team, but using Google Cloud Platform for an internal project. I've encountered my share of bugs and other flaws while using this platform. I think all these platforms are just too damn complex and brittle. It's easy for even a smart SWE or SRE to overlook one little thing that will bring down a bigger part of system.

Is this just my perception or is GCP really down a lot this year?


Eh, I was stating a fact, don't understand why it got down voted.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact