Hacker News new | past | comments | ask | show | jobs | submit login
Google outage – resolved
2316 points by abluecloud on Dec 14, 2020 | hide | past | favorite | 827 comments
various services are broken

- youtube returning error

- gmail returning 502

- docs returning 500

- drive not working

status page now reflecting outage: https://www.google.com/appsstatus

--------------

services look to be restored.




If you pay for Google Services, they have an SLA (service level agreement) of 99.9% [1]. If their services are down more than 43 minutes this month[2], you can request “free days” of usage.

Edit: Services were down from ~12:55pm to ~1:52pm, it's 57minutes. Thanks hiby007

[1] https://workspace.google.com/intl/en/terms/sla.html

[2] https://en.wikipedia.org/wiki/High_availability#Percentage_c...


If all 6 million G Suite customers, with an average number of users at 25 per G Suite account, paying the $20/user fee, requested the three day credit for this breach in the SLA contract for the outage, it'd cost Google about 300 million dollars.

Which is .22% of there COH this quarter...


Or on the regular basis, they'd get $300M every other month (exclude any fee)


Your nines or their nines?

I bet if you personally can't use it, but their overall reliability meets the bar, then they're within SLA.

Don't ask why I know this.


You know there are companies who build crawlers or health check agents exactly for this purpose, so that they know precisely from when to when the service they pay or they need for their business doesn't work and went out of SLA. I think it's brilliant and the only way to make any company pay. I believe you can sometimes get away with a couple of pingdom checks/jobs.


Like they have any bargaining power against Google.


I don't know how it works in such cases, I mean, this is publicly known that there was an outage.

What I believe is that customers will probably get free GCP credits and that's it, everything is good as before.


This person service-provides


In [1] it says: "Customer Must Request Service Credit." Do you know how to request it?


Admins can go to admin.google.com and click the help button to start a support request.



Is this official announcement? I don't find any link to google in your posted url.


not an announcement, seems like just a statement. I assume an official statement will come with the post mortem blog post in the next few days.


If you click on the red status dots, it has report with timing.


SLAs are largely bullshit.


why?


This topic just came up recently on a podcast I was on where someone said a large service was down for X amount of time and the service being down tanked his entire business while it was down for days. But he was compensated in hosting credits for the exact amount of down time for the 1 service that caused the issue. It took so long to resolve because it took support a while to figure out it was their service, not his site.

So then I jokingly responded with that being like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".

If your SLA agreement is only for precisely calculated credits, that's not really going to help in the grand scheme of things.


I like your anecdote, I might steal that one.

IANAL, but I negotiate a lot of enterprise SaaS agreements. When considering the SLA, it is important to remember it is a legal document, not an engineering one. It has engineering impact and is up to engineering to satisfy, but the actual contents of it are better considered when wearing your lawyer hat, not your engineering one.

e.g., What you're referring to is related to the limitation of liability clauses and especially "special" or "consequential" damages -- a category of damages that are not 'direct' damages but secondary. [1]

Accepting _any_ liability for special or consequential damages is always a point of negotiation. As a service provider, you always try to avoid it because it is so hard to estimate the magnitude, and thus judge how much insurance coverage you need.

Related, those paragraphs also contain a limitation of liability clause, often at capped at X times annual cost. Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.

This is just scratching the surface -- tons of color and depth here that is nuanced for every company and situation. It's why you employe attorneys!

1 - https://www.lexisnexis.com/lexis-practical-guidance/the-jour...


> Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.

Businesses do this all the time, this is how they make money. And they use a combination of insurance and not %@$#@*! up.


I've never seen an SLA that compensates for anything more than credit off your bill. I can't imagine a service that pays for loss of productivity, one outage and the whole company could be bankrupt. If your business depends on a cloud service for productivity you should have a backup plan if that service goes down.


I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue etc, but something like Slack's SLA[0] seems like it's moving in the right direction. They guarantee a 99.99% uptime (max downtime of 4 min/22 seconds per month) and give 10x credits for any downtime.

Granted, there's probably not many businesses that are losing major revenue because slack's down for half an hour, but it's nice to at least see them acknowledge that 1 minute down deserves more than 1 minute of refunds!

[0] https://slack.com/terms/service-level-agreement


> I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue

They won't show up on automated systems aimed at SMEs, but anybody taking out an "enterprise plan" with tailored pricing from a SaaS, will likely ask for tailored SLA conditions too (or rather should ask for them).


It's hard to give a compensation for profit loss, as then you would have to know the profit of the customer beforehand and put an adequate pricing including that risk. It's almost like insurance!


In financial markets I have seen SLA's where you will make people whole on the losses they incur due to downtime you inflict on them.


Seems like you want insurance. As with the hospital bill you'd generally be paying a bunch of extra money for your health insurance plan to not get stuck with the bill.

Not sure that exists for businesses, but I'd expect you'd need to go shopping separately if you want that.

Seems like a good business idea if it doesn't exist.


I think the idea here is that if the payment for SLA breach is just "don't pay for the time we were down" or (as I've seen in other SLAs) "half off during the time we were down" that doesn't feel like much of an incentive on the service provider.

They have other incentives, obviously, like if everyone talks about how Google is down then that's bad for future business. But when thinking of SLAs I'm always surprised when they're not more drastic. Like "over 0.1% downtime: free service for a month".


Independent 'a service was down' insurance isn't the same though. It is important for the cost to come out of the provider's pocket, thus giving them a huge financial incentive to not be down. Having that incentive in place is the most important part of an SLA.


Even with insurance, some of the cost will come out of the provider's pockets - as increased premiums at renewal (or even immediately, in some cases). Insurers might also force other onerous conditions on the provider as a prerequisite for continued coverage.


I hear you, but there's going to be a cost for that. For the sake of argument, say Google changes the SLA as you wish and ups the cost of their offering accordingly.

Would they gain or lose market share?

I don't think it's obvious one way or the other.


> . . . like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".

It's even slightly worse than that. SLAs generally refund you for the prorated portion of your monthly fee the service was out, so it's more like "here's a gift card for the exact value of the single dish we've determined caused your food poisoning." Hehe.


You're right and the funny thing is that's exactly what he said after I chimed in.


Completely agree with your analogy but have you ever seen any SLA that provides any additional liability? I haven't seen them - you are stuck either relying on hosting services SLA or DIY.


I cannot share details for the obvious reason, but yes - there are SLAs signed into contract directly behind the scenes which result in automatic payouts of a condition isn't met and it's not a simple credit.

Enterprise level SLAs are crafted by lawyers in negotiations behind the scenes and are not the same as what you see on random public services. Our customers have them with us, and we have them with our vendors. Contract negotiations take months at the $$$$ level.


That is a fair point. Is this a situation where you asymmetrically powerful? I have to imagine you would have considerable clout to represent a fair bit of their revenue in order to dictate terms. When I wrote my comment it was in the vein of a smaller organization.


I am but a technical cog in the machine my friend, while I know about what goes on in business and contract negoatiations I cannot comment on power dynamics. I would assume it's like any other negotiation - whomever has the greatest leverage has the power, I doubt it's ever fairly balanced.


Or purchasing business continuity insurance.


Not OP, but how do you measure them? Let's say, for example, you can send and receive email, but attaching files does not work. Is the service up or down?

What if the majority of your users can access the service, but one of your BGP peers is not routing properly and some of your users are unable to access?


Google do a very good job of defining their SLAs.

In answer to your question, they'll accept evidence from your own monitoring system when you claim on the SLA. They pair that up to their own knowledge about how the system was performing, then make the grant.

Google are exceptionally good at this, from my experience. Far better than most other companies, who aim to provide as little detail as possible while getting away with 'providing an SLA'.


The SLA itself should specify the way availability is measured.


Down because email attachments are base64 encoded files written in plaintext into the body. So if those are not working, email itself is not working.


that was a bad example. i guess the comment was trying to say "how do you account for partial service degradations".

(i dont think SLAs are BS btw)


In this example: you get free days. Which depending on your business might be worthless if you have suffered more monetary loss due to the downtime than the free days are worth.


But still better than nothing. And for some (most?) people/businesses, probably worth more than any resulting monetary loss


Exactly; downtime doesn't cost a cloud service much. At worst it causes reputation damage, with possibly large companies deciding to go for a competitor, losing a contract worth tens or hundreds of millions.


Given the blast radius of this (all regions appear to be impacted) along with the fact that services that don't rely on auth are working as normal, it must be a global authN/Z issue. I do not envy Google engineers right now.


> I do not envy Google engineers right now.

A few years ago I released a bug in production that prevented users from logging into our desktop app. It affected about ~1k users before we found out and rolled back the release.

I still remember a very cold feeling in my belly, barely could sleep that night. It is difficult to imagine what the people responsible for this are feeling right now.


When I was interviewing at Morgan Stanley, I asked "how do you do this job if a mistake can cost people money?".

The answer was "well, if you don't do anything, you make NO money".


I'm reminded of the quote from Thomas J. Watson:

> Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?


Agreed, and also it's worth noting that we're talking about companies here. Yes, for any individual the amount of money lost is insane, but that's the risk for the company. If one individual can accidentally nearly bankrupt the company, then the company did not have proper risk management in place.

That isn't too say that it wouldn't also affect my sleep quality.


Sadly a lot of managers don't see it this way, they'd rather assign blame.


Depending on the company's culture, that calculus can vary. If management start firing subordinates for making mistakes, then what should be done to management if they fail to account for human error, resulting in multimillion-dollar losses?


Well, why, good Sir - don't they usually get bonuses?


They'd rather avoid the blame hitting them


That is because they are managers, most got where they are by assigning rewards to themselves.


Welp, as a new grad there, I had brought down one very important database server on a Sunday night (a series of really unfortunate events). Multiple senior DBAs had to be involved to resuscitate it. It started functioning normally just a few hours before market open in HK. If it was any later, it would have been some serious monetary loss. Needless to say, I was sweating bullets. Couldn't eat anything the entire day lol. Took me like 2 days to calm down. And this was after I was fully shielded cuz I was a junior. God knows what would've happened if someone more experienced had done that.


I brought down the order management system at a large bank during the middle of the trading day. The backup kicked in after about a minute but it was not fun on the trading floor.


I'm so glad I'm not the only one feeling deployment anxiety. The project I'm involved in doesn't really have serious money involved, but when there's a regression found only after production deployment my stress levels go up a notch.


When I was working at a pretty big IT provider in the electronic banking sector, we (management and senior devs) made it an unspoken rule, that: - Juniors shall also handle production deployments regularly. - A senior person is always on call (even if only unofficially / off the clock). - Junior devs are never blamed for fuckups, irrespective of the damage they caused.

That was the only way to help people develop routine regarding big production deployments.


Same thing -- used to work at a very large hosting provider. One of our big internal infra management teams wouldn't consider newhires fully "part of the team" until they had caused a significant outage. It was genuinely a right of passage, as one person put it, "to cause a measurable part of the internet to disappear".

I got to see a lot of people pass through this right of passage, and it was always fun to watch. Everyone would take it incredibly seriously, some VP would invariably yell at them, but at the end of the day their managers and all their peers were smiling and clapping them on the back.


Sounds like hazing.


as a new grad there, it wasn't your fault. There should be guardrails to protect you.


Yep. It was supposed to be a very small change. I blundered. My team understood that and was super supportive about it all too. But this was after it was all fixed.

During the outage though, no one (obviously) had time for me. This was a very important server. The tension and anxiety on the remediation call was through the roof. Every passing hour someone even more important in the chain of command was joining the call. At that time I thought I was done for...


I work for an extremely famous hospital in the American midwest. We're divided into three sections, one for clinical work, one for research, and one for education. I always tell people that I'm pretty content being in research (which is less sexy than clinical), because if I screw something up, some PI's project takes ten months and one week instead of ten months. In clinical, if you screw something up, somebody dies! I just don't think I could handle that level of stress.


Same.

At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).


Jesus. One would think you'd have some safeguards for that. Even Dropbox will give you an alert if you try to nuke over 1,000 files. More reasons to COLOR CODE your work environments, if possible.


Yes but that was eons ago. The safeguards are well and truly in-place now. Not just one, several in fact.


Apart from the ones that they haven't worked out yet :)


When I meet the engineer who can design for the unknown unknowns, I will bow to them.


The trick is to be paranoid. You literally sit down and think exclusively about what COULD go wrong.


Anxiety is a bitch.


Formal methods for your formal methods. And never shipping on Friday.


colorblind (red/green) person here - 5% of the male population just don't see color enough for it to be an important visual clue.

So sure, color-code your environments, but if you find someone about to do something to a red environment that they clearly should only be doing to a green environment, just check if they're seeing what you're seeing before you sack them ;)


My primary customer right now color codes both with "actual color", and with words - ie the RED environment is tagged with red color bars, and also big [black] letters in the [red] bars reading "RED"


> At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).

It seems like a design flaw for actions like that to be so easy. E.g.

> Hey, we detected you want to delete an AWS region. Please have an authorized coworker enter their credentials to second your command.


If it was indeed in-production, I'd never in a million years have had access rights to delete the stack. Those are gated just the way one would imagine they should be.

The service stack for the region (and not an entire region itself) looked like prod, but wasn't. It made me feel like shit anyway.


It reminds me of this: https://www.youtube.com/watch?v=30jNsCVLpAE -- "GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill"


Irrelevant to the discussion, but I just wanted to say thank you for the categorized list of users I can follow on your profile!


Which tool do you use to follow users on HN?


There used to be hystry.com [0] but it isn't functional anymore.

Another workflow, though cumbersome, is: Search for a username on hn.algolia, select "comments" and "past months" as filters, then press enter.

Ex: https://hn.algolia.com/?dateRange=pastMonth&query=nostrademo...

[0] https://news.ycombinator.com/item?id=71827


Thanks for sharing those tips!


wow, that list is crazy. thanks OP.


Who automates the automators? :)


Doesn't AWS (and every big cloud/enterprise) follow best-practices for production operation like FIT-ACER? https://pythian.com/uncategorized/fit-acer-dba-checklist/

That's even more surprising to me.


Several years back when I was working at Google I made a mistake that caused some of the special results in the knowledge cards to become unclickable for a small subset of queries for about an hour. As part of the postmortem I had to calculate how many people likely tried to interact with it while it was broken. It was a lot and really made me realize the magnitude of an otherwise seemingly small production failure. My boss didn't give me a hard time, just pointed me toward documentation about how to write the report. And crunching the numbers is what really made me feel the weight of it. It was a good process.

I feel for the engineer who has to calculate the cost of this bug.


This sounds like a good practice and hopefully something they still do. Calculating the exact numbers would definitely help cement the experience and its consequences into your mind.


Presumably there were more failures than a single engineer could've been responsible for here.


Its absolutely possible, the worst AWS outage was caused by one engineer running the wrong command [0].

"This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in hundreds of millions of dollars in losses for AWS customers and others who rely on third-party services hosted by AWS."

[0] - https://www.datacenterknowledge.com/archives/2017/03/02/aws-...


If you alone were able to do it, then the system was designed badly. The bigger the impact, the more robust it has to be to prevent accidents.


The big mistake in the system is that everyone in the world is relying on Google services... These problems would have less impact with a more diverse ecosystem.


Would they have less impact? Or would it have the same impact, just distributed across many more outages?

You can rely on Google outages being very few and far between, and recovering pretty fast. For the benefits you get from such a connected ecosystem, I'm not sure anyone is net positive from using a variety of different tools rather than Google supplying many of them.


Compare closing down one road for repair a day per year to closing down all roads one day a year.


I'm not sure I see that as a fair comparison. I think it's best to use the same durations for this, as an entire day changes the level of impact. <1hr outages have a fraction of the impact that an entire days outage might have.

It's obviously subjective, but even with our entire work leaning on Google– from GMail, GDrive and Google Docs, through to all our infrastructure being in GCP– todays outage just meant everyone took an hour break. History suggests we won't see another of these for another year, so everyone taking a collective 60m break has been minimally impactful vs many smaller, isolated outages spread over the year.


Just think if Google actually decided to "take the day off"

...like I did a dozen+ years ago: https://antipaucity.com/2008/01/09/what-if-google-took-the-d...


I remember how one of our engineers had his docker daemon connected to production instead of his local one and casually did a docker rm -f $(docker ps -aq) .

Same thing happened to me but with CI, which felt bad enough already.


"Hey let's make developers do two very different jobs, development, and operations. We'll call it DevOps. We'll save money. Everything will be fine."


No Engineer should have production access from their workstation. Period.

source: am Engineer =).


Engineers shouldn’t deploy to prod directly, but sometimes it’s necessary to SSH into an instance for logs, stack dumps, etc. Source: worked for 2 big to very big tech cos.


For a large or v large tech co you should probably be aggregating logs to a centralised location that doesn't require access to production systems in this way. Stack dumps should also be collected safely off-system if necessary.

Perhaps my industry is a little more security conscious (I don't know which industry you're talking about), but this doesn't seem like good practice.


Let me be clear, I agree it should not be normal to SSH into a prod box. Our logs are centrally aggregated. But it’s one thing to say it’s not normal, but quite another to say engineers shouldn't have access, because I totally disagree with that.


What normally (should) happens in that unusual case is that the engineer is issued a special short-lifetime credential to do what needs to be done. An audit trail is kept of when and to whom the credential was issued, for what purpose, when it was revoked, etc.


Who fixes the centralised log system when that needs debugging?

Unless prohibited in something like banking, following best practice to the letter is sometimes unacceptably slow for most industries.


There should be tools that allow the team to gather such logs. Direct prod access is a recipe for disaster.


Not having those things centralized is also a huge operational failure regardless of company size.


Why not? (I think I can find some cases where production access from an engineer's workstation is a good idea)


It can be efficient, particularly in smaller companies, but that's where exactly this rule should be applied.

In some industries, security and customer requirements will at times mandate that developer workstations have no access to production. Deployments must even be carried out using different accounts than those used to access internal services, for security and auditing purposes.

There are of course good reasons for this; accidents, malicious engineers, overzealous engineers, lost/stolen equipment, risk avoidance, etc.

When you apply this rule, it makes for more process and perhaps slower response times to problems, but accidents or other internal-related issues mentioned above drop to zero.

Given how easy it is to destroy things these days with a single misplaced Kubernetes or Docker command, safeguards need to be put in place.

Let me tell you a little story from my experience;

I built myself a custom keyboard from a Numpad kit. I had gotten tired of typing so many docker commands in every day and I had the desire to build something. I built this little numpad into a full blown Docker control centre using QMK. A single key-press could deploy or destroy entire systems.

One day, something slid off of something else on my desk, onto said keyboard, pressing several of the keys while I happened to have an SSH session to a remote server in focus.

Suffice it to say, that little keyboard has never been seen since. On an unrelated topic, I don't have SSH access to production systems.


This exactly. I have deleted database records from a production DB thinking I am executing on my development DB. I've kept separate credentials and revoked dev machine access to prod environment ever since.


Congratulations - you found a counterexample yourself: engineers in small companies.


Well, it's something that can happen to anyone, take it easy. When I made the transition from developer to manager and become responsible for this situations, at first every problem made me feel as you describe. Eventually what helped me to be free is the understanding that how we feel about a fact does not change anything about that fact.


Don't be too hard on yourself, no dev works in a silo, there is usually user acceptance testing and product owner sign offs involved so they also have to wear some of this too.


Nope, especially considering the implications of this, with the amount of people working remotely. Google Meet, Classroom, etc. are down. This is probably literally costing billions every minute just in loss of productivity.


Total world economic output is ~$150M / minute, so billions every minute is off by few orders of magnitude.


You are assuming that a minute of disruption can not cause more than a minute's loss of productivity. I don't think that assumption is justified.

Consider an exactly one minute outage that affects multiple things I use for work.

First, I may not immediately recognize that the outage is actually with some single service provider. If several things are out I'm probably going to suspect it is something on my end, or maybe with my ISP. I might spend several minutes thoroughly checking that possibility out, before noticing that whatever it was seems to have been resolved.

Second, even if I immediately recognize it for what it is and immediately notice when it ends it might take me several minutes to get back to where I was. Not everything is designed to automatically and transparently recover from disruptions, and so I might have had things in progress when the outage stuck that will need manual cleanup and restarting.


I'm also assuming most of the world doesn't grind to a halt when gmail is down. Crops keep growing and factories keep running.


Even software engineers who are in a state of flow keep working :)


That figure seems way too low, what are your sources on it?


Simple math says:

World GDP (via Google) $80,934,771,028,340

Minutes per year 365 * 24 * 60 = 525,600

Divide and you get 153,985,485


World GDP is $80 trillion per year.


World GDP was ~$90B last year (https://databank.worldbank.org/data/download/GDP.pdf), which averages to ~$150M/minute


That's trillion not billion


https://en.wikipedia.org/wiki/Billion

A billion is a number with two distinct definitions:

- 1,000,000,000, i.e. one thousand million, or 10^9, as defined on the short scale. This is now the meaning in both British and American English.

- 1,000,000,000,000, i.e. one million million, or 10^12, as defined on the long scale. This is one thousand times larger than the short scale billion, and equivalent to the short scale trillion. This is the historical meaning in English and the current use in many non-English-speaking countries where billion and trillion 10^18 maintain their long scale definitions.

Nevertheless almost everyone uses 1B = 10^9 for technical discussions


This is a financial discussion though so:

https://www.worldometers.info/gdp/gdp-by-country/

World's GDP is $80,934,771,028,340 (nominal, 2017).

https://www.wolframalpha.com/input/?i=%2480%2C934%2C771%2C02...

$80.93477102834 trillion

Nobody would argue world GDP is anything billion, that's crazy.


https://fr.wikipedia.org/wiki/Liste_des_pays_par_PIB_nominal

In France, they use milliard and billion.


Sorry, language mistake. The result is the same: GDP is ~$150M/minute


That depends where in the world you are!


Indeed. Also, Google’s revenue is about $300K per minute. The value they provide is likely higher than that, but as you said, being able to send an email an hour later than you hoped it’s fine in most cases. Also, Google Search was fine, and that’s their highest impact product.

I’d guess actual losses to the world economy were more on the order of about $100K per minute, or about 1/3 of Google’s revenue. MAYBE a few hundred thousand per minute, though that seems unlikely with Search being unaffected, and everything else coming back. Certainly a far cry from billions per minute :)


I never understood this type of calculation as it implies that time is directly converted into money. However, I struggle to come up with an example for this. Even the most trivial labor cases like producing paperclips don't seem to be directly converting time into profit: even you will make 10k units instead of 100k this hour, you don't sell them immediately. They bring revenue to the firm via a long chain of convoluted contracts (both legal and "transactional") which are very loosely coupled to the immediate output.

Nothing is operating at minute margins unless it's explicitly priced on a minutely basis, like a cloud service. Even if a worked on a conveyor belt can't produce paperclips without looking at Google Docs sheet all the time, this will be absorbed by the buffers down the line. And only if the worker will fail to meet her monthly target due to this, loss of revenue might occur. But in this case the service has to be down for weeks.

In case of more complex conversions of time into money, like in the most of intellectual work, this is even less obvious that short downtimes will cause any measurable harm.


Besides the exaggerated figure, I always find these claims bizarre. Sure, there was some momentary loss, but aggregated over a month this will not even register.


I was unable to watch the Mogwai - Autorock music video. :-(


In a previous lifetime I removed an "unused" TLS certificate. It turns out that it was a production cert that was being used to secure a whole state's worth of computers.

In my defence, the cert was not labeled properly, nor was it used properly, and there was no documentation. It took us 2 days to create a new cert and apply it to our software and deliver it to the customer. Those were 2 days I'll never get back. However, when I was finished the process was documented and the cert was labeled, so I guess its a win.


Coincidentally, Google Authenticator was finally just updated on iOS after many years without update.


I am not sure why are they allowing it. Meaning why aren’t services completely isolated? Isn’t it obvious that in an intertwined environment those things are bound to happen (as in “question of when, not if”)? I understand, in smaller companies that are limited in resources (access to good developers and pressure to get product to market as soon as possible) we have single points of failure all over the place. But “the smartest developers on the planet”? What is it if not short-sighted disregard for risk management theories and practices? I mean, Calendar and Youtube, say, should be completely separate services hosted in different places, their teams should not even talk to each other. Yes, they can use same software components, frameworks and technologies. Standardization is very welcome. But decentralization should be an imperative.

Edit: again downvotes started! Thanks to everyone “supporting freedom of expression” :)


I've been in that situation before at one of my previous jobs, where some important IT infrastructure when down for the whole company. Nowhere as big of a scale as this, but it was easily one of the most stressful moments of my life


If this does not improve soon, we're looking at one of the most significant outages in recent internet history, at least from the number of people impacted.


Several others have shared their 'I broke things' experiences, and so I feel compelled to weigh in.

Many years ago, I was directly responsible for causing a substantial percentage of all credit/debit/EBT authorizations from every WalMart store world-wide to time out, and this went on for several days straight.

On the ground, this kind of timeout was basically a long delay at the register. Back then, most authorizations would take four or five seconds. The timeout would add more than 15 seconds to that.

In other words, I gave many tens of millions of people a pretty bad checkout experience.

This stat (authorization time) was and remains something WalMart focuses quite heavily on, in real time and historically, so it was known right away that something was wrong. Yet it took us (Network Engineering) days to figure it out. The root cause summary: I had written a program to scan (parallelized) all of the store networks for network devices. Some of the addresses scanned were broadcast and network addresses, which caused a massive amplification of return traffic which flooded the satellite networks. Info about why it took so long to discover is below.

Back in the 1990s, when this happened, all of the stores were connected to the home office via two way Hughes satellite links. This was a relatively bandwidth limited resource that was managed very carefully for obvious reasons.

I had just started and co-created the Network Management team with one other engineer. Basically prior to my arrival, there had been little systematic management of the network and network devices.

I realized that there was nothing like a robust inventory of either the networks or the routers and hubs (not switches!) that made up those networks.

We did have some notion of store numbers and what network ranges were assigned to them, but that was inaccurate in many cases.

Given that there were tens of thousands of networks ranges in question, I wrote a program creatively called 'psychoping' that would ICMP scan all of those network ranges with adjustable parallelism.

I ran it against the test store networks, talked it over with the senior engineers, and was cleared for takeoff.

Thing is, I didn't start it right away; some other things came up that I had to deal with. I ended up started it over a week after review.

Why didn't this get caught right away? Well, when timeouts started to skyrocket across the network, many engineers started working on the problem. None of the normal, typical problems were applicable. More troubling, none of the existing monitoring programs looked for ICMP at all, which is what I was using exclusively.

So of course they immediately plugged a sniffer into the network and did data captures to see what was actually going on. And nothing unusual showed up, except a lot of drops.

We're talking > 20 years ago, so know that "sniffing" wasn't the trivial thing it is now. Network Engineering had a few extremely expensive Data General hardware sniffers.

And to these expensive sniffers, the traffic I was generating was invisible.

Two things: the program I wrote to generate the traffic had a small bug and was generating very slightly invalid packets. I don't remember the details, but it had something to do with the IP header.

These packets were correct enough to route through all of the relevant networks, but incorrect enough for the Data General sniffer to not see them.

So...there was a lot of 'intense' discussions between Network Engineering and all of the relevant vendors. (Hughes, ACC for the routers, Synoptics and ODS for the hubs)

In the end, a different kind of sniffer was brought in, which was able to see the packets I was generating. I had helpfully put my userid and desk phone number in the packet data, just in case someone needed to track raw packets back to me.

Though the impact was great, and it scared me to death, there were absolutely no negative consequences. WalMart Information Systems was, in the late 1990s, a very healthy organization.


Makes sense, at work we have an application running on Google Cloud and everything seems to be working. So the outage is probably not at network or infrastructure level.


Went to reply, then saw the username. My guess was lb layer


yeah, not working in Europe


4:41AM PT, Google services have been restored to my accounts (free & gsuite).

And I have never seen them load so fast before - gmail progress bar barely seen for a fraction of a second whereas I am more used to seeing it for multiple seconds (2-3 sec) until it loads.

I observe the same anecdotal speedup for other sites... drive, youtube, calendar. I wonder if they are throwing all the hardware they have at their services or I am encountering underutilized servers since it is not fixed for everyone.

It is nice to experience (even if it is short lived) the snappiness of Google services if they weren't so multi-tenented.


If this phenomenon is actually real instead of just perception then I'd guess it is down to reduced demand of some short. Some possibilities:

a) users haven't all come back yet b) Google is throttling how fast users can access services again to prevent further outages c) to reduce load, apps have features turned off (which might make things directly faster on the user's end or just reduce load on the server side)


At Google's scale, I'd expect it to be all of the above.

I hope they make their learnings, post-mortem, etc. public so that we can all learn from it.

My engineer hat is saying - "damn, I wish I was part of fixing this outage at their scale."

My product owner hat is saying - "Aaaaaaaaaaaaaaa......Aaaaaaaaaaaaaaa...."

:D


Everything is snappier for a while if you turn it off and then on again


Except when there is no cache warming when you turn it on


Except when there is cache warming when you turn it on?


I would guess autoscaling kicked in (RPC error rate caused higher CPU usage?) and now things will scale back down again.


Perhaps they rebooted their clusters and it flushed the memory ???.


Oh man, you're right. Bloated gmail loaded instantly. What's going on? It's loading almost 2x to 3x faster.


Isn't this a good indication that the performance problem if gmail may not be related to the "bloat" of the frontend itself?


It might suggest that the frontend isn't the only issue, at least - and maybe this explains why it's usually so slow, if the frontend can be fast on a fast enough backend. On the other hand, the speed of the "basic HTML" version implies that the frontend can be the issue.


Entirely possible as well that the "basic HTML" uses different API service in the background that are snappier for comparative lack of users.


I always thought gmail being slow is because of me using firefox but now it's surprisingly snappy. What the hell is going on?


Wow, it's faster in firefox than it used to be in chrome... while in chrome it's almost isntanteneous


I wish it was always like this. I hate how slow YouTube, Gmail, etc. often are normally.


I wonder if they just killed the affected service so it's loading faster-than-usual now


So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?

Of course it will, - at least, it better - but what if it doesn't? And if it does, are you going to take countermeasures in case it happens again or is it just going to be 'back to normal' again?


I guess a lot of people are fine with the risk.

Everybody uses it, so if, like, Gmail loses all the emails, we are then in such a state that the consequences will be more bearable and socially normal.

Most people are fine with accepting that whatever future thing will happen to most people will also happen to them. Because then the consequences will also be normal.

If the apocalypse comes, it comes for almost all of us and that's consolation enough.


This sounds like the good old 1970-80s "No one ever got fired for buying IBM" argument.


Still haven't heard of anyone being fired for buying IBM. Have you?


Yeah but for the people making that argument, it was a good one!


The way I see it, backups are a strategy to reduce risk of ruin.

For me, backing up to the Cloud is fine, because I find the risk of my home being broken into and everything stolen AND the cloud goes down AND the cloud services are completely unrecoverable is a small enough risk to tolerate.

I don't think it's possible to have permanently indestructible files in existence over a given time period.


Different failure mode. If the cloud goes down, many more people are affected. If your self-hosted thing goes down, only you are affected. If everybody self-hosted, would the overall downtime be lower? Even if it were, would it be worth the effort of self-hosting?


For baby pictures yes, for everything else, no


Most of the things I backed up for myself are either gone forever or irretrievably lost.

Most of the things I backed up with google remain largely accessible, except for an occasion like this.

It's rare that any services I operate solo come back this quickly after there is a downing issue.


I have the opposite experience, at least with regard to your first two paragraphs. Most of the things that I have backed up on other people's computers over the past 3-4 decades are irretrievably lost. But most of the things that I have taken care to make backups of on personal equipment over the years, are still with me.

Cloud storage is still useful of course, but I prefer to view it as a cache rather than as a dependable backup.


Of course it's viable as a backup. Availability != realibility. My data is still reliably saved in the cloud even if there is an outage for a few hours. The key point is backup, e.g. Dropbox. When you use Google Docs, it becomes a single source of truth and a SPOF.


This depends on the circumstances. If your personal photos are unaccessible then maybe it doesn't matter, but if it's your documentation for a mission critical bit of infrastructure then a few hours could be very significant. Somebody in that situation probably wouldn't agree with your assertion that "availability != reliability". If I can't access it when I need it then I wouldn't consider it reliable.


Whatever data I have backed up in the cloud is synced across multiple devices that I use. Even if the cloud disappeared altogether, I still have it. The cloud allows me to keep an up to date copy across various devices.


Both Google Drive/Photos and OneDrive have an option to only keep recently used files on your local device, and even periodically suggest they automatically remove local copies of unused files to "free up space".

I highly suggest everyone disable this setting on their own, but also on their (perhaps less technical) friends' and relatives' devices. Otherwise, if anything happens to your account or - less likely - the storage provider or their hardware, your data could very well be gone forever. I can't believe anyone would want that.


You don't need 'the cloud' to do that. Look into Syncthing. It does depend on an outside "discovery server" by default to enable syncing outside of your LAN, but you can run your own.

https://syncthing.net/


What's annoying is that synchronisation does't work for google slides or google docs. They are just synchronized as links to the webpage on my computer.


If you use Insync you have the option of converting to DOCX or ODT. Insync has other issues though, my "sourcetreeconfig" is being downloaded as "sourcetreeconfig.xml".


Not 100% on that, but I think you can save these documents on Google Drive, and then they're treated (and synced) just like any other files.


>So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?

Much less chance of that happening than my local backups getting borked...


But much higher than the chances of both of them getting borked.


Of course cloud is a viable back up, similarly to physical drives.

Both have vastly different failure modes and typical backup should use both of them.

This way if all my backups are gone I likely have way more important issues that loss of files.

(and yes, my backups are encrypted)


Just a few moments ago, I wrote in my company's group chat this is the time when we buy NAS. We have a lot of documents not accessible right now in Google Drive.


Clever.


What worries me the most is email. I basically don't use any other Google services other than Gmail and YouTube, but for email I really don't know of an alternative.

Sure you can argue "move to Fastmail/Protonmail/Hey/whatever", but those can also go down on you just like Google is down now. And self hosting email is apparently not a thing due to complexity and having to forever fight with being marked as spam (ndr.: not my personal experience, I never tried self hosting, just relying what I read here on HN when the topic comes up).

So, yeah, what do we do about email? I feel like we should have a solution to this by now, but somehow we don't.


I've been happy with using Hosted Exchange on Microsoft. I own the domain so ultimately I can point the DNS to some other provider. Outlook stores the mails locally so I have a backup. I think the most important thing about E-Mail is to receive future E-Mails and not look at historical ones. In the end you can always ask the receipent to send you a copy of the email conversation - if you dont own the domain it get much harder to convince you actually own the email.


For < $100/year, Microsoft will sell you hosted Exchange (and you can use it with up to 900 domains [0]), 1 TB of storage, 2 installable copies of Office, and Office 365 on-line.

That's _much_ better than trying to host my own email server.

[0] https://docs.microsoft.com/en-us/office365/servicedescriptio...


The point is that you shouldn't put all your eggs in one basket. All services go down. If you're worried about someone else handling it when it goes down then host your own [1], otherwise you can use something different for each thing you need. Don't rely on Google for everything.

1- https://github.com/awesome-selfhosted/awesome-selfhosted#ema...


Yes, I know what's the point. But how do you avoid putting all your eggs in one basket? You can't host your email on more than one "provider" (including self host), and the vast majority of important services that you link your email to (bank, digital identity, taxes and other government services) does not allow you to have more than one linked to it; which means, that one goes down, you don't have one. Sure, I can give my accountant and my lawyer a second email address, hosted on a different provider, but that poses two problems: 1. how are they gonna know when one is working and one isn't? It's not like you get a notification if your email didn't reach most of the time, it just drops; 2. if you always send all emails to both addresses, now two providers have my data instead of one (of course excluding if one is self hosted). And you also need to always keep in mind that for all things important: one is none, two is one; so you should really have 3 addresses on 3 different providers according to that, which brings us back to the problems above. (and I'm not even mentioning the confusion that it would generate if you don't manage to get the same name with every provider "Wait, was it beshrkayali@gmail.com, or was it alibeshrkay@gmail.com? Or was that fastmail?")

As I said (literally in the second sentence), I don't rely on Google for everything, as you mention. I don't actually rely on Google for anything other than gmail, and of that I am also unhappy. The point I was trying to make is that there aren't really alternatives, and I was hoping someone might come out with a suggestion about how to overcome that problem.


You shouldn't use you@company.com as your main email, you should have your own domain. So `something@yourdomain.com` will always be yours no matter if you self host or use 3rd party. I currently use Fastmail and i've been very happy with them. If they fail or turn evil, I'll switch to something else maintaining the same address. Emails themselves should be downloaded/backed-up regularly, kind of like important documents you'd have on your disk.


> You can't host your email on more than one "provider"

You can do split delivery and have your email be delivered to two different destinations. It's less common than it used to be but it's trivial.


I'm running my own mail server, and I think anyone who has some experience with Linux should be able to do the same in a day or two. Once it's set up it just works.

You can still use Gmail and fall back to connecting directly to your server if Gmail is down.

Some mails might be flagged as spam if the IP/domain has no reputation, but that quickly passes, at least that's my experience.


I specifically use Gsuite so that I don't have to deal with managing a spam filter or dealing with IP reputation issues. I'd be willing to self-host almost anything else.


I guess something highly resilient would be - say - a mailserver on a rented VM replicating to two cloud providers via a service mesh.

Nice and simple! :D


A lot of domain registrars will host/relay mail for you if you don't want to think about it. Otherwise it's not too hard to host yourself. The sucky part is when it breaks because you can't really just put off fixing it.


I've been using mailu.io to host my own email server. Makes it real easy to manage yourself.

I haven't had any issues with new domains being marked as spam, but I always make sure the SPF, DKIM and DMARC records are set up.


I’ve heard multiple founders argue that it’s safe to have downtime because of a cloud outage, because you’re not likely to be the highest importance service that your customers use that also had downtime.


Well yeah; I don't trust myself enough to own & operate my own servers, and I cannot give myself any uptime guarantees - let alone at the scale that a cloud provider can offer me.


"The Cloud" is vague, and if you don't specify what it means then the answer to your question can only be "it depends".

If the question is "anybody still feel like arguing that 'a single provider' is a viable back-up" then it's yes for most cases. A better strategy is of course to use multiple providers. The chances that it never comes back again is much lower.


People would probably argue a “multi cloud” solution. Have your infrastructure be “cloud agnostic” and this sort of problem would be avoided.

There was actually a project called “spinnaker” that was supposed to solve this problem.

Whether the cost of paying 2 or more cloud providers is worth it for most companies is up in the air.


"Multi-cloud" only works if you stick with the basics. Like disk storage, compute, and a well-supported database. Once you tie in into a cloud's specific offerings.....


It's getting convoluted now that cloud providers seem to realize there's demand for this

https://aws.amazon.com/hybrid https://azure.microsoft.com/en-us/services/azure-arc https://cloud.google.com/anthos

Full disclosure: I work for Azure. Don't work on Arc tho. Don't have experience being a customer for these products


I find the Anthos docs and talks so confusing. Half of them say Anthos is for hybrid between on-prem and GCP. The other half say it's for multicloud and hybrid.


Well, since a viable backup strategy requires at least 3 storage locations (eg the in-use primary, an on-site or off-site backup, and a secondary off-site backup) "the cloud" is fine as an off-site backup or secondary off-site backup.


Let's hope tailscale swoops in and creates a no-gimmicks, highly usable, private internet for everyone.

They seemed to have figured out the hard parts already.


You mean like using Google GSuite for SSO? In the context of this authentication-related outage, it's a funny suggestion.

https://tailscale.com/kb/1013/sso-providers


> Of course it will, - at least, it better - but what if it doesn't?

Same question for non-cloud.


Back to normal. I can live without email for an hour...


Help! My Waymo taxi won't open the doors without me logging into the app. Its driving around in circles on route 500 and won't stop.

/s - for now ;)


I know you’re joking, but I am curious if Waymo’s fleet was affected. Shouldn’t be right? But I’m also surprised every time I fresh login to Gmail and YouTube shows up in the intermediary redirect chain.


how's the converted offshore oil platform treating you?


Gmail said my account was "temporarily" unavailable... had a moment considering if it wasn't temporary. Good reminder to remove my reliance on gmail especially.


For a second I legit thought Google banned my account for some reason. And I don't like it that I feel relieved..


Underrated comment. If the government freezes your bank account, at least you can take them to court. If Google mistakenly disables your account, there does not seem to be any legal recourse. Considering the extent to which people depend on their services, perhaps there should be a more elaborate appeal process.


They can freeze the account but I own the domain so I can point it anywhere I want and I have copies locally via imap.

At this point the only reason I use it is because I grandfathered in on an old plan it's still free, if that changes I'll go elsewhere.


Google doesn't disable ordinary rulefollowing accounts by mistake. They are aware of the false positive rate and don't care.


I agree, but the problem is that "ordinary rulefollowing" is not clearly defined by Google. Perhaps it was that innocent park frisbee party video with a Metallica sound played in the background by some other party there... and not only your video, but also your YouTube account, and worse, email account gets blocked. Maybe this scenario is dystopian - the point is that it is a black-box no-appeal/limited appeal system when such an event happens.


Google account terminations typically only affect the one service. If your YouTube account gets terminated for ToS violations, your email account will still work perfectly fine.


Same for me - and then I tried logging in to gmail in an incognito window and got "Couldn't find your Google account" which really scared me.


Time to stop postponing that large Google Takeout for me!


Clicking the stupid download links is so tedious...


Thunderbird prompting me to login, and getting "Google does not recognize this email address" as a reply was a nice adrenaline spike, until I checked the status on HN!


This happened to me with a work email. For a moment I thought, "Well, I had a good run at this company.." My next thought was, "Ha, Google must be down - better check HackerNews."


I am still experiencing this problem on all my synced-to-Thunderbird Gmail accounts, so it either hasn't been completely fixed yet, or there's another ongoing issue.


I started this[1] the last time something similarly scary happened.

It's very comforting to have a local copy of everything important in situations like this one.

[1] https://wiki.emilburzo.com/backing-up-your-google-digital-li...


Yes! It takes some work to switch, but it's worth it. Buy your own domain name, and link it to an existing service if you don't want to host yourself. You'll always be able to switch your mail alias when having issues with your email host.


Our entire library of video tutorials disappeared for a while. I was not happy, and the thought of losing our email.. Currently working with the team to make backups of absolutely everything off-site and off-Google. A good wake up call for us.


https://console.cloud.google.com/ is down as well. Seems very much an identity issue.


I really should try to get my stuff moved to my new address. Not self-hosted, but a non-profit that has been around for a while... Finally paid the membership fee.


The status page has proven itself useless again. According to that page, everything is working perfectly fine.


What if it is a static website of corporate propaganda?


It works 99.9% of the time, good enough eh?


better than a stopped clock


It works fine. You can't login only.


When I tried to login it said: 'this account does not exist'. So my first thought was some algorithm made a mistake and my account got deleted for no reason.

I already imagined the only solution now was to write a medium post and hope it gets some traction on hackernews and google support steps in. Thinking to myself I was an idiot for knowing all this and still thinking it wouldn't happen to me.

And even though it turns out to be an outage, it gave me a bad enough feeling to start using a domain name I own for my email.


Those hosting with their own domains on Google mail had an identical experience, if it makes you feel differently.


The advantage of a custom domain name is: As long as you can update DNS records, you can have your mail easily hosted elsewhere.

Obviously not relevant for this kind of outage, but in the scenario outlined by GP - Google randomly kills you off, and there is nothing you can do - this is at least an emergency strategy.


Yep, but losing access to my emails isn't that big of a deal, losing access to my email address is much worse. Especially since it's also my login on a lot of other services.


This topic is so hot it's crashing HN! Super long server response time and I get this error quite oftenly: "We're having some trouble serving your request. Sorry!"


Just checked https://www.google.com/appsstatus

all green, which does not reflect reality for me (e.g. Gmail is down)

edit: shows how incredibly difficult introspection is


HN is my goto status page when these things happen, never failed to provide up-to-date reliable information.


Seems like everyone going on HN now. The site is so slow


Interesting cascade effect of sites going down! I wonder where people will go next to check if HN is down?

Also wondering if this is perhaps the fastest upvoted HN post ever? 8 mins -> ~350 votes, 15 mins -> ~750 votes. I wonder if @dang could chime in with some stats?

Update: looks like it hit 1000 upvotes in ~25 mins!

Update: 1500 in ~40 mins

Update: 2000 in ~1 hour 20 mins (used the HN API for the timestamp)


The recent youtube-dl takedown had a similar number of votes but still slightly slower, I think. I think it was >500 in 30min and >1000 in 60min if I remember correctly.


Some politics topics have shot up pretty quickly in the past but user and/or mod flags send them back to page 2 about as quickly.


Funny this post is already #7. Are we seeing many reports again?

Stats right now: 1985 points | 1 hour ago | 597 comments


Maybe, could have just seen a dropoff in upvotes once the issue was resolved also


reddit, nanog, 4chan, twitter — pick your poison


"poison" is a particularly apt description for all of the above


> I wonder where people will go next to check if HN is down?

Probably: https://downforeveryoneorjustme.com/


HN was having problem showing anything for the past 10 min or so. I was going to ask if HN was down, then I saw this.


HN uses Firebase, maybe that is the reason.


Does this site also use Firebase, or do they only feed Firebase's Database so that other developers can build upon it?


Any source for that? Last information I have is that it's written in Arc Lisp and uses files rather than a database for storage.

There is a public API on Firebase, but AFAIR it's just a mirror rather than the main storage.


TIL. What for?


The very least public APIs are exposed via firebase.

https://github.com/HackerNews/API


Accessing the site anonymously fixes it (logged sessions are not chaced)


I, and probably many others here are like a status page for friends and family too... I had my wife thinking that the internet was broken, and tried tethering to phone and things and still didn't work, then showed me and I saw the status code errors and was like "it's actually parts of Google that are down".

I love (and am deeply scared by) the dependence of Google and the confusion of it with the entire internet.


It gets even worse: https://twitter.com/joemfbrown/status/1338452107419148290

>I’m sitting here in the dark in my toddler’s room because the light is controlled by @Google Home. Rethinking... a lot right now.

Some people are compiling more relevant events: https://twitter.com/internetofshit


I still like my physical switch on the wall to turn the light on and off very much.. for me it's hard to beat. I can even turn it on/off semi-remotely with e.g. a tennis ball.


If my Alexa fails, then I can also just turn the light off with the switch and when flipped back "on" again, then it will be on.

Fallback to 'classical mode' works for me.


If your home automation system requires a network connection to work then something is very wrong.


Right? Home automation is perfectly fine and mught even expose a secure authenticated API to the internet so you can say check and adjust the temperature remotely, but it shoukd never ever go down for lical use if the Internet connectivity or the remote service backend goes down.


Home Assistant rulez with non-cloud sensors. I never by sensors, cameras, switches which require connection to an external host. But I'm sure that most people will do it even with such outages too.



Google should write an AI application that checks on HN if their systems are working and display the result live on their status page. That would be more reliable than the current system they have in place (which is obviously not working at all).


It would probably just end up banning Googles own account and leave them with no human escalation option.


won't help when the ai service itself is down.


There are other sites that can check for you


https://downdetector.com/ is remarkably good at catching this too.

But funnily enough, a lot of the votes come from traffic that searches for "is ____ down?" on Google. XD


https://downdetector.com/search/?q=google shows all the Google services and seems like everything other than search is reporting errors


Even search seems to have issues. It can't seem to find rare pages at all, and ranking seems subpar for the past 15 minutes or so.


I think that's normal. ;)


Amusingly, downdetector.com is now down. Gotta love that cascade effect :-)


Found this relatively recently, seems to do a good job.

Do love how more consumer services (ISPs &c) always have some report of being down somewhere but its means nothing unless there's a big spike.


What's the point of having a status page when it can't reliably tell you the factual status?


Helps Google meet a 99.999% SLA!


Not really since now it's up, and the status page still says it's down.


right after I shamed them on twitter. I take responsibility despite it unlikely having any impact.


Especially one that breaks the back button.


Customer confidence. Same as how everyone reports they have 99.99% uptime.


Then their status page is not unlike like target reporting in the USSR: completely fictional.


From personal experience and other Ops peoples stories, I'd assume that could be the case for many status pages ;)


It's red already.


No -- it needs to be red when it's having the outage for people to have confidence in it as an indicator. The reality is Google have no real incentive to provide an actual external status page that is accurate -- to do so is an admission of not upholding SLAs. These status pages are updated retrospectively. Use a third party one like DownDetector.


It is all green if you do not need to be logged in.

If you are logged in, the page crashes with an error.

You can still browse all services from Incognito (which for some is not an option).


Are you saying the Google status page is correct then? Being logged in is part of being green I’d think.

Also, you can’t use many parts of Gmail, Drive, Photos, etc, without being logged in.


I think it just means the status page itself is not logged in. Apparently it should be...


Try YouTube, which is totally broken.

But I guess it's technically not part of /appsstatus


Youtube is down for me if I am logged in. I can watch videos through duckduckgo search or an icognito window.


YT seems to be working in Incognito / Private mode.


Ah, thanks!

Kinda weird it would totally break if the auth failed, unlike other services like Search.


Youtube Tv too


youtube is working in incognito mode


Yeah for me too: youtube and gmail works in incognito mode


> gmail works in incognito mode

I mean, that must be a generous definition of "works"! :)


Yeah xD


Is HN having issues too?


drive too.


Google made everyone get an account trying to force Google+ on to everyone. They don't get to pretend like that didn't happen by not factoring it into their status.


This is not true. i am logged in to gmail and getting 500 err

edit:parent updated comment


That's what the GP is saying.


the parent comment to mine at time of writing was just:

> It is all green if you do not need to be log in.

Giving the impression if you were logged in/didn't need to log in, that everything would be ok.


I apologize for my ambiguous phrasing, and I feel bad that you were downvoted (I reupvoted you, but that’s only one vote).

I wrote too fast because I thought it could help people work around the problem.


Oh, I see. Looks like their auth (state) servers are down.


It reflects the reality for them.

The small print says: The problem with Gmail should be resolved for the vast majority of affected users. We will continue to work towards restoring service for the remaining affected users...

At Google scale, "remaining affected users" probably number in tens of millions. Sucks to be one of them, tho.

But hey, it happens. As a SaaS maintainer, I can sympathise with any SREs involved in handling this, and know that no service can be up 100% of the time.


Everything went red just now.


Someone here in hn said at Amazon AWS they are required direct CTO approvals to reflect negative realities. Maybe that’s the case for anyone these days.


Not true. Only for written things like postmortems. Automated systems need to be vetted but can update things like status pages on their own.

The issue isn't "negative" realities, it is saying something while investigating that might break contracts only to find later that it wasn't true.


Mistakes happen... that page is probably still correct 99,9999% of the time!

/s


These uptime numbers cloud providers gives out can't be nowhere near actual uptime.


Solid red now. Everything down


Definitely i agree with you. Although the statuses are updated later, it may cause many problems if it is not instantaneous.


It's correct for me now


Pretty funny. I would normally assume they are connected to some testing unit that real users interface with. Vs internal APIs or however Google is doing this with them still being green


Everything red now: clean sweep


It started showing red now


Now it's all red


everything is red now.


its not green anymore


I don't think it's difficult, I think they just lie.

Monitoring is very simple, I even learned this from a document released by the Google devops team many years ago.

Always alert from the end user perspective. So in other words have an external server test login to Gmail. Simple as that.

They manually update that status page to not scare away stockholders.


Out of curiosity, what response time do you expect on a page like that? And what level of detail? I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.


> what response time do you expect on a page like that?

Faster than a free third-party website’s response time. Google should know they are down and tell people about it before Hacker News, Twitter, etc. Google should be the source of truth for Google service status, not social media.

> And what level of detail?

Enough to not tell people that there are “No issues” with services.

> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.

Google employs enough people to do both.


I'd expect that status page to be automated based on a number of metrics / health checks. Our equivalent is.


I'd expect within seconds that Google is alerted of a very large number of issues with their servers and that the status page would be updated (the green light going to red) within seconds. It's now quite some time after the start of the outage and everything is still green on that status page.


How many employees do you think Google has? Do you think they're all working on the same task?


Well, probably a lot of them are sitting around doing nothing because Google's down right now.


It's a big group of teams... one team is responsible for monitoring (And status reporting) alone


> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.

It's not like they would be working on the status page right now, that work should have been done a long time ago...


If they know its broken, one of the _many_ engineers/support across Youtube+Gmail+etc that are all known to be down related to this bug should be able to update it in first few minutes. Especially if this isn't a 5 min fix.


It should be faster than it takes TechCrunch to write an article about it


TechCrunch don't need to verify anything.


There should be enough people on that to communicate out as well as resolve.


As many comments in this thread have mentioned, the crash seems to happen when when logged in; youtube and other services seem to work when don't send the session cookie. This suggests something very fundamental related to user account/sessions is failing...

Did Maria Christensen make a mistake when adding "return true;"?

(This is a joke referencing Tom Scott's 2014 parable[1] about the danger of designing a system with a single point of failure. Tom's tells the fictional tale of a high level employee at Google adding "return true;" to the global "handleLogin()" function.)

[1] https://www.youtube.com/watch?v=y4GB_NDU43Q (you might need to open this link in a private window...)


1. Phone Gmail won't load. 2. Turn off wifi and use cell network. Still won't load. 3. Check hacker news.


Yeah I couldn’t get the hours of operation for a store near me on Google maps. Thought it was wifi at first.


Don’t blame people, blame systems and processes. I assume Google has a blameless process too. If an engineer can bring down huge swaths of google that’s not a human problem. As an eng your you should heavily invest in a sane test -> deploy -> monitoring process, and reward reliability.

Give people a bonus when things didn’t break, not only when there is a superhero that fixes broken things. Then you’re rewarding fragile systems that need superheroes.


It's super interesting that all Google services that I've tried are down _except_ for Google Search. What would isolate Search from the rest of Google's products such that it wouldn't be affected by a mass outage like this?


I’d guess Search is pretty well segregated from basically everything else because of how valuable it is - I’m logged into Google on Search and it works fine (unlike everything else)


Search works for me but it thinks I’m logged out - it just prompted me with the 'please agree to our T+Cs' dialog.


It looks like it is related to authentication, landing pages seem to be working, so going incognito you can still use youtube.

I would guess the search page does not actually use your credentials for anything since fingerprinting is separate from login?


I would've thought Google remembers what you search while logged in and sells it off. Maybe they do it in a different way - never really thought about that. I'm logged into Google when I search but my profile fails to load at the top right.


They probably do, but it is not essential to the functionality so it can work even if auth fails.

Fingerprinting happens separately because then they can identify you even if you are not using your own computer or are in incognito. Login is just a convenient way for you to tell them who you are.


It is something related to authentication. Even YouTube works without being logged in


When logging into any of my gmail personal/free and corporate paid for accounts i get "Couldn't find your google account" in red writing


Google Search Images doesn't work either: https://www.google.com/search?q=random&tbm=isch&source=lnms


Google search was slow for me during the outage (5-10s per request)

I imagine google search only has optional dependencies on login and correctly falls-back if these fail


Probably something related to users authentication then


The fact it seems to be related to login reminds me of the Tom Scott doomsday fictional lecture which I would link to but you know...



opening it with Incognito mode works too


Nice. very subtle reference to our dependability on google products. you can still link the YT lecture from a private session.


Oh yeah that talk was crazy!!


It's been ages (literally) since I felt so helplessly out of a conversation not being able to check a reference instantaneously.

Weird feeling...


Just got a text from my kids' school saying GMail is down and to use the school's direct email. Immediate reaction was "Google isn't down you idiots, the problem is on your end", go to check GMail, yep it's down.


yeah this is like saying it must be a compiler bug


I had a few of these this year, e.g. thinking I have local connectivity issues when it turned out the entire backbone is in a bad shape[0].

Or that time when 80% of the apps on my iPhone crashed on launch[1] — must be an issue with the device, right?

[0] Level 3 Global Outage - https://news.ycombinator.com/item?id=24322861

[1] Facebook iOS SDK Remotely Crashing Spotify, TikTok, Pinterest, Winno and More - https://news.ycombinator.com/item?id=23097459


Unless I'm misunderstanding something, they are going to have an immense SLA claim issue. 99.9% SLA on Workspace services, so any business paying for Google for Business (now known as Workspace) is going to have a credit claim (assuming the outage is longer than 43m 49s which feels like it will be).

Edit: As I comment it looks like things are coming back! Timing or what...


57 minutes for full Gmail restoration according to https://www.google.com/appsstatus#hl=en-GB&v=issue&sid=1&iid...


Looks like the outage was very close to that duration. Over what time period is that?


0.1% of a year is 8h45m. Or did you mean 99.99% SLA?


99.9% but calculated on a monthly basis


Gmail and Youtube are back online for me. Hangouts is beach balling. There is a warning at the top of my screen though.

"Gmail is temporarily unable to access your Contacts. You may experience issues while this persists."


Everything is affected. Its Gaia(Google Auth) outage, most probably.

Disclaimer - Googler here whose workplace chat is not working.


The nature of this downtime is quite severe. How are Googlers resolving this if their internal communications are down?


There are other ways to deliver packets... https://en.wikipedia.org/wiki/IP_over_Avian_Carriers


There is IRC as backup


I recall that their SREs use IRC?


Yep. At least some of the teams do.


not hangouts? allo? duo?


I don't think Allo and Duo are products designed for use in teams/organizations, or marketing themselves as such.


Good ole phone calls is my guess


I can't login to anything via Google Auth this morning.

It reports error that my Google Account can't be found. I thought my account had been deleted!


This is bad opsec


Atlassian's #HugOps page[1] collates messages of support for Devops as a public embedded Google Map and it really encapsulates the current feeling!

https://imgur.com/a/3RurTYf

Don't forget to #HugOps all the people who've been woken up on a monday morning with this! Hope this gets resolved soon :)

[1] https://www.atlassian.com/software/statuspage/hugops


Google is also rejecting SMTP ingress traffic.

(delivery temporarily suspended: lost connection with aspmx.l.google.com[2a00:1450:400c:c04::1b] while sending DATA command)


That is Really Bad™.

People are going to miss critical emails with no recovery.


Fortunately email was designed back in a day when mail servers could fail. Don’t worry - proper SMTP servers will retry patiently.


These are temporary deferrals, a properly configured outbound mailserver should retry for days.


Emails will be delayed but sending SMTP servers have a retry mechanism up to a few days.


First the two Gmail accounts I had in Thunderbird logged me out and asked to relogin.

Then Google said my accounts did NOT exist which made me feel very uneasy. Banned? Lost all my data? OMG.

Then I got error "502 That's all we know" after entering my passwords.

Finally I realized it couldn't be just me, so I opened HN and confirmed my suspicions.

It's all up and running now but I was really scared.

I'm really curious what happened because it surely looks like Google has a single point of failure in their authentication mechanism which is just horribly wrong. The company has over two billion users - a situation/configuration like this just shouldn't be even theoretically possible.


Google Pay contactless transaction not working in UK

news.google.com returning 500 finance.google.com returning 502


oof i didn't even think of google pay


I’m seeing poor internet performance overall, even accessing HN...too many relying on Google DNS maybe? Even ISPs?


Probably just people checking non-Google sites whether it's everyone or just them. 8.8.8.8 at least seems to be working, as does Google DNS for domains hosted on domains.google.com.


The Google outage is one of the top BBC headlines now, so it has become mainstream news: https://www.bbc.co.uk/news/technology-55299779

"The outage started shortly before noon UK time, with Google sites returning server errors when visited.

Users around the world reported problems with Gmail, Google Drive, the Android Play Store, Maps, and more.

...

Despite the widespread outage, Google's service dashboard for its services reported no errors."


"Google has been contacted for comment, but one spokesperson said they were unable to access their email."


It has become regional news in local regional newspapers in South India. It's that big.


This puts into sharp relief how outages and downtime can hit ANYONE.

Cheers to all the small SaaS businesses out there keeping their services up and running without much of a hick-up all year round.


210 points in 5 minutes. 500 in 10. Seems everyone see's the status page and is like "hey, time to check hn" ;-)

EDIT: You can permanently refresh and see the score increase by 2-3 points each second. Wow.

EDIT 2: 1800 after one hour, but a few points were probably lost due to HN downtime. Being quick to post is important, guys;-)


I feel like HN's server aren't very happy with everybody doing this though.

Stop bullying the servers :(


Yep, HN getting it's very own HN hug of death from google


Haha, the irony!


This wouldn't be a problem if HN migrated to a blockchain written in go for performance


Wait. I thought Rust is the new kid on the block


Would never work without a powerful AI engine in the middle.


This is probably like an user flood wave, crushing all platforms in its wake. Give it 10 minutes and Reddit will be down as well :)


HN was actually down for me when I first checked. I was wondering what possible infrastructure HN and youtube could have in common before it clicked


I came to HN to see if there was news to calm me down, we just started business day here, and as soon I tried to login some 30 minutes ago, Google claimed my paid business account didn't exist, so I came here to check if there was some mass banning or something like that.


I noticed Gmail and of course the google doc I was working on were unavailable. I didn't come check HN until I said to my wife "Google's down" when she complained her email didn't work and she said "check the news to see if Google's really down"


who would have thought a lot of folks here also use google products.

for real, though, hacker news is getting slower every second since this thread is up. i won't be shocked if it goes down due to the surge of traffic due to google outage


YouTube shows a "something's broken" error. Similar errors when trying to access Gmail. Tried on multiple browsers and connections, got the same result. The search engine works fine, as of now.


Youtube works when not logged in though.


You are right, the YT homepage opens up in private tabs. It seems strange, I wonder might be causing this.


Services are not restored. Some came up again, some not. My Gsuite business mail is still completely down, while youtube started working again.

I'm pretty sure there will be some internal conferences at Google after this to make sure infrastructure problems can't propagate across the entire company and world at this rate even in the event of a sysop fatfingering a console...


If you are still down try restarting app... this worked for me


Google Home and Chromecast also rendered useless. _Hey Google, how are you?_ leads to some very passive aggressive replies.


In my house, it leads to a hilarious echo of multiple devices responding simultaneously with an error message. Having only one device respond seems to be cloud-mediated too.


"Hey Google!"

"Something went wrong. Try again in a few seconds."


So happy that I didnt even notice this, because I stopped using Google's services :)


Kudos! What do you use instead of YouTube? That’s the main product keeping me on the franchise


You mean as a consumer or as a creator?


Consumer


Is Google Auth down because of the SolarWinds Orion exploit?

https://cyber.dhs.gov/ed/21-01/


Wow, I'm glad there are a few things that I do that don't require Google auth.


Pretty much anything that requires login is broken.


YouTube Music failed in an interesting way. My music has been playing, but I started hearing ads (I'm on Premium), which prompted me to check what was going on.


That supports the claim about it being an auth issue.


I am still listening to youtube bot on discord... Maybe it will crash soon.


edit: added details edit: redacted my phone number edit: big mistake to add phone number edit: I think illic is right, probably not me edit: removed details


If it was you, why the hell on earth would you bother exposing your personal phone number on the internet and asking google to call you on this post? Like, seriously...

Wouldn't you rather call them directly on the hotline...?

That's a phone number from Bulgaria, and it looks like it should be part of the Vivacom GSM network, so I guess it's his personal mobile phone number or a scam.


> why the hell on earth would you bother exposing your personal phone number on the internet

Why not? It's not like you're gonna receive death threats exactly. I've had my personal phone number on my public website and in the footer of every outgoing email for 15+ years, never had any problems, spam or otherwise.


yes personal phone number.

Had to do what it takes...

To save the company I work for and all of our customers data. Which all is in Google Cloud!


Or the more realistic scenario is that the outage happened before you did getIamPolicy which is what caused the garbage data.


That makes sense


That reminds me of the time when I plugged a network cable back into an Active Directory domain controller. At the exact same time as the RJ-45 plug would have made the little "click", a door slammed shut and a polished steel tanker truck drove by the window, shining a bright light into my eyes.

I had just plugged into a cable into the most important server in the organisation and I saw a bright flash and hear a bang.

All was well, it was just a coincidence, and a good reminder that sometimes shit happens and it's not us, it's just timing.

Relax.


This isn't true, but considering every Google outage is a one in a billion rube goldberg domino machine, it could be true. Put this comment in the post mortem!


It’s probably not you.


Although since it seems to be back up now I suppose there is one way we could find out for sure.


If you are right; congrats! you just got few googlers fired!


Google doesn't fire people who cause outages.


What exactly did you do that makes you so confident?


Exactly after my setIamPolicy API request to Google Cloud was the exact moment everything went down.


Probably dozens of other people executed comparable requests in the instant you did.


Indeed. I deleted a gcp project at the same time.

However, it would be fun if it had a UUID clash with a google service :)


I'd wager a guess that you set up some weird 'expression'? coupled with some bug in the IAM service, maybe some stale resources that you were deleting at the same time?

I'd then assume once expression is evaluated the services end up busy looping / proxies throwing out internal errors and taking out capacity.

Still, you shouldn't be able to cause downtime to more then a few servers in the extremely unlikely case I am anywhere close.

PS: - I haven't used googles IAM so guessing after a few min of reading docs.

- you are incredibly unlikely to have triggered this at google's scale.


Do you have an exact time?


What are IAM permissions?


Just permissions. The "IAM" can be safely dropped. It's exactly what you think it'd be: restrictions and privileges.

"IAM" is basically the name for a specific model of doing it.

Unless something really crazy happened, this user is unlikely to be correct. Accounts are supposed to be firewalled/sandboxed in a way that you can't contagion across to someone else's let alone systemwide.

It's possible (some sweeping script on a powerful connection that smashes just the right things or some exploit to break the sandboxing), just probably not likely - especially unintentionally.

But crazier things have happened https://books.google.com/books?id=rRp7DkTegMEC&newbks=0&prin...


It's what Amazon calls your cloud login account

Identity and Access Management (IAM)


Коле, ти ли си?


Hahahahaha this is gold


It was obviously not you.


Babashka is dangerous...


I highly doubt it.


Ummm.. no.


I am not google, but if i were to start looking for idiots to throw under a bus for this outage, you will on top of the list.

besides google won't be exactly calling you because you made that comment here.

other commenters are right that you should not expose your personal number on internet regardless of it was you or not


> Last thing you want is a conversation with google's lawyer

Google more or less want people find their weaknesses so they can patch and secure them. A person accidentally triggering a global outage is not something that would cause that person to get lawyers on them. Especially not something that only affects his or hers GCP-project.


obviously it was an accident


You say "outage", Google will probably say "sunsetting" or "deprecated."


But why is HN so slow? Pages take like 30+ seconds to load for me (German vantage point, other sites are fine). Does it timeout on some Google dependency or try to use a Google submarine cable or something? Or is everyone just posting to HN about it?


I think that is simply because all HN readers using some google service jump over here to check, because google's own statuspage is useless.


they arn't prepared for such big news, you are part of the involuntary DDOS


It is called HN hug of death (or Reddit hug of death). the traffic due to google outage is much more than usual


I guess it's the demand of everyone checking HN to see if Google is down or not.


If memory serves, HN is hosted on a relatively lightweight server configuration.


Google login also seems to be affected, meaning all services using google account login are inaccessible if you aren't logged in.


Gmail, Calendar, Google Accounts down in Germany


That's why my Thunderbird constantly popped up to enter my Google Mail credentials…


Same, my work mail does not work. I was supposed to be in a meeting right now.


As a consequence, many websites depending on Google services are stuck too. Like the nytimes.com.

Ah btw HN is being crushed by the load lol.


Seems like services does indeed work in incognito.

https://news.ycombinator.com/item?id=20132520 Maybe this beast kicked the bucket.


I happened to use Google Takeout to download everything I had last weekend (~200 Gb)... at first I thought my account was banned and that was the only solace I had, but this definitely is killing my productivity.


I really hope this is just a temporary issue and no accounts are going to be lost. If that happens, way too many people are going to feel the pain of thinking Google makes backups unnecessary - including me.


Google has now updated the status page, every single service has an outrage: https://www.google.com/appsstatus


And somehow Pokemon Go is down, too. They use GCP for hosting, but I'm not seeing other outages on GCP. Maybe they have deeper integration with Google's Cloud than publicly disclosed?


Discord was sort of down (wouldn't load), but I could log in again a few minutes later and everything was OK. My company's product is hosted on GCP and there are no problems that I can detect without administrative access (which I can't obtain because it's behind a Google login); services are returning correct results.


Google Maps for sure, maybe some GCP as well (as console is down, Discord and Playstation Network as well are having issues - which seems like some GCP service)


It works if login with facebook or Apple ID, so GCP itself is fine, but I guess admin cannot login to management


Discord is GCP and down also.


I feel kinda bad - I'm pretty sure I broke Google as I was uploading quite a few large files via Google Drive and it was chugging pretty hard.

And now because of that it seems HN is also hurting.

I'm sorry internet.


Maybe it’s time to go outside?


What is this bright thing in the sky?


let's not make panicked drastic decisions just yet


back inside we go guys:(


This outage is an idea from Calico then? (just kidding)


Let's not get ahead of ourselves :)


Seems like the auth solution that is down. Every service work in incognito/private. Shared drives in Docs/Drive does not work. Probably connected to the auth service?


Gmail service has already been restored for some users, and we expect a resolution for all users in the near future. Please note this time frame is an estimate and may change.


We've got about 3,000+ people a minute hitting Downforeveryoneorjustme.com, pretty crazy.


I can access youtube if I'm not logged in (Incognito)


Thanks for the tip! Several people said its related to logins/auth so this helps getting access to videos/info.

maybe this is their fingerprinting/cookies/tracking server side logic has some bug that is failing?


The moment you realize that the only way you notice Google is down is through HN.

Cheers all you selfhosted-FOSS-alternatives! Time to bump up those Patreon contributions...


Hackernews is also having some kind of stability issues, either everyone is on here complaining about the google outages, or somehow they share infra.


Word from inside google is their auth is down, so anything that doesn't require auth should work fine. Log out of youtube, works fine.


Even Google Sign In service is down. Sites like Airbnb, Duolingo, etc that uses Google Sign In can't log in their users.


How's things like Stadia? Or is anybody using more exotic Google stuff? Like, how are their selfdriving cars doing?


Stadia is down.


Stadia is down for me


HN is also down because everyone is checking whether Google is down


Turns out hacker news has an indirect dependency on google!


HN outage is soon to be the next one. The HN server is overwhelmed by the traffic due to Google’s outrage. :/


Google is scaring the shit out of us, we were already trying to move out of Google for some past bizarre outages, and now before they went down, they outright claimed business account didn't exist anymore, our business drive and all our IMAP e-mail clients got kicked out.


This is a great and interesting thread. But I believe nobody discussed here the opposite and postivie effects of this kind of golbal outages. There are funny calculations for example of having less spam mail open-clciks or things like that. So think in such massive otuage of millions of users and auatomated process what means in term to generate less CPU, RAM, I/O and as consecuence less Power/Cooling >> less CO2 and pollution... It's huge! and at the Earth's weather impacts is like charge trillions in advance that we''ll have to paid in a future not only with money... (Economy Degrowth)


Bummed to have not received an email about what's happening and why from the Google team.

Has anyone on a "platinum"/"enterprise" google workspace plan received any relevant communication about what's happening and ETA on uptime?


Looking at the other comments, it seems highly likely zanzibar[1] is down. Their auth service.

[1] PDF: https://research.google/pubs/pub48190.pdf


zanzibar is an authorization service, and we are looking auth as in authentication here

So no, not zanzibar


GCP console is not working too https://console.cloud.google.com/, however at least the services are still running :) phew


It's a mixed bag. Cloud storage is down but at least cloud logging is up to stream all the errors.


Yup, Ops Genie is BLOWING UP with alerts... but can I get into GCP to see what's going on? Nope.


Yep.

Launched YouTube app on the Roku and was prompted to sign in. Opened browser on PC, entered "activation code" from YouTube app, after entering my Google account username on the login page, it presented me with the reassuring "Couldn't find your Google Account" message.

Tried logging in to Gmail directly with the same effect.

Thanks to the Twitters, I realized that Google hadn't canceled me, specifically (apparently they decided to temporarily cancel everyone!).

FWIW, I typically get directed to their Chicago datacenter(s).

(Note: 7 A.M. on the U.S. east coast on a Monday morning; at least they have impeccable timing!)


What do you think, how much money are they losing right now every second?


I was thinking the same thing. Also this funny bit that’s relevant: https://entertainment.theonion.com/google-shuts-down-gmail-f... “Google Shuts Down Gmail For Two Hours To Show Its Immense Power”


Do tech companies post that kind of data?

When I worked in the automotive industry, the Detroit Big 3 American auto makers had figured out their cost per minute of downtime. They would even charge that to their suppliers and anyone who shut down one of their plants.

It was something like $10,000 per minute since they’re able to roll a complete vehicle off the assembly line about every minute


We use Google Cloud and Firebase at work and our application and servers are still up and running. Can't say the same about the Cloud Admin Console, tho.


Somebody threw a Spanner in the works is my guess. Global-scale database underpinning everything (or maybe the giant that is colossus is having an off-day).


"all green, which does not reflect reality for me (e.g. Gmail is down)"

This isn't true. Which make it more complicated as we try to figure out what systems are down -- since their status board is usually faked.

gcloud compute instance-groups list-instances [removed] --zone=us-central1-c | awk '{print $1}' | grep -v NAMEERROR: (gcloud.compute.instance-groups.list-instances) Some requests did not succeed: - Internal Error


I'm very interested in HN' crowd back-of-the-envelope estimations on how much this outage costs the world (when we know how long it lasted).


It's impressive how fast they are at fixing outages, barely a coffee break required. I'm almost starting to believe in this DevOps thing...


Still down for me. Must not be a "vast majority of users" bit of a premature "green" on that.


Interesting that I have been watching a long video (over an hour long) and the buffer is still loading while YouTube in general is down for me


Amazing how many products are clearly getting affected due to this. The top 18 products on downdetector.com are all Google-owned or related


is it me or I noticed that google search rarely goes down? is it somewhat in an isolated system or a fundamentally different architecture?


youtube-dl is still able to download videos, which is quite interesting ^^


Singalong: She broke the whoooole world, with her change, he broke the whole wide world, with his change, they broke the whole world with their chaaaaange!

I'm sure its stressful right now. But someday, these engineers will look back and retell the stories about how it happened and the lessons they learned. Hopefully with a laugh.


GOOGLE UPDATE YOUR GCP STATUS PAGE. Stop making us run around trying to figure out what systems are down.

Running: timeout 60 gcloud compute instance-groups list-instances [removed] --zone=us-central1-c | awk '{print $1}' | grep -v NAMEERROR: (gcloud.compute.instance-groups.list-instances) Some requests did not succeed:

- Internal Error


I’m seeing poor internet performance overall, even accessing HN...too many relying on Google DNS maybe? Even ISPs?


Well, HN's auto-hug of death. Also, probably most sites can't load either analytics nor ads...


FYI if you have cronjobs using Google apis, disable them or delay them.

a bot is flagging ips as abusive. it should clear up later.


Gcloud/GKE command line utils are also having problems.

Was deploying some test applications and kubectl started complaining about gcloud auth helper throwing not zero errors. Trying to launch cloud shell from the website and nothing happened.

The web application which is a site that does not rely on any external API is running fine.


Google Meet and YouTube also 500 in Brazil


Not just google, hackernews was also displaying: "We're having some trouble serving your request. Sorry!", has it something to do with google architecture?

Btw, that's funny you listed all the google services exactly in the order I found them unavailable: first youtube, then gmail, then docs


Can I just say I'm remarkably relieved I have an excuse to kick back for the next hour and do nothing?


I hope it lasts a bit, not just five minutes. We might finally question our dependency on Google services.


It looks like the auth service is broken. Everything works when logged out. Attempting to log in gives "Account not found".

EDIT: Looks like they logged everyone out while they fix the auth issue, presumably so people can use YouTube and other stuff that doesn't necessarily need login?


Was wondering why youtube can't show videos from the search page , it even reverted my language settings.

South Africa


Seems like the incident lasted from 04:00 AM PST to 04:30 AM PST, at least for BigQuery (sorry, image timestamps are in JST, since I'm in Japan): https://imgur.com/a/kiHED6v


Definitely started before the hour.


Can confirm Incognito Mode and Private Tabs are still working on YT. Seems to be something auth based.


All services are still down, https://www.google.com/appsstatus#hl=en&v=status

Crisis averted, was not locked out. Gotta take out all the data though when it gets back up.


Thank God! Google Cloud Print is still working! I'd be destroyed if that ever stopped working!


The worst outage I've ever seen with Gmail. I can't log in, getting "Couldn't find your Google account" message. Also, with the already logged in account, I can't browse anything but whatever is in cache. I'm definitely worried.


It's time to decentralized Gmail, Youtube, and Google Docs, maybe their Search Engine too.


That's what it was 15 years back. I was slightly concerned when Google Mail and Chat were integrated. Look where we are now.

Right from "The battle of the red cliffs", we will never learn the disadvantage of lashing too many boats together.


But our tech right now is far more advanced than 15 years ago.

We have IPFS, Blockchain, dat protocol today. I think it's possible to kill the giants.

Even, Tim Berners-Lee want to decentralized the web: https://techcrunch.com/2018/10/09/tim-berners-lee-is-on-a-mi...


Even? Tim Berners-Lee tried since day 0 to make the web decentralized. One of the initial requirements in the 1989 proposal for a global hypertext system included "Non-Centralisation" where "new system must allow existing systems to be linked together without requiring any central control or coordination". See https://www.w3.org/History/1989/proposal.html for the rest.

While it went OK-ish for the internet, we massively screwed it up for the web.


No we didn't. The web is very nicely decentralised, especially with the general malaise and decline of Facebook.

What isn't decentralised is web apps but TBL never intended the web to be used for full blown desktop app replacements in the first place so no surprise it doesn't meet its design goals for that.


Anyone experiencing lost emails? I'm not sure if it's actually me messing things up or because of the Gmail outage, I couldn't find an extremely important email and I've tried archive, trash, deleted, spam, sent boxes..


I did not realize how dependent I was to google/google services until 30 minutes ago.


This affects GCP as well as Google's consumer services. I'm here because I was paged about low traffic for a service. Turns out to just be that I can't access any metrics at all from Google Cloud Monitoring (i.e. Stackdriver).


So to easily fix such down time, i have alerts setup for website, server and API using https://simpleops.io/

After that I realised how frequently you can get down time


These rare events help make extremely visible how much of our general basic infrastructure we've farmed out to a small handful of companies, centralizing something that was supposed to avoid the problems of centralization.


Look the status page still works https://google.com/appsstatus#hl=da&v=status but it says everything is broke.


Luckily nobody is using Google services for anything critical. Or what, are you?


I feel like Google is having more outages recently. I worked there for 4 years and even in such a short time you really noticed the shift from engineering focus to business focus. But maybe it is just a coincidence.


I interviewed for Google many years ago and recently. The expectations fell through the floor.

I've see plots of number of employees too. I'm not sure how Google will dig itself out of that one.


They made you interview again even after having worked there before? That's odd. It used to be that they didn't do that.


I never worked for Google. They wanted to hire me after the first set of interviews, but I took a different opportunity at the time. Interview process was an intense, and the interviewers were sharp. I came out even more impressed with Google. This was way back in the day -- early '00s. I would totally be excited to work for that Google. It's just that, well, there are lots of awesome things to do, and I had (what seemed like at the time) a more interesting option. I sometimes had regrets and sometimes not.

Second process was maybe 2-3 years ago. It didn't get to a full onsite, since after a few conversations with the team, it was clear there wasn't a fit. The old arrogance was still there, but without the same sharpness or cleverness. I spoke to a team working on a new product (under NDA) in a field I had a lot of experience in, and:

1) There were computer scientists without a clear understanding of the target market domain.

2) They believed they were the best-and-brightest, and didn't need to consult experts in the field.

3) There was a lot of hype and salesmanship.

I like to be surrounded by smart people, and it felt like I'd be the sharpest guy in that room. I was a better computer scientist, AND better domain expert in the field the product was in than anyone on that team. That's not a position I want to find myself in.

And I have a job which pays a lot less than Google, but I'm surrounded by smart people I learn a heck of a lot from, where I'm working on meaningful things, and having fun. I'm also continuing to build my personal brand.

Now a few years later, the product they were working on never shipped, so if I worked for Google, I would have likely been on a failed project. On the other hand, my mortgage would likely be paid off.


Switzerland's down too. It works if you access it without session though.


Seems to also affect Google Cloud console to some extent. Dashboard data is not loading at the moment. Not sure what services exactly are affected there. Google Cloud SQL seems to be up and running so far.


youtube greece is down returns "oops" error but i can still playback a 2 hour video that I was already watching before it went down. Trying to open any other video or channel returns this error.


At this point Amazon and Google should just have a static page with green buttons for their status page.

They NEVER show red, or oddly even go down themselves during outages. ( Last time amazon went down springs to mind).


Yup, suffering it right now.


At times like this I am thankful of running a Google takeout periodically.

Not because I use it as a live failover for Google services, but because these outages are good reminders of the mortality of online services.


How long did your takeout take? It’s been 3 weeks and mine either says not ready yet, or when it is it will say unable to download and force me to start again...


When I request a full takeout it usually takes 2-4 weeks to generate.

This time around it was missing a .tgz which had error'd. A subsequent full takeout request took a couple of days.

The .tgz files I find can only be reliably extracted on a Linux... the Gmail export for me is 20GB which in the last few exports has been a single file in the last .tgz



Google employees can't login to update their status board on https://www.google.com/appsstatus ?


Seems to be down in a bunch of places https://istheservicedown.co.uk/status/gmail


Good. Now, at least a few people would search Google alternatives in DDG


Search is still working. It’s Drive, Meet, and other stuff that millions of people depend on for work, so maybe stow your glee for a moment.


Google still is the best when it comes to search. But its also important to know the alternatives. You will be surprised how many people dont know Google alternatives. lol


How coud all of googles websites be down this long tho. You would at least think google being the company they are would have some sort of like backup serevers or backup plan type stuff. Idk


Not a fan or even their user but kudos to Google for admitting the failure. Amazon (at least AWS) customers would have probably been shown a green dashboard and told to review their code.


It's back


Apparently Google "can't find" my account.

I was so scared for a while, then I found that all of my family members have the same issue. Then checked HN and breathed a sigh of relief :)


A good explainer video here: https://www.youtube.com/watch?v=5p8wTOr8AbU


YouTube is working in India. But can not log in to any Google Accounts. GMAIL and Drive not accessible via web. Drive sync is working though. Appstatus still showing green.


The cool thing is that newpipe is working while youtube is down.


Hey, are you updating newpipe from F-Droid? I do, and it stopped working for me weeks ago.


Strangely everything seems to be little faster in incognito mode


Now we'll have a bit of outrage, talking about how dangerous it is to have the whole world impacted by one company's processes and forget about it in 2 days


HN is faster and more accurate than status pages of major service providers themselves. Special thanks to people working at those service providers and posting here!


Waiting for the post-mortem to see it was all because someone forgot to renew an SSL cert that nobody knew was a single point of failure for literally everything.

:D


Seems that Google has no budget to pay their GCP instances


Google Search seemed to be also down: https://serpapi.com/status


Time to mention the alternative, a decentralized internet, e.g. https://zeronet.io


How much do you think Google can learn from this incident's postmortem? Writing down all the internal details probably amounts to a small book.


Gmail is currently returning "550 5.1.1 The email account that you tried to reach does not exist" for many valid gmail addresses.


In inkognito window (without Adblock) it works for me!


Google page is right: servers are up and running - but not reachable. Could be an infrastructure attack (DNS?) or someone just broke the cable :)


I wonder why services like Uber (which rely on Google Maps for everything are still working, with things like routing being completely okay).


Yeah weird it’s all green. I just noticed things down as well. Came to HN to see how quickly people posted. This is already #1 at 6:55 am EST


Yeah, that status page is really frustrating. Didn't anyone consider building it to REALLY check if services were working? not just returning a 200 Status... but actually DOING things?


Appears to be only affecting pages that have an authenticated session; Youtube works in a private window, but not in window with cookies.


Also Google Voice. I ctrl+F'd for that one, and it's one thing I haven't found a way to easily degooglefy from. Also down.


I wonder if this means Google Fi is affected too?

And I also wonder how many Google engineers might not even be getting paged right now as a result.


The appstatus page from Google is a good indicator for me of which products are currently live from Google. Uhm, more of a sidenote :)


"Couldn't find your Google Account"


Seems to be auth related. Login does not work, and Youtube is also throwing Internal Server Errors, but only if I'm signed in.


Seems to work when you're not signed in. Won't help with gmail and docs, but maybe a workaround for YouTube and search.


"Sign in with Google" also fails. So this outage is taking quite a few otherwise unrelated third party apps with it.


Google Cloud is also down. Can not access either Cloud Console or any of our GKE clusters. Stackdriver, Cloud Auth also down.


Even my personal e-mail (using google workspace) is not accessible

Did some asshole forget to renew a certificate or run the cron job to renew?


Gmail, Chat and Meet down from Italy as well, was in a Meet call and got disconnected, only getting 500s across all three now


Allright, what did Gilfoyle do this time? ...


Jarring reminder of how addicted I am to YouTube... getting childishly irritated that I can't continue to binge lol


Use duckduckgo to search for videos and you can watch most of them right on the results page.


Discord also seems to have gone down for me. Also (probably unrelated?) HN is really slow compared to normal right now.


Youtube, Gmail and Drive pages open in incognito mode but can't sign in. The issue seems with the account manager.


HN also seems to be loading slow with sudden inflow of traffic. Would be funny if HN goes down because of Google outage


Migrated out od G infrastructure long time ago. The G Search working fine, G Maps down (continental EU). HN very laggy.


Seems the issue is related to the google authentication system since if logged out you can visit youtube without error


At least Google DNS (8.8.8.8) still works. So, the people who use it, still have access to the rest of the Internet.


Seems like some core service crashed since all of these sites are unavailable. However, google.com seems to be fine.


I think it's just Accounts is down, so anything where you are logged in isn't working.


UK here, YouTube and Gmail are innacesible, search is working fine. Workspace Status Dashboard is showing all green.


Drive too. Status board still green though.


All down, but all green.

New norm, let's have status page, but make it green all the time :-). Surprisingly consistent practice.


Looks like the problem is with Accounts. Most Google apps work when not logged in, i.e. can view YouTube videos.


Same here in Germany. Gmail says that my account can not be found. Plus Google's support page is also down


Whoose laughing at self-hosters now huh ?


I think they are out of hard drive space.


My Gmail app on iOS keeps displaying a “No Connection” error even though my phone is connected to the Internet


The problems seem to have been (mostly) resolved over here, at least the authentication seems to work again.


It's also weird the page title with "Error 500 (Server Error)!!1" - Why !!1 at the end of it?


Internet jargon; when a commenter is overly excited, they tend to overuse exclamation points. In the rush to add many exclamation points, the shift key might not be held down for all of them.

So you end up with things like "OMG!!!!111"


It's probably a running joke internally---anecdotally, I've came across it a few times on different Google properties over the years.


Someone forgot to hold down thr shift key when typing the third mark?

Secret status code to differentiate types or server errors?

A joke?


Everything down in Canada. At least everything that seems to require auth. Gmail, Gsuite, Youtube and more.


All G Suite down for me, says "Couldn't find your Google Account". Broke youtube mid-stream.


How does this incident affect IoT devices? I'd expect many of them to be using Firebase just like HN.


People were reporting not being able to turn their lights on: https://twitter.com/joemfbrown/status/1338452107419148290


Looks like if google services are down for more than 45 minutes, Workspace customers gets 3 days added.


I wonder how on earth googlers are supposed to fix this when they can't log in to make changes...


YouTube work for me in Firefox just fine.

But I have it configured with some of the recommendations from Privacytools.io


That's interesting, Youtube never works for me in Firefox just fine. Video playback gets stuck somewhat regularly, requiring FF restart.

But yeah, the page loads for me as well :)


Just use alternatives:

* DuckDuckGo instead of Google Search

* Bitchute / LBRY / Odyssee instead of YouTube

*Nextcloud instead of Google Drive


2000 points in an hour on HN. 120K upvotes on down detector. Trending worldwide with 500K tweets.


Google Cloud Platform is down in US.


Maps could be the scariest one! All those drivers wondering how to get where they're going.


This is a good time to think about what would happen if you lost access to your Google account.


Maps are partially affected, page review returns 500, now even location pages doesn't open


It is possible to reach youtube in incognito mode, so it seems to be a google accounts issue.


exactly right


Transitioned to Google Workspace from Microsoft Office 365 last week and it’s a nice welcome.


All service down in Sénégal as well. I first noticed Youtube, then everything else followed.


Yes, I can confirm this. I got an error when accessing Google Docs, Google Meet, and Gmail.


Calculating cost of downtime ==>

World economic output is - $150mn/minute

Current world population is 7.8bn.

Google has 4.8bn users. That's approx 61% of the population.

Let's assume about 50% of the users were impacted = 2.4bn (which is 30% of the population).

So, the loss could have been about $50mn/minute.

This is without taking SLA into consideration. There will be losses incurred on that too, wouldn't it?


Looks like the authentication services are having problem. I have account not found error.


Gmail and Drive down in Thailand.


Blogger is also down, even for public site access so it's not only auth related.


Curious that minutes ago I was listening music on YT. BTW not all services are down.


Me too. But it started playing ads because apparently I wasn't logged in anymore... So it works but without "using" your account.


Hungary - gmail down, youtube down if logged in, search works with 5s response time


Sweden. The same problem. Authorization doesn't work. Gmail doesn't work.


All down for me, Serbia. YouTube works for me when logged out (i.e. in Incognito).


Yep. Gmail redirects to Gsuite Status Control and shows everything as fine there.


And just on the day that electoral college certifies the election. Great timing!


Hey can someone explain to me how all google apps are down cause idk a lot and ik that at least a kid ddosing these websites is not a possiblity so what kind of attacks are going down here? And how come google doesnt have like backup servers to run there websites on or is that not a thing?


Woah


I hate the fact that the status page is not reflecting anything related to it.


In Japan Youtube is down too.


Everything is back and working fine now, here.


Google Cloud outage? Looks like Discord is down too. Might be a coincidence.


Back up for me, almost. Some warnings on gmail but the inbox is loading up.


Whataya know Youtube didn’t work and I came to HN to check what happened.


Down in india As well, Gsuite, GCP IAM, Youtube, Calendar, Drive et al.


Google Workplace (EU) is down and also Google Cloud (login at least)


Same here, can't access gmail (currently on a trip in Poland).


This morning also Microsoft O365 had problems with authentication


My service accounts for the play store are all returning err 500.


Anyone else has this in the technical details? Numeric Code: 7444


7498


Is google's status page actually linked to anything?

Or is it ahem airgapped?


I'm able to access GMail and Maps but not Youtube or Drive.


HN is loading very slowly too for me. Related to Google outage?


Sortof, HN is slow because after the outage, a LOT of people are on HN. The servers seems overloaded.


HN is the fastest way to know about outages.

Youtube & Gmail are down for me.


seems to be working again, for me. except gmail says "Gmail is temporarily unable to access your Contacts. You may experience issues while this persists."


Seems to back up. Most services working outside of incognito


Can verify - India. Gmail, Drive, Docs, Youtube not working.


as another user mentioned, they work in incognito. looks like there is a problem with google auth which is causing the outage


Alors, Search queries taking up to 3 seconds here in France.


Hungary - gmail loads, can't send email (error #2014)


Has google.com ever suffered such outages in recent past?


This is google's way of saying do not work from home


Confirmed from Spain. Google Workspace seems to be down.


Everything down in HK


Everything seems to be back to normal, for me at least.


Looks like some Google Cloud customers are offline too.


Can verify - Finland


>drive not working

Something something someone else's computer.


Can confirm from the Netherlands, Hangouts also down.


Only thing working for me is the Google search engine


Guys, chill out! We're DOS'ing HN already!


regular text search is working for me right now, but gmail and maps are totally broken. google drive/docs also totally non functional.


yup, all down in Vietnam too, this page took some time to appear too, so i suspect a lot of issues in the pipe (mostly via China i gather)


It would be so interesting to see the postmortem...


The console for Google Cloud Platform is also down.


Drive and Gmail seem to be back for me (12:31 GMT)


Can't access my files on Google Drive, shit.


Looks like Hacker News is having trouble as well


Can confirm, gmail, gdrive and youtube are down.


On reddit thread there are some really good jokes about this [1] related to their asinine interview questions. Like:

> Did they try to fix them by inverting a binary tree?

>> Yeah maybe implementing a quick LRU cache on the nearest whiteboard will help them out here

>> Did they try checking what shape their manhole cover is?

>> Dev ops was too busy out counting all the street lights in the United States

[1] https://www.reddit.com/r/programming/comments/kcwqij/every_s...


AFAIK (as someone who does not work at Google), "brain teaser" questions have not been asked at Google for almost a decade.

Every time I see complaints like this I can't help but think the posters have a chip on their shoulder from being rejected by Google or a similarly selective company. It's never a commentary on any fundamental issues with these types of interview questions and no one ever comes up with a better process that is equally scaleable and effective for hiring generalists. In my experience interviewing with dozens of small, medium and large companies, the vast majority of technical roles require these types of questions now. These come off as criticisms against Google specifically for asking hard variants of these questions.

I don't personally have any issue with companies asking questions like these as long as they don't simply look for "a correct and optimal solution coded up perfectly whilst under stress in under 30 minutes", but rather the process of solving the problem.


If we're talking about FAANG, I don't believe that they really care about "the process of solving the problem". If you don't come up with a close-to-optimal solution in about 5 minutes, you're in trouble - there will be competitors who are better prepared and thus can. Solving algorithmic questions is a very trainable skill, after all.


Funnily enough, this response is exactly what the parent seems to be talking about. FAANG originated as a stock term and not as a grouping of companies with similar hiring processes. i.e. Interviews at Netflix are generally domain specific. Likewise, Amazon isn't the same as Apple isn't the same as Google.

> If you don't come up with a close-to-optimal solution in about 5 minutes

Disingenuously hyperbolic. I was only given a single problem to solve in each of my 45 minute "coding rounds" at Google and Facebook. If you approach these interviews like a competitive coding competition you will absolutely not get the job. By that I mean arriving at a non-optimal solution as fast as possible while writing messy code and not explaining your thought process. Not to mention that senior roles involve an increasing number of interviews that focus on design rather than algorithmic questions.

This all seems like nonsense coming from individuals who are (understandably or not) frustrated at an interview process that doesn't align with their strengths.


But their hiring practices are pretty similar (perhaps except Netflix, don't know anything about them).

In one of my interview rounds at google I was under huge time pressure, despite coming up with an optimal solution in seconds. That was an outlier as the interviewer was not aware of this solution, so explaining it took a long time. But still.

The valuable part from the competitive coding background is not the coding, but "inventing" solutions. In quotes because it's mostly a pattern-matching process. And it's a _huge_ advantage.

Personally I'm perfectly happy with the current process, as I took about half a year to become good at it, and the payoff is nice. :) But I'm convinced that's what this process selects for first and foremost - amount of preparation.


Funnily enough, this response is exactly what someone who succeeded under this model would say.

This all seems like nonsense coming from individuals who are (understandably or not) frustrated that others would suggest that their success was only due to an interview process that aligned with their strengths, rather than their obviously superior skill and intellect.


I don't think I would be considered a "success" under this model as I never passed a tough interview and currently work at company you will never hear of. I simply have no desire to justify my skills by talking others down and pretending I'm too good for companies that reject me.

I'm not sure how this is a valid rebuttal to my comment since it addresses none of the point I made. Do you really think the secret sauce to getting into highly competitive jobs is by incoherently speed running through algorithm questions?


I just took your first and last sentences and had some fun with it.

That way of interviewing is good, because it weeds out people who don't prepare and thus might be lazy or just looking for an easy paycheck. But it also tends to confuse good memory of technical and abstract questions as competence, and if you're not careful you'll end up with a lot of copy pasted humans that are really good at inverting binary trees, but suck at everything else. Also people who have already passed it tend to hold on to it as pride, and might fail an otherwise amazing candidate, just because of a failure on a question that has no practical use outside of a textbook. Those are my thoughts on it, and I'd wager the thoughts of many others who are critical of the model.

> Do you really think the secret sauce to getting into highly competitive jobs is by incoherently speed running through algorithm questions?

No, the real secret sauce is having a friend or relative on the inside. If you don't have that it's a gamble most of the time imo.


On the flipside, you can say something similar about your response regardless of your actual position.

What you've basically told him is: "Your experience is false, you're just biased and whatever your success was is a lie."


Yes. Thank you for explaining my comment to me, and you're completely right, but you missed the part where my comment was based on his first and last sentences only going the other way around. It did seem a bit dismissive and condescending at the time so I had some fun with it.


> I can't help but think the posters have a chip on their shoulder from being rejected by Google

I think so too, look at this angry (it says "fucking" 3 times) comment over at Reddit:

> It’s a waste of fucking time. I’m a really talented programmer [...] I’ve been eliminated based solely on messing up some dumbass arbitrary puzzle.

https://www.reddit.com/r/programming/comments/kcwqij/every_s...

And the only reply that to me somewhat seems to understand what's going on, getting lots of downvotes, so it got collapsed and you cannot easily find it:

> They just play a numbers game, and are able to discard perfectly capable candidates with this kind of questions, and still get a bunch of great candidates

https://www.reddit.com/r/programming/comments/kcwqij/every_s...

(although it started in a bit weird way with "Lmao", however the other replies weren't any more polite were they?)


None of these are asked in G interviews. The commenters are asinine.


Not necessarily true. Some of these most certainly have been.



This story seems like a bad example of a false negative. There were claims by Google insiders that Max was given a particularly easy interview as a formality. I am by no means a brilliant programmer and also rusty with algorithms, but given the structure and question, I would be surprised if most people couldn't figure out how to invert a binary tree within a matter of minutes.

Max himself later opened up[0] and admitted to being difficult to work with. It is entirely likely that he was rejected based on his personality and not his ability. As someone who contributed to Homebrew many years ago I would not be surprised if this was the case. In their own words: "I am often a dick, I am often difficult, I often don’t know computer science". I am not sure why any company would want to hire someone like that and put the culture of the team in jeopardy.

[0] - https://www.quora.com/Whats-the-logic-behind-Google-rejectin...


[flagged]


I love your response because I have a PhD in Computer Science. I do LeetCode as and CodeWars problems in the mornings for fun (I like coding and helps me avoid getting rusty) and I have been a hiring manager for over 10 years.

But yeah, other than the "sexiness" of CS concepts it's only because of the "narrative" we want to push.


So you do think Leetcode is useful ? What's up with the flex ? Congratulations on the PhD and job, I guess ?. Looks like you could definitely invert a binary tree.


Leetcode is good for some things, but in my experience is it not a good factor to filter Software Engineers (it is a good metric to filter candidates to form a team for the IOI or similar).

This is a response I wrote elsewhere:

I've been growing Engineering teams for the last 10 years as hiring managers in different startups. At some point in our startup we had those kind of HackerRank questions as filters.

The thing we realized is that those sort of interviews optimize to hire a specific type of very Jr Engineers who have recently graduated or are graduating from CS. That is because those are the people that have the time to churn these types of "puzzle" problems. Particularly, there are 3 types of recent graduates from CS or related fields: The ones that don't know crap, the ones that focus on these sort of problems, and the ones that are "generalists" because they dove into all sort of subjects during their degree.

I found out that the Jr people that excel at those sort of problems have a huge learning curve to climb to be productive in "production", real life environment. On the contrary, the "generalists" work better.

We stopped doing those sort of algorithm puzzles interviews after that realization, and we started getting really good Engineers with great real-life experience.


Great let me know when this scales to 100k people.


Okay ? No one's discounting his achievements. What's wrong with knowing stuff about a binary tree ? Is ignorance a badge of honor now ?


Yeah, nothing wrong with not having memorized how to do that - it's certainly not often useful in practice. But if you're given the definitions and a couple minutes to figure it out (which, in an interview, you are), and you actually can't come up with the 10 lines of code to do it, maybe that question is working as intended after all...


> But if you're given the definitions and a couple minutes to figure it out (which, in an interview, you are), and you actually can't come up with the 10 lines of code to do it, maybe that question is working as intended after all...

or maybe the person struggles to work things out in an extremely stressful situation

I speak as somebody that finds interviews Extremely stressful and if working there is regularly my interview levels of stress then please do reject me.


Don't you think actors audition for parts ? Is that stressful ? Does preparation help ? 300k+ paying jobs will have demand. With that comes pressure.


None of you have taken interviews at G. There's a strict edict to discard these questions.

"Dev ops was too busy out counting all the street lights in the United States"

If it was a Consulting interview any one would have gotten in. Thankfully no one asks these at G.


I was kicked out of a Google Meet call just now


Google meet/hangouts, appliances, the lot.


Youtube is up for me, but CANNOT login to Gmail


Yeah, drive and youtube seem to be down as well


Gmail, Docs and YT down for me too in Romania.