Hacker News new | past | comments | ask | show | jobs | submit login
Google outage – resolved
2316 points by abluecloud on Dec 14, 2020 | hide | past | favorite | 827 comments
various services are broken

- youtube returning error

- gmail returning 502

- docs returning 500

- drive not working

status page now reflecting outage: https://www.google.com/appsstatus


services look to be restored.

If you pay for Google Services, they have an SLA (service level agreement) of 99.9% [1]. If their services are down more than 43 minutes this month[2], you can request “free days” of usage.

Edit: Services were down from ~12:55pm to ~1:52pm, it's 57minutes. Thanks hiby007

[1] https://workspace.google.com/intl/en/terms/sla.html

[2] https://en.wikipedia.org/wiki/High_availability#Percentage_c...

If all 6 million G Suite customers, with an average number of users at 25 per G Suite account, paying the $20/user fee, requested the three day credit for this breach in the SLA contract for the outage, it'd cost Google about 300 million dollars.

Which is .22% of there COH this quarter...

Or on the regular basis, they'd get $300M every other month (exclude any fee)

Your nines or their nines?

I bet if you personally can't use it, but their overall reliability meets the bar, then they're within SLA.

Don't ask why I know this.

You know there are companies who build crawlers or health check agents exactly for this purpose, so that they know precisely from when to when the service they pay or they need for their business doesn't work and went out of SLA. I think it's brilliant and the only way to make any company pay. I believe you can sometimes get away with a couple of pingdom checks/jobs.

Like they have any bargaining power against Google.

I don't know how it works in such cases, I mean, this is publicly known that there was an outage.

What I believe is that customers will probably get free GCP credits and that's it, everything is good as before.

This person service-provides

In [1] it says: "Customer Must Request Service Credit." Do you know how to request it?

Admins can go to admin.google.com and click the help button to start a support request.

Is this official announcement? I don't find any link to google in your posted url.

not an announcement, seems like just a statement. I assume an official statement will come with the post mortem blog post in the next few days.

If you click on the red status dots, it has report with timing.

SLAs are largely bullshit.


This topic just came up recently on a podcast I was on where someone said a large service was down for X amount of time and the service being down tanked his entire business while it was down for days. But he was compensated in hosting credits for the exact amount of down time for the 1 service that caused the issue. It took so long to resolve because it took support a while to figure out it was their service, not his site.

So then I jokingly responded with that being like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".

If your SLA agreement is only for precisely calculated credits, that's not really going to help in the grand scheme of things.

I like your anecdote, I might steal that one.

IANAL, but I negotiate a lot of enterprise SaaS agreements. When considering the SLA, it is important to remember it is a legal document, not an engineering one. It has engineering impact and is up to engineering to satisfy, but the actual contents of it are better considered when wearing your lawyer hat, not your engineering one.

e.g., What you're referring to is related to the limitation of liability clauses and especially "special" or "consequential" damages -- a category of damages that are not 'direct' damages but secondary. [1]

Accepting _any_ liability for special or consequential damages is always a point of negotiation. As a service provider, you always try to avoid it because it is so hard to estimate the magnitude, and thus judge how much insurance coverage you need.

Related, those paragraphs also contain a limitation of liability clause, often at capped at X times annual cost. Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.

This is just scratching the surface -- tons of color and depth here that is nuanced for every company and situation. It's why you employe attorneys!

1 - https://www.lexisnexis.com/lexis-practical-guidance/the-jour...

> Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.

Businesses do this all the time, this is how they make money. And they use a combination of insurance and not %@$#@*! up.

I've never seen an SLA that compensates for anything more than credit off your bill. I can't imagine a service that pays for loss of productivity, one outage and the whole company could be bankrupt. If your business depends on a cloud service for productivity you should have a backup plan if that service goes down.

I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue etc, but something like Slack's SLA[0] seems like it's moving in the right direction. They guarantee a 99.99% uptime (max downtime of 4 min/22 seconds per month) and give 10x credits for any downtime.

Granted, there's probably not many businesses that are losing major revenue because slack's down for half an hour, but it's nice to at least see them acknowledge that 1 minute down deserves more than 1 minute of refunds!

[0] https://slack.com/terms/service-level-agreement

> I haven't seen one (at least for a SaaS company) that will compensate for loss of productivity/revenue

They won't show up on automated systems aimed at SMEs, but anybody taking out an "enterprise plan" with tailored pricing from a SaaS, will likely ask for tailored SLA conditions too (or rather should ask for them).

It's hard to give a compensation for profit loss, as then you would have to know the profit of the customer beforehand and put an adequate pricing including that risk. It's almost like insurance!

In financial markets I have seen SLA's where you will make people whole on the losses they incur due to downtime you inflict on them.

Seems like you want insurance. As with the hospital bill you'd generally be paying a bunch of extra money for your health insurance plan to not get stuck with the bill.

Not sure that exists for businesses, but I'd expect you'd need to go shopping separately if you want that.

Seems like a good business idea if it doesn't exist.

I think the idea here is that if the payment for SLA breach is just "don't pay for the time we were down" or (as I've seen in other SLAs) "half off during the time we were down" that doesn't feel like much of an incentive on the service provider.

They have other incentives, obviously, like if everyone talks about how Google is down then that's bad for future business. But when thinking of SLAs I'm always surprised when they're not more drastic. Like "over 0.1% downtime: free service for a month".

Independent 'a service was down' insurance isn't the same though. It is important for the cost to come out of the provider's pocket, thus giving them a huge financial incentive to not be down. Having that incentive in place is the most important part of an SLA.

Even with insurance, some of the cost will come out of the provider's pockets - as increased premiums at renewal (or even immediately, in some cases). Insurers might also force other onerous conditions on the provider as a prerequisite for continued coverage.

I hear you, but there's going to be a cost for that. For the sake of argument, say Google changes the SLA as you wish and ups the cost of their offering accordingly.

Would they gain or lose market share?

I don't think it's obvious one way or the other.

> . . . like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".

It's even slightly worse than that. SLAs generally refund you for the prorated portion of your monthly fee the service was out, so it's more like "here's a gift card for the exact value of the single dish we've determined caused your food poisoning." Hehe.

You're right and the funny thing is that's exactly what he said after I chimed in.

Completely agree with your analogy but have you ever seen any SLA that provides any additional liability? I haven't seen them - you are stuck either relying on hosting services SLA or DIY.

I cannot share details for the obvious reason, but yes - there are SLAs signed into contract directly behind the scenes which result in automatic payouts of a condition isn't met and it's not a simple credit.

Enterprise level SLAs are crafted by lawyers in negotiations behind the scenes and are not the same as what you see on random public services. Our customers have them with us, and we have them with our vendors. Contract negotiations take months at the $$$$ level.

That is a fair point. Is this a situation where you asymmetrically powerful? I have to imagine you would have considerable clout to represent a fair bit of their revenue in order to dictate terms. When I wrote my comment it was in the vein of a smaller organization.

I am but a technical cog in the machine my friend, while I know about what goes on in business and contract negoatiations I cannot comment on power dynamics. I would assume it's like any other negotiation - whomever has the greatest leverage has the power, I doubt it's ever fairly balanced.

Or purchasing business continuity insurance.

Not OP, but how do you measure them? Let's say, for example, you can send and receive email, but attaching files does not work. Is the service up or down?

What if the majority of your users can access the service, but one of your BGP peers is not routing properly and some of your users are unable to access?

Google do a very good job of defining their SLAs.

In answer to your question, they'll accept evidence from your own monitoring system when you claim on the SLA. They pair that up to their own knowledge about how the system was performing, then make the grant.

Google are exceptionally good at this, from my experience. Far better than most other companies, who aim to provide as little detail as possible while getting away with 'providing an SLA'.

The SLA itself should specify the way availability is measured.

Down because email attachments are base64 encoded files written in plaintext into the body. So if those are not working, email itself is not working.

that was a bad example. i guess the comment was trying to say "how do you account for partial service degradations".

(i dont think SLAs are BS btw)

In this example: you get free days. Which depending on your business might be worthless if you have suffered more monetary loss due to the downtime than the free days are worth.

But still better than nothing. And for some (most?) people/businesses, probably worth more than any resulting monetary loss

Exactly; downtime doesn't cost a cloud service much. At worst it causes reputation damage, with possibly large companies deciding to go for a competitor, losing a contract worth tens or hundreds of millions.

Given the blast radius of this (all regions appear to be impacted) along with the fact that services that don't rely on auth are working as normal, it must be a global authN/Z issue. I do not envy Google engineers right now.

> I do not envy Google engineers right now.

A few years ago I released a bug in production that prevented users from logging into our desktop app. It affected about ~1k users before we found out and rolled back the release.

I still remember a very cold feeling in my belly, barely could sleep that night. It is difficult to imagine what the people responsible for this are feeling right now.

When I was interviewing at Morgan Stanley, I asked "how do you do this job if a mistake can cost people money?".

The answer was "well, if you don't do anything, you make NO money".

I'm reminded of the quote from Thomas J. Watson:

> Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?

Agreed, and also it's worth noting that we're talking about companies here. Yes, for any individual the amount of money lost is insane, but that's the risk for the company. If one individual can accidentally nearly bankrupt the company, then the company did not have proper risk management in place.

That isn't too say that it wouldn't also affect my sleep quality.

Sadly a lot of managers don't see it this way, they'd rather assign blame.

Depending on the company's culture, that calculus can vary. If management start firing subordinates for making mistakes, then what should be done to management if they fail to account for human error, resulting in multimillion-dollar losses?

Well, why, good Sir - don't they usually get bonuses?

They'd rather avoid the blame hitting them

That is because they are managers, most got where they are by assigning rewards to themselves.

Welp, as a new grad there, I had brought down one very important database server on a Sunday night (a series of really unfortunate events). Multiple senior DBAs had to be involved to resuscitate it. It started functioning normally just a few hours before market open in HK. If it was any later, it would have been some serious monetary loss. Needless to say, I was sweating bullets. Couldn't eat anything the entire day lol. Took me like 2 days to calm down. And this was after I was fully shielded cuz I was a junior. God knows what would've happened if someone more experienced had done that.

I brought down the order management system at a large bank during the middle of the trading day. The backup kicked in after about a minute but it was not fun on the trading floor.

I'm so glad I'm not the only one feeling deployment anxiety. The project I'm involved in doesn't really have serious money involved, but when there's a regression found only after production deployment my stress levels go up a notch.

When I was working at a pretty big IT provider in the electronic banking sector, we (management and senior devs) made it an unspoken rule, that: - Juniors shall also handle production deployments regularly. - A senior person is always on call (even if only unofficially / off the clock). - Junior devs are never blamed for fuckups, irrespective of the damage they caused.

That was the only way to help people develop routine regarding big production deployments.

Same thing -- used to work at a very large hosting provider. One of our big internal infra management teams wouldn't consider newhires fully "part of the team" until they had caused a significant outage. It was genuinely a right of passage, as one person put it, "to cause a measurable part of the internet to disappear".

I got to see a lot of people pass through this right of passage, and it was always fun to watch. Everyone would take it incredibly seriously, some VP would invariably yell at them, but at the end of the day their managers and all their peers were smiling and clapping them on the back.

Sounds like hazing.

as a new grad there, it wasn't your fault. There should be guardrails to protect you.

Yep. It was supposed to be a very small change. I blundered. My team understood that and was super supportive about it all too. But this was after it was all fixed.

During the outage though, no one (obviously) had time for me. This was a very important server. The tension and anxiety on the remediation call was through the roof. Every passing hour someone even more important in the chain of command was joining the call. At that time I thought I was done for...

I work for an extremely famous hospital in the American midwest. We're divided into three sections, one for clinical work, one for research, and one for education. I always tell people that I'm pretty content being in research (which is less sexy than clinical), because if I screw something up, some PI's project takes ten months and one week instead of ten months. In clinical, if you screw something up, somebody dies! I just don't think I could handle that level of stress.


At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).

Jesus. One would think you'd have some safeguards for that. Even Dropbox will give you an alert if you try to nuke over 1,000 files. More reasons to COLOR CODE your work environments, if possible.

Yes but that was eons ago. The safeguards are well and truly in-place now. Not just one, several in fact.

Apart from the ones that they haven't worked out yet :)

When I meet the engineer who can design for the unknown unknowns, I will bow to them.

The trick is to be paranoid. You literally sit down and think exclusively about what COULD go wrong.

Anxiety is a bitch.

Formal methods for your formal methods. And never shipping on Friday.

colorblind (red/green) person here - 5% of the male population just don't see color enough for it to be an important visual clue.

So sure, color-code your environments, but if you find someone about to do something to a red environment that they clearly should only be doing to a green environment, just check if they're seeing what you're seeing before you sack them ;)

My primary customer right now color codes both with "actual color", and with words - ie the RED environment is tagged with red color bars, and also big [black] letters in the [red] bars reading "RED"

> At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).

It seems like a design flaw for actions like that to be so easy. E.g.

> Hey, we detected you want to delete an AWS region. Please have an authorized coworker enter their credentials to second your command.

If it was indeed in-production, I'd never in a million years have had access rights to delete the stack. Those are gated just the way one would imagine they should be.

The service stack for the region (and not an entire region itself) looked like prod, but wasn't. It made me feel like shit anyway.

It reminds me of this: https://www.youtube.com/watch?v=30jNsCVLpAE -- "GOTO 2017 • Debugging Under Fire: Keep your Head when Systems have Lost their Mind • Bryan Cantrill"

Irrelevant to the discussion, but I just wanted to say thank you for the categorized list of users I can follow on your profile!

Which tool do you use to follow users on HN?

There used to be hystry.com [0] but it isn't functional anymore.

Another workflow, though cumbersome, is: Search for a username on hn.algolia, select "comments" and "past months" as filters, then press enter.

Ex: https://hn.algolia.com/?dateRange=pastMonth&query=nostrademo...

[0] https://news.ycombinator.com/item?id=71827

Thanks for sharing those tips!

wow, that list is crazy. thanks OP.

Who automates the automators? :)

Doesn't AWS (and every big cloud/enterprise) follow best-practices for production operation like FIT-ACER? https://pythian.com/uncategorized/fit-acer-dba-checklist/

That's even more surprising to me.

Several years back when I was working at Google I made a mistake that caused some of the special results in the knowledge cards to become unclickable for a small subset of queries for about an hour. As part of the postmortem I had to calculate how many people likely tried to interact with it while it was broken. It was a lot and really made me realize the magnitude of an otherwise seemingly small production failure. My boss didn't give me a hard time, just pointed me toward documentation about how to write the report. And crunching the numbers is what really made me feel the weight of it. It was a good process.

I feel for the engineer who has to calculate the cost of this bug.

This sounds like a good practice and hopefully something they still do. Calculating the exact numbers would definitely help cement the experience and its consequences into your mind.

Presumably there were more failures than a single engineer could've been responsible for here.

Its absolutely possible, the worst AWS outage was caused by one engineer running the wrong command [0].

"This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in hundreds of millions of dollars in losses for AWS customers and others who rely on third-party services hosted by AWS."

[0] - https://www.datacenterknowledge.com/archives/2017/03/02/aws-...

If you alone were able to do it, then the system was designed badly. The bigger the impact, the more robust it has to be to prevent accidents.

The big mistake in the system is that everyone in the world is relying on Google services... These problems would have less impact with a more diverse ecosystem.

Would they have less impact? Or would it have the same impact, just distributed across many more outages?

You can rely on Google outages being very few and far between, and recovering pretty fast. For the benefits you get from such a connected ecosystem, I'm not sure anyone is net positive from using a variety of different tools rather than Google supplying many of them.

Compare closing down one road for repair a day per year to closing down all roads one day a year.

I'm not sure I see that as a fair comparison. I think it's best to use the same durations for this, as an entire day changes the level of impact. <1hr outages have a fraction of the impact that an entire days outage might have.

It's obviously subjective, but even with our entire work leaning on Google– from GMail, GDrive and Google Docs, through to all our infrastructure being in GCP– todays outage just meant everyone took an hour break. History suggests we won't see another of these for another year, so everyone taking a collective 60m break has been minimally impactful vs many smaller, isolated outages spread over the year.

Just think if Google actually decided to "take the day off"

...like I did a dozen+ years ago: https://antipaucity.com/2008/01/09/what-if-google-took-the-d...

I remember how one of our engineers had his docker daemon connected to production instead of his local one and casually did a docker rm -f $(docker ps -aq) .

Same thing happened to me but with CI, which felt bad enough already.

"Hey let's make developers do two very different jobs, development, and operations. We'll call it DevOps. We'll save money. Everything will be fine."

No Engineer should have production access from their workstation. Period.

source: am Engineer =).

Engineers shouldn’t deploy to prod directly, but sometimes it’s necessary to SSH into an instance for logs, stack dumps, etc. Source: worked for 2 big to very big tech cos.

For a large or v large tech co you should probably be aggregating logs to a centralised location that doesn't require access to production systems in this way. Stack dumps should also be collected safely off-system if necessary.

Perhaps my industry is a little more security conscious (I don't know which industry you're talking about), but this doesn't seem like good practice.

Let me be clear, I agree it should not be normal to SSH into a prod box. Our logs are centrally aggregated. But it’s one thing to say it’s not normal, but quite another to say engineers shouldn't have access, because I totally disagree with that.

What normally (should) happens in that unusual case is that the engineer is issued a special short-lifetime credential to do what needs to be done. An audit trail is kept of when and to whom the credential was issued, for what purpose, when it was revoked, etc.

Who fixes the centralised log system when that needs debugging?

Unless prohibited in something like banking, following best practice to the letter is sometimes unacceptably slow for most industries.

There should be tools that allow the team to gather such logs. Direct prod access is a recipe for disaster.

Not having those things centralized is also a huge operational failure regardless of company size.

Why not? (I think I can find some cases where production access from an engineer's workstation is a good idea)

It can be efficient, particularly in smaller companies, but that's where exactly this rule should be applied.

In some industries, security and customer requirements will at times mandate that developer workstations have no access to production. Deployments must even be carried out using different accounts than those used to access internal services, for security and auditing purposes.

There are of course good reasons for this; accidents, malicious engineers, overzealous engineers, lost/stolen equipment, risk avoidance, etc.

When you apply this rule, it makes for more process and perhaps slower response times to problems, but accidents or other internal-related issues mentioned above drop to zero.

Given how easy it is to destroy things these days with a single misplaced Kubernetes or Docker command, safeguards need to be put in place.

Let me tell you a little story from my experience;

I built myself a custom keyboard from a Numpad kit. I had gotten tired of typing so many docker commands in every day and I had the desire to build something. I built this little numpad into a full blown Docker control centre using QMK. A single key-press could deploy or destroy entire systems.

One day, something slid off of something else on my desk, onto said keyboard, pressing several of the keys while I happened to have an SSH session to a remote server in focus.

Suffice it to say, that little keyboard has never been seen since. On an unrelated topic, I don't have SSH access to production systems.

This exactly. I have deleted database records from a production DB thinking I am executing on my development DB. I've kept separate credentials and revoked dev machine access to prod environment ever since.

Congratulations - you found a counterexample yourself: engineers in small companies.

Well, it's something that can happen to anyone, take it easy. When I made the transition from developer to manager and become responsible for this situations, at first every problem made me feel as you describe. Eventually what helped me to be free is the understanding that how we feel about a fact does not change anything about that fact.

Don't be too hard on yourself, no dev works in a silo, there is usually user acceptance testing and product owner sign offs involved so they also have to wear some of this too.

Nope, especially considering the implications of this, with the amount of people working remotely. Google Meet, Classroom, etc. are down. This is probably literally costing billions every minute just in loss of productivity.

Total world economic output is ~$150M / minute, so billions every minute is off by few orders of magnitude.

You are assuming that a minute of disruption can not cause more than a minute's loss of productivity. I don't think that assumption is justified.

Consider an exactly one minute outage that affects multiple things I use for work.

First, I may not immediately recognize that the outage is actually with some single service provider. If several things are out I'm probably going to suspect it is something on my end, or maybe with my ISP. I might spend several minutes thoroughly checking that possibility out, before noticing that whatever it was seems to have been resolved.

Second, even if I immediately recognize it for what it is and immediately notice when it ends it might take me several minutes to get back to where I was. Not everything is designed to automatically and transparently recover from disruptions, and so I might have had things in progress when the outage stuck that will need manual cleanup and restarting.

I'm also assuming most of the world doesn't grind to a halt when gmail is down. Crops keep growing and factories keep running.

Even software engineers who are in a state of flow keep working :)

That figure seems way too low, what are your sources on it?

Simple math says:

World GDP (via Google) $80,934,771,028,340

Minutes per year 365 * 24 * 60 = 525,600

Divide and you get 153,985,485

World GDP is $80 trillion per year.

World GDP was ~$90B last year (https://databank.worldbank.org/data/download/GDP.pdf), which averages to ~$150M/minute

That's trillion not billion


A billion is a number with two distinct definitions:

- 1,000,000,000, i.e. one thousand million, or 10^9, as defined on the short scale. This is now the meaning in both British and American English.

- 1,000,000,000,000, i.e. one million million, or 10^12, as defined on the long scale. This is one thousand times larger than the short scale billion, and equivalent to the short scale trillion. This is the historical meaning in English and the current use in many non-English-speaking countries where billion and trillion 10^18 maintain their long scale definitions.

Nevertheless almost everyone uses 1B = 10^9 for technical discussions

This is a financial discussion though so:


World's GDP is $80,934,771,028,340 (nominal, 2017).


$80.93477102834 trillion

Nobody would argue world GDP is anything billion, that's crazy.


In France, they use milliard and billion.

Sorry, language mistake. The result is the same: GDP is ~$150M/minute

That depends where in the world you are!

Indeed. Also, Google’s revenue is about $300K per minute. The value they provide is likely higher than that, but as you said, being able to send an email an hour later than you hoped it’s fine in most cases. Also, Google Search was fine, and that’s their highest impact product.

I’d guess actual losses to the world economy were more on the order of about $100K per minute, or about 1/3 of Google’s revenue. MAYBE a few hundred thousand per minute, though that seems unlikely with Search being unaffected, and everything else coming back. Certainly a far cry from billions per minute :)

I never understood this type of calculation as it implies that time is directly converted into money. However, I struggle to come up with an example for this. Even the most trivial labor cases like producing paperclips don't seem to be directly converting time into profit: even you will make 10k units instead of 100k this hour, you don't sell them immediately. They bring revenue to the firm via a long chain of convoluted contracts (both legal and "transactional") which are very loosely coupled to the immediate output.

Nothing is operating at minute margins unless it's explicitly priced on a minutely basis, like a cloud service. Even if a worked on a conveyor belt can't produce paperclips without looking at Google Docs sheet all the time, this will be absorbed by the buffers down the line. And only if the worker will fail to meet her monthly target due to this, loss of revenue might occur. But in this case the service has to be down for weeks.

In case of more complex conversions of time into money, like in the most of intellectual work, this is even less obvious that short downtimes will cause any measurable harm.

Besides the exaggerated figure, I always find these claims bizarre. Sure, there was some momentary loss, but aggregated over a month this will not even register.

I was unable to watch the Mogwai - Autorock music video. :-(

In a previous lifetime I removed an "unused" TLS certificate. It turns out that it was a production cert that was being used to secure a whole state's worth of computers.

In my defence, the cert was not labeled properly, nor was it used properly, and there was no documentation. It took us 2 days to create a new cert and apply it to our software and deliver it to the customer. Those were 2 days I'll never get back. However, when I was finished the process was documented and the cert was labeled, so I guess its a win.

Coincidentally, Google Authenticator was finally just updated on iOS after many years without update.

I am not sure why are they allowing it. Meaning why aren’t services completely isolated? Isn’t it obvious that in an intertwined environment those things are bound to happen (as in “question of when, not if”)? I understand, in smaller companies that are limited in resources (access to good developers and pressure to get product to market as soon as possible) we have single points of failure all over the place. But “the smartest developers on the planet”? What is it if not short-sighted disregard for risk management theories and practices? I mean, Calendar and Youtube, say, should be completely separate services hosted in different places, their teams should not even talk to each other. Yes, they can use same software components, frameworks and technologies. Standardization is very welcome. But decentralization should be an imperative.

Edit: again downvotes started! Thanks to everyone “supporting freedom of expression” :)

I've been in that situation before at one of my previous jobs, where some important IT infrastructure when down for the whole company. Nowhere as big of a scale as this, but it was easily one of the most stressful moments of my life

If this does not improve soon, we're looking at one of the most significant outages in recent internet history, at least from the number of people impacted.

Several others have shared their 'I broke things' experiences, and so I feel compelled to weigh in.

Many years ago, I was directly responsible for causing a substantial percentage of all credit/debit/EBT authorizations from every WalMart store world-wide to time out, and this went on for several days straight.

On the ground, this kind of timeout was basically a long delay at the register. Back then, most authorizations would take four or five seconds. The timeout would add more than 15 seconds to that.

In other words, I gave many tens of millions of people a pretty bad checkout experience.

This stat (authorization time) was and remains something WalMart focuses quite heavily on, in real time and historically, so it was known right away that something was wrong. Yet it took us (Network Engineering) days to figure it out. The root cause summary: I had written a program to scan (parallelized) all of the store networks for network devices. Some of the addresses scanned were broadcast and network addresses, which caused a massive amplification of return traffic which flooded the satellite networks. Info about why it took so long to discover is below.

Back in the 1990s, when this happened, all of the stores were connected to the home office via two way Hughes satellite links. This was a relatively bandwidth limited resource that was managed very carefully for obvious reasons.

I had just started and co-created the Network Management team with one other engineer. Basically prior to my arrival, there had been little systematic management of the network and network devices.

I realized that there was nothing like a robust inventory of either the networks or the routers and hubs (not switches!) that made up those networks.

We did have some notion of store numbers and what network ranges were assigned to them, but that was inaccurate in many cases.

Given that there were tens of thousands of networks ranges in question, I wrote a program creatively called 'psychoping' that would ICMP scan all of those network ranges with adjustable parallelism.

I ran it against the test store networks, talked it over with the senior engineers, and was cleared for takeoff.

Thing is, I didn't start it right away; some other things came up that I had to deal with. I ended up started it over a week after review.

Why didn't this get caught right away? Well, when timeouts started to skyrocket across the network, many engineers started working on the problem. None of the normal, typical problems were applicable. More troubling, none of the existing monitoring programs looked for ICMP at all, which is what I was using exclusively.

So of course they immediately plugged a sniffer into the network and did data captures to see what was actually going on. And nothing unusual showed up, except a lot of drops.

We're talking > 20 years ago, so know that "sniffing" wasn't the trivial thing it is now. Network Engineering had a few extremely expensive Data General hardware sniffers.

And to these expensive sniffers, the traffic I was generating was invisible.

Two things: the program I wrote to generate the traffic had a small bug and was generating very slightly invalid packets. I don't remember the details, but it had something to do with the IP header.

These packets were correct enough to route through all of the relevant networks, but incorrect enough for the Data General sniffer to not see them.

So...there was a lot of 'intense' discussions between Network Engineering and all of the relevant vendors. (Hughes, ACC for the routers, Synoptics and ODS for the hubs)

In the end, a different kind of sniffer was brought in, which was able to see the packets I was generating. I had helpfully put my userid and desk phone number in the packet data, just in case someone needed to track raw packets back to me.

Though the impact was great, and it scared me to death, there were absolutely no negative consequences. WalMart Information Systems was, in the late 1990s, a very healthy organization.

Makes sense, at work we have an application running on Google Cloud and everything seems to be working. So the outage is probably not at network or infrastructure level.

Went to reply, then saw the username. My guess was lb layer

yeah, not working in Europe

4:41AM PT, Google services have been restored to my accounts (free & gsuite).

And I have never seen them load so fast before - gmail progress bar barely seen for a fraction of a second whereas I am more used to seeing it for multiple seconds (2-3 sec) until it loads.

I observe the same anecdotal speedup for other sites... drive, youtube, calendar. I wonder if they are throwing all the hardware they have at their services or I am encountering underutilized servers since it is not fixed for everyone.

It is nice to experience (even if it is short lived) the snappiness of Google services if they weren't so multi-tenented.

If this phenomenon is actually real instead of just perception then I'd guess it is down to reduced demand of some short. Some possibilities:

a) users haven't all come back yet b) Google is throttling how fast users can access services again to prevent further outages c) to reduce load, apps have features turned off (which might make things directly faster on the user's end or just reduce load on the server side)

At Google's scale, I'd expect it to be all of the above.

I hope they make their learnings, post-mortem, etc. public so that we can all learn from it.

My engineer hat is saying - "damn, I wish I was part of fixing this outage at their scale."

My product owner hat is saying - "Aaaaaaaaaaaaaaa......Aaaaaaaaaaaaaaa...."


Everything is snappier for a while if you turn it off and then on again

Except when there is no cache warming when you turn it on

Except when there is cache warming when you turn it on?

I would guess autoscaling kicked in (RPC error rate caused higher CPU usage?) and now things will scale back down again.

Perhaps they rebooted their clusters and it flushed the memory ???.

Oh man, you're right. Bloated gmail loaded instantly. What's going on? It's loading almost 2x to 3x faster.

Isn't this a good indication that the performance problem if gmail may not be related to the "bloat" of the frontend itself?

It might suggest that the frontend isn't the only issue, at least - and maybe this explains why it's usually so slow, if the frontend can be fast on a fast enough backend. On the other hand, the speed of the "basic HTML" version implies that the frontend can be the issue.

Entirely possible as well that the "basic HTML" uses different API service in the background that are snappier for comparative lack of users.

I always thought gmail being slow is because of me using firefox but now it's surprisingly snappy. What the hell is going on?

Wow, it's faster in firefox than it used to be in chrome... while in chrome it's almost isntanteneous

I wish it was always like this. I hate how slow YouTube, Gmail, etc. often are normally.

I wonder if they just killed the affected service so it's loading faster-than-usual now

So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?

Of course it will, - at least, it better - but what if it doesn't? And if it does, are you going to take countermeasures in case it happens again or is it just going to be 'back to normal' again?

I guess a lot of people are fine with the risk.

Everybody uses it, so if, like, Gmail loses all the emails, we are then in such a state that the consequences will be more bearable and socially normal.

Most people are fine with accepting that whatever future thing will happen to most people will also happen to them. Because then the consequences will also be normal.

If the apocalypse comes, it comes for almost all of us and that's consolation enough.

This sounds like the good old 1970-80s "No one ever got fired for buying IBM" argument.

Still haven't heard of anyone being fired for buying IBM. Have you?

Yeah but for the people making that argument, it was a good one!

The way I see it, backups are a strategy to reduce risk of ruin.

For me, backing up to the Cloud is fine, because I find the risk of my home being broken into and everything stolen AND the cloud goes down AND the cloud services are completely unrecoverable is a small enough risk to tolerate.

I don't think it's possible to have permanently indestructible files in existence over a given time period.

Different failure mode. If the cloud goes down, many more people are affected. If your self-hosted thing goes down, only you are affected. If everybody self-hosted, would the overall downtime be lower? Even if it were, would it be worth the effort of self-hosting?

For baby pictures yes, for everything else, no

Most of the things I backed up for myself are either gone forever or irretrievably lost.

Most of the things I backed up with google remain largely accessible, except for an occasion like this.

It's rare that any services I operate solo come back this quickly after there is a downing issue.

I have the opposite experience, at least with regard to your first two paragraphs. Most of the things that I have backed up on other people's computers over the past 3-4 decades are irretrievably lost. But most of the things that I have taken care to make backups of on personal equipment over the years, are still with me.

Cloud storage is still useful of course, but I prefer to view it as a cache rather than as a dependable backup.

Of course it's viable as a backup. Availability != realibility. My data is still reliably saved in the cloud even if there is an outage for a few hours. The key point is backup, e.g. Dropbox. When you use Google Docs, it becomes a single source of truth and a SPOF.

This depends on the circumstances. If your personal photos are unaccessible then maybe it doesn't matter, but if it's your documentation for a mission critical bit of infrastructure then a few hours could be very significant. Somebody in that situation probably wouldn't agree with your assertion that "availability != reliability". If I can't access it when I need it then I wouldn't consider it reliable.

Whatever data I have backed up in the cloud is synced across multiple devices that I use. Even if the cloud disappeared altogether, I still have it. The cloud allows me to keep an up to date copy across various devices.

Both Google Drive/Photos and OneDrive have an option to only keep recently used files on your local device, and even periodically suggest they automatically remove local copies of unused files to "free up space".

I highly suggest everyone disable this setting on their own, but also on their (perhaps less technical) friends' and relatives' devices. Otherwise, if anything happens to your account or - less likely - the storage provider or their hardware, your data could very well be gone forever. I can't believe anyone would want that.

You don't need 'the cloud' to do that. Look into Syncthing. It does depend on an outside "discovery server" by default to enable syncing outside of your LAN, but you can run your own.


What's annoying is that synchronisation does't work for google slides or google docs. They are just synchronized as links to the webpage on my computer.

If you use Insync you have the option of converting to DOCX or ODT. Insync has other issues though, my "sourcetreeconfig" is being downloaded as "sourcetreeconfig.xml".

Not 100% on that, but I think you can save these documents on Google Drive, and then they're treated (and synced) just like any other files.

>So, anybody still feel like arguing that 'the cloud' is a viable back-up? Or is that a sore point right now? Just for a moment imagine: what if it never comes back again?

Much less chance of that happening than my local backups getting borked...

But much higher than the chances of both of them getting borked.

Of course cloud is a viable back up, similarly to physical drives.

Both have vastly different failure modes and typical backup should use both of them.

This way if all my backups are gone I likely have way more important issues that loss of files.

(and yes, my backups are encrypted)

Just a few moments ago, I wrote in my company's group chat this is the time when we buy NAS. We have a lot of documents not accessible right now in Google Drive.


What worries me the most is email. I basically don't use any other Google services other than Gmail and YouTube, but for email I really don't know of an alternative.

Sure you can argue "move to Fastmail/Protonmail/Hey/whatever", but those can also go down on you just like Google is down now. And self hosting email is apparently not a thing due to complexity and having to forever fight with being marked as spam (ndr.: not my personal experience, I never tried self hosting, just relying what I read here on HN when the topic comes up).

So, yeah, what do we do about email? I feel like we should have a solution to this by now, but somehow we don't.

I've been happy with using Hosted Exchange on Microsoft. I own the domain so ultimately I can point the DNS to some other provider. Outlook stores the mails locally so I have a backup. I think the most important thing about E-Mail is to receive future E-Mails and not look at historical ones. In the end you can always ask the receipent to send you a copy of the email conversation - if you dont own the domain it get much harder to convince you actually own the email.

For < $100/year, Microsoft will sell you hosted Exchange (and you can use it with up to 900 domains [0]), 1 TB of storage, 2 installable copies of Office, and Office 365 on-line.

That's _much_ better than trying to host my own email server.

[0] https://docs.microsoft.com/en-us/office365/servicedescriptio...

The point is that you shouldn't put all your eggs in one basket. All services go down. If you're worried about someone else handling it when it goes down then host your own [1], otherwise you can use something different for each thing you need. Don't rely on Google for everything.

1- https://github.com/awesome-selfhosted/awesome-selfhosted#ema...

Yes, I know what's the point. But how do you avoid putting all your eggs in one basket? You can't host your email on more than one "provider" (including self host), and the vast majority of important services that you link your email to (bank, digital identity, taxes and other government services) does not allow you to have more than one linked to it; which means, that one goes down, you don't have one. Sure, I can give my accountant and my lawyer a second email address, hosted on a different provider, but that poses two problems: 1. how are they gonna know when one is working and one isn't? It's not like you get a notification if your email didn't reach most of the time, it just drops; 2. if you always send all emails to both addresses, now two providers have my data instead of one (of course excluding if one is self hosted). And you also need to always keep in mind that for all things important: one is none, two is one; so you should really have 3 addresses on 3 different providers according to that, which brings us back to the problems above. (and I'm not even mentioning the confusion that it would generate if you don't manage to get the same name with every provider "Wait, was it beshrkayali@gmail.com, or was it alibeshrkay@gmail.com? Or was that fastmail?")

As I said (literally in the second sentence), I don't rely on Google for everything, as you mention. I don't actually rely on Google for anything other than gmail, and of that I am also unhappy. The point I was trying to make is that there aren't really alternatives, and I was hoping someone might come out with a suggestion about how to overcome that problem.

You shouldn't use you@company.com as your main email, you should have your own domain. So `something@yourdomain.com` will always be yours no matter if you self host or use 3rd party. I currently use Fastmail and i've been very happy with them. If they fail or turn evil, I'll switch to something else maintaining the same address. Emails themselves should be downloaded/backed-up regularly, kind of like important documents you'd have on your disk.

> You can't host your email on more than one "provider"

You can do split delivery and have your email be delivered to two different destinations. It's less common than it used to be but it's trivial.

I'm running my own mail server, and I think anyone who has some experience with Linux should be able to do the same in a day or two. Once it's set up it just works.

You can still use Gmail and fall back to connecting directly to your server if Gmail is down.

Some mails might be flagged as spam if the IP/domain has no reputation, but that quickly passes, at least that's my experience.

I specifically use Gsuite so that I don't have to deal with managing a spam filter or dealing with IP reputation issues. I'd be willing to self-host almost anything else.

I guess something highly resilient would be - say - a mailserver on a rented VM replicating to two cloud providers via a service mesh.

Nice and simple! :D

A lot of domain registrars will host/relay mail for you if you don't want to think about it. Otherwise it's not too hard to host yourself. The sucky part is when it breaks because you can't really just put off fixing it.

I've been using mailu.io to host my own email server. Makes it real easy to manage yourself.

I haven't had any issues with new domains being marked as spam, but I always make sure the SPF, DKIM and DMARC records are set up.

I’ve heard multiple founders argue that it’s safe to have downtime because of a cloud outage, because you’re not likely to be the highest importance service that your customers use that also had downtime.

Well yeah; I don't trust myself enough to own & operate my own servers, and I cannot give myself any uptime guarantees - let alone at the scale that a cloud provider can offer me.

"The Cloud" is vague, and if you don't specify what it means then the answer to your question can only be "it depends".

If the question is "anybody still feel like arguing that 'a single provider' is a viable back-up" then it's yes for most cases. A better strategy is of course to use multiple providers. The chances that it never comes back again is much lower.

People would probably argue a “multi cloud” solution. Have your infrastructure be “cloud agnostic” and this sort of problem would be avoided.

There was actually a project called “spinnaker” that was supposed to solve this problem.

Whether the cost of paying 2 or more cloud providers is worth it for most companies is up in the air.

"Multi-cloud" only works if you stick with the basics. Like disk storage, compute, and a well-supported database. Once you tie in into a cloud's specific offerings.....

It's getting convoluted now that cloud providers seem to realize there's demand for this

https://aws.amazon.com/hybrid https://azure.microsoft.com/en-us/services/azure-arc https://cloud.google.com/anthos

Full disclosure: I work for Azure. Don't work on Arc tho. Don't have experience being a customer for these products

I find the Anthos docs and talks so confusing. Half of them say Anthos is for hybrid between on-prem and GCP. The other half say it's for multicloud and hybrid.

Well, since a viable backup strategy requires at least 3 storage locations (eg the in-use primary, an on-site or off-site backup, and a secondary off-site backup) "the cloud" is fine as an off-site backup or secondary off-site backup.

Let's hope tailscale swoops in and creates a no-gimmicks, highly usable, private internet for everyone.

They seemed to have figured out the hard parts already.

You mean like using Google GSuite for SSO? In the context of this authentication-related outage, it's a funny suggestion.


> Of course it will, - at least, it better - but what if it doesn't?

Same question for non-cloud.

Back to normal. I can live without email for an hour...

Help! My Waymo taxi won't open the doors without me logging into the app. Its driving around in circles on route 500 and won't stop.

/s - for now ;)

I know you’re joking, but I am curious if Waymo’s fleet was affected. Shouldn’t be right? But I’m also surprised every time I fresh login to Gmail and YouTube shows up in the intermediary redirect chain.

how's the converted offshore oil platform treating you?

Gmail said my account was "temporarily" unavailable... had a moment considering if it wasn't temporary. Good reminder to remove my reliance on gmail especially.

For a second I legit thought Google banned my account for some reason. And I don't like it that I feel relieved..

Underrated comment. If the government freezes your bank account, at least you can take them to court. If Google mistakenly disables your account, there does not seem to be any legal recourse. Considering the extent to which people depend on their services, perhaps there should be a more elaborate appeal process.

They can freeze the account but I own the domain so I can point it anywhere I want and I have copies locally via imap.

At this point the only reason I use it is because I grandfathered in on an old plan it's still free, if that changes I'll go elsewhere.

Google doesn't disable ordinary rulefollowing accounts by mistake. They are aware of the false positive rate and don't care.

I agree, but the problem is that "ordinary rulefollowing" is not clearly defined by Google. Perhaps it was that innocent park frisbee party video with a Metallica sound played in the background by some other party there... and not only your video, but also your YouTube account, and worse, email account gets blocked. Maybe this scenario is dystopian - the point is that it is a black-box no-appeal/limited appeal system when such an event happens.

Google account terminations typically only affect the one service. If your YouTube account gets terminated for ToS violations, your email account will still work perfectly fine.

Same for me - and then I tried logging in to gmail in an incognito window and got "Couldn't find your Google account" which really scared me.

Time to stop postponing that large Google Takeout for me!

Clicking the stupid download links is so tedious...

Thunderbird prompting me to login, and getting "Google does not recognize this email address" as a reply was a nice adrenaline spike, until I checked the status on HN!

This happened to me with a work email. For a moment I thought, "Well, I had a good run at this company.." My next thought was, "Ha, Google must be down - better check HackerNews."

I am still experiencing this problem on all my synced-to-Thunderbird Gmail accounts, so it either hasn't been completely fixed yet, or there's another ongoing issue.

I started this[1] the last time something similarly scary happened.

It's very comforting to have a local copy of everything important in situations like this one.

[1] https://wiki.emilburzo.com/backing-up-your-google-digital-li...

Yes! It takes some work to switch, but it's worth it. Buy your own domain name, and link it to an existing service if you don't want to host yourself. You'll always be able to switch your mail alias when having issues with your email host.

Our entire library of video tutorials disappeared for a while. I was not happy, and the thought of losing our email.. Currently working with the team to make backups of absolutely everything off-site and off-Google. A good wake up call for us.

https://console.cloud.google.com/ is down as well. Seems very much an identity issue.

I really should try to get my stuff moved to my new address. Not self-hosted, but a non-profit that has been around for a while... Finally paid the membership fee.

The status page has proven itself useless again. According to that page, everything is working perfectly fine.

What if it is a static website of corporate propaganda?

It works 99.9% of the time, good enough eh?

better than a stopped clock

It works fine. You can't login only.

When I tried to login it said: 'this account does not exist'. So my first thought was some algorithm made a mistake and my account got deleted for no reason.

I already imagined the only solution now was to write a medium post and hope it gets some traction on hackernews and google support steps in. Thinking to myself I was an idiot for knowing all this and still thinking it wouldn't happen to me.

And even though it turns out to be an outage, it gave me a bad enough feeling to start using a domain name I own for my email.

Those hosting with their own domains on Google mail had an identical experience, if it makes you feel differently.

The advantage of a custom domain name is: As long as you can update DNS records, you can have your mail easily hosted elsewhere.

Obviously not relevant for this kind of outage, but in the scenario outlined by GP - Google randomly kills you off, and there is nothing you can do - this is at least an emergency strategy.

Yep, but losing access to my emails isn't that big of a deal, losing access to my email address is much worse. Especially since it's also my login on a lot of other services.

This topic is so hot it's crashing HN! Super long server response time and I get this error quite oftenly: "We're having some trouble serving your request. Sorry!"

Just checked https://www.google.com/appsstatus

all green, which does not reflect reality for me (e.g. Gmail is down)

edit: shows how incredibly difficult introspection is

HN is my goto status page when these things happen, never failed to provide up-to-date reliable information.

Seems like everyone going on HN now. The site is so slow

Interesting cascade effect of sites going down! I wonder where people will go next to check if HN is down?

Also wondering if this is perhaps the fastest upvoted HN post ever? 8 mins -> ~350 votes, 15 mins -> ~750 votes. I wonder if @dang could chime in with some stats?

Update: looks like it hit 1000 upvotes in ~25 mins!

Update: 1500 in ~40 mins

Update: 2000 in ~1 hour 20 mins (used the HN API for the timestamp)

The recent youtube-dl takedown had a similar number of votes but still slightly slower, I think. I think it was >500 in 30min and >1000 in 60min if I remember correctly.

Some politics topics have shot up pretty quickly in the past but user and/or mod flags send them back to page 2 about as quickly.

Funny this post is already #7. Are we seeing many reports again?

Stats right now: 1985 points | 1 hour ago | 597 comments

Maybe, could have just seen a dropoff in upvotes once the issue was resolved also

reddit, nanog, 4chan, twitter — pick your poison

"poison" is a particularly apt description for all of the above

> I wonder where people will go next to check if HN is down?

Probably: https://downforeveryoneorjustme.com/

HN was having problem showing anything for the past 10 min or so. I was going to ask if HN was down, then I saw this.

HN uses Firebase, maybe that is the reason.

Does this site also use Firebase, or do they only feed Firebase's Database so that other developers can build upon it?

Any source for that? Last information I have is that it's written in Arc Lisp and uses files rather than a database for storage.

There is a public API on Firebase, but AFAIR it's just a mirror rather than the main storage.

TIL. What for?

The very least public APIs are exposed via firebase.


Accessing the site anonymously fixes it (logged sessions are not chaced)

I, and probably many others here are like a status page for friends and family too... I had my wife thinking that the internet was broken, and tried tethering to phone and things and still didn't work, then showed me and I saw the status code errors and was like "it's actually parts of Google that are down".

I love (and am deeply scared by) the dependence of Google and the confusion of it with the entire internet.

It gets even worse: https://twitter.com/joemfbrown/status/1338452107419148290

>I’m sitting here in the dark in my toddler’s room because the light is controlled by @Google Home. Rethinking... a lot right now.

Some people are compiling more relevant events: https://twitter.com/internetofshit

I still like my physical switch on the wall to turn the light on and off very much.. for me it's hard to beat. I can even turn it on/off semi-remotely with e.g. a tennis ball.

If my Alexa fails, then I can also just turn the light off with the switch and when flipped back "on" again, then it will be on.

Fallback to 'classical mode' works for me.

If your home automation system requires a network connection to work then something is very wrong.

Right? Home automation is perfectly fine and mught even expose a secure authenticated API to the internet so you can say check and adjust the temperature remotely, but it shoukd never ever go down for lical use if the Internet connectivity or the remote service backend goes down.

Home Assistant rulez with non-cloud sensors. I never by sensors, cameras, switches which require connection to an external host. But I'm sure that most people will do it even with such outages too.

Google should write an AI application that checks on HN if their systems are working and display the result live on their status page. That would be more reliable than the current system they have in place (which is obviously not working at all).

It would probably just end up banning Googles own account and leave them with no human escalation option.

won't help when the ai service itself is down.

There are other sites that can check for you

https://downdetector.com/ is remarkably good at catching this too.

But funnily enough, a lot of the votes come from traffic that searches for "is ____ down?" on Google. XD

https://downdetector.com/search/?q=google shows all the Google services and seems like everything other than search is reporting errors

Even search seems to have issues. It can't seem to find rare pages at all, and ranking seems subpar for the past 15 minutes or so.

I think that's normal. ;)

Amusingly, downdetector.com is now down. Gotta love that cascade effect :-)

Found this relatively recently, seems to do a good job.

Do love how more consumer services (ISPs &c) always have some report of being down somewhere but its means nothing unless there's a big spike.

What's the point of having a status page when it can't reliably tell you the factual status?

Helps Google meet a 99.999% SLA!

Not really since now it's up, and the status page still says it's down.

right after I shamed them on twitter. I take responsibility despite it unlikely having any impact.

Especially one that breaks the back button.

Customer confidence. Same as how everyone reports they have 99.99% uptime.

Then their status page is not unlike like target reporting in the USSR: completely fictional.

From personal experience and other Ops peoples stories, I'd assume that could be the case for many status pages ;)

It's red already.

No -- it needs to be red when it's having the outage for people to have confidence in it as an indicator. The reality is Google have no real incentive to provide an actual external status page that is accurate -- to do so is an admission of not upholding SLAs. These status pages are updated retrospectively. Use a third party one like DownDetector.

It is all green if you do not need to be logged in.

If you are logged in, the page crashes with an error.

You can still browse all services from Incognito (which for some is not an option).

Are you saying the Google status page is correct then? Being logged in is part of being green I’d think.

Also, you can’t use many parts of Gmail, Drive, Photos, etc, without being logged in.

I think it just means the status page itself is not logged in. Apparently it should be...

Try YouTube, which is totally broken.

But I guess it's technically not part of /appsstatus

Youtube is down for me if I am logged in. I can watch videos through duckduckgo search or an icognito window.

YT seems to be working in Incognito / Private mode.

Ah, thanks!

Kinda weird it would totally break if the auth failed, unlike other services like Search.

Youtube Tv too

youtube is working in incognito mode

Yeah for me too: youtube and gmail works in incognito mode

> gmail works in incognito mode

I mean, that must be a generous definition of "works"! :)

Yeah xD

Is HN having issues too?

drive too.

Google made everyone get an account trying to force Google+ on to everyone. They don't get to pretend like that didn't happen by not factoring it into their status.

This is not true. i am logged in to gmail and getting 500 err

edit:parent updated comment

That's what the GP is saying.

the parent comment to mine at time of writing was just:

> It is all green if you do not need to be log in.

Giving the impression if you were logged in/didn't need to log in, that everything would be ok.

I apologize for my ambiguous phrasing, and I feel bad that you were downvoted (I reupvoted you, but that’s only one vote).

I wrote too fast because I thought it could help people work around the problem.

Oh, I see. Looks like their auth (state) servers are down.

It reflects the reality for them.

The small print says: The problem with Gmail should be resolved for the vast majority of affected users. We will continue to work towards restoring service for the remaining affected users...

At Google scale, "remaining affected users" probably number in tens of millions. Sucks to be one of them, tho.

But hey, it happens. As a SaaS maintainer, I can sympathise with any SREs involved in handling this, and know that no service can be up 100% of the time.

Everything went red just now.

Someone here in hn said at Amazon AWS they are required direct CTO approvals to reflect negative realities. Maybe that’s the case for anyone these days.

Not true. Only for written things like postmortems. Automated systems need to be vetted but can update things like status pages on their own.

The issue isn't "negative" realities, it is saying something while investigating that might break contracts only to find later that it wasn't true.

Mistakes happen... that page is probably still correct 99,9999% of the time!


These uptime numbers cloud providers gives out can't be nowhere near actual uptime.

Solid red now. Everything down

Definitely i agree with you. Although the statuses are updated later, it may cause many problems if it is not instantaneous.

It's correct for me now

Pretty funny. I would normally assume they are connected to some testing unit that real users interface with. Vs internal APIs or however Google is doing this with them still being green

Everything red now: clean sweep

It started showing red now

Now it's all red

everything is red now.

its not green anymore

I don't think it's difficult, I think they just lie.

Monitoring is very simple, I even learned this from a document released by the Google devops team many years ago.

Always alert from the end user perspective. So in other words have an external server test login to Gmail. Simple as that.

They manually update that status page to not scare away stockholders.

Out of curiosity, what response time do you expect on a page like that? And what level of detail? I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.

> what response time do you expect on a page like that?

Faster than a free third-party website’s response time. Google should know they are down and tell people about it before Hacker News, Twitter, etc. Google should be the source of truth for Google service status, not social media.

> And what level of detail?

Enough to not tell people that there are “No issues” with services.

> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.

Google employs enough people to do both.

I'd expect that status page to be automated based on a number of metrics / health checks. Our equivalent is.

I'd expect within seconds that Google is alerted of a very large number of issues with their servers and that the status page would be updated (the green light going to red) within seconds. It's now quite some time after the start of the outage and everything is still green on that status page.

How many employees do you think Google has? Do you think they're all working on the same task?

Well, probably a lot of them are sitting around doing nothing because Google's down right now.

It's a big group of teams... one team is responsible for monitoring (And status reporting) alone

> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.

It's not like they would be working on the status page right now, that work should have been done a long time ago...

If they know its broken, one of the _many_ engineers/support across Youtube+Gmail+etc that are all known to be down related to this bug should be able to update it in first few minutes. Especially if this isn't a 5 min fix.

It should be faster than it takes TechCrunch to write an article about it

TechCrunch don't need to verify anything.

There should be enough people on that to communicate out as well as resolve.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact