- youtube returning error
- gmail returning 502
- docs returning 500
- drive not working
status page now reflecting outage: https://www.google.com/appsstatus
services look to be restored.
Services were down from ~12:55pm to ~1:52pm, it's 57minutes. Thanks hiby007
Which is .22% of there COH this quarter...
I bet if you personally can't use it, but their overall reliability meets the bar, then they're within SLA.
Don't ask why I know this.
What I believe is that customers will probably get free GCP credits and that's it, everything is good as before.
So then I jokingly responded with that being like going to a restaurant, getting massive food poisoning, almost dying, ending up with a $150,000 hospital bill and then the restaurant emails you with "Dear valued customer, we're sorry for the inconvenience and have decided to award you a $50 gift card for any of our restaurants, thanks!".
If your SLA agreement is only for precisely calculated credits, that's not really going to help in the grand scheme of things.
IANAL, but I negotiate a lot of enterprise SaaS agreements. When considering the SLA, it is important to remember it is a legal document, not an engineering one. It has engineering impact and is up to engineering to satisfy, but the actual contents of it are better considered when wearing your lawyer hat, not your engineering one.
e.g., What you're referring to is related to the limitation of liability clauses and especially "special" or "consequential" damages -- a category of damages that are not 'direct' damages but secondary. 
Accepting _any_ liability for special or consequential damages is always a point of negotiation. As a service provider, you always try to avoid it because it is so hard to estimate the magnitude, and thus judge how much insurance coverage you need.
Related, those paragraphs also contain a limitation of liability clause, often at capped at X times annual cost. Doesn't make much sense to sign up a client for $10k per year but accept $10M+ liability exposure for them.
This is just scratching the surface -- tons of color and depth here that is nuanced for every company and situation. It's why you employe attorneys!
1 - https://www.lexisnexis.com/lexis-practical-guidance/the-jour...
Businesses do this all the time, this is how they make money. And they use a combination of insurance and not %@$#@*! up.
Granted, there's probably not many businesses that are losing major revenue because slack's down for half an hour, but it's nice to at least see them acknowledge that 1 minute down deserves more than 1 minute of refunds!
They won't show up on automated systems aimed at SMEs, but anybody taking out an "enterprise plan" with tailored pricing from a SaaS, will likely ask for tailored SLA conditions too (or rather should ask for them).
Not sure that exists for businesses, but I'd expect you'd need to go shopping separately if you want that.
Seems like a good business idea if it doesn't exist.
They have other incentives, obviously, like if everyone talks about how Google is down then that's bad for future business. But when thinking of SLAs I'm always surprised when they're not more drastic. Like "over 0.1% downtime: free service for a month".
Would they gain or lose market share?
I don't think it's obvious one way or the other.
It's even slightly worse than that. SLAs generally refund you for the prorated portion of your monthly fee the service was out, so it's more like "here's a gift card for the exact value of the single dish we've determined caused your food poisoning." Hehe.
Enterprise level SLAs are crafted by lawyers in negotiations behind the scenes and are not the same as what you see on random public services. Our customers have them with us, and we have them with our vendors. Contract negotiations take months at the $$$$ level.
What if the majority of your users can access the service, but one of your BGP peers is not routing properly and some of your users are unable to access?
In answer to your question, they'll accept evidence from your own monitoring system when you claim on the SLA. They pair that up to their own knowledge about how the system was performing, then make the grant.
Google are exceptionally good at this, from my experience. Far better than most other companies, who aim to provide as little detail as possible while getting away with 'providing an SLA'.
(i dont think SLAs are BS btw)
A few years ago I released a bug in production that prevented users from logging into our desktop app. It affected about ~1k users before we found out and rolled back the release.
I still remember a very cold feeling in my belly, barely could sleep that night. It is difficult to imagine what the people responsible for this are feeling right now.
The answer was "well, if you don't do anything, you make NO money".
> Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?
That isn't too say that it wouldn't also affect my sleep quality.
That was the only way to help people develop routine regarding big production deployments.
I got to see a lot of people pass through this right of passage, and it was always fun to watch. Everyone would take it incredibly seriously, some VP would invariably yell at them, but at the end of the day their managers and all their peers were smiling and clapping them on the back.
During the outage though, no one (obviously) had time for me. This was a very important server. The tension and anxiety on the remediation call was through the roof. Every passing hour someone even more important in the chain of command was joining the call. At that time I thought I was done for...
At AWS, I once took an entire AZ down of a public-facing production service (with a mis-typed command), but that was nothing compared to when I accidentally deleted an entire region via internal console (too many browser tabs). Thank goodness turned out to be unused / unlaunched, non-production stack. I felt horrible for hours despite zero impact (in both the cases).
So sure, color-code your environments, but if you find someone about to do something to a red environment that they clearly should only be doing to a green environment, just check if they're seeing what you're seeing before you sack them ;)
It seems like a design flaw for actions like that to be so easy. E.g.
> Hey, we detected you want to delete an AWS region. Please have an authorized coworker enter their credentials to second your command.
The service stack for the region (and not an entire region itself) looked like prod, but wasn't. It made me feel like shit anyway.
Another workflow, though cumbersome, is: Search for a username on hn.algolia, select "comments" and "past months" as filters, then press enter.
That's even more surprising to me.
I feel for the engineer who has to calculate the cost of this bug.
"This past Tuesday morning Pacific Time an Amazon Web Services engineer was debugging an issue with the billing system for the company’s popular cloud storage service S3 and accidentally mistyped a command. What followed was a several hours’ long cloud outage that wreaked havoc across the internet and resulted in hundreds of millions of dollars in losses for AWS customers and others who rely on third-party services hosted by AWS."
 - https://www.datacenterknowledge.com/archives/2017/03/02/aws-...
You can rely on Google outages being very few and far between, and recovering pretty fast. For the benefits you get from such a connected ecosystem, I'm not sure anyone is net positive from using a variety of different tools rather than Google supplying many of them.
It's obviously subjective, but even with our entire work leaning on Google– from GMail, GDrive and Google Docs, through to all our infrastructure being in GCP– todays outage just meant everyone took an hour break. History suggests we won't see another of these for another year, so everyone taking a collective 60m break has been minimally impactful vs many smaller, isolated outages spread over the year.
...like I did a dozen+ years ago: https://antipaucity.com/2008/01/09/what-if-google-took-the-d...
Same thing happened to me but with CI, which felt bad enough already.
source: am Engineer =).
Perhaps my industry is a little more security conscious (I don't know which industry you're talking about), but this doesn't seem like good practice.
Unless prohibited in something like banking, following best practice to the letter is sometimes unacceptably slow for most industries.
In some industries, security and customer requirements will at times mandate that developer workstations have no access to production. Deployments must even be carried out using different accounts than those used to access internal services, for security and auditing purposes.
There are of course good reasons for this; accidents, malicious engineers, overzealous engineers, lost/stolen equipment, risk avoidance, etc.
When you apply this rule, it makes for more process and perhaps slower response times to problems, but accidents or other internal-related issues mentioned above drop to zero.
Given how easy it is to destroy things these days with a single misplaced Kubernetes or Docker command, safeguards need to be put in place.
Let me tell you a little story from my experience;
I built myself a custom keyboard from a Numpad kit. I had gotten tired of typing so many docker commands in every day and I had the desire to build something. I built this little numpad into a full blown Docker control centre using QMK. A single key-press could deploy or destroy entire systems.
One day, something slid off of something else on my desk, onto said keyboard, pressing several of the keys while I happened to have an SSH session to a remote server in focus.
Suffice it to say, that little keyboard has never been seen since. On an unrelated topic, I don't have SSH access to production systems.
Consider an exactly one minute outage that affects multiple things I use for work.
First, I may not immediately recognize that the outage is actually with some single service provider. If several things are out I'm probably going to suspect it is something on my end, or maybe with my ISP. I might spend several minutes thoroughly checking that possibility out, before noticing that whatever it was seems to have been resolved.
Second, even if I immediately recognize it for what it is and immediately notice when it ends it might take me several minutes to get back to where I was. Not everything is designed to automatically and transparently recover from disruptions, and so I might have had things in progress when the outage stuck that will need manual cleanup and restarting.
World GDP (via Google) $80,934,771,028,340
Minutes per year 365 * 24 * 60 = 525,600
Divide and you get 153,985,485
A billion is a number with two distinct definitions:
- 1,000,000,000, i.e. one thousand million, or 10^9, as defined on the short scale. This is now the meaning in both British and American English.
- 1,000,000,000,000, i.e. one million million, or 10^12, as defined on the long scale. This is one thousand times larger than the short scale billion, and equivalent to the short scale trillion. This is the historical meaning in English and the current use in many non-English-speaking countries where billion and trillion 10^18 maintain their long scale definitions.
Nevertheless almost everyone uses 1B = 10^9 for technical discussions
World's GDP is $80,934,771,028,340 (nominal, 2017).
Nobody would argue world GDP is anything billion, that's crazy.
In France, they use milliard and billion.
I’d guess actual losses to the world economy were more on the order of about $100K per minute, or about 1/3 of Google’s revenue. MAYBE a few hundred thousand per minute, though that seems unlikely with Search being unaffected, and everything else coming back. Certainly a far cry from billions per minute :)
Nothing is operating at minute margins unless it's explicitly priced on a minutely basis, like a cloud service. Even if a worked on a conveyor belt can't produce paperclips without looking at Google Docs sheet all the time, this will be absorbed by the buffers down the line. And only if the worker will fail to meet her monthly target due to this, loss of revenue might occur. But in this case the service has to be down for weeks.
In case of more complex conversions of time into money, like in the most of intellectual work, this is even less obvious that short downtimes will cause any measurable harm.
In my defence, the cert was not labeled properly, nor was it used properly, and there was no documentation. It took us 2 days to create a new cert and apply it to our software and deliver it to the customer. Those were 2 days I'll never get back. However, when I was finished the process was documented and the cert was labeled, so I guess its a win.
Edit: again downvotes started! Thanks to everyone “supporting freedom of expression” :)
Many years ago, I was directly responsible for causing a substantial percentage of all credit/debit/EBT authorizations from every WalMart store world-wide to time out, and this went on for several days straight.
On the ground, this kind of timeout was basically a long delay at the register. Back then, most authorizations would take four or five seconds. The timeout would add more than 15 seconds to that.
In other words, I gave many tens of millions of people a pretty bad checkout experience.
This stat (authorization time) was and remains something WalMart focuses quite heavily on, in real time and historically, so it was known right away that something was wrong. Yet it took us (Network Engineering) days to figure it out. The root cause summary: I had written a program to scan (parallelized) all of the store networks for network devices. Some of the addresses scanned were broadcast and network addresses, which caused a massive amplification of return traffic which flooded the satellite networks. Info about why it took so long to discover is below.
Back in the 1990s, when this happened, all of the stores were connected to the home office via two way Hughes satellite links. This was a relatively bandwidth limited resource that was managed very carefully for obvious reasons.
I had just started and co-created the Network Management team with one other engineer. Basically prior to my arrival, there had been little systematic management of the network and network devices.
I realized that there was nothing like a robust inventory of either the networks or the routers and hubs (not switches!) that made up those networks.
We did have some notion of store numbers and what network ranges were assigned to them, but that was inaccurate in many cases.
Given that there were tens of thousands of networks ranges in question, I wrote a program creatively called 'psychoping' that would ICMP scan all of those network ranges with adjustable parallelism.
I ran it against the test store networks, talked it over with the senior engineers, and was cleared for takeoff.
Thing is, I didn't start it right away; some other things came up that I had to deal with. I ended up started it over a week after review.
Why didn't this get caught right away? Well, when timeouts started to skyrocket across the network, many engineers started working on the problem. None of the normal, typical problems were applicable. More troubling, none of the existing monitoring programs looked for ICMP at all, which is what I was using exclusively.
So of course they immediately plugged a sniffer into the network and did data captures to see what was actually going on. And nothing unusual showed up, except a lot of drops.
We're talking > 20 years ago, so know that "sniffing" wasn't the trivial thing it is now. Network Engineering had a few extremely expensive Data General hardware sniffers.
And to these expensive sniffers, the traffic I was generating was invisible.
Two things: the program I wrote to generate the traffic had a small bug and was generating very slightly invalid packets. I don't remember the details, but it had something to do with the IP header.
These packets were correct enough to route through all of the relevant networks, but incorrect enough for the Data General sniffer to not see them.
So...there was a lot of 'intense' discussions between Network Engineering and all of the relevant vendors. (Hughes, ACC for the routers, Synoptics and ODS for the hubs)
In the end, a different kind of sniffer was brought in, which was able to see the packets I was generating. I had helpfully put my userid and desk phone number in the packet data, just in case someone needed to track raw packets back to me.
Though the impact was great, and it scared me to death, there were absolutely no negative consequences. WalMart Information Systems was, in the late 1990s, a very healthy organization.
And I have never seen them load so fast before - gmail progress bar barely seen for a fraction of a second whereas I am more used to seeing it for multiple seconds (2-3 sec) until it loads.
I observe the same anecdotal speedup for other sites... drive, youtube, calendar. I wonder if they are throwing all the hardware they have at their services or I am encountering underutilized servers since it is not fixed for everyone.
It is nice to experience (even if it is short lived) the snappiness of Google services if they weren't so multi-tenented.
a) users haven't all come back yet
b) Google is throttling how fast users can access services again to prevent further outages
c) to reduce load, apps have features turned off (which might make things directly faster on the user's end or just reduce load on the server side)
I hope they make their learnings, post-mortem, etc. public so that we can all learn from it.
My engineer hat is saying - "damn, I wish I was part of fixing this outage at their scale."
My product owner hat is saying - "Aaaaaaaaaaaaaaa......Aaaaaaaaaaaaaaa...."
Of course it will, - at least, it better - but what if it doesn't? And if it does, are you going to take countermeasures in case it happens again or is it just going to be 'back to normal' again?
Everybody uses it, so if, like, Gmail loses all the emails, we are then in such a state that the consequences will be more bearable and socially normal.
Most people are fine with accepting that whatever future thing will happen to most people will also happen to them. Because then the consequences will also be normal.
If the apocalypse comes, it comes for almost all of us and that's consolation enough.
For me, backing up to the Cloud is fine, because I find the risk of my home being broken into and everything stolen AND the cloud goes down AND the cloud services are completely unrecoverable is a small enough risk to tolerate.
I don't think it's possible to have permanently indestructible files in existence over a given time period.
Most of the things I backed up with google remain largely accessible, except for an occasion like this.
It's rare that any services I operate solo come back this quickly after there is a downing issue.
Cloud storage is still useful of course, but I prefer to view it as a cache rather than as a dependable backup.
I highly suggest everyone disable this setting on their own, but also on their (perhaps less technical) friends' and relatives' devices. Otherwise, if anything happens to your account or - less likely - the storage provider or their hardware, your data could very well be gone forever. I can't believe anyone would want that.
Much less chance of that happening than my local backups getting borked...
Both have vastly different failure modes and typical backup should use both of them.
This way if all my backups are gone I likely have way more important issues that loss of files.
(and yes, my backups are encrypted)
Sure you can argue "move to Fastmail/Protonmail/Hey/whatever", but those can also go down on you just like Google is down now. And self hosting email is apparently not a thing due to complexity and having to forever fight with being marked as spam (ndr.: not my personal experience, I never tried self hosting, just relying what I read here on HN when the topic comes up).
So, yeah, what do we do about email? I feel like we should have a solution to this by now, but somehow we don't.
That's _much_ better than trying to host my own email server.
As I said (literally in the second sentence), I don't rely on Google for everything, as you mention. I don't actually rely on Google for anything other than gmail, and of that I am also unhappy. The point I was trying to make is that there aren't really alternatives, and I was hoping someone might come out with a suggestion about how to overcome that problem.
You can do split delivery and have your email be delivered to two different destinations. It's less common than it used to be but it's trivial.
You can still use Gmail and fall back to connecting directly to your server if Gmail is down.
Some mails might be flagged as spam if the IP/domain has no reputation, but that quickly passes, at least that's my experience.
Nice and simple! :D
I haven't had any issues with new domains being marked as spam, but I always make sure the SPF, DKIM and DMARC records are set up.
If the question is "anybody still feel like arguing that 'a single provider' is a viable back-up" then it's yes for most cases. A better strategy is of course to use multiple providers. The chances that it never comes back again is much lower.
There was actually a project called “spinnaker” that was supposed to solve this problem.
Whether the cost of paying 2 or more cloud providers is worth it for most companies is up in the air.
Full disclosure: I work for Azure. Don't work on Arc tho. Don't have experience being a customer for these products
They seemed to have figured out the hard parts already.
Same question for non-cloud.
/s - for now ;)
At this point the only reason I use it is because I grandfathered in on an old plan it's still free, if that changes I'll go elsewhere.
It's very comforting to have a local copy of everything important in situations like this one.
I already imagined the only solution now was to write a medium post and hope it gets some traction on hackernews and google support steps in.
Thinking to myself I was an idiot for knowing all this and still thinking it wouldn't happen to me.
And even though it turns out to be an outage, it gave me a bad enough feeling to start using a domain name I own for my email.
Obviously not relevant for this kind of outage, but in the scenario outlined by GP - Google randomly kills you off, and there is nothing you can do - this is at least an emergency strategy.
all green, which does not reflect reality for me (e.g. Gmail is down)
edit: shows how incredibly difficult introspection is
Also wondering if this is perhaps the fastest upvoted HN post ever? 8 mins -> ~350 votes, 15 mins -> ~750 votes. I wonder if @dang could chime in with some stats?
Update: looks like it hit 1000 upvotes in ~25 mins!
Update: 1500 in ~40 mins
Update: 2000 in ~1 hour 20 mins (used the HN API for the timestamp)
Stats right now: 1985 points | 1 hour ago | 597 comments
There is a public API on Firebase, but AFAIR it's just a mirror rather than the main storage.
I love (and am deeply scared by) the dependence of Google and the confusion of it with the entire internet.
>I’m sitting here in the dark in my toddler’s room because the light is controlled by @Google Home. Rethinking... a lot right now.
Some people are compiling more relevant events: https://twitter.com/internetofshit
Fallback to 'classical mode' works for me.
But funnily enough, a lot of the votes come from traffic that searches for "is ____ down?" on Google. XD
Do love how more consumer services (ISPs &c) always have some report of being down somewhere but its means nothing unless there's a big spike.
If you are logged in, the page crashes with an error.
You can still browse all services from Incognito (which for some is not an option).
Also, you can’t use many parts of Gmail, Drive, Photos, etc, without being logged in.
But I guess it's technically not part of /appsstatus
Kinda weird it would totally break if the auth failed, unlike other services like Search.
I mean, that must be a generous definition of "works"! :)
edit:parent updated comment
> It is all green if you do not need to be log in.
Giving the impression if you were logged in/didn't need to log in, that everything would be ok.
I wrote too fast because I thought it could help people work around the problem.
The small print says: The problem with Gmail should be resolved for the vast majority of affected users. We will continue to work towards restoring service for the remaining affected users...
At Google scale, "remaining affected users" probably number in tens of millions. Sucks to be one of them, tho.
But hey, it happens. As a SaaS maintainer, I can sympathise with any SREs involved in handling this, and know that no service can be up 100% of the time.
The issue isn't "negative" realities, it is saying something while investigating that might break contracts only to find later that it wasn't true.
Monitoring is very simple, I even learned this from a document released by the Google devops team many years ago.
Always alert from the end user perspective. So in other words have an external server test login to Gmail. Simple as that.
They manually update that status page to not scare away stockholders.
Faster than a free third-party website’s response time. Google should know they are down and tell people about it before Hacker News, Twitter, etc. Google should be the source of truth for Google service status, not social media.
> And what level of detail?
Enough to not tell people that there are “No issues” with services.
> I'd much rather have their team focus on fixing this as fast as possible than trying to update that dashboard in the first 5 minutes.
Google employs enough people to do both.
It's not like they would be working on the status page right now, that work should have been done a long time ago...