As the updates to  say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.
We are currently experiencing an issue with a subset of the fiber paths that supply the region. We're working on getting that restored. In the meantime, we've removed almost all Google.com traffic out of the Region to prefer GCP customers. That's why the latency increase is subsiding, as we're freeing up the fiber paths by shedding our traffic.
Edit: (since it came up) that also means that if you’re using GCLB and have other healthy Regions, it will rebalance to avoid this congestion/slowdown automatically. That seemed the better trade off given the reduced network capacity during this outage.
As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.
My customers don't care that the network is down, the servers are down, or aliens have landed. The severity is the same and our infrastructure, regardless of the cause, was down.
During the impacted time period, we did a full DR failover to appengine instances we spun up in west2. This was not a minor hiccup.
But the people who have to fix it, desperately care about which specific part is down. That's just about the highest priority information they need. Honing in on where the problem is, is one of the few ways to get to fixing the problem. Having a boss shout that "everything is down, it's all broken" is the opposite of identifying the problem.
find the idea it was "a ridiculous time to nitpick" hilarious.
What? You lost critical business functionality for 5 hours, and you'd rather the boss was shouting at the workers because the wording used doesn't accurately reflect the boss's understanding, instead of the workers working on solving the problem?
"OK, we have databases up, load balancers responding, DNS records check out, last change/deployment was at this time, all these services are up, and the latest test suite is running all green, this narrows down the places where a failure might be with some useful differential diagnosis, now we can move attention to.."
"I DON'T CARE THAT YOU THINK THINGS ARE WORKING, IF THE CUSTOMER CANNOT GET TO IT, IT'S DOWN"
"Thanks for that helpful input, let's divert troubleshooting attention from this P1 incident, and have a discussion about what "DOWN" means. You want me to treat the working databases as down because the customer can't get to them? Even though they're working?
It's like the hatred for "works on my machine". "WELL I'M NOT RUNNING ON YOUR MACHINE". No you aren't, but this demonstrates the current build works, the commands you're using are coherent and sensible, excludes many possible causes of failure, and adds useful information to the situation.
For troubleshooting and internal use of course you want to describe the outage in precise terms (while being very sure you are not downplaying the impact).
For talking to customers, a sufficiently slow response is the same as no response, and nothing is more irritating than being told 'it's not really down' when they can't use the service.
In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly believe I have good enough judgment in what I post). If Urs and Ben think I should be fired, I'm okay with that, as it would represent a significant enough difference in opinion, that I wouldn't want to continue working here anyway.
Finally, for what it's worth, I have been reported before for "leaking internal secrets" here on HN! It turned out to be a totally hilarious discussion with the person tasked with questioning me. Still not fired, gotta try harder :).
Whenever I talk about the inner workings of Google I try to reference to external talks, books, or white papers to go along with my comments. Luckily a lot has already been set externally about how Google works.
If you haven't read them, you have to!
I would love to understand the though process of someone going out of their way to remove someone’s livelihood from them because of a comment on HN (when applied in a normal circumstance of adding additional information or correcting a misconception — I’m clearly not saying that bonehead comments shouldn’t have consequences.)
Maybe the person making the report said "Hey, I found some internal details on this external site. I'm not sure if this is allowed. Maybe someone who knows more should take a look at it, here's the link to the page."
Submitting a complaint to an internal review because “you’re not sure it’s allowed” is really petty.
In my opinion, and experience, folks who have good intentions usually pull you to the side to get a feel for a situation before filing a formal complaint.
This is not so difficult though. You just need to adjust your starting point to someone who doesn't like boulos' first. That's not so difficult IMO, it's a large org and boulos' seems to be a fairly prolific commenter here.
He certainly shares stuff I wouldn't be comfortable sharing, but then again he's a lot better connected and in the know than I am.
On the other hand, to anonymously submit a complaint feels, to me, like a personal attack. Someone who simply doesn’t like them in for whatever reason. To me, that action seem petty.
One of the things I really like about working at Google is that they place a lot of trust in the judgement of the individual employees. I generally make it clear when I'm stating my personal opinion versus the "official" (for whatever that means given how informal the project is) one, but I don't have to carefully go through an approved list of talking points, run my HN by the legal department, etc.
Obviously, in certain situations, things get more official and formal. For example, when I went to Google IO to give a talk, we did have some documentation and coaching beforehand about how to handle various questions we might get about non-public stuff, other projects related to ours, etc. We are also expected to run any slides by legal before being publicly shown in a venue with a wide audience like IO. But, even then, the legal folks I've worked with have been a pleasure to talk to.
The company's culture is basically "We hired you because you're smart. We trust you to use your brain." It would be squandering resources to not let their employees use their own intelligence and judgement.
I work at another FANG with a roughly equal engineering community and I don’t see my kind commenting as much at all!
It's probably okay to say that we know the problem and here are the steps we're taking to mitigate it. It would not be okay to say something with large scale stock price implications for Google it another publicly traded corporation. For instance a Google employee shouldn't say something like "faulty solar panels fried Google's 10 largest data centers and twelve others have been lost to rebel drone strikes", even if false, since it could have a drastic impact on the earnings and future value of Google, Google's customers, and Google's competitors.
Even less obvious things like Google's plans for adding privacy features to the Chromium open source project can have a serious impact (see https://www.barrons.com/articles/google-chrome-privacy-quest...).
Do companies realize how absurd this is?
ETA: It seems someone at Google had a change of heart, and most of what boulos posted in this thread has been added as updates to the official google status page. Better late than never, I guess, especially if this is the start of a trend in outage reporting.
I mostly responded because there was confusion downthread (and in the title) about being “down”. During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done.
We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.
This made the entire country panic. Were we being attacked? The agency that is supposed to let people know through the public channels like tv, radio etc were silent. They were themselves on vacation probably. The websites and apps they've setup were ridiculously underpowered and were basically DDOS'ed by the spike in traffic they were getting.
News outlets were also struggling, but did way better.
The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.
The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.
Back when I worked there, the AWS status board was (and probably still is) terrible b/c Service teams owned that communication channel, not AWS Support. That really ought to have been changed. Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?
> Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?
The escalation team inside PS now drafts customer messaging within ~5 minutes of the impact being identified (usually about 5 minutes into an event) and if the impact is significant enough to post to the public dashboard, than may take another 5 minutes. Depending on the type of impact, affected customers will be notified via the personal health dashboard.
PS owns the tooling that does this, and is responsible for driving the process, but the service org's (e.g. EC2, S3 etc.) representative often makes the call on whether to post to the public status page or not (depending on the scale of the impact, e.g. 20% API failure rate for 5% customers probably won't make the status page, but affected customers
will get notices).
TT is almost out ... but the PS tooling supports it and its replacement, and provides easy access and summaries for internal teams (so you don't need to refresh TT or subscribe to the ticket just to see what the status is).
Sadly, the closer you are to the action of a thing like this (for example, I'm on NetInfra SRE and we were part of the group that put in place the current mitigations you're seeing work now), the less you can say without fear of subtle inaccuracy or releasing non-public information.
Knowing an astroid took out the entire continent tells you something about the repairability, resources required to fix the problem, and generally provides context for later updates, as opposed to other contexts like a cut fiber line, a burning datacenter or a bad power supply.
"No" vs "No, I already have plans with X"
First case gives you all the information needed (denial), however in the second case I understand the situation much better. I wouldn't call the text on the status page meaningless though - it's pretty nice and concise already (which is what you want in a "crisis"). Just some brief description of the problem would be good, even though technically unnecessary.
you're right, there's no additional actionable information there, the status page contains everything i actually need to know. but a bit more information makes me feel better. I guess the difference is your comment reassures me that you actually know what's going on. the status page text (prior to the 14:31 update) could equally mean "we've got this under control" or "shit's broken and we don't know why"
Not sure why they closed that one at 9:12 just to open a new one at 10:25. We didn't see any traffic coming to us-east1 during that time period so I would assume the original issue is still the root cause.
Sorry for the confusion, and yes, the fiber link issue is the root cause. Draining the Google.com traffic presumably resolved the issue for you, though you may still be seeing elevated latency as the updates suggest.
Edit: added this to the top level comment so more folks see it.
I've been saying this repeatedly (and downvoted for it repeatedly): if you want truly reliable systems, use simple, boring technology, and don't fuck with it after it's set up, and run it yourself. 99.99% of all these outages are due to screwing up something that already works, something that if it was in your own rack you could just leave alone and not touch at all.
Fiber optic cables are a great technology, but they don't react well to being cut in half by a backhoe. Is the solution you are recommending that we stop using fiber optic cables, or that we stop using backhoes?
You can't create "technical debt" if you don't change anything in the first place.
> You can’t create “technical debt” if you don’t change anything in the first place.
Rubbish. The bits really do rot, and if you don’t do _something_ on occasion you end up with an entire data center no one wants to touch because the dust in the servers might be structural at this point.
I’m not saying go rewrite your apps against the Kafka instance your junior devs are fucking with, but you have to do something to fight the entropy.
The counter-story to yours is running that database on MongoDB in the cloud on a cluster. Instead you'd be having crazy MongoDB issues, data inconsistencies, connectivity issues when the cloud is down, etc etc.
The solution is somewhere in the middle. You can have modern, supported hardware running a LTS Linux and that counts as boring.
But over time boring IT turns into legacy, and without some tension to the system pushing it forward your standards end up locking you into legacy forever.
Stuff breaks. So you fix it. Boring old stuff needs fixing too sometimes. Problem is, old stuff gets obsolete, can't get replacement parts, because of progress. (or something). It's the same story since the first looms were made centuries ago.
What you can't fix you can't really depend on. Our time scales are just compressed to ridiculousness because the pace of change is off the charts these days. So basically, you can't really depend on anything working more than a few months before falling over. Sucks.
Important things go onto clusters, or at least have a (hot or cold) standby server.
Or it may simply not meet the needs of users anymore.
I would hardly hold the air traffic control system up as a model to aspire to, for example. The only reason we run the old one is that the upgrade attempts all failed.
Tell that to your security team.
Don't screw with it and it will have security issues after a few months?
I've definitely seen this where I work - the "old guard" setup the system that put the company in a prime market position, the newer people are just doing API calls and scratching their heads if it doesn't work.
Here's a reddit link because YouTube is blocked here.
1) Don't fuck with it?
2) Make a mitigating code change. Patch / fix it (fuck with it)?
If you must fix it, the correct solution is to replace the affected software with the same (or almost the same) version of the software with the fix. No API changes, no other fixes.
Once an attacker is in your organization he will look for exactly that kind of internal-only backend were exploits are already available and the attack vector is known.
There is no such thing as a internal-only backend regarding security.
Let's assume the attacker used social engineering to get credentials from an unprivileged user and uses these to log in to a remote desktop. (I know there are ways to prevent that but I think there are many examples shown that public facing remote desktop is not two unrealistic)
Once he is inside your company he can reach the "internal-only" backend and uses the privilege escalation bug you thought is not worth fixing to get root.
Are they really showing that? None of the major cloud providers, even constrained to a single region (or even AZ) seems on average less reliable than the on prem datacenters I've seen, and there's
> not allowing customers to do so shows how little you care about them.
While the solutions may not be as complete for all use cases as public-cloud-only ones, are any of the major cloud providers not working to enable and selling their capacity to support hybrid-cloud deployments?
But it's true, it's much cheaper if you can find a way to replicate those or do without.
For small hobby projects I simply use a 3rd party 2ndary DNS service.
The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.
So, yeah, not just your imagination.
This is just for the last few months...?
But Google's vendors might have less. One would hope that Google is auditing claims of independence from vendors at least somewhat, but at some level they have to rely on vendor representation and SLAs if they aren't going to do it all themselves.
Accidents happen. Regularly. :D
There are only 3 things I can say about this situation.
1) These issues are currently unrelated.
2) We learn a lot from these situations.
3) A lot of these types of issues can be mitigated by running in more then 1 region.
I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.
And, not to be snarky, but many of the other responses that are along the lines of "It's not really that difficult to run in multiple clouds" - let's just say I have trouble believing these commenters have real world experience actually doing this. I'm not saying it's impossible, but it is extremely difficult for any system of reasonable complexity with a dev team of, say, 10 or more people.
And, if you can stomach the cost, you do give up the ability to really use any of the proprietary (and often times awesome) functionality of a particular provider, which can put your dev velocity at a big disadvantage.
Once you have deployed your stack on Kubernetes, you can pretty much run it on any cloud or infrastructure with minor tweaks at most.
This is not to excuse the downtime in any way.
Redundant Array of independent Data Clouds.
I guess for RAID 5 would I need a min of 3 regions or 3 separate cloud providers.
Plus the fact that without serious investment, you're probably more liable to decrease availability by going multi-cloud thanks to the increased system complexity.
I can get a lot of work done while Outlook is down. Hell, probably more work done.
If our build server is down I can work for a couple hours (unless we’ve done something very bad). Same for git or our bug database or wiki or or or. When I get stuck on one thing I can swap to something else every couple of hours. And there is always documentation (writing or consuming).
But if some idiot, hypothetically speaking of course, puts most of these services into the same SAN, then we are truly and utterly screwed if there is a hardware failure.
Similarly if you make one giant app that handles your whole business, if that app goes down and there are no manual backups you might as well send everybody home.
I went to get a drink the other day and the place looked funny. They’d tripped a circuit breaker and the whole kitchen lost power. But the registers and the beverage machines were on a separate circuit. And since they sold drinks and food in that order, they stayed open and just apologized a lot. Whoever wired that place knew what they were doing.
It’s happened more than once with Azure and GCP. I think it happened once with AWS, but not positive there.
I recall Azure had some sort of multi-region database failover disaster that took several regions offline, and GCP has had several global elevated latency/error rate events, but I don't think that any cloud provider has been "down" in the sense that the word is usually used.
Here’s one that’s on Azure. Not a 100% total outage like above, but bad enough most I know in the industry would call it being down:
If I get a free moment, I’ll dig up other examples, but those were ones that were easy to find.
IMO, you’re better off with a private data center or colo and separate integrations with cloud.
The only S3 event here was limited to us-east-1:
Some APIs were impacted, because they are global by nature (e.g create-bucket). But S3 was working fine in all other regions, for existing buckets.
However, many websites were affected, because they didn't use any of the existing S3 features that allow for regional redundancy, simply because S3 had been so reliable they didn't know/think they needed to have critical assets in a bucket in a 2nd region that they could fail over to.
Admittedly, even the AWS status page was impacted, because it also relied on S3 in us-east-1.
S3 has done a lot of work to improve matters since, and mechanisms have been put in place to ensure that all AWS services don't have inter-region dependencies for "static" operation.
However, it is still incorrect to claim that it was all of S3. Many customers who use S3 only in other regions were totally unaffected.
* DevOps teams can be multi-cloud relatively easy when using infrastructure as code tooling (Terraform, Packer, etc) and traditional DevOps practices
* Why manage a fleet of vanilla boxes when you can use vanilla boxes with Kubernetes and not get gouged by cloud providers in the first place?
You don't need to jump off the hype train if you never got on in the first place.
Using the providers path isn’t necessarily gouging, but it isn’t cost optimized either. The answer depends on you.
That said, cloud is like any tenant/landlord relationship. Your rights are linked to time and are whatever your contract provides. If you didn’t like Office 2007, you didn’t buy it. If you don’t like Office 365, 2021 edition, too bad.
Of course that only works as long as you're swapping out largely replaceable parts. If you built everything around some proprietary service then yeah, you've tied yourself to that anchor.
Cost+speed of scalability, and managed services. If you rarely need to scale, your workloads are all predictable, and you don't need managed services/support, you should just buy some VPSes or dedicated boxes.
You are really limiting your tech stack by using standardized things like Jenkins, Docker, K8, mqtt, kafka.
"Outsourcing" those functions to cloud services can be big win for a small team. Like all engineering, it's a trade off.
Multiple regions, as long as your provider offers all of the services, you can have a carbon copy. Much easier.
It depends on your needs, your architecture, your risk tolerance, etc. I think for most people "Use multiple regions" is the answer that strikes the correct balance. It probably isn't the correct answer for everyone.
Certain terms and conditions may apply :) Carbon copy of a static website or one whose data is only a one-way flow from some off-cloud source of truth? Sure! Multi-master or primary-secondary with failover? Stray too far from the narrow path of specialized managed solutions and things get very complex, very quickly. That being said - it's mostly just the nature of the beast. If you're not able to tolerate a regional outage, multi-region is a pill you're going to have to swallow, no buts about it.
For the large majority of businesses investing in infrastructure-as-code far outweighs any crazy HA, redundant, multi-provider, whizzbang whatever setup you may have.
But the degree of independence provided by AZs is not constant across providers, despite similar terminology.
Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?
From the dashboard. Looks like this can be blamed on an Act of Backhoe.
Datacenters also sometimes have other single points of failure such as DNS, but those are within the company's control.
As mentioned in another thread, in this case, Google has rerouted google.com traffic out of the region to try to mitigate the congestion.
For some customer it is the right thing for other customers it may not be the right thing.
Every provider will have failures. So the question mostly boils down to does paying for more then 1 region cost more or less then paying for the the lost productivity or revenue of an outage like this.
From some places the most costly things they spend money on is employees. If your whole company comes to a stop for even 1 hour. It may cost more then the engineering effort for multi zone, multi region or multi cloud for your critical environments.
AFAIK Amazon are running a lot of actual production loads on AWS. Dogfooding can be extremely valuable, especially if a massive portion of your staff have the same profession as your target market.
I've been using Google Cloud in a new role I started recently. There's definitely some parts of GCP I like, but whenever I use the Web Console I get the distinct impression nobody at Google actually uses it. If they did, I'm fairly sure all the annoying little warts I encounter would not exist.
EC2 was released in 2006.
Amazon.com last non ec2 server was 2012.
But a lot of features of amazon.com still don't run on the main AWS offerings.
GCP has not been out for that long. Also, quite easier to run an e-commerce site than to run the web's largest search engine as well as the largest email provider, as well as the largest maps provider. Each of these has an order of magnitude more traffic than amazon.com
I'm sure they'll get there though, just not the same scale. Not even close.
For what it's worth, the internal-only systems also have warts ;)
* filtering traces by services has been broken in App Engine flex environments for more than a year.
* copy/pasting identifiers between places is a nightmare
* their IAM design is somehow worse than AWS. It’s so impressively bad I can’t even be mad. My favourite part of their IAM approach is how they have consolidated a majority of the IAM controls in the IAM page, but then random services like GCS have it defined elsewhere.
* not able to do basic time zooming of metric grafs on App Engine dashboard.
* multi-account paper cuts. Almost everyone on my team has their personal and work google accounts logged in. Whenever I send them a link to a dashboard or whatever, they end up getting a permission denied, without fail.
These are all just off the top of my head. Many of them seem silly and minor (and they are!) but there’s enough of them that I kinda dread doing anything in the Cloud console now. I need to take more time to get productive in the gcloud CLI I guess...
Google multi account support within a single browser is a pain. It kinda works until it doesn't. I'm sidestepping this issue by using distinct chrome profiles for work and personal.
In the other hand, I've not found Amazon multi-account situation to be cozy either. IIRC you literally have to logout and login again or use assume role and the switch applies to all the open tabs.
I always considered the Google Cloud approach of a "single account, multiple projects" a lot cleaner than the AWS "hundreds of accounts" approach. Do you not find this the case?
The UI was maddeningly obtuse. This is from the second time I tried.. They did fix it eventually.
Very complex system for distributing new keys taking payments.
When you really care about high availability and security you really don't want all your systems run with the same software, hardware, and coded by the same teams.
What does google (or amazon/msft) do to ensure a software echo chambers are not made within their infrastructure that potentially could cause mass scale outages by way of the same bug or bugs propagating through their systems?
GCP, AWS, and Azure is the grate decentralization of the internet.
If you want heterogeneous environments you have to cobble it together yourself by using multiple services.
I recently left Google to start a startup and now everything is falling apart.
glass house and all that... but I also share the same glass house as you.. I don't want bad luck
... and it's only a fluke that this happened to google in eu-east1 and not AWS in X region and then you (and I) would be having a time of hell! :/
Their last one was laughable in it's lack of self-awareness.
Can you explain what's better about the AWS one? They both do, approximately, the same thing: provide a few paragraphs of background, approximately one paragraph describing the actual issue, and a few paragraphs describing concrete followups. The AWS one has more timestamps.
You aren't confusing this with the postmortem, are you?
Nobody really wanted to be [enterprise software]-certified, but it was a way to get their employers to pay for them to go to the conference with cool talks and perks and such.
We delayed the training most of the day, and couldn't say it was AWS' fault because they were sitting in the audience, waiting to get certified.
People were about to riot, that was not a fun day.
the whole point when something like this happens is for you to ensure that a region going down will not impact you - not to laugh at people that use another cloud or to assume that X is better than Y. That being said, there have been several Google related failures lately that don't help building confidence in the GCP offering - if you're just starting in the cloud space this may actually impact the choices you make when you pick your cloud provider.
So my point was to _not_ to laugh at those at google (or those using their services), because AWS might be next.
The whole 'I share the same glass house', was a sort of karma thing.. if someone who uses AWS is laughing at Google. If karma came round and took out AWS, not only would it affect the guy laughing at google, but I'd be the one affected as well as a multitude of other people... and the tables could be easily turned
The point is that AZs are higher level than DCs, so that they provide pretty decent independence guarantees (though you can further derisk with multi-region.)
Well, in AWS. Google's zones have weaker independence assurances (actually, as I read it, no assurances), stating only that a zone “usually has power, cooling, networking, and control planes that are isolated from other zones”  as opposed to AWS’s “Availability Zones are physically separated within a typical metropolitan region” and “In addition to discrete uninterruptable power supply (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. Availability Zones are all redundantly connected to multiple tier-1 transit providers.” 
The same decisions that make regions fail also makes infra-region traffic cheaper. This is true for all large cloud providers. If you are okay paying more for internal network traffic you can get multiregional. But multi-AZ is still better than single-AZ. Up to you to decide if it’s worth it. For that you need good SLAs and (IMO) support contracts.
Please, be kind and decent to each other, especially when things are hard.
I wish these guys and gals luck on getting things working.
"This is a frustrating outage for us, a huge part of the attraction in Google Cloud has been the premise that we get the underlying reliability of Google's infrastructure. If we'd known what the reliability of Google in practice this year would look like, we might have stayed with AWS."
"Why are the stupid SRE's at Google even paid such absurd numbers if they can't even go a whole month without multiple hours of downtime."
Criticizing companies is find, just please remember there are real people there.
"Kind and Decent" doesn't seem like a high bar.
If "please be kind and decent" is too much of an ask, I pray we never work together.
if this statement you quoted is something you're not comfortable with, i have a hard time believing you have ever encountered criticism in your life.
Criticizing Google is fine, but sometimes, the best deployments to production can go wrong.
> “Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.”
They’re going after the definition of good for their deployments.
If you're a paying customer, you should be free to criticize as you damn well please.
Downvoters pls link here the yelling you have seen.