Hacker News new | past | comments | ask | show | jobs | submit login
Nov 16 GCP Load Balancing Incident Report (cloud.google.com)
172 points by joshma 6 months ago | hide | past | favorite | 76 comments



"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."

This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.


> "even the top-notch engineers that work at Google are still humans."

A top result in a Google search tells me: "According to 2016 annual report as of December 31, 2016 there was 27,169 employees in research and development and 14,287 in operations."

With that many people, it is unreasonable to flat-out assume that everyone who works at Google is top-notch. This kind of stereotyping is insidious in the labor-market for people who are otherwise excellent but do not have the magic fairy dust of Google sprinkled on them.

It is important to remember that no matter how impressive the machinery looks from the outside, everything eventually traces back to a human being typing some text in some editor with an imperfect model of how some lego pieces fit together.


This exactly. I work at FAANG and am so tired of the stereotype. Lots of very mediocre people everywhere


I beg to disagree, I have been through a couple of processes with FB/Google and the bar is insanely high. I have to say that I've no college degree and just years of work experience, and I'm not the type that will study to prepare an interview, I think that I should know everything requested by heart because I'm familiar with it or used to do it. I guess that maybe there are people that prepare for this and then once they are in... relax.


2016 was a very long time ago.


I didn't dig deep. It doesn't matter if the correct number is 40,000 or 80,000. The argument remains the same.


Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.

I’m not good with statistics but what are the chances?


Race conditions are weird.

I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.

There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.


Being someone who read hundreds of incident reports and postmortems that I was involved in personally in some capacity on a "fixing" side and thousands on a receiving side, I'm always amazed that otherwise intelligent people believe the details shared in them. The art of writing a postmortem is the art of feeding hungry hyenas in a zoo without blowing a budget: the details are bunk used to convince the hyenas to continue to eat the food rations.

Here's what this postmortem actually says :

* There was an undeniable, user observable issue between 10:04 and 11:28 PT as the customers could not change configuration.

* There was some root cause issue that we will say ran between time X and time Y, we do not acknowledge that your specific service was impacted in that window, unless specified separately.

* At some point we worked around/fixed the underlying issue.

* At 11:28 we fixed the user observable issue.

* The following is the number of minutes we acknowledge to be down for SLA purposes. Remember to pay your bill.


i think they are higher than you expect, because usually what causes the bug to be known is a worsening state of the system that makes the bug more likely to be hit.

i would ask how the engineer found the race condition, and whether that doesn’t imply a much greater risk.


This, as the state continues to worsen, the higher the chance that someone observing will go "huh that looks off" and then look into it, all while your system hasn't toppled over yet, no notice or write up would be necessary, but you definitely know now what the problem is. And then following that while you are working on a patch the system finally topples over and causes an incident/outage.


There likely was monitoring for various "problems" in production - error rates, validation failures etc, or even just good old crash counts.

An alert may have fired that lead to someone debugging the issue in detail.

I can totally imagine a slow creeping Metric Of Death that has slowly slowly slowly been creeping up for ages and then suddenly breaches some threshold and then becomes a problem.


Load balancers and database servers are great candidates for this type of bug.

You can live with something for a long time, but once you hit a critical mass or trigger a particular condition, failures cascade.


Race conditions aren't random, but chaotic. It's very probable that the reason the race condition wasn't caught in the first place is that it was probably "impossible" to trigger until some butterfly-patch flapped its wings halfway across the server farm to cause cascading millisecond changes in timing to ripple out.


Off-hand, the odds seem pretty low. But maybe some seemingly-unrelated performance change in the release before made the race more likely to go badly. If so, it may not be just a coincidence that an engineer found the problem and there actually was a production outage so close together. I've seen things like that before.


Pretty high with enough bugs.


The chances are relatively low, but this is survivorship bias, no? The thousands or tens of thousands of times the problem was fixed before it manifested are invisible to us.


imagine the following:

If service B returns before Service A an error occurs. Service A is lightening fast, and Service B is a slug. Service A incurs an unexpected performance penalty for every new user added to the system. This incremental slight performance degradation adds up, eventually additional system load such as a periodic Virus Scan on System A has a chance to push it over the edge.


I don't know the rollout process but perhaps it involves taking servers offline, putting more load on the still live unpatched servers, increasing the probability of the race condition occurring?


I could imagine that the mitigations they had put in place were perhaps just in the process of being removed, perhaps by some engineer who was slightly ahead of the rollout finishing...

It's the same as me seeing apt on my machine is 88% done installing some package and deciding that's probably enough to make it runnable in a new tab...


Bingo, if I was being paranoid, I would say someone leaked knowledge of this exploit after it was discovered.


Being a Googler privy to the internal postmortem: there was no way to trigger this externally (the faulty server is in the control plane) AND triggering this by a Google engineer would require some determination and leaving a ton of audit trail.


>This incident was caused by a bug in the configuration pipeline that propagates customer configuration rules to GCLB.

This line suggested it could be triggered from a customer. Is this inaccurate?


Hi. I helped write some of the internal postmortem and manage the data plane side of the team that responded to this.

Please allow me to reassure you: No. Absolutely not in this case. Not even slightly.

Any engineer can tell you customer configuration contents can cause bugs in configuration pipelines, but that's multiple layers away from this issue in our particular case.


Google runs microservices, so when the public postmortem mentions pipeline, it is a series of servers talking to each other. The problem happened towards the end of the pipeline, after multiple processing steps of the original user input. Furthermore, it was caused by a race condition, not mishandling invalid input.


Hard to know without access to the postmortem, but without it, I can think or two generalization possibilities to take advantage: 1) make config changes very quickly (very likely to have mitigations here), 2) make the configuration extremely large (what is valid but too large?), 3) both.

Inflict an off by one error? Joke.


It’s much more likely that other factors increased the chances of hitting the bug. Maybe the race condition was more likely to be hit if the amount of configuration data increased or the frequency with which configuration changes were compiled went up? The component with the bug doesn’t exist in a vacuum and its behaviour could likely be influenced by external systems.


?


Did Roblox ever release the incident report from their outage?


I haven’t been able to locate anything since the Halloween announcement

https://blog.roblox.com/2021/10/update-recent-service-outage...

Maybe they are hoping most people forget?


Haha, it was down for like 2-3 days. Prob waiting to announce a major security incident.


Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.

My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.

Anyone else have stabilizing ancedata?


As a Googler privy to the internal postmortem: as stated in the public postmortem, all traffic was unaffected within 33 minutes of the problem appearing. The bug was very on/off: at 09:35PT a corrupted configuration stopped ~immediately (usually double digit seconds of propagation delay) all traffic. At 10:08PT it was verified that the whole service is running the configuration from before the corruption.

The >1h duration was for inability to change your load balancing configuration.


Maybe you're thinking of this incident? https://status.cloud.google.com/incidents/1xkAB1KmLrh5g3v9ZE.... It was a few days earlier and took almost 2 hours.


We received errors at least 45 minutes before their stated time. :-/


Then you have been hit by some other issue.


It was definitely more than 404's they are claiming. Go playground was 503'd.


Which it could easily have been because it itself received a 404 from something and couldn't handle that.


This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.

Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.

Europe is a non-issue for hosting because it's where I live and services are plentiful.

I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.

Enough of this external dependency crap!


> I'm going to pay for a fixed IP on the fiber

This is nice for backup, but I would expect more downtime from your ISP than the big cloud platforms. Also, you might want a platform with anycast DNS if you care about (initial page load) latency.


Sure you get more downtime, that's why I have 2x fibers with my 100% read uptime database between them, that way both fibers have to go down at the same time for existing customers to be unable to login.

I noticed DNS was a bit slow on first lookup, it's not a big deal for my product and well worth the extra control.

I looked up anycast, and it's unclear how you enable that if you have your own DNS servers, I have 3, one in each continent but I'm pretty sure the DNS provider I use does not use the DNS in the right region!

Is that something you tell the root DNS servers about through your domainname registrar?

You would think this had been built into the root servers ages ago? They can clearely see where my DNS servers are!?


> I noticed DNS was a bit slow on first lookup

Have you measured this from another continent? I noticed it could add quite a bit of latency, especially when the remote client has a relatively slow internet connection.

More specifically, I noticed that when I was using a CNAME to a domain with DNS in the US.

To use anycast, you need the same IP addresses in multiple locations. Realistically, you can only do that if you peer with local ISPs and can advertise a route.

I never dug enough to start my own ISP, so it's a bit fuzzy for me, but I think you need to control your own AS (or partner with one), and announce your routes over BGP from multiple areas.

Most CDN or cloud providers probably offer anycast as an option, and it is likely the default configuration for their DNS as well as static websites.


Aha, ok thx!

I'm going to add geolocation lookup on my own DNS eventually.

But my product will connect directly to each region and measure latency and the number of players so anycast would not help a great deal for the complexity.

I wish the DNS roundrobin used the order of the replies in the DNS packet as priority instead of randomly picking one IP... that way my DNS servers could direct people to the correct region without loosing the backup!

As to why the root servers are not doing geolocation lookups in 2021 I'm just baffled by the lazyness of monopoly owners, but then again the priority ordering would be needed first!


Anecdotally, I've had 100% uptime on my ISP for the past 3 years and have read many a cloud provider's post mortem in that time.

My company hosts a large portion co-located in a datacenter and has the same uptime as my ISP. Clouds seem to be more complex which invites more opportunity for things to go wrong.


What I would not give for a comprehensive leak of Google's major internal post-mortems.


I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.


Why?

If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

GCP is repeating the same mistake with similar cycle. (Don't quote on this, that's just my impression)

That means they are not improving the situation.


> If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures.

Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load.

If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer.


I told you they are not improving. Not that config induced outages is not nasty...


It might be a business decision.

More reliability means slower development speed. If you are on the same ballpark as your competition, better invest in development speed than being 10x more reliable.


And perhaps counter-intuitively: slower development speed often means reduced reliability.

If your development and deploy cadence is slower, you end up batching up more changes in any given deployment. Larger changes => higher likelyhood of something in them being wrong => harder to debug due to delta size => wider effective blast radius.

Fast testing and robust build validation are some of the more important guard rails that allow to move fast and be more reliable at the same time.


Configuration change being the most likely cause of outage is true across all post mortems, not solely Google's. It feels like you're blaming them for not solving something that no-one else knows how to solve either. Facebook outage, Salesforce, Azure, it's all configs.


Well the OP said he becomes more confident in GCP, that's why I said this... My personal experience working at Amazon and Google told me that Google's engineering culture and practice are setup to a slightly different scenario than AWS. So if we assure cloud is more of an AWS defined business, than GCP seems are catching up, but not as fast as I would have hoped for.


This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.


Companies sugar-coating their outage reports is a pet peeve of mine and a real trust-buster. “Some users may experience delays” typically means the whole thing is completely dead. Companies that are really open and honest about such things are rare these days but really deserve praise and support for doing so.


Most companies don't release outage reports period (exhibit A: Roblox). The cloud hyperscalers kind of need to though, because it's not just their own business on the line.

In the early days of GCP, major outage reports were written and signed by SRE VP Ben Treynor:

https://status.cloud.google.com/incident/compute/16007


"5% of users faced difficulty logging in" typically means that the whole service was down, but that only 5% of users attempted to use the service during the downtime. They also count accounts that have been dormant since 2004... so it looks like a smaller number...


one bug fixed, two bugs introduced...


> customers affected by the outage _may have_ encountered 404 errors

> for the inconvenience this service outage _may have_ caused

Not a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.


if I had to guess, not a Googler...

Someone in a tech role wrote something like "because of the limitations of XYZ system we can't get a crisp measurement of the number of 404 errors customers experienced", failed to add a ballpark estimate because they thought everyone was on the same page about severity, and someone polishing the language saw and interpreted as "I mean, who can really say whether there were 404s?"

And the latter one would have been originally written as something more normal, then someone else read it and objected, "Most customers were outside of the blast impact!" (or somesuch) so then because the purpose of the post was informational to all customers, instead of scoping the apology to the customers who were impacted they came up with that language.

Committee communications are a painful mess, and the more important everyone thinks an issue is the more likely they are to mangle it.


Yeah we saw 100% of requests fail for a 20 minute timeframe for our production service, nothing made it through. Definitely a lot more than “may”.


Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?

For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?

At that scale there has to be many examples of similar issues, no?


Why do you think there might be? They just described how the error was their system returning 404s.


This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.

For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)

Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.

Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)


The 15 seconds figure may be very wishful thinking. Often a service startup is a short burst of severe resource consumption. Doing in with 100% of the fleet at once may stall everything in an uncontrollable overloaded state.


Is infrastructure at this scale typically unable to do a cold start? I can believe that this is very difficult to design for, but being unable to do it sounds dangerous to me.

(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)


I guess it depends what "infrastructure" means.

If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.

If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.

In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.

Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.


> Lastly, needing to be able to quickly cold restart something is probably a design smell.

For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.


The comment I was replying to mentions "at [Google] scale", so my answer was with that in mind.


When Amazon S3 in us-east-1 failed a few years ago, the reason for the long outage(6 hours? 8 hours? I don't recall) was that they needed to restart the metadata service, and it took a long time for it to come back with the mind boggling amount of data on S3. Cold starts are hard to plan for precisely at this type of scale


It can be done. It takes a heck of a lot longer than 15s though.


Everyone flushing the toilet at the same time to clean the pipes


'Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.'

Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...


> rollback to the last known good configuration

could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.


15 second rollbacks don't exist at scale.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: