Hacker News new | past | comments | ask | show | jobs | submit login

Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.

I’m not good with statistics but what are the chances?

Race conditions are weird.

I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.

There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.


An alternative view is you got very very lucky that a developer spotted the issue ahead of time, so when the problem occurred you were already aware of the root cause.

Otherwise you might have been down for a much longer period of time while you identified the issue.

Being someone who read hundreds of incident reports and postmortems that I was involved in personally in some capacity on a "fixing" side and thousands on a receiving side, I'm always amazed that otherwise intelligent people believe the details shared in them. The art of writing a postmortem is the art of feeding hungry hyenas in a zoo without blowing a budget: the details are bunk used to convince the hyenas to continue to eat the food rations.

Here's what this postmortem actually says :

* There was an undeniable, user observable issue between 10:04 and 11:28 PT as the customers could not change configuration.

* There was some root cause issue that we will say ran between time X and time Y, we do not acknowledge that your specific service was impacted in that window, unless specified separately.

* At some point we worked around/fixed the underlying issue.

* At 11:28 we fixed the user observable issue.

* The following is the number of minutes we acknowledge to be down for SLA purposes. Remember to pay your bill.

i think they are higher than you expect, because usually what causes the bug to be known is a worsening state of the system that makes the bug more likely to be hit.

i would ask how the engineer found the race condition, and whether that doesn’t imply a much greater risk.

This, as the state continues to worsen, the higher the chance that someone observing will go "huh that looks off" and then look into it, all while your system hasn't toppled over yet, no notice or write up would be necessary, but you definitely know now what the problem is. And then following that while you are working on a patch the system finally topples over and causes an incident/outage.

There likely was monitoring for various "problems" in production - error rates, validation failures etc, or even just good old crash counts.

An alert may have fired that lead to someone debugging the issue in detail.

I can totally imagine a slow creeping Metric Of Death that has slowly slowly slowly been creeping up for ages and then suddenly breaches some threshold and then becomes a problem.

Load balancers and database servers are great candidates for this type of bug.

You can live with something for a long time, but once you hit a critical mass or trigger a particular condition, failures cascade.

Race conditions aren't random, but chaotic. It's very probable that the reason the race condition wasn't caught in the first place is that it was probably "impossible" to trigger until some butterfly-patch flapped its wings halfway across the server farm to cause cascading millisecond changes in timing to ripple out.

Off-hand, the odds seem pretty low. But maybe some seemingly-unrelated performance change in the release before made the race more likely to go badly. If so, it may not be just a coincidence that an engineer found the problem and there actually was a production outage so close together. I've seen things like that before.

Pretty high with enough bugs.

The chances are relatively low, but this is survivorship bias, no? The thousands or tens of thousands of times the problem was fixed before it manifested are invisible to us.

imagine the following:

If service B returns before Service A an error occurs. Service A is lightening fast, and Service B is a slug. Service A incurs an unexpected performance penalty for every new user added to the system. This incremental slight performance degradation adds up, eventually additional system load such as a periodic Virus Scan on System A has a chance to push it over the edge.

I don't know the rollout process but perhaps it involves taking servers offline, putting more load on the still live unpatched servers, increasing the probability of the race condition occurring?

I could imagine that the mitigations they had put in place were perhaps just in the process of being removed, perhaps by some engineer who was slightly ahead of the rollout finishing...

It's the same as me seeing apt on my machine is 88% done installing some package and deciding that's probably enough to make it runnable in a new tab...

Bingo, if I was being paranoid, I would say someone leaked knowledge of this exploit after it was discovered.

Being a Googler privy to the internal postmortem: there was no way to trigger this externally (the faulty server is in the control plane) AND triggering this by a Google engineer would require some determination and leaving a ton of audit trail.

>This incident was caused by a bug in the configuration pipeline that propagates customer configuration rules to GCLB.

This line suggested it could be triggered from a customer. Is this inaccurate?

Hi. I helped write some of the internal postmortem and manage the data plane side of the team that responded to this.

Please allow me to reassure you: No. Absolutely not in this case. Not even slightly.

Any engineer can tell you customer configuration contents can cause bugs in configuration pipelines, but that's multiple layers away from this issue in our particular case.

Google runs microservices, so when the public postmortem mentions pipeline, it is a series of servers talking to each other. The problem happened towards the end of the pipeline, after multiple processing steps of the original user input. Furthermore, it was caused by a race condition, not mishandling invalid input.

Hard to know without access to the postmortem, but without it, I can think or two generalization possibilities to take advantage: 1) make config changes very quickly (very likely to have mitigations here), 2) make the configuration extremely large (what is valid but too large?), 3) both.

Inflict an off by one error? Joke.

It’s much more likely that other factors increased the chances of hitting the bug. Maybe the race condition was more likely to be hit if the amount of configuration data increased or the frequency with which configuration changes were compiled went up? The component with the bug doesn’t exist in a vacuum and its behaviour could likely be influenced by external systems.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact