This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen.
I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.
A top result in a Google search tells me: "According to 2016 annual report as of December 31, 2016 there was 27,169 employees in research and development and 14,287 in operations."
With that many people, it is unreasonable to flat-out assume that everyone who works at Google is top-notch. This kind of stereotyping is insidious in the labor-market for people who are otherwise excellent but do not have the magic fairy dust of Google sprinkled on them.
It is important to remember that no matter how impressive the machinery looks from the outside, everything eventually traces back to a human being typing some text in some editor with an imperfect model of how some lego pieces fit together.
I’m not good with statistics but what are the chances?
I had a service that ran fine for years if not decades on Java. One day, a minor update came in to the GNU core utils, which were not at all used by the service itself, and this somehow triggered the race every time in less than 5 minutes, taking down our production cluster. The same update didn't do anything to preproduction, even under much higher load than prod had.
There was a clear bug to fix and a clear root cause. Even so, I never understood what exactly pushed it over the edge.
Here's what this postmortem actually says :
* There was an undeniable, user observable issue between 10:04 and 11:28 PT as the customers could not change configuration.
* There was some root cause issue that we will say ran between time X and time Y, we do not acknowledge that your specific service was impacted in that window, unless specified separately.
* At some point we worked around/fixed the underlying issue.
* At 11:28 we fixed the user observable issue.
* The following is the number of minutes we acknowledge to be down for SLA purposes. Remember to pay your bill.
i would ask how the engineer found the race condition, and whether that doesn’t imply a much greater risk.
An alert may have fired that lead to someone debugging the issue in detail.
I can totally imagine a slow creeping Metric Of Death that has slowly slowly slowly been creeping up for ages and then suddenly breaches some threshold and then becomes a problem.
You can live with something for a long time, but once you hit a critical mass or trigger a particular condition, failures cascade.
If service B returns before Service A an error occurs. Service A is lightening fast, and Service B is a slug. Service A incurs an unexpected performance penalty for every new user added to the system.
This incremental slight performance degradation adds up, eventually additional system load such as a periodic Virus Scan on System A has a chance to push it over the edge.
It's the same as me seeing apt on my machine is 88% done installing some package and deciding that's probably enough to make it runnable in a new tab...
This line suggested it could be triggered from a customer. Is this inaccurate?
Please allow me to reassure you: No. Absolutely not in this case. Not even slightly.
Any engineer can tell you customer configuration contents can cause bugs in configuration pipelines, but that's multiple layers away from this issue in our particular case.
Inflict an off by one error? Joke.
Maybe they are hoping most people forget?
My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.
Anyone else have stabilizing ancedata?
The >1h duration was for inability to change your load balancing configuration.
Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.
Europe is a non-issue for hosting because it's where I live and services are plentiful.
I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.
Enough of this external dependency crap!
This is nice for backup, but I would expect more downtime from your ISP than the big cloud platforms. Also, you might want a platform with anycast DNS if you care about (initial page load) latency.
I noticed DNS was a bit slow on first lookup, it's not a big deal for my product and well worth the extra control.
I looked up anycast, and it's unclear how you enable that if you have your own DNS servers, I have 3, one in each continent but I'm pretty sure the DNS provider I use does not use the DNS in the right region!
Is that something you tell the root DNS servers about through your domainname registrar?
You would think this had been built into the root servers ages ago? They can clearely see where my DNS servers are!?
Have you measured this from another continent? I noticed it could add quite a bit of latency, especially when the remote client has a relatively slow internet connection.
More specifically, I noticed that when I was using a CNAME to a domain with DNS in the US.
To use anycast, you need the same IP addresses in multiple locations. Realistically, you can only do that if you peer with local ISPs and can advertise a route.
I never dug enough to start my own ISP, so it's a bit fuzzy for me, but I think you need to control your own AS (or partner with one), and announce your routes over BGP from multiple areas.
Most CDN or cloud providers probably offer anycast as an option, and it is likely the default configuration for their DNS as well as static websites.
I'm going to add geolocation lookup on my own DNS eventually.
But my product will connect directly to each region and measure latency and the number of players so anycast would not help a great deal for the complexity.
I wish the DNS roundrobin used the order of the replies in the DNS packet as priority instead of randomly picking one IP... that way my DNS servers could direct people to the correct region without loosing the backup!
As to why the root servers are not doing geolocation lookups in 2021 I'm just baffled by the lazyness of monopoly owners, but then again the priority ordering would be needed first!
My company hosts a large portion co-located in a datacenter and has the same uptime as my ISP. Clouds seem to be more complex which invites more opportunity for things to go wrong.
If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.
GCP is repeating the same mistake with similar cycle. (Don't quote on this, that's just my impression)
That means they are not improving the situation.
Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures.
Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load.
If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer.
More reliability means slower development speed. If you are on the same ballpark as your competition, better invest in development speed than being 10x more reliable.
If your development and deploy cadence is slower, you end up batching up more changes in any given deployment. Larger changes => higher likelyhood of something in them being wrong => harder to debug due to delta size => wider effective blast radius.
Fast testing and robust build validation are some of the more important guard rails that allow to move fast and be more reliable at the same time.
In the early days of GCP, major outage reports were written and signed by SRE VP Ben Treynor:
> for the inconvenience this service outage _may have_ caused
Not a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.
Someone in a tech role wrote something like "because of the limitations of XYZ system we can't get a crisp measurement of the number of 404 errors customers experienced", failed to add a ballpark estimate because they thought everyone was on the same page about severity, and someone polishing the language saw and interpreted as "I mean, who can really say whether there were 404s?"
And the latter one would have been originally written as something more normal, then someone else read it and objected, "Most customers were outside of the blast impact!" (or somesuch) so then because the purpose of the post was informational to all customers, instead of scoping the apology to the customers who were impacted they came up with that language.
Committee communications are a painful mess, and the more important everyone thinks an issue is the more likely they are to mangle it.
For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?
At that scale there has to be many examples of similar issues, no?
For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)
Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.
Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)
(Edit for the downvoters: I was genuinely curious how these kinds of things work at Google’s scale. Asking stupid questions is sometimes necessary for learning.)
If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.
If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.
In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.
Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.
For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.
Take some time to consider what a restart means, across many data centers on machines which have no memory of the world before the start of their present job...
could very much be the "fast" option. 15s restart, or anything close to it, across the entirety of it sounds quite unlikely.