I wish they elaborated more on what type of bug was that, which was not caught by testing / initial rollout. Either tests must be poorly written or the bug must be very subtle.
On an unrelated note: kudos to Google for publishing this postmortem and hope that this becomes an industrywide practice. I also wish they publish (a belated) one about Google+ and their throng of messaging apps over the years.
A useful PM include a summary, impact analysis, root cause analysis, and a comprehensive (and realistic) set of measures that will prevent recurrence.
After reading what they provided, I have a reasonable understanding of what went wrong (sufficient for me to plan my own safeguards if necessary), a useful measure of the team's response and remediation capabilities, and enough information to judge my comfort with their preventative measures.
Anything else is just cleverly-disguised marketing.
All postmortems are ads, but not all ads are effective.
It's easier to disclose details of a physical infrastructure outage or a bug in someone else's code/product than a bug in your own code.
« These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout. »
...but yeah, they definitely could have done better.
In this case, the "configuration change" could be a feature flag in the L2 GFEs, something in the L1 GFEs that changed L1->L2 requests in minor ways, or maybe something else entirely (since it's partially security-related: a dynamic LOAS handshake change to use different cyphers? So many possibilities). At the end of the day, though, it's still a specific permutation of all possible features and flags that hadn't been vetted before. Given how large Google's monorepo is, it's not impossible for one of your many dependencies, even indirect ones, which might be configured by another team entirely, to have subtle time bombs that only trigger well after you have built and deployed the code.
Having been on the other side, I know that, for every detail added, a bunch more questions come up.
GFE (Google Front End) is what you connect to when you access any Google services through your browser. Think nginx or even ELB. It's a load balancing reverse proxy, as well as WAF. It's mentioned here https://cloud.google.com/security/infrastructure/design/ and probably in the SRE book. This report might be the first time Google mentions in public that there can be two levels of GFEs, but I remember at least one service using such a setup many, many years ago.
Also, if I give similar root cause in my environment, I'd be laughed off. We need to absolutely provide what went wrong, what was the immediate fix and what's the permanent fix and is it done (or does it require downtime / restart).
They have just chosen to only post a brief summary.
That last bit would be more marketing except that it's true - I've used Google Appengine for years, and the rare outages are always unique issues.
I'd love to see comparative numbers, but my impression is that this focus on improvement has lead to Google's uptime being a lot better than AWS or Azure.
I think most already do this. I have seen AWS also publish detailed postmortems for outages like this. Ex: https://aws.amazon.com/message/41926/
Amazon and Microsoft's post mortems are much more to the point, one can actually learn from them to not make the same or similar mistake.
I actually think it looks pretty bad that one of the action items in this report is to make a feature dashboard for GFEs. They've been saying they will do that for years, and the team that operates GFEs is considered the most elite of all SRE teams. Most famous outages of Google products have been caused by bogus configurations pushed to GFEs or network devices in front of them.
Compared to a single SSD the performance improvement doesn't really show (for desktop loads)...
However when things come to a rare (but somewhat inevitable) and screeching halt and that mirror has one of the copies shatter beyond recognition... that's when the doubled price proves that it was worth it.
Such a dashboard would invariably also add load and complexity (both failure points) to the system, but outwardly most users would be unaware of their existence.
With highly redundant systems such as this, you generally need multiple layers of things going wrong all at once to notice an issue. This was the case here as well.
Either tests must be poorly written or
the bug must be very subtle.
For example, you must have different database credentials between test and production, and you must limit who can read the production credentials. If the production credentials are malformed, a service that worked in test will fail in production.
And the same applies to your SSL certificates, your settings for enabled/disabled features, your flashy markers that stop people mistaking production for test....
It makes me so happy that the big smart corps have the same problems that us plebs have.
I tell people that testing can never precisely duplicate the production environment but do they believe me?
Also, this is an argument for feature switches vs staging environments.
 But not in a schadenfreude sense.
Meanwhile.. waiting for the Amazon prime day post mortem.
It depends on Amazon Retail Site in this case, Netflix has published postmortems in the past but then again Netflix is very good at blogging(also, open sourcing) their engineering efforts.
Have you not used Amazon? It's where half the country does their shopping. Many businesses live and die on Amazon as just much as they do on AWS.
I know businesses that depend on Amazon.com as a primary sales channel were affected, but they don't pay Amazon to sell their product (maybe Pro merchants are the exception). I think they owe us Prime members an outage report even more on that basis.
In any case, I think it would be a good idea for them to write an incident report, but don't think it's comparable.
edit: I give. Like I said, I think it's different than AWS/GCP outages, but I think your point is great and I think it's a good idea they pubish a report on it. Look forward to seeing it someday.
They certainly do.
As with AWS, there are various services offered, each with their own pricing structure.
* Payment mechanism (Amazon Pay)
* E-commerce interface (Sell on Amazon)
* Services marketplace (Selling Services on Amazon)
* Advertising (Advertise on Amazon)
* Warehouse/logistics provider (Fulfillment by Amazon)
Amazon, eBay, or your own ecommerce website problems can cost exactly as much business as AWS, GCP, or your own data center problems.
If the latter, I wonder whether such customers would like to see Amazon introduce some sort of lower-level "Amazon purchasing API" that would continue to function even when the website doesn't, and which doesn't include any of the features that could topple the site (mostly, no paginated browse/search result API—you would have to already know the ID of the product you're buying.)
The outage was way over-hyped.
Why are you comparing a cloud provider with a retail site?
There was an issue in AWS Frankfurt yesterday, I'm waiting the post-mortem on that.
As a user of their service, our engineers were notified within <30 secs when the issue started. Given GCP had a large population impacted, how is it that it took them much longer to acknowledge?
> The GFE development team was in the process of adding features to GFE to improve security and performance. These features had been introduced into the second layer GFE code base but not yet put into service. One of the features contained a bug which would cause the GFE to restart; this bug had not been detected in either of testing and initial rollout.
Something going down after a deployment is the most common source of issues. A system KPI abnormality after a rollout should be common practice to monitor and to perform an almost instant auto-rollout on. Also, doesn't GCP perform dark launched, partial launches? Launch to 1%, see KPIs, increase to 5% and so on?
In terms of quality of life for people on call, it's an entirely different world. And in a setup where your oncall engineers are extremely highly skilled and have all the choice in the world in terms of where to work, that little bit of respect of their time is a necessary investment.
While the postmortem is appreciated, I'd rather they just didn't roll out changes globally.
Edit: Reading further down, they actually admit that it was just a roll back
> At 12:44 PDT, the team discovered the root cause, the configuration change was promptly reverted, and the affected GFEs ceased their restarts
Edit: Wow, downvotes because I like transparency from my cloud hoster, super interesting...
Not saying Amazon is perfect by any means either, but there's a lot of room for improvement. Good postmortems give everyone ideas on how to solidify their own processes and prevent other issues. This was just fluff.
When I got first introduced to the concept of incident reports it was under the name of postmortem, as I worked for a mainly English speaking company then and didn't think twice about it. But earlier this week I mentioned it to a colleague he found it a rather macabre term for something like an incident report. When you think of it, nothing really died (maybe some engineers died a little inside that their design was not as 100% reliable as they though). But for the rest it was just a temporary state, nothing permanent like death. All other uses (eg: medical) of this word all seem to relate strictly to death.
Maybe it is because incident reports just sound to formal or is there a etymology of this term in the IT world?
It might make a little more sense in the world of shipping software in retail boxes, where products/projects had a 'done' date. The project is dead, what contributed to it's demise? Or you might generalize death into failure, and that's why we use the term instead of post-incident.
It's really fun to read.
edit: Wow, there hasn't been one since 2014. I wonder why this died out. There's 10 pages of them since 2007..
I've seen the act of analyzing project failures by this name in software engineering/management ever since.
Maybe it was a bit more relevant in the days of large waterfall software project management, where failure often meant the end of a project with no product launched. Sometimes after a "death march": https://en.wikipedia.org/wiki/Death_march_(project_managemen...
But it does seem natural to me that it has been carried over to current days, and applied to analyzing failures in the context most relevant to modern software development.
Never stopped to think about the weirdness of the term's application. It feels so natural to me, I'd suggest we all start calling the fixed service as "resurrected". As in: "Google Cloud Global Load Balancers were resurrected at 13:19." (-:
If we want to be technical, a post-mortem in the tech world is commonly used to outline failures that occurred during a normal event (i.e. a software release) not a random production issue.
I've met many people who hate that abdication of responsibility, and would prefer to be heroically hacking solutions at 3am when their database replication fails. Maybe they feel they have to be punished for failing their customers.
My advice is keep testing GC, because in my experience it is very reliable. And once you realise that, the peace of mind is awesome.
I then made the comment to sign in using Bobby Drop Tables for a user name. The silence in the room quickly reminded me I was not in the company of other developers. Such a waste
I remember a configuration change being rolled out by an automated system caused a problem on GCP a few years ago, it's an interesting area that's probably quite hard to fix
As for whether it's useful for serious business. Well. Proof by example? This has sixteen pages of case studies: https://cloud.google.com/customers/. That is by no means all of GCP's serious customers.
I think I had over 30 touches with their support and key account managers regarding everything from billing, minor issues with services and just asking for advice regarding stuff. They have always delivered.
The expectation from us has been that we pay the $150/mon support package fee.