The Terraform provider hard-codes delay_hours to zero. We can't know if this was TF-initiated (I agree with the general sentiment here re: the usefulness of forensic analysis). But if it was, the TF provider bypasses an important safety mechanism.
You're welcome! Your article taught me exactly what to go looking for.
While we're at it, it also looks like the provider couldn't provision stretched clusters at all until mid-April. I don't know what I think this means for the theory presented in the article. Maybe Uni was new to TF (or even actively onboarding) and paid the beginner's tax? TF is great at turning beginner mistakes into "you deleted your infra." It's an uncomfortable amount of speculation, but it's plausible.
I don't see any value in Google being more transparent here if it was either 1) legitimately a unique bug, or 2) customer error with a Terraform script.
Fwiw -- and I know there are plenty of googlers here -- the OMG isn't locked down and it was legitimately a unique bug. I appreciate the forensic analysis of public statements that the author of this post strung together, but it doesn't really advance the conversation.
Used to work on GCE, but as the article mentioned, there are a lot of safeguards built in over there to prevent you from accidentally deleting things and account wipeout has a grace period to prevent a lapse in billing from deleting anything.
How many other classes of such bugs exist. Without a public COE one can't know that it was systemically fixed. One can't know whether security boundaries were by-passed by the bug. One can't know what sorts of extra protections are introduced for this type of systemic error. So what confidence would you have in the product with no public explanation -- are you saying, oh they're google, I'm sure they're doing the right thing.
There is value even if it is a "legitimately a unique bug" in demonstrating to customers that it truly was a unique bug. It's not unreasonable to be skeptical of such a claim.
> I don't see any value in Google being more transparent here if it was either 1) legitimately a unique bug, or 2) customer error with a Terraform script.
It could be valuable in the sense that scenario 2 has me shrugging my shoulders, but scenario 1 goes on the ever growing list of "reasons to avoid Google _anything_".
A little context for anyone not familiar with Australia's financial sector: a superannuation fund (or 'super', colloquially) is the equivalent of a private pension in the UK or a 501(k) in the US. Employers are legally required to make payments into an employee's nominated super fund; employers may choose a super fund on behalf of the employee (to make administration easier) but better employers allow employees to freely choose their fund.
Superannuation funds are often industry-aligned, and some are managed by trade unions. UniSuper is the industry super fund for the university sector.
EDIT: As others below have pointed out, yes, I meant a 401(k).
I'm in agreement with the author: without more detail or pushback from the Google Cloud leaders, this is a really bad look for future customers.
Will this well-publicized event materially alter cloud spend (e.g., cross-cloud replication or backups)?
This seems like an amazing time for AWS and Azure to come out with statements how they prevent accidents like this and why single staff members aren't capable of nuking a large company's cloud account.
Then issue a technical statement saying that such-and-such API was called with such-and-such parameter, and it did the following bad thing, and here’s why, and here’s what we’re doing to make it harder to do it by accident in the future.
In Google’s case, Google already has a reputation for poor-to-miserable support and for arbitrarily removing its users’ access to their data. (Heck, another incident of this sort was on HN today.) GCP gets considerably revenue from very large users, and those users and their decision makers will do just fine moving from GCP (which is not #1!) to AWS. [0]
Imagine an aircraft manufacturer or regulator in a similar situation. A plane crashed, and it behaved as documented or intended given pilot inputs. But it still crashed, and the factors causing it should be identified and appropriate improvements, if any, should be implemented.
[0] Some of them will grumble about how AWS’s user experience is dramatically worse than GCP’s in a lot of very obvious ways, but the overall comparison tilts strongly toward AWS here. Sure, AWS makes it miserable to configure Organizations and Accounts in line with best practices, but Google might arbitrarily delete a project/account/whatever! Google should get out ahead of this.
Absolutely. Not only should they get ahead, the should be providing detailed technical details as to what went wrong and how they are ensuring that it won’t happen again. The lack of details here just screams unseriousness.
> putting out a competing statement blaming or contradicting your customer is a bad look with that customer and with all future customers.
I don't buy it. Google could explain what the customer did to shoot itself in the foot and how GCE will modify the UX to hide the footgun. Instead, it just says that the customer had a unique configuration. That sounds like a GCE bug.
As I've said many times before, Google doesn't use GCE for critical applications, so you shouldn't either. Amazon and Microsoft use AWS and Azure respectively, but you should carefully consider which cloud services they offer are also used internally.
I think that now we know what actually happened, it's essentially false to say "My guesses were close, but not quite right." The summary of the article is that google did everything right, the dumb customer did something to themselves, and google was just too nice to publicly say that. "It was something to do with automation or terraform or a script" is almost irrelevant to what was actually conveyed in this article.
Even if it was operator error some sort of public COE would help others avoid the pitfall by design, e.g. restricting the permissions of terraform so that it can only affect resources for the system and availability zone (or better still cell) under deployment, e.g. you're running a deployment to system X, you shouldn't be able to destroy your backup buckets. Essentially minimizing the blast radius of configuration operation. I guess you'd also want to one-box the terraform change after testing it in preprod ideally though a pipeline with monitoring. "The power to modify is the power to destroy." Finally I wonder if there is a some way say to terraform, don't delete more than x resources and start very slowly, and only delete leaf resources, not the top level resource.
At the end of the day terraform can have a bug. You really want to control blast radius with permissions. Makes me wonder if the GCP VMWare integration is a boundary that doesn't expose granular permissions.
If it was operator error with terraform that should set off alarm bells through the industry. Who else is one fat finger away from total annihilation.
I used to ~own all internal terraform usage at a very large software company. I could talk for hours about all the ways in which the "wall of text" is ineffective at scale. A surprisingly large amount of our technical investment in TF was around improving plan review.
I usually only look deeper if the summary (X created, Y updated, Z deleted) shows some unexpected numbers, especially if the number deleted is not 0. However, if nothing's being deleted, I usually assume it's safe enough.
Is the VMWare integration tied to the GCP marketplace? Maybe Terraform modifying a fundamental VMWare resource (like renaming a cluster, etc) where the entire account is tied to a reseller or integration account could cause it to think it needs to delete and re-create.
Reading about Terraform in that doc made me recollect the mistakes I have done with Terraform too. The declarative world of Terraform makes it hard to recover from errors, especially the destroy operations. I once had a outage of 4 days of a production cluster while trying upgrade certificate manager on a Kubernetes cluster managed using Terraform.
Apparently the error seems to be with the managed VMWare integration. They have my sympathies: it can be really hard to get products from external parties to work just as well as the “native” product ie gce.
Even if they do fix this particular issue, I see this as a warning to be very very reluctant to adopt these integrations. A product level integration (eg internally hosted github) seems ok. But infra level integrations can have ugly failure modes.
Tl;dr summary; the author spent some time investigating, found little except that the company used Google Cloud VMWare Envgine, and doubts that there was a bug on Google's side.
Sorry but what else can someone outside these orgs do in this case? There’s really no way to know unless the companies involved give us more info, and the author seems to have done a great job in presenting a possible scenario.
>The Core unit is responsible for building the technical foundation behind the company's flagship products and for protecting users' online safety, according to Google's website. Core teams include key technical units from information technology, its Python developer team, technical infrastructure, security foundation, app platforms, core developers, and various engineering roles.
Basically, Sundar Pichai is taking the McDonnell Douglas approach to engineering and just deciding to coast on Google’s previous engineering.
The danger is that while you may have short-term stock returns, you destroy the engineering culture and it is only a matter of time before the doors blow off mid flight like a 737.
https://github.com/hashicorp/terraform-provider-google/blob/...