Could other major cloud providers boast this? It seems like Google's brand is benefiting tremendously from Project Zero in all of this. On the other hand, it feels nervous-makingly like a clear step towards running one's own mainstream hardware being too hard for the little guys.
Is that not how it's meant to work? They're paying to run a top notch security team which is actively discovering vulnerabilities and creating fixes for them. Should they not benefit from this?
One example of that might include patching your own hardware while waiting to offer the same capabilities to others.
My wife is an amazing and talented baker. But... any number of bakeries are better at rolling out a quality consistent product at scale.
Ditto with datacenters.
But if you are running your own hardware, you probably aren't sharing CPU time with strangers as GCP is.
It all comes down to your threat model though. Some people are rightly worried about insider risk. If a malicious employee can go run a binary on your shared computing infrastructure to get root credentials out of a machine, that's actually a real problem. Then again, there are lots of ways for a rogue employee to do bad things, so this is "just" another one. But don't take this as "only applicable to shared cloud environments", because it's not.
Otherwise on a personal computer, if malware gets to execute in user mode, being able to access some machine private key is the least of our worries. The sensitive information is the documents stored on the machine, whether it is for them to be leaked or encrypted, if the code can access them in read/write, the party is already over.
The VM is not considered a security boundary any longer; tenancy controls are now in place, to prevent attacks similar to Meltdown and Spectre that are yet undiscovered.
Actually we have our Test Infrastructure seperated from the rest, so basically only this runs untrusted code.
The amount of platform complexity abstracted by the GCP services is staggering. This is the job of taking a (hopefully) understandable piece of computer code that represents some real-world logic and installing it into reality, in a sense. It's clear that it's a very messy world out there for a program.
To be like insurance companies, this is how their cloud business would have to work: you pay them $20,000 over ten years while receiving absolutely nothing tangible in return, until one day, ten years in, you need to utilize $25,000 in cloud services in a short burst. Then afterward, your bill goes up significantly, while as an industry they reduce what you get for that.
Beside that, there are exceptionally few things the little guy stands a chance at when competing head-on with a giant. The giant benefits from bargaining power at scale that is usually rather dramatic on most everything.
Eg Diapers.com was a well funded, well run ecommerce start-up. Amazon was going to destroy their business through scale leverage - applying plausibly anti-competitive below-cost tactics on core products like diapers - to put them under if they didn't agree to be acquired (Amazon had already begun applying that tactic in the lead-up to the acquisition, scaring the hell out of Diapers.com).
Not sure. AWS sent out alerts in early December than many instances needed a reboot, and would be force rebooted on 1/8 iirc. I'm guessing it was to handle patches that were put in place for these issues.
We were naturally first because we took action internally, but also because we didn't need to reboot guests to ship new kernels to hosts. We update kernels on our hosts frequently, and so nobody would have particularly noticed that their VMs underwent migrations (we do it all the time!).
I do not believe that HVM instances (which would be closer to your KVM infrastructure) needed to be rebooted.
Sometimes do not think they get the credit they deserve.
If it was a little guy who discovered it, then everyone except maybe a fraction of a percent of the world (or nobody if the discoverer runs no servers) gets the same late start.
Project Zero disclosed this to Intel, AMD and ARM on June 1 as stated in the blogpost . I believe an independent researcher working somewhere else (or even at home) would have done the same, as indeed the parallel independent discoveries ultimately did.
Jann disclosed the issue to Intel in
June. The subsequent activity around the KAISER patch
was the reason we started investigating this issue.
Still, as they say in this article, without a tinfoil hat: "There’s no reason someone couldn’t have found this years ago instead of today."
I think it's also less statistically unlikely than you think. Thousands of security issues are discovered every year, it makes sense that some small number will be discovered in parallel. It looks like a big coincidence because you are focusing on the one parallel discovery, but if you consider it in the broad landscape of all discoveries then it's just statistically expected that there would be at least one parallel discovery.
No tinfoil hat: Because the series of research to date connected enough dots it crossed a threshold and lightbulbs went off all over.
Funding and time/people for the kernel comes from the big players (IBM, Intel, et. al.) and distribution decisions have created a monoculture, making the big Linux distros feel like they're part of the corporate/national tech culture and less community centered.
DragonFly had a patch recently .. in master. OpenBSD just had a rant on their mailing list.
Somebody correct me if I'm wrong, but it looks like zero.
Its my understanding that Google disclosed this vulnerability to ARM, Intel, and AMD, and no one else, and left the hardware manufacturers the job of disclosing to the various software vendors. So, they gave 6 months to the people they disclosed the vulnerability to. Those people did or did not disclose it to others.
As you say, the vulnerability was disclosed to others on the 1st of June, which puts us at 6 months for the disclosure deadline. If 6 months in vendors aren't ready with patches I doubt those few days would have made a significant difference.
The fact that hundreds maybe thousands of people knew and worked on this ahead of the press reports/rumors and subsequent information release speaks to how seriously everyone involved took this.
Of course, it was probably unwise to discuss the code changes implementing mitigations in public anyways, but I don't have first-hand knowledge of how these things work in the Linux world.
But you are right it is very impressive that there were almost non public leaks (and the only leak was not more like a slip up which nobody took too serious until the patches were made public). I would not have surprised me if someone posted them and they got picked up. It would have made things a lot worse.
Edit: ... which is why I now received downvotes...
Based on my area of work and job title, I was notified of the software that was vulnerable and to contact them with further info if we used it. I wasn't allowed to talk to others without explicit consent, so I talked to my central contact to find the right PoC in my neighbor teams to know who I was allowed to pull into a room for any joint strategies.
To everyone else, I simply said "sorry. On call is busy this week. I'm working on mitigating an undisclosed vulnerability." No questions asked. To this day I'm still not 100% sure the vulnerability is public so none of my peers know anything more than the time window in which I was distracted.
Someone on SO trying to explain it another way: https://stackoverflow.com/a/48099456
Some interesting discussion about how this patch isn't a 100% fix for Skylake processors (at least that's my understanding): https://firstname.lastname@example.org/ms...
Thanks for the stackover flow link!
Except Apple: Apples papers are authored by the "Apple ___ team".
>Requires microcode update on current CPUs. Perf hit vs. retpolines on older CPUs
The Google Cloud blog post does not mention that retpolines are insufficient on newer CPUs. That omission is dangerous because readers will believe they are protected by recompiling with retpolines when in fact they aren't.
Perhaps the blog post can be updated to clarify the limits of retpolines so people don't get the wrong impression and end up with vulnerable systems.
On the other hand, "opinion" isn't enough. Everything depends on the microarchitecture and only Intel can give assurance on whether retpoline actually works in Skylake or not. I hope they will release information about it.
I am confused, doesn't this mean that Retpoline needs to sit in the compiler and won't protect from already-compiled binaries?
We deploy new code binaries all the time so it's not a big deal.
Something compiled with retpoline is resilient from it's execution being impacted. Once KVM is recompiled with retpoline, guests cannot attack it via spectre variant 2, and as such, cannot attack other guests.
The hypervisor being compiled with retpoline, however, does nothing to protect from intra-guest attacks - if you have untrusted code running in your VM, and you don't have IBRS and the other microcode features on, or your sensitive apps compiled with retpoline, you are still vulnerable internally. Just not from other guests
I am not super familiar with google’s offering but I suspect they don’t just offer VMs. Anything that runs in a shared infrastructure (serverless design/websites hosting) runs on top of a Linux box I presume. Google would need to get those Linux binaries recompiled too, not just the hypervisor.
We've already patched the host (for a long while now). You can also recompile your binaries in your guest, and/or get an updated kernel. retpoline performs much better than the brute force hammers you've seen reported.
VMs and security sandboxing techniques are used to run external software by Google’s AppEngine (GAE)  and Google Compute Engine (GCE). We run each hosted VM in a KVM process  that runs as a Borg task.
There might be a few machines here and there running actual VMs on bare metal, but those are going to be very special cases. Using Infrastore, it's easy to find which teams are running which containers whose binaries were built before the new compiler defaults.
If the question is 'How do they guarantee their customers aren't using bad binaries', the answer is they don't, and has nothing to do with VMs vs VMs in containers, etc.
Running a VM in a Borg task (container) does not add extra security in this case, but it does help with auditing and catching the snowflakes and stragglers that invariably end up taking a disproportionate amount of time with changes like this. That kind of auditing can be easily done organization-wide, instead of having every team reinvent the wheel their own way.
KVM virtual machines making use of Extended Page Tables or Nested Page Tables do not share an address space with their host or other guests, however, they are susceptible to Spectre v2, for example. (This was the PZ PoC for variant 2)
It follows from what a cache is. You have a large set of numbers (memory addresses) mapping into a smaller set of numbers (cache lines); my necessity there will be a collision.
The mapping is deterministic and you can recover information by looking at these clashes.
Some intel high end CPUs have a feature called CAT which was designed to avoid that different workloads interfere too much by stomping on each other's cache lines, but I think it's meant mostly as a performance feature not a security feature (although there is some research about how to use it to defend from cache-based side channels, see http://palms.ee.princeton.edu/system/files/CATalyst_vfinal_c... and https://arxiv.org/pdf/1708.09538.pdf).
Another possible approach is to make it harder to predict the mapping between a physical address (or an address delta) and a cache line.
I don't know if using a cryptographic-grade mapping for hardware cache would be even remotely feasible nor whether it would actually solve the problem of information leak.
(If it brings you any amusement: imagine speculative execution as an overly energetic 7-year old that we must now build a warehouse of trampolines around.) 
The relevant quote from the article:
> For months, hundreds of engineers across Google and other companies worked continuously to understand these new vulnerabilities and find mitigations for them.
I fear this is the medium.com effect of content on the web now. Simply having content for content's sake is now seen as a missed "growth hacking" opportunity.
Performance gains are even higher now than they were then.
Not particularly (if you read Paul's post, the branch to the retpoline predicts perfectly for obvious reasons), and especially not compared to the brute force flushes as an alternative.
Edit: I phrased that backwards. The return predicts, so that the whole thing is about as bad as an unpredicted indirect call:
> This has the particularly nice property that the RSB entry and on-stack target installed by (1) is both valid and used. This allows the return to be correctly predicted so that our simulated indirect jump is the only introduced overhead.
I must be misunderstanding Paul's post.
Isn't it specifically preventing any sort of prediction?
"Naturally, protecting an indirect branch means that no prediction can occur. This is intentional, as we are “isolating” the prediction above to prevent its abuse."
Of course, you can then go and manually add direct branch hints, as is noted in the post, but unless I'm misunderstanding things, there's not an obvious reason why these branches predict perfectly.
Not that it means performance is impacted in a significant way, since that same section also says
"Microbenchmarking on Intel x86 architectures shows that our converted sequences are within cycles of an native indirect branch (with branch prediction hardware explicitly disabled)."
(which also confuses this issue - how is it predicting perfectly if prediction hardware is disabled?)
Patches for both LLVM (the infrastructure behind clang) and gcc are available. You choose what you compile your kernel and applications with, and others are actively looking at retpoline and retpoline-inspired techniques for other code generators (e.g., various JIT compilers). That's why Paul and the folks made this public.
Because you are kind of claiming that the patches that the linux kernel used to fix this (with a 10% drop in performance) are no longer needed. I am kind of wondering if this was submitted to the core linux kernel to be mainlined ? If yes, why was this not used there.
If no, then it means it is being used in some Google Cloud-specific way that is not mainlineable.
EDIT - found this comment which seems to suggest that retpoline is not bulletproof and the kernel's performance-killing patches are still needed
Does it mean that Google Cloud is doing this only on non-Skylake CPU instances ? It is a very interesting stand - it will mean that it will suddenly be more cost effective to use Google's OLDER machines than the newer Skylakes... because the newer machines have a performance degradation that the older machines will not suffer.
As mentioned elsewhere, we recompile everything at Google all the time. I'm not sure which things we've rebuilt with retpoline enabled. As Paul mentions in the article, the point is for software you believe needs to be protected, which may not be everything we build.
That thread on retpoline on Skylake has a lot of confusion. For some folks, they aren't 100% certain it works (it relies on understanding internal details of Intel CPUs) and they argue that IBRS on Skylake is cheap enough so "why not just always use IBRS and not bother?". That's the gist of this comment:
> personally I am comfortable with retpoline on Skylake, but I would like to have IBRS as an opt in for the paranoid.
I want to highlight that Paul and the team have had a lot more time to think about this issue than the folks just joining the discussion. Could our folks be missing something? Sure, and that's the point of public discussion and code review. We hope that over the coming weeks and months it's decided one way or another, but we believe retpoline to be correct and a good optimization (especially for older hardware).
With all due respect, i'm not sure about the answer you are giving me. I dont see why we are using the word "belief". If there is some secret sauce that Google is unwilling to talk about, then it would be good to have that declared. Because it is extremely weird that you would claim to have a compiler tech that fixes this issue (which needed kernel level fixes otherwise) and not be able to categorically declare what was compiled by Retpoline that gives this benefit.
Possibly its the hypervisor you are using... which explains the zero downtime migration.
P.S. I'm a Google Cloud customer as well.
> By December, all Google Cloud Platform (GCP) services had protections in place for all known variants of the vulnerability.
which means all the bits of the stacks across all our GCP products (i.e., not just our host kernels and GCE’s hypervisor but also say App Engine sandboxes).
KPTI for kernels plus recompiling any sensitive thing (kernel, hypervisor, etc.) with a retpoline capable compiler. We’re not holding back secrets. Just trying to get the patches out while keeping blog posts relatively parseable by the general public.
Note that no matter how disruptive kernel or other changes are, live migration always makes them “zero downtime migration”. The key here is that using retpoline to protect systems software against Variant 2 results in much lower overhead than other proposed mitigation strategies, particularly on older hardware but even a bit on Skylake.
But actually showing search results on page 4 of google search or youtube, when it said there were 22 million results for my search seems too hard for them.