
Protecting Google Cloud customers without impacting performance - theDoug
https://www.blog.google/topics/google-cloud/protecting-our-google-cloud-customers-new-vulnerabilities-without-impacting-performance/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FMKuf+%28The+Keyword+%7C+Official+Google+Blog%29
======
tzar
> _By December, all Google Cloud Platform (GCP) services had protections in
> place for all known variants of the vulnerability._

Could other major cloud providers boast this? It seems like Google's brand is
benefiting tremendously from Project Zero in all of this. On the other hand,
it feels nervous-makingly like a clear step towards running one's own
mainstream hardware being too hard for the little guys.

~~~
notatoad
I feel like even without this, there's a fair argument to be made that Google
is better at running hardware than most little guys are.

~~~
andrewstuart2
And it's not _necessarily_ a bad thing, either. Economies of scale means that
everybody can win by putting all your chips in one basket. The obvious caveats
apply when said basket stands to gain from serving its own interests once all
the chips are there.

One example of that might include patching your own hardware while waiting to
offer the same capabilities to others.

------
yRetsyM
I've been impressed with how long it took for this information to be
"leaked"/"declassified"/released given the sheer number of people who knew.

The fact that hundreds maybe thousands of people knew and worked on this ahead
of the press reports/rumors and subsequent information release speaks to how
seriously everyone involved took this.

~~~
masterleep
We can safely assume that black hats have moles on these lists.

~~~
throwaway2048
yeah the idea you can keep something like this away from malicious parties
across _dozens of huge companies_ is a ridiculous farce.

~~~
inlined
I've worked to mitigate other vulnerabilities within Google. at least with the
vulnerability I helped on, I'm surprised at how good the information
containment was.

Based on my area of work and job title, I was notified of the software that
was vulnerable and to contact them with further info if we used it. I wasn't
allowed to talk to others without explicit consent, so I talked to my central
contact to find the right PoC in my neighbor teams to know who I was allowed
to pull into a room for any joint strategies.

To everyone else, I simply said "sorry. On call is busy this week. I'm working
on mitigating an undisclosed vulnerability." No questions asked. To this day
I'm still not 100% sure the vulnerability is public so none of my peers know
anything more than the time window in which I was distracted.

------
kyrra
Direct link to a description of the fix:
[https://support.google.com/faqs/answer/7625886](https://support.google.com/faqs/answer/7625886)

Someone on SO trying to explain it another way:
[https://stackoverflow.com/a/48099456](https://stackoverflow.com/a/48099456)

Some interesting discussion about how this patch isn't a 100% fix for Skylake
processors (at least that's my understanding): [https://www.mail-
archive.com/linux-kernel@vger.kernel.org/ms...](https://www.mail-
archive.com/linux-kernel@vger.kernel.org/msg1577670.html)

~~~
x0x0
I don't understand how this fix can possibly be so performant. The speculative
executions _must_ be doing something positive, right? It seems like it should
never hurt (ignoring security, obv). So how can disabling it not hurt?

Thanks for the stackover flow link!

------
clon
Good for them, giving the credit to Paul Turner. Unusual for an enterprise to
allow cracks in the corporate "We".

~~~
azinman2
Google is better than most in this regard. For one, it allows its employees to
publish papers with their names on it. It also frequently lists names in press
releases of key members.

~~~
hueving
Every company that publishes research allows employees to put names on it.
It's not really an indicator of a forward thinking company so much as one that
wants prestige in the academic world.

~~~
sangnoir
> Every company that publishes research allows employees to put names on it.

Except Apple: Apples papers are authored by the "Apple ___ team".

------
stefanha
According to this page retpoline is "insufficient on Skylake and newer CPUs,
where even ret may predict from the indirect branch predictor as a fallback;
those need IBRS".

[https://github.com/marcan/speculation-
bugs/blob/master/READM...](https://github.com/marcan/speculation-
bugs/blob/master/README.md#bti-linuxgccllvm-retpolines)

What gives?

~~~
tw04
The way I read that, Skylake and newer don't see a performance hit from ibrs.
So it's retpoline on older CPUs = no performance hit. Ibrs on newer CPUs +
microcode update = no hit.

>Requires microcode update on current CPUs. Perf hit vs. retpolines on older
CPUs

~~~
stefanha
You're missing the point. It says "insufficient on Skylake and newer CPUs".

The Google Cloud blog post does not mention that retpolines are insufficient
on newer CPUs. That omission is dangerous because readers will believe they
are protected by recompiling with retpolines when in fact they aren't.

Perhaps the blog post can be updated to clarify the limits of retpolines so
people don't get the wrong impression and end up with vulnerable systems.

~~~
boulos
That's incorrect (in our opinion). See my comment at
[https://news.ycombinator.com/item?id=16130871](https://news.ycombinator.com/item?id=16130871)

~~~
stefanha
Okay, the link I posted is just a writeup summarizing publicly available
information, some of which may be misleading or incorrect.

On the other hand, "opinion" isn't enough. Everything depends on the
microarchitecture and only Intel can give assurance on whether retpoline
actually works in Skylake or not. I hope they will release information about
it.

------
jaflo
"Retpoline ... modifies programs to ensure that execution cannot be influenced
by an attacker. With Retpoline, we could protect our infrastructure at
compile-time, with no source-code modifications"

I am confused, doesn't this mean that Retpoline needs to sit in the compiler
and won't protect from already-compiled binaries?

~~~
hhw
That was my question also. How could they guarantee that nobody else using a
VM on the same physical machine isn't using binaries compiled elsewhere?

~~~
cthalupa
They don't, but it doesn't matter.

Something compiled with retpoline is resilient from it's execution being
impacted. Once KVM is recompiled with retpoline, guests cannot attack it via
spectre variant 2, and as such, cannot attack other guests.

The hypervisor being compiled with retpoline, however, does nothing to protect
from intra-guest attacks - if you have untrusted code running in your VM, and
you don't have IBRS and the other microcode features on, or your sensitive
apps compiled with retpoline, you are still vulnerable internally. Just not
from other guests

~~~
cm2187
Effectively both Linux and Windows Server need to be fully recompiled. Hope
someone hasn’t lost some source code.

I am not super familiar with google’s offering but I suspect they don’t just
offer VMs. Anything that runs in a shared infrastructure (serverless
design/websites hosting) runs on top of a Linux box I presume. Google would
need to get those Linux binaries recompiled too, not just the hypervisor.

~~~
dx034
I'm pretty sure that Microsoft can easily recompile Windows. Probably even
Windows 95 with some prep time.

~~~
cm2187
A hack they used in a recent update (I believe for the equation editor)
suggests that's probably not the case.

~~~
joshuamorton
I believe that the equation editor was third party coffee to which Ms never
had the source code.

~~~
cm2187
The point is they can’t recompile it.

------
Abishek_Muthian
It's good to see Google crediting Retpoline to Paul Turner. As Senior Staff
Engineer, Technical Infrastructure I wonder whether he was actually tasked
with working on mitigation for these vulnerabilities or he came up with this
in his free time.

~~~
boulos
As the article notes, lots of people worked basically round the clock on this
as a top priority. That isn't to say they did nothing else, but Paul et al.
definitely didn't just happen to do this in their spare time :).

The relevant quote from the article:

> For months, hundreds of engineers across Google and other companies worked
> continuously to understand these new vulnerabilities and find mitigations
> for them.

------
bogomipz
A bit off topic but the amount of real estate taken up by the header, side nav
and "related articles" footer on this is just obnoxious. Obnoxious to the
point of making reading this a really rotten experience.

I fear this is the medium.com effect of content on the web now. Simply having
content for content's sake is now seen as a missed "growth hacking"
opportunity.

~~~
aroman
Sad that this was collapsed-by-default — I realize it's not relevant to the
article, but it likely never will be. If we complain enough hopefully someone
can actually address it.

------
edf13
A very good reason why people are going to start moving to GCP over AWS....
Project Zero is a big win here.

------
deepnotderp
Wait... doesn't Reptoline have some irritating performance penalties?

~~~
boulos
Disclosure: I work on Google Cloud.

Not particularly (if you read Paul's post, the branch to the retpoline
predicts perfectly for obvious reasons), and especially not compared to the
brute force flushes as an alternative.

Edit: I phrased that _backwards_. The _return_ predicts, so that the whole
thing is about as bad as an _unpredicted_ indirect call:

> This has the particularly nice property that the RSB entry and on-stack
> target installed by (1) is both valid and used. This allows the return to be
> correctly predicted so that our simulated indirect jump is the only
> introduced overhead.

~~~
cthalupa
Edit: My post was prior to the parent edit, and now is largely unnecessary.
Keeping for posterity, I suppose!

I must be misunderstanding Paul's post.

Isn't it specifically preventing any sort of prediction?

"Naturally, protecting an indirect branch means that no prediction can occur.
This is intentional, as we are “isolating” the prediction above to prevent its
abuse."

Of course, you can then go and manually add direct branch hints, as is noted
in the post, but unless I'm misunderstanding things, there's not an obvious
reason why these branches predict perfectly.

Not that it means performance is impacted in a significant way, since that
same section also says "Microbenchmarking on Intel x86 architectures shows
that our converted sequences are within cycles of an native indirect branch
(with branch prediction hardware explicitly disabled)."

(which also confuses this issue - how is it predicting perfectly if prediction
hardware is disabled?)

------
sandGorgon
Does anybody know if Retpoline will make it to the compiler that the Linux
Kernel is compiled with ? It doesn't specifically mention in this paper, so
I'm not able to figure out what was actually compiled using Retpoline -
userland or the Linux kernel itself ?

~~~
boulos
Disclosure: I work on Google Cloud.

Patches for both LLVM (the infrastructure behind clang) and gcc are available.
You choose what you compile your kernel and applications with, and others are
actively looking at retpoline and retpoline-inspired techniques for other code
generators (e.g., various JIT compilers). That's why Paul and the folks made
this public.

~~~
sandGorgon
thanks for your reply. in Google Cloud's specific case, what was compiled
using Repoline ? Was it the kernel or userland ?

Because you are kind of claiming that the patches that the linux kernel used
to fix this (with a 10% drop in performance) are no longer needed. I am kind
of wondering if this was submitted to the core linux kernel to be mainlined ?
If yes, why was this not used there.

If no, then it means it is being used in some Google Cloud-specific way that
is not mainlineable.

EDIT - found this comment which seems to suggest that retpoline is not
bulletproof and the kernel's performance-killing patches are still needed

[https://github.com/marcan/speculation-
bugs/blob/master/READM...](https://github.com/marcan/speculation-
bugs/blob/master/README.md#bti-linuxgccllvm-retpolines)

[http://lkml.iu.edu/hypermail/linux/kernel/1801.0/03137.html](http://lkml.iu.edu/hypermail/linux/kernel/1801.0/03137.html)

Does it mean that Google Cloud is doing this only on non-Skylake CPU instances
? It is a very interesting stand - it will mean that it will suddenly be more
cost effective to use Google's _OLDER_ machines than the newer Skylakes...
because the newer machines have a performance degradation that the older
machines will not suffer.

~~~
boulos
Disclaimer: I'm not Paul :).

As mentioned elsewhere, we recompile everything at Google all the time. I'm
not sure which things we've rebuilt with retpoline enabled. As Paul mentions
in the article, the point is for software you believe needs to be protected,
which may not be everything we build.

That thread on retpoline on Skylake has a lot of confusion. For some folks,
they aren't 100% certain it works (it relies on understanding internal details
of Intel CPUs) _and_ they argue that IBRS on Skylake is cheap enough so "why
not just always use IBRS and not bother?". That's the gist of this comment:

> personally I am comfortable with retpoline on Skylake, but I would like to
> have IBRS as an opt in for the paranoid.

I want to highlight that Paul and the team have had a lot more time to think
about this issue than the folks just joining the discussion. Could our folks
be missing something? Sure, and that's the point of public discussion and code
review. We hope that over the coming weeks and months it's decided one way or
another, but _we_ believe retpoline to be correct and a good optimization
(especially for older hardware).

~~~
sandGorgon
> I'm not sure which things we've rebuilt with retpoline enabled. As Paul
> mentions in the article, the point is for software you believe needs to be
> protected, which may not be everything we build.

With all due respect, i'm not sure about the answer you are giving me. I dont
see why we are using the word "belief". If there is some secret sauce that
Google is unwilling to talk about, then it would be good to have that
declared. Because it is extremely weird that you would claim to have a
compiler tech that fixes this issue (which needed kernel level fixes
otherwise) and not be able to categorically declare what was compiled by
Retpoline that gives this benefit.

Possibly its the hypervisor you are using... which explains the zero downtime
migration.

P.S. I'm a Google Cloud customer as well.

~~~
boulos
I meant that only in the “I’m not sure all code at Google has been rebuilt and
had retpoline enabled” sense. We said in the article:

> By December, all Google Cloud Platform (GCP) services had protections in
> place for all known variants of the vulnerability.

which means all the bits of the stacks across all our GCP products (i.e., not
just our host kernels and GCE’s hypervisor but also say App Engine sandboxes).

KPTI for kernels plus recompiling any sensitive thing (kernel, hypervisor,
etc.) with a retpoline capable compiler. We’re not holding back secrets. Just
trying to get the patches out while keeping blog posts relatively parseable by
the general public.

Note that no matter how disruptive kernel or other changes are, live migration
always makes them “zero downtime migration”. The key here is that using
retpoline to protect systems software against Variant 2 results in much lower
overhead than other proposed mitigation strategies, particularly on older
hardware but even a bit on Skylake.

------
bfrog
Has this at all caused google to reconsider the heterogeneous nature of the
cloud in terms of hardware? It seems like Google the company is constantly
fixing/redoing various intel problems such as ME and now this. Google is part
of openpower after all, it would be interesting to see another architecture
being pushed.

------
cm2187
The post seems to suggest not all CPUs are affected by Variant 2. Is it
Haswell and earlier only?

------
Jerry2
[censored]

~~~
wmf
That's not unusual; if a story gets one or two upvotes very soon after being
submitted the HN ranking algorithm will put it on the front page. The rank
then drops very quickly unless it gets more votes.

------
thinkMOAR
Interesting google has time, money etc for this.

But actually showing search results on page 4 of google search or youtube,
when it said there were 22 million results for my search seems too hard for
them.

