Hacker News new | comments | ask | show | jobs | submit login
More details about mitigations for the CPU Speculative Execution issue (googleblog.com)
327 points by el_duderino on Jan 4, 2018 | hide | past | web | favorite | 90 comments

OK so this might be only partially related, but earlier today I checked out the poc code for spectre and saw that it uses that instruction to discard a cache line. That reminded me of that performance issue ryzen had after launch in some game I don't remember the name of, which was tracked down to the same instruction for flushing a cache line. In the case of that game, the instruction was for some reason emitted by the vc++ compiler in a tight loop for a variable that was accessed therein. On Intel CPUs that didn't do any harm, because these actually don't evict the line from cache but just mark it as least recently used, while ryzen actually invalidated the cache line and had to repeatedly load the data back from ram.

Back then I already wondered how many valid use cases there are for these kind of instructions, apart from synthetic benchmarks. And even if you can use it to make an algorithm faster, how many times is it used incorrectly because someone just thought it would be better without benchmarking properly, (or because they mess up some compiler).

So my (naive) thought back then was "why not just noop all those cache related instructions in the CPU itself?" Then during 2017 I saw a talk about using these very same instructions for creating a side channel between two VMs in Amazon's cloud, and now this here. I am aware you could still make sure the data is not cached by just thrashing the cache (which would make this attack much slower than it already is) but really, what have these flush and prefetch instructions ever done for us?

> what have these flush and prefetch instructions ever done for us?

Flushing the cache is important when the data must be in the main memory, for instance when an I/O device does DMA to it and is not coherent with the caches. As for prefetch instructions, they are hard to use correctly, but in some cases can make an algorithm much faster.

Good point, so could we make it "just write back but don't evict", something like fsync() for memory...?

That works only for the CPU->device direction. For the device->CPU direction, you have to evict so the next load will get the data the device wrote.

It occurs to me that a better solution, rather than having the CPU know when something else is going to write to the memory, is to have the memory tell the CPU when something else wrote to it. That would take a whole new memory protocol, though, and a new generation of CPU chips to use it...

We used "Computer Organization and Design: the Hardware/Software Interface" (also Hennessy and Patterson) as our comp arch textbook.

I'd recommend it (any moderately recent version) for simply learning as well.

Learning the internals of everything between memory and functional CPU subunits is fun (and useful).

And the ideas don't change that quickly (e.g. the TLB and ROBs that were attacked here have been around for decades in similar form). There's only so many clever things you can do when a one bit buffer is expensive.

Which edition? I think we used the 3rd edition in our class 5 or so years ago. It would not be something I'd "recommend" but I haven't looked elsewhere ( just assuming there is something better out there)

I couldn't say. I do remember at the time searching for a few topics on the net -- and sadly it was one of those "black holes" where there simply isn't equivalent information out there.

I imagine there might be some stuff in research papers, but that seems like an even worse learning experience.

For example: https://www.google.com/search?q=branch+prediction+scoreboard...

If you are interested about this sort of thing, I suggest reading "Computer Architecture: A quantitative approach" by Hennessy and Patterson.

I don't think that is needed on x86, since cache coherence is guaranteed in a very strong way. Maybe needed on ARM, but hey, if you're running ARM we already know you don't care about performance ;)

All in all I think the only legit use may be in benchmarks.

Aren't most modern CPU's cache coherent? I know the recent PowerPC SoC I was working with was. All of the DMA cache sync functions were no-ops in the Linux kernel for that architecture.

For many architectures the coherency is turned off and managed in software by the device driver, to achieve better performance.

Any examples of this in the Linux kernel? Not doubting you, I'm just curious. I would think this would be device driver specific rather than architecture specific.

Most drivers I've seen just use the kernel's DMA API (dma_alloc_coherent), which is guaranteed to provide a buffer that doesn't require any explicit cache management. If the architecture is coherent, then it leaves caching turned on for those pages. If it's non-coherent, then it turns it off.

On mobile SoCs the GPU is commonly not cache coherent with the CPU. Same with many of the other cores in the SoC.

I think the various flush and prefetch instructions are mainly a vestigial artifact from before modern CPUs got good at cache management.

In certain specific conditions, they can be used to increase performance, but these days this is increasingly rare because the hardware cache management is surprisingly intelligent.

These instructions are just a more convenient (and faster) way to manipulate the cache. An attacker can still manipulate the cache without these instructions by manipulating their access patterns.

It might seem silly to have them with, say, a single core on a single processor--but, consider the case where multiple cores (or dies, or even sockets!) are sharing the same main memory.

Wouldn't that be something the os scheduler had to do, since the process doesn't know when it's moving from one CPU to another? So it could be a privileged instruction.

> In our own testing, we have found that microbenchmarks can show an exaggerated impact.

This. The talks about 30% (or even 300%!) impact based on a graph without any details or methodology, often itself based on a microbenchmark or a tweet that got viral is really not helping.

> a tweet that got viral is really not helping.

I especially liked the AWS forum post being breathlessly circulated which is from someone claiming that their business critical workload no longer fits on a single m1.medium (1 vCPU, <4GB RAM), joined by someone else who was actually swapping and hitting the OOM killer.

There is a real impact but this is going to be the lazy IT worker's new plausible excuse of the day for months…

That particular AWS forum post doesn't support your argument. Especially when Matt the AWS rep specifically acknowledged in that very thread that the update AWS had applied had had an effect on performance.

It matters not what the workload was running on before or accusations of laziness, the key takeaway from that post was that there was an impact on performance caused by the Meltdown/Spectre mitigations.

The fact another commenter barged into the thread with an unrelated issue simply shows that AWS forum operates within the Law of Forums (which states all threads are to be cross pollinated with unrelated issues from at least one other user).

EDIT: I should say that if this fellow is the only affected user from all the computer users of the world, then the mitigations applied can be called successful. That seems unlikely :-)

I think you missed my point, which was simply that any service which you care about should be running on n > 1 servers and with at least enough capacity planning so that a small percentage change in workload doesn't cause user-visible failures. That's especially true in an environment like EC2 where hosts fail and noisy neighbors can cause service degradation.

In this case, the thread made it clear that the real problem was that they didn't have a good deployment story — notice how the original poster was mentioning needing to move everything to an m3.medium manually? I would be quite surprised if they haven't had other problems in the past (e.g. what happens when a system update kicks off if that server is already running at 90+% utilization? or when they get more users and/or the existing users start doing slightly more) but hadn't wanted to deal with the hassle of migrating.

> There is a real impact but this is going to be the lazy IT worker's new plausible excuse of the day for months…


For those whose workers try to use such an excuse, ask them for historical graphs showing that they designed and operated the system to use 99% of provisionable RAM, and ask them what emergency response planning they included with that design in the case that a fluctuation in available RAM occurred due to unforeseen circumstances.

> This.

Sometimes this, sometimes that, it depends really.

> The talks about 30% (or even 300%!) impact based on a graph without any details or methodology, often itself based on a microbenchmark or a tweet that got viral is really not helping.

Of course it's helping, everyone learned about CPU architecture, speculative execution, caches, and started trading AMD and INTC all of the sudden.

But to be serious, doesn't "details and methodology" apply to any benchmarks we see on Twitter. Or do people automatically start buying more servers when they read that tweet. Anyone here did that?

The point of that number is warn people to go and measure their own workload. Google measure and yay! not much of an impact. It's great news. Someone who does telephony and maybe is sending lots of tiny little UDP packets back and forth might be impacted. Should they start crying and running around, no. They should measure because it might affect them.

I don't care about incompetent corporate IT departments listening to tweets. But "regular people" will read articles quoting the 30% figure and disable Windows update because they don't want their computer to get 30% slower. I saw this happen on /r/pcgaming

My app slows 10%, on an OS that slows 5%, running in a virtualization host that slows 10%. The numbers add up.

Do they ? If your usage is 50% userspace, 50% kernel, then having each slice slow by 10% does not result in a 20% slowdown.

They do... because they are measured in a weird way.

If the usage is split 50/50 kernel/userspace. And the userspace changes slow the app down 10%. Then the user space changes slowed the userspace code down 20%. Likewise if the kernel changes slow the app down 10%, they slowed the kernel code down by 20%.

So the total impact of the changes is 20% in your example.

That is indeed a weird way to measure and report. Without the associated userspace/kernel usage ratio those numbers are meaningless.

They aren't meaningless, they are measuring exactly what you care about. How much more computing power do you need to achieve the job.

Well, it's not a slowdown really.

The high performance of the physical CPU was achieved through means that compromise security. The top of the speed range was bullshit, basically. Now that the top is being lopped off, you experience the true speed of that CPU while operating in a secure manner.

I hear Intel's PR department is hiring, if you're interested.

> Now that the top is being lopped off, you experience the true speed of that CPU while operating in a secure manner.

So the proper execution is slower than the secure execution, right? Ergo, it is 'slower'.

I posted this elsewhere, but there's performance data from Red Hat Performance Engineering in this blog post. (Numbers for Linux obviously.)


> Subscriber exclusive content

Can you share more?

Sorry about that. I confirmed earlier that I could get to it from an incognito window. The sibling post has a relevant excerpt. I’ll look into.

[ADDED: It's back. Was pulled back to make some changes.]

It used to not be paywalled:

> Measureable: 8-12% - Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8-12%. Examples include Oracle OLTP (tpm), MariaBD (sysbench), Postgres(pgbench), netperf (< 256 byte), fio (random IO to NvME).

> Modest: 3-7% - Database analytics, Decision Support System (DSS), and Java VMs are impacted less than the “Measureable” category. These applications may have significant sequential disk or network traffic, but kernel/device drivers are able to aggregate requests to moderate level of kernel-to-user transitions. Examples include SPECjbb2005 w/ucode and SQLserver, and MongoDB.

For me that post is behind a RedHat subscriber paywall, is there a free link somewhere?

FYI: You can see it with a free develper account

It talks about a significant slowdown, 8%-19% for OLTP Workloads (tpc), sysbench, pgbench, netperf (< 256 byte), and fio (random I/O to NvME). 3%-7% for Database analytics, Decision Support System (DSS), and Java VMs

It was pulled back to make a few changes. It's public again.

What do you think the actual average impact will be?

Some people really are seeing those slowdowns though, it all depends on your workload. If you’re one of the people slowed down by 50%, i imagine the fact that you’re an outlier is cold comfort.

It also depends on the platform you are on, the CPU vendor/model, the vulnerability you fix, and the patch you used.

As mentioned in the article, there are 3 different (identified) vulns, each with its own set of patches, and sometimes multiples different ways to patch. Each of those patch will have its own impact, and work is currently in progress to find new ways to patch that will have a lower impact on performance.

Also, this means we will need to be used to patch, as new vulns of this category will definitely be found, and new changes to kernels/cpu/compilers will be designed and will need to be applied.

Besides that one forum post yesterday from someone whose m1.medium instance slowed down, what other cases do we know of? If it's just a single person who has observed a slowdown, I'd say this worked out quite well overall.

This post seems to contradict Intel's recent release:

> Intel has developed and is rapidly issuing updates for all types of Intel-based computer systems — including personal computers and servers — that render those systems immune from both exploits (referred to as “Spectre” and “Meltdown”)

That states that Meltdown is specifically addressed and mitigated. However, this post by Google does not indicate that "Variant 3", or "Meltdown", can be addressed by a microcode update:

> Mitigating this attack variant requires patching the operating system. For Linux, the patchset that mitigates Variant 3 is called Kernel Page Table Isolation (KPTI). Other operating systems/providers should implement similar mitigations.

The option of applying a microcode update is explicitly called out for their mitigation of "Variant 2":

> Mitigating this attack variant requires either installing and enabling a CPU microcode update from the CPU vendor (e.g., Intel's IBRS microcode), ...

Am I misreading one of these statements?

Reading through all this stuff I had similar thoughts. Statements by Google are the most trustworthy at this moment IMO, and they don't add up with what AMD and especially Intel state as far as I can tell.

IIUC, you're not misreading. KPTI solves the problem of leaking data from the kernel address space (besides a bare minimum). But this was an issue because Intel was speculatively accessing kernel addresses in the first place.

I'm curious whether the Retpoline mitigation will still be necessary/recommended for user applications (that don't operate as a JIT or interpreter) once the kernel and CPU mitigations for Spectre that are currently in the works have been applied.

After the Intel microcode update it shouldn't be. At least according to Intel.

My jihad against JIT compilation finally finds its foothold!

And now I have to wonder about negligible performance implications.

Does it mean Google applied point-and-shoot type of fixes in several areas, e.g. OS kernel and hypervisor? Most likely.

Or. The speculative execution did not provide meaningful performance benefits in the first place? At least for Google workloads? If so - why all this extra complexity?

The mitigations don't involve disabling speculative execution entirely, only adding safeguards to how it's done.

Speculative execution definitely does provide meaningful performance benefits, but like many things it only really shows itself clearly in a few algorithms. Branch prediction failure is a fairly well known source of performance issues.

This list linked at the bottom is pretty insightful for Google products & relevant mitigations:

> You can learn more about mitigations that have been applied to Google’s infrastructure, products, and services here[1].

Confirms the `chrome://flags/#enable-site-per-process` flag is useful here & sure enough, the 2018-01-05 SPL was waiting for me on my Pixel when I looked.

1. https://support.google.com/faqs/answer/7622138

So I have read the meltdown.pdf paper, but I don't quite understand why/how it bypasses the kernel/usermode checks of the pagetable, does anyone have a good explanation ?

Furthermore the paper mentions it is unable to reproduce it on AMD and ARM, but has some suggestions of things to try to make the exploit work. Other sources , including AMD itself, claims not to be vulnerable to meltdown, does anyone know the technical reasons as to why, and what is different compared to Intel cpus ?

Basically, you have some code that says "go read a byte from kernel memory. if the high bit of that byte is true, then access page X of memory".

Normally, that code will just error out right away.

But if you add a new branch before the code (such that the branch is taken to avoid the code, but the CPU predicts the branch to NOT be taken), the CPU will speculatively execute the above code just past your branch. The speculative execution doesn't check for memory violations (because that takes time).

Normally, that's cool: if the new branch IS taken, there is no harm because the the result of the (bad) kernel access will be thrown away. If the new branch IS NOT taken, the CPU notices the bad access and complains.

But if you are extra devious, you can ensure that page X is NOT cached when running this code. After, you check if page X suddenly got cached. That tells you the value of the high-bit of your kernel memory. Keep scanning all the bits and you can read out all of kernel memory.

An excellent, succinct explanation!

What it does is create a branch in code that is never actually followed, but is executed speculatively by the CPU, which then discards the result. The problem is that even though the result is discarded, computation of the result causes a measurable side effect (specifically, it populates caches accessible to the code).

In short, the speculatively executed branch retrieves a value from a kernel page (that the code shouldn't have access to) - Intel CPUs allow this, but AMD CPUs do not. It then &s the value with 0x01, and then uses the result to calculate and access an address in a page that it DOES have access to.

This results in that address being stored in the TLB (i.e. a cache that stores mapping from virtual to physical memory).

The result of the computation is then discarded, but the address that it calculated is still in TLB. Since there are two possible addresses that could have been cached based on the value of the data in the kernel page, all the code has to do is time the access to each address, and it can discern the value of the bit in the kernel page based on the relative timings.

> Intel CPUs allow this, but AMD CPUs do not.

Do we know this for sure ? I see now that the meltdown paper has a suggestion in chapter 7.1 that the CPU should perform the permission check on the page table before data is fetched to the cache, But dismisses that suggestion on the grounds it will haves very high performance impact.

Are you saying AMD does the check first anyway, and either takes the performance hit or has found a way to not take a performance hit ?

My understanding isn't especially deep, but I believe AMD prevents access to unloaded pages that the code doesn't have permission to access, even for speculative access. Their specific quote is:

"AMD processors are not subject to the types of attacks that the kernel page table isolation feature protects against. The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault."

I wrote up this analogy, but basically it only checks after it has done the work of verifying a conditional, and the result of that conditional is to cache something you have can check the access time to.

Let's say you want to know if your boss is away on vacation next week so you call their admin and say "you need to double-check my contact info if the boss is going to be out next week". They load up the boss' calendar to check and based on his presence next week then load up your info. Only once done, do they take the time to remember the boss didn't want you to know wether they are in or out. So you hear back, "sorry, can't tell you that, but you follow up with "OK, well can you still double check that my phone number is..." If they respond quickly with a yes, then your file is still on their screen and the boss is in fact out next week. If there is a short pause while they look it up, then the opposite.

Related question about spectre and the cross-process(!?) mem read possibility: the attacker "influences the victim to execute the gadget speculatively.", and then the "areful selection of a gadget" (i.e. in the victim process) can be used to read arbitrary victim memory.

How is that possible? I can understand an attacker reading its own memory, but foreign one via a common mapped gadged section... that is just incredible!

How are GNU Hurd, Minix, and other microkernel systems affected by these issues? I would expect that they would have less sensitive information in kernel memory, and so exploits to read kernel memory would not be as dangerous as on systems with a monolithic kernel.

It's not so much about sensitive info residing in the kernel, but that the kernel has an identity mapping of the entire physical memory, Thus if you can read kernel memory, you can dump all ram, where secrets from other processes or virtual guests reside.

I don't know if Minix or Hurd maps all of ram into the kernel address space though (or if they add the kernel address space to each user space processes , as the exploit also requires)

A bit off topic but I asked the same issue in a thread that got delisted as [dupe] so I'm reposting here so hopefully someone can enlighten me:

I've been reading up a bit on these attacks and I was wondering if there are any particular requirements to implement them in an arbitrary language?

For example, can you implement the attack with Java but without JNI? i.e. are syscalls required to be able to leverage the exploit?

the requirements are easy to meet - you just need to be able to time operations at a suitable precision. The exploits are possible in Javascript running in a browser.

Yeah, I saw the PoC in the Spectre paper, but I was wondering if a JVM language could meet those reqs.

I have absolutely no idea if using the JVM would for example, mess with the required precision since I'm guessing one would need to use JNI/JNA to get the timing and that could maybe not be suitable?

Java provides access to high resolution timers without JNI and optimises them very well.

I suspect a better solution instead of KPTI is to evict all user space pages from cache when an invalid page access happens if fault was caused by read/write kernel space pages. My kernel days was so long ago that I don't now if it is possible.

Massive performance hit but only on misbehaved software. Well behaved software will not have the performance hit of KPTI.

Kernel could even switch dynamically to KPTI if too many read/write attempts from user space.

Implementations of meltdown do not need to trigger a page fault (because the instruction which would fault can be made to execute speculatively - in addition to the instruction which leaks information into the cache executing speculatively). Accordingly, there would be nothing for the kernel to observe or respond to.

I thought that:

mov rax, [Somekerneladdress]

would trigger an interrupt even on speculative execution as described on https://cyber.wtf/2017/07/28/negative-result-reading-kernel-...

ADDED: So in the interrupt handler the kernel could evict all user space pages from cache before returning control to user space so it could not use the timing attack on the cache of the speculative execution of Mov rbx,[rax+Someusermodeaddress] on the address rax+Someusermodeaddress.

and what if it was preceded with

   cmp $0, [some_readable_but_uncached_addr_containing_zero]
   je some_safe_location
   //now the exploit
   mov rax, [somekerneladdr]
   ...the rest of it...
cpu may speculatively execute past "jz" and speculatively do the load. no fault generated

So it is a game over here. Unless Intel can change the microcode to force a page fault in this case.

It doesn't make sense for speculatively executed code to throw architecturally visible exceptions. The appropriate behavior would be to not perform speculative loads across protection domains (i.e. the behavior of AMD implementations).

It would make sense if it was the only alternative as the kernel can handle it. The appropriate behavior is to remove all traces of the speculative execution including cache hits.

Is that even possible? The data that would need to be removed from the cache has already evicted other cache lines, and that re-fetching those might have observable effects, like the timing.

Concretely, https://twitter.com/corsix/status/948670437432659970 can be used to get both `movzx rax, byte [somekerneladdress]` and `movzx rax, byte [rax+someusermodeaddress]` executed speculatively (the idea behind this is the same as a retpoline - exploit the fact that `ret` is predicted to return to just after the "matching" `call` instruction). If the first load is executed speculatively, it won't cause a page fault.

Even if a fault occurred (others are correct in pointing out it doesn't necessarily) I think this would be too late. I could have already observed the effects in the cache before the instruction causing was (non speculatively) reached and the fault occurred.

That sounds like the kind of proposal for which Linus would scream at you.

I would never send to him :)

How do you go from knowing the location of memory to actually doing an attack if the memory is read protected? That's the part I don't get.

So kind of left field but what's the theoretical effects these exploits would have on The Mill ?

Would it be vulnerable in the same way?

See https://millcomputing.com/topic/security-on-mmooo-processors... and https://millcomputing.com/topic/meltdown-and-spectre/

The Mill has portals (possibly) gating everything and making microkernel design easy and fast, so Meltdown would not work. Spectre, not so sure, there are claims the NaR (not a result) option might prevent it.

I bet this will come up in the next talk, looking forward to it.

Honestly I'm just thinking if there was a moment to break into the CPU market...I'd say this is it. Half the planet is going to be looking for architecture updates which can be proven to be immune to this type of vulnerability.

I'm sure not going to be buying any new CPUs till some new arch's come out which remove this (or make the mitigations performant).

Google Security Blog ... requires javascript enabled to read ...

Works fine for me with JS disabled. In fact it loads even faster...


Alas, content blockers blocking content won't put a site into noscript mode which shows <noscript> tags because they only block requests.

The site breaks if you block their JS.

That makes sense, maybe content blockers could be configured to also "run" noscript tags.

On mobile it's completely unusable since they hijack the horizontal scrolling that would allow an unresponsive page like this to work it the first place. Bizarre.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact