- Unlike previous speculative execution attacks against SGX, this extracts memory "in parallel" to SGX, instead of attacking the code running in SGX directly. It always works: it doesn't require the SGX code to run and it doesn't require it to have any particular speculative execuction vulnerability. This also means existing mitigations like retpolines don't work.
- It lets you extract the sealing key and remote attestation. That's about as bad as it gets. Because SGX is primarily about encrypting RAM, anything that pops L1 cache is game over and this is a stark reminder of that fact.
- The second attack that fell out of this allows you to read arbitrary L1 cache memory, across kernel-userspace or even VM lines.
The good news here is that the mitigation is somewhat straightforward. It's a pure L1d attack: flush L1d (or prevent things from accessing the same L1d via e.g. core pinning) and you're fine.
If there was any doubt left that speculative execution bugs were an entire new class and not just a one-off gimmick...
It could have definitely been worse, with the leak of the fused secrets or a breach to integrity of the microcode (the two things that together constitute the TCB, which put simply is the only piece of the system you assume will never be broken).
All in all, assuming a microcode update can counter the attack as Intel claims, sealing and attestation secrets will be rekeyed via the KDF rooted in the fused keys, so that you can start afresh.
Of course, operationally speaking, that is a total pain but it is frankly remarkable to see this kind of deep recovery strategy finally built into consumer devices (and yes, I know DRM is unfortunately the main driver, but there are still some very legitimate use cases).
>> flush L1d (or prevent things from accessing the same L1d via e.g. core pinning) and you're fine
No, you are not fine.
As the paper explains, an adversary (which is by definition more privileged than you are) can operate between the moment you use secrets in L1 and the moment you flush them out.
Only the CPU (silicon or microcode) can assist you in the flushing of L1 when you exit enclave mode.
It gets worse, if the adversary has root they can force all the data in L1 cache. This allows them to read all memory pages of an executed and non-exuctuted encalve.
There is nothing what can be done against this from the enclaves point of view.
> Only the CPU (silicon or microcode) can assist you in the flushing of L1 when you exit enclave mode.
This seems correct, upon double-checking. The interrupt process within SGX is called Asynchronous Enclave Exit (AEX) and does not give the enclave an opportunity to run any code upon interrupt, though it is possible to run code upon every enclave entry (via code placed at the Asynchronous Entry Pointer). I'm not sure that would help with any speculation-based exploits, however.
- Google Cloud's protections against this new vulnerability:
- GCE Related information:
- GKE Related information:
(Disclaimer: I work at AWS, but I am not linking this in any sort of official capacity. I don't know any more details beyond what is listed in that bulletin, and can't answer any questions related to this, unfortunately.)
"Meanwhile, we suggest using the stronger security and isolation properties of EC2 instances to separate any untrusted workloads."
"Google Compute Engine employs host isolation features which ensure that an individual core is never concurrently shared between distinct virtual machines. This isolation also ensures that, in the case that different virtual machines are scheduled sequentially, the L1 data cache is completely flushed to ensure that no vulnerable state remains."
The former does not inspire confidence. Given the hypervisor on EC2 is opaque to me, I'm not sure how I'm supposed to avoid co-tenanting in a risky fashion.
"Meanwhile, we suggest using the stronger security and isolation properties of EC2 instances to separate any untrusted workloads." As I read it, it is talking about running code within your instance - if you have untrusted workloads, rather run them in a separate instance, so as not to encounter issues like this cross-process.
The "strong isolation of EC2 instances" refers to the properties of isolation provided by EC2's virtualization compared to processes within an operating system. It is challenging to safely and securely run untrusted code within sandboxes and processes using general purpose software. However, the hypervisor and hardware based virtualization of EC2 instances is engineered to provide isolation between mutually untrusted instances.
There are several reasons that customers may want to use dedicated instances or dedicated hosts, so we provide those tenancy options as well. The most common reason customers use dedicated hosts so that they can bring their own software licenses, which are often tied to a physical host.
Disclosure: I work for AWS.
In Linux, at least, the Xen hypervisor on EC2 exposes some information about itself at /sys/hypervisor. In particular, I think /sys/hypervisor/uuid would allow you to detect co-tenanting (between two VMs of your own).
Not saying that I think you should do that, or that'd I'd want to — it'd be a PITA to coordinate amongst VMs, and I'm not sure it would matter (what if you're co-tentant w/ a malicious VM? even if you detect it, how do you get out of it?). That is, inside the VM seems like wholly the wrong place to attempt to deal with this. But I don't think many people realize /sys/hypervisor exists.
https://aws.amazon.com/ec2/instance-types/ includes a note that provides some additional information about when vCPUs utilize Intel HT Technology:
Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium. (emphasis added)
https://aws.amazon.com/ec2/virtualcores/ provides a table of the number of cores allocated per instance. Notice that an instance like m5.large that has 2 vCPUs and 1 core. An instance like t2.small has 1 vCPU and 1 core.
Disclosure: I work for AWS
> If it’s a timesliced core, then it’s not what they say they’re selling (a hyperthread).
> A core can be sequentially time sliced between customer instances when you use a fractional (m3.medium) or burst (T2 instances) CPU instance type.
AWS instance types page ( https://aws.amazon.com/ec2/instance-types/ ):
> Each vCPU is a hyperthread of an Intel Xeon core except for T2 and m3.medium.
Who cares how Google defines things?
So those cores won't be used optimally. No big deal.
However, GCE does offer shared core machine types (f1-micro and g1-small) with 0.2 and 0.5 vCPUs respectively. This seems to contradict their statement (unless the cores are not shared after all, but that doesn’t make sense from an economical standpoint).
Also, they offer machines with one vCPU, but since a vCPU is only a single hyper-thread and not a full core, this still allows for the core to be shared over multiple VMs. If this means that Google will stop using hyperthreading and instead give everyone a full CPU core per vCPU, that will likely give noticeable performance benefits (but cost more for them).
I work for Google Cloud, but not related to security or OS development, so am not aware how it is actually being done.
GCE uses KVM, which defaults to the linux scheduler with time slices from 0.75ms to 6ms, so the extra impact should be negligible. It's possible they tuned it weirdly, but I can't think of any reason to do so.
Flushes that occur from hypervisor calls could possibly have an impact, but those will happen whether you share a CPU or not.
Is this true?
They can give both hyperthreads to one machine for 50ms then both hyperthreads to another machine for 50ms.
For VMs with less than 2 vCPUs I suspect they give them two virtual threads and just schedule them for half the time they would otherwise.
2 months ago thread on OpenBSD and hyper-threading: https://news.ycombinator.com/item?id=17350278
As a sysadmin (who admittedly doesn't deal with hardware much) these issues with Intel chips (the mitigation of which can seriously decrease performance), and the relative ease with which AMD has come through the problems make me wonder if we would be better with AMD.
I'm reminded of this experiment:
"People respond slower (or not at all) to emergency situations in the presence of passive others."
I think the only reason this is the case is the difference in market share. There are much more Intel processors out there, so finding an Intel vulnerability makes a much better academic paper (and much more lucrative for bad actors).
AMD might have made some inroads into the enthusiast market lately, but with cloud providers they are basically non-existent. These guys buy (or manufacture) servers by tens of thousands. They hate rebooting servers, even if they got a VM live-migration stuff worked out (as everybody does these days). BIOS and RAM problems that people mention here make AMD a complete no-go.
I'm sure that now there are people at Amazon/Google/Microsoft thinking hard about reducing their dependence on Intel, but I doubt we will see any difference for years.
We introduced some AMD servers in January, and the setup had its ups and downs. First We had to go through multiple BIOS updates for stability reasons, also kernel updates kept improving overall experience regarding the CCX architecture... I mean they're working, but Intel systems have almost always delivered a plug and play experience for the last decade or so.
At which point do we agree the performance increases over the last 20 years have been built on sand and move elsewhere?
However, unlike meltdown it cannot access data that is not already in the L1 cache.
Yes, deep down they happen for the same reason, but then so does Spectre as well.
They're both around how page faults are asynchronous at a uArch level on Intel, and not any of the other vendors. This and Meltdown don't apply to AMD or ARM.
The closest alternative would be ARM. In any case, it's a massive undertaking.
On the contrary there's SPARC, MIPS, PA-RISC, POWER and a whole heap of others that perhaps were written off prematurely. Need to move quickly tho' while some vestiges of expertise still remain.
That's why alternatives like Open POWER are important.
This isn't as simple as "Intel/x86 sucks, let's go use SPARC". The causes run much deeper and the necessary fixes may or may not be architecturally elegant or simple.
It's also time for a computer system with one and only one general purpose processor (no tiny CPUs in storage or "system management" or every other device)
Probably something like a programming language/OS/computer system written new with a CPU based on current GPU designs.
Unless your willing to run on the equivalent of a Cortex-M0 then you have to live with it.
VLIW only removes the logic to detect data dependency - it doesn't workaround the actual need to wait for data to be ready.
None of this has much to do with speculative execution which is guessing which way a branch will go. You simply can't have what would be considered a modern computer without it.
The legacy parts have either been disabled in 64-bit mode, or they are implemented in microcode. Other architectures are not simple either, ARM64 has incredibly complicated paging for example.
That is far more a constraint than you think. Probably quite hard to have even a gigabit Ethernet subsystem without it.
Of course no offloading, but you wouldn't notice any performance drops, if the ring buffers are large enough.
If said SGX application wasn't built around this model then it's probably not a valid use case of SGX..
Also, the more major spectre-related microcode updates have to be applied very early (in the BIOS) probably for technical reasons. For this latest microcode update, for example, Intel didn't even include it in their downloadable microcode package as you linked to. On my v6 Xeons, I was able to get to revision 0x84 with the latest OS microcode package, but 0x8e with a BIOS upgrade.
Breaking anything that enables DRM is a win in my book.
-10%? -20%? -30%?
Have we gone back 3 CPU generations?
I'm still running an i7-3770k on my desktop at home. I was considering upgrading when the 9th gen comes out in October, but if the Spectre/Meltdown/Foreshadow fixes have a significant performance impact, it won't be worth spending the money. As it is, I'll already need a new motherboard and RAM, since I'm still on DDR3.
Speculative execution is what allows Meltdown to work. You make the processor speculatively execute an access to kernel memory, then access a memory address based on the value of the data read from kernel memory. Intel processors preform the speculative execution without first checking if the memory access in allowed while AMD processors check before the speculative execution. This is why AMD processors aren't vulnerable to Meltdown.
SGX was thought not be vulnerable to a speculative execution attack because attempts to access SGX memory, without having the necessary access, just yield -1 for reads and writes are ignored, as opposed to causing an exception as with access to kernel memory. However, if the SGX memory is marked as not-present, then attempts to read the memory will trigger a page fault exception. The page fault circumvents the normal SGX protection and allows the memory to be read by speculatively executed instructions.
Amazon Linux bulletin: https://alas.aws.amazon.com/ALAS-2018-1058.html
RHEL patches are out. CentOS after delay, presumably. Nothing yet for Debian/Ubuntu.
TL;DR: AWS is patched. Go update your kernel (especially if you run other people's code).