I notice often that when coverage of these side-channels start spreading Intel takes a hit and AMD thrives. While the majority of these side-channel attacks are tuned to Intel chips, the theory behind them along with a decent amount of work can often make them applicable to AMD as well, and in some cases even other architectures(ARM, MIPS, SPARC, etc).
I don’t know why, but there’s something really surprising to me about how far reaching the issues are, when you pull them back to first principles.
A big group that turned it into "Intel fail" was due to various errors on Intel part, like moving verification of memory accesses to instruction retirement and similar things that even to uneducated person like me look like "tricks to make single-core performance shine".
We can also talk about the Store-to-Load Forwarding involving a partially check physical address, exploited by the Fallout attack, which is an optimization patented by Intel since 2006 .
All these Intel optimizations are old, and sometimes patented (which reduces the chance that IBM, ARM or AMD will do the same) and are totally valid and safe as long as it's assumed that it is impossible to recover data used during the speculative execution.
The recovery of this data becomes possible only when the side channel Flush+Reload is discovered in 2014  (which originally didn't target speculative executions). It is only 3 years later that Meltdown attack use Flush+Reload as a covert channel to exfiltrate data during speculative execution.
It seems to me that this timeline shows that it was not possible in 2006 to anticipate this issue. So it's not just a lack of education.
Perhaps a lack of security research at Intel to continuously challenge their optimizations?
But please note that there is no theory to simply determine whether these optimizations are a good idea or not.
Criticism is easier afterwards when you know how the attacks are produced. But in 2017 this type of attack was something new.
That said, some of Intel's responses to the flaws (partial correction of vulnerabilities, lack of dialogue with researchers, etc.) are very worrying and I think these criticisms are much more legitimate.
 : https://patents.google.com/patent/US20080082765A1
 : https://eprint.iacr.org/2013/448.pdf
I will admit that I'm not an expert in microarchitecture design, so to me the expected flaw would be "we do something wrong in speculative execution and end up bypassing the checks in destructive way", not through side channels. Didn't know about patents involved. And of course timing attacks make everything worse for all of us :/
What Intel did was put all memory access auth checks to retirement, so appropriately crafted code could bypass protection bits in TLB, not just snoop elements of cache (AMD, ARM, POWER)
Intel's implementation of checking permissions in parallel with speculative loads was vulnerable because the transient effects could be observed through cache timing information.
All of these attacks, meltdown included, use "timing side channels from speculative execution."
POWER9 had the same problem (and therefore was similarly vulnerable to meltdown) for memory accesses that hit in the L1. i.e. userspace accesses to kernel addresses cached in the L1 could produce observable data-dependent effects because the data was available for further speculative execution before the permission exception was raised. This is why the Linux kernel put in place a mitigation for meltdown on POWER CPUs that involves flushing the L1-D cache on transitions to/from kernel space ("RFI flush").
To be clear, the vulnerability that Intel processors have because they "move memory access verification to instruction retirement" is called "meltdown." When I said that meltdown was present on IBM and some Arm designs, I meant that these designs are vulnerable to the same exploit because they have comparable problems with regard to memory permission checks. I was not referring to the various other spectre-type vulnerabilities, which are even more widespread.
So to summarize:
Ok, but "Intel fail" indicates meltdown was specific to Intel processors, when it was also present of IBM and some Arm designs.
I know a significant number of people who have disabled spectre and meltdown protections in favor of higher performance because they are not concerned at all about side channel attacks.
Single tenant workloads that run no untrusted code that are not public facing are not exactly a unicorn.
That doesn't answer whether they were less optimized because someone on the Red team realized that was a good idea.
After checking a few sites, it looks like AMD may have between 20 and 37% of PC market share. I do not consider that "tiny."
This attack as it stands does not seem to apply to Intel server hardware from Skylake to present due to a different interconnect architecture, though the researchers indicate a belief the attack may be portable it would also be more limited.
Complete rectal estimation here, but I'd imagine executive laptops to probably be the next juiciest target as the end user station most likely to contain "interesting" data. While AMD has been making incredible inroads in the laptop market with Zen 2 and especially Zen 3, as opposed to the Bulldozer era when seeing an AMD sticker on a laptop let you know it was the cheapest one on the shelf, they still haven't made it to the models commonly bought by businesses.
AMD is absolutely killing it in the DIY desktop space, deservedly so (this post typed from a 3900X that I love), but on the OEM side of things you still generally have to go out of your way to find them in anything not marketed at gamers.
Fact is Intel takes the brunt of these attacks because their HW is more available especially in semi-standardised environments like universities. Also, Intel fund researchers (this paper was part funded by Intel), and their technical documentation is better than AMDs, at least in my experience. That makes it easier to understand the CPU internals, which is a big part of these papers.
The TL;DR is that the L3 cache (LLC: Last Level Cache) is shared between all cores on the chip, but the L3 is composed of a number of slices colocated with each core: the L3 is actually CC-NUMA! There is contention when reading/writing to the L3 via the ring interconnect, which can be used to identify the memory addresses access patterns of the cache. There's a bit more covering how the L1/L2 caches, private to each core, are revealed via the set inclusivity properly of Intel's cache design.
In monitoring LLC it is similar to https://eprint.iacr.org/2015/898.pdf though it doesn't reference them.
It isn't a game over attack but it reinforces how sharing resources (cache, memory, cores) is a bad idea when that crosses a security boundary. VM exclusive cache regions (Intel CAT cache partitioning) and cache locking should help, but cache slice partitioning is reqd too.
In the long term though? Who knows. Separate cpu+cache+memory hardware for every process seems a little crazy, but the alternatives might be worse.
Indeed, you can always spawn a helper thread that does nothing but incrementing a counter on shared memory and use that as clock substitute. That's why browsers disabled shared arrays after spectre was revealed.
So this isn't practical for any programs that need fine-grained parallelism.
If the future provides us with large but thermally limited dies, I could see a scenario where CPU components are duplicated but not all running at once. A thread or set of threads could own a hierarchy of caches, for example. Caches are die-intensive, so I don't see this happening soon, but maybe at some point in the future.
a) route all hardware interrupts to CPU 0 only
b) when a user thread does a syscall, immediately task switch to another thread, marking the current thread as entering/in the kernel. Only service threads entering/in the kernel on CPU 0.
Technically, you need a little bit of ring 0 time to task switch. But if you make all of the syscalls amazingly slow, timing attacks are a lot harder.
With all the cache flushes on context switch, making syscalls slow should be no problem at all.
There's already a large slowdown because of communication and NUMA. The syscall arguments have to be transferred to CPU 0 memory as well.
Note that in this instance the shared region is the bus used to access the cache. Reserving a cache region won't help with that. One would have to reserve time slices on the bus.
Also, time slicing might not be sufficient due to the queues in each LLC slice. If you can fill the queue the bus access will retry I guess.
Firstly, this new technique is probably not exploitable against 'real' cryptographic software. A very common technique in these papers that I see all the time these days is they attack obsolete versions of libgcrypt, because old versions of this relatively obscure crypto library aren't using constant time code. All modern and patched crypto implementations that people actually use would not leak using this technique, as the paper admits at the end. And even libgcrypt was patched years ago. The version number cited in the papers never makes this clear - you have to look up the release history to realise this.
Secondly, this paper is a bit unusual in the sheer number of steps that are simulated or otherwise simplified by e.g. requiring root. It's got some work to do before it's usable outside of lab condition demonstrations.
OK, so the crypto attack is theoretical and wouldn't work in real life, but what about the ability to extract passwords from typing patterns? Well, if we read that part of the paper carefully we can see a rather massive caveat: all it takes is 2 threads doing stuff in the background and the signal is drowned in noise. 4 threads doing stuff cause the signal to be entirely lost. So there seems to be a simple mitigation and it's unclear that this would work at all on a server that's under even moderate load.
Moreover, whilst a casual reader may get the impression they can extract passwords, they don't actually demonstrate that. Rather, they demonstrate what they claim is "a very distinguishable pattern" in ring contention triggered by keystrokes, with "zero false positives and zero false negatives". That sounds impressive but no information is given on what they're comparing against: what were the other events that were being tested here? The obvious question in my mind is how much of the signal they're seeing came from the actual keystroke vs output being printed to a terminal emulator, in which case, I'd expect to see potential FPs caused by any printing to the terminal. Their victim program doesn't merely monitor keystrokes but they are also echoed to the terminal - which is not what happens during password input, where characters aren't visible. This mock victim program is not much like a real password input program as a consequence.
Overall it's a clever paper, but I find myself increasingly fatigued by this genre of research. They seem to have settled into a template:
• Attack Intel, ignore everything else.
• Make grand and scary sounding claims
• Only demonstrate them against deliberately crippled victim programs in very artificial conditions.
It's been a few years now and I don't think any Spectre attack has ever been spotted in the wild, mounted by real attackers. That's true despite a huge state-sponsored attack having just been detected, that Microsoft claim might have had over 1000 developers work on it. Uarch side channel attacks sound cool but it seems real attackers either can't make them work, or have easier ways to get what they want. I find myself losing interest as a consequence.
Just a few days ago there were some news about the discovery of the first real malware that had exploited a Spectre variant.
Unfortunately I do not remember where I saw this, but it was said that the Spectre-based malware might have spread for a few months before being identified recently.
If so, I agree this is a good example, although how 'in the wild' a red team exploit kit can be considered to be is perhaps up for debate.
However, I agree on your points. I just hope it makes Intel's share price dip so I can make more money.