* The branch target injection variant of Spectre if you want to get a sense of how amazing this vulnerability is: you can spoof the branch predictor to trick a target process into running arbitrary code in its address space! This is crazy!
* The misprediction variant of Spectre if you want to get a hopeless feeling in the pit of your stomach, since the implications of mispredict are that certain kinds of programs are riddled with a new kind of side channel we didn't really grok until last week, and no upcoming microcode update seems to be in the offing.
You could probably use the same Python conceit to illustrate the other two attacks; someone might take a crack at that.
(I'm not disputing that the R-Pi's aren't vulnerable to Spectre).
> Both vulnerabilities exploit performance features (caching and speculative execution) common to many modern processors to leak data via a so-called side-channel attack. Happily, the Raspberry Pi isn’t susceptible to these vulnerabilities, because of the particular ARM cores that we use.
The reason why Spectre is not a problem is because there is no branch predictor in these simpler arm cores. Instructions are processed in parallel when possible, but not before dependencies, including branch decisions.
EDIT: under "What is speculation?" branch prediction is described. Then in the conclusion: "The lack of speculation in the ARM1176, Cortex-A7, and Cortex-A53 cores used in Raspberry Pi render us immune to attacks of the sort."
Branch prediction isn't new, you're right. But VLIW instructions are equally unmodern and are entirely orthogonal to speculative execution. Sufficiently smart compilers are also no substitute for runtime analysis.
rpi 1: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301h/dd...
rpi 2: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464d/DD...
rpi 3: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DD...
As others have indicated further down, though, it won't open up much of a vulnerability unless they are speculatively fetching memory.
I need to re-read the papers but I think the real problem isn't even speculative execution but allowing speculative cache changes.
The notion that "gadgets" didn't even need to return properly was both amusing and eye opening for me. It doesn't matter because the result will be flushed anyway! ;-)
In practice, advanced in-order designs contain more local reordering mechanisms, e.g. in the load/store unit, but they lack the unified global abstraction of a reorder buffer. The most successful timing attacks involve a mis-speculated load, so they wouldn't apply to these mechanisms, but it's not completely out of the question that they are also an effective side-channel.
Not quite. Branch prediction is typically used on non-speculative architectures in order to avoid pipeline bubbles. (You could argue that pipelining is a form of speculation)
Here is the branch prediction documentation for one of the processors they claim is not vulnerable. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....
Whether or not they're vulnerable has more to do with how their pipeline is structured. It's possible for an architecture to be vulnerable if a request to the load store unit can be done within the window between post-branch instruction fetch/exec and a branch resolution. Eyeballing the pipeline diagram from the above docs, it looks like you can maybe get a request to the LSU off before the branch resolves. dramatic music
(Simplest cores have only static branch prediction though)
Simply instead of MOV AX, [something that depends on secret value] as in the original Meltdown paper you'd need to use JMP [something that depends on secret value] to trigger the memory fetch in a branch that's not going to be used.
In the comments on the article the author argues: "Why don’t speculative instruction (and data) fetches introduce a vulnerability? Because unlike speculative execution they don’t lead to a separation between a read instruction and the process (whether a hardware page fault or a software bounds check) that determines whether that read instruction is allowed."
but that would seem to be confusing the details of Spectre with Meltdown (which is happening a lot right now). Spectre doesn't depend on unauthorized reads succeeding.
ARM also has an other trick for that, every opcode in the (full, non-Thumb) instruction set has a condition code that can let you execute an instruction conditionally based on the flags state without requiring an explicit branch. This way from the CPU perspective the flow of the code is linear, it's only late in the pipeline that the condition code is evaluated and the instruction discarded if it doesn't mach the flags. This way you're sure you'll never have a bubble no matter what, although the downside is that you end up fetching instructions that may end up not being executed so it's only worth it for "short" branches.
Here's what I'm trying to figure out. Let's say there's a JIT-generated instruction that I, an attacker, am interested in learning but cannot directly read from my position in the sandbox. If I can influence the instruction fetch speculator to issue a load for that instruction, then AFAICT it doesn't matter that it never makes it as far as the execute stage -- merely the act of fetching it for decode will have had a side-effect I can probably exploit into determining what it was.
I can't really imagine how you can construct an attack based on that, but maybe I lack imagination.
Is it just used for getting instructions into the L1i cache then?
It does ... it just doesn't have to fully describe spectre to show why the raspi is not affected.
Actually that seems to be rather different, but still similar-ish. They both use threads to exploit memory caches in an unexpected way.
I thought he deserved a mention since no one really took it seriously back then. https://it.slashdot.org/story/05/05/13/0520214/hyperthreadin...
"The recent Hyper-Threading vulnerability announcement has generated a fair amount of discussion since it was released. KernelTrap has an interesting article quoting Linux creator Linus Torvalds who recently compared the vulnerability to similar issues with early SMP and direct-mapped caches suggesting, "it doesn't seem all that worrying in real life." Colin Percival, who published a recent paper on the vulnerability, strongly disagreed with Linus' assessment saying, "it is at times like this that Linux really suffers from having a single dictator in charge; when Linus doesn't understand a problem, he won't fix it, even if all the cryptographers in the world are standing against him."
Always found that amusing.
"Cache timing goes back to 2005 with Percival. I published a couple weeks before them. :-)" (cpercival)
Cache timing goes back to the VAX Security Kernel (early 1990's) designed for those A1 certification requirements that tptacek calls useless, "red tape." One of the mandated techniques was covert, channel analysis of whole system. They found lots of side channels in hardware and software they tried to mitigate. Hu found one in CPU caches following with a design to mitigate it. That was presented in 1992.
Since Hu is paywalled, see (b) "cache-type covert timing channels" in his patent:
So, one of INFOSEC's founders (Paul Karger) did the first, secure VMM for non-mainframe machines. The team followed the security certification procedures discovering a pile of threats that required fixes from microcode for clean virtualization to mitigation of cache, timing channels. They published that. Most security professionals outside high assurance sector and CompSci ignored and/or talked crap about their work presumably without reading it. Those same folks reported later on virtualization stacks hit in 2000's with the attack from 1992 on new software with weaker security than KVM/370 done in 1978. Now, another attack making waves uses the 1992 weakness combined with another problem they discovered by looking at what interacts with it. That might have been discovered earlier if they did that with x86 like high-assurance security (aka "red tape") did in 1995 for B3/A1 requirements with them spotting potential for SMM and cache issues:
Note: High-assurance security was avoiding x86 wherever allowed by market for stuff like what's in that report. As report notes with exemplar systems, market often forced it to detriment of security. Their identification of SMM as a potential attack vector preempted Invisible Things by quite a bit of lead time. That was typical in this kind of work since TCSEC require thoroughness.
In 2016, one team surveyed the components plus research on them in a modern CPU. Researchers had spotted branching as a potential, timing channel pretty quickly after CPU's got mainstream attention.
So, a team following B3 or A1 requirements of TCSEC for hardware like in 1990-1995 would've identified the cache channels (as done in 1992) plus other risky components. They'd have applied a temporal or non-interference analysis like they did with secure TCB's in 1990's to early 2000's. A combination of human eyeballs plus model-checking or proving for interference might have found recent attacks, too, given prior problems found in ordering or information flow violations. This is a maybe but I say focusing on interactions with known-risks would've sped discovery with high probability. Far as resources, it would be one team doing this on one grant using standard, certification techniques from mid-1980's on a CPU others analyzed in mid-1990's say it was bad for security due to the cache leaking secrets, too many privileged modes, various components implemented poorly, and so on.
I keep posting this stuff on HN, Lobste.rs, and so on since it's apparently (a) unknown to most new people in the security field for who knows what reason or (b) dismissed by some of them based on the recommendations of popular, security professionals who have clearly never read any of it or built a secure, hardware/software system. I'm assuming you were unaware of the prior work given your scrypt work brilliantly addressed a problem at root cause quite like Karger et al did when they approached security problems. The old work's importance is clear as I see yet again well-known, security professionals are citing attack vectors discovered, mitigated, and published in the 1990's like it was a 2005 thing. How much you want to bet there's more problems they already solved in their security techniques and certifications that we'd profit from applying instead of ignoring?
I encourage all security professionals to read up on prior work in tagged/capability machines, MLS kernels, covert channel analysis, secure virtualization, trusted paths, hardware security analysis, and so on. History keeps repeating. It's why I stay beating this drum on every forum.
My paper was the first to demonstrate that microarchitectural side channels could be used to steal cryptologically significant information from another process, as opposed to using a covert channel to deliberately transmit information.
Prior and current work usually models secure operation as a superset of safe/correct operation. Schell, Karger, and others prioritized defeating deliberate penetration with their mechanisms since (a) you had to design for malice from the beginning and (b) defeating one takes care of the other as a side effect. They'd consider the ability for any Sender to leak to any Receiver to be a vulnerability if that flow violates the security policy. That's something they might not have spelled out since they habitually avoided accidental leaks with mechanisms. Then again, you might be right where they never thought of it while working on the superset model. It's possible. I'm leaning toward they already considered side channels to be covert channels given descriptions from the time:
"A covert channel is typically a side effect of the proper functioning of software in the trusted computing base (TCB) of a multilevel system... Also, as we explain later, malicious users can exploit some special kinds of covert channels directly without using any Trojan horse at all."
"Avoiding all covert channels in multilevel processors would require static, delayed, or manual allocation of all the following resources: processor time, space in physical memory, service time from the memory bus, kernel service time, service time from all multilevel processes, and all storage within the address spaces of the kernel and the multilevel processes. We doubt that this can be achieved in a practical, general purpose processor. "
The description is it's incidental problem from normal, software functioning that can be maliciously exploited with or without a Trojan horse. They focus on penetration attempts since that was culture of time (rightly so!) but know it can be incidental. They also know in second quote just how bad the problem is with later work finding covert channels in all of that. Hu did the timing channels in caches that same year. Wray made a SRM replacement for timing channels year before. They were all over this area but without a clear solution that wouldn't kill the performance or pricing. We may never find one if talking timing channels or just secure sharing of physical resources.
Now far as your work, I just read it for refresher. It seems to assume, not prove, that the prior research never considered incidental disclosure. Past that, you do a great job identifying and demonstrating the problem. I want to be extra clear here I'm not claiming you didn't independently discover this or do something of value: I give researchers like you plenty credit elsewhere on researching practical problems, identifying solutions, and sharing them. I'm also grateful for those like you who deploy alternatives to common tech like scrypt and tarsnap. Much respect.
My counter is directed at the misinformation than you personally. My usual activity. I'm showing this was a well-known problem with potential mitigations presented at security conferences, one product was actually built to avoid it, it was higly cited with subsequent work in high-security imitating some of its ideas, these prior works/research is not getting to new ones concerned about similar problems, some people in security field are also discouraging or misrepresenting it on top of that, and I'm giving the forerunners their due credit plus raising awareness of that research to potentially speed up development of next, new ideas. My theory is people like you might build even greater things if you know about prior discoveries in problems and solutions, esp on root causes behind multiple problems. That I keep seeing prior problems re-identified makes me think it's true.
So, I just wanted to make that clear as I was mainly debunking this recent myth of cache-based, timing channels being a 2005 problem. It was rediscovered in 2005, perhaps under a new focus on incidental leaks, in a field where majority of breakers or professionals either didn't read much prior work or went out of their way to avoid it depending on who they are. Others and I studying such work also have posted that specific project in many forums for around a decade. You'd think people would've have checked out or tried to imitate something in early secure VMM's or OS's by now when trying to figure out how to secure VMM's or OS's. For some reason, they don't in majority of industry and FOSS. Your own conclusion echos that problem of apathy:
"Sadly, in the six months since this work was first quietly circulated within the operating system security community, and the four months since it was first publicly disclosed, some vendors failed to provide any response."
In case you wondered, that was also true in the past. Only the vendors intending to certify under higher levels of TCSEC looked for or mitigated covert channels. The general market didn't care. There's a reason: the regulations for acquisition said they wouldn't get paid their five to six digit licensing fees unless they proved to evaluators they applied the security techniques (eg covert-channel analysis). They also knew the evaluators would re-run what they could of the analyses and tests to look for bullshit. It's why I'm in favor of security regulations and certifications since they worked under TCSEC. Just gotta keep what worked while ditching bullshit like excess paperwork, overly prescriptive, and so on. DO-178B/DO-178C has been really good, too.
Whereas, understanding why FOSS doesn't give a shit I'm not sure on. My hypothesis is cultural attitudes, how security knowledge disseminates in the groups, and rigorous analysis of simplified software not being fun to most developers versus piles of features they can quickly throw together in favorite language. Curious what your thoughts are on FOSS side of it given FOSS model always had highest potential for high-security given labor advantage. Far as high-security, it never delivered it even once with all strong FOSS made by private parties (esp in academia) or companies that open-sourced it after the fact. Proprietary has them beat from kernels to usable languages several to nothing.
So it's possible that some of the Raspberry Pi versions are in fact vulnerable to (much weaker versions of) Meltdown and Spectre.
* https://lists.opensuse.org/opensuse-security-announce/2018-0... (https://news.ycombinator.com/item?id=16081366)
Future microcode updates mentioned:
* https://newsroom.intel.com/wp-content/uploads/sites/11/2018/... (https://news.ycombinator.com/item?id=16079910)
Perhaps this will finally provide enough incentives to model data sensitivity in the type systems of practical programming languages.
What I'm saying with the type system comment is this: the cache side effects of speculation are desirable most of the time but not always. We should find a way to model data sensitivity in the type system so that a compiler can automatically choose to generate a side-effect-free code sequence where the side effects must be avoided (this assumes a future with ISA extensions that allow telling the CPU to prevent such side effects by blocking the speculative execution).
Let's ignore meltdown, which seems solvable in hardware with no obvious performance loss (assuming amd's existence proof is correct), and concentrate on spectre, which everyone thinks means the world is ending.
One thing to note is that compilers have been safely speculating instructions for years. Shocking, i know :)
Processors could too.
They just weren't.
One of the cardinal rules of safe speculation is that you can't speculate an instruction unless you can prove it has no observable side-effects (ie possibly faulting, or in this case, ending up in cache), or that the side-effects are the precisely the same whether you speculate it or not (where "precisely the same" generally includes some notion of ordering of side-effect occurrence. i'm not going to get into all the various intricacies)
a+b -> safe to speculate, has no side effects
load a -> generally unsafe, has possible observable side effects
-> safe to speculate load a to right above the if, it must always be executed, and in this example, side-effects will be the same.
-> not safe to speculate load a above the if, it may not execute
So processors were/are assuming that speculation of loads, calls, etc, had no side effects because they could "throw away the results". As the processor controlled the faulting, it could just throw away the fault and pretend it never happened.
These two attacks are just proving that is not true, and the side-effects of computation themselves are observable.
The end result should be the same. Processors speculate more like compilers do: only in safe situations.
The idea that you have to give up all speculative execution seems very wrong.
You only have to give it up in cases where there are possibly observable side-effects and it's not guaranteed they will always happen.
The upshot is the most likely outcome is that both processors and compilers will work harder to speculate.
Compilers will get called upon to do more safe speculation.
Processors will have to grow the logic to determine when speculation is safe (or figure out a way to actually undo all side effects, which is fairly hard).
Now, the downside is the most useful speculation is obviously to hide load/store latency, and those things are the hardest to reason about safety.
But like i said, compilers have been doing it for many years at this point.
Our hardware brethren just found out the hard way that they are likely to start having to do it too, and that there are more observable side-effects than they thought.
 Some JITs do it and catch the result in a fault handler if it turns out badly. I expect someone is going to discover ways to exploit all of these basically instantly. These will not be fixed by processor related fixes because they are not processor directed.
Not being able to fetch new cachelines when under speculation would be a huge blow, as exposing memory level parallelism is one of the most important features of OoO CPUs.
Edit: also fetching a cacheline is not really undoable as it is observable from other CPUs via the coherency protocol: i.e. while it might be possible to hide to the local cpu that a cacheline was loaded under a failed speculation, the effect can be still observed by another core by noticing that an exclusive line is now shared (by timing for example the latency of an atomic instruction)
Yes, it is.
Remember, again, that the vast majority of those instructions can still be speculated, because the vast majority of instructions are not loads or stores.
Now, certainly, the expensive ones are loads and stores, but i'm just pointing out that the hundreds of instructions you are talking about in the buffer are mostly not loads and stores.
It's true that lowering memory level parallelism would be a huge blow, as the vast majority of time in well-tuned cpu bound apps is usually spent in stalls waiting for memory (otherwise, if it's really just arithmetic bound, it may make more sense to run it on a GPU or something), and this would just increase it.
The real question is what percent can you prove are safe to speculate, and at what point can you prove that safety (ie assuming it must be dynamically speculated, can you prove safety with enough cycles left that it matters). If you have 5 instructions, yeah, no, probably not.. But it may also be the case that the execution environment can prove it safe for you as the program executes and tell you.
I expect getting back this performance is going to be done using a variety of methods, some cooperation between jits/compilers and processors, and possibly some weird abstractions around marking memory you want to protect or not (IE not speculate around).
I mean, in the absolute worst case, you could make loads/stores take constant time and speculate as much as you like :)
It's just that this has a much higher performance cost right now than not speculating at all (by a few orders of magnitude)
I think realistically the only safe and not completely performance crippling workaround will be, at the very least, to run any untrusted code under a separate address space (assuming that the cpu is immune from meltdown). That doesn't necessarily require a full blown separate process, but something like memory protection keys might work.
The alternative is full static analysis and source level annotations, which realistically is only going to be done for very few programs and will still be error prone.
Otherwise, they can observe it before you roll it back due to the way the coherency protocols work.
> Compilers will get called upon to do more safe speculation
Compilers can also emit code that minimizes those situations. Also, extending the ISA with an instruction modifier that signals that otherwise innocent instruction has indeed observable side effects (ideally the processor should know that, but, if the compiler already knows it, a runtime check can be skipped.
> Processors will have to grow the logic to determine when speculation is safe
In many cases, this would be as simple as flagging the micro-op as unsafe and pause the speculation at that point.
For Meltdown, the just-add-silicon approach could be to never share cache between privileged and unprivileged code. To extend that to Spectre, never share cache across different PIDs (but, then, the ISA will have to know what a PID is). Since that would reduce the effective cache effectiveness, caches will have to grow.
Fun times ahead.
Yesterday we only guessed it https://news.ycombinator.com/item?id=16069740 based on ARM CPU list,
the RPi 1-3 CPUs
ARM11, Cortex-A7, Cortex-A53
Affected ARM cores:
Cortex-R7, Cortex-R8, Cortex-A8, Cortex-A9, Cortex A15,
Cortex-A17, Cortex-A57, Cortex-A72, Cortex-A73, Cortex-A75
(I tried to post it 3 hours ago, but HN is rate-limiting my posts, oh well)
> However, suppose we flush our cache before executing the code, and arrange a, b, c, and d so that v is zero. Now, the speculative load in the third cycle:
> v, y_ = u+d, user_mem[x_]
> will read from either address 0x000 or address 0x100 depending on the eighth bit of the result of the illegal read. Because v is zero, the results of the speculative instructions will be discarded, and execution will continue. If we time a subsequent access to one of those addresses, we can determine which address is in the cache. Congratulations: you’ve just read a single bit from the kernel’s address space!
To my understanding it is that saying that by...
1) ...flushing the cache so you have a 'clean' state, you can get...
2) ...the speculative execution to 'pull in' to cache the address user_mem[x_] but...
3) ...the particular address that's pulled into cache, 0x000 or 0x100, is determined by whether...
4) ...the illegal read of kern_mem[address] 8th bit was a 1 or 0...
5) ...which you can then subsequently determine the value of by...
6) ...timing how long it takes to access that user_mem[x] address once again and...
7) ...thereby leaking the value of kern_mem[address]...
So you still have to perform some logic on the result of the speed of the access to the secondary address read right?
If read of 0x000 is slow you know kern_mem[address] was a 1 and if fast kern_mem[address] a 0, and if 0x100 is slow you know kern_mem[address] was a 0 and if fast that kern_mem[address] was a 1?
Is that correct?
If it is it seems that timing is the key right, and actually the clever leap of creativity in completing the exploit, at least to my untrained mind.
Please do correct anything I've got wrong, I'm not an engineer/developer!
What does that do, besides turn the exfiltration problem from an immediate one into a statistical one?
Maybe there will end up being a new Jumping Around Kernal Address Space System (JAKASS - a cousin of Linux's FUKWIT patch) that periodically resets kernel ASRL to make it fully impossible.
The paper they linked to references this one: https://www.usenix.org/system/files/conference/usenixsecurit...
I think this is what all sandboxes have to do: set the TSC disable flag, restrict system timer precision (make it configurable per sandbox: web servers generally don't need more than 1ms precision), make system timer report fuzzy (randomized) time. Heck, why not also make the CPU run at randomized frequency to mess with busy loop timers.
Consider an Olympic 100 metre sprinter. Today we time this event very accurately, I think it's to one hundredth of a second, using sophisticated technology.
But even if the judges used a much less accurate mechanical stopwatch, Usain Bolt wouldn't actually be slower, we'd just be less confident of how ridiculously fast he is.
It's also quite fun to think that the little Pi I have chugging away in a tiny corner doing a variety of background tasks, which was already the most trouble-free machine I own, may also be the safest (OK I know that's an oversimplification, but I'm feeling affectionate towards it).
Instead, the read-ahead/speculative logic causes one of two addresses in user space to be read, and thus placed in the cache. So, by reading both of them, and checking the time it took, the exploit can indirectly determine one bit (0 or 1) of kernel memory. Scary!
There was a comment below the article that explained this part a little further:
> Imagine the value at the kernel address, which gets loaded into _w, was 0xabde3167. Then the value of _x is 0x100, and address user_mem[0x100] will end up in the cache. A subsequent load of user_mem[0x100] will be fast.
> Now imagine the value at the kernel address, which gets loaded into _w, was 0xabde3067. Then the value of _x is 0x000, and address user_mem[0x000] will end up in the cache. A subsequent load of user_mem[0x100] will be slow.
> So we can use the speed of a read from user_mem[0x100] to discriminate between the two options. Information has leaked, via a side channel, from kernel to user.
The remaining part is to iterate the process over all of the bits in the word, using different bitmasks. The resultant set of 0 or 1 results for each bit yields the complete word.
Then one iterates that whole process over all (useful) words in (mapped) kernel memory.
I got my computer engineering degree in 1999 and ended up going the computer science route making CRUD apps all day. I feel in my gut that some engineer, somewhere, MUST have asked this question at one of the big chip manufacturers.
Am I missing something fundamental? Is the access check too expensive? If it isn't, then can the microcode be updated to do this, or is caching/accessibility checking happening at a level above microcode? If that's the case then it would seem that pretty much all processors everywhere that do speculation without protected memory access checks are now obsolete.
From what I understand, it does happen on AMD, which is why AMD CPUs are not vulnerable to the more dangerous Meltdown attack (any code reading kernel / hypervisor host memory).
Intel / ARM delays the checks until later, to the time when the speculated instructions are actually finalised and make their results available. This is faster, and loading some memory into the cache is normally invisible to the unprivileged code. The checks would still be done when actually reading that memory. But nobody spent enough time considering the timing side-effects of the cache.
Going forward, we may have to assume that security is only possible with true process isolation. For example this might put pressure on OSs to fix their slow context switching implementations to encourage the use of processes instead of threads. Beyond that, I can't see any easy way to fix the situation and am highly skeptical of things like compiler fixes, because there will likely always be another way to abuse various instructions to read outside memory boundaries.
Most 32-bit operating systems ignored the segmentation system, basically just running everything in what the old timers would call "small model".
As long as the sandboxed code cannot change the segment registers, this would prevent it from generating an address outside the sandboxed portion of the processes' virtual address space.
I don't recall if the x86 segment system provides a way to trap attempts to change a segment register. If I recall correctly, it does support more than just the two level kernel/user protection system, and I think it supports not allowing loading a segment register with a selector that refers to a segment belonging to a higher level, so maybe if user mode was split into two levels, so sandboxed code could be run at a less privileged level than the main process it could work.
In general, I think processor designers need to take into account the need for processes to run sandboxed code, and provide some kind of mechanism the processes can use to protect themselves from malicious code in the sandbox.
I just now understood the impact of Spectre. It is not just that all existing attempts to execute code in a sandbox are vulnerable. For CPUs with this problem, it is literally impossible to create a secure sandbox.
We certainly live in interesting times...
(hoping for someone to correct me if I'm wrong)
If that happened, Meltdown wouldn't be possible on the processors on which it is possible.
Those checks in no way mitigate against Spectre. Spectre is both simpler and in many ways more profoundly devastating -- as long as an attacker can influence the statistics that the CPU uses for branch prediction, the very act of trying to "work ahead" in another process can be influenced by the attacker, and will produce some side-effects even if you throw out the result, and those side-effects suffice to infer information about the work you did but threw out.
There is no memory protection violation because it happens in the kernel; obviously, the kernel can read its own memory.
Sure, they were about to commit career suicide but then they learned to love the bomb and went on with their day. Maybe they even tried to explain the problem to management but somehow it got lost in translation.
When you consider a theoretical model of the CPU, then it's not leaking - the speculative execution, cache, other parts of the CPU are designed carefully so that no data can "escape" and be read by processes that don't have the permission to do it. Speculated execution can happen, but before any results from that are released, the permissions are checked, and if they fail, the results are discarded.
What people did not consider is the timing attacks that do leak information. It's only "clear" to us now after the attack has been demonstrated, even if it has been present in CPU design for the last 20 years.
There are probably many more of these side channel data extraction paths possible. For example, in the recent years attacks on cryptographic algorithms have looked at similar timing measurements, and in some cases power consumption measurements.
No one in the CPU Architecture Design arena put 2 & 2 together  to realize that the same side channel that was devastating for cryptography work would also be quite devastating for bypassing memory permission protections in the CPU's they were designing.
 probably because the intersection of "cryptographers who can mount timing side channel attacks" and "CPU Architecture designers" is very close to zero.
Keep in mind that for a very long time, PCs and x86 CPUs were used in environments where either there was a single user with full access, or multiple users that aren't completely adversarial (look at Win95/98's multiuser security model, for example.) Memory protection and other security features served as a barrier to accidents and to "keep the honest honest", not determined adversaries.
They still are today, but this is very different from the shared servers/cloud computing environments which have now become common --- completely mutually untrusting users with possibly adversarial relationships are sharing the same hardware.
This is the reason why all the CPU manufacturers have had some variant of "as designed" in their public comments --- speculative execution was designed with the former model in mind, not the latter.
Until this news broke, a CPU designer would tell you that speculative execution is a well-understood, proven approach to gaining a lot of performance. Branch predictors are really good at figuring out where the next instruction is going to come from, and are a really important tool for avoiding stalling out the whole machine while you wait for the next instruction to come back from memory.
And intuitively, it seems really "safe." All you're doing is having the CPU get ready to perform "future" calculations more quickly. It gets to guess where the program is going to go, and start fetching resources that it thinks will be needed. As long as nothing architecturally visible is changed by these preparations, everything is functionally the same, so what could go wrong?
And intuitively, you wouldn't think that something like a cache, which the program has no way to access directly, should be architecturally visible. Even putting on a security-minded hat, you would think that it doesn't matter what's in the cache, because if a program tries to access kernel memory, the access still has to undergo a permissions check.
The attack is pretty damn clever. And disheartening.
IIRC, one of my professors at one point explained modern CPU behaviour (branch prediction/OOOE etc.) as, roughly, the CPU can do whatever it likes because the program never sees what happened under the hood - it just has to order the result/make sure the result is correct.
The leak is only clear in retrospect. Many, many things are only clear after you see how they were done.
It has been twenty years since processors with this vulnerability started appearing. Over those two decades, thousands of very smart engineers (including state-sponsored ones) have collectively spent millions of hours of analysis trying to find security flaws.
No one has found such a clever timing attack until now. So "clearly leaking" wasn't "clearly leaking" until this week.
However, even at the time of the design it would have been obvious that deferring security checks is a risky design choice.
One question I still have that gets glossed over is how timing of instructions is captured.
In both cases, (Spectre) attacks can be prevented by browser updates, so any performance impact is not system wide.
This is different from Meltdown, which (only?) affects intel. That one requires kernel changes which cause system-wide performance degradation.
* make such attacks more difficult.
Particle physicists are experts at this, extracting tiny signals from huge piles of noisy data.
Updated for those of μs confμsed by this μngainly grammar
I wouldn't write off the ability to get a useful side-effect signal. The variants widely documented are not the only possible methods of inducing speculative side-effects.
I didn't check, but these will almost certainly have branch prediction. What they probably lack is a predictor advanced enough to speculate on indirect branches, which AIUI is the primary vector of Spectre.
The Cortex-A53 branch predictor  does prefetching to keep the core fed. This ensures that the instructions are ready for decoding, but has no architectural effects beyond the L1 instruction cache, which is already a well-studied timing sidechannel.
Here's the Cortex-A53 pipeline: https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cp...
It's an in-order CPU, so that "issue" phase (pipeline step 5) stalls until the instruction pointer is resolved. Instructions must be issued to the "AGU Load" functional unit, which is what actually performs the read and pulls data into the cache hierarchy.
Note also that a single speculative memory load is insufficient for Spectre. You need two speculative memory loads.
I tried doing that on RPi 3, but the IO seemed not up to the job -- the CPU appeared to be just about tolerable, but using micro SD as a disk was too slow and prone to failure (I'd have tried an external USB disk but I believe the problems were in part because of poor I/O bandwidth). Other single board machines seemed to have better provision for disks that are up to the task I had in mind, but lack software support, so that I had little confidence in security updates, for example.
If somebody sold this I think they'd have my money tomorrow:
* An ARM mini-PC
* With a decent security update team behind it (probably the hard part?)
* That will let me run some basics: for me, a Unixy OS with Chrome/Chromium, emacs, ledger and python, without a big effort to install those and keep them up to date
* Ideally without too much anti-commodification BS (from my customer perspective) so that hardware can be swapped out if needed
Does anything like that exist?
arm-powered chromeboxes don't exist (yet?), so perhaps an arm-powered chromebook?
(chromeboxes tend to be more upgradeable than their chromebook counterparts)
Linux OS, kept up to date for you (with google backporting security fixes to their kernel, and of course - updating their browser), running on arm.
Can be used as a simple browser-only machine, or if you are decently comfortable with linux, you can unlock its potential and use it as a fully-fledged linux machine. Your choice.
If you want to avoid being part of the google foodchain, you could try dual-booting into another arm distro of your choice.
Best I can think of, at the moment ...
> User enters password
> PW gets hashed
> Hash gets compared to the DB
I don't know of any same system that would allow you to compare a hash to a hash? Unless you have access to something you shouldn't, in which case it doesn't matter anyway, because you can probably just read the hashes.
If, say, you submit a password with a hash that starts "b94", if the database doesn't use a constant time comparison, you can use the timing to figure out that the stored hash also starts with "b94" (statistically, given network etc. delays involved), meaning you can pre-filter your submitted guesses (i.e. bruteforce offline and only submit guesses that start "b94").
It's definitely a edge-case thought (and probably not worth worrying about unless you don't salt/rate-limit requests). I also don't know if the number of requests needed to determine the timing would actually be less than just making random guesses outright (intuitively it seems so because even if it takes a lot of requests it shrinks the search space at each step).
In real-world terms, what's the fastest processor we could build today whose execution speed is reasonably matched to it's main memory access speed (so it doesn't need caches, etc)?
I could imagine that a processor, with a simple design that closely matches a naive model of how CPUs work, would be very useful for high-security applications. It would be much easier to reason about up-front.
SRAM can go fast since that’s what caches are made with, but that’s expensive. Also it would need to be close to the CPU as wire latency is non-trivial at high clock speeds.
In fact on the Beeb, spiritual ancestor of the RPi, memory ran at 4Mhz and the CPU at only 2Mhz...
Edit: this is what I'm talking about: https://www.bigmessowires.com/2011/08/25/68000-interleaved-m...:
> The Atari ST and Amiga appear to have both used a more aggressive scheme where video circuitry access occurred during known dead time in the 68000 bus cycle, so the CPU never had to wait
See for starters this discussion about reinventing the AS/400: https://news.ycombinator.com/item?id=16053518
To me it would make a lot more sense to use a special value to indicate the read did not succeed and propagate this value until it is time to crash. I guess this introduces some overhead (e.g. reserve a special value); but are there any other drawbacks?
But the actual read from is allowed to occur, even if the "access denied" signal is given. Which allows the read to effect the state of the data caches. This was likely done this way as a performance booster, because this would allow speculative instructions to also perform cache pre-fetching during their speculation window.
That seems to be why AMD CPU's are immune to Meltdown. AMD's design prevents the read from occurring when the "access denied" signal appears, so the cache state is not effected, so there is no side channel to detect.
Why is this? Is it because the CPU doesn't know ahead of time what is valid (because it depends on the "outcome" of instructions in flight), or is there something I'm overlooking?
In the end the result is a confluence of several different topic (speculative execution, data caching, high resolution timers [although these can be simulated with a plural CPU system]) that each in isolation is all but harmless, but together emergent behavior appears that was not immediately apparent from each one viewed individually. I.e., without caches there's no side channel to monitor. Without speculative execution there's no way to trick read a bad address and avoid taking a memory access fault. Without high enough resolution timers it becomes very hard detecting the time difference between a cache hit and miss.
 a reasonably safe assumption - most CPU architecture designers are not cryptographers, and most cryptographers are not CPU architecture designers, and most timing side-channel attacks have historically been against crypto. algorithm implementations.
t, w_ = a+b, kern_mem[address]
u, x_ = t+c, w_&0x100
v, y_ = u+d, user_mem[x_]
w, x, y = w_, x_, y_ # we never get here
"One almost wishes that they’d stuck with the original name for the KPTI patchset: Forcefully Unmap Complete Kernel With Interrupt Trampolines.
Now that's funny!!!
Really awesome explanation
These do have Dynamic branch prediction/folding afaik and may be affected ?
Does somebody have a spectre.c tuned for generic armv5tel for example?
Current versions of spectre.c, like this one https://gist.github.com/LionsAd/5116c9cd37f5805c797ed16fafbe... still contain "_mm_clflush" and therefore do not compile on ARM at all.
I assume there is some way to tell the CPU "when memory location X is read, store the current time in register Y" or some such thing. Could anyone share what that mechanism is?
Instead of measuring the literal time interval between instructions, the number of cycles between two points is measured (using the RDTSCP instruction).
The RPi may mitigate risk of these attacks simply in the way it is used.
There's also this...
And I'll offer that if you're not capable of demonstrating it after reading Eben's description of how it works than there is no good reason for you to have an example handed to you.
If you think you are capable I'll offer your time would be better spent working on fixes.
Future vulnerabilities that I could imagine being "worse" would be either encryption vulnerabilities or signals level vulnerabilities.
I think it's interesting, you are not just paying for speed, you are paying for a compromise, because the speed is gained through complexity, which not only increases the chance of error (by design or implementation) but in the case of a high degree of speculative execution can translate into worse performance per watt. In short, it's the whole "more is less" thing.
Very good point. Apparently including at least one compromise that most people (probably including the engineers who designed the CPUs) didn't know they were making.
:P yes I stole it from mywittyname
I didn't downvote you (can't even do it), but I suspect you are getting downvoted because your analogy is so off the mark that it can't even be called "Apples vs Oranges".
edit: replying to the "why the downvotes" which has since been edited away
Actually, if it's true that a car gets a better mileage than a big airplane (more in one vehicle = more efficient, people seem to believe) I would find that interesting. Similarly, I can see how OP thinks it's ironic that a very cheap machine is not vulnerable whereas a quite expensive piece of equipment is, making it seemingly less-well engineered.
Jet fuel is 37.4 MJ/L; compare to gasoline at 34.2:
An electric car will have up to 150 miles-per-gallon equivalent (aka 150 miles per the same amount of energy that's in one gallon of gasoline):
So therefore, an electric car with just the driver is more efficient energy-wise than an airplane, while an average airplane is better than a gas car with even three people in it.
For instance many EV proponents note the total cost of ownership difference due to significant differences in maintenance costs. Playing the other side of the argument: they tend to forget about the value of upfront cost is more than later cost.
It's still possible to achieve some sort of quantitative comparison by applying an interest rate based on "distance" of the cost in time before summing them. If you applied this to a regular petrol car you might be able to make a better comparison to the ticket price of a flight.
Of course a 747 in a domestic Japan configuration will have more passengers than a typical BA 747 tatl flight with 100+ beds.
It doesn't even matter though, I said technical details aside...
I'm not stupid i know a pi zero is slow, but it's a cheap computer that happens to be invulnerable to three really bad sidechannel attacks that plague all big shiny expensive ones. How is that not a little amusing. Not any more now I've had to argue for it.