The client (a now vanished startup) had a small 8-bit CPU design which they wanted validation for, using the technique of executing random sequences of instructions and comparing the result against an emulator. We wrote the emulator independently from their architecture description. Given that most instructions were a single byte plus arguments and most of those were valid, the test coverage was pretty thorough. All looked fine until I added support for interrupts, at which point we discovered that an interrupt during a branch would not return to the correct point in execution.
Verifying security properties of processors is really hard; you can go looking for specific types of failure, but I'm not aware of a general way of proving no information leakage.
I think there is no answer that doesn't involve sign-offs on certain things.
In the situation you described, we resorted to fully randomizing everything, and then adding constraints to ensure it remained legal. This is a tedious process, one that leads to multiple false-positive failing tests.
However, the opposite situation, writing constrained stimulus from the beginning, leads to gaps where you didn't think of randomizing something, or you over-constrained it.
It's similar to a top-down vs down-up approach, only one is difficult and tedious and riddled with false positives, whereas the other one is easier and simpler, but can hide bugs.
In the face of issues like what you said, we implemented randomized interface behavior. Interrupts are always meant to be program-order invisible, so you can randomize those (imagine, 100 interrupts back to back, or 1 every N cycles where N is random but smaller than 5 or 500 etc). Pure random isn't interesting on its own. You need controlled randomness. From that point, you have to then start adding constraints for illegal/impossible situations, and you have to be extremely careful about these. We found a bug a week before tapeout when a senior architect forced a junior engineer(wonder who this is) to add a constraint. He used his seniority to sign it off, without going through a group review, and it was a mistake.
As I understand it, the current consensus is to use a tagged architecture, then show that there are no observable differences when the tags involve secret data and the value of the tagged data changes.
There are a few rubs here. One is that this is pretty hard to do. The other is that no one wants to pay the overhead of using a tagged architecture. Yet another is that deciding what "observable difference" means is challenging (too little, and you probably miss attacks, too much, and you will probably discover information leaks).
Further, this kind of rigorous system hasn't been rigorously empirically evaluated by a third party (as far as I know, perhaps in part because it's so hard to create these systems). "Empirically evaluated?" you scoff, "there's a proof, what's the point of testing?" Well...
For a glimpse of what this looks like, consider the SAFE architecture: http://www.crash-safe.org/assets/verified-ifc-long-draft-201... and lowRISC: http://www.lowrisc.org/downloads/lowRISC-memo-2014-001.pdf
Cadence has a tool  to do this, but I'm not sure it would work for timing leaks.
Formal verification of specs, microcode, and security policy they hold in ACL2. Part of that is embedding a separation kernel a la MILS or SKPP model. It also has triplicated registers for fault-tolerance. Then, they integrate high-level languages like SPARK and microCryptol with that using certified compilers or equivalence checking.
Here's some examples from CompSci work:
Just also found a super-old one that tries to do it with capability-based mechanisms.
Most hardware work in information-flow analysis and control is currently focused on making hardware that controls bad software. I don't think most of them consider a CPU failure. If anything, they might be more vulnerable from a practical strategy of trying to reuse existing cores by making info-flow components that operate side-by-side with them. Here's an example from that area, though, since they could be modified to address recent concerns.
I've never thought about it until I saw the above line in the article, and that thought went something like this:
"Assembler instructions which might never have the conditions met for their execution during a program's runtime might be speculatively executed nonetheless, and this, depending on the nature of the instruction executed, might have huge ramifications up to and including program incorrectness and even program failure."
In other words, your absolutely 100% deterministic Turing machine (or programs that you write on it that you deem to be 100% deterministic) -- may not be quite so deterministic when viewed against these understandings...
It adds a whole new dimension to what must be thought about when writing assembler code...
Anyway, it's a really great article!
There is lots of hardware with strange bugs out there, and in some setups (embedded devices, mission critical systems etc.) it can be easier to work around them instead of trying to fix the hardware. I know of a medical systems manufacturer which still sticks to some incredibly old CPUs, a home-grown operating system and gcc 2.95 because after 15 years they think they at least know about all the problems and how to work around them. Fixing the issues would pretty much require them to re-validate the whole design and repeat all the testing, which would take several years.
One could argue that the introduction of cloud computing (using virtualization) has converted the machines we run on from deterministic to sorta-deterministic. It just happens that until now we haven't been hurt by it much. Now everyone's paying for it, to the tune of a sizable performance hit on syscalls.
But still no. Deterministic means the the input maps consistently to the output. The hardware in this sense is doing the mapping. Deterministic does not mean for all x the only mapping is f(x). It means when f(x) is the mapping f(x) does not change to f'(x). However g(x) is a perfectly deterministic possibility for a mapping too. Point being two hardwares f(x) and g(x) can independently deterministically map x. The existence of two hardwares does not affect whether it's possible to deterministically map x or not. The reality and I think your point is that if you are unaware a different hardware is performing the mapping this can lead to a result that would appear nondeterministic. But the original point with the medical company anecdote is that this is only an appearance and in some domains people go to great lengths to make sure they intimately understand their f(x) and make sure they're not using a g(x) they don't understand.
Was the game no longer "using the xdcbt instruction", but the branch predictor caused issues, because they put a jmp instruction in front of it instead of removing the instruction entirely?
There were other ways that xdcbt could cause problems (what if the block of memory was freed before being purged from L1) so who knows, and I don't remember for sure.
if(flags | PREFETCH_EX)
An untrained branch predictor could still take the "wrong" path even if none of the code that calls that if statement takes the bad path. That's presumably what happened here, the calling program removed all instances of PREFETCH_EX but the CPU would still sometimes execute that branch, such as on the first run through that function.
This is a lot like the problem of "don't think of a pink elephant". The mere existence of the "evil opcode" in the library means the CPU might execute it under certain circumstances.
 Unless there's some variable width mode like ARM's thumb I'm not aware of, but even then you're limited to 2 byte boundaries.
It's certainly possible that a CPU could delay the same sort of exception processing for "NX" flags until later in the pipeline, but still before retirement, allowing a similar sort of problem as Meltdown.
But I think for it to happen randomly, you'd need an invalid pointer that just so happens to jump to what "becomes" an instruction with a bug, and a branch. That seems pretty rare on first glance.
The UNIX fix for this was to use a raw device or O_DIRECT to bypass the buffer cache.
Maybe Intel's new cache partitioning feature offers a similar fix, see:
Actually in the comments someone mentions using cache partitioning for security. Maybe the threads used by jit code could be placed in their cache partition to avoid some of Spectre.
- Meltdown, on affected processors (...basically anything from Intel), gets you the ability to read anywhere in memory, whether you could legally map it or not (e.g. you can read from kernel, user, or w/e memory as an unprivileged user, albeit on the order of bytes per second)
- Spectre affects approximately anything with a cache and speculative execution, but only buys you reading anywhere that you could legally map, e.g. you can read anything the process you're in could legally read, but you can't, say, read kernel memory with it as an unprivileged user.
> If the kernel's BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time.
The problem with X360 is it can’t run arbitrary native code, it only can run code digitally signed by MS. The hypervisor refuses to run an unsigned code, and the key never leaves the CPU i.e. very hard to tamper with.
Spectre might make it possible to dump the public keys but they're not very useful to us.
> [insert X here] was no longer guaranteed, but hey, we’re video game programmers, we know what we’re doing, it will be fine.
but I'm not sure.
How would you know it’s not the green stuff? I don’t know, perhaps by size. I would imagine the green stuff is some other kind of cache or memory (op cache? TLB buffers? patchy SRAM).
This is what I know from seeing these shots online for years. I’m no expert by any means.
But I seem to remember being told that the green areas were the VMX pipelines. I don't remember clearly, and the hardware people I asked at the time were uncertain - they were too far removed from that aspect of the hardware.
The other reason to assume that the blue is the caches is because it is closer to the cross-bar and the cross-bar is what connects the cores to the L2. Incoming data from the L2 goes first to the L1 caches so they would logically be on that side.
Another solution would be to split the existing function into safe_function() and unsafe_function(), and remove use of the PREFETCH_EX flag. But then you have to explain to callers that they must not protect calls to unsafe_function() with any kind of conditional statement.
The author hinted at possible solutions but it was probably a case of "twice bitten" by that point.
Keep in mind that it's possible for a branch predictor to predict with "high confidence" a branch it's never seen before, because many branches may overlap into the same memory region of the predictor. The article touches on this.
Sooo, out you go, G5. Any suggestions what to use for online banking? A 486 can't handle all the jQuery and tracking scripts.
In this specific instance, the xdcbt instruction accidentally does a bad thing in the Spectre Zone that leaks back into the real world through a side-channel in the L1 cache. Crafted malicious code could do another bad thing in the Spectre Zone that ordinarily isn't allowed in the real world, then leak the results back to the real world through any side-channel that works.
Only on Xbox 360, it could even use xdcbt vs dcbt for that side-channel. If you set code up to always execute xdcbt on core 2, and always dcbt on cores 0 and 1, you might be able to do timing-based attacks between core 0 (closer to L2) and core 1 (further from L2) comparing against core 2 (bypassing L2).
A different version of the same PowerPC chip family would have its own side-channels. All those side-channels are hardware-specific.
The 360 cpu (and the closely related ps3 cpu) is the perfect example of this. Despite the fact that it's 100% in-order, it has a very long instruction pipeline (upto 50 instructions long) and branch prediction. And that's all Spectre needs, a long pipeline + branch prediction.
I'm not even 100% sure the intel atom will be safe from Spectre & Meltdown. It's pipeline is much shorter, only 16 stages long, but it still speculatively executes up-to 32 instructions (2 instructions per cycle).
It just makes things harder, you have to find a Spectre gadget that's short enough. Though piss-easy for Meltdown, because you can just hand-assemble short code.