
Finding a CPU Design Bug in the Xbox 360 - nikbackm
https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-bug-in-the-xbox-360/
======
alexkoeh
Reading about the architecture of that chip (3 core, PowerPC) I am amazed at
how smooth GTA V could run on it.

------
shaklee3
Great read!

------
golergka
I have seen _so many_ bugs and crashes created exactly because of this kind of
thinking:

> [insert X here] was no longer guaranteed, but hey, we’re video game
> programmers, we know what we’re doing, it will be fine.

~~~
pjc50
The games industry pretty much valorises doing weird things with the hardware,
all the way since the Atari 2600 through the demoscene.

------
rzzzt
Would gcc's "__builtin_expect" construct help in these cases?

~~~
syncsynchalt
Yes, it might have, assuming it's available in the 360 compilation
environment.

Another solution would be to split the existing function into safe_function()
and unsafe_function(), and remove use of the PREFETCH_EX flag. But then you
have to explain to callers that they must not protect calls to
unsafe_function() with any kind of conditional statement.

The author hinted at possible solutions but it was probably a case of "twice
bitten" by that point.

------
amenghra
Very well written. Thanks for sharing!

------
jhallenworld
This issue that xdcbt is supposed to solve happens at a completely different
level. Performing a full disk to tape backup would slow systems down because
the entire disk would be copied through the OS buffer cache, evicting data
used by other processes.

The UNIX fix for this was to use a raw device or O_DIRECT to bypass the buffer
cache.

Maybe Intel's new cache partitioning feature offers a similar fix, see:

[https://lwn.net/Articles/694800/](https://lwn.net/Articles/694800/)

Actually in the comments someone mentions using cache partitioning for
security. Maybe the threads used by jit code could be placed in their cache
partition to avoid some of Spectre.

------
peter_d_sherman
Article Excerpt: "So a speculatively-executed xdcbt was identical to a real
xdcbt!"

I've never thought about it until I saw the above line in the article, and
that thought went something like this:

"Assembler instructions which might never have the conditions met for their
execution during a program's runtime might be speculatively executed
nonetheless, and this, depending on the nature of the instruction executed,
might have huge ramifications up to and including program incorrectness and
even program failure."

In other words, your absolutely 100% deterministic Turing machine (or programs
that you write on it that you deem to be 100% deterministic) -- may not be
quite so deterministic when viewed against these understandings...

It adds a whole new dimension to what must be thought about when writing
assembler code...

Anyway, it's a really great article!

~~~
johannesburgel
The machine is still deterministic, it just doesn't conform to the assumption.
It's a bug. You can only think about the bug when you write the assembly code
if you know about it.

There is lots of hardware with strange bugs out there, and in some setups
(embedded devices, mission critical systems etc.) it can be easier to work
around them instead of trying to fix the hardware. I know of a medical systems
manufacturer which still sticks to some incredibly old CPUs, a home-grown
operating system and gcc 2.95 because after 15 years they think they at least
know about all the problems and how to work around them. Fixing the issues
would pretty much require them to re-validate the whole design and repeat all
the testing, which would take several years.

~~~
kevingadd
In production though the machine will not behave deterministically from your
perspective, because you do not have all the inputs to the machine. Some of
the inputs are controlled by a third party (the hypervisor, or worse, another
VM running on your EC2 host that is actively manipulating branch predictor
data to control the speculative execution for your application.) Even if you
capture every operation that runs inside your VM, including register states,
timing, etc, you won't be able to reproduce the actual behavior that occurred
on the host because you don't know what happened outside the VM.

One could argue that the introduction of cloud computing (using
virtualization) has converted the machines we run on from deterministic to
sorta-deterministic. It just happens that until now we haven't been hurt by it
much. Now everyone's paying for it, to the tune of a sizable performance hit
on syscalls.

~~~
dcow
A medical company using old processors and gcc2 for their system is not
running on AWS.

But still no. Deterministic means the the input maps consistently to the
output. The hardware in this sense is doing the mapping. Deterministic does
not mean for all x the only mapping is f(x). It means when f(x) is the mapping
f(x) does not change to f'(x). However g(x) is a perfectly deterministic
possibility for a mapping too. Point being two hardwares f(x) and g(x) can
independently deterministically map x. The existence of two hardwares does not
affect whether it's possible to deterministically map x or not. The reality
and I think your point is that if you are unaware a different hardware is
performing the mapping this can lead to a result that would appear
nondeterministic. But the original point with the medical company anecdote is
that this is only an appearance and in some domains people go to great lengths
to make sure they intimately understand their f(x) and make sure they're not
using a g(x) they don't understand.

~~~
ddingus
In fact, they aren't running anything they don't know about.

------
6d6b73
Correct me if I'm wrong, but won't Meltdown/Spectre bug allow people to
jailbreak pretty much any os/device that has CPU that "supports" these bugs?
This could potentially open a lot of currently closed devices to people.

~~~
Const-me
Potentially.

The problem with X360 is it can’t run arbitrary native code, it only can run
code digitally signed by MS. The hypervisor refuses to run an unsigned code,
and the key never leaves the CPU i.e. very hard to tamper with.

X360 can execute JavaScript in the web browser, and .NET in (now
discontinued?) XNA. I’m not sure if it’s possible to exploit the bug if you
can only run JavaScript or .NET, both are very high level and have strong
security model on top of what’s in the hardware/OS/hypervisor.

~~~
jsheard
Surely they use asymmetric crypto, so the private key needed to sign code
isn't present on the console at all?

Spectre might make it possible to dump the public keys but they're not very
useful to us.

------
pjc50
I'm now wondering if I have enough material to do an interesting writeup for
my time as a CPU bug-hunter in verification.

The client (a now vanished startup) had a small 8-bit CPU design which they
wanted validation for, using the technique of executing random sequences of
instructions and comparing the result against an emulator. We wrote the
emulator independently from their architecture description. Given that most
instructions were a single byte plus arguments and most of those were valid,
the test coverage was pretty thorough. All looked fine until I added support
for interrupts, at which point we discovered that an interrupt during a branch
would not return to the correct point in execution.

Verifying security properties of processors is _really hard_ ; you can go
looking for specific types of failure, but I'm not aware of a general way of
proving no information leakage.

~~~
munin
> but I'm not aware of a general way of proving no information leakage.

As I understand it, the current consensus is to use a tagged architecture,
then show that there are no observable differences when the tags involve
secret data and the value of the tagged data changes.

There are a few rubs here. One is that this is pretty hard to do. The other is
that no one wants to pay the overhead of using a tagged architecture. Yet
another is that deciding what "observable difference" means is challenging
(too little, and you probably miss attacks, too much, and you will probably
discover information leaks).

Further, this kind of rigorous system hasn't been rigorously empirically
evaluated by a third party (as far as I know, perhaps in part because it's so
hard to create these systems). "Empirically evaluated?" you scoff, "there's a
proof, what's the point of testing?" Well...

For a glimpse of what this looks like, consider the SAFE architecture:
[http://www.crash-safe.org/assets/verified-ifc-long-
draft-201...](http://www.crash-safe.org/assets/verified-ifc-long-
draft-2013-11-10.pdf) and lowRISC: [http://www.lowrisc.org/downloads/lowRISC-
memo-2014-001.pdf](http://www.lowrisc.org/downloads/lowRISC-memo-2014-001.pdf)

~~~
dfox
Overhead of tagged architecture is to some extent an myth caused by abysmal
performance of certain implementations (eg. iAPX whose performance problems
are AFAIK caused mainly by funky instruction encoding) and by performance
issues with "straightforward" way of running C code on such architectures.

------
SamPutnam
Where are the L1 caches?

~~~
saagarjha
I believe each core had its own L1 cache, hence why it’s not possible to see
on the diagram.

~~~
brucedawson
I think the L1 caches are the per-core blue blobs visible in this die shot:

[https://randomascii.files.wordpress.com/2018/01/xbox360_proc...](https://randomascii.files.wordpress.com/2018/01/xbox360_processor_die-
fixed.jpg)

but I'm not sure.

~~~
readittwice
Is there a way to "recognize" this? For example why the blue blobs and not the
green ones?

~~~
MBCook
Cache/RAM always looks like basically solid colors because it’s so densely
packed with a repeating structure.

How would you know it’s not the green stuff? I don’t know, perhaps by size. I
would imagine the green stuff is some other kind of cache or memory (op cache?
TLB buffers? patchy SRAM).

This is what I know from seeing these shots online for years. I’m no expert by
any means.

~~~
brucedawson
The green stuff might have been the VMX register files - those things were
pretty huge (128 registers per thread, each register 16 bytes, so 2 _128_ 16=4
KiB per core). While the L1 cache was higher capacity (32 KiB) as a general
rule of thumb faster memory uses more die area per bit.

But I seem to remember being told that the green areas were the VMX pipelines.
I don't remember clearly, and the hardware people I asked at the time were
uncertain - they were too far removed from that aspect of the hardware.

The other reason to assume that the blue is the caches is because it is closer
to the cross-bar and the cross-bar is what connects the cores to the L2.
Incoming data from the L2 goes first to the L1 caches so they would logically
be on that side.

------
rwmj
Would it be fair to say that having any sequence of bytes in memory which
looks like the _xdcbt_ instruction (even if those bytes are just data) is
unsafe? Given that a stale entry in the branch prediction table might end up
pointing at those bytes.

~~~
SolarNet
Data and code are not quite so interchangeable (thankfully). Typically
programs have code segments and data segments, and processors actually have
support for specifying the difference in behavior. A branch predictor would
probably not run against pointers to memory segments without the execute bit
set.

~~~
SomeHacker44
Actually, this is exactly the assumption that led to the Meltdown bug. The
Intel CPU speculative (non-retired) instruction pipeline runs instructions
which access memory and load it into cache and execute the instructions,
delaying the check that the memory is accessible within the current context
until later (just before instruction retirement).

It's certainly possible that a CPU could delay the same sort of exception
processing for "NX" flags until later in the pipeline, but still before
retirement, allowing a similar sort of problem as Meltdown.

------
jonny_eh
Awesome story! I'm curious though why the branch predictor was running the
xdcbt instruction if "The symptoms were identical. Except that the game was no
longer using the xdcbt instruction".

Was the game no longer "using the xdcbt instruction", but the branch predictor
caused issues, because they put a jmp instruction in front of it instead of
removing the instruction entirely?

~~~
luc157
As I understand it, the function that used the xdcbt instruction had an
oversight that led to the first set of crashes, so it was rewritten to fix
that issue. Even with the fix, the game programmer decided to remove the
PREFETCH_EX flag, which is what caused the copy routine to use xdcbt in the
first place. So xdcbt ended up in the compiled code, but the flag that caused
it to be used wasn't there. The branch predictor ran it anyway, in an unsafe
way, leading to the next set of crashes.

~~~
brucedawson
I think that the developer didn't yet have the fixed version. They thought
they would be safe because they weren't passing PREFETCH_EX, but the branch
predictor decided otherwise.

There were other ways that xdcbt could cause problems (what if the block of
memory was freed before being purged from L1) so who knows, and I don't
remember for sure.

~~~
Gibbon1
I'm going to assume that somewhere in the code was something like

    
    
       if(flags | PREFETCH_EX) 
       {
         xdcbt
       }
       else 
       {
         safe way
       }
    

And the branch predictor would decide the xdcbt branch was most likely,
executes xdcbt and then realizes oops and does it the safe way. Which would be
okay except xdcbt is evil, it breaks cache coherency.

~~~
djhworld
Wow, interesting to see feature flags causing this undesired effect!

~~~
syncsynchalt
Any if statement could cause this, doesn't have to be feature flags.

An untrained branch predictor could still take the "wrong" path even if none
of the code that calls that if statement takes the bad path. That's presumably
what happened here, the calling program removed all instances of PREFETCH_EX
but the CPU would still sometimes execute that branch, such as on the first
run through that function.

This is a lot like the problem of "don't think of a pink elephant". The mere
existence of the "evil opcode" in the library means the CPU might execute it
under certain circumstances.

------
fps_doug
_Scratches PowerPC off the list of trustworthy CPUs_

Sooo, out you go, G5. Any suggestions what to use for online banking? A 486
can't handle all the jQuery and tracking scripts.

~~~
LeoPanthera
In the list of possible risks to online banking (and why is it always online
banking in these hypothetical world-is-ending scenarios?), CPU bugs are way
down the list.

~~~
exikyut
It's the stereotypical "why to use SSL" that everyone thinks of first.

