Bruce Dawson taught a 100 level class at the college I attended, it's been a while but I think the class title was something around image processing. The first or second assignment was to copy one image region into another, little did we know that we implemented bit blit.
After turning in the assignment the next class he grabbed one random assignment from the class and added some quick benchmark instrumentation and if I remember right then asked the class how fast we think it could run. The next 30 minutes was a total geek out on cache lines, prefetching, alignment and all the other dark arts of performance optimizations. I don't remember how fast he got it to go over the naive for loop but it was many orders of magnitude. There's a few classes that really stick with you and that was definitely one of them.
The only thing that stands out to me in college are the 100 level classes at the beginning and the 400 level classes in the middle that werent in my major
Maybe its the imposter syndrome coupled with actually being unqualified and incompetent which makes those classes stand out
But we can just pretend its only imposter syndrome
Angrave was very memorable for me for CS241 Systems Programming (2016?). To this day my threading understanding is unparalleled to that of my colleagues due to him and that class.
Looks like it still exists https://cs241.cs.illinois.edu/
and I can imagine this was a difficult course, you have challenging assignments on a weekly basis judging from the descriptions.
It sounds like a mispredicted xcdbt would actually need to invalidate or otherwise make coherent the cache lines that it mistakenly made incoherent, which also affects the instructions that read the incorrect data, so effectively a full pipeline flush. Even if they got that right, I suspect it would still result in some interesting performance anomalies whenever a mispredicted xcdbt was speculatively executed and then "cancelled".
It's notable that in 2005, it was already near the end for the P4/Netburst with its insanely 31-stage pipeline, and CPU designs were moving towards increasing IPC rather than clock frequency.
The question of how to properly implement an instruction like xdcbt is interesting. Undoing the damage would be both tricky and expensive. Only doing the L2-skipping when the instruction is executed (as opposed to speculatively executed) would probably be way too late. It seems that such an instruction is probably not practical to implement correctly.
I never asked any of the IBM CPU designers this question (it was too late to make changes so it wasn't relevant) and now I regret that.
Re-reading the post, it sounds like the conclusion was just “don’t use it” / label it as dangerous. Why didn’t they end up marking the instruction as “can’t speculate this”?
(I can imagine wanting to keep it for straight line unrolled copies that don’t have prediction, but it still seems dicey given that you’d have to write any code with knowledge of the speculative fetches).
Making the instruction not speculatable would indeed be a hardware change, which there was not time for. So that was not an option.
And, let's say they did that. All other loads/prefetches are done in the early stages of the pipeline, when execution is speculative. I think they would need new logic at a later stage of the pipeline just for this instruction, in order to initiate a "late prefetch". That is potentially a lot of extra transistors and wires. And, at that point you have a prefetch instruction that doesn't start prefetching until potentially dozens (or more) cycles later. At that point using xdcbt instead of dcbt may just make your code run slower.
What about, then, an xdcbt which is seen in a context where it is known early on that it will definitely be executed - a context where it is not speculative. Well, there really is no such context. Practically speaking there are so many branches that when an instruction is decoded there is almost always a conditional branch in front of it in the pipeline. And, architecturally speaking, any earlier instruction could trigger an exception which would stop execution flow from reaching the xdcbt. Pipelines are really really deep.
TL;DR - On heavily pipelined CPUs (even in-order ones) you don't know for sure that an instruction is "real" until it is time to commit its results, and that is way too late for a "prefetch"
I've been working on verifying memory coherency units of modern out-of-order CPUs for a few years now. Nowadays, this would be a huge miss if it were to escape to silicon. You'd have a dead on arrival product.
I think the specific application of the CPU here makes it more palettable. It was a nice idea, but it doesn't work out, so scan for it and don't publish games that have that opcode; if possible, issue a microcode update that makes it either a noop or an illegal operation.
Game console CPUs get away with all sorts of brokenness. The Wii U CPU also has broken coherency, and needs workaround flush instructions in the core mutex primitives. You can't run standard PowerPC Linux on it multicore for this reason.
I agree that console silicon can be rushed and bugs slip through. But then again, we may not even be aware of all the bugs in Intel or AMD products where they have been fixed post silicon via mechanisms such as microcode patches.
If Intel ships a CPU with a bug in it then that is an expensive mistake. If they produce a bunch of CPUs with a bug (escape to silicon) that can also be an expensive mistake.
That said:
1) I deal with CPU bugs pretty regularly on Chrome. Some old CPUs behave unreliably with certain sequences of instructions and we get bursts of crashes on these CPUs with some Chrome versions.
2) Intel regularly "fixes" CPU bugs with microcode updates.
3) The Spectre family of vulnerabilities are arguably CPU bugs.
4) The Pentium fdiv bug was definitely a CPU bug.
So, CPU bugs escape to silicon all the time. Way more than I would have guessed just a few years ago. Our industry is built on a pile of wobbly sand.
If the PREFETCH_EX flag was never passed, why did the branch predictor speculatively execute xdcbt? That branch would never be taken, so it ought not to be predicted.
Was it a static branch prediction hint? I know PowerPC has that. If so, could it have fixed by just flipping the hint bit?
Instead simple branch predictors typically squish together a bunch of address bits, maybe some branch history bits as well, and index into an array of two-bit entries. Thus, the branch predict result is affected by other, unrelated branches, leading to sometimes spurious predictions.
> And that was the problem – the branch predictor would sometimes cause xdcbt instructions to be speculatively executed and that was just as bad as really executing them. One of my coworkers (thanks Tracy!) suggested a clever test to verify this – replace every xdcbt in the game with a breakpoint. This achieved two things:
>
> 1. The breakpoints were not hit, thus proving that the game was not executing xdcbt instructions.
> 2. The crashes went away.
I love the simplicity and the genius behind this idea.
There's a "branch hint" in PowerPC, I wonder if that's acted on by the Xenon CPU in question.
edit: it's discussed in the comments as well but they don't know either. The author responds: "I can’t remember how PowerPC branch hints work but if the branch hint overrode the branch predictor then it could have avoided the bug."
No, because the branch predictor is just "on" or "off". You'd be wondering if you can make it not speculatively execute specific instructions. I'm going to infer from the article that there wasn't a way to do that and it was far too late to spin a new CPU revision to prohibit speculatively executed xdcbt.
I assume one could have put a serializing instruction inside the "dangerous" path, which should then lead to the speculative execution path being rolled back before reaching the dangerous instruction. Obviously it'd also be expensive...
I wonder why they didn't just make two separate routines without any branch except for loop.
Then you would just make another safety offset to assume for the last loop iteration prediction.
That would _probably_ be safe, but you have to be sure that the xdcbt instructions in the "special" function are far enough into the function that speculative execution can never reach there. Pipelines are way deeper than most people realize so this might require a lot of instructions before the first xdcbt.
And then, for maximum performance you want prefetch instructions to be as early as possible. So, you immediately have a contradiction.
And, assuming that you resolve this there is still the risk that a mispredicted indirect branch could end up triggering an xdcbt. So, you end up with no guarantees anyway.
After turning in the assignment the next class he grabbed one random assignment from the class and added some quick benchmark instrumentation and if I remember right then asked the class how fast we think it could run. The next 30 minutes was a total geek out on cache lines, prefetching, alignment and all the other dark arts of performance optimizations. I don't remember how fast he got it to go over the naive for loop but it was many orders of magnitude. There's a few classes that really stick with you and that was definitely one of them.