Interesting. Difficult to tell how complex the bugs are though. Some of them seem to be just triggered by accessing non-existent CSRs which suggests those chips haven't been very well verified already?
Also:
> Cascade discovered
3 inaccurate performance counter bugs (Perfcnts) in Kronos,
VexRiscv and BOOM (K4, V13, B2). They incur an offset in
the retired instruction counters when written by software.
They used to be a bit better than that. IMHO the models released shortly after FDIV bug was discovered would probably have the least errata. Of course, x86 also has a relatively large existing body of software that can exercise obscure instruction sequences --- demoscene productions. I don't know whether they're still using those for testing, however.
As for RISC-V, wasn't one of the biggest selling points of RISC that it would avoid bugs like these due to its reduced complexity in comparison to other architectures? Then again, given that almost all of the bugs found seem to be in the FPU, maybe that's just an inherently complex piece of a CPU.
...and that's before even getting into things like marginal signal integrity/timing, which was the cause of the early 386's 32-bit multiply bug.
>As for RISC-V, wasn't one of the biggest selling points of RISC that it would avoid bugs like these due to its reduced complexity in comparison to other architectures?
Not really, RISC is about doing what's convenient for HW in HW and leaving other things to SW. It has nothing to do with ease of verification.
As the other person mentioned, these are all open source cores, many of which have seen fairly limited DV effort put into them beyond simple lockstep verifiers. The findings here are unfortunately quite shallow and easy to detect issues and so it doesn't give a good sense of the maximum capability of the tool.
That being said, it's cool research and better stimulus is always wonderful. Finding deep CPU bugs is extremely hard and verification typically eats up a huge amount of the engineering budget for real chips.
>>...and that's before even getting into things like marginal signal integrity/timing, which was the cause of the early 386's 32-bit multiply bug.
>I'd hope modern fab libraries have checks to prevent these.
They do - but the end user overclocks the f*ck out of them anyway claiming that it's "stable" because it happens to run one workload OK, then complains about software being unstable.
Unrelated but see also 'stress' by Amos Waterland, an excellent workload generator to impose CPU, memory, I/O or disk stress on POSIX systems and report any errors that come up [1,2]. The tool unfortunately hasn't been maintained for some time, but it's being resurrected [2].
The complexity of OoO CPU is such that you scrutiny correctness of every instruction you implement. Also, OoO behavior depends on the commands preceding and following one to be tested, so you have to have more indirect tests for OoO implementation of CPU commands.
Looks like from Appendix D that only 2 bugs were found in BOOM:
> 1. Inaccurate instruction count when minstret is written by software
I don't know what that means, but having minstret written by software was definitely not something I ever tested. In general, perf counters are likely to be undertested.
> 2. Static rounding is ignored for fdiv.s and fsqrt.s
A mistake was made in only listening to the dynamic rounding mode for the fdiv/sqrt unit. This is one of those bugs that is trivially found if you test for it, but it turns out that no benchmarking ever cared about this and from all of the fuzzers I used when I worked on BOOM, NONE of them hit it (including commercial ones...). Ooops.
Pretty cool stuff. I built a dumber version of this a few years ago that just did differential fuzzing between actual cores from different vendors, but without feedback on microarchitectural state it didn't get very far. Good to see people demonstrating how public descriptions of your parts yield concrete security benefits.
> Cascade discovered 12 bugs that produce wrong computations under certain microarchitectural conditions (Uarchvals) in Kronos and VexRiscv
> Cascade discovered 3 bugs that cause hangs in Kro- nos, PicoRV32 and VexRiscv
These are the hard bugs to find in the implementation of the CPU
> Cascade discovered 4 bugs in BOOM and CVA6 that produce wrong output val- ues regardless of the microarchitectural state
These are unacceptable bugs, showing a lack of architectural tests. It means no one ever ran those instructions and checked the result. The community should be able to fix this.
> It means no one ever ran those instructions and checked the result.
That's not really true. It means they produced wrong values with some input. They may have tested but not hit the pathological input case. No one is going to sweep the entirety of the domain. You really need a formal proof to avoid these bugs.
Correct. One beneficial thing with Cascade's long programs with dependencies is that it will tendentially produce problematic values (values around NaNs, around zero, etc.).
Formal methods won't be replaced by Cascade :) but it's another amount of work
(Note: That one is kind of a marketing piece, too.)
In high-assurance systems, both formal methods and exhaustive testing are supposed to be used. I'd be curious to see the results of the submitted tech on designs such as the VAMP processor and Rockwell Collins AAMP7G.
Exhaustive testing is used? Only in some very special cases that can be done... Take the PMULLD instruction. It takes two 128-bit inputs. If you assume you can verify one input in 1ns. An exhaustive test will take 3.67 × 10^60 years. Orders of magnitude longer then the age of the universe.
I meant it figuratively. Safety and security certification at high levels often required that every module and function be tested. That was on top of black box testing and pen testing. Some practitioners added fuzz testing, path-based, symbolic, combinatorial, etc.
So, not literally exhaustive. They just covered every function, feature, interface, and combination that might trigger a bug.
Although, some rare folks did do literally- exhaustive testing of algorithms and FSM’s on 8-bit MCU’s since they had 256 values per variable. Others and I put timers on loops to force it to be terminating and just analyzed those cases. One could use exhaustive testing in the small to justify modeling a function with a specific formula. If you have several of those, then methods like Cleanroom Software Engineering let you semiformally verify the correctness of upper layers.
So, there’s some stuff like that which was used on an ad hoc basis. Most just tested values and combinations likely to trigger bugs since they were in ranges that cause effects.
> > Cascade discovered 4 bugs in BOOM and CVA6 that produce wrong output val- ues regardless of the microarchitectural state
> These are unacceptable bugs, showing a lack of architectural tests. It means no one ever ran those instructions and checked the result. The community should be able to fix this.
For BOOM it looks like the only 2 bugs found were miscounting on the inst-retired perf counter if software over-wrote it, and a fdiv.s/fsqrt.s that always listened to the dynamic rounding mode instead of the statically provided rounding mode, when specified. Not great, but recoverable.
Also:
> Cascade discovered 3 inaccurate performance counter bugs (Perfcnts) in Kronos, VexRiscv and BOOM (K4, V13, B2). They incur an offset in the retired instruction counters when written by software.
Funnily enough the Sail model had this bug too! https://github.com/riscv/sail-riscv/issues/256