Cascade: CPU fuzzing via intricate program generation

timhh · on Oct 23, 2023

Interesting. Difficult to tell how complex the bugs are though. Some of them seem to be just triggered by accessing non-existent CSRs which suggests those chips haven't been very well verified already?

Also:

> Cascade discovered 3 inaccurate performance counter bugs (Perfcnts) in Kronos, VexRiscv and BOOM (K4, V13, B2). They incur an offset in the retired instruction counters when written by software.

Funnily enough the Sail model had this bug too! https://github.com/riscv/sail-riscv/issues/256

guerrilla · on Oct 23, 2023

That's cool but sandsifter[1] has a cooler UI.

https://github.com/xoreaxeaxeax/sandsifter

flaviens · on Oct 23, 2023

Guys if somebody makes a cool UI like this for Cascade (both fuzzing and program reduction) I'll merge your PR with greatest pleasure <3

guerrilla · on Oct 23, 2023

I'm glad you like it. Congratulations on your work!

readyplayernull · on Oct 24, 2023

This is how hacker UIs in movies should look like: https://raw.githubusercontent.com/xoreaxeaxeax/sandsifter/ma...

userbinator · on Oct 24, 2023

Note that more than 1000 errata were released in Intel Core and AMD CPUs in recent years

That was predicted: https://news.ycombinator.com/item?id=16058920

They used to be a bit better than that. IMHO the models released shortly after FDIV bug was discovered would probably have the least errata. Of course, x86 also has a relatively large existing body of software that can exercise obscure instruction sequences --- demoscene productions. I don't know whether they're still using those for testing, however.

As for RISC-V, wasn't one of the biggest selling points of RISC that it would avoid bugs like these due to its reduced complexity in comparison to other architectures? Then again, given that almost all of the bugs found seem to be in the FPU, maybe that's just an inherently complex piece of a CPU.

...and that's before even getting into things like marginal signal integrity/timing, which was the cause of the early 386's 32-bit multiply bug.

sweetjuly · on Oct 24, 2023

>As for RISC-V, wasn't one of the biggest selling points of RISC that it would avoid bugs like these due to its reduced complexity in comparison to other architectures?

Not really, RISC is about doing what's convenient for HW in HW and leaving other things to SW. It has nothing to do with ease of verification.

As the other person mentioned, these are all open source cores, many of which have seen fairly limited DV effort put into them beyond simple lockstep verifiers. The findings here are unfortunately quite shallow and easy to detect issues and so it doesn't give a good sense of the maximum capability of the tool.

That being said, it's cool research and better stimulus is always wonderful. Finding deep CPU bugs is extremely hard and verification typically eats up a huge amount of the engineering budget for real chips.

snvzz · on Oct 24, 2023

>As for RISC-V,

AIUI they only looked at open source implementations.

They might not be anywhere as validated as the commercial counterparts.

>...and that's before even getting into things like marginal signal integrity/timing, which was the cause of the early 386's 32-bit multiply bug.

I'd hope modern fab libraries have checks to prevent these.

kimixa · on Oct 24, 2023

>>...and that's before even getting into things like marginal signal integrity/timing, which was the cause of the early 386's 32-bit multiply bug.

>I'd hope modern fab libraries have checks to prevent these.

They do - but the end user overclocks the f*ck out of them anyway claiming that it's "stable" because it happens to run one workload OK, then complains about software being unstable.

gone35 · on Oct 23, 2023

Unrelated but see also 'stress' by Amos Waterland, an excellent workload generator to impose CPU, memory, I/O or disk stress on POSIX systems and report any errors that come up [1,2]. The tool unfortunately hasn't been maintained for some time, but it's being resurrected [2].

[1] https://web.archive.org/web/20190307133628/http://people.sea...

[2] https://github.com/resurrecting-open-source-projects/stress

camel-cdr · on Oct 23, 2023

It's kind of interesting that BOOM had the second-lowest number of found bugs, considering that it's the only out-of-order core out of the six CPUs.

I suppose BOOM got more scrutiny, but CVA6 is quite comparable in popularity.

thesz · on Oct 25, 2023

The complexity of OoO CPU is such that you scrutiny correctness of every instruction you implement. Also, OoO behavior depends on the commands preceding and following one to be tested, so you have to have more indirect tests for OoO implementation of CPU commands.

_chris_ · on Oct 25, 2023

Looks like from Appendix D that only 2 bugs were found in BOOM:

> 1. Inaccurate instruction count when minstret is written by software

I don't know what that means, but having minstret written by software was definitely not something I ever tested. In general, perf counters are likely to be undertested.

> 2. Static rounding is ignored for fdiv.s and fsqrt.s

A mistake was made in only listening to the dynamic rounding mode for the fdiv/sqrt unit. This is one of those bugs that is trivially found if you test for it, but it turns out that no benchmarking ever cared about this and from all of the fuzzers I used when I worked on BOOM, NONE of them hit it (including commercial ones...). Ooops.

Fixed here: https://github.com/riscv-boom/riscv-boom/pull/629/files

debatem1 · on Oct 23, 2023

Pretty cool stuff. I built a dumber version of this a few years ago that just did differential fuzzing between actual cores from different vendors, but without feedback on microarchitectural state it didn't get very far. Good to see people demonstrating how public descriptions of your parts yield concrete security benefits.

alain94040 · on Oct 23, 2023

The key results:

> Cascade discovered 12 bugs that produce wrong computations under certain microarchitectural conditions (Uarchvals) in Kronos and VexRiscv

> Cascade discovered 3 bugs that cause hangs in Kro- nos, PicoRV32 and VexRiscv

These are the hard bugs to find in the implementation of the CPU

> Cascade discovered 4 bugs in BOOM and CVA6 that produce wrong output val- ues regardless of the microarchitectural state

These are unacceptable bugs, showing a lack of architectural tests. It means no one ever ran those instructions and checked the result. The community should be able to fix this.

rowanG077 · on Oct 23, 2023

> It means no one ever ran those instructions and checked the result.

That's not really true. It means they produced wrong values with some input. They may have tested but not hit the pathological input case. No one is going to sweep the entirety of the domain. You really need a formal proof to avoid these bugs.

flaviens · on Oct 23, 2023

Correct. One beneficial thing with Cascade's long programs with dependencies is that it will tendentially produce problematic values (values around NaNs, around zero, etc.). Formal methods won't be replaced by Cascade :) but it's another amount of work

nickpsecurity · on Oct 23, 2023

If doing the formal route, then Centaur's reports are worth looking at for using ACL2:

https://www.researchgate.net/publication/267809758_Use_of_Fo...

https://www.cs.utexas.edu/users/hunt/talks/2021-11-Centaur.p...

(Note: That one is kind of a marketing piece, too.)

In high-assurance systems, both formal methods and exhaustive testing are supposed to be used. I'd be curious to see the results of the submitted tech on designs such as the VAMP processor and Rockwell Collins AAMP7G.

https://www.researchgate.net/publication/220643416_Putting_i...

https://www.khoury.northeastern.edu/home/pete/acl206/slides/...

rowanG077 · on Oct 24, 2023

Exhaustive testing is used? Only in some very special cases that can be done... Take the PMULLD instruction. It takes two 128-bit inputs. If you assume you can verify one input in 1ns. An exhaustive test will take 3.67 × 10^60 years. Orders of magnitude longer then the age of the universe.

nickpsecurity · on Oct 24, 2023

I meant it figuratively. Safety and security certification at high levels often required that every module and function be tested. That was on top of black box testing and pen testing. Some practitioners added fuzz testing, path-based, symbolic, combinatorial, etc.

So, not literally exhaustive. They just covered every function, feature, interface, and combination that might trigger a bug.

Although, some rare folks did do literally- exhaustive testing of algorithms and FSM’s on 8-bit MCU’s since they had 256 values per variable. Others and I put timers on loops to force it to be terminating and just analyzed those cases. One could use exhaustive testing in the small to justify modeling a function with a specific formula. If you have several of those, then methods like Cleanroom Software Engineering let you semiformally verify the correctness of upper layers.

So, there’s some stuff like that which was used on an ad hoc basis. Most just tested values and combinations likely to trigger bugs since they were in ranges that cause effects.

_chris_ · on Oct 25, 2023

> > Cascade discovered 4 bugs in BOOM and CVA6 that produce wrong output val- ues regardless of the microarchitectural state

> These are unacceptable bugs, showing a lack of architectural tests. It means no one ever ran those instructions and checked the result. The community should be able to fix this.

For BOOM it looks like the only 2 bugs found were miscounting on the inst-retired perf counter if software over-wrote it, and a fdiv.s/fsqrt.s that always listened to the dynamic rounding mode instead of the statically provided rounding mode, when specified. Not great, but recoverable.