The end of the related work section cites both wasm-smith and the Binaryen fuzze...

HeliosPanoptes · 2023-08-05T00:50:31

Hi! Author here. Xsmith provides a lot of ways to tune the choices it makes during program generation. By far the easiest, and the one used a lot in Wasmlike (the fuzzer in this thesis), is adjusting the weight of each AST node. It doesn’t just have to be a static weight though! It can be a function of any number of attributes present in the AST so far when that choice is made!

For example, to get a nice spread of function sizes, Wasmlike limits the AST depth of new functions based on how many were generated so far. If the maximum depth isn’t limited, program size and generation time explodes. If the depth is just a simple continuation of where the function was first called from, the resulting program will have a ridiculous number of one liner functions without any medium sized ones.

pfdietz · 2023-08-05T02:15:50

Do you do swarm testing, where you random disable some fraction of the kinds of choices for generating program fragments before generating the AST? If so, how did that help?

HeliosPanoptes · 2023-08-05T05:00:46

Come to think of it, there is something similar that Xsmith can do with parametric randomness, where it can change just one small choice in the chain of decisions that made a random program. It’s a library developed by a previous masters student at the research lab, and it’s called Clotho: https://docs.racket-lang.org/clotho/index.html.

The idea is to enable feedback directed fuzzing for a senantically valid random program generator. I believe that adding this to the Wasm fuzzer is the ‘next step’ in the ongoing research.

So not quite swarm testing, but a bit closer in terms of focused fuzzing instead of a shotgun approach.

HeliosPanoptes · 2023-08-05T02:34:45

Short answer: no. The focus was on always generating semantically valid programs. That said, there is a lot of work on avoiding nondeterministic or undefined behavior, like division by zero or negative square roots.

There are also a few ‘feature’ flags to enable/disable things like floating point operations in the case that it would affect results, but that never actually came up in testing, and the tests runs we did used roughly the same configurations.