Random Testing of WebAssembly Implementations Using Semantically Valid Programs

phickey · 2023-08-04T23:20:43

Related: one of my colleagues created [wasm-smith](https://github.com/bytecodealliance/wasm-tools/tree/main/cra...) for fuzzing wasmparser and Wasmtime.

hardwaresofton · 2023-08-05T02:41:44

The attention to detail and quality of the wasmtime tooling and runtime is amazing.

I don’t think I’ve ever seen standards work be so high quality.

tlively · 2023-08-05T00:13:10

The end of the related work section cites both wasm-smith and the Binaryen fuzzer (https://github.com/WebAssembly/binaryen/wiki/Fuzzing) and says, "They both provide a fuzzer that turns a stream of bytes into a WebAssembly module in order to test implementations. Their fuzzers always generate semantically valid test cases, but lack the targeting and tuning that Xsmith provides."

I look forward to reading more about how they do the targeting and tuning.

HeliosPanoptes · 2023-08-05T00:50:31

Hi! Author here. Xsmith provides a lot of ways to tune the choices it makes during program generation. By far the easiest, and the one used a lot in Wasmlike (the fuzzer in this thesis), is adjusting the weight of each AST node. It doesn’t just have to be a static weight though! It can be a function of any number of attributes present in the AST so far when that choice is made!

For example, to get a nice spread of function sizes, Wasmlike limits the AST depth of new functions based on how many were generated so far. If the maximum depth isn’t limited, program size and generation time explodes. If the depth is just a simple continuation of where the function was first called from, the resulting program will have a ridiculous number of one liner functions without any medium sized ones.

pfdietz · 2023-08-05T02:15:50

Do you do swarm testing, where you random disable some fraction of the kinds of choices for generating program fragments before generating the AST? If so, how did that help?

HeliosPanoptes · 2023-08-05T05:00:46

Come to think of it, there is something similar that Xsmith can do with parametric randomness, where it can change just one small choice in the chain of decisions that made a random program. It’s a library developed by a previous masters student at the research lab, and it’s called Clotho: https://docs.racket-lang.org/clotho/index.html.

The idea is to enable feedback directed fuzzing for a senantically valid random program generator. I believe that adding this to the Wasm fuzzer is the ‘next step’ in the ongoing research.

So not quite swarm testing, but a bit closer in terms of focused fuzzing instead of a shotgun approach.

HeliosPanoptes · 2023-08-05T02:34:45

Short answer: no. The focus was on always generating semantically valid programs. That said, there is a lot of work on avoiding nondeterministic or undefined behavior, like division by zero or negative square roots.

There are also a few ‘feature’ flags to enable/disable things like floating point operations in the case that it would affect results, but that never actually came up in testing, and the tests runs we did used roughly the same configurations.

fasterik · 2023-08-05T11:11:53

I think random/fuzz testing is far more effective than unit testing. I pretty much don't write individual test cases anymore. I write a test that throws random data at a system or does random operations on it. Combined with a liberal use of assertions, it usually catches bugs very quickly.

pfdietz · 2023-08-05T13:45:35

The key point is that a program can generate inputs far faster than you can. So a testing system based on generation of inputs can create and execute far more cases than manual creation of individual unit tests.

The problem becomes determining if the test has failed. You should be creating properties to test, not inputs.

There used to be, back in the stone age of computing, this notion that random generation of test inputs was bad because it would waste valuable computing time. It was thought to be better to manually create carefully crafted individual tests. But the relative cost of people vs. the computer has changed by many orders of magnitude since then.