
Learning Input Tokens for Effective Fuzzing - memexy
https://publications.cispa.saarland/3098/
======
memexy
> Modern fuzzing tools like AFL operate at a lexical level: They explore the
> input space of tested programs one byte after another. For inputs with
> complex syntactical properties, this is very inefficient, as keywords and
> other tokens have to be composed one character at a time. Fuzzers thus allow
> to specify dictionaries listing possible tokens the input can be composed
> from; such dictionaries speed up fuzzers dramatically. Also, fuzzers make
> use of dynamic tainting to track input tokens and infer values that are
> expected in the input validation phase. Unfortunately, such tokens are
> usually implicitly converted to program specific values which causes a loss
> of the taints attached to the input data in the lexical phase. In this paper
> we present a technique to extend dynamic tainting to not only track explicit
> data flows but also taint implicitly converted data without suffering from
> taint explosion. This extension makes it possible to augment existing
> techniques and automatically infer a set of tokens and seed inputs for the
> input language of a program given nothing but the source code. Specifically
> targeting the lexical analysis of an input processor, our lFuzzer test
> generator systematically explores branches of the lexical analysis,
> producing a set of tokens that fully cover all decisions seen. The resulting
> set of tokens can be directly used as a dictionary for fuzzing. Along with
> the token extraction seed inputs are generated which give further fuzzing
> processes a head start. In our experiments, the lFuzzer-AFL combination
> achieves up to 17% more coverage on complex input formats like JSON, LISP,
> tinyC, and JavaScript compared to AFL.

