
Regex: badly needs fuzzing - infogulch
https://svn.boost.org/trac/boost/ticket/12818
======
saywatnow
Badly needs _static analysis_. coverity is free for open source projects, will
find many of the same issues, and produces reports that directly identify the
buggy code, rather than the crash which might come thousands of instructions
later in a completely different layer, requiring extensive reverse engineering
to identify the source. And then when you've fixed one, the other test cases
all need to be re-tested because they might have come from the same root
cause, despite crashing in different locations.

Particularly when the hit rate is very high, fuzzing is a stupidly inefficient
way to find bugs.

~~~
MaulingMonkey
Needs both.

> rather than the crash which might [...]

Assert like mad. Break out your debug allocators, your asan, your checked
iterators. Fuzzing can be as bad as you say, but it can also give you
_extremely easy_ to diagnose repro cases - and catch an incredibly wide range
of issues for incredibly little work, in my experience.

Static analysis generally won't tell you "Hey, if I almost-but-not-quite
exhaust your address space allocating to successfully parsing this file, I can
get your next closed-source API call to segfault!" \- useful for tracking down
each and every possible allocation related to networking deserialization so
you can configure sane quotas (such as "array size probably shouldn't be
larger than bytes sent, bail out safely instead of trying to allocate that")

With most sane setups that come to mind, no false positives either.

> And then when you've fixed one, the other test cases all need to be re-
> tested because they might have come from the same root cause, despite
> crashing in different locations.

Your tooling should do this automatically for you. If it doesn't, get better
tooling. AFL. SDL MiniFuzz. Whatever floats your boat.

> Particularly when the hit rate is very high, fuzzing is a stupidly
> inefficient way to find bugs.

I've had it be stupidly _efficient_ in terms of programmer time once setup.
Occasionally leaving my computer on overnight? Worth.

------
CJefferson
Any rust lovers out there: Could I ask you do a benchmark comparison and a
fuzz comparison. I'd be genuinely interested in the result and if (as you
might hope) the Rust::regex is as fast as boost:regex, and never crashes, that
would persuade at least me to finally learn some Rust!

~~~
stuckagain
If you were interested in performance you probably would not have been using
boost::regex to begin with. RE2 is often an order of magnitude faster. You
might choose boost if you require backtracking, but that's crazy anyway due to
exponential time.

~~~
0xFFC
What is backtracking?

~~~
carussell
Looks like there's (a lot of) confusion in these comments about the difference
between _backtracking_ and _backreferences_.

The `\2` in Phritzy's snippet is a backreference.

Backtracking is an implementation strategy for writing a regular expression
engine.

I don't know why anyone choosing an engine would "require backtracking". It's
an implementation detail, not a feature. (Although the fact that Thompson NFAs
avoid exponential time complexity inherent to backtracking is something that I
suppose could be considered a feature.)

Here's a link to some real literature:

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
geofft
From the RE2 wiki:

 _" As a matter of principle, RE2 does not support constructs for which only
backtracking solutions are known to exist. Thus, backreferences and look-
around assertions are not supported."_

From your link:

 _One common regular expression extension that does provide additional power
is called_ backreferences. _A backreference like \1 or \2 matches the string
matched by a previous parenthesized expression, and only that string:_
(cat|dog)\1 _matches_ catcat _and_ dogdog _but not_ catdog _nor_ dogcat. _As
far as the theoretical term is concerned, regular expressions with
backreferences are not regular expressions. The power that backreferences add
comes at great cost: in the worst case, the best known implementations require
exponential search algorithms, like the one Perl uses. Perl (and the other
languages) could not now remove backreference support, of course, but they
could employ much faster algorithms when presented with regular expressions
that don 't have backreferences, like the ones considered above. This article
is about those faster algorithms._

If your regex has _backreferences_ , or look-ahead/look-behind assertions, or
similar things, then your regex engine must support a _backtracking_ strategy.

If you do not need backreferences or similar things, then your regex engine
may still choose to use backtracking, but does not need to.

RE2 is a regex engine that prioritizes never using backtracking as an
implementation detail, and therefore does not support backreferences as a
matter of external API.

RE2 is not able to evaluate Phritzy's regex.

~~~
carussell
> If your regex has backreferences, or look-ahead/look-behind assertions, or
> similar things, then your regex engine must support a backtracking strategy.

There's a morphism that maps Phritzy's pattern to an equivalent pattern that
can be matched with a Thompson NFA. (It's a similar strategy to the one that
allows you to take an NFA and turn it into a DFA.) You can perform a sort of
"static decomposition" on a subset of patterns containing backreferences, such
that /(cat|dog)\1/ for example decomposes into /catcat|dogdog/. The caveat is
that this isn't fully generalizable for all patterns containing
backreferences.

But saying that backreferences cannot be supported by a pattern matching
engine that uses something other than backtracking is like saying that because
there is no general algorithm no prove any given program P halts, then _no_
program can be proven to halt. But that's trivially demonstrated to be untrue.
∃ ≠ ∀

RE2 chooses not to support any PCRE-style backreferences, because RE2 is a
pattern matching engine meant to support patterns that are actually regular,
and backreferences are not regular.

~~~
geofft
> _But saying that backreferences cannot be supported by a pattern matching
> engine that uses something other than backtracking is like saying that
> because there is no general algorithm no prove any given program P halts,
> then no program can be proven to halt. But that 's trivially demonstrated to
> be untrue. ∃ ≠ ∀_

Yeah, sorry, I mixed up the sense of the quantifiers a bit. More accurately:
"If you are interested in regexes that use _backreferences_ , or look-
ahead/look-behind assertions, or similar things, then your regex engine must
support a _backtracking_ strategy, or you need to rewrite all the regexes you
care about to not use any of these."

Is it possible to automate that morphism, i.e., is it possible to build a
wrapper around RE2 that converts (in suitably bounded time) a regex using
backreferences to RE2 if the language described is in fact regular, and prints
an error otherwise?

~~~
wahern
I did something similar for word boundary zero-width assertions (i.e. \b and
\B); that is, rewrite the PCRE by expanding permutations inline. The purpose
was to compile as many PCREs as possible to DFAs using Ragel.

Fortunately, Ragel supported a limited form of assertions called semantic
conditions, which it implements by expanding the code space of each state and
allows you to associate a C boolean expression as a predicate for entering
that state. For various reasons the semantic condition needed to be attached
to a terminal, not to subexpressions (i.e. groupings and alternations).

Also, expressions like (cat?|dog?)\b needed to be permuted similar to the
backreference problem, resulting in something that might look like
(ca\b|cat\b|do\b|dog\b). This strategy didn't work for kleene stars (*), but
usually a semantic condition could be attached to trailing terminals instead
of leading terminals.

The actual code to perform these AST transformations was surprisingly simple.
Excluding parsing of PCREs into an annotated AST and generation of the code,
less than 200 lines of Lua code, maybe.

The PCREs weren't simple, but most were only moderately complex. For a corpus
of about 10,000 PCREs about 1/2 used word boundary assertions. The vast
majority of those (possibly even all of them... I can't remember) could be
successfully transformed. With nested subexpressions (groups, alternations)
the number of permutations rapidly grows, but I don't think our production
corpus ever caused a problem in that regard, and only the most insane regular
expressions nest subexpressions more than a few deep.

Before we tackled back references and other gnarly PCRE features I think we
managed to coax better than 90% of our PCRE corpus to transform into
semantically equivalent Ragel-based DFAs. In other words, back references were
rare. With 90%+ compiled completely to native machine code and the remainder
simplified into prefilters we saw greater than 10x performance improvement (>
1000%) over libpcre in terms of production throughput, not just benchmarks.
(Ragel is awesome!) RE2 was usually faster than libpcre, but RE2's performance
wasn't even remotely in the same league as the native machine code DFA Ragel
produced, so I quickly ditched the idea of using RE2 anywhere in the pipeline.

That last 10% or so took much longer to tackle, and eventually most of my
original code was dumped. But getting to about 90% was surprisingly easy. We
did eventually get (and continue to maintain) 100% transformation to native
machine code, but a small percentage still require backtracking. All-in-all I
_think_ we're at 20x to 30x over baseline libpcre; that is, tacking the last
10% put us 2x or 3x atop the 10x. It wasn't a simple case of diminishing
returns as the most difficult expressions to transform also tended to be the
most costly at runtime, but if we had stopped early on at 90% it still would
have been a resounding success.

Regarding RE2, it's important to note once an expression was transformed into
a proper regular expression, they could be easily joined into a single union.
Setting aside the bytecode vs machine code differences, RE2 just isn't capable
of compiling huge unions of expressions (on the order of 100s or 1000s),
whereas for Ragel it was a relative breeze. Another team was using re2c (not
RE2) for a project, and it turned out that what it literally took re2c DAYS to
compile only took SECONDS for Ragel to compile. GCC and clang also became
bottlenecks compiling the Ragel generated code. And interestingly they both,
coincidentally, exhibited (and still exhibit, AFAIK) quadratic complexity when
parsing certain C constructs, such as initializer lists. So that necessitated
some workarounds.

~~~
carussell
Neat! What was the application? Or was it pure research?

~~~
wahern
For a content filtering/scanning service.

By the time we wrapped up this project Intel finally released Hyperscan as
Open Source. Hyperscan implements these and other transformation tricks, not
to mention the SIMD optimizations.

However, Hyperscan doesn't have as strong compatibility for PCRE as what we
ended up with--100% effectively--and would have necessitated keeping libpcre
as a failover. Hyperscan is substantially faster than Ragel for small-to-
medium length expressions thanks to their SIMD optimizations, but with huge,
unioned expressions (larger than what Hyperscan can handle) Ragel-generated
code is on par.

Starting from scratch today and presuming Hyperscan is still well-maintained,
it would be most practical to build a solution around Hyperscan. Especially if
you need capturing, as capturing expressions can't be unioned effectively.
Ragel, however, makes a ton sense for many other tasks than merely speeding up
PCRE matching.

What makes Ragel unique is:

1) The way it integrates with the host language (C, C++, Java, etc). It allows
executing host-language expressions at state machine transitions. It doesn't
directly tie into the host language typing model (e.g. like Boost's template-
based regular expressions), but for various reasons it's hardly a liability
and often an asset.

2) It allows yielding and resumption of matching at any and all input stream
boundaries, with runtime machine state being a simple integer. (Application
keeps any additional state however it wants.) Most regular expression tools
require a complete string to match against, while a Ragel machine can be
iteratively applied to an infinite stream of data. That capability wasn't
useful in this scenario, but it's extremely useful for concisely,
performantly, and securely implementing network protocols, for example. In a
low-level language like C, and especially with asynchronous I/O, this can
drastically simplify the task of I/O and buffer management. You could parse
(if you wanted) an entire HTTP stream in a single logical pass, without having
to deal with line buffering, header continuations, etc as separate concerns in
your code, greatly simplifying networking code. In a some sense it can be
thought of as a complement if not an alternative to callback schemes,
futures/promises, or green threading (a la Go).

~~~
glangdale
(Hyperscan team lead here)

Interesting. I should point out that Hyperscan is not abandonware; it is still
maintained by Intel.

Streaming is not unique to Ragel.

You're not wrong about libpcre compatibility. We have very tight _syntactic_
compatibility with libpcre (that is, we won't misidentify an unsupported
construct and supply some erroneous semantics) but we make no secret of our
inability to handle general lookarounds and backreferences.

I'm interested in how you went about handling backreferences in Ragel. They
have been in our 'too hard' basket for years, although many instances of
backrefs are tractable in our framework. It's always seemed easier to not do
any of them rather than handle just a few.

Ragel certainly ties into the host language differently, and more tightly,
than Hyperscan. We use it ourselves, to lex regexes on the way into Hyperscan,
as it happens.

~~~
wahern
As I alluded to earlier, the 100% (possibly +/\- edge cases) solution was
finished primarily by a contractor.[1] So the following only reflects the
state of things while my Lua parser, analyzer, and transformation code was
still in use. Also, my grasp of this stuff is nowhere near as strong as your's
or the contractor's. I'm a pretender ;)

Backreferences are implemented with prefilters, NFA machines, and what are
effectively zero-with assertions. It's not sexy, and the rest is easy to
sketch out from there as a conceptual matter. What's more interesting are the
Ragel-specific details.

Before we tackled backreferences we had a problem with repetitions causing
unacceptable state blow up with unioned DFAs, repetitions, semantic
conditions, and especially with their interactions. This was the biggest
barrier for most of the final 10%. To help solve that the contractor added an
NFA construct to Ragel, along with the magic needed for the generated Ragel
state machines to switch from DFA to NFA modes mid-expression. The NFA
construct expands on the notion of embedded host-language code in transition
actions and semantic conditions, which allows embedding inline C code to track
and control the matching behavior of each subexpression, as well as to handle
bookkeeping, including state stack management.

Most simple captures can be accomplished using Ragel's regular action
features. But IIRC even captures that could be done from a DFA machine would
sometimes blow up state too much so I think the captures necessary for
backreferences are always done in NFA mode. Similarly, the code is quick to
fallback to NFA mode for expressions without backreferences as it simplified
grouping some of the complex expressions into secondary unions, which allowed
compiling their prefixes as a DFA; as opposed to immediately falling back to
prefilters and a linear match of each expression that couldn't be added to the
primary or secondary unions.

Matching of backreferences is implemented similar to lookarounds. Lookarounds,
at least in my early code, are implemented using semantic conditions. In my
later code I think lookbehinds still had to be a fixed size, but it looks like
(bugs notwithstanding) backreferences could be variable sized. The Ragel match
is anchored on the empty string at the relevant position using the NFA
construct, auxiliary code handles matching against the capture, and on a match
the `p' index variables is updated to skip over the length of the match. The
NFA construct saves and restores `p' as it trials each subexpression. Of
course this approach doesn't work well in streaming mode. Similar to Ragel
scanners I think theoretically the host application could arrange for
windowing, but I don't know if everything is in place to make that feasible.
In our case we didn't need the streaming.

And at least in my code, anything with backreferences automatically gets a
prefilter and is only run conditionally in a second phase. Presumably that's
so we don't need to do captures for stuff that will never match, whether or
not it involves an NFA. I'm sure that's how it's still implemented.

Regarding streaming, I totally forgot about Hyperscan's streaming mode.
Streaming isn't something that (IME) can be readily bolted on after the fact,
so it's a testament to it's careful design and forethought.

But in C, especially in a complex networking application doing asynchronous
I/O, callbacks are so extremely costly--they break code locality and compound
the normal headaches with ownership and lifetime. Managing that complexity is
of utmost importance in a language like C. Using Hyperscan in streaming mode,
especially in combination with DPDK... I imagine the project failure rate is
really high. I know of at least one skillful team that bombed precisely as I
predicted because of the excessive complexity and the resulting inability to
stabilize the product. The callback/push/interrupt model is too difficult in
C. In languages with lambdas, Hyperscan's streaming mode has no handicap. But
especially in C, Ragel's idiosyncratic host-language interface, with it's
embedded action handling and minimal state, is just so incredibly important
for reducing complexity and for sticking as closely as possible to a pull
model for data consumption without sacrificing performance. If developers
spend most of their time working on and fixing the I/O code and related
interface boundaries (as is typical), there's much less time for implementing
the functional features that matter. I don't doubt people have built a ton of
amazing things with DPDK and Hyperscan's streaming mode, but the number of
teams that can do that _correctly_ are few and far between. Honestly, I
wouldn't want a solution like that protecting my data unless I knew the team
had the requisite skill.

On second thought... I should reel that criticism in a little bit. ;) For just
matching and cataloging "hits", rather than performing complex operations on
them, it's much less of a problem. The costs will be a function of how much
work is done on the other side of the callback. But it's just so much more
difficult to balance efficiency and complexity, and to constrain complexity in
general. I've been writing asynchronous I/O services since before libevent and
I've just become constitutionally averse to callbacks. I've never used
libevent's buffering API because before it even debuted I had internalized how
poorly suitable that sort of interface is in terms of interface composition.
Callbacks are necessary--e.g. at the interface boundary of the event loop--but
I try so hard to avoid them when possible, including switching to languages
other than C when possible. It's just a huge consideration for me, perhaps
unreasonably. :)

Also, I understand that Hyperscan's callback interface is largely dictated by
other considerations, like the perfectly sensible choice for a runtime library
interface. In streaming mode a generic queuing and retrieval interface would
be awkward and suboptimal for most cases, and there's no good substitute for
pushing that implementation detail to the application. DPDK, OTOH, wasn't
forced to adopt an asynchronous interrupt model, it was just the easiest.

[1] I don't like to mention names of people not participating in a thread,
even though in this case I'm sure you're familiar and he probably would
appreciate the plug. You may have conversed with him at some point because
your SIMD work is awesome, Hyperscan in general is amazing, and when it was
finally open sourced we had to take a really hard look at adopting it or,
alternatively, whether we should pursue SIMD optimizations on our own. And
IIRC at some point he may have chatted with someone on the Hyperscan team.

~~~
glangdale
Thanks for the kind words about Hyperscan. If you'd like to talk more, please
get in contact (via the list, via our support alias, or our emails if you have
them already).

I'm curious about the design issues w.r.t callbacks. We certainly haven't had
complaints from most of our users about this, but maybe there would have been
better models. Note that callbacks are partly a design pattern for _our_
convenience - we need those big, complex stack frames for many of our engines.
It's nice to throw a match out of the middle of a big unrolled loop and know
that we can come back to the middle of the loop - not into the tedious
'preconditioning and warmup' part of the loop.

An alternate design is to fill a buffer of matches, but that has its own
issues, especially with timeliness. Some users want to stop matching quickly
after the first match - a problem if we go off and scan another megabyte to
get the second through Nth matches.

In any case, we're always interested to hear about cases where we aren't the
best solution, as this is always the most productive type of feedback to get.
"Your library is great and you guys are really smart" is gratifying, but it
doesn't lend itself well to continuous improvement.

------
nix0n
Google's cache of this bug (loads quite slowly for some reason):

[https://webcache.googleusercontent.com/search?q=cache:mVrrFL...](https://webcache.googleusercontent.com/search?q=cache:mVrrFLejZEwJ:https://svn.boost.org/trac/boost/ticket/12818+&cd=1&hl=en&ct=clnk&gl=us)

~~~
zao
The trac instance (and legacy SVN server) isn't really designed for slashdot
(well, HN) effect traffic, runs on some jiggly piece of rust at some uni
somewhere.

~~~
nix0n
I was commenting on the loadtime of the cached page (amalag found the
solution).

------
pcwalton
Another counterexample to the idea that modern C++ written by experts is free
of memory safety issues.

~~~
mememachine
If the solution doesnt make it any easier to avoid memory issues (just forces
you to avoid them,) its not an attractive solution

~~~
amalag
Avoiding bugs is a pretty effective way to not have bugs.

~~~
mlmlmasd
Avoiding bugs is the default in software development in any language, no? If
you have pointer DSL that can lead to various bugs, replacing it with a more
complex pointer DSL is not a good way to avoid bugs. And I don't think C++
'smart pointers' and other crap are very simple (relative to C pointers) or
much less error-prone. That's just my opinion though.

~~~
marvy
Avoiding bugs is what you try to do by default in any language, yes. But
success is not guaranteed :)

I agree that in general, if you have a pointer DSL that can lead to bugs,
replacing to with a more complex one just makes things worse, assuming all
else stays the same.

But! Maybe the extra complexity buys you something. For instance, maybe the
complexity allows the compiler to prove that certain classes of bugs are
impossible. Now we have a tradeoff. Let's say we want to get a correct program
as fast as possible. Then you can quickly write a buggy program in the simple
DSL, but you will need to spend some time debugging it. Or you can spend a
long time wrestling with the complicated one, but you don't need to debug
pointer issues anymore (you still have to fix other bugs of course.)

Then the question is: do you save more time in debugging than you pay in
"fighting" a complex language. And the answer is always the same: maybe! Maybe
you make less pointer bugs in C than most people, and maybe you're more
allergic to smart pointers than most people, so for you it's pointless.
Certainly pointer bugs are not very rare in the world though. Worse, they're
not always noticed right away.

------
jimrandomh
Practical upshot: do not feed untrusted regexes into boost::regex.

~~~
Groxx
Untrusted regexes to _anything_ , where possible. There are fairly well-known
ways to use those to run a DoS:
[https://en.wikipedia.org/wiki/ReDoS](https://en.wikipedia.org/wiki/ReDoS)

~~~
stuckagain
I take it you don't use a web browser?

~~~
lomnakkus
Not exactly sure what you're getting at, but web browsers these days have a
_lot_ of protection from malicious pages, including the "JavaScript on this
page is using 100% cpu; terminate? Yes/No?" dialogs.

------
donovanm
Hmm makes me wonder what the result of running this fuzzing test on other
regex libraries would be.

~~~
moosingin3space
Rust's regex crate has fuzzing running against it, and that has lead to a lot
of useful small fixes: [https://github.com/rust-
lang/regex/pull/262](https://github.com/rust-lang/regex/pull/262)

------
darfs
Am i the only one, who get slowly angry about that "Use $language but Not this
One"-comments? I don't see much value in such comments, srsly.

Why don't accept the fact someone decided to write $it, and move on with
usefull comments? Enough hn this today.

~~~
davb
Yes, I had a similar thought.

The top two comment chains right now are (to paraphrase) "See, modern C++ _isn
't_ free of memory issues" and "Maybe we should rewrite it in Rust and
compare".

~~~
burntsushi
> "Maybe we should rewrite it in Rust and compare"

That already happened, almost three years ago.

It's an interesting comparison point. The OP contains memory corruption bugs
in C++'s standard regex library. If Rust claims to prevent these kinds of
bugs, does it actually hold up to scrutiny? One way of testing that is
throwing a fuzzer against a regex library written in Rust.

~~~
blub
Boost is not the "C++ standard regex library", it's one of the best known C++
libraries.

------
revelation
They seem to be fuzzing the regex, not just the input it is applied to. This
may or may not change the results, but if you're allowing users to input
arbitrary regex patterns you have a whole lot of other problems.

~~~
AgentME
I've seen plenty of places that you may want to accept an arbitrary regex from
the user. An app could allow the user to set up a filter for messages or
usernames by putting in a regex. Or an interpreter for a sandboxed language
could provide regex support.

~~~
revelation
All manner of problems in the programmers mind become trivial if only we allow
users to input essentially code to express exactly what they want. Of course
this is basically never a good solution.

The issue with allowing arbitrary regex patterns is DoS through exponential
blowup. But if you allow running code anyway you might not very much care for
that.

~~~
kbenson
Assuming untrue things about the mathematical properties of your regular
expression engine which are not supported by the documentation and allowing
that to be exploited by user input is a different beast entirely than a _bug_
in that library. The first, with careful examination of the properties may be
preventable through sanitation of the input to exclude certain edge cases. You
can't assume you'll be able to sanitize input ahead of time for bugs nobody
knows about yet, which might be hidden in normally safe features.

------
luckystartup
Is it true that most of these bugs wouldn't be possible in Rust? I think Rust
might have solved:

> heap-buffer-overflows, stack overflow, use of uninitialized data, SIGSEGV,
> undefined shift, invalid enum value, memory leaks

I don't think Rust can protect you from infinite loops. And of course, actual
logic bugs where the output is incorrect (including "assert failures").

~~~
amalag
>I don't think Rust can protect you from infinite loops.

That is the Turing Halting Problem!

------
foota
This is your reminder to treat a regex with similar scrutiny as eval.

