Hacker News new | past | comments | ask | show | jobs | submit login
ReDoS: Regular Expression Denial of Service (gitconnected.com)
83 points by davisjam314 on June 29, 2020 | hide | past | favorite | 37 comments

"Your regex engine uses a slow regex matching algorithm. This is the case for the regex engines built into most programming language runtimes — this includes Java, JavaScript (various flavors) and Node.js, C#/.NET, Ruby, Python, Perl, and PHP. On the safe side, Go and Rust both use fast regex engines, and Google’s RE2 library is fast as well."

I don't feel like "slow" and "fast" are the right adjectives here. Perhaps some form of "backtracking capable" instead of "slow".

Edit: For example, the jit version of pcre2 is often faster than Rust's re engine. But it still supports backtracking and can be "dos-ed".

I don't know if it has a particular name, but it is a common problem: fast average case, slow worst case.

A common example is the quicksort algorithm. It is often considered the fastest sorting algorithm in the general case but it is still O(n²) in the worst case. According to statistics, it should never happen in practice, however, with cleverly ordered data, it can. That's why some libraries prefer to use a slightly slower algorithm (ex: heap sort) but with a O(n.log(n)) guarantee, so that it can't be DDoSed.

Unbalanced binary trees can have the same problem, on random data, they work fine, but improperly used, they degenerate into much slower linked lists. Hash tables can also be exploited if they don't used a cryptographically secure hash function, and these are rarely used because they can be so slow as to negate the advantage of using a hash table in the first place.

>cryptographically secure hash function

I doubt that would help you. Hash tables take the hash and mod it by the size of the array. That will lead to collisions after the mod even if it's essentially impossible to generate collisions before the mod.

What you actually need is a secret hash function, so attackers don't know what the values hash to. This is done using a keyed hash function with a secret key.

As to sorting, a lot of libraries use timsort these days since it's more optimal than quicksort in the real world and has nlogn worst case. Or b trees for big storage and sorting, but that's generally a different use case (databases et al).

If your threat model is "users who create hash collisions to slowly encumber data operations", you've either royally pissed someone off or you have a very determined competitor. No one who cares about performance would deploy a crypto hash function for a dictionary unless they absolutely had to. You can always add more buckets to reduce collision probability. There's an entire class of hash functions dedicated to non crypto purposes (xxhash, murmur, etc).

The name for this in the security literature is "algorithmic complexity attacks". They were first proposed by Crosby & Wallach in 2003, with a running example of degraded hash tables.

Yes, those terms are definitely imprecise.

My target audience for that blog post was "anyone vaguely technical, from programmers to managers". I therefore attempted to minimize the amount of technical detail and to focus on the core algorithmic concepts.

You could have just used one sentence to explain that regular expressions in (list of languages) have a feature called backtracking which can be exploited to slow down and DoS a site. I mean, even that sentence, right there. Instead, you boil it down so the reader knows Go and Rust are good and other languages are bad.

Maybe instead of "a slow regex matching algorithm" you could just say "a backtracking algorithm".

And instead of "a fast regex engine" you could say "a linear algorithm".

That way it's both more concise and more technically correct.

I suggest using the term "regular language" in contrast to "regex" or "regular expression".


"Regular language" is an accurate mathematical term that's not "corrupted" by programming. I think you could also use "automata-based" but it's probably less clear and refers more to the implementation [1]

If you scroll down about 3 pages here, there is table explaining regexes vs. regular languages:


It's the same stuff that's in RSC's articles but those are too dense for a general programming audience (they're more for regex engine implementers). And the existence of the OP's article shows that people haven't gotten it :)

You could also use the terms:

- regex engine (Perl/Python/PCRE/Java/etc.)

- regular language engine (libc, RE2, rust/regex)

Or "hybrid engine" may be more accurate for some of those. That is, they may use either automata-based techniques or backtracking depending on the situation.


BTW, the Oil shell has a new syntax Eggex for both regular languages and regexes that makes the difference clear. In other words, you should be able to see from the syntax if a construct may cause exponential backtracking.


In other words, it's a static property that's easily detected (if you have a parser for the regex language itself, which is not hard at all in theory, but maybe it is a little in practice :) )


[1] Aside: The terms NFA and DFA were also corrupted by Friedl in a popular O'Reilly book. I have this book and remember being super confused by it, since I read it long after I learned what a DFA and NFA were.


What little text it devotes to implementation issues perpetuates the widespread belief that recursive backtracking is the only way to simulate an NFA. Friedl makes it clear that he neither understands nor respects the underlying theory.


The "corrupted" term is better here because it describes what the article is about: how to avoid security issues with real-world software tools, some of which match true regular expressions and some which match extended pseudo-"regular expressions".

See where the article says, "Faster engines may not support advanced features, most notably backreferences and lookaround assertions."

In a discussion about sucrose, fructose, aspartame, and stevia, you wouldn't tell someone to stop using the term "sweetener" and instead use the narrower and more precise term "sugar". If you did, you'd be advising them to use a word with the wrong meaning.

I'm replying to the parent, so the article could be called:

Regular Expression Denial of Service Can be Avoided by Using Regular Languages

In other words, accepting regular expressions / regexes exposes you to a DoS. Accepting regular languages does not (as long as you're using an appropriate engine.)

This seems like a quibble, but putting the emphasis on "using an appropriate engine" in parentheses at the end is burying the lede. Sticking to regular languages will not, in itself, help - you can have catastrophic backtracking in a backtracking engine even with a regular expression that uses only "regular language" concepts.

It's also at least theoretically possible that an engine could implement a subset of the non-regular-language "regex" constructs and still avoid backtracking. We considered this in Hyperscan - there are a number of backreference cases that are tractable in practice:

First, sometimes the backreferences just don't contain that many different possibilities - sometimes they are used to match quotes (or absence of quotes) so the three alternatives are ("|'|epsilon). This can be done by expansion.

Second, a number of backreference uses are structured so that you can only have a small - fixed size - number of back-references "alive" at once. For example, /\s+(\d+)\s+\1/ in libpcre notation matches whitespace, then a number, then more whitespace, then that number again. The thing being referred to by the backref can't ever have multiple live simultaneous values - an automata could note the boundary-crossings in and out of this particular backref and make a note of the value, then do a simple compare later.

There are quite a number of backreferences out there that work this way.

No-one, to my knowledge, does this, but it could be done.

The point is - it's the engine. Not the language. Fully general support for nasty regular expression constructs does lead in practice to ugly blowouts but it doesn't have to.

Notably, received wisdom that matching backreferences is exponential is based on a false assumption. The proof that it's in NP requires you to arbitrarily increase the number of backreferences, which is counter to the way most people would imagine regex matching or string matching is analyzed - you either assume that the regex or string is a fixed workload or you give it another parameter 'm' and talk about O(nm) things.

Fair point about the parenthetical, though I would say both the engine and the language are crucial, depending on what role you're playing in the system.

The setup I think of: you have "domain experts" writing policies with regexes, software engineers who embed the engine in an application, and operators of the entire service. My usages of regexes for "big data" followed that pattern, and I imagine the DPI applications of Hyperscan do too.

A lot of those rules are accumulated over time, are not easy to rewrite, and require testing. If you have a thousand regexes in one dialect painfully crafted over years, you can't just change the engine. It would break the whole application.

So that is why I emphasize the language. Backtracking engines are more common, and the language and the "code" written in the language has to be changed before you can change the engine. That happened at least partially at Google 10+ years ago, migrating from PCRE to RE2.

But yes, if you're an operator, you will be looking at it from the engine perspective. Obviously it is not enough to change the patterns themselves.

(The backreference point is interesting, although I'll probably leave it out of my mental model until someone implements it :) )


Also the point of Eggex is to treat regexes more abstractly, like a real language. Just like I don't write C++ code for GCC or Clang, I don't write regexes for a specific engine. I have ported regexes between engines many times.

The typical application I've worked on uses multiple engines. Hell a one line shell script using grep and awk uses multiple engines!!!

Tangent: reminds me of this paper: https://www3.cs.stonybrook.edu/~dongyoon/papers/FSE-19-Lingu...

This is very true. It was something I admit I found more than a bit irritating; a number of Hyperscan DPI customers just started regarding Hyperscan as a "fact of nature" and would simply start coding around weaknesses or avoiding patterns that didn't work as well. I would much have preferred that they complain and have us fix things for them.

IMO there's still a lot of scope for a better regex engine; backtracking engines have a lot of expressiveness for better or worse and I'm not convinced that they are as bad as people say. Systems like Hyperscan are somewhat niche in that they throw a lot of stuff overboard in pursuit of scale (lots of regexes at once), performance and "streaming" (the ability to pause matching with a fixed-sized amount of context and restart). Despite its hype, RE2 is not that great either - it's a couple engines taped together with some fairly opinionated design policies that has become big enough to be a "fact of nature" at a big influential company ("I guess that's how a regex engine works").

Hm I don't really think RE2 is even that influential (unfortunately IMO). Rust is the only new language whose engine borrows its design AFAIK. For example, I was disappointed to learn that Julia uses PCRE.


I think backtracking is firmly in the past because it causes too many problems in practice [1], but I think there's scope for a text matching engine that's built on top of regular languages, but more expressive.

As mentioned here and on lobste.rs recently, regular languages aren't expressive enough for syntax highlighting (or lexing most programming languages in their entirety).

But it doesn't take much to make them more expressive, WHILE meaning these properties:

1. linear-time and constant space

2. Composition by OR

Backtracking fails at both properties. Backtracking is bad. Like you say with backreferences, there should be something more expressive than regular languages that maintains both properties (and it's likely to be useful).

Example of tiny mechanism on top of regular languages:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html



Here's an implementation idea for such a text matching engine:

(1) Start with re2c, because it has some different scaling properties than other engines https://news.ycombinator.com/item?id=23665569

The compile time and code size are linear with respect to the number of patterns. What was very surprising to me is that the compiled C code starts off slower than interpreters, but scales better as you increase the number of patterns.

(2) Write a WebAssembly backend for it. Compiling its C output to WebAssembly is not likely to be great, because it uses gotos on every line, and gotos in WebAssembly need the Relooper algorithm last time I checked.

(3) Now add a linear-space constant-time VM on top. As a test of this property, it should be able to express LL(1) parsing and LALR(1) parsing. Something like "lexer modes" should also fall out trivially.

I believe this strategy saves a bunch of work, and has some desirable properties:

- an entire RE front end, handling captures and such in DFAs: https://arxiv.org/abs/1907.08837 . I'm sure re2c is slower than Hyperscan, but as you hint, I think it's also more general and widely used.

- wasm could be more palatable to embed than other VMs. There are multiple implementations of it.

- wasm avoids dealing with C code and assembly code. It has a few high quality JITs.

- The first version of wasm has a ton of limitations (lack of GC, rich interfaces to the host), but this application hits none of them. It's a perfect problem for it.

(I'm also pretty sure this hasn't been done, but I could be wrong. I googled "re2c wasm" and found that it is used in some wasm tools, but I think that's unrelated: https://chromium.googlesource.com/external/github.com/WebAss... )

And it's also significantly different than RE2 or rust/regex, if that's your complaint :)

If any of that is interesting feel free to contact me at any method listed here: http://www.oilshell.org/ (email, zulip, etc.)


[1] https://lobste.rs/s/m24zv1/xi_editor_retrospective#c_pyfpv1

Actually one more thing I realized about this idea that would be wild...

Since re2c itself is written in C, it can be easily compiled to wasm (even if it has a few gotos). So then you get an engine that can compile user patterns at runtime, rather than having it be an offline process.

(I don't really think there is that much difference in constraints between the offline and online processes, as some of have suggested. For example I observed that re2c compile time is linear with respect # constants strings in a large range in that benchmark.)

I have been thinking of embedding wasm in Oil, mainly to do reasonably fast string transformations. Because one goal of Oil is to provide an alternative to extremely tortured string transformation syntax like: https://github.com/dylanaraps/pure-bash-bible

And Oil has the eggex language, that compiles to egrep syntax, and uses libc's engine.

But it would be cool if it could compile to DFAs with an embedded re2c compiler.

Then again, I was surprised that the compiled C code was not faster than an intepreter for small cases, so maybe this whole thing is too complex for the benefit... I guess I would have to do some up front benchmarks to see if this is even worth it.

It could be that you need an extremely fast wasm engine to make it worthwhile, and all of those are tied to big JS engines and their very non-portable JITs.

[1] https://johnwickerson.wordpress.com/2020/05/21/diagrams-for-...


edit: found some related comments:


Surprise, the "totally shocking" result there is due to BACKTRACKING :) Automata-based regex in wasm beats native browser regex because it doesn't backtrack.


Tangent: I am the author of the Lingua Franca paper. I hope you enjoyed it!

Yes, it was very interesting, and related to my Eggex subproject (of the Oil shell, "our upgrade path from bash"):


Basically the idea is to compile a new syntax to multiple engines. It's embedded in a shell and has a seamless upgrade path:


I'd be interested in your feedback on it! I think it's very related to Lingua Franca but if you disagree I'm also interested in that :)

I think most "new language" projects are rightly met with skepticism, but the difference with Oil is that is emphasizes a transition path, and it's embedded in a significant piece of work (i.e. there are other projects like this, including ones from myself, but ironically I didn't want the extra dependency)

Although I guess the obvious criticism is that Eggex almost entirely syntactic. If two regex engines have the same syntax but different semantics, it doesn't help you. But I have not found that to be a big issue in decades of using multiple engines, although admittely it relates to the subset of regexes I use. Eggex is more like a union of engines than an intersection, so maybe that's something to change ...

> This seems like a quibble, but putting the emphasis on "using an appropriate engine" in parentheses at the end is burying the lede. Sticking to regular languages will not, in itself, help.

Yes. I attempted to make this point at the end of my article, while avoiding phrases like "regular languages".

> There are quite a number of backreferences out there that work this way

Yes, this appears to be a common usage of backreferences. In the regex corpus from the FSE'19 Lingua Franca paper, many backreference-using regexes are of this form.

I believe a loose bound on the worst-case time complexity for regexes with backreferences is n^{mX} where X is the number of backreferenced capture groups. The goal is to describe the number of distinct values the capture groups can take on. If a capture group can take on arbitrary sub-strings (e.g. .(.)\1) then m=2; if it is fixed but ambiguous (e.g. .*(')\1) then m=1; if it is unambiguous then m=0. Often still a polynomial though.

The exception proves the rule; I've seen regexes with 20 backreferences, many of them ambiguous, but the vast majority cover simple use cases and are quite sane.

[ disclosure: I designed Hyperscan and worked on it 2006-2017 ]

This is a rather old concept, but as long as people keep using backtracking regular expression matchers, I suppose it's relevant.

What would be intriguing would be ReDoS against robust matchers like RE2 and Hyperscan. Both have their own potential performance corners that could be exploited, especially when matching large numbers of patterns (more Hyperscan's bailiwick than RE2).

For a sufficiently large pattern (or, rather, one that builds a sufficiently large DFA), RE2 could be forced to stay in its slow path by an input that prioritizes trying to head for an unbuilt DFA state. An attacker could attempt to tune this to stay on the edge of RE2s heuristics to fall back to NFA matching.

Alternately, NFA matching in RE2 is quite slow, and there are presumably ways to ensure that the slowest paths (probably involving large numbers of states being on) are traversed.

Hyperscan has its own performance corners - the dumps produced by Hyperscan's compiler could be examined to discover which "literal factors" it's using, then inputs crafted to blast those literal factors. There's a good deal of trickery to try to avoid getting caught by these kind of "overly frequent literal factors" so it's probably harder than one might think - but I don't doubt an attacker could make life difficult for Hyperscan.

Once into the NFA/DFA engines in Hyperscan, an attacker could ensure that the engines can't ever use their fast acceleration paths.

Note these aren't exponential blowups in performance. However, hitting one of these systems (RE2 or Hyperscan) with a 'constant factor' of extra work could be quite significant.

This is a "technical two-pager" of my doctoral research about regex denial of service.

I looked through this. I'm curious as to how you managed to slot in a small reference to Hyperscan (my baby for many years) while writing a thesis about regular expression performance problems.

You cite us twice; once as a source on what "dominated" means in a graph theoretic sense (?) and once to falsely suggest that Hyperscan is somehow "domain specific". To the extent that Hyperscan is focused on scaling up to larger-scale pattern matching (unlike RE2) it is suitable for a domain that other regular expression matchers are not suitable for, but there's no reason that you can't use Hyperscan in a general purpose domain.

The fact of the matter is that large scale regular expression matching, where it occurs in domains that are highly performance-sensitive, is a extremely difficult problem that is largely solved - in a limited way. No-one sensible builds an IPS or a NGFW and expects to be able to do backreferences and the usual menagerie of backtracking-only features (recursive subpatterns!).

The various outages that have happened since - people tripping over performance problems that were widely understood in the industry in at least 2006, if not earlier - are examples of people wilfully using the wrong tools for the job and being surprised when they don't work.

There's a niche for faster backtracking matchers, but most of the usage of regular expressions where anyone cares about performance is already squarely in the "automata" world and is done by either Hyperscan, hand-rolled software at network shops or acceleration hardware like the cards from TitanIC (now Mellanox). The backtracking formulation is not only subject to all the superlinear guff, it's also incredibly poorly suited to matching large numbers of regular expressions at once.

(This is a repetition of my replies to your thread on Twitter).

My goal was not to prove that super-linear behavior is possible, but rather to understand questions like:

1. How common are such regexes in practice? (~10%)

2. Do developers know it? (<50%)

3. Do they use specialized regex engines like RE2 or HyperScan? (Not often)

These are not theoretical advances. But I believe that my work improved our understanding of practice, and that it will let us ground future decisions in large-scale data instead of opinion.

Are there other regex questions you think are worth a look? I'm all ears!

I apologize if my treatment of HyperScan was offensive to you. My understanding from surveys is that developers typically use the regex engine built into their programming languages. I therefore focused on those engines, not on standalones like RE2 and HyperScan.

Once RE2 and HyperScan reach widespread use (I hope my research moves folks in that direction!), understanding their weaknesses will be an interesting next step.

In the automata world there is also re2c and ragel

Edit: Never mind, clicked on the article, lots of sources!

Is your paper available for reading? I'm very interested in it.

A quick Google search turned up http://hdl.handle.net/10919/98593.

In one place, you say that programmers should pre-validate that inputs are not long. I would suggest quantifying that somehow. I suspect some developers will not realize that a 30 character input is "long".

At the same time some sites/systems impose too strict size limits which makes impossible to enter e. g. a long name or a long address. And such problems are almost inevitable if a programmer makes assumptions about input data based on a limited personal experience.

https://github.com/kdeldycke/awesome-falsehood tries to address this.

Thank you for the suggestion. I added notes about typical sizes depending on the complexity of the regex.

TLDR: ~30 chars is too long for exponential regexes. 1K-5K chars are too long for quadratic regexes. Higher polynomials blow up somewhere between 100 and 1000 characters.

The paper Regular Expression Matching Can Be Simple And Fast [1] has a time complexity plot that clearly shows how backtracking engines are more exploitable than finite-automata engines.

[1]: https://swtch.com/~rsc/regexp/regexp1.html

Interestingly, the backtracking engines can also be viewed as using automata. The trouble is actually in how they simulate the automaton. They use backtracking -- like a depth-first search. Since they don't remember where they've been, this leads them to a lot of redundancy. Engines like RE2 use a lockstep approach instead -- like a breadth-first search. They work their way down the search tree level-by-level and so don't have to repeat themselves.

Practically speaking, the slow regex engines don't always build NFAs. Instead they may follow a grammar-directed approach. But the NFA is a valid model of that simulation.

I learned about ReDoS through a PostgreSQL bug. On an old pg version(9.4.5), the following regex expression made pg hang in an infinite loop.

  select 'anystring' ~ '($^)+';
Of course that was patched a long time ago, but since then I've avoided exposing regex functionality to end users.

This isn't a cheat-sheet. I expected a PDF with one or two pages of very condensed key information, not a long web article.

Ok, we've de-cheat-sheeted the title above.

Great. Much better now.

Thanks, folks!

It's great to see more introductory ReDoS material! I took a deep-dive on ReDoS myself recently and found the material available to be somewhat lacking, especially for beginners. Over the course of my investigation I found some interesting bugs in big name projects and created a few blog posts as an introduction to the bug class:

* https://blog.r2c.dev/2020/finding-python-redos-bugs-at-scale...

* https://blog.r2c.dev/2020/improving-redos-detection-with-dli...

The culmination of this work was a Python regex linter that could automatically detect ReDoS expressions with fairly high accuracy - Dlint's DUO138 rule: https://github.com/dlint-py/dlint/blob/master/docs/linters/D....

In my opinion, the best solution, as this article mentions, is to avoid the bug class altogether by using something like RE2 when possible. Nevertheless, I found ReDoS to be a really cool bug class at the intersection of computer science and software engineering.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact