
ReDoS: Regular Expression Denial of Service - davisjam314
https://levelup.gitconnected.com/the-regular-expression-denial-of-service-redos-cheat-sheet-a78d0ed7d865
======
tyingq
_" Your regex engine uses a slow regex matching algorithm. This is the case
for the regex engines built into most programming language runtimes — this
includes Java, JavaScript (various flavors) and Node.js, C#/.NET, Ruby,
Python, Perl, and PHP. On the safe side, Go and Rust both use fast regex
engines, and Google’s RE2 library is fast as well."_

I don't feel like "slow" and "fast" are the right adjectives here. Perhaps
some form of "backtracking capable" instead of "slow".

Edit: For example, the jit version of pcre2 is often faster than Rust's re
engine. But it still supports backtracking and can be "dos-ed".

~~~
GuB-42
I don't know if it has a particular name, but it is a common problem: fast
average case, slow worst case.

A common example is the quicksort algorithm. It is often considered the
fastest sorting algorithm in the general case but it is still O(n²) in the
worst case. According to statistics, it should never happen in practice,
however, with cleverly ordered data, it can. That's why some libraries prefer
to use a slightly slower algorithm (ex: heap sort) but with a O(n.log(n))
guarantee, so that it can't be DDoSed.

Unbalanced binary trees can have the same problem, on random data, they work
fine, but improperly used, they degenerate into much slower linked lists. Hash
tables can also be exploited if they don't used a cryptographically secure
hash function, and these are rarely used because they can be so slow as to
negate the advantage of using a hash table in the first place.

~~~
Thorrez
>cryptographically secure hash function

I doubt that would help you. Hash tables take the hash and mod it by the size
of the array. That will lead to collisions after the mod even if it's
essentially impossible to generate collisions before the mod.

What you actually need is a secret hash function, so attackers don't know what
the values hash to. This is done using a keyed hash function with a secret
key.

------
glangdale
[ disclosure: I designed Hyperscan and worked on it 2006-2017 ]

This is a rather old concept, but as long as people keep using backtracking
regular expression matchers, I suppose it's relevant.

What would be intriguing would be ReDoS against robust matchers like RE2 and
Hyperscan. Both have their own potential performance corners that could be
exploited, especially when matching large numbers of patterns (more
Hyperscan's bailiwick than RE2).

For a sufficiently large pattern (or, rather, one that builds a sufficiently
large DFA), RE2 could be forced to stay in its slow path by an input that
prioritizes trying to head for an unbuilt DFA state. An attacker could attempt
to tune this to stay on the edge of RE2s heuristics to fall back to NFA
matching.

Alternately, NFA matching in RE2 is quite slow, and there are presumably ways
to ensure that the slowest paths (probably involving large numbers of states
being on) are traversed.

Hyperscan has its own performance corners - the dumps produced by Hyperscan's
compiler could be examined to discover which "literal factors" it's using,
then inputs crafted to blast those literal factors. There's a good deal of
trickery to try to avoid getting caught by these kind of "overly frequent
literal factors" so it's probably harder than one might think - but I don't
doubt an attacker could make life difficult for Hyperscan.

Once into the NFA/DFA engines in Hyperscan, an attacker could ensure that the
engines can't ever use their fast acceleration paths.

Note these aren't exponential blowups in performance. However, hitting one of
these systems (RE2 or Hyperscan) with a 'constant factor' of extra work could
be quite significant.

------
davisjam314
This is a "technical two-pager" of my doctoral research about regex denial of
service.

~~~
glangdale
I looked through this. I'm curious as to how you managed to slot in a small
reference to Hyperscan (my baby for many years) while writing a thesis about
regular expression performance problems.

You cite us twice; once as a source on what "dominated" means in a graph
theoretic sense (?) and once to falsely suggest that Hyperscan is somehow
"domain specific". To the extent that Hyperscan is focused on scaling up to
larger-scale pattern matching (unlike RE2) it is suitable for a domain that
other regular expression matchers are not suitable for, but there's no reason
that you can't use Hyperscan in a general purpose domain.

The fact of the matter is that large scale regular expression matching, where
it occurs in domains that are highly performance-sensitive, is a extremely
difficult problem that is largely solved - in a limited way. No-one sensible
builds an IPS or a NGFW and expects to be able to do backreferences and the
usual menagerie of backtracking-only features (recursive subpatterns!).

The various outages that have happened since - people tripping over
performance problems that were widely understood in the industry in at least
2006, if not earlier - are examples of people wilfully using the wrong tools
for the job and being surprised when they don't work.

There's a niche for faster backtracking matchers, but most of the usage of
regular expressions where anyone cares about performance is already squarely
in the "automata" world and is done by either Hyperscan, hand-rolled software
at network shops or acceleration hardware like the cards from TitanIC (now
Mellanox). The backtracking formulation is not only subject to all the
superlinear guff, it's also incredibly poorly suited to matching large numbers
of regular expressions at once.

~~~
davisjam314
(This is a repetition of my replies to your thread on Twitter).

My goal was not to prove that super-linear behavior is possible, but rather to
understand questions like:

1\. How common are such regexes in practice? (~10%)

2\. Do developers know it? (<50%)

3\. Do they use specialized regex engines like RE2 or HyperScan? (Not often)

These are not theoretical advances. But I believe that my work improved our
understanding of practice, and that it will let us ground future decisions in
large-scale data instead of opinion.

Are there other regex questions you think are worth a look? I'm all ears!

I apologize if my treatment of HyperScan was offensive to you. My
understanding from surveys is that developers typically use the regex engine
built into their programming languages. I therefore focused on those engines,
not on standalones like RE2 and HyperScan.

Once RE2 and HyperScan reach widespread use (I hope my research moves folks in
that direction!), understanding their weaknesses will be an interesting next
step.

------
hyperpape
In one place, you say that programmers should pre-validate that inputs are not
long. I would suggest quantifying that somehow. I suspect some developers will
not realize that a 30 character input is "long".

~~~
citrin_ru
At the same time some sites/systems impose too strict size limits which makes
impossible to enter e. g. a long name or a long address. And such problems are
almost inevitable if a programmer makes assumptions about input data based on
a limited personal experience.

[https://github.com/kdeldycke/awesome-
falsehood](https://github.com/kdeldycke/awesome-falsehood) tries to address
this.

------
cbarrick
The paper _Regular Expression Matching Can Be Simple And Fast_ [1] has a time
complexity plot that clearly shows how backtracking engines are more
exploitable than finite-automata engines.

[1]:
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
davisjam314
Interestingly, the backtracking engines can also be viewed as using automata.
The trouble is actually in how they _simulate_ the automaton. They use
backtracking -- like a depth-first search. Since they don't remember where
they've been, this leads them to a lot of redundancy. Engines like RE2 use a
lockstep approach instead -- like a breadth-first search. They work their way
down the search tree level-by-level and so don't have to repeat themselves.

Practically speaking, the slow regex engines don't always build NFAs. Instead
they may follow a grammar-directed approach. But the NFA is a valid model of
that simulation.

------
steve-chavez
I learned about ReDoS through a PostgreSQL bug. On an old pg version(9.4.5),
the following regex expression made pg hang in an infinite loop.

    
    
      select 'anystring' ~ '($^)+';
    

Of course that was patched a long time ago, but since then I've avoided
exposing regex functionality to end users.

------
snvzz
This isn't a cheat-sheet. I expected a PDF with one or two pages of very
condensed key information, not a long web article.

~~~
dang
Ok, we've de-cheat-sheeted the title above.

~~~
snvzz
Great. Much better now.

~~~
davisjam314
Thanks, folks!

------
schwag09
It's great to see more introductory ReDoS material! I took a deep-dive on
ReDoS myself recently and found the material available to be somewhat lacking,
especially for beginners. Over the course of my investigation I found some
interesting bugs in big name projects and created a few blog posts as an
introduction to the bug class:

* [https://blog.r2c.dev/2020/finding-python-redos-bugs-at-scale...](https://blog.r2c.dev/2020/finding-python-redos-bugs-at-scale-using-dlint-and-r2c/)

* [https://blog.r2c.dev/2020/improving-redos-detection-with-dli...](https://blog.r2c.dev/2020/improving-redos-detection-with-dlint-and-r2c/)

The culmination of this work was a Python regex linter that could
automatically detect ReDoS expressions with fairly high accuracy - Dlint's
DUO138 rule: [https://github.com/dlint-
py/dlint/blob/master/docs/linters/D...](https://github.com/dlint-
py/dlint/blob/master/docs/linters/DUO138.md).

In my opinion, the best solution, as this article mentions, is to avoid the
bug class altogether by using something like RE2 when possible. Nevertheless,
I found ReDoS to be a really cool bug class at the intersection of computer
science and software engineering.

