Hacker News new | past | comments | ask | show | jobs | submit login

No, your understanding and the parent's isn't correct.

The NFA is linear in the size of the regex. That is, a regex like a+b+c+ will have an NFA that's 3x the size of the NFA for a+.

Interpreting the NFA is done in time LINEAR with respect to the input data. Code is given in Cox's articles. Yes worst case linear.

You're probably thinking of NFA -> DFA conversion, i.e. the "subset construction".

First of all, RE2 doesn't always do that. You can always operate on NFAs. (and it does that because of capturing is hard with DFAs, IIRC)

Second of all, "exponential in size of input pattern" is NOT a problem. Exponential in the size of the DATA is a problem, and that doesn't happen. When you are measuring the runtime of regexes, the size of the pattern is a small constant. (Think matching a+b+c+ against gigabytes of data.)

The magic of regexes is that you can match any regular language in linear time and constant space. You can't do that with "regexes". That's the WHOLE POINT of these articles.

I have wanted to write a summary of Russ Cox's articles on my blog for awhile. They are very good, but they are very dense, and most people still aren't getting the message.

The message is: "Use regular languages and not regexes". (e.g. constructs like backreferences and lookaround assertions make the language nonregular)

[1] Side note: I tested re2c and its regex compilation time is linear up to matching 47,000 fixed strings in parallel (from /usr/share/dict/words)

https://github.com/oilshell/blog-code/blob/master/fgrep-prob...

re2c ALWAYS builds a DFA, unlike RE2. (Note: they are completely unrelated projects, despite their similar names.)

Even taking into the account that exponential in the size of the pattern is not a problem, it's still not exponential in the size of the pattern (for the fgrep problem). If the DFA were exponentially sized, then it would take exponential time to construct it, but it doesn't in this case.




There's definitely room for a helpful blog post laying out these principles in a less dense way. Go for it!

There's also a lot of room for discussion of the choices around implementation techniques different libraries use (things like whether or not DFAs are created on the fly). I'm interested in this area, so I should probably work up to writing up some of these examples, as a way of forcing myself to learn them better.

Also, I think you're not painting the full picture about the issue of DFA conversion being exponential. You're right that it's much less of an issue than being exponential in the input size, as it affects fewer use cases. However, it's still a potential DOS issue, so it definitely matters for any application that takes user controlled input and produces a DFA. I think it also matters for getting optimal performance out of an engine even in non-adversarial conditions.

For an example of how easy this is to trigger: (a|b)*a(a|b)^(n-1) produces a DFA of size exponential in n. With n = 15, re2c generates a 4mb file.


RE2 was used in Google Code Search in an adversarial context, and it handles the DFA issue by simply putting an 8MB limit on the DFA for a regex:

https://github.com/google/re2/blob/master/re2/re2.h#L601

I learned a few months ago that Rust's regex crate, which is based heavily on RE2, does the same thing.

There is perhaps something worth exploring there, but after reading the Cox articles, I can see that there are so many issues that come up in regex implementation, and this seems to be one of the smaller ones.

I don't think the "regex as untrusted data" use case is very common. That is, there are probably 10,000 programs that use regexes for every one that accepts regexes from users.

-----

Although, one thing I've wanted to explore is Regex derivatives. Supposedly that technique constructs an optimal DFA directly. You don't have to make an NFA, convert it to a DFA, and then optimize the DFA.

Cox describes it as a "win-win" here (in contrast to parsing with derivatives, which is asymptotically worse than traditional parsing algorithms)

https://research.swtch.com/yaccalive

There a bunch of non-production implementations floating around like

https://github.com/google/redgrep

https://github.com/MichaelPaddon/epsilon

(This technique was revived by the 2009 paper cited)

-----

Regarding the blog post, I have some material here. I gave about 10% of this as a 5-minute talk, and realized it's about 2 hours worth of material.

https://www.oilshell.org/share/05-31-pres.html

I still think the table toward the end is a good summary and perhaps worth publishing with some more context / explanation.

----

One interesting test of whether the message from Cox's articles got through is what style of regex engine newer languages are using.

I just noticed Julia uses PCRE, which is unfortunate IMO. I like RE2 -- it has a nice API, great performance, and a lot of features.

https://docs.julialang.org/en/v1/manual/strings/index.html#R...

Swift also has Perl-style regexes, but maybe they inherited it from Objective C:

https://developer.apple.com/documentation/foundation/nsregul...

So basically Rust is the only newer language that appears to have gotten the message.

The Oil shell will of course use "regular languages" rather than regexes, but that's natural because shell never had Perl-style regexes.

One thing I think Cox didn't explore enough is that most libc's appear to be using the "fast" good method, because they don't need Perl constructs.

I guess that would be a good follow-up article. To test GNU libc and musl libc on those pathological cases. I looked at their code a few years ago and they looked "right" (I've forgotten the details), but I don't think I ever did the test.


Lookaround is still regular


Citation? How do you implement it with automata?


The various kinds of lookaround are essentially intersections plus some trickery to extract the correct spans.

E.g. in Python re.match('(?=(a.)+$)(.b.|..b)+', 'abacabax') matches 'abacab', while applying it to 'abacabxa' doesn't match. If you only focus on the question of whether there's a match starting at a given position, this is equivalent to there being a match for '(a.)+$' as well as '(.b.|..b)+' from the same position, i.e. the intersection of the languages matched by the two expressions.

If you have two automata with n and m states for the two expressions, then you can match the intersection by creating an automaton with nm states corresponding to all possible combinations of states of the smaller automata, with a transition from (i,j) to (k,l) iff there are transitions from i to k and j to l. (i,j) is an accepting state iff both i and j are. This essentially simulates both automata in parallel.

Once you've determined whether or not there's a match by using that automaton, you could extract the span that matched by re-running only the part of the regular expression that's not part of a lookaround assertion. I guess actual implementations just do some additional bookkeeping to do both matching and span extraction in one pass.


Thanks! I don't understand this yet but I bookmarked it for later :) I saw some similar answers via Google.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: