
The 1960's elegance behind Go's regexp - Dawny33
https://docs.google.com/presentation/d/1CwgyzSoz5lVFrgTWb67_ar5kkidW2crOItBOnxRH9uI/mobilepresent
======
harpocrates
The article by Russ Cox [0] (from which the initial Perl vs. Thompson NFA
graph is taken) is a much more informative read. Also, I'm not sure why this
always gets brought up as being a big deal: CS is all about making the right
tradeoffs and I agree that sometimes there are cool tricks that let you
_almost_ have your cake an eat it. But here, there is no such trick:
backtracking has always been exponential and finite state automata have always
been linear.

No surprises.

[0]
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

~~~
glangdale
Full disclosure - I work on the Hyperscan project at Intel
github.com/01org/hyperscan

Regular expression implementation is fun and many interesting things have been
done in this area. We are partial to the work of Gonzalo Navarro (in terms of
summary papers) and the Glushkov construction, which predates the Thompson NFA
construction and IMO is better in a number of ways for fast implementation.

I quite enjoyed the Russ Cox posts, but they are a very partial and
idiosyncratic picture of regular expression implementation. RE2 is one point
in the automata-style regex implementation space; Hyperscan is another, and
there a bunch of other distinct and interesting approaches (e.g. the work from
the Parabix guys, various hardware and GPGPU implementations, etc).

~~~
harpocrates
See, now _this_ is something interesting. When I get some time, I'll take a
look at hyperscan. Looks pretty neat.

~~~
burntsushi
As the author of Rust's regex crate, Hyperscan is a work of art, both in terms
of the algorithms it employs and its performance. (I don't think Hyperscan has
been part of any publicized benchmark yet, so I'm only speaking with some
limited experience I've had poking at it.)

------
cestith
I stopped reading when a post in 2017 compared anything to perl 5.8.7 which is
older than what CentOS 5 shipped.

Perl 5.8.x was EOL in 2008. 5.10 was EOL in 2009.

The currently maintained versions are 5.22.3 and 5.24.1 and major redesign,
refactoring, and rework has been done in the system including the regex
parser. It went in the meantime from being a primarily recursive, backtracking
NFA to using a DFA when possible, being primarily iterative, and only
recursing when necessary.

Anyone intentionally picking a more than decade old version whose major
version line was end-of-life nearly a decade ago to compare to their new
hotness has completely invalidated the evidence for their arguments. All the
great things the slides might say may still be true, but they are unsupported
by these charts.

~~~
lexpar
I don't think the author 'intentionally' picked it. Like all the other
graphics in the presentation, it looks to have been sourced from somewhere
else.

Edit:
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

All the graphics come from the above, and no mention of the original article
it was taken from. Nice.

~~~
cestith
Plagiarism v. cherry-picking (or is that pit-picking in this case)... It's
difficult to say which is the bigger ethical issue. Both are extremely
dishonest and misleading.

------
stymaar
Slightly off-topic: BurntSushi wrote a Go binding[1] for the Rust regex
engine[2] (he is also the author) which shares the same approach (finite
automata) but has a more polished, highly optimized implementation. It could
be useful in some scenario for people facing performance issues with Go's
regex engine.

[1] [https://github.com/BurntSushi/rure-
go](https://github.com/BurntSushi/rure-go) [2] [https://github.com/rust-
lang/regex](https://github.com/rust-lang/regex)

------
ggambetta
_" [A regex is] a style of describing character strings. If a string
successfully describes a regex, then it is called a match"_

Shouldn't that be "if a regex successfully describes a string"? I'm confused.

------
jhallenworld
I think the Thompson NFA matcher could be pushed further if someone would
bother to do it. One example is that since it finds all solutions to a given
regular expression, it could be used within a larger backtracking matcher to
support back-references.

Consider:

    
    
      (a+)(a+)=\1
    

With an input like: aaaaa=aa

The matcher will find these strings by the time it gets to the '=':

    
    
        (a)(aaaa)=
        (aa)(aaa)=
        (aaa)(aa)=
        (aaaa)(a)=
    

Each match will end up as a separate still-existing thread in the Thompson
matcher. Instead of pruning them as soon as possible, keep them and
recursively match the rest of the string with the back-reference set in turn
to each until a match is found.

~~~
burntsushi
If I understand your idea right, then this is actually being done!
[https://github.com/google/fancy-regex](https://github.com/google/fancy-regex)

------
gumby
TL;DR we use a subset of today's common regexps which allowed us to go
lightning fast (as long as you don't need lookahead). Oh, and I'll neglect to
be clear about that.

Nothing wrong in making the tradeoff but I didn't see it in the text.

------
legulere
I do not get why there still is so much of a hype for regular expressions.
They are hard to read and maintain. They have a relatively large syntax. Edge-
cases are easy to get wrong with them.

------
davidw
Looks like Tcl has been doing it right for a while...

------
moomin
Seriously, can we stop fetishising old approaches and papers? Yes, there's
some tricks we've missed along the way, but there's a thread of alchemical
thinking in our community that is just plain unproductive.

~~~
kaosjester
It would be easier to take this comment seriously if Go wasn't a modern re-
implementation of ALGOL 68.

~~~
Ericson2314
60s regex with 60s language. Hmm, I wonder which area (regular language
parsing or programming theory) has seen more research since then :D.

