
Regular Expression Matching Can Be Simple and Fast (2007) - Toast_25
https://swtch.com/~rsc/regexp/regexp1.html
======
twic
There's probably someone who posts this comment every time this article comes
up, but:

    
    
        #![feature(test)]
        
        extern crate regex;
        extern crate test;
        
        use regex::Regex;
        
        pub fn check(pattern: &Regex, input: &str) -> bool {
            pattern.is_match(input)
        }
        
        #[cfg(test)]
        mod tests {
            use super::*;
            use test::Bencher;
        
            #[bench]
            fn bench_check(b: &mut Bencher) {
                let n = 29;
                let pattern = Regex::new(&format!("{}{}", "a?".repeat(n), "a".repeat(n))).unwrap();
                let input = "a".repeat(n);
                b.iter(|| check(&pattern, &input));
            }
        }
    

And then:

    
    
        $ cat rust-toolchain
        nightly-2018-02-09
        $ cargo bench
           Compiling void v1.0.2
           Compiling lazy_static v1.0.0
           Compiling regex-syntax v0.4.2
           Compiling libc v0.2.36
           Compiling utf8-ranges v1.0.0
           Compiling unreachable v1.0.0
           Compiling thread_local v0.3.5
           Compiling memchr v2.0.1
           Compiling aho-corasick v0.6.4
           Compiling regex v0.2.6
           Compiling regexps v0.1.0 (file:///home/twic/Code/Regexps)
            Finished release [optimized] target(s) in 20.11 secs
             Running target/release/deps/regexps-9c084e63d7d31ac3
        
        running 1 test
        test tests::bench_check ... bench:         212 ns/iter (+/- 8)
        
        test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out
    

Rust's standard regex library isn't the fastest in the world, and the lack of
backtracking can be limiting, but it is highly optimised for Rust's main use
case of posting showoff comments on Hacker News.

I also tried this in Java (OpenJDK 1.8.0_161-b14), and it took about 24
seconds per iteration at 29 characters.

~~~
burntsushi
For those that don't know, Rust's regex crate descends from RE2 and Go's
regexp library, both of which were written by Russ Cox, the author of the OP.

The principle difference between Rust's regex crate and, say, RE2 is its focus
on literal optimizations. This speeds up a very large number of regexes, and
it is one of the reasons why it does tend to be quite fast.

~~~
kbenson
How good is ecosystem/crate support for more complex regex features? Speed is
good, especially when you need it, but man, positive/negative
lookahead/lookbehind and other advanced features are pretty damn wonderful
once you know how to use them. The ability to choose what you want when you
want it would be nice.

~~~
burntsushi
The regex crate doesn't support it, and it's not something I'm personally
interested in working on. My goal at this point is to make the regex crate as
good as possible at what it does, across the board. There's still a long way
to go.

Other than that though, there is a WIP crate that actually builds on the
`regex` crate and provides fancier features: [https://crates.io/crates/fancy-
regex](https://crates.io/crates/fancy-regex) (Created by the same author as
Xi.)

You can also use bindings to a more established library. The Onigurama
bindings seem well developed: [https://github.com/rust-onig/rust-
onig](https://github.com/rust-onig/rust-onig)

There are also bindings to PCRE, but they are of varying quality. The regex
crate's benchmark suite binds to several other libraries (including C++ and D
libraries), but they are very minimal bindings sufficient for rough
microbenchmarking: [https://github.com/rust-
lang/regex/tree/master/bench/src/ffi](https://github.com/rust-
lang/regex/tree/master/bench/src/ffi)

~~~
kbenson
Thanks. The info on the other crates is what I was looking for, I wouldn't
expect this one to do more than what it aims for, which seems to be a specific
type of regex engine which isn't amenable to all features but gets extra speed
in that trade-off.

------
thedirt0115
This is an excellent read. I would suggest reading this also (maybe even
before), as it provides one of the simplest (incomplete, but still useful)
regex implementations you'll ever see:
[https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...](https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html)
It suffers from the problems remedied by the original submission (backtracking
instead of DFA), but the code is shorter and simpler. They're great to
compare.

------
glangdale
The biggest gap in the references/history section is the failure to mention V.
Glushkov, whose NFA construction has much appeal and certainly predates
Thompsons'. Reading the references, one might also draw the conclusion that
regex implementation almost ceased between 1968 and 2007.

In our experience, regex matching _can_ be simple and fast, but only
sometimes. RE2's approach is straightforward but naive; we built a
considerably faster (on some metrics) and more scalable system in Hyperscan.
The cost of this turned out to considerable complexity - and there are still
many regexes and inputs that don't perform particularly well in either system.

------
patrickmay
I thought this was going to be about compiling regular expressions, which is
how CL-PPCRE ([https://edicl.github.io/cl-ppcre/](https://edicl.github.io/cl-
ppcre/)) manages to be faster than C. Better algorithms beat compilation,
though.

------
carapace
Slightly off-topic, _derivatives of regular expressions_ is fun and
interesting:

[https://en.wikipedia.org/wiki/Brzozowski_derivative](https://en.wikipedia.org/wiki/Brzozowski_derivative)

[http://lambda-the-ultimate.org/node/2293/](http://lambda-the-
ultimate.org/node/2293/)

------
harpocrates
The comments from the original post[1] of this are still quite relevant. I
feel like I've seen this link posted here several times (much more recently
than 2009), but I can't find the links.

    
    
      [1]: https://news.ycombinator.com/item?id=466845

~~~
kbenson
There's been good discussion quite a few of the times this has been posted.
Check the "past" link below the submission info at the top for a quick search
of all the submissions.

------
anilshanbhag
There are modules that implement the Thompson's algorithm For Python:
[https://github.com/xysun/regex](https://github.com/xysun/regex)

------
u801e
The vim text editor actually can use the NFA regular expression engine [1] for
some searches. I'm not sure if that's an option for some of the languages
mentioned in the article.

[1]
[http://vimhelp.appspot.com/options.txt.html#%27regexpengine%...](http://vimhelp.appspot.com/options.txt.html#%27regexpengine%27)

------
gumby
This should be labeled (2007) and has been on HN several times before. It's a
good article.

The article (and the wikipedia) say Thomson put regexps into a version of QED
for CTSS, but CTSS was only used at MIT. He must have traveled between MIT and
Bell labs as part of the Multics project. The CTSS reference is pretty
obscure.

~~~
Toast_25
Thanks, I'll keep that in mind for next time. Is there a way to edit the title
now?

~~~
gumby
The HN admins do it sometimes.

------
lerax
Yes, I know. But regex for everything is bad. What about Parsec? Anyone?
[http://book.realworldhaskell.org/read/using-
parsec.html](http://book.realworldhaskell.org/read/using-parsec.html)

~~~
kccqzy
Without the syntactic sugar provided by Haskell most parser combinators in
other languages tend to be ugly, difficult to read and difficult to write.

