
Rejit: a work-in-progress JIT-powered regex engine - adamnemecek
https://github.com/kirbyfan64/rejit
======
Twirrim
FWIW PCRE also includes a JIT.

[http://sljit.sourceforge.net/pcre.html](http://sljit.sourceforge.net/pcre.html)
[http://www.pcre.org/readme.tx](http://www.pcre.org/readme.tx)

------
ptspts
FYI: ReJit, as of now, can match only on null-terminated strings. Most regexp
engine support matching arbitrary byte sequences.

What's the benefit of using ReJit over RE2? I can think of no C++ dependencies
and the support for backward-references.

Benchmarks 2...3 years ago:
[http://coreperf.com/update/rejit/2013/09/01/rejit_pcre_bench...](http://coreperf.com/update/rejit/2013/09/01/rejit_pcre_benchmarks.html)
It looks like much faster than RE2 for various regexps.

The documentation doesn't say if it's worst-case matching time is exponential.
RE2 is always polynomial.

~~~
pcwalton
These benchmarks point to the reason I don't like it when the "Regular
Expression Matching Can Be Simple and Fast" article is linked to as though
it's the last word on regex implementation. Unless you have pathological
regexes, the re2 approach frequently loses to a simple recursive backtracking
JIT.

~~~
burntsushi
Rust's very own regexes should be besting PCRE now, even when PCRE's JIT is
enabled. There are some benchmarks on the PR: [https://github.com/rust-lang-
nursery/regex/pull/164#issuecom...](https://github.com/rust-lang-
nursery/regex/pull/164#issuecomment-184025147)

Also, FYI, but the rejit in the GP's benchmark link is not the same rejit in
the OP. The rejit in the benchmark in the GP is actually based off of Russ
Cox's articles. (It uses automata and doesn't support backreferences.)

The places where PCRE is still winning come down to these things (I think):

* Use of SIMD instructions with literal prefixes that have a small number of common leading bytes (greater than 1). I suspect this is what's going on with the "medium" benchmarks. I think the `jetscii` crate would help here, but it's only usable on the nightlies.

* PCRE appears to be detecting a literal prefix of `\n` when the regex starts with `^` in multi-line mode. (The sherlock::line_boundary_sherlock_holmes benchmark.) We could do this too---I just haven't yet.

* The sherlock::word_ending_n benchmark has a word boundary, which the lazy DFA in Rust's regex library doesn't support. (It will be hard to improve here unfortunately.)

* In general, I think PCRE probably has less overhead per match, which I haven't worked on too much. (For one, every search requires acquiring and quickly releasing a mutex in Rust's library. All I really need to do is push/pop from a stack, so I bet we could avoid the mutex.)

On the rest of the regexes, Rust's library is either competitive or completely
smashes PCRE in throughput. There are even a few benchmarks that I had to omit
because PCRE was too slow.

These are strong words, so my goodness, do I hope I'm benchmarking PCRE right!
:P

And this was all work done based on what Russ Cox wrote.

Actual benchmark code is here: [https://github.com/rust-lang-
nursery/regex/blob/master/bench...](https://github.com/rust-lang-
nursery/regex/blob/master/benches/bench_pcre.rs)

------
DannyBee
re2c is probably just as fast in practice, i'd imagine. Also note that rejit
has already been used as a name for this purpose :)

[https://github.com/coreperf/rejit](https://github.com/coreperf/rejit)

~~~
haberman
> re2c is probably just as fast in practice, i'd imagine.

Probably, but re2c requires every pattern to be run through a C compiler. Not
useful if you want to use it as the embedded regex engine inside a language
interpreter.

Also recent iterations of Intel processors contain specialized string-scanning
instructions -- I would love to see engines that can take advantage of these.
It would be hard for C compilers to recognize these patterns, I think, without
explicit use of intrinsics: [http://blog.reverberate.org/2009/07/gazelle-is-
going-to-love...](http://blog.reverberate.org/2009/07/gazelle-is-going-to-
love-sse-42.html)

~~~
sitkack
libtcc, [http://bellard.org/tcc/](http://bellard.org/tcc/)

------
Chris2048
Was this submitted by the author?

If not, given that it's described as 'WIP', they might not be happy about it
appearing here...

~~~
newday
From the readme:

> Note that ReJit is NOT complete! See the issue tracker for a list of open
> issues.

~~~
Chris2048
I'm not sure why you post this..

