
Go bindings to Rust's regex engine - burntsushi
https://github.com/BurntSushi/rure-go
======
exDM69
This right here is why I think Rust really stands out from other languages.
You can call it from other languages just as you can call native code written
in C. It's not married to a complex runtime system.

Most other languages are "dead ends", because code written in them is not
easily reusable from other languages.

~~~
dangerlibrary
It's easy to call any code from any language - all you need is the power of
HTTP and micro services! ;)

~~~
dikaiosune
Microservices don't help polyglots make regexes faster. If you're sticking a
100ns operation behind an endpoint with JSON serialization/deserialization,
network round-trip, etc., then I'm not sure what to say to that.

~~~
cderwin
Actually this microservices approach has existed for decades in erlang/otp and
can maintain quite excellent speed. You're not going to beat raw c speed-wise,
but typically you will beat comparable dynamic languages -- and you don't have
to worry about decoding/encoding json or any of that bs. I'd personally be
absolutely thrilled if something like genserver and some other otp protocols
were ported to other dynamic languages -- but alas real processes aren't
nearly efficient enough.

~~~
vvanders
You can also delegate to C with Erlang via NIFs

~~~
steveklabnik
Then there's
[https://github.com/hansihe/Rustler](https://github.com/hansihe/Rustler)

~~~
vvanders
Yes, that's been on my (long) list of things to check out.

------
glangdale
We're interested in how the Hyperscan library (full disclosure: I am an Intel
employee and Hyperscan designer) stacks up in this comparison. From what I
understand there are Go and Rust bindings for Hyperscan out there as 3rd party
projects (although I don't know much about them).

We find Rust intriguing and the fine-grained control would be a good match for
what our run-time needs; it would be a interesting project to re-implement
parts of Hyperscan in Rust.

~~~
burntsushi
I'm interested too. I'm looking into adding hyperscan to Rust's benchmark
harness (which already has PCRE1, PCRE2, RE2, Tcl and Oniguruma). I'm trying
to figure out how best to wrap the API. The callback function in `hs_scan` is
something I haven't seen in a regex library before (although I bet I can
understand why it's used), but for example, Rust's wrapper cripples the
callback in comparison to C by dropping the `void *context` parameter:
[http://flier.github.io/rust-
hyperscan/doc/hyperscan/type.Mat...](http://flier.github.io/rust-
hyperscan/doc/hyperscan/type.MatchEventCallback.html)

I'm happy to wrap the C directly (in fact, that's what I've done for most of
the other regex engines I benchmark against). Before I try to wrangle the
callback-style API, is there an alternate API with a more explicit iterator-
like interface?

~~~
glangdale
Thanks for looking into this; we like to see benchmarks run by other people to
avoid the prospect that we're just benchmarking the things we already know
about and like doing (which are often also the things which we've optimized).

There are no other interfaces outside of the callback API. It's an interesting
prospect to have an iterator-like API but a naive implementation of an
iterator API would involve us having to save out all of our state somehow (in
an explicit fashion, as opposed to a bunch of stack frames). Our system is
pretty complex and so it's not just a case of saving out a couple DFA
states...

~~~
burntsushi
I've done a _tiny_ bit of hacking around on the command line. I'm essentially
comparing your `simplegrep` utility with my own utility that operates
similarly using `time`.

The first thing I noticed was that hyperscan wasn't lining up with match
counts from the rest of the regex engines. I dug into your docs a bit more,
and indeed, your engine reports _all_ matches, whereas traditionally, most
regex engines offer an iterator over successive non-overlapping leftmost-first
matches (including start+end). Of course, that's not to say you're wrong or
anything---it's just going to be a real bear to accurately benchmark them.

All in all, hyperscan looks pretty fast. Some of your docs suggest you're
doing even smarter literal detection than I am. Prefix literals and literals
"near the front" are easy enough to figure out (the latter is on my list of
things to do), but literals at the end or middle of a regex seem much
trickier. In particular, it seems very easy to fall into worst case quadratic
behavior. If you felt like expanding on some implementation details there,
that would be awesome. :-)

In any case, I'll share some timings. I ran all of the following regexes on a
~2GB file containing a concatenation of some large sample of Project
Gutenberg. It's my goto for "feeling" out regexes and collecting large samples
for profiling. I modified simplegrep.c to not print every match and instead
increment a single static `count`. I also modified it to mmap the file, so
that it is comparable to my utility in Rust which also mmaps the file. The
count of total matches is printed at the end. I tried to limit the regexes to
ones that had the same number of matches in all regex engines, but on some of
them, hyperscan reported slightly more. I then modified simplegrep.c to use
the HS_FLAG_SOM_LEFTMOST flag. I did this because that's what Rust's regex
engine does (and so do most others). Thus, the benchmark is flawed. But
knowing this...

Benchmarks are reported in seconds. I tried to take the best of 3 where
appropriate, but overall, I saw very little variation run to run.

    
    
        pattern                          rust   pcre2-jit   hs     hs-start-end
        \w+\s+Holmes                     4.4    17.0        0.5    3.4
        [1]                              4.4     4.0        0.5    0.5
        (?i)Sherlock|Holmes|Watson       4.6     7.5        0.5    0.5
        Sher[a-z]+|Hol[a-z]+             0.9     0.6        0.6    0.6
        (?i)Sher[a-z]+|Hol[a-z]+         4.6     7.0        0.6    0.7
        [2]                              0.9     2.0        0.5    0.5
        [3]                              1.0     ---        0.5    0.5
        [a-zA-Z]+ing                     5.1     7.2        1.3    2.2
        \s[a-zA-Z]{0,12}ing\s            4.9    20.0        1.6    2.6
        \w{7}\s\w{7}                     5.0    32.6        7.0    7.1
        [0-9]+[a-z]+[0-9]+[a-z]+[0-9]+   2.1     2.1        0.5    0.5
        
        [1] Sherlock|Holmes|Watson|Irene|Adler|John|Baker
        [2] Holmes.{0,25}Watson|Watson.{0,25}Holmes
        [3] Holmes(?:\s*.+\s*){0,10}Watson|Watson(?:\s*.+\s*){0,10}Holmes
    

That's a pretty impressive showing! I'll have to learn your secrets. :-)

~~~
glangdale
Your observation about match counts and "all-matches" vs "leftmost-longest" or
"pcre-locally-maximizing" semantics is accurate and Hyperscan is meant to be
that way. All-matches is the only semantics we can implement efficiently when
you take into account streaming and multiple regex matching. Block mode scans
of single regexes are something we like to do well at but they are not our
primary use case; more typical is hundreds of regexes in streaming mode.
Similarly, accurate Start of Match (SoM) tracking is a pain. This is a work in
progress. You can see the hit we are taking for it (perhaps unnecessarily
given the SoM cases don't look like the hardest SoM calculations we have to
do). This will improve but SoM is typically hard for us (given the streaming
and multi-pattern requirements).

~~~
glangdale
As for our secrets - they are not really secrets; anyone who wants to learn
them can simply read our code. I will outline the literal discovery approach,
though, as I'm inordinately proud of it (possibly for the wrong reasons):

We find 'factor' literals in a NFA Graph (a Glushkov automata embedded in a
BGL graph; Glushkov automata are both our internal intermediate representation
of regexes and an occasional implementation strategy) as follows: we explore
the local neighborhood of a graph node and calculate 'if I wanted to cut the
graph here, which literal strings would I need to see'. For example, if we
have paths "abc" and "def" to a node, we would give this a score corresponding
to "2 3-character literals" (pretty good). On the other hand, we could have a
single path "abcdefghij" to the node (which would be a really good score - a
low score) or we could have multiple short literals (say, corresponding to the
expansion of \d - 10 1-character literals - really bad - a high score). If we
can't find anything reasonable at all the score is effectively infinite. This
allows us to score each edge in the graph. We then wire up our graph to a
source and sink (at the start and end of the pattern) and do network flow and
grab a min-cut. More or less - there are some wrinkles, but that's basically
it.

This may or may not be the world's best idea, but I had hankered to find some
weird use of netflow ever since a Graph Theory class in 1991, where a
professor of mathematics showed us an algorithm to solve the stable marriages
problem by reducing it to network flow. This was made more memorable by the
fact that he, probably unintentionally, posited a slightly obscene
formulation: all the boys were wired to a mysterious 'source', the girls to a
mysterious 'sink', and some unknown substance flowed from the former group to
the latter (!).

The min-cut trick has its problems, as it does things like scoring a graph
where we cut with 5 different copies of the same literal "abc" as if they were
distinct literals (having thrown away the way the scores were found when we do
min-cut, which just treats the scores as a magic floating point number). But
it's been fairly practical. Someone with a better grasp of abstract algebra
and graph theory would be welcome to come along and tell us whether the thing
that implements our score in min-cut could be more complex (some sort of class
that cleverly tracks some or all of the 'sums of literals' and implements
operations like + and min) and still allow the min-cut algorithms to work.

Cutting is also more complex for us given the fact that a 'late' cut isn't
nearly as good for us as an 'early' cut if we are streaming, as the leftover
bits of an NFA graph on the left of such a cut need to be evaluated all the
time in streaming mode, while this isn't true to of the parts on the right of
a cut. So in practice we don't really do the netflow algorithm in the pure
form above (at least, not since version 2.something) but it still gets used.
The fact that we are actually decomposing our problem also means that some
cuts are better than others to leave a 'clean' set of engines also messes
things up.

We'd love help working on this problem (hey, we're open source) but Building
Your Own regex engine seems to be the thing everyone must do. It's probably a
lot more fun than trying to grab a tractable bit to work on in a 100+Kloc
source base.

~~~
burntsushi
Very interesting. That is much more sophisticated than what I'm doing. My
literal extraction is a simple greedy approach on the AST that only finds
prefixes. My "scoring" is just a silly heuristic at this point. I had been
doing it on the NFA itself, but found the AST easier to work with.

I think my problem is that I don't know how to take interior literals or
suffix literals and apply them to search without provoking worst case
quadratic behavior. For example, take the regex `\w+\s+Holmes`. It's easy
enough to find `Holmes` and do a fast literal search for it. Then you can
match `\w+\s+` in reverse from the start of the literal hit. But whoops,
`Holmes` is a substring in the set of all strings described by `\w+\s+`, which
means your automaton can start re-scanning input you've already seen. In
pathological cases, you get quadratic behavior. (Similar to a poorly written
Boyer-Moore implementation.)

I wonder if this gets easier using hyperscan's "all-matches" semantics. I
haven't given alternative matching semantics much thought yet.

> We'd love help working on this problem (hey, we're open source) but Building
> Your Own regex engine seems to be the thing everyone must do. It's probably
> a lot more fun than trying to grab a tractable bit to work on in a 100+Kloc
> source base.

Yeah, I started on Rust's regex engine almost 2 years ago. It's been a labor
of love.

------
jerf
Something I've wondered for a while... in the long term (that is, I'm more
asking about architecture than what exists today), is it possible to integrate
Rust more deeply into the calling convention of another language, so that
rather than generating "C" functions, it might generate true "Go" functions?

I'm also asking beyond Go, so if the answer is "M:N really screws up Go w.r.t.
Rust", I'm also curious about whether, for instance, you could write native
Rust extensions to Perl or Python or $WHATEVER by changing out the runtime or
something. If it works better for some languages than others, which ones might
they be?

It's nice to be able to generate true C functions, and it is of course
perfectly appropriate and correct to start there, given the real-world state
of the ecosystem. But a lot of languages take various forms of penalties to
call into C, in order to ensure guarantees that are not necessarily required
by languages that aren't C, and, simultaneously, miss out on guarantees that
Rust may be able to provide that C can't express and thus the C bindings can't
count on.

~~~
pcwalton
Possibly. You'd have to add support for the nonstandard Plan 9 calling
conventions, stack growth checks, and GC stack map information. LLVM broadly
has support for all of these features (contrary to what some Go team members
have said), but the specific binary formats for these features are
incompatible with the Go runtime, so it'd be necessary to contribute upstream
changes to LLVM or fork it.

~~~
geodel
Did they say segmented stack / stack maps is not available in llvm or it was
not available when they started working on Go around 2007? Because I notice
that segmented support was added around 2012 and stack maps in 2013.

------
niersi
May I ask why? From my understanding Rust just have revamped their regex[0]
using the same (similar) scheme as in RE2, which is now the Go's
implementation, am I misinformed?

[0]: [https://github.com/rust-lang-
nursery/regex/commit/e5a5198eee...](https://github.com/rust-lang-
nursery/regex/commit/e5a5198eeea6de2a10a2244fece8a1cc206faf22)

~~~
dbaupp
The Rust regex library does use various techniques that RE2 and Go's library
also use (and has from the start, that patch is just adding another feature
that was missing), but it is generally faster than both of those libraries:
take a look at the benchmarks in the link.

~~~
burntsushi
@jerf - In response to your deleted comment, the README explains some of the
numbers. The really crazy numbers are silly, and I say so in the README.

However, there are plenty of benchmarks well over 10x that are not silly at
all. Look at the second block. The BeforeHolmes benchmark in particular is a
good litmus test, since it defeats most optimizations.

Those particular speed ups are due to algorithms used. RE2 would have a
similar speedup when compared with Go's regexp package.

Some of the other benchmarks in the 100x-400x range are likely due to smarter
use of literal scanning.

------
Animats
The usual way to use regular expressions in Rust is to use the Rust module
that complies them at compile time. That can't be called from Go.

~~~
burntsushi
Unfortunately, no, that is not the usual way any more. Two primary reasons:

1\. Compiling regexes at compile time requires a compiler plugin, which only
works on nightly.

2\. The compile-time regexes are dramatically slower (think orders of
magnitude) than the normal Regex::new method. The normal Regex::new method is
what's shown here in these bindings.

The compiler plugin was code I wrote two years ago, and really hasn't changed
since then. Regex::new on the other hand has gotten a lot of love.

The plugin's performance may one day get better, but it won't be soon unless
someone else works on it.

~~~
AYBABTME
curious, why is the compile-time regex slower? is it just a less efficient
implementation?

~~~
burntsushi
Regex::new has these tools at its disposal: Pike VM (NFA), backtracking (NFA),
DFA (online), DFA (offline), Boyer-Moore

The plugin has these tools at its disposal: Pike VM.

But yeah, killercup is correct. The DFA (online) is one of the bigger reasons.
But the offline DFA and Boyer-Moore are probably just as important.

You might glean more information from my hacking guide:
[https://github.com/rust-lang-
nursery/regex/blob/master/HACKI...](https://github.com/rust-lang-
nursery/regex/blob/master/HACKING.md)

