

Syntax extensions and regular expressions for Rust - dbaupp
http://blog.burntsushi.net/rust-regex-syntax-extensions

======
sixbrx
Very impressive. Especially the native regexs which will fail at compile time
if they are invalid and which compile to a custom matcher machine (without
having to express them in nasty template syntax). Rust has more macro-chops
than I suspected.

------
pohl
Aside from Servo itself, this is the coolest and most ambitious piece of Rust
code I've yet seen.

I haven't had the pleasure of using a language with a good macro facility.
What other sorts of applications do macros have, aside from this and type safe
println are there? Could one make a json or XML parsing library with this
technique? Or a driver for a database? How does one go about acquiring an
intuition for when to use or not use them?

~~~
burntsushi
Thanks for your kind comments. :)

> How does one go about acquiring an intuition for when to use or not use
> them?

Truthfully, I don't know. In this particular case, the benefits seem really
clear: we get safety _and_ performance increases.

But here's a caveat: the particular reason why this works well for regexes is
because the common use case is to write the string literal corresponding to
the regex into your program. This is what allows the macro facilities to kick
in. For example, if the regex is derived from user input at runtime, then the
`regex!` macro can't be used.

i.e., you can't do this:

    
    
        let restr = "a*";
        let re = regex!(restr);
    

The `regex!` macro must accept a string literal (or another macro invocation
that produces a string literal).

~~~
pohl
I see. So maybe if you had a fixed XSD or XSLT at compile time you could emit
a faster but specialized parser or transformer. But maybe that wouldn't work
out so well in practice.

I did find an interesting slide deck from the scala universe that mentions
something like F# type providers (among other things) as possible
applications.

[http://scalamacros.org/paperstalks/2014-02-04-WhatAreMacrosG...](http://scalamacros.org/paperstalks/2014-02-04-WhatAreMacrosGoodFor.pdf)

------
CyberShadow
Would be interesting to see a benchmark between D's ctRegex and Rust's
"regex!". IIRC, when ctRegex was announced, it beat all other contestants
(including V8's JIT-ed REs) in the author's benchmarks.

There will be a talk at this year's D conference on D's implementation of
regular expressions:

[http://dconf.org/2014/talks/olshansky.html](http://dconf.org/2014/talks/olshansky.html)

------
jeffdavis
Impressive.

I like how the rust community is offering so many good examples of real
utility for the language. Some languages are a little heavy on theory but have
trouble connecting with potential users; and other languages are popular but
have a lot of problems due to fundamental weaknesses.

Rust, and its community, seems to have a good balance. Good theoretical
foundation, but a lot of people always working to show some practical reason
to care about the theory.

------
thristian
The author asks about other languages with "native" regex implementations, and
I'm not entirely sure whether PyPy counts. Like regular Python, it compiles
regexes into a bytecode-based virtual machine, but like just sprintf-style
string formatting and the rest of Python in general, the regex VM is JIT-
compiled into native code.

~~~
burntsushi
Author here. I don't know enough about PyPy to really answer your question
with any certainty, but I have some guesses.

Firstly, it's my understanding that JITing pays a runtime cost. That's not the
case with Rust compiled regexes. (Admittedly, this cost is likely negligible.)

Secondly, wouldn't PyPy JIT the _general matching_ algorithm? This seems like
a substantively different process than doing code generation specifically for
a regex whose structure you know at compile time. (Whether I'm currently
taking advantage of any optimizations that PyPy's JIT isn't is of course a
different story.)

~~~
thristian
PyPy uses a tracing JIT, so it observes the general matching algorithm for a
few iterations and then emits code that implements what the general algorithm
actually did, i.e. the specific matching steps for that specific regex.

JITting a regex isn't really the same thing as your Rust native regexes; for
example, if somebody wrote a regex library with a JIT and linked it to a C
program, I woudln't count it. However, in PyPy's case regexes are exactly as
native as the rest of the codebase rather than being a nested VM, so in that
sense they're 'native'.

------
Ygg2
How does this compare to other regex parsing libraries, performance wise?

~~~
charlieflowers
He gives a link to this at the bottom:
[https://github.com/BurntSushi/regexp/tree/master/benchmark/r...](https://github.com/BurntSushi/regexp/tree/master/benchmark/regex-
dna)

(You have to scroll down to the code block sections to see the numbers).

Rust beats Go pretty soundly (on this specific benchmark), but C beats Rust
pretty soundly.

Why is C so much faster?

~~~
innguest
> Why is C so much faster?

Because processors are made with C in mind, as it's the lingua franca, and so
they end up being optimized for it.

~~~
Ygg2
Ok, but why is Python faster than Rust implementation? (And yes, I assume
those are C libs behind python).

NOTE: Rust is supposed to have zero cost abstractions. I think this is more to
lack of optimization on regexp part, than some fault of language.

~~~
dbaupp
The Rust implementation hasn't been extensively microoptimised.

------
Dewie
Is this implementation close to the theoretical concept of regular
expressions? I have read that some implementations of regular expressions are
quite a bit more powerful than regular expressions from CS, in that they can
recognize languages that are not regular.

~~~
burntsushi
Yes. It's a reasonably faithful port of RE2, which elides features like
backreferences.

~~~
pjscott
This is usually a small price to pay for guaranteed O(n) matching speed.
(Kudos to the author for doing the Right Thing here.)

~~~
burntsushi
Thanks :-) The implementation is basically the Pike VM as described by Russ
Cox. Recursion depth in particular has an upper bound corresponding to the
number of instructions in the regex. In practice, this means it's safe to run
a regex on untrusted data.

(Creating a regex from untrusted data still needs a bit of work, but is
fixable!)

