
A regular expression matcher in Scheme using derivatives - andars
http://matt.might.net/articles/implementation-of-regular-expression-matching-in-scheme-with-derivatives/
======
todd8
As to why using RE derivatives might be interesting, this if from Owens, Reppy
and Turon's, "Regular-expression derivatives re-examined", which is cited in
the originally referenced web site:

    
    
       Specifically, RE derivatives have the following advantages:
       • They provide a direct RE to DFA translation that is well 
         suited to implementation in functional languages.
       • They support extended REs almost for free.
       • The generated scanners are often optimal in the number of
         states and are uniformly better than those produced by 
         previous tools.

~~~
wolfgke
Unluckily there are regular languages where you can prove that the number of
required states of a DFA is exponential in the input length of the regular
expression describing it. That's why one should use NFAs for parsing REs, see
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

For an example of such a regular expression, consider

[ab]*a[ab][ab]...[ab][ab]

where the number of [ab] at the end is n-1. Then any DFA for this regular
expression needs at least 2^n states, while there exists an NFA using only n+1
states (the proof is an easy exercise that any Computer Science student should
easily be able to do on a whiteboard ;-) ).

------
kazinator
A regular expression matcher in _C_ using derivatives:

[http://www.kylheku.com/cgit/txr/tree/regex.c](http://www.kylheku.com/cgit/txr/tree/regex.c)

Starting at the function "dv_compile_regex". (The source file contains an NFA
and derivatives implementation; the top level regex compiler will choose one
or the other based on the absence or presence of exotic operators in the
abstract syntax.)

------
jhallenworld
Can we do submatch addressing with derivatives?

I think the answer is yes: extend the definition of an expression to include a
recorder R(re,recorded-string) such that D(R(re,),c) -> R(D(re,c),c) ->
R(re1,c). D(R(re1,c),d) -> R(D(re1,d),cd) -> R(re2,cd). Eventually you have
R(,string), the recorded sub-matched string.

Checking each of the rules with it...

    
    
      D(R(∅,s),c) -> ∅.
      D(R(ε,s),c) -> ∅.
      D(R(c,s),c) -> R(ε,sc).
      D(R(c,s),d) -> ∅.
      D(R(re,s),c) -> R(D(re,c),sc):
    
        D(R(re1,s1)R(re2,s2),c) -> δ(R(re1,s))R(D(re2,c),s2c) | R(D(re1,c),s1c)R(re2,s2).
    
         D(R(re1,s1)|R(re2,s2),c) -> D(R(re1,s1),c)|D(R(re2,s2),c).
    
         D(R(re*,s),c) -> R(D(re*,c),sc).
    
      R(∅,s) -> ∅.
      δ(R(∅,s)) -> ∅.
      δ(R(ε,s)) -> R(ε,s).
      δ(R(re*,s)) -> R(ε,s).
      δ(R(re1,s1)R(re2,s2)) -> δ(R(re1,s1))δ(R(re2,s2)).
      δ(R(re1,s1)|R(re2,s2)) -> δ(R(re1,s1))|δ(R(re2,s2)).
    

I think it's good: when the input is done, you have a match only if the AST is
only made up of R(ε,s1)R(ε,s2)...

------
taeric
Benchmarks would be appreciated. As things are, this is neat and I'm curious
to read more of the technique. However, I don't see this being used in either
rapid prototyping or in hardened implementations. The first because you would
just use a language that implemented regexes anyway; the second because
performance is important.

------
zeckalpha
Wow. I've always enjoyed Russ Cox's articles on regexes, but this is something
else entirely.

~~~
brandonbloom
The follow up work on _parsing_ with derivatives is even more fantastic:
[http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/)

~~~
eru
I implemented a parser combinator library for regular languages in Haskell:

[https://github.com/matthiasgoergens/redgrep](https://github.com/matthiasgoergens/redgrep)

(Still work in progress.)

------
jhallenworld
This is very interesting and I think the performance is pretty good with the
right AST. Consider the regular expression "hello".

There is no recursion to take the derivative of this, so O(n): (, h (, e (, l
(, l o))))

but with this AST, you have O(n!): (, (, (, (, h e) l) l) o)

