
Derivatives of Regular Expressions (2007) - tosh
http://lambda-the-ultimate.org/node/2293/
======
neel_k
Eleven years on, this remains one of my favorite papers!

There's a pretty variant of this idea introduced in 1995 by Valentin Antimirov
called partial derivatives of regular expressions. His idea was to replace the
_single_ derivative with a _set_ of partial derivatives that add up to the
whole derivative. This simplifies the derivative algorithm quite a bit, and
makes it possible to write amazingly concise and efficient regular expression
matchers.

E.g., here's an implementation of a regular expression compiler that uses
Antimirov derivatives to take a regexp and build a table-driven matcher in
about 50 lines of code:

[http://semantic-domain.blogspot.com/2013/11/antimirov-
deriva...](http://semantic-domain.blogspot.com/2013/11/antimirov-derivatives-
for-regular.html)

~~~
abecedarius
Curiously, in Ken Thompson's 1968 paper on his regex matcher
[https://www.fing.edu.uy/inco/cursos/intropln/material/p419-t...](https://www.fing.edu.uy/inco/cursos/intropln/material/p419-thompson.pdf)
he says it works by taking Brzozowski derivatives. This was a headscratcher to
me since his code seems completely different from the derivative-based regex
matchers I'd seen. The answer is, it takes Antimirov derivatives, which didn't
have a name yet.

Here's something like your code in Python:
[https://github.com/darius/regexercise_solutions/blob/master/...](https://github.com/darius/regexercise_solutions/blob/master/star.py)
and transformed into Thompson's algorithm:
[https://github.com/darius/regexercise_solutions/blob/master/...](https://github.com/darius/regexercise_solutions/blob/master/star_thompsonlike.py)

(I've never read Antimirov's paper past the first page or two; I wrote these
things while digging into Thompson.)

~~~
chubot
Woah I think you just answered the exact question I asked 1 minute ago ? :)

[https://news.ycombinator.com/item?id=18434228](https://news.ycombinator.com/item?id=18434228)

I don't know what an Antimirov derivative is, will check it out ... Thanks for
the links to code!

~~~
abecedarius
Yep! As mentioned in the comments, the logic you get this way isn't exactly
the same as Thompson's, but it's very close. You're welcome.

FWIW I drafted a lot of variations while messing with this:
[https://github.com/darius/sketchbook/tree/master/regex](https://github.com/darius/sketchbook/tree/master/regex)
(some of this code is incorrect) and
[https://github.com/darius/sketchbook/tree/master/lex](https://github.com/darius/sketchbook/tree/master/lex).

------
yesenadam
Brzozowski's paper is from 1964. Download link:

[http://maveric.uwaterloo.ca/reports/1964_JACM_Brzozowski.pdf](http://maveric.uwaterloo.ca/reports/1964_JACM_Brzozowski.pdf)

~~~
chubot
This paper is the first citation of Ken Thompson's 1968 paper, which I
archived here:

[http://www.oilshell.org/archive/Thompson-1968.pdf](http://www.oilshell.org/archive/Thompson-1968.pdf)

One thing that I just learned that is SUPER WEIRD is that Thompson's paper
contains the word "derivative" but it doesn't contain the words NFA, DFA, or
even "state" !!!

First paragraph:

 _In the terms of Brzozowski [1], this algorithm con- tinually takes the left
derivative of the given regular expression with respect to the text to be
searched._

From:
[https://research.swtch.com/yaccalive](https://research.swtch.com/yaccalive)

 _Thompson learned regular expressions from Brzozowski himself while both were
at Berkeley, and he credits the method in the 1968 paper. The now-standard
framing of Thompson 's paper as an NFA construction and the addition of
caching states to produce a DFA were both introduced later, by Aho and Ullman
in their various textbooks. Like many primary sources, the original is
different from the textbook presentations of it. (I myself am also guilty of
this particular misrepresentation.)_

However, figure 5 in Thompson's paper is CLEARLY an NFA for the regex a(b|c)d.

What's the connection between derivatives and NFAs? I don't understand why
Thompson says his method is derivative-based and not NFA-based.

(The 2007 "Re-examined" paper linked here comments upon the Thompson line in
section 6. They qualify Thompson's statement a bit, but I don't understand it
fully. )

(This came up on lobste.rs after a discussion of Thompson's construction:
[https://lobste.rs/s/fq8uil/aho_corasick#c_4xkm7z](https://lobste.rs/s/fq8uil/aho_corasick#c_4xkm7z))

------
man-and-laptop
If regexps are understood as a kind of "baby" programming language, and the
techniques for implementing it are understood as "baby" implementation
techniques for more complete languages, does "differentiation" have any
applications in implementing more complete programming languages?

As an aside, why do I think of regexps as a kind of "baby" programming
language? Because:

\- They can be converted to NFAs or DFAs. These NFAs and DFAs resemble
assembly languages. The "assembly language" for NFA is actually quite exotic.
Reminds me of compilation.

\- NFAs can be understood as a kind of Virtual Machine. DFAs are like a subset
of a physical machine.

\- Just In Time compilation was first applied to regexps before anything else.
This was done by Ken Thompson.

\- The interpreter->compiler algorithm in RPython (the program that is used to
produce PyPy) works quite well on regexps. This essentially automates the work
that Ken Thompson did.

~~~
CuriousSkeptic
Zipper is pretty interesting
[http://okmij.org/ftp/continuations/zipper.html](http://okmij.org/ftp/continuations/zipper.html)

~~~
theoh
The idea of taking the derivative of an algebraic data type is also discussed
in "Categories and Computer Science" from 1992:
[https://www.cambridge.org/core/books/categories-and-
computer...](https://www.cambridge.org/core/books/categories-and-computer-
science/203EBBEE29BEADB035C9DD80191E67B1)

That discussion covers some of the same ground as the Taylor series idea
elsewhere in thus discussion. It doesn't get as far as inventing the zipper
concept.

------
mehrdadn
Reminds me of what my TA showed us in undergrad. If I remember correctly, it
went like this:

Consider the regex A: Aa | ϵ, where ϵ denotes the empty string. What strings
does it generate?

To solve, rewrite it as A = Aa + ϵ, then find its Taylor series and you're
done:

A = ϵ / (ϵ - a) = ϵ + a + aa + aaa + ...

~~~
emilfihlman
A=Aa+e

A-Aa=e

A(1-a)=e

A=e/(1-a) not e/(e-a)

~~~
mehrdadn
Nope, I'm pretty sure it's not wrong. I believe ϵ is the multiplicative
identity ("1") here. Note that ϵ is the empty string, and concatenating it
with anything results in the same thing, so Aϵ = A. Hence we have Aϵ - Aa =
A(ϵ - a) = ϵ.

------
carapace
I wrote a Jupyter notebook with an implementation of Derivatives of Regular
Expressions in Python[1].

The Prolog version is pretty much just an encoding of the rules. This is an
implementation of a two-symbol alphabet _{01}_ (without "compaction" IIRC):

    
    
        % Brzozowski's Derivatives of Regular Expressions
        :- use_module(library(ordsets)).
        :- use_module(library(tabling)).
    
        dre(_, [], []).
        dre(_, [""], []).
        dre(0, ["1"], []).
        dre(0, ["0"], [""]).
        dre(1, ["0"], []).
        dre(1, ["1"], [""]).
        dre(C,   kstar(R), cons(Rd, R)) :- dre(C, R, Rd).
        dre(C,    not_(R), not_(Rd)   ) :- dre(C, R, Rd).
        dre(C,  and(R, S), and(Rd, Sd)) :- dre(C, R, Rd), dre(C, S, Sd).
        dre(C,   or(R, S),  or(Rd, Sd)) :- dre(C, R, Rd), dre(C, S, Sd).
        dre(C, cons(R, S),    cons(Rd, S)     ) :- nully(R, []  ), dre(C, R, Rd).
        dre(C, cons(R, S), or(cons(Rd, S), Sd)) :- nully(R, [""]), dre(C, R, Rd), dre(C, S, Sd).
    
        nully([], []).
        nully([""], [""]).
        nully([_], []).
        nully(kstar(_), [""]).
        nully(not_(R), [""]) :- nully(R, []).
        nully(not_(R), []) :- nully(R, [""]).
        nully( and(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_intersect(Rn, Sn, N).
        nully(cons(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_intersect(Rn, Sn, N).
        nully(  or(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_union(Rn, Sn, N).
    
    

[1] "∂RE: Brzozowski’s Derivatives of Regular Expressions"
[http://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expres...](http://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expressions.html)

------
saagarjha
One of my programming assignments for a programming languages class happened
to be creating a DFA for regular expressions by taking successive derivatives,
except slightly modified to use ranges because the naïve method of taking
derivatives isn't great when you have to take 2^(2^16) of them when your
alphabet is UTF-16 characters.

Actually, now that I think about it, the entire class was strangely centered
around taking derivatives of regular expressions…

~~~
arethuza
I think university classes, especially more advances courses, are often
strongly influenced by the research areas of the people doing the teaching.

------
kazinator
I used derivatives in the regex implementation in TXR. The derivative back-end
is only used for regexes that contain the "exotic" intersection and complement
operators. (Or the non-greedy repetition which transforms to intersection and
complement at the AST level).

[http://www.kylheku.com/cgit/txr/tree/regex.c](http://www.kylheku.com/cgit/txr/tree/regex.c)

The derivatives implementation starts with the appearance of the function
_reg_expand_nongreedy_.

Wow, I see that in _reg_derivative_ where it handles compound forms, I'm
testing for some vanishingly improbable internal-error cases upfront: presence
of uncompiled character classes. Fixed that.

------
Patient0
See also: [http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/)

~~~
chubot
Note that Parsing with derivatives is different than Matching regexes with
derivatives.

Russ Cox likes regexes with derivatives -- they are a "win-win". He doesn't
like parsing with derivatives because of the computational complexity. Linked
from that article:

[https://research.swtch.com/yaccalive](https://research.swtch.com/yaccalive)

Your link doesn't really convince me they addressed the concerns ... There
seems to be some hedging in the language, like _that it seems to be efficient
in practice; and that it should eventually be efficient in theory too._

------
hyperpape
Here's an argument that derivatives make the pumping lemma for regular
languages much more understandable:
[https://bosker.wordpress.com/2013/08/18/i-hate-the-
pumping-l...](https://bosker.wordpress.com/2013/08/18/i-hate-the-pumping-
lemma/)

------
gpvos
The actual article is behind a paywall, but Wikipedia gives some more details:
[https://en.wikipedia.org/wiki/Brzozowski_derivative](https://en.wikipedia.org/wiki/Brzozowski_derivative)

------
porphyrogene
I am seeing this error printed directly to the DOM:

Warning: mysql_connect() [function.mysql-connect]: Unknown MySQL server host
'mysql' (1) in /home/ltu/www/includes/database.mysql.inc on line 31 Unknown
MySQL server host 'mysql' (1)

------
tannhaeuser
(1964)

