Hacker News new | past | comments | ask | show | jobs | submit login
Derivatives of Regular Expressions (2007) (lambda-the-ultimate.org)
147 points by tosh on Nov 12, 2018 | hide | past | favorite | 49 comments



Eleven years on, this remains one of my favorite papers!

There's a pretty variant of this idea introduced in 1995 by Valentin Antimirov called partial derivatives of regular expressions. His idea was to replace the single derivative with a set of partial derivatives that add up to the whole derivative. This simplifies the derivative algorithm quite a bit, and makes it possible to write amazingly concise and efficient regular expression matchers.

E.g., here's an implementation of a regular expression compiler that uses Antimirov derivatives to take a regexp and build a table-driven matcher in about 50 lines of code:

http://semantic-domain.blogspot.com/2013/11/antimirov-deriva...


Curiously, in Ken Thompson's 1968 paper on his regex matcher https://www.fing.edu.uy/inco/cursos/intropln/material/p419-t... he says it works by taking Brzozowski derivatives. This was a headscratcher to me since his code seems completely different from the derivative-based regex matchers I'd seen. The answer is, it takes Antimirov derivatives, which didn't have a name yet.

Here's something like your code in Python: https://github.com/darius/regexercise_solutions/blob/master/... and transformed into Thompson's algorithm: https://github.com/darius/regexercise_solutions/blob/master/...

(I've never read Antimirov's paper past the first page or two; I wrote these things while digging into Thompson.)


Woah I think you just answered the exact question I asked 1 minute ago ? :)

https://news.ycombinator.com/item?id=18434228

I don't know what an Antimirov derivative is, will check it out ... Thanks for the links to code!


Yep! As mentioned in the comments, the logic you get this way isn't exactly the same as Thompson's, but it's very close. You're welcome.

FWIW I drafted a lot of variations while messing with this: https://github.com/darius/sketchbook/tree/master/regex (some of this code is incorrect) and https://github.com/darius/sketchbook/tree/master/lex.


Brzozowski's paper is from 1964. Download link:

http://maveric.uwaterloo.ca/reports/1964_JACM_Brzozowski.pdf


This paper is the first citation of Ken Thompson's 1968 paper, which I archived here:

http://www.oilshell.org/archive/Thompson-1968.pdf

One thing that I just learned that is SUPER WEIRD is that Thompson's paper contains the word "derivative" but it doesn't contain the words NFA, DFA, or even "state" !!!

First paragraph:

In the terms of Brzozowski [1], this algorithm con- tinually takes the left derivative of the given regular expression with respect to the text to be searched.

From: https://research.swtch.com/yaccalive

Thompson learned regular expressions from Brzozowski himself while both were at Berkeley, and he credits the method in the 1968 paper. The now-standard framing of Thompson's paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

However, figure 5 in Thompson's paper is CLEARLY an NFA for the regex a(b|c)d.

What's the connection between derivatives and NFAs? I don't understand why Thompson says his method is derivative-based and not NFA-based.

(The 2007 "Re-examined" paper linked here comments upon the Thompson line in section 6. They qualify Thompson's statement a bit, but I don't understand it fully. )

(This came up on lobste.rs after a discussion of Thompson's construction: https://lobste.rs/s/fq8uil/aho_corasick#c_4xkm7z)


Annoyingly, now it's really needed Wikipedia gives no pronunciation guide. Apparently it's Bzho-ZOFFskyi in Poland (where zh is like the si in vision) or Briz-OWski in the US.


Reminds me of what my TA showed us in undergrad. If I remember correctly, it went like this:

Consider the regex A: Aa | ϵ, where ϵ denotes the empty string. What strings does it generate?

To solve, rewrite it as A = Aa + ϵ, then find its Taylor series and you're done:

A = ϵ / (ϵ - a) = ϵ + a + aa + aaa + ...


A=Aa+e

A-Aa=e

A(1-a)=e

A=e/(1-a) not e/(e-a)


Nope, I'm pretty sure it's not wrong. I believe ϵ is the multiplicative identity ("1") here. Note that ϵ is the empty string, and concatenating it with anything results in the same thing, so Aϵ = A. Hence we have Aϵ - Aa = A(ϵ - a) = ϵ.


Nitpick: A is a production, not a regex, which would be something like a*.

Edit: I meant "A: Aa | ε" instead of "A"


No... the production is the entire rule, so "A: Aa" and "A: ϵ" are the productions.


My formal automata class might have used non-standard terminology, but what I was taught was that while A: Aa and A: ε are productions, A: Aa | ε is just a shorthand to combine the two. Since there is only one start variable, A: Aa | ε is the entire grammar as well. a* is the regular expression representing the language for this grammar.


> but what I was taught was that while A: Aa and A: ε are productions

Huh? This is the exactly what I also just said, in response to the different claim you made above, which was that the production is merely "A" rather that "A: Aa" or "A: ε". If you're claiming they are somehow equal (which they clearly are not... one is a symbol and the other is a rule mapping that symbol to a sentential form) then I'm confused how you think A and a* were any less equal to begin with...


Oops, I just realized that I forgot to write the entire right side of the production…


I guess this is what happens when you focus on nitpicking (to use your own word).


If regexps are understood as a kind of "baby" programming language, and the techniques for implementing it are understood as "baby" implementation techniques for more complete languages, does "differentiation" have any applications in implementing more complete programming languages?

As an aside, why do I think of regexps as a kind of "baby" programming language? Because:

- They can be converted to NFAs or DFAs. These NFAs and DFAs resemble assembly languages. The "assembly language" for NFA is actually quite exotic. Reminds me of compilation.

- NFAs can be understood as a kind of Virtual Machine. DFAs are like a subset of a physical machine.

- Just In Time compilation was first applied to regexps before anything else. This was done by Ken Thompson.

- The interpreter->compiler algorithm in RPython (the program that is used to produce PyPy) works quite well on regexps. This essentially automates the work that Ken Thompson did.


> does "differentiation" have any applications in implementing more complete programming languages?

I'm not sure about implementing per se, but you can extend some notion of differentiation to richer programming languages: https://www.sciencedirect.com/science/article/pii/S030439750...

The differential of a term does give you some information about how a term behaves under reduction, so I guess the answer is "maybe" - sounds like a fun project to work on. :)

More generally, there are models of linear logic built on a notion of differentiation (mentioned in the same paper as above), which might translate to a compilation of linear lambda calculus/classical processes/pi calculus/etc.


> does "differentiation" have any applications in implementing more complete programming languages?

It seems to: http://matt.might.net/articles/parsing-with-derivatives/


To clarify, the programming language in your link is BNF. Clearly, BNF is a more expressive language than regexp.

I was actually thinking of differentiating whole programs. But still interesting.


A Theory of Changes for Higher-Order Languages - Incrementalizing λ-Calculi by Static Differentiation

> [...] We present a program transformation taking programs to their derivatives, which is fully static and automatic, supports first-class functions, and produces derivatives amenable to standard optimization.

https://arxiv.org/abs/1312.0658


> the programming language in your link is BNF

The programming language follows a context-free grammar (CFG), which has been written in Backus-Naur Form (BNF). It is technically nonsensical to say "the programming language is BNF". It's like if I were to say "English is a Roman language" because we happen to use the Roman letters for our orthography. Just a technical nitpick.

> BNF is a more expressive language than regexp

The context-free languages are more expressive than the regular languages. BNF is a notation for writing a context-free grammar, and regular expressions are a notation for writing regular languages. (Again, a nitpick of terminology.)


> Just In Time compilation was first applied to regexps before anything else. This was done by Ken Thompson.

Citation for this? I'd be surprised if Thompson was first. Runtime optimization was long the domain of Lisp programmers. Blurring the distinction between compile-time and run-time is one of PG's things that "made Lisp different".

https://eli.thegreenplace.net/2013/11/05/how-to-jit-an-intro... references a paper by Aycock, which says McCarthy was first, in his original Lisp paper, nearly a decade before Thompson.


I think Rob Pike mentioned it in a recent talk on the history of Unix https://youtu.be/_2NI6t2r_Hs?t=1901


He says this is the first case of regular expressions on a computer, which I believe, but I'm not hearing anything about runtime optimization or compilation.


less than one minute later he mentions "compiling on the fly" some assembly code https://youtu.be/_2NI6t2r_Hs?t=1944


Sure, but he never claims that's the first instance of runtime compilation, just that the implementation is "pretty interesting".


NFAs are not a "kind of virtual machine". The "assembly" language for an actual NFA is no more complex than the one for a DFA.

They are state machines that can be in a lot of states at once; that's all. There is a debasement of the term NFA, popularized by Friedl, that "NFA" means "any thingy that implements regex that isn't a DFA". This isn't true and should not be promulgated.


They are a virtual machine because the physical hardware can't be in more than one state at the same time. Representing an NFA as an assembly-type program makes it resemble something like the JVM, in that it looks like an assembly language for a non-physical machine. Virtual=non-physical

I think you're being pedantic.


It's a strained metaphor which casts no light on actual NFA implementation or the true distinction of NFAs and DFAs.

In this case, instead of the distinction between "virtual" and "physical" we have, considerably less pretentiously, an "state id" vs a "set of state ids" (in case of actual NFA implementation, said set might well be represented by a bit-vector whether a general purpose register or a SIMD register).

Making tortured analogies to JVMs and virtual machines doesn't actually help explain anything about what's going on. It certainly fails to provide the insight that you might get from comparing the whole "id vs set-of-ids" (one can instantly intuit the risk of an exponential blowout in NFA->DFA conversion, which does in fact exist).


This is where I got the VM analogy from: https://swtch.com/~rsc/regexp/regexp2.html

My whole comment was about tenuous but interesting analogies, and whether they can lead to somewhere interesting.


Fair enough. I think that implementation is a dreadful straight NFA implementation - and, IIRC, is not used in RE2 for pure NFA code.

It's not a dreadful way of implementing regex as it can do things a straightforward bitset NFA implementation can't.

To be fair, I think your original post does stand, but I guess I am a bit pedantic about the terminology after spending a little bit of time on regex implementation.

In fact, a number of strategies in regex implementation turned out to generalize to broader areas. The bitwise automata are helpful elsewhere, as are many of the strategies one might use to make them fast. The 4 Russians technique ( https://en.wikipedia.org/wiki/Method_of_Four_Russians ) has also cropped up all over the place.


Regarding the strained metaphor, the "virtual machine" is metaphorical, and it can be simply ignored because any possible implementation of a NFA is actually a DFA. Proof: a finite amount of memory can be considered a DFA state and modifying that memory between consuming successive input symbols can be considered updating the state.


Yes, I think "virtual machine" is too loaded of a term now and is just confusing the point?

An NFA implementation can be "virtual" in the sense of "virtual memory" being virtual. It is a pretense of a practically unbounded set of states while really being bounded like any other practical implementation. It grows on demand and can handle many practical inputs while possibly hitting its implementation limits on a worst-case input.



The idea of taking the derivative of an algebraic data type is also discussed in "Categories and Computer Science" from 1992: https://www.cambridge.org/core/books/categories-and-computer...

That discussion covers some of the same ground as the Taylor series idea elsewhere in thus discussion. It doesn't get as far as inventing the zipper concept.


You can see all of these as data structures instead of programs and then derivatives transfer and generalize nicely: see Conoc McBride’s work on zippers.


> NFAs can be understood as a kind of Virtual Machine. DFAs are like a subset of a physical machine.

One of the projects for the class I mentioned in another comment was actually implementing a stack-based virtual machine for regular expressions!


I wrote a Jupyter notebook with an implementation of Derivatives of Regular Expressions in Python[1].

The Prolog version is pretty much just an encoding of the rules. This is an implementation of a two-symbol alphabet {01} (without "compaction" IIRC):

    % Brzozowski's Derivatives of Regular Expressions
    :- use_module(library(ordsets)).
    :- use_module(library(tabling)).

    dre(_, [], []).
    dre(_, [""], []).
    dre(0, ["1"], []).
    dre(0, ["0"], [""]).
    dre(1, ["0"], []).
    dre(1, ["1"], [""]).
    dre(C,   kstar(R), cons(Rd, R)) :- dre(C, R, Rd).
    dre(C,    not_(R), not_(Rd)   ) :- dre(C, R, Rd).
    dre(C,  and(R, S), and(Rd, Sd)) :- dre(C, R, Rd), dre(C, S, Sd).
    dre(C,   or(R, S),  or(Rd, Sd)) :- dre(C, R, Rd), dre(C, S, Sd).
    dre(C, cons(R, S),    cons(Rd, S)     ) :- nully(R, []  ), dre(C, R, Rd).
    dre(C, cons(R, S), or(cons(Rd, S), Sd)) :- nully(R, [""]), dre(C, R, Rd), dre(C, S, Sd).

    nully([], []).
    nully([""], [""]).
    nully([_], []).
    nully(kstar(_), [""]).
    nully(not_(R), [""]) :- nully(R, []).
    nully(not_(R), []) :- nully(R, [""]).
    nully( and(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_intersect(Rn, Sn, N).
    nully(cons(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_intersect(Rn, Sn, N).
    nully(  or(R, S), N) :- nully(R, Rn), nully(S, Sn), ord_union(Rn, Sn, N).

[1] "∂RE: Brzozowski’s Derivatives of Regular Expressions" http://joypy.osdn.io/notebooks/Derivatives_of_Regular_Expres...


One of my programming assignments for a programming languages class happened to be creating a DFA for regular expressions by taking successive derivatives, except slightly modified to use ranges because the naïve method of taking derivatives isn't great when you have to take 2^(2^16) of them when your alphabet is UTF-16 characters.

Actually, now that I think about it, the entire class was strangely centered around taking derivatives of regular expressions…


I think university classes, especially more advances courses, are often strongly influenced by the research areas of the people doing the teaching.


> except slightly modified to use ranges because the naïve method of taking derivatives isn't great when you have to take 2^(2^16) of them when your alphabet is UTF-16 characters.

The paper we read described how instead of deriving by a single character, one could also derive by sets of characters. I then improvised the range approach by myself - it was very cool being able to handle unicode!


I used derivatives in the regex implementation in TXR. The derivative back-end is only used for regexes that contain the "exotic" intersection and complement operators. (Or the non-greedy repetition which transforms to intersection and complement at the AST level).

http://www.kylheku.com/cgit/txr/tree/regex.c

The derivatives implementation starts with the appearance of the function reg_expand_nongreedy.

Wow, I see that in reg_derivative where it handles compound forms, I'm testing for some vanishingly improbable internal-error cases upfront: presence of uncompiled character classes. Fixed that.



Note that Parsing with derivatives is different than Matching regexes with derivatives.

Russ Cox likes regexes with derivatives -- they are a "win-win". He doesn't like parsing with derivatives because of the computational complexity. Linked from that article:

https://research.swtch.com/yaccalive

Your link doesn't really convince me they addressed the concerns ... There seems to be some hedging in the language, like that it seems to be efficient in practice; and that it should eventually be efficient in theory too.


Here's an argument that derivatives make the pumping lemma for regular languages much more understandable: https://bosker.wordpress.com/2013/08/18/i-hate-the-pumping-l...


The actual article is behind a paywall, but Wikipedia gives some more details: https://en.wikipedia.org/wiki/Brzozowski_derivative


I am seeing this error printed directly to the DOM:

Warning: mysql_connect() [function.mysql-connect]: Unknown MySQL server host 'mysql' (1) in /home/ltu/www/includes/database.mysql.inc on line 31 Unknown MySQL server host 'mysql' (1)


(1964)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: