
A regular expression matcher By Rob Pike and Brian Kernighan (2007) - sid6376
http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
======
stiff
There is a lovely description of the classic Thompson algorithm of building an
NFA from a regular expression and then simulating the NFA with a DFA using two
stacks :) in the "Compilers" book by Aho et al. There is also a great article
from Russ Cox on it: <http://swtch.com/~rsc/regexp/regexp1.html>

I recommend implementing this as an exercise, rarely do you see that much
classic CS theory put to a practical use in one place.

------
chollida1
Can someone explain to me why the matchstar function takes its first param as
an integer? to me it should be a char.

    
    
        int matchstar(int c, char *regexp, char *text)
        {
            do {    /* a * matches zero or more instances */
                if (matchhere(regexp, text))
                    return 1;
            } while (*text != '\0' && (*text++ == c || c == '.'));
            return 0;
        }

~~~
abecedarius
The usual reason is to allow c to be EOF, but I don't see why anyone would in
this case.

------
nrbafna
On a unrelated note.

Adding a few lines of css (setting body width, font-size to 16, line-height to
1.4, Georgia for font and larger font size for headings ) makes for a decent
reading experience.

<http://imgur.com/a/KtaEC/>

~~~
nodesocket
My recommendations, for a more pleasurable reading experience:

    
    
        body {
          width: 70%;
          font-size: 20px;
          margin: auto;
          font-family: sans-serif;
          line-height: 30px;
        }

~~~
huhtenberg
_Pedantic nitpick incoming_

Sans-serifs are generally harder to read. Note, for example, how virtually all
books are set in serifs. The primary reason sans-serifs are widely used in
computers is the lack of decent display resolution required to display the
actual serifs. So if you are using a larger font size to read something off
the screen, there is no reason not to use a serifed typeface.

~~~
gliptic
This might very well be a myth. See e.g. [http://alexpoole.info/blog/which-
are-more-legible-serif-or-s...](http://alexpoole.info/blog/which-are-more-
legible-serif-or-sans-serif-typefaces/#part2).

~~~
huhtenberg
Everything that has to do with the perception might very well be a myth :)

However, in typography and type designer circles it is commonly accepted that
Serifs are superior for consuming large quantities of text. Also in layman
terms - serifs make the glyphs more distinct and easier to recognize at a
glance, in contrast to the sans where there are several glyph pairs that look
virtually the same.

~~~
dpark
"Commonly accepted" isn't worth much when it conflicts with actual science.
Lots of things commonly accepted are simply incorrect.

~~~
huhtenberg
Please don't pull a Wikipedia on me here. As an old physicist joke goes -

    
    
      Experimental physicist comes to a theoretical physicist office,
      bring a graph from a recent experiment and asks for a help with
      interpreting the results. 
    
      - Well, it's all rather obvious. Here's a peak, here's a dip,
        because of this, that and third.
    
      - Hold it, hold it... you are looking at it upside down.
    
      - Ah, right, right. *Rotates the graph*. Oh, it's now even more 
        obvious than before.
    

In other words, the "actual science" you are referring to frequently ends up
to be nothing more than a matter of interpretation and a subject to all sorts
of biases. Especially when it concerns something as unquantifiable as "comfort
of reading". Just pick up a couple of fiction books, one set in serif and
another in sans-serif, go through a pageful of text and see for yourself.

~~~
dpark
You can't just dismiss science and fact when it's inconvenient. Well, you can,
but then your beliefs have no more backing than those of astrologers, and
you're guilty of willfully spreading misinformation. You asserted a
falsifiable statement as fact ("Sans-serifs are generally harder to read") and
in the face of fairly compelling evidence that it is in fact a false
statement, you immediately appealed to "common knowledge". It's also "common
knowledge" that Einstein failed math as a kid and that eating before swimming
will give you cramps, but these are both untrue statements.

Comfort of reading is very quantifiable. You can present text to a bunch of
people in serif and sans serif fonts (double-blind and randomized) and ask
them to rate how pleasant the text was. You can be clever and ask questions
that measure understanding and retention, or you can ask them how much they
enjoyed reading the passage, and these will give you indirect measures of
reading comfort. Or you can be blunt and ask if they enjoy reading in the
font, though this will pick up biases more. Either way is significantly more
scientific than an appeal to common knowledge, though.

> _Just pick up a couple of fiction books, one set in serif and another in
> sans-serif, go through a pageful of text and see for yourself._

This isn't science. This is just bias confirmation. ("Wow, I prefer the one I
expected to prefer!")

But for the record, my e-reader font is set to Gill Sans.

~~~
huhtenberg
I have no immediate way of verifying that what you call a compelling evidence
referenced in that blog post is in fact compelling and accurate. I _could_ go
through all linked sources, but I don't presently have time for that. On the
other hand I read several books on typography and I am partial to the type
design. From that exposure I do know that the common consensus between people
involved in creating said serifs and sans-serifs is that sans-serifs _are_
generally harder to read. So when I am presented with an evidence to the
contrary, you can be damn sure I will doubt it.

Throwing around power words like "science" and "compelling evidence" based on
a couple of references plucked from a blog post - sorry, but you are well in a
meta area, preaching about general subject matter without any regard to the
context. You are not the only one who's aware what science and scientific
methods entitle, but then you should also be well aware of a bunch of junk
that gets published in a format of scientific research, gets quoted and re-
quoted and eventually accumulates notable status even though it hasn't even
been peer-reviewed once. This happens all the time and it's a part of
"science", so being skeptical is a part of the package. And the more obscure
the area of the research, the more skepticism is warranted. You surely know
that being that well-versed in all things science.

    
    
      --
    

Your e-reader doesn't have the resolution required for good quality rendering
of serif fonts. Hence the Gill Sans.

~~~
dpark
My complaint is not that you disagree with the blog or that you don't have the
time to dig through all the references (neither do I). My complaint is that
you've presented something as fact with no evidence and continue to assert its
unquestionable truth in the face of contrary evidence. You've made no actual
attempt to defend the claim except to say that others agree with you. That
isn't knowledge. I'm not even sure it qualifies as an appeal to authority.
It's just dogma.

You're right to be skeptical. You're not right to be dismissive.

\--

My phone also uses sans-serif fonts almost exclusively and I read a ton of
stuff on it. It definitely has a high enough resolution (>300ppi) for serif
fonts.

~~~
tadfisher
I have a suspicion that 90% of the functional effectiveness of any visual
aesthetic is learned experience. Witness the outcry over 48fps feature-length
films, for example; people complain that it has a "cheap" or "soap-opera"
feel, because cheap productions and soap operas are the dominant media filmed
in high-speed. This affects the enjoyment that movie-goers take out of their
experience, and thus the overall value of the product.

The inferences we make about functional differences in the aesthetics of type
would clearly be biased by this phenomenon. It might certainly be possible for
controlled observations to reveal that reading speeds are faster for serif
type; but this might be due to the fact that the reader is used to it, as most
long works are already printed in serif type, in part due to the assumptions
of book designers that serif type is faster to read!

------
nilkn
I translated this into Haskell just for fun.

    
    
        match :: String -> String -> Bool
        match ('^':rest) text     = matchLocal rest text
        match rex        (c:rest) = matchLocal rex (c:rest) || match rex rest
        match _          _        = False
     
        matchLocal :: String -> String -> Bool
        matchLocal []            _         = True
        matchLocal "$"           []        = True
        matchLocal (c:'*':restR) (t:restT) = c == t && matchLocal restR (starSkip c (t:restT))
        matchLocal (r:restR)     (t:restT) = (r == '.' || r == t) && matchLocal restR restT
        matchLocal _             _         = False
     
        starSkip :: Char -> String -> String
        starSkip c (t:rest) = if c == t then starSkip c rest else t:rest
        starSkip _ []       = []

------
erre
For those who don't know, this is Rob Pike's chapter of Beautiful Code, a
thoroughly enjoyable book: [http://www.amazon.co.uk/Beautiful-Code-Leading-
Programmers-P...](http://www.amazon.co.uk/Beautiful-Code-Leading-Programmers-
Practice/dp/0596510047) . Every now and then I read a different chapter, and
it's always interesting. I particularly liked Rob Pike's one, and the one
about Python's dictionary implementation.

~~~
bcantrill
This code is beautiful, but it's also busted [1] -- albeit only for
extraordinarily pathological input. As to whether this ultimately dents its
beauty is up to the beholder, presumably; for me personally, I confess to
suffer from a utilitarian sense of beauty when it comes to software -- for me,
broken code can't achieve absolute beauty.

[1] <http://dtrace.org/blogs/bmc/2007/07/11/beautiful-code/>

~~~
agentS
I believe Kernighan acknowledged that: "[snip] The depth of recursion is no
more than the length of the pattern, which in normal use is quite short, so
there is no danger of running out of space. [/snip]".

If you're going to judge this code based on both pathological haystacks but
also pathological needle regexs, then you have to admit that this code, which
has linear space usage in the pattern length is far better than PCRE, which
has exponential time in the pattern length[1].

Considering its succinctness, I find that pretty impressive.

[1] <http://swtch.com/~rsc/regexp/regexp1.html>

------
b0rsuk
Great code is often deceptively simple. It can lead you to believe it was
simple to write.

------
sid6376
For those unfamiliar with pointer arithmetic I wrote a python port here.
<https://gist.github.com/siddharthsarda/5538928>

~~~
mixedbit
Be aware that this isn't an equivalent of the C code. Replacing pointer
arithmetic with string copy increases big O complexity of the algorithm.

An equivalent would need to pass indexes to sub functions to avoid additional
copies.

------
minikomi
On a related note, Rob Pike's "Lexical Scanning in Go" talk is also worth a
look <https://cuddle.googlecode.com/hg/talk/lex.html#landing-slide>

------
porlw
It's interesting that this is in a very functional style, no global state is
used.

------
nilkn
I was once asked in an interview to write a pattern matcher in C for the exact
same class of regular expressions as in this post. I'm pretty sure the
interviewer was looking for exactly the approach used here as well.

------
jonny_eh
I wonder if this would be a good test to give a software developer candidate.
Develop a regex matcher that matches a subset of the regex rules in your
language of choice (without using regex abilities, of course).

~~~
dreen
I don't know, regexes have a lot of rules so you need to supply a reference,
requiring them to remember it is overkill, not to mention some of those rules
are pretty arcane.

~~~
barrkel
A simple implementation of ?, +, * , [] and non-capturing, precedence-only ()
hardly needs a reference. Whether greedy vs non-greedy + / * are implemented
doesn't really matter.

I will say that the problem is much easier if you have a good grounding in the
basics of compilers, while that knowledge is not hugely useful very often. Any
positive value as a test will come from correlation with interest in CS IMO.

------
pdog
Altough performance isn't really part of this exercise, beautiful code is also
likely to be fast -- efficiently doing the work to solve the problem and no
more.

~~~
lmm
Not always true, e.g. brute force searching can often look more beautiful than
more sophisticated techniques. The article mentions how performance could be
improved by separating compilation and execution of the regular expression (as
most "real" regex libraries do) - but this would almost certainly make it less
beautiful.

------
toolslive
I saw this in a drdobbs article in the 90s as well.

~~~
toolslive
april 1999: [http://www.drdobbs.com/architecture-and-design/regular-
expre...](http://www.drdobbs.com/architecture-and-design/regular-
expressions/228700838?queryText=pike%2Bkernighan%2Bregular)

~~~
McUsr
Thanks for posting this, and thanks to the OP for posting the link as well. :)
I must shamefully admit that I haven't read the book, but it is now on top of
my reading list!

This was really a most pleasurable and insightful read. True elegance!

