Hacker News new | past | comments | ask | show | jobs | submit login
A Regular Expression Matcher (2007) (princeton.edu)
109 points by ville on Feb 13, 2020 | hide | past | favorite | 22 comments

A thread from 2016: https://news.ycombinator.com/item?id=12199836

2013: https://news.ycombinator.com/item?id=5672875

2011: https://news.ycombinator.com/item?id=2723366

Maybe others?

p.s. these links are just for curious readers; reposts are ok after a year or so—see https://news.ycombinator.com/newsfaq.html.

This is chapter 1 of "Beautiful Code" (http://shop.oreilly.com/product/9780596510046.do)

I thought I had seen it before. Verified you can actually see the whole chapter in a nicer (book vs web) layout in Amazon's "Look inside" feature since it's at the beginning of the book.


These posts are great for history, but regular expression implementation has moved on considerably from early Thompson implementations whether backtracking or NFAs. There is a considerable body of literature about regex implementation, including many quite convincing implementations and alternate formulations, some of which weren't even done by people at, or from, Bell Labs!

We seem to have something of an Eternal September of regex implementation knowledge (abetted by Russ Cox's amazingly selective bibliography in his posts introducing RE2).

I saw this in the repo of single file language implementations that was posted yesterday. Glad to see it on the front page now.

Here’s the repo if anyone is interested in checking out the others:


Skimming over it, the implementation looks inefficient:

    int matchstar(int c, char *regexp, char *text)
        do {    /* a * matches zero or more instances */
            if (matchhere(regexp, text))
                return 1;
        } while (*text != '\0' && (*text++ == c || c == '.'));
        return 0;
That can be used to match short strings, but not to grep through a filesystem. A good regex matcher has runtime O(N*M) where N is the length of the regex (typically very short) and M is the length of the scanned text.

A surprising number of regex tools used in the wild employ a backtracking algorithm like this one. Some popular features such as lookaheada, lookbehinds, and backreferences can't always be matched using a finite automaton.

An FSA matcher will be O(M) - one state transition per input character.

If you're looking at a specific regex (~FSA), then N is constant... If you have a function that runs an arbitrary regex, then N will vary.

Seeing code like this I can understand why the pioneers though C code could be pretty.

But now all I see there is something more akin to a demonstration of ice carving with chainsaws than something that should be used in a production system.

This would be a very good technical interview question — i.e. "implement this simple regex specification."

We software types must be masochists. It took Rob Pike an hour in the comfort of his private office to write this piece of code. How is this _at all_ like a typical software interview?

You're right. At the same time, could it be interesting to see how someone approaches (not solves) an intimidating problem?

Also known as hazing.

There's a common question, "convert an RE into an NFA" (https://en.wikipedia.org/wiki/Thompson%27s_construction)

Have you seen that used for interviewing? Yes, some people can do it in a 1-hour interview, but wouldn't you rather ask something that can't be done from memory?

I know engineers who have asked it, yes. It really seems to be a test for "who was paying attention to that one lecture in CS205 near the end of the semester"

Wow. That's quite a tricky test, too -- if you study Thompson's original paper you can see plenty of pitfalls. E.g. his main code would loop forever on a star of a star, and he solves this in an appendix with a preprocessor to rewrite regexes of that kind, only it looked like that rewriting would blow up the size of the regex exponentially in the length of the stack of stars. No hire!

it doesn't any worse than what Huffman did for his CS classes at UCSC.

I have seen this one go by, which, from the perspective of someone who has been doing regex implementation for 14 years, more or less, is fairly amusing. It seems unfair, as someone aware of potential pitfalls of backtracking would presumably feel obligated to build a automata-based solution; someone aware also of the potential pitfalls of DFAs (failure to determinize quickly on some nasty patterns) might also feel obligated to build a Thompson or Glushkov NFA solution as well.

It's quite possible you might turn someone competent into a gibbering wreck (they are asking me to implement RE2 or Hyperscan in an hour!) :-)

I used to use this exact problem years ago when interviewing candidates at Google. At that time I thought it was a great question, but now I'm not that sure anymore if this approach to interviewing is the best way to assess candidates.

Super interesting. Could you say more? Perhaps it depends a bit on the role?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact