Hacker News new | past | comments | ask | show | jobs | submit login
A Regular Expression Matcher (2007) (princeton.edu)
85 points by synergy20 on Aug 25, 2022 | hide | past | favorite | 25 comments



Related:

Rob Pike's simple C regex matcher in Go - https://news.ycombinator.com/item?id=32434412 - Aug 2022 (75 comments)

A Regular Expression Matcher (2007) - https://news.ycombinator.com/item?id=22317138 - Feb 2020 (22 comments)

A Regular Expression Matcher (2007) - https://news.ycombinator.com/item?id=12199836 - Aug 2016 (34 comments)

A Regular Expression Matcher (2007) - https://news.ycombinator.com/item?id=10654150 - Dec 2015 (1 comment)

A regular expression matcher By Rob Pike and Brian Kernighan (2007) - https://news.ycombinator.com/item?id=5672875 - May 2013 (45 comments)

A Regular Expression Matcher in 30 lines of C - https://news.ycombinator.com/item?id=2723366 - July 2011 (9 comments)


Rob Pike's simple C regex matcher in Go - https://news.ycombinator.com/item?id=32434412 - Aug 2022 (74 comments)


Oh jeez, and that was just a couple weeks ago. Thanks a ton! I've added it to the GP.


Why in matchstar does c have type int instead of char? Is this some holdover from K&R C?


Interesting. I see elsewhere that "int" is used where "char" should be fine, specifically, "int sepc" on page 97 of https://archive.org/details/practiceofprogra0000kern/page/96... , used only for a (char*)[0] and a test against ','.

While the C++ code on page 101 at https://archive.org/details/practiceofprogra0000kern/page/10... uses a "char c" parameter.

There's also some "unsigned char c" at pages 169 and 170, at https://archive.org/details/practiceofprogra0000kern/page/16... , in C code.

I'm going to guess it's from habit of using "int c" with getchar() to prevent confusion between EOF (-1) and 255. Discussed at https://archive.org/details/practiceofprogra0000kern/page/19... .


Only Rob could pull this feat. Not for the lesser mortals :-)


It is for lesser mortals, too. I think everyone with half a CS degree should be able to pull it off.

That said: it's an implementation which doesn't behave like normal regexps (it's a left-shortest match first instead of the common left-longest) and it's a back-tracker. Because this implementation doesn't allow much choice, expensive examples will look contrived, but extending this implementation with character sets, groups and choices (i.e. a|b) will make its exponential nature felt very quickly.


The book, two pages later, gives a left-longest implementation of matchstar. See https://archive.org/details/practiceofprogra0000kern/page/22... .

> extending this implementation

The exercises at the end of the chapter ask the student to extension the implementation along these lines. Including support for utf8.

> exponential nature

Both the essay and the book highlight this issue. The latter comments that some commercial greps (at the time) also had exponential behavior. https://archive.org/details/practiceofprogra0000kern/page/22...


It's a natural approach to anyone familiar with the pattern matchers explained in old Lisp books. This is not just hindsight because when I first read this chapter in Beautiful Code I stopped after the problem statement and wrote something similar before reading the rest.

(I didn't choose exactly the same grep sublanguage to implement -- it was something like supporting the most basic escaping instead of ^ and $.)


My "rules" to have simple code that works:

- Implement the concrete, simplest use case.

- Parameterize first step with function.

- Try to add another use case with another function

- Try to compose 1st and 2nd use case.

Anyone try to bring OOP + Inheritance breaks the simple rules. So composition is the key here.


35 lines, no? Since 'line' only means something to humans I think you have to count comments and whitespace.


"Lines" is short for "lines of code" I would say so comments and whitespace would not count. But holistically speaking you are right of course.


There's other whitespace in there that does get counted - for example, LOC is different depending on whether you prefer K&R braces. And since comments compensate for things that code doesn't directly express, it probably makes sense to include them too.

If we're trying to measure code complexity I think we have to ask "complexity to whom". If it's to humans, we should measure the code as it appears to humans. If it's to machines, then "line" is meaningless and we should look at tokens, the way PG suggested years ago.


Gosh dang, I have to remind myself that you're more than entitled to technical opinions as well!

A tight definition of SLOC isn't something we have. We do have sloccount, or we can install it, different tool from `wc -l` (we can sed the blank lines out first).

Really we should be calculating cyclomatic complexity like adults. I'm told this isn't straightforward! But what is the use of heaven if a man's reach cannot exceed his grasp.


Bob Nystrom describes his language Wren through number of semicolons, which I find interesting.

Doesn't work for every language, though =P


The title of this article is misleading since it only matches a small subset of what's possible with regular expressions. It's like saying "A novel programming language linker and compiler in 30 lines of code" while it can only do simple arithmetic operations and is lightyears away from being turing complete.


I guess we can address this objection by reverting to the article's actual title, which is what the site guidelines call for anyhow. (Submitted title was 'A Regular Expression in 30 Lines of C')


I didn't say the title of the post is misleading ;)


This beautiful code has horrible performance if your regex is "a*a*a*a*a*"...


He writes about that in the article? And mentions how grep has the same problem while egrep solves it...


Good prompt for a beautiful matcher with better complexity. Have a favorite?

https://www.oilshell.org/archive/Thompson-1968.pdf is longer and doesn't quite fix the problem (see third-to-last paragraph) but it is interesting and very tight.


I don't know about beautiful, but I just came up with this small NFA matcher for the same problem:

    #include <stdlib.h>
    #include <string.h>

    int match(char* regex, char* text) {
        int n = strlen(regex);
        char* active = calloc(n + 1, 1);
        char* next_active = calloc(n + 1, 1);
        active[0] = 1;
        int found = 0;
        int anchor = regex[0] == '^';
        regex += anchor;
        do {
            for (int i = 0; i <= n; i++) {
                if (!active[i])
                    continue;
                if (regex[i] == '\0') {
                    found = 1;
                    goto DONE;
                } else if (regex[i + 1] == '*') {
                    next_active[i + 2] = 1;
                    if (regex[i] == '.' || regex[i] == *text)
                        next_active[i] = 1;
                } else if (regex[i] == '$' && regex[i + 1] == '\0') {
                    found = *text == '\0';
                    goto DONE;
                } else {
                    next_active[i + 1] = regex[i] == *text;
                }
            }
            char* tmp = next_active;
            next_active = active;
            active = tmp;
            active[0] = !anchor;
            memset(next_active, 0, n + 1);
        } while (*text++ != '\0');
    DONE:;
        free(active);
        free(next_active);
        return found;
    }


Nice! I should've expected this could be that short, but I didn't.

Here's my favorite approach to "rediscovering" efficient regex matching: https://semantic-domain.blogspot.com/2013/11/antimirov-deriv... It's implicit in the Thompson paper I linked above, but imo not obvious from it.


This just illustrates the power of recursion. With it, the call stack becomes your invisible data structure - it keeps all saved program state deep down. The same approach is used to write recursive-descent parsers for languages with an inherently recursive definition.

You could also devise an iterative solution that wouldn't be more complex - then you simply have to maintain explicit stacks for operators and operands.


> Beautitful code

definitely a draft




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: