Hacker News new | past | comments | ask | show | jobs | submit login
A Regular Expression Matcher (2007) (princeton.edu)
132 points by _acme on Aug 1, 2016 | hide | past | web | favorite | 34 comments



I have a re.c file in my best-of collection of code examples locally - but I can't seem to find the original source quickly - nevertheless it was written by Russ Cox[0] and he's spent a lot of time making regular expressions accessible and elegant, his articles are a great resource.

[0] https://swtch.com/~rsc/regexp/


It's a nice piece of code. I like to use it to explain how to turn interpreters into compilers through staging/specialization: https://scala-lms.github.io/tutorials/regex.html


I got the guided tour of this code from bwk back in... 2006?[1]

I wasn't sure I wanted to deal with computers on a day to day basis when I took the course. The lecture blew my mind, and I started looking for programming jobs the next day.

[1] COS333. Memories... man, I'm getting old.


I'm impressed.

The code lives up to its goal of being a good example of programming. As Kernighan says, it shows the value of handling special cases early and of recursion. (It's also a good example of how pointer-arithmetic and nul-terminated strings should be used, but the less said about that, the better).

Students are likely to not notice the tail recursion, so converting this to an iterative version is probably a nice exercise.

But notice that explaining regexps is a non-goal of this example. There is an underlying state-machine here (encoded in a combination of the instruction pointer and the `text` variable), but it is not very obvious. For that, Russ Cox' articles are the go-to.


> (It's also a good example of how pointer-arithmetic and nul-terminated strings should be used, but the less said about that, the better).

Why the less said the better?

I'm always disappointed when people say C is bad at string manipulation. The "character at a time, no allocations needed, stop when you hit the end" style demonstrated here and elsewhere have always resonated deeply with me.


Roghly speaking, I don't want to start a flame war by defending pointer-arith. The haters are right that languages could use other, safer features (like array slicing).

But if it is how your language does things, then use it, and use it well -- as in this example. Also I find counted strings preferable for most tasks, but nul-terminated have the advantage of working so nicely with the ptr++ operation.

There was real insight behind C's minimalistic string handling. And C's minimalism is has always been its best feature. But today's minimal languages should support strings explicitly.


> Also I find counted strings preferable for most tasks, but nul-terminated have the advantage of working so nicely with the ptr++ operation.

Is this:

    while (foo != '\0') {
        *bar++ = *foo++;
    }
really that much nicer than this:

    while (i != len) {
        bar[i++] = foo[i++];
    }
?

Or even this:

    for (; i != len; i++) {
        bar[i] = foo[i];
    }
?

Don't do this, though:

    while (i != len) {
        bar[i] = foo[i++];
    }
(that's the road to undefined behavior... better watch out for them sequence points!)

> There was real insight behind C's minimalistic string handling.

There really wasn't, though. The simplicity of C's string handling is deceptive. It's full of nuances and inadequacies when used the way the Good Book* describes, and that's the way it's usually used. You're really better off rolling your own string abstraction, or using some library such as bstrings, or even compiling as C++ and using std::string instead (if your C happens to otherwise compile as C++, that is).

* K&R


"Is this: while (foo != '\0') { bar++ = foo++; } really that much nicer than this: while (i != len) { bar[i++] = foo[i++]; }"

Not visually, but the latter increments i twice in every loop :-). It also typically uses an extra register and requires an explicit 'compare with zero'. Both mattered more 40+ years ago.


> the latter increments i twice in every loop :-).

D'oh! Good catch. That's what I get for coding without testing at 3am... Just realized `foo != '\0'` should be `* foo != '\0'`, too. Derp.

> It also typically uses an extra register and requires an explicit 'compare with zero'.

Okay, I see the extra register, but I'm lost on any extra comparisons versus the pointer-arithmetic version. The pointer version still has to test for the nul character anyway, right?

> Both mattered more 40+ years ago.

Indeed, but I'm talking about today. We can now afford the safety that comes with using explicit lengths rather than sentinels. (I'm inclined to think Dennis Ritchie could've afforded it, too, considering that Pascal (among others) was using explicit lengths for strings (and indeed all arrays) around the same time C was invented, and did so on similar hardware to boot. Bounds checking is pretty damn cheap, all things considered, even on a PDP-11 (the PDP-10, while not exactly a predecessor to the PDP-11, ran TeX written in WEB / Pascal, and was favored by some Lisp hackers as practically a Lisp machine (in fact, Lisp was the first thing DEC brought up on the PDP-10, or so I'm told...), so I think it could afford bounds checking, at least).


"The pointer version still has to test for the nul character anyway, right?"

Some architectures set the zero flag on memory loads. I think the PDP is one of them. Google gives me https://groups.google.com/forum/m/#!topic/net.lang.c/LKAYbzX..., which claims that

    while (*ra++ = *rb++);
    
compiles to

    L16:
        movb        (r10)+,(r11)+
        jeql        L17
        jbr         L16
    L17:
If you do i != len, you need an extra comparison instruction.


Ah! Interesting. Thank you :)


> I'm always disappointed when people say C is bad at string manipulation. The "character at a time, no allocations needed, stop when you hit the end" style demonstrated here and elsewhere have always resonated deeply with me.

The main reason is that C doesn't really offer many primitives for string manipulation beyond pointer manipulation. All you really get is the stuff in `string.h`, which isn't much, and most of those functions are deeply flawed in one way or another:

* strcpy() can easily lead to buffer overflows if the destination array is not large enough to hold the entire string

* strncpy() will not nul-terminate the destination string if the size of the source string exceeds n

* ditto for strcat()/strncat()

* strtok() is not re-entrant, and therefore not thread-safe. The C11 standard is explicit that strtok() is not required to avoid data races with other calls to the function.

* You're not allowed to modify the result of strerror(), yet the function returns a regular ol' char* and, as such, the type system can't prevent anyone from doing so.

And perhaps the biggest problem of them all:

* strlen() potentially diverges (i.e., it can loop indefinitely or overflow the buffer, which causes undefined behavior[0]), making it literally impossible to reliably validate a C string in standard C without knowing the length of that string a priori. It is also the case that the length of a string is not necessarily equal to the size of the buffer it is stored in, which means you now potentially have a third thing you need to know a priori (the buffer size).

The small size of C's string library means you get to roll a lot of your own string functions, including the kinds of primitives you'd expect. Even QBasic has a larger (and arguably better) string library. Alternatively, you can just forgo the functions and write loops through strings everywhere, but then as soon as you want to support some other character encoding (or multiple character encodings), you're in for a boatload of fixes throughout your codebase.

The nature of C strings being represented as arrays of characters rather than an abstract string type means that it brings you the joy of manual memory management everywhere. You need to constantly be allocating space for strings, deallocating buffers, keeping track of ownership, copying strings between buffers, etc. Of course, since the length of the string isn't included in the string itself, if you want to use the safe string-handling functions, you need to keep track of the lengths separately. This usually means returning a `size_t` for the length of the string, while the output of the string functions goes to a pointer ("sorta pass-by-reference") "out" parameter.

Strings that rely on sentinel values are themselves pretty flawed. Continuing the note above, it's safer and more convenient to use strings that are somehow bundled with their length. With sentinels only, reasonable things like nul-padding the strings for alignment purposes can also be a pain in the ass, since C considers the end of the string to be the first nul character it encounters. An abstract string type would allow you to store the length of the string and the size of the buffer in the same package as the pointer to the buffer (and possibly even the buffer itself). This can have advantages other than easier handling of padding as well, allowing you to grow or shrink the string within the same buffer without (re)allocating memory and can potentially save some copying. If you can sacrifice a byte, you can retain interoperability with functions expecting C-style strings while enabling better string handling by including the nul-terminator in addition to bundled the length (Lua does this, for example). A common complaint is that this places an upper bound on the length of a string, whereas a sole sentinel value does not. This does not have to be the case if a varint[1] is used to hold the length.

C strings are also woefully inadequate for dealing with character encodings larger than 8 bits, but that's another can of worms.

In short, C's standard strings are fine, really, but they could be so much better with a touch of abstraction. Fortunately, it's possible to make some improvements from within C itself, but unfortunately that's often a mild pain in the ass, and not all programmers bother to do it.

---

[0]: Of course, once your program enters undefined-behavior territory, it's off the chain and can do anything it wants, regardless of what you intended it to do, even if your program is 100% correct in cases where there is no undefined behavior.

If you're lucky, a buffer overflow will cause the program to crash and die immediately, due to segfault or something. If you're unlucky, it'll go haywire and run amok. If you're really unlucky, it'll continue to operate with no outward signs of problems, silently doing something horrible and nasty without any indication to the contrary.

[1]: https://en.wikipedia.org/wiki/Varint


strncat() is not like strncpy() - it always nul-terminates the result.

I think your strlen() issue is a bit overblown. Either you get the string from a function that explicitly produces valid strings, in which case you know that it's a valid string already; or you get it from a raw bunch of bytes which tends to come with a size attached, giving you all the information you need to either check if it's a valid string or append a nul-terminator yourself.


> strncat() is not like strncpy() - it always nul-terminates the result.

Yes, you're right. That's what I get for trying to speed-read the C standard at 3am...

> I think your strlen() issue is a bit overblown. Either you get the string from a function that explicitly produces valid strings, in which case you know that it's a valid string already; or you get it from a raw bunch of bytes which tends to come with a size attached, giving you all the information you need to either check if it's a valid string or append a nul-terminator yourself.

Usually, yes, but neither of those things are guaranteed, unfortunately, and thus we wind up with the perennial buffer overflow exploit.


The issue of strtok() being horrible hack is easily overcame by using str(c)spn() (and optionally bunch of pointer arithmetic).


Of course. All of the issues with C's string library can be overcome rather easily, if only because it's so very basic.


The problems you describe are all well understood to C programmers and hence not really a huge issue. You are severely overstating their difficulty.


I don't mean to imply that they're difficult to overcome. I'm just trying to explain why people often say that C has poor string handling. Many modern languages provide more comprehensive string libraries, and the ones that are provided tend not to suffer from such problems. It's not that C is incapable of doing string handling (it's capable, yes), but when you constantly find yourself tossing out C's string library altogether and rolling your own instead, it becomes obvious that C's provided string handling functionality is inadequate for many use cases.


I found I didn't hate, loathe and detest unbraced blocks any less when it's Rob Pike code

    if(x) {
        do_y();} 
2 characters, no vertical space difference. Unbraced blocks only feature is they introduce bugs - they have no legitimate use. But man do you feel like a tough, macho-man "I'm such a hard man I leave my blocks un-braced."

Is there a list of languages that repeated this C idiocy (an optimization of syntax for the parser rather than the programmers?) I sure hope those languages that would like to replace C, ie D, go, Rust etc. Haven't repeated it this far into the 21st century.


I keep one-line blocks unbraced when possible and my reasoning is definitely not to feel like a tough macho man. It's more like a mild form of OCD where any syntax that's not necessary just bothers me if I don't omit it.


Reductio ad absurdum: I assume your variable names are single-letter, your Python is indented only one space per block level and every C program is a one-liner? You should check out the Code Golf stackexchange; you'd do well.

Kidding aside, I can appreciate your feeling of OCD (I fight that urge too) but remember that most syntax is for humans, not machines. Using braces communicates intent to the reader; omitting them offers an opportunity to stumble.


Nah, humans go by indentation anyway. Just use whatever brace style you like, but make sure you have a linter / formatter that makes sure your indentation agrees with your braces.


I'm with you on that.

The one line version isn't prone to goto fail, and to me reads very nicely.

    if (x == 1) return;
The two line version, however, I avoid like the plague.

    if (x == 1)
        return;


In Rust the {} are mandatory, but you don't need the (), so you can just write this:

    if a == b {
        stuff();
    }


Thank $whatever for that. Go rust. Sanity prevailing.


I use unbraced blocks as an unwise habit brought over from Pascal, which I learned before C.

I think of these things not as "blocks" but as "statements" -- those can be sandwiched between begin/end or else be one-liners. Pretending that one-liners aren't statements doesn't make any sense at all.

The rules of C are the same -- except it doesn't work as well in practice. Because (1) explicit `{...}` in a one-liner is far less painful than explicit `begin...end`. (2) errors caused by a missing `{`, `}` are harder to spot than errors caused by missing `begin`, `end`.

That said, the best syntax I know of for this is indentation based -- then the whole issue just goes away.


I hate to be that person, but your extra two characters version is only slightly less prone to errors than the one that you are replacing. At the very least, if you are going to insist that these be required, the closing has to be on its own line, or you lose much of the benefit.

And, while I do agree it is easy to vilify this style of statement, I question hating on it. I have found reading code like this is somewhat easy to read if you have internalized reading it to be "if this, then do a single thing." Where the "a single thing" is different from the normal reading of "if this, then that."

Still, I support style guides that prohibit this.


Much to easy to read over, much less likely to get syntax errors on a merge tool mistake, internalise all you like to make fewer mistakes and still don't do it and there will be fewer mistakes. Even if you're so amazing you could never make that mistake yourself a future maintainer might, a tool might and your benefit for taking that risk is zero.

I hate it because it's just so silly. Do we use a feature that has no benefit and carries risk? Do we really need a statistical analysis for this one?


Syntactic noise matters. You don't notice how much it costs until you're used to a lots-of-small-functions style in a language where that's possible.

I prefer consistency in the opposite direction, like I have in Scala: all constructs can be written as expressions. Function body? Expression. Try/catch body? Expression. Loop body? Expression. If you want to put a bunch of statements in rather than an expression, then you put them in braces. (But most of the time there's a better way to express what you wanted).


Automata make for good regular expression matchers. Another interesting approach is via derivatives: https://www.mpi-sws.org/~turon/re-deriv.pdf



I appreciate that he acknowledges me at the end. Thanks Rob!


it's brian who acknowledged you. rob only wrote the code.

exegesis is a nice word to have in one's vocabulary. brian is a good exegesist (or is that exegesisticist?)


Exegete.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: