
A Regular Expression Matcher (2007) - _acme
http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html
======
forgotpwtomain
I have a re.c file in my best-of collection of code examples locally - but I
can't seem to find the original source quickly - nevertheless it was written
by Russ Cox[0] and he's spent a lot of time making regular expressions
accessible and elegant, his articles are a great resource.

[0] [https://swtch.com/~rsc/regexp/](https://swtch.com/~rsc/regexp/)

------
tiarkrompf
It's a nice piece of code. I like to use it to explain how to turn
interpreters into compilers through staging/specialization: [https://scala-
lms.github.io/tutorials/regex.html](https://scala-
lms.github.io/tutorials/regex.html)

------
kchoudhu
I got the guided tour of this code from bwk back in... 2006?[1]

I wasn't sure I wanted to deal with computers on a day to day basis when I
took the course. The lecture blew my mind, and I started looking for
programming jobs the next day.

[1] COS333. Memories... man, I'm getting old.

------
adrianratnapala
I'm impressed.

The code lives up to its goal of being a good example of programming. As
Kernighan says, it shows the value of handling special cases early and of
recursion. (It's also a good example of how pointer-arithmetic and nul-
terminated strings should be used, but the less said about that, the better).

Students are likely to not notice the tail recursion, so converting this to an
iterative version is probably a nice exercise.

But notice that explaining _regexps_ is a non-goal of this example. There is
an underlying state-machine here (encoded in a combination of the instruction
pointer and the `text` variable), but it is not very obvious. For that, Russ
Cox' articles are the go-to.

~~~
asveikau
> (It's also a good example of how pointer-arithmetic and nul-terminated
> strings should be used, but the less said about that, the better).

Why the less said the better?

I'm always disappointed when people say C is bad at string manipulation. The
"character at a time, no allocations needed, stop when you hit the end" style
demonstrated here and elsewhere have always resonated deeply with me.

~~~
groovy2shoes
> _I 'm always disappointed when people say C is bad at string manipulation.
> The "character at a time, no allocations needed, stop when you hit the end"
> style demonstrated here and elsewhere have always resonated deeply with me._

The main reason is that C doesn't really offer many primitives for string
manipulation beyond pointer manipulation. All you really get is the stuff in
`string.h`, which isn't much, and most of those functions are deeply flawed in
one way or another:

* strcpy() can easily lead to buffer overflows if the destination array is not large enough to hold the entire string

* strncpy() will not nul-terminate the destination string if the size of the source string exceeds _n_

* ditto for strcat()/strncat()

* strtok() is not re-entrant, and therefore not thread-safe. The C11 standard is explicit that strtok() is not required to avoid data races with other calls to the function.

* You're not allowed to modify the result of strerror(), yet the function returns a regular ol' char* and, as such, the type system can't prevent anyone from doing so.

And perhaps the biggest problem of them all:

* strlen() potentially diverges (i.e., it can loop indefinitely or overflow the buffer, which causes undefined behavior[0]), making it _literally impossible_ to reliably validate a C string in standard C without knowing the length of that string _a priori_. It is also the case that the length of a string is not necessarily equal to the size of the buffer it is stored in, which means you now potentially have a _third_ thing you need to know _a priori_ (the buffer size).

The small size of C's string library means you get to roll a lot of your own
string functions, including the kinds of primitives you'd expect. Even QBasic
has a larger (and arguably better) string library. Alternatively, you can just
forgo the functions and write loops through strings everywhere, but then as
soon as you want to support some other character encoding (or multiple
character encodings), you're in for a boatload of fixes throughout your
codebase.

The nature of C strings being represented as arrays of characters rather than
an abstract string type means that it brings you the joy of manual memory
management _everywhere_. You need to constantly be allocating space for
strings, deallocating buffers, keeping track of ownership, copying strings
between buffers, etc. Of course, since the length of the string isn't included
in the string itself, if you want to use the safe string-handling functions,
you need to keep track of the lengths separately. This usually means returning
a `size_t` for the length of the string, while the output of the string
functions goes to a pointer ("sorta pass-by-reference") "out" parameter.

Strings that rely on sentinel values are themselves pretty flawed. Continuing
the note above, it's safer and more convenient to use strings that are somehow
bundled with their length. With sentinels only, reasonable things like nul-
padding the strings for alignment purposes can also be a pain in the ass,
since C considers the end of the string to be the _first_ nul character it
encounters. An abstract string type would allow you to store the length of the
string _and_ the size of the buffer in the same package as the pointer to the
buffer (and possibly even the buffer itself). This can have advantages other
than easier handling of padding as well, allowing you to grow or shrink the
string within the same buffer without (re)allocating memory and can
potentially save some copying. If you can sacrifice a byte, you can retain
interoperability with functions expecting C-style strings while enabling
better string handling by including the nul-terminator _in addition to_
bundled the length (Lua does this, for example). A common complaint is that
this places an upper bound on the length of a string, whereas a sole sentinel
value does not. This does not have to be the case if a varint[1] is used to
hold the length.

C strings are also woefully inadequate for dealing with character encodings
larger than 8 bits, but that's another can of worms.

In short, C's standard strings are fine, really, but they _could_ be _so much
better_ with a touch of abstraction. Fortunately, it's possible to make some
improvements from within C itself, but unfortunately that's often a mild pain
in the ass, and not all programmers bother to do it.

\---

[0]: Of course, once your program enters undefined-behavior territory, it's
off the chain and can do anything it wants, regardless of what you intended it
to do, _even if your program is 100% correct in cases where there is no
undefined behavior_.

 _If you 're lucky_, a buffer overflow will cause the program to crash and die
_immediately_ , due to segfault or something. If you're unlucky, it'll go
haywire and run amok. If you're _really_ unlucky, it'll continue to operate
with no outward signs of problems, silently doing something horrible and nasty
without any indication to the contrary.

[1]:
[https://en.wikipedia.org/wiki/Varint](https://en.wikipedia.org/wiki/Varint)

~~~
caf
strncat() is not like strncpy() - it _always_ nul-terminates the result.

I think your strlen() issue is a bit overblown. Either you get the string from
a function that explicitly produces valid strings, in which case you know that
it's a valid string already; or you get it from a raw bunch of bytes which
tends to come with a size attached, giving you all the information you need to
either check if it's a valid string or append a nul-terminator yourself.

~~~
groovy2shoes
> _strncat() is not like strncpy() - it always nul-terminates the result._

Yes, you're right. That's what I get for trying to speed-read the C standard
at 3am...

> _I think your strlen() issue is a bit overblown. Either you get the string
> from a function that explicitly produces valid strings, in which case you
> know that it 's a valid string already; or you get it from a raw bunch of
> bytes which tends to come with a size attached, giving you all the
> information you need to either check if it's a valid string or append a nul-
> terminator yourself._

Usually, yes, but neither of those things are guaranteed, unfortunately, and
thus we wind up with the perennial buffer overflow exploit.

------
harry8
I found I didn't hate, loathe and detest unbraced blocks any less when it's
Rob Pike code

    
    
        if(x) {
            do_y();} 
    

2 characters, no vertical space difference. Unbraced blocks only feature is
they introduce bugs - they have no legitimate use. But man do you feel like a
tough, macho- _man_ "I'm such a hard man I leave my blocks un-braced."

Is there a list of languages that repeated this C idiocy (an optimization of
syntax for the parser rather than the programmers?) I sure hope those
languages that would like to replace C, ie D, go, Rust etc. Haven't repeated
it this far into the 21st century.

~~~
gfody
I keep one-line blocks unbraced when possible and my reasoning is definitely
not to feel like a tough macho man. It's more like a mild form of OCD where
any syntax that's not necessary just bothers me if I don't omit it.

~~~
jake-low
Reductio ad absurdum: I assume your variable names are single-letter, your
Python is indented only one space per block level and every C program is a
one-liner? You should check out the Code Golf stackexchange; you'd do well.

Kidding aside, I can appreciate your feeling of OCD (I fight that urge too)
but remember that most syntax is for humans, not machines. Using braces
communicates intent to the reader; omitting them offers an opportunity to
stumble.

~~~
eru
Nah, humans go by indentation anyway. Just use whatever brace style you like,
but make sure you have a linter / formatter that makes sure your indentation
agrees with your braces.

------
eru
Automata make for good regular expression matchers. Another interesting
approach is via derivatives: [https://www.mpi-sws.org/~turon/re-
deriv.pdf](https://www.mpi-sws.org/~turon/re-deriv.pdf)

------
andyzweb
[http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf](http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf)

------
NotQuantum
I appreciate that he acknowledges me at the end. Thanks Rob!

~~~
f2f
it's brian who acknowledged you. rob only wrote the code.

exegesis is a nice word to have in one's vocabulary. brian is a good exegesist
(or is that exegesisticist?)

~~~
gpvos
Exegete.

