
Rob Pike on regular expressions in lexing and parsing - jeffreymcmanus
http://commandcenter.blogspot.com/2011/08/regular-expressions-in-lexing-and.html
======
SiVal
Regular expression systems are designed for a very wide variety of text
processing tasks. Lexing and parsing systems are optimized for a much more
limited range of tasks--lexing and parsing. When you are writing software that
is intended as a production system, not a one-off quickie, you are almost
always better off using a more specialized system optimized for your specific
task, if one is available. The unoptimized, general purpose system would be
Plan B. I think that was Pike's point.

Unfortunately, it will probably be misinterpreted as a general criticism of
regular expression systems. There are so many cases where my knowledge of a
powerful, general-purpose system has saved me from having to learn a new
software library for each random task that I intend to keep using such things
as regular expressions, scripting languages, unix command line pipelines,
Excel, etc. Writing my own code or learning to use a new, specialized system
is Plan B for simple, non-production tasks.

Only when I can amortize the cost (time/effort) over a lot of use (as in a
production system), or when the advantage of the specialized system is
noticeable to the user (including me), does the specialized system become my
Plan A.

General purpose tools, special purpose tools, and custom coding all have
different costs and benefits, and there are circumstances under which each one
is the best choice.

~~~
simcop2387
One thing I've found is that using a bit of a subset of regular expressions
seems to work very nicely for doing many lexing tasks. Things like using
^[a-zA-Z][a-zA-Z0-9]+ (or whatever shorthand your RE system has for those
classes) can make a lexer nicer to work with. Now I wouldn't be trying to do
lots with most of the other features in them for a lexer but it ends up a
compact but simple to understand way of describing the input the lexer is
looking for to pull tokens out.

------
brianpane
I've found Ragel ( <http://www.complang.org/ragel/> ) to be a good compromise:
it's less error-prone and easier to maintain a Ragel grammar than a
handwritten lexer, but Ragel lets you use regular expressions for all the
little places near the leaves of a grammar where it's easy to represent token
rules as regular expressions. In contrast to most regex APIs, it does the
state machine compilation at build time rather than runtime, and the generated
code can be quite fast (although you have to make a speed-vs-code-size
tradeoff).

------
ecounysis
_Some people, when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems._

~jwz

~~~
6ren
I thought you were just having a cheap shot, but then I read the story, and
that's exactly his sentiment.

------
chwahoo
Interesting that he claims that it's difficult to change a regular expression
to cope with new requirements but easy to change a hand-coded loop.

In my experience, the opposite is true (up to the point when your grammar
becomes non-regular).

~~~
lukesandberg
Many regular expression implementations don't deal properly with unicode
character classes and collation. So adding something like unicode identifiers
to go may be impossible or impractical with regexs whereas with a loop you can
rely on more standard collation support.

~~~
fooyc
Some libs such as PCRE can match by character properties, for example \pL
would match any unicode letter.

Doing that in a loop seems to be much more difficult. And unless you work on
utf-32 data, you have to handle variable-size encoding of characters.

------
etanol
I'm a bit confused, aren't lexers implemented using similar techniques as
regular expressions? (i.e: NFA to DFA conversion and DFA minimization).

Or maybe I'm also confusing lexers with lexer generators.

~~~
chalst
The term "regex" usually means not "regular expressions" in the formal
languages sense but rather the trickier operational notion that makes use of
backtracking.

Look at [http://cstheory.stackexchange.com/questions/448/regular-
expr...](http://cstheory.stackexchange.com/questions/448/regular-expressions-
arent)

------
nraynaud
Amen, stop putting regex everywhere, they are way too error-prone and
difficult to read!

Use parsers combinators where you can. There you can put sub-expressions in
variables, have meaningful identifiers in your grammar, explorable and
debuggable parsing etc.

------
cpfohl
Does anyone else have a little trouble with the following sentence: "A regular
expression library is a big thing. Using one to parse identifiers is like
using a Ferrari to go to the store for milk." I for one would rather read that
as "A regular expression library is freaking awesome!" But I think the author
meant to say it was overkill! ;)

------
fooyc
> And when we want to adjust our lexer to admit other haracter types, such as
> Unicode identifiers

Actually regex libraries such as PCRE have a good unicode support and are
better than me when it comes to do things like matching character properties
(letters, uppercase letters, numbers, punctuation, etc).

------
lukesandberg
I understand this point on lexers and parsers and i think it makes a lot of
sense. And regular expressions definitely have a sweet spot in short off the
cuff applications (sed, vim, grep). But there's a large amount of potential
use cases in between. I wouldn't want to write a lexer/parser to deal with
validating random user supplied data in a web application (phone numbers,
email addresses).

So i guess the point is that the whole point of a parser/lexer is to look at
and validate text so you shouldn't 'outsource' it. but maybe in applications
where minor parsing/text validation tasks are more peripheral then regexs are
more appropriate?

~~~
randomdata
To be fair, I wouldn't want to write a regex to validate an email address
either. <http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html>

------
supersillyus
Maybe I'm thinking of something different, but I've never considered regular
expression to be hard to write or hard to write well, and considering that
they may be compiled to machine code (either AOT or JIT), they're not
_necessarily_ inefficient and a good implementation is likely to handle
unicode better than your average programmer. So, while I agree that they
aren't a panacea, they certainly are a powerful and useful tool, and for
simple pattern matching and/or extraction, they are often easier to visually
verify than the hand-written equivalent in my experience.

~~~
kragen
Have you read Friedl's book? I didn't think they were hard to write or hard to
write well until I did, and then I regretted my past life of sin.

~~~
apag
Then again reading said book made me believe they’re fairly easy to write
well. You need to keep in mind what a quantifier really does (“this will
gobble up the whole string and then yield bits until the pattern matches”),
but in the end I find it not fundamentally any more taxing than a having in my
head a rough idea of the behaviour of a few nested loops or recursions.

~~~
kragen
I feel like it's more error-prone. This regexp lexer took me several minutes
to get right, and I'm still not totally sure it's bug-free:

    
    
        >>> replace = lambda text, env: ''.join(env[item[1:]] if item.startswith('$') else item[1:] if item.startswith('\\') else item for item in re.findall(r'[^\\$]+|\\.|\$\w+|\\$', text))
        >>> print replace(r'This $line has \stuff \\in it that costs \$50 and some $variables.', {'line': 'LINE', 'variables': 'apples'})
        This LINE has stuff \in it that costs $50 and some apples.
    

If I were to write out an explicit loop over the characters of the string, I
would be a lot more sure that I wasn't accidentally dropping characters due to
an inadvertent failure to make the regexp exhaustive (I originally forgot the
\\\$ case! Although that reduces to the empty string anyway) and I wouldn't
have to forget and rediscover which lexical category each token belonged to.

And, although it's not present in this case or in all regexp engines, it's a
lot easier to accidentally write an exponential-time algorithm in a regexp
than in a nested loop. And my experience has been that it's harder to debug
it, too.

~~~
St-Clock
I agree and disagree:

"If I were to write out an explicit loop over the characters of the string, I
would be a lot more sure that I wasn't accidentally dropping characters due to
an inadvertent failure to make the regexp exhaustive"

This is why regex comments exist. For any non-trivial regex (more than a two
or three characters), you should break down and document your regex.
Otherwise, it's worse than a 1000 character-long perl one-liner.

"it's a lot easier to accidentally write an exponential-time algorithm in a
regexp than in a nested loop"

So true. I did not realize it was possible until I made that mistake.
Debugging these cases is extremely difficult. For two strings that look
similar, the same regex can go crazy on one. But this happened to me only once
in the past three years (time when I started to heavily rely on regular
expressions for a project).

~~~
kragen
> This is why regex comments exist.

Regex comments don't help much with inadvertently writing a non-exhaustive
regex (i.e. one for which some possible input could fail to match), or a few
other kinds of regexp bugs. Or, how would you write the regexp in the above
code with comments so that it would be obvious if you left out the \\\$ case?

------
swah
Bonus: Pike explains a regexp matcher:
[http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea...](http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html)

------
carsongross
You haven't lived until you've written a hand-rolled lexer and recursive
descent parser.

~~~
onan_barbarian
... or at least, you haven't done second-year CS. Pick one.

~~~
carsongross
Oh, that I could have been so lucky! I was stuck using lex and yacc. I had to
work on an actual production parser before I saw the light.

------
fooyc
> They also result in much faster, safer, and compact implementations.

If you are using a scripting language, regular expressions are often faster
than a hand-written parser because the regular expression library itself will
work at a lowest-level.

------
alexyoung
I don't understand: "Standard lexing and parsing techniques are so easy to
write so general, and so adaptable there's no reason to use regular
expressions".

Isn't using the lex family of generators pretty 'standard' in terms of writing
lexers? Aren't they generally considered fast and in most cases better than
hand-written code?

Aren't bad regular expressions just a consequence of someone not properly
learning regular expressions? Couldn't the same argument be applied to every
programming language?

Also: <http://swtch.com/~rsc/regexp/regexp1.html>

------
EGreg
Ten years ago I wrote a lexer myself for my own language, Q, because I didn't
know about regular expressions.

It was actually quite powerful. But I literally just implemented a state
machine. To be honest, I think there were a couple things it did that a
regular expression wouldn't be able to do.

However, for the most part, the "efficiency" argument is off base. You can
COMPILE regular expressions into tight code.

~~~
nandemo
You don't have to implement a state machine yourself. You can use a lexical
parser generator such as flex.

~~~
psykotic
The lex program and its descendants are generally regarded poorly by Ken and
Rob and others from their tribe of Unix systems programmers. The blog post
suggests why. It is often faster and easier to write the lexer in C by hand.
But they do tend to put great value on yacc. Once you wrap your head around
LALR shift-reduce parsing, yacc is expressive in ways that are hard to
replicate with a hand-written parser.

------
berntb
Pike's argument is relevant for compiled system level languages (C, C++, Java,
etc).

The regular expression libraries used in scripting languages today (Perl's and
PCRE) are optimized C code. A lexer in interpreted code is hardly going to
beat them in speed.

Note that a large part of the reason to use scripting languages is the
development speed. Regexps FTW, etc.

Edit: I should add -- if I have a problem where I need to parse something
complex like a "real" programming language, I go to the LALR libs of course.
The right tool for the job. (On second thought -- this is probably what Pike
talks about; he doesn't go around solving simple problems, like I do. But no
complaints, I've got a job interview tomorrow... :-) )

~~~
kragen
He has several arguments; performance is only one of them.

~~~
berntb
>>He has several arguments; performance is only one of them.

First -- I noted that I limited my comment to scripting languages.

I did discuss most of the relevant arguments in the article, so I really don't
see what your point is?

It doesn't matter for the scripting languages that a regexp lib is large (it
is linked in and used anyway), so I ignored that. I also ignored the Unicode
point, since it is generally supported in scripting languages' regexp engines.

I did touch on speed and development speed. Shorter code is also generally
easier to understand; a simple regexp can often replace 10-20 lines.

In my edit, 10++ hours before your comment, I discussed when full grammars
where a better alternative -- so I handled Pike's accusation of treating
Regexps as a "panacea for all text processing". (I have used regexps as lexers
for grammars. But only for parsers of less than a few hundred lines.)

(I am not going to touch Pike's argument if I and others _really_ grokk
regular expressions, because I thought I did understand them until I read
"Mastering Regular Expressions". Maybe there are some more satori insights
waiting? I do consider myself a little bit familiar with them from automata
theory, implementing state engines and usage [Edit: and my considered opinion
is that re:s are often a good solution for scripting languages.])

