
Writing Compilers Quickly - r11t
http://www.links.org/?p=877
======
tptacek
Could not disagree more about lexer generators. Every device we support comes
with 1-2 parsers, and we started out thinking that Ruby's built-in regex
support made flex unnecessary. Bzzzt. Lexers are fiddley whether you write
them in Flex or Ruby, but the Ruby version is harder to write.

We've settled on Ragel for lexers; Craig wrote "ralex", which I don't use or
understand but which he insists we're about to open source. So I recommend
Ragel. Or Flex. Or anything but just hand-hacking your own tokenizer.

~~~
barrkel
I don't agree, for a simple reason - most programming language lexical
analysis is extremely simple and a lexer can be hand-written in less than an
hour. Here's the outline of a NextToken routine (or whatever you want to call
it) that does the lexical analysis:

* Skip whitespace and comments, keeping track of newlines as needed

* Check for end of file

* Determine token type from first character (typically a switch statement or equivalent)

* Scan in remainder of token's characters (depending on token type - and possibly more discrimination for common prefixes, such as 10 vs 10.25, ints vs floats)

* Do any conversions, such as string -> numbers as necessary

* Look up identifiers for keywords

A simple hand-written scanner for a language like C (excluding the
preprocessor etc.) written in C itself shouldn't exceed a couple of hundred
lines or so. If you're willing to sacrifice some readability and performance,
I guess you could squish it down close to 60.

With languages that need more complicated state transition mechanics than C or
similar, using a generator may make sense. But fully general regexes are ill-
suited for creating lexical analysers. With lexical analysis, you're looking
to classify a pattern of characters from the head of a list; language and
library regex support will generally only tell you if a particular string
matches / starts with a regex, or lets you search a string for the regex. A
lexer effectively wants to annotate multiple acceptance states in the FSA, and
return different values depending on which acceptance state was chosen.
Regexes from languages and libraries generally only pop out with a boolean,
having merged the acceptance states together in the DFA (never mind non-
regular regular expressions, such as those with backrefs etc.). So using
regexes for writing a lexer will be inefficient and won't give you hints when
the language defined by one regex is actually a subset of the language defined
by another.

~~~
camccann
As an addendum: In some languages (who will remain unnamed) lexical analysis
is actually a pain due to funny whitespace rules or syntax-dependent
tokenizing. For instance, something like:

* whitespace around operators is optional, except that sometimes a symbol adjacent to a variable means something completely different

* "foo" is a keyword in type signatures, but a normal identifier otherwise

* indentation is meaningful, and must generate INDENT or DEDENT tokens depending on the indentation of some arbitrary number of prior lines

* indentation is meaningful based on whether the first non-whitespace token on a line is aligned in the _same column_ as a _specific token_ in the prior line, based on parsed _syntax_ , oh god what were these people smoking

...and so on. It's easy to get into situations where merely lexing requires
keeping some sort of context, requires interleaving with parsing such that
parse errors may cause the _lexer_ to backtrack and reinterpret things, or
possibly even worse.

In fact, I wouldn't be surprised if it were possible to construct an
especially pathological bit of perl that couldn't be lexically analyzed
without _running the program_.

~~~
statictype
>* "foo" is a keyword in type signatures, but a normal identifier otherwise

Isn't this decision something that should be punted out of the tokenizer and
into the parser?

~~~
camccann
You can punt as much into the parsing stage as you like, of course. It's not
like there's even any requirement that the two stages be separated.

All of the stuff I mentioned gives you the choice of either more work during
parsing (which is usually hard enough already), or building a crazy Rube
Goldberg-esque tokenizer--neither of which is appealing.

------
hga
Read the linked <http://en.wikipedia.org/wiki/Monkey_patch> if you want to
find out what monkey patching and duck punching are ^_^.

------
pmjordan
I'm wondering if he's tried to do real work with ANTLR and ANTLRWorks in
particular. I found them extremely temperamental and fiddly to work with, but
this was more than a year ago.

~~~
gchpaco
They haven't gotten any less temperamental or brittle. The default tokens have
a rude tendency to store the entire input stream as it is being parsed, which
was a really rude surprise for us (had to rip it out and do LL(1) by hand,
with minimal memory demands). I like a lot of features of ANTLR, but it is not
the sort of tool I like to use and abuse.

------
kingkongreveng_
A perl programmer talking about compilers in 2010 doesn't mention Parrot?
Parrot has to be the easiest and fastest way to get a compiler up and running
these days.

~~~
tptacek
He's not targeting a VM or a runtime, is he? His compiler looks more like a
code generator, a la yacc.

~~~
kingkongreveng_
Well the CaPerl project he mentions just translates perl5 with his special
keywords to standard perl5 code, I think. Who knows what he's doing here.

~~~
tptacek
He's writing yacc for crypto protocols. You write an algorithm in his Stupid
language, you and others verify it Stupidly, and then compile it to your real
language.

~~~
statictype
So the language's compiler does analysis of the algorithm for common crypto
flaws? (like buffer-overruns and timeable iterations ...)?

~~~
tptacek
Nope. It does nothing at all except to standardize the notation.

------
marshallp
use lisp macros or prolog dcg's

