
Standalone lexers with lex: synopsis, examples, and pitfalls - g3orge
http://matt.might.net/articles/standalone-lexers-with-lex/
======
psykotic
Not much of a lex fan. Here's a simple way to implement the indentation-to-
braces transformation in a hand-written lexer:

    
    
        def tokenize(stream):
            indent = stream.indent
            while stream.current() and stream.indent >= indent:
                stream.indent = 0
                while stream.current() == ' ':
                    stream.indent += 1
                    stream.consume()
    
                if stream.current() == '\n':
                    stream.indent = indent
                    yield stream.consume()
                elif stream.indent == indent:
                    while stream.current() and stream.current() != '\n':
                        yield stream.consume()
                    if stream.current() == '\n':
                        yield stream.consume()
                else:
                    stream.rewind(stream.indent)
                    if stream.indent > indent:
                        yield '{'
                        for token in tokenize(stream):
                            yield token
                        yield '}'
    

A tokenize() call only returns when it's either reached the end of the stream
or after consuming all contiguous lines at its initial indentation level or
greater. Empty lines (with or without leading indentation) don't count towards
the brace structure. It doesn't specially handle tabs, backslash line
continuations or indentation within parentheses, but that would be easy to
add.

For a bit of fun, run it on its own source code and you get this:

    
    
        def tokenize(stream):
        {indent = stream.indent
        while stream.current() and stream.indent >= indent:
        {stream.indent = 0
        while stream.current() == ' ':
        {stream.indent += 1
        stream.consume()
    
        }if stream.current() == '\n':
        {stream.indent = indent
        yield stream.consume()
        }elif stream.indent == indent:
        {while stream.current() and stream.current() != '\n':
        {yield stream.consume()
        }if stream.current() == '\n':
        {yield stream.consume()
        }}else:
        {stream.rewind(stream.indent)
        if stream.indent > indent:
        {yield '{'
        for token in tokenize(stream):
        {yield token
        }yield '}'
        }}}}

------
evmar
In a recent project I examined the alternatives and ended up using re2c
(<http://re2c.org/>).

re2c's advantage over flex is that it is easy to embed _within_ existing code,
as opposed to a generated yylex function that your code must interface with.

re2c's advantage over a hand-written lexer is that it can be a more succinct
description of the fundamental rules you're implementing and the generated
code (table-driven, optionally using computed gotos) is faster.

The result for switching my project was maybe 5% faster (and that was in
switching from code that was hand-written to even use a lookup table in a
particularly tight loop), but most importantly the result is much easier to
read and modify.

