

Tries and Lexers - aberatiu
http://blog.ircmaxell.com/2015/05/tries-and-lexers.html

======
haberman
I would recommend that the author read up on NFAs and DFAs -- they are a
formalism better suited to lexers than tries.

At a high level, if compiling a lexer for a run-of-the-mill language like
JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it
wrong. By "doing it wrong," I mean that there is almost certainly a better
approach that is far superior in every measurable aspect (CPU, memory, code
size, etc).

I don't fully understand what kind of algorithm the author was using, so I
can't comment in detail on it, but in general lexers are better thought of as
finite automata (NFAs and DFAs) than tries. The two are related, but unlike
tries NFAs and DFAs can have cycles, which are essential for representing a
lexing state machine efficiently.

Another observation: it's not too terribly surprising that you could beat
preg_match_all() with the latter being given a huge regular expression. Most
regex engines are backtracking (RE2 being a notable counterexample), which
means that a regular expression with high-cardinality alternation (ie.
A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are
designed for. A backtracking regex engine will basically try each alternative
in sequence until that alternative no longer matches, then back up and try the
next one. NFA/DFA based parsers will be much faster for this case.

The right tool for this job, as another commenter mentioned, is Ragel. It's
basically designed for exactly this. It doesn't appear to support PHP
though...

~~~
ircmaxell
> I would recommend that the author read up on NFAs and DFAs -- they are a
> formalism better suited to lexers than tries.

Author here. The actual regex implementation uses a NFA. The start of it used
a Trie, but it moved away.

The majority of what I wanted to get across here was the use of a minimal
structure (single-character).

The next step was using a maximal radix implementation (as long of a prefix as
possible). Then finally, throwing all of it away and going straight to parsing
using a state machine.

------
djoldman
Want to build an ultra-fast lexer? Ragel is the way.

[http://en.wikipedia.org/wiki/Ragel](http://en.wikipedia.org/wiki/Ragel)

[http://www.colm.net/open-source/ragel/](http://www.colm.net/open-
source/ragel/)

~~~
JadeNB
I may be mixing this language up with another text-manipulation language
starting with 'R' (not being coy—I genuinely don't remember its name), but I
seem to remember being uncertain of the reach of this project.

The Wikipedia page links two publications by (I suppose) the creator of Ragel,
and two publications that do not seem to be specifically about Ragel; and the
product page mentions nothing in the way of publications or references to uses
in industry. Is Ragel actually widely used?

~~~
djoldman
I'm not sure how widely it is used. However the documentation is pretty
comprehensive[1], the interface is intuitive, and the compiled code is fast.

[1] [http://www.colm.net/files/ragel/ragel-
guide-6.9.pdf](http://www.colm.net/files/ragel/ragel-guide-6.9.pdf)

------
bane
Tries are amazing data structures, simple and extraordinarily fast -- O(m) on
look ups. But they also eat memory at extraordinary rates as well. They're a
classic speed vs. memory data structure.

However, most people use naive Tries, just adding elements down a branch until
they exhaust the string they're inserting.

One easy optimization to make with Tries is to set a maximum branch length
(based on some statistical analysis of lexeme usage. For example, make 90% of
your lookups reachable under that length), any lexeme longer than that length
simply gets hung off of the end of the branch in a more space-efficient data
structure (like a hash table).

Your lookup then is then still O(m) for anything under the maximum branch
length, and things longer are still just O(m)+O(n) or whatever.

But your memory usage will shrink dramatically. And you can improve it by
fiddling with your branch length and choose say 80% reachable without hashing.

~~~
FreakLegion
Indeed there are a lot of good optimization options for tries.

At a little bit of a speed penalty you can use a bitmap for navigation instead
of pointers, which obliterates the memory footprint. In some use cases the
result is actually smaller than the equivalent Bloom filter with a sane FP
rate (though generally not smaller than a Golomb-compressed set or cuckoo
filter).

A compromise between the speed and memory bloat of pointers is to use
Radix/Patricia tries, which are a solid go-to option. In some cases they'll be
slower than naive tries, but in others they'll be faster.

The fastest _and_ most memory-efficient implementation I've seen, though,
boiled each pattern in the trie down to its most unique dword or qword, and
only attempted a full match after an initial hit. Locality of reference can
work wonders here -- we got a 20x speedup over an already very fast naive trie
this way.

~~~
bane
> boiled each pattern in the trie down to its most unique dword or qword, and
> only attempted a full match after an initial hit

Oh that's interesting - like a very fast cmp before bothering to check the
rest of the branches.

When you say pattern are you talking about specific letters/morphemes or
subsequence? And were you pulling the [dq]words from the initial pattern or
generating them some otherway (through some kind of hash or encoding
function?).

~~~
FreakLegion
Sorry, should've given more context. In this case the trie was for Aho-
Corasick, so we built it from the most unique subsequences. Since there was no
need to add or remove patterns at runtime, the subsequences were precomputed
and the trie built server-side, then sent serialized down to the application.

On a different occasion I tried a few kinds of entropy coding, but their value
was pretty limited, and obviously not applicable if you need to operate on a
stream with Aho. In that case I only needed memory-optimized set membership
testing, though, so ended up going with a Golomb-compressed set from pgs. 7-8
of [0].

[0]
[http://algo2.iti.kit.edu/singler/publications/cacheefficient...](http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-
wea2007.pdf)

------
bluetech
[From all the bad things I hear about PHP, the code is very readble without
any previous experience - nice].

Here are some things a lexer for a programming language might have to deal
with:

1\. Comments (some even do nested - which means regular expressions are out
for that).

2\. Continuation lines.

3\. Includes (if done at the lexical level).

4\. Filename/line/column number for nice error messages (can really hurt with
branch mispredictions).

5\. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings
(with escapes), etc.

6\. Identifiers.

So matching keywords is mostly the straightforward part. However I have found
that matching many keywords is the perfect (and in my experience so far, the
only) use case for a perfect hashing tool like gperf - it would normally be
much faster than any pointer-chasing trie. gperf mostly elminated keyword
matching from the profile of any lexer I've done.

~~~
TheLoneWolfling
Another thing:

Some languages allow escapes before everything else. Looking at you, Java. So
either you need to do a pass beforehand to unescape them, or unescape
characters (and in the process do error handling / etc!) on-the-fly.

------
lindig
A lexer for a language with a lot of keywords leads to a large representation
as an automaton as the author has experienced. One way to deals with this is
to only recognise in one rule all identifiers including keywords (something
like "[a-z_][a-zA-Z0-9_]*" and to use a hash table of keywords to check
whether a match is a keyword (and which one) or an identifier .

Edit: fixed the regexp to allow for single-char identifiers.

~~~
munificent
> One way to deals with this is to only recognise in one rule all identifiers
> including keywords (something like "[a-z_][a-zA-Z0-9_]+" and to use a hash
> table of keywords to check whether a match is a keyword or an identifier
> (and which one).

A lot of production parsers I've seen do this with two state machines. They'll
have a top-level state machine (usually hand-implemented using a giant switch
statement) that handles all of the main different token types—numbers,
strings, punctuators, etc. That has a single type for all identifiers,
including keywords.

Then, after it determines a token is a identifier, it runs another little tiny
optimized state machine to see if it's a reserved word or not.

------
jakobegger
So I get that this optimized giant trie might be faster than a regex. But what
about a normal lexer, either handwritten or generated? Shouldn't that be
faster still than the trie? I mean, that giant amount of memory alone must
cause lots of performance issues...

------
lolptdr
"Parsers work at the grammatical level, lexers work at the word level."

Is this correct to say?

~~~
munificent
Yes, as long as you include punctuation in your definition of "word". I like
to think of it like this:

A lexer takes a sequence of characters and produces a sequence of tokens. List
in, list out. It just chunks runs of characters together.

A parser takes in a sequence of tokens and produces a tree. It produces a
nested data structure from a flat input.

~~~
ch
Right, a lexer is a mapping function. A parser is a fold.

~~~
nostrademons
A lexer is a fold as well - there are usually fewer elements in the output
list than in the input list. A mapping function implies a 1:1 correspondence,
because the recursion is abstracted out into map() and the map function can't
accumulate any state during the computation. (At least in traditional FP
technology; the Mapper in MapReduce isn't actually a true map function, it's
more like an unfold.)

~~~
ch
Ah true I am incorrect on the lexer as a map. Had I thought on that half a
second more I would have saved myself some embarrassment. :)

------
TheLoneWolfling
At least optimize the DFA before you run it...

The equivalent of a DAWG versus a Trie.

