
Re2c: A free and open-source lexer generator for C and C++ - fanf2
http://re2c.org/
======
burntsushi
For any other state machine nuts such as myself, you won't want to miss their
paper on tagged DFAs:
[http://re2c.org/2017_trofimovich_tagged_deterministic_finite...](http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf)
It is a somewhat recent follow on from Laurikari's 2000 paper and Kuklewicz
2007 paper on the same topic. Specifically, tagged DFAs let you do submatch
extraction via the DFA itself. Other finite state machine regex libraries
either don't do this at all, or fall back to slower engines in the general
case (such as RE2). I think the downside of tagged DFAs is that they can be
quite large!

~~~
jhallenworld
It's interesting that it's using Laurikari's algorithm- I looked at the 2000
paper years ago and could not get a working algorithm out of it. Eventually I
found another way to do this that I use for JOE's regular expression library.
So now I'm curious which one is faster...

[https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tr...](https://sourceforge.net/p/joe-
editor/mercurial/ci/default/tree/docs/hacking.md#regular-expressions)

It uses a variant a Thompson's matcher, see:

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

"However, Thompson-style algorithms can be adapted to track submatch
boundaries without giving up efficient performance. The Eighth Edition Unix
regexp(3) library implemented such an algorithm as early as 1985, though as
explained below, it was not very widely used or even noticed. "

So in JOE's matcher, when simulating the NFA with the multithreaded machine,
each succeeding thread will correspond to one of the sub-match possibilities.
Augment this machine to keep the ones you want (the ones with the longest
matching strings starting from the left usually).

~~~
burntsushi
> So in JOE's matcher, when simulating the NFA with the multithreaded machine,
> each succeeding thread will correspond to one of the sub-match
> possibilities. Augment this machine to keep the ones you want (the ones with
> the longest matching strings starting from the left usually).

Yeah, this is the standard way to do it. It's what I meant by "fall back to a
slower engine," since simulating the NFA is much slower than a DFA.

------
chubot
FWIW I encountered re2c in the Ninja lexer

[https://github.com/ninja-
build/ninja/blob/master/src/lexer.i...](https://github.com/ninja-
build/ninja/blob/master/src/lexer.in.cc#L123)

and then used it for my own huge shell lexer:

 _The Oil Lexer: Introduction and Recap_
[http://www.oilshell.org/blog/2017/12/15.html](http://www.oilshell.org/blog/2017/12/15.html)

The nice thing about it is that it doesn't impose any constraints on the
structure of your code.

lex/flex probably have options not generate piles of messy C code with global
variables at this point, but IMO there's no reason not to use re2c instead, so
it's moot.

~~~
chubot
update: Also re2c gives you "semantic actions" like Ragel or yacc, which is
better IMO. Not sure if lex/flex does that. I think it has a hard-coded notion
of tokens.

~~~
saurik
lex/flex definitely supports semantic actions and definitely does not have a
hardcoded notion of tokens.

------
cjohansson
I have ported the PHP re2c lexer logic to elisp in my in-progress attempt at a
Semantic PHP mode for Emacs. Using the lexer tokens as foundation for syntax-
coloring and indentation

[https://github.com/cjohansson/emacs-phps-
mode/blob/master/ph...](https://github.com/cjohansson/emacs-phps-
mode/blob/master/phps-mode-lexer.el)

------
quotemstr
I usually reach for Ragel [1] for this kind of work. It has a very steep
learning curve and is a bit more complex than a straight-up translation of a
conventional regular expression, but the state machine combination algebra is
incredibly powerful and flexible. I've used the tool in numerous projects over
the years.

[1] [http://www.colm.net/open-source/ragel/](http://www.colm.net/open-
source/ragel/)

------
megous
I use re2c a lot. It's a great tool for safely working with text in C
programs, without any runtime dependencies. It's a secret sauce in many of my
programs.

For example, my JSON parser that doesn't use intermediate object tree and
works directly on strings:
[https://megous.com/git/sjson/tree/sjson.c](https://megous.com/git/sjson/tree/sjson.c)

------
openbasic
re2c is used by PHP:
[https://externals.io/message/102796](https://externals.io/message/102796)

[https://github.com/php/php-
src/blob/master/Zend/zend_languag...](https://github.com/php/php-
src/blob/master/Zend/zend_language_scanner.l)

------
mncharity
Hyperscan was recently posted.[1] A fast multi-pattern regex matcher. As in
1000's of patterns. Though optimized for deep packet inspection, I'm curious
to try it for tokenizing.

[1]
[https://news.ycombinator.com/item?id=19270199](https://news.ycombinator.com/item?id=19270199)

------
cleeus
I replaced a handwritten lexer of a proprietary PDF library with a re2c
generated lexer. Besides fixing a bunch of dormant bugs, the resulting lexer
was much faster. Also the state of c++ std regex implementations is pretty sad
(at least it was the last time I checked) and if you just need to match with a
few compile-time-known regexes, re2c is there for you to solve the problems.

------
sovande
re2c is a hidden gem and it is used in libzdb to parse URL's
[https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db34...](https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db346b71489439dfc005dd464971/src/net/URL.re?at=prefetch&fileviewer=file-
view-default#URL.re-106) and SQL Date strings
[https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db34...](https://bitbucket.org/tildeslash/libzdb/src/9402bd5757d1db346b71489439dfc005dd464971/src/system/Time.re?at=prefetch&fileviewer=file-
view-default#Time.re-168)

------
CUViper
SystemTap derived its regex functionality from re2c:
[http://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=blob;...](http://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=blob;f=README.stapregex)

------
unsatchmo
Safari used to use re2c to identify Unicode Top Level Domains in order to
avoid cross-site-scripting attacks. The Unicode support has no parallel that
I’m aware of.

------
pubby
Lexer generators seem like a useful algorithm wrapped up in heaps of framework
bullshit. Wouldn't it be better to have a simple program that inputs regex and
outputs a DFA than one that handles the entire lexing process?

~~~
jules
Lexers run multiple regexes in tandem to determine the next token type. To do
that with your method you'd have to run multiple DFAs on the same string.
Lexers can do it with 1 DFA by having multiple different final states (non-
final + 1 per token type vs non-final + final).

~~~
ori_b
You can do that from a regex api fairly easily. Something like:

    
    
        re = regex.compilemulti([
             (".*@(.*)\\.(com|org|net)", TagEmail),
             ("http://.*, TagURL),
             ("[a-z]*", TagWord)
        ])
    
        match regex.tagmatch(re, pat)
        | (TagEmail, submatches):    domain=submatches[1]
        | (TagURL, submatches):      url=submatches[0]
        | (TagWord, _):              /* nothing */
        ;;
    

I haven't had time to do it in Myrddin's libregex
([https://myrlang.org/doc/libregex/](https://myrlang.org/doc/libregex/)) yet,
but I've got plans to do something like this.

~~~
jules
Sure, and that's exactly what re2c does too. It lets you do a match on
multiple regexes:

    
    
      regex1 { code that executes if regex1 matches }
      regex2 { code that executes if regex2 matches }
      regex3 { code that executes if regex3 matches }

~~~
ori_b
Yes, compiled into source code, which doesn't let you change the set of
regexes at runtime.

------
clishem
Alternative: [https://github.com/fbb-git/flexcpp](https://github.com/fbb-
git/flexcpp)

(nice with [https://github.com/fbb-git/bisoncpp](https://github.com/fbb-
git/bisoncpp) as parser generator)

Very well written with good documentation.

