Cool! Yeah, someone else mentioned Ragel too. I'll have to check it out and see ...

wahern · on Aug 1, 2018

Unfortunately[1] it's not lazy. The DFAs are static. And you can get blow ups. I've used Ragel for many different projects. The only time that became an issue was for a project that involved transforming ~10,000 PCREs to Ragelable syntax and into a giant union. A handful[2], when placed into a union, would trigger (directly or indirectly) DFAs that were either too large or simply took too long to compile.[3] We developed heuristics to split them up, so we had one giant union and a few dozen smaller ones. Later the Ragel author implemented a cost function that allowed us to do that more consistently and efficiently.

Normally if you encounter undesirable blow ups you can use Ragel constructs to split your machines and dynamically jump between them.

I've also worked with RE2 and consulted on a project using re2c. Those projects run into trouble with DFA complexity much, much, much earlier than Ragel. For example, re2c took days to compile a union that Ragel compiled in seconds, and while the generated code was large it wasn't even noticeable in terms of performance. (And by not noticeable I mean that the run time wasn't noticeably different from other Ragel-using projects. In terms of performance there's simply no comparison between Ragel and re2c, or Ragel and RE2 for that matter.)

[1] On the plus side, if you compile the machine into a dynamically linked library the generated code and data will be shared across processes and with the file system buffer cache. Even multi-gigabyte machines don't necessarily result in as much memory pressure as you'd think because most of the states won't be visited and so won't need to be resident. Sometimes you make a little change to a machine and it doubles or triples in size; I had to learn to leave it be and move on because ultimately it was usually inconsequential and my time was better spent elsewhere rather than obsessing over what intuitively seemed inefficient and costly.

[2] EDIT: This originally said 5%, but now I remember that 5% was the number of PCREs that contained problematic constructs that, in order to emulate, required special treatment. (In other words, 95% readily transformed to equivalent DFAs) The ones that truly blew up the DFA into something unusable were quite rare. And none were so large that they couldn't be compiled to their own, single machine.

[3] Part of that was our use of `when` conditionals, which we used heavily to translate zero-width assertions (e.g. "\B"). Because of how they work they can easily inflate the machine size when part of a union.

dataflow · on Aug 1, 2018

Interesting, thanks for the info! :-)