
Regex Parser in C Using Continuation Passing - tinkersleep
http://www.theiling.de/cnc/date/2017-12-19.html
======
burntsushi
> One drawback of the CPS technique in C is its stack usage, especially
> because it is quite hard to understand at all what is going on, which makes
> it very hard to predict how much stack is used. Usually, I get SIGSEGV from
> pointer misuses (typically NULL), but in debugging this program, instead I
> got an infinite recursion causing SIGSEGV a couple of times.

I think the key missing drawback that isn't stated here is that even if you
fix all of your logical bugs, then unless every function call is tail-call
optimized by the compiler, then your parser isn't quite safe for untrusted
regular expressions. e.g., A user could write `(((((...a...)))))`, and if you
don't guard against it with some hard limit on nesting depth, then you're
vulnerable to stack overflows. So let's say you implement nesting limits. What
limit should be set? /shrug

This is why I write my regex parsers without recursion and instead use an
explicit stack on the heap. It is then a bit easier to more robustly manage
nesting limits.

~~~
kindfellow92
Parser(s)? As in more than one? Hopefully you’ll ever need to only write one
parser (e.g. for Rust) but practically zero since regexs are a solved problem.

~~~
zodiac
Regexes can't be used for parsing (e.g. the language of balanced parantheses)

~~~
ThePadawan
I think your parent post referred to using a parser to parse regexes, not
using regexes to parse another language.

------
galapago
Why this was flagged?

------
rurban
I don't understand what he is talking about. A very simple one-pass recursive
regex matcher is trivial to write. I wrote the one for asterisk in 2003. No
fancy CPS needed.

[http://asteriskguru.com/archives/asterisk-dev-better-
pattern...](http://asteriskguru.com/archives/asterisk-dev-better-pattern-
matcher-vt42988.html)

~~~
tinkersleep
Hmm, I have yet to see a trivial implementation for full-featured regex
syntax, which was my goal.

The idea was not to parse a simple dialect of regex, but to parse a full-
featured dialect, including things like \d, \\), [)] and multi-level (), like
(a|(b|c\\)d[e)f])*g|h)+, so, e.g., the simple approach of scanning for the
closing ) when you find a ( does not really work. Just for parsing such regexs
this requires a recursive scanner, and then, a recursive data structure to
store the result of parsing the regexp, and this runs counter to matching at
the same time.

The linked code you provide assumes that you can first parse the whole pattern
(like '(...)'), interpret that pattern, and then recurse to match it. There do
not seem to be test cases in your code for the complex regexs cited above. So
I don't think the approach of trivial recursion works with a full-featured
standard regex grammar.

~~~
rurban
I see, thanks. But practically such an parser explodes, and should not really
be used. A simple one is still linear, and much faster than the current
regcomp/regexec overkill.

------
onli
For dang or an other mod: This was probably flagged by accident.

------
kazinator
Do C compilers do tail calls through a pointer?

(How does that work if the target function has a separate entry point for
accepting a tail call?)

If you pile up stack frames while matching regex, you will blow up on long
matches. Eg. a+ matching string with thousands or millions of a's.

This could be made not to rely on tail call support by using a:

    
    
       typedef struct cont {
         void (*fn)(void *data);
         void *data;
       } cont_t;
    

which is returned from each function. A dispatch loop invokes the tail call by
calling fn in the returned structure, passing it data. Needless to say, data
can't be the address of some local variable in the function which returns this
structure. Obviously, non-tail invocations of the continuations sometimes
occur, and those have to be handled differently.

~~~
tinkersleep
gcc generates 'jmp *%eax' (and other computed branch instructions) for this
code, so yes, it uses tail-calls for function pointers.

------
fao_
Interesting! Does anyone know how much less efficient this is in practice?

~~~
burntsushi
Typically regex parsing isn't benchmarked thoroughly, so you probably won't
get any solid answers to your query. The reason why it isn't benchmarked
thoroughly is because it's generally fast enough, and is typically dwarfed by
other aspects of the regex compilation phase. (Which in turn varies with the
regex engine, and depends on how much analysis it does up front. Some do very
little, some do a lot.)

~~~
kmill
This is both a regex parser and a regex matcher, and I believe fao_ meant the
matching part.

------
phaedrus
This is very close to logic programming, even down to the use of the return
codes for "exception" handling (you can re-re-interpret the "return 0 means
exception" as meaning logical negation-as-failure). Compare the Castor C++
logic programming library, which defines logic relations as functions which
return true if they bind variables or false if they failed.

The use of continuation passing style is also apropos: compare the LC++ logic
programming library which uses CPS to implement lazy streams for query
evaluation. In LC++ an "OR" relation is implemented by calling the first
argument with the continuation, and then the second argument with the
continuation. An "AND" is similar, except it chains the continuations.

My personal interest in this topic is that I believe it should be possible to
use continuation passing style and callbacks together to make an evaluation
engine which "reorganizes" the C-stack to aid in debugging. It would work like
this: the user code would utilize CPS, but the runtime engine, not the user
code, has final say over where the continuation actually jumps. By priming the
linked-list of continuations + context data the right way, the runtime could
"stack the deck" so that when the user functions run, it creates a "fake"
callstack that looks like a certain series of operations ran, which in reality
may have actually been run a long time ago or may be composed of pieces of
different chains of computation. In this way, the engine could communicate to
the programmer + debugger a semantically meaningful summary of an otherwise
convoluted / too detailed original computation. Or restart a computation which
had to be previously abandoned due to needing data that wasn't available yet.
I call this technique "stacking the stack" (in analogy to stacking the deck to
cheat at a card game).

Also, although your technique of calling the continuation twice as in "return
cont(a * a, data) && cont(a _a_ a, data);" is perfectly valid (in fact, this
code is also a syntactically valid Castor logic expression), here's another
way it could be done: convert the nondeterministic choice into a deterministic
one by lifting the information as to which branch is being executed into a
variable which can be passed in. It would then look like:

int foo(int * next_part, int a, cont_t cont, data_t * data)

{

    
    
      switch (( * next_part)++)
    
      {
    
      case 0: return cont(a * a, data); 
    
      case 1: return cont(a * a * a, data); 
    
      default: return 0; // fail when exhausted 
    
      };
    

}

You're already doing this, sort of, in the regex engine when you do "switch (
* p)". The difference is, there, you're choosing between alternatives based on
a concrete property of the input character. If you like, think of the
"next_part" variable as an intangible or out-of-band property of the current
input. Once you do that, you could even turn all of you "switch ( * p)"
statements into "switch (( * next_part)++)" and simply _assume_ *p matched
that case (we'll just fail inside the step if it didn't match, and backtrack).
Congratulations, now you're doing logic programming!!

