
Show HN: Using Parsing Expression Grammars to rewrite source code - sebcat
https://github.com/sebcat/move-literals
======
leafo
Parsing expression grammars are my favorite way to parse text.

I also use Lua and LPeg. Here's a tutorial I wrote on them:
[http://leafo.net/guides/parsing-expression-
grammars.html](http://leafo.net/guides/parsing-expression-grammars.html)

Additionally, I've written an entire programming language with them:
[http://moonscript.org/](http://moonscript.org/)

Here's the grammar:

[https://github.com/leafo/moonscript/blob/master/moonscript/p...](https://github.com/leafo/moonscript/blob/master/moonscript/parse.moon#L106)

My favorite part is that you express the grammar in code, not some simplified
language designed to write parsers. So you get all the nice things a
programming language normally gives you. (eg. you can use functions instead of
having support for macros)

I've also started experimenting with compiling LPeg style grammars to C, for
even more speed. This uses the peg library:
[https://github.com/leafo/moonparse](https://github.com/leafo/moonparse)

You can see how the grammar looks here:
[https://github.com/leafo/moonparse/blob/master/parse.peg.moo...](https://github.com/leafo/moonparse/blob/master/parse.peg.moon#L115)

~~~
david-given
Last time I used Lpeg for a big project I ran into difficulties with error
detection and recovery --- writing the parser for correct input was
beautifully easy, but writing a parser that would gracefully handle incorrect
input was very hard. (I'm reminded of the apocryphal Prolog compiler which, if
you gave it a program with a syntax error, would just reply 'No.'.)

When I asked about the best way to handle errors it was suggested to me that I
should add alternatives to my rules that would catch errors and return an
error token from there; but of course, that short circuits any alternatives
further up the grammar tree, so it wasn't very satisfactory. A quick look at
your grammar doesn't show anything like this --- how are you dealing with
errors?

~~~
YeGoblynQueenne
>> (I'm reminded of the apocryphal Prolog compiler which, if you gave it a
program with a syntax error, would just reply 'No.'.)

I don't believe you remember this correctly.

The Prolog _interpreter_ will raise an error for syntax errors. If you misspel
something but don't cause a syntax error then you may get an unexpected "no"
(or "false") but a syntax error causes compilation to fail, in Prolog as in
any language.

In any case, a "no" ("false") is the proof procedure failing to prove your
query true (or, more accurately, finding a way to prove it false). It's not an
error and it's not a failure of the interpreter.

~~~
david-given
The expanded version of the saying, which I think I got from my Prolog
lecturer at university, is:

Writing an optimising compiler in Prolog (note, _in_ Prolog, not _for_ Prolog)
which will accept valid programs is trivially easy. Writing an optimising
compiler in Prolog which will say anything other than 'No.' for invalid
programs is intractably hard.

~~~
YeGoblynQueenne
Ah, I see- you meant a compiler written in Prolog, not a Prolog compiler.

Who was your lecturer? I haven't heard of that problem with optimising
compilers in Prolog. I don't know why it should be harder to do in Prolog than
in any other language.

------
ulrikrasmussen
Minor technical nitpick for the README: You're not moving up the Chomsky
hierarchy but rather moving sideways ;). PEGs are non-Chomskyan parsing
formalisms, in the sense that the family of languages describable by PEGs are
not to be found in the Chomsky hierarchy.

For example, they are not regular, as they can match patterns such as a^n b^n.
On the other hand, they are also not context-free, as they can also match non-
context-free patterns such as a^n b^n c^n. Finally, since PEG can be parsed in
linear time, there must also be some CFG which cannot be recognized by any PEG
- this follows due to a result by Lee [1].

[1] L. Lee. Fast Context Free Grammar Parsing Requires Fast Boolean Matrix
Multiplication.
[http://arxiv.org/pdf/cs/0112018.pdf](http://arxiv.org/pdf/cs/0112018.pdf)

~~~
versteegen
Interesting. I see that the Wikipedia PEG article still states that the
existence of CFG- but not PEG-parsable languages is an open problem.

~~~
ulrikrasmussen
It is! It is surprisingly difficult to come up with a grammar and a proof that
it cannot be recognized by a PEG. Part of the reason is probably that PEGs and
CFGs are quite different: A CFG is a set of generative rules which specify a
set of strings, whereas a PEG is actually more like a recursive program.

There's an open question on Stack Exchange for anyone who feel like taking a
stab at the problem: [http://cstheory.stackexchange.com/questions/34792/does-
peg-c...](http://cstheory.stackexchange.com/questions/34792/does-peg-contain-
cfg)

~~~
versteegen
OK. I thought you meant that a non-constructive proof existed that there is
such a CFG. That's how I interpreted "Finally, since PEG can be parsed in
linear time, there must also be some CFG which cannot be recognized by any
PEG". But I see that on cstheory you wrote instead (being more precise?) "most
probably ... PEG does not contain CFG".

~~~
ulrikrasmussen
Ah, yes, that's because the non-linear complexity of binary matrix
multiplication (which CFG parsing can be reduced to) is _strongly_ suspected
to be a lower bound, but I don't think that there is a formal proof of it (I
am not an expert in complexity theory, so I am just basing this on what some
authors seem to suggest). So, the non-constructive argument is a "proof" in
the same sense that some results can be proved by first assuming that P /= NP.

------
akavel
I really wish the expressions were better documented/explained in the source.
They're dense, similarly to regexps, and regexps also benefit immensely from
explanation when used as a demo/teaching tool.

~~~
sebcat
Good point. I added a new branch where I broke the "literal" rule into
multiple rules and added some comments where I try to explain what's going on
in English. I also added an achor to it in the master README.md.

I find it a bit hard to accurately describe the operations with words though,
if there's any ambiguities in the wording, let me know!

[https://github.com/sebcat/move-
literals/blob/commented_expre...](https://github.com/sebcat/move-
literals/blob/commented_expressions/move-literals.lua#L17)

~~~
akavel
Awesome, thanks! It's the first time I've seen example PEGs explained so
nicely. I've let myself add a few small comments with some additional
suggestions.

Some additional comments I didn't know how to add on github:

\- in load_data(), you should most probably use `local` on l.99 and 103;

\- in load_data() l.99, you could consider using the idiom:

    
    
        local f = assert(io.open(file, "rb"))
    

though arguably it's clearer as it is now.

Other than that, it would be even cooler if you could also add some examples
in the repo of how to deal with parsing errors, as you showed in another
comment!

~~~
sebcat
Thank you for the feedback. The commented_expressions branch looks a lot
better than master now, I'll probably merge it. I added the code from the
comment to README.md.

------
VertexRed
I viewed this on my phone and it was annoying and difficult to read due to the
typing effect in the header which caused the content below to move up and
down.

~~~
chrismcb
It is a readme on guthub. What typing effect are you talking about?

~~~
VertexRed
Wrong tab. Comment was meant for
[https://news.ycombinator.com/item?id=12498976](https://news.ycombinator.com/item?id=12498976)

------
nightcracker
A while back I made a PEG parser in Python in ~400 LOC, that can parse itself.
It's actually quite straightforward:
[https://gist.github.com/orlp/e880287287985ccd9288def8f6741b4...](https://gist.github.com/orlp/e880287287985ccd9288def8f6741b47).

I've also added a 'skipper' feature, that makes handling whitespace a breeze.

