Hacker News new | past | comments | ask | show | jobs | submit login
PEG.js - parser generator for JavaScript (majda.cz)
74 points by dmajda on Mar 17, 2010 | hide | past | web | favorite | 43 comments



There are a couple other parser generators for JS:

* http://zaa.ch/2k - Jison

* http://inimino.org/~inimino/blog/peg_first_release - another Packrat (has an es5 parser as an example)

Plus a few more that work but produce generated code that's too slow to be useful.


Worth noting that, last I heard, CoffeeScript is using Zaach's Jison: http://zaa.ch/2o


I had to do some digging for myself to verify that, yes, PEGs are a recent concept distinct from the kind of grammars I'm already familiar with. See Bryan Ford's paper, Parsing Expression Grammars: A Recognition-Based Syntactic Foundation from PoPL 2004: http://portal.acm.org/citation.cfm?id=964001.964011


There's a really good VM-based PEG parsing system called LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html). Paper: http://www.inf.puc-rio.br/~roberto/docs/peg.pdf

It's for Lua, but there isn't anything in the VM design that requires Lua per se - porting it to another language's C API wouldn't be difficult. In my experience, LPEG has been quite easy to use for the sort of tasks one would normally apply regular expressions to (and then some, as it handles recursive structures far better), it's quite fast, and it's easy to tune incrementally. In my (not yet ready for release) Lua/LPEG webserver, I'm handling 1000+ requests per second in about 2MB ram total* , so LPEG's design has tamed the usual space issues caused by PEG memoization.

* Only a tiny amount of that time is due to request parsing, it's mostly because I'm still working the kinks out of the event loop.

One strike against it is that there isn't a whole lot of casual documentation, though the documentation that does exist (the research paper and a library reference) is quite thorough. Also, it takes advantage of Lua's operator overloading to make the grammars concise, which takes getting used to. Still, having it so thoroughly integrated into the language is extremely convenient, and it's quite a bit more expressive than regular expressions.

Also, here's a link to the PEG paper on the author's site, rather than behind the ACM's paywall: http://www.bford.info/pub/lang/peg.pdf . Gotta love the ACM.


Memoization to get linear performance out of backtracking, which is exponential without it, is certainly a nice sleight of hand, but I wouldn't be under any illusions that the constant factor is competitive with say an LALR recognizer. Also, with a scheme that maps so closely to recursive descent, it's easy to miss what amount to grammar ambiguities, where nested productions preferentially eat tokens that may be in the follow set (i.e. the if/else problem).


PEGs don't have grammar ambiguities. Every PEG is deterministic.

Grammar ambiguities are resolved by ordered choice. If there are two ways to parse something, the first way will be tried first. If it succeeds, the ordered choice short-circuits and the second way is ignored. So you can resolve, say, the classic dangling else problem by ordering your choices correctly.


I know all of that. My point is that the if/else problem really is an ambiguity (i.e. it is a language bug, not just a problem you have to get around in parsing), and problems similar to it crop up when your tool doesn't alert you to first/follow conflicts, in LL(1) parlance. That PEGs encode the implementation strategy inside the declarative grammar is a flaw, in my opinion, not a benefit.


In programming, ambiguity is bad! Explicitly encoding disambiguation rules in the grammar is a good thing, just like it's a good thing in functional pattern matching, in Prolog... I'd be interested to hear any reason why you would want to not disambiguate a grammar in any language designed to be computer-recognizable.


Here, read this:

http://compilers.iecc.com/comparch/article/05-09-114

In particular:

"Yes, this is "the" problem with PEG's. / implies ordering of the search (parsing) space. You need to order your / operators so that special cases (e.g. longer matches), so that they appear first. Unfortunately, if you don't do this, nothing will tell you you have a problem with your grammar, it will simply not parse some inputs. To me this implies that if one wants to use a PEG to parse some input, then one must exhaustively test the parser."

Hopefully the thread therein will describe my issue better than I have thus far.


Indeed it does, and Chris Clark very nicely summarizes the advantages and disadvantages of PEG vs other parsers. Nice find.


Right, and PEGs only have one level of backtracking, so you don't have potentially massive slowdown due to backtracking several levels and re-trying everything along the way.

While the backtracking behavior is different, choice order affects PEG rules similarly to Prolog.


I love LPEG for Lua. It usually takes me some time to write a decent grammar, but once it works it is rather straightforward to read it. (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html)

Now seeing this for javascript makes me think if I should actually do more work with js.


Amen on the easy to read part. I'm the author of clj-peg ( http://www.lithinos.com/clj-peg/ ), and the main reason I wanted to write it was to have grammars that were human readable.


I'm really interested in human-readable grammars. Thanks for your work.


A parser generator? Why oh why? It sounds so 80s. JavaScript is strong enough to express parser combinators, e.g. see Chris Double's implementation (also based on the PEG formalism):

http://www.bluishcoder.co.nz/2007/10/javascript-parser-combi...

I did the same thing in ActionScript. It's easy and fun.


It was pointed out to me that my git repository hosting this code was down. I've now moved it to github if anyone was looking for it:

http://github.com/doublec/jsparse


I believe that the classical generated parser has generally better performance potential - there is less function calling, less string passing (but this may be avoided easily in combinators too) and more opportunity for optimalization if you have the whole grammar AST in hand.


Are there PEGs available for languages like Javascript? Given recent discussion of Perl being undecidable, I assume there isn't one for Perl. Has anyone tackled such a thing for Ruby?

Update: Thanks for the links! That being said, I didn't mean parsers written in Javascript or Ruby, I mean PEGs or CFGs that parse Javascript and/or Ruby programs :-)


You might look at OMeta/JS: http://tinlizzie.org/ometa-js/#JavaScript_Compiler

OMeta is based upon PEGs and is worth a look because it has some great meta-meta ideas from the 60's that were ignored.

(Also, there are several ports of OMeta to other languages)


ECMAScript 5 PEG parser: http://boshi.inimino.org/3box/PEG/build/ES5.html

(Hmm, I must definitely try to port it to PEG.js :-)

As for Ruby, I am not sure if it is even possible to create a PEG grammar. Its lexer and parser are heavily interconnected and there is a lot of state involved. If the grammar will be created, I don't think it would be pretty.


http://treetop.rubyforge.org/

Im writing a programming language in Ruby at the moment - and after playing with that for a few days I've gone back to "hand coded lexer plus Racc for grammar". It's probably personal preference but I find that a little more tunable.


What exactly you didn't like?


Just personal preference: I prefer the grammar structure for parsers like Racc


They're native in Perl 6, and people above have linked to libraries for Lua and Clojure. I think there's a library for Python as well somewhere or other.

Also peg/leg for C is intended (iirc) as a direct competitor to flex/bison et al.


I'm confused about the grammar example.

How does: multiplicative : primary "* " multiplicative { return $1 * $3; }

return the product of two integers?


HN formatting messed up your asterisks. The whole rule is:

    multiplicative    : primary "*" multiplicative { return $1 * $3; }
                      / primary
"primary" refers to another rule in the grammar, which is defined as integer or a parenthesized expression. This snippet defines "multiplicative", which is either a primary, or a primary followed by an asterisk followed by another multiplicative. (This recursively expands to allow any number of primaries, separated by asterisks.)

The value of the multiplicative expression is the value of the first term times the value of the third term. (The second term is just the literal string "*".)


On the face of it, however, the grammar is problematic. There's a reason it uses "additive", "multiplicative" and "primary" rather than a more traditional expression, term and factor: by using commutative operators like + and multiply, and leaving out - and / it disguises the fact that the grammar is actually evaluating the expression from right to left, rather than the expected order of left to right.


I knew somebody would raise this point :-) You are right that the evaluation order would be wrong for "-" and "/".

I will probably implement support for left recursion in PEG.js - it is possible (see e.g. http://www.vpri.org/pdf/tr2007002_packrat.pdf). After that, the grammar could be rewritten to evaluate in the correct order.

(Another alternative - which works right now - is to change the parsing expressions to something like "additive ([+-] additive)*" and deal with the whole chain of operations with the same priority at once. I didn't use this in the example as I wanted it to be as simple as possible.)


Don't use that paper! I tried to use the parsing technique in that paper while working on a parser for use at the CME, and it caused me weeks of headaches.

First, you'll note the algorithm is extremely complicated--nothing like the simple top-down algorithm that makes PEGs so attractive. Not only is it complicated, it misses basic refactoring issues--some logic is duplicated across functions, and the functions interact in ugly ways.

Second, it doesn't even handle left recursion correctly. Throw a ruleset like this at their parser

A -> B "a"

B -> C "b"

C -> B / A / "c"

and it will explode into a million little pieces, because the authors did not account for any recursive rule having multiple recursion points. Don't even try something like

S -> A / B

A -> A "a" / B / a

B -> B "b" / A / "b"


Interesting, thanks for the warning. I only skimmed through the paper today and noted that the algoritm seems complex, but I didn't attempt to understand it in detail.

What was your final result? Did you implement the left recursion in the way the paper describes, invented/found some other way or abandoned the whole idea?


The approach I'm working on uses the same "growing the seed" idea, but in a different way.

It involves the memo entries being able to remember which left-recursive results they are dependent on. This way, when a left-recursive rule produces a result that is dependent on itself, it knows that this match can possibly be "grown" through repeated iterations. That's a basic sketch of the idea. Performance properties remain the same in the case of left-recursive rules that are not interdependent. I don't really know what they are like for large numbers of interdependent left-recursive rules--but if you have a language like that, better to use an Earley or GLR parser.

I'm still working on it. It passes a battery of test cases, two of which I posted above, but I'm not 100% confident in it just yet. Also, as posted above, I'm trying to get permission from the higher-ups to release the code into the wild.


Neat, I'm onsite at a company right down the block from CME's Clearing devs. Buy you coffee sometime?


Sure--why don't you e-mail me: cynicalkane at gmail dot com


That's not an issue for associative operators, so long as you keep track of precedence. x * (x * x) == (x * x) * x. But it may become an issue for more general grammars that need to handle left-recursion in a non-associative fashion.

The issue is that a top-down grammar cannot handle left-recursion, since it naturally leads to infinite recursion when being parsed from the top down.

I actually had to handle the left-recursion problem for a PEG I wrote in-house--right now I'm trying to convince my boss to let me open-source it :)


The way I would normally handle what would otherwise be left recursion in a top-down parser is to introduce loops.

    term ::= factor ( ('*' | '/') factor )* ;
If you introduce regex-like closure operators to model loops, life gets much easier, and you can build your trees the right way around.

How applicable this is to PEGs is another matter though. I generally write my parsers by hand, and I parse expressions with operator precedence.


This can be done in PEGs, using term rewriting or simple tree rewriting to create a PEG construct like the loop you described. But doing so subverts the elegance of PEG parser, and I'm unaware of a way to make it suffice as a general solution.


The thing is, LL(1) is almost always all you need in practice. Left recursion doesn't buy you a lot more in practice than the loops I describe. But I also don't think PEGs are particularly elegant - I think the translation of LL(1) (including loops) to recursive descent is perfectly straightforward, and I think backtracking combined with memoization is a hack.

And for stuff that LL(1) can't handle, I think GLR is a better approach.


You misunderstand PEGs.

In the simplest version, each rule can be seen as a function that calls other rules, passing along a memo table as it goes. It's entirely stateless and functional, except for the memo table. In addition, each function is only a few lines.

Backtracking isn't a hack--it's the natural result of statelessness. There's no cursor that needs to be backtracked.


I think you don't understand my objection. It's that backtracking is very seldom necessary at all in the first place, whether or not you get rid of the time cost of it. It's like doing something inefficient in the name of source code elegance, then patting yourself on the back when you got rid of the big-O inefficiency. My point is that you needn't have been so inefficient to begin with, and that the more efficient approaches don't suffer unduly from inelegance, and to top it all off, they have a lower constant factor.


The advantage of PEGs is that they are easily understood, easy to code, and easy to code for. In 99% of applications nowdays, these are the only considerations. Performance is a secondary concern or not a concern at all. Given this, the user shouldn't have to worry about backtracking, ambiguity, or lookahead.

Asking the user to have to worry about this stuff, when he doesn't have to, is not good software design. Not all of us are writing high performance parsers for complicated languages.


Thanks, what I don't get from reading the grammar is how, as you say, multiplicative is either a primary or primary followed by a multiplicative. Clearly I understand that it needs to be defined this way, but is that either part just inherit to PEG or this implementation or what, b/c I don't get it form the syntax?


The "/" at the start of the second line is a choice operator. For example, this is how you would define a term a that can match either b or c:

   a: b / c
(The asterisks in my post were escaped because they were in a code block, or because there were not two of them in the same line.)


btw, how did you escape that asterisk in quotes?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: