Hacker News new | past | comments | ask | show | jobs | submit login
Language.js - A fast PEG parser written in JavaScript (github.com/tolmasky)
101 points by ryannielsen on June 11, 2011 | hide | past | favorite | 26 comments



This just got pushed literally today as I gave a presentation on it at CappCon, wasn't expecting it to get posted here, README and so forth are coming.

EDIT: OK, there is a README up, but again, this is not officially "launched" or anything, I was planning on putting the docs and website together this week, so check back later.


Sorry for the pre-announce, then! I was intrigued by the tweets of others at CappCon, and this is the first I've ever heard of a PEG... it felt like something others on HN would like to know about. I'm very interested in language.js and Obj-J 2.0, and look forward to hearing more about both.


Nice. Look forward to hearing more. Currently part of a team that is using node.js with peg.js (http://pegjs.majda.cz/documentation) for a service that defines a grammar for a query language, which is translated into one or more internal service API calls for an enterprise backend. Needs to handle hundreds of req/s, so any performance improvements over peg.js will definitely make your solution more appealing as well.


The "naughty OR" operator, and placing syntax error nodes in the syntax tree, are awesome ideas. I hope someone steals them and adds them to Parsec :)


Can someone summarize what PEG is? I read through the Wikipedia page but can't make heads or tails of it. What is this used for?


PEG is a different way of specifying grammars and an different process for parsing them. If you're used to Bison/Yacc or LALR parsers modeled on Yacc (like Racc), the big things you notice with any PEG parser are:

* You automatically get "EBNF"-style operators like "zero-or-more" or "one-or-more" instead of having to specify recursive nonterminals and epsilons.

* You typically don't need a separate lexer; PEG parser productions resemble regular expressions.

* PEGs use prioritized choice to deal with ambiguities such as dangling-else. PEG grammars are unambiguous.

PEG parsers are much easier to build and work with than Yacc-style parsers. The learning curve on PEG is also way, way shorter than for shift-reduce parsers. You should probably be using PEG parsers whenever possible now.


I've long been a PEG skeptic. All three of these benefits have issues:

(1) LL parser generators have had zero-or-more and one-or-more operators for a long time (see ANTLR for example [1], as well as the venerable Parse::RecDescent [2]). They're very easy to implement in a recursive-descent parser.

(2) In practice, lexer-free grammars are very difficult to maintain in the presence of things like comments and whitespace rules. You sacrifice the ability to reason about layers of the grammar in isolation by binding the two layers together. Try to write a PEG grammar for, say, Java and you'll see what I mean; you end up wishing you had a sort of preprocessor that could get rid of comments and such in advance, which is precisely what a lexer is.

(3) PEGs are unambiguous in the sense that the system specifies how to resolve ambiguities (by taking the first option). But backtracking LL grammars and LR grammars provide this too. LL parser generators typically accept the first rule that matches: what could be simpler? LR parser generators are a little more clever and resolve shift/reduce conflicts in favor of the shift, which is a bit more magical but basically means that productions act "greedy". The key is that grammars always specify how to resolve ambiguities in some way, and this is really no different from what PEGs provide.

[1]: http://www.antlr.org/wiki/display/ANTLR3/ANTLR+Cheat+Sheet [2]: http://search.cpan.org/~dconway/Parse-RecDescent-1.965001/li...


I don't care enough about this stuff to have strong opinions on (1) and (3). Until this year, I wrote my parsers (I end up doing 2-3 a year) in Lex/Yacc.

But (2) is a red herring. Nothing about PEG tools force you to do all your tokenizing in the grammar. If you want to strip /++/ comments out, you can use exactly the same code you'd have used in your handrolled recursive descent parser to do that, and then parse the real stuff in PEG.

But that aside, we use a PEG grammar for Javascript, which has (I assume) all the annoying comment syntax you're referring to, and it doesn't seem like a big deal.


> we use a PEG grammar for Javascript [...]

Anything open-source, or that you're in the mood to share?



For a counterpoint:

* you can get "EBNF"-style operators using traditional LL parsers like ANTLR.

* Likewise, top-down parsers can have integrated lexing also.

* The downside of prioritized choice is that you don't get any insight into what inherent ambiguities your grammar has. The dangling-else ambiguity in C (for example) is a real ambiguity that you have to tell your users about. Prioritized choice hides ambiguities and gives you no hint when you develop your grammars where the ambiguities are.

* PEG-based parsers are less efficient than LL parsers. LL parsers are O(n) in time and O(d) in space (where n is the length of the input and d is the depth of the nesting). To parse a PEG in O(n) time the space complexity becomes O(n) (this is a packrat parser). To parse with O(d) space, time complexity becomes O(n^d) (as with lpeg). And the constant factor for LL parsing is much lower than with packrat parsing.

I think the future is LL. When the tooling gets good enough with LL, I think that the primary motivation for PEGs (that they are easy to use) will evaporate.


I knew someone would call me on this and chose my words carefully --- "LALR parsers modeled on Yacc" --- but am glad for the response. ANTLR is a pain in the ass to integrate; you can get a totally respectable PEG parser in 2k lines of C. What would you use for LL parsing?


I'm working on making my ideal tool: http://www.reverberate.org/gazelle/

It's been a long time coming, but I believe very strongly in it.

"pain in the ass to integrate" -- I can sympathize. I'm working tirelessly to make this as easy to integrate as regexes are in Perl, Python, Ruby, etc.


The name of the pdf linked to on the language.js page is "Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space" (www.ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf).

They achieve this by adding a Prolog-style Cut operator to the PEG language.


PEG parsers get rid of the dangling else problem with recursive descent parsers where it's unclear in the following statement whether

   if e then s1 else if e then s2 else s3
   
the else s3 should be with the first if or the second if - usually you have to code special rules into your parser. It also allows for a worst-case linear time parse, and there are some language rules you can express with a PEG parser that you can't with a recursive descent parser.


Wouldn't you just have an end symbol for if statements, "end" or "}" or something? That's what I did in my little recursive parser.

Maybe what I should ask is, how does a PEG parser resolve it?


In a PEG you operate under the expectation that all rules will be tried exhaustively, so if one fails you just move on to the next, and if they all fail you drop back a level of recursion.

Thus you can have rules like:

"if <if-expr>" where if-expr can contain "<statements> else" or "<statements> if" or "then <statements>", where statements is a set of predefined keywords, another set of PEG rules, or a regular expression matcher.

And then you can just expect that they'll all be tried at some point.

The main issue with this, of course, is the one you raised, with endings and creating rules for delimiting new blocks. Although it's possible to define a good set of rules, it's still hard to distinguish between a continuation of a statement and the start of a new one, and in most cases you won't get a useful syntax error message using a naive PEG system like the one in PEG.js, because it'll just drop all the way to the top level and then fail at the first character without telling you how far/deep it got before the syntax broke.

So in practice having end symbols is still useful to make the grammar operate sanely. PEG can also be used to execute a relatively flat pass that mostly produces tokens and simple expressions, while other methods are applied to the resulting tree to get the complete structure. This is an approach which I find a little less mind-bending than trying to cram everything into PEG rules, since you can use varying methods of parsing as the situation dictates.


Great idea, but... ungoogleable name choice. Definitely look forward to hearing more about this though.


Nondescript names are de rigueur for JavaScript-related projects. See also: prototype, node, underscore, backbone, processing, reflection, glow...


It's actually the second result for 'languagejs' already, so it's not too bad.


From Google Chrome after following the link on GitHub:

"The website at languagejs.com has been reported as a “phishing” site. Phishing sites trick users into disclosing personal or financial information, often by pretending to represent trusted institutions, such as banks."


Odd, it's just a parked GoDaddy page.


Could really use a readme. Languagejs.com goes to a GoDaddy landing page.

I guess this is what PEG is: http://en.wikipedia.org/wiki/Parsing_expression_grammar


Just a note: Francisco didn't release this. He just pushed it to Github and others have posted it here. I don't think he mentioned it on twitter even, it's a glimpse of something he's hacking on.


Pretty cool stuff. Shame I don't have much of a use for it. :(


No README?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: