Hacker News new | past | comments | ask | show | jobs | submit login
Colm programming language released: best parser-writer ever (complang.org)
87 points by edwardog on May 18, 2011 | hide | past | favorite | 42 comments

Thurston claims that no previous grammar system supports his three requirements of generalized parsing, grammar-dependent scanning, and context-dependent parsing.

I would argue that Prolog Definite Clause Grammars, which date back to the early 1970s, have all three of these properties. Furthermore, since the context is maintained functionally, by threading additional values through the productions, no "undo actions" are required; Prolog's built-in backtracking is all that's needed.

Of course, the problem with DCGs is performance: they're exponential in the worst case. But I think they deserve mention in a dissertation like this anyway. Also, any backtracking parser risks exponential worst-case performance; it will be interesting to see how Colm avoids this fate (I've only read the first few pages yet).

I remember, back in 1985, trying to write a parser in Prolog on a VAX 11/750. I didn't know what I was doing and my code fell into the exponential case trap. Even minimalist examples appeared to crash as they took unreasonably long, so I wasn't getting clues to the problem, and just gave up.

Consequently I want to rephrase the first sentence of your third paragraph. The problem with DCGs is performance: they're exponential in the naive case. That might not sound too bad today, but back in 1985 the hype was that Prolog let you program declaratively. Just declare what a parse looks like. Reasonable performance in the naive case was the popular selling point for Prolog.

Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully generalized parsing requires keeping all versions of the parsing context in memory. Consider making a C++ parser in that way ... ie every time you modify the structures you build in memory you make a copy of them first.

If you know that every subfield of your state object is immutable, and you're using references instead of "local" copies (inconvenient unless you have garbage collection) then your copying costs are limited to the size of the overall state object plus whichever field changed.

Obviously this makes no sense for C++, but Clojure and OCaml get away with it and in Haskell it's the standard way of implementing almost every stateful computation.

If your data structures are persistent data structures you don't incur the costs of copying.

Indeed I did not follow what you meant. I missed the "persistent" and saw ... "don't incur the cost of copying." I took that to mean that you were referring to not maintaining any history and using a single, mutable global state. It is with that understanding that I made the previous comment.

To address what you meant ... it's more a matter of what's practical. For example, to parse C++ one needs to do quite a bit of semantic analysis during parsing. A full implementation of the type system is necessary. Then the data structures change with every statement/clause parsed. Achieving this with persistent data structures is far too costly.

Then you forgo generalized parsing.

Edit: indeed I did not follow what you meant.

I don't follow.

To have both generalized parsing and context dependent parsing you need to somehow revert changes to the global state when you backtrack. You can do this by restoring it to old versions, which requires copying the state and keeping the history. The Colm approach is to keep only one global state, but store instructions for programmatically reverting the state while you backtrack.

The point of functional data structures is that if you modify, you don't copy the whole structure, you only copy the path to the things that changed.

A simple example would be parsing a sequence of characters and storing that in a linked list. When you get an additional character you don't destructively modify the list, instead you create a new list node that points to the existing list.

    def process(char, state):
      return Node(char, state)
The process function takes a character and a state, and returns a new state. The old state is still usable, as nothing was mutated. On the other hand no large data structure had to be copied.

Your method is probably more efficient. Because you don't need access to all historical versions simultaneously, you only have to keep a stack of undo operations. So you only have O(1) slowdown per operation. With functional data structures you have in the worst case a O(log n) slowdown per operation.

For example if you implement a bit vector, then update and lookup can both be O(log n), e.g. a red-black tree. If you want update to be O(1) then lookup becomes O(n), e.g. a linked list. If you want lookup to be O(1) then update becomes O(n), e.g. copying the entire vector on update.

(please correct me if I'm wrong and if you can do better than O(n) for those cases)

Yes I understand these techniques, in fact they are heavily in use in Colm, just not for maintaining the global state that is used for the parsing feedback loop.

I use the term "copy" in the general sense to make discussion easier. Conceptually, persistent data structures are a copy, even if they are optimized heavily to incur very little cost.

Then you misunderstood my point, backtracking and keeping history with persistent data structures is very cheap.

Notice that the DNS example is parsing a binary DNS request, not a text file.

:) If I had my way this comment would be closer to the top. Not many grammar-based parsing systems can claim raw DNS parsing.

From my quick scan of the thesis, the basic design seems to be a programming language in which you write both the parser and any transformations you want to perform. It's not clear whether there is an easily-accessible parse tree serialization that you can use to load the output into another language, or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they need to write their algorithms in your special language. Parsing tools deliver value because grammars are easier to write than the imperative code that implements those grammars. That value offsets the cost of having to learn a new special-purpose language. But imperative programming languages are already pretty good at tree traversal and transformation, so there's little benefit to using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily integrates into other languages so that the parsing framework can handle only the parsing and all of the tree traversal and transformation can be performed using whatever language the programmer was already using. This requires much less buy-in to a special-purpose language.

Colm has built-in serialization. There is still some work to do in this area though. Colm will preserve whitespace for minimal disruption of untransformed text, but figuring out what to do at the boundaries between modified and unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more complex algorithms. I agree a means of embedding is necessary and I have kept this in mind, though not yet achieved it. I would very much like to be able to parse, transform, then have the option to import the data into another environment and carry on there.

Thanks for the info. What is the built-in serialization format?

Just plain old text as it came in. I see now that is not what you were referring to. You're talking about JSON, XML, etc I now think.

There is also a print_xml function, which puts the tree into XML, but it's mostly used for debugging at this point, not export to other systems. I'm hoping that with time these kinds of features will crop up.

AntLR can do this, although it does not work that well. I used the C backend, which is pretty directly ported from the Java backend. C-in-Java-style is pretty awkward.

ANTLR is not the same as what I am describing. ANTLR generates code in each target language: I am talking about a common runtime that all languages call into. Think of it as a "parsing VM." Using this scheme, there would be no need to have separate backends for C and Java, the only thing you'd need to port is the bindings.

Why would that be a good idea? What advantage would a just-in-time parser generator have that a static parser generator does not?

Fast parsing from any language (even slow languages like Ruby) without having to compile and link generated C code for every grammar into the interpreter.

With this approach, you could have a C extension that could load any grammar at runtime and parse it extremely fast.

C/Assembly code is orders of magnitude faster at parsing than generating eg. Ruby that does the parsing.

What you're describing sounds like the Gold parser: http://www.devincook.com/goldparser/

Yes, or my own project, Gazelle: http://www.gazelle-parser.org/

Quote "Colm does not yet have any documentation".

Then I would hazard that it is not yet a language as without documentation it has no "grammar". At best it is a patois.

TXL, its apparently predecessor, is very well documented (http://www.txl.ca/). TXL is a very interesting approach to parsing and worth reading up on if you're interested in the area (or are waiting for documentation for Colm :)

There seems to be a PhD thesis behind, so you should check it for the grammar.

It would be interesting to see someone who understood both this and Perl 6's grammars to do a comparison. Based on Colm's quick description and my rough understanding of Perl 6 grammars, they sound like they are roughly equally powerful. But I admit I'm not sure I understand what "transformation language" means...

Although similar in expressive power, Colm offers instruction logging to auto-reverse global state changes upon backtracking, something Perl 6 grammars does not (yet) support; at the moment we need to manually manage them with embedded blocks.

Re: "reverse global state changes upon backtracking": this sounds similar to the (manual) "undo actions" supported by the Kelbt parser [1], perhaps unsurprisingly as it was developed by the same author :-)

[1] http://www.complang.org/kelbt/


If they are then I don't deserve to be called "Dr. Thurston!"


Interestingly, I parsed thurston's response not as an appeal to authority, but as self-mocking humor (i.e. "wow, if Perl 6 does that too, maybe my PhD was unwarranted!"). I guess English parsing is non-deterministic too... :-)

Good grief. Have a sense of humor; the grandparent response was obviously made with tongue firmly in cheek.

Adrian Thurston (the creator of Colm) is also responsible for the fantastic Ragel state machine generator.

I am more interested in DSNP, how come this project has not received more fame than the infamous Disapora? http://www.complang.org/dsnp/

There are some difficult problems in that space. I've posted to HN and reddit a few times, but mostly I've been working on it quietly so I can focus. Lately, that's starting to change. I'll be talking about it at FSW 11 in Berlin in a few weeks.

My google sense failed me this time around to find information on this FSW 11 in Berlin. Care to explain? Is it a conference, can anybody come?

I am really interested in DSNP and am fairly well versed in GNU/Linux and can do some programming, Java and Python mostly. I work as a web-frontend developer guy. Can I be of some help? Do you need testers, peers, documenters?

Ya it's currently hard to find. http://d-cent.org/fsw2011/

I need help from people like you actually. What I've done.

1. defined the protocol

2. implmented it in a C++ daemon that

   a) talks to other daemons

   b) serves the content managers (frontend UIs)
3. written a (crappy) example content manager.

What needs to happen next is step 3 needs to be repeated by other people who know what they are doing. They don't need to understand the details of the protocol, they just need to understand the basic model, which is just message broadcast, distributed agreement, etc.

Email me for more details, will get back to you later tonight.

This kind of sounds like Semantic Design's DMS software Reengineering toolkit and the Parlanse language.

They are related systems. DMS is much more mature.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact