

Colm programming language released: best parser-writer ever - edwardog
http://www.complang.org/colm/

======
ScottBurson
Thurston claims that no previous grammar system supports his three
requirements of generalized parsing, grammar-dependent scanning, and context-
dependent parsing.

I would argue that Prolog Definite Clause Grammars, which date back to the
early 1970s, have all three of these properties. Furthermore, since the
context is maintained functionally, by threading additional values through the
productions, no "undo actions" are required; Prolog's built-in backtracking is
all that's needed.

Of course, the problem with DCGs is performance: they're exponential in the
worst case. But I think they deserve mention in a dissertation like this
anyway. Also, any backtracking parser risks exponential worst-case
performance; it will be interesting to see how Colm avoids this fate (I've
only read the first few pages yet).

~~~
thurston
Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully
generalized parsing requires keeping all versions of the parsing context in
memory. Consider making a C++ parser in that way ... ie every time you modify
the structures you build in memory you make a copy of them first.

~~~
swannodette
If your data structures are persistent data structures you don't incur the
costs of copying.

~~~
thurston
Then you forgo generalized parsing.

Edit: indeed I did not follow what you meant.

~~~
swannodette
I don't follow.

~~~
thurston
To have both generalized parsing and context dependent parsing you need to
somehow revert changes to the global state when you backtrack. You can do this
by restoring it to old versions, which requires copying the state and keeping
the history. The Colm approach is to keep only one global state, but store
instructions for programmatically reverting the state while you backtrack.

~~~
jules
The point of functional data structures is that if you modify, you don't copy
the whole structure, you only copy the path to the things that changed.

A simple example would be parsing a sequence of characters and storing that in
a linked list. When you get an additional character you don't destructively
modify the list, instead you create a new list node that points to the
existing list.

    
    
        def process(char, state):
          return Node(char, state)
    

The process function takes a character and a state, and returns a new state.
The old state is still usable, as nothing was mutated. On the other hand no
large data structure had to be copied.

Your method is probably more efficient. Because you don't need access to all
historical versions simultaneously, you only have to keep a stack of undo
operations. So you only have O(1) slowdown per operation. With functional data
structures you have in the worst case a O(log n) slowdown per operation.

For example if you implement a bit vector, then update and lookup can both be
O(log n), e.g. a red-black tree. If you want update to be O(1) then lookup
becomes O(n), e.g. a linked list. If you want lookup to be O(1) then update
becomes O(n), e.g. copying the entire vector on update.

(please correct me if I'm wrong and if you can do better than O(n) for those
cases)

~~~
thurston
Yes I understand these techniques, in fact they are heavily in use in Colm,
just not for maintaining the global state that is used for the parsing
feedback loop.

I use the term "copy" in the general sense to make discussion easier.
Conceptually, persistent data structures are a copy, even if they are
optimized heavily to incur very little cost.

------
jws
Notice that the DNS example is parsing a binary DNS request, not a text file.

~~~
thurston
:) If I had my way this comment would be closer to the top. Not many grammar-
based parsing systems can claim raw DNS parsing.

------
haberman
From my quick scan of the thesis, the basic design seems to be a programming
language in which you write _both_ the parser _and_ any transformations you
want to perform. It's not clear whether there is an easily-accessible parse
tree serialization that you can use to load the output into another language,
or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they
need to write their algorithms in your special language. Parsing tools deliver
value because grammars are easier to write than the imperative code that
implements those grammars. That value offsets the cost of having to learn a
new special-purpose language. But imperative programming languages are already
pretty good at tree traversal and transformation, so there's little benefit to
using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily
integrates into other languages so that the parsing framework can handle
_only_ the parsing and all of the tree traversal and transformation can be
performed using whatever language the programmer was already using. This
requires much less buy-in to a special-purpose language.

~~~
thurston
Colm has built-in serialization. There is still some work to do in this area
though. Colm will preserve whitespace for minimal disruption of untransformed
text, but figuring out what to do at the boundaries between modified and
unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more
complex algorithms. I agree a means of embedding is necessary and I have kept
this in mind, though not yet achieved it. I would very much like to be able to
parse, transform, then have the option to import the data into another
environment and carry on there.

~~~
haberman
Thanks for the info. What is the built-in serialization format?

~~~
thurston
Just plain old text as it came in. I see now that is not what you were
referring to. You're talking about JSON, XML, etc I now think.

There is also a print_xml function, which puts the tree into XML, but it's
mostly used for debugging at this point, not export to other systems. I'm
hoping that with time these kinds of features will crop up.

------
bdfh42
Quote "Colm does not yet have any documentation".

Then I would hazard that it is not yet a language as without documentation it
has no "grammar". At best it is a patois.

~~~
thurston
Grammar: <http://svn.complang.org/colm/trunk/colm/lmparse.kl>

------
colomon
It would be interesting to see someone who understood both this and Perl 6's
grammars to do a comparison. Based on Colm's quick description and my rough
understanding of Perl 6 grammars, they sound like they are roughly equally
powerful. But I admit I'm not sure I understand what "transformation language"
means...

~~~
audreyt
Although similar in expressive power, Colm offers instruction logging to auto-
reverse global state changes upon backtracking, something Perl 6 grammars does
not (yet) support; at the moment we need to manually manage them with embedded
blocks.

~~~
chocolateboy
Re: "reverse global state changes upon backtracking": this sounds similar to
the (manual) "undo actions" supported by the Kelbt parser [1], perhaps
unsurprisingly as it was developed by the same author :-)

[1] <http://www.complang.org/kelbt/>

------
Twisol
Adrian Thurston (the creator of Colm) is also responsible for the fantastic
Ragel state machine generator.

------
DrCatbox
I am more interested in DSNP, how come this project has not received more fame
than the infamous Disapora? <http://www.complang.org/dsnp/>

~~~
thurston
There are some difficult problems in that space. I've posted to HN and reddit
a few times, but mostly I've been working on it quietly so I can focus.
Lately, that's starting to change. I'll be talking about it at FSW 11 in
Berlin in a few weeks.

~~~
DrCatbox
My google sense failed me this time around to find information on this FSW 11
in Berlin. Care to explain? Is it a conference, can anybody come?

I am really interested in DSNP and am fairly well versed in GNU/Linux and can
do some programming, Java and Python mostly. I work as a web-frontend
developer guy. Can I be of some help? Do you need testers, peers, documenters?

~~~
thurston
Ya it's currently hard to find. <http://d-cent.org/fsw2011/>

I need help from people like you actually. What I've done.

1\. defined the protocol

2\. implmented it in a C++ daemon that

    
    
       a) talks to other daemons
    
       b) serves the content managers (frontend UIs)
    

3\. written a (crappy) example content manager.

What needs to happen next is step 3 needs to be repeated by other people who
know what they are doing. They don't need to understand the details of the
protocol, they just need to understand the basic model, which is just message
broadcast, distributed agreement, etc.

Email me for more details, will get back to you later tonight.

------
Barrasmara
This kind of sounds like Semantic Design's DMS software Reengineering toolkit
and the Parlanse language.

~~~
thurston
They are related systems. DMS is much more mature.

