Hacker News new | past | comments | ask | show | jobs | submit login
Parser combinator library in C (github.com/wbhart)
91 points by wbhart on Aug 7, 2012 | hide | past | favorite | 28 comments

I don't know why it uses a garbage collector when simple reference counting would suffice. Also I hate libraries with global states/variables like this one ...

I will likely remove the global state at some point, but the garbage collector is not something I plan to implement myself.

How would you implement refcounting in the presence of longjmp? I don't see a graceful way to do it.

I am not sure I grok the code, but I think something along the following lines would work:

  - a 'recovery' list of (object pointer, old refcount) pairs
  - whenever you allocate something, add (object pointer, 0) to the list
  - when you update a reference count, check whether the object is in the list.
    If it is not, add it to the list, thus remembering the original refcount
  - whenever the code decreases reference count to zero, add the object to a
   'to be deleted' list (this could possibly be a linked list chained through
    the 'refcount' fields; the old refcount will be in the recovery list)
  - when the parser reaches toplevel without longjmp, clear the first list,
    and delete any objects in the 'to be deleted' list.
  - when a longjmp occurs, walk the list and reset reference counts. For
    zero 'old refcounts', delete the object
Elegant? Not really, but with proper macros/functions, it should not be much less elegant than reference counting on its own.

That's pretty close to Deutsch-Bobrow deferred referenced counting. If you decide to implement it, you may want to read their paper about the technique.

I don't like C libraries that use longjmp for error handling. It's like killing a fly with a bazooka. There are some cases when longjmp is useful of course, but If you want proper exceptions, use a language that supports them.

The OP is doing awesome stuff and sharing it with the world, and you want to criticize an implementation detail?

Bikeshedding like this isn't useful to anyone.

I appreciate the OP's efforts, certainly sharing code is a good thing. And one of the benefits of the sharing is to get a lot of feedback. Criticizing implementation details is important to spot, discuss and hopefully fix potential problems.

But the only actionable advice you can take away from the criticism is "rewrite it in C++". I don't think that's in any way useful and I don't think it should be made.

Right - "Don't use longjmp, do X in C instead" would have been a useful comment (particularly with example code). "Don't use longjmp, use C++ instead" is much less useful.

I agree that my comment was a bit harsh, but that's not what I said. I said that if he wanted exceptions he shouldn't have used C.

As for the alternatives, well return values never killed anyone, did they? You can use gotos to simplify the control flow within the functions. Other than that, there really isn't much to explain.

I don't like longjmp because I never expect them in C. I never think "hey, the control flow might jump to some point 20 stack frames above at any moment when I enter this library". And then I use mutexes. Cue the drama. You can't even protect yourself from the stack unwinding with handler-case or try ... catch/except.

This. The headline was a parser combinator in C. The author is working with what he's got. Can't argue with the hangups that come with it, unless he makes some claim that his language choice is somehow superior to another, which I don't see happening here.

If I release code, having it criticized is a benefit. I want to learn from others, hear what they think of the design decisions, what they would have done differently, and what they think I did stupidly.

If you criticize my work, you're doing me a favor.

could anyone explain the use for this library ?

Basically, the general idea of parser combinators is that you want to be able to express your parsing grammar in the sort of high-level description that people actually talk in, not in the low-level description that you get when you are constantly looking at the element string[i] in some switch statement in some function. Here is a simple example:

    parseAddress = let
            hexStr2Int = Prelude.read . ("0x" ++)
        in do
            start <- many1 hexDigit
            char '-'
            end <- many1 hexDigit
            return $ Address (hexStr2Int start) (hexStr2Int end)
(taken from http://therning.org/magnus/archives/289 .)

In actual words, "here is the function for converting a hex string to an integer, now define the variable `start` as the parse of many (or one) hex digits, then there is a hyphen, then the variable `end` is another parse of many (or one) hex digits. Convert `start` and `end` into numbers and put them into an Address object."

There are three further blog posts on Haskell+Parsec from the above blog: http://therning.org/magnus/archives/290 , http://therning.org/magnus/archives/295 , http://therning.org/magnus/archives/296 .

If you are interested about how such things are generally implemented, this Strange Loop talk is interesting: http://www.infoq.com/presentations/Parser-Combinators

What's probably the most interesting thing to speak of here is simply the fact that it's written in C, rather than in a "normal" functional language. You'll see in the latter presentation for example that there is a sort of "circuit wiring" approach which you can do with Lisp functions, essentially using function calls a(x) in order to describe a directed graph vertex from x to a.

I knew that you could pass pointers to functions in C, but I was not aware that it was sophisticated enough to build a parser combinator library, so I might brush up on my rusty C skills to see if I can understand the details of the implementation here.

    parseAddress = 
       let hexStr2Int = Prelude.read . ("0x" ++)
           in do start <- many1 hexDigit
                 char '-'
                 end <- many1 hexDigit
                 return $ Address (hexStr2Int start) (hexStr2Int end)

Or in applicative style (using attoparsec), which I like a lot more than the monadic style, because you can read it quite literally from the left to the right:

    parseAddress = Address <$> hexDigits <*> (dash *> hexDigits)
          hexDigits = string "0x" *> takeWhile1 hexadecimal
          dash      = char '-'

Back to Parsec using applicative style, and parsing the same thing as the original.

    parseAddress = Address <$> hexDigits <*> (dash *> hexDigits)
        hexDigits = read . ("0x" ++) <$> many1 hexDigit
        dash = char '-'

It's not bad, but the "0x" is part of the conversion into an integer, not part of the parsed address. That is, the more-complicated-looking combinator converts the string "03-c0" to `Address 3 192`, while it looks like yours converts the string "0x03-0xc0" to to `Address "0x03" "0xc0"`. (I'm not totally sure about that; I'm still very new to Haskell. Thanks for the heads-up on attoparsec though.)

The (* >) operator [1] (the space in the middle is just to circumvent HN's rather dumb comment formatting) sequences its two arguments, ignoring the one on the left. So

    string "0x" *> takeWhile1 hexadecimal
parses a literal string "0x", then one or more hexadecimal digits, and the result is just the hexadecimal digits.

[1] http://hackage.haskell.org/packages/archive/base/latest/doc/...

    {-# LANGUAGE OverloadedStrings #-}
    import Control.Applicative
    import qualified Data.Attoparsec.Text as P
    import qualified Data.Text as T
    data Address = Address {start :: Int, end :: Int} deriving Show
    address = Address <$> hexDigits <*> (dash *> hexDigits)
          hexDigits = P.string "0x" *> P.hexadecimal
          dash      = P.char '-'
    parse parser str = P.feed (P.parse parser $ T.pack str) T.empty

Put the above into a file like 'parse.hs'.

    ~> ghci parse.hs

    *Main> parse address "0x1-0x1"
    Done "" Address {start = 1, end = 1}

You might need to install attoparsec beforehand:

    cabal install attoparsec

The major parser libraries that I could find don't allow adding syntax at runtime or to add operators to the expression hierarchy at runtime. Many (though not all) of them also make it difficult to do custom error reporting. These were my major motivations.

Ah right, interesting use-case. I usually think of combinator parsing's advantages versus table-driven parsing (e.g. LALR) being primarily its expressiveness (and amenability to a functional-programming style), but it has a nice runtime-modifiable aspect as well.

I wonder how hard it'd be to build a runtime-modifiable table-driven parser, though. Assuming table compilation is fast enough, the most straightforward approach might be to just link in the grammar-compilation code so it's available at runtime, and rebuild the table each time a modification is made. I don't think you'd want to be doing complex incremental surgery on a table-driven parser, but you might not need to.

Not C .. but are you familiar with Ometa (and its variations, Ometa/JS, Ometa#, PyMeta, etc?) http://tinlizzie.org/ometa/

It's a turing-complete parser specification language with an interpreter that is very close to being a packrat parser, which embraces adding and extending syntax (in a more general way than your library does, if I understand correctly).

Implementing the same idea in C should be possible. I have no idea why you want to add syntax at runtime, but the Ometa model might fit your plans better.

FParsec's OperatorPrecedenceParser component[1] does support adding (and removing) operators to the expression grammar at runtime.

[1] http://www.quanttec.com/fparsec/reference/operatorprecedence...

Doesn't Boost.Spirit [1] allow adding syntax and operators at runtime? (Seriously asking; have never used Spirit)

[1] http://boost-spirit.com/home/

nice, but would be better to have something simpler that doesn't rely on various libraries and features that are typically forbidden in high performance environments.

incidentally almost every memory leak I deal with comes from not being allowed to allocate my own memory and having to fiddle around with gc and refcount rubbish.

I have written whole non-trivial games with no detectable leaks.

Could you provide some more information about "features that are typically forbidden in high performance environments".

Probably you want a parser expression grammar with a packrat parser if you really want speed anyway.

At the moment I am not smart enough to see how to implement combinators in C in an elegant way without GC of some form. But I'll think about it. Maybe something will occur to me.

Patches are welcome of course.

(without stopping to check for leaks half way through)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact