Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Do you think the use of parser generators is practical also for performance-critical binary data?

That's not a question I consider very important, because prioritizing performance over correctness (safety) is irresponsible.

That said, using a parser generator shouldn't have a significant impact on performance. When a network is involved, the parser isn't going to be the bottleneck anyway.

> video

I am not very familiar with the details of video formats, but I would suggest trying the parser combinator library Meredith mentions in the talk in my previous [1], known as "hammer"[2]. It's very easy to use, and it has specific support for bit-level parsing of binary formats.

> numPersons 2

I highly recommend not using formats like this, because it's more complex than deterministic context-free. Using a length field requires that the parser remember state. It's also redundant information, creating the possibility of a miss-match between the number specified in "numPersons" and the actual number of rows.

A format that uses s-expressions would be safer and more easier to implement:

    (ancestry
      (ancestor
        (person john "John Doe")
        (person jane "Jane Doe")
      )
    )
The closing tag of the s-expression provides a similar benefit as the num* fields by simply marking the end of a complete record/field.

> That's so trivial that I would never want to depend on a parser generator

Is it really "trivial"? Or is your parse that you write in 10 minutes accepting ambiguous or invalid data? Is it actually checking for all of the error conditions? For most real-world grammars, it's unlikely you will catch all of the subtle ambiguities, interactions, and edge cases.

The benefit of using a parser generator is that those entire classes of bugs are impossible.

> check (relational) integrity

In the trivial example you gave, you shouldn't have a grammar that requires checking relational integrity. In a more complex grammar, such checks would be the job of whatever is walking the AST that the parser returns. Parsers interpret syntax, not semantics.

> If I leave away the num* fields above, the parser would need one token of lookahead.

You need at least one token of lookahead anyway. Including the num* fields means the parser needs to remember and manage state.

[2] https://github.com/UpstandingHackers/hammer



>> numPersons 2

> I highly recommend not using formats like this, because it's more complex than deterministic context-free.

Barely. It avoids the complexity of look-ahead and reallocating growing buffers.

> A format that uses s-expressions would be safer and more easier to implement:

Trees are unsuitable for almost anything computationally, at least as authoritative representation. Prefer flat. (Widget hierarchies are one of a few counterexamples where the tradeoffs are different).

This s-expression you gave is much more complex to parse. Granted, you get parsing as a tree for free. But then you have to go inside the tree and see what's there. That there are two persons per ancestor, etc. Navigating trees is hard.

> Is it really "trivial"? Is it actually checking for all of the error conditions?

Yes. Yes. Of course. It's the most boring code in the world. Parse numbers, allocate arrays correspondingly, parse each following table in a loop corresponding to the numbers parsed. You can do it all using scanf. Bail out as soon as a bad line comes in. 20 lines.

> The benefit of using a parser generator is that those entire classes of bugs are impossible.

Yeah. Well, they are not impossible. They are only possible in the parser generator. And interfacing a parser generator is itself not necessarily trivial. Plus, dependencies are a maintenance burden.

The other approach is to make the design so simple (like the relational example) to make them (the classes) nonexistent.

> You need at least one token of lookahead anyway.

Not in the example I gave.

> Including the num* fields means the parser needs to remember and manage state.

I would hardly call that state. It's an immutable number, initialized in sequential code without real data dependencies (the data dependencies are resolved in the format -- you can not go wrong parsing it).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: