Hacker News new | past | comments | ask | show | jobs | submit login

A parser combinator library. I'm writing a tool that will do static analysis of SQL (in a very limited fashion, it's a build tool and not a static analyzer, but I need to understand dependency relationships between statements). I started out using `nom`, but found it imperfectly matched to my needs (underpowered in areas I desired and overpowered in areas I didn't need for my project). `nom 8` came out with some interesting simplifications, but it happened to break my code in a way that would be awkward to fix. So I bit the bullet and started writing my own library.

My library is specialized for parsing text. That had enabled some cool capabilities.

It comes with a `Span` primitive, which tracks where in a file a token came from, for implementing error messages. A `Span` can be the input or the output of a parser. At the front end a `Span` is an entire file, and as you slice and dice it, it tracks the metadata of where it came from.

Along with the standard `Sequence` (combining parsers in a set order) and `Choice` operations (branching between many parsers) that parser combinators are built around, I have come up two operations that are very handy. I suspect that others have made them before, they are both patterns I used in `nom`. (I've also only skimmed the original paper, they could be in there and I didn't see them.)

One of them is called `Compose`. As an alternative to a `Sequence`, instead of a group of parsers consuming the input in order, the first parser consumes the input, and the subsequent parsers consume the return of the previous parser. This is useful for instance when implementing escapable strings; the first parser grabs the entire string, the second one transforms escape sequences. (There is a mechanism for transforming the content of a `Span` while retaining it's metadata.)

The other is called `Fuse`. This is a small twist on `Sequence`, where after matching the parsers in order, the result is all concatenated together into a single token. This is useful for a "pattern matching" primitive, where you want to find a series of tokens in order, but you don't want to split them into different tokens, you want them all together.

It's been a wild ride, there's been a lot of thorny issues. I often think I should've just stuck with `nom 7` instead of shaving this yak. But I've learned a whole lot about writing especially abstract/DSL-yy Rust by combining tuples, traits, and declarative macros. There are also other programming language projects I'd like to pursue, and it will be nice to have a tailor fit tool for parsing text.

Special thanks to dtolnay's `paste,` the real MVP.






I did something like this some time ago [0] and still using it in prod - feel free to get inspiration in case you find it useful.

[0] https://github.com/preludejs/parser


>A parser combinator library.

Cool. I got interested in this subject recently. Have been checking out some text articles and videos about it. Unfortunately there is not much info available (and some of it is advanced stuff), or at least I couldn't find much, so far.

I am working on a library, which is not exactly a parser combinator one, but borrows some of those ideas, for use in other projects.

>One of them is called `Compose`.

About the escapable strings example: can you not just rescan the string for the escape sequences, after grabbing the full string?


> Unfortunately there is not much info available [.]

Parser combinators are a bit hard to get into, the most helpful resource for me was `nom`'s "Choosing a Combinator" document [1], which is dense but gives you an overview of all the Lego bricks which you can then start imagining how to fit together.

I've not really read it, but there's also the original paper on the subject [2] (as linked to by the `parsec` documentation [3]) which describes the nuts and bolts theory behind it.

> [Can] you not just rescan the string for the escape sequences, after grabbing the full string?

Absolutely, this is just a convenience around that pattern that allows you to express that like:

    let string = quoted_string.then(escaped(json_string_escapes)).parse(&input)?;
Where `escaped` does the rescanning using the parser `json_string_escapes` (which consumes all the input up to the next escape, if it doesn't start with an escape sequence, or else consumes an escape sequence and returns the transformed text - this API is a little awkward, it may change).

And also more generally for any parsers `foo`, `bar`, and `baz` as:

    let quux = (foo, bar, baz).map().parse(&input)?;
[1] https://github.com/rust-bakery/nom/blob/main/doc/choosing_a_...

[2] https://web.archive.org/web/20140528151730/http://legacy.cs....

[3] https://hackage.haskell.org/package/parsec


Thanks for the reply. I will check out those links.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: