
Learning Parser Combinators with Rust - chwolfe
https://bodil.lol/parser-combinators/
======
YeGoblynQueenne
>> The novice programmer will ask, "what is a parser?"

>> The intermediate programmer will say, "that's easy, I'll write a regular
expression."

>> The master programmer will say, "stand back, I know lex and yacc."

The Prolog programmer will write a Definite Clause Grammar [1], which is both
a grammar and a parser, two-in-one. So you only have to do the easy and fun
bit of writing a parser, which is defining the grammar.

Leaves plenty of time to get online and brag about the awesome power of Prolog
or get embroiled in flamewars with functional programming folks [2].

______________

[1]
[https://en.wikipedia.org/wiki/Definite_clause_grammar](https://en.wikipedia.org/wiki/Definite_clause_grammar)

[2] Actually, DCGs are kiiind of like parser combinators. Ish. In the sense
that they're executable grammars. But in Prolog you can run your programs
backwards so your DCG is both a recogniser _and_ a generator.

~~~
fwip
This is beside your point, but I've found PEG to be a nice step up in
usability from lex&yacc, as again, you only have to write one grammar
definition.

Wiki:
[https://en.wikipedia.org/wiki/Parsing_expression_grammar](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
Example Implementation: [https://pegjs.org/](https://pegjs.org/)

~~~
minxomat
Yep, PEGs are seriously underutilized. Here's a nice introduction using LPeg:
[http://leafo.net/guides/parsing-expression-
grammars.html](http://leafo.net/guides/parsing-expression-grammars.html)

------
intertextuality
On the reddit discussion of this [0], someone mentioned using a type of

fn parse(&self, input: &mut &str) -> Option<Output>

instead of the article's

fn parse(&self, input: &'a str) -> Result<(&'a str, Output), &'a str>

for composability. I found the article fascinating and plan on going back to
see what an xml parsing implementation based on the former would act like.

[0]:
[https://www.reddit.com/r/rust/comments/bepi63/learning_parse...](https://www.reddit.com/r/rust/comments/bepi63/learning_parser_combinators_with_rust/)

~~~
steveklabnik
This might be the first time I’ve seen a good use for &mut &T, very cool!

For those of you not well-versed in Rust, this is a mutable pointer to an
immutable string. This means that you can change the part of the string you’re
pointing at, but you can’t change the underlying string itself.

~~~
e12e
What does that mean? Is it equivalent to the pointer starting out (say)
pointing to the first letter of the string, but being able to "walk"/iterate
along the length of the string?

~~~
steveklabnik
EDIT: adjusted the diagram to make it more clear that we're not mutating the
&str

Here's some string data, in memory somewhere:

    
    
        Hello world   
    

A `&str` is a pointer, length combo. The pointer points to the start of the
string, the length says how many bytes there are:

    
    
        (p, 11)           
         |                
          \ 
           |
           V              
           Hello world
          
     A `&mut T` is a pointer, so a `&mut &str` looks like this:
    
         p
         |
         V
        (p, 11)           
         |                
          \ 
           |
           V              
           Hello world
     
     Since it's mutable, this means we can re-assign what it points to, and create a new `&str` to the later part of the string:
     
         p
         \------\
                 V
        (p, 11) (p, 5)           
         |       |                
          \      |
           |     |
           V     V         
           Hello world
     
     Since the inner &str is immutable, we can't change the underling string.
     
     Hope that helps!

~~~
tigershark
So it’s basically like a Span in C# if I understood correctly?

~~~
steveklabnik
I am not familiar with the in-memory representation of Slice, but conceptually
at least, yes. We call &str a "string slice" and &[T] more generally a
"slice".

EDIT: apparently you just edited from Slice to Span, reading
[https://msdn.microsoft.com/en-
us/magazine/mt814808.aspx](https://msdn.microsoft.com/en-
us/magazine/mt814808.aspx)

> First, Span<T> is a value type containing a ref and a length, defined
> approximately as follows:

Yep, exactly the same as Span<T> then.

~~~
tigershark
Sorry, I wrote Slice instead of Span in the original comment... Slice is the
method that returns a Span that is the equivalent of this rust concept as far
as I can tell. At some point in the future I’ll need to delve in rust, it’s
quite fascinating, but sadly I don’t have much time nowadays :(

~~~
steveklabnik
It's all good, on both counts! We'll still be around when you find the time :)

------
xymostech
This was such a wonderful read! I've been getting into Rust recently, and the
sections on dealing with challenges that are specific to Rust were
particularly useful. The way they created a new trait to turn `Fn(&str) ->
Result<(&str, T), &str>` into `Parser<T>` was insightful, and the discussion
of how they dealt with the growing sizes of types was something that I can
imagine myself running into in the future.

Most importantly though, when they started writing `and_then`, my eyes lit up
and I said "It's a Monad!" I think this is the first time I've really
identified a Monad out in the wild, so I enjoyed that immensely.

------
louthy
It doesn't _feel_ very declarative in Rust. Personally, I'm finding it hard to
see the intent (I haven't written a line of Rust in my life, so take that with
a pinch of salt, but I am a polyglot programmer).

Really, Haskell's do notation is the big winner when it comes to parser
combinators, as the direction of the flow of the parser is easy to follow, but
also you can capture variables mid-flight for use later in the expression
without obvious nested scope blocks.

It's possible to capture variables with `and_then` by the looks of it, but any
suitably complex parser will start to end up quite an ugly mess of nested
scopes.

I ported Haskell's Parsec to C# [1], it has LINQ which is similar to Haskell's
Do notation. Simple parsers [2] are beautifully declarative, and even complex
ones, like this floating point number parser [3], are trivial to follow.

[1] [https://github.com/louthy/language-
ext](https://github.com/louthy/language-ext)

[2] [https://github.com/louthy/language-
ext/blob/master/LanguageE...](https://github.com/louthy/language-
ext/blob/master/LanguageExt.Parsec/Parsers/Prim.cs#L452)

[2] [https://github.com/louthy/language-
ext/blob/master/LanguageE...](https://github.com/louthy/language-
ext/blob/master/LanguageExt.Parsec/Parsers/Token.cs#L287)

~~~
pornel
There are libraries like nom:
[https://lib.rs/crates/nom](https://lib.rs/crates/nom) or combine
[https://github.com/Marwes/combine/blob/master/examples/date....](https://github.com/Marwes/combine/blob/master/examples/date.rs)
that have more declarative-looking syntax. In TFA you intentionally get "raw
Rust" to avoid syntax sugar obscuring what's going on.

------
xixixao
Nice article. I finally gave Rust a recently. It's really interesting how new
languages evolve, and what "deficiencies" they exert. The article for example
uses closures, but it's currently impossible in stable Rust to accept a
closure that itself accepts a closure as an argument (while you can easily
rewrite the same pattern with structs). The borrow checker could still do
better on suggesting fixes to common problems (otherwise it's actually quite
elegant). What struck me while reading this was the use of
assert_eq!(expected, actual), as I've mostly seen the other order. Sure enough
I checked and the macro does not define the order. That's unfortunate, as
testing against "fixed" "expected" outcome is very common, and leads to more
friendly testing devx (which in general while supported out of the box isn't
great).

On the other hand, Rust's IDE support, built-in linting, is seriously
impressive.

~~~
steveklabnik
We looked at the order for assert_eq, and couldn’t find any real consensus.
Large testing frameworks in a variety of languages use both orderings. So, we
decided to not enforce a particular one.

~~~
skybrian
That seems like an odd decision? It's like saying there is no consensus for
which side of the road to drive on, so let's not pick one. Standardization is
arbitrary but useful here.

~~~
coldtea
How is it useful, since it has no observable effect? Equals is equals!

~~~
skybrian
As many people pointed out, an expected value is not the same as the actual
value. This affects error messages, which are important for ergonomics.

~~~
eridius
The downside to enforcing an order is the compiler cannot enforce an order so
people absolutely will get it wrong and not notice until much later when the
test breaks and they get really confused because the message is telling them
the expected value is the actual value and vice versa.

~~~
tomjakubowski
We could have a macro like:

    
    
        test_eq!(expected=(<expr>), actual=(<expr>));
    

Where those "assignments" could be specified in either order.

Playground example: [https://play.rust-
lang.org/?version=stable&mode=debug&editio...](https://play.rust-
lang.org/?version=stable&mode=debug&edition=2018&gist=a4142e215dfb46ebfdab447ec22b6b95)

~~~
xixixao
I really like this, not that I would want to write this every time, but
because you could have a macro rule that doesn't specify expected/actual, for
those rare cases where both sides are computed ("actual").

------
norswap
If someone wants to have a look at the code of a cutting-edge parser
combinator framework with focus on features + usability, I'll plug this here
(it's in Java)

[https://github.com/norswap/autumn4](https://github.com/norswap/autumn4)

WIP but 1.0.0 will land somewhere within the next two months, with a full
user-guide (half of it is already written and available).

Constructive feedback welcome!

~~~
tempguy9999
Honest question, why use parser combinators?

If I need to do it by hand I can write a recursive descent parser, but I never
do it by hand any more, just break out lex/yacc or antlr.

So, what have I or anyone to gain by learning/using PCs? (I'm never against
learning new stuff but I have so much to learn that adding something
unnecessary to that would be just dumb).

TIA

~~~
fooker
Parser combinators aid in writing recursive descent parsers, they are not
something alien. You can maybe consider it a design pattern which eliminates
redundancies and makes it easier to construct an AST.

------
tiuPapa
Okay, I am interested in this topic. Does anyone know of any good resources
for exploring parser combinators further?

~~~
vmchale
I like
[http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf](http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf)

------
amelius
What is the class of languages that can be parsed with such parsers, in the
sense of [1]?

[1] [https://en.wikipedia.org/wiki/Context-
free_grammar#Subclasse...](https://en.wikipedia.org/wiki/Context-
free_grammar#Subclasses)

~~~
jcranmer
It's basically a recursive-descent parser, which means LL(k)-ish. The "-ish"
is because you can use other tricks instead of straightforward recursive-
descent (expression parsing is a common example of where you might want to do
so), but the basic combinator concept itself is LL(k).

------
lelf
[https://news.ycombinator.com/item?id=19694793](https://news.ycombinator.com/item?id=19694793)

------
k0t0n0
nice read; Hi I also wrote a SQL dump parser using rust here the code.

> [https://github.com/ooooak/sql-split](https://github.com/ooooak/sql-split)

------
vmchale
I don't like Rust for this purposes. It doesn't have higher-kinded types and
thus no applicatives or monads, which sort of misses the point.

I also object to the idea that parser combinators are an alternative to parser
generators. They're each useful in different scenarios. But for something like
XML the parser combinators will be slower.

I'd also be curious to see how the efficiency of parser combinators is
affected by the absence of laziness in Rust. I seem to recall that laziness
makes the analysis more complicated than you'd expect, but I need to find a
source...

~~~
thramp
> I don't like Rust for this purposes. It doesn't have higher-kinded types and
> thus no applicatives or monads, which sort of misses the point.

Having used parser combinators in Rust and Haskell (combine and attoparsec,
respectively), I've found that even without applicatives, parser combinators
are pretty handy—they're a step above regexes, but a step below a parser
generator.

> I'd also be curious to see how the efficiency of parser combinators is
> affected by the absence of laziness in Rust. I seem to recall that laziness
> makes the analysis more complicated than you'd expect, but I need to find a
> source...

Same here, but I suspect that aggressive inlining might be pretty helpful.

