Hacker News new | past | comments | ask | show | jobs | submit login
What the heck is a parser-combinator? (kimpel.com)
160 points by lihaoyi on June 21, 2017 | hide | past | web | favorite | 72 comments



Consider the most basic parsers you might want. For instance, a parser that only succeeds if it matches a string exactly, a parser that matches any single character and always succeeds, a parser that matches nothing and returns some constant, a parser that always fails. They're all simple and stupid and let's give them names:

    string("foobar") : Parser<String>
    char : Parser<Character>
    always<A>(x: A): Parser<A>
    never<A>: Parser<A>
These are parser combinators. They form the atoms at the foundation of a language for constructing parsers. For instance, we can imagine two parsers happening in sequence

    then<A, B>(a: Parser<A>, b: Parser<B>): Parser<(A, B)>

    then(char, char): Parser<(Character, Character)>
Or more fancily, a two parsers happening in sequence, but the second parser being defined using the output of the first. This creates context sensitivity

    thenDependent<A, B>(a: Parser<A>, b: A -> Parser<B>): Parser<(A, B)>
We're constructing a fundamental language for building parsers up from our atoms to larger and larger things. All we need now is a way to "run" them

    run<A>(p: Parser<A>): String -> Option<A>
Tada---parser combinators!


I didn't really 'get' parser combinators until I watched the following introduction (showing how to build a parser library from scratch in F#): https://skillsmatter.com/skillscasts/9731-understanding-pars.... Highly recommended.


Or watch on youtube: https://www.youtube.com/watch?v=RDalzi7mhdY

No need to create an account or login


This is a great talk. I especially like his comment here [0] about obfuscated infix operators since he opens his talk with "hopefully this talk will get you over the confusion about what `.>>.` and company mean" and then 20 minutes later he just has the line `let ( .>>. ) = andThen`. That was easy.

https://youtu.be/RDalzi7mhdY?t=1898


Great find


Is that really how the term is used? I would have naively thought that among your examples only 'then' and 'thenDependent' are parser combinators, and that 'string("foobar")', 'char', 'always(x)' and 'never' are simply called parsers. (Maybe 'string', as opposed to the evaluated 'string("foobar")', should be called a combinator even though it doesn't take any parsers as arguments...)

Just like I'd call '+' and '*' arithmetic operators but I'd call '3' a number.


Combinator is a pretty overloaded term, it's difficult to pin down. In practice, something like `never` would be called a combinator, though.

Also, in theory a combinator is just an expression of no free variables. S-K-I calculus is the canonical set of "combinators" and all three of them could be considered "functions" but also could be considered "atoms". So perhaps it's just vague.


In my experience (limited) yes, this is how the term is used.

"Combinator" != "combiner". A combinator is a thing which is combined with other combinators. When you combine two combinators, you end up with yet another combinator which can be combined with other combinators. Very composable, in the functional spirit of things.


I'm not so sure, and however wrote the Wikipedia page seems to agree with me. The first sentence is "In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output." Also people name the type something like Parser<A>, not ParserCombinator<A>.


But what is the class of grammar they support?

I suspect these combinators are not very powerful compared to e.g. LR(k) parsers, and provide a false sense of modularity. (E.g. a minor grammar change leading to a large scale rewrite).


Roughly, context sensitive can be covered by monadic parsers, while applicative parsers are context free. LR(k) also corresponds to context free parsers. So, at least for monadic combinators they are more powerful. I also think they're fairly modular. The primitive parsers are fairly easy to write and once you have a small library (or use an existing one like Parsec) of them it is fairly easy to put them together (in haskell do notation helps too). The only other kind of parsers I have personal experience coding is recursive descent which I had more trouble trying to keep track of.


Ok, but what about the efficiency of LR(k) parsers implemented using monads?

Further, do monads warn the user when there is an ambiguity in the "grammar"?


> monadic combinators are more powerful [than context free parsers]

Monadic parsers can do context-sensitive things that context free parsers can't, but they can't have unordered options which context free parsers can. So really monadic parsers have different powers to context-free.

> do monads warn the user when there is an ambiguity in the "grammar"?

You don't have ambiguity in the grammar because monadic parsers only provide ordered options.


> You don't have ambiguity in the grammar because monadic parsers only provide ordered options.

Oh, but you do have ambiguity in the grammar, except that parser returns one of the possible parse trees deterministically and thus doesn't warn you that the input could be parsed differently. This leads to the false impression that your grammar is unambiguous.


The parser implements an unambiguous grammar. But it may not be the grammar you were thinking it was.

Specifically, it makes it impossible for you to even write down that ambiguous grammar you wanted.


> Further, do monads warn the user when there is an ambiguity in the "grammar"?

I'm not aware of a parser combinator library that does this. The way you build parser combinators in Haskell doesn't exactly preclude it from being done but it wouldn't be totally trivial either.

Googling for it didn't turn up much that wasn't a veiled reference to this paper:

http://richard.myweb.cs.uwindsor.ca/PUBLICATIONS/PADL_08.pdf


If you mean big O efficiency, it doesn't make any difference if you are creating a monad, applicative, or an state machine. Complexity comes explicitly in the parser construction, so if you write a monadic parser using only LR(k) combinators, you will get an LR(k) parser. If you write some backtracking, you will get worse complexity.


They're pretty damn efficient and nicely modular if you're okay with not having context sensitivity. Most of the time you're okay with not have context sensitivity, imo.

Context sensitivity as I wrote it here is the most powerful but least optimizable version. There are plenty of tricks to improve optimization. You can even write a "cut" combinatory to break backtracking.

It's a huge design space. The answer to all of your questions will probably be: it depends.


In languages that allow abstraction over generic types you can also reuse a lot of infrastructure to write parsers. For instance you can parse a string by chaining a bunch of char parsers:

    string :: String -> Parser String
    string s = mapM char s
On the other hand this probably degrades from c-like performance to python levels so some higher level builtins that are easy to optimize and compile into efficient assembly might be a better idea.


Monadic parser combinators!

There is a nice paper on the subject by Graham Hutton and Erik Meijer. It's a very good introduction to both parser combinators and monads, and it's very readable even for beginners.


Notably, you hardly ever need Monadic parser combinators. Applicative parser combinators work fine in almost all cases.

FastParse lets you use parser combinators monadically, but that's only necessary in uncommon cases. The three I've come across are length + data-of-length constructs (more common in binary parsing than text), having your parser validate matched XML tags/closing-tags, and indentation-delimited-block parser (e.g. when parsing Python).

For the vast majority of programming-language-like things, Applicative parser combinators are enough, including languages with complex syntax such as Scala.


Hm? I don't see why these would have to be monadic. A simple

    type Parser t = Input -> Maybe (t, Input)
or similar would be enough? Simple function composition gets you the rest of the way. (I grant you, it might be a bit tedious to write parsers this way, and monadic notation certainly makes it more pleasant in most cases.)

For anyone following along at home: think function which takes input + current position and may return: "Nothing" if the parser doesn't match, or "a value of type t along with the rest of the unparsed input".

(Obviously it'd have to be expanded to support errors properly, etc.)

EDIT: Of course, I guess you could argue that they just are monadic by definition since chaining using function composition and parameter passing is the essence of a monad... Maybe that's what you were driving at?


Yeah that type is enough to make a monad. In slightly fancy language the type you just wrote is just StateT Input Maybe. Monadic mostly makes me think of bind which is just about chaining (technically also return is needed, but here return is just always succeed and give that value) and parsers have a natural notion of run one after the other.


It's only a monad if you define it to be. It's also "just" an applicative if you like. The monadic part arises when people actually _use_ bind and then you lose some information that might be nice for optimizations.


> In slightly fancy language the type you just wrote is just StateT Input Maybe.

Except the newtype -- which was kind of my point :).


I realize I misread parent comment functions:

  always<A>(x: A): Parser<A>
  thenDependent<A, B>(a: Parser<A>, b: A -> Parser<B>): Parser<(A, B)>
as:

  always<A>(x: A): Parser<A>
  thenDependent<A, B>(a: Parser<A>, b: A -> Parser<B>): Parser<B>
which would make a monad provided the monad laws are satisfied.

Of course you don't have to write parser combinators that way, but I found it pretty practical compared to dealing with nested tuples for the sequence parser. Check out Hutton and Meijer's article for more details.


The problem with monadic parsers is that they are too powerful. If you have two options

    string "foo" <|> string "for"
It would be nice to automatically rewrite this as

    string "fo" >> (char 'o' <|> char 'r')
But if we try to analyze monadic parsers like that we run into the halting problem.

Applicative parsers are less powerful so we can left factor them automatically. There were some attempts to combine applicative and monadic parsers which resulted in arrow syntax but that kind of feels like the worst of both worlds for many cases.


It sounds vaguely like refactoring a recursive-descent parser.


It can certainly help to look at a parser-combinatory as a more rigorous, less artisanal/home-cooked recursive-descent parser. Good parser-combinators offer a lot more than just recursive-descent ways of looking at the parsing world, but if you are familiar with how you might do things in a recursive-descent way I think that familiarity can help in learning a parser-combinatory library.

For similar reasons, too, I think PEG-style grammars are more often conceptually closer than BNF grammars for early learning/thinking in parser-combinators. But again a good parser-combinator library will have power beyond what you might consider the possibility space afforded by just PEG-style grammars, especially as you start to get into higher order combinators.


What's an example of something that can be recognized with parser combinator but not PEGs?

Do parser combinators have the notion of arbitrary lookahead or backtracking?


Parser combinators can do look-ahead and backtracking, yes. You definitely pay for it in efficiency however (especially look-ahead).


I have the overall feeling that most people don’t realize that parser combinators are nothing more that disguized recursive descent parsers. Personally I love both.


They're not, though. They're a formal language for talking about all kinds of parsers. You can implement them as recursive descent if you want. You can also compile them into formal FSAs if you want.


Parser combinators are a quite general technique which has been applied to PEG parsing, Earley parsing, and many other specific algorithms. Recursive descent is common because it is easy to implement and allows the resulting combinators to form a Monad.


Gotta love Parsec for Haskell and pyparsing for Python. Some other awesome parsing libs:

PEGTL - C++ Parsing Expression Grammar Template Library https://github.com/taocpp/PEGTL

Parboiled - Java & Scala PEG Library https://github.com/sirthias/parboiled

Nom - Rust parser combinator framework https://github.com/Geal/nom

Nearley - JavaScript parser toolkit https://github.com/Hardmath123/nearley

Neotoma - Erlang library and packrat parser-generator for PEGs https://github.com/seancribbs/neotoma


I've used Nom to parse a couple RFC-based protocols now, and I'm amazed at how easy and performant the result ends up being.

There's a great walk-through of using it here, including fuzzing your parsers to make sure they're solid. https://github.com/Geal/langsec-2017-hackathon-code


Geal is very helpful too, he pops up around the place in the Rust community and is responsive on the nom Gitter channel.


Instaparse - Clojure/Clojurescript https://github.com/Engelberg/instaparse


Nearly is pretty slow.

check out Chevrotain

http://sap.github.io/chevrotain/performance/


angstrom - OCaml parser combinators built for speed and memory efficiency https://github.com/inhabitedtype/angstrom


Add fparsec for F#/.Net http://www.quanttec.com/fparsec/

I used that 2 times and found it pretty easy to use, once I had grasped all the operators (like .>>, .>>., <|>, etc.)


Pegged - PEG generator implemented in the D programming language https://github.com/PhilippeSigaud/Pegged , works in run-time or compile-time


I haven't tried pyparsing, but I enjoyed implementing a simple DSL with funcparserlib having no previous experience with parser combinators: https://github.com/vlasovskikh/funcparserlib

The library has a well-written tutorial that doesn't require previous knowledge about parser combinators.


This is great, but if you're going to use a parser combinator in .NET, you might as well switch to F# and use FParsec instead of hacking LINQ in C#. The F# code is so much easier to read and understand.


Slightly off-topic but please stop this js smooth scroll nonsense. If I wanted to use smooth scroll I would have enabled it in my browser.


^ Someone talking sense right here, I thought some of my browser settings had changed for a minute.


Sing it! I wonder what purpose it is even supposed to serve?

The weird thing is that all people have to do to make this stuff work is: nothing at all. But for some reason that's just too much effort.


And while we're at it, get rid of the annoying sticky header.


I want an adblock for smooth scroll js


Install NoScript. Run in default deny JS mode.

No smooth scroll js. In fact, I read the whole article and never even realized there was any smooth scroll js anywhere.


Here's a JSON parser in LPeg, the excellent PEG parser from one of the Lua authors. The function "decode" returns the final, complete tree data structure.

  local lpeg = require"lpeg"
  local P, S, R, V = lpeg.P, lpeg.S, lpeg.R, lpeg.V
  local C, Cc, Cf, Cg, Ct = lpeg.C, lpeg.Cc, lpeg.Cf, lpeg.Cg, lpeg.Ct

  local function to8(n)
    ... -- Lua code to normalize UTF-16 to UTF-8
  end

  local unicode = P"u" * (R("09", "AF", "af")^4 / to8)
  local named = C'"' + C"\\" + C"/" + (P"b" * Cc"\b") + (P"f" * Cc"\f") + (P"n" * Cc"\n") + (P"r" * Cc"\r") + (P"t" * Cc"\t")
  local escaped = P"\\" * (named + unicode)
  local unescaped = C((P(1) - S'\\"')^1)
  local qstring = Ct(P'"' * (unescaped + escaped)^0 * P'"') / table.concat

  local exp = S"Ee" * S"-+"^-1 * R"09"^1
  local frac = P"." * R"09"^1
  local number = (S"-+"^-1 * R"09"^1 * frac^-1 * exp^-1) / tonumber

  local boolean = (P"true" * Cc(true)) + (P"false" * Cc(false))
  local null = P"null" * Cc(nil)
  local space = S" \t\r\n"^0

  local JSON = { "Value",
    Value = space * (V"Object" + V"Array" + V"Simple") * space,
    Object = Cf(Ct"{" * space * Cg(qstring * space * P":" * V"Value" * P","^-1 * space)^0 * P"}", rawset),
    Array = Ct(P"[" * space * (V"Value" * P","^-1 * space)^0 * P"]"),
    Simple = number + boolean + null + qstring,
  }

  local function decode(txt)
    return lpeg.match(JSON, txt)
  end



type Parser a :: String -> [(a, String)]

  "A Parser for Things
  Is a function from Strings
  To Lists of Pairs
  Of Things and Strings!"
  - Fritz Ruehr


Reminds me of SuperCombinators: a parser combinator framework written in Swift https://github.com/snipsco/SuperCombinators



Here is a great explanation of parser combinators, not tied to any specific implementation:

https://qntm.org/combinators


I implemented a basic set of primitive parsers and combinators here in K, along with a usage example- a complete parser for the bittorrent "Bencode" data interchange format:

https://github.com/JohnEarnest/ok/blob/gh-pages/examples/par...

Not particularly efficient, but fairly concise.


Parser combinator library in Go: https://github.com/prataprc/goparsec


A tiny, elemental parser in a universe where parsers are closed under certain parser composition operators. Such tiny parsers can be combined using such operators to produce a bigger parser which could theoretically parse anything.


Wanted to leave a comment on the site, but it seems comments are broken, so I'll stick this here just in case the author sees it:

You've linked to my csharp-monad library for C# parser combinators. This has been superseded by my language-ext project: https://github.com/louthy/language-ext/

It is a much more advanced and efficient port of the Haskell Parsec library. Would you mind linking to that instead?

Your Sprache examples would look like this in language-ext:

    public static readonly Parser<JCLCommand> JCLText =
        from open    in ch('$') 
        from ws1     in spaces
        from command in asString(many(noneOf(' ')))
        from ws2     in spaces
        from content in asString(many(noneOf('"')))
        select new JCLCommand(command, content);
     
    public static readonly Parser<JCLCommand> GlobalText =
        from variablename in asString(many(noneOf('=')))
        from ws2          in ch('=')
        from openbrack    in ch('(')
        from filepath     in asString(many(noneOf(')')))
        from closebrack   in ch(')')
        select new JCLCommand(variablename, filepath);
But the power of parser combinators are their reusable nature. So I'd break that down to a set of tools:

    static Parser<A> token<A>(Parser<A> p) =>
        from x in p
        from _ in either(spaces, eof)
        select x;

    static Parser<string> symbol(string x) =>
        token(str(x));

    static Parser<string> identifier =
        token(asString(many1(alphaNum)));

    static Parser<A> quotes<A>(Parser<A> p) =>
        between(symbol("\""), symbol("\""), p);

    static Parser<A> parens<A>(Parser<A> p) =>
        between(symbol("("), symbol(")"), p);

    static Parser<string> quoteText =
        token(quotes(asString(many(satisfy(x => x != '"')))));

    static Parser<string> parensText =
        token(parens(asString(many(satisfy(x => x != ')')))));
Then your final parsers would look like this:

    static readonly Parser<JCLCommand> JCLText =
        from open    in symbol("$")
        from command in identifier
        from content in quoteText
        select new JCLCommand(command, content);

    static readonly Parser<JCLCommand> GlobalText =
        from variablename in identifier
        from ws2          in symbol("=")
        from filepath     in parensText
        select new JCLCommand(variablename, filepath);

    static readonly Parser<Seq<JCLCommand>> Commands =
        from _        in spaces
        from commands in many1(either(JCLText, GlobalText))
        select commands;

Which is much easier to understand I think. It's not exactly the same as it defines what an identifier is, but it's much more tolerant of rogue spaces because of the token parser. This is definitely the most compelling aspect of parser combinators for me, the way they compose so elegantly.


[flagged]


This doesn't seem like such an egregious example to me - I tried it in a colour contrast checker and it passes WCAG AA but fails WCAG AAA for normal size text (but passes for larger text): https://snook.ca/technical/colour_contrast/colour.html#fg=5A...

I will admit I don't find the letter spacing on that font to be particularly pleasant, and if you turn the "Cantarell" declaration off, it does become easier to read (better letter spacing and thicker strokes, which latter contributes positively to the contrast).


> why ... can't you use some darker color for your text?

Just add the "zap colors" bookmarklet from this page: https://www.squarefree.com/bookmarklets/zap.html to a tool-bar menu in your browser (add some of the others for other annoyances if you like). Then when you find a page like this with a non-black color, just poke the "zap colors" bookmarklet and the color will change to black (a few others will change too, such as those pages that like redefining the colors for links, etc.).


Not sure what you mean? Its nearly black?


It's #5a5a5a. That's not "nearly black" is my book.


Discount Y - Combinator


basically a parser inception, takes parser and returns parser, woah


A few years ago I worked on a similar kind of project as part of my bachelors. We were converting SQL for Oracle to SQL for Microsoft SQL Server. One problem is dealing with features which language A supports and language B doesn't. Another problem is that it is a huge amount of work to match all syntax elements as well as all library usages. In fact, I guess it's not a good thing to say on HackerNews, but isn't it better to just buy a tool which does it for you? It seems like they are available for the author's use case: http://www.ispirer.com/application-conversion/jcl-to-powersh...


> We were converting SQL for Oracle to SQL for Microsoft SQL Server.

I had a similar problem a few months ago with a Golang application that uses Postgres in production, but SQLite for unit tests. Since Go's SQL support has pluggable driver backends, I made a generic proxy driver [1] that can rewrite the incoming query, and used that in my application to rewrite from Postgres to SQLite syntax [2].

[1] https://godoc.org/github.com/majewsky/sqlproxy

[2] https://github.com/sapcc/limes/blob/205f9980a41d75fc0315e0ac...


Why just not use a local postgres instead of sqlite, your case seems like overengineering to me.


>One problem is dealing with features which language A supports and language B doesn't.

That's not a problem with parsers or parser-combinators though. That's a problem with converting from one DB to another.


> That's a problem with converting from one DB to another.

The article is about converting JCL to PowerShell, so I'd say it's related.


Kinda, that's the example the article uses, but the core is about the use of parser-combinators for the job (even the title is "What the heck is a parser-combinator?").




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: