Consider the most basic parsers you might want. For instance, a parser that only succeeds if it matches a string exactly, a parser that matches any single character and always succeeds, a parser that matches nothing and returns some constant, a parser that always fails. They're all simple and stupid and let's give them names:
These are parser combinators. They form the atoms at the foundation of a language for constructing parsers. For instance, we can imagine two parsers happening in sequence
Or more fancily, a two parsers happening in sequence, but the second parser being defined using the output of the first. This creates context sensitivity
thenDependent<A, B>(a: Parser<A>, b: A -> Parser<B>): Parser<(A, B)>
We're constructing a fundamental language for building parsers up from our atoms to larger and larger things. All we need now is a way to "run" them
This is a great talk. I especially like his comment here [0] about obfuscated infix operators since he opens his talk with "hopefully this talk will get you over the confusion about what `.>>.` and company mean" and then 20 minutes later he just has the line `let ( .>>. ) = andThen`. That was easy.
Is that really how the term is used? I would have naively thought that among your examples only 'then' and 'thenDependent' are parser combinators, and that 'string("foobar")', 'char', 'always(x)' and 'never' are simply called parsers. (Maybe 'string', as opposed to the evaluated 'string("foobar")', should be called a combinator even though it doesn't take any parsers as arguments...)
Just like I'd call '+' and '*' arithmetic operators but I'd call '3' a number.
Combinator is a pretty overloaded term, it's difficult to pin down. In practice, something like `never` would be called a combinator, though.
Also, in theory a combinator is just an expression of no free variables. S-K-I calculus is the canonical set of "combinators" and all three of them could be considered "functions" but also could be considered "atoms". So perhaps it's just vague.
In my experience (limited) yes, this is how the term is used.
"Combinator" != "combiner". A combinator is a thing which is combined with other combinators. When you combine two combinators, you end up with yet another combinator which can be combined with other combinators. Very composable, in the functional spirit of things.
I'm not so sure, and however wrote the Wikipedia page seems to agree with me. The first sentence is "In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output." Also people name the type something like Parser<A>, not ParserCombinator<A>.
I suspect these combinators are not very powerful compared to e.g. LR(k) parsers, and provide a false sense of modularity. (E.g. a minor grammar change leading to a large scale rewrite).
Roughly, context sensitive can be covered by monadic parsers, while applicative parsers are context free. LR(k) also corresponds to context free parsers. So, at least for monadic combinators they are more powerful. I also think they're fairly modular. The primitive parsers are fairly easy to write and once you have a small library (or use an existing one like Parsec) of them it is fairly easy to put them together (in haskell do notation helps too). The only other kind of parsers I have personal experience coding is recursive descent which I had more trouble trying to keep track of.
> monadic combinators are more powerful [than context free parsers]
Monadic parsers can do context-sensitive things that context free parsers can't, but they can't have unordered options which context free parsers can. So really monadic parsers have different powers to context-free.
> do monads warn the user when there is an ambiguity in the "grammar"?
You don't have ambiguity in the grammar because monadic parsers only provide ordered options.
> You don't have ambiguity in the grammar because monadic parsers only provide ordered options.
Oh, but you do have ambiguity in the grammar, except that parser
returns one of the possible parse trees deterministically and thus doesn't
warn you that the input could be parsed differently. This leads to the false
impression that your grammar is unambiguous.
> Further, do monads warn the user when there is an ambiguity in the "grammar"?
I'm not aware of a parser combinator library that does this. The way you build parser combinators in Haskell doesn't exactly preclude it from being done but it wouldn't be totally trivial either.
Googling for it didn't turn up much that wasn't a veiled reference to this paper:
If you mean big O efficiency, it doesn't make any difference if you are creating a monad, applicative, or an state machine. Complexity comes explicitly in the parser construction, so if you write a monadic parser using only LR(k) combinators, you will get an LR(k) parser. If you write some backtracking, you will get worse complexity.
They're pretty damn efficient and nicely modular if you're okay with not having context sensitivity. Most of the time you're okay with not have context sensitivity, imo.
Context sensitivity as I wrote it here is the most powerful but least optimizable version. There are plenty of tricks to improve optimization. You can even write a "cut" combinatory to break backtracking.
It's a huge design space. The answer to all of your questions will probably be: it depends.
In languages that allow abstraction over generic types you can also reuse a lot of infrastructure to write parsers. For instance you can parse a string by chaining a bunch of char parsers:
string :: String -> Parser String
string s = mapM char s
On the other hand this probably degrades from c-like performance to python levels so some higher level builtins that are easy to optimize and compile into efficient assembly might be a better idea.
There is a nice paper on the subject by Graham Hutton and Erik Meijer. It's a very good introduction to both parser combinators and monads, and it's very readable even for beginners.
Notably, you hardly ever need Monadic parser combinators. Applicative parser combinators work fine in almost all cases.
FastParse lets you use parser combinators monadically, but that's only necessary in uncommon cases. The three I've come across are length + data-of-length constructs (more common in binary parsing than text), having your parser validate matched XML tags/closing-tags, and indentation-delimited-block parser (e.g. when parsing Python).
For the vast majority of programming-language-like things, Applicative parser combinators are enough, including languages with complex syntax such as Scala.
Hm? I don't see why these would have to be monadic. A simple
type Parser t = Input -> Maybe (t, Input)
or similar would be enough? Simple function composition gets you the rest of the way. (I grant you, it might be a bit tedious to write parsers this way, and monadic notation certainly makes it more pleasant in most cases.)
For anyone following along at home: think function which takes input + current position and may return: "Nothing" if the parser doesn't match, or "a value of type t along with the rest of the unparsed input".
(Obviously it'd have to be expanded to support errors properly, etc.)
EDIT: Of course, I guess you could argue that they just are monadic by definition since chaining using function composition and parameter passing is the essence of a monad... Maybe that's what you were driving at?
Yeah that type is enough to make a monad. In slightly fancy language the type you just wrote is just StateT Input Maybe. Monadic mostly makes me think of bind which is just about chaining (technically also return is needed, but here return is just always succeed and give that value) and parsers have a natural notion of run one after the other.
It's only a monad if you define it to be. It's also "just" an applicative if you like. The monadic part arises when people actually _use_ bind and then you lose some information that might be nice for optimizations.
which would make a monad provided the monad laws are satisfied.
Of course you don't have to write parser combinators that way, but I found it pretty practical compared to dealing with nested tuples for the sequence parser. Check out Hutton and Meijer's article for more details.
The problem with monadic parsers is that they are too powerful. If you have two options
string "foo" <|> string "for"
It would be nice to automatically rewrite this as
string "fo" >> (char 'o' <|> char 'r')
But if we try to analyze monadic parsers like that we run into the halting problem.
Applicative parsers are less powerful so we can left factor them automatically. There were some attempts to combine applicative and monadic parsers which resulted in arrow syntax but that kind of feels like the worst of both worlds for many cases.
It can certainly help to look at a parser-combinatory as a more rigorous, less artisanal/home-cooked recursive-descent parser. Good parser-combinators offer a lot more than just recursive-descent ways of looking at the parsing world, but if you are familiar with how you might do things in a recursive-descent way I think that familiarity can help in learning a parser-combinatory library.
For similar reasons, too, I think PEG-style grammars are more often conceptually closer than BNF grammars for early learning/thinking in parser-combinators. But again a good parser-combinator library will have power beyond what you might consider the possibility space afforded by just PEG-style grammars, especially as you start to get into higher order combinators.
I have the overall feeling that most people don’t realize that parser combinators are nothing more that disguized recursive descent parsers. Personally I love both.
They're not, though. They're a formal language for talking about all kinds of parsers. You can implement them as recursive descent if you want. You can also compile them into formal FSAs if you want.
Parser combinators are a quite general technique which has been applied to PEG parsing, Earley parsing, and many other specific algorithms. Recursive descent is common because it is easy to implement and allows the resulting combinators to form a Monad.
I haven't tried pyparsing, but I enjoyed implementing a simple DSL with funcparserlib having no previous experience with parser combinators: https://github.com/vlasovskikh/funcparserlib
The library has a well-written tutorial that doesn't require previous knowledge about parser combinators.
This is great, but if you're going to use a parser combinator in .NET, you might as well switch to F# and use FParsec instead of hacking LINQ in C#. The F# code is so much easier to read and understand.
Here's a JSON parser in LPeg, the excellent PEG parser from one of the Lua authors. The function "decode" returns the final, complete tree data structure.
local lpeg = require"lpeg"
local P, S, R, V = lpeg.P, lpeg.S, lpeg.R, lpeg.V
local C, Cc, Cf, Cg, Ct = lpeg.C, lpeg.Cc, lpeg.Cf, lpeg.Cg, lpeg.Ct
local function to8(n)
... -- Lua code to normalize UTF-16 to UTF-8
end
local unicode = P"u" * (R("09", "AF", "af")^4 / to8)
local named = C'"' + C"\\" + C"/" + (P"b" * Cc"\b") + (P"f" * Cc"\f") + (P"n" * Cc"\n") + (P"r" * Cc"\r") + (P"t" * Cc"\t")
local escaped = P"\\" * (named + unicode)
local unescaped = C((P(1) - S'\\"')^1)
local qstring = Ct(P'"' * (unescaped + escaped)^0 * P'"') / table.concat
local exp = S"Ee" * S"-+"^-1 * R"09"^1
local frac = P"." * R"09"^1
local number = (S"-+"^-1 * R"09"^1 * frac^-1 * exp^-1) / tonumber
local boolean = (P"true" * Cc(true)) + (P"false" * Cc(false))
local null = P"null" * Cc(nil)
local space = S" \t\r\n"^0
local JSON = { "Value",
Value = space * (V"Object" + V"Array" + V"Simple") * space,
Object = Cf(Ct"{" * space * Cg(qstring * space * P":" * V"Value" * P","^-1 * space)^0 * P"}", rawset),
Array = Ct(P"[" * space * (V"Value" * P","^-1 * space)^0 * P"]"),
Simple = number + boolean + null + qstring,
}
local function decode(txt)
return lpeg.match(JSON, txt)
end
I implemented a basic set of primitive parsers and combinators here in K, along with a usage example- a complete parser for the bittorrent "Bencode" data interchange format:
A tiny, elemental parser in a universe where parsers are closed under certain parser composition operators. Such tiny parsers can be combined using such operators to produce a bigger parser which could theoretically parse anything.
Wanted to leave a comment on the site, but it seems comments are broken, so I'll stick this here just in case the author sees it:
You've linked to my csharp-monad library for C# parser combinators. This has been superseded by my language-ext project: https://github.com/louthy/language-ext/
It is a much more advanced and efficient port of the Haskell Parsec library. Would you mind linking to that instead?
Your Sprache examples would look like this in language-ext:
public static readonly Parser<JCLCommand> JCLText =
from open in ch('$')
from ws1 in spaces
from command in asString(many(noneOf(' ')))
from ws2 in spaces
from content in asString(many(noneOf('"')))
select new JCLCommand(command, content);
public static readonly Parser<JCLCommand> GlobalText =
from variablename in asString(many(noneOf('=')))
from ws2 in ch('=')
from openbrack in ch('(')
from filepath in asString(many(noneOf(')')))
from closebrack in ch(')')
select new JCLCommand(variablename, filepath);
But the power of parser combinators are their reusable nature. So I'd break that down to a set of tools:
static Parser<A> token<A>(Parser<A> p) =>
from x in p
from _ in either(spaces, eof)
select x;
static Parser<string> symbol(string x) =>
token(str(x));
static Parser<string> identifier =
token(asString(many1(alphaNum)));
static Parser<A> quotes<A>(Parser<A> p) =>
between(symbol("\""), symbol("\""), p);
static Parser<A> parens<A>(Parser<A> p) =>
between(symbol("("), symbol(")"), p);
static Parser<string> quoteText =
token(quotes(asString(many(satisfy(x => x != '"')))));
static Parser<string> parensText =
token(parens(asString(many(satisfy(x => x != ')')))));
Then your final parsers would look like this:
static readonly Parser<JCLCommand> JCLText =
from open in symbol("$")
from command in identifier
from content in quoteText
select new JCLCommand(command, content);
static readonly Parser<JCLCommand> GlobalText =
from variablename in identifier
from ws2 in symbol("=")
from filepath in parensText
select new JCLCommand(variablename, filepath);
static readonly Parser<Seq<JCLCommand>> Commands =
from _ in spaces
from commands in many1(either(JCLText, GlobalText))
select commands;
Which is much easier to understand I think. It's not exactly the same as it defines what an identifier is, but it's much more tolerant of rogue spaces because of the token parser. This is definitely the most compelling aspect of parser combinators for me, the way they compose so elegantly.
This doesn't seem like such an egregious example to me - I tried it in a colour contrast checker and it passes WCAG AA but fails WCAG AAA for normal size text (but passes for larger text): https://snook.ca/technical/colour_contrast/colour.html#fg=5A...
I will admit I don't find the letter spacing on that font to be particularly pleasant, and if you turn the "Cantarell" declaration off, it does become easier to read (better letter spacing and thicker strokes, which latter contributes positively to the contrast).
> why ... can't you use some darker color for your text?
Just add the "zap colors" bookmarklet from this page: https://www.squarefree.com/bookmarklets/zap.html to a tool-bar menu in your browser (add some of the others for other annoyances if you like). Then when you find a page like this with a non-black color, just poke the "zap colors" bookmarklet and the color will change to black (a few others will change too, such as those pages that like redefining the colors for links, etc.).
A few years ago I worked on a similar kind of project as part of my bachelors. We were converting SQL for Oracle to SQL for Microsoft SQL Server. One problem is dealing with features which language A supports and language B doesn't. Another problem is that it is a huge amount of work to match all syntax elements as well as all library usages. In fact, I guess it's not a good thing to say on HackerNews, but isn't it better to just buy a tool which does it for you? It seems like they are available for the author's use case: http://www.ispirer.com/application-conversion/jcl-to-powersh...
> We were converting SQL for Oracle to SQL for Microsoft SQL Server.
I had a similar problem a few months ago with a Golang application that uses Postgres in production, but SQLite for unit tests. Since Go's SQL support has pluggable driver backends, I made a generic proxy driver [1] that can rewrite the incoming query, and used that in my application to rewrite from Postgres to SQLite syntax [2].
Kinda, that's the example the article uses, but the core is about the use of parser-combinators for the job (even the title is "What the heck is a parser-combinator?").