
InstaParse: Context-Free Grammars as Easy as Regular Expressions - brudgers
https://github.com/Engelberg/instaparse/blob/master/README.md
======
hbbio
Having parsers at hand in a language is an important feature.

In Opa ([http://opalang.org](http://opalang.org)), we decided to have parsers
as first-class objects: cf. [https://github.com/MLstate/opalang/wiki/The-core-
language#pa...](https://github.com/MLstate/opalang/wiki/The-core-
language#parser)

And this is used at least 35 times in a real project such as PEPS:
[https://github.com/MLstate/PEPS](https://github.com/MLstate/PEPS)

(according to `git grep "= parser" | wc -l`)

~~~
hugofirth
Are you suggesting that parsing capability should be part of a languages
standard library, or just that it is useful to have good open source libraries
available for the purpose?

Anecdotally, my experience with Scala's parser combinators has led me to the
latter conclusion bu not the former. Parsers and parser generators are very
specific tools which (as useful as they are) don't see use in most
applications. This leads you to a situation like the one in Scala, where open
source libs such as Parboiled ( [http://git.io/xpg3](http://git.io/xpg3) ) are
_far_ better than the std lib offering because they recieve more attention.

~~~
amelius
I think the language should support the creation of parser generators such
that the output of the generators can be used in a straightforward way (as
opposed to the Yacc way, where you have to write difficult make files to
include the generator output in the build process)

------
panic
The algorithm used is Generalized LL parsing
([http://dotat.at/tmp/gll.pdf](http://dotat.at/tmp/gll.pdf)), which promises
linear-time parsing of LL grammars and cubic-time parsing of arbitrary
grammars. For some reason this isn't mentioned until the very end.

------
jared314
The biggest news here is the existence of a current ClojureScript port:

[https://github.com/lbradstreet/instaparse-
cljs](https://github.com/lbradstreet/instaparse-cljs)

~~~
lbradstreet
I'm the author of the ClojureScript port. I'd really love to hear from anyone
who eventually uses it, as I'm getting it into shape so it can eventually be
merged upstream.

All the tests pass and it works for my use case, but it'd be great to get some
more data points.

------
ar7hur
I've used InstaParse with great success for an IMAP client. It's well
documented and delivers as promised. Kudos to the author.

------
skatenerd
Parsing declaratively (by e.g. declaring EBNF) is really interesting.

It sort of mushes together the two steps ("lexing" and "parsing"), which, as
far as I can tell, are traditionally distinguished from one another.

I guess that the "token" definitions really just reside within the EBNF
rules..

~~~
panic
Separating "lexing" and "parsing" can still be good for performance, since
"lexing" needs no auxiliary storage (not even a stack) and cannot backtrack.
The author points this out in the performance documentation:
[https://github.com/Engelberg/instaparse/blob/master/docs/Per...](https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md)

------
grandalf
This is the best library of its kind that I've seen. Shockingly intuitive
compared to something like Treetop for Ruby.

~~~
hugofirth
I have to say - for parsers I _really_ like the scala library parboiled2 (
[http://git.io/xpg3](http://git.io/xpg3) ). Its DSL for defining PEGs is very
intuitive and should be easy to pick up for anyone.

Its also supposed to be pretty performant, although I have never seen a side
by side comparison with ANTLR and friends.

------
samatman
I love instaparse. It's been an inspiration for some of the work I've been
doing recently with lpeg.

I also contributed the viz hooks, back in the day, finding it difficult to
work with parsers without an antlr-style tree printer. It's a fun time to be
into parsers, there's a lot going on in the space.

Try decompiling a grammar into combinators!

------
ccleve
This looks really cool, but I'm having a tough time using it from within a
Java app.

My knowledge of Clojure is very limited, but it appears to run on the JVM and
should work. It installs via Maven just fine. Running it, however, is
currently beyond my ken.

A little sample Java code would open this fine piece of software to a much
wider audience.

~~~
jared314
Calling the Clojure api directly from Java would look something like the
following:

    
    
        import clojure.java.api.*;
        import clojure.lang.IFn;
        import clojure.lang.PersistentVector;
        
        public class Core {
        	
        	public static void main(String[] args)
        	{
        		String grammar1 = "S = AB*\n" +
        				  "AB = A B\n" +
        				  "A = 'a'+\n" +
        				  "B = 'b'+";
        		
        		Clojure.var("clojure.core", "require").invoke(Clojure.read("instaparse.core"));
        		IFn parser = Clojure.var("instaparse.core", "parser");
        		IFn asandbs = (IFn)parser.invoke(grammar1);
        		PersistentVector result = (PersistentVector)asandbs.invoke("aaaaabbbaaaabb");
        		
        		System.out.println(result.get(0).toString());
        	}
        }
    

But, my suggestion would be to build a Java friendly api in Clojure, with gen-
class, and then consume that.

------
tomp
One thing I really like about LALR and LR parsers is that they warn you about
all ambiguities. IMO that's one of the most useful things when designing a
syntax.

People (and Wikipedia) like to pretend that e.g. PEG grammars cannot be
ambiguous, because the PEG "choice" operator is ordered. In reality, that
doesn't make them unambiguous, it just hides the ambiguity. By the same logic,
you could say that LR grammars cannot be ambiguous either; the parser
generator will always produce a working parser. The difference is, LR parser
generators warn you about the parts of the grammar where the generator had to
make a choice (and their choices are well-documented and not arbitrary), while
PEG parsers don't.

~~~
wfunction
Er, no, they don't. Detecting ambiguities is undecidable [1] -- LALR and LR
parsers don't tell you about ambiguities. They only tell you about shift and
reduce conflicts. Those do not necessarily imply the existence of ambiguities.

[1]
[https://en.wikipedia.org/wiki/Ambiguous_grammar#Recognizing_...](https://en.wikipedia.org/wiki/Ambiguous_grammar#Recognizing_ambiguous_grammars)

~~~
tomp
It's undecidable for context-free grammars, but LR grammars are a
deterministic subset of context-free grammars. If I understand "deterministic"
correctly, it implies unambiguous.

~~~
wfunction
No, conflicts in LR grammars imply nothing about the ambiguity of the grammar.
Only the absence of conflicts implies the grammar is unambiguous.

Remember that ambiguity only refers to whether or not there can be multiple
derivations for the same string, _not_ whether the parser action is ambiguous.
For example, consider:

    
    
      S: "a" "b" "c" | A "b" "d"
      A: "a"
    

This LR(1) grammar has a shift/reduce conflict, but it is unambiguous.

You could even make it worse by:

    
    
      S: "a" B "c" | A B "d"
      A: "a"
      B: "b" | B "b"
    

in which interpreting this as an LR(k) grammar for all k < ∞ results in
conflicts even though the grammar is still unambiguous.

------
daviddaviddavid
This reminds me of Prolog's definite clause grammars.

[http://en.wikipedia.org/wiki/Definite_clause_grammar](http://en.wikipedia.org/wiki/Definite_clause_grammar)

------
wfunction
A couple questions:

1\. Can it parse its input using an LR(k) grammar in time linear in its input
when k > 1?

2\. Is there any other parser known to be able to achieve the above in
practice?

------
rtpg
I am a fan of parser generators, but isn't a good DSL (think Parsec) a nicer
answer than passing in a string?

EDIT: this is still way cool in itself.

~~~
jared314
> but isn't a good DSL (think Parsec) a nicer answer than passing in a string.

EBNF is a DSL.

I happen to like it more than many apis because of it's history, stability,
and my experience with it. But, that's an opinion.

~~~
rtpg
but this isn't a DSL in the "in the language" sense, it's just a string. I
like EBNF a lot too, I think writing (| A B) is not that much worse than
writing "A | B" in clojure, but that you get the advanatages of manipulating a
language-level object instead of an opaque string.

Let's not forget that string-based programming is the source of many many bugs

~~~
freshhawk
I also think this string format is the right default choice, since most of the
time I'm building parsers that don't change or need to be built
programmatically.

But Instaparse has exposed all the functionality as an "in the language" DSL
as well using parser combinators in a map structure (near the bottom of the
README).

You can also turn a string specification into a grammar map/combinator
specification, which I have used when a part of my parser was generated at run
time.

------
ehurrell
Instaparse is great. I've used it just enough to know how powerful it is, in a
weekend project to write an org-file parser.

------
mwj
Is there anything like this for python?

------
Vektorweg
I usually don't point this out, but:

 _Come, let us go down and confuse their language so they will not understand
each other._

Like the issue with RegEx, its probably better to use nice English words in
code to describe a problem solution, so that other people can understand it
faster (before they get demotivated).

~~~
brudgers
English words don't solve the real difficulty with Regex's; mathematical
regular expressions; or parsers. Describing state machines to do accurately
and precisely what we want, though easier than writing an optimizing compiler,
is usually non-trivial.

The amazing thing about Regex's is that they make describing some state
machines easy.

~~~
Vektorweg
English is a common base. Not everyone understands Regex or enough about math.
If you use Regex, some people might not be able to understand and change your
Regex if necessary.

Well, most of the time, no one else needs to understand it ...

~~~
brudgers
Is calling '*' "Kleene_Star" a marked improvement?

The hard part of regular expressions is envisioning the languages they accept.
It's not primarily a difficulty with notation as with the concepts the
notation encodes.

~~~
Vektorweg
The * or Kleene star describes a sequence of values from a specific set. A
nice function name would be "many", like in Haskell's Parsec. In comparison
between *, "Kleene_Star" and "many" is the meaning of the function, when named
"many", is easy to imply without consulting the docs.

------
zer0defex
"...as Easy as Regular Expressions"

I see you too.. like to live dangerously. _tips hat_

~~~
vidarh
Yes, my reaction to that was "so it's really, really hard to use for anything
but trivial use cases?"

