
ANTLR Mega Tutorial - ftomassetti
https://tomassetti.me/antlr-mega-tutorial
======
jasode
I last played around with ANTLR in 2012 (when it was version 3) and I
discovered that there's a "bigger picture" to the parser generator universe
that most tutorials don't talk about:

1) ANTLR is a good tool for generating "happy path" parsers. With a grammar
specification, it easily generates a parser that accepts or rejects a piece of
source code. However, it's not easy to use the hooks to generate high quality
diagnostic error messages.

2) ANTLR was not good for speculative parsing or probabilistic parsing which
would be the basis of today's generation of tools such as "Intellisense" not
giving up on parsing when there's an unclosed brace or missing variable
declaration.

The common theme to the 2 bullet points above is that a high quality compiler
written by hand will hold multiple "states" of information and an ANTLR
grammar file doesn't really have an obvious way to express that knowledge. A
pathological example would be the numerous "broken HTML" pages being
successfully parsed by browsers. It would be very hard to replicate how
Chrome/Firefox/Safari/IE doesn't choke on broken HTML by using ANTLR to
generate an HTML parser.

In short, ANTLR is great for _prototyping_ a parser but any industrial-grade
parser released into the wild with programmers' expectations of helpful error
messages would require a hand-written parser.

Lastly, the lexing (creating the tokens) and parsing (creating the AST) is a
_very tiny percentage_ of the total development of a quality compiler.
Therefore, ANTLR doesn't save as much time as one might think.

I welcome any comments about v4 that makes those findings obsolete.

~~~
corndoge
The vast majority of the time it is totally okay to quit on a single syntax
error. Not everybody -- most -- do not need "speculative parsing", we are not
all designing HTML parsers and IDE assists. ANTLR / Bison are fantastic for
parsing grammars and parsers written in them are a million times more
maintainable than a 1000 line hand rolled parser in C that's ten years old and
has been touched by 20 hands. You really can't out perform a parser generator
unless you live in an ivory tower where no one will ever modify your code and
you have unlimited time to reinvent every wheel that Bison / ANTLR have built
in.

 _In short, ANTLR is great for prototyping a parser but any industrial-grade
parser released into the wild with programmers ' expectations of helpful error
messages would require a hand-written parser._

What an odd statement, in light of the innumerable deployments of Bison /
ANTLR parsers you certainly use at least once a day (if you spend any time at
all in a terminal).

~~~
chubot
_What an odd statement, in light of the innumerable deployments of Bison /
ANTLR parsers you certainly use at least once a day (if you spend any time at
all in a terminal)._

Which ones? Awk is one; in fact Awk was almost co-developed with yacc. But it
gives generally bad error messages. (There are multiple implementations, but
most of them use a yacc-style LR(1) grammar and give bad error messages.)

I don't know of any others. Certainly ANTLR generates bad C/C++ code, so I
would be very surprised if it's used in anything in a typical Unix/Linux
terminal.

I have been blogging about Unix and parsing here, after writing a very
complete bash parser by hand:
[http://www.oilshell.org/blog/](http://www.oilshell.org/blog/)

My conclusion is also that generic parser generators are not good enough for
production quality parsers. This is borne out by evidence in the wild.

In other words, "real" languages don't use parser generators. Look at the top
10 languages, as well as emerging languages. Which one of them use parser
generators?

Clang, GCC, v8, PHP, C# / Roslyn, Go, Perl, TypeScript, Dart, etc. all use
hand-written parsers. Not sure about Rust and Swift, but I think they are
hand-written. Java is an exception, but interestingly it has a "real" grammar,
and then a LALR(1) optimized for parser generators:

[http://trevorjim.com/is-java-context-free/](http://trevorjim.com/is-java-
context-free/)

Python uses ITS OWN parser generator, not a generic one, which is a very
important distinction.

I'm not sure about Ruby, I think it might be a yacc core with A LOT of ad-hoc
parsing, so it may not count. Just like bash uses yacc for about 1/4 of the
language, and ad hoc hand-written parsers for the other 3/4.

Anyway, I don't think parser generators are as widely used as you think, but I
would be happy to be corrected.

~~~
parrt
Howdy! It is definitely the case that most production languages use handbuilt
parsers. In talking with these compiler developers, they are obsessed with
control (speed, error reporting, ...) and want very specific data structures
built during the parse. They also tend to use existing lexer infrastructure
that is baked into their development environments. I commented more on this
topic here: [http://stackoverflow.com/questions/16503217/antlr-for-
commer...](http://stackoverflow.com/questions/16503217/antlr-for-commercial-
compilers-why-not/16503455#16503455)

The thing to remember is that the vast majority of parsers out there are not
for these production language compilers. Compare the number of people you know
that have built parsers for a DSL, data, documents, or whatever to the number
of people you know that built compilers. ANTLR's niche is for your everyday
parsing needs. It generates fast ALL(*) parsers in a multitude of languages
and accepts all grammars without complaint, with the minor constraint that it
cannot handle indirect left recursion. (Direct left recursion as an
expressions is totally okay.) For a speed shootout with other tools, see
OOPSLA paper [http://www.antlr.org/papers/allstar-
techreport.pdf](http://www.antlr.org/papers/allstar-techreport.pdf)

Excellent discussion!

~~~
chubot
Thanks for the response. Yes I think we are agreeing -- I have 2 of your
books, and have read many of your ANTLR-related papers, and they were
definitely the thing that taught me the most about top-down parsing. (I also
used to share an office with Guido van Rossum and I remember he had your books
too.)

I ported the POSIX shell grammar to both ANTLR v3 and v4 (which was basically
changing yacc-style BNF to EBNF). But as mentioned, I discovered that the
grammar only covers about 1/4 of the language. bash generates code with the
same grammar using yacc, but fills in the rest with hand-written code. Every
other shell I've encountered uses a hand-written parser. bash says they regret
using yacc here:

[http://www.aosabook.org/en/bash.html](http://www.aosabook.org/en/bash.html)

I agree with you that there is a Pareto or long tail distribution in parser
use cases. Most languages CAN use something like ANTLR or bison. But the
parent was making a different claim:

 _What an odd statement, in light of the innumerable deployments of Bison /
ANTLR parsers you certainly use at least once a day (if you spend any time at
all in a terminal)._

I would say that is FALSE, because most parsers that your fingers pass through
are HAND-WRITTEN, because of the Pareto distribution. 99% of anyone's usage is
of probably a dozen or so parsers, and they are either hand-written or
generated by custom code generators, not general-purpose code generators like
ANTLR or yacc.

\-----

As feedback from a user of parsing tools, you might also be interested in my
article here:

[https://news.ycombinator.com/item?id=13628412](https://news.ycombinator.com/item?id=13628412)

Someone is asking if there are any parsing tools that generate a "lossless
syntax tree".

Also, based on my experience with ANTLR v3 vs. v4, I ask the question why use
a concrete syntax tree at all? Nobody answered that question in the comments.
I don't understand why that is a good representation, other than the fact that
you might not want to clutter your grammar with semantic actions ("pure
declarative syntax").

To me the parse tree / CST seems to be resource-heavy while containing
unnecessary information, and also lacking some crucial information like where
there's whitespace and comments.

To summarize my article, I'm researching code representations in the wild for
both style-preserving source translation (like go fix, lib2to3 in Python) and
auto-formatting (like go fmt).

It's definitely possible I misunderstood something since my experience was
relatively limited, but I have read a lot of the docs and bought the books.

~~~
parrt
> _why use a concrete syntax tree at all?_

Do you mean instead of an AST? I find the syntax tree better for non-compiler
applications like translators.

~~~
chubot
My claim is that neither the parse tree/CST or AST is good for applications
like translators. Instead I defined another term Lossless Syntax Tree, which
is the data structure I want:

[http://www.oilshell.org/blog/2017/02/11.html](http://www.oilshell.org/blog/2017/02/11.html)

I researched "production" implementations, and found that they use something
like a Lossless Syntax Tree (not an AST or CST):

[https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-
Pa...](https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-Pattern)

Examples: Clang, Microsoft's Roslyn platforms, RedBaron/lib2to3 for Python,
scalameta, and Go. The defining property of the LST is that it can be round-
tripped back to the source. This is called out in this C# design doc, along
with some conventions for associating whitespace with syntax tree nodes:

[https://github.com/dotnet/roslyn/wiki/Roslyn-
Overview](https://github.com/dotnet/roslyn/wiki/Roslyn-Overview)

What do you think of that claim? (If you prefer not to use this deep comment
thread, feel free to contact me by e-mail instead at andychup@gmail.com.)

------
CalChris
I switched over to ANTLR 4. It is strictly superior to ANTLR 3. The listener
approach rather than embedding code in the grammar is very natural. Separating
leads to clean grammars and clean action code. Odd thing is that I was stuck
on 3 because 4 didn't support C yet and then I just switched to the Java
target in an anti-C pique. Shoulda done that awhile ago.

TParr's _The Definitive ANTLR 4 Reference_ is quite good. And so's this mega
tutorial.

[https://pragprog.com/book/tpantlr2/the-definitive-
antlr-4-re...](https://pragprog.com/book/tpantlr2/the-definitive-
antlr-4-reference)

ANTLR is my goto tool for DSLs.

~~~
n00b101
ANTLR4 has a C++ target now (finally)

------
ttd
I think everyone should manually implement a simple recursive descent parser
at least once in their careers. It's surprisingly easy, and really (in my
experience) helps to break through the mental barrier of parsers being magical
black boxes.

Plus, once you have an understanding of recursive descent parsing, it's a
relatively small leap to recursive descent code generation. And once you're
there, you have a pretty good high-level understanding of the entire
compilation pipeline (minus optimization).

Then all of a sudden, compilers are a whole lot less impenetrable.

~~~
Terr_
On that note, Parr also has a nice little book called "Language Implementation
Patterns" [0] which introduces things on a level of "this kind of language
motif maps to this kind of parsing code".

[0] [https://pragprog.com/book/tpdsl/language-implementation-
patt...](https://pragprog.com/book/tpdsl/language-implementation-patterns)

~~~
parrt
Thanks for the ptr. :) I just wish I had time to rewrite that book in ANTLR 4
(it's in ANTLR 3).

~~~
ftomassetti
I also loved that book! It is in my list of best books on building DSLs
([https://tomassetti.me/domain-specific-
languages#books](https://tomassetti.me/domain-specific-languages#books))

------
raverbashing
Just a note on "Why not to use regular expression". Because it's _impossible_
depending on the language complexity

REs are level 3
[https://en.wikipedia.org/wiki/Chomsky_hierarchy](https://en.wikipedia.org/wiki/Chomsky_hierarchy)

~~~
farresito
How is it usually done in most high performance compilers, if I may ask?

~~~
lallysingh
C++ is a context sensitive language, so (at least some) hand parsing is
necessary.

~~~
munificent
You can use a generated parser for C/C++, I believe. You just need to insert a
little shim between the lexer and parser to look up identifiers in the symbol
table and mark which name tokens refer to types versus functions or variables.

~~~
joshuata
You are (essentially) correct. The language itself is context-free, and
therefore parse-able as a regular language. Looking up identifiers in symbol
tables is part of the type checking phase of compilation. This is a pretty
common misconception, that type constraints effect the parsing complexity of a
language. Language type only relates to the process of building an abstract
syntax tree from an input, not validating it.

~~~
munificent
> therefore parse-able as a regular language.

You mean "context-free" not "regular" here, but either way, I believe that C
is actually context-sensitive:

    
    
        (a)(b);
    

Is this:

1\. Evaluating "b" and casting it to type "a"?

2\. Evaluating the parenthesized expression "a" which yields a function, and
then calling it, passing in "b"?

The only way to distinguish the two is by knowing whether "a" is a type or
not.

You could argue that as long as the parser produces a suitably vague AST, _it_
doesn't need to know the distinction and can pass it on to later phases. I
_think_ that might work because I don't know of cases where the syntax
actually diverges based on the type of a symbol.

But in practice, I think most parsers want to produce a more precise AST that
distinguishes cast expressions from function calls.

~~~
joshuata
You are right, the regular/context-free comment was a typo. You are also
correct about c being non-context-free[0]. I have just seen many explanations
of language hierarchy online mistaking type-checking for language complexity
that it is a pet peeve of mine.

[0] [http://eli.thegreenplace.net/2007/11/24/the-context-
sensitiv...](http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-
of-cs-grammar)

------
nradov
I used ANTLR to write a fuzz testing tool which parses an ABNF grammar (like
in an IETF RFC) and then generates random output which matches the grammar.
Worked great!

[https://github.com/nradov/abnffuzzer](https://github.com/nradov/abnffuzzer)

~~~
parrt
Heh, that's cool. I tried to do a random phrase generator at one point but
it's hard!

~~~
nradov
If you're referring to human language phrases, I understand that Markov Chains
are the usual approach.

------
pjmlp
Great tutorial, ANTLR is one of the best tools for prototyping languages and
compilers.

I wasn't aware it supports JavaScript nowadays.

In any case, good selection of languages.

~~~
ftomassetti
Thanks! Yes, it is a big pro the fact it supports so many languages, until you
need to write semantic predicates at least

------
intrasight
For .Net projects, I've used Irony. From the CodePlex site:

"Unlike most existing yacc/lex-style solutions Irony does not employ any
scanner or parser code generation from grammar specifications written in a
specialized meta-language. In Irony the target language grammar is coded
directly in c# using operator overloading to express grammar constructs. "

~~~
Matthias247
I used Irony in a very succesful project too. It worked beautifully and got
it's job done. The downside is that there is less documentation available then
for Antlr.

------
musesum
I used Antlr v3 to create a NLP parser for calendar events for iOS and
Android. It took longer than expected. iOS + C runtime was opaque, so had to
write a tool for debugging. Android + Java runtime overran memory, so had to
break into separate grammars. Of course, NLP is not a natural fit. Don't know
what problems are fixed by v4.

> The most obvious is the lack of recursion: you can’t find a (regular)
> expression inside another one ...

PCRE has some recursion. Here is an example for parsing anything between { },
with counting of inner brackets:

'(?>\\{(?:[^{}] _|(?R))_ \\})|\w+'

A C++11 constexpr can make hand coded parsers a lot more readible, allowing
token names in case statements. For example , search on "str2int" in the
following island parser:
[https://github.com/musesum/par](https://github.com/musesum/par)

------
betenoire
These types of tutorials always start out explaining the problems with Regular
Expressions and why not to use them... then immediately proceeding into lexing
via regular expressions.

Perhaps the tutorials should start with the strengths of regular expressions,
and how we can harness that for getting started with a lexer.

------
destructaball
What are the advantages of ANTLR over something like Haskells Parsec?

[https://github.com/aslatter/parsec](https://github.com/aslatter/parsec)

~~~
danidiaz
Speaking of "parsec", "megaparsec" is a modern fork with a few more bells and
whistles:
[http://hackage.haskell.org/package/megaparsec](http://hackage.haskell.org/package/megaparsec)

------
closed
This tutorial looks great. I picked up Antlr4 a few months ago, and hadn't
done any parsing before then. The first week was basically me, The Definitive
Antlr4 Reference, and extreme confusion with how different targets worked.
Compounding the problem was the fact that a lot of the antlr4 example grammars
only work for a specific target. The use of different language implementations
as part of this tutorial seems really useful!

(Antlr4 is awesome :)

------
poppingtonic
I used ANTLR to write a Python parser for the SNOMED expression language last
year, and testing it was one of the weirder parts of the experience. I was up
and running in a few days, which was largely thanks to the ANTLR book. I love
this project. It made doing what I did a lot more fun than I thought it would
be. Hand-rolling an ABNF parser from scratch would be a nice hobby project,
but not when one has a deadline.

