
The language of languages - Adrock
http://matt.might.net/articles/grammars-bnf-ebnf/
======
mcguire
Good introduction to Backus-Naur Form (BNF), Extended BNF (EBNF), Augmented
BNF (ABNF, a la RFC 5234) and the typical extensions. Not earth shattering,
but a good reference.

------
stiff
If you want to learn to apply grammars and parsers in practice I recommend the
antlr parser-generator:

<http://www.antlr.org/>

It can not only do the basic text->tree parsing from a file describing the
grammar, but will also allow to specify additional grammars for _traversing_
the generated tree and executing arbitrary code in your language of choice as
particular nodes are recognized. I built a little compiler in it some years
ago, I had a Xpl.g grammar file for parsing the program text and creating the
abstract syntax tree, a SemanticAnalysis.g grammar file for doing a first pass
through the tree, annotating it with additional information, filling the
symbol table, checking semantic correctness and then finally CodeGeneration.g
for emitting JVM bytecode using the annotated tree. The code for this is still
on Github:

[https://github.com/jaroslawr/xpl/tree/509120e66e23aac8493414...](https://github.com/jaroslawr/xpl/tree/509120e66e23aac8493414ce5aa7e8079f6ff7ce)

There is a simpler example using the same functionalities described on the
ANTLR wiki:

[http://www.antlr.org/wiki/display/ANTLR3/Simple+tree-
based+i...](http://www.antlr.org/wiki/display/ANTLR3/Simple+tree-
based+interpeter)

I used the ANTLR reference book when learning it, but now there is also a real
introductory manual:

[http://pragprog.com/book/tpantlr2/the-definitive-
antlr-4-ref...](http://pragprog.com/book/tpantlr2/the-definitive-
antlr-4-reference) [http://pragprog.com/book/tpdsl/language-implementation-
patte...](http://pragprog.com/book/tpdsl/language-implementation-patterns)

I also used the Dragon Book and "Programming Language Pragmatics" for theory,
both great books and PLP certainly deserves to be better known.

~~~
Cyranix
Thanks for the additional information on antlr. I stumbled upon it last month
when doing some prep work for long-term planning at my company. I immediately
wished I could rip out most of the custom code my team had written for
handling ARFF files (for Weka machine learning toolkit) and replace it with
the antlr grammar file attached to the Weka wiki.

------
JCraig
I was thrown off by Might stating that "a grammar defines a language," which
is not nearly as useful or factual as saying that it "describes" a language,
the wording that he relies on throughout the rest of the article. That is the
difference between me being able to make a dog or identify a dog based on a
set of characteristics.

Grammars are only one part of understanding a language, hardly the "language
of languages". In natural languages, grammars are one subset of linguistics.
It would be just as valid to say vocabularies or phonology are the language of
languages as it would be to say grammars are.

Other than these overly broad arguments and attempts to define natural
languages in the same way that formal languages can be defined, this is a nice
general introduction to some specific notation techniques for computer
languages.

Of course, I might not have read it at all if it were titled "An Introduction
to Backus-Naur Form, Extendend BNF, and Augmented BNF Notation Techniques".

~~~
seanmcdirmid
Ya, grammar only describes the syntax of a language. You still have the
semantics and pragmatics to consider!

I'm not even sure if a XBNF is the best way to describe or reason about
language syntax. Precedence grammars (with hacks to handle braces) are quite
interesting for robust error tolerant parsing, and might more closely mirror
how we internal grammars in our head.

~~~
aaronblohowiak
Technically, the grammar describes the language of the language, where the
"language of the language" means the formal language, the set of characters
and strings that are valid (wheras the union of all of the characters allowed
in every char or string that is valid in the language is the language's
_alphabet_ , not all members of the alphabet may be allowed to stand alone as
a token in a given language..)

~~~
seanmcdirmid
My PhD in PL tells me your right, but I'm always on the lookout for a more
intuitional vs. technical definition of language, even for programming.

~~~
aaronblohowiak
You could always use META II which defines the grammar and the semantics :D

------
aaronblohowiak
Cool tutorial and introduction. In my toy languages, I prefer PEGs to CFGs
because I believe in ordinal precedence of choice -- this means that your
grammar can NOT be ambiguous. I also enjoy that you can skip tokenization.

~~~
pacaro
I too am a fan of PEG (Parsing Expression Grammar [1]), in particular I like
that it is pretty easy to write a PEG parser, whereas there is a lot more work
and housekeeping (IMHO) in implementing an LALR[2] parser.

I find it interesting that the parser generators are so closely bound up with
code generation. I like the model where you can specify at runtime: when this
non-terminal is parsed, execute that function.

[1] <http://en.wikipedia.org/wiki/Parsing_expression_grammar> [2]
<http://en.wikipedia.org/wiki/LALR_parser>

------
ilaksh
Why is there no machine-processable language of languages?

Like a KR or some system that could read BNF or whatever and translate
directly into another level, like operations defined by the operating system
or something.

Why can't we describe a language declaratively and semantically so that its
low level details wouldn't have to be specified manually and so that it could
be related to other languages and reasoning could be done about its effects?

The lowest levels of the system would probably have to be described as part of
the same representation system.

~~~
Yttrill
Felix is built on an extensible parser which uses EBNF for the productions.
The Felix "language" is defined in the library by such productions. The action
codes of the grammar are arbitrary Scheme functions returning non-arbitrary
s-expressions which map to Felix AST nodes.

The grammar is here:

<http://felix-lang.org/lib/grammar>

Close examination reveals even the regexps for literals are defined in the
grammar. The parser is built on top of the excellent extensible GLR+ parsing
tool Dypgen.

In principle the Felix parsing system is independent of Felix. All you need to
do is replace the s-expression to Felix AST translator with some kind of
pretty printer for s-expressions, even XML, and you can target anything.

------
tlarkworthy
I just worked with a mathematician who understood grammars, and could write
one, but could not use the theory in any practical way. Parsing is _not_ just
recognition. You need to pass info up the parse tree so that after parsing,
you have a new data structure with everything nicely structured ready to __ACT
UPON __. People write in specific languages to achieve something, not to
simply create syntactically correct bodies of text. So understanding grammars
is 50% of being able to use them as tools.

The other understanding you need to be able to put them to use is having a
mental model of how a bottom up parser processes the tokens. You need to be
able to insert actions for each grammar rule at appropriate places, which
allows the information to flow bottom up too, and be processed appropriately
at the same time. I have found this second bit to language processing is
actually the harder bit, and its worth rewriting your grammar to make it
simpler.

~~~
aaronblohowiak
>You need to pass info up the parse tree so that after parsing, you have a new
data structure with everything nicely structured ready to ACT UPON.

I (and many others) prefer to keep these passes relatively distinct. Parse
tree -> AST -> Internal Representation -> (Generation of Output /
Interpretation)

------
worldsayshi
BNFC is a compiler front end generator for labeled bnf grammars that can
generate parsers and traversers for various languages. It uses Labaled BNF
grammar. I'm not sure what the difference to regular grammars is since I only
used this and it was a while ago. Some added expressiveness anyway:

<http://bnfc.digitalgrammars.com/>

------
dasil003
Lisp is the language of languages.

