
The difference between top-down parsing and bottom-up parsing - zniperr
http://qntm.org/top
======
carsongross
I really didn't understand grammars until I started doing hand-rolled top-down
recursive descent parsing for Gosu.

It's a shame that parsing in CS schools is taught using theory and lex/yacc
type tools when a basic lexer and recursive descent parser can be written from
first principles in a few months. It is more incremental, and it gives you a
much deeper feel for the concepts, plus it lets you learn a bit about software
engineering and organization as well.

~~~
ternaryoperator
It is a shame that Gosu has not become more popular as a JVM language. You
guys were doing things that took other languages years to catch up.

~~~
zura
Indeed. Also Groovy++ (aka Groovy with static typing) is worth mentioning.

~~~
vorg
Or even Beanshell. In fact, the ideas behind Beanshell (i.e. closures on the
JVM) and Groovy++ (i.e. static typing) were scooped up and copied into Groovy,
that kitchen sink language, which was bundled as part of Grails, that knock-
off of Rails for the JVM.

The software industry is divided into "Makers" and "Takers". The people behind
Gosu are obviously Makers, whereas those presently behind Groovy are Takers.

~~~
ternaryoperator
> "...copied into Groovy, that kitchen sink language, which was bundled as
> part of Grails, that knock-off of Rails for the JVM. The software industry
> is divided into 'Makers' and 'Takers'. The people behind Gosu are obviously
> Makers, whereas those presently behind Groovy are Takers."

This is all wrong. Groovy predates Grails, as Grails was written in Groovy.
Gosu and Groovy came out at roughly the same time, and IIRC, Groovy had
closures and features like the Elvis operator before Gosu did.

You apparently don't like Groovy's current owners, but slamming Groovy for
implementing static typing and qualifying them as takers for doing so makes no
sense at all.

~~~
vorg
I've gotten the impression over the years that Grails is written in Java and
only bundles Groovy for user scripting, but I haven't checked up recently. I
certainly looked through the Gradle codebase when its version 2.0 came out and
there was hardly a line of Groovy code in there, only Java. It, too, only uses
Groovy for user scripting. If I'm right about the Grails codebase, static-
typed Groovy isn't actually used to build any systems of note. Perhaps Cedric
Champeau, who wrote the static-typed Groovy codebase, will use it to build the
Groovy for Android he's presently working on. It'll be interesting to see
whether he uses it or uses Java because it'll show whether he has enough faith
in what he built to actually use it to build something with it. Until static-
typed Groovy is used to build something, Groovy will remain a dynamically-
typed scripting language for testing, Gradle builds, webpages, etc.

------
Guthur
(facetious-comment "It's a tragedy how much brilliance is wasted on grammar
parsing when it's a solved problem; just use a Lisp")

~~~
skybrian
JSON or XML would also work. Except that few people like languages based on
XML, and I haven't seen anyone seriously try JSON.

Perhaps someone should try to build a JSON-like language that's close to how
most programmers like to write their code?

~~~
kyllo
Yeah, this is why no one likes languages based on XML:
[http://thedailywtf.com/articles/We-Use-
BobX](http://thedailywtf.com/articles/We-Use-BobX)

If you were going to do it, I'm certain you'd want to use a more human-
writable format such as YAML or TOML instead of JSON or XML.

But doing so means you're living out Greenspun's Tenth Rule. Again, just use a
Lisp.
[http://www.defmacro.org/ramblings/lisp.html](http://www.defmacro.org/ramblings/lisp.html)

~~~
green7ea
I know BobX is supossed to be a troll language but it bothers me to no end
that the example code uses <xbobendif> instead of the XML spirited </xboxif>.
I believes this confirms that I suffer from OCD.

------
nachivpn
Bottom up parsing - "If not, it's necessary to backtrack and try combining
tokens in different ways" I feel the way it is put along with shift reduce
parsing is misleading. Backtracking is essentially an aspect avoided (more
like solved) by shift-reduce parsing. They don't go together in bottom up
parsing. Shift reduce parsers posses the potential to predict the handle to
use by looking at the contents on top of the stack.

Good job BTW, there are very few people who write about compiler/language
theory :)

~~~
canjobear
You need backtracking if you have an ambiguous grammar. This comes up in
natural language parsing; I'd guess that it is avoided in programming
languages.

~~~
aardvark179
Actually, many languages which started with hand written parsers do have
ambiguous grammars, or have an unambiguous grammar so hideously unwieldy it is
best ignored.

This sort of thing can be fixed up with predicates added to the grammar but it
always feels like a bodge.

------
zura
One more difference: top-down parsing is European and bottom-up - American :)

~~~
twic
I'm not sure why this was downvoted - it's true(ish), and interesting.
Historically, American computer scientists preferred LR parsers, and Europeans
preferred LL parsers. That influenced the languages they designed.

I think i read about this in Sedgewick's 'Algorithms in C', although i could
be wrong. I struggle to find any online citation. This was mentioned in the
Wikipedia article at one point, but disppeared in edit described as "Removed
heresay":

[http://en.wikipedia.org/w/index.php?title=LL_parser&directio...](http://en.wikipedia.org/w/index.php?title=LL_parser&direction=prev&oldid=450637859)

~~~
rdc12
That is quite intersting, at one point the main language in AI for Europe was
Prolog and in America it was LISP, or so I read. I wonder what other instances
this kind of cultural differences occured, and if the internet has had any
effect on this.

~~~
lispm
> AI for Europe was Prolog and in America it was LISP,

I wouldn't say 'main language', but Prolog maybe a little more popular in
Europe. The Japanese though based their 'fifth generation' project on logic
programming. I once saw a Prolog Machine, a computer with the architecture
optimized for Prolog and the main software written in Prolog:

[http://museum.ipsj.or.jp/en/computer/other/0009.html](http://museum.ipsj.or.jp/en/computer/other/0009.html)

$119000 a piece...

------
slashnull
I was so happy when I finally grokked LR parsers: it's just a big state
machine! _if_ that token found, push on stack, go to _that_ other state.
Consume token, check the next state transition.

But while fun it was pretty much useless because recursive descents and
combinators are so much easier.

~~~
phyzome
That moment when you finally get it that CFG parsers can be implemented using
push-down automata... :-)

------
vbit
Another good write-up on this topic is
[http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demyst...](http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demystified.html?m=1)

------
vorg
I think I read somewhere incremental parsers are better off being written with
bottomup parsers rather than topdown parsers. The reason was that when a small
edit is made to the code being parsed, the artifact from a bottomup parser
often only needs a minor change that ripples only as far as it needs to,
whereas the topdown parser needs to be completely rerun because it can't tell
whether the effect of one small edit is large or localized. Anyone out there
who can verify or refute this?

~~~
seanmcdirmid
I've written parsers both as top down and bottom up. Nite that neither is
naturally incremental and some memoization is required, so for bottom up, you
see if the parent you want to create already exists, for top down, your child.
It is sort of a wash which one is better, but both are pretty trivial.

I use top down recursive descent these days backed up with memoization. When a
change to the token stream comes, you simply damage the most inner tree that
contains the token, or if on a boundary, damage multiple trees.

~~~
vorg
Thanks. I've been writing and using recursive descent parsers a little lately,
both Parsec-style and Packrat-style. If it's possible to do decent incremental
parsing without wrapping my head around LR-parsing, then I'll give it a try
sometime.

~~~
seanmcdirmid
It is actually quite easy if you have a memoized token stream to work on. Just
store your trees based on their first token. When you do a parse, check to see
if a tree of that type already exists for that token. If it does, just reuse
it, and if there is any context to consider, dirty the tree if the context of
its creation has changed.

There are more advanced incremental parsing algorithms that do not require
memoization, but they really aren't worth it (the ability to parse
incrementally is not really a performance concern, but in keeping around state
associated with the tree and in doing better error recovery).

~~~
skybrian
Could you expand on this? What's a memoized token stream?

~~~
seanmcdirmid
A token stream that preserves token identities. Say you have a token stream of
[t0, t1] and add a new token t2 between t0 and t1 to get [t0, t2, t1]. What
you want is to be able to identify t0 and t1 in the new stream as being the
same tokens as in the old stream. If you simply re-lex to get new tokens, that
won't happen, and if you use character offsets as your token identities, t1 in
the new and old stream will have different identities.

Incremental lexing is pretty easy: just convert the tokens around the damage
back into text, re-lex that text, and then work from front and back of the new
partial stream to replace new tokens with old existing tokens if their lexical
categories match (e.g. an old identifier's text can be updated with a new
identifier's contents, but that is ok because parsing only depends on the fact
that the token is an identifier). You might not win any performance awards,
but those reused old tokens make very good keys into various data structures
for incremental parsing, incremental type checking, and even incremental re-
execution (if you are into live programming).

~~~
skybrian
Thanks. To figure out what's damaged, it seems like you have to do a diff
somewhere? It sounds like this is done at the character level rather than the
token level?

~~~
seanmcdirmid
No diff, just do the re-lex and damage the tree on the boundaries of where the
user typed. My techniques (also, see glitch) usually just assume change has
happened since the reprocessing is incremental anyways and won't cascade if
nothing has really changed (when reprocessing a tree, the parent is damaged if
the end token changed, I guess you could call that a diff).

------
jimmaswell
What do you call it if you're just using a big list of regexes? I've seen that
used for a simple dialog scripting language in a game.

~~~
olavk
If the language is simple enough to be parsed with only regular expressions,
then the language does not have context-free grammar, so bottom-up/top-down
distinction does not apply. This also means that the language cannot have
recursive productions, e.g. cannot support expressions.

But it depends on how the list of regexes is used. If it is used as part of a
recursive paring routine, then it is a recursive-descent parser where lexing
and parsing happens in the same pass.

