
How to Parse Ruby - steveklabnik
http://programmingisterrible.com/post/42432568185/how-to-parse-ruby
======
jayferd
Even lexing ruby for highlighting is really hard. I had to use so many
lookahead hacks to get the lexer in rouge working:
[https://github.com/jayferd/rouge/blob/master/lib/rouge/lexer...](https://github.com/jayferd/rouge/blob/master/lib/rouge/lexers/ruby.rb)

I learned a ton about ruby when implementing that. Did you know that this
totally works:

    
    
        my_method <<-ONE, <<-TWO, <<-THREE
           text of one
        ONE
           text of two
        TWO
           text of three
        THREE
    

And all the % delimited strings are super hard.

(also note that pygments chokes on my perfectly valid ruby file, at
`%r(\\\\\\\\)`)

~~~
Udo
For non-rubyists, what does this example do? Are those labels?

~~~
venus
That's ruby's heredoc syntax: <http://en.wikipedia.org/wiki/Here_document>

I had no idea it could be nested like that and I've been using ruby for 8
years. Crazy!

~~~
mitchty
Holy cow, 10 years for me and I've never even thought of nesting heredocs.

Some days I wish Matz would formalize the grammar so it is sane.

------
koenigdavidmj
Well, even C has this requirement. Code like

    
    
      foo * bar;
    

Could either mean "multiply foo by bar and drop it on the floor", or "declare
bar as a pointer to a foo". C++ makes it worse, of course. Either way, you
need to be aware of what symbols are in scope at any given point.

~~~
masklinn
An other fun one:

    
    
        (a) - (b)
    

which in statically typed C-like languages can mean either "cast the value
-(b) to type a" or "subtract the value (b) from the value (a)"

~~~
epidemian
These two are great examples of why i think "overloading" syntax symbols
sucks. Problem is, nearly _every_ language does that, and in many cases it
makes some expressions ambiguous (not to the parsers, as they have those
corner-cases defined, but ambiguous to the programmers):

\- in JavaScript braces are used both as block delimiters and object literal
delimiters \- in CoffeeScript, which prevents that particular grammar
ambiguity from JS, whitespace is used for many different things: scope/block
delimiting, object literals, even function calls (optional parentheses) \-
Python uses parentheses both for expression grouping (precedence), function
calls AND tuples \- square brackets are used in many languages to both denote
literal lists/arrays and element access ([] operator) \- etc, etc, etc

There are very few languages that i know about don't do this kind of lexical-
token overloading. Smalltalk is an example: square brackets are always blocks,
parentheses are only used to group expressions, dots are always statement
terminators, etc. I think Haskell is another example of a straightforward
syntax, but don't quote me on that.

Is there any formal name for these kinds of non-overloaded/simple syntaxes? If
there is a formalism for those, why is it that "overloaded" syntaxes are
preferred most of the time, or why are they so prevalent?

~~~
masklinn
Note that "overloaded" operators does not mean the grammar is context-
sensitive (let alone ambiguous). For instance, the parens as grouping and
tuple operators in Python are indeed ambiguous (you have to parse to the first
comma or end parens to know which is which), but they're not ambiguous with
funcalls: a function call opening paren an infix, it can't be the start of a
new expression. Same with square brackets for both array literal and element
access, one is a prefix, the other one is an infix. Similarly, there is no
ambiguity to e.g. the `+` or `-` operator being both infix (addition and
subtraction) and prefix (positive and negate)

~~~
epidemian
Totally right. I was aware that this "overloading" does not imply ambiguity
for the parser; it's just that i find it confusing to read in some cases (like
the tuples in Python, or the {} in JS), and i think it's curious that language
designers deliberately choose to use the same symbol for totally different
things when they could use something else, or keep the grammar simpler.

I guess it's always a trade-off between simplicity at the grammar level and
ease of use, but i'm not so sure about that either. Smalltalk has a very
simple grammar, yet that doesn't make it harder to use than other languages
with similar semantics; e.g. Ruby, which prefers to be "programmer friendly"
by having multiple syntactic forms for denoting blocks, even a special syntax
for calling function whose last parameter is a block, and much much more.

~~~
masklinn
> i think it's curious that language designers deliberately choose to use the
> same symbol for totally different things when they could use something else,
> or keep the grammar simpler.

The problems is that there's only a limited number of symbols available in
ASCII, especially "matching"/"paired" symbols where you've got all of three
pairs to work with (`{}`, `[]` and `()`), with one pair having a lot of
immutable baggage (`()`) and one being coopted by C's inheritance (`{}`). (I'm
not counting `<`/`>` as they have even more historical baggage as comparison
operators)

Now of course we could use non-paired symbols even for paired situations, but
I guess these symbol pairs look... right? Especially for situations where
we're defining a "section" or "grouping" rather than a coherent block (such as
a string)

------
cllns
I'm interested in how this compares to other languages with reference
implementations. Is ruby the odd one out here, not providing a formal grammar,
or is that the norm?

Ruby has a stated design goal of making developers happy. As far as I'm aware,
it hasn't been designed to be easily parsed.

I, as an end user (i.e. programmer), prefer it this way. If ease of parsing is
important for you, maybe you should use something like LISP.

~~~
lucian1900
Python, Haskell, Java, Go have formal and even context-free grammars. It makes
me as a developer happy since it's much easier to predict how the language
will parse something.

~~~
peterwoo
Python is not context free.

~~~
Scaevolus
Indentation is handled by the lexer, not the parser.

The parser is context free-- LL(1), actually.

The lexer maintains one extra stack to track indentations, meaning it's more
of a PDA than a DFA, but the actual complication is mild.

~~~
peterwoo
The _language_ is not context free, although I agree there is a parsing
solution which is not terribly complex.

~~~
lucian1900
It _is_ context free. INDENT and OUTDENT are directly equivalent to { and } in
other grammars.

~~~
peterwoo
The difference is that INDENT and OUTDENT have to be defined by context-
sensitive productions. _Even if this occurs in what would otherwise be the
"lexing" stage of the parser_.

------
michaelfeathers
Use Ripper in 1.9.3. It's part of the distribution and it works well.

[http://www.rubyinside.com/using-ripper-to-see-how-ruby-is-
pa...](http://www.rubyinside.com/using-ripper-to-see-how-ruby-is-parsing-your-
code-5270.html)

<https://gist.github.com/michaelfeathers/4704833>

------
quadhome
_"Nobody really knows what the Bourne shell’s grammar is. Even examination of
the source code is little help."_

\-- Tom Duff, "Rc — A Shell for Plan 9 and UNIX Systems"[1]

[1]
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.3287&rep=rep1&type=pdf)

------
endlessvoid94
I found the Rubinius source code to be quite readable. For example, here are
the AST node definitions:
[https://github.com/rubinius/rubinius/tree/master/lib/compile...](https://github.com/rubinius/rubinius/tree/master/lib/compiler/ast)

------
stib
A different approach, which has served me well, is to study and understand
Perl's parser/lexer, which are more intertwined than your grandmother's Friday
night challah. Once you get that, understanding the MRI parser is a snap.

~~~
rictic
Perl cannot be parsed in the traditional sense of the word. The parse tree is
unknowable until you're actually running.

<http://www.perlmonks.org/?node_id=663393>

~~~
tantalor
This one is also a good read,

<http://www.perlmonks.org/index.pl?node_id=44722>

------
ww520
Remember that a yacc file builds an AST tree and paying attention to the AST
tree goes a long way in understanding yacc file.

Language ambiguity due to context is pretty common. The way to deal with them
is with nested scope symbol tables. And determine the semantic of a symbol
usage based on previous declaration of the symbol, tracked in the symbol
tables.

In the article's example, x + 3, it confuses the stages of the lexer and the
parser. The lexer only knows x as a symbol. Its output is symbol('x') op('+')
num(3). It doesn't know it's a function call or a variable reference. It is
the parser's job to figure out the semantic of x based on context and to build
the AST.

In this case if x has been declared as a variable before, its symbol and type
info are recorded in the symbol table. When x + 3 is encountered, it's a
simple lookup on the symbol table to see if it's a declared variable and
generate the AST node with op('+', var('x'), num(3)).

If x has been declared as function in the symbol table, the lookup will
generate the funcall AST node.

For special declaration rule like usage before declaration (as in Ruby), the
lack of definition in the symbol table can be defaulted to a funcall.

------
joeroot
I came across this issue when looking to build a stripped down Ruby
interpreter in JS last summer. A few attempts have been made at defining its
grammar, though with each new version the task becomes more and more
challenging.

If anyone's interested, this definition of Ruby 1.4 is pretty good:
<http://www.cse.buffalo.edu/~regan/cse305/RubyBNF.pdf>

~~~
masklinn
> If anyone's interested, this definition of Ruby 1.4 is pretty good

The question is whether what it defines corresponds to Ruby. It's easy enough
to define an unambiguous subset of the language and declare you're done, but
it's irrelevant if the grammar does not match what the language actually is.

~~~
joeroot
You'll never reduce Ruby to a pure grammar, there are too many ambiguous cases
- my point was simply that it covers a subset which incorporates almost all
common idioms. As far as I am aware it is one of the few openly available,
non-trivial attempts to do so and thought it might be of interest.

------
c3d
How to parse XL:
[http://xlr.git.sourceforge.net/git/gitweb.cgi?p=xlr/xlr;a=bl...](http://xlr.git.sourceforge.net/git/gitweb.cgi?p=xlr/xlr;a=blob;f=xlr/scanner.cpp)
and
[http://xlr.git.sourceforge.net/git/gitweb.cgi?p=xlr/xlr;a=bl...](http://xlr.git.sourceforge.net/git/gitweb.cgi?p=xlr/xlr;a=blob;f=xlr/parser.cpp).
676 + 747 lines of non-hacky C++ code.

------
akkartik
Does it come with decent tests?

~~~
mcphage
<http://rubyspec.org>

------
markov_twain
fwiw, I find the syntax definition in the ruby_parser gem at
[https://github.com/seattlerb/ruby_parser/blob/master/lib/rub...](https://github.com/seattlerb/ruby_parser/blob/master/lib/ruby19_parser.y)
to be much easier reading, as it's written for racc (the ruby equivalent of
yacc), although as the readme warns, it's not perfect.

------
sanxiyn
Here I note that while JRuby and IronRuby ported parse.y, CPython, Jython, and
IronPython use totally different parsing (CPython rolls its own parser
generator, Jython uses ANTLR, IronPython does handwritten recursive descent
parsing) but are still comatible with each other.

------
TallboyOne
I wrote something similar here: [http://pineapple.io/discussion/just-learned-
about-ripper-in-...](http://pineapple.io/discussion/just-learned-about-ripper-
in-ruby)

------
mbillie1
"By luck and various cabal like connections" - wasn't the Topaz link on
hackernews or lobsters like 2 days ago?

------
softbuilder
You can also gaze at <http://rubyspec.org/>

------
skurmedel
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_...](http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=59579)
?

~~~
steveklabnik
The ISO standard is based on Ruby 1.8.7, by the way.

That said, 1.8.7 still has these issues. You absolutely _can_ parse Ruby, that
doesn't mean it's not incredibly difficult.

