Hacker News new | past | comments | ask | show | jobs | submit login
How to Parse Ruby (programmingisterrible.com)
133 points by steveklabnik on Feb 7, 2013 | hide | past | favorite | 84 comments



Even lexing ruby for highlighting is really hard. I had to use so many lookahead hacks to get the lexer in rouge working: https://github.com/jayferd/rouge/blob/master/lib/rouge/lexer...

I learned a ton about ruby when implementing that. Did you know that this totally works:

    my_method <<-ONE, <<-TWO, <<-THREE
       text of one
    ONE
       text of two
    TWO
       text of three
    THREE
And all the % delimited strings are super hard.

(also note that pygments chokes on my perfectly valid ruby file, at `%r(\\\\)`)


> (also note that pygments chokes on my perfectly valid ruby file, at `%r(\\\\)`)

Pygments was actually better at lexing ruby at one point but the regular expressions were too complex and some other bug fixes broke other stuff. I think at the moment it's good enough.

But yeah, I was in the same boat. I learned Ruby for Pygments.

//EDIT: vim does considerably worse on that file btw.


Yep. It's surprising how many things can be missing from a lexer and still be "good enough" for most people. I don't know anybody who's actually used that heredoc-stacking thing.

:D I guess you're the author of all those frustrated "wtf ruby" comments I found in the pygments lexer, then :). I seriously doubt I could have done it at all without pygments as a reference.

My vim does fine on that file: http://imgur.com/VSYr4aa . Maybe I'm using a more recent version...? It misses some of the `end` keywords, but that's just a quirk that it has ;)

Another example that bit me - this took the longest, I think:

    foo = 10
    foo %(2) # call method foo with string "2"
    foo % 2  # the value of foo modulo 2
    % 2      # the string "2"


These are the kinds of things that make me appreciate s-expressions so much more. It's crazy how programming languages like Ruby or C can have ridiculously complex syntax, and LISP can express all of that an then some with just three or four simple grammar rules.


For non-rubyists, what does this example do? Are those labels?


It's a way of writing strings without having to escape things or having line breaks mess things up (a heredoc, if you're familiar with that terminology). For instance,

    <<-END
      this is some text
    END
will give you a string containing " this is some text\n". END can be anything.

It's particularly useful in metaprogramming, since you can maintain formatting, and you don't have to escape " or ' if they happen to show up in your code.

What's interesting in the OP's example is that you end up passing three strings " text of one\n", " text of two\n", " text of three\n". I had no idea that would happen, though I have seen things like

    define_method <<-RUBY, __FILE__, __LINE__ + 1
      some_code
    RUBY
so it makes sense. Ruby treats anything on the same line after the comma as separate arguments, so even though __FILE__ comes after <<-RUBY and before RUBY, it really appears after the final RUBY to the interpreter.


That's ruby's heredoc syntax: http://en.wikipedia.org/wiki/Here_document

I had no idea it could be nested like that and I've been using ruby for 8 years. Crazy!


Holy cow, 10 years for me and I've never even thought of nesting heredocs.

Some days I wish Matz would formalize the grammar so it is sane.


As another non-rubyist with too much perl and bash under my belt, I'm guessing it's heredoc - the <<-FOO syntax means "grab the lines following until you see FOO alone on a line, shove it all in a string, and pass it as this argument (in bash it passes a filename that represents half of a pipe which has been sent the content).


Well, even C has this requirement. Code like

  foo * bar;
Could either mean "multiply foo by bar and drop it on the floor", or "declare bar as a pointer to a foo". C++ makes it worse, of course. Either way, you need to be aware of what symbols are in scope at any given point.


An other fun one:

    (a) - (b)
which in statically typed C-like languages can mean either "cast the value -(b) to type a" or "subtract the value (b) from the value (a)"


These two are great examples of why i think "overloading" syntax symbols sucks. Problem is, nearly every language does that, and in many cases it makes some expressions ambiguous (not to the parsers, as they have those corner-cases defined, but ambiguous to the programmers):

- in JavaScript braces are used both as block delimiters and object literal delimiters - in CoffeeScript, which prevents that particular grammar ambiguity from JS, whitespace is used for many different things: scope/block delimiting, object literals, even function calls (optional parentheses) - Python uses parentheses both for expression grouping (precedence), function calls AND tuples - square brackets are used in many languages to both denote literal lists/arrays and element access ([] operator) - etc, etc, etc

There are very few languages that i know about don't do this kind of lexical-token overloading. Smalltalk is an example: square brackets are always blocks, parentheses are only used to group expressions, dots are always statement terminators, etc. I think Haskell is another example of a straightforward syntax, but don't quote me on that.

Is there any formal name for these kinds of non-overloaded/simple syntaxes? If there is a formalism for those, why is it that "overloaded" syntaxes are preferred most of the time, or why are they so prevalent?


Note that "overloaded" operators does not mean the grammar is context-sensitive (let alone ambiguous). For instance, the parens as grouping and tuple operators in Python are indeed ambiguous (you have to parse to the first comma or end parens to know which is which), but they're not ambiguous with funcalls: a function call opening paren an infix, it can't be the start of a new expression. Same with square brackets for both array literal and element access, one is a prefix, the other one is an infix. Similarly, there is no ambiguity to e.g. the `+` or `-` operator being both infix (addition and subtraction) and prefix (positive and negate)


Totally right. I was aware that this "overloading" does not imply ambiguity for the parser; it's just that i find it confusing to read in some cases (like the tuples in Python, or the {} in JS), and i think it's curious that language designers deliberately choose to use the same symbol for totally different things when they could use something else, or keep the grammar simpler.

I guess it's always a trade-off between simplicity at the grammar level and ease of use, but i'm not so sure about that either. Smalltalk has a very simple grammar, yet that doesn't make it harder to use than other languages with similar semantics; e.g. Ruby, which prefers to be "programmer friendly" by having multiple syntactic forms for denoting blocks, even a special syntax for calling function whose last parameter is a block, and much much more.


> i think it's curious that language designers deliberately choose to use the same symbol for totally different things when they could use something else, or keep the grammar simpler.

The problems is that there's only a limited number of symbols available in ASCII, especially "matching"/"paired" symbols where you've got all of three pairs to work with (`{}`, `[]` and `()`), with one pair having a lot of immutable baggage (`()`) and one being coopted by C's inheritance (`{}`). (I'm not counting `<`/`>` as they have even more historical baggage as comparison operators)

Now of course we could use non-paired symbols even for paired situations, but I guess these symbol pairs look... right? Especially for situations where we're defining a "section" or "grouping" rather than a coherent block (such as a string)


If you express your grammar in a Pratt parser (http://en.wikipedia.org/wiki/Pratt_parser), which associates parsing actions with tokens, the amount of "token overloading" in the syntax becomes obvious.

As for why tokens are overloaded--you just run out of good tokens if you don't reuse them. Consider the tokens '(' and '['. It's quite common to use '(' in the prefix context for grouping, and '(' in the infix context to mean a function call. How do you eliminate the overloading? You can make function calls use whitespace as an infix operator, but that creates a host of other problems. Also, overloading is useful for creating parallelism in the syntax. You might use '[' in the prefix context to signify literal arrays and '[' as an infix operator to signify array dereferencing. In that situation, overloading is synergestic.


> Is there any formal name for these kinds of non-overloaded/simple syntaxes? If there is a formalism for those, why is it that "overloaded" syntaxes are preferred most of the time, or why are they so prevalent?

Someone might be able to say with more certainty, but I believe this is covered by the formalism of "Context Free Grammars" versus "Context Sensitive Grammars"[1]. That is, a language is context free if you don't need information from elsewhere in the code to determine the meaning of a symbol.

There's a good discussion of what makes C context sensitive on Eli Bendersky's blog[2].

Edit:

Thinking about this some more, I'm realizing that this has little to do with the overloading of symbols like block delimiters, since these can be tokenized and parsed by a context free grammar fairly easily in an "overloaded" fashion.

Maybe what you're describing is related to the concept of "purely functional," or even some more vague notion of purity generally...

[1] http://en.wikipedia.org/wiki/Context-free_grammar

[2] http://eli.thegreenplace.net/2007/11/24/the-context-sensitiv...


I don't think "context free" is the concept i was referring to. AFAIK, a grammar can be context free but still have these kind of "syntactic overloads" annoyances. E.g. in Python these expression will always be parsed the same no matter its context: "(foo(bar, baz) + (a,))"; it bothers me, however, that each one of those three pair of parentheses has a different syntactic meaning (one is for grouping, one is for function call, and the last one is for a tuple). Not ambiguous for the parser, but quite inconvenient for my brain's parser that i have distinguish between three possible different meanings for the same syntactic symbol.


Such an unambiguous grammar would only have as many state transitions in its FSM than characters in the grammar, which would limit the language considerably, so you would have to raise the number of characters.

Also ( means something different in a string, so in that sense evert programming language with strings are hard to parse locally.

Also note that ambiguity formally means that more than one parse trees can represent the same string.


Haskell's syntax has a lot of special cases. I'm not sure it's more than typical languages, but I'd be surprised if it was much less.


C (and most C family languages) require you to know the names of types. If that is given, then it is unambiguous. Yes, it's bad design, but it's not the end of the world.


I'm interested in how this compares to other languages with reference implementations. Is ruby the odd one out here, not providing a formal grammar, or is that the norm?

Ruby has a stated design goal of making developers happy. As far as I'm aware, it hasn't been designed to be easily parsed.

I, as an end user (i.e. programmer), prefer it this way. If ease of parsing is important for you, maybe you should use something like LISP.


> Ruby has a stated design goal of making developers happy. As far as I'm aware, it hasn't been designed to be easily parsed.

That does not mean they're opposite goals. Having parsing ambiguities means insufficient thought has been given to parsing, or the language has been defined as "as implemented" with an ad-hoc and organically grown parser (other examples of such case: Perl, PHP)

> I, as an end user (i.e. programmer), prefer it this way.

You prefer that languages have broken, inane or completely missing grammars? So you like PHP even more than Ruby?


> That does not mean they're opposite goals. Having parsing ambiguities means insufficient thought has been given to parsing, or the language has been defined as "as implemented" with an ad-hoc and organically grown parser (other examples of such case: Perl, PHP)

Of course they're not opposite, but you have to choose what to focus on.

> You prefer that languages have broken, inane or completely missing grammars? So you like PHP even more than Ruby?

Sorry if it wasn't clear, I was comparing easy parsing to developer happiness. That is, I prefer a language tries to make me happy rather than be easy to parse. (this goes back to your point above)


A language which is hard to parse for a computer is often hard to parse for a human. For instance, computers have fewer problems with the amount of short-term context memory. 'Magic' is often considered bad because it's inexplicable, while clarity is a principal virtue of a programming language in the eyes of many.

This is the key problem with DWIM interfaces in general. When it does what you mean, it's so nice. But sometimes you're stuck with ambiguity and lack of a precise means of expression; then you have to jump through hoops to push the 'intuitive' thing out of your way. (Curiously, both MS Word and Perl, of all things, manifest this problem.)


"A language which is hard to parse for a computer is often hard to parse for a human."

Whether it often is the case isn't relevant. It may be possible, or even necessary, to make a language hard to parse in order to make it better for the programmer.

Otherwise, by your logic LISP is by definition the best programming language. That maybe be true, but I'm not sure if you would follow your own logic to it's logical conclusion.


"LISP is by definition the best programming language"

Well, that's exactly right, it is the best! Not sure for the GP but I would follow this logic to this conclusion happily :)

Um, er... Sorry, I just recently wrote my first program in Lisp (in Racket exactly) that was something more than a few tens of lines of code and am very happy because of this and I couldn't resist posting this here :)


There was no claim that a language that is easy to parse for a computer is always easy to parse for a human. And you don't present any data that easier to parse for the human makes it better other than that you like it.

I like Ruby too, it's actually my favorite language because it's so easy to do metaprogramming, but sometimes I wish it were clearer what the execution will be even if that comes at the expense of some clarity elsewhere.


> Sorry if it wasn't clear, I was comparing easy parsing to developer happiness.

The problem is that this is a statement which does not make sense. You can have both. And as nine_k notes, a language which can also be harder to read: an ambiguous syntax is also ambiguous for a human reader.


Maybe it'd be better to think about this as a list of priorities, where there has to be ordinality. For example, developer happiness is the first priority, and ease of parsing falls somewhere further down. When Matz was creating the language, when he came across a problem with multiple solutions, he picked the one that would result in highest developer happiness, not which one is easier for the parser.

Is that a better way of wording it?


I believe we understand the dichotomy you're trying to point out, and calling it a false one.

It is possible (and often correlated) to maximize developer happiness with an easy to parse syntax.


> Having parsing ambiguities means insufficient thought has been given to parsing, or the language has been defined as "as implemented" with an ad-hoc and organically grown parser (other examples of such case: Perl, PHP)

Also C, Java, really just about every language in common usage.


Not sure about C, but definitely not true for Java. Or C#. Or Python. Or Pascal.


Python, Haskell, Java, Go have formal and even context-free grammars. It makes me as a developer happy since it's much easier to predict how the language will parse something.


I've done a lot of java and a lot of ruby. Anyone that states that Java makes them happier as a developer is viewing the world from a very different perspective than I am.

That right there is why I disagree with your point. I'm far happier programming in ruby than java, even though there is more ambiguity to the syntax.


Java is indeed pretty nasty, but I would consider it easier to parse visually than Ruby. Less ambiguity.


I think your definition of "parse visually" is very different than mine. The java version of the below would be much harder to parse visually for me.

https://gist.github.com/jaydonnell/4735159


I really hate this "My language is better than your X" talk. Remember that the reason that you find a language predictable in usage is often because you have experience using it, claiming that it's syntax will be clearer from the start is misleading.

Also, you could have structured the "supreme readable ruby code" much more cleanly:

https://gist.github.com/cronin101/7024c1c0dfa721ec1dab


My point isn't "language x is better than language y". My point is that "unambiguous" syntax can be harder to visually parse than a more ambiguous syntax. Java has a lot of noise that makes it hard to visually follow the intent of the code, even though it is less ambiguous.

Admittedly I could have come up with a better code example :) I wanted to show code transforms a collect a couple of times.


That is because of lack of features, particularly in libraries.

While it takes longer to read the Java version, it is unambiguous what it does. With Ruby (as nice as the language is) that is not always the case, at least for me.


I just want to point out that unambiguous is not the same thing as easy to visually parse. Java often is very nosy which makes it hard to visually see the intent of the code, even if it is unambiguous.


Perhaps you should consider a different field of employment. https://gist.github.com/anonymous/4736156

(Pardon the tabbing, I don't have that much time to waste on uneducated comments like the above)


Unlike you, I'll avoid the ad hominem. The commenter I was replying to said that java is EASIER to visually parse. Your java code is harder to visually parse than my ruby code. It's not hard to visually parse, but it's harder than the ruby code which was my point.

btw, I've added a sort to mine, what would that look like for yours?

https://gist.github.com/jaydonnell/4735159

Edit: I see you don't have the courage to use your real account.


My name is J. Chan, and I forgot the password to my other account. Thanks for making fun of my name.


Why didn't you add the sort?


Use a Comparator. http://docs.oracle.com/javase/6/docs/api/java/util/Comparato...

Since it seems like you're talking about visual activity now instead of visual clarity, I agree that Java will be more "hard to read" under your definition. But aside from the two lines for the class declaration & method declaration and the other two for the ending brackets, there really isn't much bloat.

All that you have to implement for a simple numerical sort is some logic if a > b return 1 else if a == b return 0 else return -1.


I'm absolutely talking about visual clarity. In the ruby code it is significantly more visually clear what is going on because the java code has a lot of incidental noise. If you would show a java version that would be clear to everyone, which is probably why you didn't add the sort. The code side-by-side will speak for itself.


Wow guys really.

  Iterable<String> s = Splitter.on(" ").split("Hello World");
  Multiset<String> counts = HashMultiSet.create(s);
  Multiset<String> sorted = Multisets.copyHightestCountFirst(counts);
Or to sort by counts directly

  TreeHashSet.create(Splitter.on(" ").split("Hello World"))

Granted this uses guava, but there is nothing really more readable about your ruby code than this guy's java code. To say he 'lacks courage' ... jesus I'm still laughing. "Why didn't you add the sort!" You're too much man.


Your code doesn't do the same thing as mine. You need to start with an array of sentences.

Show the java code that does the same thing and it will be clear that the ruby is more readable.


It will not be clear: You can replace the first line with the following two

  String[] sents = {"the quick", "the slow", "the blue"};
  Iterable<String> s = Splitter.on(" ").split(Joiner.on(" ").join(sents));


Guava is nice, it makes java almost bearable.


That's orthogonal to the point, though - when I write ruby, I don't care about how it will get parsed, I care about how it will run.


This is sort of an odd statement since the way a piece of code is run is directly related to how it is parsed. If an alternative implementation like Topaz parses something differently than MRI then you'll have code that does two different things on two different interpreters. Having a complicated and informally specified grammar makes this a lot harder to avoid.


Ah okay. Cool, thanks!


Python is not context free.


Indentation is handled by the lexer, not the parser.

The parser is context free-- LL(1), actually.

The lexer maintains one extra stack to track indentations, meaning it's more of a PDA than a DFA, but the actual complication is mild.


The language is not context free, although I agree there is a parsing solution which is not terribly complex.


It is context free. INDENT and OUTDENT are directly equivalent to { and } in other grammars.


The difference is that INDENT and OUTDENT have to be defined by context-sensitive productions. Even if this occurs in what would otherwise be the "lexing" stage of the parser.


And there's no way java is. Not sure about Go.


The Java language specification includes a description of the language as a context free grammar.

That alone does not make it a nice language, of course.


> The Java language specification includes a description of the language as a context free grammar.

I can only assume that it's lying its pants off, as I noted in an other subthread `(a) - (b)` can't be parsed (to an AST) without knowing the identity of `a` in the current scope: if it's a type the expression is a cast of `-(b)` to `a`, if it's a value it's a subtraction of `(b)` from `(a)`.


It can be parsed to an AST -- as you note, it can be parsed into two ASTs. It's called an ambiguous grammar, which can still be context free.


> It can be parsed to an AST -- as you note, it can be parsed into two ASTs.

By "it can't be parsed to an AST" I meant "it can't be parsed to a useful AST used to run or generate code" on its own, it needs to be disambiguated through contextual information before anything can be done with the prospective subtree. Sorry for the lack of clarity.

> It's called an ambiguous grammar, which can still be context free.

The only way to disambiguate is to provide expression context (namely visible name bindings at this point), it's essentially the same problem Eli Bendersky described when talking about C grammar's context-sensitivity: http://eli.thegreenplace.net/2007/11/24/the-context-sensitiv...



"Nobody really knows what the Bourne shell’s grammar is. Even examination of the source code is little help."

-- Tom Duff, "Rc — A Shell for Plan 9 and UNIX Systems"[1]

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41....


I found the Rubinius source code to be quite readable. For example, here are the AST node definitions: https://github.com/rubinius/rubinius/tree/master/lib/compile...


A different approach, which has served me well, is to study and understand Perl's parser/lexer, which are more intertwined than your grandmother's Friday night challah. Once you get that, understanding the MRI parser is a snap.


Perl cannot be parsed in the traditional sense of the word. The parse tree is unknowable until you're actually running.

http://www.perlmonks.org/?node_id=663393



Remember that a yacc file builds an AST tree and paying attention to the AST tree goes a long way in understanding yacc file.

Language ambiguity due to context is pretty common. The way to deal with them is with nested scope symbol tables. And determine the semantic of a symbol usage based on previous declaration of the symbol, tracked in the symbol tables.

In the article's example, x + 3, it confuses the stages of the lexer and the parser. The lexer only knows x as a symbol. Its output is symbol('x') op('+') num(3). It doesn't know it's a function call or a variable reference. It is the parser's job to figure out the semantic of x based on context and to build the AST.

In this case if x has been declared as a variable before, its symbol and type info are recorded in the symbol table. When x + 3 is encountered, it's a simple lookup on the symbol table to see if it's a declared variable and generate the AST node with op('+', var('x'), num(3)).

If x has been declared as function in the symbol table, the lookup will generate the funcall AST node.

For special declaration rule like usage before declaration (as in Ruby), the lack of definition in the symbol table can be defaulted to a funcall.


I came across this issue when looking to build a stripped down Ruby interpreter in JS last summer. A few attempts have been made at defining its grammar, though with each new version the task becomes more and more challenging.

If anyone's interested, this definition of Ruby 1.4 is pretty good: http://www.cse.buffalo.edu/~regan/cse305/RubyBNF.pdf


> If anyone's interested, this definition of Ruby 1.4 is pretty good

The question is whether what it defines corresponds to Ruby. It's easy enough to define an unambiguous subset of the language and declare you're done, but it's irrelevant if the grammar does not match what the language actually is.


You'll never reduce Ruby to a pure grammar, there are too many ambiguous cases - my point was simply that it covers a subset which incorporates almost all common idioms. As far as I am aware it is one of the few openly available, non-trivial attempts to do so and thought it might be of interest.



Does it come with decent tests?



fwiw, I find the syntax definition in the ruby_parser gem at https://github.com/seattlerb/ruby_parser/blob/master/lib/rub... to be much easier reading, as it's written for racc (the ruby equivalent of yacc), although as the readme warns, it's not perfect.


Here I note that while JRuby and IronRuby ported parse.y, CPython, Jython, and IronPython use totally different parsing (CPython rolls its own parser generator, Jython uses ANTLR, IronPython does handwritten recursive descent parsing) but are still comatible with each other.



"By luck and various cabal like connections" - wasn't the Topaz link on hackernews or lobsters like 2 days ago?


You can also gaze at http://rubyspec.org/



The ISO standard is based on Ruby 1.8.7, by the way.

That said, 1.8.7 still has these issues. You absolutely _can_ parse Ruby, that doesn't mean it's not incredibly difficult.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: