Hacker News new | past | comments | ask | show | jobs | submit login
Building an extensible syntax (breuleux.net)
39 points by gozzoo on Oct 24, 2014 | hide | past | web | favorite | 24 comments



Didn't expect to see this here (I am the author). For the record, I implemented this syntax as a Racket #lang called LISO[1], and I also used a variant of that syntax for a language that compiles to JavaScript, Earl Grey[2]. It's fun to play with.

[1] http://breuleux.net/blog/liso.html

[2] http://breuleux.github.io/earl-grey/repl/

Edit: I forgot to mention I also based a markup language on operator precedence parsing: Quaint (http://breuleux.net/blog/quaint.html). It benefits from similar ease of extension.


Here's an interesting way to do operator priority:

Specify priority relative to other operators, including associativity, and let the language figure it out. If something is ambiguous or inconsistant, emit an error.

The nice thing about this is that it removes the entire concept of "first-class" operators. If you want to inter-operate with something, specify the relative priorities as needed and you're done.

The down side is that the worst case time to detect cycles is O(n^2) with the number of operators, at least if you include the relative priority of '!='. Although, that being said, I'm not sure if it's a good idea to include '!=' as a relative priority.

It also means that your parsing becomes... less than amusing. In particular, it suddenly becomes rather context-dependent. This breaks one of the "laws" described here. Still something interesting to muse about, however.


That's more or less what I meant by "priority graph" in the article: priority is defined using a directed acyclic graph. In principle, yes, you could have a way to add new nodes in the graph; you don't really expect n to be large in this case.

Parsing wouldn't necessarily be very complicated. An operator precedence grammar[1] can be fully implemented in about 50 lines of code. If you run a simple sequential algorithm, completed expressions could be executed immediately and could modify the precedence matrix dynamically. You do lose, however, the option of parallel parsing.

There's an idea I am toying with, though, which is to split source files in "zones". A separator, say, "=====", would act as a hard delimiter for parts of the file. The first part could define the syntax of the second part, etc.

[1] http://en.wikipedia.org/wiki/Operator-precedence_grammar


Again, you introduce context-dependance - which is something that you explicitly say you shouldn't have. What AST a <foo> b <bar> c is turned into depends on what priorities <foo> and <bar> are defined as.

This starts to get nightmareish very quickly, at least if you're trying to leave open the possibility of compilation as opposed to interpretation.

Zones are an interesting possibility to solve this, although it doesn't have generality - there's no way to express "zone C depends on zone B and zone A, but zone B doesn't depend on zone A, nor vice versa", for instance.


Well, the idea of zones is that the second zone depends on the first zone, the third depends on the second, and so on. It is meant to allow definition and use of a language within the same file, so there is little point in having a different dependency structure, I think.


I think things get a little lost here when you start treating infix operators as special. When talking about creating uniform syntaxes, I think people don't look hard enough at the basic structure of SmallTalk, where operators work mostly like you'd expect, are infix, and are uniform with all other kinds of calls because the convention is always (subject verb object) instead of the lispy (verb object subject). You don't get operator precedence, but I've largely come around to the idea that maybe we should be questioning if it's something we actually need.

Of course that leaves you with the problem of when a verb lacks a subject.


Honestly, most of the time I think we would be better off without operator precedence. It just adds another thing you need to keep in your head when reading code.

Unfortunately, nearly everyone -- even non-programmers -- learned the operator precedence for arithmetic operators, which means they might be confused if those operators in a language didn't use it.


A limited hierarchy like Go's could be OK:

    parens (...)
    paths  . [] and args (x,y)
    unary  ! -
    binary * / &
    binary + - |
    binary == != <= >= < >
    binary &&
    binary ||
    comma  ,
    assign = += -= *= /= &= |= :=


SmallTalk does have some operator precedence: unary operators bind tightest, followed by binary, keyword, and assignment.

In general, what precedence buys you is that it lets you omit grouping brackets in situations where you would almost always have to write them. Consider assignment, for instance. Without operator precedence, you will often write "x := (y + z)", but you will almost never write "(x := y) + z". By giving := lower precedence than +, you can now omit the brackets 99% of the time.

This being said, I don't think operator precedence is particularly useful for arithmetic. It's deeply ingrained in expectations, though.


To clarify, I meant arithmetic operator precedence. Obviously there is still a grammatic precedence below that.


Sure, but the same principles that justify this grammatical precedence could be used to justify arithmetic precedence. I mean, why does assignment have lower priority? It could be left associative with arithmetic.

Presumably it has low priority because it never makes sense as a sub-expression, but you could use a similar argument to give lower priority to equality as compared to arithmetic, since you typically don't do arithmetic on the result of a logical operation.


I don't think they're really the same thing. Arithmetic precedence comes from the meaning of the operators themselves while the other layers of grammatical precedence simply provide consistent structure. The reason this is a problem when the operators can overload is that the structural aspects of the operators may no longer apply to their meaning. See for example the iostream operators << and >> in C++, which actually have a very awkward precedence that forces you to parenthesize things in unexpected ways sometimes.


Operators were first introduced in programming languages for arithmetic, .e.g. Fortran, but they've evolved beyond that. As an example, Scala gives precedence to operators based on the first token in the op, so e.g.

  *>> has higher precedence than +~< because * "beats" +
Golang treats

  * / and & as having the same precedence, and ditto with + - and |


I think it's even wrong to think about syntax at all when thinking about Lisp. It's a symbolic computational system. It's about anything. Other languages are DSLs with aesthetics. Whereas lisp is about semantics. Don't even think about syntax, think about meaning:

(<idea> <data>...)

That's it. Now you can focus on ideas instead of looks. (but yeah the added bonus of self-similar representation is a pretty large one too, both in expressiveness and in concepts reduction)


There's a library for lisp called readable lisp. It has three features (http://readable.sourceforge.net/)

1: Indentation instead of parentheses

  (defn [a b]
    (+ a b)
  )
Becomes

  defn add [a b]
    + a b
2: Function names before the parens instead of after

  (add 3 5)
Becomes

  add(3 5)
3: Infix operations

  (+ 3 5 6)
Becomes

  {3 + 5 + 6}
(Note the curly braces instead of parens)

These combined make Lisp code a whole lot more readable:

  (define (factorial n)
    (if (<= n 1)
      1
      (* n (factorial (- n 1)))))
Becomes

  define factorial(n)
    if {n <= 1}
      1
      {n * factorial{n - 1}}
To me, this made common-lisp python-like readable. Now to get it working in Clojure


> Now to get it working in Clojure

Perhaps you've seen the macro for infix operators in Clojure: http://data-sorcery.org/2010/05/14/infix-math/

Use that and all you'd need to implement for infix ops is a parser that translates {...} to ($= ...) for any sequence ... of tokens.


Except it doesn't stop at 3. Then you have (copying from http://sourceforge.net/p/readable/wiki/Solution):

4. A \\ (aka SPLIT) starts a new line at the current indentation. If it's immediately after indentation (aka GROUP in that case), it represents no symbol at all (at that indentation) - this is useful for lists of lists.

5. A $ (aka SUBLIST) in the middle of list restarts list processing; the right-hand-side (including its sub-blocks) is the last parameter of the left-hand side (of just that line). If there's no left-hand-side, the right-hand-side is put in a list. This useful abbreviation from Haskell simplifies many common cases.

6. A leading traditional abbreviation (quote, comma, backquote, or comma-at) or datum comment "#;", at the beginning of a sweet-expression line, and followed by space or tab or end-of-line, is that operator applied to the following sweet-expression. Otherwise, it applies to the next neoteric-expression.

7. The markers “<∗” and “∗>” surround a collecting list, and MUST accept a list of 0 or more un-indented sweet-expressions. The “<∗” and “∗>” represent opening and closing parentheses, but restart indentation processing at the beginning of the line. A collecting list is only terminated by its matching “∗>” and not by a blank line. These are valuable for defining libraries and for short let-style constructs.

8. The marker “$$$” is reserved for future use.

Got all that? I sure didn't. I spent some time on the mailing list a couple of years ago and grew frustrated with this gradual creep of operators, which led to absurd code like this (from http://srfi.schemers.org/srfi-110/srfi-110.html):

  let
    \\
      c $ cos a
      s $ sin a
    body...
I mean, would it hurt to add a couple of parens here and there?! Even Java has a few of them.

(My competing proposal, which wasn't hampered by all the bending-over-backwards to preserve backwards compatibility and get merged upstream that did them in: http://akkartik.name/post/wart. The final straw that pushed me to find my own syntax was this example from SICP: http://arclanguage.org/item?id=16699)


I'm going to have to reread this to really understand it, but it prompted me to imagine the possibility of a simpler antidote to parentheses than described here. If the tokenizer generated INDENT and DEDENT tokens as in python, then the combination of those tokens and the spaces and parens that lisp already has could be translated to other paren patterns and then interpreted as in normal lisp. (Of course strictly speaking even parens are already translated into hierarchical cons-pairs, but that just proves that this sort of translation is not foreign to lisp.) Hmmm, something to think about.


I did this last year: http://akkartik.name/post/wart


Perhaps I'm missing something: In Iteration 1, it says juxtaposition has higher priority than any operator. In that case, how is the lambda declaration (x y -> (...)) not interpreted as a function call to the function x with the argument y?

Or is the author simply saying that this will unambiguously create an AST of ((x y) -> (...)) regardless of what -> does?


Post it to Lambda the Ultimate if you want the serious language designers to take a look at the ideas and make suggestions.


That's such a good idea that he did it 9 months ago: http://lambda-the-ultimate.org/node/4879.


Why parser extensions are opposed to AST transforms? They work together well, see an example implementation here: https://github.com/combinatorylogic/mbase


I've had very similar ideas in my language Ivy: https://github.com/jhallen/joes-sandbox/tree/master/lang/ivy




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: