Hacker News new | past | comments | ask | show | jobs | submit login
Regular, Recursive, Restricted (matklad.github.io)
56 points by todsacerdoti 7 months ago | hide | past | favorite | 17 comments



The question is raised at the end about the undecidability of ambiguity of CFLs, and whether, under certain restrictions, the problem becomes decidable.

As the author points out, Regular Language ambiguity (among many other properties) is decidable.

One of my favorite language facts that I always keep in my back pocket is the existence of Visibly Push Down Languages and the related Finite Tree Automata, which reside in between Regular Languages / FA and CFLs / PDA in the inclusion hierarchy. Whereas Regular Languages give you a limited, tail-recursive or looping structure, VPLs / FTA give you the kind of full recursion you'd expect as a programmer without compromising decidability of all the essential questions (intersection, emptiness, negation, ...).

VPLs are very expressive, and many languages which you hitherto thought were general CFLs often turn out to be VPLs!

The key distinguishing feature of VPLs as a restriction of CFLs is that you can always tell where the recursive term boundaries are, with bounded lookahead, by detecting sentinel "call" and "return" tokens, such as open/close parentheses or other hints. This makes all kinds of questions decidable because parsing algorithms don't have to indefinitely defer decisions about where to place term boundaries.

I'm not sure whether VPLs can assist the author in this particular application because 1) I haven't determined whether an expression grammar with operator precedence is a VPL (though I suspect it is... when reading left to right you are never held in suspense about the implicit grouping for very long, and the explicit grouping is, well, explicit.) and 2) I don't know for a fact that ambiguity is one of the advertised decidable questions for VPLs (but again, I suspect it is, as all these properties seem to stand and fall together).

Edit: I was a bit sloppy when referring to ambiguity. Of course, it's a property of a grammar, that is, one particular representation of a language as an explicit machine of sorts, not a property of the language itself.


Whoa. I've wondered about that exact question for a while, whether a proper class exists between RL and CFL. Relevant Wikipedia link, for posterity: https://en.wikipedia.org/wiki/Nested_word

Thanks for sharing!


Another "proper class" are the operator precedence languages, AKA Floyd's languages (R. W. Floyd). They include the "visibly pushdown languages", which are really just a rediscovery of the input-driven languages. Floyd's languages are the languages parsed using the shunting yard algorithm.


if I could be a researcher, this would be one of my main topics. I have personal reasons to be interested in languages with alphabets that have "very special symbols" such as the parenthesis.

in general, i'm curious about how as you progress up the type language hierarchy it becomes necessary to keep adding specialized symbols to keep up. most interestingly, i wonder about how one pair of symbols (open and close) add a lot of expressive power but I suspect adding more different types of pairs of symbols won't significantly augment the expressiveness but I don't know for sure.

this is also reminiscent of dyck languages and their corresponding automata, which can recognize and match open and closing parenthesis. I did not know visibly push down was sufficient to recognize parenthesis, I thought a stack was sufficient. but given a stack is more powerful than the unidirectional "stack" of a VPLs it all makes sense


"Visibly pushdown automata" are just a rediscovery of input-driven pushdown automata. It's sad that the authors of the VPL papers never acknowledged this AFAIK, even while writing more papers on the same topic.


It's a reflection on the state of the field, I think, that there is such a volume of lesser known good work hiding among an even larger volume of noise and fluff that neither the authors nor the reviewers caught such an oversight. (And reviewers are usually pretty aggressive about suggesting related work.)


This is sensible, but I don't fully understand why there is such pressure to support inherently ambiguous infix operations. I get that 1 + 2 * 3 might be marginally nicer to type out but 1 + (2 * 3) is completely unambiguous no matter what language I'm using. This, for me, improves readability, especially because I don't have to guess what the intention was if I'm ever confused by the result. Also, it's one thing in a math or science paper where the reader can correct the result if there is an ambiguity, but if a compiler handles ambiguity wrong, that can throw the entire program off quite significantly.

The ratio of time spent on precedence parsing in compilers to utility feels very low to me. That being said, I do appreciate the satisfaction of being able to implement it declaratively as the author does here.


Lisps, Smalltalks, and APLs are among the language families that haven't bothered with precedence.

(that said, I think it's worth it: an afternoon spent providing precedence is easily paid back —in an expression-oriented language, anyway— in avoiding myriad pairs of parens)


Smalltalk does have precedence, just not among infix operators. It has keyword, infix, and postfix expressions and needs to specify the relative precedence between those. In other words, the language has to decide if:

   a b: c - d e
Is parsed like:

   a b: (c - (d e))
   a b: ((c - d) e)
   (a b: c) - (d e)
   (a b: (c - d)) e
   ((a b: c) - d) e


I spend far more time reading code than typing it, so the time to type () is in the noise compared to the time to remember precedence rules while reading.

If there were only a couple of operators, maybe precedence would be manageable, but there aren't.


IMO the "not equals" is easy to misinterpret as permitting the tree.

I think something like this would clearer:

    Expr =
        'number'
      | '(' Expr ')'
      | Expr '+' Expr  (bind-order: 2, assoc: left)
      | Expr '-' Expr  (bind-order: 2, assoc: left)
      | Expr '*' Expr  (bind-order: 1, assoc: left)
      | Expr '/' Expr  (bind-order: 1, assoc: left)
or if you want relative precedences:

    Expr =
        'number'
      | '(' Expr ')'
      | Expr '+' Expr  (add, assoc: left)
      | Expr '-' Expr  (sub, assoc: left)
      | Expr '*' Expr  (mul, assoc: left)
      | Expr '/' Expr  (div, assoc: left)
    
    bind-order:
      mul equals div
      add equals sub
      mul before add
or if you want to completely separate conflict resolution (and shorter bind-order syntax, and raise a syntax error if certain operators are mixed...this one I like the most):

    Expr =
        'number'
      | '(' Expr ')'
      | Expr '+' Expr  (add)
      | Expr '-' Expr  (sub)
      | Expr '*' Expr  (mul)
      | Expr '/' Expr  (div)
      | Expr '&&' Expr (land)
      | Expr '||' Expr (lor)
      | Expr '==' Expr (eq)
      | Expr '!=' Expr (ne)
      | Expr '.' Expr (haskellDot)
      | Expr '..=' Expr (range)
    
    assoc:
      left: add, sub, mul, div, land, lor, eq, ne
      right: haskellDot
      forbid: range

    bind-order:
      mul | div
      add | sub
      range
      eq | ne
      land | lor
      haskellDot

    forbid:
      eq | ne
      land | lor


This approach works very well, as far as I can tell so far:

https://practal.com/press/aflap/1/ (mainly starting at "Syntactic Categories")

I got it right only after finishing the draft of the article, so you need to make it somehow until "Post Scriptum" for the right method.


> Problem statement: it’s hard to describe arithmetic expressions in a way that:

> - declaratively captures the overall shape of expression, and

> - has a clear precedence semantics

I'm pretty distracted here. The answer to that question is either postfix or guaranteed parenthesis. Precedence is only ever used with mathematical operators anyway...

> ...it doesn’t tell whether * or + binds tighter, and their associativity.

Associativity isn't defined by grammar. It's defined by mathematical proof. You can write 2+4 or 4+2, and predict that they will evaluate to the same 6, but the grammar sees two entirely different expressions. Just because you know two syntax trees are equivalent doesn't mean they will ever be the same. Parsing is not arithmetic.


Associativity in this context doesn't mean "The associative property," it means "left-associative or right-associative" https://en.wikipedia.org/wiki/Operator_associativity


This technique has a problem with combinatorial explosion: N operators, N(N-1) exclusion rules plus a high risk of writing them incorrectly.

Consider the 2 operators example near the end:

  Expr =
    'number'
  | '(' Expr ')'
  | Expr '+' Expr
  | Expr '*' Expr

  Expr !=
               Expr '+' [Expr '+' Expr]
  |            Expr '*' [Expr '*' Expr]
  |            Expr '*' [Expr '+' Expr]
  | [Expr '+' Expr] '*' Expr
There are 1 exclusion per operator to enforce associativity (not problematic) and 2 exclusions (out of 4 trees containing both operators) to enforce precedence. These scale quadratically: if we add, say, exponentiation ('^') we need to add both

   Expr!=
   |           Expr '^' [Expr '+' Expr]
   | [Expr '+' Expr] '^' Expr
and

  Expr!=
   |           Expr '^' [Expr '*' Expr]
   | [Expr '*' Expr] '^' Expr
And so on.

There is a simpler but less elegant technique to constrain associativity and precedence: different types used asymmetrically.

  Expr_base =   'number' | '(' Expr ')'
  Expr_plus= Expr_plus '+' Expr_base | Expr_base
  Expr_mul= Expr_mul '*' Expr_plus| Expr_plus
  Expr_exp= Expr_mul '^' Expr_exp| Expr_mul
  Expr=Expr_exp
Expr_mul contains no '^', Expr_plus contains no '^' and no '*', Expr_base contains no operator at all.


> But at this point we lose decorativeness.

I don't understand this sentence. The left-recursive rules are pretty much standard. Many parse algorithms deal with such left-recursive rules well; you can feed such rules directly to e.g. Earley parser and it will work correctly without producing ambiguous parses. And of course as the author says they capture precedence well. So what is this lack of decorativeness that makes it non-ideal? Decorate what?


I'm actually currently doing some research on operator precedence and syntax, so it's fun to see such a relevant post! One of my favorite syntaxes for custom operators comes from Agda and allows for easy definition of mixfix operators. Agda uses underscores to stand-in for operands in the name of arbirtrary functions (i.e. `_+_` as the sum). But when fully qualified with all underscores, this will act like a normal function (`_+_ 3 4 == 7`). But, uniquely among languages I've seen, it allows for partial application of arguments with operands in place, making it much easier to create new curried functions.

Here's an `if_then_else` example, modified from [0]:

    -- declare an operator function with underscores:
    if_then_else_ : {A : Set} -> Bool -> A -> A -> A
    -- define the function:
    if true then x else y = x
    if false then x else y = y
    -- call it:
    if_then_else_ cond first second
    -- or use it infix:
    if cond then first else second
    -- or partially apply "sections" of the operator:
    -- this is really cool
    (if_then first else second) cond
    (if cond then_else second) first
    if cond then first else_ second
    if cond then_else_ first second
    if_then first else_ cond second
    (if_then_else second) cond first
Of course, the paper by Agda authors Danielsson and Norell, Parsing Mixfix Operators [1], is a classic discussion at the theory of mixfix operators in expression grammars (they take a stance of precedence as a DAG). But I do not often see cited the earlier and similarly titled dissertation by Jacob Wieland, Parsing Mixfix Expressions [2], which goes further into defining classes of ambiguity in both operator expressions and overloaded operators, expresses these in an Earley style parser, and in my view is far more practical for real language designers.

[0] Agda Mixfix docs: https://agda.readthedocs.io/en/v2.6.4.3/language/mixfix-oper...

[1] Danielsson/Norell: https://scholar.google.com/scholar?q=parsing+mixfix+operator...

[2] Wieland: https://scholar.google.com/scholar?q=parsing+mixfix+expressi...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: