Hacker News new | past | comments | ask | show | jobs | submit login
Parsing: The Solved Problem That Isn't (2011) (tratt.net)
117 points by contr-error 10 months ago | hide | past | favorite | 84 comments



There was an interesting discussion two years ago regarding nonobvious issues with PEGs:

https://news.ycombinator.com/item?id=30414683

https://news.ycombinator.com/item?id=30414879

I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.

As the author points out, Earley parsing with some disambiguation rules (production precedence, etc.) has been much less finicky/annoying to work with. It's also reasonably fast for small parses even with a dumb implementation. Would suggest for prototyping/settings when runtime ambiguity is not a showstopper, despite the remaining issues described in the article re: having a separate lexer.


> I spent a year or two working with PEGs

> Earley parsing with some disambiguation rules

Any idea why GLR always gets ignored?


At least in my personal case: GLR sounds great in theory, but I like to implement things 'from scratch' when possible. Both PEGs and the basic Earley algorithm are incredibly simple to write up and hack on in a few hundred lines of insert-favorite-language-here.

GLR would probably have (much) better performance but I'm usually not parsing huge files (or would hand-roll one if I were). I've not yet found an explanation of GLR (or even LR for that matter) that's quite as simple as PEGs or Earley (suggestions welcome tho!).


Gotcha. GLR is painful to learn, you have my sympathy there. I feel like a big chunk of that is due to overly complex formalism and nomenclature around it, but it's definitely still tough, especially if you're trying to implement it efficiently. If you're interested in doing so, I recommend making sure you understand each of the following parsing algorithms in order: regexes via NFA, regexes via DFA, unambiguous LR(0) without empty productions and while ignoring time complexity, LR(0) with time complexity optimizations and with empty productions, LR(1), and then finally GLR (to handle ambiguity). The hardest part will probably be the LR(0) stage.

If you get stuck on LR(0), here's the idea: you use a state machine with a stack, and (b) treat nonterminal symbols like any other input tokens. How? Like below. Say you had the rules:

  #1: Add: Add '+' NUMBER;
  #2: Add: NUMBER;
Initially, you could be in the beginning of any of these rules: either in rule #1 offset 0 (before the Add), or in rule #2 offset 0 (before the NUMBER). Push ({(#1, 0), (#2, 0)}, <null>) onto your stack, where the second list is the symbols you've seen so far (nothing so far). Now consume the first token in the input; let's say it was a NUMBER (say, "55"). Go through every possible new location in your stack, and update where it could be now, and push the new candidate locations along with the symbol you just saw, filtering out any locations where you went past the end of the rule. In this case, that means pushing ({(#1, 1), (#2, 1)}, NUMBER). Now, carefully examine your stack. You've seen [<null>, NUMBER] so far (i.e., just [NUMBER]), and you're either in the middle of rule #1, or at the end of rule #2. Well, it can't be #1, because the last symbol you saw was NUMBER, not Add. Therefore it's #2, and you've finished rule #2, which means you've recognized an Add. Therefore, pop all the symbols in the production of #2 from the stack (which leaves the stack empty), then push the output back (i.e. "reduce" it to the Add symbol). This leaves you with a stack that has just one item, ({(#1, 1)}, Add). Since you're no longer at the end of a production, consume the next input symbol, then go repeat this process, reducing as much as you can every time before consuming an input. [1]

Once you understand this intuitively (and if you stare at it long enough, you'll realize the process I just explained resembles that of a regex NFA), then learn how to optimize this by preprocessing the set of possible locations into simple integers instead of entire sets, so you don't have to do O(|rules|) work every iteration. Then move onto LR(1), which is merely about disambiguating the rules using a lookahead. After that, GLR is "just" keeping a DAG instead of a stack (and it reduces to a linked-list version of a stack if there is no ambiguity)... best of luck.

[1] I lied a little here, in that the process starts with reducing, not shifting. Because you might need to reduce (possibly multiple times) before you consume any inputs. But that's easier to explain here after the fact, than when you're initially learning it.


Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.


This is entirely the case. Given a sensible grammar stated in a sensible way, it's very easy to write a nice recursive decent parser. They are fast and easy to maintain. It doesn't limit the expressiveness of your grammar unduly.

Both GCC and LLVM implement recursive decent parsers for their C compilers.

Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.


Agree completely. Having used a bunch of parser generators (Antlr and bison most extensively) and written a parser combinator library, I came to the conclusion that they're a complete waste of time for practical applications.

A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.

And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.


The main advantage of recursive descent is that since parser rules correspond to functions in the host language, you can put a breakpoint on them in the debugger and understand how you got there. You can't do that with a table-driven LALR(1) parser like Bison.

I think you cover that with "debuggability" remark.

Here is something else. Yacc/Bison parsing uses its own stack for the symbols. Whereas recursive descent uses the regular run-time stack.

Pop quiz: which stack is visible to your garbage collector, and which isn't?

In TXR Lisp, which uses a Bison/Byacc generated parser, GC is suspended while yyparse() executes. Otherwise it would prematurely collect anything that is only stored in the Yacc stack, and so there would have to be a GC hook to traverse that stack (which would depend on undocumented features of the generated code, and likely have to have some separate coding for Bison versus Byacc.)


> handle expressions/operators

Precedence climbing is a natural and efficient way to parse expressions in a recursive decent parser. It's used by Clang in LLVM.

https://eli.thegreenplace.net/2012/08/02/parsing-expressions...


Make them all s-expr with prefix notation solves even that generally, correct?


> A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators)

Exactly the recipe I swear by as well!


Just like email addresses. The specification/rfc/whatever could have defined a reg-ex that determines a valid address, instead of the essential impossibility we have today.


> and it will greatly benefit the readability also

This is the controversial part, Lisp aficionados to the contrary.


People are misunderstanding my original comment.

You can parse quite a lot more than Lisp with techniques from 1965.


People would misunderstand you less if you made yourself better understood.

Like using a hyperlink to code example.


You can search for parsing techniques that were available and practical in 1965: recursive descent, the shunting yard algorithm, and operator precedence parsing[0]. LR(k) parsing was described by Knuth in 1965, but I don't think it was considered practical yet due to the memory required by LR(1) tables.

[0]: 1963 operator precedence description: https://dl.acm.org/doi/10.1145/321172.321179


Do you mean lisp? If yes I agree


Including, but not limited to.


Many an argument has been won by the response: "but consider Lisp and Forth"


Smalltalk is another.


End-comment


)


But I don't want to be able to parse only highly restricted languages. I want to be able to parse anything, including natural language or even non-languages like raw audio.

My brain can do it, why can't my computer?


Yes, never do humans misunderstand each other, or instructions are not clear to everyone and totally unambiguous, and luckily no language has pure differentiation of meaning by intonation, and, and.. and...


Yeah, natural languages don't have a specification or canonical parser implementation, so they cannot be reliably parsed.


They don't have a short parser. They can be parsed, it just requires a huge amount of priors and world knowledge. Rule-based parsers are too simple.


No, they cannot be reliably parsed. There is no unambiguously correct parsing for many (or, arguably, any) strings. Two people could say the same thing in the same context and mean different things by it. You can't even definitively say whether what they said/wrote is valid English. Sure, there are strings most would agree are and strings most would agree aren't, but even taking consensus opinion as the source of truth, most isn't all, and there's no universally agreed upon threshold for acceptance.


That's okay - that just means your parser needs to model what the speaker was thinking when they said it. That's extra information that's required to decode the message. It is not necessary for the same text to always mean the same thing.


If you need to already know what the speaket meant in order to understand them, then there is no point in communication.

Human language has a pretty clear distinction between syntax and sementics. This is how we recognize that "colorless green ideas sleep furiously" are perfectly well formed, if meaningless. In contrast, "I is happy" is meaningful and unambiguous, but grammatically incorrect.

In terms of syntax, English (like most, if not all) languages is literally ambigous.

Consider the sentence structure:

Subject Verb Object Prepositional-Phrase.

This can be either:

(Subject Verb (Object Prepositional Phrase))

Or

(Subject Verb Object ) Prepositional Phrase.

For instance, consider the sentence "I saw a man with binoculars".

In any sense of the word, this example is structually ambigous.


If I already knew what the speaker was thinking, there would be no need to parse his words at all.


Fine, a parser that is a perfect oracle for authorial intent can reliably parse English. But no real parser can. And anyways, that effectively extends the English grammar to include the entire world state, which isn't really what people mean when they talk about English as a language or parsing strings—a fact which perhaps helps to illustrate the problem.


>that effectively extends the English grammar to include the entire world state

Exactly! So glad we're on the same page.

Language is created by and intended for big brains with huge amounts of knowledge about each other and about the world. Relying on external knowledge makes it extremely compact and flexible, but also means your parser needs a similar level of knowledge to function.


>So glad we're on the same page

We're really not. I'm saying you've incorrectly defined the English language. By that definition, no piece of text is English.


> There is no unambiguously correct parsing for many (or, arguably, any) strings.

My favorite funny phenomenon in this area is when a sentence has two unambiguously different parses that mean exactly the same thing.


My favorite example, due to Douglas Hofstadter:

Politicians lie.

Cast iron sinks.

Politicians lie in cast iron sinks.

It's not actually ambiguous, but I think it's a lovely illustration of the subtleties of the problem.

An actually ambiguous example: I saw a politician lying in a cast iron sink.


Ambiguous parses aren't even the worst of it -- the worst are the ones that require real world knowledge.

"I couldn't fit the trophy in my suitcase because it was too big."

"I couldn't fit the trophy in my suitcase because it was too small."


Those seem like easier to resolve ambiguities than in the parent. One is almost certainly never going to be correct because it's nonsense, and that's a determination that requires a semantic understanding of the sentence, but no further context. The ambiguities in the parent are such that both parses are semantically reasonable. There are even contexts in which either could be intended. Suppose someone asks "What materials are heavier than water, and what are everyday things made of them that could float anyways?". If someone answers "cast iron sinks", it's unclear whether they're answering only the first, only the second, or even both, punningly.


My personal favourite example is the proof that two positives can make a negative: 'yeah, right'.


Your computer can do it, if it has a beefy enough Nvidia card. That's what LLMs are for!

Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.


You've never had an LLM misinterpret what you correctly wrote?

(… I have.)


Related:

Parsing: The Solved Problem That Isn't (2011) - https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70 comments)

Parsing: the solved problem that isn't - https://news.ycombinator.com/item?id=2327313 - March 2011 (47 comments)


After using Instaparse at least it felt like a solved problem: https://github.com/Engelberg/instaparse


What I find annoying about using parser generators is that it always feels messy integrating the resulting parser into your application. So you write a file that contains the grammar and generate a parser out of that. Now you build it into your app and call into it to parse some input file, but that ends up giving you some poorly typed AST that is cluttered/hard to work with.

Certain parser generators make life easier by supporting actions on parser/lexer rules. This is great and all, but it has the downside that the grammar you provide is no longer reusable. There's no way for others to import that grammar and provide custom actions for them.

I don't know. In my opinion parsing theory is already solved. Whether it's PEG, LL, LR, LALR, whatever. One of those is certainly good enough for the kind of data you're trying to parse. I think the biggest annoyance is the tooling.


Parser combinators is what I've been loving in the last few years.

Pros: * They're just a technique/library that you can use in your own language without the seperate generation step.

* They're simple enough that I often roll my own rather than using an existing library.

* They let you stick code into your parsing steps - logging, extra information, constructing your own results directly, e.g.

* The same technique works for lexing and parsing - just write a parser from bytes to tokens, and a second parser from tokens to objects.

* Depending on your languages syntax, you can get your parser code looking a lot like the bnf grammar you're trying to implement.

Cons: * You will eventually run into left-recursion problems. It can be nightmarish trying to change the code so it 'just works'. You really need to step back and grok left-recursion itself - no handholding from parser combinators.

* Same thing with precedence - you just gotta learn how to do it. Fixing left-recursion didn't click for me until I learned how to do precedence.


Have there been any notable innovations in parsing since this was written?


Aycock & Horspool came up with a 'practical' method for implementing Earley parsing (conversion to a state-machine) that has pretty humorously good performance delta over "naive" Earley, and is still reasonable to implement. Joop Leo figured out how to get the worst-case of Earley parsing down to either O(n) (left-recursive, non-ambiguous) or O(n^2) (right-recursive, non-ambiguous). That means the Earley algorithm is only O(n^3) on right-recursive, ambiguous grammars; and, if you're doing that, you're holding your language wrong.

A somewhat breathless description of all of this is in the Marpa parser documentation:

    https://jeffreykegler.github.io/Marpa-web-site/
In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™:

    https://loup-vaillant.fr/tutorials/earley-parsing/


An extremely layman answer is that most interesting innovation in parsing in relatively modern times has happened seems to be in the context of IDE's. I.e. incremental, high-performance parsing to support syntax highlighting, refactoring, etc. etc.

(I may be talking out of my ass here.)


Actually the most important step of parsers (as even non-incremental, slow (or better: not fast) parsers are fast enough) is error recovery (error resilience) from syntax errors (mostly half written or half deleted code). What is time consuming is e.g. type-checking. Semantic checking in general, like exhaustiveness checks of pattern matches, syntax checking is fast.


In the days of punched cards error recovery was important.


Not sure, but I at least am certainly aware of possibilities that such writeups exclude.

In particular, you can do (a subset of) the following in sequence:

* write your own grammar in whatever bespoke language you want

* compose those grammars into a single grammar

* generate a Bison grammar from that grammar

* run `bison --xml` instead of actually generating code

* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues

In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.



I'm not super familiar with the space, but tree-sitter seems to take an interesting approach in that they are an incremental parser. So instead of re-parsing the entire document on change, it only parses the affected text, thereby making it much more efficient for text editors.

I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.


In my experience incremental parsing doesn't really make much sense. Non-incremental parsing can easily parse huge documents in milliseconds.

Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.

I prefer Chumsky or Nom which go all the way.


Ah interesting, yeah I did spend quite a bit of time parsing their AST, which turned out to be harder than writing the grammar itself. I’ll look into those two projects.


What do you mean by “parse of that tree to get useful structures out”? Can you provide some concrete examples?


Yeah suppose you write a simple config language like:

  let a = 12;
  let b = a + 5;
  ...

Tree-Sitter will give you a tree like

   Node(type="file", range=..., children=[
     Node(name="let_item", range=... children=[
       Node(name="identifier", range=...)
       Node(name="expression", range=..., children=[
         Node(name="integer_literal", range=...)
   ...
Whereas Nom/Chumsky will give you:

    struct File {
      let_items: Vec<LetItem>,
      ..
    };
    struct LetItem {
      name: String,
      expression: Expression,
    };
    ...
Essentially Tree-Sitter's output is untyped, and ad-hoc, whereas Nom/Chumksy's is fully validated and statically typed.

In some cases Tree-Sitter's output is totally fine (e.g. for syntax highlighting, or rough code intelligence). But if you're going to want to do stuff with the data like actually process/compile it, or provide 100% accurate code intelligence then I think Nom/Chumksy make more sense.

The downsides of Nom/Chunksy are: pretty advanced Rust with lots of generics (error messages can be quite something!), and keeping track of source code spans (where did the `LetItem` come from) can be a bit of a pain, whereas Tree-Sitter does that automatically.


Ok, understood. I was confused by the phrase "parse of that tree".

Tree-sitter's output is closer to being "dynamic" than "untyped", though.

It's not too hard to build a layer on top of tree-sitter (out of the core lib) to generate statically typed APIs. I haven't felt the need for that yet, but it may be worth exploring.

> actually process/compile it

At work, I built a custom embedded DSL, using tree-sitter for parsing. It has worked well enough so far. The dynamically-typed nature of tree-sitter actually made it easier to port the DSL to multiple runtimes.

> provide 100% accurate code intelligence

Totally agree that tree-sitter cannot be used for this, if we are aiming for 100%.


Not the person you’re asking, but basically anything that needs to happen after the initial parsing stage. So you convert your raw text into an AST, but there’s usually some processing you need to do after that.

Maybe you need to optimize the data, maybe you need to do some error checking. Lots of code is syntactically valid but not semantically valid, and usually those semantic errors will persist into the AST (in my limited experience).


> [incremental parsing] I don't know if that's specific to tree-sitter though

No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).

For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066


Yes. You roll your own manual parser. It's not as difficult as people make it out to be.


These are not new, but my takeaways from https://tratt.net/laurie/blog/2020/which_parsing_approach.ht... and https://rust-analyzer.github.io/blog/2020/09/16/challeging-L... are to embrace various forms of LR parsing. https://github.com/igordejanovic/parglare is a very capable GLR parser, and I've been keeping a close eye on it for use in my projects.


Yes, we resdesigned our programming languages to be easy to parse with limited lookahead.


I feel that most of the time the two options are presented as either write a handwritten parser or use a parser generator. A nice third way is to write a custom parser generator for the language you wish to parse. Handwritten parsers do tend to get unwieldy and general purpose parser generators can have inscrutable behavior for any specific language.

Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.


You could write a language with which to specify that custom parser. Oh wait...


Common example of complications of two grammars being combined: C code and character strings.

Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.

So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.


at no point in my life have I ever considered escape codes to be problematic.

ugly, yes. problematic? no.


Not problematic. Just a little cumbersome. (And ugly, agreed.)

You can't always just copy and paste some text into code, without adding escape encodings.

Now write code that generates C code with strings, that generates C code with strings, and ... (ahem!)

It's not a big deal, but it isn't zero friction either. Relevant here because it might be the most prevalent example of what happens when even two simple grammars collide.


until you need to get your string through several levels of escape. how many backslashes to add? depends on how deep your pipe is and how each of those layers is defined


The only alternative is extracting them to other files or designing specialized string formats.


There is one obvious "specialized string format" that solves 99% of all escaping issues: use «balanced quotes». The real problem isn't escaping, it is that the same character is used both to open and close strings.


Or old Fortran style Hollerith constants. They consist of the string length, a "H" and the string itself. Like 13HHello, world!


Qwerty has ` and ', but for some reason we've decided that '' isn't " and " is the one true quotation symbol.


Won't you still have to escape the closing bracket if it occurs inside the string?


Only if you want to refer to it literally as a closing quote rather than having it act as a closing quote. That case is extremely rare.


Its definitely rarer than double or single quotes occurring in string. But I was wondering about the parent comment's concern of passing a string through multiple levels of escaping.

> until you need to get your string through several levels of escape. how many backslashes to add? depends on how deep your pipe is and how each of those layers is defined


This is the beauty of balanced quotes: they completely eliminate the need for multiple escapes.

When the same character is used as both an open and close delimiter, you have to disambiguate between three possibilities: opening a new string, closing the current string (which may or may not be embedded) and a literal character as a constituent of the current string. By convention, an unescaped double-quote inside a string indicates closing that string, so you need different escapes to indicate opening embedded strings and constituents.

You could have done that by using two different escape characters, but for historical reasons there is only one escape character: the backslash. So that one character has to do double-duty to disambiguate two different cases. But in fact it's even worse than that because string parsers have a very shallow understanding of backslashes. To a string parser, a backslash followed by another character means only that the following character should be treated as a constituent. So you still need to disambiguate between actual constituents and opening an embedded string, and the only way to do that, because all you have is the backslash, is with more backslashes. The whole mess is just a stupid historical accident.

If you used balanced quotes you only have one case that needs to be escaped: constituents. So you never need multiple escapes.

Note that I made a mistake when I wrote:

> Only if you want to refer to [a close-quote character] literally as a closing quote rather than having it act as a closing quote.

You have to escape both open and close quotes to refer to them as constituents. In other words you would need to write something like this:

«Here is an example of a «nested string». The start of a nested string is denoted by a \« character. The end of a nested string is denoted by a \» character.»

Note that it doesn't matter how many levels deep you are:

«Even when you write «a nested string that refers to \« or \» characters» you only need one level of escape.»

Note that when you refer to quote characters as balanced pairs as in the examples above you don't actually need the escapes. The above strings will parse just fine even without the backslashes, and they will print out exactly as you expect. The only "problem" will be that they will contain embedded strings that you probably did not intend. The only time escapes are actually required is when referring to an quote characters as constituents without balancing them. This will always be the case if you refer to a close-quote without a corresponding preceding open-quote, which is the reason I got it wrong: escaping close-quotes will be more common than escaping open-quotes, but both will be needed occasionally.


I totally agree with the idea that balanced quotes are needed to make quoting sane. If the quotes in a string are balanced then it should be possible to quote it with no changes.

I would also advocate the principle that you don't escape the escape character by doubling it. There are two problems with replacing \ with \\: firstly the length of the string doubles with each nested quotation; secondly you can't tell at a glance whether \\\\\\\\\\\\\\\\\\\n contains a newline character or an n because it depends on whether the number of backslashes is odd or even.

Another useful principle is to escape a quote character with a sequence that does not contain that character: then it is much easier to check whether the quotes are balanced because you don't need to check whether any of them are escaped.

So here's a possible algorithm for quoting a string: first identify the top-level quote characters that don't match (this is not totally trivial but it isn't difficult or computationally expensive); then, in parts of the string that are not inside nested quotes, but only there, replace « with \<, » with \>, and \ with \_ (say). Does that work?


It might work but I think it misses the point. The current mess is in no small measure the result of assuming that we have to constrain ourselves to ascii. We don't. Once you accept balanced quotes you have implicitly accepted using unicode characters, at which point a whole host of new possibilities (and new problems -- see below) opens up. For example, the only reason you need \n (and \r and \t etc.) is because you want a way to use non-whitespace characters to represent whitespace. But this feature is already built in to unicode, which has dedicated non-whitespace versions of all of the ascii whitespace and control characters (␍, ␊, ␉ , ␀, etc.) so there is no need to escape any of these.

That leaves only the problem of escaping the escape character, and here again there is no need to constrain ourselves to ascii. There is no reason that the escape character needs to be backslash. In fact, that is a particularly poor choice because backslash, being an ascii character, is extremely precious real estate. In fact, it is doubly precious because it actually has a balanced partner in the forward slash, so if you are going to use backslash for any special purpose it should be partnered with forward slash as a balanced set (which open up the problem of what to use for the directory delimiter in your operating system, but that's another can o' worms).

I think the Right Answer is simply to choose a different character to serve as the escape character inside balanced strings. My first pick would probably be ␛, but there are obviously a lot of other possibilities.

This points to a potential danger of this approach: there are a lot of unicode characters that render very similarly, like U and ᑌ. You would need to choose the unicode characters with special meanings very judiciously, and make sure that when you are writing code you have an editor that renders them in some distinctive way so you can be sure you're typing what you think you're typing. But that seems doable.


I also once had to pick up around the house and it was oh so terrible.


> problematic?

Unless it's a regex....


My current view of what makes parsing so difficult is that people want to jump straight over a ton of intermediate things from parsing to execution. That is, we often know what we want to happen at the end. And we know what we are given. It is hoped that it is a trivially mechanical problem to go from one to the other.

But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.

And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)


What have those other things got to do with parsing though? Granted, they rely on parsing having already happened, but I don't see how there's much feedback from those considerations to the way that parsers work, or are written, or - as the article discussed - can be combined?


You can easily view it as having nothing to do with it. My push is that it is the point of parsing. You don't parse directly into understanding/execution. You parse into another representation, one that we never directly talk about, so that you can then move it into the next level.

Even English can be parsed first into the sounds. This is why puns work. Consider the joke, "why should you wear glasses to math class? It helps with division." That only works if you go to the sounds first. And you will have optionality in where to go from there.

So, for parsing programs, we often first decide on primitives for execution. For teaching, this is often basic math operations. But in reality, you have far more than the basic math operations. And, as I was saying, you can do more with the intermediate representation than you probably realize at the outset.


I agree, I think syntax could be defined as "the whole problem" if you see it holistically. Approaches that avoid scaling an intermediary grammar from lower-level tokens like the Lisp or Forth construction of "the parse directly maps to a data structure, and the data structure has the semantics" are robust. The reason why they aesthetically offend comes down to familiarity and tooling: infix expressions "look like math" - they please someone with prior training - but often act to hide important machine-level details. And the grammar helps to spread around the architecture of the language so that a syntax error compiles as a different semantic. Versus an approach like APL with a richer token set, or a ColorForth that packs in more semantic value by assigning tags to each word and making that part of the presentation.

I've moved towards designing languages now that operate over CSV source. That adds an extra dimension while still enabling convenient editing - just turn off all the parsing behavior in the spreadsheet and you can edit it as "plain text". Although, column alignment isn't always desirable in this case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: