compiler writers write compilers in programming languages, not in syntax trees. ...

catnaroek · on March 29, 2016

> it's still all lexical.

I have no idea what you mean. Are you talking about lexical scope? If anything, determining the lexical scope of a variable is much easier from a syntax tree than from the original program text.

> A syntax tree is not syntax, it's a tree.

Pretty much all the literature on programming languages identifies “syntax” with “abstract syntax (tree)”. It is typically understood that textual concrete syntax exists primarily as a convenience for the programmer, because it's easier to edit, but the very first thing a compiler does is recover the abstract syntax tree, and from then onwards proceed as if the concrete syntax had never existed (except perhaps for error reporting purposes).

betenoire · on March 29, 2016

Lexical means "as it appears 'written' on the 'page' of code". So lexical scope allows you to look and see what the binding is. Contrasted to dynamic scope where you must run the program to find the binding.

If you will, syntax is the vocabulary of the language. Also called "lexicon" and the syntactical tokens are sometimes called "lexemes". It's all about the text. Syntax is text. Or more technically, syntax is the format of the input.

Syntax trees are data structures that represent syntax, but they are not syntax. There are different types of syntax trees, too. The trees are built by the compiler, and never written by hand unless for exercise.

Syntax trees have already passed semantic analysis and contain context. They are much more than just syntax.

catnaroek · on March 29, 2016

> If you will, syntax is the vocabulary of the language. Also called "lexicon" and the syntactical tokens are sometimes called "lexemes".

I'm pretty sure that most programming languages have their syntax defined in terms of two components: (0) a lexical specification [which lexemes are valid], (1) a grammar [how to arrange lexemes into well-formed programs]. And the latter tends to be richer and more complex than the former. So it isn't just, or even primarily about the individual lexemes.

> It's all about the text. Syntax is text. Or more technically, syntax is the format of the input.

Syntax is the (tree!) structure of utterings (declarations, statements, expressions, whatever). Textual representations of syntax just happen to be convenient to manipulate in a text editor.

> Syntax trees have already passed semantic analysis and contain context.

This is wrong. Syntax trees are the input for semantic analysis.

betenoire · on March 30, 2016

You must be fucking with me by now. Parsing IS context also, and it's the parsing of said syntax that _results_ in trees.

That's why we have syntax highlighting. Have you ever seen a tree in your syntax highlighting? I didn't think so. Do you have a tree when you have a syntax error? No, because it never gets that far. Syntax is independent and prior to your trees, the IR in a compiler doesn't HAVE to use trees, it can use whatever it wants. I see there is no point in continuing this, we are arguing over definitions. Not even sure what we are talking about anymore.

catnaroek · on March 30, 2016

> You must be fucking with me by now.

I'm honestly not. Also, let's try to keep this civilized.

> and it's the parsing of said syntax that _results_ in trees.

When did I say otherwise? As you say, parsing concrete syntax indeed results in abstract syntax trees. Then those trees are used as input for semantic analysis (type checking, data flow analysis, etc.). Parsing is not semantic analysis.

> Have you ever seen a tree in your syntax highlighting?

The whole point to syntax highlighting, as well as good code formatting practices, is to make it easy to see the abstract syntax tree from the concrete program text.

> Do you have a tree when you have a syntax error?

If you're using a modern production compiler, yes. Compilers don't stop at the first syntax error. They add the error to a list, and then continue trying to build the syntax tree.

> Syntax is independent and prior to your trees,

I'd argue otherwise. When designing a programming language, abstract syntax comes first. You can define the semantics of a programming language purely in terms of its abstract syntax, and only later come up with a concrete syntax for it.

> the IR in a compiler doesn't HAVE to use trees

IR is what the compiler's frontend hands out to the backend. Syntax trees are an internal data structure used by the compiler's frontend.

groovy2shoes · on March 31, 2016

"Syntax highlighting" is often a misnomer. More often than not, it's actually "lexis highlighting" (though there are some actual syntax highlighters, they're few and far between in my experience).

Take Vim or jEdit for instance (among the vast majority of others). When you want to make a color scheme, you do it by assigning colors to linguistic elements like "string literal", "character literal", "numeric literal", "boolean literal", "identifier", "comment", "keyword", etc. These elements are the sort of thing a lexical analyzer produces: a token type (in practice, lexers typically identify the associated lexeme as well). This is no coincidence, as most text editors with "syntax" highlighting use regular expressions to categorize and colorize the text. If you've studied any automata theory, programming language theory, and/or compiler construction, then you know that regular expressions are equal in power to finite automata and are therefore incapable of analyzing any syntax beyond the lexical for a great majority of programming languages. In fact, without a grammar of some sort beyond regular expressions, the "syntax" highlighting engine is provably incapable of actually understanding syntax.

I've heard that Scintilla's highlighter is fed parsing expression grammars to describe the syntaces of various languages (using Lua LPeg). Now that's a syntax highlighter -- but (perhaps unfortunately) it's the exception, not the rule.

So that's why you don't (often) see trees in your "syntax" highlighting, but instead see a string of tokens.

catnaroek is correct: source code as stored in text files is merely a linear string of characters. That is not syntax, per se; it's the parser's job to determine the (ostensibly tree-like) syntactical structure according to a collection of syntactic rules. The first result of parsing is the concrete syntax tree (also called a parse tree) deduced from the source text. This can be, and usually is, transformed into an abstract syntax tree (AST)[1] prior to semantic analysis. This is the case for every language; it's only so apparent in the Lisps because they belong to the tiny set of languages that explicitly specify their AST[2] -- for most languages, each translator has its own, possibly unique notion of abstract syntax for the source language.

So, back to your original question: what kinds of syntax exist beside "surface" syntax? Abstract syntax is one kind. You can think of the notion of compilers' internal representation (IR) as another kind, too.

[1]: Even in parsers that produce the AST directly, the concrete syntax tree is still constructed, but it's implicit in the process's state rather than explicit in a data structure.

[2]: See, for example, "The Revised^n Report on the Algorithmic Language Scheme" (RnRS). It specifies not only the concrete syntax (called the external representation in the Report), but, to the extent to which it's reasonable to do so, it specifies the abstract syntax as well (the internal representation per the Report, or even just data in some cases). It's this specification of the abstract syntax that enables the syntactic macros for which the Lisps are so renowned -- syntactic macros are essentially an API for manipulating the AST at compile-time.