When compiling the AST sits in the middle of the whole process. The first part deals with parsing your language into the AST the second part deals with transforming the AST to the target language (including possible optimizations of the tree).
Lisp is effectively the raw AST, which is where its power comes from. The use of parentheses is the cleanest way to directly represent a raw tree that you can interact with. This means that you can use Lisp to cut the language design process in half from either direction:
On the one hand, if you are worried about parsing your language, then you can simply transform it into lisp, which is virtually identical to creating the AST, and then you can use a lisp interpreter/compiler for the second half.
On the other hand, if you're interested in writing the interpreter/compiler for a language and don't want to stress about parsing, you can write it for lisp and not have to worry about parsing a complex language. If you follow the Norvig code you can extend that example to a language devoid of parenthesis by writing the parser for it.
Even if you want to do both it's not a terrible idea to prototype both halves using Lisp and then perform the minimal work pull out the Lisp code and replace it with some other implementation of the AST.
Given the current state of machine learning, could an AST be generated to accommodate a given language?
And more importantly, can the form of an AST cause a difference in performance of the overall program?
No machine learning required. It is certainly possible to automatically generate an AST based on a grammar.
The more interesting question is whether you can automatically generate a grammar based on samples of the language - https://en.wikipedia.org/wiki/Grammar_induction - although it is unclear why you'd bother trying for a programming language (as opposed to natural language).
> And more importantly, can the form of an AST cause a difference in performance of the overall program?
However, if you are doing tree-based interpretation: Yes, differences in choice of AST can make a significant difference to runtime performance.
Without making this post long I could "tempt" or encourage you by suggesting that if you take the outermost paren pair off an s-expression, you sortof have a language of function application:
defun fact n = if (< n 2) 1 (* n (fact (- n 1)))
I think there is an SRFI for Scheme that proposes something like that plus offside rule to get rid of parens.
You could get rid of more parens by adding operator precedence to the syntax (and that offside rule would help too), but now you're making the parsing interesting instead of making the execution interesting.
Not even the smarter side of syntax (abstract syntax), but rather character-and-token level syntactic sugaring.
Recent GCC commit:
GCC has some Lisp-like data structures inside, and those give a language to the commit comment also, making it easy to comprehend what is going on with the abstract syntax.
Just swap  with (), then they're a lot easier to hit. It's weird that parentheses are shifted but square brackets are not: the former are a lot more common in English text than the latter, as well as in many programming languages.
> Is there something inherently easier to implementing a functional language instead of something more imperative?
Other answers have focused on the parsing of the language (that is, the production of an AST) which is much easier to cover instructionally for a Lisp because it's basically the AST already.
To my mind the semantics' of imperative languages is the real issue. In particular when defining and/or implementing the semantics for an imperative language, eventually store (memory) management comes up and everything gets much more complicated instantly.
 And to go along with the store there are often more forms for which you need to define the semantics (statements, expressions, classes, etc).
In contrast functional languages can frequently be implemented using term rewriting which can deal directly with the AST itself.
More broadly, this is why I wish students were required to implement an interpreter of an imperative language. The act of debugging programs becomes more difficult for the same reason the semantics is more difficult to define and implement: it's more complex and there are more nuts and bolts to consider.
Why? Isn't C-like "call malloc" simpler than the GC which is required for most (all?) functional languages?
Most more complex languages you'd use something like lex and yacc which would add quite a bit more overhead to the tutorial.
The danger of these tutorials is the whole "grow your own parser" method doesn't scale to more complex languages (you need better tooling for that and/or a lot firmer grasp of the concepts) and I would be concerned that people would try to extend this tutorial and go down a rabbit hole.
Edit: I know this from experience because when I was in high school I wrote a compiler this way (let's call it the naive way). And then in college I learned how to do it with BNF and code generators and the difference in maintainability and code complexity between the two is ridiculous. I would never do it the old way again, even for a toy language.
Very few production compilers use lex/yacc type parsers, because error handling and recovery is painful. Most end up with hand-written recursive-descent parsers.
I mean let's say you evaluate
> (+ 2 3)
So, that's a reason why it is said. (edit: also, if the course implements scheme IN scheme, then you don't need a parser because the host scheme can do it for you.)
Yeah it is obviously a lot easier to parse than traditional, imperative languages like C. I just do not like the phrase, as it gives this "magic" vibe to Scheme. I mean its syntax is beautiful and simple, but there is definitely no magic there.
It is true for toy compilers of language subsets, and those are often done. And few people ever explore real scheme compiler internals.
The readers in real scheme compilers are generally a large pile of ugly code. A decade or so ago, I recall Bigloo's was the cleanest I'd found, with a regular expression based parser framework. I can't quickly find the files to link. But after wading through many other implementations, it was a breath of fresh air. But not simple, and not small.
Here's a quick quote from a Chicken changelog: "2.7 Fix several bugs (expt, integer?, rational?, <=, >=, eqv?, -) found by importing the number tests from Gauche and writing an extensive test for number syntax edge cases. Complete rewrite of number parser (it should fully conform to the R7RS superset of R5RS number syntax now)."
Sure, it needs a "parser" in the sense of a FSA + a stack.
Presumably because there's no real nesting of terms in assembly code? Personally, I find BNF so simple that I'd probably still use it for describing the grammar, but lots of parser generator technology is heavier machinery than required (though a recursive descent parser built from a regular grammar is essentially a DFA if your language/compiler offers tail call elimination).
If there is an ambiguity in the grammar that takes some time to resolve, a recursive descent parser can easily take exponential time. What tail call elimination does is keep you from having a giant call stack, but does not fix the potential (though rare) performance disaster.
I did use a lexer. I wrote an assembler before without using a lexer; using the lexer isn't that much simpler IMO, but it makes it more robust and makes it much easier to systematically catch syntax errors and produce meaningful errors.
Keep in mind it's a personal project, the assembler is written in not the prettiest idiosyncratic Python.
It comes with very few special forms and makes no difference between a variable, function or macro. Everything is same-y.
That's true of Scheme, not Lisp (Lisp has rather more special forms, and definitely makes distinctions between variables, functions and macros), but it is indeed why Scheme is so popular in compiler classes.
Not quite sure what you mean there. If you are talking about the Lexer, it's also rather simple since LISP doesn't need lookahead. Everything is in prefix notation.
Which can be incredibly simple since you only have the persistent global scope and the list local scope, so you only discard data that goes out of scope. There are no references
Heck, you could probably just write the Garbage Collector in LISP too.
LISP-family languages usually aren't called LISP for being Hash Table Processors. There are only lists.
>Vectors, Object System, Stack Unwinding, Closures
Not needed either, you can macro Vectors, you can macro the Object System, you have lists, not a stack and closures are a basic property of the language last I checked.
LISP is a rather simple language since you can write a macro for almost everything, a minimal implementation only needs a couple keywords. Object Systems are usually implemented via macros.
It is special because after implementing a minimal set, you can program the language itself to provide the rest.
And I'd like you to name another language that is not assembler or original BASIC where you would not have to implement
all this stuff while remaining as simple and expressive as LISP.
I don't think you really need any of this complicated 1960s and later stuff. LISP is the second oldest programming language, it's really not complex.
The Reader is the thing which reads the many Lisp data structures: lists, symbols, floats, integers, strings, characters, vectors, arrays, structures, ...
> Which can be incredibly simple since you only have the persistent global scope and the list local scope, so you only discard data that goes out of scope. There are no references
Lisp has references.
> Heck, you could probably just write the Garbage Collector in LISP too.
Basically you can't.
> LISP-family languages usually aren't called LISP for being Hash Table Processors. There are only lists.
All Lisp have hash-tables since decades. Internally the symbol table is a hash-table in most Lisp implementations.
> Not needed either, you can macro Vectors, you can macro the Object System, you have lists, not a stack and closures are a basic property of the language last I checked.
You can't. You don't see to know what a macro is. If you only have linked lists, you can't make vectors. No macro will help you with that.
> Object Systems are usually implemented via macros.
They aren't. Object systems are a mix of new data structures, functions and some macros. In Common Lisp, the CLOS system is a layering of data structures, functions and only at the top are macros. Parts of CLOS then are defined in itself.
> It is special because after implementing a minimal set, you can program the language itself to provide the rest.
But it is not done that way. You won't implement arithmetic as a macro.
> while remaining as simple and expressive as LISP.
Once you have all that, Lisp is no longer simple.
> LISP is the second oldest programming language, it's really not complex.
Check the recent Scheme standard for some entertainment.
You only need about 9 primitives and can build the rest from there.
You can implement arithmetic as macros and functions, from scratch.
>Internally the symbol table is a hash-table in most Lisp implementations.
Point me at a language that won't require a symbol table that is not assembler.
Really do it.
Point me at a language that is simpler to implement than LISP and is not assembler. (and maybe not brainfuck)
>If you only have linked lists, you can't make vectors.
If you have a turing complete language, you can do everything. I don't think you know what a turing-complete language is.
>In Common Lisp, the CLOS system is a layering of data structures, functions and only at the top are macros. Parts of CLOS then are defined in itself.
>Lisp is no longer simple.
Compared to most other popular languages, it is.
You can't practically. No actual Lisp is developed that way. It's mostly without any practical use. It's a theoretical idea from lambda calculus. It is not an implementation technique for Lisp. It's also not special to Lisp, any Functional Programming language can do that. Note also that one does not need macros for that.
In a real Lisp implementation, numbers are not implemented based on functions. Any real-world Lisp implementation will use the facilities of a processor to represent numbers (integers, floats, ...) and to do computation with numbers. Thus numbers are a separate data type with their own operations. Both the data type and the operations will need to be implemented - the basic operations you mentioned won't help. Thus for a real Lisp, you will a) need to implement numbers using the machine facilities and you will need to understand them. For example floating point operations are in Lisp not simpler than in C, C++, Ada, Java, or any other language which provides floating point operations.
Any language which does not provide floating point operations (like an imaginary primitive Lisp) is simpler than a language with floating point operations (like most existing Lisp languages).
So, an imaginary primitive Lisp is simple. But that has no practical implications, since actual Lisp programming languages are not primitive, and thus not simple. See the definitions of popular Lisp dialects.
> You can implement arithmetic as macros and functions, from scratch.
You don't need macros. And no, nobody does that, because it is not practical.
> Point me at a language that won't require a symbol table that is not assembler.
> Point me at a language that is simpler to implement than LISP and is not assembler. (and maybe not brainfuck)
Forth. Basic. Logo.
> If you have a turing complete language, you can do everything. I don't think you know what a turing-complete language is.
I know what that is. I think you don't know what algorithmic complexity is and what the difference between a linked list and a vector is.
> How nitpicky.
No, it shows you that macros are neither necessary nor sufficient to implement an object system.
> Compared to most other popular languages, it is.
Basic is simpler. Forth is simpler. Logo is simpler.
I would assume that your class targeted Lisp because SICP uses it. While that is a great resource, it is light on details when it comes to the end game: runtime. If you are developing a language today and you aren't considering runtime, then you are just writing macros.
Lexing, and parsing are not trivial tasks, but powerful tooling already exists for these. Compiling means knowing how to translate your language into something that executes, and I would say "must execute well". I'm in favor of hobbies (fun) and exploration (growth), but if you are "perfecting" steps 1&2 and haven't yet considered step 3, then maybe it is time to support another language community rather than develop something on your own.
With steps 1 and 2, you're just translating between different program representations. You could hand-write your programs as syntax trees directly if you really wanted to, and you wouldn't even need those steps. It would be a bit more work, but not insanely so; Lisp syntax looks a lot like a syntax tree already, and some people are perfectly happy writing Lisp.
Step 3 is where you go from 'code' in one form or another (usually AST), to an actual process that's executing on the machine, either through interpretation or compilation. It's where most of the magic happens. You have to turn your AST into a series of machine instructions that will i) do what the code says ii) in an efficient manner.
Sometimes achieving i) at all is hard; it's not immediately clear how to implement certain high level operations using machine instructions. Sometimes there are obvious ways to do it, but there are also much less obvious ways that are much faster (more efficient). Finding those ways is 'optimization'.
Unlike steps 1 and 2, which can be done by hand relatively easily, trying to execute your program by hand, for any non-trivial program, would be an absolute nightmare.
(The exception is very low level languages, like assembly, where your language maps almost 1-to-1 with machine instructions. In that case step 3 is trivial. But if you're designing a language today it's probably not an assembly language.)
If you are targeting a lower level language like assembly language then I think an imperative language is easier to do though. You need to worry a lot about memory layouts, register allocation and calling conventions and in this paradigm a simple imperative language will be an easier fit.
TECS provides a skeleton of a recursive descent parser, with about a dozen different methods to write. The Java-like language has many different kinds of expressions and statements, and this is reflected in the complexity of the parser.
At first the output is an XML tree. Code generation gets an entire chapter. Some of the chapters are straightforward, less than 6 hours of work. But the harder chapters have taken me around 15 hours each. YMMV, but this gives one some idea of the differences between functional languages and procedural languages.
Using Lisp syntax lets the class have more time for semantics.
Besides, when writing a compiler, it's always great if you can ignore the possibility of some value changing. It makes implementation much simpler even more if you go into optimizations. And while you can't completely ignore value mutation on Lisp, it is always explicit, what makes an implementer's life easier.
x := y + z
f(y + z)
Though Wirth-ian languages, like Pascal, Modula-II and Oberon are good examples of languages that are very simple to parse despite not being functional. Though they are slightly more complx than some functional languages, you can prototype a hand-written parser for them in a day if you've got a bit of experience..
E.g. the Oberon-07 grammar takes up 1.5 pages of BNF of which the first third of a page or so represent the lexer symbols (you don't need a separate lexer module for most of these languages).
 Page 166 and 17 of the language report (which incidentally only spends 17 pages to describe the entire language...) https://www.inf.ethz.ch/personal/wirth/Oberon/Oberon07.Repor...
Jeremy Ashkenas created CoffeeScript after reading that book. I can't recommend it enough for someone going down this road.
Are there any resources on the code generation side of things? Even getting from a high level language down to SSA seems like a big leap (nevermind going from SSA to assembly).
Cooper and Torczon's "Engineering a Compiler" is mostly focused on the back end. I liked it a lot.
This fascination with "how" to build something, without considering "what" and "why", seems to be an issue that gets repeated time after time again.
Lexing/parsing is important of course, but it's been done to death, and for most of the simple types of languages people tend to use for teaching, it's a simple problem.
Code generation, on the other hand, is still pretty poorly covered, in my opinion, and something people tend to struggle with a lot more, even if you resort to tools like LLVM (and that's fine if that's what you want, but I'd argue you should try a lower level approach at least once to understand some of the challenges)
Exactly. I'm in the pursuit of build one. My first questions? How make a REPL, a debugger, how implement pattern matching, type checking, if a interpreter can be fast enough, if possible to avoid to do a whole VM for it, etc...
Exist a LOT of practical questions that are left as "a exercise for the reader" that need clarifications.
- Do lisp/forth if wanna do the most minimal parsing and do a lisp
- Use a parse generator, if not care about the quality of this
- Use a top-down parsing by hand, if wanna some control
ANY OTHER OPTION is non-optimal, and will divert from the task, EXCEPT if you wanna do something novel.
If will let aside the parsing stuff, we can put more in the hard and more rewarding aspects.
Though in retrospect I think it should have been two separate series (and I need to write a few more parts; I'm very close to having it compile itself as of last night).
I think it'd have been better to evolve the initial simple language into an equivalently simple language to parse, and kept the long slog towards compiling Ruby as a separate thing.
Especially as that has complicated things enough to be in severe need of various refactoring (which I'm avoiding until it will compile itself, at which point I'll start cleaning it up while insisting on keeping it able to compile itself..).
The parser itself started out conceptually quite clean, for example, but the Ruby grammar is horribly complex, and I keep having to add exeptions that's turned it quite convoluted. I don't doubt it can be simplified with some effort once I have the full picture, but it's not great for teaching.
> You are only allowed to store the AST and variable names found by parser. The input to the lexer is not allowed to be persistent.
To do this, your lexer would need to be in some sense invertible, capable of both producing a source-code-representation of an AST given some naming metadata, as well as converting that source-code-representation back to names+AST.
I think that would make the lexing + parsing task worthy of an 80% article.
The Red Dragon book has the same issue, except they spend 800 pages on basic parsing and only end up making it sound terrifying and mathy. Definitely recommend against reading it.
Maybe a different place to start would be making your own bytecode?
(Beavis and Butthead laugh)