Hacker News new | comments | show | ask | jobs | submit login
Writing Your Own Programming Language (github.com)
202 points by ingve on Nov 10, 2016 | hide | past | web | favorite | 91 comments



Peter Norvig's "(How to Write a (Lisp) Interpreter (in Python))" (http://norvig.com/lispy.html) covers a superset of this material and makes more sense, and actually has a portable implementation you can run yourself. If you're going to do this, use Norvig as a guide.


You don't need to knock the project to praise Norvig. Thank you for the link though.


When I think about writing my own language, I think of something with as few parenthesis as possible and all the examples use Lisp.


At a high level, compilers just translate one programming language to another. A key part of this translation is the Abstract Syntax Tree (AST), which represents the programming language transformed into a tree of computation independent of the syntax of the language. Once you have the AST, you can then step through it and translate the tree to anther language like java bytecode, ASM, CIL, etc.

When compiling the AST sits in the middle of the whole process. The first part deals with parsing your language into the AST the second part deals with transforming the AST to the target language (including possible optimizations of the tree).

Lisp is effectively the raw AST, which is where its power comes from. The use of parentheses is the cleanest way to directly represent a raw tree that you can interact with. This means that you can use Lisp to cut the language design process in half from either direction:

On the one hand, if you are worried about parsing your language, then you can simply transform it into lisp, which is virtually identical to creating the AST, and then you can use a lisp interpreter/compiler for the second half.

On the other hand, if you're interested in writing the interpreter/compiler for a language and don't want to stress about parsing, you can write it for lisp and not have to worry about parsing a complex language. If you follow the Norvig code you can extend that example to a language devoid of parenthesis by writing the parser for it.

Even if you want to do both it's not a terrible idea to prototype both halves using Lisp and then perform the minimal work pull out the Lisp code and replace it with some other implementation of the AST.


Surely a left-field type of question but here goes.

Given the current state of machine learning, could an AST be generated to accommodate a given language?

And more importantly, can the form of an AST cause a difference in performance of the overall program?


> Given the current state of machine learning, could an AST be generated to accommodate a given language?

No machine learning required. It is certainly possible to automatically generate an AST based on a grammar.

The more interesting question is whether you can automatically generate a grammar based on samples of the language - https://en.wikipedia.org/wiki/Grammar_induction - although it is unclear why you'd bother trying for a programming language (as opposed to natural language).

> And more importantly, can the form of an AST cause a difference in performance of the overall program?

If you are compiling to machine code, or to bytecode for some VM, or transpiling to another source language (e.g. JavaScript): not really. The form of an AST could make a difference to compilation speed or the ease of implementing later stages of the compiler, but I can't see how it would directly make any difference to the performance of the compiled program.

However, if you are doing tree-based interpretation: Yes, differences in choice of AST can make a significant difference to runtime performance.


The reason for LISP or SCHEME as the tutorial is that parsing is easy and doesn't call for Flex/Bison. There is a level of semantics also that if you don't strain yourself on edge cases you can do a decent interpreter or compiler pretty quickly.

Without making this post long I could "tempt" or encourage you by suggesting that if you take the outermost paren pair off an s-expression, you sortof have a language of function application:

defun fact n = if (< n 2) 1 (* n (fact (- n 1)))

I think there is an SRFI for Scheme that proposes something like that plus offside rule to get rid of parens.

You could get rid of more parens by adding operator precedence to the syntax (and that offside rule would help too), but now you're making the parsing interesting instead of making the execution interesting.


And the reason you don't want to drag in Flex and Yacc/Bison is that then the bulk of the course will revolve around the banal issues surrounding syntax, rather than semantics: high in heat, low in light.

Not even the smarter side of syntax (abstract syntax), but rather character-and-token level syntactic sugaring.


A recursive decent parser for a prefix language is not significantly complicated by using C style function calls.


You can make the LISP do something other than parenthesis. An early example was Dylan language. You can also use LISP to power the compiler as both intermediate and executable language like Julia did. Outwardly, it's a powerful, pleasant language many people would like. On inside, its power came from that fact that the syntax was immediately translated to & manipulated in femtolisp. My favorite use of LISP to bootstrap a language is this presentation [that could also be done without parentheses]:

https://news.ycombinator.com/item?id=9699065


Re: early example, CGOL was two decades before Dylan.

https://en.wikipedia.org/wiki/CGOL


It's a little weird in syntax but definitely an improvement in the side-by-side examples. Thanks for telling me about it.


Lisp representations can be used internally even if you have such a syntax.

Recent GCC commit:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=28251d450e034f...

GCC has some Lisp-like data structures inside, and those give a language to the commit comment also, making it easy to comprehend what is going on with the abstract syntax.


> When I think about writing my own language, I think of something with as few parenthesis as possible

Just swap [] with (), then they're a lot easier to hit. It's weird that parentheses are shifted but square brackets are not: the former are a lot more common in English text than the latter, as well as in many programming languages.


Have you considered that the two might be correlated?


Could you elaborate?


Simple lisps have very little, very regular syntax making them popular choices for "write your first interpreter" tutorials.


Thanks for the link! One nice thing about the original post is that it runs inside of a Swift playground, which, for those with Xcode (or an iPad) can be a fast way of learning and experimenting. I could see someone starting with this project and then easily expanding their knowledge by modifying it using resources like the one you linked to.


Honest question: why do tutorials in this topic seem to always use functional languages/syntax as examples? Our compilers class at MST had us re-implement a lisp compiler, but didn't touch on why we used lisp specifically (other than the professor liking it; we were a largely C++ school). Do they think functional languages are simpler / less complex / easier to understand? Is there something inherently easier to implementing a functional language instead of something more imperative?


I've built interpreters for both a subset of Java and a full Lisp. Here's my take.

> Is there something inherently easier to implementing a functional language instead of something more imperative?

Other answers have focused on the parsing of the language (that is, the production of an AST) which is much easier to cover instructionally for a Lisp because it's basically the AST already.

To my mind the semantics' of imperative languages is the real issue. In particular when defining and/or implementing the semantics for an imperative language, eventually store (memory) management comes up and everything gets much more complicated instantly.

[edit] And to go along with the store there are often more forms for which you need to define the semantics (statements, expressions, classes, etc).

In contrast functional languages can frequently be implemented using term rewriting which can deal directly with the AST itself.

More broadly, this is why I wish students were required to implement an interpreter of an imperative language. The act of debugging programs becomes more difficult for the same reason the semantics is more difficult to define and implement: it's more complex and there are more nuts and bolts to consider.


I agree, but would add I wish student had to build compilers/interpreters for 3 languages: Forth, Lisp and something like C. As a student we went right into C: BNFs, lex, yacc, emitting machine code, optimizations, etc. Very interesting but the beauty of simplicity is lost in it all. Forth at first is baffling - there is no syntax! It's amazing how little you need to get going. Lisp, to just get right to the parse tree. And you already have intuition on the interpreter part. And finally C or Java for all the upfront complexity required.


In particular when defining and/or implementing the semantics for an imperative language, eventually store (memory) management comes up and everything gets much more complicated instantly.

Why? Isn't C-like "call malloc" simpler than the GC which is required for most (all?) functional languages?


It's not really being functional that calls for GC -- if you have data structures (whether they're closures, records, or whatever) escaping up from the scope in which they were created, you have a tricky lifetime question to deal with. In both the imperative and functional cases, you could require explicit allocation and deallocation (like malloc/free), and you could make it automatic (garbage collection). Functional programming tends to use GC because passing/returning closures is a common thing to do. These days, I would say imperative programmers also tend to use GC.


Because languages like this map so closely to the AST (as someone else pointed out) you don't need to deal with complexities such as look aheads, regular expressions, and formal syntax definitions (BNFs). I suppose this makes it a good getting started point.

Most more complex languages you'd use something like lex and yacc which would add quite a bit more overhead to the tutorial.

The danger of these tutorials is the whole "grow your own parser" method doesn't scale to more complex languages (you need better tooling for that and/or a lot firmer grasp of the concepts) and I would be concerned that people would try to extend this tutorial and go down a rabbit hole.

Edit: I know this from experience because when I was in high school I wrote a compiler this way (let's call it the naive way). And then in college I learned how to do it with BNF and code generators and the difference in maintainability and code complexity between the two is ridiculous. I would never do it the old way again, even for a toy language.


> Most more complex languages you'd use something like lex and yacc which would add quite a bit more overhead to the tutorial.

Very few production compilers use lex/yacc type parsers, because error handling and recovery is painful. Most end up with hand-written recursive-descent parsers.


But most compilers have a parser. Scheme doesn't need one ;)

http://calculist.org/blog/2012/04/17/homoiconicity-isnt-the-...


Why is it so often stated, that Scheme doesn't need a parser? I mean the AST is simpler then in other languages, but at the end of the day, you have to parse SEXPs, don't you?

I mean let's say you evaluate

  > (+ 2 3)
You have to parse the string "(+ 2 3)" and will then evaluate it to 5?


... reading sexp's is technically parsing but it is so far removed from something like a C parser that it's really not the same animal. I mean, you can write a function to build a list from a sexp in a few minutes, while even a basic hand-written C parser will take days/weeks/months depending on your familiarity.

So, that's a reason why it is said. (edit: also, if the course implements scheme IN scheme, then you don't need a parser because the host scheme can do it for you.)


> ... reading sexp's is technically parsing but it is so far removed from something like a C parser that it's really not the same animal.

Yeah it is obviously a lot easier to parse than traditional, imperative languages like C. I just do not like the phrase, as it gives this "magic" vibe to Scheme. I mean its syntax is beautiful and simple, but there is definitely no magic there.


McCarthy's basic evaluator for LISP in a page or two in a high-level language (esp LISP). That's not magic but good design gets plenty close. ;)


"It's easy to parse scheme" is one of those pretty stories that is often repeated but untrue. A similar one is "it's easy to implement scheme in scheme".

It is true for toy compilers of language subsets, and those are often done. And few people ever explore real scheme compiler internals.

The readers in real scheme compilers are generally a large pile of ugly code. A decade or so ago, I recall Bigloo's was the cleanest I'd found, with a regular expression based parser framework. I can't quickly find the files to link. But after wading through many other implementations, it was a breath of fresh air. But not simple, and not small.

Here's a quick quote from a Chicken changelog: "2.7 Fix several bugs (expt, integer?, rational?, <=, >=, eqv?, -) found by importing the number tests from Gauche and writing an extensive test for number syntax edge cases. Complete rewrite of number parser (it should fully conform to the R7RS superset of R5RS number syntax now)."


we mean it doesn't need a parser in the traditional sense of a LR or LL grammar (or more complicated) with generated table or even a hand-rolled parser with lookahead.

Sure, it needs a "parser" in the sense of a FSA + a stack.


The read function the parser.


I was recently reading http://book.realworldhaskell.org/read/using-parsec.html, which made me think about this very thing. I never liked working with parser generators because they felt so unwieldy, but I wondered if a system like parsec might be extensible enough to simplify the entire process greatly.


Yes, libraries/frameworks like that are certainly easier to work with than generators.


I'm writing an assembler for a virtual 16 bit CPU I built, and I hesitated on using BNF. It seemed overkill so I ended up hand-rolling a parser; for an assembler it's straightforward enough, but I can easily see how it could get ridiculous to implement a compiler this way.


I hesitated on using BNF. It seemed overkill

Presumably because there's no real nesting of terms in assembly code? Personally, I find BNF so simple that I'd probably still use it for describing the grammar, but lots of parser generator technology is heavier machinery than required (though a recursive descent parser built from a regular grammar is essentially a DFA if your language/compiler offers tail call elimination).


Don't you mean an NFA?

If there is an ambiguity in the grammar that takes some time to resolve, a recursive descent parser can easily take exponential time. What tail call elimination does is keep you from having a giant call stack, but does not fix the potential (though rare) performance disaster.


Yes, NFA! DFA only if it's LL(1).


Right, with the syntax I chose there ends up being, IIRC, 6 different forms (<op>, <op> <arg1>, etc.), which can't be nested. Some of those forms share patterns which could be described as a nested grammar, but I just abstracted them in the parser code as functions (i.e. functions to parse patterns that show up in multiple forms).

I did use a lexer. I wrote an assembler before without using a lexer; using the lexer isn't that much simpler IMO, but it makes it more robust and makes it much easier to systematically catch syntax errors and produce meaningful errors.


Is regular grammar same as context-free grammar?


It is not - see https://en.wikipedia.org/wiki/Chomsky_hierarchy for the where these terms come from. Context-free grammars are strict supersets of regular grammars and cover the vast majority of programming language constructs.


Is your codebase public? I would like to take a look.


Sure. It's a work in progress, my latest changes are not on Github yet, I'll probably push tonight.

Keep in mind it's a personal project, the assembler is written in not the prettiest idiosyncratic Python.

https://github.com/jbchouinard/sixteen/


I can probably answer the LISP question; the Core LISP Languages is incredibly simple. It has only 8 or so keywords you really need and 5 syntax forms (probably wrong, I don'T have the exact numbers in my head)

It comes with very few special forms and makes no difference between a variable, function or macro. Everything is same-y.


> It comes with very few special forms and makes no difference between a variable, function or macro. Everything is same-y.

That's true of Scheme, not Lisp (Lisp has rather more special forms, and definitely makes distinctions between variables, functions and macros), but it is indeed why Scheme is so popular in compiler classes.


Scheme is a Lisp. Lisp is nowadays a language family, built upon a common core: both programs and data are nested lists of symbols; the interpreter executes by recursively calling eval and apply functions.


I think you mean lisp-1 (or something). Lisp as a whole is a, erm, genus of languages.


You're right, I always confuse the many forms of LISP to some certain degree...


Yeah, if you ignore the reader, garbage collector, hash table implementation, vectors, object system, stack unwinding, representation of closures, etc, etc. all you're left with is a handful of special forms. Unicode support? Why, practically just a footnote; leave that to the summer students.


>Reader

Not quite sure what you mean there. If you are talking about the Lexer, it's also rather simple since LISP doesn't need lookahead. Everything is in prefix notation.

>Garbage Collector

Which can be incredibly simple since you only have the persistent global scope and the list local scope, so you only discard data that goes out of scope. There are no references

Heck, you could probably just write the Garbage Collector in LISP too.

>Hash Table

LISP-family languages usually aren't called LISP for being Hash Table Processors. There are only lists.

>Vectors, Object System, Stack Unwinding, Closures

Not needed either, you can macro Vectors, you can macro the Object System, you have lists, not a stack and closures are a basic property of the language last I checked.

>Unicode

---------

LISP is a rather simple language since you can write a macro for almost everything, a minimal implementation only needs a couple keywords. Object Systems are usually implemented via macros.

It is special because after implementing a minimal set, you can program the language itself to provide the rest.

And I'd like you to name another language that is not assembler or original BASIC where you would not have to implement all this stuff while remaining as simple and expressive as LISP.

I don't think you really need any of this complicated 1960s and later stuff. LISP is the second oldest programming language, it's really not complex.


> Not quite sure what you mean there.

The Reader is the thing which reads the many Lisp data structures: lists, symbols, floats, integers, strings, characters, vectors, arrays, structures, ...

> Which can be incredibly simple since you only have the persistent global scope and the list local scope, so you only discard data that goes out of scope. There are no references

Lisp has references.

> Heck, you could probably just write the Garbage Collector in LISP too.

Basically you can't.

> LISP-family languages usually aren't called LISP for being Hash Table Processors. There are only lists.

All Lisp have hash-tables since decades. Internally the symbol table is a hash-table in most Lisp implementations.

> Not needed either, you can macro Vectors, you can macro the Object System, you have lists, not a stack and closures are a basic property of the language last I checked.

You can't. You don't see to know what a macro is. If you only have linked lists, you can't make vectors. No macro will help you with that.

> Object Systems are usually implemented via macros.

They aren't. Object systems are a mix of new data structures, functions and some macros. In Common Lisp, the CLOS system is a layering of data structures, functions and only at the top are macros. Parts of CLOS then are defined in itself.

> It is special because after implementing a minimal set, you can program the language itself to provide the rest.

But it is not done that way. You won't implement arithmetic as a macro.

> while remaining as simple and expressive as LISP.

Once you have all that, Lisp is no longer simple.

> LISP is the second oldest programming language, it's really not complex.

Check the recent Scheme standard for some entertainment.


I think you're completely failing to understand my point.

https://stackoverflow.com/questions/3482389/how-many-primiti...

You only need about 9 primitives and can build the rest from there.

You can implement arithmetic as macros and functions, from scratch.

>Internally the symbol table is a hash-table in most Lisp implementations.

Point me at a language that won't require a symbol table that is not assembler.

Really do it.

Point me at a language that is simpler to implement than LISP and is not assembler. (and maybe not brainfuck)

>If you only have linked lists, you can't make vectors.

If you have a turing complete language, you can do everything. I don't think you know what a turing-complete language is.

>In Common Lisp, the CLOS system is a layering of data structures, functions and only at the top are macros. Parts of CLOS then are defined in itself.

How nitpicky.

>Lisp is no longer simple.

Compared to most other popular languages, it is.


Forth has a minimal set of primitives as well, and an even simpler parser (it's literally "read until next space" and "search a linked list to see if this matches an existing word, otherwise, is this a number?, otherwise, yell")


And, likewise, the Turing completeness of those primitives doesn't mean what some naive new computer scientists think it means. Turing completeness doesn't mean that those primitives constitute a language which can be extended to become any language; it means that for any given Turing machine, those primitives can be used to write a program which computes the same thing as that Turing machine. That program might be, for instance, a small state machine dispatcher accompanied by a big table with some initial contents. VirtualBox can run an entire PC, and that PC can run an OS, under which we have a C++ compiler. That doesn't mean VirtualBox has been extended to be a C++ language implementation.


> You only need about 9 primitives and can build the rest from there.

You can't practically. No actual Lisp is developed that way. It's mostly without any practical use. It's a theoretical idea from lambda calculus. It is not an implementation technique for Lisp. It's also not special to Lisp, any Functional Programming language can do that. Note also that one does not need macros for that.

In a real Lisp implementation, numbers are not implemented based on functions. Any real-world Lisp implementation will use the facilities of a processor to represent numbers (integers, floats, ...) and to do computation with numbers. Thus numbers are a separate data type with their own operations. Both the data type and the operations will need to be implemented - the basic operations you mentioned won't help. Thus for a real Lisp, you will a) need to implement numbers using the machine facilities and you will need to understand them. For example floating point operations are in Lisp not simpler than in C, C++, Ada, Java, or any other language which provides floating point operations.

Any language which does not provide floating point operations (like an imaginary primitive Lisp) is simpler than a language with floating point operations (like most existing Lisp languages).

So, an imaginary primitive Lisp is simple. But that has no practical implications, since actual Lisp programming languages are not primitive, and thus not simple. See the definitions of popular Lisp dialects.

> You can implement arithmetic as macros and functions, from scratch.

You don't need macros. And no, nobody does that, because it is not practical.

> Point me at a language that won't require a symbol table that is not assembler.

Why?

> Point me at a language that is simpler to implement than LISP and is not assembler. (and maybe not brainfuck)

Forth. Basic. Logo.

> If you have a turing complete language, you can do everything. I don't think you know what a turing-complete language is.

I know what that is. I think you don't know what algorithmic complexity is and what the difference between a linked list and a vector is.

> How nitpicky.

No, it shows you that macros are neither necessary nor sufficient to implement an object system.

> Compared to most other popular languages, it is.

Basic is simpler. Forth is simpler. Logo is simpler.


Lisp, or any prefix or postfix language, is relatively easy to parse and execute (in the case of an interpreter) or translate (in the case of a compiler/translator) because its syntactic structure mirrors the abstract syntax tree (AST) that the semantic analysis phase (i.e. the last of the 3 phases identified in the post) will traverse as it either executes (interpreter) or translates (compiler). In other words, the parse tree is nearly identical to the AST. That isn't true of Algol-like languages, and so the syntactic analysis and semantic analysis phases of a compiler/translator of Algol-like languages is more complex. That's why Lisp, or really any prefix or postfix language, is a good first language to implement, as opposed to something like C.


I think you're asking the right questions.

I would assume that your class targeted Lisp because SICP uses it. While that is a great resource, it is light on details when it comes to the end game: runtime. If you are developing a language today and you aren't considering runtime, then you are just writing macros.

Lexing, and parsing are not trivial tasks, but powerful tooling already exists for these. Compiling means knowing how to translate your language into something that executes, and I would say "must execute well". I'm in favor of hobbies (fun) and exploration (growth), but if you are "perfecting" steps 1&2 and haven't yet considered step 3, then maybe it is time to support another language community rather than develop something on your own.


I'm not sure what you mean by "runtime" here and I'm having a little trouble seeing what you're getting at.


Runtime means program execution, basically.

With steps 1 and 2, you're just translating between different program representations. You could hand-write your programs as syntax trees directly if you really wanted to, and you wouldn't even need those steps. It would be a bit more work, but not insanely so; Lisp syntax looks a lot like a syntax tree already, and some people are perfectly happy writing Lisp.

Step 3 is where you go from 'code' in one form or another (usually AST), to an actual process that's executing on the machine, either through interpretation or compilation. It's where most of the magic happens. You have to turn your AST into a series of machine instructions that will i) do what the code says ii) in an efficient manner.

Sometimes achieving i) at all is hard; it's not immediately clear how to implement certain high level operations using machine instructions. Sometimes there are obvious ways to do it, but there are also much less obvious ways that are much faster (more efficient). Finding those ways is 'optimization'.

Unlike steps 1 and 2, which can be done by hand relatively easily, trying to execute your program by hand, for any non-trivial program, would be an absolute nightmare.

(The exception is very low level languages, like assembly, where your language maps almost 1-to-1 with machine instructions. In that case step 3 is trivial. But if you're designing a language today it's probably not an assembly language.)


It depends on hoe you are writing your compiler. If you are coding in a high level language, then writing an AST interpreter for a functional language is super simple. You just need to mix first-class functions and an unsurprising expression evaluator and you end up with something really expressive.

If you are targeting a lower level language like assembly language then I think an imperative language is easier to do though. You need to worry a lot about memory layouts, register allocation and calling conventions and in this paradigm a simple imperative language will be an easier fit.


That's only because the hardware architecture's designed that way. If your target is a stack machine (with no general purpose registers), a functional language would be just as easy if not easier to generate code for.


TECS (The Elements of Computing Systems) devotes 2 chapters out of 12 to writing a compiler for a Java-like language. Lexical analysis is more complex, with operators and identifiers and comments in addition to parentheses, numbers, and the letter 's'.

TECS provides a skeleton of a recursive descent parser, with about a dozen different methods to write. The Java-like language has many different kinds of expressions and statements, and this is reflected in the complexity of the parser.

At first the output is an XML tree. Code generation gets an entire chapter. Some of the chapters are straightforward, less than 6 hours of work. But the harder chapters have taken me around 15 hours each. YMMV, but this gives one some idea of the differences between functional languages and procedural languages.


TECS is a wonderful book! I think it was one of the most enjoyable books about CS I have ever read.


If the language supports assignment, which allows variables and parts of data structures to be treated as mutable memory locations, then it isn't functional.

Using Lisp syntax lets the class have more time for semantics.


The simplest languages around are all functional. And most of them are Lisp idioms.

Besides, when writing a compiler, it's always great if you can ignore the possibility of some value changing. It makes implementation much simpler even more if you go into optimizations. And while you can't completely ignore value mutation on Lisp, it is always explicit, what makes an implementer's life easier.


It isn't really meaningful to call a language like this functional or imperative. It's far too simple. If you're asking about the s-expressions specifically, they are very easy to parse for the same reason they are easy to read: you don't need any precedence rules. Writing interpreters in Lisp languages for Lisp languages also has a long history.


If you were trying to implement a language with assignment statements like:

  x := y + z
or

  f(y + z)
or

  g(f(x, y))
you'd need to be able to evaluate expressions first. Maybe it's just simpler to illustrate a smaller language first?


Modeling a pipeline of transformations of a well-understood data type it sort of whatever comes right after "hello world" in functional programming. A compiler is a very scaled up version of this, but it certainly fits naturally with the standard ideas of FP.


Not having to deal with a statement/expression distinction makes it a lot easier. Functional languages often have a simpler syntax with fewer special cases. (Though there are other languages like this e.g. Smalltalk)


I agree in general.

Though Wirth-ian languages, like Pascal, Modula-II and Oberon are good examples of languages that are very simple to parse despite not being functional. Though they are slightly more complx than some functional languages, you can prototype a hand-written parser for them in a day if you've got a bit of experience..

E.g. the Oberon-07 grammar takes up 1.5 pages of BNF [1]of which the first third of a page or so represent the lexer symbols (you don't need a separate lexer module for most of these languages).

[1] Page 166 and 17 of the language report (which incidentally only spends 17 pages to describe the entire language...) https://www.inf.ethz.ch/personal/wirth/Oberon/Oberon07.Repor...


Prefix notation (ie. + 1 1) instead of infix notation (ie. 1 + 1) is much easier parse, as there are no precedence rules and you don't require different conventions for expressions and function calls.


I think it's because the text representation is super close to the AST


I have a friend who wrote this book: http://createyourproglang.com/

Jeremy Ashkenas created CoffeeScript after reading that book. I can't recommend it enough for someone going down this road.


Ah, I searched this link for a while, thanks.


There are a TON of resources like this which focus on lexing and parsing which is all fine and dandy but interpreting the resulting Abstract Syntax Tree will be extremely slow.

Are there any resources on the code generation side of things? Even getting from a high level language down to SSA seems like a big leap (nevermind going from SSA to assembly).


> Are there any resources on the code generation side of things?

Cooper and Torczon's "Engineering a Compiler" is mostly focused on the back end. I liked it a lot.


I took a university course where we built an interpreter for MicroScheme. It was a difficult project but was also really awesome and rewarding. I'd like to go back and do it again without deadlines to really understand it better. I think functional programming languages can be a great choice for implementing an interpreter since metaprogramming is their forte and Racket's match function helped a lot, it's like the most powerful thing I've ever seen[1]

[1] https://docs.racket-lang.org/reference/match.html


No. First, understand what the language is for (if nothing, stop here...). Second, make the main design choices (expressiveness of type system, level of control of mutability or lack thereof, etc.). Third, design the type system, with a view to type inference. Fourth, design and define the semantics. Do those last two in a way that will let you test your implementation against these definitions automatically. Fifth, think about sufficiently efficient implementation strategies. Sixth, pick a syntactic style that will be familiar to most of your users. Seventh, design the actual syntax. Eighth, implement it. Ninth, try it out on users and go back to the start. Tenth, rest.


I'm confused, the article says the language uses a postfix operator and then shows examples like `(s 2 3)`. Isn't that a prefix operator?


Why do bloggers focus on lexing and parsing? These things should not take up 80% of the article about creating a programming language.

This fascination with "how" to build something, without considering "what" and "why", seems to be an issue that gets repeated time after time again.


I agree, which is why I started my own series, as lmm mentioned, with showing simple ways of (ab)using gcc to figure out how to do simple/primitive code-generation, and built up from that, instead of building down from lexing/parsing.

Lexing/parsing is important of course, but it's been done to death, and for most of the simple types of languages people tend to use for teaching, it's a simple problem.

Code generation, on the other hand, is still pretty poorly covered, in my opinion, and something people tend to struggle with a lot more, even if you resort to tools like LLVM (and that's fine if that's what you want, but I'd argue you should try a lower level approach at least once to understand some of the challenges)


This is a fair complain. I have read dozens of articles on the matter and honestly, parsing and lexing is the less significant aspect of this:

https://news.ycombinator.com/item?id=10793054

https://www.reddit.com/r/coding/comments/2ocw2r/how_to_creat...

    Exactly. I'm in the pursuit of build one. My first questions? How make a REPL, a debugger, how implement pattern matching, type checking, if a interpreter can be fast enough, if possible to avoid to do a whole VM for it, etc...
    Exist a LOT of practical questions that are left as "a exercise for the reader" that need clarifications.
In the end, the parsing steps can be summarized as:

- Do lisp/forth if wanna do the most minimal parsing and do a lisp

Or:

- Use a parse generator, if not care about the quality of this

Or:

- Use a top-down parsing by hand, if wanna some control

ANY OTHER OPTION is non-optimal, and will divert from the task, EXCEPT if you wanna do something novel.

If will let aside the parsing stuff, we can put more in the hard and more rewarding aspects.


I found http://hokstad.com/compiler a lot easier to understand than any other compiler tutorial I've seen. Writing a compiler the way you'd write any other program. It does end up with e.g. a distinct parser, but only when the reasons that might be a good idea become apparent.


Thanks :)

Though in retrospect I think it should have been two separate series (and I need to write a few more parts; I'm very close to having it compile itself as of last night).

I think it'd have been better to evolve the initial simple language into an equivalently simple language to parse, and kept the long slog towards compiling Ruby as a separate thing.

Especially as that has complicated things enough to be in severe need of various refactoring (which I'm avoiding until it will compile itself, at which point I'll start cleaning it up while insisting on keeping it able to compile itself..).

The parser itself started out conceptually quite clean, for example, but the Ruby grammar is horribly complex, and I keep having to add exeptions that's turned it quite convoluted. I don't doubt it can be simplified with some effort once I have the full picture, but it's not great for teaching.


If I were going to write a programming language for myself, I would lay out the following challenge to myself:

> You are only allowed to store the AST and variable names found by parser. The input to the lexer is not allowed to be persistent.

To do this, your lexer would need to be in some sense invertible, capable of both producing a source-code-representation of an AST given some naming metadata, as well as converting that source-code-representation back to names+AST.

I think that would make the lexing + parsing task worthy of an 80% article.


How about making your comment useful by enlightening the community as to the relative unimportance of these aspects of contemporary PL design, and suggest some other aspects for would-be designers to focus on instead? Just imagine: you could even link to resources for the latter!


I think it's just because they run out of steam after doing the "first part", and there are backends like LLVM available.

The Red Dragon book has the same issue, except they spend 800 pages on basic parsing and only end up making it sound terrifying and mathy. Definitely recommend against reading it.

Maybe a different place to start would be making your own bytecode?


What is the problem there? I'm confused by your comment


One problem is that syntax is just one (rather shallow) part of the design of a programming language, but it's one that gets a lot of attention because everyone who's used two programming languages can tell that it's a point where languages differ. Semantics (rules about what blobs of syntax mean) is a much more interesting way for languages to differ, but most "build a language" tutorials I (and probably GP) have seen don't seem aware that there are even decisions to be made there. The "your own" bit in the title is also a bit upsetting for a post that hands the reader a language and its implementation instead of talking about something of the reader's own design.





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: