Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What's a starting point for learning how to write programming languages?
213 points by da_murvel on Sept 13, 2018 | hide | past | favorite | 95 comments
I've been thinking about writing my own programming language for some time. This to gain a deeper understanding of how programming languages, and by extent, computers work. I've bought the Flex and Bison book by John Levine to have a resource on parsing and lexing. I have come to realize however, that writing a complete language like Java, Python, Ruby, from bottom to the top, is of course not a small task. So therefore I started to think of writing maybe a compiler for another language, writing a "simple" language like SQL, Markdown etc. to begin with.

What are your experiences when writing a new programming language, where and how did you start?

Others have already mentioned various resources for learning the details of implementing a language. As someone who does programming language development as a hobby and PL technology as a career, I have a higher-level bit of advice:

Take a top-down approach, and focus on semantics over syntax.

If you want to design a new language, think of a moderately sized representative example of a problem you have day-to-day, and write a sketch of a program in the magical pseudocode you wish you were able to write. It doesn’t have to be a big problem, or a big language, just a concrete problem domain.

Then, piece by piece, investigate the details of how to make that notation work in reality: how do you parse, typecheck, optimise, compile, and implement a runtime for it? Figure out the smallest set of core primitive operations needed to express the semantics of your language—that’s your core calculus, which you can prove things about or use as an intermediate representation in your compiler.

In addition, a productive strategy for producing an actual implementation, followed in some compiler courses, is to build your compiler (or interpreter) in such a way that you always have a working implementation at every step—of a language that starts small and grows over time.

For example, write a program that takes in a “hello world” or other primitive program written in your language, and just produces the expected output without any analysis—the result of the program, or a binary (machine code, .NET/JVM bytecode, &c.) with the right behaviour. Then, incrementally add features—outputting other messages, doing other things than simple input/output, evaluating complex terms, rejecting invalid programs, and ultimately taking advantage of the semantic features that make your language unique. This style of development helps encourage you to write a test suite of increasingly complex example programs, which are an essential part of a language implementation—you don’t need to follow strict rules like TDD, but you do need to test everything.

Thanks for your reply! Although I (think) I usually go about programming by starting small, and then adding features as they are needed. It never occurred to me, for some reason, I could do the same in this case. By reading the other replies I think a combination of starting small and using a couple of "how to"-links and books as a reference will be my best bet.

For many people, and I include myself in that group, the class in compiler and language design is where all of the concepts being taught in CS came together into a bigger and more rich picture. A common observation that it was like learning "how" music was written rather than just "playing" music.

That said, everyone who learns to program usually learns more than one language. For me, it was after about the fifth language that the syntax of a particular language fell away and what remained was the algorithm that was being expressed. It was this ability to see the algorithm distinct from the syntactic construction of the language that lets me see where the language is "helping" and where it is "hurting" in expressing the algorithm.

The advice then for starting points for developing new languages would be to take some algorithms you know well and implement it in as many languages as you can find. Ideally you'll have at least one imperative language (BASIC, C, Fortran, COBOL all qualify), one functional language (haskell/APL), one lambda language (lispish), and one semantic language (Ada, Mesa, Modula) Typically sorting, string handling, and asynchronous execution (or schedulers) all make for good algorithms to try. See what is "easy" and what is "hard" to implement based on the features the language brings to the table.

Then optimize for 'easy'.

That will gives you verbs and structure that support your need, once you have those then giving them names and a syntactic structure is fairly straight forward.

Thanks for your reply! As a musician myself I like your analogy. It was a huge revelation for me when I learned how to write my own music, and I also started to "consume" music in a completely different way. I also sort of hope that the same will happen with my programming, taking it to the next level.

I absolutely love your music analogy. That is brilliant and accurate!

A Forth or a Lisp are simple tasks for an experienced programmer, and even basic ones are really satisfying imho. Seeing simple constructs work in an hour of coding really feels great. I implemented more complex languages (with types) for work which were more like DSLs but in languages less suited for the type of DSL we required, but I find most pleasure doing little Forth (like [1]) or Lisp likes when I am waiting for hours (in a plane for instance). It is like playing games for me. I write them on my phone or iPad.

I would recommend a lot of books here as well for a solid base and giving you the background needed to write more complex languages including static types, compilers, JITs and VMs. But for me, because I really do not have that much time, that would stretch over too long time and I lose interest, so I stick to things I know I can finish (or rather, get working and achieve something a previous incantation did not achieve; smallest, simplest, fastest etc).

[1] https://gist.github.com/tluyben/16ee2645c4c8aed813005d51488d...

Someone wrote an amazing write-your-own lisp using Python tutorial, I found it amazingly consumable and it only took a weekend:


There's also Make a Lisp: https://github.com/kanaka/mal/blob/master/process/guide.md

It's available in a variety of languages and has a nice guide (linked) to various steps.

I'll agree and expand on their benefits to more complex things:

The vast majority of languages are parsed into ASTs of some form, and Lisp is basically directly programming in AST data structures that are executable. Seeing that source code is basically just another data structure to be manipulated and converted was the major "Ah hah!" moment in demystifying compilers for me.

Forth is a great lower-level representation of nested expression evaluation, and sheds a lot of light on how to organize that evaluation in your code generation, converting foo(bar(baz(val),blort(val2))) into a linear stream of operations. Even for register machines, Forth's stack represents the intermediate values that must be retained between expressions in either registers or memory.

Once this basis of how programming languages work is established, then the ideas of new programming language design from the source code perspective can be explored in practical terms.

I totally agree here, if you want low level I think j1 forth (http://www.excamera.com/sphinx/fpga-j1.html) and for lisp I really liked http://www.buildyourownlisp.com which uses modern C libraries so it polishes your C knowledge too.

Beautiful Racket[0] is a good starting point that covers a lot of nice small examples that incrementally build up your understanding of the different components involved in making languages. And it uses Racket[1], which incorporates the idea of crafting many small "domain-specific" languages to solve different problems into its entire philosophy[2].

[0] https://beautifulracket.com/

[1] https://racket-lang.org/

[2] http://felleisen.org/matthias/manifesto/

I'm just finishing up the last tutorial, and I highly recommend this as well. I made one traditional C compiler back in college with flex and bison, which I consider an invaluable experience, but I've always wanted to tap in to the power of Scheme for developing new languages. I had some false starts on my own and with mediocre tutorials, but in the end Beautiful Racket was exactly the kind of pedagogy I needed.

Second Beautiful Racket. Good writing and approachable content.

Although its still in development, http://craftinginterpreters.com/ (previously on HN https://news.ycombinator.com/item?id=13406081) seems promising, and the existing content is good.

I'm not super far in but I'm very happy with it so far. I recommend it very highly and am very skeptical of anyone recommending reading the dragon book without doing something more hands on like this first.

Maybe that book is more practical than I realize though.

Crafting interpreters is a great place to start! Takes you from basically nothing to implementing a whole language surprisingly quickly.

I also recommend Crafting Interpreters.

It helped me write the scripting language that I use in one of my projects.

I would suggest looking at Niklaus Wirth's Compiler Construction book and even the Oberon compiler. This is one of the best ways to learn about compilers and writing programming languages. See: https://www.inf.ethz.ch/personal/wirth/

Another approach is to look at SICP, it is Scheme oriented and more oriented towards macros. See: https://mitpress.mit.edu/sites/default/files/sicp/index.html

SQL is a deceptively complex language, the optimisations that it performs are very complex. Markdown, on the other hand, is more of a transpiler to HTML and is not Turing Complete.

Write a parser for a complicated file format.

Write a simple interpreter for a C-like language by parsing it into a syntax tree, constructing a stack of symbol tables, and evaluating the nodes.

Once that works, write a version that emits code for a stack-based virtual machine (a simple stack machine written concisely might eat just a few screenfuls of code).

When that works, symbolically evaluate the stack bytecode into variable references, pick a simple architecture (I like MSP430 for this but ARM will do) do a trivial register allocator, implement register spill to the stack, and emit real code.

Then read Lisp In Small Pieces.

Worked OK for me!

A surprising amount of the complexity you might perceive in a "real" programming language just comes down to how expressions are parsed.

What's the purpose of the file format bit?

Declarative thingies are easier to parse? You can skip a lot of these steps, I guess --- or just stop at a working interpreter, which will get you as far as MRI Ruby got before RoR took off.

Oh, I see. I thought it was something like 'writing an ad-hoc parser for something convoluted will make you understand and appreciate systematic approaches to parsing better' but I guess that's just me being a terrible person.

I pretty much just use yaccs.

Put that flex & bison book away. Don't think about compilers. Design your own litte interpreted language. Write a lexer by hand (not that hard). Write a recursive descent parser by hand. It's a bit tricky to get infix operator parsing stuff right, but an important learning experience. Once you got the basic parsing algorithm right you can probably add lot's of fun stuff to your language just with general programming skills. When you feel ready to move on. Read the flex & bison book, and explore more advanced topics knowing you don't need to write lexers and parsers by hand again.

+1. This is the approach I'm currently taking. I tried starting with books and tutorials but felt I wasn't learning much just being spoon-fed code and copy+pasting. I think it's much more illuminating to attack the problem with a beginner's mindset and work it out for yourself, at least up to a point (to start I'm working on implementing a small subset of Scheme). If I occasionally get stuck I'll peek at a book just enough to get un-stuck. I'm having a blast with it and learning a ton. Once I get tired of implementing language features I'll probably start from scratch and follow a book the second time through. For me it's often easier to absorb problem-solving concepts if I've attempted the problem myself before.

The first couple chapters of crafting interpreters (free) cover lexing and parsing and precedence well and it's very readable and practical rather than theoretical.

There is a really simple algorithm to convert infix to prefix or postfix. The pre/postfix is then interpreted/compiled easily. This was one of the first things my comp-sci teacher in HS made us do after teaching linked list/stacks/fifos.

Of course if you pick a language like forth (mentioned below) the whole problem is sort of moot.

I highly recommend Norvig's (How to Write a (Lisp) Interpreter (in Python)) as a very short, practical introduction to implementing a language. http://norvig.com/lispy.html

There's also a followup article implementing additional data types and tail call optimization. http://norvig.com/lispy2.html

Learn the mechanics of compiler construction first - that'll inform everything about your language, and is a prerequisite to actually building something other people can use. I'd recommend The Dragon Book [1], SICP [2], and Modern Compiler Implementation in ML [3] for this.

After that, it's a question of where you want to go with your language, and what distinguishing features you'd like it to have relative to other languages. Languages which just change some syntax and keywords but otherwise look like C are common, but they tend not to get usage beyond their creators. Much of the academic research lately is about type systems (TAPL has already been mentioned) - there's research into dependent types, effect typing, total functional programming, linear types already made it into Rust's borrow checker, etc.

However, many of the worst problems in industry - dependencies, versioning, smoothly migrating from a prototype to robust production code, code navigation, security, parallelization & distribution, the memory hierarchy - have gotten relatively little attention in academic research, and are desperately awaiting solutions. If you have some industry experience, it may be worth trying to imagine how you might fix these problems if you were allowed to change the programming language at will.

[1] https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...

[2] http://web.mit.edu/alexmv/6.037/sicp.pdf

[3] https://www.amazon.com/Modern-Compiler-Implementation-Andrew...

> Languages which just change some syntax and keywords but otherwise look like C are common, but they tend not to get usage beyond their creators

Considering c++ is one of the most widely-used languages out there, I would question this.

C++ has a bunch of new semantics on top of C - classes, access modifiers, virtual functions, constructors/destructors, operator overloading, different casting rules, exceptions, namespaces, templates, etc. People who know both of them well consider them very distinct languages, because the programming style you use in a C++ program is very different from a C program.

Check out "Writing An Interpreter In Go" https://interpreterbook.com/ It uses Golang to make an interpreter for a simple C-Style language and it's "only" 200 pages long. I'm about a third in and my head is smoking.

That looks pretty awesome. Thanks for the suggestion. Looks like it gets to a good amount of functionality too, with first class functions and closures and the like.

Very much so, and once you've completed the book there are a lot of extensions various people have made.

For example here's a version with classes / OO-behaviour:


Or here's my version which "just" adds some more standard-library facilities, allowing you to open/read/write files:


I found the language fun to play with, and I used a virtually identical lexer and parser-idea in a recent project of my own - rather than "parsing" an input-file with regular-expressions, which is how I would have previously approached the implementation of simple DSLs. I guess that showed I'd learned something useful!

This is a great read! The code is very clean and easy to follow and the progression of ideas is well thought out. Even if you don't know Go you could learn that along the way without too much trouble. He also just released a companion volume writing a compiler for the Monkey as well.

The jovial writing style makes the content easily consumable

I found Jack Crenshaw's "Let's Build a Compiler" series in a magazine, quite a while ago: https://compilers.iecc.com/crenshaw/

There's more than a few ways to do it.

Ohm [0] is a good place to start. It's a parser generator. If you can understand the math example [1], you can probably realize how things get built upwards. Then you can focus on what you want your language to do, how you'll interpret it, what problems it would express, &c.

[0] https://ohmlang.github.io/

[1] https://github.com/harc/ohm/blob/master/examples/math/index....

Also consider reading about OMeta, if Ohm isn't your thing.

Ohm’s web playground is an awesome way to learn how to build up a language from scratch. I’m in the process of porting Postscript to it and it’s amazingly easy.

FWIW Prolog DCGs (Definite Clause Grammar) are very similar to Ohm/OMeta.

I'd like to offer a bit different approach. First, start off by making a simple calculator program, that uses RPN syntax. This is dirt simple to implement -- read a line of input, if it is a number then add it to a stack. If it is a math operator, then call a function handling that operator (addition, multiplication, etc) on the last two items in the stack and push the results back on the stack.

Now add error handling (i.e., unrecognized operators, or stack underflow condition) -- figure out what you want to do, such as crash the interpreter with an error message, or return an error and let the user continue.

Now, read up on the Shunting Yard algorithm. This lets you take a regular (infix) math expression, and convert it to RPN (postfix) notation, which you can then feed directly into the RPN interpreter ("stack machine") that you just wrote.

Now look at handling parentheses, which should be covered in the same explanation of the Shunting Yard algorithm.

Next, add in variables in the original stack language, and figure out what minimal changes to the second piece (the algebraic expression parser) you need to do to get it to handle variables.

Finally, add in functions, and you have a simple language. I followed this exact formula when putting together a toy language many years ago, after which I learned how much I didn't know that I didn't know about language design. But I know enough now for a second shot.

Oh, and some other resources to study that may help -- Look for the reference books for the Postscript language. The "Blue" book teaches you Postscript, and I believe it is the "Red" book that tells you enough of the internals for you to write your own Postscript interpreter (or to at least get inspiration on creating you own version of a Stack machine interpreter).

I suggest "Programming Languages: Application and Interpretation", available free here: http://cs.brown.edu/courses/cs173/2012/book/ It's an undergraduate-level programming languages class textbook.

Other books burn a lot of time on things like lexing and parsing; this one does not. In the lisp tradition, it assumes s-expressions and gets to the juicy bits of programming language semantics immediately: lexical scope, first class functions, objects, etc.

Second this. I'm writing a compiler now (https://github.com/kitlang/kit) and this book has been invaluable.

A few other tips:

* Use a parser generator. Parsing is more or less a solved problem; get it out of the way as fast as possible.

* Writing a language that compiles to another language is very helpful as you can often borrow the underlying language's implementation of a feature. Doubly helpful if you can leverage the underlying language's libraries for things like syscalls.

* Make a todo list (I like Trello for this) full of self-contained tasks. Pick one at a time, think through how you would implement that feature, and go to work. Compilers look like a lot of work in the beginning, but if you take one step at a time, one day you'll look back and have accomplished more than you realize. The list can also be helpful to avoid scope creep; add any new ideas to your backlog and deal with them later.

* Write unit tests, as you'll very likely want to refactor something down the line. Once your compiler becomes complex enough, add functional tests that verify entire programs can compile and run and produce the expected output.

If you like books, this one is great:


If you like excellent free university lectures online, this is great:


In reality - programming languages are most likely what you think they are, if you simply start with the basics like what's a variable, what's a function. How do I call functions inside other functions. Do I want variables to have types. What kind of types?

If you start with the very very basics - global variables, global functions, you can't call functions within other functions - and implement that, you'll get the feel for it, which is most likely as far as you'll want to go, because it gets very tedious, very quickly, as soon as you want something as 'simple' as proper unicode support, garbage collection, generics, inheritance, foreign-function-interface and the list goes on and on.

To get your feet wet regarding what it's all about - those video lectures above are top notch.

The book "Lisp In Small Pieces" takes you threw the journey of creating a few compilers of increasing difficulty, from a simple language's interpreter using a complex language to a complex language's compiler using a simple language.

By using Lisp-y language you can ignore the tedious and uninteresting task of parsing for the real meat of compiling and (some) optimizing.

Love lisp in small pieces! Beautiful book. Also enjoyed Essentials of Programming Languages by Friedman and Wand -- very SICPish into to programming language implementation in that it weaves together this narrative with uniting underlying themes rather than bombarding the reader with seemingly disparate concepts and leaving it an exercise to unite them as many "classic" texts in this discipline do

I maintain a programming language called Ferret [1]. You can read through the manual, it is a literate program, covers everything you need to implement your own lisp. Compiler, optimizations, runtime, garbage collection etc. [2] overlaps with [1] covers how to compile a restricted subset of Clojure to C++. If you are interested in interpreters [3] has an implementation of the "A Micro-Manual for Lisp - not the whole Truth" in C.

[1] https://ferret-lang.org/ [2] https://nakkaya.com/2011/06/29/ferret-an-experimental-clojur... [3] https://nakkaya.com/2010/08/24/a-micro-manual-for-lisp-imple...

What worked for me: read Norvig's tutorials (How to write lisp). It gives basic understanding.

Then continue enhancing that tiny language, add more features.

Start with an interpreter, they are easier to grasp. You can always add a compiler later.

I grabbed Scheme R5RS standard and started implementing it in Kotlin: https://github.com/kovrik/scheme-in-kotlin

Then added more Clojure-like and Racket-like features, plus some Kotlin features.

For me the starting poin was Structure and Implementation of Computer Programs, particularly the 80's lecture recordings.

Also I would recommend staying away from lexer and parser generators and just use hand built recursive descent parser and some set of ad-hoc lexical analysis subroutines.

Chris Lattner, the creator of Swift has a nice walkthrough for making a small language called Kaleidoscope http://web.cs.ucla.edu/classes/spring08/cs259/llvm-2.2/docs/...

This is a better link: https://llvm.org/docs/tutorial/ That page is using a 10+ year old version of LLVM

I loved going through that, definitely second this.

Types and Programming Languages (TAPL) is required reading, in my opinion: https://www.cis.upenn.edu/~bcpierce/tapl/

Absolutely not for a markdown parser, which is what the author is asking about to get started

Parsing is but 1/3 of compiling.

Markup languages are not generally thought of as programming languages, but maybe that's an archaic view.

At any rate, 90% of the material in this thread is way overkill if all one wants to do is parse Markdown.

I bet this is what you are looking for: https://www.amazon.com/Compiler-Construction-Using-Java-Java...

This book taught me how to write a compiler and build a programming language of my own.

To answer you question "where and how did you start?": This is where I started learning about compilers and programming language implementation.

Here is its description from its website:

* Comprehensive treatment of compiler construction.

* JavaCC and Yacc coverage optional.

* Entire book is Java oriented.

* Powerful software package available to students that tests and evaluates their compilers.

* Fully defines many projects so students can learn how to put the theory into practice.

* Includes supplements on theory so that the book can be used in a course that combines compiler construction with formal languages, automata theory, and computability theory.

What I promise you with this book: You'll learn how to write your own programing language. Not only how compilers and about the language but also about computer architecture. This book is very easy to read and understand. It starts with very basic topics then slowly building from it until you'll grok implementing the language given the specification. You'll have the chance to build a compiler from scratch or by using JavaCC and Yacc.

> a "simple" language like SQL, Markdown

If you're happy to consider Markdown a language (which it is, just not a programming language), you could also consider a (subset of) XML parser, or JSON parser. It's a nice place to start, because it's very small scope, a subset of PL (parsing), and the completed thing is actually usable and useful.

Plus, it gets you a step on the path to a programming language - arguably, the fundamental step.

I recommend a recursive descent parser, and this dragon compiler book (https://en.m.wikipedia.org/wiki/Principles_of_Compiler_Desig... Your local uni library has it.). An early chapter gives a complete calculator (arithmetic parser), in very simple and clear C code.

NB the most recent dragon book uses horrific over-complex architecture-astronaut, design-patterns OO java code. OOP has its place, but not like this. Not like this.

There is much good advice given by others here regarding the implementation of languages. However, if you really want to get into language design, there is one area that you should/need to understand (as far as I am concerned) and that is how to specify the meaning of your language and its constructs. If this is not done, you end up with too many gotchas and special cases in how you will do things.

I have found too many language designers have not thought about this clearly enough. When you go to actually use the language, you will suddenly find all sorts of oddities in how you construct the program you are trying to write. Undefined behaviour is a curse, implementation defined behaviour is a curse.

If your language allows you to write something and it passes the compiler, then make sure that it is completely defined so that no matter what there are no subtle surprises.

> that is how to specify the meaning of your language and its constructs

You pose the question but don't seem to answer it - this area is called semantics, and there are several approaches and notations you can use. Types and Programming Languages by Pierce is the most common starting point.

I am not posing a question. I am suggesting he look into the subject. The word "semantics" doesn't hold much meaning for many people, they understand the word "meaning" better. I could have dived straight in and suggested that he look at "denotational semantics", but I considered that just a little too much to start him off with.

By the way, how much are you doing on Katadhin these day?

I haven't touched Katahdin in the last decade!

Some of the ideas live on in Truffle and Graal.


Write lexers and parsers for data formats, JSON and EDN are two good starting points.

write a forth or lisp lexer, parser and interpreter.

then write a small language that compiles to a target, like js, make sure your language is close in semantics to the target language to avoid headaches.

You might have a look at "Write Your Own Compiler", which builds a full compiler (emitting ELF) for a block-structured language in less than 1600 lines of code. It is a very short introduction (slight more than 100 pages) and yet covers many aspects of compiler writing. The compiler source code is in the public domain.

Yes, I'm the author! There are other, more in-depth, compiler books on my homepage: http://t3x.org

The home page also contains the source code to many compilers and interpreters at different levels of complexity, from trivial to real-world.

There's a ton of great responses in this thread. Allow me to go a bit meta: this is a big topic, and as you enter it, you'll eventually find out which parts of it appeal to you. Do you care about parsers? Type theory? Garbage collection? Code generation/optimization? Notation as a tool for thought?

Ever since I started programming, the idea of writing my own language was a sort of "white whale". I have one idea for a language that I think could be usable in a very particular domain. For that, I wrote a very primitive tree-walking interpreter (terrible performance, very technically uninteresting). The idea for the language was what I wanted to explore. I recently started writing a regex->jvm bytecode compiler. That's much lower level, and if I stick with it, performance will be what I optimize for. In neither case did I do anything interesting with parsers, though I used to love learning about new parsing techniques.

In that vein, see dfox's suggestion to "just use hand built recursive descent parser and some set of ad-hoc lexical analysis subroutines". That's the fastest way to get started, whether or not it'll be the best in the long term.

So, my advice is to explore and figure out which things you're interested in, and what you want to accomplish. That will tell you where you want to invest your time.

You're right that writing a complete language like Java or Python is not a small task, but that's because these languages have redundant ways of doing the same thing. If you simplify them be eliminating redundancy while retaining power, you'll have a much simpler implementation task. Here's a language I designed for this: https://kartick-log.blogspot.com/2016/04/why-arent-there-sim... If you want to learn language implementation rather than design, you can implement this language as is.

Another option is to build a VM, that accepts as input byte in a text (not binary) form. Less efficient, but easier to implement, and you can write your bytecode by hand instead of also building a compiler to do so. Like object-oriented assembly language, if you will. If you're interested, I'll scan and share my dissertation report with you.

You can also implement a domain-specific language. Did you, in the past, feel that a task you were doing would be simpler if there was a custom language for it? If so, maybe you can try building that DSL and then using that to make your main task easier.

Speaking from daydreams and not experience, I’ve always thought the combination of http://leafo.net/guides/parsing-expression-grammars.html and http://terralang.org would be a great toolset on which to build a language.

Implementing an evaluator for infix parenthesized arithmetic expressions, like "5*(1+2)+60", is a good starting point.

You can begin with writing a tiny lexer and a recursive descent parser that simply writes out the value of an expression that the user typed in. Then you can make it generate assembly for computing the expression instead (hardcoding some further assembly to print the result out at the end of the execution of the program), turning it into either a traditional compiler or a JIT compiler. Once you do that, it should not be too hard to see how to add variables and simple loops. This will already demystify things quite a bit and make further learning easier.

I specifically recommend implementing your own parser and generating assembly on your own, at least if your primary goal is learning. You could instead use an existing parser generator and e.g. a library for generating JVM bytecode, but those ready-made components are high level enough to turn it all into the usual exercise in plumbing libraries together, rather than a learning experience across many layers of the computing stack.

Build Your Own Lisp (http://buildyourownlisp.com/) is as good a starting point as any. You learn to implement a high-level dynamic programming language in C, which is a common choice of implementation language for interpreters (e.g., CPython, Ruby). I really can't recommend the book enough.

In phases, this is roughly how I learned:

1. Cute hacks like using the eval function in Python and rewriting the string before passing it in. This is basically metaprogramming by macro-processing and a good stepping stone to thinking about what a programming language really accomplishes.

2. Writing various Forth-like and Lisp-like languages. This let me explore a working system without spending too much time on parsing - I could try interpreting the AST directly or compiling it to an intermediate language.

3. Gradually learned that a lot of the features and technologies in general purpose programming languages aren't things I want to care about(how to implement arithmetic, perform variable assignment, perform typechecking etc.). Focused on data languages like JSON and small supplemental languages for e.g. scheduling concurrency.

4. Found a useful model for prototyping semantics before adding a syntax, by envisioning the whole language in the form of a state machine that builds expressions one API call at a time, then compiles/executes the result with a "commit" call or similar. This is a practical technique for many domain specific problems, and makes the necessary syntax emerge naturally.

A lot of what goes into making a useful production language isn't really on the theoretical side, but on finish-and-polish stuff: fast builds, helpful docs and error messages, well-rounded libraries, debugging tools, ease of deployment and so on. Learning this through the process of writing small PLs helped change how I viewed language selection to be more like picking any other application, vs a point of religion: choosing something with approximately the right features and tooling so that the groundwork is easy, and then magnifying that leverage by adding a bit of custom syntax, static checking, code generation, what-have-you on top. Turning the tech into an airtight abstraction generalized across a whole language is a ton of work, but getting 80% of it in for the specific app takes a fraction of the time in comparison.

I'm going through this book right now, i find it to be pretty good! https://www.amazon.ca/Writing-Interpreter-Go-Thorsten-Ball-e...

I have ADHD that makes it nearly impossible for me to learn how to code, and thus I've never been able to perfect it. I know bits and pieces here and there, but I can tell you what usually works best for me for learning complex tasks.

-1. You need to actively set aside time each day for this task. -2. Leave time for this to last a long time, two months, minimum. -3. Plan every step out. -4. Repeat, repeat, REPEAT!! As in, repeat your training tasks. People have to repeat something seven times to burn it into permanent memory, and you need to remind yourself of it at least once a month for a year, then once every six months after this.

If you do those four things, which are surely no easy task, you should be an excellent programmer in no time. Good luck :)

I'm writing a programming language from scratch called Basecode. It's very early days yet. The bootstrap compiler (written in C++) is about three months away from being complete. Then I'll rewrite the compiler in Basecode itself to make it self-hosted. After the language is self-hosted, there's a second level of features I'll be adding that I think are quite exciting.

I've captured the entire process on video: https://www.youtube.com/watch?v=0_PX2Ue6F_A&list=PLWMUVtnFsZ...

The YouTube playlist is an archive of my daily streams on Twitch.

I spent 3 years writing como-lang https://github.com/analang/ana I didn’t have a compilers course at my college. I think the first thing you should understand is how a language works. The first part is the syntax and grammar. It would also be very useful to know how assembly works, as than you’re probably keen on stack machines, which are very simple, yet powerful machines capable of representing expressions in a linear way. Feel free to email me if you have any questions! I’d love to help you get on the right track.

When I was learning, my only resources were books, google, and PHP-src and CPython.

I've been working for a while on a sort of language for writing highly optimized cryptographic code, mostly elliptic curve cryptography where there's a very elegant mathematical description, but when you try to optimize every single detail manually it's easy to get very dirty code.

The idea was then to have a language that maintains the elegance and clarity, and a compiler that generates in my case C code. This let me focus on the "core" of a compiler, without too much emphasis on parsing, tons of optimizations, corner cases.

So this would be my advice. Pick a domain, maybe focus on getting highly optimized code, design a small language and build your first compiler.

Algorithms + Data Structures = Programs by Wirth

Specifically Chapter 5 - Language Structures and Compilers

Write a DSL. This is really easy in languages like Ruby that don't need parentheses around function parameters. You can write a Ruby DSL in a few hours and expand on syntax, features as you like.

The next step is to actually parse code with an abstract syntax tree. The easy path here is actually javascript. There are many transpilers for JS so things like bluebird/harmony can work.

Next step is to do thing "the right way"(tm) with yak/bison and build a compilable language. For that, I suggest you read a few books on languages and understand concepts such as backus-naur form before continuing.

If at any point you find yourself looking for parser generators, Nearley JS is a fun tool for experimenting with lightweight grammars. (Emphasis on lightweight - it's a javascript library.) It has good docs as well. If you're interested in the Earley algorithm, you can run it in a nodejs shell and look under the hood at the data structure, step-by-step as it scans the string.


Lots of great advice here. In addition to "use prolog" and "play with ometa/ohm" - you might want to consider looking at a ML - like ocaml (used for bootstrapping rust) or standard ml, sml.

Another approach is to look at rpy - the restricted python subset used for pypy, see for example "other languages" on:


I am working on my first programming language. I decided to use Racket and I am using the book Beautiful Racket and miscellaneous blog articles to guide me.

I started out just working on a set of linked/cooperating libraries and then decided to offer both the libraries and a custom language. I am writing something for my own use but when I release the MVL, I obviously also hope to get users and pull requests.

languages.. are everywhere. You keep making them even if you do not notice. Say, a function with 3 arguments is a small language.. just missing all the fuss, checks and generality. A repeating data structure that feeds some switch-decision tree is another language.. implicit one, essentialy the code is an interpreter.

But it's much better to make languages explicit. And there's no need to invent new grammars and parsers.. keep those for real extreme cases.

i express my languages in python, its easiest to use as base and build semantics on top (and even syntax). One can just use function/class-based declarations, of domain-specific notions, and then interpret them any way one wants. Or one can even turn function-calls into "statements", and "with"-construct into "blocks-of-statements", etc. Gives you strict debuggable proto-syntax, on which to build your own "syntax"/semantics/error-checking/reporting.

but any way, try to see/feel the Language behind the syntax/implementation. have fun

Probably the greatest amount of growth I've had in my lifelong programming journey was during the years when I was working on making my own scripting language. I think the skills I learned have shaped how I approach other programming project. Think debugging is hard? Try debugging why your script interpreter isn't running a given script correctly.

Jack Crenshaw, all the way. Parser for sure: go recursive descent. Don't mind 386 assembler - generate stack machine code.

For my capstone class for my CS BS, we had to write a compiler that compiled to machine code that the virtual machines we wrote could run. I can't remember the name of the book, but we called it "The Dragon Book." Looking in the other comments, it looks like someone already linked it.

There is a nice tool called "The Gold Parser Builder" which is free and can help you design a programming language.


The simplicity of a language doesn’t mean it will be easy to write a parser for. The grammar of SQL is actually quite lengthy. As others have suggested a ‘harder’ language like Lisp might be a better place to start.

I found this amazing site (craftinginterpreters.com), and I made a (barebone) language in a little more than a week. I used Instaparse instead of a custom scanner and parser to save time.

I think there is value in just writing interpreter figuring out as you go along. It won't be optimal but it will bring to your mind the questions you want to ask to make it better.

If you haven't yet, you may enjoy reading "Concepts, Techniques, Models of Computer Programming" [1]. The reason is that it is chock-full of interesting ideas:

- kernel language, which is not a practical language, but contains all the concepts for building one of a given model

- growing said kernel language one concept at a time, when its expressiveness is stretched

- the case for multi paradigm languages, with a real example throughout the book (Oz)

- and I mean multiparadigm (although the author dislikes this word): first declarative, then with concurrency, then with ports, then with explicit state, then object oriented with and without concurrency (which in fact isn't really an extension of the kernel language), then relational ... all being simple but very careful extensions to that kernel language

- the case for each concept, which results in an extension to the language, is very carefully explained and many alternatives are considered, and much justification is given to why it would improve the language; if you are considering designing your own language, you could do worse than learning from the author's carefully described approach

- each concept results in an addition to the grammar, the semantics of which are described in detail; they are also described formally in an appendix

- whenever a new programming model is described (which is basically a set of concepts), an archetypal example is discussed: Haskell (concurrent declarative), Erlang (message passing), Java (concurrent OO), Prolog (relational)

- techniques for programming with each concept are introduced

- each chapter has exercises at the end, several of which consist of doing your own extensions to the language; some of these are research level

- as an example of how deep but accessible this book is, check the "Principal programming paradigms" poster [pdf] [2]

- the extensive bibliography at the end is fantastic

And since I believe that it hasn't been mentioned in the comments and in case you don't know yet, check out "Lambda the Ultimate", "the programming languages weblog" [3]

[1] https://www.info.ucl.ac.be/~pvr/book.html

[2] https://www.info.ucl.ac.be/~pvr/paradigmsDIAGRAMeng108.pdf

[3] http://lambda-the-ultimate.org/

Decide first whether your language will be case-sensitive or case-insensitive. Then regret your choice forever.

This is going to sound "out of left field" perhaps, but use Prolog.

I've been exploring exactly this, compiling with Prolog, this past month or so and I'm simply blown away. It's so simple, elegant, and easy. It will be faster and easier for you to learn the basics of Prolog and then learn how to write compilers using it, than to learn to write compilers in another language (even a language you already know.) Prolog is an extremely simple language despite its incredible power. It's very different from imperative/functional languages so it might take a minute to wrap your head around it, but then you're operating on a higher plane.

Read "Logic Programming and Compiler Writing " by David Warren http://sovietov.com/tmp/warren1980.pdf But beware: it's dated. Or, if you can get a copy of "The Art of Prolog" you can read Chapter 24, which covers the same concepts but has more modern code. Specifically, you're going to want to use Definite Clause Grammars (DCG):

https://en.wikipedia.org/wiki/Definite_clause_grammar http://www.pathwayslms.com/swipltuts/dcg/index.html https://www.metalevel.at/prolog/dcg http://www.amzi.com/manuals/amzi/pro/ref_dcg.htm

Study "Toy compiler in Prolog" http://faculty.cooper.edu/smyth/cs225/ch7/prolog.htm And maybe this: https://github.com/fusiongyro/dailyprogrammer/tree/master/20...

I can vouch for this. I wrote a parser for Oberon in two days. I wrote an assembler for the RISC CPU used in Project Oberon in another two days. Both of them are no longer than two pages and the code is essentially just grammar rules. I haven't (yet) written the bit in-between that translates the AST from the parser into assembly code for the chip, but only because I have other fish to fry. But I can see how to do it and it would be very easy. The compiler is described in the Project Oberon book if you want to attempt that language.

I'm implementing a compiler for Joy on top of the assembler and it's been like a dream. Because the code is so simple, and the editor I'm using has an extension that uses the Prolog interpreter to lint as I go, it's actually hard to introduce bugs. I write the asm code then abstract patterns into the compiler to refactor and (lightly) abstract the raw asm into what are effectively AST nodes. But there is no language above this. I think I have to write a blog post called "Compiler for No Language" or something. Eventually, I am going to abstract my way to "atomic" nodes that represent Joy functions and then I'll have a Joy compiler. But in the meantime, I'm floating around in a semantic sea between parsing and assembly, and it's warm and tropical and you would not believe the fish I've seen.

I have to conclude that anyone who is writing any sort of compiler and NOT using Prolog is working way too hard.

I've been programming in Python for twenty years, happy as a clam, and now you couldn't pay me enough to NOT use Prolog.

"These are the words of VAL. They are true. Hear them."

In fact, it would actually be easier to implement Prolog then implement a compiler in it, than to implement a compiler directly in e.g. C, Java, Lisp, etc.

To understand how computers work is the reason that compilers is the capstone class in many CS programs. It unites automata with algorithms and machine code, bringing everything together in one program. I'm pretty sure I wouldn't have gotten through this exercise myself outside college.

A lot of the resources out there are going to assume you have one or two sides of this triangle already and are ready to bring it all together. If you don't then you'll probably feel like you're in over your head.

I would recommend checking out "Let's Build a Compiler". It's a bit old-fashioned but at least things are pretty simple. https://compilers.iecc.com/crenshaw/

Different flavors of Markdown are not a bad place to start. Try and find something with a real syntax. As I recall, the original/official version is defined by a Perl program and has no true grammar. But other variants probably do.

SQL is not simple. Just the parser for PostgreSQL's dialect was around 1500 lines of code, last I checked. Although, if you did a small subset, maybe it wouldn't be so bad.

I really like the book Semantics with Applications which discusses the meaning of different programming languages. When I was starting out, it seemed to me like there was a lot riding on the syntactic differences between languages like C and Pascal. Now having seen the fringe, a lot of those differences are not as significant to me as they were before. This book helps you drill into what makes some languages different below the surface, and how to model those differences formally.

I think it's eminently possible to make a small language, like "While", in a few pages of code. I would start with a very simple unambiguous grammar, and a little interpreter for it, but if you really want the full experience, generate assembler. This paper sort of presents the language I'm talking about, but there is a lot here about type theory and semantics that may not be interesting to you, so just ignore that. https://arxiv.org/pdf/1603.08949.pdf When I took compilers ~15 years ago, we generated code for nasm because gas syntax was too weird for us. Anyway.

Another approach that looks interesting to me is "many small steps": http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf The idea here is to start with a program that only knows about integers, but handle everything end-to-end. Then you slowly add features one-by-one, but in an end-to-end fashion. I think this would be a lot easier to iterate and it hearkens back to Crenshaw's approach.

A lot of compiler construction material focuses on Lisp and Scheme. This is because it sort of hides or ignores the parsing problem. You will probably have to learn some other languages to build a compiler; it wouldn't hurt to know some Lisp or Scheme to be able to access these resources. A lot of people build a Lisp-like language as their first for this reason too. Just something I'm pointing out.

Anyway, it's a noble goal and best of luck!

IMO, The best example of designing and implementing a "small" language can be seen in Chapter 8 of "The Unix Programming Environment by Kernighan & Pike" where they walk you through the construction of "hoc" (higher order calculator) using the standard compiler construction tools. The example is realistic, not too small nor too big and every step is explained. The exposition is very clear and you really can "get" how languages are designed and implemented. This should be the first thing you should read :-)

If you have a specific problem, you can then write a small DSL for it. I once had the opportunity to write a small "test language" for the sub-protocols of H.323. These protocols are specified using SDL (https://en.wikipedia.org/wiki/Specification_and_Description_...) and i just converted the tokens in the spec directly into my language and put a simple markdown syntax around it. The tests simply exercise the protocol FSM.


            Var1=Val1         //these are assignments
            MsgFromNetwork=<msg name>  //trigger to FSM
             Var1=Val1.1       //these are assertions
Though simple, this worked nicely, allowing me to push off the actual writing of the test cases to non-programmers who just had to convert the SDL diagrams from the spec. to text as given above.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact