
Resources on creating programming languages - zephyrfalcon
https://tomassetti.me/resources-create-programming-languages/
======
justinpombrio
I was cynically expecting only a bunch of resources about parsing and
compilation, but was pleasantly surprised that the first section was about
language design!

Here's a shameless plug for Brown's PL course last year. The videos and
assignments are all online. The assignments involve exploring mystery
languages, and aim to teach a bit about the consequences of language design.
Next year we might do a MOOC.

[http://cs.brown.edu/courses/cs173/2016/index.html](http://cs.brown.edu/courses/cs173/2016/index.html)

~~~
dyarosla
I liked the fact that it started with language design too. One of the links
points to a Paul Graham essay which touches on the idea that "You Need an
Application to Drive the Design of a Language"; that writing a language for
the sake of writing a language without an application in mind isn't (edit*) as
effective[0]. Things like that, thinking about the reason for creating the
language in the first place, is a fundamental starting point imo.

[0]
[http://www.paulgraham.com/langdes.html](http://www.paulgraham.com/langdes.html)
(Number 3)

~~~
aaron-lebo
So I checked out that link and:

 _This may not be an absolute rule, but it seems like the best languages all
evolved together with some application they were being used to write. C was
written by people who needed it for systems programming. Lisp was developed
partly to do symbolic differentiation, and McCarthy was so eager to get
started that he was writing differentiation programs even in the first paper
on Lisp, in 1960._

...but can that claim really be supported? Are c and lisp "the best" languages
and if that "rule" is true, why does the claim have to be tempered for the two
examples (C was written by people who needed to do programming at the time -
was PHP not written for needs, too? lisp was only kind of developed for
symbolic differentiation, and was lisp 1.6 one of the best languages or do we
prefer scheme, common lisp...arc)?

He could have at least tried to make us believe it! :)

~~~
fmap
The truth seems to be that we don't know what makes a good language. In
academic research we have criteria for what makes a language expressive, and
tools for analyzing existing languages and check if they allow for modular
development or are fundamentally incapable of building sound abstractions.

What we _don 't have_ are answers to the "soft" questions. What makes a
language pleasant to use (ignoring familiarity)? Which abstractions are easy
to communicate and why or why not? What's the best way to communicate intent
to a computer? It seems very unlikely that the answer should be "a text
editor", but so far all alternatives have turned out to be terrible...

What's especially vexing about this is that it looks like a solvable problem.
People in the social sciences have been measuring far more ephemeral
phenomena, the only problem is one of funding and finding the right people to
run such big and extensive experiments.

~~~
loup-vaillant
> _The truth seems to be that we don 't know what makes a good language._

On the other hand, we do know what makes a _bad_ language. We can spot flaws
in a language, and fixing them automatically makes them better —at least in
the aspects one is interested in. C for instance has a number of flaws and
questionable tradeoffs that are widely known by now:

Contextual grammar. While not too bad (the context is easy to maintain in the
parser), unknown identifiers can lead to syntax errors, and slightly worse
error messages.

Complicated syntax for types. Matching declaration and use may have sounded
neat at the time, but doing so kills the separation between an entity's name
and its type. The result is not very readable. An ML-like syntax would be much
better.

Switch that falls through by default. The switch statement were clearly
designed from an implementer's point of view —it maps nicely to a jump table.
In practice however over 95% of switch statements do not fall through. Switch
should break by default (Having multiple cases branch to the same code is more
common, but is not incompatible with breaking by default, see how ML style
pattern matching does it).

No direct support for sum types (tagged unions). We have structs , unions, and
with them we can emulate algebraic data types. It's a pain in the butt
however. Automating this somewhat would be nice, since sum types are so widely
useful. Even if they were used _just_ for error handling that would be a nice
bonus.

No generics. Makes it much harder to abstract away common code. Want to write
a generic hash table? Good luck with pre-processor magic.

Textual macros. We can do, and have done, better than that.

Too many undefined behaviours. In the name of portability, many things that
would have worked fine on many architectures are now undefined because _some_
architectures couldn't handle them. And now we have silly stuff such as
undefined signed integer overflow. But we can't remove them because compiler
writers justify this madness with optimizations! (For the record, I have seen
Chandler Carruth's talk on the subject, and I disagree: when compilers remove
security checks because of undefined behaviour, it is just as bad as nasal
demons.)

Silly `for` loop. That damn thing is fully general, we don't need that. We
have the `while` loop. A simpler, less general syntax would have allowed
optimisations that currently exploit undefined signed integer overflow, and
then some.

\---

Of course, it's easy to criticise a language over 40 years after the fact. But
that's kind of the point: we have learned a good deal since the 70's. A
language written now could not justify most of the flaws above. This is why so
many people (me included) dismissed Go out of hand: not providing generics in
a new statically typed language is just silly.

Sure, designing a good language is still very hard. But we can avoid more
mistakes now than we could some decades ago.

~~~
msangi
This is a good point.

At least I find it encouraging that there are some of the newer languages that
are taking errors from the past into account.

They're not going to be perfect and they surely have their own flaws, but
that's something that someone will fix in a couple of decades once we will
know what new ideas are actually good and bad.

------
rattray
For anyone thinking about building a language with a JavaScript target, I
would strongly recommend building on Babel. I've found the tooling to be
terrific.

In my case, I wanted to build a (rough) superset of JS, adding the syntax and
language features I was interested in but otherwise starting from a working
(tested!) language. This made the task dramatically easier.

Forking babylon, Babel's parser, worked well for this in my case, but you
could also write your own parser from scratch and use babel-types to build
your AST.

In my case, I have a ~1,000-LOC babel plugin "compiler"[0] with maybe
1,000~2,000 LOC of mods to the parser[1], resulting in a language that looks
fairly different from JS but feels pretty solid. Users of the language benefit
from Babel's tooling – it works with webpack, ESLint (decently), etc.

[0] [https://github.com/lightscript/babel-plugin-
lightscript/](https://github.com/lightscript/babel-plugin-lightscript/)

[1] [https://github.com/lightscript/babylon-
lightscript](https://github.com/lightscript/babylon-lightscript)

~~~
mark_l_watson
Thanks for the reference to your project, I might use it as a guide for
writing a language. I like the book "Build Your Own Lisp" because the simple
implementation is easy enough to understand and hack. Using your example as a
guide would be a second step for building something more practical. Very cool
stuff.

~~~
rattray
Thanks! Would be happy to help – email is in my profile if you're interested.

------
mhh__
Appel's Modern compiler implementation in ML is really good, as both an
introduction to compilers and some detail on implementing the runtime i.e.
garbage collection. There are versions for C and Java, but ML is much easier
to read in a book.

If you want a (slightly ugly) encyclopedia of compiler optimisations and IR
designs, then Steven Muchnick's Advanced compiler design and implementation is
great.

Also, on the functional front SPJ's book ([https://www.microsoft.com/en-
us/research/publication/the-imp...](https://www.microsoft.com/en-
us/research/publication/the-implementation-of-functional-programming-
languages/)) is really good.

While unfinished, Write You a Haskell is also a really interesting read.

~~~
Gracana
I have the C book... It is quite antiquated and has a lot of ugly pointer math
in the tokenizer/lexer beginning parts. It does sound like the ML book is a
better choice.

~~~
nickpsecurity
"It is quite antiquated and has a lot of ugly pointer math in the
tokenizer/lexer beginning parts."

It's possibly C code written by a guy that mainly does ML with a whole career
kicking ass in ML languages and provers. That was my hypothesis when I saw
both books available. I got a copy of the ML one instead. Appreciate
confirmation it was a good choice. ;)

~~~
mhh__
Yeah. AFAIK The C version only exists because ML is too niche for mainstream
publishing. ML actually works as decent pseudocode(Or at least easier to
mentally transpile to other langauges, unlike e.g. Haskell) as well, so the
C/Java versions are basically pointless.

~~~
nickpsecurity
That's what I was guessing. Especially after reading the article below and
seeing lots of compiler/prover authors go with ML's.

[https://news.ycombinator.com/item?id=14123100](https://news.ycombinator.com/item?id=14123100)

------
tormeh
What I've come to realize is that what we need is a language that lets you
manage your level of technical debt, as in "langc mysource.lang --technical-
debt=5". You generally want to start out writing PHP in the beginning of a
project. However, at some point the technical debt will force you to enter
maintenance mode. At that point it would be nice if you could turn down the
compiler's tolerance from PHP-level to Rust-level, via Python and Java.
Simultaneously you usually start caring about performance, so you need to move
from interpretation to non-GC'd binary. Wouldn't it be nice if you could scale
the strictness and compiledness of the language according to how messy and
performance-critical your module or program has become? Granted, we'd need
several backends for the same language and the compiler would become really
quite messy, but it's interesting to consider.

~~~
wcrichton
I can't agree more. I think a lot of the pieces to enable such a compiler
stack are there--gradual typing, extensible parsers, and so on, but nobody has
put them all together into a single place.

~~~
TeMPOraL
Common Lisp? Except that the typing could use some extension (probably could
be provided through macros).

------
velox_io
This is one of the most overlooked topics in computing.

I've developed an in-memory database, naturally with it's own programming
language.

I disagree that you have to have an application in mind to design your
language (sorry GP). But, you definitely need use-cases in mind as every
decision has trade-offs..

Some are discoveries I've made along the way: 1\. High-level - I remember
having a for-each loop zoomed in on my TV, I may have looked at the for a
couple of hours as I looked at it and thought "how can I make it better", and
it did pay off!

2\. Parsers/ Lexers - I couldn't find one that worked the way I needed to
(allowed me to remove lexemes that would normally be considered essential). So
I started writing my own..

3\. Abstract Syntax Trees (ASTs) - I don't think there's a guide on compilers
that doesn't mention ASTs. I found this to be more of a hindrance than help.
So I removed it, yes the parser generates the program code!!! All that is
needed is a small stack of the starting points. This actually works better
than anticipated. It's still a single pass compiler, and you rarely need to
know to what is next (before delegating). There's checks at every step, the
program can also generate sourcecode (reverse engineer itself). So moving away
from ASTs doesn't mean that the program is harder to test. A unexpected result
of going down this path means that the compiler 'understands' the code better
than than most, able to supply actual suggestions rather than cryptic error
messages. In-fact it will reject code that won't compile, or breaks
dependencies.

The lesson here is don't be afraid to experiment. Yes it has taken some time,
with improvements in each iteration (I've done some house keeping in the
recently, swapping out ~30% of the parser and it hasn't been as daunting as
you might imaging).

------
joshmarinacci
For those just getting started on building languages here is a beginners
tutorial I created, using JS and Ohm.
[https://www.google.com/amp/s/www.pubnub.com/blog/2016-09-26-...](https://www.google.com/amp/s/www.pubnub.com/blog/2016-09-26-4-tutorials-
create-language-in-less-than-200-lines-of-code/amp/)

~~~
whitten
Is there a tutorial for building languages that uses an Earley parser? I
always have a dream of using a controlled natural language for coding or as a
view of existing code.

------
ternaryoperator
Surprised not to see reference to Walter Bright's excellent article from Dr.
Dobb's, "So you want to write your own language?" [1]

[1] [http://www.drdobbs.com/240165488](http://www.drdobbs.com/240165488)

~~~
ftomassetti
Right, but the list had to be limited in scope, and I focused more on longer
pieces of content

------
fizixer
I've been meaning to understand the idea of programming languages, removed
from the syntactic details, and removed from the issue of converting a program
to a machine executable form, and even optimizing a program.

I have a hunch that Krishnamurthi's book might be a good read in that regard
(though not sure).

However I wonder what's left after you remove the issues I mentioned above.
There are two cases: typed language and untyped language. In the first case
you're left with Type Theory. In the second case you're left with, either a
Turing Machine or a Lambda Calculus (e.g., implemented in a syntax free
language like scheme). Is this the correct line of thinking? Or is there more
to PLT than what I described?

To continue the thought further, in case of scheme, you have the base language
on top of which you can build any PL feature or any PL paradigm you want. That
means what's left is a survey/exploration of all possible PL features. Here
I'm not interested in syntactic features. But in that case, what's left is
really a survey/exploration of all possible PL paradigms, or programming
paradigms [1]. So I guess what I'm really after is a massive study of the
equivalent of all programming languages, but in a syntax-free way, i.e., in
scheme syntax! (And I'm looking for a book that can provide that).

[1]
[https://en.wikipedia.org/wiki/Programming_paradigm](https://en.wikipedia.org/wiki/Programming_paradigm)

~~~
kd0amg
* However I wonder what's left after you remove the issues I mentioned above. There are two cases: typed language and untyped language. In the first case you're left with Type Theory. In the second case you're left with, either a Turing Machine or a Lambda Calculus (e.g., implemented in a syntax free language like scheme). Is this the correct line of thinking? Or is there more to PLT than what I described?*

Well, there's still what you describe in the rest of your post. Even something
minimalist like Scheme has a load of features above and beyond plain lambda
calculus: dynamic parameters, continuations, dynamic-wind, mutable variables,
boxes, exceptions, the whole macro system, and so on. Not every programming
language feature that can exist has already been invented. We also don't fully
understand the implications of every language feature that has already been
invented -- there's plenty of work to do in designing type systems that can
cope with more language features. You're not going to find _one_ book that
covers _every_ feature people have come up with, but PLAI at least gives a
fairly broad guided tour. I would add a book on semantics (probably Winskel,
maybe Gunter) to develop a mental toolkit for carefully defining and reasoning
about new language features.

~~~
fizixer
I guess you have a point. Thanks for your comment (and recommendations).

------
itsmemattchung
Awesome--I've been knee deep learning how C code translates into assembly
instructions, learning more about the compilers roles and responsibilities;
but in the back of my mind, I aways wonder: why is the C language (or any
language, really ) designed the way it is. A more concrete example: instead of
"how do closures work in python," I wonder "WHY are closures designed to work
this way in python ?"

~~~
nickpsecurity
It mostly wasn't designed if you're talking about C. It was almost totally the
result of terrible hardware and personal preference derived from BCPL which
was terrible hardware. The Vimeo link below goes from one paper & computer to
another from CPL to BCPL to B to proto C's (IIRC) to C itself. Also shows
interplay between C and C++ development. I'm also including my rant that
summarizes the video in bullet points with some commentary. I still need to
update it, though, with critiques from last time it was on HN.

[https://pastebin.com/UAQaWuWG](https://pastebin.com/UAQaWuWG)

[https://vimeo.com/132192250](https://vimeo.com/132192250)

Now, for the other languages, who knows. I agree it's always fascinating to
learn how they were done. I didn't get that feeling learning C's history. Not
like when I studied Wirth's Lilith and Modula-2:

[https://en.wikipedia.org/wiki/Lilith_(computer)](https://en.wikipedia.org/wiki/Lilith_\(computer\))

Note: Full description is in references under "Personal Computer Lilith."

------
rendall
Is it just me or do the scroll bars disappear on that site? In both Chrome and
Firefox latest versions, I cannot scroll down.

~~~
camgunz
Yeah they're gone for me too. Space, Page Down or Down Arrow do nothing.

------
kushti
I would like to read anything on cost-analysis, especially efficient cost-
analysis on ASTs.

