
Designing a Programming Language - xigency
http://ducklang.org/designing-a-programming-language-i
======
toolslive
"A program written in C can be deployed on virtually any operating system".
Let me kill that illusion: While it is possible to write C that can be
deployed on virtually any operating system, doing so is quite hard.

Operating systems have different system calls, and even if they are the same
family (Linux vs Solaris fe) the semantics can differ. Another thing is that
the resolution of the basic primitive types can vary: 'long' is not equally
wide on all platforms, heck, even 'char' can be 16 bits on some exotic
systems. Anyway, writing portable C is an extra level of complexity. Just
write up a little socket server in C and try to get it running on Windows,
Linux and Solaris, maybe mix in 32 and 64 bit and see if you still have the
same opinion afterwards.

~~~
ectoplasm
Love it or hate it, that's what Autoconf and friends are for. I get to use the
same CLI on every computer I have thanks to Autoconf. Also, how many of the
compilers and VMs for the languages you love were written in something other
than C/C++?

~~~
toolslive
most of them end up self-hosting, but plenty of them are initially written in
language that's well geared to building compilers. Rust fe was bootstrapped
using OCaml to write the first compilers. Smart choice.

~~~
jwdunne
I believe many lisps do this too. This is where I learned about self-hosting
compilers before realising they're pretty much everywhere. I remember looking
at the source of a CL impl and thinking "what? It's all in Common Lisp?!"

Despite my ignorance, it's pretty common. C compilers tend to be self-hosting
from my understanding. It seems like madness to maintain it all in x86 without
some justification.

~~~
jwdunne
Taking another look at SBCL, even the assembler and its code is written in
Lisp:

[https://github.com/sbcl/sbcl/blob/master/src/assembly/x86/ar...](https://github.com/sbcl/sbcl/blob/master/src/assembly/x86/arith.lisp)

Very interesting stuff.

In the same repo, you can also see that the compiler is also written in SBCL.

~~~
gmfawcett
It's written in ANSI Common Lisp, not in an SBCL dialect: you can bootstrap
SBCL with an ANSI CL implementation other than SBCL. IIRC, that was one of the
key reasons for the original SBCL/CMUCL split.

------
jrapdx3
The article starts out with a simple, incremental approach, pretty easy to
grasp the concepts all the way through the discussion of the lexer and
digesting the program input.

Then all of a sudden the density of the material accelerates dramatically. The
discussion of the parser, and parsing as it applies to the language, becomes
very opaque to readers not already familiar with the technologies being
described.

It's perfectly OK to proceed that way for a sophisticated audience, but the
first part of the article is written at a much more introductory level. An
unsuspecting beginner may very likely feel overwhelmed about half-way through,
and probably give up at that point.

When I got to the explanation of the AST, no surprise, it was exactly
representable as an sexpr, which anyone with a few hours experience with Lisp
or Scheme would recognize. Not sure, _maybe_ it would have made more sense to
begin with the parsed goal (the AST) and work backwards to explicate how
parsing of the original syntax was done.

I guess parsing is a really hard subject to teach, but it's the core idea that
the reader needs to grasp to be able to understand PL construction as the
author laid it out. A still gentler introduction could be possible, if it
doesn't sound so easy to do.

~~~
oskarth
I noticed this phenomena in my own writing and I think I know where it comes
from. In Polya's list of heuristic (from _How to Solve It_ ) he writes about
the the two rules of teaching:

> The first rule of teaching is to know what you are supposed to teach. The
> second rule of teaching is to know a little more than what you are supposed
> to teach.

When we learn something and then write about it to teach others, oftentimes we
are excited about the things we just learned, so we write about that, even if
we don't fully understand it yet. Then we add an introduction to the "meat"
(i.e. the limits of our current understanding) which is very clear, precisely
because we are "overqualified" to teach it. Another way to state Poyla's rule
for teaching is as follows:

 _Being overqualified in knowledge is a prerequisite for being qualified as a
teacher of that knowledge._

One way I like to picture knowledge is as a dark room. You might step into one
but that doesn't mean you know every corner of it. For example, imagine a
proof or a technique. Unless you can know by heart why each assumption is
there, or when the technique fails, you don't fully understand it. If you are
guiding someone through a dark house, as you go from room to room, you are
only qualified to give someone directions about the room you already been in,
not the one you are in right now - even if that's what is on your mind.

Note that this isn't unique to people writing technical blogs, it happens in
academia a lot too. There's a reason Feynman's lecture on "undergraduate
physics" attracted the attention of graduate students.

~~~
agumonkey
Your comment resonates very weirdly with my experience.

College teachers are guilty of the exponential wall. Easy and lengthy intro
then steep but short core material (often cut short because the bell is
ringing). Impossibly annoying.

The dark room exhaustive search is exactly how I ended up learning about
learning. Through music though, as a self taught I had to visit every possible
spots, even if most of them were holes. And most of the time, most things only
'make sense' because the other things just don't work. Now if you teach a
topic to someone by listing what 'make sense', unless that person is very easy
on remembering or extremely creative, it will be a meaningless burden to rote
learn. On the other hand, knowing all the wrong parts, makes you very good at
teaching someone since as the student will ask why his intuition is failing,
you know how it feels and you know what to do about it, how to try other
ideas, patterns, revisit things that he might have overseen.

Douglas Crockford said that once you understand monads, you suffer the curse
of not being able to explain it anymore. I really wonder how often this
happen, how many students learned shortened and polished stuff in books
without the whole story, the whole experience of the actual researcher ? This
tiny offset in understanding, repeated generation after generation .. might
lead to large divergence. Saying this after watching (too many?) A. Kay talks
about how people ignore the past, they ignore it because it's not taught that
is all.

------
pearle
Jonathon Blow has a great series [1] on YouTube where he's designing and
implementing an alternative to C/C++ specifically aimed at meeting the needs
of modern game programmers.

It's an interesting watch.

[1]
[https://www.youtube.com/watch?v=TH9VCN6UkyQ&list=PLmV5I2fxai...](https://www.youtube.com/watch?v=TH9VCN6UkyQ&list=PLmV5I2fxaiCKfxMBrNsU1kgKJXD3PkyxO)

~~~
xigency
This is similar to a talk that Tim Sweeney (Epic Games) gave about creating a
new language for games, that I thought was interesting. In particular, he
talks about the tradeoffs of using C# over C++ for game development, and then
talks about ways that ideas from Haskell or purely functional languages can be
brought over that might improve performance and productivity for developers.

"The Next Mainstream Programming Language"

([https://www.st.cs.uni-
saarland.de/edu/seminare/2005/advanced...](https://www.st.cs.uni-
saarland.de/edu/seminare/2005/advanced-fp/docs/sweeny.pdf))

~~~
shadowfox
I remember seeing this long back. Did he (or anyone else) make progress on
this?

~~~
xigency
Um, I sort of doubt it. Unreal Engine 4, while I haven't dug around with it,
is presumably still C++ and low-level stuff.

The way the game industry is moving is still along the same path. With a lot
of amature and professional developers moving towards platforms like Unity,
it's sort of a backslide in my opinion. Then again, it's almost as if the
target platforms have become too powerful for people to really care what
they're developing on.

(My opinion, not trying to be controversial.)

------
kctess5
Great read - concise yet decently in-depth. Something cool about parsers is
how many practical uses there are, and how easy they are to play with.

I decided to learn Go recently, so I first wrote a little parser combinator
library (which also returns concrete syntax trees). Then I implemented a
little grammar for a shorthand to define parsers, and did the cst -> parser
step to make it spit out working parsers. Example here [1] I got it to work
for recursively defined context free grammars with some fun [2] deferred
function reference trickery.

I haven't decided what to do next, but I've been kicking around the idea of
writing a notation to describe cst -> ast conversion. That way one could
easily define parsers that generate abstract syntax trees (which can be much
easier to perform high level graph manipulations on then the underlying cst.)
This, paired with a lexer, could be used for quite a few practical
applications, including writing special-purpose languages/compilers.

I'm also interested in investigating the efficiency/time complexity of the
library I've made and seeing what can be done to speed things up. I'm fairly
sure a few tricks could speed things up significantly. Could be interesting to
do some profiling to see how Go is doing with the functional code.

I did this all as a learning exercise/toy project, so I know that there are
similar things already out there and that this is a solved problem. I'll might
write something about all this once I do something more interesting with it.
Eventually I might try to write an optimizing compiler...

[1] [https://github.com/kctess5/Go-lexer-
parser/blob/master/main....](https://github.com/kctess5/Go-lexer-
parser/blob/master/main.go) [2] [https://github.com/kctess5/Go-lexer-
parser/blob/master/parse...](https://github.com/kctess5/Go-lexer-
parser/blob/master/parse/shorthand.go#L44)

------
keedot
I accept that some people don't know how, or don't want to dedicate the time
to making a responsive site. I get that, I have sites that don't make sense in
mobile, so we don't bother with a mobile design. But don't take away the
ability to zoom on mobile if you don't. You lose a good portion of your
audience.

~~~
xigency
Try refreshing the page.

The site is hosted on GitHub pages and I used a template. I found this tag in
the HTML

<meta name="viewport" content="width=device-width, initial-scale=1, user-
scalable=no">

And changed it to

<meta name="viewport" content="width=device-width, initial-scale=1, user-
scalable=yes">

~~~
sunilkumarc
The page is not responsive. Hosting on Github pages doesn't mean the web page
will be responsive. It depends on the template you have used which is not
responsive apparently.

~~~
xigency
Yeah, I'm not a UX guy, I'm a programmer. It's plaintext rendered with HTML.
The <meta> tag above prevented scaling on WebKit phones. I don't really
believe in using templates or overlays because I'm afraid of them interfering
with usability, disabling the ability of a viewer to read the text, or being
distracting.

If you are interested in the content I've posted, you would be best off
printing it out and reading it. It's probably easier on the eyes.

~~~
acqq
You still have widths expressed in pixels which still limits the chance of
your text being readable on some smaller devices. At least I can report that
your page is fully readable using Safari on iPhones in the "reader mode"
(which avoids all the original formatting of the page).

------
arundelo
"Additionally, there are certain types of circular references that may never
be freed under this scheme, although that's not a huge concern to look at from
the start."

Early versions of PHP were released in the late 90s, with PHP 4, the first
Zend Engine version, released in 2000. The first PHP that didn't leak cycles
(PHP 5.3.0) was released in 2009. Beware of how long "not a huge concern to
look at from the start" can last.

[http://php.net/manual/en/features.gc.collecting-
cycles.php](http://php.net/manual/en/features.gc.collecting-cycles.php)

[http://php.net/ChangeLog-5.php#5.3.0](http://php.net/ChangeLog-5.php#5.3.0)

~~~
xigency
Yes, that is a good point. I had an adviser in college who specialized in
garbage collection algorithms and I unfortunately didn't have the chance to
take his course or learn from him, but there are definitely tradeoffs to all
of the methods out there.

------
charriu
I think the first section confuses a couple of things.

First, a program can be compiled (static) or interpreted (dynamic) as stated
in the article. However, that does not mean that you can't have a dynamic type
system in a compiled language, or vice versa.

Also, if you add type inference, the examples given for variables in dynamic
languages are perfectly valid examples for variables in a language with a
static type system (the type would just be defined on first assignment).

~~~
oxymoron
Static and dynamic are theoretically unrelated to compiled or interpreted. The
reason why it seems like these two are the same is that it's much harder to
implement a compiled version of a dynamic language than an interpreted dynamic
language, although I'm sure it's been done. There are plenty of interpreted
static languages though.

I'm missing a discussion of weak vs strong in the article though. Visual Basic
might be statically typed but it does have implicit type coercion which gives
it a very different feel from — say — C#.

~~~
KC8ZKF
Common Lisp is an example of a compiled dynamic language.

~~~
xigency
And a rather opaque one at that. I've never heard of anyone discussing the
implementation details of a Scheme or Lisp compiler, leading me to believe
that they must have really complicated internal logic.

I might be better served applying my time towards development on Chicken
(Scheme).

~~~
lispm
There are books about Lisp implementations and papers about various Lisp
compilers. Actually there is a whole bunch of literature about it.

Personally I find that the article needs a bunch of improvements:

static vs. dynamic vs. statically typed and dynamically typed

These are different things. Either we are talking about runtime behavior (a
dynamic language can change itself or the program at runtime) or we are
talking about typing.

For example Java is a statically typed language, but it allows the use of
class loaders which can reload classes at runtime - which provides some
dynamic features. On the other end of the spectrum are many compiled Lisp
implementations which allow various changes to the programming language at
runtime, for example via an extensive meta-object protocol.

> Attempting to access a value that has not been named in a relevant scope
> leads to a syntax error issued at compile time.

A syntax error? Really? Syntax?

> So called dynamically typed languages are sometimes referred to as duck-
> typed or duck languages.

Definitely not. Duck typing is a only a special form of dynamic typing,
usually in an object-oriented language. Non object-oriented dynamically typed
languages don't provide duck typing, because they lack OOP objects.

Does compiled Lisp look different?

Maybe... Let's look at SBCL on a 64bit Intel processor:

    
    
        * (defun calc (a b c) (+ (* 2 a) (* 3 b) (* 4 c)))
    
        CALC
    
        * (disassemble #'calc)
    
        ; disassembly for CALC
        ; Size: 118 bytes. Origin: #x1002EC2599
        ; 599:       48895DE8         MOV [RBP-24], RBX               ; no-arg-parsing entry point
        ; 59D:       BF04000000       MOV EDI, 4
        ; 5A2:       488BD1           MOV RDX, RCX
        ; 5A5:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
        ; 5AB:       41FFD3           CALL R11
        ; 5AE:       488BF2           MOV RSI, RDX
        ; 5B1:       488B5DE8         MOV RBX, [RBP-24]
        ; 5B5:       488975F0         MOV [RBP-16], RSI
        ; 5B9:       BF06000000       MOV EDI, 6
        ; 5BE:       488BD3           MOV RDX, RBX
        ; 5C1:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
        ; 5C7:       41FFD3           CALL R11
        ; 5CA:       488BFA           MOV RDI, RDX
        ; 5CD:       488B75F0         MOV RSI, [RBP-16]
        ; 5D1:       488BD6           MOV RDX, RSI
        ; 5D4:       41BBD0010020     MOV R11D, 536871376             ; GENERIC-+
        ; 5DA:       41FFD3           CALL R11
        ; 5DD:       488BDA           MOV RBX, RDX
        ; 5E0:       48895DF0         MOV [RBP-16], RBX
        ; 5E4:       488B55F8         MOV RDX, [RBP-8]
        ; 5E8:       BF08000000       MOV EDI, 8
        ; 5ED:       41BBD0020020     MOV R11D, 536871632             ; GENERIC-*
        ; 5F3:       41FFD3           CALL R11
        ; 5F6:       488BFA           MOV RDI, RDX
        ; 5F9:       488B5DF0         MOV RBX, [RBP-16]
        ; 5FD:       488BD3           MOV RDX, RBX
        ; 600:       41BBD0010020     MOV R11D, 536871376             ; GENERIC-+
        ; 606:       41FFD3           CALL R11
        ; 609:       488BE5           MOV RSP, RBP
        ; 60C:       F8               CLC
        ; 60D:       5D               POP RBP
        ; 60E:       C3               RET
        NIL
    

Now we can even define the types in Common Lisp:

    
    
        * (defun calc (a b c)
          (declare (fixnum a b c))
          (the fixnum
               (+ (the fixnum (* 2 a))
                  (the fixnum (* 3 b))
                  (the fixnum (* 4 c)))))
        STYLE-WARNING: redefining COMMON-LISP-USER::CALC in DEFUN
    
        CALC
        * (disassemble #'calc)
    
        ; disassembly for CALC
        ; Size: 38 bytes. Origin: #x1002F4B802
        ; 02:       48D1E1           SHL RCX, 1                       ; no-arg-parsing entry point
        ; 05:       488D047F         LEA RAX, [RDI+RDI*2]
        ; 09:       48D1F9           SAR RCX, 1
        ; 0C:       48D1F8           SAR RAX, 1
        ; 0F:       4801C1           ADD RCX, RAX
        ; 12:       48C1E602         SHL RSI, 2
        ; 16:       48D1FE           SAR RSI, 1
        ; 19:       4801F1           ADD RCX, RSI
        ; 1C:       48D1E1           SHL RCX, 1
        ; 1F:       488BD1           MOV RDX, RCX
        ; 22:       488BE5           MOV RSP, RBP
        ; 25:       F8               CLC
        ; 26:       5D               POP RBP
        ; 27:       C3               RET
        NIL
        *

~~~
xigency
Overkill.

This article isn't a dictionary entry, an encyclopedia, or a research paper.
It describes what's necessary to build a programming language for someone who
has the necessary skills, which is someone who programs.

The entire premise of this programming language isn't based on as rigorous
ideas as what you're criticizing, and that's been the reaction since the first
time I posted about it.

As for what I was talking about, I'm only interested in dynamically typed
languages that are compiled and provide support for 'eval in this case,
because I don't believe that this is possible without providing a complete
library for an interpreter with a compiled executable. For every other case,
who really cares? Being able to replace every variable declaration with the
keyword `auto' does not a new (or useful) programming language make.
Additionally, all of these other gray areas in-between are not something I am
interested in.

For this purpose, when speaking of the Duck programming language, dynamicism
refers to being able to literally manipulate types and data in any way
imaginable, both in terms of runtime behavior AND typing. In any case, it is
designed to be _the most_ dynamic language, as the union of all of these
features, and as such that invalidates a huge number of complaints.

>> Attempting to access a value that has not been named in a relevant scope
leads to a syntax error issued at compile time.

> A syntax error? Really? Syntax?

This is a very minor complaint and mirrors 99% of the criticism I've received.

I wish I could get more interesting feedback for the content of my writing
rather than the semantics.

------
sklogic
Nice. But, as usual, a bit too much emphasis on syntax and parsing - the least
important part of the language. Semantics and intermediate representations are
much more important. Also, not quite convinced about the choice of C vs. LLVM
IR.

~~~
qznc
Why intermediate representation? Build a straightforward AST then convert to
LLVM.

Semantics is important. Before you start writing the frontend, I would advise
to write some programs in your language like a tutorial for newbies.

~~~
sklogic
> Why intermediate representation?

Because you'd need not just one, but many of them, in order to keep things
simple and manageable - for the philosophical background on this read about
the Nanopass approach: [http://www.cs.indiana.edu/~dyb/pubs/nano-
jfp.pdf](http://www.cs.indiana.edu/~dyb/pubs/nano-jfp.pdf)

> Build a straightforward AST then convert to LLVM.

It's only possible for the very low level languages without any interesting
features.

Firstly, you'd need some type system, even if it's mostly dynamic. Secondly,
your frontend may provide quite a bit of syntax sugar that will need further
lowering before you can easily turn it into IR.

Even before typing, you'd have to handle the lexical scoping, which may also
involve an additional intermediate representation (one distinct IR per each
pass is normal).

With this approach you can keep your entire compiler pipeline as a sequence of
trivial, fully declarative, pattern-based tree rewrites. With the less fine-
grained approaches you'd end up with a barely maintainable mess, where
multiple passes functionality is combined into a single bloated, twisted pass.

> Semantics is important.

Of course it's the only thing that is important. Unlike syntax.

~~~
davorb
> one distinct IR per each pass is normal

Source?

~~~
sklogic
> Source?

1) Personal experience

2) The link I provided above (and all the other relevant Nanopass papers too)

------
bigtunacan
Designing a programming language is something I've found very interesting for
the past few years now, but this is the one area of programming that really
still eludes me and seems unapproachable. I've been through Understanding
Computation, but at a certain point I just no longer understood what the hell
was going on I was just typing in the examples.

I have the Principles of Compiler Design book and I've made a run at that a
couple of times, but it seems to theoretical for me to do anything useful with
it.

I've been thinking about taking a go at creating something with ANTLR alone
with the books The Definitive ANTLR 4 Reference and Language Implementation
Patterns, but I'm not sure if I'll have much more luck there.

If anyone knows of a better path to get from here (can't design my own
language) to there (clarity and understanding) I would love to hear your
ideas.

I'm not trying to create the next Rust, JavaScript, etc... I just believe that
being able to implement my own language would give me a deeper understanding.
In fact I'm very much interested in the idea of reimplementing an interpreter
or compiler for languages I already work with as a way of learning this rather
than trying to create something new.

~~~
rspivak
I started a series of articles called "Let’s Build A Simple Interpreter" which
might be what you're looking for:

\- [http://ruslanspivak.com/lsbasi-part1/](http://ruslanspivak.com/lsbasi-
part1/)

\- [http://ruslanspivak.com/lsbasi-part2/](http://ruslanspivak.com/lsbasi-
part2/)

~~~
xigency
+1

------
amelius
Anybody know the difference between the "green" and the "red" dragon book?

~~~
pdpi
green dragon book is
[https://en.wikipedia.org/wiki/Principles_of_Compiler_Design](https://en.wikipedia.org/wiki/Principles_of_Compiler_Design)

red dragon book is the first edition of
[https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)

------
dummy7953
I really wish more help from human-computer interaction & cognitive
psychologists would be used to design a programming language.

The programming language is (to me) the most expensive part of a system,
because it determines how the coder works & interacts with the system. And the
human resources involved in building the system are the most expensive
(besides those used to maintain and run it daily).

There must be symbols used to write code that are more easy to cognitively
process than others. There must be keywords that are easier to memorize than
others. There must be ways to write a statement that are less prone to error
than others. Let's start working on this important and neglected problem.

~~~
tormeh
Those are the "language by committee" that people complain about. Ada was made
that way, if I remember correctly.

