
So You Want to Write Your Own Language (2014) - rspivak
http://www.drdobbs.com/architecture-and-design/so-you-want-to-write-your-own-language/240165488
======
KirinDave
Alternatively, just write software in a Lisp variant (with good macros,
clojure is quite meh here) or Ocaml or Haskell.

These languages put you at the level of writing a language for each problem
domain immediately and with a comprehensive and useful toolkit. These
languages also do this while directly competing with all but the most
carefully tuned of C++/C environments. Steel Bank Common Lisp and the Glasgow
Haskell Compiler are good references here.

These languages are often considered strange because of this, and feel
somewhat alien. But once you realize you're building a language to model your
problems, suddenly tons of stuff makes more sense. Previously alien concepts
like Macros and Monads are outside the typical language's experience precisely
because the language authors created these contexts and put you inside them.

This article is sort of fundamentally wrong that it's difficult to write
small, purpose built languages. It's not, and even outside the functional and
homoiconic meta-syntactic world we have seen code generators deployed
regularly. Successful products and libraries are build using these techniques
all the time, and with a modern toolchain it delivers excellent results. It's
just that other more restricted and guided approaches are often introduced
earlier in people's learning curve and sets the expectations for them
subsequently.

~~~
treebog
I'm approaching an intermediate level in Haskell, but I don't really see how
writing a program in Haskell is any more like writing a language than writing
a program in, say, python. Could you elaborate or point to some DSLs written
on top of Haskell? I'm very curious.

~~~
louthy
How about BASIC in Haskell?

[http://augustss.blogspot.co.uk/search/label/BASIC](http://augustss.blogspot.co.uk/search/label/BASIC)

~~~
wruza
(From article comments)

    
    
        #include <stdio.h>
        
        int
        main(int argc, char **argv)
        {
          double i, s;
          s = 0;
          for (i = 1; i < 100000000; i++)
            s += 1/i;
          printf("Almost infinity is %g\n", s);
        }
    
        Lennarts-Computer% gcc -O3 inf.c -o inf
        Lennarts-Computer% time ./inf
        Almost infinity is 18.9979
        1.585u 0.009s 0:01.62 97.5%     0+0k 0+0io 0pf+0w
    

And now the Haskell code:

    
    
        import BASIC
        main = runBASIC' $ do
            10 LET I =: 1
            20 LET S =: 0
            30 LET S =: S + 1/I
            40 LET I =: I + 1
            50 IF I <> 100000000 THEN 30
            60 PRINT "Almost infinity is"
            70 PRINT S
            80 END
    

And running it:

    
    
        Lennarts-Computer% ghc --make Main.hs
        [4 of 4] Compiling Main             ( Main.hs, Main.o )
        Linking Main ...
        Lennarts-Computer% ./Main
        Almost infinity is
        18.9979
        CPU time:   1.57s
    

As you can see it's about the same time. In fact the assembly code for the
loops look pretty much the same. Here's the Haskell one:

    
    
        LBB1_1: ## _L4
                movsd   LCPI1_0, %xmm2
                movapd  %xmm1, %xmm3
                addsd   %xmm2, %xmm3
                ucomisd LCPI1_1, %xmm3
                divsd   %xmm1, %xmm2
                addsd   %xmm2, %xmm0
                movapd  %xmm3, %xmm1
                jne     LBB1_1  ## _L4
    

\---

 _OMG that is astonishing_

------
geekpowa
Another reason for writing your own:

I have a sprawling codebase written in a 80s/90s era closed 4gl. Crossroads.
a) keep kicking this can down the road using what I have inspite of its
obvious and increasingly imposing limitations, b) discard it and rewrite or
code in something else inspite of loss of hard won business rules, logic,
behaviours and feel of the code over the years or c) develop a
compiler/runtime and evolve it from there.

Went with c). In hindsight was incredibly ambitious but glad I did it. Been
evolving compiler and runtime steadily. Now at point where I am starting to
think about how to bend the language to discourage certain anti-patterns the
language by design encourages and move the code base away from these anti-
patterns (excessive use of global variables for one).

Alot of consequences I have to accept with this approach though, one obvious
one is it is hard to find and hire programmers who want to work on the
frankenstein's monster I've created here. Programming seems easy when it's all
new and shiny and you are minting things for the first time. Old code is
challenging, there are no obvious solutions forward, none I've found at least.

------
mannykannot
There are so many languages available these days, so the first thing to ask is
"what is unique and beneficial about my language?" (valid answers might be
that it is a unique combination of features seen elsewhere, that it addresses
a specific domain's problems better than any general-purpose language, or that
it is an experiment.)

Someone made an interesting point about Kotlin recently: as one might hope for
a language coming from an IDE maker, it has excellent tools. I have heard Alan
Kay make a similar comment about Smalltalk. There are a couple of semi-popular
languages that I feel are significantly disadvantaged by their lack of tools
support (primarily in debugging and documentation.)

~~~
psyc
I have long wanted to see programmers experiment with taking this IDE thing
much further. Almost every other authoring tool has an opaque file format,
manipulated via one or more views. E.g. when a digital artist wants to create
and manage a complex 3D scene, they use a tool like Maya. They're not fussed
about whether they can read the file format!

So why not try letting the program exist as an embellished AST, that you edit
in multiple IDE views? You could still have text views that present the
program in one or more programming languages. But also views like circuit
diagrams, graphical call graphs, other standard programmer diagram types. A
run-time view with a time axis and time controls, with graphical
representations of the program, ability to rewind, fiddle with state, travel
visually down all the code paths.

Can we really do no better than text + our imaginations and whiteboards? I'm
almost certain we can do better.

Or, if you find those ideas a bit eccentric, then start with just having views
of source code. For example, my team requires verbose symbols, but I like very
short symbols. I wish I could just have a simple map from the long symbol
names to short ones, and toggle my view of the source code without affecting
the source code. But since no IDE I know of has the concept of a view, I don't
know how to do this today with existing tools.

~~~
jsymolon
AST = Abstract Syntax Tree.

" ... manipulate AST ... "

Interesting thought, but that's not a property of a language per se but a
function of the IDE.

~~~
hashkb
Except for editing lisp, which manipulating the AST directly in code.

~~~
aninhumer
Not really. The AST in lisps is much more obvious, but you're still editing a
text serialisation format, and any features that operate on the AST itself are
provided by the IDE on top of that.

~~~
TeMPOraL
There is no "concrete" AST that's not a "serialization format". A binary AST
is also a serialization of AST.

Lisp source is pretty much as close to AST as you can get while still staying
in text-land. That said, experienced Lisp developers often use tools like
Paredit mode that let them navigate and edit the code in terms of tree nodes,
not characters.

------
jakub_g
> These are false gods:

> Minimizing keystrokes. Maybe this mattered when programmers used paper tape,
> and it matters for small languages like bash or awk. For larger
> applications, much more programming time is spent reading than writing, so
> reducing keystrokes shouldn't be a goal in itself. Of course, I'm not
> suggesting that large amounts of boilerplate is a good idea.

Couldn't agree more! No one needs another perl or bash. Languages should be as
concise as possible, but not more.

Also good points about familiarity and helpful error messages.

~~~
tyingq
> No one needs another perl or bash. Languages should be as concise as
> possible, but not more

I've seen you lump perl and bash together before, but around the idea of
sigils like $ and @. This sounds more like a complaint about "default"
operations. Like Perl's $_ perhaps?

Other than that sort of thing, I don't see where Perl reaches that far in
being concise. Maybe regular expressions being a first class thing? Like "if
($foo =~ /bar/)" ? Though that seems straightforward to me.

------
nailer
> Redundancy. Yes, the grammar should be redundant. You've all heard people
> say that statement terminating ; are not necessary because the compiler can
> figure it out. That's true — but such non-redundancy makes for
> incomprehensible error messages. Consider a syntax with no redundancy: Any
> random sequence of characters would then be a valid program. No error
> messages are even possible. A good syntax needs redundancy in order to
> diagnose errors and give good error messages.

This makes no sense. A terminator is required, multiple terminators (ie,
redundancy) are not. Various modern languages eliminate redundancy: any random
sequence of characters is not a valid program in those languages.

Disappointed Dr Dobbs would publish something like this.

More subjectively:

> Tried and true. Absent a very strong reason, it's best to stick with tried
> and true grammatical forms for familiar constructs. It really cuts the
> learning curve for the language and will increase adoption rates.

More people will program in the future than who do at present. If you aim to
approach this audience, things like using '=' to set values, rather than test
equality, make no sense to the vast majority of those people.

~~~
CJefferson
The lack of a terminator in python, and the optional terminator in JavaScript,
has often produced weird errors for me, when the compiler and me disagree on
if something should be interpreted as one line or two. With a semicolon, it's
clearer.

Of course, missing semicolon errors are still annoyingly bad, I wish GCC and
clang could produce better errors messages in this case.

~~~
ben-schaaf
I've had similarly weird errors in python, but given the same situations its
ever occurred to me in ruby. Maybe its just the way the grammar is defined,
for example stuff like the following works perfectly fine in ruby while python
just says "Invalid Syntax":

    
    
      foo = 1 +
            2

~~~
opportune
You just need to add an escape character, "\". This is python 101

This is properly interpreted:

    
    
        foo = 1 + \
            2

~~~
Too
Python will automatically escape newlines in some contexts. Another way to
accomplish the above is to wrap the expression in parenthesis.

    
    
        foo = (1 +
               2)

------
Koshkin
Trying yourself at inventing a new programming language is both fun and an
excellent intellectual exercise - regardless of whether you want to make your
language "better" than the one(s) that you are familiar with.

It also opens a path to humility and a true appreciation of work of others.

~~~
andreareina
It's fun taking an existing feature in one language, and implementing it in
another language that doesn't have it. Often you'll come up against a
simplicity/power tradeoff; making that decision (after thinking hard) and then
comparing it to how the others did it is usually an interesting revelation.
Lisps are the canonical language for doing this in, but sweet.js macros could
enable some interesting results as well.

------
dronemallone
Related :) -
[http://colinm.org/language_checklist.html](http://colinm.org/language_checklist.html)

------
WalterBright
Author here. So Ask Me Anything!

~~~
chubot
_A context-free grammar, besides making things a lot simpler, means that IDEs
can do syntax highlighting without integrating most of a compiler front end.
As a result, third-party tools become much more likely to exist._

This statement feels imprecise in a couple ways. It seems to imply that some
IDEs actually use context-free grammars for syntax highlighting? Which ones?

As far as I can tell, Vim and Textmate bundles (i.e. what Github uses for
syntax highlighting) don't use anything close to a context-free grammar for
their syntax highlighting models. They are more like ad hoc lexers -- a
collection of rules and regular expressions.

Certainly an editor doesn't want to parse the entire file to highlight text,
because it has to potentially re-highlight at every keystroke. Also, you want
to be able to highlight malformed programs (i.e. code with syntax errors). As
far as I understand, that's generally why they don't use grammars.

I think you might mean that a language should be designed with a concise
grammar so that somebody else can reimplement it by hand more easily? That is,
you want multiple independent implementations.

If that's true, "context-free grammars" is the wrong term to express that
notion. Context-free grammars don't handle many real languages, not just C and
C++, but also Python/Perl/Ruby, and even JavaScript and Go (semi-colon
insertion.)

Also, a lot of tools are enabled not by grammars but by producing specific
data structures for your front end. (e.g. here is the way I think about it:
[http://www.oilshell.org/blog/2017/02/11.html](http://www.oilshell.org/blog/2017/02/11.html)
)

The different between Clang and GCC is that Clang is a library that enables a
whole ecosystem of fantastic tools, including IDE support. But Clang doesn't
do much at all with CFGs. The real difference is that the front end is a
library that produces a very rich representation of the code.

The C# front end does similar things. More links here:
[https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-
Pa...](https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-Pattern)

In other words, I would say that integrating the compiler front end into the
IDE became the more popular and successful approach. And of course the
compiler very much has to be designed with this use case in mind.

~~~
WalterBright
> integrating the compiler front end into the IDE became the more popular and
> successful approach

C and C++ syntax highlighting suffered for decades because it was not possible
to do a complete job of it without integrating a compiler front end into the
IDE.

~~~
chubot
I think you're talking about Intellisense-like functionality, not syntax
highlighting. Syntax highlighting works just fine in C++ in many editors.

The red squigglies in Visual Studio is a completely different technology than
syntax highlighting, which is lexical.

Intellisense has only a rough relation to context-free grammars. Some of the
red squigglies are errors in semantic analysis, not parsing. That is why the
more fruitful approach was to integrate the actual compiler and the IDE -- not
have two separate parsers/compilers, as was the case with Java IDEs.

Also, I think it was always "possible" to integrate a C++ front end and an IDE
-- just nobody did it in an open source compiler until Clang.

~~~
WalterBright
No, I'm not talking about Intellisense. Just highlighting the code correctly.
The C++ editors that don't integrate a compiler front end tend to get confused
when you do tricks with macros, backslash line splicing, and trigraphs, for
example. They'll also have trouble with the >> thing and templates, and
preprocessor metaprogramming stuff like:

    
    
        #define BEGIN {
        #define END }
    

They do work fine with conventional code, but if one knows the darker corners
of the Standard, they can be broken.

One could argue "don't write code like that", but as a tool developer there is
always someone that does. When designing a language, though, one can design
out all those problems.

~~~
chubot
OK, but highlighting the code correctly has essentially nothing to do with
context free grammars.

This happens in Vim and Emacs with languages other than C and C++ -- here docs
in shell, multiline strings in Python -- and Python _does_ use a CFG, etc.

I agree it's annoying although I think most people view it as a minor thing.
They stick with Vim and Emacs for other reasons.

I'm not sure anyone has based their language design around Vim/Emacs syntax
highlighting, although ironically that is one of my criteria for language
design. I was just confused by the advice to use a CFG, since it's not the
relevant issue.

I would say the relevant issue is that your lexer shouldn't be too clever and
have too many modes. And to avoid mixing languages in the same file, or have a
very obvious lexical construct to mix languages.

The C preprocessor is an entirely separate language than C or C++, so that is
the core of the issue in your example. Likewise, it is usually hard to
highlight CSS and JavaScript embedded within HTML.

~~~
WalterBright
> nothing to do with context free grammars

This is incorrect. Some languages have user-defined tokens. Some have
contextual keywords. Both require a semantic understanding of the code to
highlight them correctly.

And it isn't just the preprocessor with C++. There's the >> problem. It's not
just me talking through my hat - tools for C++, such as pretty-printers and
refactoring tools - have been very slow to appear, and fragile. But with a
language like Java the tooling is quick & easy to write.

You don't have to believe me. Write a tool that reads C++ source code and
inserts boilerplate at the beginning and end of each function, and works 100%
of the time.

~~~
chubot
I know exactly what problem you are talking about. It's exactly the problem
that Clang solves.

With the Clang front end, you can write a tool to read C++ source code and
insert boilerplate at the beginning and end of each function, and it will work
100% of the time. There are dozens of such tools in active use at Google and
I'm sure many other places.

But it has nothing to do with context free grammars -- _really_. Clang uses a
recursive descent parser. GCC used to use a yacc-style grammar (which BTW is
only context free-ish because of semantic actions), but it could NOT perform
the task you are talking about. In fact that was largely the motivation for
Clang.

It also doesn't really have to do with syntax highlighting as practiced by any
editor or IDE I know of. Even though Clang has the power that you want
("semantic understanding"), I don't know any editor that uses it for syntax
highlighting.

Instead they use approximate lexical hacks. This is probably because of the
need to highlight partial files and malformed files, as I mentioned. You don't
want your syntax highlighting to turn off in the middle of typing a code
fragment.

But editors DO use Clang for semantic understanding, e.g. the YCM plugin for
Vim.

But they use CFGs for NEITHER problem. You're conflating two different issues
and suggesting the wrong solution for both of them.

There are a lot of links about this issue with regard to languages like C#,
Scala, Go, JavaScript, etc. in the wiki page I linked.

I agree with your general point about language design, but the terminology
you're using is wrong and confusing.

~~~
WalterBright
> It's exactly the problem that Clang solves.

Yes, and clang appeared on the scene 20 years after C++ did. It's a long wait.
If you create a new language, are you willing to wait 20 years for tooling?

~~~
chubot
I agree C++ is too hard to parse, and you should design something simpler.
Simpler isn't the same thing as a context-free grammar. The issues you are
pointing out are lexical (Python has a CFG but still has imprecise syntax
highlighting in editors).

 _> A context-free grammar, besides making things a lot simpler, means that
IDEs can do syntax highlighting_

I disagree with this because it's wrong. People don't use Clang or context-
free grammars to syntax highlight code. Java has a CFG -- who uses it to
syntax highlight code?

This conversation isn't very interesting because it's just me explaining the
same thing to you over and over again. Your head is stuck in the mode of
"expert" and not somebody who is curious and wants to learn something.

------
Cacti
If you haven't done this before and want to explore a bit, I'd recommend using
ANTLR 3. You can get an interpreter up and running pretty easily and it
targets the major platforms, so you can typically use the produced grammars
and generators in your language of choice. Recursive descent is usually the
easiest to implement.

This is really only 5 or 10% of the work involved in creating your own
language, but it's enough to experiment and give you an idea if you want to
proceed with the long and hard work of developing it further.

~~~
zaphar
The author actually recommends against using an lexer/parser generator. If you
are just playing around they can be find but he's right that when you are
ready to get serious they often end up being a hindrance.

Additionally a it's not usually that hard to create your own tokenizer and
parser by hand.

------
jokoon
Honestly if I'd make a language, I'd take most thing that make C and python
popular, to make something as simple as possible, and as similar to C as
possible. All I could wish for is C with native features of python (dict, list
comprehensions, libs like sqlite and xml...)

What I think is the most important thing for a language is that it must be
easy to learn and read for beginners. There is no need to have advanced
features for advanced programmers. Those guys will use other languages and
their specific tools to solve their advanced problems.

A language is something that is "talked" by many people, so the easier is it
to learn, the more people will use it. It is all about a low barrier of entry.

Just stop focusing on particular features or what you like about that X or Y
language. Just imagine CS students and beginners, and make a language that let
them write code and do their assignments.

~~~
doubleunplussed
Cython is a fantastic and underrated language. It basically allows you to
write C using Python syntax, and to also write Python in the same source file.
Basically if you use a dict or some other nice python functionality, your code
will run at Python speed, but if you add a few type annotations and whatnot it
runs like C. Not to mention that you can import and use C libraries. Mix and
match Python and C as required for the ultimate balance of speed and
convenience. It's great.

~~~
dom0
I disagree (strongly).

Cython is nice to save some typing for generating wrappers. It is a very tough
language to work with for actually implementing things that you'd want to do
in a low-level language, because it's severely underdocumented and
underspecified, but extremely complex at the same time. To actually see
whether the code as written is correct, one must almost always cythonize it
and dig through the generated, verbose and hard to decipher C code. Both the
language and the compiler have a ton of things I'd call bugs, induced by the
complexities and mismatches in the type system. E.g. it's awkwardly easy to
have Cython call some PyObject function on something that isn't a Python
object.

No thank you.

PS: Also avoid using setup.py Extension, including Cython's variant, if you
can at all.

(Source: I've been maintaining software containing Cython stuff for some
while, written and reviewed quite some of it, too.)

~~~
doubleunplussed
I suppose I almost solely use Cython for numerics, which seems to be reliable.
Anything that's not numerics is not usually the bottleneck for my code and so
is left as Python. So I can't comment on problems trying to use the C end of
the Cython continuum as a general purpose replacement for C.

For my numerics at least I can't say I've had to dig through generated C. I
generate the annotated html to ensure that what I thought would translate to
pure C without python API calls indeed has, but I've never had to actually
read the C it generates.

~~~
dom0
The annotated mode is indeed very nice to browse the generated C code,
although I still needed to manually read the .c in some cases. The annotated
code elides things like the (many thousand lines of) Cython support code; but
since few things are documented I sometimes even needed to dig through those,
deciphering nested #ifs and whatnot just to see whether the code would be
correct.

E.g. what does `cdef char *something = somethingelse` give me. Even if you
know the type of somethingelse it's at best a guess. (Bonus question: Say you
know somethingelse is going to be a Python bytes object. Does something point
to a copy?)

------
bobsgame
For anyone who hasn't yet seen it, check out Jonathan Blow's Jai language, it
is fantastic!

[https://www.youtube.com/watch?v=gWv_vUgbmug](https://www.youtube.com/watch?v=gWv_vUgbmug)

~~~
papaf
That's cool. Is there somewhere I can download the language to play with?

------
dadvocate
Nice Article, although I clicked on the link thinking that finally I got
something that I can follow and get started with writing my own language.

~~~
kozhevnikov
[https://news.ycombinator.com/item?id=9718472](https://news.ycombinator.com/item?id=9718472)

------
rzzzwilson
Is it just me being in my dotage, or does anyone else feel that websites that
override my "open in new tab" action to force their tab to the top of my view
plus tout their app should be taken out behind the shed and shot?

~~~
andreareina
I've said the same thing about applications that steal focus.

~~~
cryptoz
It's remarkably hard to find or configure an OS to not steal focus. Everything
I've tried for my whole life has failed, with the exception of not using a GUI
and just staying on the linux command line without letting startx run.

iOS? Steals focus all the time.

Android? Steals focus all the time.

Windows? Steals focus all the time.

Mac OS X? Steals focus all the time.

Ubuntu? Steals focus all the time.

It's outrageous and annoying. Does anyone have a simple OS configuration that
will prevent all focus-stealing system dialogs from actually interrupting
focus?

Edit: To continue the rant, Windows used to pop OS update dialogs that
interrupted keyboard focus, took keystrokes like spacebar and enter to mean
"lose all work and reboot". I remember a long time ago, typing and looking
away from the screen, only to find that my computer had turned off.
Infuriating.

~~~
user5994461
A windows application can't steal the focus unless it is the the top window
already.

This is behavior built deep into the system that cannot be overridden. It is
very effective against numerous stupid apps.

~~~
TeMPOraL
> _A windows application can 't steal the focus unless it is the the top
> window already._

Or unless it decides to throw up a modal dialog, AFAIR. Unless they changed
that at some point.

Also, Windows Update "rebooting in 5 minutes, do you want to reboot now
instead?" popups in pre-Win8.

------
jlebrech
what of forking an existing language and adding one of two features you want
to see it it.

i'd like to make a subset of ruby minus blocks and indentation (yes i'm crazy)

~~~
rurban
see tinyrb based on potion then.

------
tanilama
I don't think so. I feel like invent a language is more and more useless in
nowadays.

