Hacker News new | past | comments | ask | show | jobs | submit login
Make Your Own Programming Language (mattias-.github.io)
225 points by Pirate-of-SV on Jan 3, 2015 | hide | past | favorite | 78 comments

Great post. Recently I wrote an interpreter for Scheme using Haskell using the guide "Write Yourself a Scheme" [1]. It was great fun, and it seems that it will be fairly straightforward to write a new domain-specific language using Haskell, using the same principles. That said, it was never clear to me how something like that would work in a non-purely-functional language, and so the fact that the OP used Rust is very exciting.

[1] http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...

This is a great tutorial, although the one complaint I had was the errors in the pdf version (can't remember the specifics).. took me a while to realise the html version was far superior.

Regardless of my complaints, it was an interesting experience.

Yes, I came to the same realization. One of the problems I had with the pdf version was that the flags given for compiling the program did not work correctly for my version of GHC.

That rings bells, but I think there were some others too - sigh my memory :/

I started making a language a few months ago. I found Douglas Crockford's parser from his Top Down Operator Precedence paper pretty useful, which is here: http://javascript.crockford.com/tdop/index.html

When looking for resources on making toy languages, most of them said not to bother with a parser and use a pre-built solution. I think they're probably wrong - the parser is actually pretty easy (relative to the rest of the process of getting a language working) and at a stretch, fun.

I agree. The parsing step is distracting. I have TONS of questions and all the tutorials do the parsing stuff, and stop at doing integer +/*-, maybe vars, maybe if, maybe basic functions!

I'm doing one, and I'm only working on the AST.

I take the idea from this blog:


And from http://www.itu.dk/people/sestoft/plc/. So I get rid of the parsing, and go directly at "how do type checking?", "How I map vars to the environment?", "How make the debugger???", "So, I can have a AST for INTS, BOOLS... now how let the user create your own types???", "How make possible to do macros? I'm not a LISP!", "I'm on F#. How lift the stuff F# already have to do not doing it again???", "What to do: Erlang actors or GO CSP? and how?", "How do pattern matching", "Auto-instrument the code interpreter with dtrace or similar, and why I think this is good?", "I wanna be a reactive language?", "I decide the language be relational. I strongly believe that is amazing. I don't know yet why and how prove it" and a huge list like this...

I don't have a clean answer yet for some of this questions!

Also, I get rid of the idea of make a compiler, and do a interpreter instead. That is another distraction. I suspect that if later on I turn the interpreter to a bytecode, the step to full compilation will be simple, and without solve several questions about the full language the task of rewrite both the parsing and the compiler is not cool.

I have written a few languages here and there. Hope this helps.

They are all separate parts, and you can do them in any order you want.

I like to think of the language as two parts - syntax and features. How do I want the language to look and what features do I want the language to have? The parser and lexer handle the syntax, and the AST handles the features.

This way, I can play around with the syntax of the language a little later.

I would start with a really simple language:

- add expressions, proper infix support for arithmetic

- variables

- if / else / else if

- scoping / functions

- loops if you don't want to just use recursion

then, you can get more advanced:

- lambda expressions / first class functions / closures

- namespaces or types

- macros. if you can make nice looking lisp-style macros for an infix language i would love you

write it as an interpreter, add an llvm backend.

It should be fun. do a little at a time.

> - add expressions, proper infix support for arithmetic

I'd suggest ignoring precedence rules completely here. It generally simplifies the grammar and parser considerably. It doesn't make the language unworkable, either, as long as you allow for grouping with parentheses - Smalltalk does this and it work rather well.

I'd also try working with libraries like PyParsing first before attempting to use mammoth size parser generators. In my experience, especially for simple grammars, this way you can be done with parsing stage with minimum effort and concentrate on the "features" side.

I should have mentioned that precedence rules are pretty wacky.

I wish that in the languages I have worked on that I could just ignore precedence and use parens, but operator precedence is expected.

More or less my exact plan. Plus modules (like in ADA) or components, a way to do parallel/concurrent programing (CSP, Actors, Coroutines?).

w.r.t. macros, have you looked at Elixir, Rust, Sweet.js?

And Nimrod, and (Open)Dylan (which I feel really deserves more attention right now, it's far from being dead!).

I don't know about Rust, but Elixir, Dylan and Sweet.js have Scheme-style, pattern matching based macro systems while Nimrod is a bit unique in having procedural macros conceptually closer to defmacro from some Lisps.

There are also languages where functions can decide whether to evaluate their arguments or not, in the latter case making them work like macros. Io is an example of such a language.

Anyway, infix languages with nice macro systems do exist, it's just that none of them became popular enough (yet?). Also, programmers tend to fear macros for some reason (probably because of they are in C and similar languages) which makes having macros rather low-priority feature for language designers. But it's perfectly possible to create a (very nice!) macro system for infix language and it's been done.

Yep, I have look at nimrod, julia, and barely at dylan. My target syntax is more like python/julia/elixir, and all with relational programing as first-class.

The thing I'm now is how implement macros (and how deeply), not the syntax, is kinda easy to see the way with lisp, but still not with more "normal" syntax.

Also, I have tough (in a way to provide linq-like capabilities) if could be nice to have this = that (evaluate that and put on this) and this ::= that (get the AST on this and later evalualte this) and let decorate the functions with "fun(test:AST.LogicalOp)", so macros at runtime, not just compile time..

Yes! I write my own recursive descent parsers because my programming languages tend to have incremental execution stories to them (you can see an example at http://research.microsoft.com/en-us/people/smcdirm/managedti...). And writing a recursive descent parser really isn't that hard, plus it gives you more flexibility in tweaking error recovery. It also turns out that in industry, auto generated parsers are rarely used (Java is an exception) given performance and error recovery concerns.

On the topic of precedence parsing, they have some really good properties in an incremental context: you can start parsing anywhere and the results will be the same. The problem is expressiveness, and while I came up with some hacks to push them beyond operators, pure recursive descent is much easier to work with (as I learned from Martin Odersky's scalac compiler).

For me the big advantage of using a parser generator is that they let you write things a bit more declaratively. Bottom up parser generators also have the advantage of being able to recognize some grammars that top down parsers can't and they also detect grammar ambiguities at compile time.

That said, handwritten top down parsers are very tweakable and flexible which can be a big plus. In particular, error recovery and things like semicolon insertion. You also get more control over code size and parser performance.

I can see why people do use a parser generator. For me though, the language I'm making has a fairly simplistic syntax, and I value code size and parser performance more than I should.

I have used lex and yacc over almost 30 years numerous, numerous times. I always regretted the decision later. It's a myopic concession for short-term gain.

The tools generate a monstrous, ugly mess that nobody understands, which is linked into your executable.

When it gets complicated, debugging it is a massage session, where you make changes that you don't entirely understand in hopes that a desired behavioral change comes about.

Here is a debugging fail: you cannot just put a break point on a rule, because it's not a function. The state machine is traversing multiple rules at the same time.

In a recursive-descent parser, you can break on a rule and get a call stack.

If you value parser performance you should use a parser generator. Handwritten parsers rarely approach the performance of a fast table-based generated parser.

A hand-written lexer should beat most table-based lexers. Hand-written lexers encode the automaton state in the program counter and encode the transitions in code branches, whereas table-based need to use a separate variable and encode the transitions through an indirection followed by a loop.

Parsers are a different kettle of fish entirely though. You hypothetically compare a handwritten parser with a "fast" generated parser. Presumably a handwritten parser will be faster than a "slow" generated parser?

If the language to be parsed is simple and has no ambiguities that need semantic information to resolve (this is not the case for C, for example, never mind C++), a generated parser may save time. Would it be faster than a handwritten parser? I have my doubts, for similar reasons as lexers. The automaton state is stored implicitly via the program counter, and state transitions are in code: similar arguments apply. If the handwritten parser is doing more busywork, such as a lot of recursion to handle precedence in expressions, then it may be slower. But if it uses precedence parsing just for operators, it can get that back, while not losing the benefits of recursive descent style.

OTOH, if you're trying to parse a fairly ambiguous language, you can save a substantial amount of effort with a parallelized LR parser, something that builds a forest of parse trees and discards options that are no longer feasible when more information comes in: a Tomita or GLR parser. Parsing such languages with recursive descent typically involves a lot of extra work building up semantic info as you go, or parsing a simpler form of the language and rewriting with disambiguation later, after further analysis. But GLR parsing is O(n^3) in the worst case. Hand-written may be a lot more work, but it might also be faster. It depends.

The example I was thinking of when I mentioned performance was Lua. Instead of generating a syntax tree the recursive descent parser directly spits out bytecode as it goes, which ends up being a big performance win on large source files.

Taking the other side of the argument in my sibling comment to yours: using a parser generator does not preclude directly spitting out bytecode. You can do that in the parse actions instead of building an AST.

A bottom-up parser like that produced by LALR parser generator could make this slightly more awkward, depending on what you need to generate the bytecode, but an LL(k) parser generator (e.g. Antlr) should let you do most of what you'd do with recursive descent, and in a similar way.

I wonder if exists other ways, and find a obscure reference to "slim binaries":

(Asked here)


Also, I wonder how NOT lost the info that already have the AST in the bytecode conversion, I even have the crazy idea to store everything in a sqlite database, but I suspect will be slow (to read later in the VM...)

Evaluate code for a stack machine symbolically, and you'll end up with an expression tree. RPN bytecode is a post-order traversal of the underlying expression.

Things get a bit more wrinkly with control flow, but you can design bytecode to make it easier to rehydrate that too.

I agree. It's not as hard as one might think. I implemented my own handwritten JS parser using TDOP: https://github.com/higgsjs/Higgs/blob/master/source/parser/p...

I think it's more flexible and maintainable this way. I didn't have to wrestle with odd grammar rules. Implementing the JS "automatic semicolon insertion" with parser generator tools seems like it would have been a cluster headache.

I have to admit I also wrote my own parser because I was keen on implementing some pretty advanced macros in it (which I've almost done already). A lot of those parser generator tools assume you have a nice C-like language syntax.

The more you do yourself, the more you will learn. If you're just learning and not trying to build the next big programming language, try and do some of the low level stuff yourself. Write a parser, write a garbage collector and a virtual machine, re-write the compiler in the language itself. Go hog wild.

> The more you do yourself, the more you will learn.

This is true once you complete the project, but it seems like it can be dangerous: if you start off your mission to make a programming language by soldering components on a breadboard, say, then, you will learn something, to be sure, but you will probably get so embroiled in preliminaries that you will never learn what you intended to learn. (Of course, sometimes learning something other than what you intended is a good thing ….)

that's a great suggestion and seems to be useful tip to get started to writing a new language on our own.

The author appears to have added a CNAME since this was posted to HN. The updated URL is http://blog.ppelgren.se/2015-01-03/DIY-Make-Your-Own-Program...

Server not found error when connecting. Tried removing the "-" but didn't work neither.

In glibc's libresolv, this is caused by res_hnok in resolv/res_name.c returning 0:

     * Verify that a domain name uses an acceptable character set.

     * Note the conspicuous absence of ctype macros in these definitions.  On
     * non-ASCII hosts, we can't depend on string literals or ctype macros to
     * tell us anything about network-format data.  The rest of the BIND system
     * is not careful about this, but for some reason, we're doing it right here.
    #define PERIOD 0x2e
    #define	hyphenchar(c) ((c) == 0x2d)
    #define	underscorechar(c) ((c) == 0x5f)
    #define bslashchar(c) ((c) == 0x5c)
    #define periodchar(c) ((c) == PERIOD)
    #define asterchar(c) ((c) == 0x2a)
    #define alphachar(c) (((c) >= 0x41 && (c) <= 0x5a) \
		       || ((c) >= 0x61 && (c) <= 0x7a))
    #define digitchar(c) ((c) >= 0x30 && (c) <= 0x39)

    #define borderchar(c) (alphachar(c) || digitchar(c))
    #define middlechar(c) (borderchar(c) || hyphenchar(c) || underscorechar(c))
    #define	domainchar(c) ((c) > 0x20 && (c) < 0x7f)

    res_hnok(const char *dn) {
	    int pch = PERIOD, ch = *dn++;

	    while (ch != '\0') {
		    int nch = *dn++;

		    if (periodchar(ch)) {
		    } else if (periodchar(pch)) {
			    if (!borderchar(ch))
				    return (0);
		    } else if (periodchar(nch) || nch == '\0') {
			    if (!borderchar(ch))
				    return (0);
		    } else {
			    if (!middlechar(ch))
				    return (0);
		    pch = ch, ch = nch;
	    return (1);

And the hack fix:

    hnokpre.c: LD_PRELOAD shim to bypass valid hostname checking in glibc
    gcc -fPIC -shared hnokpre.c -o libhnokpre.so
    LD_PRELOAD=./libhnokpre.so curl http://mattias-.github.io/

    int res_hnok(const char* dn)
        return 1;

    int __res_hnok(const char* dn)
        return 1;

I'm kinda wondering what's the reason for that that check to begin with. Just being pedantic for pedentrys sake?

     * Note the conspicuous absence of ctype macros in these definitions.  On
     * non-ASCII hosts, we can't depend on string literals or ctype macros to
     * tell us anything about network-format data.  The rest of the BIND system
     * is not careful about this, but for some reason, we're doing it right here.
Well that is only a problem if the charset is converted to something else. Are there still systems out there without a C compiler that can be told to use ASCII?

According to RFC 1034[1] hyphens can only be interior to a label; it appears many tools simply follow these recommendations strictly.

The exact wording is: "They must start with a letter, end with a letter or digit, and have as interior characters only letters, digits, and hyphen."

[1]: http://www.ietf.org/rfc/rfc1034.txt

The wording “must start with” doesn’t sound like a “recommendation”, and I would think that these tools are right to interpret the standard strictly, and that Github is wrong to allow such domain names.

Whatever happened to "be lenient in what you accept, conservative in what you send"?

Github ignores that rule when they emit such domain names. One might make the point that Firefox and others who in turn reject such domain names are not being lenient in what they accept, but since domain names on the edge of acceptability often are a security issue, I think that a strict interpretation of the standard is the safe choice to make here.

It's a terrible idea that just causes all sorts of intro l interop problems. Web development, for instance, is a mess because of all the various attempts to interpret stuff in some way, versus just failing.

Author here: This is slightly embarrassing.

I just added a custom domain/CNAME to my github pages so it should be reachable at http://blog.ppelgren.se

Could happen to anyone. Nice job solving it fast.

Apparently the hyphen isn't there by mistake. Here's the blog source repo: https://github.com/Mattias-/Mattias-.github.io

Are you perhaps using a browser or a DNS stack that rejects this type of domain name?

I had the same problem, but you can read the page from the Markdown-parsed source. https://github.com/Mattias-/Mattias-.github.io/blob/master/_...

The question should probably be: which OS are you using?

It's quite interesting... dig seems to resolve the domain just fine, but neither chrome, curl nor firefox seem to be able to open the page. (tested on Fedora/CentOS/Android)

Same here. My dnsmasq server responds with the domain name when I try it with 'host', but curl and firefox fail on Debian and Ubuntu with only that 1 DNS server in resolv.conf. Curl says "curl: (6) Could not resolve host: mattias-.github.io".

However, curl and firefox seem to be RFC-compliant. RFC952 states: "The last character must not be a minus sign or period."

Android, Google chrome, also have the resolver problem.

Ubuntu 14.04 / Chrome + Firefox... same behaviour :( DNS resolves, curl + both browsers fail.

randunel@18:~$ curl Mattias-.github.io

curl: (6) Could not resolve host: Mattias-.github.io

randunel@18:~$ dig +short Mattias-.github.io


Interesting... I'm on Ubuntu 14.04 too, and Firefox fails for me, but Chrome succeeds.

Firefox on Debian here, also get a "Problem loading page"

Firefox on Arch, same stuff.

Edit: Google Chrome works.

I got it with curl -I --header "Host: mattias-.github.io"

I am also unable to visit the page. I'm on Android 4.2, using Firefox and Chrome.

Nice article, inspired me to invest time in the future to learn Rust.

Also working on JavaScript native compiler + developing a new language on top as subset for JavaScript: https://github.com/InfiniteFoundation/dopple (front page info is outdated though).

After writing a lexer, parser, compiler and virtual machine in C++ and ported parts to Python, Java and FreeBASIC - I think this is a good starting point.

Most of the concepts are pretty simple - it's when you get into if/while/for statements. Especially if you have nested if/while/for statements. Conceptually easy to implement using recursion but during the compilation phase keeping track of where you are at can be...difficult. I'll be the first to admit my implementation sucked but it worked.

Let's add the next thing I won't have time to play with. Anyway, this is a great post and I look forward to the whole guide. Keep it up!

Thank you! If I can at least make one person interested I'm happy to continue.


I've been working on Python-like systems language with a compiler in Python targeting LLVM IR, and it's been cool (though so much work to get a halfway usable language!).

https://github.com/djc/runa if anyone's interested.

Did you create an interpreter for the language before you started on the compiler? I think it's not much overhead and you can run tests with it to make sure you get all the lexing and parsing done correctly. Then you just replace the evaluation with code generation and BAM! you've got a compiler.

No, I didn't. For my language, being compiled down to machine code has been one of the design goals. Also, it doesn't seem to me like building an interpreter is significantly simpler than building a compiler (at least given the fact that I've left a bunch of the hard stuff to LLVM magic). LLVM IR isn't that hard to write.

I would argue that this statement is a mistake that should have been foreseen: "negative integers isn't literals!" Specifically, twos complement signed integers will be a bitch of you continue down that path.

On the other hand, having negative integers as literals will cause even harder problems with subtraction (as in "2-2")

In a language like C the parser is going to try to parse Expr - Expr but not Expr Expr since there's no such construct in the language to put two expressions after each other.

Actually, it's an incredibly good idea, cause you can then just implement constant folding on the AST, and that (unary negate)1 turns into an actual -1, as well as a lot of other good optimizations.

Wow. I'm sure I've seen Rust before (in fact, I have it installed) but I'm just surprised how similar it is to Swift.

EDIT: s/is/looks/. I meant the syntax.

The design of Swift was influenced by Rust. However, most of the resemblances are superficial or relate only to areas where Rust in turn was heavily influenced by other languages. For example, Rust's memory management model could not work in Swift (because this is an area where Swift remains compatible with Objective-C).

> I'm just surprised how similar it is to Swift.

A lot of language enthusiasts just winced.

I started one a while ago, but I became fascinated by the "middle end" and bought a compiler textbook. Now I have SSA optmizations but no strings still.

Very interesting read. I am myself writing a series on the same subject and I was pretty inspired by your way of writing. Your article is paced a lot faster than what I have written so far and I really like the style.

I submitted my own posts to HN also, but I did not get the same attention as you did, congratulations!

I'm looking forward to the next iteration ;)

Great read! An effort on a DSL using PLY (Python LEX-YACC implementation): https://github.com/georgepsarakis/quick-bash .

It is actually a transcompilation to Bash from a functional language, using Python for the intermediate processing.

I started creating a language by myself a few days ago, this is a great read. Thanks!

Nice, what kind of language? I think nice beginner languages are Query languages (like SQL) or markup languages (like Markdown).

I'm aming for a scripting language that is a sort of mix between Lua and Go. The project is currently very overambitions, but it is fun! :)

This was a really cool read :). I'm working on my own DSL for generating code. I'll make a Show HN soon, but it was really cool to see someone else doing something so similar.

Nice what kind of code is it going to be used for to generate? A specific language or something more generic.

It would be cool with a DSL to generate low level code like X86 ASM, LLVM IR or JVM bytecode. That would be so meta!

Heh, not so complicated. It's just used to generate boiler plate code. Kind of like data object files in C#, Java, PHP, etc. That would be really cool though :).

whitespace sensitivity is a pretty big pain to deal with in the current toolkits available for building out a programming language, unfortunately.

The pain is compared to the absolute and total simplicity of defining a grammar when you're in a whitespace-insensitive setting (parser generators are easy to express and efficient in deterministic grammars).

We have a lot of good parsing tools out there, but we might be missing the right tools for building out good lexers

I'm getting started with LLVM, Just downloaded the source and now wondering how to go about things.

It would be cool if we could add a "Rust" tag to this entry. I almost didn't give this link a second look because I thought it was an older link. Glad I did give it a second look.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact