Hacker News new | comments | show | ask | jobs | submit login
How to implement a programming language in JavaScript (lisperator.net)
153 points by tambourine_man on June 29, 2016 | hide | past | web | favorite | 38 comments

Cool... but nitpick about the parsing part (only glanced at it): why do people always forget about parser combinators and don't present them as obviously 'the first thing to try' when making a new language?!

They seem the simplest and most intuitive tool for an unbiased newbie designing a new language, and you can even introduce them by skipping the theory with something like this: http://theorangeduck.com/page/you-could-have-invented-parser... .

If anyone is interested in making a language in JS I can recommend the "Nearley" parser/lexer:


Very easy parser/lexer to use, accepts a larger class of grammars than yacc/lex and is reasonably fast.

I use it for the language in my project: http://emunotes.com and Nearley is fast enough to parse stuff as the user is entering text.

It may be tempting to see if you can get away with not using a parser generator (as the author does here) but I found it is not worth the trouble it generates down the road.

Emunotes is really cool. A magical word processor. Is there a way to make functions that connect to expanded services?

Thanks; if you by expanded services mean integration with external pieces of user-defined javascript or data sources (such as a YQL query) it is a good and very potent idea. It is something I like to do a little further down the road. In the near future I am focusing on graphics and collaborative features.

This is also a really good resource: https://github.com/thejameskyle/the-super-tiny-compiler

Immediately thought about another awesome resource about how to write a language in JavaScript:


UNIX/Linux has had off the shelf tools for doing this for nearly 40 years. Two of them are called lexx and yacc.

Kind of reinventing thevwheel for those unfamilar with history.

Parsing is the smallest part of writing a compiler.

It's also one of the most conceptually difficult parts, which makes it a good learning opportunity.

One common criticism of compiler classes is that the first semester spends all of its time on parsing, so the students don't learn very much about compilers. And btw, there's an infinite number of conceptually difficult parts when it comes to optimization.

Conceptually? I'd say it can be tedious, if done by hand, but the concepts are pretty simple.

"It's also one of the most conceptually difficult parts"

... except all the others.

It would be nice to read some books like Language Implementation Patterns for deeper thoughts after following some easy tutorials. That book is good for designing both DSL and general-purpose languages and explains multiple patterns with pros and cons, tradeoffs, etc.

And also, it's better to check out some metasyntax notations, e.g. EBNF https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_F...

Cool! This is a great, clear tutorial.

I strongly recommend this activity to almost any programmer. No matter what, you'll end up with a deeper appreciation for great system design. You realize how almost every design decision boils down to a compromise between convenience and flexibility. For better or worse, you'll also get new respect for the cruel reasoning behind more arcane, but also more flexible APIs. Red pill blue pill kind of thing.

And, just to share: Last year I took a crack at creating a JavaScript-interpreted Lisp and this is the result - https://github.com/matthewtoast/runiq - I haven't added anything since then but I might pick it up again, as it was a helluva lot of fun and also quite mind-expanding.

Is there such a thing as a language that compiles to a value rather than an executable program?

I've done a really hacky version of something this where a user would input a succinct, high-level description of a sequence of colored lighting, and my program would parse* that and build the tedious, low-level JSON representation of that sequence for yet another program to perform. Is there a fancy or unfancy name for that?

*Ok really it was a little bit of munging and then eval.

Well, at some level every language compiles to a value rather than an executable program - the parse tree of a source module is a data structure describing the program, and the executable is a sequence of bytes that is interpreted by the processor.

There's a continuum between this and what we think of as a "value" in programming terms (eg. an int or a string), but the boundary is fuzzy. It was pretty common at Google to write DSLs that compiled to a protobuf that'd be loaded into the server as configuration, for example. CUDA compiles programs into a buffer that is then sent to the GPU. HTML compiles into the DOM, which is a value that can be manipulated with Javascript.

Consider a run-length encoded image. One possible implementation is to have a command, a length value, then a sequence of 1 or more pixels. A "run" command's sequence would just be the color to repeat, while a "dump" command's sequence would be a number of pixels to output directly into the image.

In a sense, that's a piece of source code that would "compile" into an image, if you take a loose interpretation of what it means to "compile".

For other examples that may count, LaTeX and Postscript are both Turing-complete languages, usually compiled into a typeset document and a vector graphic image, respectively.

You might call them Domain-Specific Languages, in general (the DSL mentioned by the sibling comment).

Sure. I would describe that as a domain-specific language (DSL), although not all domain-specific languages look like that. DSLs that don't encode any computation are sometimes described as "configuration" languages, but I'm not sure that term is particularly well-defined.

Domain-specific languages are one of the most powerful yet underused techniques in programming—especially if the language can be embedded inside a rich, general-purpose language. We get to bear all the abstractive capability of language at our domain while also having general programming language techniques at our disposal (static analysis, optimization, type checking... etc). If we're really doing a good job we can even make sure our DSL is easy to reason about and has good semantics, using the theoretical tools developed to analyze general-purpose programs.

One neat trick is that a lot of tools that are difficult to use on general-purpose languages like automatic verification or program synthesis can become far easier to apply and more practical when working with a small DSL. A lot of tools in this vein would be a multi-year research project to implement for a language like C but only a multi-week project for sufficiently constrained, well-behaved DSLs. And if you're designing your DSL, you can design it in tandem with this tooling, making the problem even more tractable.

The language in the article is just such a thing!

The difference is essentially between statements (AKA instructions) and expressions (AKA values).

Statements are executed one after another to produce effects(/actions). That is "imperative" programming.

In contrast, expressions represent values, and we evaluate(/reduce/simplify) them over and over. Sometimes they reach a "normal form" (i.e. a "final result"), other times they may end up in a loop. This may be called "value-oriented programming", and pure functional programming is one example. The language in the article is similarly "value-oriented", although it isn't "pure" since it allows effects in its expressions, which requires careful choice of which order to evaluate things in.

In Haskell, for example, "programs" are just expressions representing some sort of value. It just-so-happens that when a Haskell program is 'executed', what actually happens is we look for a value called "main", we evaluate it to get a value, and that value just-so-happens to have the type "imperative program" (technically it's called "IO a"). That "imperative program" value is then handed over to an interpreter to be executed. Well, conceptually at least; those tasks are actually interleaved so the code is executed "on demand", and the whole thing can get smushed together by an aggressive compiler. Still, it sounds very much like your two-stage language.

Macros in C, and template metaprogramming in C++? Some crazy people are writing their software entirely as C++ metaprograms and using the compiler to execute it. Presumably, the compiler's output is a very short `main` function that just prints the constant value that the compiler calculated.

There was an example of this on HN a while ago, where a guy wrote a Game of Life implementation entirely with C++ templates (https://news.ycombinator.com/item?id=8597443). Successive generations were nested templates. Compile times became prohibitive after a few generations. The whole thing started to read sort of like Haskell.

The dude writes good code. He's the author of Uglify.

So this is how magic happens

The new hotness seems to be transpiling, turning an enhanced language into Javascript. Any good resources for this?

It's the same as any other compiler, just generate javascript out of the backend.

It would be nice when saying things like "don't use regexps for parsing" that it is accompanied by an explanation (possibly I didn't read far enough to see the explanation...)

There are only a few actual computer science theory things that I think all programmers should know and this is one of them. There are classifications of grammars (1). Without going into detail, regular expressions can only be used to parse regular grammars.

The problem is that most grammars for programming languages, file formats, communication protocols, etc, are not regular grammars. With a regular grammar you can look at the current state and the next input symbol to determine the next state. With context free grammars you need to have a stack of states. With context sensitive grammars you need to have a stack of stacks states. With unrestricted grammars you are essentially screwed ;-)

So your first task when you are designing a language or file format or communication protocol or whatever is to choose a simple grammar. If you choose a regular grammar then you can (and probably should) use regular expressions to parse it. With context free or context sensitive grammars you can often do lexical analysis (creating symbols that you then pass to your parser) with regular expressions, but you need something more complex for parsing the stream of symbols (i.e., you need to be able to put parser states on a stack).

The problem that you often see is that people design things and have absolutely no idea what kind of grammar it is. They use regular expressions (or some ad-hoc code) and then try to keep track of state in global variables. If they happen to have a context sensitive grammar then chances are their parser simply will not work correctly no matter what they do.

You may be wondering why people choose to use more complex grammars if it is harder to parse. The main reason is that complex grammars give you more options for expression. Sometimes it is extremely difficult or even impossible to represent something with a regular grammar. Having said that, though, you should almost always try to keep your grammars at least context free. Once you get into context sensitive grammars, the difficulty of parsing will either make your parser very difficult to implement or buggy as hell (usually both). Usually it is better to remove functionality than it is to move from a context sensitive to context free grammar.

In the past I have often seen file formats that have unrestricted grammars (converting file formats used to be my job). People who do this should be replaced with programmers who know what they are doing :-P

(1) - https://en.wikipedia.org/wiki/Chomsky_hierarchy

This is all true, for sure, but I'd like to point out that PEGs are a formalism that isn't much harder to learn than regular expressions, and totally appropriate for parsing a fairly large class of languages. I've also seen PEG-based (pakrat) parser generators for just about every programming language (some better than others, of course). Again, these tend to be not much harder to learn than regular expressions, and because of the "ordered choice" operator, many programmers seem more comfortable reasoning about PEGs than a Yacc specificiation, for example.

PEGs aren't the only formalism that attempts to make writing certain classes of parsers easy. You can get a long way with parser combinators, and Matthew Might's "Yacc is Dead" paper discusses another (extremely general) approach, which I have unfortunately not seen many implementations of.

Anyway, enough of an addendum! Keep on spreading the Good Word ;)

Thanks for addending :-) I wasn't aware of PEGs. That definitely does look promising for a lot of cases.

This. OMeta and Ohm are awesome.

Just wondering if it is true that most of the modern language parsers are written in C?

I don't think this is true. Loads of FP languages are self-hosted.

Not only FP. Lots of languages are self-hosted. So actually most modern parsers are not written in C, since they are written in the language they parse.

Probably true, but only because the implementation language for them is also C. Some of them also frequently use a mix of the implementation language and grammar specification DSL like yacc.

Use of DSLs and tools like flex and bison makes the implementation a whole lot more reliable (but only if they're used, and some commonly run languages have a very implementation-specific grammar).


    input:    /* empty string */
            | input line
    line:     '\n'
            | exp '\n'  { printf ("\t%.10g\n", $1); }
    exp:      NUM                { $$ = $1;         }
            | exp '+' exp        { $$ = $1 + $3;    }
            | exp '-' exp        { $$ = $1 - $3;    }
            | exp '*' exp        { $$ = $1 * $3;    }
            | exp '/' exp        { $$ = $1 / $3;    }
            | '-' exp  %prec NEG { $$ = -$2;        }
            | exp '^' exp        { $$ = pow ($1, $3); }
            | '(' exp ')'        { $$ = $2;         }

Python, Ruby, PHP: C

JS: no standard parser, but V8(chrome), SpiderMonkey(firefox) and Chakra(edge) are all C++

Go, C, C++: self-hosted (written in the same language)

Java, Erlang: parser/compiler self-hosted, but VMs written in C.

I doubt it. If anything, C++, but even that doesn't seem likely. The vast majority of languages seem to be self-hosted. A good portion compilers or interpreters in C or C++ probably use (F)lex and Yacc/Bison so even saying C or C++ are the most popular choice (but not necessarily the majority) sounds like a stretch.

Go is now written in Go

It used to be written in C before bootstrapping.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact