Hacker News new | comments | ask | show | jobs | submit login
A Simple Compiler in JavaScript (github.com)
126 points by mgechev on Jan 21, 2018 | hide | past | web | favorite | 38 comments

Of course, the implemented language is simple enough, to accomodate this simple implementation. Still, looks nice, the code is very clear, and the comments are pretty good.

The syntax is of course very limited with a one-line lexer, so more sophistocation here would somewhat expand the lexer, even though if it remains centered around regexes (which it probably will), it can remain pretty small.

When expanding the language, the tree building code can probably stay pretty small as well, if the syntax of your new constructs are chosen wisely. (I foresee control structures as prefix operators, which will be interesting to say the least.)

The evaluator and transpiler would complicate themselves significantly when adding control structures to the language: I think the most code-expanding task would be handling of not-directly evaluated code blocks. In the case of conditionals, you can probably get away with the ?: operator, also seeing as JS has the comma operator. But in the case of loops, you need a conpletely different structure in the output; it would depend on the implementation whether that stays small or not.

Of course, adding control structures is only useful if there's some way of defining names, be it variables (producing an imperative language) or recursion-capable functions (producing a functional language). It might be an interesting exercise to see how far you can push this language while still keeping the compiler under, say, 100 lines. :P

Protip if you do that: ditch the evaluator. A compiler doesn't need an interpreter to compile. ;)

I think compiler implementation is a particular sweetspot for functional languages. Here's a similar tiny example in Haskell:


When I did my degree, the older students were forbidden to use any language from the Lisp, ML or Prolog families for the compiler design assignment as that would be too easy.

As Java was just released by the time my class got to do the same assignment, we ended up using Java with JavaCC and JJTree.

While not as easy as Lisp, ML or Prolog, it was still much easier than the Jurassic yacc/lex that still prevails in some circles.

Something about that rule seems extremely backwards to me. Might as well implement a Prolog or Scheme in a lower level language then do your compiler assignments in that.

Reminds me of this quote "Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."

Source: https://en.m.wikipedia.org/wiki/Greenspun%27s_tenth_rule

You’d be surprised how much hand written recursive descent is dominant in industry, and to be honest, I think it is often easier than using generator magic.

Everyone who is into compilers and parsing should try recursive descent at least once.

True, but it should only be done after getting a working compiler, to get good error handling and messages.

From my experience most people don't use parser generator tools due to lack of knowledge that they even exist.

People don't use parser generator tools because you wind up doing more work than needed to integrate it with the rest of your compiler. The reason to use a parser generator is not for ease of use or rapid development (because they are harder to use and slower to develop with) but for performance and reliability.

The first one isn't such a big deal for most programming languages, the second one is. Throw in better error handling and tricks like sub-file incremental and/or heuristical compilation (for IDEs), and the generators don't make much sense anymore.

> People don't use parser generator tools because you wind up doing more work than needed to integrate it with the rest of your compiler.

I don't think you can make such a sweeping generalisation. Both JavaCC and Antlr have saved me a lot of time when doing DSLs in the past and both are well documented and easy to use. It would have been much more work for me to write the parsers by hand in Java. Both could also generate my own hand-rolled AST easily and so I do not see why integration would be a problem for most projects. I agree that generators do not scale to large compilers or projects with special needs, but that is not most projects.

That makes sense after having a language reaching 1.0 in production use.

Not so much when prototyping, when it isn't even clear if the language is going to get any users beyond the design team.

My last few languages have all been prototyped by hand, but they were all live languages (with continuous re-parsing and interpretation), so I didn’t have much choice :). I havent bothered with ANTLR since grad school.

I think I'd have been confused as to why using a parser generator was also not similarly regarded as being too easy :)

Interestingly, most mature industrial languages (e.g. Java, GCC, clang) use hand-coded recursive descent parsers. And of course such parsers can be implemented very easily from first principles using parser combinators in FP.

I've used both JavaCC and Antlr in the past for small DSLs, both are very good tools, when working with Java.

Yeah, I guess it comes down to the later compiler layers, when doing data structure manipulation for the actual code generation.

Pattern matching and FP/LP based algorithms are quite convenient, versus doing it in an imperative approach.

Hand coded parsers win on the error reporting front.

> I think compiler implementation is a particular sweetspot for functional languages.

Compilers often use graphs as intermediate representation, and functional languages often are not a good fit for dealing with them.

Modern compilers seem to work with trees more often than graphs. Trees would certainly be easier to work with in any language.

That said, functional languages can and do work with graphs when needed, but yes it is much more awkward than trees.

Not down in the lower levels of IR and IL, the trees have mostly disappeared by then. If you are just interpreting or doing unoptimized compiling, you can get by with just trees.

GHC Haskell does nearly all its optimisation using trees and a high-level functional IR, so I think you are understating trees.

Yes, but I don't think they require a low-level IR in that case (purity, no re-assignment, no loops). LLVM does not use trees at that level, for example, because they really need to use stuff like SSA, basic blocks, and so on.

Tree is also a form of graph

> Compilers often use graphs as intermediate representation, and functional languages often are not a good fit for dealing with them.

Why is that?

Wow, excellent work! These are always super interesting to read.

A similar project: https://github.com/thejameskyle/the-super-tiny-compiler

I thought it would be good to include the output of the example executions at the bottom of the page... then I remembered i could just paste that in a console - very cool!

A while back I started making a game in javascript called "BASIC Instincts" that was going to be about finding clues in the world and typing in BASIC programs from magazines to solve puzzles. The parser/interpreter was the most fun part (https://www.youtube.com/watch?v=GwBiJR_rj_w)

I did the same thing for a conference talk once, but also “golfed” each of the 4 components down to Tweet size (back in my day, Tweets were 140 characters): https://gist.github.com/tlrobinson/1257254

Every programmer should implement a simple Lisp from scratch at least once, just to understand compilers and interpreters aren’t magic.

Nicely done. It demonstrates the principles comprehensibly with no extraneous noise.

This is really cool, thanks for posting this. I've always been vaguely curious how compilers work, but have never set aside time to research it. This was a great introduction, I have a better beginner's notion of how they work now :)

if anyone found this interesting, i recommend reading http://lisperator.net/pltut/

It called "How to implement a programming language" its in JS and its made by the guy who's behind uglifyJS

... but why?

Why not?

*interpreter/transpiler, not a compiler

The interpreter is certainly an interpreter, but a transpiler is arguably also a compiler, be it maybe with an easier target than most conventional compilers. Assembly language and machine code are also languages you can program in; in that sense you'd also have to call gcc a transpiler ;)

Indeed, every compiler is a transpiler: it transpiles into asm. But not every transpiler is a compiler in the truest sense of the term. IMO only transpiling to machine language qualifies as compiling.

But we're just arguing about semantics here :) My comment was more directed towards the interpreter part of things.

> Indeed, every compiler is a transpiler: it transpiles into asm. But not every transpiler is a compiler in the truest sense of the term.

I think you have it backwards. Every transpiler is a compiler, but the reverse is not necessarily true.

> IMO only transpiling to machine language qualifies as compiling.

It is not black and white, compiling to IL or bytecode qualifies as compiling. Perhaps compiling means compilation to a binary executable format?

Bytecode is machine language, specially on mainframes and computer architectures up to early 80's.

It is all a matter how the CPU executes it, direct mapping of machine code into gates, or micro-coded translation layer.

GCC compiles C to assembly, and then uses GAS to turn that into actual machine code... So is GCC a transpiler, and GAS the compiler?

Can you update the first sentence of this wikipedia article to point out that a transpiler is not a compiler? https://en.wikipedia.org/wiki/Source-to-source_compiler

It would help squelch the confusion for future forum bike shedding.

Someone seems to try and point this out every time it comes up like they're winning some sort of pedantry points, but I don't see how they could be. Please illuminate me so I understand the next time someone says this and possibly join la resistencia.

Aaw, this would have been great if it didn't use `eval`, but just executed `+`, `-` etc directly.

It doesn't use the built-in eval(). There's a local function in the code named "eval" that executes the operation -- not the best choice of names, really.

Oops, guess I read it a bit too fast.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact