Hacker News new | past | comments | ask | show | jobs | submit login
Writing a basic x86-64 JIT compiler from scratch in stock Python (csl.name)
325 points by csl on Nov 9, 2017 | hide | past | favorite | 50 comments

To be clear about the `del block` thing, the only thing that the `del` keyword does in Python is unbind a name from the local scope. It does not cause an object to be destructed, call `__del__`, or anything else. It's morally equivalent to "block = None", except future references to block will raise a NameError instead of giving you None.

When used at the end of a function, when all locals go out of scope, it is actually doing nothing.

Unbinding a name does do one thing which might call `__del__`: It decreases the reference count by 1.

In CPython, yes. But this is an implementation detail.

I learned this with PyPy (or maybe it was Jython?). My usual bad habit for small scripts to depend on open files to close when they went out of scope would suddenly exhausted all file descriptors. Ouch.

Just what I love, a scientific and pedagogic CS article, in well formulated English.

I would appreciate some introductory notes or links to prerequisites though. That would perhaps entice an audience who has no previous experience in writing compilers.

I really want to be able to wrap my head around compilers but I'm not sure what a good "user friendly" resource is that wont stuff lots of fancy compiler buzzwords in my face and become white noise to me.

I would recommend "Compiler Construction" by Niklaus Wirth. This small book (< 200 pages) will teach you the basics and show you how to build a single-pass compiler for a subset of Oberon (a Pascal successor).

The book is freely available here: http://people.inf.ethz.ch/wirth/CompilerConstruction/index.h...

For a more comprehensive example, you can have a look at my compiler for the full Oberon language targeting the JVM. It is based on the principles of that book: https://github.com/lboasso/oberonc

I second the recommendation. Wirth's compiler book is a wonderful, concise and very readable introduction.

Start simple. If you haven't written an interpreter, do that first. Then figure out how to write a similar program that emits code for a virtual machine rather than evaluating the language.

I'd keep an eye on http://www.craftinginterpreters.com/contents.html, although it hasn't gotten to compilation yet (I think).

I'm actually trying to write an easy-to-digest compiler and companion reading series. It's for a compilers independent study class. You can track my progress here: https://github.com/tekknolagi/tigress

At this point it's not well-documented, but the layers are numbered in order: l00, l01, l02... Each builds on top of the previous one.

EDIT: I just added a mini evaluator for debugging purposes and so people can see what the language is like.

Lisp In Small Pieces would be my reccomendation. A relatively small book that takes you through writing 11 interpreters and 2 compilers.

Is it really that friendly when starting out, though? I found it brilliant, but quite the challenging read. I'd rather just read chapters four and five of SICP [1] and move on to Queinnec as a follow-up on more advanced topics.

[1]: The book is free as well: https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-4.html...

To be honest, I found LISP to be a lot easier than SICP, but that's probably just some learning styles stuff at play.

SICP being freely available in both book form, and the lectures [0] makes it one of the best resources out there.

(I just never found it's section on compilers that helpful. Again, probably learning styles.)

[0] https://www.youtube.com/playlist?list=PLB63C06FAF154F047

Maybe I need to reread LISP, then. I agree that SICP is laser-focused on Scheme, and thus leaves out general compilation techniques.

This isn't about wrapping one's head around a compiler, but I thought you might find it useful. https://medium.freecodecamp.org/the-programming-language-pip...

I enjoyed this read because it was a very user-friendly overview. After reading, I knew exactly what major pieces did what and therefore what to search for next on my own terms.

I'm writing my first compiler now, and it's been quite a learning curve. I've also realized how much nicer my language is for writing compilers than the language in which I'm implementing the compiler. That made me realize I could write the compiler in my language and then write an interpreter for my language which would only need to run once--to generate the compiler--after which point my compiler would be self-hosting. :p

Do you mind sharing some of the resources you're using to learn? Book titles or URLs?

I wrote https://codewords.recurse.com/issues/seven/dragon-taming-wit... about a tutorial-level compiler written in itself. Of course, a recommendation from someone who's learned from it instead of from the author would give a better signal.

I also recommended the same Wirth book as lboasso, and a couple other books, down at the bottom of the page.

I haven't looked at any resources in particular. Basically my compiler has 3 parts--parser -> type checker -> code generation. Parsing was tricky at first, but then I stumbled upon a pattern which is pretty close to parser-combinators. Type checking seemed daunting, but my language happened to be Hindley-Milner compatible, and there's lots of reference information for HM. Code generation is easy in the base case, but I'm still wrapping my mind around template expansion.

that wont stuff lots of fancy compiler buzzwords in my face and become white noise to me.

Crenshaw's tutorial fits this perfectly: https://compilers.iecc.com/crenshaw/

x86 version here: http://www.pp4s.co.uk/main/tu-trans-comp-jc-intro.html

I think Wirth's compiler book is more approachable for a beginner. Once you are done with that book you can also read "Project Oberon" from the same author. This way you can understand the implementation of a full graphical OS written from scratch for a custom made RISC processor implemented on a FPGA. The book and source code are available at: http://www-oldurls.inf.ethz.ch/personal/wirth/ProjectOberon/

Stanford Lagunita has Stanford's Compilers course available for free, with the usual undergraduate prerequisites.

In my opinion, the most tedious part of writing this sort of thing from scratch is the x86/x86-64 instruction encoding. Grab the Intel/AMD manuals or an opcode database and have fun. Other projects to study: LLVM, dynasm, asmjit, peachpy, luajit jit, Intel XED, etc.

FWIW, the series concluding at https://eli.thegreenplace.net/2017/adventures-in-jit-compila... build a Brainf*ck JIT in pure C++, then using LLVM, then in Python using PeachPy

yeah, it sucks. That is what turned me off of initial messing around with JITs. Wish I had known about GNU Lightning then.

>dynasm, ... luajit jit

Aren't these the same thing?

off topic question: When do you find time do such interesting things as these? I'd really like to dive deeper in this topic. As a CS student, who also got to work as a dev at the same time for a living, I'm even happy to get to sleep 8 hours.

Almost always in evenings, whenever I have time and feel like it. If you're a student, you're probably hosed with deadlines, so pick very small things to try out — things like really simple interpreters.

This particular project luckily turned out to be quick to get working. Also, I kept the scope very small, calling it a day right before getting big ideas.

Why not take a class about compilers and implement something like that as a class project?

Where's the fun in that.

A nice trick for debugging code generated at runtime is temporarily adding a CC instruction at the beginning of your assembly function, so you don't have to care where in memory your snippet is going to end up

This is a very interesting and well-written article, but I am puzzled by the technique being described as "JIT compilation".

Isn't JIT compilation the process of compiling some sort of intermediate code into machine code at runtime? For example, compiling JVM bytecode into machine code during execution, and using the cached machine code transparently on subsequent calls to that segment of code.

Not to detract from the article, or the interesting techniques and explanation, but I didn't see any compilation other than by gcc, which IMHO makes this AOT rather than JIT compilation.

What am I missing?

I do see your point, but I tried to make it clear that this is compilation in an extremely restricted sense: It specializes and executes a function at runtime. And I wanted to cover the core technique of doing just that.

Perhaps I should have included a small example of compiling a linear list of abstract instructions into machine code. But that again would consist of compiling small templates in a coherent way, just like many actual compilers do.

Anyway, point taken, and maybe I'll expand the article or follow up on it.

My subsequent comment acknowledges that JIT means more than compilation at runtime. In .NET desktop, it means compilation of all CIL to native on every run, just before execution. To me, that's just executing the second pass of a compiler at launch time, but Anders Hejlsberg says it's JIT[1], and I'm not going to argue :)

[1]: https://stackoverflow.com/a/1255828/4158187

Anyway, I've made a follow-up post that shows how to JIT compile a subset of Python, directly based on the previous techniques: https://csl.name/post/python-compiler/

For me the key concept that comes to mind for a JIT compiler, is profiling at runtime to decide what to compile and how aggressively which this article doesn't deal with. The other big thing is how to make the profiling and compilation not be too expensive to negate itself.

The article author seems to be using GCC not at runtime, but more as a generator for templates (at the machine code level) for the rest of the software to use. You don't really want to be assembling a template at runtime in this context.

Mixed-mode execution (interpreter+JIT when justified) isn't the only way to go.

.Net doesn't work that way, for instance. Unlike Java's HotSpot, .Net never interprets CIL instructions. They're always compiled down to native code for execution.

At least, that was true in 2009, according to Jon Skeet - https://stackoverflow.com/a/1255832/2307853

> .Net never interprets CIL instructions.

I wasn't aware of this. To be clear, it's not 100% accurate to say "never", because it depends on the .NET environment, but point taken.

I now understand that .NET (on desktop, anyway) compiles all intermediate code (CIL) to native code on every run. This is a way different kind of JIT than Java HotSpot, which executes intermediate code (bytecode) initially, profiles intermediate code execution, and only compiles the hot spots to native code after a while.

It seems like what "JIT" means has evolved since the time it meant "JVM HotSpot".

There's a very interesting snippet of an interview with Anders Hejlsberg at https://stackoverflow.com/a/1255828/4158187, which exposes his reasoning.

There's a rats' nest of terminology here.

".Net" and "Common Language Runtime" refer specifically to Microsoft's implementation. The spec is called the "Common Language Infrastructure".

Wikipedia seems to use "Common Language Runtime" in both ways - the Mono article uses it in the generic sense, whereas the CLR's own article explicitly states that it refers to .Net's implementation.

.Net also has ngen.exe which isn't really JIT at all, it's traditional 'offline' ahead-of-time compilation. In a sense that's rather like the traditional Unix compile-and-install model. One sees the oxymoron "pre-JIT" used to describe this.

This is another difference from HotSpot, which historically never cached native code to disk, though other JVMs have always been able to do this (Excelsior Jet, and GCJ, for instance).

Apparently AOT may be coming to Java though - it may even already be here, I'm not quite sure.

https://www.infoq.com/news/2016/10/AOT-HotSpot-OpenJDK-9 , http://openjdk.java.net/jeps/295

I recall once reading that one of the reasons HotSpot didn't cache native code, was the issue of securing that cache. Not sure what they've done to address that, nor what .Net does; it seems a valid concern.

That's an interesting read re. Hejlsberg. Worth noting that Mono takes them up on that and does feature an interpreter.

Speak of the devil - https://news.ycombinator.com/item?id=15686875

Edit: I'm wrong about the terminology! https://news.ycombinator.com/item?id=15688290

Love the idea of only using stock Python.

Congratulations on the article.

I am reminded of a most excellent piece of science fiction, "Coding Machines".


good performance

That is awesome! Thanks!

Does anybody know the easiest way to compile to JVM bytecode? I have a Scheme interpreter written in Kotlin, what is the best way to compile it, instead of interpreting? Where do I start?

Examples of Lisp-2 compiler using Java and ASM:

Shen to JVM bytecode compiler in Java 6 https://github.com/otabat/shen-jvm

Shen to JVM bytecode compiler in Java 8 with indy https://github.com/hraberg/Shen.java

The most widely used library for dynamic bytecode generation is asm: http://asm.ow2.org/

I was poking around* with this yesterday, has two different simple implementation of a lispkit compiler:


*by "poking around" I mean I typed the code into a text editor for further study:

code: https://pastebin.com/6pAkqE2R

Another alternative to jcdavis answer, is to just output bytecode in text format and use jasmin to compile it.


Or have a lookt at Kawa.


Another possibility is using Truffle & Graal.

For one thing Clojure comes to mind: https://github.com/clojure/clojure

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact