Hacker News new | past | comments | ask | show | jobs | submit login
A Complete Guide to LLVM for Programming Language Creators (mukulrathi.co.uk)
402 points by mseri 11 months ago | hide | past | favorite | 44 comments

A question for the LLVM experts here:

In the past when I've looked at LLVM from a distance, the biggest stumbling block I found were that it's written in C++ , which isn't the language I'm using for my frontend.

How important is the C++ API in practice? Are the bindings for other languages usable? Is it possible to have my frontend emit LLVM IR in a text format, similarly to how you can feed assembly language source code to an asssembler? Or should one really just bite the bullet and use C++ to generate the IR? I noticed that the compiler in this tutorial has a frontend in Ocaml and a backend in C++, with the communication between them being done via protobufs.

I emit LLVM IR as text in my compiler. It’s painless, almost. Some points off the top of my head…

- I started out just printing strings, but found I wanted a bit more structure so I wrote a dozen or so functions to do the printing. Things at the level of call(name,rags). All make a single line of IR. It helps keep the typos down.

- just run your IR through clang to compile it. It has good enough error messages that you won’t go mad.

- you will need to make a bunch of DI metadata to get debug information into your programs. It took me about 8 hours to get enough that I have line number information and LLDB will show me functions and source lines on a backtrace. I should have done this much earlier than I did. I was getting by with line number comments in my IR, which is simple, but useless if you Give yourself an access violation.

- learn to write little C programs and use clang to emit the IR. This will let you sort out how to do things. The IR manual is good, but there are concepts which are omitted because they were obvious to the authors, but won’t be to you.

- local symbols have to be in strict, ascending, no skipped values numerical order. Screw that. Grab a dollar sign and make your own numeric local-ish symbols and don’t sweat reordering them.

- it doesn’t seem to care what order you emit things in, other than lines in a block of course, so just do what is easy for you.

- get a grip on getelemtptr and phi. The first may not be what you think it is, learn what it is. The second is just part of the magic of SSA.

> learn to write little C programs and use clang to emit the IR.

Highly recommend this. I used this to get an understanding of how to implement the IR for my language.

The command to use is `clang -S -emit-llvm -O1 foo.c`

It'll write it out to a foo.ll file.

(I use -O1 so it cleans up a bit of the messy parts of the IR).

Can you print variables in lldb with the debug information?

One compiler that I use which emits llvm ir added support for debug information recently and its now possible to set breakpoints in gdb but you can't print out any stack variables or anything so its not useful other than figuring out which code paths execute.

I'd like to learn more about this. Maybe contribute to the compiler and fix this issue.

> I'd like to learn more about this. Maybe contribute to the compiler and fix this issue.

You need to create a call to the `llvm.dbg.declare` intrinsic that connects the stack variable alloca to a DILocalVariable metadata node, and place the call after the stack variable alloca. The rest of LLVM will handle tracing the stack location through the compiler to the output, including the alloca being promoted.

See: https://llvm.org/docs/SourceLevelDebugging.html#debugger-int...

Julia has an interesting split here, it does the lowering into SSA from in pure Julia and then has a codegen steps that translates the SSA from into LLVM IR, but for that second step we do use the C++ API. We have very robust bindings to the C-API, but it forever feels just a bit incomplete and less cared for. The C-API is very stable, whereas the C++ API does change quite a bit.

But you cannot use the C-API for symbols/methods. You need a C++ callback for that.

I would avoid any text, but LLVM has mature bindings in ocaml and Haskell, for example. The textual representation isn't stable IIRC, and it adds a step in between you and your already lumbering backend.

Ultimately the C++ API isn't too difficult to use but LLVM mandates a fairly hardcore level of C++ knowledge to play with it's internals.

Quick Tip: if you're thinking "Holy shit how do I get from [complicated] all the way down to IR instructions" lower the big thing to a something simpler so you can reuse the code to generate IR from that - for example, a foreach loop is expressible as a for loop under the hood, now you only have to be compile for loops. This would usually be done in the AST itself.

Regarding interface stability: Indeed, the textual representation is not stable, things like added types in the representation of some instructions can happen when upgrading to a new version. However, to be entirely honest, in the last few years of updating LLVM-based research tools to newer LLVM versions, changes in the C++ API that required me to (sometimes just slightly) change my code happened a lot more often than changes in the textual representation...

I'm not an expert, but there are C bindings: I was able to play around with a toy compiler[1] in Lua using lualvm[2]

I also know of at least one compiler[2] that actually emits textual IR, and then builds and links .obj files from that with the LLVM toolchain...but I think that's just a bunch of work, would be hard to debug, and generally just a bad idea.

1: https://github.com/chc4/solar/blob/master/src/jit.lua 2: https://github.com/gilzoide/lualvm which I actually had to fork to https://github.com/chc4/lualvm for a small bugfix 3: https://github.com/FeepingCreature/fcc/blob/master/llvmfile.... by 'feepingcreature

Hey, author of the post here. Do I think the C++ API is important? For most languages no. The OCaml bindings in my case were almost sufficient, but I planned to do some memory fences and other operations in my language that the OCaml bindings didn't have.

In hindsight, it's probs better to choose OCaml bindings and then link in any special instructions you need from C++ if you need to.

Regarding this post in particular, I chose to document everything in terms of the C++ API as that's the native API. You can use any of the other bindings, and just translate the syntax across to your language.

> Are the bindings for other languages usable?

Yes, I've done several compiler projects using Haskell and LLVM.

That said, not all the bindings were always maintained and up to date and together with LLVM's lack of API stability, there was a significant amount of churn work related to updating from one LLVM version to another. I had to build and install an older version of LLVM that would work with the bindings, several times.

Note: this was years ago, situation may have improved.

I understand that LLVM API doesn't change significantly between versions any more so the work required to update the bindings to a newer version shouldn't be huge for the maintainers of the bindings. But for an end user like me there was quite a lot of manual steps to get my project and the dependencies building.

There's always the option of emitting LLVM IR by writing text, but that doesn't give you the ability to do JITting and a REPL and so on.

I'm currently on a compiler creation course with llvm in University.

Generating it in text form is really simple. Doing just that with Rust. Actually, writing it by hand is too.

You can use llvm-as to convert text form to bytecode and then lli to interpret it (or use one of the other tools to compile it).

I've had success with the llvmlite Python binding. It has a Python-native utility to help build the intermediate code, and then it internally emits text and uses the llvm C api for codegen.

Both the text format and the C api are alledgedly more stable than the C++ api, so this may be a usable pattern in general. The text format is very well documented in my experience.

One downside to using text is an extra emit-and-parse-back step, but unless your code is huge, it's more than fast enough (and it falls away against optimization anyway).

As someone who writes a lot of toy languages, I made this scaffolding for a LLVM-based compiler: https://github.com/finiteloop/compiler

It uses Bison and Flex for parsing and lexing unlike this post, but may be a useful starting point for those building their own toy languages.

Reusable compiler components are really helpful. I made this example for parsing with ANTLR:


It uses a C++ port of CPython's ASDL to define the AST.

Doesn’t the LLVM API go through breaking changes fairly often? How do you track all that and keep your sanity?

Everyone says that but I have not experienced it. In practice, I think this impacts backend extension developers more than people targeting LLVM IR. My experience covers version 7-11, but perhaps it used to be worse?

I've been updating a small codegen since the 3.x days. There have been many API changes with major versions. That said, I always found it pretty easy to implement the changes (something like a workday for my ~10k LLVM-interfacing LoC), and the changes tend to be such that once you get it to compile again, it just works as before.

I think LLVM is an excellent demonstration of how to design and implement big systems well in C++, and as a part of that, they also do breaking changes pretty well.

> Doesn’t the LLVM API go through breaking changes fairly often? How do you track all that and keep your sanity?

No, not very big ones. But they give no stability promises.

If you're using a language other than C++ via API bindings, you will experience some churn going from version to version as the bindings need to be updated too.

While we're here, let's not forget about the incredible Kaleidoscope tutorial. They really helped me get a grip with LLVM.


This is most welcome, as I recently started looking at LLVM, and I was not finding it easy to get a clear picture of the project's organization.

I see that your username resembles that of the entity hosting this guide - I'm guessing that's not a coincidence!

> LLVM IR looks like a more readable form of assembly

I think assembly is more readable than the dumped LLVM IR output of a real compiler…

Anyway, that was a good overview of how to build simple LLVM IR. :)

The IR is pretty messy but if you can see through the clutter the IR is much easier to read the flow of. There's a reason why we use SSA (although prior to mem2reg running you don't quite have it, but still)

Yes, exactly. Clang emitted IR especially has a lot of (C/C++)-specific junk. If you look past that clutter, it's not too bad.

I think the best way to learn to read IR is to look at super-minimal examples, and then you'll be able to tell which parts of larger IR files are relevant.

I'd say readability of dumped IR would be on par with dumped assembly from a compiler for readability, but hand-written LLVM IR has the potential to be quite a lot more readable than hand-written assembly

It’s possible I missed this from the article, since I’m not actually following along, but why do I need to create the custom C++ adapter? The article mentions both ll bc formats, but I don’t see how they are used?

If I wanted to target LLVM, why wouldn’t I emit ll or bc files?

What do you mean by C++ adapter? All the c++ in this post is building the LL files in memory, which can them be persisted to disk or wherever.

Ah I didn’t understand that aspect. So we write the parser in ocaml, then serialize it to protobuf, then consume that from C++ so we can call LLVM’s API to write the expected file format? That still seems needlessly complicated. Is the file format simply too arcane?especially given [0].

[0] https://news.ycombinator.com/item?id=25540637

I see, in this case the author's compiler was written in c++, but you can use ocaml instead to build the IR. There is even a version of their great tutorial series using ocaml you can check out. All the concepts he outlines are the same, just convert the c++ bits to the corresponding ocaml version of the IRBuilder. Then you can go from your language to llvm IR all in ocaml.


Ah, I understand now. Thanks for helping!

Is there a similar resource for liblldb? The only documentation I could find was the Doxygen-based one, and it doesn't provide enough hints to a newcomer to know where to get started if I want to write a debugger frontend.

When I studied compilers back in the university, the subject consist in reading understanding and putting in practice the 'dragon book' (not the full book but a big part of it). We all used flex+yacc and a simplified ASM that was interpreted by some educational software which name I cant remember. The project for the year was to implement a basic compiler language: included if/for loops, function calls, recursivity ... basic stuff but enough as starting point to build something.

Trying to get into LLVM, Id love to find an example done in flex+yacc vs the same done in LLVM.

Flex+yacc and llvm are for separate steps in the process. The former are for parsing the source code into an AST and the latter for generating executable code from the AST.

That said, somewhere else on this comment thread user "finiteloop" posted the scaffolding for a toy compiler using yacc+llvm. You could check that out.

Really loved reading it. It made it so much easier for me understand LLVM.

Just wanted to appreciate you for writing this awesome guide

Hi, author of the post here. Thank you very much for your kind words!

Does anybody know a good resource like this but for writing backends for a new cpu ISA?

I would like to know this as well. I mostly work with Renesas RH850 cores and was interested to be able to compile rust for them, but there is no v850 backend, sadly. Was considering fiddling with it, but couldn't figure out how.

I also hate the proprietary compilers for this arch, since they are sooo slooow (maybe for good reason, but I have no comparison point) and their licensing method just annoys me.

I'm playing around trying to create a JIT execution engine for Rust code. I was struggling a bit with the codegen part, so the description of the function structure helped make more sense of what I'm trying to do.


Thanks for this, I used to have other resources too about 3 years ago when I was making my own toy blockchain project that takes in a programming language to create smart contracts

let me see if i can still find those

LLVM: used to be “Low Level Virtual Machine”, but the acronym no longer applies

IR: intermediate representation

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact