In the past when I've looked at LLVM from a distance, the biggest stumbling block I found were that it's written in C++ , which isn't the language I'm using for my frontend.
How important is the C++ API in practice? Are the bindings for other languages usable? Is it possible to have my frontend emit LLVM IR in a text format, similarly to how you can feed assembly language source code to an asssembler? Or should one really just bite the bullet and use C++ to generate the IR? I noticed that the compiler in this tutorial has a frontend in Ocaml and a backend in C++, with the communication between them being done via protobufs.
- I started out just printing strings, but found I wanted a bit more structure so I wrote a dozen or so functions to do the printing. Things at the level of call(name,rags). All make a single line of IR. It helps keep the typos down.
- just run your IR through clang to compile it. It has good enough error messages that you won’t go mad.
- you will need to make a bunch of DI metadata to get debug information into your programs. It took me about 8 hours to get enough that I have line number information and LLDB will show me functions and source lines on a backtrace. I should have done this much earlier than I did. I was getting by with line number comments in my IR, which is simple, but useless if you Give yourself an access violation.
- learn to write little C programs and use clang to emit the IR. This will let you sort out how to do things. The IR manual is good, but there are concepts which are omitted because they were obvious to the authors, but won’t be to you.
- local symbols have to be in strict, ascending, no skipped values numerical order. Screw that. Grab a dollar sign and make your own numeric local-ish symbols and don’t sweat reordering them.
- it doesn’t seem to care what order you emit things in, other than lines in a block of course, so just do what is easy for you.
- get a grip on getelemtptr and phi. The first may not be what you think it is, learn what it is. The second is just part of the magic of SSA.
Highly recommend this. I used this to get an understanding of how to implement the IR for my language.
It'll write it out to a foo.ll file.
(I use -O1 so it cleans up a bit of the messy parts of the IR).
One compiler that I use which emits llvm ir added support for debug information recently and its now possible to set breakpoints in gdb but you can't print out any stack variables or anything so its not useful other than figuring out which code paths execute.
I'd like to learn more about this. Maybe contribute to the compiler and fix this issue.
You need to create a call to the `llvm.dbg.declare` intrinsic that connects the stack variable alloca to a DILocalVariable metadata node, and place the call after the stack variable alloca. The rest of LLVM will handle tracing the stack location through the compiler to the output, including the alloca being promoted.
Ultimately the C++ API isn't too difficult to use but LLVM mandates a fairly hardcore level of C++ knowledge to play with it's internals.
Quick Tip: if you're thinking "Holy shit how do I get from [complicated] all the way down to IR instructions" lower the big thing to a something simpler so you can reuse the code to generate IR from that - for example, a foreach loop is expressible as a for loop under the hood, now you only have to be compile for loops. This would usually be done in the AST itself.
I also know of at least one compiler that actually emits textual IR, and then builds and links .obj files from that with the LLVM toolchain...but I think that's just a bunch of work, would be hard to debug, and generally just a bad idea.
2: https://github.com/gilzoide/lualvm which I actually had to fork to https://github.com/chc4/lualvm for a small bugfix
3: https://github.com/FeepingCreature/fcc/blob/master/llvmfile.... by 'feepingcreature
In hindsight, it's probs better to choose OCaml bindings and then link in any special instructions you need from C++ if you need to.
Yes, I've done several compiler projects using Haskell and LLVM.
That said, not all the bindings were always maintained and up to date and together with LLVM's lack of API stability, there was a significant amount of churn work related to updating from one LLVM version to another. I had to build and install an older version of LLVM that would work with the bindings, several times.
Note: this was years ago, situation may have improved.
I understand that LLVM API doesn't change significantly between versions any more so the work required to update the bindings to a newer version shouldn't be huge for the maintainers of the bindings. But for an end user like me there was quite a lot of manual steps to get my project and the dependencies building.
There's always the option of emitting LLVM IR by writing text, but that doesn't give you the ability to do JITting and a REPL and so on.
Generating it in text form is really simple. Doing just that with Rust. Actually, writing it by hand is too.
You can use llvm-as to convert text form to bytecode and then lli to interpret it (or use one of the other tools to compile it).
Both the text format and the C api are alledgedly more stable than the C++ api, so this may be a usable pattern in general. The text format is very well documented in my experience.
One downside to using text is an extra emit-and-parse-back step, but unless your code is huge, it's more than fast enough (and it falls away against optimization anyway).
It uses Bison and Flex for parsing and lexing unlike this post, but may be a useful starting point for those building their own toy languages.
It uses a C++ port of CPython's ASDL to define the AST.
I think LLVM is an excellent demonstration of how to design and implement big systems well in C++, and as a part of that, they also do breaking changes pretty well.
No, not very big ones. But they give no stability promises.
If you're using a language other than C++ via API bindings, you will experience some churn going from version to version as the bindings need to be updated too.
I see that your username resembles that of the entity hosting this guide - I'm guessing that's not a coincidence!
I think assembly is more readable than the dumped LLVM IR output of a real compiler…
Anyway, that was a good overview of how to build simple LLVM IR. :)
I think the best way to learn to read IR is to look at super-minimal examples, and then you'll be able to tell which parts of larger IR files are relevant.
If I wanted to target LLVM, why wouldn’t I emit ll or bc files?
Trying to get into LLVM, Id love to find an example done in flex+yacc vs the same done in LLVM.
That said, somewhere else on this comment thread user "finiteloop" posted the scaffolding for a toy compiler using yacc+llvm. You could check that out.
Just wanted to appreciate you for writing this awesome guide
I also hate the proprietary compilers for this arch, since they are sooo slooow (maybe for good reason, but I have no comparison point) and their licensing method just annoys me.
let me see if i can still find those
IR: intermediate representation