
LLVM for Grad Students - samps
http://adriansampson.net/blog/llvm.html
======
asb
This is (in my humble opinion) a fantastic introduction, and relevant to a
much wider audience than the title implies. If you want to keep up to date
with LLVM developments, you might also be interested in my LLVM Weekly
newsletter ([http://llvmweekly.org/](http://llvmweekly.org/)). I try to
highlight interesting commits, mailing list discussions, and blog posts (tips
and submissions always welcome!).

------
regehr
Adrian might have mentioned an important drawback of using LLVM: active effort
is required to keep up with it. Many an LLVM-based research project has gotten
stuck on 2.9 or 3.2 at which point it starts becoming less and less relevant.

It's not that big of a deal, but active effort is required. The amount of
effort depends on how many and which APIs your project uses; for a
small/medium project perhaps a couple of hours every couple of weeks.

~~~
samps
Great point, John. Can I have a comments section on my site where only you are
allowed to comment?

~~~
regehr
Sure :).

------
deanstag
For the past year, i had two projects which required me to work with LLVM.
Because of the scarcity of articles/documentation, i found it really hard to
get into it. The API Docs and a few articles by Eli Bendersky(thank you so
much!!! ) was all i found useful.

But once i got past the basic hurdles, the llvm project code was so well
written, i felt a certain pleasure working with it.

~~~
david-given
I've used LLVM a few times to generate code, and it's pretty easy to use (and
_moderately_ stable between revisions). e.g.
[http://cowlark.com/calculon](http://cowlark.com/calculon) is a function
evaluation language I wrote as a sort of cheap shader language for a ray
tracer; the whole thing is 4000 lines of C++ header.

I have _also_ attempted to port LLVM to new CPU architectures. This is a
totally different deal --- it's poorly documented, very hard to make progress
with, and peculiarly unfinished. (e.g. the pattern matcher language, which is
excellent, is weirdly unable to match certain types of pattern, which requires
you to write big chunks of C++ which manually look for particular DAG patterns
and convert them into other patterns.) Debugging is a pain; get a pattern
wrong and it'll just hang as the pattern matcher state machine goes insane, or
else produce weird, contorted and incomprehensible error messages.

I would say that it's probably about as painful as gcc, although the pain
points are in a very different place. And LLVM actually has people and
momentum behind it, while the gcc mailing lists are dead quiet.

I'm currently investigating libfirm, which is the open source C compiler
nobody's ever heard of. It looks really rather nice, and builds in less time
than LLVM takes to run its configure script...

~~~
lorenzhs
libFirm is a very interesting project, I know some of the people who are
working on it (we're in the same building). What they're doing is quite cool.

------
felixangell
A nice introduction, I'm currently working on a compiler which uses LLVM for
code generation... it was very difficult to get into at first. Especially
since I was using C (now Go), so I would have to work through 2/3 layers of
language and documentation. (Was just C to C++, now it's Go, to C, to C++).

If you're interested in learning more about LLVM, there are some good open
source projects that use it. If you aren't using C++, people have also ported
the kaleidoscope tutorial project to Haskell, Rust, C, etc... Additionally, a
lot of bigger compilers like Rust, and Clang use it - Swift also uses LLVM,
and should be open source soon?

~~~
slimsag
Because I'm curious, could we have a link to your Go project if it's open-
source? :)

~~~
andars
[https://github.com/ark-lang/ark](https://github.com/ark-lang/ark)

~~~
felixangell
That's the one :)

------
tmostak
This is a great intro into the subject.

We use LLVM to at MapD ([http://mapd.com](http://mapd.com)) to compile SQL
queries to CPU and GPU machine code - it has given us a major boost over an
interpreter based approach. For more see here -
[http://devblogs.nvidia.com/parallelforall/mapd-massive-
throu...](http://devblogs.nvidia.com/parallelforall/mapd-massive-throughput-
database-queries-llvm-gpus/). If you have a background in LLVM or compilers in
general and are interested in tackling problems like this please reach out at
jobs@mapd.com.

~~~
raincom
Daytona([http://www.research.att.com/projects/Daytona/index.html](http://www.research.att.com/projects/Daytona/index.html)
), mainly developed by Rick Greer in 1990's, does the same thing. They have
implemented superset of SQL; and this SQL is compiled into machine code.
Daytona in those days used to parse call records for AT&T. They were
processing Terabytes of data when the average laptop had a 20GB hard drive.
Rick wrote a paper titled "Daytona And The Fourth-Generation Language Cymbal".

You should contact Rick, if you guys need to pick his mind. He is very down to
earth man. And he is retired, lives in Morristown, NJ area.

~~~
shepardrtc
That link is down, but I found a whitepaper on it:

[http://web2.research.att.com/export/sites/att_labs/projects/...](http://web2.research.att.com/export/sites/att_labs/projects/Daytona/AllAboutDaytona.pdf)

------
mtweak
If you're really interested in this stuff and looking for an opportunity,
we're looking for talented LLVM developers for our Austin location.

[http://www.bitfusion.io/jobs.php](http://www.bitfusion.io/jobs.php)

Mazhar Memon, Bitfusion.io

------
johntyree
> LLVM is nicely written: its architecture is way more modular than other
> compilers. Part of the reason for this niceness comes from its original
> implementor, who is one of us.

What on earth is that supposd to mean?

~~~
minimax
LLVM was started by Chris Lattner while he was a graduate student at UIUC.

------
fish2000
This is indeed a great intro article.

For those interested in seeing examples of LLVM hacking in action, I would
recommend reading the source for Halide –
[https://github.com/halide/Halide](https://github.com/halide/Halide) – which
is an image-processing DSL implemented in C++ and piggybacking on LLVM. I
myself learned a lot about LLVM this way.

------
jnordwick
Not just for grad students! I wish I would have seen this a week ago. LLVM
passes are surprisingly readable too.

------
valgaze
This isn't germane to the main topic but this paper they cite about using a
compiler pass to verify OS security ("Protecting Applications from Hostile
Operating Systems") is pretty darn interesting:
[http://sva.cs.illinois.edu/pubs/VirtualGhost-
ASPLOS-2014.pdf](http://sva.cs.illinois.edu/pubs/VirtualGhost-ASPLOS-2014.pdf)

------
amelius
How difficult would it be to add a garbage collector to a language that you
have written a compiler for?

~~~
cfallin
If you want an imprecise (conservative) garbage collector, the GC and the
compiler are almost disjoint. The GC only needs to know where the heap and
stack are, and be able to capture register values at GC time. In other words,
it's a runtime-library issue, not a compiler issue. (See the Boehm GC for an
example that works with a stock C compiler.)

If you want a precise garbage collector, you need type information to know
which heap and stack slots and registers are pointers. This is much harder
because your compiler needs to emit metadata for each struct and then keep
stack frame info or somesuch at the GC safe points so the runtime knows what
the thread's roots are.

Most of the precise GCs I know of are in JIT'd or interpreted languages where
you have this type info anyway... AOT-compiled Lisp is the one counterexample
I can think of, but usually Lisps solve this in a different way by using
tagged pointers (so you know if any heap slot is a pointer by e.g. its LSB).

~~~
mafribe

       Most of the precise GCs I know of are in JIT'd or interpreted languages
    

GC is orthogonal to JITing. OCaml and Haskell are two examples of languages
with non-JIT AOT compilers (ocamlopt and ghc, respectively) that do precise
GC. I'm sure there are plenty of others.

~~~
cfallin
Yep, you're right, it is. And I totally forgot about those compiled functional
languages. Thanks! I mentioned JITs and interpreters because those are the
obvious cases where type information still exists at runtime, but the compiler
can emit that metadata too (as I mentioned).

~~~
mafribe
GCing has nothing to do with a language being functional. All you need is
access to the roots and the ability always to determine which data is another
pointer. The main 'enemy' of this is the ability to cast integers to pointers
(as you may in C/C++).

Interpreters don't necessarily do GC: if the language that the interpreter is
written in has GC, then the interpreted language is automatically GCed. If the
interpreter is written in a non-GC language you have to manage the interpreted
program's memory manually too (which you may do using a GC for the interpreted
language). As an aside: this is comparable to evaluation order: if the
interpreter is written in a call-by-value language, then the interpreted
language is CBV, if the interpreter is written in a lazy language, the
interpreted language has lazy evaluation. (Assuming you are not doing
something fancy like translation to continuation-passing-style.)

~~~
cfallin
> GCing has nothing to do with a language being functional

Yep, I didn't say it did! You mentioned Haskell and OCaml as counterexamples.
I referred to them as "the compiled functional languages". This isn't a claim
that _only_ compiled functional languages do GC.

> Interpreters don't do GC

Sure they do! One example: Ruby (the original interpreter in C). It has a full
mark/sweep collector that traverses user data structures. Custom object types
implement their own 'mark' routines, so it's fully precise. (I've implemented
a C extension that uses this API.)

Another example: often, Java runtimes interpret Java bytecode on the first
pass, before JIT'ing, to give fast startup on cold code. The JRE would be
written in C or C++ (non-GC'd), and Java is GC'd.

In general an interpreter can add any additional or different semantics it
needs to on top of the host language's semantics. I don't think your call-by-
value claim is correct either, because function calls in the interpreted
program do not map one-to-one to function calls in the interpreter. (The
interpreter would generally manage a data structure representing the
interpreted program's stack.)

My goodness, this is going off the rails. Sorry to OP for derailing. Just
trying to fix things that are Wrong On The Internet...

~~~
mafribe

       One example: Ruby (the original interpreter in C).
    

OK, I should have said, and have now modified my original statement to
"Interpreters don't necessarily do GC".

    
    
        In general an interpreter can add any additional or different semantics
    

Yes, and that falls under "something fancy"! The naive way of interpreting
object-language application by meta-language application transfers the
latter's calling discipline to the former.

~~~
cfallin
> Interpreters don't necessarily do GC

Of course, but no one claimed that interpreters _necessarily do_ GC either, so
this statement is not very useful. You keep arguing against claims that aren't
there. The only claim ever made in response to OP was that precise GC requires
runtime information, and that this can sometimes be easier when it exists
anyway (as in JITs and real-world interpreters), but can also be implemented
when the compiler produces it AOT. I hope that's clear enough :-)

------
noreasonw
Long time ago I read a post by Mathew Flatt about LLVM and gcc and how mini
optimization and selling points were important not able to find the post, but
it was a interesting read.

------
PSeitz
"backed by the largest company on Earth."

Wal-Mart?

~~~
_delirium
By market cap (or enterprise value), Apple's the largest public company now,
worth about $700b, which is about 2x as much as the next biggest (Exxon,
Microsoft, etc.). Although Saudi Aramco is probably the world's most valuable
company if non-publicly-traded companies are included.

------
dadrian
This would have been great during my grad school compilers class.

