
LLVM Intermediate Representation is better than assembly - majke
https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/
======
munin
A few points:

-LLVM at the IR level still has platform specific information - calling conventions, vector types, pointer widths, etc.

-what was true then is still true now: [http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/0437...](http://lists.cs.uiuc.edu/pipermail/llvmdev/2011-October/043719.html)

-typed assembly languages have existed for a while, no one uses them

~~~
majke
> platform specific stuff

Yes. If I were to write IR for a real project, I'd probably keep a few pretty
large functions to avoid overhead caused by calling conventions.

Very interesting link. Thank you! Though I find the arguments unconvincing.
More data needed...

> typed assembly

Yes. But now _every_ mac user and plenty of other people have a typed assembly
compiler on their machines! IR is reasonably popular and quite well
documented. I'm not saying it's new, I'm saying it's cool :)

~~~
azakai
It's not just calling conventions. LLVM IR also bakes in various assumptions
about the target platform, from endianness to structure alignment to
miscellaneous facts like whether char is signed or unsigned.

In many of those there is no going back to the original representation, they
are one-way.

If you have IR that you can compile to various archs and have it work there,
that is a lucky thing in that particular case. But it is not what LLVM IR was
designed for nor should that work in general.

~~~
phaemon
I don't understand what this means. Could you please give an example of some
code that loses information in this way when compiled with LLVM?

~~~
azakai
Say you have a struct type X with properties int, double, int. The offset of
the last property depends on how alignment works on the target platform - it
could be 12 or 16 on some common ones. LLVM IR can contain a read from offset
12, hardcoded. Whereas the C code contains X.propertyThree, which is more
portable.

~~~
nostrademons
But that's not how LLVM works, at least when I worked with it a couple years
ago. You would define a struct type in terms of primitive types (int64, ptr,
etc), and then use getelementptr with the offset of the field path you wanted.
Yes, it's a numeric offset, but it's a _field offset_ within the struct, not a
byte offset. LLVM handles packing, alignment, and pointer size issues for you
automatically.

~~~
vidarh
Once you have defined a struct in terms of primitive types, it is platform
dependent.

Consider C:

A C int can be 16 bits. Or 32. Or 64. Etc. As long constraints of the relation
to the other types is met.

The moment the frontend specifies a primitive type for a field in the struct,
that code is incompatible with a whole lot of platforms.

~~~
eropple
Your primitive types aren't LLVM's though, are they? I mean, I haven't looked
at LLVM thoroughly (just enough to be familiar with it, a friend is writing a
language he wanted some input on), but I would be surprised and disappointed
if they had a C "int" type as opposed to "signed 32-bit integer" or whatever.
At which point it's compatible with whatever else is throwing around a signed
32-bit integer.

~~~
vidarh
But that is exactly the point - that LLVM IR is not platform independent.

The fronted must choose which specific integer type that "int" in C maps to.
At that point, the IR is no longer machine independent - if you pick 32 bit
signed ints to represent C "int", your program will not match the C ABI on any
platform using 16 bit unsigned int as C "int" and you won't be able to
directly make calls to libraries on that platform, for example.

~~~
eropple
So use uint32_t?

~~~
vidarh
This misses the point. The point is that if you pass a C program that uses
"int" through a C-compiler that spits out LLVM IR, the resulting LLVM IR is
not portable.

You might not be able to change the C program - it might be using "int"
because the libraries it needs to interface to uses "int" in their signature
(and handles it appropriately) on the platforms you care about.

------
kwantam
This is interesting and pretty awesome. I can imagine having a lot of fun
making a toy compiler that emits LLVM IR directly.

Practically speaking, is there a good reason to do this in non-toy software
rather than the more-or-less standard practice of using C as an IR? That gives
you, in addition to compilation via LLVM, the ability (in principle) to use
gcc, icc, et al. (In practice, in many cases people end up using compiler-
specific extensions in the generated C, which does impose limitations on
portability to other compilers, but that's certainly avoidable.)

I suppose if you are dynamically generating things that need to be compiled as
quickly as possible, eliminating an unnecessary compilation step could be a
significant advantage. It doesn't seem like such a constraint would apply in
most cases, though.

~~~
pbsdp
C is a heavyweight intermediary language which brings with it a /lot/ of
baggage. There's no valid _direct_ technical reason _other_ than financial
resource constraints to justify compiling to C (... or JS, for that matter)
instead of a more suitable and expressive IR/bytecode.

It's impossible to implement a variety of useful abstractions in portable C --
for example, function trampolines that do not permute register state and do
not appear in the callstack after execution.

~~~
azakai
> C is a heavyweight intermediary language which brings with it a /lot/ of
> baggage. There's no valid direct technical reason other than financial
> resource constraints to justify compiling to C (... or JS, for that matter)
> instead of a more suitable and expressive IR/bytecode.

1\. In theory. But what is that "more suitable and expressive IR/bytecode"? I
would argue such a bytecode should be portable, but LLVM IR is not that.

2\. There are lots of reasons to target C. C can be compiled by many
compilers, while LLVM IR can only be compiled by LLVM to things LLVM can
compile to. For example, game consoles, various new embedded platforms, etc. -
they all have C compilers, but LLVM might not support them (and possibly
cannot support them).

~~~
foobarbazqux
All of your arguments about going through C would go away if you had enough
money to build your own ideal toolchain, including the development of your own
IR for your language.

~~~
azakai
In theory. But it would be _my_ own ideal toolchain. Someone else might make
different tradeoffs in its design.

~~~
foobarbazqux
I'm not sure I understand. The claim is that the only reason to target C is to
save money.

------
mjn
I'm intrigued by this idea, but it seems tricky to really use portably if your
goal for writing asm is to carefully manage register usage and SIMD
instructions. Sure, the syntax is portable, but to get the "right" result,
i.e. one that maps directly onto the fast machine code you had in mind, you
need to know something about the target architecture, e.g. how wide its SIMD
instructions are, and how many registers it has. And if you want it to be fast
on two architectures that differ considerably, you may have to write two
versions of your LLVM IR, at which point you're losing some of the benefit of
an IR. You do still get the advantage that it will at least compile to
_something_ on platforms you hadn't previously heard of, but most projects
will just use a straight-C fallback function for those platforms.

~~~
rinon
Managing register allocation by hand is unlikely to provide significant
speedups, considering that modern register allocators are "pretty good."
However, if you are wanting to optimize for SIMD, this can be easily done with
vector instructions in IR, without needing to worry about register allocation
at all.

~~~
pcwalton
haberman beat me to it, but register allocation algorithms fall down in very
branchy control flow like dispatch loops, basically for two reasons: (1) to
make compile times reasonable they tend to have to approximate in this case:
(2) compilers don't really know the hot paths in such code like a human would,
and will likely make suboptimal spilling decisions as a result.

~~~
DannyBee
1\. This is only true in JIT's and similar things that strongly depend on
linear time register allocation. LLVM happens to now use a greedy allocator
that still tries to prioritize time over allocation optimality.

2\. This is just a bad compiler then. Good static profiling is actually quite
good, and quite accurate, in most cases. This includes value and other forms
of profiling, rather than just simple static edge estimation based on
heuristics. Usually, the thing that gets hurt is not spill placement, but bad
inlining decisions.

For diamond shaped switch statement interpreter loops with simple bodies, the
real issue is that most greedy/linear allocators are not great at live range
splitting. Compilers like LLVM (and to some degree, GCC), move all the
variable allocations up to the beginning of the function to make life easy by
removing scopes (otherwise you have really really crazy edge cases performing
hoisting/sinking optimizations), and then for those that don't get mem2reg'd,
can't prove they aren't live all at the same time during the switch due to the
loop.

Then they make bad choices about which of these variables should stay in
registers because their value profiling infrastructures are non-existent.

Proper region analysis, allocation regions, and better optimistic live range
splitting would go a long way towards fixing this, but it's not worth it.
There is little to no sense in optimizing LLVM for the very uncommon case of
interpreter loops (particularly when one of the goals of LLVM is to ...
replace interpreters).

So the basic answer is: It's not really a problem anyone cares to solve, not
"it's a really hard problem to solve".

~~~
qznc
Nevertheless, register allocation is still an NP-complete problem and even in
AOT compilation everything above O(n^3) is too slow (unless you just compile
kernels for embedded software).

Actually, it is not the register allocation itself, which is NP-complete [0].
Avoiding spills and copys is the hard part.

[0] [https://pp.info.uni-
karlsruhe.de/publication.php?id=buchwald...](https://pp.info.uni-
karlsruhe.de/publication.php?id=buchwald11cc)

~~~
DannyBee
The fact that some subproblems people want to solve are NP complete is mostly
irrelevant, and has been for many many years. Nobody cares about _optimal_
copy coalescing, only "good enough". There are studies about optimal vs non-
optimal, and the answer is basically "you could eliminate 20% more copies".
That was against the best optimistic coalescers of the time, which have
already been superseded by PBQP and other approaches. Also remember that
people build architectures knowing what compilers can and can't do, and so
processors do a lot at runtime.

As I said, the real issue is that nobody has chosen to optimize for the
interpreter case, because it's uncommon and doing so does not help anything
else, but comes at great cost.

Choosing the one edge case everyone has said "we don't care about" and saying
"see, they suck at everything!" does not seem quite right to me.

For example, it is _completely_ irrelevant to whether register allocation is a
problem for tight SIMD intrinsic loops, for example.

At least in GCC (which is what x264 was likely talking about), the issues are
GCC's architecture, and not some fundamental "register allocation is hard"
issue.

~~~
nkurz
_remember that people build architectures knowing what compilers can and can
't do, and so processors do a lot at runtime_

It sounds like you may be referring to MOV elimination by register renaming,
which should make the extra moves no more costly than NOP's. I read that Intel
post-Ivy Bridge does this, but haven't been able to find any real
documentation. Do you know if this is something one can now rely on, or what
the limits of this are (number per cycle, size differences, latency)?

~~~
nkurz
Answering myself: Yes, this is documented and can be depended on for Ivy
Bridge onward.

3.5.1.13 Zero-Latency MOV Instructions

In processors based on Intel microarchitecture code named Ivy Bridge, a subset
of register-to-register move operations are executed in the front end (similar
to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution
resources in the out-of-order engine. Most forms of register-to-register MOV
instructions can benefit from zero-latency MOV. Example 3-23 list the details
of those forms that qualify and a small set that do not.

Example 3-23. Zero-Latency MOV Instructions

MOV instructions latency that can be eliminated

    
    
      MOV reg32, reg32
      MOV reg64, reg64
      MOVUPD/MOVAPD xmm, xmm
      MOVUPD/MOVAPD ymm, ymm
      MOVUPS?MOVAPS xmm, xmm
      MOVUPS/MOVAPS ymm, ymm
      MOVDQA/MOVDQU xmm, xmm
      MOVDQA/MOVDQU ymm, ymm
      MOVZX reg32, reg8 (if not AH/BH/CH/DH)
      MOVZX reg64, reg8 (if not AH/BH/CH/DH)
    

MOV instructions latency that cannot be eliminated

    
    
      MOV reg8, reg8
      MOV reg16, reg16
      MOVZX reg32, reg8 (if AH/BH/CH/DH)
      MOVZX reg64, reg8 (if AH/BH/CH/DH)
      MOVSX
    
    

[http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...](http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-
optimization-manual.pdf)

------
ihnorton
Julia is a pretty cool platform for playing with this kind of thing in a REPL
environment - you can dump LLVM IR and corresponding disassembly for any
function, with source line annotation (some things are rearranged by the LLVM
optimizer, but it is usually close). No inline IR support yet, but it's a
great way to understand the relationship between high-level constructs and
machine code.

~~~
_delirium
Interesting, I didn't realize Julia had that. Fwiw, many Lisp systems also
have this available: if you (disassemble 'foo) in SBCL, it'll show you the asm
the function compiled to.

------
nullc
It would be nice if the substance of the article actually backed up the
title... it doesn't really show this: Yes sure, it's more portable: but plain
C is way more portable still.

The places where assembly is justified these days tend to be SIMD inner loops
where intrinsic use isn't effective (e.g. due to compilers being pretty lame
at vector register allocation) and where a lot of gains can be had by bitops
level micro-optimization that depend very precisely on the instruction
behavior. I would be surprised if writing LLVM IR was actually better than
using intrinsics in these cases.

If it were so it would be neat to see it— an example of taking some well
optimized multimedia codec library and converting (some of) the SIMD asm into
LLVM IR and showing equal performance would be impressive.

~~~
majke
I personally think that IR is _more_ expressive than C. There are two points I
wanted to make in the blog post:

\- LLVM IR is cool and in many cases it could be beneficial to use it instead
of assembler

\- If LLVM is bad for any reason, writing an optimizer is relatively simple.
And, as opposed to hand crafted optimizations, it does scale.

------
zqfm
Perhaps I am mistaken, but I was under the impression that the LLVM IR was
subject to change at a whim. That may be something to keep in mind if you
intend to target it directly.

~~~
octo_t
the text format changes, the API stays relatively static (afaik). Most tools
use something similar to:

    
    
      LLVMContext &Context = getGlobalContext();
      SMDiagnostic Err;
      Module *Mod = ParseIRFile(argv[1], Err, Context);

~~~
jevinskie
The C API is deemed stable. The C++ API changes every release. The textual IR
has so far been forwards compatible. Bit code is also supposed to be forwards
compatible but some bugs have prevented this in the past. Backwards
compatibility is right out which can be frustrating.

------
gsg
The author invokes the idea that writing assembly and writing compiler passes
are similar enough activities that you can substitute one for another, but
that is not the case.

Good thing, too - I don't even want to think about how buggy and slow
compilers would become if random people started jamming passes into them based
on nothing but expediency.

~~~
octo_t
The cool thing about LLVM is that "jamming passes" in, is something an
individual team can do, adding a pass simply involves running

    
    
      opt -load path/to/pass.so < IR.bc
    

Which emits (hopefully) optimised IR. Neat huh?

------
syncopate
It should be noted that on the official llvm irc channel (#llvm@oftc.net)
people do not like llvm's IR being called a "progamming language". Rather it's
intended to be a representation of llvm's internal data structures. It may
look similar to a programming language for sure but it's not the intent behind
it. It's a representation that should help compiler engineers debugging their
llvm-using parsers.

Also, if you've ever looked at clang's "-emit-llvm" output you will notice
that there are some parts of the IR which are hard to be generated by humans,
e.g. dgb by sequential number references.

See e.g.: [http://llvm.org/docs/SourceLevelDebugging.html#object-
lifeti...](http://llvm.org/docs/SourceLevelDebugging.html#object-lifetimes-
and-scoping)

~~~
0x09
To be fair to outsiders,
[http://llvm.org/docs/LangRef.html](http://llvm.org/docs/LangRef.html) is
linked from the main page where the human-readable IR format is referred to as
"LLVM assembly language", and with the in-memory representation listed as an
equivalent form but not with particular privilege.

------
waynecochran
The problem with writing LLVM IR code by hand is that is must be in SSA
(single status assignment) form which is a pain for humans, and requires graph
algorithms to insert phi nodes in the basic blocks. What we need is a non-SSA
LLVM IR assembler that converts to SSA -- this would be nice for human
authors.

BTW, I taught a compiler course where I gave a project that translated a
"turtle graphics" functional language into LLVM IR:
[http://ezekiel.vancouver.wsu.edu/~cs452/projects/turtlecomp/...](http://ezekiel.vancouver.wsu.edu/~cs452/projects/turtlecomp/turtlecomp.pdf).

------
josteink
To be fair x86 assembly is horrifying.

I've worked with other (mostly Motorola-based or some sort of embedded)
architectures, and while a bit cumbersome to work with, the assembly was clean
and understandable.

Back in the days assembly was my primary way of getting things done, in part
because C-compilers cost money back then and I barely had money left after
buying a computer.

Once the "PC" (and thus Intel) had won the war and I set foot in X86-country,
I started looking at the assembly. It took me less than a week to decide to
give up assembly forever.

So yeah. It doesn't take much to be better than X86 assembly.

------
ezy
This needs better examples. I find:

    
    
      v4si multiply_four(v4si a, v4si b) { return a*b; }
    

Much more readable, and just as portable. There's a typedef for v4si, of
course, but you do that exactly once and then ignore it. The IR produced is
identical, and therefore just as portable.

------
cylinder714
On a related subject, Dan Bernstein of qmail fame created a portable assembly
language a few years ago, called qhasm:
[http://cr.yp.to/qhasm.html](http://cr.yp.to/qhasm.html) The idea is that
processors do pretty much the same things, but with different assembly
language syntaxes. With a portable subset, one could then use translators to
convert qhasm source into whatever dialect required by ARM, Intel, PowerPC, et
cetera.

The latest source code dates from 2007, and it's only a prototype, but it's a
clever idea.

------
acqq
In the examples given under "vectors" I don't see if they are also
multiplatform -- not every platform has "process 4 floats at once this way."
Also I don't see the encoding of assumptions of alignment. Not every processor
has the same even when the "main" architecture is the same (even x86
generations differ).

I remain suspicious that IR is the only representation you need if you do want
to do the things on the assembly level. I welcome examples of somebody in the
know.

The another topic is how often IR is going to change.

~~~
majke
> ... not every platform has "process 4 floats at once this way

Sure. The point is - you can write IR code that multiplies _any_ number of
floats and the backend "should" generate reasonable machine code for any
architecture.

> Also I don't see the encoding of assumptions of alignment

Yeah, no memory loads there, data passed in xmm0 and xmm1, plenty of
simplifications.

> The another topic is how often IR is going to change.

Fair question. I don't know but given the number of projects that currently
use it I'm quite sure it'll be well maintained. Also, keep in mind that
adapting IR to a future version (if one is created) is still IMO simpler than
adapting assembler. Additionally - you can use IR to generate machine code
once (for every architecture). That way you won't depend on your users having
LLVM installed.

~~~
cliffbean
> you can write IR code that multiplies _any_ number of floats and the backend
> "should" generate reasonable machine code for any architecture.

Except that it doesn't. If you write your code with <2 x float> and codegen
for SSE, you'll get <4 x float> code, with two elements going unused. It's
functionally correct, but you're potentially missing out on half the
throughput. If you write your code with <8 x float>, you'll get two registers
for each value, but this can create extra register pressure without actually
giving you any increased throughput in return.

~~~
jevinskie
Other optimization passes will widen the vecorization width to 4 if possible
by unrolling the loop!

------
artagnon
I doubt it has any practical application outside llvm.git. Why would you want
to write assembly by hand in the first place?

1\. To bootstrap. There's various architecture-specific bootstrap code to load
the kernel into memory in linux.git. The project largely depends on GNU as to
assemble reliably; llvm-mc doesn't work half as well.

2\. To try out new processor extensions. When new instruction mnemonics come
out, assembler have to catch up. I'm not sure why anyone would want to use
LLVM IR here, because those instructions are architecture-specific anyway.

------
astral303
The exciting thing would be to see applications compiled to LLVM IR get
JIT'ed. Finally C or C++ apps that could inline through function pointer or
virtual function calls at runtime.

~~~
monstrado
You might find this interesting, a database query engine using LLVM IR to
speed up performance.

[http://blog.cloudera.com/blog/2013/02/inside-cloudera-
impala...](http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-
runtime-code-generation/)

------
dbecker
How many people still write assembly, and for what purposes?

~~~
stephencanon
I write assembly most days (I write system libraries--largely floating-point
and vector code--for a living). In general the assembly that I write is
between 1.5x and 3x faster than what the best compilers currently produce from
equivalent C code (obviously for completely trivial tasks, you usually can’t
beat the compiler, but those generally don’t need to be abstracted into a
library either; there are also outliers where the code I write is orders of
magnitude faster than what the compiler produces). For most programmers most
of the time, that’s not enough of a payoff to bother. For library functions
that are used by thousands of applications and millions of users, even small
performance wins can have huge impact.

Partially my advantage comes from the fact that I have more knowledge of
microarchitecture than compilers do, but the real key is that I am hired to do
a very different job than the compiler is: if your compiler took a day (or
even an hour) to compile a small program, you simply wouldn’t use it, no
matter how good the resulting code was. On the other hand, for a library
function with significant impact on system performance, I can easily justify
spending a week “compiling” it to be as efficient as possible.

~~~
mjn
_if your compiler took a day (or even an hour) to compile a small program, you
simply wouldn’t use it, no matter how good the resulting code was_

Some research has been moving in that direction, since there does seem to be a
demand for it. After all, if someone is willing to wait for you to spend a
week hand-tuning a function, maybe they'd also be willing to run a compiler in
a "take 10 hours to crunch on this" mode. Example:
[http://blog.regehr.org/archives/923](http://blog.regehr.org/archives/923)

~~~
stephencanon
Oh, absolutely.

That said, it’s also worth keeping in mind the enormous differences in how
computers and expert humans currently approach the problem closely parallel
the differences in how computers and expert humans play chess; quickly
evaluate billions of possible “moves” vs. quickly identify the few most
promising “moves” and then slowly evaluate them to pick the best. I fully
expect to be regularly beaten by the compiler “someday”, but (a) I believe
that day is still several years off and (b) even then, I expect that expert
human + compiler will beat compiler alone, just as in chess.

~~~
zwegner
As a former chess engine author, and current compiler writer, with ideas in
search-based compiler algorithms/unbounded compilation times, I hope to
accelerate the arrival of that day :)

------
monstrado
Cloudera's Impala SQL query engine uses LLVM to maximize performance, here's
an article published recently talking about how it's used.

[http://blog.cloudera.com/blog/2013/02/inside-cloudera-
impala...](http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-
runtime-code-generation/)

------
carterschonwald
For those wanting to play with llvm as a library but would rather not deal
with C++, the llvm-general Haskell library gives you nearly every power of
llvm (aside from link time optimization, and writing new optimization passes)
in a nice high level lib that's already being used for a few interesting
compilers out there.

~~~
iskander
Similarly, Siu from Continuum wrote some nice Python bindings
([http://www.llvmpy.org/](http://www.llvmpy.org/)) and LLVM already ships with
OCaml bindings
([http://llvm.org/docs/tutorial/OCamlLangImpl1.html](http://llvm.org/docs/tutorial/OCamlLangImpl1.html))

------
AYBABTME
This reminds me that I need to learn some assembler. I have a hard time
finding resources on conventions, project layout, best practices, tooling. If
anybody's experienced in an assembler, I'd love to have some tips and
pointers.

~~~
acuozzo
I can provide many tips and pointers for 6502 ASM.

------
swah
And C is better than LLVM, from a post a few days ago :)

------
crb002
glibc is nuts for not having an LLVM port.

