Hacker News new | past | comments | ask | show | jobs | submit login
Exploring Clang/LLVM optimization on programming horror (matthieud.me)
362 points by maattdd on Aug 17, 2021 | hide | past | favorite | 68 comments

Great writeup! This is a nice overview of how LLVM's optimization passes compose to produce nicely optimized assembly.

One nitpicky observation: mem2reg doesn't "move values from memory [...] to registers (directly inside the CPU)." It's an SSA expansion pass, taking the minimal SSA form produced by the C/C++ frontend to turning it into a more aggressive SSA form by eliminating as many `alloca`s as possible. SSA is characterized by the presence of infinitely many "fresh" abstract registers, none of which correspond to actual CPU registers -- LLVM is responsible at a later stage for performing efficient lowering and allocation of registers/stack slots for the SSA registers.

Not nitpicky at all - I was confused as to why one would want to do register allocation before control/data flow analysis. Thanks for the explanation!

Thanks for the nitpick, I should have been more precise and write move value from memory to abstract register otherwise it is indeed confusing. I will fix the article and mention your comment.

Could you give an example of an alloca mem2reg would eliminate?

The article has an example: in the unoptimized IR generated by the frontend, all the variables are stored in memory on the stack and accessed through pointers stored in registers, whereas after the mem2reg pass variables are stored directly in registers.

Memory is mutable whereas SSA registers are immutable, so the compiler frontend just put all the C variables in memory to express the semantics of the C program in the simplest way possible. The mem2reg pass then moved all the variables into SSA registers, by simply creating a new register for each assignment and adding phi nodes to track the flow of variables between basic blocks (like the individual iterations of the loop body).

You've very concisely explained the bits that were unclear to me. Thank you.

I don't think that would be helpful in learning what's going on.

Instead, I'll point you out to the keyword "Dominator Frontier", and "SSA Construction Algorithm".


You should read that only after you understand SSA and Phi functions in particular. Once you know how Phi works, the next question is the algorithm for how they're constructed.


This page may also help.


> -mem2reg: Promote Memory to Register

> This file promotes memory references to be register references. It promotes alloca instructions which only have loads and stores as uses. An alloca is transformed by using dominator frontiers to place phi nodes, then traversing the function in depth-first order to rewrite loads and stores as appropriate. This is just the standard SSA construction algorithm to construct “pruned” SSA form.



> The ‘alloca’ instruction allocates memory on the stack frame of the currently executing function, to be automatically released when this function returns to its caller. If the address space is not explicitly specified, the object is allocated in the alloca address space from the datalayout string.


Based on my reading and understanding of alloca and mem2reg above, I have to disagree with your assessment. It seems like alloca roughly corresponds to a "push" in your typical x86 assembly language, or maybe a "add esp, #ofBytes".

By removing alloca instructions, the mem2reg step is turning stack-memory into registers.

> By removing alloca instructions, the mem2reg step is turning stack-memory into registers.

This is true, but it's also misleading: the transformation of stack slots into machine registers is a beneficial side effect of mem2reg. Reducing stack use is an important optimization, but producing an optimal SSA form is even more important, since key passes like folding, DCE and SRoA rely heavily on the SSA form. The "reg" in "mem2reg" refers explicitly to the latter (SSA registers), as the text you've excerpted directly says.

You can also prove this to yourself by contriving an IR function that contains more SSA registers than machine registers: you'll see that, in addition to any ABI constraints, LLVM will be forced to spill any excess SSA registers back onto the stack.

Edit: but also yes, to confirm: `alloca` corresponds more or less directly to `add ESP, <adjust>` within the x86 stack model. But it doesn't have to! LLVM's semantics are target independent even when the IR isn't.

Gotcha. I think I see what you're talking about now.

I have crude familiarity with SSA (no personal experience but I've read a book once) but not with LLVM's terminology. I'm looking at the example in the blogpost, and I can see that its clearly adding in the important Phi functions needed

Strange that llvm names their aggressive SSA step in this manner. But yes, adding the phi functions in this manner is a very important step before we can think about optimizations.

The "inductive variable" is far easier to detect once its in a "Phi-form" so to speak.

It's an extremely confusing naming choice! I don't know why they did it, but to take a guess: a lot of compiler literature refers to SSA values as "registers," since the SSA model of a computer program directly corresponds to the concept of an (infinite) register machine in abstract computation.

The end result being that we have the same word "register" referring to two opposed resources (one infinite, one extremely finite) :-).

It's not the only place where this terminology confuses people. Another one is "register-based virtual machine", in the context of interpreters. In that case the registers belong to the VM and are actually stored in stack memory, not on actual CPU registers. However, people get that confused all the time!

At least historically LLVM has itself been inconsistent in the naming of SSA "values", and mem2reg is arguably just the most prominent example of the "register" name being used.

> You can also prove this to yourself by contriving an IR function that contains more SSA registers than machine registers: you'll see that, in addition to any ABI constraints, LLVM will be forced to spill any excess SSA registers back onto the stack.

Having more SSA register than machine registers does not mean there will be any spilling as unlike SSA registers, machine registers can be reused once the old value is no longer needed.

Sure. To be more precise: you can prove this to yourself by contriving a function that has more live variables/values than there are machine registers.

The core simplification is due to LLVM's induction-detection step. Even after doing induction-based proofs in school, until reading this I had the sense that induction is kinda magical and kinda not real.

I still cannot fathom how you can encode induction detection into an algorithm, any pointers welcome (keeping it simple, please, and ideally with working code).

The only case that makes sense to me is if you did numeric computation against some fixed number of cases and if that worked out then you assume it's right.

I guess this is what proof assistants do (among other things). Maybe I should look into how to write a basic proof assistant.

Induction detection is relatively simple in LLVM IR. The first thing to remember is that the compiler doesn't work on the same source code that the human does; it works on a different representation. The canonical loop structure would look something like this:

    ; This is the block that executes before the loop does.
    br label %loop.body;

    %i = phi i32 [0, %loop.preheader], [%i.next, %loop.body]
    %val = phi i1 [0, %loop.preheader], [%val.next, %loop.body]
    %val.next = xor i1 %val, -1
    %i.next = add i32 %i, 1
    %cmp = icmp slt i32 %i, %N
    br i1 %cmp, %loop.exit, %loop.body

    ; After the loop is done
Because of how phis work, in constructing this form, we immediately realize how to describe a value either in terms of its value in the first iteration (this is the value it takes coming from the preheader), or in terms solely of its value in the previous iteration. In other words, %val.next = f(%val) for some function f, which we can easily identify in the source code.

Now, if we can compute the repeated composition of f easily, we can replace the recurrence in the loop with repeated composition. Repeated addition of the same value is multiplication, and repeated multiplication of the same value is exponentiation--that's two very easy examples to do so. We recognize that there is no use of the value except of its final occurrence, and so instead of doing the loop, we can generate the composition itself. Thus, after N iterations of the loop, we can conclude that %val.next = f^N(0), and if we can identify the loop iteration count (that's %N in this case), we can simply write out f^N instead of having to compute every individual value.

How does LLVM know to simplify f^N ?

I've seen LLVM do the same optimization for triangular and pyramidal numbers. Is there a list of hardcoded functions it can handle? Does it try a specific algorithm to find the direct formula?

So if I'm understanding this, this only works for loops with a finite number of iterations knowable at compile time, right?

The number of iterations must be knowable, yes. It doesn't need to be known as a direct constant value, but it must be computable as a symbolic value as some kind (say, an argument to the function, maybe plus or minus a constant).

I haven't looked at the exact criteria in this transformation, but it's very likely that there are also a couple of constraints on how precisely the loop is structured.

Correct, if you don't ensure the necessary preconditions are met you quickly run into undecidable problems.

Well in this case it seems reasonable that they simply added a rule for the specific line:

    even = !even
and the line

resulting in

    even = (indvar & 1)
    numberCompare = indvar
You can then solve from the exit condition that number == numberCompare, which gives you the indvar which can then be substituted into 'even'.

I'm not saying it isn't magical, and certainly requires some symbolic solving, but it's doable.

Of course the real goal is to eventually get it to optimize

    while (number != 1) {
        if (number & 1)
            number = number * 3 + 1
        number >>= 1;

> Of course the real goal is to eventually get it to optimize `while (number != 1) { ... }`

Valid C++ programs are guaranteed to be able to make forward progress [1].

So if `number` is a normal stack allocated integer (and not volatile, etc), then the infinite looping case is undefined behavior here.

So it would be a valid optimization to transform this to `number = 1;`.

(And indeed: https://godbolt.org/z/eodhfWe6h )

[1] https://en.cppreference.com/w/cpp/language/memory_model

Amazing, but crazy.

icc does not make this astounding leap in logic.

This is an example of a transformation that is probably more expensive in compile-time than it actually wins in runtime performance. Gcc has explicitly rejected this kind of optimization in the past for precisely that reason.

In general, compilers face an iron triangle of optimization: do you want a fast compiler, a small binary, or fast output? Optimizations that help all three are rare (especially ones that aren't already implemented in -O1), and even two of the three is often difficult to obtain.

> This is an example of a transformation that is probably more expensive in compile-time

Why? Reasoning that the value must be 1 after the loop is not even a loop optimization, just a simple consequence of the branching instruction. This leaves the loop but nothing it computes is used anymore so since C++ guarantees that it exits you can simply remove it.

The transformation as a whole might not be useful often, but having a constant for the induction variable after the loop and being able to delete useless loops will both have runtime wins.

> In general, compilers face an iron triangle of optimization: do you want a fast compiler, a small binary, or fast output?

I often wish compilers provided more options to just brute force a bit more performance out of release builds. You can manually tune many parameters but then you need to know a lot more about how the passes work than just saying --compile-for=5h (or something like that, doesn't need to have a hard time limit).

Note Gcc does do this, just at -O2 and above, whereas Clang does it on -O1 and above. Gcc also only does it on C++ code whereas Clang does it on both C and C++

Maybe LLVM should add a Collatz[0] optimization pass that would replace your loop with

    number = 1
for at least up to 64bit types?

[0] https://en.m.wikipedia.org/wiki/Collatz_conjecture

Well, basically it's about converting iteratively computated state to a function of iteration count.

Consider indvar[n] = indvar[n-1] + k, indvar[0] = a. It follows that indvar[n] = a + n * k.

This is the type of indvar canonicalization that can zap loops such as the one in the article. Suddenly the final state is just a function of the loop trip count.

The return value is replaced with canonicalized indvar. That makes the loop itself dead (nobody observes any of its effects) and removed by subsequent dead code elimination pass.

My example transforms addition to multiplication. Another common one is multiplication into power. That's close to what's going on in the article example: xor is just a multiplication of signed one-bit integers.

This appears to be the source code: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Tran... It's long, but it looks well-commented! Some Googling also finds https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimi... which looks relevant.

I think "induction" here isn't exactly the sense of proofs by mathematical induction as you learn in school (as in proving something for all positive integers given a base case and a recursive case) - I think all they mean by it is "a loop, over a single integer." Quoting https://en.wikipedia.org/wiki/Induction_variable :

> In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable.

The description at https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/ makes it clearer:

> An induction variable is any variable whose value can be represented as a function of: loop invariants; the number of loop iterations that have executed; and other induction variables.

That is to say, the link to proofs-by-induction is that if you have a loop like "for (i = 0; i < 10; i++)", you can prove the statement "after N loops, i = N". Why? Because after 0 loops, i = 0 (the first part), and on each loop, i is one greater (the i++ part), and nothing else that happens in the loop changes i (which is what that post means by "loop invariants").

In other words, it's not that the compiler is detecting that the programmer is doing induction (the programmer is not, in fact the programmer is simply writing a for loop and not thinking about it), it's that the compiler is performing induction on the loop.

So a simple case is if you have a loop like

    int i;
    for (i = 0; i < 10; i++)
    printf("i = %d\n", i");
You can prove that at the end of the loop, i = 10, and you don't need to do actually increment i during each iteration to get there. And if you didn't have that printf (i.e., if the loop body was empty), you could optimize the loop out entirely and just print out 10.

So it's not terribly magical that LLVM is able to turn

    while (number != numberCompare)
        even = !even;
into, in pseudocode,

    numberCompare = number;
    even = !!!...!!!even (numberCompare times);
i.e., that it's able to get numberCompare out of the loop, to conclude that even in turn is only a function of numberCompare, and then that there's no loop left.

What's more magical to me, I think, is that it's able to make sense of "run the ! operation numberCompare times" and turn that into "if (numberCompare % 2 == 1) even = !even". I don't think that's too surprising either - a compiler should know that, for booleans, !!foo can be simplified to foo. But repeatedly applying that rule across an abstract number of calls to !!, plus possibly one more, seems pretty clever.

If I'm skimming the code right, that part of the simplification is a different optimization pass entirely. I don't know off hand where that's implemented. I'd be curious if someone can find it.

I'm also very interested in this part. Especially because I know clang will optimize the triangular and pyramidal numbers into constant time computations.

(The sum of the first N integers and first N squares respectively)

I really wonder what they use for this and how much it generalizes.

the induction step is pretty cool. it will remove the loop calculating the sum of an arithmetic progression and replace it with just multiplies/shifts/subtracts and adds.

These optimizations are on LLVM IR into LLVM IR. So basically every backend benefits. I don't think most backend engineers would even understand them. I don't.

> I don't think most backend engineers would even understand them. I don't.

Most engineers are intimidated by compilers, but they really shouldn't be! LLVM's source code is very approachable and well-isolated by concern, and most of the fundamental concepts in an optimizing compiler are understandable with a beginner's understanding of data structures, assembly, and resource allocation.

One slightly funny thing, though: LLVM's optimizations are on IR, so every frontend can theoretically benefit from them. However, many assume patterns originating from the C/C++ frontend or heavy canonicalization, so you need to be a little careful as a language frontend developer to take full advantage of all optimizations!

I'm actually working on an LLVM backend. I'd read the Dragon Book before, long before, but I think my real education was the LLVM Developer Meeting tutorials.

LLVM is pass oriented but broadly there are front ends (Clang, Rust, Fortran), a middle end (LLVM IR SSA optimizations) and multiple in-tree and out-of-tree backends. I was a little resentful of having to write all of 5 lines or so inside of Clang. Otherwise, I happily live in the backend. Doing a merge with a new release is an hour or so of conflict resolution.

The pass structure importantly means that you can test your backend code, even at the GlobalISel subpass level, without needing to understand Clang, LLVM, .... Pretty sweet if you ask me.

> so you need to be a little careful as a language frontend developer to take full advantage of all optimizations!

So does this mean that it’s possible to express exotic patterns in LLVM-IR that aren’t typically produced when compiling C or C++?

Are Rust’s optimizer passes significantly different than clang's? I would guess borrow checking has a special pass that wouldn’t be applicable to C.

This topic is really fascinating to me!

Rust uses basically the same optimizer passes as Clang.

Borrow checking is more like a type checking pass, not an optimization pass- it's pass/fail, and doesn't change the IR.

Probably the most interesting thing to look at here is Rust's use of LLVM noalias metadata. This captures the assumption that, while a reference (of type &T or &mut T) is live, the value it points to is not allowed to change via any other name. (The borrow checker ensures this assumption is valid for safe code, but the rule applies to unsafe code too- the important thing is the reference types.)

This metadata enables LLVM to perform more aggressive optimizations of pointer reads and writes. C can actually generate similar metadata via `restrict`, but this is less common because there's no borrow checker to verify that it's used correctly. And interestingly, LLVM has been very buggy in this area! Rust has had to disable its use of noalias metadata several times as those bugs have been found and fixed.

Notably, the Rust borrow checker has exposed a tonne of bugs in the LLVM noalias optimizations. I believe the latest versions of rustc now have noalias enabled since the LLVM 12 release, but it's been a long cycle of enabling it, discovering a bug, and disabling it to this point.

My experience is that LLVM can generally optimize the IR produced from clang frontend a bit better than the IR produced by myself. The advantage is small, but still exists.

I think the major reason is that TBAA optimization is only possible if you manually emit those metadata (clang does, but I didn't, due to time limitation and fear of aliasing-related bugs).

But that LLVM is designed to work best on the pattern emitted by Clang is definitely another reason. In fact, even the official documentation recommends users to emit IR in a way that looks similar to what is emitted by Clang, so you have a better chance to keep the LLVM optimizer happy.

A notable topic to throw in the bag here is GPU shading languages.

LLVM project is used to compile many different shading languages into a similar IL, then compiled into a bytecode for GPU. The project has been really useful for managing all the compiler infrastructure needed for graphics nowadays imo.

LLVM is approachable but some things are no better than GCC, I find. In particular I have noticed there seems to be very little high level documentation for the instruction scheduler.

I find GCC machine def files actually slightly easier to read than LLVM's

None of these steps are too hard to think about in SSA form (a specific transformation of code that LLVM does near the beginning of its analysis).

So the key is to first understand SSA form. After that, a lot of optimization passes become obvious to think about.


SSA has two bits to understand:

1. Single static assignment -- variables can be assigned, but once assigned they can never change. (This part is easy).

2. Graph-based, and therefore requires the 'Phi' function to understand -- variables may take on two different values. EX: "r0" may come from Node#1, while "r0" comes from Node#2. r0 may have been renamed %1 in #1, and %2 in #2. "Phi" is a "fake function" that resolves the ambiguity and kinda makes SSA code "real" again.

Hmm, maybe that's a bad explanation. But whatever, my point is that understanding "Phi" is the concept you want to focus on in SSA. Phi is the harder part, but its still not that hard to understand.

Once you get it, the whole form "clicks" in your brain and you suddenly understand how optimizations can become possible.

Another explanation of SSA:

SSA is static single assignment. Every variable has a single point of declaration. If you want to change the value of a variable (say, i = i + 1), you instead have to create a new variable. The advantage of this form is that tracking the definition of a variable is trivial; in LLVM IR, to use a variable, you literally pass in a pointer to the instruction that defined it. The inverse, finding all uses of a value, is done by building a use-list: whenever you create an instruction, it adds itself to the use-list of all the values it's using, which allows easy iteration of the users of a value.

In short, SSA gives you a trivially correct-by-construction to understand the dataflow of a program (at least, that data which lives in registers; memory is more complicated). There's no separate analysis that has to be maintained throughout optimizations, and hence a possible source of programs (as there is for failure to maintain the dominator tree through CFG transformations, a historically bug-prone step in LLVM).

Now, as you may notice, there are cases where it would seem hard to pin down a single definition of a variable: conditional assignment. SSA solves this by creating Phi values. Phi values can be better thought of as basic block arguments: when you enter a basic block, you must provide a value for each of the phi values in that basic block. The actual value of the phi is dependent on which edge you used to enter a basic block, or in other words, the control dependencies of the code.

In short, phis are the intersection of control flow in the code with the dataflow--and they are the only things in that intersection.

LLVMs IR has linear basic blocks inside a function graph, there are IRs which are graph all the way down as well.

EDIT - nvm, the CS241 UWaterloo course doesn't cover SSA.

In case anybody's interested in learning the basics, here's some course notes to a compilers class which I love!


Your notes, while honestly very good, don't cover data flow analysis or Single Static Assignment (SSA) which is what LLVM IR [1] uses and which are the tools that the article's optimizations are based on. Hell, LLVM backend engineers rarely even look at IR. We mostly look at Generic Machine IR [2].

[1] https://llvm.org/docs/LangRef.html

[2] https://llvm.org/docs/GlobalISel/GMIR.html

Ah, my bad. When you said "backend" I thought you mean "HTTP backend" hence my over-simplification. You're right -- SSA isn't covered in that course. Cheers.

Interesting, in my day SFU kept this til 3rd year, as CMPT379 IIRC, and only as an option for Theoretical or Systems paths.

Looks like it's still 379; not sure about the optional component.


Decades ago I participated in some group programming contest for universities (in person, not online). We did terribly, but mainly I remember being in awe of how quickly Waterloo was solving the challenges.

Nice that it even provides the right answer for negative numbers as undefined behavior but only when optimizations are enabled!

Nice catch. One of many instances where C/C++ compilers rely on integer wraparound being UB.

I'm very curious how it optimizes that recursive version at the end, because it is not tail recursive.

Does anyone know? Perhaps it becomes tail recursive after one round of inlining?

It seems the TailRecursionElimination.cpp pass is responsible for that. It has some smarts to know that the xor instruction between the call and return can be turned into an "accumulator", although what exactly that means I'm unsure.

Further reading about the relationship between recursion, iteration, and accumulators: https://blog.moertel.com/posts/2013-05-11-recursive-to-itera...

This transformation is one you can apply in your own source as well. An example from that link, the recursive

    def factorial(n):
        if n < 2:
            return 1
        return n * factorial(n - 1)
can be rewritten iteratively:

    def factorial1d(n, acc=1):
        while n > 1:
            (n, acc) = (n - 1, acc * n)
        return acc

Thanks for investigating that!

An accumulator is when we add an extra parameter to hold the "loop variable", like this:

    bool doit(int n, bool acc)
        if (n == 0) return acc;
        else return doit(n-1, !acc);

    bool isEven(int n)
        return doit(n, true);
The function is now tail-recursive and behaves similarly to this loop:

    bool acc = true;
    while (n != 0) {
        acc = !acc;
        n = n - 1;
    return acc;
According to the comment on the top of the file, apparently llvm can create this accumulator variable if the operation is associative and commutative. That's because after introducing the accumulator, the evaluation order is outside->inside instead of inside->outside.

you get a linear time O(n) isEven function

In complexity theory, the size of a problem is the number of bits to represent the input, so if the input integer is i, then the size is n = log_2(i) so the algorithm is actually exponential in the number of bits it takes to represent the input.

In a strict sense, you're right.

But in practice, you're not. Case in point: matrix multiplication is commonly quoted as O(n^3) (naive), when in fact, the amount of data used is O(n^2), and therefore should be quoted as O(size^2) (quadratic) with respect to data-size. (EDIT: was bad at math for a second).

But no, people mean "n-by-n matrix" has "O(n^3) naive" implementation. In practice, the "n" is just whatever is most convenient for a problem.

In many cases, n is proportional to the input size. But in other cases (such as the famous matrix multiplication examples), n is the sqrt(input-size).

> therefore should be quoted as O(size^2) (quadratic)

It's actually O(size^1.5), not O(size^2). Well, assuming you're talking about θ(n) when you mean O(n) as most people do.

[If you're not familiar with the difference, Θ(f(n)) means a function that grows as fast as f(n), up to a constant, whereas O(f(n)) is a function that grows no faster. So n^2 is O(n^3) but not Θ(n^3).]

But it's actually quite relevant here. For example with the informal notation you can find all the prime factors of a number N in "O(N)" time , if someone doesn't understand the distinction then they might believe that theres a polytime algorithm for factorization.

is complexity theory constrained to binary representations? Why should it?

Every base except unary is a constant-factor away from binary, and so is irrelevant asymptotically. We don't use unary because it's artificially slow.

That post is a joke, right?

  O(n) isEven function compared to the obvious constant time O(1) modulo algorithm
In the holy name of nitpickiness: the modulo operation is not always O(1), as div/mod CPU instructions have complex internal implementations and require varying number of cycles depending on what the input args are.

However, for x%2 it'll be probably O(1), both on the cpu level, and b/c it'll get optimized to something like x&1 by most compilers (at least for unsigned values)

I will nitpick your nitpick. In a language with arbitrary size integers you're right. This is a C++ int, therefore 32/64 bits (in all sane cases, ik the spec doesn't specify), so the maximum number of cycles is upper bounded. Therefore modulo on C++ ints is is O(1).

But having an upper bound doesn't make it O(1) automatically, no? It'd be like saying that Erastotenes' sieve is O(1) b/c for 64bit integers we know what upper exec time limit is? IMO id'd be rather creative use of current conventions around comp. complexity, but my knowledge of those topics is bad.

big O() notation is simply not precise enough (and not meant to be) to be used in such nitpicky discussions. Everything is O(1) for a large anough constant, for example O(live time of the universe).

There are other notations like small o(), Omega(), omega() or theta() to capture the asymtotic runtime behavior more or less precisely.

It states "constant time O(1) modulo _algorithm_" for implementing an isEven function.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact