What?
"Register allocators use heuristics that are “pretty good” for most usage, but sometimes fail spectacularly in edge cases with very high pressure. Register allocation is, after all, NP-complete. "
The specific problem he referred to was the heuristics failing spectacularly in high pressure.
In SSA form, they have no need to use heuristics, and if they don't want to, don't. They can optimally choose registers.
This will not fail.
Yes, there are other parts to register allocation. The most common NP complete part referred to, and referred to by the parent (AFAICT), was register choosing. This is not NP complete to do optimally in compilers using SSA form. Period.
There may be other parts to the register allocation pass that use heuristics, and fall down, but most of those are not NP complete on SSA either.
SSA-based register allocation does copy-coalescing afterwards [0] or integrated [1]. This is the NP-hard part. Of course, there are fast heuristics [2] which are "pretty good".
I'm not super familiar with all the recent literature on SSA, but it always seemed to me that while you could optimally allocate registers for the SSA representation, the SSA representation often contains dead Φs, and fully eliminating the Φs may not be doable in polynomial time.
There are other algorithms that may coalesce more phis, and can be done in linear time, but do not have such guarantee.
If your concern is dead phis, all dead phis (where dead is defined as unused, or only used by dead instructions) can be eliminated in linear time by performing a single linear time backwards-DCE pass on the graph.
This will get all phis and instructions that are dead, or only used by instructions that are themselves dead.
If your concern is useless phis and instructions (for lack of a better term, there are papers, and they use the word "dead" differently, so i'm defining a term here for you), where you define useless as "side-effects do not matter", such that in
*a = 5
c = a
*a = 5
c = a
the second store and assignment is useless,
This type of elimination is O(N^3) normally, and unproven bounds in SSA (AFAIK).
What exactly is "this type of elimination"? Within our compiler [0] your concrete example is optimized by an O(1) mechanism and then there is a more complex load-store-optimization which is O(n^2) I think.
Looking at your code, libfirm's DCE the standard SSA backwards DCE algorithm without control dependence computation to eliminate dead phis. It will be O(n).
The load store optimization looks like a modification of stuff i helped Thomas VanDrunen with. It is O(N^2), but will miss loads/stores depending entirely on the strength of your memory dependence calculation.
In fact, it will miss a lot, particularly around loop dependent store/load values that are equivalent.
It will catch my particular example, but GCC's testsuite has load/store examples (see gcc/testsuite/gcc.dg/tree-ssa) that it should fail at without help from other phase ordering/etc.
Thanks for the clarification. You are right, our load-store-opt cannot handle loops.
Dead code elimination technically [0] is O(n). However, it should be called "copying garbage collection" instead.
LibFirms SSA construction is minimal [1], it does not create dead phis. Or course, phis might die due to optimizations, in which case they become unreachable and get garbage collected at some point (see above). However, Firm does not model memory (global variables) as SSA, only registers (local variables).
Minor correction: LibFirm also generates pruned SSA, which is the property that guarantees lack of dead phis. Minimal SSA roughly means "no superfluous live phis".
Let me know if you still do this once you integrate a polyhedral code generator and try to do incremental SSA renaming on the nested loop outputs :)
Note that dead phis are actually useful in some contexts, particularly around loop updates
This is why things like loop-closed ssa (http://gcc.gnu.org/onlinedocs/gccint/LCSSA.html) exist, and phi nodes are generated even if the variable is never initially used outside the loop (makes sinking easier)
Thanks for that info; I have more reading to do. The actual work I get paid to do isn't on an SSA compiler, so keeping up-to-date on SSA is more of a hobby.
You can do dead code elimination before register allocation. Dead code elimination is doable in polynomial time. Alternatively, you can you use this relatively unknown construction algorithm [0] originally invented by Cliff Click.
True.
Let me expand a bit, I guess, to prevent confusion.
Your time bounds and what you get depends on what you mean by "dead code", and how you perform it.
The vast majority of DCE algorithms, in books, in compilers, etc, on SSA form, have the same properties:
1. The "bad" ones are forward DCE algorithms, and do not eliminate everything they could because they process in the wrong order.
2. The "good" ones, used in most compilers, are backwards DCE algorithms, and eliminate all dead scalar code in a straight line.
Both are O(n).
The standard backwards DCE algorithm has a number of deficiencies, that cannot be eliminated sanely in O(n) time.
1. These DCE algorithms do not consider memory affecting statements dead unless they can prove they are non-side effecting. Proving that is going to usually require at least some N^2 analysis somewhere.
This means they miss all dead scalar code that results from a useless memory reload (second order effects), and generally will not eliminate any dead stores. Even though this is code, and it's dead, it is not eliminated.
2. These DCE algorithms that are O(n) don't deal with control altering statements. Jumps to empty blocks, as well as code contained in unreachable blocks, will be considered non-dead unless you include control dependence and some form of branch evaluation. You can compute control dependence in O(n) time. In practice, the constant sucks :) It is also not possible to statically evaluate all branches in O(1) time, so even in the absence of memory statements you cannot guarantee you eliminate all dead code in O(n), only some of it.
3. These DCE algorithms do not handle partially dead code (that is, code that is dead if moved to a different place in the program, but does not otherwise affect semantics).
The canonical example is:
y = a + b
if (...) {
y = c + d
} else {
...
}
The first statement is partially dead (dead along some branches), and performing assignment sinking will make it fully dead (if you sink a copy into the else branch, the original is now fully dead).
Performing this type of elimination is N^3 or worse, i haven't seen better time bounds on SSA.
(It is not, AFAIK, completely subsumed by something like SSUPRE)
GCC (and I think LLVM now) actually performs the most trivial case of this, which catches a very large number of cases once other passes are run.
Well, actually, looking again, it looks like over the years it's grown.
I guess nobody's noticed it's basically equivalent to Cliff Click's GCM algorithm at this point, without the nice guarantees.
It looks like it could be trivially modified into the GCM algorithm now.
In my opinion the "good" DCE is fine. Your deficiencies should be fixed elsewhere. It largely depends on the definition of dead, though.
1. I would never consider a store dead in the sense of DCE. Those side-effect analysis should be handled elsewhere.
2. DCE should not alter control-flow.
3. Partial dead code elimination is not about dead code or elimination anything. Click understood it correctly, it is about code motion.
Effectively, I argue to push the problems somewhere else. This means that DCE is a very cheap optimization and can be run multiple times during compilation. This somewhat helps with the phase ordering problem.
Partial dead code elimination (and partial redundant code elimination) is still an open research question. There are multiple overlapping approaches, but none subsumes everything.
Sure, these are all matters of opinion, though i'll point out GCC started out thinking the same way WRT to keeping things cheap and running them multiple times.
This eventually led to horrendous compile times (that LLVM was much better at) because passes refused to clean up after themselves at all. After all, cheap DCE/cheap GVN/whatever would just get run later, who cared. Those passes became necessary, and even when cheap, if you run DCE 5 times, it starts to add up. Same with not touching control flow during DCE. Leave it to the CFG simplifier, which now gets run 5 or 6 times.
These are all tradeoffs, and especially in a production compiler, you have to make these tradeoffs sometimes, and do more than is academic-wise ideal to do in a single pass.
This is why LLVM's GVN prunes dead instructions, etc
The specific problem he referred to was the heuristics failing spectacularly in high pressure.
In SSA form, they have no need to use heuristics, and if they don't want to, don't. They can optimally choose registers.
This will not fail.
Yes, there are other parts to register allocation. The most common NP complete part referred to, and referred to by the parent (AFAICT), was register choosing. This is not NP complete to do optimally in compilers using SSA form. Period.
There may be other parts to the register allocation pass that use heuristics, and fall down, but most of those are not NP complete on SSA either.