A lot of these suboptimal examples come down to the complexity of the optimization problem. Compilers tend to use heuristics to come up with "generally good enough" solutions in the optimizers instead of using a longer and computationally more expensive foray into the solution space. Register allocation is a prime example. This is an NP-Hard problem. Plenty of heuristics exist for finding "generally good enough" solutions, but without exhausting the search space, it typically isn't possible to select the most optimal solution, or even to determine whether a given solution is optimal. Couple this with the tight execution times demanded for compilers, and issues like these become pretty common.
Even missed strength reduction opportunities, such as eliminating unneeded spills or initialization, can come down to poor heuristics. It's possible to write better optimizer code, but this can come at the cost of execution time for the compiler. Hence, faster is often chosen over better.
In first-tier platforms like ix86 and x86_64, enough examples and eyes have tweaked many of the heuristics so that "generally good enough" covers a pretty wide area. As someone who writes plenty of firmware, I can tell you that it's still pretty common to have to hand-optimize machine code in tight areas in order to get the best trade-off between size, speed, and specific timing requirements. A good firmware engineer knows when to trust the compiler and when not to. Some of this comes down to profiling, and some comes down to constraints and experience.
Then, there are areas in which compilers typically rarely produce better code than humans. Crypto is one example. Crypto code written in languages like C can break in subtle ways, from opening timing oracles and other side-channel attacks to sometimes getting the wrong result when assumptions made by the developer and the optimizer are at odds. In these cases, hand-written assembler -- even in first tier platforms -- tends to be both faster and safer, if the developer knows what he/she is doing.
Register allocation is a prime example. This is an NP-Hard problem. Plenty of heuristics exist for finding "generally good enough" solutions, but without exhausting the search space, it typically isn't possible to select the most optimal solution, or even to determine whether a given solution is optimal.
That's true only if you're not using SSA-based dataflow analysis (an idea so amazingly simple and powerful that I've often wondered why it didn't become popular sooner); otherwise, the interference graph is chordal and optimal register allocation essentially becomes a very simple linear-time algorithm.
The graph coloring problem for chordal graphs is tractable, true.
But register allocation involves a lot more than just graph coloring. For example where and how you insert spill code makes a huge difference.
Also, for SSA, you can split some variables to have smaller live ranges which are more likely to get registers. The moment you do that, the theoretical advantages of having a chordal graph goes out of the window.
> Also, for SSA, you can split some variables to have smaller live ranges which are more likely to get registers. The moment you do that, the theoretical advantages of having a chordal graph goes out of the window.
IMO, the most important insights of the work on SSA-based register allocation go beyond the idea that chordal interference graphs are easily colorable. In particular,
1) The idea that register allocation can be decoupled into separate allocation and assignment phases. Without this, you either have to view register allocation as an integrated problem, which makes it difficult to apply heuristics, or you have to have miniature register suballocators inside of different phases of your register allocator. The asymptotic horizon of the last possibility is GCC's reload.
Of course, the further your architecture deviates from the theoretical model of SSA register allocation, the less you can see the benefit. In particular, estimating liveness with excessive use of subregisters is difficult, and splitting live ranges to satisfy register constraints can push the majority of the difficult problems back into copy elimination.
2) The recognition of the role of parallel copy semantics of phi pseudofunctions. Even if you were writing a conventional non-SSA register allocator, it would be wise to eliminate SSA form by introducing parallel copies and preserving those as late as possible to avoid spurious interferences.
Both of these ideas are actually present in Lueh's work on fusion-based allocation (which decouples allocation by simply repeatedly checking whether graphs are still greedily k-colorable, rather than relying on chordality), but it is easier to see that in hindsight.
There is still the issue of compiling phi functions effectively, though. The result in the end is less efficient than approximate graph coloring, if I remember correctly the work you cite.
From "Copy coalescing by graph recoloring" PLDI'08, which compared several SSA-based copy coalescing techniques (including an expensive, but optimal, ILP-based formulation) to
iterated register coalescing (TOPLAS 1996):
> We ran iterated register coalescing [11] (the state of the art in safe coalescing) on these graphs and compared the remaining costs and unsatisfied affinities: Iterated coalescing left 1.28 times more costs non-eliminated and 1.79 times more affinities unsatisfied than the algorithm presented in this paper. To put it another way, from
the cost iterated coalescing left, we were able to optimize 22.5% away (and 44.3% of the affinities). The optimum, as determined by the ILP, is given by 35.9% of the costs (and 51.9% of the affinites) iterated coalescing left over.
My understanding is that they are included. The paper talks about the problem it is addressing in Section 2.
The following text is taken from that section and makes me think that extra moves/exchanges introduced by out-of-SSA are included.
> Live-range splitting to handle register constraints and φ-functions make use of parallel copy instructions. These parallel copy instructions have of course to be implemented using real processor instructions (in the case of φ-functions this is called SSA destruction [5]). In contrast to traditional approaches, these parallel copies are implemented after register allocation and not before.
[...]
the goal of coalescing is to minimize the number of instructions to implement parallel copies by trying to give corresponding operands and results the same color.
The problem is that actual computers do not have parallel copy instructions. Did they count the cost of simulating them through moves? (I knew the people doing this research, but it was almost 10 years ago so I am a bit rusty).
I had a question a while back about potential spill strategies that may be appropriate to ask here, given that a bunch of compiler people are probably crawling this thread.
Do compilers ever do "computed spilling" (for lack of a better phrase)?
For example, if you have code such as:
int x = f();
int y = x & MASK;
use(y);
// ... bunch of code that uses x, but causes y to spill ...
use(y);
If we're spilling 'x' and 'y', the compiler could theoretically notice that 'y' is purely derived from 'x', and turn the "spill" of 'y' into a no-op (assuming 'x' is kept live), and the "load" of 'y' into re-computing y from x (in this case using a BITAND).
Mostly academic curiosity, but is this technique used by any major compiler?
> (assuming 'x' is kept live), and the "load" of 'y' into re-computing y from x
If I understand correctly, what you describe is called rematerialization (https://en.wikipedia.org/wiki/Rematerialization), and yes, it's standard. As with everything else in register allocation, it's difficult to decide when and how to do it.
> A lot of these suboptimal examples come down to the complexity of the optimization problem.
That and the rest of what you wrote is true in a general sense, for many examples of suboptimal code generation. I don't think it's true in the specific case of (most of) these suboptimal examples listed in the featured article. Almost all of these seem to be due to programmer oversight in formulating some rules or pattern matches, not due to heuristics for combinatorial optimization making unfortunate choices.
For example, of the eight issues listed for CompCert, only the two concerning spilling and coalescing seem to be due to imperfect heuristics. A bunch of the others are simply due to pattern matches that nobody had thought (or bothered) to add, as you can see from the linked patches. No combinatorial optimization, heuristic or otherwise, going on there.
Other missing pattern matches are Clang's vnmla instruction selection issue (again, there is a patch that simply adds a pattern somewhere) or the fact that it seems to lack some rules for computing the possible value range for expressions of the form "x % c" with c constant (or propagating such ranges to divisions by constants).
Even for some of the examples involving registers and spilling, I don't think it's unfortunate heuristic solvers. For GCC's "float to char conversion through memory" issue, a developer stated in the bug report that the cost model is wrong, implying that fixing the costs should make the solver find the "right" code. And the GCC "spill instead of register copy" example never even touches the r2 register anywhere in the function, so it would be available for allocation by even the most primitive heuristic solver. There, too, I would suspect a bug in the model rather than an unlucky solver.
So I think my point is, it's true that NP-hard problems are a curse in many subproblems of optimizing compilation, but there are many, many other possible reasons for missed optimizations.
That one is now optimized in GCC and Clang: (isnan(x) || isnan(y)) becomes a single instruction on x86.
Some of these are very tricky, but the isnan pairing one is simple, and quite useful for some math routines whose first order of business is finding NaNs.
Among other things, I seen lots of inefficiencies with bitfields. I tend to use them a lot in hardware register access and packing of data in embedded development. Imagine a bitfield that fits into one word and setting each of the bitfields to a constant. A good compiler should be able to set all the values with one load operations. Some compilers would break this into many separate loads. I think the ARM compilers were worse at this while Clang would optimize it much better. Many times had to forgo bitfields and use macro definitions and masking to get the best code generation at the cost of readability.
> A good compiler should be able to set all the values with one load operations.
Hm? I believe that might actually be entirely wrong for many pieces of actual hardware that I've worked with. Setting a single bit may change the interpretation of a subsequently set bit, so you can't really "batch" the updates.
(I'm happy for you if this isn't an issue you've faced. We did. Feel sorry for us.)
That's what the keyword volatile is for. If something is not marked volatile the compiler is allowed to change the order and collapse any memory accesses it deems unnecessary.
Indeed, but I don't think that was the poster I was responding to was talking about?
I mean, he/she was talking about bitfields and similar things and how writes to those could be coalesced? In genreal, they cannot. (I dunno, it seemed appropriate to mention the caveats.)
This is totally right and shouldn't be downvoted, but the reply is also right that volatile should prevent optimization.
On some hardware, "writing" to "memory" is really sending a message to a device, and writing 0x80 followed by 0x08 is not the same thing as writing 0x88.
Interesting... My AMD64 Programmer's Manual says, about "Rotate Left": "The processor masks the upper three bits of the count operand, thus restricting the count to a number between 0 and 31. When the destination is 64 bits wide, it masks the upper two bits of the count, providing a count in the range of 0 to 63."
Are there x86 platforms that don't do that? (I don't know if there's a more official document than the AMD64 manual. Technically Intel's stuff is made to be compatible with a standard that AMD first created.) If not, yeah, that's a nice, clear case of a missed optimization.
With a fixed n, GCC does get it right as well (kinda has to, because that's the only portable way to do fixed rotations in C). I'd sort it under range analysis.
I haven't worked on ARM much, but I am surprised how often I find sub-optimal code like coming from production compilers. Here's one that I found several years ago in GCC/x86-64. Took a few years to fix:
As someone who has read a ton of compiler output over the years (mostly x86, and actually more embedded stuff like 8051 and PIC than ARM), I'm not surprised. Useless/extra moves, both between registers and to/from memory, are the most common "pessimisation" I've seen.
I believe it comes from the way compilers are traditionally designed --- as an extremely stupid code generation pass followed by multiple optimisation passes, which naturally will fail to remove 100% of the "stupidity". In other words, they're making a mess and trying to clean it up instead of avoiding a mess in the first place.
We could follow the latter idea, and create "naturally smarter/optimising" compilers --- ones which analyse the data and control flow of the source, and generate the minimum instructions necessary to implement it. This entails working "backwards", starting from the results/outputs and moving towards the inputs. I believe the whole category of useless data movement can be solved with such an algorithm, since it's "goal-seeking".
To use your bug as an example of how this could work: The compiler would first determine that func() may be called, and then realise that it finishes with calling bar(). It's a tail call, so we may jump instead of call and ret. (GCC was smart enough to figure that one out.) The two arguments come from a return of foo(), and that is (unfortunately) fixed by the calling convention to be rax/rdx. The inputs to bar() must be in rsi and rdi, so two moves are necessary. (If the input locations were the same as the output, then it wouldn't generate any moves.) We thus arrive at these 4 instructions --- and not one of them is unnecessary:
call foo
mov rsi, rdx
mov rdi, rax
jmp bar
I'm not a huge compiler academic so I don't know if anyone has tried (and failed?) at making a compiler behave like this before. SSA sounds similar, but doesn't have that crucial "work backwards from the solution" idea.
I'd say that the fact that compilers are designed with a "stupid" IR-generation pass (e.g. the C++ AST to LLVM IR pass in clang) is part of a broader design strategy, namely, designing compilers to have many simple components that work together to do a complex thing.
There are always trade-offs, as you point out, but one of the reasons that it's beneficial to design compilers this way stems from the fact that we generally apply a very high quality bar to compilers, because bugs in the compiler are expensive.
Having simple components with strict interfaces allows us to reason precisely about what each of our transformations is supposed to do. That makes it easier to write unit tests and to do code reviews, and ultimately, in my experience, helps us put correctness first.
I also don't think that this approach of dumb components necessarily leads to worse code. Indeed, for each of the LLVM missed optimizations in the list, I have a pretty good idea of which pass ought to be responsible for fixing the issue. And because the thing that each of these passes does is simple, I can have some confidence that my change won't break other passes.
To add onto that, my understanding is that just "knowing" which sequence of passes is required to perform desired transformation is isomorphic to the halting problem. We might do better if we were more intelligent about our pass managers, but last I checked, LLVM uses simple ones.
Program dependency graphs 'work backwards' in that you have a node that performs a computation, and then you have a graph of edges backwards to information and effects in other nodes it needs in order to run. The 'first node' in a program dependency graph is the result that you want to achieve.
This doesn't really solve the problem that you are describing though. I think that's more to-do with the register allocator - there's where the moves come from.
Something interesting to note is that I think register to register moves are extremely cheap, because the processor is renaming all registers anyway. I think except for taking up instruction cache and a little work for decode and dispatch, these moves don't actually consume any execution slots so they're not a problem.
We are also finding in practice that simpler register allocation algorithms, such as linear scan, don't really seem much worse in practice very complicated algorithms that try to eliminate these moves. So that's more evidence that maybe they're not the problem you think they are.
When you read assembly it's frustrating to see redundant moves, but I'm not sure they actually cause as much slowdown as you think.
I think the problem with that idea is that there's no guarantee that the input to the compiler is going to be ideal. It might be that the programmer write something in a suboptimal or naive way (possibly expecting the compiler to take care of it) or the source was mechanically generated or expanded via a preprocessor. Working strictly backwards from the source seems like it wouldn't be enough.
So a production optimizing compiler is going to need all those cleanup passes one way or another. Why not assume they're there and lean on them rather than duplicating that work?
> I believe it comes from the way compilers are traditionally designed --- as an extremely stupid code generation pass followed by multiple optimisation passes, which naturally will fail to remove 100% of the "stupidity". In other words, they're making a mess and trying to clean it up instead of avoiding a mess in the first place.
I think that this is, at best, a misleading view of how most modern compilers work.
> We could follow the latter idea, and create "naturally smarter/optimising" compilers --- ones which analyse the data and control flow of the source, and generate the minimum instructions necessary to implement it.
Many modern compilers (e.g. pretty much all that are based on LLVM) do this: the only subtlety is that analysing data/control flow on the program's AST isn't the best approach, as it doesn't compose, and is designed to be a representation of the human's view of the program, not one good for computers' understanding. So, first, the compiler converts the code into some form of intermediate representation (typically SSA or similar) and then all the control flow is "obvious" and the analysis is easy. Some compilers just use LLVM IR for this purpose, but most will layer on top another (or multiple, like Haskell's GHC) IR that contains more semantic information relevant to the language being compiled.
This IR can then be transformed into estimated-to-run-faster versions of the IR (that hopefully have the same behaviour), until finally instructions are chosen, registers allocated and machine code emitted. It is also easier to use an IR rather than AST to do optimisations like inlining, which is an absolutely critical optimisation for performance (it enables a whole pile of other optimisations to be more effective), something that purely minimising the instruction count of a function won't ever do.
In fact, this brings me to another important point: minimising instruction count within individual functions can result in slow code, as it shouldn't unroll nor vectorise loops, and should aggressively outline common code sections (even if they're 'hot'). Pure instruction count doesn't mean that much on modern computers; sure, it correlates (negatively) with performance, but weakly.
As a sibling points out, it doesn't seem to make sense to put a pile of effort into getting a "perfect" AST-to-IR converted, which would just be duplicating (in a worse framework for them) a lot of the optimisation and analysis passes that are desirable for IR (due inlining opening up new optimisation opportunities there), meaning one would have two constant-propagators, two common-subexpression-eliminators, etc, rather than just leaning on, and beefing up, a single one.
How is that undefined behavior? The C spec says: (6.3.1.4)
> When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
char is an integer type and float is a real floating type, so the function should be well-defined for at least some input values, although the exact range (including whether or not it includes negatives, i.e. whether plain ‘char’ is signed) is implementation-defined.
That would be undefined as per the last sentence in the quote above, "If the value of the integral part cannot be represented by the integer type, the behavior is undefined." Same goes if p1 = 256.0 and your char type can store 0..255 only, although it would probably be easier to make educated guesses about the result in this case.
Note: the optimization “Missed simplification of multiplication by integer-valued floating-point constant”, that the article points out that Clang does, relies on this kind of overflow to be undefined and causes yet more unpredictable results in some contexts. If you expect the “a *= 10.0;” statement to be translated to a float-to-int conversion instruction, you may expect the behavior of that instruction to apply on overflow, but it won't because the instruction won't have been generated at all.
That case is undefined but the cast itself isn't, just certain executions of it. If the code never calls the function with bad values, there's no UB, just like how pretty much all cases of UB in C are only undefined for certain values.
Hi everyone. I'm the author of the featured post; thanks for the interest.
> If the code never calls the function with bad values, there's no UB
Exactly. Even if a function might have undefined behavior for some input values, compilers are still required to compile the code correctly for the other input values (if any). And I would still expect them to optimize for those other input values.
Anyway, would everyone be happily convinced if I updated the example as follows?
I think that covers all that could possibly go wrong. And (unsurprisingly, to me at least) it still shows the same behavior with GCC of storing to the stack and reloading instead of zero-extending register-to-register.
No it is not. Casting a pointer to a different type can violate the strict aliasing rule, but this is just a regular float to integer conversion. Besides, the strict aliasing rule has an exception for char pointers. You can access any type through a char pointer, but the value is implementation defined.
While there may be some compiler misses, I started doubting the whole thing when I hit:
"Missed simplification of multiplication by integer-valued floating-point constant
Variant of the above code with the constant changed slightly:
int N;
int fn5(int p1, int p2) {
int a = p2;
if (N)
a = 10.0;
return a;
}
GCC converts a to double and back as above, but the result must be the same as simply multiplying by the integer 10. Clang realizes this and generates an integer multiply, removing all floating-point operations."
A double or float literal multiply followed by an integer conversion is nowhere near the same as an integer literal multiply. If the coder wanted = 10 (or even = 10.0f), that was available. If = 10.0 was written, it should generally be compiled that way unless --superfast-wreck-floating-point was turned on...
This optimisation is actually fully correct if int is 32-bits wide. This is because doubles have a 52-bit mantissa, which means that they can exactly represent all integers up to 2^53 in magnitude. However you are right that the optimisation would not be valid, had int been 64-bits in width.
Since it's a *=, I'd expect the operation to occur in double, quantize accordingly, and handle overflow as double-to-int would, rather than int multiplication would.
For 10.0 specifically, double is more than large enough to, as far as I know, result in the same output. For arbitrary double literals, I wouldn't expect this to be the case.
Note: overflow cast from double to int is undefined, last I looked, so one could argue that the numerical regions for which quantization would occur are outside of the practical range of this chunk of code.
It is still valid after constant lifting or if the range of the long int in question vs can be proven to be representable in 53 bits or less. (E.g. using the polyhedral loop optimizer gcc has.)
I'd be more curious what Clang did for values that would result in different behaviors for some ranges of a. I understand that there are limits, and that there is range checking in literals for escalation to, for instance, 64-bit values, but it seems like an optimization that is:
A) fraught with edge-case peril
and
B) easily accounted for by the developer (e.g. write "a *= 10;")...
> I'd be more curious what Clang did for values that would result in different behaviors for some ranges of a.
I don't think I understand what you mean. Can you give an example?
As for "the developer should write what they mean", one standard answer from compiler developers is along the lines of "but this code might not be hand-written, it might come from a primitive code generator". You can decide whether that convinces you or not ;-)
Yeah. I'm partially unconvinced. That is to say that, if the compiler Dev wants to take on the completeness responsibility, cool. If not, I'm okay with that, too. Adding the optimization and screwing up the edge cases is the only way to end up on my bad side.
An example number might be 8388608.0 (or 2^23). Hell could have broken out more than four million integer values before that, but it's as nice a test as any (except maybe the exact edge).
There could also be interesting cases of compound arithmetic literal statements. And then, of course, there are other side-effect considerations (e.g. the x87 opcode register).
I'll stick to getting my literal types right in the first place, but, as stated above, it doesn't feel like a real miss to me. Either a dev or a code generator wrote ".0". I'm inclined to respect that to some degree.
Probably because I'm old.
I'll compile this in gcc and clang and see what shakes out. My bet is that clang does the right thing once the value is out of safe range.
A lot of these suboptimal examples come down to the complexity of the optimization problem. Compilers tend to use heuristics to come up with "generally good enough" solutions in the optimizers instead of using a longer and computationally more expensive foray into the solution space. Register allocation is a prime example. This is an NP-Hard problem. Plenty of heuristics exist for finding "generally good enough" solutions, but without exhausting the search space, it typically isn't possible to select the most optimal solution, or even to determine whether a given solution is optimal. Couple this with the tight execution times demanded for compilers, and issues like these become pretty common.
Even missed strength reduction opportunities, such as eliminating unneeded spills or initialization, can come down to poor heuristics. It's possible to write better optimizer code, but this can come at the cost of execution time for the compiler. Hence, faster is often chosen over better.
In first-tier platforms like ix86 and x86_64, enough examples and eyes have tweaked many of the heuristics so that "generally good enough" covers a pretty wide area. As someone who writes plenty of firmware, I can tell you that it's still pretty common to have to hand-optimize machine code in tight areas in order to get the best trade-off between size, speed, and specific timing requirements. A good firmware engineer knows when to trust the compiler and when not to. Some of this comes down to profiling, and some comes down to constraints and experience.
Then, there are areas in which compilers typically rarely produce better code than humans. Crypto is one example. Crypto code written in languages like C can break in subtle ways, from opening timing oracles and other side-channel attacks to sometimes getting the wrong result when assumptions made by the developer and the optimizer are at odds. In these cases, hand-written assembler -- even in first tier platforms -- tends to be both faster and safer, if the developer knows what he/she is doing.