
Missed optimizations in C compilers - ingve
https://github.com/gergo-/missed-optimizations
======
nanolith
These results don't surprise me.

A lot of these suboptimal examples come down to the complexity of the
optimization problem. Compilers tend to use heuristics to come up with
"generally good enough" solutions in the optimizers instead of using a longer
and computationally more expensive foray into the solution space. Register
allocation is a prime example. This is an NP-Hard problem. Plenty of
heuristics exist for finding "generally good enough" solutions, but without
exhausting the search space, it typically isn't possible to select the most
optimal solution, or even to determine whether a given solution is optimal.
Couple this with the tight execution times demanded for compilers, and issues
like these become pretty common.

Even missed strength reduction opportunities, such as eliminating unneeded
spills or initialization, can come down to poor heuristics. It's possible to
write better optimizer code, but this can come at the cost of execution time
for the compiler. Hence, faster is often chosen over better.

In first-tier platforms like ix86 and x86_64, enough examples and eyes have
tweaked many of the heuristics so that "generally good enough" covers a pretty
wide area. As someone who writes plenty of firmware, I can tell you that it's
still pretty common to have to hand-optimize machine code in tight areas in
order to get the best trade-off between size, speed, and specific timing
requirements. A good firmware engineer knows when to trust the compiler and
when not to. Some of this comes down to profiling, and some comes down to
constraints and experience.

Then, there are areas in which compilers typically rarely produce better code
than humans. Crypto is one example. Crypto code written in languages like C
can break in subtle ways, from opening timing oracles and other side-channel
attacks to sometimes getting the wrong result when assumptions made by the
developer and the optimizer are at odds. In these cases, hand-written
assembler -- even in first tier platforms -- tends to be both faster and
safer, if the developer knows what he/she is doing.

~~~
userbinator
_Register allocation is a prime example. This is an NP-Hard problem. Plenty of
heuristics exist for finding "generally good enough" solutions, but without
exhausting the search space, it typically isn't possible to select the most
optimal solution, or even to determine whether a given solution is optimal._

That's true only if you're not using SSA-based dataflow analysis (an idea so
amazingly simple and powerful that I've often wondered why it didn't become
popular sooner); otherwise, the interference graph is chordal and optimal
register allocation essentially becomes a very simple linear-time algorithm.

[http://compilers.cs.uni-saarland.de/projects/ssara/](http://compilers.cs.uni-
saarland.de/projects/ssara/)

[http://www.cs.ucla.edu/~palsberg/paper/aplas05.pdf](http://www.cs.ucla.edu/~palsberg/paper/aplas05.pdf)

~~~
fooker
The graph coloring problem for chordal graphs is tractable, true.

But register allocation involves a lot more than just graph coloring. For
example where and how you insert spill code makes a huge difference.

Also, for SSA, you can split some variables to have smaller live ranges which
are more likely to get registers. The moment you do that, the theoretical
advantages of having a chordal graph goes out of the window.

~~~
cwzwarich
> Also, for SSA, you can split some variables to have smaller live ranges
> which are more likely to get registers. The moment you do that, the
> theoretical advantages of having a chordal graph goes out of the window.

IMO, the most important insights of the work on SSA-based register allocation
go beyond the idea that chordal interference graphs are easily colorable. In
particular,

1) The idea that register allocation can be decoupled into separate allocation
and assignment phases. Without this, you either have to view register
allocation as an integrated problem, which makes it difficult to apply
heuristics, or you have to have miniature register suballocators inside of
different phases of your register allocator. The asymptotic horizon of the
last possibility is GCC's reload.

Of course, the further your architecture deviates from the theoretical model
of SSA register allocation, the less you can see the benefit. In particular,
estimating liveness with excessive use of subregisters is difficult, and
splitting live ranges to satisfy register constraints can push the majority of
the difficult problems back into copy elimination.

2) The recognition of the role of parallel copy semantics of phi
pseudofunctions. Even if you were writing a conventional non-SSA register
allocator, it would be wise to eliminate SSA form by introducing parallel
copies and preserving those as late as possible to avoid spurious
interferences.

Both of these ideas are actually present in Lueh's work on fusion-based
allocation (which decouples allocation by simply repeatedly checking whether
graphs are still greedily k-colorable, rather than relying on chordality), but
it is easier to see that in hindsight.

------
jzwinck
Most of the top questions I've asked on Stack Overflow are related to missed
optimizations in C (or C++, but I'll skip those here):

\- [https://stackoverflow.com/questions/45052282/why-is-
memcmpa-...](https://stackoverflow.com/questions/45052282/why-is-
memcmpa-b-4-only-sometimes-optimized-to-a-uint32-comparison)

\- [https://stackoverflow.com/questions/23055704/why-cant-gcc-
op...](https://stackoverflow.com/questions/23055704/why-cant-gcc-optimize-the-
logical-bitwise-and-pair-in-x-x-4242-to-x)

\- [https://stackoverflow.com/questions/26052640/why-does-gcc-
im...](https://stackoverflow.com/questions/26052640/why-does-gcc-implement-
isnan-more-efficiently-for-c-cmath-than-c-math-h)

\- [https://stackoverflow.com/questions/18951520/gcc-
optimizatio...](https://stackoverflow.com/questions/18951520/gcc-optimization-
missed-opportunity)

But my favorite is this one:

\- [https://stackoverflow.com/questions/26053934/is-it-
feasible-...](https://stackoverflow.com/questions/26053934/is-it-feasible-for-
gcc-to-optimize-isnanx-isnany-into-isunorderedx-y)

That one is now optimized in GCC and Clang: (isnan(x) || isnan(y)) becomes a
single instruction on x86.

Some of these are very tricky, but the isnan pairing one is simple, and quite
useful for some math routines whose first order of business is finding NaNs.

------
pkaye
Among other things, I seen lots of inefficiencies with bitfields. I tend to
use them a lot in hardware register access and packing of data in embedded
development. Imagine a bitfield that fits into one word and setting each of
the bitfields to a constant. A good compiler should be able to set all the
values with one load operations. Some compilers would break this into many
separate loads. I think the ARM compilers were worse at this while Clang would
optimize it much better. Many times had to forgo bitfields and use macro
definitions and masking to get the best code generation at the cost of
readability.

~~~
lomnakkus
> A good compiler should be able to set all the values with one load
> operations.

Hm? I believe that might actually be entirely wrong for many pieces of actual
hardware that I've worked with. Setting a single bit may change the
interpretation of a subsequently set bit, so you can't really "batch" the
updates.

(I'm happy for you if this isn't an issue you've faced. We did. Feel sorry for
us.)

~~~
morio
That's what the keyword volatile is for. If something is not marked volatile
the compiler is allowed to change the order and collapse any memory accesses
it deems unnecessary.

~~~
lomnakkus
Indeed, but I don't think that was the poster I was responding to was talking
about?

I mean, he/she was talking about bitfields and similar things and how writes
to those could be coalesced? In genreal, they cannot. (I dunno, it seemed
appropriate to mention the caveats.)

~~~
carlmr
Not coalescing writes on non-volatile bitfields probably isn't guaranteed by
the standard. If you depend on it I hope you have a test catching that.

~~~
AstralStorm
Why use bit fields at all instead of straightforward bit math then? They're
not really convenient and have weird alignment rules already.

------
clarry
GCC can't optimize out the mask (which is required to avoid UB if n may be
equal or greater than 64):

    
    
        #include <stdint.h>
    
        uint64_t rol(uint64_t n, uint64_t val)
        {
            n &= 63;
            return (val << n) | (val >> 64-n);
        }
    

Compiles to

    
    
        rol(unsigned long, unsigned long):
          mov rcx, rdi
          mov rax, rsi
          and ecx, 63
          rol rax, cl
          ret
    

Clang and ICC do get it right.

~~~
waterhouse
Interesting... My AMD64 Programmer's Manual says, about "Rotate Left": "The
processor masks the upper three bits of the count operand, thus restricting
the count to a number between 0 and 31. When the destination is 64 bits wide,
it masks the upper two bits of the count, providing a count in the range of 0
to 63."

Are there x86 platforms that don't do that? (I don't know if there's a more
official document than the AMD64 manual. Technically Intel's stuff is made to
be compatible with a standard that AMD first created.) If not, yeah, that's a
nice, clear case of a missed optimization.

~~~
userbinator
Intel's official reference says the count is masked too, with an interesting
note that the 8086 (and presumably '186) doesn't.

------
haberman
I haven't worked on ARM much, but I am surprised how often I find sub-optimal
code like coming from production compilers. Here's one that I found several
years ago in GCC/x86-64. Took a few years to fix:

[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44194](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44194)

~~~
userbinator
As someone who has read a _ton_ of compiler output over the years (mostly x86,
and actually more embedded stuff like 8051 and PIC than ARM), I'm not
surprised. Useless/extra moves, both between registers and to/from memory, are
the most common "pessimisation" I've seen.

I believe it comes from the way compilers are traditionally designed --- as an
_extremely_ stupid code generation pass followed by multiple optimisation
passes, which naturally will fail to remove 100% of the "stupidity". In other
words, they're making a mess and trying to clean it up instead of avoiding a
mess in the first place.

We could follow the latter idea, and create "naturally smarter/optimising"
compilers --- ones which analyse the data and control flow of the source, and
generate the minimum instructions necessary to implement it. This entails
working "backwards", starting from the results/outputs and moving towards the
inputs. I believe the whole category of useless data movement can be solved
with such an algorithm, since it's "goal-seeking".

To use your bug as an example of how this could work: The compiler would first
determine that func() may be called, and then realise that it finishes with
calling bar(). It's a tail call, so we may jump instead of call and ret. (GCC
was smart enough to figure that one out.) The two arguments come from a return
of foo(), and that is (unfortunately) fixed by the calling convention to be
rax/rdx. The inputs to bar() must be in rsi and rdi, so two moves are
necessary. (If the input locations were the same as the output, then it
wouldn't generate any moves.) We thus arrive at these 4 instructions --- and
not one of them is unnecessary:

    
    
        call foo
        mov rsi, rdx
        mov rdi, rax
        jmp bar
    

I'm not a huge compiler academic so I don't know if anyone has tried (and
failed?) at making a compiler behave like this before. SSA sounds similar, but
doesn't have that crucial "work backwards from the solution" idea.

~~~
jlebar
(I work on GPU compilers in LLVM.)

I'd say that the fact that compilers are designed with a "stupid" IR-
generation pass (e.g. the C++ AST to LLVM IR pass in clang) is part of a
broader design strategy, namely, designing compilers to have many simple
components that work together to do a complex thing.

There are always trade-offs, as you point out, but one of the reasons that
it's beneficial to design compilers this way stems from the fact that we
generally apply a very high quality bar to compilers, because bugs in the
compiler are expensive.

Having simple components with strict interfaces allows us to reason precisely
about what each of our transformations is supposed to do. That makes it easier
to write unit tests and to do code reviews, and ultimately, in my experience,
helps us put correctness first.

I also don't think that this approach of dumb components necessarily leads to
worse code. Indeed, for each of the LLVM missed optimizations in the list, I
have a pretty good idea of which pass ought to be responsible for fixing the
issue. And because the thing that each of these passes does is simple, I can
have some confidence that my change won't break other passes.

~~~
flamedoge
To add onto that, my understanding is that just "knowing" which sequence of
passes is required to perform desired transformation is isomorphic to the
halting problem. We might do better if we were more intelligent about our pass
managers, but last I checked, LLVM uses simple ones.

------
Animats

        char fn1(float p1) {
          return (char) p1;
        }
    

That's undefined behavior. Don't complain about performance.

~~~
comex
How is that undefined behavior? The C spec says: (6.3.1.4)

> When a finite value of real floating type is converted to an integer type
> other than _Bool, the fractional part is discarded (i.e., the value is
> truncated toward zero). If the value of the integral part cannot be
> represented by the integer type, the behavior is undefined.

char is an integer type and float is a real floating type, so the function
should be well-defined for at least some input values, although the exact
range (including whether or not it includes negatives, i.e. whether plain
‘char’ is signed) is implementation-defined.

~~~
flamedoge
Pedantic me thinks p1 = NaN

~~~
majewsky
That would be undefined as per the last sentence in the quote above, "If the
value of the integral part cannot be represented by the integer type, the
behavior is undefined." Same goes if p1 = 256.0 and your char type can store
0..255 only, although it would probably be easier to make educated guesses
about the result in this case.

~~~
pascal_cuoq
I would advise against making educated guesses about the result of overflow in
the conversion from floating-point to integer:

[http://blog.frama-c.com/index.php?post/2013/10/09/Overflow-f...](http://blog.frama-c.com/index.php?post/2013/10/09/Overflow-
float-integer)

Note: the optimization “Missed simplification of multiplication by integer-
valued floating-point constant”, that the article points out that Clang does,
relies on this kind of overflow to be undefined and causes yet more
unpredictable results in some contexts. If you expect the “a *= 10.0;”
statement to be translated to a float-to-int conversion instruction, you may
expect the behavior of that instruction to apply on overflow, but it won't
because the instruction won't have been generated at all.

------
chaboud
While there may be some compiler misses, I started doubting the whole thing
when I hit:

"Missed simplification of multiplication by integer-valued floating-point
constant

Variant of the above code with the constant changed slightly:

int N; int fn5(int p1, int p2) { int a = p2; if (N) a _= 10.0; return a; } GCC
converts a to double and back as above, but the result must be the same as
simply multiplying by the integer 10. Clang realizes this and generates an
integer multiply, removing all floating-point operations. "

A double or float literal multiply followed by an integer conversion is
nowhere near the same as an integer literal multiply. If the coder wanted _=
10 (or even _= 10.0f), that was available. If_ = 10.0 was written, it should
generally be compiled that way unless --superfast-wreck-floating-point was
turned on...

~~~
chengsun
This optimisation is actually fully correct if int is 32-bits wide. This is
because doubles have a 52-bit mantissa, which means that they can exactly
represent all integers up to 2^53 in magnitude. However you are right that the
optimisation would not be valid, had int been 64-bits in width.

~~~
AstralStorm
It is still valid after constant lifting or if the range of the long int in
question vs can be proven to be representable in 53 bits or less. (E.g. using
the polyhedral loop optimizer gcc has.)

~~~
chaboud
Right, though some side effects (e.g. Status registers) might be affected.
(One reason strict modes blow floating point optimization, generally.)

