
GCC x86 Performance Hints (2012) - ZnZirconium
https://software.intel.com/content/www/us/en/develop/blogs/gcc-x86-performance-hints.html
======
Supersaiyan_IV
This article is way too outdated.

Also, who the heck uses -flto on GCC 4.7.1? LTO is experimental in GCC 4.7.1
and will make any application nigh-impossible to debug.

Here's something more relevant to modern times:
[http://hubicka.blogspot.com/2019/05/gcc-9-link-time-and-
inte...](http://hubicka.blogspot.com/2019/05/gcc-9-link-time-and-inter-
procedural.html)

~~~
gnufx
Turning on LTO in Fedora hasn't been entirely painless despite SUSE's
pioneering.

------
glouwbug
-flto is great if your project is less than 15,0000 lines. Otherwise, -Ofast in conjunction with -flto creates a single threaded link time step that's longer than half a minute, and it only goes up from there

~~~
vii
The gold linker and now lld can do LTO pretty fast across huge codebases: this
breaks down the techniques used
[https://archive.fosdem.org/2019/schedule/event/llvm_lld/](https://archive.fosdem.org/2019/schedule/event/llvm_lld/)

Curious that Intel seems to recommend dynamic linking to get architecture
specific libc function implementations. Dynamic linking has considerable
overhead.

~~~
btown
I’m curious why this is necessary nowadays. Why not just ship a multi
architecture static libc that chooses implementation at runtime? Branch
prediction should make that practically zero cost, and with cache sizes at all
levels I doubt binary size would have an impact. Are there other reasons?

~~~
nicoburns
> Are there other reasons?

I think some systems don't have a stable kernel interface. The system-provided
libc is that interface.

On Linux I believe that this is possible (and fairly common using the MUSL
libc), but that glibc doesn't support static linking and many people want to
use glibc for performance and compatibility reasons.

~~~
roblabla
> I think some systems don't have a stable kernel interface. The system-
> provided libc is that interface.

In fact, a lot of systems don't have a stable kernel interface, Linux is the
exceptions here. In Microsoft Windows, one must link against Kernel32.dll to
communicate with the kernel. On OpenBSD, the libc provides the stable syscall
interface (and in fact the kernel will refuse syscalls from outside the libc
as a security measure, see [0]). MacOS and Solaris are two other OSes where,
if I understand correctly, syscall ABI is not guaranteed and must go through a
common library. Go used to embed syscalls, and ran into a lot of problems
because of this.

[0]: [https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2](https://marc.info/?l=openbsd-
tech&m=157488907117170&w=2)

~~~
voldacar
I get that openbsd is obsessed with security, but this still makes me a little
sad. C is very old, and it should be possible to create programs and
programming languages that have no idea what C is or what libc is. Forcing all
programs to communicate with the OS via libc seems wrong

~~~
roblabla
On one hand, from an idealistic point of view, I agree, and am a very, very
big believer of having actual, carefully designed ABIs for kernel (and inter-
process) communications.

On the other hand, the structures required to do syscalls via libc/kernel32
are generally simple enough that it's not a huge deal, and the difference
between doing a raw syscall and doing a function call is unlikely to actually
matter.

Fun fact/pet peeve: on android, if you want to communicate with most services,
you need to go through Binder. While the low-level Binder ABI is stable, the
services written on top of it aren't, and often change in backwards-
incompatible ways. This includes core services like SurfaceFlinger (necessary
to draw on the screen). This means that it's generally impossible to create
purely native software on android - you always have to call into Java to talk
to those services.

------
known
In my .bashrc I use

export CFLAGS='-O2 -march=native -mfpmath=sse -fopenmp -fomit-frame-pointer
-pipe -fno-unwind-tables -fno-asynchronous-unwind-tables'

~~~
kanox
Why would you put something like that in bashrc?!

------
vsskanth
I wonder if there is an equivalent for MSVC and Clang Compilers.

Would be nice if there was a good guide somewhere on flags to use (and their
trade-offs) for fast floating point performance for numerical intensive
programs.

------
ilaksh
It feels like they are actually trying to convince me to finally switch to
Clang or something.

I mean, I have actually still never used Clang, but even before this list, I
was aware that there were quite a few gcc defaults that were ridiculous.

But every gcc article is about how there are even more ways that gcc doesn't
actually work sensibly unless you know the trick.

I mean, why doesn't it say somewhere in the help next to the optimization
stuff that you might want to consider specifying the architecture?

~~~
simias
I'm not sure I understand your argument. The optimizations here are:

>"-Ofast" same as "-O3 -ffast-math"

-ffast-math makes the compiler violate IEEE, it would make a very bad default.

-O3 used to be potentially buggy and somewhat experimental, although I doubt that it's much of a practical problem these days. It still makes the code harder to debug though. Most build systems I'm aware of default to debug builds, so GCC is not really unique in that regard.

>"-flto" enable link time optimizations

Those optimizations are typically quite expensive and can backfire in some
scenarios. Link time optimizations are relatively novel, at least within the
timeframe of GCC. I'm sure they'll be enabled by default some day, but GCC has
to be conservative.

>"-mfpmath=sse" enables use of XMM registers in floating point instructions
(instead of stack in x87 mode)

So that means that you tell the compiler to assume that SSE is available on
the target. By default GCC outputs code that's compatible with the baseline,
which is perfectly reasonable IMO. Same reason why -march=native is also not
the default, it is assumed that you may want to ship your binaries to other
computers.

>"-funroll-loops" enables loop unrolling

From GCC's own docs:

"This option makes code larger, and may or may not make it run faster."

Loop unrolling is tricky, because it gets rid of branches but also increases
the size of the code and therefore the pressure on the icache. In some
scenarios it's possible that the looping code run faster than the linear
version if it saves on cache misses.

Given that there are tradeoffs involved, it's also reasonable to let the user
decide to enable this optim.

I work with C a lot and IMO the only default in GCC that's truly bad is that
it doesn't have -Wall by default.

~~~
jcranmer
> So that means that you tell the compiler to assume that SSE is available on
> the target. By default GCC outputs code that's compatible with the baseline,
> which is perfectly reasonable IMO. Same reason why -march=native is also not
> the default, it is assumed that you may want to ship your binaries to other
> computers.

If your x86 computer was made after the US invaded Iraq, it supports SSE2
instructions.

~~~
cozzyd
It's also possible someone may prefer the 80-bit intermediate precision of the
x87 registers. I guess.

~~~
jcranmer
long double in either x87 or SSE mode will force the use of 80-bit fp values.

