Nah, it is not that bad. Sure you can mess up your performance by picking bad co...

vlovich123 · 2024-06-22T17:06:40 1719076000

The main flags to look at:

* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific

* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)

* -flto=thin - get most of the benefits of LTO with minimal compile time overhead

* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.

* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.

vbezhenar · 2024-06-22T19:45:10 1719085510

Also I learned recently that there's `-Og` which enables optimizations suitable for debug build.

vlovich123 · 2024-06-22T20:15:45 1719087345

In practice I’ve had limited success with that flag. It still seems to enable optimizations that make debugging difficult.

senkora · 2024-06-23T03:15:04 1719112504

Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.

vlovich123 · 2024-06-24T04:00:14 1719201614

Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.