Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.
This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?
* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific
* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)
* -flto=thin - get most of the benefits of LTO with minimal compile time overhead
* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.
* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.
Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.
Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.
Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.
This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?