-flto is great if your project is less than 15,0000 lines. Otherwise, -Ofast in conjunction with -flto creates a single threaded link time step that's longer than half a minute, and it only goes up from there
Curious that Intel seems to recommend dynamic linking to get architecture specific libc function implementations. Dynamic linking has considerable overhead.
I’m curious why this is necessary nowadays. Why not just ship a multi architecture static libc that chooses implementation at runtime? Branch prediction should make that practically zero cost, and with cache sizes at all levels I doubt binary size would have an impact. Are there other reasons?
I think some systems don't have a stable kernel interface. The system-provided libc is that interface.
On Linux I believe that this is possible (and fairly common using the MUSL libc), but that glibc doesn't support static linking and many people want to use glibc for performance and compatibility reasons.
> I think some systems don't have a stable kernel interface. The system-provided libc is that interface.
In fact, a lot of systems don't have a stable kernel interface, Linux is the exceptions here. In Microsoft Windows, one must link against Kernel32.dll to communicate with the kernel. On OpenBSD, the libc provides the stable syscall interface (and in fact the kernel will refuse syscalls from outside the libc as a security measure, see [0]). MacOS and Solaris are two other OSes where, if I understand correctly, syscall ABI is not guaranteed and must go through a common library. Go used to embed syscalls, and ran into a lot of problems because of this.
I get that openbsd is obsessed with security, but this still makes me a little sad. C is very old, and it should be possible to create programs and programming languages that have no idea what C is or what libc is. Forcing all programs to communicate with the OS via libc seems wrong
On one hand, from an idealistic point of view, I agree, and am a very, very big believer of having actual, carefully designed ABIs for kernel (and inter-process) communications.
On the other hand, the structures required to do syscalls via libc/kernel32 are generally simple enough that it's not a huge deal, and the difference between doing a raw syscall and doing a function call is unlikely to actually matter.
Fun fact/pet peeve: on android, if you want to communicate with most services, you need to go through Binder. While the low-level Binder ABI is stable, the services written on top of it aren't, and often change in backwards-incompatible ways. This includes core services like SurfaceFlinger (necessary to draw on the screen). This means that it's generally impossible to create purely native software on android - you always have to call into Java to talk to those services.
That's pretty meaningless. The effect of cache is very processor and ISA specific and has a log tail of value (as in, depending on the processor, X amount of cache may get you a 50% speedup, but 2 times X might only give you 60% so is the cost/benefit trade-off worth it). It's plausible that the i7 needs 3 times the cache to get the same benefit and the 64k in the 865. N.B. I don't know for sure that that is the case.
-flto is great at the other end of the spectrum- tiny systems. With "-Os -lto", much unused code from newlib is deleted, allowing the rest to fit in a tiny Cortex-M0 system. In the old days, each function of the C library was in its own object file in the library, but these days this appears to not be the case. But -lto gives the same effect.
I wonder if there is an equivalent for MSVC and Clang Compilers.
Would be nice if there was a good guide somewhere on flags to use (and their trade-offs) for fast floating point performance for numerical intensive programs.
I'm not sure I understand your argument. The optimizations here are:
>"-Ofast" same as "-O3 -ffast-math"
-ffast-math makes the compiler violate IEEE, it would make a very bad default.
-O3 used to be potentially buggy and somewhat experimental, although I doubt that it's much of a practical problem these days. It still makes the code harder to debug though. Most build systems I'm aware of default to debug builds, so GCC is not really unique in that regard.
>"-flto" enable link time optimizations
Those optimizations are typically quite expensive and can backfire in some scenarios. Link time optimizations are relatively novel, at least within the timeframe of GCC. I'm sure they'll be enabled by default some day, but GCC has to be conservative.
>"-mfpmath=sse" enables use of XMM registers in floating point instructions (instead of stack in x87 mode)
So that means that you tell the compiler to assume that SSE is available on the target. By default GCC outputs code that's compatible with the baseline, which is perfectly reasonable IMO. Same reason why -march=native is also not the default, it is assumed that you may want to ship your binaries to other computers.
>"-funroll-loops" enables loop unrolling
From GCC's own docs:
"This option makes code larger, and may or may not make it run faster."
Loop unrolling is tricky, because it gets rid of branches but also increases the size of the code and therefore the pressure on the icache. In some scenarios it's possible that the looping code run faster than the linear version if it saves on cache misses.
Given that there are tradeoffs involved, it's also reasonable to let the user decide to enable this optim.
I work with C a lot and IMO the only default in GCC that's truly bad is that it doesn't have -Wall by default.
>That depends on the language and the GCC version.
I was quoting TFA verbatim. I actually never use -Ofast myself, beyond -O3 I tend to use the individual flags manually, checking wit benchmarks that it makes a difference.
>Yes, but unfortunately it's the Intel default, which contributes to some of the compiler mythology.
I didn't know that. I guess Intel is extremely performance-oriented and doesn't really care for portability so it makes some sense for them to do that.
>Regarding unrolling, you usually do want it in numeric loops, and -O3 unrolls (and jams) as -fopt-info shows.
Sure but that's to my point: the default optims for -O3 are fairly aggressive already. Unless you're writing code where performance is absolutely critical you'll probably be do just fine just remembering to pass "-Wall -O3" to GCC and that's it. Actually for most of my code where I want good performance but I'm not counting individual clock cycles I tend to default to -O2 which gives you most of the performance benefits with more conservative and easier to debug optimizations.
And for cases where you need to go beyond -O3 you'll probably have to write some benchmarking code before you can decide which additional option to use. At least, in my experience.
> So that means that you tell the compiler to assume that SSE is available on the target. By default GCC outputs code that's compatible with the baseline, which is perfectly reasonable IMO. Same reason why -march=native is also not the default, it is assumed that you may want to ship your binaries to other computers.
If your x86 computer was made after the US invaded Iraq, it supports SSE2 instructions.
And I'm currently writing code for a 32bit CPU that was first introduced in 2000. It's not x86 so SSE is irrelevant, but my point is that GCC is routinely used to build millions if not billions of lines of code, they can't just YOLO-deprecate things as if it were a javascript framework.
clang also doesn't assume -march=native - just like gcc it makes binaries that can run on machines of the same architecture rather than specialising to the current processor's features.
You can see the flags it's using by comparing
clang -E -v - </dev/null 2>&1
and
clang -march=native -E -v - </dev/null 2>&1
The second output will probably show a lot of important features enabled - for example
Clang works much the same way by default. All compilers have to pick a baseline, and assuming it is arch_of(your_core) is a recipe for disaster if you are compiling on your high-end machine for release to a broad user base.
For many users of GCC the intended user base will be “the same as my current distribution” and so GCC can be configured using —-with-{cpu, arch, schedule, tune} depending on the architecture to set a default in line with user/distro expectations.
Exactly. What you do not want to happen is that your program crashes with fancy "illegal instruction" errors on older CPUs that do not have the newest fancy features yet, but are still in use by a large chunk of your (paying) users.
Performance critical programs usually deal with that by either providing multiple builds targeting different CPU features/CPUs, or let the user compile themselves from source with the right flags, or have CPU runtime detection and provide alternative versions of a few important performance critical functions for different CPU feature sets (e.g. browsers, ffmpeg, glibc, various VMs/runtimes like Java Hotspot or dotnet, do that) while the majority of code is still compiled for a lowest common subset of CPU features.
Of course, languages that run on (usually) JIT-ed VMs/runtimes have a bit of an advantage here, as the actual machine code is generated from source code or byte code only at runtime, at which point it is clear what kind of CPU is underneath the program. They can - but not always do - implement optimized JITting depending on the CPU features. (of course, every language/VM/runtime comes with its own set of pros and cons and there is no silver bullet).
To make matters even more complicated: compiling code to use the newest CPU features or newest optimization techniques will not mean it will actually run faster.
E.g. AVX512 may actually slow down your code (when multi-threaded) on many CPUs[1]. Or heavily "optimized" code may become larger in machine code, to the point where your "unoptimized" code may run faster because it fits in the CPU cache(s) properly while the "optimized" version does not. "-Os" optimized code may run faster than "-Ofast" optimized code for this matter. Or it may not. Depending on the actual code.
I remember compiling ffmpeg and libx264 myself a bunch of years ago, with the "best" flags for my system, starting with "-march=" and "-Ofast" of course, thinking I am a tough skillful super geek now. Imagine my surprise when I tested the performance against a default ffmpeg build and my optimized build was 2-5% slower.
Also, who the heck uses -flto on GCC 4.7.1? LTO is experimental in GCC 4.7.1 and will make any application nigh-impossible to debug.
Here's something more relevant to modern times: http://hubicka.blogspot.com/2019/05/gcc-9-link-time-and-inte...