Caches, memory hierarchies, out-of-order execution, etc. are hidden from assembl...

bayindirh · on July 18, 2023

I don't think I agree completely to your sentiment. Because while we want to make software run everywhere (at least in the X86 family regardless of the feature sets we have), we want to make sure that our software performs well, too. This is esp. important in areas where we (ab)use the hardware to the highest level (games, science, rendering, etc.)

To enable this performance optimizations, we taught our compilers tons of tricks, like -march & -mtune flags. Also, we allow our compilers to generate reckless code like -ffastmath, or add tons of assembly or vectorization hints into libraries like Eigen.

We write benchmarks like STREAM, or other tools which measure core to core latency, or measure execution code with different data lengths to detect cache sizes, associativity, and whatnot. Then use this information to optimize our code or compiler flags to maximize the software's speed at hand.

If caches and other parts of the system would be available to assembly, we would have asked the processor their properties, directly optimize according to their merits, even do some data allocation tricks or prefetching w/o guesswork (which some architectures support via programmable external prefetching engines), not doing tuning in the dark via half-informative data sheets, undisclosed AVX frequency behaviors, or other techniques like running perf and looking cache trash percent, IPC, and other numbers to make educated guesses about how a processor behaves.

Yes, not all stuff is can be run in parallel, and I don't want to move all computation to GPUs with FP16 Half Precision math, but we can at least agree that these systems are designed to look like PDP-11's from a distance, and our compilers are the topmost layer of this "emulation" while doing all kinds of tricks. Trying to push this performance in an opaque way why we have Spectre and Meltdown, for example, where these abstractions and mirrors break down.

If our hardware was more transparent to us, we would have arguably selectively optimize our code a bit easier, if it had the switches labeled "Auto/I know what I'm doing", for certain features.

Intel tried to take this to max (do all optimization with the compiler) with Itanium. The architecture was so dense, it failed to float, it seems.