> uses half the instructions generated by the C++ compiler is there a tool that ...

akoboldfrying · 2024-01-07T10:32:49 1704623569

There was Intel VTune, which I heard was good, though I haven't used it myself. One difficulty is that there are many non-obvious and hard-to-predict factors that interact to produce pipeline stalls. Instructions had specified throughputs and latencies (throughout being the number of cycles before another independent instruction of that type could be initiated; latency being the number of cycles before its output could be used by another instruction), but that was only part of the story. Was that memory read from L1 cache? L2? Main memory? Is this conditional branch predictable? Which of the several applicable execution units will this micro-op get sent to? There were also occasional performance cliffs (alternating memory reads that were exactly some particular power of 2 apart would alias in the cache, leading to worst-case cache behaviour; tight loops that did not begin on a 16-byte boundary would confuse the instruction prefetcher on some CPUs...)

I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.

formerly_proven · 2024-01-07T12:25:31 1704630331

VTune still exists and is free since a few years. Neat thing with VTune is that it has support for a few runtimes, so it understands for example CPython internals to the point that stack-traces can be a mixture of languages. That's something only becoming available just now outside of VTune, like Python 3.12 has some hooks for Linux perf.

A purely pen-and-paper tool was IACA; you simply inserted some macros around a bit of code in the binary and IACA simulated how these would/could be scheduled on a given core: https://stackoverflow.com/questions/26021337/what-is-iaca-an...

LegionMammal978 · 2024-01-07T17:03:14 1704646994

Note that there's also the open-source uiCA [0], which similarly predicts scheduling and overall throughput for a basic block. Their benchmarks claim it to be more accurate than IACA and other tools for newer Intel CPUs, but I wouldn't be qualified to judge those claims.

[0] https://uops.info/uiCA.html

akoboldfrying · 2024-01-07T13:02:18 1704632538

Didn't know VTune is free now, nor had I ever heard of IACA which looks very nice (and would have saved me a lot of brow-sweat)! Thanks.

malkia · 2024-01-07T18:57:35 1704653855

Awesome - I'll try it again!

malkia · 2024-01-07T18:57:09 1704653829

Yup. Back in the day VTune was useful and good, then I haven't used it for more than 20 years. It might be still good, but knowing how much more complicated the current CPU architecture is, and how much I've lost touch with low-level assembly I don't know if it's going to be useful to me. I'm relying now on profiling, and other programmers that have become way better in this than me to hear their opinion, and use that as basis (or others on the web).

Most of the time, some good optimized library would do pretty good.

JonChesterfield · 2024-01-07T12:13:08 1704629588

This varies from trivial to very hard to mostly data dependent with different architectures. llvm-mca might be of interest.

One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.

Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.

adrianN · 2024-01-07T06:16:00 1704608160

For small pieces of code I would try to use a superoptimizer like souper.

a1o · 2024-01-07T11:05:34 1704625534

https://github.com/google/souper

It looks like it only supports Linux and macOS - no Windows, but no other things too like mobile.

It seems it exists for ten years, I wonder what optimizations aren't still picked by the recent compilers.

adrianN · 2024-01-07T11:16:00 1704626160

Compilers need to balance compilation speed with optimization. SMT solvers are right out for speed reasons.