> uses half the instructions generated by the C++ compiler
is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?
There was Intel VTune, which I heard was good, though I haven't used it myself. One difficulty is that there are many non-obvious and hard-to-predict factors that interact to produce pipeline stalls. Instructions had specified throughputs and latencies (throughout being the number of cycles before another independent instruction of that type could be initiated; latency being the number of cycles before its output could be used by another instruction), but that was only part of the story. Was that memory read from L1 cache? L2? Main memory? Is this conditional branch predictable? Which of the several applicable execution units will this micro-op get sent to? There were also occasional performance cliffs (alternating memory reads that were exactly some particular power of 2 apart would alias in the cache, leading to worst-case cache behaviour; tight loops that did not begin on a 16-byte boundary would confuse the instruction prefetcher on some CPUs...)
I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.
VTune still exists and is free since a few years. Neat thing with VTune is that it has support for a few runtimes, so it understands for example CPython internals to the point that stack-traces can be a mixture of languages. That's something only becoming available just now outside of VTune, like Python 3.12 has some hooks for Linux perf.
Note that there's also the open-source uiCA [0], which similarly predicts scheduling and overall throughput for a basic block. Their benchmarks claim it to be more accurate than IACA and other tools for newer Intel CPUs, but I wouldn't be qualified to judge those claims.
Yup. Back in the day VTune was useful and good, then I haven't used it for more than 20 years. It might be still good, but knowing how much more complicated the current CPU architecture is, and how much I've lost touch with low-level assembly I don't know if it's going to be useful to me. I'm relying now on profiling, and other programmers that have become way better in this than me to hear their opinion, and use that as basis (or others on the web).
Most of the time, some good optimized library would do pretty good.
This varies from trivial to very hard to mostly data dependent with different architectures. llvm-mca might be of interest.
One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.
Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.
is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?