All you need is an assembler (a language with a built-in assembler helps with th...

dkersten · on Aug 8, 2011

A timing loop will tell you how fast it runs on your particular configuration, but not how fast you can expect it to run in the general case. For that, you need to get very intimate with the processor manuals, especially optimization guides. Also, a good profiler, like Intels VTune, is great for getting low level performance data such as cache misses.

Also, doing what you suggest on a number of combinations of hardware would be useful, so you can compare various processor architectures.