While it might sound from the title like this article is hopelessly out of date, I think it's still highly relevant to the current generation of Intel processors. I came across it in a footnote to Agner Fog's excellent http://www.agner.org/optimize/microarchitecture.pdf[1].
The article details (what I think is) an otherwise undocumented 'replay' feature that describes how the processor deals with data dependencies that aren't resolved on the expected schedule: among others, L1 cache misses, TLB misses, and failed Store-Load forwards.
[1] Footnote to self: Read the rest of Agner's footnotes!
Do you have specific knowledge about how more modern processors handle these cases? I was particularly excited to find this article because it was the only source that explained the hardware counter activity I found here: http://fastcompression.blogspot.com/2014/09/counting-bytes-f...
Yann was trying to write a fast histogram to record the number of occurrences of each character. But the simple version of the program was much slower than expected. After a fair amount of digging, it was determined that "impossible" store forwarding was a factor. Checking the performance counters seemed to confirm that many of the loads were being replayed (executed many times before being retired). Is there another explanation for this?
Yes, I think so too! But I'm biased almost surely. You probably didn't notice, but the "Nathan Kurz" who wrote that email and the 'nkurz' that posted this are both me. That's why I was excited to finally find some sort of external confirmation. :)
The article details (what I think is) an otherwise undocumented 'replay' feature that describes how the processor deals with data dependencies that aren't resolved on the expected schedule: among others, L1 cache misses, TLB misses, and failed Store-Load forwards.
[1] Footnote to self: Read the rest of Agner's footnotes!