

Optimizing AMD Opteron Memory Bandwidth, Part 1: single-thread, read-only (2010) - adamnemecek
http://blogs.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-memory-bandwidth-part-1-single-thread-read-only/

======
agumonkey
Last part is [http://blogs.utexas.edu/jdm4372/2010/11/11/optimizing-amd-
op...](http://blogs.utexas.edu/jdm4372/2010/11/11/optimizing-amd-opteron-
memory-bandwidth-part-5-single-thread-read-only/)

I wish the author kept writing articles like that, but even old material like
this is a great timeless value.

~~~
sitkack
If you love this kinda stuff, Inner Loops [0] is an excellent book that covers
low level performance optimization. And the author Rick Booth handles it in a
way that allows the reader to transfer the techniques to new platforms.

I like it so much I buy used copies and give them as gifts.

[0] [http://www.amazon.com/Inner-Loops-Sourcebook-Software-
Develo...](http://www.amazon.com/Inner-Loops-Sourcebook-Software-
Development/dp/0201479605)

~~~
nkurz
I've appreciated your comments elsewhere, but I'm really dubious that a book
that only covers the Pentium II as an addendum can offer any specific advice
that is still useful. You're sure? I'm intrigued enough to try to get a copy
from the library.

I did just read a great explanation of how the P6 generation the Pentium II
belongs to differed from the previous one in that it was the first Intel to
have Out of Order execution:
[http://people.cs.clemson.edu/~mark/330/colwell/p6des.pdf](http://people.cs.clemson.edu/~mark/330/colwell/p6des.pdf)

I'd love to find a more up to date book on such topics. Currently I'm
struggling to understand how speculative execution works (or doesn't work) in
cases that are bound by the front end's constraint of issuing only 4
instructions per cycle. Do you give up all preloading when you are front end
bound? It would seem like the PC and speculative PC would be running together,
and you'd always bear the full brunt of latency.

~~~
sitkack
I would be surprised if you weren't pleasantly surprised.

I can't address your SE question, I am not caught up on current tech. The
TSX[0] stuff looks really fun, but it is _off_ for now.

You might enjoy realworldtech [1] for hardware info.

[0]
[http://en.wikipedia.org/wiki/Transactional_Synchronization_E...](http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions)

[1] [http://www.realworldtech.com/](http://www.realworldtech.com/)

PS Just ran across this while weeding for multiple issue architectures,
[http://mcg.cs.tau.ac.il/papers/](http://mcg.cs.tau.ac.il/papers/)

PPS
[http://transact2014.cse.lehigh.edu/wang.pdf](http://transact2014.cse.lehigh.edu/wang.pdf)

