Rae's and Deck's verses are 2 of my favorite in all of hip hop. There's just a certain flow, and the beat is just spooky as hell.
Alternatively you can execute your code in a simulator with a cache model. Of course, this model won't actually match your hardware (unless you're using a serious cycle-accurate simulator), but it might be more insightful than performance counters alone. I don't know if any of the standard x86 emulators include a good cache model.
Just from the front page I see a presentation "Cache Line Utilization with AMD CodeAnalyst Software" which touches the subject
Great tool for optimizing C/C++ code.
If you're using multiple threads, then you might have issues with false sharing, which is a bitch. Valgrind can help you there.
edit: also, i guess, it never hurts to plug drepper's most excellent paper: "What every programmer should know about memory ?"
Also informative reading: "A Memory Allocator" http://gee.cs.oswego.edu/dl/html/malloc.html
How fast can you access a bit of memory? The lower bound is the speed of light. How do you organize your memory such that access is the fastest? In a sphere with the processor in the center. There is no more overall efficient packing of bits assuming that each bit takes up some finite "bit" of space (pun intended).
This means that random access to memory can never be O(1) at the limit, it is actually O(n^.333)--the cube root, as the volume of a sphere is 4/3 pi r^3. Yes, even for hash tables. So random access memory always gets slower the more you have of it.
Conversely, the less memory you have to access, the faster it can be. By putting more frequently accessed bits closer to the center of the sphere, you have created a cache.
ducks before another HN mod title debate begins
When you need really fast algorithms and very low latencies, you have to get your hands dirty.
If you're interested in much more advanced writings on the subject I'd suggest googling for "How to do 100K+ TPS at less than 1ms latency".
It is explaining how the LMAX's disruptor pattern is now able to process 12 million events per second on a single core in... Java!
It explains, amongst other, the size and speed of the various caches and how to minimize cache misses.
Video (InfoQ): http://www.infoq.com/presentations/LMAX
Blogpost (2011): http://martinfowler.com/articles/lmax.html?t=1319912579
Discussion (2011): http://news.ycombinator.com/item?id=3173993
Funny that this was posted here as their whitepaper has been sitting in my Downloads folder for months waiting to be read.
The slides mention "Hotspot likes small compact methods". Does anyone happen to know a source or have more information on this?