on an intel sandy bridge processor, each L1 access takes 4 cycles, each L2 access takes 12 cycles, each L3 access takes about 30 cycles, and each main memory access takes about 65 ns. assuming you are using a 3 GHz processor, this would make you think that you can do 750 MT/s from L1, 250 MT/s from L2, 100 MT/s from L3, and 15 MT/s from RAM.
now imagine 3 different data access patterns on a 1 GB array:
1) sequentially reading through an array
2) reading through an array with a random access pattern
3) reading through an array in a pattern where the next index is determined by the value at the current index.
if you benchmark these 3 access patterns, you will see that:
1) sequential access can do 3750 MT/s
2) random access can do 75 MT/s
3) data-dependent access can do 15 MT/s
you might guess that sequential access is fast because it is a very predictable access pattern, but it is still 5x faster than the speed indicated by the latency of L1. maybe you'd think it's prefetching into registers or something? but notice the difference between random access and data-dependent access. this is probably not what you expected at all! why is random access 5x faster than data-dependent access? because on sandy bridge, each hyper threading core can do 5 memory accesses in parallel. this also explains why sequential access seems to be 5x faster than the speed of L1.
what does this mean in practice? that to do anything practical with these latency numbers, you also need to know the parallelism of your processor and the parallelizability of your access patterns. the pure latency number only matters if you are limited to one access at a time.