I don't really care if my real system is 2x faster or slower than the quoted number.
 Ulrich Drepper, What Every Programmer Should Know About Memory, http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf
Extremely long tail latencies are observed which reach a second or more in those networks when some interference causes loss.
It would be much better to react quicker than the eternity of 200ms (if that is the default).
This was great at helping me develop a better intuition for the numbers. Thanks!
Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs
It says that for a system in a stable state, "occupancy = latency x throughput". Let's apply this to one of the latencies in the table: main memory access. An obvious question might be "How many lookups from random locations in RAM can we do per second?" From the formula, it looks like we can calculate this (the 'throughput') if we knew both the 'latency' and the 'occupancy'.
We see from the table that the latency is 100 ns. In reality, it's going to vary from ~50 ns to 200 ns depending on whether we are reading from an open row, on whether the TLB needs to be updated, and the offset of the desired data from the start of the cacheline. But 100 ns is a fine estimate for 1600 MHz DDR3.
But what about the occupancy? It's essentially a measure of concurrency, and equal to the number of lookups that can be 'in flight' at a time. Knowing the limiting factor for this is essential to being able to calculate the throughput. But oddly, knowledge of what current CPU's are capable of in this department doesn't seem to be nearly as common as knowledge of the raw the latency.
Happily, we don't need to know all the limits of concurrency for memory lookups, only the one that limits us first. This usually turns out to be the number of outstanding L1 misses, which in turn is limited by the number of Line Fill Buffers (LFB's) or Miss Handling Status Registers (MSHR's) (Could someone explain the difference between these two?).
Modern Intel chips have about 10 of these per core, which means that each core is limited to having about 10 requests for memory happening in parallel. Plugging that in to Little's Law:
"occupancy = latency x throughput"
10 lookups = 100 ns x throughput
throughput = 10 lookups / 100 ns
throughput = 100,000,000 lookups/second
It's often difficult to sustain this rate, though, since it depends on having the full number of memory lookups in flight at all times. If you have any failed branch predictions, the lookups in progress will be restarted, and your throughput will drop a lot. To achieve the full potential of 100,000,000 lookups per second per core, you either need to be branchless or perfectly predicted.