

Latency numbers - Aeolus98
https://gist.github.com/Secathor/8415846

======
nkurz
Knowing these latency numbers is essential for writing efficient code. But
with modern out-of-order processors, it's often difficult to gauge how much
the latency will hurt throughput without closer analysis. I'd love if the math
for this analysis and the associated hardware limits were also better known.
Little's Law is a fundament of queuing theory. The original article is very
readable: [http://web.mit.edu/sgraves/www/papers/Little's%20Law-
Publish...](http://web.mit.edu/sgraves/www/papers/Little's%20Law-
Published.pdf)

It says that for a system in a stable state, "occupancy = latency x
throughput". Let's apply this to one of the latencies in the table: main
memory access. An obvious question might be "How many lookups from random
locations in RAM can we do per second?" From the formula, it looks like we can
calculate this (the 'throughput') if we knew both the 'latency' and the
'occupancy'.

We see from the table that the latency is 100 ns. In reality, it's going to
vary from ~50 ns to 200 ns depending on whether we are reading from an open
row, on whether the TLB needs to be updated, and the offset of the desired
data from the start of the cacheline. But 100 ns is a fine estimate for 1600
MHz DDR3.

But what about the occupancy? It's essentially a measure of concurrency, and
equal to the number of lookups that can be 'in flight' at a time. Knowing the
limiting factor for this is essential to being able to calculate the
throughput. But oddly, knowledge of what current CPU's are capable of in this
department doesn't seem to be nearly as common as knowledge of the raw the
latency.

Happily, we don't need to know all the limits of concurrency for memory
lookups, only the one that limits us first. This usually turns out to be the
number of outstanding L1 misses, which in turn is limited by the number of
Line Fill Buffers (LFB's) or Miss Handling Status Registers (MSHR's) (Could
someone explain the difference between these two?).

Modern Intel chips have about 10 of these per core, which means that each core
is limited to having about 10 requests for memory happening in parallel.
Plugging that in to Little's Law:

    
    
      "occupancy = latency x throughput"
      10 lookups = 100 ns x throughput
      throughput = 10 lookups / 100 ns
      throughput = 100,000,000 lookups/second
    

At 3.5GHz, this means that you have a budget of about 35 cycles of CPU that
you can spend on each lookup. Along with the raw latency, this throughput is a
good maximum to keep in mind too.

It's often difficult to sustain this rate, though, since it depends on having
the full number of memory lookups in flight at all times. If you have any
failed branch predictions, the lookups in progress will be restarted, and your
throughput will drop a lot. To achieve the full potential of 100,000,000
lookups per second per core, you either need to be branchless or perfectly
predicted.

------
Klinky
Any document relating benchmarks or performance numbers should include the
exact make/model of the hardware involved. So often I see performance numbers
reported by developers without much detail on the actual hardware being used.
It should be painfully obvious that numbers will vary greatly depending on
hardware/platform.

~~~
e98cuenc
These numbers are gold even without knowing the exact model of the hardware. I
always use them when I want to get a rough estimate for a new system. Should I
expect milliseconds, seconds or minutes for a given task in a new system?

I don't really care if my real system is 2x faster or slower than the quoted
number.

------
amelius
Related and also very interesting: [1]

[1] Ulrich Drepper, What Every Programmer Should Know About Memory,
[http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf](http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf)

------
thrownaway2424
The "TCP packet retransmit" one is interesting, because it's a parameter you
can set in your socket library or kernel. On Linux the default minimum RTO is
200ms, even if the RTT of the connection is < 1ms. For local networking you
really, really want to reduce the minimum RTO to a much smaller number. If you
don't, random packet loss is going to dominate your tail latency.

~~~
drdaeman
For local networking, packet loss is, in most cases, a sign something
somewhere isn't doing well. So, maybe, it's better notice it sooner than bump
into possible unexpected problems later.

~~~
thrownaway2424
Not really. A busy network has a million little buffers and things where your
packet might get dropped but succeed 1ms later.

~~~
wtallis
But those buffers are also often huge and your packet could still come out the
other end of one of them 20ms (or 2s, if it's a cable or DSL modem) later. You
still want to treat it as evidence of congestion, though.

------
vowelless
> Lets multiply all these durations by a billion:

This was great at helping me develop a better intuition for the numbers.
Thanks!

------
Oculus
Earlier in the summer I decided to create a phone background with these
numbers so whenever I had free time I could work on memorizing them:
[https://twitter.com/EmilStolarsky/status/496298288325599233](https://twitter.com/EmilStolarsky/status/496298288325599233)

------
adsche

        Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
    

Any reason why (arbitrarily?) take 2K here and not 1K?

~~~
xxs
2KiB is above the usual MTU which is probably suboptimal for standard packet
size.

