
Latency Comparisons every programmer should know (2012) - Remnant44
https://gist.github.com/hellerbarde/2843375
======
acqq
Alternative unit I like is "how many per second" (when done on single hardware
unit, one after another, of course). The numbers from there are then easy to
grasp, since then they relate to what we actually want to achieve (I haven't
checked them for truth):

In billions: L1 cache reference can be done 2 billion times per second.

In millions: Branch mispredict, L2 cache reference, Mutex lock/unlock and Main
memory reference can be done 200, 140, 40 and 10 million times per second.

In thousands: Compress 1K bytes with Zippy, Send 2K bytes over 1 Gbps network,
SSD random read, Read 1 MB sequentially from memory, Round trip within same
datacenter and Read 1 MB sequentially from SSD* can be done 300, 50, 7, 4, 2
and 1 thousand times per second.

And finally, Disk seek, Read 1 MB sequentially from disk and Send packet
CA->Netherlands->CA can be done 100, 50 and 7 times per second.

No graphs needed, or trying to figure out is a microsecond much or not.

~~~
mjevans
The numbers in the chart better reflect an 'add up the costs' accounting
methodology which is useful for deadlines or cost/benefit comparisons. It had
also better be worth it to spend very costly human time to save comparatively
cheep computer time. Things at E.G. FANG scale rather than the cron-job that
does something overnight at a small business.

~~~
gameswithgo
if you save 1 second in a program used 1000 times each by 1 millions
people....

~~~
mjevans
Right, that's FANG scale.

------
dang
Discussed in 2017:
[https://news.ycombinator.com/item?id=13530820](https://news.ycombinator.com/item?id=13530820)

------
pkroll
Similarly covered at Coding Horror: [https://blog.codinghorror.com/the-
infinite-space-between-wor...](https://blog.codinghorror.com/the-infinite-
space-between-words/)

------
anonymousDan
Where does a context switch fall in this? A function call?

~~~
jcranmer
Function call overhead is on the order of several 10s of clock cycles (round
to 100, not 10), or a few dozen nanoseconds. The actual impact of function
calls is usually not so much the overhead of the call itself, but rather on
the fact that function calls tend to be barriers to analyses and
optimizations.

Context switches are closer to about 1µs in the best case, with interprocess
context switches maybe reaching 5µs. The time here is much more variable
because the really painful thing is the effect of TLB flushing and other
microarchitectural state going completely haywire.

~~~
yjftsjthsd-h
> The time here is much more variable because the really painful thing is the
> effect of TLB flushing and other microarchitectural state going completely
> haywire.

Isn't context switch also where spectre/meltdown mitigations hit the hardest,
too?

------
lvturner
I get it... but it would be really nice to include WHY these comparisons are
important somewhere

------
mirimir
> 25 s Making a coffee

Instant coffee?

Maybe French press, if the water is already boiling.

------
CalChris
I've seen this table many times and I'm always surprised it doesn't include
_register_.

~~~
tenebrisalietum
Registers should have zero latency. They're literally within the same part of
the CPU that decodes the instructions.

~~~
zenmaster10665
Well zero as in one or two clock cycles?

~~~
jcranmer
A fraction of a clock cycle.

An execution unit is basically an ALU with a set of muxes for inputs, where
the muxes are pulling either from the registers or from later execution stages
that have yet to commit their results to registers. The entire path has to
settle down by the time the clock signal reaches the latch at the edge of the
stage to retain the value for the next clock cycle. This means that the time
it takes to read a register is a fraction of the single clock cycles that
1-clock-cycle µops take to run.

------
ladberg
The 2k over a network latency really depends on the networking hardware in the
computers and switches and how the software stack is set up. I wouldn't be
surprised if it's easily 10x that in most real cases.

~~~
ses1984
I think it's a case of "one of these is not like the other" this number seems
to refer to a slice of throughput. There's a separate line item for "round
trip in the datacenter".

~~~
wtallis
Yeah, it's "time taken to transmit", not round trip time or even one-way end
to end latency. It's useful if you want to know eg. the minimum extra latency
being added to other network flows because your NIC is busy transmitting this
packet for the next n microseconds.

------
sprash
Any info on the difference of latency of register access vs. L1 cache on
x86_64?

When looking at compiler output of gcc most of the time it puts variables
within a scope on the stack and rarely populates all registers or even any at
all. Wouldn't using the full 12-14 available first be faster?

~~~
jcranmer
An L1 cache hit is 4 cycles of latency.

As for your second question, my guess is you're looking at unoptimized code.
Most compilers keep variables on the stack unless you're optimizing the code
to at least some degree.

