
Latency numbers every programmer should know - friggeri
https://gist.github.com/2841832
======
luckydude
Most of these latencies were measured and written up for a bunch of systems by
Carl Staelin and I back in the 1990's. There is a usenix paper that describes
how it was done and the benchmarks are open source, you can apt-get them.

<http://www.bitmover.com/lmbench/lmbench-usenix.pdf>

If you look at the memory latency results carefully, you can easily read off
L1, L2, L3, main memory, memory + TLB miss latencies.

If you look at them harder, you can read off cache sizes and associativity,
cache line sizes, and page size.

Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of
the memory latency test.

<http://www.bitmover.com/lmbench/mem_lat3.pdf>

His standard interview question is to show the candidate that graph and say
"tell me everything you can about this processor and memory system". It's
usually a 2 hour conversation if the candidate is good.

~~~
ajross
Pedantic quip: I have a hard time believing you guys were measuring half
nanosecond cache latencies on a machine with a 100MHz clock. :)

And actually the cache numbers seem optimistic, if anything. My memory is that
a L1 cache hit on SNB is 5 cycles, which is 2-3x as long as that table shows.

~~~
luckydude
We didn't believe it either until we put a logic analyzer on the bus and found
that the numbers were spot with respect to the number of cycles. I don't
remember how far off they were but it wasn't much, all the hardware dudes were
amazed that software could get that close.

tl;dr: the numbers were accurate to the # of cycles, might have been as much
as 1/2 of 1 cycle off.

Edit: I should add this was almost 20 years ago, I dunno how well it works
today. Sec, lemme go test on a local machine.

OK, I ran on a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (I think that's a
Sandy Bridge) that is overclocked to 4289 MHz according to mhz, and it looks
to me like that machine takes 4 cycles to do a L1 load. That sound right?
lmbench says 4.05 cycles.

I poked a little more and I get

L1 4 cycles, ~48K L2 12 cycles, ~256K L3 16 cycles, ~6M

Off to google and see how far off I am. Whoops, work is calling, will check
back later.

------
larsberg
Honestly, I'd rather programmers know how to _measure_ these numbers than just
have them memorized.

I mean, if I told them that their machine had L3 cache now, what would they do
find out how that changes things? (This comment is also a shameless plug for
the fantastic CS:APP book out of CMU).

~~~
memset
Honest question: how would you measure an L2 cache lookup? (What program would
I need to write which ensures that a value is stored in L2, such that when I
read it later, I would know the lookup time to indeed be the L2 time?)

At that, how would one measure this kind of thing at the nanoseconds level?
Would using C's clock_gettime() functions be good enough? Is there any other
facility to count cycles which have passed between two operations?

~~~
MichaelGG
What you could do is write a program that uses increasing amounts of memory.
Measure the time it takes to complete a few million iterations of some
function that accesses that memory. As the total usage increases, you'll see
steps in the timing. For example, you might see a jump at ~32K, then ~2MB,
then ~8MB.

~~~
comatose_kid
I think you'd just need to access memory in increasingly larger increments
(eg, sequential reads of 32k, 64k, ...2x size of l2 cache). But there's more
to it as the L2 cache is often shared amongst cores in a multi-cpu system
(where as the L1 is per-core). Also, the organization of the cache (eg, n-way
set associative) means that you'd want to vary the size of the jump (stride)
for your reads/writes and see how that also affects throughput.

~~~
MichaelGG
Yea. All this and more is covered in the fantastic paper by Ulrich Drepper:
What Every Programmer Should Know About Memory[1]. Was very enlightening.

1: <http://www.akkadia.org/drepper/cpumemory.pdf>

------
kjhughes
Anyone who hasn't heard Rear Admiral Grace Murray Hopper describe a nanosecond
should check out her classic explanation:

<http://www.youtube.com/watch?v=JEpsKnWZrJ8>

------
xb95
This reminds me of one of the pages that Google has internally that, very
roughly, breaks down the cost of various things so you can calculate
equivalencies.

As an example of what I mean (i.e., these numbers and equivalencies are
completely pulled out of thin-air and I am not asserting these in any way):

    
    
      * 1 Engineer-year = $100,000
      * 25T of RAM = 1 Engineer-week
      * 1ms of display latency = 1 Engineer-year
    

This allows engineeers to calculate tradeoffs when they're building things and
to optimize their time for business impact. E.g.: it's not worth optimizing
memory usage by itself, Latency is king, Don't waste your time shaving yaks,
etc etc.

~~~
Morg
Yet everyone uses much slower RAM in servers and will likely continue to do
so, all the while caches swell, etc.

Optimizing memory usage is almost irrelevant today, until it starts being a
bandwidth problem, and that's still solvable but only through complex scaling
strategies that also cost several engineer-years.

------
zippie
These numbers by Jeff Dean are relatively true but need to be refreshed for
modern DRAM modules & controllers. Specifically, the main memory latency
numbers are more applicable to DDR2 RAM vs the now widely deployed DDR3/DDR4
RAM (more channels = more latency). This has been a industry trend for a while
and theres no change on the horizon. Additionally, memory access becomes more
expensive because of CPU cross chatter when validating data loads across
caches.

A potential pitfall with these numbers is they give engineers a false sense of
security. They serve as a great conceptual aid - network/disk I/O are
expensive and memory access is _relatively_ cheap but engineers take that to
an extreme, and get lackadaisical about memory access.

When utilizing a massive index (btree) our search engine failed to meet SLA
because of memory access patterns. Our engineers tried things at the system
(numa policy) and application level (different userspace memory managers,
etc.)

Ultimately, it all came down to improving the efficiency around memory access.
We used Low-Level Data Structure to get the 2x improvements in memory latency:

<https://github.com/johnj/llds>

------
dsr_
Scaling up to human timeframes, one billion to one:

Pull the trigger on a drill in your hand 0.5s Pick up a drill from where you
put it down 5s Find the right bit in the case 7s Change bits 25s Go get the
toolkit from the truck 100s Go to the store, buy a new tool 3000s Work from
noon until 5:30 20000s Part won't be in for three days 250000s Part won't be
in until next week 500000s Almost four months 10000000s 8 months 20000000s
Five years. 150000000s

~~~
hellerbarde
I forked the gist and did something similar. Would you mind if I took some of
your suggestions into my fork?

<https://gist.github.com/2843375>

~~~
dsr_
Fine by me.

------
jgrahamc
I believe that this originally comes from Norvig's "Teach Yourself to Program
in Ten Years" article: <http://norvig.com/21-days.html>

~~~
alecco
That's from 2011. Amazon's James Hamilton wrote about it in 2009, and it comes
from a presentation that year by Google's Jeff Dean:

[http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLe...](http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx)

EDIT: this is wrong, Norvig's page pre-dates Dean's presentation
[http://wayback.archive.org/web/*/http://norvig.com/21-days.h...](http://wayback.archive.org/web/*/http://norvig.com/21-days.html)

~~~
tjr
I make no claim as to who first assembled that particular table of data, but
Norvig's article is dated 2001.

~~~
alecco
EDIT: I stand corrected.

The numbers seem to have been evolving and the original source seems to be
that page.

[http://wayback.archive.org/web/*/http://norvig.com/21-days.h...](http://wayback.archive.org/web/*/http://norvig.com/21-days.html)

~~~
luckydude
This is shameless self promotion (well, me and Carl promotion) but we were
measuring these sorts of things in the late 80's and wrote a paper about it
that got best paper at Usenix in 95. I think we had most of those numbers, not
in the same format.

Pissed off the BSD folks because it made them look bad. Oh, well.

Helped make Linux better, largely because while the BSD guys refused to
engage, Linus did. He and I spent many many hours discussing what was the
right thing to measure and what should not be measured. We both felt that
lmbench would influence OS design (and it's influenced processor design, see
all the cache prefetch stuff, I'm pretty convinced that's because all the
processor people used lmbench). Linus was already on the "OS should be cheap
path" but lmbench helped him make the case to other people who wanted to add
overhead because of their pet project.

The cool part about working with Linus was he was never about making Linux
look better, he was about measuring the right things. If Linux sucked, oh,
well, he'd fix it or get someone else to fix it. Awesome attitude, I feel the
same way.

The only published work that might predate lmbench for these sorts of numbers
is Hennessy and Patterson computer architecture. They talked about memory
latency but so far as I recall, didn't have a benchmark. That said, that book
is friggin awesome and anyone who cares about this sort of thing and hasn't
carefully read that book is missing out.

------
peteretep
Took me forever to find this, but:

[https://plus.google.com/112493031290529814667/posts/LvhVwngP...](https://plus.google.com/112493031290529814667/posts/LvhVwngPqSC)

~~~
hellerbarde
From that link, this brilliant visualisation: <http://i.imgur.com/X1Hi1.gif>

------
yuvadam
Is a single-text-file-github-gist the best way to disseminate this piece of
knowledge (originally by Peter Norvig, BTW)?

What about a comprehensive explanation as to why those numbers actually
matter?

Meh.

~~~
willvarfar
<http://www.infoq.com/presentations/Lock-free-Algorithms> early-on gives good
numbers on-machine.

I like this visualisation too: <http://news.ycombinator.com/item?id=702713>

~~~
matthavener
For anyone thinking about watching: the presentation is incredibly good. The
name is kinda misleading -- they talk more about modern x86 architecture and
actual numbers of various algorithms than the lock free algorithms themselves.
Both of those guys have a high emphasis on measuring and testing to improve
performance.

------
aristus
These are good rules of thumb, but need more context. Plugging an article I
wrote about this & other things a couple of years ago for FB engineering:
<https://www.facebook.com/note.php?note_id=461505383919>

The "DELETE FROM some_table" example is bogus, but the rest is still valid.

------
some1else
John Carmack recently used a camera to measure that it takes longer to paint
the screen in response to user input, than send a packet accross the Atlantic:
[http://superuser.com/questions/419070/transatlantic-ping-
fas...](http://superuser.com/questions/419070/transatlantic-ping-faster-than-
sending-a-pixel-to-the-screen/419167#419167)

I came across the post when I was looking for USB HID latency (8ms).

------
EternalFury
Considering that so many programmers are currently enthralled with JavaScript,
Ruby, Python and other very very high level languages, the top half of this
chart must look very mysterious and unattainable.

------
sciurus
One of my favorite writeups on this topic is Gustavo Duarte's "What Your
Computer Does While You Wait"

[http://duartes.org/gustavo/blog/post/what-your-computer-
does...](http://duartes.org/gustavo/blog/post/what-your-computer-does-while-
you-wait)

------
Symmetry
In practice any out of order processor worth its salt ought to be able to
entirely hide L1 cache latencies.

~~~
matthavener
Agreed, I think the real lesson is: reading is 10x cheaper than branching, so
if you can do something with 10 non-branching ops it'll be just as fast as a
single branch.

~~~
seabee
Conversely: if you can do something with a branch that's correctly predicted
90% of the time it'll be just as fast as 10 non-branching ops.

Branch prediction is a tool like any other - don't neglect it when it can help
you.

~~~
marshray
As a coder, how can I predict the branch predictor?

~~~
CamperBob2
There are definite rules, documented by the CPU vendor. A forward branch is
assumed not taken, while a backward branch is assumed to come at the end of a
loop that will probably iterate more than once. See
[http://software.intel.com/en-us/articles/branch-and-loop-
reo...](http://software.intel.com/en-us/articles/branch-and-loop-
reorganization-to-prevent-mispredicts/) for example.

I'd assume the particulars will vary between CPU manufacturers and families,
but the idea that backward branches will probably be taken seems fairly
universal.

~~~
haberman
This is outdated information; Intel chips have not used static prediction for
conditional branches since NetBurst (Pentium 4).

"Pentium M, Intel Core Solo and Intel Core Duo processors do not statically
predict conditional branches according to the jump direction. All conditional
branches are dynamically predicted, even at first appearance."

\--[http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...](http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-
optimization-manual.pdf)

------
balloot
The thing here that is eye opening to me, and relevant to any web programmer,
is that accessing something from memory from another box in the same
datacenter is about 25x times as fast as accessing something from disk
locally. I would not have guessed that!

~~~
hboon
That's why tools like memcached work like they do.

------
perlpimp
Such items are important to web developers and they can use them to justify
looking at one or other technology. Or perhaps attempt at least benchmarking
and have them as one of the guides in configuring and setting up services.
Comes to mind why is Redis can be better then mongodb and in what
configuration.

As well in discussion about this and that these can be of help too.

Adding misaligned memory penalties such as on word boundary and page boundary
can enhance such document. This might be a good cheatsheet if one inclined to
research and make one.

------
stiff
I present to you Grace Hopper handing people nanoseconds out:

<http://www.youtube.com/watch?v=JEpsKnWZrJ8>

:)

------
CookWithMe
Also, these numbers don't mean much on their own. E.g. L2 Cache is faster than
main memory, but that doesn't help you if you don't know how big your L2 Cache
is. Same for main memory vs. disc.

E.g. I optimized a computer vision algorithm for using L2 and L3 caches
properly (trying to reuse images or parts of images still in the caches).
Started off with an Intel Xeon: 256KB L2 Cache, 12MB L3 Cache. Moved on to an
AMD Opteron: 512KB L2 Cache (yay), 6MB L3 Cache (damn).

Also, the concept of the L2 Cache has changed. Before multi-cores it was
bigger and the last-level-cache. Now it has become smaller and the L3 Cache is
the last-level-cache, but has some extra issues due to the sharing with other
cores.

The important concepts every programmer should know are memory hierarchy and
network latency. The individual numbers can be looked up on a case-by-case
basis.

------
lallysingh
If this is your cup of tea, have a look at Agner Fog's resources:
<http://agner.org/optimize/>

Also, I'd have a look at Intel's VTune or the 'perf' tool that ships with the
linux kernel.

------
SeanLuke
How is a mutex lock less expensive than a memory access? Are such things done
only in registers nowadays? This doesn't sound right.

~~~
haberman
Less expensive than a _main_ memory access. It must be that an uncontended
lock/unlock can happen in cache.

~~~
JoeAltmaier
No definitely not. A full memory fence surrounds lock/unlock.

~~~
scott_s
I actually don't think that's true. My understanding is that on x86, atomic
instructions have implicit lock instructions before them. (Or you can make
some instructions atomic by putting a lock instruction before them.) Such
instructions lock the bus and prevent other cores or SMT threads from
accessing memory. In that way, you can safely perform an atomic operation on a
value in the cache.

Note that this implies that atomic operations slow down _others_ cores and SMT
threads.

~~~
JoeAltmaier
Locks are often implemented using an xchg instruction, which is implicitely
locked.

All processor's caches are committed/flushed for the affected cache line. So
its correct to say other processors are slowed down. But it also in that sense
IS a main memory operation, just not yours.

~~~
scott_s
To be clear, then we agree that haberman was correct, and the value can be
changed in cache.

~~~
JoeAltmaier
Sure. It just isn't useful in cache. If it is a real lock, it has to be
shared. Either the caches have to reconcile, or it has to go to main memory.

~~~
haberman
Just because a lock is shared does not mean that it's contended.

For example, some multi-threading techniques attempt to access only CPU-local
data but use locks purely to guard against the case where a process is moved
across CPUs in the middle of an operation (thus defeating the best-effort CPU-
locality).

~~~
JoeAltmaier
But it has to be available to those multiple cpus, right? So it has to go down
to memory and back up to cache. Contended or not.

~~~
haberman
Maybe I'm missing something, but if the cache line is only in use by one CPU,
I don't see why the value would need to be immediately propagated to main
memory or to any other CPU's cache until it is written as part of the normal
cache write-back.

~~~
gvb
Correct. Typically the cache snoops the main memory bus. If a remote CPU
starts a read on a cached memory location, the caching CPU sends a "stall" or
"retry" signal to the reader, does a cache flush to main memory, and then lets
the remote CPU proceed with the (now correct) main memory read.

------
al_james
What stands out here is how long a disk read takes (especially compared to
network latency). Indeed, disk is the new tape.

------
dockd
Does anyone feel like this is sort of an apples to oranges table? It compares
reading one item from L1 cache to reading 1M byte from memory, without
adjusting for the amount of data being read (10^6 more). It looks like the
data was chosen to minimize the number of digits in the right column.

------
CookWithMe
What about L3 Cache?

What about Memory Access on another NUMA Node?

What about SSD?

Does a mobile phone programmer need to know the access time for disks?

Does an embedded system programmer need to know anything of these numbers?

Every programmer should know what memory hierarchy and network latency is. (If
you learn it by looking at these numbers, fine...)

~~~
jbooth
I'm not an expert, but:

L3 is generally on the order of the same time as main memory - it's main
purpose is to reduce the total amount of requests in order to conserve
bandwidth

SSDs are on the order of 0.1ms, so 100,000 ns, give or take a factor of 10.

Someone smarter than me will have to answer the NUMA node question.

~~~
MichaelGG
I think L3 is much faster; perhaps 1/3rd of main memory, assuming the line is
available and not in another core. Here are some numbers for L3 cache, from
Intel (probably specific to the 5500 series)[1]:

    
    
      L3 CACHE hit, line unshared ~40 cycles 
      L3 CACHE hit, shared line in another core ~65 cycles 
      L3 CACHE hit, modified in another core ~75 cycles 
      remote L3 CACHE ~100-300 cycles 
      Local Dram ~60 ns 
      Remote Dram ~100 ns
    

60ns at 2.4GHz is ~144 cycles, right?

1:
[http://software.intel.com/sites/products/collateral/hpc/vtun...](http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf)

------
bunderbunder
I find myself thinking of figures like these every time I see results for
benchmarks that barely touch the main memory brought up in debates about the
relative merits of various programming languages.

------
patrickmay
When working on low latency distributed systems I more than once had to remind
a client that it's a minimum of 19 milliseconds from New York to London, no
matter how fast our software might be.

------
ryandetzel
Interesting but unnecessary for most programmers today. I'd rather my
programmers know the latency of redis vs memcached vs mysql and data type
ranges.

~~~
snotrockets
Every programmer that understands (rather than "knows") the number mentioned
in the link would know the numbers you are looking for (or at least, the
relationships between those.)

I'm not sure this relation holds for the opposite direction.

------
JoeAltmaier
Network transmit time is almost irrelevant. It takes orders of magnitude more
time to call the kernel, copy data, and reschedule after the operation
completes than the wiretime.

This paradox was the impetus behind Infiniband, virtual adapters, and a host
of other paradigm changes that never caught on.

~~~
luckydude
Huh. Data please.

Part of the reason I wrote lmbench was to make sure that what you are saying
is not true. And it is not in Linux, kernel entry and exit is well under 50
nanoseconds. Passing a token back and forth, round trip, in an AF_UNIX socket
is 30 usecs. A ping over gig ether is 120 usecs.

Unless I'm completely misunderstanding, you are saying that the OS overhead
should be "orders of magnitude" more than the network time, that's not at all
what I'm seeing on Linux.

I guess what you are saying is that given an infinitely fast network, the
overhead of actually doing something with the data is going to be the
dominating term. Yeah, true, but when do we care about the infinitely fast
network in a vacuum? We always want that data to do something so we have to
pay something to actually deliver it to a user process. Linux is hands down
the best at doing so, it's not free but it is way closer to free than any
other OS I've seen.

~~~
Getahobby
This may be ignorant and please correct if I am off base but given the same
physical medium isn't sending 2k across the network the same cost whether you
are at fastE or gigE? Given the network is not saturated?

~~~
luckydude
fastE is 100Mbits/sec, gigE is 1000Mbits/sec, so given the same size packet,
gigE is in theory 10x faster.

However, to make things work over copper I believe that gigE has a larger
minimum packet size so it's not quite apples to apples on pings (latency).

For bandwidth, the max size (w/o non-standard jumbo grams), is the same,
around 1500 bytes, and gigE is pretty much linear, you can do 120MB/sec over
gigE (and I have many times) but only 12MB/sec over fastE.

~~~
Getahobby
If I have 1000Mbits to send then gigE is 10 times faster. But in the article
we are transferring only 2k across the network. Mixing latency and bandwidth
here. The latency to send 2k across an empty network isn't 10 times greater on
a fastE versus gigE network, right?

~~~
luckydude
2K is going to be 2 packets, a full size and and a short packet, roughly 1.5K
and .5K.

For any transfer there is the per packet overhead (running it through the
software stack) plus the time to transfer the packet.

The first packet will, in practice, transfer very close to 10x faster, unless
your software stack really sucks.

The second packet is a 1/3rd size packet so the overhead of the stack will be
proportionally larger.

And it matters _a lot_ if you are counting the connect time for a TCP socket.
If this is a hot potato type of test then the TCP connection is hot. If it is
connect, send the data, disconnect, that's going to be a very different
answer.

Not sure if I'm helping or not here, ask again if I'm not.

------
chmike
Lz4 is faster than zippy and much easier to use. It's a single .h .c file.

------
hobbyist
How is mutex lock/unlock different from any other memory access?

~~~
wmf
It requires more cache coherence.

------
Morg
Someone should add basic numbers like ns count for 63 cycles modulo and that
type of stuff - That'll help bad devs realize why putting another useless cmp
inside a loop is dumb, and why alt rows in a table should NEVER be implemented
by use of a modulo, for example.

Yes I know that's not latency per se but in the end it is too.

~~~
teach
I think if you're worried about whether or not you use modulo to calculate
alternating table rows (and you _don't_ work for Facebook), then you're almost
certainly optimizing prematurely.

~~~
Morg
IT DOES NOT COST MORE TIME TO CODE CORRECTLY

Some approaches are NOT acceptable, it's not about optimizing prematurely,
it's about coding obvious crap.

While you may be used to the usual "code crap, fix later" and "waste cycles,
there are too many of it" , it doesn't mean you're right.

Everyone says it but you're still running on C (linux, unix), you're still
going nuts over scaling issues (lol nosql for everyone) and you're still
paying your amazon cloud bill.

~~~
teach
I know you're ranting to the world at large, but I am _not_ "going nuts over
scaling issues". All my websites are static HTML files. I regenerate them as
needed using custom Python code and my "databases", which are text files in
JSON.

I have several sites running on a single smallest Linode, and the CPU
utilization virtually never cracks 1%.

Also, note that I am not advocating "coding crap". I'm talking about not
berating coworkers over the nanosecond cost of an extra modulo inside a loop.

~~~
Morg
If said coworkers are actually trying to improve and can take the advice
peacefully, I will deliver it peacefully.

The others I will be pleased not to work with.

------
mmukhin
2kB over 1Gbps is actually 16ns (i guess they round up to 20)

