
Latency Numbers Every Programmer Should Know - pwendell
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
======
lars
A few years ago I made a (non-interactive) visualization similar to this, that
made its way to hackernews [1]. It puts this on a linear scale, so you can get
a feel for the baffling difference between your most local caches and your
hard drive:

<http://i.imgur.com/X1Hi1.gif>

Note: This is a huge gif, zoom in to the top.

[1] HN discussion: <http://news.ycombinator.com/item?id=702713>

~~~
yahelc
How does a SSD compare?

~~~
whalesalad
I had a crazy thought recently that I need to test:

If you had your swap partition set to a raid array of SSD's. Like two 64GB
inexpensive SSD drives... I wonder how it would perform compared to RAM.

~~~
vidarh
The limitation with a decent RAID array of SSD drives will be your controller
and/or your PCI interface. I have servers at work we get ca. 1200MB/sec reads
from without any real tuning, while we get ca. 6000MB/sec from a RAM disk
(which of course then still costs us the overhead of the filesystem layer).
You might be able to beat that with a suitably designed setup, but it's likely
to be expensive.

Getting 500-600MB/sec is "cheap". Getting 1200MB/sec gets a bit pricy but
isn't hard (the setups we've tested "just" involved us popping 4x OCZ Vertex
drives of various models into a server with SATA III, without much regard for
anything else).

------
jeffffff
in practice instruction level parallelism and data access patterns make a big
difference. if each memory access takes 100 ns but you can do 100 at a time,
most of the time there is no practical difference from each memory access
taking 1 ns where you can only do 1 at a time. an observer (either a human or
another computer over a network) will not notice 100 ns. the numbers listed in
this chart are pure latency numbers. what most people care about when talking
about cache/memory speed is bandwidth.

on an intel sandy bridge processor, each L1 access takes 4 cycles, each L2
access takes 12 cycles, each L3 access takes about 30 cycles, and each main
memory access takes about 65 ns. assuming you are using a 3 GHz processor,
this would make you think that you can do 750 MT/s from L1, 250 MT/s from L2,
100 MT/s from L3, and 15 MT/s from RAM.

now imagine 3 different data access patterns on a 1 GB array:

1) sequentially reading through an array

2) reading through an array with a random access pattern

3) reading through an array in a pattern where the next index is determined by
the value at the current index.

if you benchmark these 3 access patterns, you will see that:

1) sequential access can do 3750 MT/s

2) random access can do 75 MT/s

3) data-dependent access can do 15 MT/s

you might guess that sequential access is fast because it is a very
predictable access pattern, but it is still 5x faster than the speed indicated
by the latency of L1. maybe you'd think it's prefetching into registers or
something? but notice the difference between random access and data-dependent
access. this is probably not what you expected at all! why is random access 5x
faster than data-dependent access? because on sandy bridge, each hyper
threading core can do 5 memory accesses in parallel. this also explains why
sequential access seems to be 5x faster than the speed of L1.

what does this mean in practice? that to do anything practical with these
latency numbers, you also need to know the parallelism of your processor and
the parallelizability of your access patterns. the pure latency number only
matters if you are limited to one access at a time.

------
kqr2
Also recommend Jeff Dean's tech talk _Building Software Systems At Google and
Lessons Learned_ which references those latency numbers.

Slides: [http://research.google.com/people/jeff/Stanford-DL-
Nov-2010....](http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf)

Video: <http://www.youtube.com/watch?v=modXC5IWTJI>

Also, a previous thread on latency numbers:

<http://news.ycombinator.com/item?id=4047623>

~~~
colin_scott
Yuppers, idea taken directly from Jeff's talk: [http://colin-
scott.github.com/blog/2012/12/24/latency-trends...](http://colin-
scott.github.com/blog/2012/12/24/latency-trends/)

------
patrickwiseman
With timings that are several orders of magnitude in difference I'd just
ignore the constants between factors as they change too frequently. Also there
is a difference between latency and bandwidth and the chart is simply
inconsistent.

CPU Cycle ~ 1 time unit Anything you do at all. The cost of doing business.

CPU Cache Hit ~ 10 time units Something that was located close to something
else that was just accessed by either time or location.

Memory Access ~ 100 time units Something that most likely has been accessed
recently, but not immediately previously in the code.

Disk Access ~ 1,000,000 time units It's been paged out to disk because it's
accessed too infrequently or is too big to fit in memory.

Network Access ~ 100,000,000 time units It's not even here. Damn. Go grab it
from that long series of tubes. Roughly about the same amount of time it takes
to blink your eye.

~~~
jewel
It's probably worthwhile to differentiate between local network access and
remote network access, and also differentiating between spinning disks and
SSD.

Local gigabit network ping time is 200 μs and an SSD typically reads a 4K
block in 25 μs. I imagine that converts to 200,000 time units and 25,000 time
units, respectively.

------
LnxPrgr3
A couple nits to pick:

(1) Mutex performance isn't constant by OS. My own measurements tell me
FreeBSD and Linux are around 20x slower than Linux locking and unlocking a
mutex in the uncontended case.

(2) Contention makes mutexes a lot slower. If a mutex is already locked, you
have to wait until it unlocks before you can proceed. It doesn't matter how
fast mutexes are if there's a thread locking the one you're waiting on for
seconds at a time. And even if other threads aren't holding the mutex long,
putting the waiting thread to sleep involves a syscall, which is relatively
expensive.

~~~
LnxPrgr3
Erm… I meant FreeBSD and Mac OS X have slower mutexes than Linux. Not sure how
that sentence slipped by me.

~~~
Nick_C
Is that still the case for FreeBSD? I thought they did some work this year to
help speed it up. Also, the Linux emulation under FreeBSD was always slow for
mutex (did you test under Linux emulation or native?), but they have
implemented futexes now and pushed uncontended mutex to userspace to speed it
up.

~~~
LnxPrgr3
I tested under native, but you raise an important point—IIRC I was testing
with FreeBSD 7.1, and seeing as 9.0 is out it's definitely worth a retest.

------
dgudkov
What implication on my RoR programming is supposed to be from knowing that
mutex lock/unlock time is 17ns? Especially, if the web site is hosted on
Amazon? Should really every programmer know it?

~~~
HoLyVieR
The idea is more about knowing the scale of time it takes to help compare
different solution. With those number you can see that it's faster to get data
in memory from a distant server than to read the same data from disk. In
common application that means using disk storage is less efficient that
storing it in a database on an other server, since it usually keeps almost
everything in memory. It also tells a lot about why CDN are needed if you are
serving client worldwide. A ping from North America to Europe will always
takes 100+ ms, so having geographically distributed content is a good idea.

~~~
AnthonyMouse
>In common application that means using disk storage is less efficient that
storing it in a database on an other server, since it usually keeps almost
everything in memory.

Depends on the usage pattern. Database servers generally have more memory than
end user machines, but not proportionally more: A database serving a million
users won't have a million times more memory. (This matters primarily when
each user stores unique or low popularity data, otherwise the shared database
has the advantage.) Moreover, data stored "on disk" on user machines will,
with modern long-uptime large-memory desktops (and before long mobile
devices), have a high probability of having the data cached and retrieved from
memory rather than disk.

Thus, if the data is accessed a relatively short period of time after it is
written (i.e. within a few hours), storing it locally on disk may be faster.
(And storing it nominally on disk even though it's all cached may be
preferable to storing it in strictly memory either because some minority of
users have insufficient memory to store all necessary data in memory, or
because the data should survive a loss of power or other program termination.)

Edit: It's also worth pointing out that latency isn't the sole performance
criterion. Local disk generally has more bandwidth than network. If you're
bandwidth constrained (either at either endpoint, the database or the user
device), or either is getting charged by the byte for network access, that can
be an important consideration.

------
TallboyOne
It amazes me that something happening locally (in the ~10ms range, say a
database query) is even in the same scale as something traveling halfway
around the world. The internet wouldn't really work without this fact, but I'm
very thankful for it. Simply incredible.

~~~
justincormack
The internet would still work if you had a faster database. 19ms is your hard
drive seek, SSD is much faster. Thats the big change of the last 10 years.

~~~
TallboyOne
Sorry I meant to imply the data travelling far distances was most impressive,
given the fact it's on the same time-scale (milliseconds) as a transaction
happening 1 inch away on your own computer. To me that's what I think is most
incredible about that graph. It almost seems that the traveling portion would
need to be 100x longer than something happening locally, but obviously this
isn't true.

~~~
zanny
Speed of light is really fast. Once you have something on a wire traveling
close to it, you are in pretty good shape time wise.

The number on a round trip is actually kind of incorrect, because the number
of routing stations it must traverse will dramatically impact the round trip
time, so it isn't just a fixed value. It is a good ballpark though. I think
the real impressive thing is how ethernet hardware has been designed well
enough to be able to process billions of packets per second where they have to
read the IP destination header, compute the next node in the transaction, and
send it off. You can often end up doing that a half dozen times in relatively
small trips but the individual time commitments are so much smaller than disk
IO :P

------
pooriaazimi
I don't understand. Why "reading sequentially from hard disk" is getting
faster? Is 2012's 5400 rpm different from 2002's 5400 rpm?!

~~~
jcampbell1
As the linear bit density increases, the read rate must increase for a
constant RPM and platter size.

------
Bjoern
When we talk latency, nanoseconds etc. I am always fond to remember the things
Grace Hopper had to say about it to put it into perspective.

<https://www.youtube.com/watch?v=JEpsKnWZrJ8>

~~~
SilasX
Non-video summary?

~~~
lotharbot
Without a good feel for a billion, or a billionth, it's hard to get a feel for
a nanosecond.

Visualization for a nanosecond: a piece of wire 11.8 inches long, which is the
maximum distance light/electricity can travel in 1 ns. A microsecond is a coil
of wire 984 feet long. So if you're trying to understand why it takes so dang
long to send a message via satellite or whatever, just understand that there's
a whole bunch of nano/micro/milli-seconds between here and there. This also
explains why computers must be small to be fast.

(Background: Rear Admiral Grace Hopper started programming on the Harvard Mark
I in 1944, and wrote the first ever compiler in 1952 on the UNIVAC I. She
earned the nickname "Grandma Cobol". Info via
<https://en.wikipedia.org/wiki/Grace_Hopper> , which also summarizes this
video in the "anecdotes" section.)

~~~
kragen
It's not quite clear what the first ever compiler was, since "compiler" is
kind of a fuzzy category, but Rear Admiral Hopper might have given that credit
to Betty Holberton.

~~~
lotharbot
Interesting.

<https://en.wikipedia.org/wiki/Betty_Holberton>

>> in 1997, she received the IEEE Computer Pioneer Award from the IEE Computer
Society for developing the sort-merge generator which, according to IEEE,
"inspired the first ideas about compilation."

Whereas Hopper coined the term "compiler" and wrote the first one that's
recognized as such:

[https://en.wikipedia.org/wiki/History_of_compiler_constructi...](https://en.wikipedia.org/wiki/History_of_compiler_construction)

<https://en.wikipedia.org/wiki/A-0_programming_language>

~~~
kragen
Indeed. But A-0 isn't what we'd call a "compiler" today. It was a gradual
evolution from Holberton's work (1948? 1949?) through A-0 in 1952, up to, say,
1957, when the first optimizing FORTRAN compiler was shipped, which is
definitely what we would call a compiler today.

[https://en.wikipedia.org/wiki/History_of_compiler_writing#Fi...](https://en.wikipedia.org/wiki/History_of_compiler_writing#First_compilers)

------
yxhuvud
I take it that mutex is unconstested? A contested one would be a lot slower.

------
markild
I know how this works, but it's quite depressing to see the "Packet roundtrip
CA to Netherlands" unchanged throughout the years.

~~~
revelation
As [1] suggests, the best possible time for that distance would still be _82
ms_.

[1]:
[http://www.wolframalpha.com/input/?i=as+the+crow+flies+from+...](http://www.wolframalpha.com/input/?i=as+the+crow+flies+from+netherlands+to+california)

~~~
abecedarius
54 ms for neutrinos taking the chord through the Earth (using 0.22 from your
result). (That's only 4ms faster than light through vacuum along the great
circle. Disappointing.)

[http://www.wolframalpha.com/input/?i=2+*+sin+%280.5+*+0.22+*...](http://www.wolframalpha.com/input/?i=2+*+sin+%280.5+*+0.22+*+2+pi%29+*+%28earth+radius%29+%2F+%28speed+of+light%29)

------
gilgoomesh
I think one important latency is left out here...

Latency to screen display: approximately 70ms (between 3 and 4 display frames
on a typical LCD).

<http://en.wikipedia.org/wiki/Display_lag>

Obviously, you only have 1 display latency in your pipeline but it's still
typically the biggest single latency.

~~~
colin_scott
Interesting -- good to know!

------
Aardwolf
Very nice.

It would be nice if the source code of the page were not shown in a frame
below though. That screen space is served better to show the rest of the
actual diagram. You can already see the source code of a page with your
browser, why waste screenspace and annoying extra scrollbars on that?

~~~
colin_scott
The intention was to make it really easy to play with the parameters (not that
if you make a change to the html, it will be immediately reflected in the
screen above).

If you have any suggestions, a pull request would be much appreciated!
<https://github.com/colin-scott/interactive_latencies>

~~~
StavrosK
Oh, very interesting. You could add a button, though, to show the frame, or
just hide it below the fold (increase the height of the visualization div).
It's not like vertical space is a scarce resource, so you don't really need to
hide it or to have the main div short.

~~~
colin_scott
Good suggestion. So, similar to this? (second tab):
<http://worrydream.com/TenBrighterIdeas/>

~~~
StavrosK
That's a very nice UI, yes.

------
jlouis
I would beware to just extrapolate numbers on a curve. There are some physical
constraints which makes it impossible to improve certain numbers by much. And
sometimes there are also logical constraints doing the same.

We need somewhat accurate measurements on modern hardware as a base. I would
not be surprised if some values have gotten _worse_ over the years.

~~~
colin_scott
You're absolutely right -- DRAM latencies, for example, have actually been
getting worse.

See the html at the bottom of the page for explanations of how I extrapolated.

If you have any improvements to make, a pull request would be much
appreciated! <https://github.com/colin-scott/interactive_latencies>

------
Aardwolf
I can't wait for the year 2020! 0ns reads from SSD :D

~~~
colin_scott
Fixed :)

------
yatsyk
Very useful!

What is the license of the code? I'd like play with visualization some day.
Some of the bars is not very comprehensible due to height.

~~~
colin_scott
It's licensed under BSD: <https://github.com/colin-
scott/interactive_latencies>

Pull requests would be much appreciated!

------
gems
Almost everybody that studies computer architecture, even a little, is aware
of these numbers.

This almost seems like promotion.

~~~
AgentConundrum
As with most posts, this submission is useful to the rest of us who don't have
the same background as you.

~~~
npsimons
GP has a point, though: if you got a CS degree without understanding this
stuff, your CS department failed you. Still, it is nice to review, and as you
say, good for those who don't know.

~~~
AgentConundrum
My point was that not everyone - not even everyone on HN - has a CS degree. I
don't have one, but I still found the OP interesting.

His comment made me feel like maybe he thought I should feel bad for not
knowing the information already.

------
B-Con
I have to admit that I was not expecting mutexes to be anywhere near that
fast.

~~~
npsimons
Much like the speed of spawning new processes under Linux (and other UNIXen),
I think it's something that has been very well tuned _because_ it needed to be
(ie, it is so widely used).

------
hmsimha
I think the conversion to ms in the rightmost column is off by an order of 10

~~~
hmsimha
It's the ns timing given that's off by an order of 10 actually. RTT to
Netherlands and back definitely isn't 1.5 s

~~~
colin_scott
I think it should be fixed now. Thanks!

------
jahansafd
how can we decrease the packet round trip from CA to Netherlands?

~~~
ComputerGuru
Short of tunneling _through_ the earth or quantum entanglement, nothing much.

~~~
TillE
I thought quantum entanglement was incapable of transmitting information, for
reasons beyond my basic undergrad comprehension of quantum mechanics.

~~~
colanderman
My (ever-so-slightly more than basic undergrad) understanding is that you
could send qubits to the Netherlands, have processing be performed by a
quantum computer there, and have the result returned faster-than-light (as
qubits), but no-one in the Netherlands could read your data or the result (or
even the power meter) before at least a light-speed delay.

(If I'm wrong please correct me!)

------
pithon
1000 nanoseconds is only "approximately" equal to 1 microsecond?

------
joeywas
has latency between ca and norway always been constant as the chart displays?

~~~
Guillaume86
I have only basic theoretical knowledge about networks but I guess not
because: 1\. Adding more nodes/routes = shorter physical path 2\. Processing
speed of nodes has improved over years (?)

