Latency numbers every programmer should know

luckydude · on May 31, 2012

Most of these latencies were measured and written up for a bunch of systems by Carl Staelin and I back in the 1990's. There is a usenix paper that describes how it was done and the benchmarks are open source, you can apt-get them.

http://www.bitmover.com/lmbench/lmbench-usenix.pdf

If you look at the memory latency results carefully, you can easily read off L1, L2, L3, main memory, memory + TLB miss latencies.

If you look at them harder, you can read off cache sizes and associativity, cache line sizes, and page size.

Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of the memory latency test.

http://www.bitmover.com/lmbench/mem_lat3.pdf

His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.

luckydude · on May 31, 2012

One other thing that is worth mentioning:

lmbench was trying to give you both bandwidths and latencies of everything in a computer system, not just memory. This is one place where it is actually worth memorizing the rough numbers, they help immensely when you are sketching out a design. lmbench tried to give you insight into latency/bw of network, disks, memory, etc.

Do you know, to the nearest order of magnitude

round trip time over the network

bandwidth over your data center's network

seek time end to end (not the silly 1/3rd seek / no rot delay)

bandwidth at the fast end

bandwidth at the slow end

memory read / write / bcopy bandwidth

random read from memory

I've sat in hundreds of design meetings where someone was claiming they'd build an XYZ server that did $BIG_NUMBER of ops/sec and I'd just stare at the ceiling, think about the network, and go "Maybe" or "not possible" based on the numbers. There must have been some time that I was wrong but I don't remember it.

It's somewhat harder today because all the big machines have multiple cores so it's not as simple as knowing what 1 core, with 1 network interface can do, but you should at least know that, you can assume linear scaling as an upper bound and have some idea of the capacity of the machine.

It's always amazed me how easy it is to get a feel for these numbers and yet many people in the biz don't take the time to do so. I can only guess they think it is harder than it actually is.

Retric · on June 1, 2012

Latency does not prevent X opps/second. It's not that hard to build a server that does a simple transaction 100,000 times a second. Keeping synchronized across several machines at that throughput level can be next to impossible.

ajross · on May 31, 2012

Pedantic quip: I have a hard time believing you guys were measuring half nanosecond cache latencies on a machine with a 100MHz clock. :)

And actually the cache numbers seem optimistic, if anything. My memory is that a L1 cache hit on SNB is 5 cycles, which is 2-3x as long as that table shows.

luckydude · on May 31, 2012

We didn't believe it either until we put a logic analyzer on the bus and found that the numbers were spot with respect to the number of cycles. I don't remember how far off they were but it wasn't much, all the hardware dudes were amazed that software could get that close.

tl;dr: the numbers were accurate to the # of cycles, might have been as much as 1/2 of 1 cycle off.

Edit: I should add this was almost 20 years ago, I dunno how well it works today. Sec, lemme go test on a local machine.

OK, I ran on a Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (I think that's a Sandy Bridge) that is overclocked to 4289 MHz according to mhz, and it looks to me like that machine takes 4 cycles to do a L1 load. That sound right? lmbench says 4.05 cycles.

I poked a little more and I get

L1 4 cycles, ~48K L2 12 cycles, ~256K L3 16 cycles, ~6M

Off to google and see how far off I am. Whoops, work is calling, will check back later.

WayneS · on June 1, 2012

You realize he is measuring averages right? Setup 1 million L1 cache hits, divide by total time...

notaddicted · on May 31, 2012

That mem_lat3 is pretty interesting, is there more info about it anywhere? Do you know what "Latency" refers to? Is that for Sandy Bridge consumer quad core? (hence ending at 8MB since that is L3$ size...)?

luckydude · on May 31, 2012

That data is old, it's when Wayne was still working at Intel, he's been working here since 2001, so I'm guessing that is a pentium pro or similar.

I'll ping him and he'll probably answer tomorrow, it's late in his day in Indiana.

WayneS · on June 1, 2012

This is the original Pentium Pro with the full speed L2. Don't forget the TLB when parsing the data.

alecco · on May 31, 2012

Interesting. What do you think of Agner Fog's work?

luckydude · on May 31, 2012

Wasn't aware of it, so thanks for the question. I took a quick look and it looks like Agner is using performance counters, which, while being more accurate, are platform specific. One of the goals of lmbench was to be accurate across different platforms and CPU architectures. So it's all userland software only.

You can get pretty far that way, Carl wrote a userland tool called mhz that has been working without modification since around 1998 (when the fastest clock rate was 600mhz on an alpha). Still pretty accurate to this day when CPUs have clock rates 7x faster and fairly different architectures.

Carl wrote up how it works, might be interesting to people who think about CPUs: http://www.bitmover.com/lmbench/mhz-usenix.pdf

larsberg · on May 31, 2012

Honestly, I'd rather programmers know how to _measure_ these numbers than just have them memorized.

I mean, if I told them that their machine had L3 cache now, what would they do find out how that changes things? (This comment is also a shameless plug for the fantastic CS:APP book out of CMU).

memset · on May 31, 2012

Honest question: how would you measure an L2 cache lookup? (What program would I need to write which ensures that a value is stored in L2, such that when I read it later, I would know the lookup time to indeed be the L2 time?)

At that, how would one measure this kind of thing at the nanoseconds level? Would using C's clock_gettime() functions be good enough? Is there any other facility to count cycles which have passed between two operations?

Swizec · on May 31, 2012

cracks knuckles Let's see if I remember enough of what I'm supposed to know in an exam in a couple of months or so.

You need to know the size of a page in your L1 cache. Then you can write a program that goes through a big enough chunk of memory that an "out of space" error occurs predictably in L1 cache, but not in L2 cache.

That way you can know when specifically your program went to L2 cache to swap out a page of L1 cache when it was needed.

The caveat being that you can probably only do this well enough in some sort of assembler code (for your architecture) and that you would have to be running a single-process system without interrupts enabled. Otherwise all sorts of things can mess up your cache lookups.

Tuna-Fish · on May 31, 2012

That's actually not good enough, because modern CPUs have prefetchers, so if you just access a 33kB chunk sequentially, it will still almost always be from L1 (the prefetcher proabably uses misses to train, so you'll have a few misses before it catches up), even if the L1 is just 32kB.

To predictably miss caches you need to have a random access pattern.

luckydude · on May 31, 2012

Well, back in the early 90's, HP RISC was the only cache controller we found that had cache line prefetch. If you keep increasing your stride you eventually get to a big enough one that you are skipping enough cache lines that they stopped fetching.

We switched to going backwards, that fooled them for a while but now most cache controllers seem to detect forwards, backwards, and uniform strides, which is actually pretty neat.

So yeah, today it would seem that random is what you need.

MichaelGG · on May 31, 2012

What you could do is write a program that uses increasing amounts of memory. Measure the time it takes to complete a few million iterations of some function that accesses that memory. As the total usage increases, you'll see steps in the timing. For example, you might see a jump at ~32K, then ~2MB, then ~8MB.

comatose_kid · on May 31, 2012

I think you'd just need to access memory in increasingly larger increments (eg, sequential reads of 32k, 64k, ...2x size of l2 cache). But there's more to it as the L2 cache is often shared amongst cores in a multi-cpu system (where as the L1 is per-core). Also, the organization of the cache (eg, n-way set associative) means that you'd want to vary the size of the jump (stride) for your reads/writes and see how that also affects throughput.

MichaelGG · on May 31, 2012

Yea. All this and more is covered in the fantastic paper by Ulrich Drepper: What Every Programmer Should Know About Memory[1]. Was very enlightening.

1: http://www.akkadia.org/drepper/cpumemory.pdf

cube13 · on May 31, 2012

>But there's more to it as the L2 cache is often shared amongst cores in a multi-cpu system (where as the L1 is per-core)

As comatose_kid notes, this is highly dependent on CPU architecture. Intel's architectures from Nehalem to the current Ivy Bridge have smaller per-core L2 caches, and a shared L3 cache across all cores.

WayneS · on June 1, 2012

You can't just access memory in that pattern because an out of order processor will just issuse the next N loads in parallel. Instead lmbench sets up a linked list in memory ans walks the list. That way the address for each load is not known until the previous load completes. -Wayne

ob · on May 31, 2012

> Honest question: how would you measure an L2 cache lookup?

Like this: https://github.com/ob/cache/blob/master/cache.c

AlisdairO · on May 31, 2012

I don't know how you actually measure how long a lookup takes, but just FYI if you're interested in the number of lookups you're doing from the various caches (and in things like number of missed branch predictions you're causing), you can use OProfile or Cachegrind. OProfile is a bit more complicated to set up because the flags you need to set are processor-specific, but it causes much less slowdown.

WayneS · on June 1, 2012

I love performance counters and good profile tools. But the whole point of knowning this stuff is to get a handle on expected performance before you have written a line of code. Then you can make good decisions about what to do. Later the profile tools can help you fix what is broken when your code doesn't match expectations.

nhebb · on May 31, 2012

+1 to CS: APP (Computer Systems: A Programmer's Perspective, by Bryant and O'Hallaron). I thought Computer Architecture: A Quantitative Approach by Hennessey was dry and better suited to EE's. CS: APP, on the other hand, really sucked me in - and I'm not the kind of guy that typically gets enthralled by CS texts.

kjhughes · on May 31, 2012

Measure vs memorize misses the point.

It's understanding the relative order of magnitude differences between the activities that's important.

larsberg · on May 31, 2012

My problem with the memorization approach is that the technology is still moving quickly enough that the orders of magnitude still "wiggle," particularly in the middle numbers.

For example, at a glance, the disk seek numbers are off by orders of magnitude for SSD drives. And the 1MB of RAM read numbers appear not to have used the SSE streaming load/store instructions. I don't know anything about datacenters, but I'd certainly want to measure that one myself before building a system assuming that number was in its correct place in the hierarchy.

apaprocki · on May 31, 2012

DTrace lets you measure L# cache hits/misses along with lots of other useful things.

adobriyan · on May 31, 2012

It lets you measure a number of cache hits, not the latency.

joe_the_user · on June 1, 2012

I've love to able to either measure or use all this lovely information in my running Linux desktop GUI app. But I happen to know that would probably not be worth my time except some rather specific cases. Plus the direction of PC architecture seems to be towards more and more levels of cache and that implies more and more work if you actually take these into account.

Just finding the time that library calls take is quite a pain in Linux.

Given all this, I'm inclined to think that I'd rather architect cache-oblivious data structures than root-around endless when background threads and other stuff suddenly get slow...

kjhughes · on May 31, 2012

Anyone who hasn't heard Rear Admiral Grace Murray Hopper describe a nanosecond should check out her classic explanation:

http://www.youtube.com/watch?v=JEpsKnWZrJ8

zorkian · on May 31, 2012

This reminds me of one of the pages that Google has internally that, very roughly, breaks down the cost of various things so you can calculate equivalencies.

As an example of what I mean (i.e., these numbers and equivalencies are completely pulled out of thin-air and I am not asserting these in any way):

  * 1 Engineer-year = $100,000
  * 25T of RAM = 1 Engineer-week
  * 1ms of display latency = 1 Engineer-year

This allows engineeers to calculate tradeoffs when they're building things and to optimize their time for business impact. E.g.: it's not worth optimizing memory usage by itself, Latency is king, Don't waste your time shaving yaks, etc etc.

Morg · on June 1, 2012

Yet everyone uses much slower RAM in servers and will likely continue to do so, all the while caches swell, etc.

Optimizing memory usage is almost irrelevant today, until it starts being a bandwidth problem, and that's still solvable but only through complex scaling strategies that also cost several engineer-years.

zippie · on May 31, 2012

These numbers by Jeff Dean are relatively true but need to be refreshed for modern DRAM modules & controllers. Specifically, the main memory latency numbers are more applicable to DDR2 RAM vs the now widely deployed DDR3/DDR4 RAM (more channels = more latency). This has been a industry trend for a while and theres no change on the horizon. Additionally, memory access becomes more expensive because of CPU cross chatter when validating data loads across caches.

A potential pitfall with these numbers is they give engineers a false sense of security. They serve as a great conceptual aid - network/disk I/O are expensive and memory access is relatively cheap but engineers take that to an extreme, and get lackadaisical about memory access.

When utilizing a massive index (btree) our search engine failed to meet SLA because of memory access patterns. Our engineers tried things at the system (numa policy) and application level (different userspace memory managers, etc.)

Ultimately, it all came down to improving the efficiency around memory access. We used Low-Level Data Structure to get the 2x improvements in memory latency:

https://github.com/johnj/llds

dsr_ · on May 31, 2012

Scaling up to human timeframes, one billion to one:

Pull the trigger on a drill in your hand 0.5s Pick up a drill from where you put it down 5s Find the right bit in the case 7s Change bits 25s Go get the toolkit from the truck 100s Go to the store, buy a new tool 3000s Work from noon until 5:30 20000s Part won't be in for three days 250000s Part won't be in until next week 500000s Almost four months 10000000s 8 months 20000000s Five years. 150000000s

hellerbarde · on May 31, 2012

I forked the gist and did something similar. Would you mind if I took some of your suggestions into my fork?

https://gist.github.com/2843375

dsr_ · on May 31, 2012

Fine by me.

CamperBob2 · on May 31, 2012

This is the same basic advice that I try to keep in mind when writing English text. A word that the reader already knows is in "L1." A word they have to stop and think about is an L1 miss. If they have to reach for the dictionary, it's an L2 miss -- and probably several more cycles for a line fill, as they get distracted reading the next few entries in the dictionary.

As programmers, Strunk & White might have made the list of the all-time greats.

jgrahamc · on May 31, 2012

I believe that this originally comes from Norvig's "Teach Yourself to Program in Ten Years" article: http://norvig.com/21-days.html

alecco · on May 31, 2012

That's from 2011. Amazon's James Hamilton wrote about it in 2009, and it comes from a presentation that year by Google's Jeff Dean:

http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLe...

EDIT: this is wrong, Norvig's page pre-dates Dean's presentation http://wayback.archive.org/web/*/http://norvig.com/21-days.h...

tjr · on May 31, 2012

I make no claim as to who first assembled that particular table of data, but Norvig's article is dated 2001.

alecco · on May 31, 2012

EDIT: I stand corrected.

The numbers seem to have been evolving and the original source seems to be that page.

http://wayback.archive.org/web/*/http://norvig.com/21-days.h...

luckydude · on June 1, 2012

This is shameless self promotion (well, me and Carl promotion) but we were measuring these sorts of things in the late 80's and wrote a paper about it that got best paper at Usenix in 95. I think we had most of those numbers, not in the same format.

Pissed off the BSD folks because it made them look bad. Oh, well.

Helped make Linux better, largely because while the BSD guys refused to engage, Linus did. He and I spent many many hours discussing what was the right thing to measure and what should not be measured. We both felt that lmbench would influence OS design (and it's influenced processor design, see all the cache prefetch stuff, I'm pretty convinced that's because all the processor people used lmbench). Linus was already on the "OS should be cheap path" but lmbench helped him make the case to other people who wanted to add overhead because of their pet project.

The cool part about working with Linus was he was never about making Linux look better, he was about measuring the right things. If Linux sucked, oh, well, he'd fix it or get someone else to fix it. Awesome attitude, I feel the same way.

The only published work that might predate lmbench for these sorts of numbers is Hennessy and Patterson computer architecture. They talked about memory latency but so far as I recall, didn't have a benchmark. That said, that book is friggin awesome and anyone who cares about this sort of thing and hasn't carefully read that book is missing out.

peteretep · on May 31, 2012

Took me forever to find this, but:

https://plus.google.com/112493031290529814667/posts/LvhVwngP...

hellerbarde · on May 31, 2012

From that link, this brilliant visualisation: http://i.imgur.com/X1Hi1.gif

yuvadam · on May 31, 2012

Is a single-text-file-github-gist the best way to disseminate this piece of knowledge (originally by Peter Norvig, BTW)?

What about a comprehensive explanation as to why those numbers actually matter?

Meh.

willvarfar · on May 31, 2012

http://www.infoq.com/presentations/Lock-free-Algorithms early-on gives good numbers on-machine.

I like this visualisation too: http://news.ycombinator.com/item?id=702713

matthavener · on May 31, 2012

For anyone thinking about watching: the presentation is incredibly good. The name is kinda misleading -- they talk more about modern x86 architecture and actual numbers of various algorithms than the lock free algorithms themselves. Both of those guys have a high emphasis on measuring and testing to improve performance.

aristus · on May 31, 2012

These are good rules of thumb, but need more context. Plugging an article I wrote about this & other things a couple of years ago for FB engineering: https://www.facebook.com/note.php?note_id=461505383919

The "DELETE FROM some_table" example is bogus, but the rest is still valid.

some1else · on May 31, 2012

John Carmack recently used a camera to measure that it takes longer to paint the screen in response to user input, than send a packet accross the Atlantic: http://superuser.com/questions/419070/transatlantic-ping-fas...

I came across the post when I was looking for USB HID latency (8ms).

EternalFury · on May 31, 2012

Considering that so many programmers are currently enthralled with JavaScript, Ruby, Python and other very very high level languages, the top half of this chart must look very mysterious and unattainable.

sciurus · on May 31, 2012

One of my favorite writeups on this topic is Gustavo Duarte's "What Your Computer Does While You Wait"

http://duartes.org/gustavo/blog/post/what-your-computer-does...

Symmetry · on May 31, 2012

In practice any out of order processor worth its salt ought to be able to entirely hide L1 cache latencies.

matthavener · on May 31, 2012

Agreed, I think the real lesson is: reading is 10x cheaper than branching, so if you can do something with 10 non-branching ops it'll be just as fast as a single branch.

seabee · on May 31, 2012

Conversely: if you can do something with a branch that's correctly predicted 90% of the time it'll be just as fast as 10 non-branching ops.

Branch prediction is a tool like any other - don't neglect it when it can help you.

marshray · on May 31, 2012

As a coder, how can I predict the branch predictor?

CamperBob2 · on May 31, 2012

There are definite rules, documented by the CPU vendor. A forward branch is assumed not taken, while a backward branch is assumed to come at the end of a loop that will probably iterate more than once. See http://software.intel.com/en-us/articles/branch-and-loop-reo... for example.

I'd assume the particulars will vary between CPU manufacturers and families, but the idea that backward branches will probably be taken seems fairly universal.

haberman · on June 1, 2012

This is outdated information; Intel chips have not used static prediction for conditional branches since NetBurst (Pentium 4).

"Pentium M, Intel Core Solo and Intel Core Duo processors do not statically predict conditional branches according to the jump direction. All conditional branches are dynamically predicted, even at first appearance."

--http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...

gcp · on May 31, 2012

Now extend that to making memory reads predictable and prefetchable, and RAM is suddenly also free.

luckydude · on May 31, 2012

The lmbench loop for measuring memory latency is

while (n--)p = *p;

Out of order does not help, it can't. That was the key insight in that benchmark.

jedbrown · on May 31, 2012

Depends on data dependencies and the number of registers, but yes, L1 latency is rarely the bottleneck (more likely to be load/store throughput, in-register shuffles, or arithmetic dependencies for such tight kernels).

balloot · on May 31, 2012

The thing here that is eye opening to me, and relevant to any web programmer, is that accessing something from memory from another box in the same datacenter is about 25x times as fast as accessing something from disk locally. I would not have guessed that!

hboon · on June 1, 2012

That's why tools like memcached work like they do.

perlpimp · on May 31, 2012

Such items are important to web developers and they can use them to justify looking at one or other technology. Or perhaps attempt at least benchmarking and have them as one of the guides in configuring and setting up services. Comes to mind why is Redis can be better then mongodb and in what configuration.

As well in discussion about this and that these can be of help too.

Adding misaligned memory penalties such as on word boundary and page boundary can enhance such document. This might be a good cheatsheet if one inclined to research and make one.

stiff · on May 31, 2012

I present to you Grace Hopper handing people nanoseconds out:

http://www.youtube.com/watch?v=JEpsKnWZrJ8

:)

CookWithMe · on May 31, 2012

Also, these numbers don't mean much on their own. E.g. L2 Cache is faster than main memory, but that doesn't help you if you don't know how big your L2 Cache is. Same for main memory vs. disc.

E.g. I optimized a computer vision algorithm for using L2 and L3 caches properly (trying to reuse images or parts of images still in the caches). Started off with an Intel Xeon: 256KB L2 Cache, 12MB L3 Cache. Moved on to an AMD Opteron: 512KB L2 Cache (yay), 6MB L3 Cache (damn).

Also, the concept of the L2 Cache has changed. Before multi-cores it was bigger and the last-level-cache. Now it has become smaller and the L3 Cache is the last-level-cache, but has some extra issues due to the sharing with other cores.

The important concepts every programmer should know are memory hierarchy and network latency. The individual numbers can be looked up on a case-by-case basis.

lallysingh · on May 31, 2012

If this is your cup of tea, have a look at Agner Fog's resources: http://agner.org/optimize/

Also, I'd have a look at Intel's VTune or the 'perf' tool that ships with the linux kernel.

SeanLuke · on May 31, 2012

How is a mutex lock less expensive than a memory access? Are such things done only in registers nowadays? This doesn't sound right.

a-priori · on May 31, 2012

I assume that the mutex's memory is warmed, so the processor doesn't have to go to RAM. But, it does have to synchronize with any other processors/cores in the machine to prevent them from locking the mutex at the same time.

In a Intel Nahalem or Sandy Bridge system, this goes over the QuickPath Interconnect which has a latency of ~20ns. HyperTransport fills the same role in AMD systems, and probably has a similar latency, but I don't have numbers for that.

I'm basing this on this presentation, especially the architecture diagram at 2m 40s:

http://www.infoq.com/presentations/Lock-free-Algorithms

haberman · on May 31, 2012

Less expensive than a main memory access. It must be that an uncontended lock/unlock can happen in cache.

JoeAltmaier · on May 31, 2012

No definitely not. A full memory fence surrounds lock/unlock.

scott_s · on May 31, 2012

I actually don't think that's true. My understanding is that on x86, atomic instructions have implicit lock instructions before them. (Or you can make some instructions atomic by putting a lock instruction before them.) Such instructions lock the bus and prevent other cores or SMT threads from accessing memory. In that way, you can safely perform an atomic operation on a value in the cache.

Note that this implies that atomic operations slow down others cores and SMT threads.

JoeAltmaier · on May 31, 2012

Locks are often implemented using an xchg instruction, which is implicitely locked.

All processor's caches are committed/flushed for the affected cache line. So its correct to say other processors are slowed down. But it also in that sense IS a main memory operation, just not yours.

scott_s · on May 31, 2012

To be clear, then we agree that haberman was correct, and the value can be changed in cache.

JoeAltmaier · on June 1, 2012

Sure. It just isn't useful in cache. If it is a real lock, it has to be shared. Either the caches have to reconcile, or it has to go to main memory.

haberman · on June 1, 2012

Just because a lock is shared does not mean that it's contended.

For example, some multi-threading techniques attempt to access only CPU-local data but use locks purely to guard against the case where a process is moved across CPUs in the middle of an operation (thus defeating the best-effort CPU-locality).

JoeAltmaier · on June 1, 2012

But it has to be available to those multiple cpus, right? So it has to go down to memory and back up to cache. Contended or not.

haberman · on June 1, 2012

Maybe I'm missing something, but if the cache line is only in use by one CPU, I don't see why the value would need to be immediately propagated to main memory or to any other CPU's cache until it is written as part of the normal cache write-back.

gvb · on June 1, 2012

Correct. Typically the cache snoops the main memory bus. If a remote CPU starts a read on a cached memory location, the caching CPU sends a "stall" or "retry" signal to the reader, does a cache flush to main memory, and then lets the remote CPU proceed with the (now correct) main memory read.

scott_s · on June 1, 2012

It's useful for fields (not just locks; I'm thinking of lock-free algorithms that do a compare-and-swap directly on a value) that have low contention.

JoeAltmaier · on June 1, 2012

Even uncontended, the value has to be reconciled between cpus/caches. So it has to be plumbed down thru all the caches, then back up.

scott_s · on June 1, 2012

That's my point: uncontended values may not be in any other caches.

al_james · on May 31, 2012

What stands out here is how long a disk read takes (especially compared to network latency). Indeed, disk is the new tape.

dockd · on May 31, 2012

Does anyone feel like this is sort of an apples to oranges table? It compares reading one item from L1 cache to reading 1M byte from memory, without adjusting for the amount of data being read (10^6 more). It looks like the data was chosen to minimize the number of digits in the right column.

CookWithMe · on May 31, 2012

What about L3 Cache?

What about Memory Access on another NUMA Node?

What about SSD?

Does a mobile phone programmer need to know the access time for disks?

Does an embedded system programmer need to know anything of these numbers?

Every programmer should know what memory hierarchy and network latency is. (If you learn it by looking at these numbers, fine...)

jbooth · on May 31, 2012

I'm not an expert, but:

L3 is generally on the order of the same time as main memory - it's main purpose is to reduce the total amount of requests in order to conserve bandwidth

SSDs are on the order of 0.1ms, so 100,000 ns, give or take a factor of 10.

Someone smarter than me will have to answer the NUMA node question.

MichaelGG · on May 31, 2012

I think L3 is much faster; perhaps 1/3rd of main memory, assuming the line is available and not in another core. Here are some numbers for L3 cache, from Intel (probably specific to the 5500 series)[1]:

  L3 CACHE hit, line unshared ~40 cycles 
  L3 CACHE hit, shared line in another core ~65 cycles 
  L3 CACHE hit, modified in another core ~75 cycles 
  remote L3 CACHE ~100-300 cycles 
  Local Dram ~60 ns 
  Remote Dram ~100 ns

60ns at 2.4GHz is ~144 cycles, right?

1: http://software.intel.com/sites/products/collateral/hpc/vtun...

CookWithMe · on May 31, 2012

Sorry, these were rhetorical questions :)

I was trying to make the point that these numbers are somewhat arbitrary (i.e. why do I need to know the access speed to disc, when I keep everything in memory on a NUMA system?) and don't apply to all programmers (e.g. embedded systems may not have discs, L2 Caches or internet access).

bunderbunder · on May 31, 2012

I find myself thinking of figures like these every time I see results for benchmarks that barely touch the main memory brought up in debates about the relative merits of various programming languages.

patrickmay · on May 31, 2012

When working on low latency distributed systems I more than once had to remind a client that it's a minimum of 19 milliseconds from New York to London, no matter how fast our software might be.

ryandetzel · on May 31, 2012

Interesting but unnecessary for most programmers today. I'd rather my programmers know the latency of redis vs memcached vs mysql and data type ranges.

snotrockets · on May 31, 2012

Every programmer that understands (rather than "knows") the number mentioned in the link would know the numbers you are looking for (or at least, the relationships between those.)

I'm not sure this relation holds for the opposite direction.

JoeAltmaier · on May 31, 2012

Network transmit time is almost irrelevant. It takes orders of magnitude more time to call the kernel, copy data, and reschedule after the operation completes than the wiretime.

This paradox was the impetus behind Infiniband, virtual adapters, and a host of other paradigm changes that never caught on.

luckydude · on May 31, 2012

Huh. Data please.

Part of the reason I wrote lmbench was to make sure that what you are saying is not true. And it is not in Linux, kernel entry and exit is well under 50 nanoseconds. Passing a token back and forth, round trip, in an AF_UNIX socket is 30 usecs. A ping over gig ether is 120 usecs.

Unless I'm completely misunderstanding, you are saying that the OS overhead should be "orders of magnitude" more than the network time, that's not at all what I'm seeing on Linux.

I guess what you are saying is that given an infinitely fast network, the overhead of actually doing something with the data is going to be the dominating term. Yeah, true, but when do we care about the infinitely fast network in a vacuum? We always want that data to do something so we have to pay something to actually deliver it to a user process. Linux is hands down the best at doing so, it's not free but it is way closer to free than any other OS I've seen.

JoeAltmaier · on June 1, 2012

Linux sounds fast. 50ns is blazingly fast. Benchmarks I surveyed show figures around 1-10usec but they are old and processors are faster now.

Kernel entry and exit are of course contributors to i/o overhead. Also include data copy time (2K times all those main memory flushes), ip stack time (well into scores of usec now) and driver overhead.

Add the latency of the result being signalled to your user-mode application. Interrupt latency, user-mode task scheduling time and if a receive then data copy time again.

Of course router delays are negligible on the backbone, but your local cable modem etc will add something.

I think we're over 16usec now, which if I did the math right is the wire time.

Another way to estimate all this is to benchmark achieved transfer rate peer-to-peer on an otherwise idle link. Folks report from 100mbit to 300mbit depending on other bottlenecks (disk speed, bus etc), but that's often using very large block sizes, not our 2K. Even so we see most of the Gigabit rate whittled away.

Getahobby · on May 31, 2012

This may be ignorant and please correct if I am off base but given the same physical medium isn't sending 2k across the network the same cost whether you are at fastE or gigE? Given the network is not saturated?

luckydude · on June 1, 2012

fastE is 100Mbits/sec, gigE is 1000Mbits/sec, so given the same size packet, gigE is in theory 10x faster.

However, to make things work over copper I believe that gigE has a larger minimum packet size so it's not quite apples to apples on pings (latency).

For bandwidth, the max size (w/o non-standard jumbo grams), is the same, around 1500 bytes, and gigE is pretty much linear, you can do 120MB/sec over gigE (and I have many times) but only 12MB/sec over fastE.

luckydude · on June 1, 2012

Some day I need to do a post about what I learned from SGI's numa interconnect. I used to think big packets are good, that interconnect taught me that bigger is not better.

Getahobby · on June 1, 2012

If I have 1000Mbits to send then gigE is 10 times faster. But in the article we are transferring only 2k across the network. Mixing latency and bandwidth here. The latency to send 2k across an empty network isn't 10 times greater on a fastE versus gigE network, right?

luckydude · on June 1, 2012

2K is going to be 2 packets, a full size and and a short packet, roughly 1.5K and .5K.

For any transfer there is the per packet overhead (running it through the software stack) plus the time to transfer the packet.

The first packet will, in practice, transfer very close to 10x faster, unless your software stack really sucks.

The second packet is a 1/3rd size packet so the overhead of the stack will be proportionally larger.

And it matters _a lot_ if you are counting the connect time for a TCP socket. If this is a hot potato type of test then the TCP connection is hot. If it is connect, send the data, disconnect, that's going to be a very different answer.

Not sure if I'm helping or not here, ask again if I'm not.

Peaker · on May 31, 2012

Infiniband has caught on, at least in the high-end.

chmike · on June 1, 2012

Lz4 is faster than zippy and much easier to use. It's a single .h .c file.

hobbyist · on May 31, 2012

How is mutex lock/unlock different from any other memory access?

wmf · on May 31, 2012

It requires more cache coherence.

Morg · on May 31, 2012

Someone should add basic numbers like ns count for 63 cycles modulo and that type of stuff - That'll help bad devs realize why putting another useless cmp inside a loop is dumb, and why alt rows in a table should NEVER be implemented by use of a modulo, for example.

Yes I know that's not latency per se but in the end it is too.

teach · on May 31, 2012

I think if you're worried about whether or not you use modulo to calculate alternating table rows (and you don't work for Facebook), then you're almost certainly optimizing prematurely.

jeltz · on May 31, 2012

Actually it does not matter if you are facebook or not. What really matters is how tight the loop is and how much time is spent in it.

EDIT: I agree with Morg. If coding right also results in faster code there is no reason not to do that.

Morg · on May 31, 2012

IT DOES NOT COST MORE TIME TO CODE CORRECTLY

Some approaches are NOT acceptable, it's not about optimizing prematurely, it's about coding obvious crap.

While you may be used to the usual "code crap, fix later" and "waste cycles, there are too many of it" , it doesn't mean you're right.

Everyone says it but you're still running on C (linux, unix), you're still going nuts over scaling issues (lol nosql for everyone) and you're still paying your amazon cloud bill.

teach · on May 31, 2012

I know you're ranting to the world at large, but I am not "going nuts over scaling issues". All my websites are static HTML files. I regenerate them as needed using custom Python code and my "databases", which are text files in JSON.

I have several sites running on a single smallest Linode, and the CPU utilization virtually never cracks 1%.

Also, note that I am not advocating "coding crap". I'm talking about not berating coworkers over the nanosecond cost of an extra modulo inside a loop.

Morg · on May 31, 2012

If said coworkers are actually trying to improve and can take the advice peacefully, I will deliver it peacefully.

The others I will be pleased not to work with.

hythloday · on May 31, 2012

Quite to the contrary, the optimization of using any particular method to colour rows is so tiny it can easily be outweighed over its lifetime by the 50 or so extra keystrokes it needs to type. That's how trivial this is (which is why people are reacting to your extremely aggressive tone).

Morg · on June 1, 2012

Indeed, I should drop the agression. However, the subject is not optimization but coding correctly in the first place.

And the anti-optimization argument would be correct if: -typing represented more than 1% of dev work -code was never reused -code was never massively used -code had a short lifespan

So let me help you see clearly: -I'm not a typist -Every bad code tutorial out there creates millions of code bits that contain the N times slower version, with an aggregate impact that actually matters -Any 10% opt mistake in a codebase like iptables would cause more carbon than you can imagine -Fortran is still in use because it's the fastest language there is with the best math libraries.

Those seem to be eternal so far, and C seems to remain the only other relevant language throughout the short history of coding.

Sure, there are much more problematic cases than the dumb even odd example, but I picked that one because many would recognize it.

mseebach · on May 31, 2012

It does cost considerable time and brain bandwidth to learn to "code correctly" if coding correctly means knowing how to avoid every excess few nanoseconds.

If your code is expressive, easy to reason about and fast enough, then less expressive, harder to reason about and even faster code isn't more correct.

njs12345 · on May 31, 2012

Or, you know, just get a decent compiler: http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-60...

Morg · on May 31, 2012

I suppose you are referring to the very particular case of the right shift, but as much as that's easily predictable, it's a corner case.

Who knows maybe the trend will be 3 colors instead of two. Or maybe it'll be another instruction that's wrongly abused. Or another compiler that actually sucks, like most JS interpreters.

The idea really is to use the simplest logical approach to the problem rather than the wrong one.

In the very well known case of the alt row table, it looks to me like we're alternating odd and even, why not just code that to start with, before any optimization ?

njs12345 · on May 31, 2012

No, this form of strength reduction can often eliminate modulo operations in a loop even when the modulus is not a constant. The example given is:

  for(t = 0; t < T; t++)
    for(i = 0; i < NN; i++)
      A[i%N] = 0;

which is optimised to this, without a modulo in sight:

  _invt = (NN-1)/N;
  for(t = 0; t <= T-1; t++) {
    for(_Mdi = 0; _Mdi <= _invt; _Mdi++) {
      _peeli = 0;
      for(i = N*_Mdi; i <= min(N*_Mdi+N-1,NN-1); i++) {
        A[_peeli] = 0;
        _peeli = _peeli + 1;
      }
    }
  }

I find the modulo easier to read in this case, but I guess that's a question of taste. It's certainly not 'wrong' to use a modulo, and probably worth the trade off in most cases if it makes your code clearer.

Morg · on May 31, 2012

Yes, sometimes the compiler can compensate bad decisions from the programmer, the jvm can collect your garbage etc. - none of these will save you from stupid data models and idiotic objects.

recursive · on May 31, 2012

How should they be implemented?

And per se should NEVER be spelled "per say".

Morg · on May 31, 2012

Indeed it should never be spelled wrong, as it means in itself in latin, my bad really.

Alt rows are a simple concept, the first row is odd, the next is even, etc.

A good step forward is an if/then/else or a switch or an unrolled loop - a huge step forward in terms of performance too, as a mod takes 63 cycles and a cmp takes almost nothing.

an example could be

rowClass='even'; loop if(rowClass=='odd'){ rowClass='even'; }else{ rowClass='odd'; } endloop

snotrockets · on May 31, 2012

Unrolling loops is premature optimization, and we all should know what that is the equivalent of.

Unroll your loop once they are tried, tested, and working correctly, and a profiler finds out you spend much too time in the specific parts that would be discarded when unrolling.

The liberties you took with your pseudocode above prove the point: as others have noted, you've chosen premature optimization over using a fitting data type. The first could be easily fixed before release. The latter is harder.

hythloday · on May 31, 2012

I think I must be misreading you. Are you suggesting doing a string comparison to avoid the performance hit of a mod?

Morg · on May 31, 2012

I did write it like that yes.

And it would still be faster than a mod, too, even though one byte might be better for registry usage, it won't affect cycles that much iirc.

recursive · on May 31, 2012

Hey, guess what? You're wrong. (at least in python, which is a reasonable guess for a language that's generating html)

    >>> import timeit
    >>> timeit.Timer(stmt="z=101%2").timeit()
    0.033080740708665485
    >>> timeit.Timer(stmt="z='even'=='odd'").timeit()
    0.05949918215862482

hythloday · on May 31, 2012

It does seem to be true for javascript though:

  > profile = function(fn) { var start = Date.now(); fn(); return Date.now() - start; }
  > cmp = function() { for (var i=0; i < 1000000000; i++) { var z = 'odd' === 'even'; } }
  > mod = function() { for (var i=0; i < 1000000000; i++) { var z = 101 % 2; } }
  > prof(cmp)
  20329
  > prof(mod)
  40792

Whether you think those 20 nanoseconds per test are worth saving is, I guess, an open question. :) I can imagine it being useful for game programming, for example.

Natsu · on May 31, 2012

If you know there are only two states, why not have a bool that you flip each time?

is_even = !is_even should be a lot cheaper than string comparison or modulus, assuming a reasonable language.

Morg · on June 1, 2012

I love fairy tail Indeed you can do much better, flipping boolean + cmp if you only need two states, and integer + array if you need three or more - main thing is, even the "stupid" string comparison is much cheaper than a modulus, and it's a very logical basic approach that says " if the previous row was odd, this one must be even ".

My idea with that is that it's extremely important to reach that conclusion, as it matches the problem perfectly and thus is much more efficient than our (often natural) standard approach of mod(x,2).

When you've reached that step, you can further improve by using a boolean instead of a short string, but that steps clearly into optimization, as it's not "formulating the problem correctly" but "finding a better way to implement the same solution".

There is major cost in not formulating the problem correctly (even odd is an approach to alternating colors, not the problem itself), and mod for table rows is a prime example of that.

seabee · on May 31, 2012

> Some approaches are NOT acceptable, it's not about optimizing prematurely, it's about coding obvious crap.

And yet we miss the simple, efficient answer?

    isEven = 1
    rowClasses[] = { 'odd', 'even' }
    loop
        isEven = 1 - isEven
        rowClass = rowClasses[isEven]
    endloop

It might only be an example but if you're going to complain about people's inefficient/wrong code, the least you can do is provide a good demonstration.

Morg · on June 1, 2012

As I said above, that's a better implementation of the same solution which is, pick two states and alternate by using the knowledge that if the previous is odd the next is even.

In essence your solution and the strcmp one follow the same logic, except yours is limited to two states as it is - but indeed the best n-state solution uses an array too.

What you posted here is a somewhat optimized implementation of the right solution, which is slightly better, like the boolean one (indeed you're using one bit that you flip ...).

But the BIG difference between the mod family of solutions and ours is that mod is over ten times slower because it does not correctly use the problem data.

I'm not the best coder there is, but I know it is much simpler to base yourself on something you already know (the state of the previous row) rather than doing additional computing because an analytical approach says alternating two colors is like having a color for even rows and one for odds.

By the way. your code is evil, if you're going to implement two-state logic, you're expected to use a boolean, and it will run faster with an if(b){str1}else{str2} than with an array that costs additional processing because of its nature (I'm talking straight out of my ass btw, but I still know it's inevitable that an array of two strings requires more bits and ops than two strings).

Also, the point of using an array for such an exercise would be to support n-state logic, yet your fake-boolean int approach makes it doubtful ;)

seabee · on June 15, 2012

> faster with an if(b){str1}else{str2}

I'm not aware of any compiler that optimises such a structure to avoid branch misprediction. An array of two strings might require more bits and ops in an unoptimised interpreted language, or one in which bounds checking is always enabled, but I can assure you that an indexed load is 1 instruction and a load for each string is... well, more than that. You are right, you are definitely talking straight out of your ass!

The n-state solution is a state machine, btw - but it is right to use a simple solution if that is all your problem requires.

cchurch · on May 31, 2012

For 99% of code, worrying about the number of cycles in an operation is an utter waste of time. It just doesn't matter. Correctness and readability trump machine efficiency.

Do you work on real-time systems, embedded code, or something similar?

Morg · on June 1, 2012

You are repeating a point of view you do not understand.

You don't know WHY people started saying that, WHEN they started and WHO started.

It was started by old people a while ago who told even older people that for the simple stuff they were writing for DESKTOP computers, it didn't matter anymore.

Indeed, if you have a 486dx4 and all you want to do is word processing, it didn't matter much wether it was optimized or microsoft word as the thing was way too powerful for that kind of stuff already.

Today, battery life is a concern, virtualization is a reality, scalability is a CORE issue, there are low power states etc.

Today, making your application 100 times more efficient gives you 10x more battery life,100x lower cloud hosting costs, 100x better scalability, etc.

Think that's unlikely ? You've been stacking inefficient blocks for a lifetime, sometimes with inefficiencies multiplying, where do you think you are today ?

Simple example, from a pgsql>jdbc>jboss>j2ee>hibernate>java report factory to a dumb php script that did simple SQL, you already have factors above 20 in favor of the simple solution.

That's before you make a better data model or even try using a fast language or a more suiting data store depending on your needs.

Besides, your argument is nonsense, the simplest most correct way IS the most efficient, that's the power of programming, there is absolutely NO compromise between reliability and efficiency in terms of code.

Readability is over rated, as long as you don't code crap, any COMPETENT coder will be able to read and understand fast enough, even without comments.

Do you work on overweight UIs that drain phone batteries or cloud-hosted applications or anything that needs scaling ?

debacle · on May 31, 2012

In general, there are very few things you actually need a modulo for. It's highly inefficient.

bitwize · on May 31, 2012

A lot of compilers are smart enough these days to optimize modulo by n, n a power of 2, to bitwise AND the complement of n-1.

mmukhin · on May 31, 2012

2kB over 1Gbps is actually 16ns (i guess they round up to 20)