Hacker News new | past | comments | ask | show | jobs | submit login
Why is one loop slower than 2 in C? (stackoverflow.com)
191 points by rkalla on Dec 18, 2011 | hide | past | favorite | 41 comments

The tldr here is that iterating through large arrays simultaneously with identical alignment can be slower than non-simultaneous access because of cache associativity. It's basically hash collisions in the cache due to alignment.

I've actually run into this in the "real world" in image processing applications. Given a "magical" width to an image, a pixel directly above (or below) another pixel can have the same alignment with respect to cpu cache. This can mean that things like (large) 2d convolutions are slower on images of very particular sizes.

I recently had a class project where we had similar issues. Being lazy students, we just let it perform slower for matrices of the wrong size; what is a good way to handle this sort of thing?

You pad the rows manually. Then you start loading bigger chunks of data at a time. Then you start issuing prefetch instructions with compiler intrinsics or inline assembly. Then you ask "why am I dealing with this at all?" and use Intel's math libraries. And finally you use Octave or Matlab.

Unless you are a nutter like me, in which case you hand-roll everything in assembly.

Usually you just pad the rows a bit.

And now I feel silly for not having thought of that :)

Happens to all of us. :)

I remember the first time I saw an image data struct that specified bits per pixel, image width and then also had bytes per row. I was a bit perplexed about why you'd store bytes per row if you could just compute it from bpp and width. Turns out you can fix/avoid a lot of performance issues if you can pad the data, so it's good practice to always save the number of bytes per row and use that in your mallocs/copies/iterators/etc.

For images, one of the main reasons to have separate bytes-per-row (usually called stride or pitch) is that it allows you to have sub-images that use the same data and stride as the parent image, but a different width and height.

Well, that too of course. :)

In scientific computing it's not unusual to get 2x-3x speedups by doing simple optimizations like this one. For desktop applications it's nothing, but when your simulation is taking a week to complete then 2x-speedup is a huge improvement.

This was especially the case on the old Pentium 4's. What we used to do was over-allocate the amount of memory needed, then offset the pointer into the allocated memory to avoid cache collisions.

It's not always nothing, particularly when you're dealing with video. :)

On a related node, this page also has an interesting visualisation of the effects of cache misses on program performance: http://igoro.com/archive/gallery-of-processor-cache-effects

I liked that too. That is the dominant factor here. In the first example the four arrays blow away 4 lines of the cache none of them stay resident, in the second example two of the four arrays stay in cache during the copy. The effects can be quite large.

At Network Appliance there is a guy there who understands cache effects inside and out. File systems it turns our are particularly sensitive to them because 'pointer chasing' (the essence of file system traversal) is often a very cache expensive event.

So with this level of complexity and opaqueness for a simple routine, does it ever get any better?

If your algorithm has ten intertwined loops, does one have any hope of systematically optimizing performance or would you be reduced to "intuition" and easter-egging. (http://www.catb.org/jargon/html/E/Easter-egging.html)

What about "cache oblivious" algorithms? http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

What I would like to see is support in performance tools, like Instruments on Mac/iOS, for understanding and visualizing the effects of memory latency on your code. Sampling tools have been an incredible step forward for understanding why your code takes a long time, but I haven't seen anything remotely easy to use that helps understand memory latency. Considering how much slower our computers are at memory vs everything else, this could make a big difference!

Have you looked at Shark? It can show you where your program most frequently encounters L2 cache misses (or really any event your cpu has a hardware performance counter to track). Intel VTune and the "perf" tool from the Linux kernel also perform similarly.

Yes, I have. I wouldn't count the raw performance counters as easy to use, though. They're usable by experts familiar with the specific chip being profiled. Getting actual latency as opposed to misses and particularly visualizing it is maybe possible, but compare that to the profiling situation - hit a button, get data anyone can understand.

Have you tried Valgrind's Cachegrind (http://valgrind.org/docs/manual/cg-manual.html) and found it inadequate?

It's pretty slow. Alternatives that take advantage of hardware counters (like OProfile and Intel VTune) don't get in the way so much. With long-running simulations, or when trying to simulate performance in light of real-world loads you really don't want the VM overhead you get with Valgrind/Cachegrind if you can avoid it.

Another thing to consider for code like this is that all Intel CPU's since the Core 2 tend to suffer somewhat from loop-unrolling optimizations. They have limited fetch bandwidth and SSE instructions are pretty long. Instead, they've got a loop buffer after the instruction fetch unit (after the decoders in Nehalem), which can feed the decoders much faster than the fetch unit. However the buffer is limited in size, and if unrolling causes the loop to exceed the size of the buffer, it can result in a substantial performance penalty.

From the top reply, it looks like it's less locality of reference, and more specific details of the cache indexing; when the arrays are allocated differently so the arrays don't alias each others' cache lines, the one-loop version becomes faster again (for some sizes of arrays).

Why didn't the compiler split them between two cores/cpu's - would this have been quicker?

Cache associativity isn't necessarily the only problem. Also possibly relevant, from http://drdobbs.com/184405848 (Mike Abrash):

"The other cause of erratic performance—and by far the worst—was 64K aliasing. On most Pentium 4s, if two data items in the L1 cache are at the same address modulo 64K, then a massive slowdown often occurs. Intel's documentation explains this by saying that the instruction that accesses the second piece of 64K-aliasing data has to wait until the first piece is written from the cache, which is to say until the first instruction is completely finished. As you might expect, this can wreak havoc with out-of-order processing. It was 64K aliasing that caused the aforementioned four-times slowdown in our large-triangle drawing benchmark.

"The truly insidious thing about 64K aliasing is that it is very hard to reliably eliminate it. Sure, it helps to avoid obvious problems. For example, pixel buffers and 32-bit z buffers are accessed at the same intrabuffer offsets for any given pixel, so if they encounter 64K aliasing at all, it will happen constantly; to avoid this, Pixomatic makes sure the two buffers never start at the same address modulo 64K (and also not at the same address modulo 4K, to avoid cache contention). But time after time, we've tweaked our data to get rid of 64K aliasing, only to have it pop up again a few days later after more changes were made. About all you can do is run VTune periodically to see if the number of 64K-aliasing events has spiked and, if it has, start backing out changes to figure out what variables or buffers need to be tweaked so they're not 64K aligned.

"On newer Pentium 4s, 64K aliasing changes to 1M aliasing, and in the most recent versions to 4M aliasing, which should reduce the problem considerably. Older Pentium 4s that have 64K aliasing also have 1M aliasing, which introduces an additional penalty on top of the cost of 64K aliasing, so beware of 1M aliasing even more than 64K.

"By the way, on Pentium 4s that have 1M aliasing but not 64K aliasing, the aliasing event counter still fires on 64K-aliasing events, even though they don't affect performance on those chips, so VTune will report incorrect results in such cases. On newer P4s that only have 4M aliasing, VTune will correctly report only 4M-aliasing events."

Hmm this would be a great interview question when I hire my next round of web developers. Thanks hn!

It sounds like you want to be hiring ex kernel-hackers as your web dev team :)

I hope you're being sarcastic.

My school taught this sort of thing in the general CS classes, before everyone split off into their own branches. I think this sort of question would be a great thing to ask web developers. They should know it even if they don't always use it.

Really? They should be able to look at that code snippet plus performance results and say, "Oh, that's probably the result of cache hash collisions caused by the way the memory was allocated." That's harsh.

Edited to add: So you'd be ok with your next interviewer asking you to use the method of Frobenius to solve an ordinary differential equation, because, "hey, I had to learn that as part of my CS math prerequisite courses, so you should know it too, even though you're just interviewing for a web dev job?"

They don't need to know the exact cause, but they should certainly be able to look at that code and say, "well that's rather weird. my guess is cache weirdness". I don't think being able to identify the ballfield that the answer is in is too much to ask. Identifying that sort of problem is sophomore or maybe junior level stuff. Figuring out what exactly is causing it and therefore how to solve it I would consider bonus, and wouldn't expect it.

Differential equations was not a part of my CS curriculum, but I imagine it would be fair game for any engineer. I would not expect them to solve the answer for me on a whiteboard right there, but I don't know what the equivalent to "simply identify what the question is" would be in this case.

I hope all interviewers are not like you. Not every web developer comes with a rigorous CS background. Some very excellent web developers I know come from various backgrounds and are good at what they do because they understand web development itself really well even though they may not know low level compiler and OS stuff.

It's always better to align the interview with the job (and future tasks they may be involved in), not some idealized CS background that you think everyone should have.

I find it hard to believe that somebody with no knowledge of "low level compiler and OS stuff" that did know the basics of caching (generalized) and had strong problem solving skills (which all developers regardless of specialization should have) would have a hard time puzzling something approximating a correct diagnosis.

I think that you are not giving your average "web developer" enough credit.

Funny, I don't even have a high school diploma much less a degree of any kind, and I instantly knew it was probably cache weirdness. Do yourself a favor, drop the obsession with formal CS curriculums, they're not a hallmark of competence.

My point is that this isn't an unreasonable thing to expect people to know, as evidenced by it being taught to undergrads. That you know it without a high school diploma strengthens my assertion that it is a reasonable thing to expect people to know.

This isn't about "formal CS curriculum", this is about a passing familiarity of modern computer architecture.

Without looking at the answers, let me guess. It has to do with fitting more data in L1/L2 cache and filling up the CPU's speculative execution pipeline with the simpler code.

Nope! Man, there are a lot of people in the comments who didn't read it.

wait, I thought that was the reason? (the L1-cache misses)

It is due to CPU cache issues, but not because the cache is full.

CPU caches are indexed on the low bits of the address. If you have two addresses with the same low-order bits, then those addresses will collide in the cache, even though the cache isn't full.

Large allocations (e.g. huge arrays) tend to get their own memory pages allocated - which means that the same offset in two different arrays ends up having the same low-order bits.

No. It's cache misses, but because of cache-line hash collisions (multiple inputs consistently clash on the same cache line; when it's only 2 inputs, in that specific CPU, they all fit in L1, when it's 4, they don't -- regardless of the cache size; this case is tripped by the associativity)

tl;dr: At small scales, efficiency is dominated by loop overhead. At larger scales, issues of cache size and bus bandwidth come to dominate. (In particular, if the memory accessed in the single loop doesn't all fit in the cache at once, the two loops might be faster if their respective portions of memory each can fit in the cache.)

Actually, it turned out to be more interesting than that. False tldr! It had to do with the way the cache is addressed - just shifting one or the other array over in memory so all the parallel indexes weren't at the same offset from the beginning of the cache line would address it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact