
Why is one loop slower than 2 in C? - rkalla
http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops
======
lbrandy
The tldr here is that iterating through large arrays simultaneously with
identical alignment can be slower than non-simultaneous access because of
cache associativity. It's basically hash collisions in the cache due to
alignment.

I've actually run into this in the "real world" in image processing
applications. Given a "magical" width to an image, a pixel directly above (or
below) another pixel can have the same alignment with respect to cpu cache.
This can mean that things like (large) 2d convolutions are slower on images of
very particular sizes.

~~~
tikhonj
I recently had a class project where we had similar issues. Being lazy
students, we just let it perform slower for matrices of the wrong size; what
is a good way to handle this sort of thing?

~~~
dpe82
Usually you just pad the rows a bit.

~~~
tikhonj
And now I feel silly for not having thought of that :)

~~~
dpe82
Happens to all of us. :)

I remember the first time I saw an image data struct that specified bits per
pixel, image width and then also had bytes per row. I was a bit perplexed
about why you'd store bytes per row if you could just compute it from bpp and
width. Turns out you can fix/avoid a lot of performance issues if you can pad
the data, so it's good practice to always save the number of bytes per row and
use that in your mallocs/copies/iterators/etc.

~~~
ssp
For images, one of the main reasons to have separate bytes-per-row (usually
called stride or pitch) is that it allows you to have sub-images that use the
same data and stride as the parent image, but a different width and height.

~~~
dpe82
Well, that too of course. :)

------
kia
In scientific computing it's not unusual to get 2x-3x speedups by doing simple
optimizations like this one. For desktop applications it's nothing, but when
your simulation is taking a week to complete then 2x-speedup is a huge
improvement.

~~~
ahelwer
This was especially the case on the old Pentium 4's. What we used to do was
over-allocate the amount of memory needed, then offset the pointer into the
allocated memory to avoid cache collisions.

------
xtacy
On a related node, this page also has an interesting visualisation of the
effects of cache misses on program performance:
<http://igoro.com/archive/gallery-of-processor-cache-effects>

~~~
ChuckMcM
I liked that too. That is the dominant factor here. In the first example the
four arrays blow away 4 lines of the cache none of them stay resident, in the
second example two of the four arrays stay in cache during the copy. The
effects can be quite large.

At Network Appliance there is a guy there who understands cache effects inside
and out. File systems it turns our are particularly sensitive to them because
'pointer chasing' (the essence of file system traversal) is often a very cache
expensive event.

------
joe_the_user
So with this level of complexity and opaqueness for a simple routine, does it
ever get any better?

If your algorithm has ten intertwined loops, does one have any hope of
systematically optimizing performance or would you be reduced to "intuition"
and easter-egging. (<http://www.catb.org/jargon/html/E/Easter-egging.html>)

What about "cache oblivious" algorithms? <http://en.wikipedia.org/wiki/Cache-
oblivious_algorithm>

~~~
kenferry
What I would like to see is support in performance tools, like Instruments on
Mac/iOS, for understanding and visualizing the effects of memory latency on
your code. Sampling tools have been an incredible step forward for
understanding why your code takes a long time, but I haven't seen anything
remotely easy to use that helps understand memory latency. Considering how
much slower our computers are at memory vs everything else, this could make a
big difference!

~~~
minimax
Have you looked at Shark? It can show you where your program most frequently
encounters L2 cache misses (or really any event your cpu has a hardware
performance counter to track). Intel VTune and the "perf" tool from the Linux
kernel also perform similarly.

~~~
kenferry
Yes, I have. I wouldn't count the raw performance counters as easy to use,
though. They're usable by experts familiar with the specific chip being
profiled. Getting actual latency as opposed to misses and particularly
visualizing it is maybe possible, but compare that to the profiling situation
- hit a button, get data anyone can understand.

------
rayiner
Another thing to consider for code like this is that all Intel CPU's since the
Core 2 tend to suffer somewhat from loop-unrolling optimizations. They have
limited fetch bandwidth and SSE instructions are pretty long. Instead, they've
got a loop buffer after the instruction fetch unit (after the decoders in
Nehalem), which can feed the decoders much faster than the fetch unit. However
the buffer is limited in size, and if unrolling causes the loop to exceed the
size of the buffer, it can result in a substantial performance penalty.

------
machrider
<http://en.wikipedia.org/wiki/Locality_of_reference>

~~~
_delirium
From the top reply, it looks like it's less locality of reference, and more
specific details of the cache indexing; when the arrays are allocated
differently so the arrays don't alias each others' cache lines, the one-loop
version becomes faster again (for some sizes of arrays).

------
Mordor
Why didn't the compiler split them between two cores/cpu's - would this have
been quicker?

------
CamperBob
Cache associativity isn't necessarily the only problem. Also possibly
relevant, from <http://drdobbs.com/184405848> (Mike Abrash):

"The other cause of erratic performance—and by far the worst—was 64K aliasing.
On most Pentium 4s, if two data items in the L1 cache are at the same address
modulo 64K, then a massive slowdown often occurs. Intel's documentation
explains this by saying that the instruction that accesses the second piece of
64K-aliasing data has to wait until the first piece is written from the cache,
which is to say until the first instruction is completely finished. As you
might expect, this can wreak havoc with out-of-order processing. It was 64K
aliasing that caused the aforementioned four-times slowdown in our large-
triangle drawing benchmark.

"The truly insidious thing about 64K aliasing is that it is very hard to
reliably eliminate it. Sure, it helps to avoid obvious problems. For example,
pixel buffers and 32-bit z buffers are accessed at the same intrabuffer
offsets for any given pixel, so if they encounter 64K aliasing at all, it will
happen constantly; to avoid this, Pixomatic makes sure the two buffers never
start at the same address modulo 64K (and also not at the same address modulo
4K, to avoid cache contention). But time after time, we've tweaked our data to
get rid of 64K aliasing, only to have it pop up again a few days later after
more changes were made. About all you can do is run VTune periodically to see
if the number of 64K-aliasing events has spiked and, if it has, start backing
out changes to figure out what variables or buffers need to be tweaked so
they're not 64K aligned.

"On newer Pentium 4s, 64K aliasing changes to 1M aliasing, and in the most
recent versions to 4M aliasing, which should reduce the problem considerably.
Older Pentium 4s that have 64K aliasing also have 1M aliasing, which
introduces an additional penalty on top of the cost of 64K aliasing, so beware
of 1M aliasing even more than 64K.

"By the way, on Pentium 4s that have 1M aliasing but not 64K aliasing, the
aliasing event counter still fires on 64K-aliasing events, even though they
don't affect performance on those chips, so VTune will report incorrect
results in such cases. On newer P4s that only have 4M aliasing, VTune will
correctly report only 4M-aliasing events."

------
tocomment
Hmm this would be a great interview question when I hire my next round of web
developers. Thanks hn!

~~~
maximusprime
I hope you're being sarcastic.

~~~
burgerbrain
My school taught this sort of thing in the general CS classes, before everyone
split off into their own branches. I think this sort of question would be a
great thing to ask web developers. They should know it even if they don't
always use it.

~~~
amcintyre
Really? They should be able to look at that code snippet plus performance
results and say, "Oh, that's probably the result of cache hash collisions
caused by the way the memory was allocated." That's harsh.

 _Edited to add:_ So you'd be ok with your next interviewer asking you to use
the method of Frobenius to solve an ordinary differential equation, because,
"hey, I had to learn that as part of my CS math prerequisite courses, so you
should know it too, even though you're just interviewing for a web dev job?"

~~~
burgerbrain
They don't need to know the exact cause, but they should certainly be able to
look at that code and say, _"well that's rather weird. my guess is cache
weirdness"_. I don't think being able to identify the ballfield that the
answer is in is too much to ask. Identifying that sort of problem is sophomore
or maybe junior level stuff. Figuring out what exactly is causing it and
therefore how to solve it I would consider bonus, and wouldn't expect it.

Differential equations was not a part of my CS curriculum, but I imagine it
would be fair game for any engineer. I would not expect them to solve the
answer for me on a whiteboard right there, but I don't know what the
equivalent to "simply identify _what the question is_ " would be in this case.

~~~
alphakappa
I hope all interviewers are not like you. Not every web developer comes with a
rigorous CS background. Some very excellent web developers I know come from
various backgrounds and are good at what they do because they understand web
development itself really well even though they may not know low level
compiler and OS stuff.

It's always better to align the interview with the job (and future tasks they
may be involved in), not some idealized CS background that you think everyone
should have.

~~~
burgerbrain
I find it hard to believe that somebody with no knowledge of _"low level
compiler and OS stuff"_ that _did_ know the basics of caching (generalized)
and had strong problem solving skills (which all developers regardless of
specialization should have) would have a hard time puzzling something
approximating a correct diagnosis.

I think that you are not giving your average "web developer" enough credit.

------
ww520
Without looking at the answers, let me guess. It has to do with fitting more
data in L1/L2 cache and filling up the CPU's speculative execution pipeline
with the simpler code.

~~~
kenferry
Nope! Man, there are a lot of people in the comments who didn't read it.

~~~
rkalla
wait, I thought that _was_ the reason? (the L1-cache misses)

~~~
jbri
It is due to CPU cache issues, but _not_ because the cache is full.

CPU caches are indexed on the low bits of the address. If you have two
addresses with the same low-order bits, then those addresses will collide in
the cache, even though the cache isn't full.

Large allocations (e.g. huge arrays) tend to get their own memory pages
allocated - which means that the same offset in two different arrays ends up
having the same low-order bits.

------
blahedo
tl;dr: At small scales, efficiency is dominated by loop overhead. At larger
scales, issues of cache size and bus bandwidth come to dominate. (In
particular, if the memory accessed in the single loop doesn't all fit in the
cache at once, the two loops might be faster if their respective portions of
memory each _can_ fit in the cache.)

~~~
kenferry
Actually, it turned out to be more interesting than that. False tldr! It had
to do with the way the cache is addressed - just shifting one or the other
array over in memory so all the parallel indexes weren't at the same offset
from the beginning of the cache line would address it.

