

Why is processing a sorted array faster than an unsorted array? - Ashuu
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array/11227902#11227902

======
ColinWright
It's a wonderful piece. Well written, well constructed, and immensely
informative for those new to the details of the internals of CPUs. Perhaps it
deserves to be submitted every month or so. In some sense, seeing it surface
yet again gives me hope that there are still people on HN who crave
discussions about technical issues.

In case you think that the people here are intelligent and might have
something to add, here is the previous discussion of this:

[https://news.ycombinator.com/item?id=4637196](https://news.ycombinator.com/item?id=4637196)
(119 comments)

For those complaining that it seems to be posted regularly, here are some of
the previous submissions:

[https://news.ycombinator.com/item?id=4167834](https://news.ycombinator.com/item?id=4167834)
(366 days ago)

[https://news.ycombinator.com/item?id=4170972](https://news.ycombinator.com/item?id=4170972)
(365 days ago)

[https://news.ycombinator.com/item?id=4185226](https://news.ycombinator.com/item?id=4185226)
(362 days ago)

[https://news.ycombinator.com/item?id=4355548](https://news.ycombinator.com/item?id=4355548)
(324 days ago)

[https://news.ycombinator.com/item?id=4964931](https://news.ycombinator.com/item?id=4964931)
(185 days ago)

[https://news.ycombinator.com/item?id=5167935](https://news.ycombinator.com/item?id=5167935)
(143 days ago)

[https://news.ycombinator.com/item?id=5666751](https://news.ycombinator.com/item?id=5666751)
(52 days ago)

[https://news.ycombinator.com/item?id=5679080](https://news.ycombinator.com/item?id=5679080)
(50 days ago)

~~~
ancarda
I was under the impression HN prevented you from submitting the same URL. How
has this happened?

Edit: Looks like each URL is slightly different.

~~~
DanBC
Sometimes people deliberately add stuff to the end of the URL to get past the
dupe-check. Perhaps this needs to be penalized somehow?

([https://news.ycombinator.com/item?id=5679080](https://news.ycombinator.com/item?id=5679080))

And it's not helpful that people link to various places on the SE page (the
question, or one of the answers), rather than a canonical link to the page.

------
mtdewcmu
It's not so obvious what the take-home lesson should be from this example. The
code was obviously designed to test the loop speed. Surely in reality it would
not be worth sorting the array just to optimize the branches in the loop. The
branchless version somebody posted is effective, but it's hard to read and
brittle. The Intel compiler seemed like the optimal solution if you have it.

I've found architecture effects fairly hard to predict in real situations,
with my level of knowledge. I think it's hard to reduce these to
straightforward lessons. It's useful to know that things like branch
mispredictions and cache misses are important, but you should probably not try
to optimize for them until you've timed the loop and built up a good case for
what's occurring. Most of the time, when I think I can make something faster
that way, I rewrite the code and it's no faster.

One case where something did get faster was in a case where I had two large
arrays, one containing indexes into the other, which contained data, and I
needed to permute the data by the indexes and write it out to a file.
Initially, the indexes were arranged so that the output was sequential, but
the array accesses were random. I switched out the array of indexes with its
inverse, which would access the data sequentially, but produce output in
random order, which I captured in a new array before writing out. The second
version was measurably faster. The lesson seems to be that loads benefit more
from locality than stores. In the end, though, it only saved a fraction of a
second, even with 100 million elements in the array. Unoptimized, it was
already pretty fast, taking slightly over a second. It was nice, but I'm glad
I didn't spend too much time on it.

I think this kind of optimization is somewhat like saving money by doing your
own plumbing. It's usually not worth it unless you know what you're doing, or
you can afford to spend lots of time.

Edit: Reading more of the comments, I found something useful that I didn't
know. The ternary ?: operator generates a conditional move rather than a
branch. So that's a viable solution, although it came out a little slower than
the predictable branch.

------
anonymous
Interestingly, when written in C, gcc compiles the inner loop to (comments
mine):

    
    
        .L11:
    	movl	(%r12,%rdx,4), %eax ; load value of data[c] in eax
    	leal	(%rsi,%rax), %ecx ; (sum is in rsi)
            ; add data[c] to sum, store in ecx
            ; yes, this is (ab)using the
            ; "load effective address" instruction as an
            ; "add a to b and store in c" instruction
    	cmpl	$128, %eax ; compare data[c] to 128
    	cmovge	%ecx, %esi ; if above comparison was true,
                               ; set rsi (sum) to ecx, computed above
    	addq	$1, %rdx ; c = c+1
        .L9:
    	cmpq	$134217727, %rdx ; this is the for end clause
    	jbe	.L11
    	jmp	.L12
    

In effect, it's adding data[c] to sum, storing it in a register and storing
the result back in sum if data[c] was larger than 128, all with no jumps
except for the loop itself and that jump is mispredicted exactly once. I don't
see why java's JIT can't do the same.

~~~
mtdewcmu
In the answers to the original question, somebody mentioned that gcc will
convert branches into conditional moves, but only at -O3. I never use -O3,
because it's frequently slower than -O2. In another answer, somebody pointed
out that the ternary operator always generates a conditional move, which I
never knew.

------
ancarda
Is it possible to reduce the length of pipelines? Example:
[http://www.youtube.com/watch?v=w9VWRB07yqc](http://www.youtube.com/watch?v=w9VWRB07yqc)

~~~
microarchitect
Deeper pipelines have higher clock frequencies but this comes at the cost of
lower instructions per clock (IPC). Roughly speaking, asking for a shallower
pipe essentially boils down to asking for a lower clock frequency. Deeper
pipelines are more complex to build, tricky to validate, not to mention the
obvious fact that they consume more area and power, so architects who choose
deeper pipelines are doing so only because performance studies show that these
pipes are worthwhile.

The branch in this example is essentially random and so pretty much impossible
to predict. I would classify it as a pathological case. A modern predictor
such as perceptron [1] or even the older tournament predictor [2] is extremely
accurate for branches found in real benchmarks.

[1]
[http://www.cs.utexas.edu/~lin/papers/hpca01.pdf](http://www.cs.utexas.edu/~lin/papers/hpca01.pdf)
[2] [http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-
TN-36.pdf](http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-36.pdf)

PS. Sidenote about the P4, it achieved the highest SpecINT CPU score among all
its contemporary processors so it wasn't the performance disaster some people
make it out to be. It's worth noting that SpecINT CPU is has traditionally
been the most challenging benchmark in terms of branch prediction
requirements. SpecFP CPU for example is full of easy to predict loops.

~~~
mtdewcmu
>> The branch in this example is essentially random and so pretty much
impossible to predict. I would classify it as a pathological case.

It's a highly contrived example, and it's a little bit misleading.

------
betterunix
Here I was, hoping this would be a discussion about the algorithmic
improvements that are possible with sorted arrays...

------
omniwired
every six months this is posted here. Please stop. Thanks

~~~
glhaynes
Of the things that get posted to HN on an average day, this technical,
interesting, well-written piece that shows up every several months is the one
that really bothers you to see? I've read it before too, but its having become
part of the site's recurring culture seems… not so bad to me.

~~~
Ashuu
Hey, I am really sorry for posting this. I didn't know it was posted earlier
too! I thought this might be a good post to discuss on HN. Is there any way of
knowing if the content is posted earlier on HN?

~~~
glhaynes
I'm not sure if you misread my comment or perhaps just posted this as a reply
to the wrong comment, but I completely agree that it's a good HN post and I
(personally) don't think it's a problem for it to get posted again every few
months.

HN does have a dupe detector and won't let an identical URL be posted a second
time. But, as you can see in the above comments on this same page, even very
slight differences are sufficient for it to allow the post through.

Anyway, I hope you keep posting items like this! If they're not sufficiently
interesting and novel to enough people, they won't be upvoted. No harm.

