
Why is processing a sorted array faster than an unsorted array? - ashishgandhi
http://stackoverflow.com/q/11227809/399268
======
rayiner
Modern architectures really are quite complex, and it's important to keep your
internal machine model up to date with what's happening. Examples: It used to
be on a 486 even correctly-predicted taken branches had a two-cycle penalty.
It used to be important whether you arranged your code so the most commonly
taken code was on the fall through path or not. Today, correctly predicted
branches are free--the processor stitches together the current path and the
new path so there are no bubbles in the pipeline. Virtual function calls also
used to be more expensive back in the day, because CPU's didn't try to predict
the target address of the branch. Today, CPU's have a buffer that maps from a
program counter to a target address, so if your virtual function call
predictably hits the same target each time, the only overhead will be the
extra memory accesses to indirect through the v-table.

At the same time, things that are expensive today may not be so in the near
future. Haswell is supposed to make uncontended synchronization operations
almost free. It'll make a lot of algorithms, particularly lock-free
algorithms, much more practical than they are today. For example, C++ smart
pointers are slow in multi-threaded situations because the reference count
manipulation needs to be done under a lock. But the lock is almost never
needed (except when it is). Cheap synchronization should make it much more
practical to use reference counting more pervasively in C++ programs.

~~~
dan00
"For example, C++ smart pointers are slow in multi-threaded situations because
the reference count manipulation needs to be done under a lock."

Several architectures support atomic increment and decrement operations, so
there's no need for any lock. Libaries like OpenSceneGraph use them for their
reference counting.

Also Haskell isn't magical in this regard. Their lock free implementations
will also presumably just use safe atomic operations.

~~~
rayiner
Atomic increment/decrement operations still generate a locked memory
transaction of some form or the other at the processor level. Haswell is
Intel's new CPU that is supposed to implement some level of hardware
transactional memory, leveraging the infrastructure used for cache coherency.
In theory, this will make transactional operations (atomic increment/decrement
is just a tiny transaction) as cheap as a regular memory operation so long as
there is no contention.

~~~
ajross
Sort of, but it's not really a "lock". If, as is very likely in performance-
sensitive-but-uncondended multithreaded code, the CPU already has the cache
line in "exclusive" (in the MESI sense) mode then the atomic operation needs
to be no faster or slower than a standard read/write sequence. On x86 it is
serializing, however, which can have performance impacts for some workloads.

~~~
rayiner
An atomic operation on a cacheline owned exclusively by the local CPU avoids
any bus lock operations, but still serializes the pipeline to avoid conflicts
caused by memory operations on the same CPU. This basically destroys your
memory parallelism if you use it in a situation where the atomic instruction
happens often (e.g. every time an object reference is loaded or stored).

~~~
ajross
Isn't that exactly what I said? Note that "destroy memory parallelism" is
often not a high penalty in typical workloads where all in-flight accesses are
hitting L1 cache only. It's not nearly as high as the "full round trip to
DRAM" latency implied by calling it a "lock".

~~~
rayiner
I'm not really disagreeing with you, but I think we're using the terminology
differently. To me, "memory" is the whole memory pipeline--everything from the
load/store units, through the load/store buffers, the caches, to DRAM.

When I said it was a "locked memory transaction of some sort" I was including
the effect of a serializing instruction which prevents concurrent load/store
operations even if they hit the cache. I wasn't trying to imply a full round
trip to DRAM.

------
lazydon
Ah this.. the highest voted Q on SO. Not again plz -
[http://www.hnsearch.com/search#request/all&q=Why+is+proc...](http://www.hnsearch.com/search#request/all&q=Why+is+processing+a+sorted+array+faster+than+an+unsorted+array)
(interesting case of same page diff urls bypassing HN duplicate url check
though)

Go to <http://stackoverflow.com/questions> sort by votes and read from top if
you like.

~~~
JepZ
I also saw it a few weeks ago here on HN and was wondering why it is here
again... At least I wasn't the only one who thought so ;-)

Sometimes I wonder how old 'news' can be and still climb up to the top of HN.
Is there somewhere a search engine which can tell the author if something was
on HN already? Or if it is about good websites an not about news, why nobody
posted a link to google.com during the last weeks? ;-)

~~~
dllthomas
Wasn't there something from the 80's recently?

~~~
TeMPOraL
A week or two ago we had a text from 70's about diamonds, AFAIR. :).

------
Bakkot
Several months old, but still worth a read - especially for the end of
Mystical's comment, where he discusses the different optimizations performed
by different compilers. Particularly interesting is the Intel Compiler, which
effectively optimizes out the benchmark itself: something to keep in mind when
testing your own code, if you want your results to make sense.

------
soup10
One take away point from this for non-c coders is that higher level languages
will always be significantly slower than C because it's far easier to make
these kinds of fiddly pipeline optimizations when you don't have another
abstraction layer in between mucking things up. There's few java programmers
for instance that both intuitevly understand how both the machine
architecture, and the JVM compiler will affect the code they write. There's
too many quirks. While in theory it's possible to write comparably fast java
code(assuming your willing to do sketchy things for memory manipulation
related tasks). In practice it's significantly more difficult to get all the
juice out with higher level languages.

~~~
jwilliams
And assembly would be faster again?

Both the C compiler and the JVM JIT target machine code. I don't buy that one
is automatically & universally superior to the other. You can equally argue
that the JIT will always be superior because it's run-time aware.

~~~
krakensden
In the counterfactual world where the JVM's JIT commonly won bechmarks against
C, that would be a great point.

~~~
rayiner
What kind of benchmarks? The benchmarks C folks always want to use are the
ones that favor static compilation (matrix multiplication, etc). The best any
language can do in that situation is tie--it's basically just an exercise of
how much money has gone into the compiler's low-level optimizer.

Consider a benchmark implementing a common situation in real programs--a
performance intensive loop calling some code located in a plugin that isn't
known until runtime. The JVM will wipe the floor with a C compiler in that
situation--since the JVM can do inlining at runtime between module boundaries,
while a C compiler cannot do inlining at runtime at all.

Or, when memory management is involved, they always want to test situations
where everything can be easily pool-allocated, instead of complex situations
involving lots of dynamic allocation. But in situations where code is really
allocating lots of short-lived objects (e.g. functional code for optimizing
compiler trees), the GC in a JVM is going to wipe the floor with any malloc()
implementation.

~~~
jamwt
But, in "real programs", all this doesn't usually matter b/c clever JIT tricks
with loops and branches fail to offset 5-10x the memory use and the abysmal
corresponding CPU cache hit rate.

See "branch misprediction" vs. "fetch from main memory" on norvig's rough
timings list: <http://norvig.com/21-days.html#answers> .

~~~
pkolaczk
5-10x the memory use? Only if coders smoked something. I agree that JITed
runtime has some overhead, but in a great majority of cases it is far below
what you write here.

------
Permit
Mystical mentions that Intel's compiler uses loop interchange to gain an
extraordinary speedup. There's actually a great answer further down the page
that details this and was great as I wasn't aware of how it worked:
<http://stackoverflow.com/a/11303693/300908>

~~~
DannyBee
Meh, GCC can do the interchange as well, it just isn't on by default.

------
ColinWright
Isn't HN amazing? Here is the same article submitted on no less than four
previous occasions:

* <http://news.ycombinator.com/item?id=4185226>

* <http://news.ycombinator.com/item?id=4170972>

* <http://news.ycombinator.com/item?id=4355548>

* <http://news.ycombinator.com/item?id=4167834>

Total upvotes: 29.

Total discussion: 0.

No wonder people try to game the system and find optimal times to submit
things.

No wonder people think HN is broken.

~~~
dfc
Colin, I find it amazing that you did not confess that you were one of the
people who submitted a duplicate of this story, your duplicate being the most
recent:

4167834: zmanji 105 days ago

4170972: VeXocide 104 days ago

4185226: moobirubi 101 days ago

4355548: ColinWright 63 days ago

Especially given your fight against duplicated stories:

 _"My intention in doing what I did was always to try to create value by
reducing duplication (and thereby reducing wasted effort)"_

Glass houses?

~~~
ColinWright
I was going to reply privately, but could find no contact details. I see you
espouse HN Notify, so I'm hoping you'll see this response.

My duplicate detector found the initial pair almost immediately. I did
nothing, because there was no discussion, so I didn't know which one to point
to. Then the third came. Still no discussion. I watched in considerable dismay
as a submission that I thought was absolutely excellent, and deserving of
significant discussion, sank without trace.

Was it just unlucky? Did no one read it? There were a few desultory upvotes -
a few had read it. But was I so completely wrong about what HN would find
interesting?

I couldn't believe it, so I submitted it again _over a month later._ Still no
upvotes, still no discussion. It seemed I was wrong about the audience here.

And now I can see that I was right about HN being interested in deeply
technical issues like this, and I'm happy. I'm less happy that a brilliant
item like this took four submissions, sorry, _five_ submissions before it got
noticed.

That's why now I generally try to point out duplicates in two circumstances.
One is when they come close together, and that's to help stop a split
discussion. The other is when there was a significant discussion on an earlier
occasion, and I want people to benefit from previous comments.

So no, I don't think it's a case of glass houses.

~~~
wglb
> But was I so completely wrong about what HN would find interesting?

I don't think so.

Whether a good submission gets noticed and upvoted early depends quite a bit
on what time of day it is submitted. I have formed the opinion that 0700
eastern might be the best time, as the east coast sees it, UK sees it, and if
something gets a few upvotes within a few minutes, it is likely to hit the
front page, at which point it much more likely to be evaluated on its merits.

Keep in mind that upvotes come from two sources. First is if someone is
viewing an HN page, perhaps /newest and likes the article, clicks uparrow. The
other way is if another user submits the same link, it gets an automatic
update. So if a popular twit is noticed and submitted, the chance of a handful
of submissions in a few minutes is higher.

And I have a suspicion that if one were to tally the number of times the
uparrow was hit on the front page as opposed to the /newest page, the front
page would be vastly higher.

------
Lagged2Death
_You are the victim of branch prediction fail._

"Failure" is a perfectly serviceable word and the cutesy "fail" is just
annoying in a technical discussion like this.

~~~
kahawe
> _"fail" is just annoying in a technical discussion_

I am sorry but you know what is REALLY annoying? Nit-picking at what must be
the atomic level considering the depth and detail of this perfectly brilliant
reply.

Technical discussions need a LOT more people like Alexander Yee aka
"Mysticial" - and technical discussions sure as hell do not need ANY people
like you complaining about the choice of words or slang as long as it is
perfectly clear and understandable for the target audience and on that account
he more than delivered.

~~~
Lagged2Death
_Nit-picking at what must be the atomic level considering the depth and detail
of this perfectly brilliant reply._

Can't have it both ways. The more depth and detail a discussion tries to
present, the _more_ important it is to get the details right. The medium
(StackExchange) even recognizes this and has editing, correcting, and
collaborative fixing built into its DNA, for that very reason.

The more famous this response becomes, the more ESL readers there will be. Why
trip them up this way? There's just no reason to use "fail" like this here.

"Branch prediction fail" is not actually a very clear or precise phrase. It
could mean the failure of a single branch prediction attempt; it could mean
the failure of the technique in a particular case, and it could refer to an
author's opinion that it was a failure in general. Why muddy the waters?

~~~
kahawe
You are being pedantic to an extreme I haven't even encountered amongst the
worst "grammar nazis" and you are using nothing but FUD arguments. So his
reply was "muddied" up by that? The amount of upvotes and comments and people
linking to the reply very strongly beg to differ. It is a brilliant reply and
I am not a native speaker, even I understood it perfectly well too so he must
have done something right.

You do not have a single reason to complain and the sooner you realize that,
the better it will be for your own sake.

~~~
Lagged2Death
So, to recap: I voiced my irritation at a particular word choice, and the
conclusions are:

1) I am therefore _worse than the Nazis_

2) _I'm_ the one who's being unreasonable

3) If something is _popular_ it follows that it is also _perfect_ and cannot
be improved in any way.

Thanks for clearing that up.

------
cvursache
If you're looking for a good online course on how computer systems work from
AND gate scale - up to data center size, the iTunesU course <<"Computer
Science 61C", UC Berkeley, w/ Daniel Garcia>> is really informative and fun to
watch.

[http://itunes.apple.com/de/itunes-u/computer-
science-61c-001...](http://itunes.apple.com/de/itunes-u/computer-
science-61c-001-fall/id461056991)

------
mrich
Not to hate on the great top-voted answer, but isn't it a bit strange that it
was posted a mere five minutes after the question?

~~~
masklinn
Not really, a very common pattern on SO is to jump on a question to which you
know you have a good answer, put something very short (but correct) and then
edit in a better, longer response. This avoids "losing" initial votes from
people looking at responses while you're fleshing out the description, adding
benchmarks, etc...

It's definitely gaming the system, but it's not really suspicious (in the
sense of "guy asks a question and immediately puts a response he had pre-
written in a text editor" suspicious)

~~~
m_myers
The "suspicious" case you mention is explicitly supported by the system; you
can actually write the question and answer at the same time if you want. The
normal content rules still apply, of course; if the question isn't about an
actual problem, you're more likely to win a bundle of downvotes.

------
juddlyon
The fact that someone would take the time to help someone in such depth
restores my faith in humanity. With a picture to boot.

~~~
kahawe
The fact that someone would complain about the use of the word "fail", on HN
no less, should nicely even out whatever faith was restored, unfortunately...

~~~
Evbn
Nope, you can look away from the troll and focus your attention on the good
stuff. You have the power to weigh good stuff more heavily.

------
zippie
The answer focuses on branch misprediction but there is also another factor -
CPU data prefetch.

In order to offset increasing CAS latencies (largely due to the increasing
number of channels) modern CPU's are greedy (in a good sense) and fetch more
pages per memory controller read request than are necessary.

If your data is laid out continuously (in this case - sorted), chances of the
pages you need next being primed via prefetch are greatly increased and
further read requests to the memory controller may be reduced or avoided.

I benchmarked the effect of data prefetching on modern processors with large
data structures (usually limited to L1d cache lines):

[https://github.com/johnj/llds#l1-data-prefetch-
misses-200000...](https://github.com/johnj/llds#l1-data-prefetch-
misses-200000-per-hardware-sample)

A plausible analogy for CPU prefetching would be "readahead" mechanisms
implemented by modern filesystems and operating systems.

~~~
ww520
While prefetch is an important factor, it is not the case here since the data
array is accessed sequentially in both the sorted or unsorted scenarios. Also
the initial number generation loop already primes the data in the memory for
either scenario.

------
dennisgorelik
Alexander J. Yee, who answered the question is a student who calculated Pi to
5 trillion digits:

[http://www.geekosystem.com/pi-five-trillion-digits-
alexander...](http://www.geekosystem.com/pi-five-trillion-digits-alexander-
yee-shigeru-kondo/) _The main challenge for a computation of such a size, is
that both software and hardware are pushed beyond their limits. For such a
long computation and with so much hardware, failure is not just a probability.
It is a given._

------
fleitz
Sorted arrays make it easy for a simple branch predictor guess correctly, if
the branch predictor guesses correctly then you avoid a pipeline stall.

Assuming the median value in the array is 128 then for the sorted case the
first half of the array will have one cache miss and the second half will have
one cache miss.

If the OP optimized their code, sorted/filtered the array and then summed only
those values greater than 128 it would be obvious why sorting is faster.

------
grannyg00se
What about the time taken to do the sorting? Is there some rule of thumb that
hints at whether taking the time to sort the data is going to be a net win?

~~~
Tichy
I would expect sorting to take a lot longer than just summing.

------
raverbashing
Interesting test, but it's not news if you know something about code
optimization.

Caching of the data may also play a part on similar cases (but not this one -
or better, the effect is negligible)

Here's a little test that can be tried. Make the test result (inside the if)
be true or false alternately (for example, sum only if the index is even), see
how long it takes.

Spoiler: modern branch predictors can detect and predict cyclic conditions

------
hm8
TL;DR Would it not be great that we change the branch prediction
implementation to switch on/off automatically(with every new process)? This
would be difficult to trace or even implement but I'm just guessing. The CPU
stops using branch prediction if a lot of it's predictions are failing but
starts again after some time.

------
prophetjohn
Branch prediction is not the only reason this is faster and probably not even
the biggest reason.

Since the array is sorted,

    
    
        if (data[c] >= 128)
    

will evaluate to true consecutively. When the cache requests something from
memory, it will request blocks containing multiple words at a time. Since
every data[c] that needs to be added to sum is in a contiguous piece of
memory, the code is minimizing the number of times a block is transferred from
memory to the cache. This is the concept of spatial locality[1].

[1][http://en.wikipedia.org/wiki/Locality_of_reference#Locality_...](http://en.wikipedia.org/wiki/Locality_of_reference#Locality_of_reference)

~~~
tptacek
No. The "true" or "false" value of a simple expression isn't stored in memory.
It's stored as a bit in the status register as a side effect of the "cmp"
instruction†. It is no faster for "cmp" to set "true" than it is to set
"false".

† _(CMP is actually setting carry, overflow, sign, zero, &c; "truth" or
"falsity" is decided by the specific conditional jump, here JL, which checks
if sign != overflow)._

~~~
prophetjohn
That's not what I was saying. I was indicating that more memory accesses would
be required in unsorted data for the `sum += data[c]` computation, obviously
overlooking that data[c] is already in a cache, and probably in a register as
a result of the comparison with 128.

~~~
drv
But that is true whether the array is sorted or not; the indexes (and
therefore memory accesses) in the loop are still in order even if the contents
of the array are not sorted.

~~~
prophetjohn
I know. Let me state it another way.

"I was wrong, but not in the way that you(tptacek) understood me to be."

~~~
tptacek
For the record: I acknowledge that you were not wrong in the way I understood
you to be wrong; I responded because I got the sense that the thread was now
discussing the storage of boolean expression results, and klaxons inside my
nerd brain started going off.

------
keefe
a very broad intuition is that less entropy in the input, more predictability
in executing algorithms on that input, more regularities to exploit

------
marblar
Where can I learn more about this?

