Hacker News new | comments | show | ask | jobs | submit login
Why is processing a sorted array faster than an unsorted array? (stackoverflow.com)
590 points by ashishgandhi 1446 days ago | hide | past | web | 119 comments | favorite

Modern architectures really are quite complex, and it's important to keep your internal machine model up to date with what's happening. Examples: It used to be on a 486 even correctly-predicted taken branches had a two-cycle penalty. It used to be important whether you arranged your code so the most commonly taken code was on the fall through path or not. Today, correctly predicted branches are free--the processor stitches together the current path and the new path so there are no bubbles in the pipeline. Virtual function calls also used to be more expensive back in the day, because CPU's didn't try to predict the target address of the branch. Today, CPU's have a buffer that maps from a program counter to a target address, so if your virtual function call predictably hits the same target each time, the only overhead will be the extra memory accesses to indirect through the v-table.

At the same time, things that are expensive today may not be so in the near future. Haswell is supposed to make uncontended synchronization operations almost free. It'll make a lot of algorithms, particularly lock-free algorithms, much more practical than they are today. For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock. But the lock is almost never needed (except when it is). Cheap synchronization should make it much more practical to use reference counting more pervasively in C++ programs.

"For example, C++ smart pointers are slow in multi-threaded situations because the reference count manipulation needs to be done under a lock."

Several architectures support atomic increment and decrement operations, so there's no need for any lock. Libaries like OpenSceneGraph use them for their reference counting.

Also Haskell isn't magical in this regard. Their lock free implementations will also presumably just use safe atomic operations.

Atomic increment/decrement operations still generate a locked memory transaction of some form or the other at the processor level. Haswell is Intel's new CPU that is supposed to implement some level of hardware transactional memory, leveraging the infrastructure used for cache coherency. In theory, this will make transactional operations (atomic increment/decrement is just a tiny transaction) as cheap as a regular memory operation so long as there is no contention.

Note the classic 'bus lock' is a worst case nowadays: caches in SMP machines have their own protocols for invalidating and locking individual lines, so the entirety of the machine's memory isn't necessarily serialized, though I'm not sure when this kind of locking applies.

Sort of, but it's not really a "lock". If, as is very likely in performance-sensitive-but-uncondended multithreaded code, the CPU already has the cache line in "exclusive" (in the MESI sense) mode then the atomic operation needs to be no faster or slower than a standard read/write sequence. On x86 it is serializing, however, which can have performance impacts for some workloads.

An atomic operation on a cacheline owned exclusively by the local CPU avoids any bus lock operations, but still serializes the pipeline to avoid conflicts caused by memory operations on the same CPU. This basically destroys your memory parallelism if you use it in a situation where the atomic instruction happens often (e.g. every time an object reference is loaded or stored).

Isn't that exactly what I said? Note that "destroy memory parallelism" is often not a high penalty in typical workloads where all in-flight accesses are hitting L1 cache only. It's not nearly as high as the "full round trip to DRAM" latency implied by calling it a "lock".

I'm not really disagreeing with you, but I think we're using the terminology differently. To me, "memory" is the whole memory pipeline--everything from the load/store units, through the load/store buffers, the caches, to DRAM.

When I said it was a "locked memory transaction of some sort" I was including the effect of a serializing instruction which prevents concurrent load/store operations even if they hit the cache. I wasn't trying to imply a full round trip to DRAM.

Depends. One atomic writer - multiple readers work out to be pretty darn fast if protected by only a memory fence.

Oh sorry, I have been reading your post a bit sloppy, and transactional memory was just too much of an association with Haskell.

I think s/he meant the upcoming Haswell processor architectures, not Haskell the language.

It's important to note that even at the hardware level(for intel x64, at least), atomic increments and decrements are, in fact, just the operations with locks around them.

So relying heavily on them in multithreaded cases with heavy contention can actually reduce performance over a more complex lockless scheme.

Ah this.. the highest voted Q on SO. Not again plz - http://www.hnsearch.com/search#request/all&q=Why+is+proc... (interesting case of same page diff urls bypassing HN duplicate url check though)

Go to http://stackoverflow.com/questions sort by votes and read from top if you like.

I'm a regular HN reader and I've never seen this before. I don't think it's so bad to re-post, especially since it's actually HN-quality content.

I'd second that -- the fact that it jumped to the top means there are enough people who haven't seen it before

This is totally missing the point of a major part of HN's (and other communities') value: The actual links may or may not be important, it's the community discussion that has the more value. In many cases, I find myself not even clicking on the links on HN (either the topic is too mundane or I've already seen it from another source) but go straight for comments.

This particular post is a good example: I have seen this interesting SO tidbit before but learned a few things from the HN discussion.

The results show that the 4 other links receive very low attention (3 links with 2 points, 1 with 21 points, all of them with no comments). Given the HN audience is spread across the globe with consistent interest in programming articles, I think there must be inference of posting time. Poster karma is not a big factor here because ColinWright has 33711 karma. Anyone wants to conduct such a research?

This is actually quite interesting. What it tells me is that "luck" (not timing, nor karma) is the most significant determinant whether something goes viral or not. Content comes second.

I also saw it a few weeks ago here on HN and was wondering why it is here again... At least I wasn't the only one who thought so ;-)

Sometimes I wonder how old 'news' can be and still climb up to the top of HN. Is there somewhere a search engine which can tell the author if something was on HN already? Or if it is about good websites an not about news, why nobody posted a link to google.com during the last weeks? ;-)

> Sometimes I wonder how old 'news' can be and still climb up to the top of HN. Is there somewhere a search engine which can tell the author if something was on HN already?

What value is there to the people who are finding this content interesting today in rejecting it because some other subset of the HN community saw it at some arbitrarily distant time in the past?

Reposted interesting technical content beats the hell out of Boring industry naval-gazing news story #82795 and Android vs Apple "news" intended to provoke a fanbo flamewar thread #762483101.

If you're spending so much time on HN that you not only recognize reposts but that you see them so often it bothers you enough to complain in the comments, consider doing something more productive with your time.

Wasn't there something from the 80's recently?

A week or two ago we had a text from 70's about diamonds, AFAIR. :).

It's certainly better than those countless why you should be 'hacking' stuff with XYZ or why you shouldn't.

Several months old, but still worth a read - especially for the end of Mystical's comment, where he discusses the different optimizations performed by different compilers. Particularly interesting is the Intel Compiler, which effectively optimizes out the benchmark itself: something to keep in mind when testing your own code, if you want your results to make sense.

One take away point from this for non-c coders is that higher level languages will always be significantly slower than C because it's far easier to make these kinds of fiddly pipeline optimizations when you don't have another abstraction layer in between mucking things up. There's few java programmers for instance that both intuitevly understand how both the machine architecture, and the JVM compiler will affect the code they write. There's too many quirks. While in theory it's possible to write comparably fast java code(assuming your willing to do sketchy things for memory manipulation related tasks). In practice it's significantly more difficult to get all the juice out with higher level languages.

The counter argument is that if you've compiled into a semantic description (bytecodes for a virtual machine) then the underlying VM implementation can optimize a semantically correct pipeline based on features available to it. Thus programmer time is saved (they wrote their code once) at the expense of a higher burden on the VM to do the right thing.

This argument (that the VM new best) was put forward by Bill Joy early on in the development of Java as the reason Java would eventually be faster than compiled C code. We discovered that in practice 'friction' [1] between hot spot compilation and instruction set architectures nearly always dominated this equation (leaving Java code slower).

Architectures where the friction was reduced (like the Jazelle instruction set on ARM) could swing the balance the other way.

[1] Friction in this case is the cost to performance of the difference of expressibility between a byte coded instruction and the actual processor architecture.

What the JVM compiler does to your code isn't really that much harder to understand than what GCC or LLVM does to your code. Assuming equally skilled programmers, I think it's often easier to write fast code in a higher level language. The primary bottleneck for performance is usually developer time. If you can get the non-performance critical parts written in half the time, you have way more time to optimize. Higher level languages also generally have much better profiling and introspection tools, which make debugging much easier. Debugging tightly optimized code is, of course, one of the big time sucks in optimizing code.

This is not correct: there are things that can be faster than C because of the compilation aspect (the compiler can only optimize with the information it has, some of them can only be known at runtime).

A good example is iteration over non trivial, multi-dimensional structures: if you iterate over a single dimension, then some variables become actual constant, but this can only be known once you know the iterator parameters (i.e. even profile-guided optimization would not work).

You could obviously implement a runtime-based optimizer in C, but then the assertion that you can never beat C is a tautology.

On the other hand, sometime in higher level languages the compiler could prove more constrains and then apply more aggressive optimizations. (I don't remember a specific case now, perhaps we still need a few years for an interesting example.)

Another possibility is that using a higher level language it is possible to write more complicated code, with smarter optimizations, because the programmer doesn't have to take care of all the details. For example, the Pypy project they has a Phyton interpreter written in a combination of RPhyton an C that is faster than the standard Phyton interpreter that is written directly in C.

Fortran defaults to strict aliasing and C defaults to loose aliasing, leading to naive Fortran programs sometimes being optimized better than equivalent naive C programs. Or as some people say, "Fortran is faster than C".

Another lesson from Fortran is to include arrays as a native type in your language. Intel continues to push their Fortran compiler whenever you shop for their numeric C libraries with some convincing benchmarks. https://speakerdeck.com/u/multicoreworld/p/james-reinders-in...

You can carefully design an interpreter to mitigate these issues.

You can also have a JIT compiler remove branches altogether based on experimental data, something completely impossible in C.

Don't point at one thing and claim it makes a language class faster or slower.

Indeed, a JIT compiler will always be able to take advantage of something that isn't possible for something that is compiled once and set in stone.

However, that's not to say that upfront compilation is doomed never to compete.

Profiling C compilers can mimic this behaviour to an extent, although they can only profile for the situtations they experience during their training phase so they can't adapt to unforeseen scenarios or dramatic shifts in the input stream.

C compilers could be extended to implement several/many variants of each function based on different possible code paths and, using information gained from on the fly timings it could switch between different variants in order to optimise execution. Combinatorics would place a low limit on the number of percentage of possible variants implemented though.

Finally, no-one ever said that C has to be a compiled language. There's nothing stopping someone writing a JIT C interpreter and getting the best of both worlds. (Ugh!)

True, I should have been clearer that I meant only C under the mainstream implementation style. If you get creative there are so many wonderful options. I know there's already TCC and a variety of REPLs.

C++/CLI probably comes close to "C with JIT"

Isn't it possible to JIT C code with LLVM?

It's not impossible in C. It's impossible statically.

There is nothing that stops you from JIT'ing C/C++. LLVM includes a JIT that functions just fine on C/C++ programs.

And assembly would be faster again?

Both the C compiler and the JVM JIT target machine code. I don't buy that one is automatically & universally superior to the other. You can equally argue that the JIT will always be superior because it's run-time aware.

Yes, it's always possible to write assembly code that is at least as fast as C code, if not faster.

But it's entirely another question whether you, as a mere mortal human, are able to do a better job at producing assembly code in practice, for non-trivial programs, than a compiler.

Think we completely agree.

It's pretty much demonstrated by the update to the piece -- certain compliers & compiler flags did produce the right optimization, whereas the original author did not.

In the counterfactual world where the JVM's JIT commonly won bechmarks against C, that would be a great point.

What kind of benchmarks? The benchmarks C folks always want to use are the ones that favor static compilation (matrix multiplication, etc). The best any language can do in that situation is tie--it's basically just an exercise of how much money has gone into the compiler's low-level optimizer.

Consider a benchmark implementing a common situation in real programs--a performance intensive loop calling some code located in a plugin that isn't known until runtime. The JVM will wipe the floor with a C compiler in that situation--since the JVM can do inlining at runtime between module boundaries, while a C compiler cannot do inlining at runtime at all.

Or, when memory management is involved, they always want to test situations where everything can be easily pool-allocated, instead of complex situations involving lots of dynamic allocation. But in situations where code is really allocating lots of short-lived objects (e.g. functional code for optimizing compiler trees), the GC in a JVM is going to wipe the floor with any malloc() implementation.

But, in "real programs", all this doesn't usually matter b/c clever JIT tricks with loops and branches fail to offset 5-10x the memory use and the abysmal corresponding CPU cache hit rate.

See "branch misprediction" vs. "fetch from main memory" on norvig's rough timings list: http://norvig.com/21-days.html#answers .

5-10x the memory use? Only if coders smoked something. I agree that JITed runtime has some overhead, but in a great majority of cases it is far below what you write here.

I think the best benchmark for this is to look at the industries where performance matters and see what they are using. Take high-frequency trading where lowest latency makes the most money. I don't see many using the JVM. AFAIK it's all assembly, C and C++. If they could use the JVM and have faster code, they would be doing so.

But in situations where code is really allocating lots of short-lived objects (e.g. functional code for optimizing compiler trees), the GC in a JVM is going to wipe the floor with any malloc() implementation.

Even an arena allocator?

Arena allocation is feasible only for particular types of allocation patterns. If that weren't true, every malloc() implementation would just be based on an arena allocator.

Specifically, for arena allocation to work, you have to have points at which you know all objects in an arena will become garbage. There are many situations where you know that most objects will quickly become garbage (say 98%), but can only guarantee at a very large scope that all objects will become garbage.

Consider something like a compiler that optimizes a tree by replacing sets of nodes with simpler ones. Where do you put the pool delete operation? You know lots of nodes will become garbage after each optimization pass, but you know some will survive. But you can't statically segregate which ones. So you can put the pool delete at the very end after code generation, when you know you don't need the tree anymore. But then your memory use skyrockets--you don't reclaim any memory until the very end of the process.

GCC, for example, used to use a pool allocation functionality called obstack's. They ended up switching to a home-rolled GC because it became too hard to manage that system.

This gets into a can of worms, because while much of the performance benefit of an arena is the fact that you don't need to do per-object bookkeeping, the other benefit is that it's fast to allocate, track, and free data when you know all the sizes in advance, and that is a benefit you can get from a pool allocator, which isn't much more complicated than an arena.

The general purpose malloc(3) has a very hard problem to solve that most allocations in most programs probably don't really need it to work so hard to solve.

The benchmarks were JIT running time versus C compile time + runtime.

Not exactly apples to apples.

It's a question of degrees of freedom.

Lower level languages like C can include ASM code with little overhead, they can also swap code around at run time just like the JIT. Granted, the vast majority of CRUD code has no need for such capability's, but as JVM languages have more constraints and absolutely no new capability's when it comes to the ASM they produce they fall behind when performance is most critical.

That said, the vast majority of programmers are never get any where near these limits.

If you need to super-optimize execution you can build C or ASM code and call it via a library or inline it into your high-level language. Java can call functions in shared libraries (or use JNI) and Perl can inline straight C or ASM.

I'm not sure why you'd want to get "all the juice" out of higher level languages, though. I thought the whole point was to abstract away from machine optimization? For example, to learn to write more efficient Perl I picked up "Mastering Algorithms with Perl" and it had loads of helpful tips.

Yes, there are some critical algorithms which will be always be faster in C.

In practice it's significantly more difficult to optimize all your code based on deep knowledge of architectures, compilers, and even compiler versions than it is to just let a JIT compiler optimize at runtime.

On the other hand, using a higher lever language and ready-made code can mean than someone else has already implemented all those optimizations you would never implement yourself.

> higher level languages will always be significantly slower than C

Looking at benchmarks, I can see this is factually incorrect.

Also... show me where C allows you to see the pipeline directly. It's hard to optimize something you can't see.

C is an abstraction layer. So is assembly. Both are relatively thick when you're talking about things like branch prediction and pipelining.

Mystical mentions that Intel's compiler uses loop interchange to gain an extraordinary speedup. There's actually a great answer further down the page that details this and was great as I wasn't aware of how it worked: http://stackoverflow.com/a/11303693/300908

Meh, GCC can do the interchange as well, it just isn't on by default.

Isn't HN amazing? Here is the same article submitted on no less than four previous occasions:

* http://news.ycombinator.com/item?id=4185226

* http://news.ycombinator.com/item?id=4170972

* http://news.ycombinator.com/item?id=4355548

* http://news.ycombinator.com/item?id=4167834

Total upvotes: 29.

Total discussion: 0.

No wonder people try to game the system and find optimal times to submit things.

No wonder people think HN is broken.

Colin, I find it amazing that you did not confess that you were one of the people who submitted a duplicate of this story, your duplicate being the most recent:

4167834: zmanji 105 days ago

4170972: VeXocide 104 days ago

4185226: moobirubi 101 days ago

4355548: ColinWright 63 days ago

Especially given your fight against duplicated stories:

"My intention in doing what I did was always to try to create value by reducing duplication (and thereby reducing wasted effort)"

Glass houses?

I was going to reply privately, but could find no contact details. I see you espouse HN Notify, so I'm hoping you'll see this response.

My duplicate detector found the initial pair almost immediately. I did nothing, because there was no discussion, so I didn't know which one to point to. Then the third came. Still no discussion. I watched in considerable dismay as a submission that I thought was absolutely excellent, and deserving of significant discussion, sank without trace.

Was it just unlucky? Did no one read it? There were a few desultory upvotes - a few had read it. But was I so completely wrong about what HN would find interesting?

I couldn't believe it, so I submitted it again over a month later. Still no upvotes, still no discussion. It seemed I was wrong about the audience here.

And now I can see that I was right about HN being interested in deeply technical issues like this, and I'm happy. I'm less happy that a brilliant item like this took four submissions, sorry, five submissions before it got noticed.

That's why now I generally try to point out duplicates in two circumstances. One is when they come close together, and that's to help stop a split discussion. The other is when there was a significant discussion on an earlier occasion, and I want people to benefit from previous comments.

So no, I don't think it's a case of glass houses.

> But was I so completely wrong about what HN would find interesting?

I don't think so.

Whether a good submission gets noticed and upvoted early depends quite a bit on what time of day it is submitted. I have formed the opinion that 0700 eastern might be the best time, as the east coast sees it, UK sees it, and if something gets a few upvotes within a few minutes, it is likely to hit the front page, at which point it much more likely to be evaluated on its merits.

Keep in mind that upvotes come from two sources. First is if someone is viewing an HN page, perhaps /newest and likes the article, clicks uparrow. The other way is if another user submits the same link, it gets an automatic update. So if a popular twit is noticed and submitted, the chance of a handful of submissions in a few minutes is higher.

And I have a suspicion that if one were to tally the number of times the uparrow was hit on the front page as opposed to the /newest page, the front page would be vastly higher.

Who the hell cares, you two? We're talking about it now. The system worked fine for this story. Can we move on from this sideshow?

This whole story generated a pretty great thread; why look for reasons to have a problem with that?

I realize that your comment was mainly rhetorical but...

I don't have a problem with duplicates. I think a lot of things are worth reading/seeing/doing multiple times. Moreover I do no think or expect anyone to be perfect.

I do have a problem with Colin's crusade against duplicates. I don't think it sets the right tone for HN and it is a disincentive for people to submit things. I would rather have people feel free/comfortable submitting things and then let the up voting system separate the chaff. If you go through Colin's comments you can see hundreds of discussion ending comments about duplicate stories. Given his distaste for duplicates I found it disingenuous that he would bring up the problem of duplicates given his was the third duplicate in the list.

If its a recent exact duplicate the submission is merely another upvote and which is a neat function of the system (this happened to me recently with a story I submitted about correlation/causation that you had already submitted) If its a duplicate and boring it will be flagged or at the very least not up voted. If its a duplicate and people find it interesting it will be up voted. If a duplicate is up voted that I have read before I can choose to ignore it or read the discussion to see if there are new opinions. Duplicates are not a problem for me, reducing the likelihood of people contributing to HN is a problem.

I do have a problem with Colin's crusade against duplicates.

You should re-read his comments about duplicates more carefully if you think he is on a crusade.

Here is an excerpt from his most recent comment:

    Those who are interested in this subject and want to 
    discuss it, you may be interested in seeing the points - 
    good and bad - that have been made on previous occasions 
    that this has been submitted. Here are a few previous 
That doesn't look like a crusade against duplicates to me. It looks helpful and I'm glad it's there.

Duplicate discussions is not the same as duplcate submissions.

You are the victim of branch prediction fail.

"Failure" is a perfectly serviceable word and the cutesy "fail" is just annoying in a technical discussion like this.

When I see colloquialisms used in technical discussions I generally assume that the author is attempting to convey the topic of conversation in a more casual and approachable way so as to maximize the amount of of people who read and think about what is being said. You may not be aware of this but pretentiousness does turn a lot of people off. This stops said people from inquiring further about things they may actually discover they care about. The downside of all this pretentiousness is that the world is as a whole less educated because the people who possess the knowledge and know-how refuse to accommodate their audiences with simple word changes such as this.

There's nothing pretentious about the word "failure."

I do not believe the people who plod through a huge multi-page, multi-code-sample question at StackOverflow are just casual dabblers who are likely to leap away from the discussion when they discover the top-rated answer features the word "failure."

I suspect that you also believe that you are not being pretentious and pedantic by raising this point. I emphatically disagree. The latter half of the criticism raised in my comment was directed towards your initial comment and comments or explanations like yours that are prevalent in advanced fields. I apologize for not being clear.

I'm sorry, but I got turned off from reading the rest of your comment when I got to "emphatically." Maybe you should change it to "like totally." To accommodate me.

like totally noted, dude! :)

I didn't read "fail" there as being cutesy/meme-y. I [think I] heard people use the term "branch prediction fail" many years before "FAIL!" became "a thing".

That anyone would grok less word salad resulting from cutesyhood is dubious, bro.

> "fail" is just annoying in a technical discussion

I am sorry but you know what is REALLY annoying? Nit-picking at what must be the atomic level considering the depth and detail of this perfectly brilliant reply.

Technical discussions need a LOT more people like Alexander Yee aka "Mysticial" - and technical discussions sure as hell do not need ANY people like you complaining about the choice of words or slang as long as it is perfectly clear and understandable for the target audience and on that account he more than delivered.

Nit-picking at what must be the atomic level considering the depth and detail of this perfectly brilliant reply.

Can't have it both ways. The more depth and detail a discussion tries to present, the more important it is to get the details right. The medium (StackExchange) even recognizes this and has editing, correcting, and collaborative fixing built into its DNA, for that very reason.

The more famous this response becomes, the more ESL readers there will be. Why trip them up this way? There's just no reason to use "fail" like this here.

"Branch prediction fail" is not actually a very clear or precise phrase. It could mean the failure of a single branch prediction attempt; it could mean the failure of the technique in a particular case, and it could refer to an author's opinion that it was a failure in general. Why muddy the waters?

You are being pedantic to an extreme I haven't even encountered amongst the worst "grammar nazis" and you are using nothing but FUD arguments. So his reply was "muddied" up by that? The amount of upvotes and comments and people linking to the reply very strongly beg to differ. It is a brilliant reply and I am not a native speaker, even I understood it perfectly well too so he must have done something right.

You do not have a single reason to complain and the sooner you realize that, the better it will be for your own sake.

So, to recap: I voiced my irritation at a particular word choice, and the conclusions are:

1) I am therefore worse than the Nazis

2) I'm the one who's being unreasonable

3) If something is popular it follows that it is also perfect and cannot be improved in any way.

Thanks for clearing that up.

word, kahawe, you just said what I was thinking.

If you're looking for a good online course on how computer systems work from AND gate scale - up to data center size, the iTunesU course <<"Computer Science 61C", UC Berkeley, w/ Daniel Garcia>> is really informative and fun to watch.


Not to hate on the great top-voted answer, but isn't it a bit strange that it was posted a mere five minutes after the question?

Not really, a very common pattern on SO is to jump on a question to which you know you have a good answer, put something very short (but correct) and then edit in a better, longer response. This avoids "losing" initial votes from people looking at responses while you're fleshing out the description, adding benchmarks, etc...

It's definitely gaming the system, but it's not really suspicious (in the sense of "guy asks a question and immediately puts a response he had pre-written in a text editor" suspicious)

The "suspicious" case you mention is explicitly supported by the system; you can actually write the question and answer at the same time if you want. The normal content rules still apply, of course; if the question isn't about an actual problem, you're more likely to win a bundle of downvotes.

I think without this ability to "game" the system SO wouldn't be half as good as it is now. You end up with brilliant and above all complete answers that can be kept up to date, and the moderation policy reduces duplication and clutter, keeping the SNR high.

Sure, some people might be motivated only by receiving high scores, but it seems that the benefits outweigh the potential for abuse

It's become common on stackexchange sites to post a brief answer immediately, follow by a much longer edit.

Doesn't matter. You can answer your own question immediately after posting it if you want. It's designed to be a collections of questions and answers, not a forum, and so it doesn't matter who answers what question.

Although Stackexchange puts in some mild game dynamics to make it a competition. At least for some, if you're into that kind of thing.

Game != competition. You get a few point for writing an accepted answer, but two good answers can both get up vote points from one reader.

There is only one best answer for every question, and badges, points, etc. tied to this. Whenever you have limited awards you have competition.

Stop splitting hairs.

Seems like the answer has been edited after lots of upvote to give a more detailed explanation.

I often do this. There is one answer which I've been quite literally editing for about three years. Every few months I add a bit more information or a few more references or links to it. Its become a kind of personal (but public) repository of information on the topic.

I hope you keep a backup of it yourself.

If you read the post author's comment on his own post, he went back and added all the explanation/analogy after making the initial post that just said it was down to branch prediction.

The fact that someone would take the time to help someone in such depth restores my faith in humanity. With a picture to boot.

The fact that someone would complain about the use of the word "fail", on HN no less, should nicely even out whatever faith was restored, unfortunately...

Nope, you can look away from the troll and focus your attention on the good stuff. You have the power to weigh good stuff more heavily.

The answer focuses on branch misprediction but there is also another factor - CPU data prefetch.

In order to offset increasing CAS latencies (largely due to the increasing number of channels) modern CPU's are greedy (in a good sense) and fetch more pages per memory controller read request than are necessary.

If your data is laid out continuously (in this case - sorted), chances of the pages you need next being primed via prefetch are greatly increased and further read requests to the memory controller may be reduced or avoided.

I benchmarked the effect of data prefetching on modern processors with large data structures (usually limited to L1d cache lines):


A plausible analogy for CPU prefetching would be "readahead" mechanisms implemented by modern filesystems and operating systems.

While prefetch is an important factor, it is not the case here since the data array is accessed sequentially in both the sorted or unsorted scenarios. Also the initial number generation loop already primes the data in the memory for either scenario.

Alexander J. Yee, who answered the question is a student who calculated Pi to 5 trillion digits:

http://www.geekosystem.com/pi-five-trillion-digits-alexander... The main challenge for a computation of such a size, is that both software and hardware are pushed beyond their limits. For such a long computation and with so much hardware, failure is not just a probability. It is a given.

Sorted arrays make it easy for a simple branch predictor guess correctly, if the branch predictor guesses correctly then you avoid a pipeline stall.

Assuming the median value in the array is 128 then for the sorted case the first half of the array will have one cache miss and the second half will have one cache miss.

If the OP optimized their code, sorted/filtered the array and then summed only those values greater than 128 it would be obvious why sorting is faster.

What about the time taken to do the sorting? Is there some rule of thumb that hints at whether taking the time to sort the data is going to be a net win?

I would expect sorting to take a lot longer than just summing.

The time it takes to sort an array is a rather well-understood problem. If you can reason about how much performance gain you'll get from looping through a sorted versus unsorted array, you can probably estimate the additional cost of the sort itself.

Unless you have external information, you will need to take at least one pass through the data to determine if its sorted or not.

sorting an array of integers that you know the max value of (like in the question in SO) will take you O(n) with count sort, so this will be pretty damn fast

Interesting test, but it's not news if you know something about code optimization.

Caching of the data may also play a part on similar cases (but not this one - or better, the effect is negligible)

Here's a little test that can be tried. Make the test result (inside the if) be true or false alternately (for example, sum only if the index is even), see how long it takes.

Spoiler: modern branch predictors can detect and predict cyclic conditions

TL;DR Would it not be great that we change the branch prediction implementation to switch on/off automatically(with every new process)? This would be difficult to trace or even implement but I'm just guessing. The CPU stops using branch prediction if a lot of it's predictions are failing but starts again after some time.

Branch prediction is not the only reason this is faster and probably not even the biggest reason.

Since the array is sorted,

    if (data[c] >= 128)
will evaluate to true consecutively. When the cache requests something from memory, it will request blocks containing multiple words at a time. Since every data[c] that needs to be added to sum is in a contiguous piece of memory, the code is minimizing the number of times a block is transferred from memory to the cache. This is the concept of spatial locality[1].


No. The "true" or "false" value of a simple expression isn't stored in memory. It's stored as a bit in the status register as a side effect of the "cmp" instruction†. It is no faster for "cmp" to set "true" than it is to set "false".

† (CMP is actually setting carry, overflow, sign, zero, &c; "truth" or "falsity" is decided by the specific conditional jump, here JL, which checks if sign != overflow).

That's not what I was saying. I was indicating that more memory accesses would be required in unsorted data for the `sum += data[c]` computation, obviously overlooking that data[c] is already in a cache, and probably in a register as a result of the comparison with 128.

Yes, the value is in a register.

But that is true whether the array is sorted or not; the indexes (and therefore memory accesses) in the loop are still in order even if the contents of the array are not sorted.

The counter also doesn't need cache; it gets assigned a register.

I know. Let me state it another way.

"I was wrong, but not in the way that you(tptacek) understood me to be."

For the record: I acknowledge that you were not wrong in the way I understood you to be wrong; I responded because I got the sense that the thread was now discussing the storage of boolean expression results, and klaxons inside my nerd brain started going off.

Eh, what new item is it requesting from memory inside the branch?

  if (data[c] >= 128)
    sum += data[c];
Whether the condition evaluates to true or false it has already retreived the value from the data array in order to test it against the condition.

You're right, that's a pretty careless mistake. Considering that, the only way to stretch this into locality would be to assume that sum would be used in rapid enough succession so as to be able to live in a register instead of go back to a cache or even memory. That difference seems negligible, though.

You can just compile the code and look at it; yes, the sum gets a register.

I don't think this is true. The loop that sums the values always fetches the array items linearly. Whether they are sorted or not does not make any difference. Since the accumulator ('sum') can be kept in a register there should be no store operations in the inner loop, which means it reduces to simply reading consecutive memory, inspecting the values, and adding some of them to a register.

Branch mispredictions are the only likely cause for the runtime difference. It would be interesting to see whether using multiplexing (sum += data[c] > 128 ? data[c] : 0) or conditional assignments would make any difference. Some CPU's have opcodes for these that don't require branching, but I'm not sure whether they use the branch prediction logic.

In an optimized build data[c] is read from memory only once per loop iteration, so the memory read patterns are identical in the sorted and unsorted cases.

a very broad intuition is that less entropy in the input, more predictability in executing algorithms on that input, more regularities to exploit

Where can I learn more about this?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact