Hacker News new | past | comments | ask | show | jobs | submit login
Why is it faster to process a sorted array than an unsorted array? (2012) (stackoverflow.com)
329 points by tosh on Sept 15, 2017 | hide | past | favorite | 97 comments



By far the most interesting part of this post is the update with newer compilers. Intel's compiler, in particular, makes an awesome optimization.

Edit: The update is not new, either. Just the part that I found interesting. Apologies for any confusion.


The update with the newer compilers was done back in 2012. There have only been minor changes since then. (Not to say that it isn't interesting, but so that people don't go to the post expecting new content and get disappointed)


I updated to make this clearer at the top level. Thanks!


Thanks for pointing that out. I'd seen this before so I wasn't going to bother clicking the link, not knowing there was a cool new thing there.


Glad it was worth clicking. I see I accidentally misled some folks that it was recently updated.


Yes, but checkout the Wikipedia page for Intel's C compiler. It points out that Intel's compiler selects optimum code if the CPU reports "GenuineIntel", but selects the least optimum code if the CPU is non-Intel. AMD sued Intel for this, and now Intel has a mumbo-jumbo disclaimer that it may not generate optimum code on a non-Intel CPU (without disclosing that it generates the least optimal code).


Can someone explain how interchanging the loops make it immune to mispredictions please? I don't get it


If I'm reading it right, the Intel compiler just took advantage of the outer test loop. It was supposed to loop over the data many times. Instead looped over each item of the data many times. Got the same answer, but the branch predictor had a much easier job.


Can you point those out please? I dont see any substantive edits in the last few years. Mostly just changing the notice, bounty award, and a roll back this year.


Updated my post. I merely meant the section. It was still old. :)


The question I have, which I do not see answered after a quick glance of the top responses, is: What is the total time of sorting + good branch prediction loop, versus not sorting and bad branch prediction loop? Does the good branch prediction save more time than it cost to sort the array, on modern processors? What about old processors with shorter pipelines?


I wouldn't expect the cost to offset the impact of having to sort first. But, if you need to process the same data many times, it may be a good idea to sort it up front.


In that case, loop fusion may be even more efficient - makes better use of CPU caches. Depends on the logic I suppose.


That's the thing, it's feels impossible to generalize since this can get pretty arbitrary. The size of the data, how sorted the data already is, what you're processing, etc.


Yes heuristic logic can have a role here, e.g. if list has length M and number of passes through it M, do N*M and if the result is over some threshold then make the effort of sorting the list. But a good value for that threshold depends on many factors, hence a heuristic may make sense in some scenarios. It's the same sort of pattern used with SQL query optimizers, e.g. cardinality of a result set is one of the key factors that influence how some query on that result set will be executed, e.g. whether to scan versus use an index.


Modern processors have at least a dozen pipeline stages. In this case, every other instruction is a branch and they are completely unpredictable so it's possible to see a 12.0 x 0.50 x 0.50 or 3x slowdown. If you can sort or partition the data in effectively 2N time for the size, then you can beat the misprediction.

The important thing is to be keenly aware that this is how a processor works. Branch predictors take up a big chunk of silicon on the CPU and keep very complicated histories, and in certain cases the compiler can outsmart some bad code, but in general it's way too easy to make this kind of mistake.


> 12.0 x 0.50 x 0.50

I assume this means "12 cycles delay × 50% instructions can miss × 50% miss rate". But that's not really right; you're comparing against an IPC of 1 sans misprediction, with each instruction latency-1 and all of them dependent on the prior. Less importantly, you're assuming 50% of the instruction are branches, where it's really more like 25% since you have the load and add as well. Your mispredict penalty also seems a tad small.

A not-horrific compiler and a fast CPU should do far better than that, with peak IPC of just under 4 (since theoretical peak is 4 and there's generally a little overhead). Mispredicts linearize the graph, reducing IPC to some fraction, which you can guess is a bit over 1/2, since you average ~2 loops of 4 instructions per mispredict. This means you've got a reduction of slightly less than 8x.


It might not matter for the simple case of searching through a list, but in general it makes a lot of sense to sort your data so that you improve cache locality on traversal. See for example how Frank McSherry uses Hilbert curves in graph processing: https://github.com/frankmcsherry/blog/blob/master/posts/2015...


This example isn't about branch prediction (not exactly), but it might still be interesting:

https://github.com/frankmcsherry/blog/blob/master/posts/2015...

It's about how sorting + random access can be faster than random access, because you introduce locality of reference. And in this case, it absolutely is faster to sort the data first and then do the work, even counting the sorting.

Edit: Aw crap, adrianN beat me to this a few screenfuls down, sorry! But, it is a different post, so maybe this is still helpful. :D


The second loop will do O(N) comparisons. If the array is unsorted, each will be poorly predicted. The best possible comparison sort will do O(N * log(N)) comparisons. If the array is unsorted, each of those will be poorly predicted. So regardless of the pipeline length, sorting first will be strictly slower than operating on the unsorted data. Sorting first might help if you subsequently need to do something that will do more than O(N * log(N)) comparisons.


Sorting is going to be more expensive than mis-predicting every item in the array (since sorting inherently involves multiple rounds of conditional swaps, getting it wrong a lot of the time on random data). The trick in the SO post is that they use the same sorted data tens of thousands of times.


You can partition the list with zero branches (aside from the loop test.) You don't need to sort it.

(Though, of course, the original algorithm can be written in a one-pass branchless fashion anyway, so the point is kinda moot.)


I feel like branch prediction is only a small part of the performance boost. If I do the same in python 2500 times for an array of size 40k, I get 0.00183519964218 vs 0.00163079910278 or ~11% boost.

I imagine the 6x factor probably is because the optimizer unfolds the array/loop structure.

edit: sorting the array increases the time by about 4x if you include the sort time, so its not worth it


You can't use the Python implementation to get insight into what's happening with a C++ implementation, because the overhead of operations in Python may mask a real effect that can be seen in C++.


not directly, but information is information. I am also more skeptical that branching alone would result in a 6x increase in run time.

We know optimizes unfold loops as well as arrays. it doesnt take a profound leap in logic that the optimizer may have unfolded everything and realized certain code was not going to change the state of the program, thus removed it. Maybe it didnt, but then again maybe it did.


> information is information

It's information, but not worthwhile information.

> It doesnt take a profound leap in logic that the optimizer may have unfolded everything and realized certain code was not going to change the state of the program, thus removed it. Maybe it didnt, but then again maybe it did.

Prove it. This article has had a ton of traffic. If you see the thing that everyone missed, a lot of people will be very impressed with you.


Did you try with an actual array or a Python list? https://docs.python.org/3/library/array.html or numpy you might be able to recover the same effect. Your average line of Python is miles away from low level and simple operations. Pretty much any dynamic language with a large runtime is. Especially when they default to lists not arrays. You should see your result as evidence of that. Not assume this discussion is flawed.


I'm surprised you got such a high (11%) boost. I would not expect branch prediction issues from your code to have much of an effect in Python because the Python interpreter itself naturally introduces 10x as many branches under the hood as the code it interprets.


guess: the branches in the python VM are more accurately predicted than branch on the data. The data is going to be random and thus unpredictable, but the branches in the VM are probably pretty predictable.


Python if statements will branch on very few processor instructions in the VM; so I don't think branch prediction of the processor is able to predict different Python if's — they're all the same to it.

Well, that... and even simple Python VM instructions tend to take dozens of insns to execute.


Well, the code being measured does one pass through the array. Sorting can in general not be done in linear time, so asymptotically you won't gain anything by sorting first.

Edit: Whoops, I misunderstood. Sorry!


Well sure, but from an asymptotic approach both unsorted and sorted are "linear", so asymptotics don't necessarily provide enough information to answer GP's question.


The update at the very end is pretty awesome. Basically saying, some compilers will optimize this for you now so there's no difference.

It would be interesting to see how Clang/LLVM do...


I had a surprise with sorting in JavaScript a few days ago, details also on SO.[0]

TL;DR: Chrome uses quick sort and we managed to hit its worst case by pre-ordering the input alphabetically on the server side.

[0] https://stackoverflow.com/questions/46228556/how-is-array-so...


Needs a "(2012)" at the end of the title. Also I'm not sure when cmov changed to 1 cycle latency, but it might effect some of the tangential topics here.


It appears to me here that it is trivial for a compiler to use a conditional instruction (instead of a branch) here. As a result, I'm very surprised that it didn't. Any idea why this is the case?


Correctly predicted conditional branches are faster than conditional instructions, because they add fewer dependenies. The programmer usually has better knowledge of whether the branch is going to be 50/50 or somewhat predictable, and is thus better suited to make that choice.

http://yarchive.net/comp/linux/cmov.html


While that post is technically correct (conditional moves have high latency because you can't dispatch the instruction until the conditional variable has been evaluated), they're still good in many cases because there's also no immediate dependency on the output of the conditional move. You might end up with 10 conditional moves in your reservation stations, but that's fine if all you're doing is summing up the results. You don't actually act on that sum until the end of your loop, so it's OK if it takes a few cycles after the loop to flush all those pending conditional moves out of the reservation stations.


Cmov is 1 cycle latency now. Cant remember when there change occurred but AF's info should point the way.


The issue isn't the `cmov` itself, which is extremely forgiving nowadays (4 per cycle!), but the fact the `cmov` introduces a dependency on both inputs and the condition. A predicted branch introduces a dependency on only the predicted input: the other input is never calculated and the condition is checked after-the-fact.


This is a fabulous explanation that I think a novice programmer could readily grasp.


Agreed. StackOverflow could make a lovely book with these kind of high quality answers. Stuff's gold.


I'm currently taking an OMS CS course in High Performance Computer Architecture and recently completed the branch prediction lesson. It is a free course on Udacity. Here's the source: https://www.udacity.com/course/high-performance-computer-arc...

Lesson 3 covers pipelines and lesson 4 covers branches. Milos does a great job of explaining "How It Works" for something that is really a hidden layer under the CPU.


Thanks for sharing, this course looks interesting. Cheers.


This is a re-occuring duplicate, last posted 3 months ago, then twice again a year ago. Should be tagged [dupe].


Some of the suggested ways to avoid the branch are bit manipulation (which is not necessarily portable) and the ternary operator (which seems hardly different from an if-else, though maybe the compiler usually treats it differently). It seems like another way would be sum += data[c] * (data[c] >= 128); which adds 0 when the condition is false.


Many processors have an instruction that will select between two operands depending on a condition. That can be faster than a branch. Although a sufficiently intelligent compiler would be able to determine that the instruction could be used in an if/else situation.


My immediate reaction would be to think of the entropy of the information in the array. Sorted sounds like energy was spent to put more information into that array. Intuitively, the lower entropy of a sorted array should help us predict and make better decisions along the way of searching for things in the array. Completely unsorted arrays give us less information to work with: the lack of order certainly can't help us make decisions!


Read a little further down to WiSaGaN's answer. There's an excellent discussion of the optimizations afforded by the cmovge instruction that is typically generated for the C ternary operator and how its implementation in the CPU pipeline allows it to avoid the branch misprediction penalty.

He references a textbook as well "Computer Systems: A Programmer's Perspective (2nd Edition)". Just bought a copy.


See this also for a case against CMOV predicates: http://yarchive.net/comp/linux/cmov.html


After its sorted you could add a break condition when the array is less than 128 that way it would make it faster.


This is just a simple example for this kind of situation.

Of course, you could make it faster by removing everything under 128 from the array before starting the loop, but it's not really the point here.


When I look at the code, I'm under the impression that will generate always the same output because the random seed is fixed (besides the timer). Given current tech, could a compiler see that and just reduce the computation to a simple value ?


Sort of. If rand() wasn't built into the standard, it would be impossible because you could override it at link-time with another function that connects to your database or something. And the linker runs after the compiler.

But because it's built-in, the compiler might be able to infer more about it, just like it reduces:

printf("x");

to

putchar('x');


(2012)


Question and top answer were both edited this year. Didn't look at diff to see how substantively.


The recent edits are all trivial. The meat of the question and top answer have been there since 2012.


Does this time this was posted really have any relevant bearing on the content here?


Among other things, for people to whom the title sounds familiar it gives them a hint that it's the same article they remember, not something brand new.


I was quite certain that this was the famous post with the railroad junction picture before I followed the link. This was submitted 3 times already (but only sortof gained traction once):

https://news.ycombinator.com/item?id=12490893

https://news.ycombinator.com/item?id=14459549

https://news.ycombinator.com/item?id=12272428

I am not against resubmissions, but if it is an older post the year should be in the title.


Older posts are regularly shared on Hackernews, because no, time often doesn't matter. But it's still valuable metadata that is usually posted or added to the title.


Yes. Architecture changes and some of the cmov info is probably incorrect now.


Technically the branch prediction benefit comes from partitioning the array around the critical value 128. Sorting is not necessary. Partitioning may be faster.


It's simple.

Cache, baby! Cache!

To fill in the blanks: Computers, CPU, RAM, and other Storage devices are well optimized for sequential reading and writing.


Imagine that as an interview question!


Yep, the best interview questions are ones you yourself would fail before 2017.09.15 but would pass after 2017.09.15 based on having read a HN article.

This interview question is great for three reasons:

1. It identifies that someone is you.

2. It separates the bad-coder you (before 2017.09.15) from the good-coder you (after 2017.09.15). This means that it is immune from generating a false positive on a poor candidate due to time travel.

3. It identifies cultural fit: the person reads the same news articles you do. If you waste time reading random hacker news articles, you're going to want to hire people who do the same. Especially ones who were around and not too busy on exactly 2017.09.15! It easily weeds out people who were on vacation on that date, for example.

What I especially like about this is that it has nothing to do with anyone's code. (After all, anyone who works at a level that low can answer it very easily without having read this stack overflow question, so it's a strictly orthogonal puzzle: it's only hard for people who don't need it!)

You should go ahead and add this to your list of interview questions! In fact, why not make it the only one?

(It also avoids the fuss of having to come up with questions in any way related to the work that a candidate will be doing, which, in case the above sarcastic comment wasn't clear, is what you should actually be doing.)


OK, we get your point. But consider that there are jobs where this stuff is actually important, and it's useful to have working knowledge. If you were actually going to ask this in an interview, you would obviously want to ask something different but along the same lines, just in case the candidate was familiar with this post. But that goes for many interview questions.


For the jobs where this is actually important, this isn't a good interview question - it's not subtle/deep enough. (There are many deeper questions for those jobs - this one is a basic intro level question for a job such as that one.) The reason I enumerated so many ways this is wrong isn't to be funnier: it's because I really want people to stop doing this.


Sarcasms are like lies: the more you talk, the more you dig your own grave.


I really, really (really) want to get people to stop asking these types of interview questions. I want them to realize how awful it is. (I've enumerated the ways.)

This type of interview question simply needs to die. An interview question shouldn't be about whether you've seen something that is unrelated to the job. Which is what this is.

(For jobs where it is actually something they need to know, it is a poor question because it's too simple.)

I should add that I personally found the stack overflow question itself and its answers (especially about the Intel compiler) very interesting.


I actually think this would be a very good interview question for someone who would be working on performance critical C++ software (maybe not for an entry-level position though), branch-prediction is something that anyone working in that domain should understand at least the basics of and it could quite easily open up the conversation to other hardware or optimisation topics.


Really, I was going to say this is a good weed-out question for an entry-level position. Topics like pipelining, branch prediction and cache locality were covered in 2nd year computer architecture classes when I went to undergrad. Aren't they teaching this stuff anymore?


> Aren't they teaching this stuff anymore?

You're describing a good question for an end-of-year test. Not an interview question for "entry level positions", most of which don't require or use any academic CS knowledge.


Yeah but if you don't use that knowledge for 10 years, you forget it. You wouldn't ask a web developer for the dates of the French Revolution in order to prove that they graduated high school.


I'm not sure what kind of developer works for 10 years and doesn't use knowledge of how microprocessors work. Is their code being executed by something other than a microprocessor?

On the other hand, maybe this is why my E-mail client has a 300MB memory footprint and my browser pegs the CPU when simply opening a web page.


> On the other hand, maybe this is why my E-mail client has a 300MB memory footprint and my browser pegs the CPU when simply opening a web page.

It sounds a lot more like you don't fully understand what the causes behind these things are and you're ready to arrogantly blame it on developers who dare use javascript without microoptimizing everything.

Your email client has a 300MB memory footprint most likely because of its dependencies and the appropriate amount of work invested into it. If it's a commercial product, the company probably doesn't care enough to spend decades optimizing every single layer of the stack down to the compiled assembly just to sell you something for twenty bucks. If it's an open source product, the devs behind it definitely don't have time to do that but OTOH you're welcome to apply your superior knowledge and show everyone how it's done.


I'm not sure what kind of developer works for 10 years and doesn't use knowledge of the chemical properties of silicon. Is their code being executed by something other than silicon?


In my school, we were taught about them, sure, but not in great detail. We certainly weren't taught how to detect code that has bad branch prediction behavior or how to write code that prevents branch mispredictions. I don't think we even covered cache locality.

I disagree on the original SO question as being a good weed-out question for an entry level position. This is a code optimization that I certainly wouldn't expect a newbie to know.


At my university it was the same, but this (second) class has become an optional elective that is cross-listed with the Computer Engineering major and no longer required. So, most CS majors no longer take it. The curriculum has "drifted" somewhat I'd say, to be a bit higher level, and perhaps more useful to people who go on to be developers at the application level.


> Aren't they teaching this stuff anymore?

They quite possibly are, but I didn't do a CS degree so I wasn't sure whether it was a standard thing to learn or not (hence "maybe" rather than "definitely").


You learn branch prediction in most CS, CE or EE programs.


What kind of class would include that in their curriculum?


It's usually a topic on the first computer architecture class. It's a mid-of-the-book topic on most computer organization books, the classics like Hennesy Patterson or Tanenbaum have substancial portions dedicated to it. Both are books for a first course on computer architecture and design.

It's very important to understand it so that you can understand why sometimes pipelining is difficult or impossible.


Pipelining is done on basically every processor . It is like the laundry example you are not going to wait for dryer to finish before you load the washer. As soon as the clothes are done in the washer you will move them to the dryer and load the washer.


Yeah if you have branch prediction you can suddenly realize you put dirty clothes in the drier together with clean ones.


Our school introduced it in a year 2 computer organization course that was required for all CS majors. It went into how the processor converts assembly into instructions for the ALU, instruction pipelining, branch prediction (though only qualitatively), and the instruction and data cache, amongst other things.


We learned about in Computer Architecture (EECS 314).

We used the textbook "Computer Organization and Design" by David A Patterson and John L Hennessy. Branch Prediction comes up on page 341. I have it sitting on my desk here at work because it's such a useful book.


It would be an easy one, if applying for any speed/latency critical job (C/Java).

There are a lot harder micro-benchmarks related to L2 cache size, page faults, compiler optimizations (esp. JVM), pointer deference/indirections and so on.


I can perfectly imagine it as it may give an indication on whether your interest is in the field or not. If you know the answer, chances are you read HN or follow other communities as opposed to not giving a damn.


Not so great. Either you get it or you don't, and it's easily learned. Missing it just means you have more to learn.


If you are looking for someone to work on performance-critical code, you would probably prefer someone who has already learned this.

Apart from being useful knowledge for the job, it's a good indicator, showing if the person has worked on such problems before (or alternatively has good knowledge to tackle such problems)


Yes, I'd prefer them but only slightly.

If they don't know it, they're not an expert. But did they claim to be, and how quickly could they learn?

If they do know it, it doesn't mean they're an expert either. It's easy to learn these things just by happening to read about them.


What are you trying to say?


Compare with Hash


Excellent answers


Imagine if words in your dictionary wasn't sorted and you will have your answer.


Why would you care, if you were reading them sequentially, like the code here is doing?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: