
Why is it faster to process a sorted array than an unsorted array? (2012) - tosh
https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array
======
mruts
I feel like we are now living in a world in which figuring about the
performance of your code is almost impossible. Between hopspot compilers,
bytecode, CPU intrinsics, vectorization, caches, pipelining, GC, branch
prediction, etc how is one supposed to reason about his code? In whatever
language I'm programming, I usually just imagine what the C would look like
and make performance decisions from there.

But things are so complicated nowadays that's doesn't even come close to
capturing the performance.

Is this the only way we can get good performance? Is there an alternative
paradigm of optimization in hardware or software that would make performance
tradeoffs more explicit? I don't know, but the current state of things makes
me uneasy because I feel I never understand what's truly going on (and I
definitely don't).

~~~
theonemind
I don't know. In general, it's advisable to write your code first largely
ignoring performance considerations, just do the simplest/most readable thing,
then profile it and see where it bottlenecks.

So, that's the best general wisdom I know. Don't even bother trying to guess
where you'll have performance problems. Profile it.

~~~
civility
This just defers the optimization process by one step. It's hard to improve
something without measuring it first, but either you know where your inner
loop is or the profiler tells you. Now you've still got to improve that
bottleneck somehow and take into account all those complicating factors.

I've seen very unexpected results doing this. In one case, I put the inner
loop in a separate function to enable swapping implementations and make it
simpler to time. Surprisingly, the overall program ran faster without any
other changes. Just pulling the inner loop into a function of its own sped
things up.

My best guess was that the compiler was willing to put more effort into
optimization on the smaller function. This contradicts common wisdom about
function call overhead, and it also shows another level of complexity in the
problem: Some compiler optimizations are exponential in their cost, so they
punt if you put too much of your code in one block or function.

Also note that I wasn't able to prove this is what happened. It's just my best
guess based on what I was able to see and infer.

------
rwmj
I had a fun accidental optimization last week: I needed to zero out a memory
buffer before calling a GnuTLS function to read in data (for security reasons
- we can't be sure that the callee will fill in the full buffer and don't want
to leak uninitialized memory back to the client). I wondered what the
performance penalty of this would be. In fact it made the code about 5-10%
faster. This is on the edge of being a measurable effect - could be random
variability in the test - but it certainly surprised me that it was not
slower.

The non-encrypted transport case had a similar change but there was no
measurable performance difference before and after.

I'd like to know what perf stats are available to show what's going on. (I
guess it's an L2 cache effect)

~~~
seanmcdirmid
The zeroing was somehow priming the cache, maybe it was faster this way
because all the zero writes were independent and could easily be run in
parallel, whereas normal cache stalls would cause more actual stalling.

~~~
gameswithgo
prolly a simd enhanced zero operation so very quick.

------
azhenley
Prior discussion:
[https://news.ycombinator.com/item?id=15256543](https://news.ycombinator.com/item?id=15256543)

------
Rychard
I wonder if the microcode changes to mitigate branch prediction
vulnerabilities (i.e. Spectre) affect these results in any way.

~~~
jcranmer
Unlikely. The microcode changes are largely about disabling indirect branch
prediction--virtual functions, jump tables, function returns. The branch
prediction here is generally driven by simple branch history predictors (i.e.,
at a fundamental level, the branch predictor boils down to "guess what the
last branch did"), which are way too useful to disable even to mitigate
Spectre.

------
ggeorgovassilis
After running the experiment (openjdk 8, linux) with some variations, I'm
convinced it's a JIT optimisation and not a CPU predictive branch feature, eg.
increaseing the threshold from 128 to post-16 bit, eg 100000: if (data[c] >=
128)

------
peterkelly
This looks like one of those questions to which the author already knows the
answer, and posts the question on an alternate account so they can get all the
upvotes for their subsequent detailed explanation for the behavior of their
contrived example.

~~~
bonoboTP
You don't have to do that on StackOverflow (and other StackExchange sites). It
is explicitly encouraged to answer your own questions, or put in other words:
to share your knowledge in a Q&A format.

~~~
otherme123
[https://stackoverflow.com/help/self-
answer](https://stackoverflow.com/help/self-answer)

If you are in SO for the points, splitting them between two accounts seems
lame. Remember that the question also collects points and badges!

I've never checked the user that answers to see if it's the same user that is
asking, and it happened to me a couple of times that I tried to upvote my own
answer.

