

How doing something can be faster than not doing it - ingve
http://blogs.msdn.com/b/oldnewthing/archive/2014/06/13/10533875.aspx

======
userbinator
Note that were the comparisons _unsigned_ , we could use adc instead, which
uses one register less, and as a result gets rid of quite a few instructions;
the unsigned version is basically

    
    
         mov edx, [esp+8]
         xor eax, eax
         mov ecx, array_start
        L:
         cmp [ecx], edx        ; CF = (unsigned) [ecx] < edx
         adc eax, 0            ; eax += CF
         add ecx, 4
         cmp ecx, array_end
         jb L
         ret
    

and this version is 40% faster on my machine than the signed version.

~~~
nkurz
Nice. I think you might be able to shave another cycle off if you were to use
[base + 4 * index] addressing with base = array_end and a negative index. This
allows you to get rid of the second compare and use 'add index_register, 4'
'jnz L' for the loop test and exit when index == 0. This saves another
instruction as well as a cycle of dependency.

Of course, you could also go all out and vectorize: vpcmpgtd, then vpaddd for
per-column subtotal, then sum the elements at the end, then negate. Combined
with the negative index trick, you might be able to get it down to single
cycle per 8 ints. Ahh, the joys of micro-optimizing toy loops!

------
phkahler
Has anyone made a processor that speculatively executes both branch options? I
guess good simulations would show weather that or branch prediction is better.

~~~
js2
I think this is called dual or multi-path execution if you want to Google for
it.

[http://people.cs.clemson.edu/~mark/eager.html](http://people.cs.clemson.edu/~mark/eager.html)
says _...I 've seen it alleged that some mainframe processors in the 1960s and
1970s executed down both paths beyond a branch; but, as far as I'm aware,
multiple-path execution has never been done by a commercial processor._

~~~
onan_barbarian
From what I recall reading about this in the academic literature, the primary
concern here is that speculative operations on the alternate paths consume
power just like any other operation.

------
alain94040
If indeed

    
    
      if (array[i] < boundary) count++;
    

is slower than

    
    
      count += (array[i] < boundary) ? 1 : 0;
    

Any reason why compilers can't infer the latter?

~~~
yongjik
Linus once wrote a nice (typically Linus) rant on why cmov is a terrible idea
in most cases:

[http://yarchive.net/comp/linux/cmov.html](http://yarchive.net/comp/linux/cmov.html)

> In contrast, if you use a predicated instruction, ALL of it is on the
> critical path. Calculating the conditional is on the critical path.
> Calculating the value that gets used is obviously ALSO on the critical path,
> but so is the calculation for the value that DOESN'T get used too. So the
> cmov - rather than speeding things up - actually slows things down, because
> it makes more code be dependent on each other.

~~~
mtdewcmu
Huh. I guess the reason it works in the article is that the value is not
calculated, it's just a constant.

------
mtdewcmu
The title of this is misleading, and so is this part:

>>The cost of a single increment operation is highly variable. At low boundary
values, it is around 0.03 time units per increment. But at high boundary
values, the cost drops to one tenth that.

The cost of an increment operation has nothing to do with it. It's very fast
and doesn't change. What matters is the cost of flushing the pipeline on every
mispredicted branch.

Maybe you're supposed to get that that part is "naive" and ignore it once
you've read the whole article?

