
How We Beat C++ STL Binary Search - timanglade
https://realm.io/news/how-we-beat-cpp-stl-binary-search/
======
mmozeiko
For 8192 elements I also get result that their search function is faster than
STL. But 8192 elements is tiny size of array where to search, for larger array
I get opposite - STL is faster. Here is output for 8 * 1024 * 1024 elements
(~8.4million).

gcc 5.2.0 on Windows x64 (i5-3210M)

    
    
        stl      : 881 miliseconds
        version 1: 880 miliseconds
        version 2: 1607 miliseconds
        version 3: 1260 miliseconds
        version 4: 1271 miliseconds
    

gcc 5.2.0 on Linux x86_64 (i7-4770S)

    
    
        stl      : 629.231 miliseconds
        version 1: 629.436 miliseconds
        version 2: 897.143 miliseconds
        version 3: 862.827 miliseconds
        version 4: 863.22 miliseconds
    

clang 3.6.2 on Linux armv7h (CuBox-i, Cortex-A9)

    
    
        stl      : 3380.29 miliseconds
        version 1: 3428.9 miliseconds
        version 2: 3433.65 miliseconds
        version 3: 3391.86 miliseconds
        version 4: 3376.91 miliseconds
    

Oh, and Visual Studio doesn't have "/O3" argument they say they are using for
cl.exe.

~~~
nhaehnle
This is fascinating. Do you have any idea what might cause this?

A shot in the dark: Their benchmarking uses a very simple "random" generation
to choose the index to search for, which is actually just a linear scan
(modulo the size of the array). Could it be that with the larger array, the
generated sequence of test indices happens to work nicely with the CPU's
branch prediction - after all, your results show a performance drop-off for
the versions that use conditional moves, and the behavior depends suspiciously
on the CPU microarchitecture.

------
Noughmad
There is no mention of the fact that the STL implementation is very generic,
it only assumes operator++ and operator* on the iteration, and operator< on
the value.

The "optimized" versions here all make more assumptions on the iterator,
starting from operator+(int) in Version 1, so it no longer works on iterators
with just "forward_iterator_tag". Further versions even restrict vector sizes
(albeit to a very high number) and assign -1 to an unsigned integer (size_t).
So this is something you can use in your project if you need the performance,
but can't put it into GCC.

~~~
lhecker
Just a nitpick: Assigning a negative value to an unsigned integer is perfectly
safe and well defined (i.e. integer overflow for unsigned integers is well
defined in the C++ spec).

Using -1 for size_t is sometimes even prefered to assign it's maximum value,
since there is no cross-platform #define for it (most use either SIZE_T_MAX or
SIZE_MAX).

In fact llvm's libc++ uses it for it's std::numeric_limits implementation and
for std::string's npos.

~~~
Joky
I believe there is a standard cross-platform maximum value, and you mention it
in your last sentence: std::numeric_limits!

~~~
lhecker
Uhm... Using "-1" to assign the maximum value is in fact as standard and
cross-platform as using std::numeric_limits, which was basically my entire
point.

The difference between those is only "textual solution" VS "mathematical
solution" and thus a matter of personal preference.

------
blt
Loosely related: STL binary search suffers from the same weakness as several
other comparator-based STL algorithms for certain classes of expensive
comparators:
[https://www.reddit.com/r/cpp/comments/36sqtq/more_efficient_...](https://www.reddit.com/r/cpp/comments/36sqtq/more_efficient_interface_for_algorithms_taking_a/)

------
elitheeli
If you take version 4, stop the first while loop once the size is at most 32,
and then do linear search from there, it's faster (on my machine). For
example, version 4 is 102ms and this tweak puts it at 83ms. (On clang x64.)

------
myle
Isn't version 2 wrong, because

    
    
        size_t probe = (low + high) / 2;
    

may overflow?

~~~
eloff
Not if low and high are both unsigned (size_t is unsigned.) Even if it
overflows, the result will be correct. See:
[http://googleresearch.blogspot.com/2006/06/extra-extra-
read-...](http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-
about-it-nearly.html)

~~~
cautious_int
That blog is wrong. The program won't be undefined( unsigned wrap is defined
), but the offset will not be correct.

~~~
dastbe
So the article isn't wrong, but its misleading you into thinking this should
be correct for the wrong reasons.

They are using signed integers as their indices, which means that the signed
bit is always 0. Thus the addition after casting to unsigned will never
overflow, and you can divide by two (shift by 1) and then recast to a signed
integer, no harm no foul.

~~~
cautious_int
Actually it is misleading by them( and you ) to assume that in C, an unsigned
int can represent values larger than the largest signed int value.

In other words, C allows that UINT_MAX == INT_MAX, in which case you will
overflow.

If they made that assumption, they should explicitly mention it, but they
didn't.

> Update 17 Feb 2008:... ...Now that we've made this change, we _know_ that
> the program is correct;)

It seems the article is aware of the irony. Another update would be in order.

~~~
dastbe
Sure, and if you expect your software to run on such a platform with any
degree of confidence then you're right to consider it. Even better, tell us
about a conforming implementation that you've used in product recently that
has UINT_MAX == INT_MAX.

Also I'm not trying to mislead people into thinking that its a good way to
implement this. The confounding bit from the article is they started in Java
and ended up in C. If you were indexing with signed ints in C, C++, or any
language that has unsigned integers then you already have a bug with or
without the bad mean check.

~~~
cautious_int
You got it backwards there. Only if you know your implementation and plan to
code only for it, can you even start to consider bending the C Standard, and
not the other way around.

------
xjia
(76% + 81% + 80% + 75% + 70%) / 5 = 76%

[http://www.cse.unsw.edu.au/~gernot/benchmarking-
crimes.html](http://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html)

------
avdicius
My version:

    
    
      template <class T> INLINE size_t fast_upper_bound5(const vector<T>& vec, T value)
      {
          size_t index = 0;
          size_t size = vec.size();
          while (size > 0) {
              size /= 2;
              size_t probe = index + size;
              if (vec[probe] <= value)
                  index = probe + 1;
          }
          return index;
      }
    

Slightly faster with gcc:

    
    
      $ g++-mp-4.9 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp 
      $ ./blog 
      size = 8192:
          stl      : 144.883 miliseconds
          version 1: 145.406 miliseconds
          version 2: 129.713 miliseconds
          version 3: 109.231 miliseconds
          version 4: 103.578 miliseconds
          version 5: 102.282 miliseconds
    

But sucks with clang, apparently because of the described problem with non-
using cmov instruction.

    
    
      $ clang++-mp-3.6 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp
      $ ./blog 
      size = 8192:
          stl      : 147.466 miliseconds
          version 1: 145.978 miliseconds
          version 2: 145.547 miliseconds
          version 3: 113.546 miliseconds
          version 4: 106.968 miliseconds
          version 5: 144.231 miliseconds

------
detrino
It should be noted that their version gives different results to
std::upper_bound if your range contains duplicates.

------
arbre
This is nice but what are the use cases for such an optimization ? My
understanding is that such an optimization will not matter anyway compared to
all the I/O and other expensive operations. And if a program is really too
slow then the true optimization is to scale with threads or distributed
computing.

------
twoodfin
Great link.

The most educational bit for me was the careful structuring and eventual
elimination of the if/else to shake out a conditional move rather than an
unpredictable branch.

Modern optimizers and CPU scheduling engines are so powerful, a lot of
received wisdom on how to code for speed is outdated; manual loop unrolling,
for example, is rarely very beneficial. It's nice to see that there's still
some room for craftsmanship in the most critical of paths. Structuring loops
for autovectorization is another useful habit to get into.

~~~
cautious_int
Wouldn't a conditional move transfer the cost from branch mispredict to the
cache miss.

I wonder how realistic their test were, and if the same results would be
achieved with some cache trashing.

~~~
twoodfin
If you're going to miss when you go back to the array, you're going to pay
that cost regardless. But without the mispredict, you'll be able to issue the
load that much sooner.

I'm not familiar enough with Intel's architecture to know one way or the
other, but it wouldn't surprise me if not mispredicting the next memory access
saves you more than just some pipeline flushing: The CPU could speculatively
issue the load you don't actually need, wasting resources that could be used
to service the correct addresses.

Or did you mean something else?

~~~
cautious_int
Well a branch will fetch the incorrect cache line if it mispredicts, but a
conditional move will fetch both cache lines every time.

~~~
rrrlasse
The conditional move is the statement `low = v >= value ? low : other_low;`
which either assigns `other_low` to `low` or does nothing.

And other_low is a variable, and not an arbitrary element of the list (which
is a big difference with respect to cache), and it can also be seen from the
assembly that it's stored in register. So there there is no "both cache lines"
to fetch anything from.

~~~
cautious_int
The conditional move still depends on both values taken from the array, via
other variables.

------
chardan
I think it's at the very least been hinted at below (and there are other good
points), but note that their implementation makes copies. If the copy ctor of
type T is the least bit expensive, the STL (implementation in glibc) for this
same benchmark is far faster than theirs. Try it. :)

------
CJefferson
One of the problems with the extreme complexity of submitting code to GCC, and
FSF projects in general, (having to complete copyright assignments, which are
very slowly handled) is that this is unfortunately unlikely to end up in
libstdc++ (although I would be happy to see it there).

~~~
nanofortnight
I think the fact this violates the standard would be the greatest difficulty.

------
sbahra
Paul has been doing great work here:
[http://repnop.org/pd/slides/bsearch.pdf](http://repnop.org/pd/slides/bsearch.pdf)

