
Nearly All Binary Searches and Mergesorts are Broken (2006) - gruseom
http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html
======
lisper
It's not binary searches and merge sorts that are broken, it's the INT data
type that is "broken" (and I put "broken" in scare quotes because they aren't
really broken, they just don't do what you want in most cases). People tacitly
assume that INTs model the integers, but they don't. They model the integers
modulo 2^N for some value of N. If you code as if INTs were integers and you
hit the 2^N limit you will lose.

The algorithm as given would work perfectly well on a straightforward
translation into a language with a real integer data type like Python or Lisp.

And you can lose in the other direction too: if you're doing crypto, for
example, the algorithms are often designed to use arithmetic operations modulo
2^N, so INTs (and NOT integers) can be just what you need.

But this has nothing do do with binary search or mergesort.

~~~
barrkel
I can make a similar argument about floating-point numbers not modelling
reals, and the hysteria over the lack of reflexivity of equality in IEEE 754
is a little overblown.

~~~
jjs
Why trifle over tiny little epsilons, or NaN, which isn't even a number? ;-)

------
ggchappell
I remember reading this (back in 2006?) and looking at the versions of Binary
Search and Merge Sort that I had recently written for the Data Structures &
Algorithms class that I teach. I found that my versions did _not_ have this
bug, despite the fact that I had not given any thought to it.

The reason my code did not have the bug is that I represented the range to be
processed, not as low & high subscripts, but rather using a pair of iterators
(this was in C++). And you cannot add two iterators. So instead of _mid_ = (
_low_ \+ _high_ )/2, you get something like _size_ = _high_ \- _low_ and then
_mid_ = _low_ \+ _size_ /2, where _size_ is an integer, and _low_ , _high_ ,
_mid_ are iterators.

There is a lesson to be learned here, certainly. I am not exactly sure what it
is in its full generality, but I think we can conclude that the endpoints of a
range, regardless of how they are represented, are _not_ _things_ _that_
_should_ _be_ _added_ _together_.

~~~
stonemetal
How does that avoid the problem? Assuming you used a signed integer for size
and the list is large enough, there is a sign roll over on size. If size
becomes less than -2 then low + size/2 is less than low (on the first pass of
a binary search you now have an invalid iterator for mid). Assuming you used
unsigned integers size can still not be the correct value if a rollover has
occurred though this time mid will always be valid.

~~~
notaddicted
If you have a byte array on a 32bit machine, assuming your program's text
(i.e. instructions) takes up some space, the biggest possible array will be
less than 2^32 Bytes = 4GiB, so the midpoint will be too. There can be no
overflow.

If you subtracted two 64bit pointers and stored the result in an int32 though,
then you would have problems.

------
btilly
It seems to me that this "fix" just pushes the bug off by a factor of 2. What
happens when you have an array with more elements than can be expressed by an
int?

I also question the mergesort assertion. I've implemented mergesort several
times, almost always expressed in terms of writing to/from pipes of data (that
under the hood eventually wound up going to disk I/O in some way). Those
solutions will not suffer from any form of this bug for the simple reason that
you never access anything by index.

The curious might wonder why I felt a need to implement such a well-known
algorithm myself. Well in addition to the times I did it for fun, one time I
had a large dataset to process that had already broken the database, and the
computer I had to process it with didn't have enough disk space for the whole
dataset. So I needed to do a mergesort while keeping all data that hit disk in
compressed form...

~~~
wvenable
> What happens when you have an array with more elements than can be expressed
> by an int?

You probably couldn't have an array that big and even if you could, it's
likely this function would accept it at compile time. The problem here is that
you're passing it an array that it is advertised to handle, yet it fails.

~~~
Groxx
It's pretty easy, actually.

Make two arrays, and have the end of one act like a linked list to the start
of the second. When a [x] value is larger than the size of the first, just
subtract the first's size from x, and find that value in the second. It's even
still O(1) time.

Viola, you've now got nearly _double_ the storage of what you can reference,
and it'll work in a generic case, but you'll never be able to refer to
anything larger than your index-number can reach.

To make matters worse, do you know if your library isn't doing this for you?
It'd allow you to extend your array without having to copy the whole thing,
which saves a LOT of time on frequently-lengthening arrays. The extremely
small cost of checking an int's size before actually accessing the sub-array
necessary is often negligible compared to duplicating potentially _billions_
of values.

------
jmount
This sort of problem with sorting seems like it has always been with us. For
example: Bentley quoting Knuth: "in section 6.2.1 of his 'Sorting and
Searching,' Knuth points out while the first binary search was published in
1946, the first published binary search without bugs did not appear until
1962" (Programming Pearls 2nd edition, "Writing Correct Programs", section
4.1, p. 34).

I touch a bit on related topics (von Neuman's first program being a sorting
program: [http://www.win-vector.com/blog/2008/02/hello-world-an-
instan...](http://www.win-vector.com/blog/2008/02/hello-world-an-instance-
rhetoric-in-computer-science/) , and quicksort implementations almost always
being wrong: <http://www.win-vector.com/blog/2008/04/sorting-in-anger/> ) in
my blog.

~~~
barrkel
Knuth also said: "Beware of bugs in the above code; I have only proved it
correct, not tried it."

------
Groxx
So, the bug is that you're overflowing integer datatypes? It's a bug that
doesn't exist if your data never exceeds 1/2 of your integer value. And with
64-bit, it's effectively impossible for this to happen (except in exceedingly
excessive circumstances, and then you better not be using such limited
primitives, or suffer the consequences).

I mean, heck. On equal footing, you san say that it's a "bug" that you can't
access array indices past what your integer can refer to, but you could
conceivably push data in further than that if your array implementation allows
it. The article's bug happens at 1/2 that limit, but it's the same thing, just
slightly harder to find. _Slightly_ , mind you, as something like this should
be checked any time you approach the limits of your formats.

Here's something to ask: _which_ are broken? Example code, code made for a use
significantly below 2^31 values, or _code in use in cases which may exceed
this value_? I'd be willing to bet it's the first and second. If you're not
making a _truly_ generic algorithm, don't make a generic algorithm. YAGNI.

~~~
sparky
Nit: ints are still 32 bits on most 64-bit systems (
<http://en.wikipedia.org/wiki/64-bit#Specific_data_models> )

Example:

    
    
      mrj10@mjlap:~$ uname -m
      x86_64
      mrj10@mjlap:~$ cat test.c
      #include <stdio.h>
      int main()
      {
        printf("%zu\n",sizeof(int));
        return 0;  
      }
      mrj10@mjlap:~$ gcc test.c -o test
      mrj10@mjlap:~$ ./test
      4

~~~
Groxx
Hah, yep, does the same on mine. Oh well. Thanks for the info. Long int gives
me 8, but I see that the standard only requires 4 _minimum_ , so I guess you
can't rely on that either...

Eh. I'll still stick with infinite Integers if I'm worried about running out
of space.

~~~
jedbrown
> I see that the standard only requires 4 minimum

Oh really? What standard? C99 section 5.2.4.2.1 says 16 bits minimum for int,
32 bits for long, 64 for long long. There may not exist 64-bit architectures
with 16-byte ints, but that's not prevented by the standard.

~~~
Groxx
I just pulled from here, no idea if there's something more accurate:
<http://en.wikipedia.org/wiki/Long_integer>

~~~
jedbrown
Sorry, I misread comment. Long ints have a 4-byte minimum, plaint ints have a
2-byte minimum.

------
wrs
The only time I can remember actually needing to write a binary search from
scratch, I played it safe by copying it straight out of an algorithms textbook
(post-1962)...and it had a much worse bug than this one!

------
cousin_it
Let's go further and say that Nearly All C Code Using Integer Arithmetic Is
Broken.

------
donaq
This seems more like a limitation of the language implementation than a
program bug to me.

~~~
sparky
If the behavior is according to the language spec, it's not an implementation
thing. Perhaps you meant language spec instead of implementation, but even in
that case it seems like a poor idea to require all arithmetic to be arbitrary-
precision, no?

~~~
donaq
You're right on both counts, and it is indeed a poor idea unless we can figure
out a way to efficiently implement arbitrary precision arithmetic in hardware.
I was just quibbling that the implementation of the binary search wasn't
strictly wrong. :)

Having said that, I do think that such pitfalls might be easily avoided if a
check for the values being operated on were built into the language
implementation. That way, we can all write simpler code without having to take
into account the min/max values for types specific to the language.

~~~
sparky
It would be nice, but the overhead would likely be unacceptable, since most
code is dominated by integer arithmetic of one form or another. There was a
StackOverflow question here [http://stackoverflow.com/questions/199333/best-
way-to-detect...](http://stackoverflow.com/questions/199333/best-way-to-
detect-integer-overflow-in-c-c) that has some good suggestions. As for
hardware support, many (but not all) architectures have some support for
trapping or jumping to a specified address on integer overflow. The problem is
compiler support. It has been a long time since a mainstream compiler for a
mainstream architecture has supported it. GCC still has a -ftrapv flag, but it
appears not to work anymore.

There's a cool paper on a static analysis tool to detect possible overflows
here: <http://www.isoc.org/isoc/conferences/ndss/09/pdf/17.pdf> They used it
to find real bugs in open source projects.

------
dgreensp
It's interesting to consider the same situation in JavaScript, where integers
tend to be represented internally as doubles, meaning they don't overflow,
they just lose precision somewhere around 2^52. But this is unlikely to be a
problem if you're counting things.

------
JoeAltmaier
The fact is with code, if you don't try it, it doesn't work. We all know this
is true; when code works "the first time" we tell the story to our friends. I
debugged a string-move routine ported from linux to a RISC processor with a
requirement for aligned word-moves. It had 11 bugs in like, 20 lines of code.
Because it hadn't been tried on that architecture before. Its not a bug if the
code worked where originally designed to work, but not when moved to a new
environment. Kind of makes "reusable code" an oxymoron.

~~~
prodigal_erik
If the code relied on behavior the standard doesn't guarantee, the code was
always wrong. It just worked by accident (maybe only in the cases we tested),
and we didn't have a compiler and runtime good enough to tell us the code was
wrong.

~~~
JoeAltmaier
Sounds academic. The programmer verified correct operation in the environment.
In fact it was never going to fail where it was originally written and tested.
That is "correct" by most reasonable definitions.

------
JoachimSchipper
This, of course, is why you always use size_t for array indices. (unsigned int
blows up on amd64, too: a 4GB array of chars is huge but not impossibly
large).

~~~
mbreese
That wouldn't have helped in this case, since the issue was an overflow in the
(high+low) / 2 calculation. Wouldn't using size_t just make the problem less
likely to happen (requiring an even larger array to show up)?

~~~
spc476
The C Stardard (addmittedly from ISO:9899:1990, which describes C89; it still
holds for C99; no comment on C++, which I don't use) states: "... size_t which
is the unsigned integral type of the result of the sizeof operator ..."

P. J. Plauger in _The Standard C Library_ states: "When you apply the sizeof
operator in a C expression, the result has type size_t. It is an unsigned
integer type that can represent the size of the largest data object you can
declare. Almost certainly it is either _unsigned int_ or _unsigned long_.
[emphasis in original] ... It is the safest type to represent any integer data
object you use as an array subscript. You don't have to worry if a small array
evolves to a vary large one as the program changes. Subscript arithmetic will
never overflow when performed in type size_t. You don't have to worry if the
program moves to a machine with peculiar properties, such as 32-bit bytes and
1-byte longs. Type size_t offers the greatest chance that your code won't be
unduly surprised. The only sensible type to use for computing the sizes of
data object is size_t. ... You should make a point of using type size_t
_anywhere_ [emphasis in original] your program performs array subscripting or
address arithmetic."

Answer as I see it: the issue wouldn't come up at all.

~~~
Chris_Newton
That's all true, but unfortunately it doesn't help. You can still have two
indices greater than half the maximum value of a size_t, and cause an
overflow.

The problem here isn't the maximum value of whatever data type we are using,
it is that no matter what that maximum is, you can always exceed it by adding
two sufficiently large values of that same type together.

