
Nearly All Binary Searches and Mergesorts are Broken (2006) - prajjwal
http://googleresearch.blogspot.co.uk/2006/06/extra-extra-read-all-about-it-nearly.html
======
crntaylor
I don't agree with this statement

    
    
      "It is not sufficient merely to prove a program correct;
       you have to test it too."
    

It _is_ sufficient to prove a program correct - as long as your proof is not
faulty! The problem in this case was not that the program had a bug despite
being proved correct. The problem was that the 'proof' was not a proof at all.

Machine ints are not mathematical integers. Floats are not real numbers. You
can't prove things about programs that use ints/floats without taking these
things into account.

Of course, the question of how one knows that a proof is correct is still left
open - but that's a metatheoretical argument that it might be best to leave
aside. I suspect that most faulty proofs are faulty for pedestrian reasons
(incorrect type assumptions, failing to deal with null/NaN etc) rather than
high-falutin' concerns about the validity of first-order logic.

~~~
stiff
Computers are now complicated enough for Computer Science to have turned to
some extent into an empirical science - it is impossible for a single person
to have in their head everything that goes in a typical computer, operating
system, compiler and so forth, so one is often forced to resort to experiment
to find things out, it's no longer a theory where you can just reason things
out, maybe it never was one in fact, because many simple imperative programs
are so complex to reason about.

In empirical sciences, you not only have to have a mathematically valid
theory, but you also have to check if the theory fits reality by making
predictions and experimentally checking them with the real world. It's the
same now in Computer Science, there are so many places where theoretical
assumptions might deviate from reality that having a proof is not enough. In
fact, you have to have those assumptions to make things mathematically
tractable. Imagine mathematicians or computer scientists re-proving real
analysis theorems using floating point arithmetic...

In other words, your vision of proofs being enough as long as all the
assumptions are part of the theory, seems utopian to me. In fact, even some of
the most devoted advocates of correctness proofs have admitted this:

[http://www.gwern.net/docs/1996-hoare.pdf](http://www.gwern.net/docs/1996-hoare.pdf)

~~~
crntaylor
> _Imagine mathematicians or computer scientists re-proving real analysis
> theorems using floating point arithmetic..._

Mathematicians and computer scientists _do_ prove theorems about floating
point arithmetic! For example, the most widely-cited floating point reference
contains no fewer than fifteen theorems about floating point:

[http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)

Or here's a presentation about representing functions with their Taylor
expansions using floating point arithmetic, providing strong error bounds on
the result of adding or multiplying two functions represented in this way:

[http://perso.ens-lyon.fr/nathalie.revol/talks/ICIAM07.pdf](http://perso.ens-
lyon.fr/nathalie.revol/talks/ICIAM07.pdf)

I agree that many proofs in computer science are harder than proofs in
mathematics, because you can't deal with idealizations - you always have to
think about the machine. But unlike empirical sciences, we have access to the
design of the machine. We know what many of the axioms are. Formal reasoning
is valid for a far larger part of computer science than for the natural
sciences.

I'm not going to argue that proofs are a panacea for every situation. But I
also don't categorically reject them in the domains where they can be usefully
applied.

~~~
stiff
I know there are theorems about floating point, that's missing the point, what
I am saying is that the theories most useful for doing reasoning are often
nearly impossible to formulate if you would like to include in them a lot of
messy details of something like floating point, just as one example. What
happens instead, we reason using nice idealized theories, and then we
experiment to asses the gap between theory and reality.

I am completely a fan of theory and proofs, but the point of the quoted
comment is that there is an empirical component to software development too,
and your original comment seemed to question it.

------
jgoodwin
A bit off topic, but most of statistics also breaks.

If you go back to _Mathematical Statistics_ by RA Fisher, early in the last
century, and look at his arguments about binning 'big data' into histograms,
he has a nice little construction that uses the notion of an 'angle' running
through the data set, does a Fourier Series expansion, keeps the 'DC' term
from the cosine series, and waves his hand about second order effects. He does
estimate them for the sine-like series, and finds for a data set of size N=1
Trillion it might be a 10% effect.

The only remnant of this whole proceeding in modern lore (and even Ph.D.
statisticians may not have heard of it) is Sheppard's correction for equal
class-interval histograms:

[http://mathworld.wolfram.com/SheppardsCorrection.html](http://mathworld.wolfram.com/SheppardsCorrection.html)

But of course when your datasets start to be 1 billion rows routinely, 10%
effects a mere 3 orders of magnitude away in the size of the dataset should
start to make you nervous.

Moral: once you get a billion data points of anything or so, it's time to redo
the Maths, very very carefully.

~~~
yetanotherphd
I am a PhD statistician and I can't understand your comment.

What statistic is being calculated for the N=1 Trillion dataset? And what is
the way of calculating that would be off by 10%?

------
kagebe
In my opinion, the main take-away here is that the proofs were, obviously, no
proofs. If you program with modulo arithmetic you have to do your proofs with
modulo arithmetic. If you use IEEE floating point, say goodbye to your
theorems about real arithmetic.

If you forget/omit a single fact about your target platform/machine/api
(whatever axioms you found your reasoning on) in your proof, it may be worth
nothing.

~~~
koverstreet
Yup.

More annoyingly, for a binary search over an array (i.e. something that can
fit in memory) their code is still wrong - they should be using size_t.

~~~
MaulingMonkey
On top of that, this 'fix' from the article, even if it were corrected to use
size_t instead of unsigned int...

    
    
      In C and C++ (where you don't have the >>> operator), you can do this:
      6:             mid = ((unsigned int)low + (unsigned int)high)) >> 1;
    

Has the same bug as the original snippet, just in unsigned space. Now,
granted, binary searching a >2GB byte array in a 32-bit process is an unlikely
use case... but it is technically feasible!

~~~
PhantomGremlin
Yeah. That "fix" immediately jumped out at me in the sense of WTF??? All he
did was double the max size of the array before his code fails again!

That so-called "fixed" code just couldn't do anything useful in a 32-bit
address space. A billion 32-bit ints completely fills up the address space by
itself. As you note, perhaps it could "technically" be possible to search a
billion 8-bit bytes, but that's not what's being passed in to the function.

And if he's running in a 64-bit address space (otherwise how could he pass in
an array of a billion ints), then he should be using 64-bit integer
arithmetic. (I don't know enough about Java to know how feasible it is to do
that).

------
yetanotherphd
I would guess that nearly all code, period, is vulnerable to integer overflow
issues. I don't think it makes sense to worry about this except in very
special cases.

~~~
AndrewBissell
Having an array with more than about 1.2 billion elements is all that it would
have taken to break the old binary search, and that's not all that uncommon
anymore.

Pure JavaScript code isn't vulnerable to integer overflow by virtue of not
having any integer types, and I believe errors like this were the reason for
leaving them out of the language.

~~~
vbit
JavaScript numbers are just floats and susceptible to precision loss as well
as overflow to Infinity, aren't they?

In fact, if incremented repeatedly, 64-bit floats lose precision before 64-bit
ints overflow.

~~~
masklinn
> 64-bit floats lose precision before 64-bit ints overflow.

More precisely, a 64 bit IEEE 754 ("double precision") has 53 bits worth of
"integer"[0], which allows for 15 digits (and almost, but not quite, 16: it
can encode 15.95 decimal digits)

[0] even though only 52 bits are allocated to the fraction, because the
fraction part has an implicit 53rd bit set to 1 outside of special values

------
Perseids
I understand why people actually using these huge data sets see this as a bug
but personally I don't agree. Consider how hard programming becomes when you
have to take into account integer overflows even for a /single addition/ of
array indices. Our mainstream programming languages are build around the
assumption that indices don't overflow and I would rather use the binary sort
code as argument to support this than to support the argument that bug free
code is really hard.

The real lesson in my opinion is to stop using 32 bit integers in big data
applications. Which unfortunately doesn't seem trivial in Java.

------
simula67
Previous discussions :

[https://news.ycombinator.com/item?id=1130463](https://news.ycombinator.com/item?id=1130463)

[https://news.ycombinator.com/item?id=621557](https://news.ycombinator.com/item?id=621557)

------
edanm
One of the uncommon times a sensational headline is actually correct!

I love this bug - it's been my go-to example of how there are bugs in every
piece of code, no matter how supposedly common. A bug in Java's implementatino
of Binary Search - one of the most popular languages, and one of the most used
algorithms - and still a bug managed to lay in wait for 9 years.

~~~
kokey
I was also thinking this was a stereotypical HN sensational headline blog
post. Apart from being strangely correct, it somehow makes me want to code
something in Java today.

------
bproctor
I don't believe this is necessarily a bug. All code will break under some
extremes and nowhere are the requirements specified for what this code is
supposed to be able to handle. Without requirements, you can't "prove" it
correct (or incorrect for that matter).

------
greenlakejake
Off topic(since it's not about C/Java), but Python has a long integer type
which won't over/underflow unless you run out of memory.

% python

Python 2.7.5 (default, Aug 25 2013, 00:04:04)

[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> i = long(9999999999999999999999999999999)

>>> i

9999999999999999999999999999999L

>>>

~~~
masklinn
> Python has a long integer type which won't over/underflow unless you run out
> of memory.

IIRC, so do Erlang, Ruby or Haskell (when using `Integer`), FWIW. And Java has
BigInteger (though that one's a pain to use).

But there's a cost to their existence (they need to check for overflow at
every operation), and a cost to going above machine word size. Also, now
you've got "integers" which can take arbitrary amounts of memory and integer
operations in O(n)

Still, definitely a plus on the correctness side.

There's also the option of type-encoded value ranges as in Pascal or Ada.

------
graycat
I read far enough into that to see the code that could cause overflow and quit
reading.

I've programmed various cases of binary search, merge sort, etc. for decades
and in all cases wrote code with indexing that could never overflow. If some
silly academic make-work, prof-scam, busy-work, nonsense code could overflow,
so be it.

Heck, I even worry ahout

do i = 1 to n

wondering if the code is written to exit the loop when i = n + 1 tests larger
than n, which in case n is the largest integer could never happen.

The title here "nearly all" is to me total BS. Take your insult of my code and
stuff it. Capiche?

------
gkhnarik
I don't understand why such an old article came here again.

~~~
VladRussian2
because Java arrays length is int, 32bit limited. I think we'll see these
articles more and more :)

------
mtp0101
Good thing I learned about this in 15-122 freshman year

------
eximius
Well, that was an overly melodramatic title.

