The algorithm as given would work perfectly well on a straightforward translation into a language with a real integer data type like Python or Lisp.
And you can lose in the other direction too: if you're doing crypto, for example, the algorithms are often designed to use arithmetic operations modulo 2^N, so INTs (and NOT integers) can be just what you need.
But this has nothing do do with binary search or mergesort.
Regardless of our opinions of overflow, it exists in many languages. And I cannot think of anything that would better qualify this issue for "has something to do with binary search" status than that most programmers have seen, used, or written an implementation with this bug.
Not quite as eloquent, is it?
When I first saw an article about it, it was actually in the context of how hard it is to write correct code. That should be the take away.
It's a little disappointing to me that the HN community is more interested in the concrete example than the meta topic itself.
The reason my code did not have the bug is that I represented the range to be processed, not as low & high subscripts, but rather using a pair of iterators (this was in C++). And you cannot add two iterators. So instead of mid = (low + high)/2, you get something like size = high - low and then mid = low + size/2, where size is an integer, and low, high, mid are iterators.
There is a lesson to be learned here, certainly. I am not exactly sure what it is in its full generality, but I think we can conclude that the endpoints of a range, regardless of how they are represented, are not things that should be added together.
So when this article first came out I was all "Aren't you bagging on Bentley too much? He published pseudocode, and of course you have to take care about things like overflow when converting it to C." But then I saw Bentley saying somewhere that no, he really hadn't considered this problem. (IIRC)
Maybe something about torsors?
(Nice article. Thanks for the link.)
If you subtracted two 64bit pointers and stored the result in an int32 though, then you would have problems.
By "integer", I did not mean "int". size was a std::size_t.
I also question the mergesort assertion. I've implemented mergesort several times, almost always expressed in terms of writing to/from pipes of data (that under the hood eventually wound up going to disk I/O in some way). Those solutions will not suffer from any form of this bug for the simple reason that you never access anything by index.
The curious might wonder why I felt a need to implement such a well-known algorithm myself. Well in addition to the times I did it for fun, one time I had a large dataset to process that had already broken the database, and the computer I had to process it with didn't have enough disk space for the whole dataset. So I needed to do a mergesort while keeping all data that hit disk in compressed form...
You probably couldn't have an array that big and even if you could, it's likely this function would accept it at compile time. The problem here is that you're passing it an array that it is advertised to handle, yet it fails.
Make two arrays, and have the end of one act like a linked list to the start of the second. When a [x] value is larger than the size of the first, just subtract the first's size from x, and find that value in the second. It's even still O(1) time.
Viola, you've now got nearly double the storage of what you can reference, and it'll work in a generic case, but you'll never be able to refer to anything larger than your index-number can reach.
To make matters worse, do you know if your library isn't doing this for you? It'd allow you to extend your array without having to copy the whole thing, which saves a LOT of time on frequently-lengthening arrays. The extremely small cost of checking an int's size before actually accessing the sub-array necessary is often negligible compared to duplicating potentially billions of values.
This contrasts with the overflow defect which would cause the mergesort routine to fail when given an otherwise perfectly valid input.
I don't think you can do that in Java. In the more general case, you're right, but I think you can solve it by just using a long long or whatever your language has instead of an int for the appropriate calculations.
I touch a bit on related topics (von Neuman's first program being a sorting program: http://www.win-vector.com/blog/2008/02/hello-world-an-instan... , and quicksort implementations almost always being wrong: http://www.win-vector.com/blog/2008/04/sorting-in-anger/ ) in my blog.
I mean, heck. On equal footing, you san say that it's a "bug" that you can't access array indices past what your integer can refer to, but you could conceivably push data in further than that if your array implementation allows it. The article's bug happens at 1/2 that limit, but it's the same thing, just slightly harder to find. Slightly, mind you, as something like this should be checked any time you approach the limits of your formats.
Here's something to ask: which are broken? Example code, code made for a use significantly below 2^31 values, or code in use in cases which may exceed this value? I'd be willing to bet it's the first and second. If you're not making a truly generic algorithm, don't make a generic algorithm. YAGNI.
mrj10@mjlap:~$ uname -m
mrj10@mjlap:~$ cat test.c
mrj10@mjlap:~$ gcc test.c -o test
Eh. I'll still stick with infinite Integers if I'm worried about running out of space.
Oh really? What standard? C99 section 188.8.131.52.1 says 16 bits minimum for int, 32 bits for long, 64 for long long. There may not exist 64-bit architectures with 16-byte ints, but that's not prevented by the standard.
Having said that, I do think that such pitfalls might be easily avoided if a check for the values being operated on were built into the language implementation. That way, we can all write simpler code without having to take into account the min/max values for types specific to the language.
There's a cool paper on a static analysis tool to detect possible overflows here: http://www.isoc.org/isoc/conferences/ndss/09/pdf/17.pdf They used it to find real bugs in open source projects.
P. J. Plauger in _The Standard C Library_ states: "When you apply the sizeof operator in a C expression, the result has type size_t. It is an unsigned integer type that can represent the size of the largest data object you can declare. Almost certainly it is either unsigned int or unsigned long. [emphasis in original] ... It is the safest type to represent any integer data object you use as an array subscript. You don't have to worry if a small array evolves to a vary large one as the program changes. Subscript arithmetic will never overflow when performed in type size_t. You don't have to worry if the program moves to a machine with peculiar properties, such as 32-bit bytes and 1-byte longs. Type size_t offers the greatest chance that your code won't be unduly surprised. The only sensible type to use for computing the sizes of data object is size_t. ... You should make a point of using type size_t anywhere [emphasis in original] your program performs array subscripting or address arithmetic."
Answer as I see it: the issue wouldn't come up at all.
The problem here isn't the maximum value of whatever data type we are using, it is that no matter what that maximum is, you can always exceed it by adding two sufficiently large values of that same type together.
But that's system-specific; it's not defined to fix this, so sorry for spreading the misinformation.