Hacker News new | past | comments | ask | show | jobs | submit login
Nearly All Binary Searches and Mergesorts are Broken (2006) (googleresearch.blogspot.com)
101 points by gruseom on Feb 17, 2010 | hide | past | web | favorite | 49 comments

It's not binary searches and merge sorts that are broken, it's the INT data type that is "broken" (and I put "broken" in scare quotes because they aren't really broken, they just don't do what you want in most cases). People tacitly assume that INTs model the integers, but they don't. They model the integers modulo 2^N for some value of N. If you code as if INTs were integers and you hit the 2^N limit you will lose.

The algorithm as given would work perfectly well on a straightforward translation into a language with a real integer data type like Python or Lisp.

And you can lose in the other direction too: if you're doing crypto, for example, the algorithms are often designed to use arithmetic operations modulo 2^N, so INTs (and NOT integers) can be just what you need.

But this has nothing do do with binary search or mergesort.

I can make a similar argument about floating-point numbers not modelling reals, and the hysteria over the lack of reflexivity of equality in IEEE 754 is a little overblown.

Why trifle over tiny little epsilons, or NaN, which isn't even a number? ;-)

Don't mean to give Python too much attention here, but I discovered the decimal module a couple months ago and it's fantastic. :)

While it's not a big deal: A common implementation strategy in language run-times is to use machine words when the integer value fits within a particular range. This may mean that the shown implementation will incur a performance penalty well before it's actually necessary to pay the price for using big ints.

But this has nothing do do with binary search or mergesort.

Regardless of our opinions of overflow, it exists in many languages. And I cannot think of anything that would better qualify this issue for "has something to do with binary search" status than that most programmers have seen, used, or written an implementation with this bug.

If Perlis was right when he said that Lispers know the value of everything and the cost of nothing, this bug proves the counterpoint: of Fortran-derived languages knowing the cost of everything but the value of, um, most things but not some others.

Not quite as eloquent, is it?

A more correct title is "most binary search implementations have a bug" but that isn't catchy enough.

When I first saw an article about it, it was actually in the context of how hard it is to write correct code. That should be the take away.

That's my reading, too. The author is really making a point about the fundamental challenge of programming purely logical models in the physical world.

It's a little disappointing to me that the HN community is more interested in the concrete example than the meta topic itself.

I remember reading this (back in 2006?) and looking at the versions of Binary Search and Merge Sort that I had recently written for the Data Structures & Algorithms class that I teach. I found that my versions did not have this bug, despite the fact that I had not given any thought to it.

The reason my code did not have the bug is that I represented the range to be processed, not as low & high subscripts, but rather using a pair of iterators (this was in C++). And you cannot add two iterators. So instead of mid = (low + high)/2, you get something like size = high - low and then mid = low + size/2, where size is an integer, and low, high, mid are iterators.

There is a lesson to be learned here, certainly. I am not exactly sure what it is in its full generality, but I think we can conclude that the endpoints of a range, regardless of how they are represented, are not things that should be added together.

Yes, I've usually represented the range with (low, size) instead of (low, high) (and then the high-low part never comes up).

So when this article first came out I was all "Aren't you bagging on Bentley too much? He published pseudocode, and of course you have to take care about things like overflow when converting it to C." But then I saw Bentley saying somewhere that no, he really hadn't considered this problem. (IIRC)

Bentley didn't just publish pseudo-code; he also published C, and the published C contains the bug.

Ah well, at least it fit the theme here when I misremembered.

regardless of how they are represented, are not things that should be added together.

Maybe something about torsors?



(Nice article. Thanks for the link.)

How does that avoid the problem? Assuming you used a signed integer for size and the list is large enough, there is a sign roll over on size. If size becomes less than -2 then low + size/2 is less than low (on the first pass of a binary search you now have an invalid iterator for mid). Assuming you used unsigned integers size can still not be the correct value if a rollover has occurred though this time mid will always be valid.

If you have a byte array on a 32bit machine, assuming your program's text (i.e. instructions) takes up some space, the biggest possible array will be less than 2^32 Bytes = 4GiB, so the midpoint will be too. There can be no overflow.

If you subtracted two 64bit pointers and stored the result in an int32 though, then you would have problems.

> Assuming you used a signed integer for size ....

By "integer", I did not mean "int". size was a std::size_t.

Perhaps even more generally, the range ADT should include a method to locate the midpoint of a range, in addition to the endpoints.

It seems to me that this "fix" just pushes the bug off by a factor of 2. What happens when you have an array with more elements than can be expressed by an int?

I also question the mergesort assertion. I've implemented mergesort several times, almost always expressed in terms of writing to/from pipes of data (that under the hood eventually wound up going to disk I/O in some way). Those solutions will not suffer from any form of this bug for the simple reason that you never access anything by index.

The curious might wonder why I felt a need to implement such a well-known algorithm myself. Well in addition to the times I did it for fun, one time I had a large dataset to process that had already broken the database, and the computer I had to process it with didn't have enough disk space for the whole dataset. So I needed to do a mergesort while keeping all data that hit disk in compressed form...

> What happens when you have an array with more elements than can be expressed by an int?

You probably couldn't have an array that big and even if you could, it's likely this function would accept it at compile time. The problem here is that you're passing it an array that it is advertised to handle, yet it fails.

It's pretty easy, actually.

Make two arrays, and have the end of one act like a linked list to the start of the second. When a [x] value is larger than the size of the first, just subtract the first's size from x, and find that value in the second. It's even still O(1) time.

Viola, you've now got nearly double the storage of what you can reference, and it'll work in a generic case, but you'll never be able to refer to anything larger than your index-number can reach.

To make matters worse, do you know if your library isn't doing this for you? It'd allow you to extend your array without having to copy the whole thing, which saves a LOT of time on frequently-lengthening arrays. The extremely small cost of checking an int's size before actually accessing the sub-array necessary is often negligible compared to duplicating potentially billions of values.

It's not at all uncommon to have an array that big these days. But the API should use size_t for the array size (or whatever is available in your language of choice).

The difference is that the constraint of using a signed int as the length of the array is known and available to users of the function. What happens when you try to do that is you get a compiler error, because it won't implicitly cast from, say, bigint or unsigned int down to signed int, for example. That's not necessarily a defect, that's just a constraint.

This contrasts with the overflow defect which would cause the mergesort routine to fail when given an otherwise perfectly valid input.

It seems to me that this "fix" just pushes the bug off by a factor of 2. What happens when you have an array with more elements than can be expressed by an int?

I don't think you can do that in Java. In the more general case, you're right, but I think you can solve it by just using a long long or whatever your language has instead of an int for the appropriate calculations.

This sort of problem with sorting seems like it has always been with us. For example: Bentley quoting Knuth: "in section 6.2.1 of his 'Sorting and Searching,' Knuth points out while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962" (Programming Pearls 2nd edition, "Writing Correct Programs", section 4.1, p. 34).

I touch a bit on related topics (von Neuman's first program being a sorting program: http://www.win-vector.com/blog/2008/02/hello-world-an-instan... , and quicksort implementations almost always being wrong: http://www.win-vector.com/blog/2008/04/sorting-in-anger/ ) in my blog.

Knuth also said: "Beware of bugs in the above code; I have only proved it correct, not tried it."

So, the bug is that you're overflowing integer datatypes? It's a bug that doesn't exist if your data never exceeds 1/2 of your integer value. And with 64-bit, it's effectively impossible for this to happen (except in exceedingly excessive circumstances, and then you better not be using such limited primitives, or suffer the consequences).

I mean, heck. On equal footing, you san say that it's a "bug" that you can't access array indices past what your integer can refer to, but you could conceivably push data in further than that if your array implementation allows it. The article's bug happens at 1/2 that limit, but it's the same thing, just slightly harder to find. Slightly, mind you, as something like this should be checked any time you approach the limits of your formats.

Here's something to ask: which are broken? Example code, code made for a use significantly below 2^31 values, or code in use in cases which may exceed this value? I'd be willing to bet it's the first and second. If you're not making a truly generic algorithm, don't make a generic algorithm. YAGNI.

Nit: ints are still 32 bits on most 64-bit systems ( http://en.wikipedia.org/wiki/64-bit#Specific_data_models )


  mrj10@mjlap:~$ uname -m
  mrj10@mjlap:~$ cat test.c
  #include <stdio.h>
  int main()
    return 0;  
  mrj10@mjlap:~$ gcc test.c -o test
  mrj10@mjlap:~$ ./test

Hah, yep, does the same on mine. Oh well. Thanks for the info. Long int gives me 8, but I see that the standard only requires 4 minimum, so I guess you can't rely on that either...

Eh. I'll still stick with infinite Integers if I'm worried about running out of space.

> I see that the standard only requires 4 minimum

Oh really? What standard? C99 section says 16 bits minimum for int, 32 bits for long, 64 for long long. There may not exist 64-bit architectures with 16-byte ints, but that's not prevented by the standard.

I just pulled from here, no idea if there's something more accurate: http://en.wikipedia.org/wiki/Long_integer

Sorry, I misread comment. Long ints have a 4-byte minimum, plaint ints have a 2-byte minimum.

They're Google. Their data set is bigger than your. :-)

The only time I can remember actually needing to write a binary search from scratch, I played it safe by copying it straight out of an algorithms textbook (post-1962)...and it had a much worse bug than this one!

Let's go further and say that Nearly All C Code Using Integer Arithmetic Is Broken.

This seems more like a limitation of the language implementation than a program bug to me.

If the behavior is according to the language spec, it's not an implementation thing. Perhaps you meant language spec instead of implementation, but even in that case it seems like a poor idea to require all arithmetic to be arbitrary-precision, no?

You're right on both counts, and it is indeed a poor idea unless we can figure out a way to efficiently implement arbitrary precision arithmetic in hardware. I was just quibbling that the implementation of the binary search wasn't strictly wrong. :)

Having said that, I do think that such pitfalls might be easily avoided if a check for the values being operated on were built into the language implementation. That way, we can all write simpler code without having to take into account the min/max values for types specific to the language.

It would be nice, but the overhead would likely be unacceptable, since most code is dominated by integer arithmetic of one form or another. There was a StackOverflow question here http://stackoverflow.com/questions/199333/best-way-to-detect... that has some good suggestions. As for hardware support, many (but not all) architectures have some support for trapping or jumping to a specified address on integer overflow. The problem is compiler support. It has been a long time since a mainstream compiler for a mainstream architecture has supported it. GCC still has a -ftrapv flag, but it appears not to work anymore.

There's a cool paper on a static analysis tool to detect possible overflows here: http://www.isoc.org/isoc/conferences/ndss/09/pdf/17.pdf They used it to find real bugs in open source projects.

This seems more like an artifact of 1970s programming to me; the fact that allegedly-modern languages still offer numeric types which differ only by the range of numbers they can represent is just plain nasty, and makes me happy that I use a language which doesn't make me worry about this when I want to work with numeric types.

It's interesting to consider the same situation in JavaScript, where integers tend to be represented internally as doubles, meaning they don't overflow, they just lose precision somewhere around 2^52. But this is unlikely to be a problem if you're counting things.

The fact is with code, if you don't try it, it doesn't work. We all know this is true; when code works "the first time" we tell the story to our friends. I debugged a string-move routine ported from linux to a RISC processor with a requirement for aligned word-moves. It had 11 bugs in like, 20 lines of code. Because it hadn't been tried on that architecture before. Its not a bug if the code worked where originally designed to work, but not when moved to a new environment. Kind of makes "reusable code" an oxymoron.

If the code relied on behavior the standard doesn't guarantee, the code was always wrong. It just worked by accident (maybe only in the cases we tested), and we didn't have a compiler and runtime good enough to tell us the code was wrong.

Sounds academic. The programmer verified correct operation in the environment. In fact it was never going to fail where it was originally written and tested. That is "correct" by most reasonable definitions.

This, of course, is why you always use size_t for array indices. (unsigned int blows up on amd64, too: a 4GB array of chars is huge but not impossibly large).

That wouldn't have helped in this case, since the issue was an overflow in the (high+low) / 2 calculation. Wouldn't using size_t just make the problem less likely to happen (requiring an even larger array to show up)?

The C Stardard (addmittedly from ISO:9899:1990, which describes C89; it still holds for C99; no comment on C++, which I don't use) states: "... size_t which is the unsigned integral type of the result of the sizeof operator ..."

P. J. Plauger in _The Standard C Library_ states: "When you apply the sizeof operator in a C expression, the result has type size_t. It is an unsigned integer type that can represent the size of the largest data object you can declare. Almost certainly it is either unsigned int or unsigned long. [emphasis in original] ... It is the safest type to represent any integer data object you use as an array subscript. You don't have to worry if a small array evolves to a vary large one as the program changes. Subscript arithmetic will never overflow when performed in type size_t. You don't have to worry if the program moves to a machine with peculiar properties, such as 32-bit bytes and 1-byte longs. Type size_t offers the greatest chance that your code won't be unduly surprised. The only sensible type to use for computing the sizes of data object is size_t. ... You should make a point of using type size_t anywhere [emphasis in original] your program performs array subscripting or address arithmetic."

Answer as I see it: the issue wouldn't come up at all.

That's all true, but unfortunately it doesn't help. You can still have two indices greater than half the maximum value of a size_t, and cause an overflow.

The problem here isn't the maximum value of whatever data type we are using, it is that no matter what that maximum is, you can always exceed it by adding two sufficiently large values of that same type together.

I'm sorry, I spoke too quickly. On the other hand, it would actually have worked: every OS I'm aware of (Windows, Linux, OpenBSD) limits 32-bit applications to 2GB of memory, and 64-bit applications to some ridiculously large number well short of 2^64. In neither case could this actually overflow.

But that's system-specific; it's not defined to fix this, so sorry for spreading the misinformation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact