Hacker News new | past | comments | ask | show | jobs | submit login

I'd put the blame on languages that don't allow exceptions, and whose return value in case of errors belong to the same domain as the solution.

I've coded binary searches and sorts tons of times in C++, and yet none was succeptible to this bug. Why? Because, whenever you're talking indices, you should ALWAYS use unsigned int. Since an array can't have negative indices, if you use unsigned ints the problem is solved by design. And, if the element is not found, you throw an exception.

Instead, in C you don't have exceptions, and you have to figure out creative ways for returning errors. errno-like statics work badly with concurrency. And doing something like int search(..., int* err), and setting err inside of your functions, feels cumbersome.

So what does everyone do? Return a positive int if the index is found, or -1 otherwise.

In other words, we artificially extend the domain of the solution just to include the error. We force into the signed integer domain something that was always supposed to be unsigned.

This is the most common cause for most of the integer overflows problems out there.




When you’re talking indices, you should NEVER use int, unsigned or not. The world is 64-bit these days and int is stuck at 32 bits almost everywhere. And even on 32-bit systems indexing with unsigned int may not be safe unless you think about overflow, as this bug demonstrates (at least unsigned overflow is not immediate UB in C and C++ like signed overflow is…)

C has size_t. Use it.


To be fair, size_t doesn't solve this particular problem; you also need to use correct array slice representation (ptr,len) not (start,end), and calculate the midpoint accordingly (ie (ptr,len/2) or (ptr+len/2,len-len/2)).

(And because C doesn't mandate correct handling of benign undefined behavior, you still have a problem if you `return ptr-orig_ptr` as a size_t offset (rather than returning the final ptr directly), because pointer subtraction is specified as producing ptrdiff_t (rather than size_t), which can 'overflow' for large arrays, despite that it's immediatedly converted back to a correct value of size_t.)


The problem is not solved by using unsigned ints though, because it stems from integer overflow. I'm afraid your implementations are, alas, also incorrect.


Confused, how does using unsigned integers not solve this particular problem? Doesn't the article itself show solutions with unsigned integers?


Example using 16-bit size_t for convenience:

  char array[60000]; // 5KB left for code and stack if not segmented
  size_t i = 40000;
  size_t j = 50000;
  size_t mid = (i+j)/2; // should be 45000
  // i+j = (size_t)90000 = 24464
  // mid = 24464/2 = 12232 != 45000
Larger integers make the necessary array size bigger, but don't change the overall issue.


Unsigned int is 32-bit, most address spaces are 48-bit or more.


Array.length is 32-bit in Java. This is from 2006.


Or you can return an unsigned int which is the highest valid index+1.

    ['a','b','c'].indexof('b') == 1 // found - return index
    ['a','b','c'].indexof('w') == 3 // not found - return size of array


There should be a pointer type that is not an int and whose size depends on the host platform.

Casts from int to pointer should be explicit.

We are enamoured with programmer convenience at the expense of the safety of our systems. It’s unprofessional and we should all aim to fix it.


The type that’s meant for indexing in C and C++ is called `size_t`. It is pointer-sized. In Rust it’s called `usize` and Rust does not have implicit conversions, so if you accidentally use too narrow an integer type to compute an index, at least Rust forces you to add an explicit cast somewhere.


I've seen libraries that add a size_t type as an alias to int on certain systems. Rust gets it right here.


Seeing the new erroneous assumption being made in this post reminded me of Linus Torvalds' rant about C++ http://harmful.cat-v.org/software/c++/linus


For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

The problems here are about using integers that are too narrow and not properly doing arithmetic to prevent overflow from impacting the result.


> For most practical purposes, an int64 index can include a universe of negative return codes with no loss of functionality.

Isn't this article a counterexample to that? Where using signed instead of unsigned actually does result in a loss of functionality?


No. This article explicitly mentions the "int" type, which in C, C++, and Java is 32 bits long. 32-bit ints are not large enough for this purpose: they can only index 2 billion items directly (which will overflow a lot given that a standard server now has 256-512 GB of RAM), and this average calculation hits problems at around 1 billion items. Overflows on 64-bit ints (when used to store 63-bit unsigned numbers) are not going to happen for a very long time.


Wasn't Array.length 32-bit on Java when the article was written? In fact, isn't it 32-bit even now?

Moreover I don't see how you deny that using signed would lose functionality in this case—it's pretty undeniable that it gives the wrong answer in cases where unsigned would give the correct answer; the code is right in the article and you can test it out. This is true irrespective of any other cases that neither might handle correctly (assuming you believe any exist, but see my previous paragraph).


I didn't say that a signed int would be fine. I said that a signed 64-bit int would be fine.

Moreover, it is trivial to convert from a 32-bit signed or unsigned type to a 64-bit int, so you are not constrained by the size type of Java.


The naive (x+y)/2 returns the wrong number for x=UINT_MAX and y=UINT_MAX, for a trivial counter example.


"Because, whenever you're talking indices, you should ALWAYS use unsigned int."

sounds like a lot of your code is in fact broken...


I think it'd be nice if you give some examples of how using unsigned integers for indices breaks code in cases where signed integers don't, because otherwise your comment is very unilluminating.


you need to google why size_t exists. size_t guarantees you to be able to represent any array index. unsigned int can be arbitrarily tiny and a bigger loop may cause unsigned overflow during your loop on some architectures. in other words, your code would be broken. size_t will make it work correctly everywhere. it is the correct way of representing array indices.


C & C++ allow exceptions on signed integer overflow.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: