
Pitfalls in C and C++: Unsigned Types (2011) - steveklabnik
http://www.soundsoftware.ac.uk/c-pitfall-unsigned
======
Mindless2112

      > “Unsigned has a bigger range.”
      > Only a bit bigger.
    

So... "only" twice as big.

I'm not particularly fond of the trend toward avoiding unsigned integers (in
particular the lack of unsigned types in Java and array lengths being signed
in C#).

A good programmer should be capable of determining whether to use a signed or
unsigned type; if it's not easy to decide, that indicates a design flaw to me.

------
limmeau
So signeds are better because they overflow differently?

If you program C like there's no overflow, it doesn't matter much whether you
mis-index the next array with 4294967295 or -1. There's no substitute for
overflow-vigilance, and particularly not rules like "such-and-such type is bad
and such-and-such type is better".

(BTW: an ex-boss of mine claimed that signed ints were evil because because
and insisted on using unsigned for everything, but that's a different story,
and I'm glad he's an ex-boss)

~~~
oleganza
OP means to say that when dealing with small enough numbers, unsigned integer
may overflow, while the signed one won't. With huge numbers they both
overflow, but we usually don't have them.

------
pslam
I think the opposite - signed types are a big problem in statically typed
languages and the cause of countless bugs I've had to deal with (in other
people's code) for most of my career.

I think most languages would benefit from unsigned types being the default,
and arithmetic overflow being a hard error unless otherwise decorated.
Signedness and lenient overflow promote laziness. Array indexes don't make
sense as signed, yet most people prefer to iterate arrays with a signed type,
e.g most commonly:

    
    
      for (int i = 0; i < 10; ++i) buf[i] = 0;
    

Far too many people rely on signed integer overflow working as you expect the
underlying machine to handle it - but that's not what the C spec says and not
how a compiler handles it either.

There are countless security issues I've had to fix related to signed types in
what is supposed to be secure code. These would not have occurred if the
author was forced to use an unsigned type, and had to consider the extreme
limits of the values it can take on. Subtle things such as the addressable
limit of memory being naturally unsigned, but buffer sizes being passed as
signed, can cause easy exploits, and are stupidly commonplace.

~~~
elbee
On the other hand unsigned types are a huge pain if you want to iterate
through an array backwards because you have to use subtraction. A lot of
people end up with something like this:

    
    
      unsigned int i = strlen(s) - 1;
      for (; i >= 0; --i) { // BUGBUG
          if (s[i] == '.') {
              break;
          }
      }
    

(Yes, you can make it work, but it is very error-prone when people try).

~~~
pslam
Simple transformation:

    
    
      for (unsigned i = strlen(s); i > 0; --i) {
        if (s[i - 1] == '.') break;
      }
    

Easy to see that s[i-1] does not underflow the array due to the loop invariant
i>0\. It's usually easy to convert signed iteration code to unsigned and when
I see this, I can tell the author spent the time to consider what happens at
the limits of their inputs.

------
betterunix
The problem is not unsigned types. The problem is _overflow_ and _underflow_
conditions not being reported.

Compare this to, say, the "declarations-as-assertions" extension in
CMUCL/SBCL: when you declare an unsigned type, underflow conditions will be
reported:

    
    
      * (defun test (x) 
    	(the (integer 0) (1- x))
    	)
      * (test 0)
    
      debugger invoked on a TYPE-ERROR in thread
      #<THREAD "main thread" RUNNING {1002B2AF23}>:
        The value -1 is not of type UNSIGNED-BYTE.

~~~
dllthomas
Sure. Of course, at the same time sometimes overflow and undeflow is fine and
should be ignored (when treating unsigned ints as the modular ring they
actually are). This is of course not a criticism of CL, which I understand
does not check these by default and so is clearly capable of not checking -
just a note to anyone who might take more from your comment than was there.

~~~
betterunix
Actually, CL goes a step further, which I like: by default, integer arithmetic
is _arbitrary precision_. There are no overflows, unless you count running out
of RAM to store your numbers as an overflow. Modular arithmetic must be
explicitly specified, leaving no room for surprises:

    
    
      (mod (+ x y) N)
    

The reason I prefer this behavior to C's is that it is more natural. Most of
the time when we think of integer arithmetic, we do not assume that it is
modulo N; we assume things like " x - 1 < x " will always be true. Forcing
unexpected semantics on programmers is a recipe for bugs, and integer
underflow/overflow bugs are not uncommon in C programs.

~~~
dllthomas
I think that'd probably be inappropriate for C's use cases, but I think
generally it's a good move. I _do_ like encoding the modular behavior in the
type, FWIW; specifying a mod manually everywhere it goes leads to unreadable
code. I agree, however, that it should probably not be the default type.

------
detrino
I think a lot of the pitfalls with unsigned types relates to the for loop
construct being biased towards counting up. Consider the following code, the
down for loop is really ugly and unintuitive, but the down while loop is
symmetrical to the up loop. Maybe C/C++ need a counterpart to the for loop
where the loop body executes last?

    
    
        #include <iostream>
    
        int main()
        {
            unsigned I;
    
            // 0 to 4, for loop
            for (I = 0; I != 5; ++I) std::cout << I << "\n";
            std::cout << "\n";
    
            // 4 to 0, for loop
            for (I = 5; I-- != 0;) std::cout << I << "\n";
            std::cout << "\n";
    
            // 0 to 4, while loop
            I = 0;
            while (I != 5)
            {
                std::cout << I << "\n";
                ++I;
            }
            std::cout << "\n";
    
            // 4 to 0, while loop
            I = 5;
            while (I != 0)
            {
                --I;
                std::cout << I << "\n";
            }
            std::cout << "\n";
        }

------
ggchappell
When the rubber meets the road, and I need to write some actual code, I guess
I'm not sure what is being recommended here. Say I need to iterate through the
indices of a vector--his "strongest counterargument". He suggests "casting the
unsigneds to signed integers". Am I supposed to do this?

    
    
      for (ptrdiff_t i = 0; i != ptrdiff_t(v.size()); ++i)
    

I think the real answer here is for something like Python's itertools to be in
the C++ Standard Library, so that we _never_ need to do anything remotely like
the code above. But failing that, it isn't clear to me what the "best
practices" version of the above would be.

------
dbrower
I take this as an overgeneralization from the bugs the original author runs
into most often, which may be domain specific -- overflow in signal processing
being endemic to the problem.

In many other cases, one of the common bugs is unexpected sign extension when
a type is promoted to a larger one -- the most common being extension of a
signed character to an integer, which can result in some wildly negative
values.

Those who have been hit by _that_ frequently will prefer unsigned types.

As with most things, your mileage may vary.

-dB

------
dllthomas
Use unsigned when indexing arrays, for two reasons:

First, arr[(unsigned)-1] will segfault while arr[-1] will likely cause harder
to debug non-local problems.

Second, mod in C and C++ is broken(ish) on signed values: negative % positive
returns a negative number. This means in order to use it as an index you need
to check the value and add the table size post-mod if you want to wrap an
index into an array (as you might for a hash table or ring buffer).

~~~
Someone
I don't see why arr[(unsigned)-1] is guaranteed to segfault. Care to explain?

Isn't it equivalent to

    
    
      * (arr + (unsigned) -1)
    

, which should be equivalent to

    
    
      * (arr + UINT_MAX)
    

On typical PC hardware with sizeof(int) == sizeof(char *) that looks like a
undefined behavior case to me (maybe barring border conditions; I don't feel
like thinking of all edge cases). Most C compilers choose the fastest possible
path for undefined behavior cases, so I would expect this to (typically.
Again, there will be border cases) be equivalent to arr[-1].

(See also [http://stackoverflow.com/questions/2578455/pointer-
arithmeti...](http://stackoverflow.com/questions/2578455/pointer-arithmetic-
signed-unsigned-conversions))

~~~
dllthomas
"Typical PC hardware", at least in my recent experience, has sizeof(int) == 4,
sizeof(char*) == 8 and so the reason they're different should be apparent. I
believe you are correct that there is no difference when they are equal.

------
codehero
I've definitely been bitten by this kind of bug before, but I did not come to
this author's conclusion. What I learned is that each step of a calculation
has lower and upper bounds, regardless of int type. You always make sure your
inputs, intermediates and products fall within these bounds. Consequently, the
author's advice becomes irrelevant.

------
dnautics
if you are implementing fixed point arithmetic, using uints is probably a good
idea, too.

~~~
deletes
How are you going to get negative coordinates then?

