

A Quiz About Integers in C - ch0wn
http://blog.regehr.org/archives/721

======
Teckla
As a long-time C programmer (20+ years), this quiz -- despite its flaws -- are
a good example of how difficult it really is to write 100% portable C code.

I often hear developers talking about how C is a simple language, but it may
or may not be, depending on how you define "simple". For example, a written
alphabet with only two letters -- A and B -- is "simple" to grasp (there's
only two letters to learn!), but it would be very difficult, in the real
world, to read and write such a language.

I think C is similar: "simple" in a sense, but very, very complex in another
sense. C could use some massive clean-up, in my opinion. Much of its design
was so that it could be ported to CPUs which haven't really existed in the
wild in a long time in great enough numbers to warrant all the undefined and
implementation dependent behavior.

The language could be hugely simplified, in the usage sense, if much of the
cruft were jettisoned.

~~~
AngryParsley
That's a huge problem with programming languages: Designers can always add
things, but removing stuff is _hard_. JavaScript is a great example of this.
It still has automatic semicolon insertion, variable hoisting, and a broken
comparator (==). Practically everyone agrees these are bad things, but they
can't be changed. Doing so would break backwards compatibility, and there's a
huge ecosystem of JavaScript programs that would need to be updated.

Also, language users often rebel when language designers make breaking
changes. Python 3 tried to remove cruft, and look how slowly it's been
adopted. Instead of adopting 3.x, people backported the features they wanted
to 2.x.

I would love it if C had less cruft, but when I say that I mean, "I want C
with less cruft, but with the same huge ecosystem of documentation and
libraries and tools and debuggers and profilers that crufty-C has."

~~~
taliesinb
That's why I think the language Go is such a great development. The authors
seem to have added only the minimum set of features that make the language
workable for their initial needs.

As a result, these features are largely orthogonal, and the rules are simple
to state and learn -- even if they at first seem a bit odd (you have to cast
all numeric types to each-other before they can interact).

In contrast to Scala, another modern language I investigated recently, which
seemed like quite a thicket of features.. some of which seemed just to be
there to mediate the interaction of other features.

------
klodolph
I was really surprised how many errors there were in the quiz.

Errors in the quiz:

3\. (unsigned short)1 > -1: This will evaluate to 0 on systems where
sizeof(short)<sizeof(int), and 1 on systems where sizeof(short)==sizeof(int).
Yes, such systems exist. Old Cray supercomputers and various DSP processors
don't have byte-addressable memory, only word-addressable, so they make all
primitive types a word long.

5\. SCHAR_MAX == CHAR_MAX: The person who wrote the quiz even ACKNOWLEDGES
that the quiz is incorrect here and apologizes. This is implementation-
dependent, it is 1 when char is signed and 0 when char is unsigned. Both types
of systems exist. You can use -funsigned-char on GCC, for example.

11: int x; x << 31: This is only defined for some values on platforms where
int has at least 32 bits. There exist systems where int has 16 bits, in which
case this is undefined for all values. Old DOS PCs often used 16-bit ints, and
I believe that ints were 16 bits when C was invented.

12: int x; x << 32: This is only undefined on systems where int has no more
than 32 bits. There exist systems where int has more bits. Do a search for
ILP64 if you wish to hear about such systems.

14: unsigned x; x << 31: Again, this is only defined on systems where int has
at least 32 bits. See #11.

15: unsigned short; x << 31: This one is tricky. There are four different
cases, depending on the size of int and whether sizeof(short)==sizeof(int).

15a: sizeof(short) == sizeof(int): defined for all x, since x gets promoted to
unsigned.

15b: sizeof(short) < sizeof(int), int has less than 32 bits: defined for no x.
I don't think any such systems exist.

15c: sizeof(short) < sizeof(int), int has at least 32 bits but less than 32
more than a short: defined for some x. This is the most common, with 16-bit
short and 32-bit int.

15d: sizeof(short) < sizeof(int), int has at least 32 more bits than a short:
defined for all x. This is the uncommon ILP64 system.

18: int x; (short)x + 1: This is outright incorrect. Casting int to short
results in undefined behavior if the value cannot be represented as a short.
Truncation is only guaranteed to occur for unsigned types.

~~~
DCoder
The intro text explicitly defines those datatype sizes:

 _In other words, please answer each question in the context of a C compiler
whose implementation-defined characteristics include two's complement signed
integers, 8-bit chars, 16-bit shorts, and 32-bit ints. The long type is 32
bits on x86, but 64 bits on x86-64 (this is LP64, for those who care about
such things)._

~~~
klodolph
Hm, okay... the intro text apparently disappears as soon as you click start.

~~~
sausagefeet
Specifying those things is really weak anyways. "Here is a portability test on
how the standard defines int's, but oh, assume this implementation".

~~~
pdw
I don't think it's intended as a portability test, but rather an
implementation-defined vs undefined behavior test. And in practice it's safe
to make assumptions about implementation-defined behavior as long as you're
programming for general-purpose computers.

Regehr clarifies this in the comments section: "Regarding signed overflow
being defined or not, compiler developers generally draw a sharp distinction
between undefined behavior and implementation-defined behavior. 32-bit ints,
2's complement, etc. are examples of the latter and signed overflow is an
example of the former. A lot of developers do not draw such a sharp
distinction, which is why I made a point of asking questions about this
issue."

~~~
klodolph
For an example of signed overflow versus unsigned overflow, certain compilers
are known to assume that int loop variables won't overflow. So "for (int i =
0; i != -1; ++i)" is transformed into "for (int i = 0; ; ++i)", since both are
equal in the eyes of the C standard (both will iterate through all non-
negative values that fit in an int, then both will invoke undefined behavior,
so they are the same).

The funny part is that it's often better to use int exactly because of the
undefined behavior on overflow. By signaling to the compiler that you don't
intend to overflow a particular variable, it can optimize appropriately.

------
JulianMorrison
Isn't the real answer to these questions "don't do that, you fool"?

~~~
antirez
in theory yes, but often there are cases where you can't avoid mixing signed
and unsigned stuff for example. Think about write() and read() syscalls that
will return a signed type, but are often used in a context where buffer
lengths are expressed in unsigned types.

~~~
JulianMorrison
Then you probably want to mark it with an explicit function named something
like "int32_to_uint32()" which you have defined (and debugged) in one place
that performs the appropriate conversion with sanity checks.

Basically, IMO, C's implicit conversions are a flaw in the spec. A more
sensible language would require every conversion (aside from those involving
un-dimensioned constant data) to be spelled out explicitly. Since you can't
make C sensible, the next best thing is to scrupulously isolate the stupid
parts.

------
matzahboy
I believe that the answers are incorrect. They assume that all true boolean
expressions evaluate to 1, whereas the c standard only guarantees that they
don't evaluate to 0.

Edit: As pointed out by the responses, I am wrong

~~~
aidenn0
Not true, !0 is always 1 in the C standard. Other parts are wrong though (they
assume short is narrower than int, and that int is 32 bits).

~~~
chc
It's not wrong. If you read the intro, they note that those traits are
implementation defined, so they say to assume you're using an implementation
with those characteristics.

~~~
nooop
So this is incorrectly titled. This should be a "A Quiz About Integers in C on
common x86 and x64 ABI."

Unfortunately this does not educate people to that kind of issues, which is
very unfortunate.

------
prophetjohn
What is the reasoning behind the fact that, say, (INT_MAX + 1) != INT_MIN?

~~~
Someone
Some reasons (the first one being the most important one):

1\. To give compilers some leeway when optimizing stuff (for examples, see
[http://blog.llvm.org/2011/05/what-every-c-programmer-
should-...](http://blog.llvm.org/2011/05/what-every-c-programmer-should-
know.html))

2\. To make C usable for those worried about erroneous overflows in their
code.

3\. To make it easier to write C compilers for CPUs that trap on integer
overflow.

4\. To allow for performant C compilers on CPUs that use one's complement
arithmetic.

~~~
cliffbean
> 1\. To give compilers some leeway when optimizing stuff (for examples, see
> <http://blog.llvm.org/2011/05/what-every-c-programmer-should-...>)

I'm curious about this. See below.

> 2\. To make C usable for those worried about erroneous overflows in their
> code.

I think you're saying that the rule allows compilers to implement -fwrapv if
they choose, which sounds reasonable.

> 3\. To make it easier to write C compilers for CPUs that trap on integer
> overflow. > 4\. To allow for performant C compilers on CPUs that use one's
> complement arithmetic.

I wonder how much these matter nowadays.

For optimization, the blog post you cite describes the following examples:

    
    
         - "X+1 > X" to true
         - "X*2/2" to "X"
         - "<= loops" and "int" induction variables
    

On the first two, presumably these usually only come up after macro expansion
and inlining, however they're still suspicious. If a function is scaling its
return value and its caller is de-scaling it, it's usually a sign that the API
isn't designed quite right. I'd be curious to know how often these come up.

Code using "int" induction variables to step through arrays on 64-bit targets
is often sloppy. Such code won't handle very large arrays properly, due to the
limited range of "int", which is a bug that may not be quickly noticed.

And for the "<= loop" itself:

    
    
      for (i = 0; i <= N; ++i) { ... }
    

it's really unlikely that the code is actually intended to be an infinite loop
in the case where N happens to be INT_MAX. Code like this would usually be
clearer written as "i < N + 1" to emphasize that it really does intend to
iterate N+1 times rather than just N times, and it just so happens that this
form makes optimizers happier as well.

Instead of having compiler writers sit around and think up clever ways to
repurpose anachronistic language rules, I might prefer to have them focus
instead on ways they can help me write better code instead :-).

~~~
jacquesgt
In response to the last paragraph, there's a tension between two different
visions for the C language. One of the visions is what's sometimes called
"high-level assembly language". In that case, undefined behavior lets
implementors choose the behavior that makes the most sense in that situation.
Someone writing software for a DSP is going to have different expectations for
the behavior or integers than someone writing software for a general purpose
processor. Forcing both implementations to use the exact same semantics is
going to slow both of them down, because extra checks will have to be added to
work around the differences in the underlying hardware.

The other vision for C is a portable systems programming language. When trying
to write portable code, undefined or implementation-defined behavior is a big
problem. I'm glad that compiler writers employ a take-no-prisoners approach
here. People shouldn't be relying on undefined behavior when trying to write
portable code, so compiler writers shouldn't be bound to implement consistent
behavior in those cases. That's especially the case if code ever needs to be
compiled with a different compiler, which might decide to implement different
consistent behavior.

It's also worth noting that in some cases, the exploitation of undefined
behavior doesn't always happen in a single place in a compiler. Instead, it
can be a combination of applying a few different rules in different
optimization passes that produces surprising results.

With that being said, it sure would be nice if the compiler writers figured
out how to give more warnings when undefined behavior is detected. If x + 1 >
x is optimized to false, tell me! If a dereference of a pointer that could be
null leads to potentially dead code being eliminated, tell me about that too!
It's the silently surprising behavior that causes the most problems.

~~~
cliffbean
Virtually all CPUs today implement two's complement signed integers natively.
Even the few C-capable DSPs that I know about do too (and have separate types
and operations for saturation etc.). The main loss would be the compiler
optimizations, and most of those can be recovered if programmers can avoid a
few pitfalls, such as the ones I discussed.

The problem with undefined behavior is not people ignoring portability. It's
that it's actually really easy to accidentally misuse it. For example,
Regehr's group has found quite a few such bugs in widely ported code written
by smart people [0].

[0] <http://embed.cs.utah.edu/ioc/>

------
cjensen
So in question 5, the C Standard answer is rejected because the x86-specific
answer is required. But then in question 7, the x86 (and all normal archs)
answer is rejected and the C Standard answer is required.

Would it really kill you to decide on the questionnaire rules before writing
the questions? :-)

~~~
delan
There's a difference between 'implementation-defined' and 'undefined'.
Question 5 is implementation-defined behaviour, while question 7 is undefined
behaviour.

------
forgotusername
While I didn't expect to do well on this (if for no other reason, then lack of
parenthesis to demarcate evaluation order!), however could someone please
explain why CHAR_MAX == SCHAR_MAX? My understanding is that no such guarantee
exists for the range of these types.

A few of the questions seemed odd, or mixing implementation specifics with
standardize.

Edit: s/width/range/

~~~
sltkr
That tripped me up too, and I'm pretty sure that answer is wrong. It's
implementation defined, since it's implementation defined whether or not
'char' is signed or not.

~~~
delan
It is, but this quiz is in the context of x86 and x86-64.

> Also assume that x86 or x86-64 is the target. In other words, please answer
> each question in the context of a C compiler whose implementation-defined
> characteristics include two's complement signed integers, 8-bit chars,
> 16-bit shorts, and 32-bit ints. The long type is 32 bits on x86, but 64 bits
> on x86-64 (this is LP64, for those who care about such things).

------
wging
This might be a naive question.

On Question 12 ("Assume x has type int. Is the expression x<<32..."), why is
this considered an error? Why do we want a compiler to prefer this over "x<<n
means shift x by (n % sizeof(int))"?

~~~
nitrogen
One possible reason: it makes little sense for _1 << 33_ to be less than _1 <<
31_ but not zero (assuming 32-bit ints -- replace 31 and 33 with 63 and 65 for
64-bit ints). In other words, if you keep shifting a number to the left, it
should keep growing, until all the bits are gone.

Another possible reason: CPUs have instructions for implementing << as it now
stands.

~~~
pdw
On x86, the shift instruction only looks at the bottom 5 bits of the shift
distance. x<<32 effectively is interpreted as x<<0\. (I'm sure other
architectures have the same restriction.)

~~~
astrange
Different architectures use different numbers of bits. I think PPC uses 4 but
I can't remember how to check this at the moment.

~~~
nitrogen
Does that imply that PPC is limited to shifting at most 15 bits at a time?

~~~
astrange
In fact the answer was 5 bits for 32bit and 7 bits for 64bit[1]. x86-64 uses 5
bits for 32bit and for 64bit (shld/shrd can't shift more than 31 bits).

7 bits is problematic because some code may assume that x << -y is an
optimization for x << (32-y). I thought there was some architecture where this
assumption didn't work on 32bit either, which led to my post above.

[1]
[http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix...](http://pic.dhe.ibm.com/infocenter/aix/v7r1/topic/com.ibm.aix.aixassem/doc/alangref/alangref.pdf)

