
A bit of background on compilers exploiting signed overflow - ingve
https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759de5a7
======
CJefferson
One other advantage of signed overflow not listed here -- loop termination.
Consider the following loop:

    
    
        for(int i = A; i <= B; i+=4) ...
    

This loop can be infinite if overflow is defined, and B is less than 4 from
the largest value. Checking for this is a pain (and extra code), and in
practice (almost) no-one would ever mean to write this and get an infinite
loop when B is close to the top of the integer range.

Special case checking for this case is a pain for various kinds of loop
unrolling / peeling, and also removing empty loops if they don't contain any
code (which often comes up in practice with C++ templates).

~~~
dsfuoi
It's just one line. Put this before the loop if the code logically shouldn't
trigger this.

    
    
      assert( B < B_TYPE_MAX - 4 );
    

Or use an if statement if it could trigger at runtime.

Also the code is clearer on intent.

~~~
haberman
Can assert() actually function this way? Can you use assert to tell the
optimizer it can assume something is true?

I've noticed that memcpy(x, y, 4) on x86 can generate very efficient code
(register move), but on ARM it expands to something much more verbose because
the addresses might not be aligned.

Could this effectively function as a way of promising to the compiler that the
addresses are aligned?

    
    
        void move4_aligned(void *dst, const void *src) {
          assert(((uintptr_t)dst & 0x3) == 0);
          assert(((uintptr_t)src & 0x3) == 0);
          memcpy(dst, src, 4);
        }

~~~
quotemstr
> Can assert() actually function this way? Can you use assert to tell the
> optimizer it can assume something is true?

Trivially.

    
    
      #ifdef NDEBUG
      # define assert_and_assume(cond) if(!(cond))     
        __builtin_unreachable((cond))
      #else
      # define assert_and_assume assert
      #endif

~~~
haberman
Interesting! This assert_and_assume() seems strictly better than vanilla
assert() for any predicates that don't have side effects. But I guess you have
to be sure that the compiler is able to deduce that there aren't side effects
and feels comfortable optimizing away the predicate in release mode.

------
rav
The biggest takeaway is the last paragraph:

> On most of the machines you're likely to use, "size_t" for loop counters is
> a good idea where signed values aren't required. It's unsigned and generally
> as wide as addresses, so there's normally no extra code for zero/sign
> extends, and the type is standard.

Just work with size_t and be happy!

If you need to iterate through a sequence a[0], a[1], ..., a[n-1] in backwards
order, the while-condition "i >= 0" of course won't work with size_t (which is
unsigned); instead, use the following idiom:

for (size_t i = n; i--;) f(a[i]);

The above code will call f(a[n-1]), f(a[n-2]), ..., f(a[0]) with no issue,
thanks to the post-decrement operator.

~~~
fdej
Quick: how would you implement for (i = n; i >= 0; i -= k) with unsigned
integers?

I just use a signed pointer-size type for all sizes and counters, and stay
happy...

~~~
detrino
Your loop:

    
    
        void TestSigned(int min, int max, int step) {
          for (int i = max; i >= min; i -= step) F(i);
        }
    

Becomes this with unsigned:

    
    
        void TestUnsigned(unsigned min, unsigned max, unsigned step) {
          while (true) {
            F(max);
            if (max < min + step) break;
            max -= step;
          }
        }
    

Now if I want to transform my loop to operate on pointers or iterators, the
transformation is trivial:

    
    
        void TestUnsigned(unsigned *min, unsigned *max, unsigned step) {
          while (true) {
            F(*max);
            if (max < min + step) break;
            max -= step;
          }
        }
    

Quick: Do the same for yours.

~~~
dsfuoi
It looks easy but it isn't.

The second example causes unexpected behavior (and probably ub later) if: min
+ step > MIN_TYPE_MAX

The last example causes undefined behavior if: max - min < step - 1.

------
skybrian
It sounds like making the width of ints implementation specific in C was a
failure. It was supposed to allow better performance (allowing it to be set to
the register width) but instead we get the worst of both worlds - portable
code can't depend on its size but compilers can't practically change the size.

~~~
cjensen
There's an added complication: in C, there are just five[1] names available
for signed values: signed char, short, int, long, and long long. Additional,
the number of bits assigned to each of those must be >= to the previous name.

Suppose your C implementation wants to have signed types for 8-bit, 16-bit,
and 32-bit values. If int is 64-bits, then you literally don't have names left
to label those lesser-sized signed types. So in practice, int can be no more
than 32 bits.

The solution is to always use size_t as a count or size of anything. Uses
offset_t for a difference between size_t values, or a difference between
pointer values.

An alternate practice (one I subscribe to) is to use unsigned as your default
type instead of int. In this practice, 'int' means a variable which can be
negative. In real programs, the vast majority of variables never contains
negative values.

[1] char can be either signed or unsigned, so let's ignore it for this. Also,
let's ignore the new char type names introduced in recent C++.

------
userbinator
The other thing to consider is that "x86-64" or "amd64" is not really 64-bit.
Operand sizes are 32 bits by default (except the addressing modes, which leads
to this mess), and you can only use 64-bit widths with an extra prefix byte
--- on each instruction. There are no 64-bit equivalents of some instructions.
Compared to the 16-to-32-bit transition that happened with the 386, the
32-to-64 just does not seem anywhere near as well thought-out.

Then again, it makes sense to leave operand sizes at 32 bits, as otherwise
code would be twice as big, with the accompanying consequences for caching
etc.; in fact I'd be willing to bet that 4 billion representable values is
more than a lot of applications would ever need to put in an int. Extending
that to 64 bits is really for addressing and the other cases where that
exponentially larger range is truly needed.

------
ryuuchin
How much of an issue is this really though?

Isn't the worst case a 1 cycle penalty for using a 32-bit int on x86-64 in the
manner described in the article?

~~~
witty_username
But that's repeated in the loop. So, looks like around a 10-20% performance
change. Quite a bit for something not obvious.

~~~
ryuuchin
Perhaps in a really tight loop but the counter is still going to be several
iterations ahead thanks to the out of order capabilities of the CPU. You may
also have other things in play as well such as the micro-op cache depending on
the code size of the loop.

------
askee
I find the wording somewhat peculiar: 'the resulting address will change
dramatically, by (4 - 2^(32+2)). In 32-bit mode (all addresses modulo 2^32),
this is just an increase by 4 as usual; the "wraparound" part is a multiple of
2^32 and hence invisible'

Why would it matter that the address will change "dramatically"? Even if it
didn't (and only change by say 16 bytes or whatever) you'd have to deal with
it. The problem is that int overflows at 2^31 while long/a larger type
doesn't. In order to cope with the overflow you'll have to "simulate" an
overflow in the larger type as well (or at least do what the smaller type
does) which is achieved via a sign-extend.

------
nkurz
I've recently been diving into low-level loop optimization on current Intel
x64 processors. It's a rabbit hole, but makes me think there are considerably
better optimizations that compilers could be making. Here are some thoughts on
this "naive" loop example as they would execute on Skylake, with reference to
sections in the (excellent) Intel Architectures Reference Manual:
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
optimization-manual.pdf)

First, since the µops have already been decoded, instruction length doesn't
really matter. Also, the limiting factor in this loop is not the execution
ports, but the "renamer". While we can supply 6 µops per cycle [2.1.1] and
dispatch and execute up to 8 µops per cycle [2.2.2] (limited by instruction
mix) only 4 (possibly fused) µops can be moved from the Decoded Instruction
Queue to the Scheduler per cycle. [2.3.3.1]

Using "xor %reg -> %reg" for zeroing a register is considered a special "zero
idiom" (along with several other forms), and does not need an execution port
[2.3.3.1]. It is handled in the renamer with zero latency. It does however
take up one of the 4 available renamer slots, but since it's outside the loop
here this doesn't matter.

The "test %ecx -> %ecx; jng done" sequence is "macro-fused" to a single µop.
It's a single both for the renamer and for execution. These instructions need
to be consecutive for this to happen. Again, outside the loop so no real
difference.

At the top of the loop, "movsxd %ebx -> %rdx" is a single µop, 1 cycle
latency, and can execute on any of the 4 arithmetic ports. While it can't be
substituted here, it's interesting to note that if sign extension is not
needed, "mov %ebx -> %edx" and "movzx %bl -> %edx" are specially treated as a
"zero latency move": 0 latency, no execution port, but do use a renamer slot.

The add from memory "add [%rsi+%rdx*4] -> %eax" is internally treated as a
single "micro-fused load op" when stored in the Decoded Instruction Queue. But
because it uses an addressing mode with an index register, it is "un-
laminated" as it is transferred to the scheduler [2.3.2.4], and thus uses two
renamer slots. Since we only have 4 available slots per cycle, this can be a
major reason to prefer a non-indexed address like "add [reg+offset] -> eax" as
in his unrolled example.

"inc %ebx" can perform poorly as an instruction, since it updates only part of
the flag register (leaves "carry" unmodified). While it doesn't stall post-
Sandy Bridge, in some cases it may require an extra µop to merge the flag
register. Intel suggests avoiding it, and using "sub $1 -> %ebx" instead
[3.5.2.6].

The "cmp %ecx, %ebx; jl lp" pair are fused and treated as a single µop. This
is good, and loop termination should almost always keep these instructions
consecutive for this reason. Even better would be to arrange the loop so that
one can update the loop counter and then check the flags resulting from the
arithmetic. The usual way to do this is to count down to zero instead of up to
a maximum: "sub $1 -> %ebx; jnz lp". Possibly worth noting is that fusion with
add/sub does not occur with jno (jump not overflow) or jns (jump not sign), so
terminating at zero is usually the best choice.

(Note that I intend this as supplementary information to the article, and not
a criticism. While I agree with CJefferson and cokernel_hacker that loop
termination is a more important reason for undefined overflow, I think the
technical details in the article are correct.)

------
akkartik
There's been a lot of discussion about undefined behavior on HN recently, and
I'm gradually starting to question if the conventional approach to compilers
might be a giant dead-end. In this case, our conventional languages provide no
way for a programmer to say "I care about overflow". Because face it, the vast
majority of programs a compiler sees are written extremely sloppily. Most of
the time there's this huge disconnect between the amount of effort the
programmer and the ghost of the compiler writer are putting into considering
such situations.

Now, most of the time this disconnect is a good thing. Division of labor
ensures that the work of the compiler writer is amortized across many many
programmers. But the discussion around undefined behavior is a sign that the
load on the compiler writer has gotten too onerous, to the point that
everybody is paying a cost (in reduced performance, in increased compiler
complexity leading to reduced compiler stability, in compilers
antagonistically interpreting standards like lawyers) for a level of
correctness along one narrow dimension that only a tiny fraction of users
actually care about.

What if we instead said, "compilers will never ever consider overflow" and let
those who care about overflow fend for themselves in the source language?
Among other things that would allow integers to always be the word size of the
computer they're running on.

(My interest in these questions stems from trying to program in an idealized
assembly language for the past year:
[http://github.com/akkartik/mu](http://github.com/akkartik/mu). After writing
10kLoC of assembly to build a simple text editor
([http://akkartik.name/post/mu](http://akkartik.name/post/mu)) I'm surprised
to find it quite ergonomic. High level languages today seem to provide three
kinds of benefits over traditional/ugly assembly: expressiveness (e.g. nested
expressions, classes), safety (e.g. type checking) and automation (e.g.
garbage collection). An idealized assembly language gives up some
expressiveness, but doesn't seem to affect the other benefits. As such, it's
at an extreme end of the trade-off I describe. I'm hoping my experience will
largely translate even when Mu eventually implements a high-level language.)

 _Edit 21 minutes later_ : For reference, here's how I've been thinking about
undefined behavior while building Mu:
[https://github.com/akkartik/mu/blob/5dd82a8a70/001help.cc#L4...](https://github.com/akkartik/mu/blob/5dd82a8a70/001help.cc#L40).
Comments most welcome.

~~~
mempko
With every additional compiler check, you reduce the set of possible programs.
It would be interesting if the checks could be easily programatically defined
and inserted by those who care. I think Perl6 is probably the best modern
compiler that allows this.

I agree, it should not be the compilers job to decide the set of possible
programs. The hardware guys figured this out long ago. Afterall, CPU registers
are untyped.

~~~
akkartik
Can you elaborate on what checks Perl6 permits programmers to ask for?

~~~
raiph
Aiui you can have any checks you want.

Aiui, simplifying, every operation is a function call so you just write a
function body that corresponds to what you want the operation to be with the
checks you want.

------
pepijndevos
Are there any example of how big projects deal with this? Like, using int64_t
and friends.

I'm also surprised that there does not seem to be an option to tell GCC to use
64 bit ints if this really produces better code.

[https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_00...](https://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/i386-and-x86_002d64-Options.html)

~~~
GlitchMr
If you could change meaning of int to be 64-bit type, then your code ABI
([https://en.wikipedia.org/wiki/Application_binary_interface](https://en.wikipedia.org/wiki/Application_binary_interface))
would be essentially incompatible, and you wouldn't be able to use code (such
as standard library, or other headers provided by operating system libraries)
that was compiled on a compiler where int was 32-bit, as for instance wrong
structure fields would be read.

If you would try to "solve" that by assuming headers use 32-bit int
definition, then you would break calling your own code, as you would use
32-bit int definition provided in headers, while your code was called with
64-bit int convention.

It's probably just easier to define your own int meaning type, say, a shorter
alias of int_fast32_t.

