
The Performance Cost of Integer Overflow Checking - nicholasjbs
http://danluu.com/integer-overflow/
======
pslam
It's not that unreasonable to ask for _hardware_ support for overflow
checking. There's various ways it can be supported which make it much cheaper
than naive methods.

For example, ARM has a sticky overflow flag "Q". The idea is that you don't
have to explicitly check the flag after every flag-setting instruction -
instead you execute a bundle, and check if any set the "Q" flag at a point
where you must branch or otherwise accept the output. Sadly, this is only
implemented for a limited number of instructions (e.g QADD QSUB), and pretty
much obsolete. Still - the idea is sound in that it doesn't cost a stall, if
properly scheduled, and greatly reduces the number of branching points.

You can somewhat do the same purely in software with the compiler, but the
lack of a "sticky" flag means it'll need to access - and potentially stall -
at each point where flags would be overwritten by another ALU instruction.

I hear Rust did go down this path for a while but abandoned it. Pretty much
all (non-Rust) code I write these days would SERIOUSLY benefit from global
overflow detection. I would turn it off for only a few critical paths, audit
the hell out of those code snippets, and I bet I would get very close to the
original performance.

~~~
thesz
You wouldn't believe the cost of _hardware_ support.

The addition is usually done using carry-look-ahead scheme [1]. This scheme
has depth of O(log(N)) (N being number of bits). For 64 bits it is k * 6, and
k is about 1.5. So you are looking at ~9 logical operation depth.

The computation of overflow uses bits from both operands and result. You also
have to store that overflow bit somewhere. This means that 1) you have to add
logical operations (usually two) to compute it and 2) you have to lay wires to
store results of computations. Either way you waste timing resources (logical
ops for wider processes, wires for thinner ones).

For superscalar execution you end with another result dependence to resolve
and mostly ignore.

In the end you add about 5-10% of overhead of clock cycle time _due to
constant checking for overflow_.

E.g., your request will make all computers more expensive to operate.

I have relatively extensive experience in designing hardware for a software
engineer (accelerated video controller for STB, no less, from algorithmic
prototype to tests). My modus operandi in that area is that you should do in
hardware only what you cannot do in software.

Let's compare SPARC and MIPS. SPARC has status register (hardware support for
overflow you asked for) and MIPS doesn't. SPARC lso has complicated register
file, but it is out of control path, which is register-register addition, you
wouldn't believe. SPARC and MIPS are equavalent otherwise. We estimated
operating frequency estimations for ALUs of SPARC and MIPS for 0.13um process
and for SPARC it was 400-450MHz for SPARC and was about 500MHz for MIPS,
without any tweaking in low level. We have here 10% in speed difference. MIPS
would be even faster if we ditch ADD/ADDI/SUB/SUBI instructions (add/subtract
with overflow checking).

The same is true for OpenRISC and RISC-V. OpenRISC generates exceptions for
any sneeze that may happen, RISC-V continues. Guess what is easier to develop,
test and will be faster in the end.

Please, do not add to hardware any functionality you really do not need. You
can check for integer overflow statically and generate special code that will
generate exceptions if you cannot prove their absence conclusively. This is
already done for division by zero for MIPS target (check it out, it is amazing
to see difference between -O0 and -03), it can be done for integer additions.

[1] [http://en.wikipedia.org/wiki/Carry-
lookahead_adder](http://en.wikipedia.org/wiki/Carry-lookahead_adder)

~~~
solarexplorer
A flag register is indeed an additional dependency and may end up in the
critical path. But x86 and ARM already have a flag register. They already pay
the cost.

And you don't need a flag resister to check for overflows. Trap on overflow
(like Alpha) will work just as well. The difference is that traps are
infrequent so you don't have to make them fast, just correct. You don't have
to raise the trap in the same cycle that you calculate the integer operation.
You just have to to do it before the commit stage. (And the last time I
checked, the Alphas were quite a bit faster than MIPS.)

Of course hardware support implies some overhead and more complexity. I can
see why people would oppose it. But there really is software that would
greatly benefit from hardware support.

~~~
emjaygee
Modern processors can generate different micro-ops depending on whether the
flags are observed. In old non-pipelined/non-speculative/non-rewriting
processors what you said is true but all bets are off in the world of
massively funded x86 processor development.

------
TazeTSchnitzel
The fact many languages don't overflow check by default really saddens me.
Integer overflow is the cause of so many bugs (many user-facing:
[http://www.reddit.com/r/softwaregore/](http://www.reddit.com/r/softwaregore/)),
and yet people keep making new languages which don't check overflow. They
check buffer overruns, they check bounds, and yet not integer overflow. Why?
The supposed performance penalty.

The reckless removal of safety checks in the pursuit of performance would be
considered alarming were it not commonplace.

(Disclaimer: I really, really care about integer overflow for some odd reason,
going so far as to be going through the entire PHP codebase to add big integer
support...)

~~~
SamReidHughes
This is the main reason I don't trust the mob of people working on Rust to get
their act together and make it a good language. You can see some awareness of
the value of integer overflow detection among some individuals that work on
Rust but despite that, here we are without that feature. Swift, on the other
hand, is one language that has put some thought to the matter.

~~~
derefr
It seems to me that, despite all the flexibility people demand, they only
really want (and use) exactly three integer data-types in a language:

1\. machine bit-string of size 8(2^n) bits (e.g. b/w/d/q fields) with no
concept of being a signed/unsigned integer, but where you can apply both
signed and unsigned integer operations to it in an optimized manner. Such code
will be shimmed with sets of clever shifting ops if the target architecture
doesn't actually have a storage type of that width, but you have to be
explicit about what you'll allow (e.g. if an integer is of type "d|2w", then
it will compile fine on a 16-bit architecture (and operate using shimmed ops),
but fail to compile on an 8-bit architecture.

2\. unsigned integer of fixed exactly-specified bitsize, which exactly matches
the semantics of a having a value on a modular-arithmetic ring. You go up, it
wraps around at 2^bitsize; you go down, it wraps back. This stays true even if
the code is compiled on an architecture where the exactly-specified bitsize
isn't a clean processor storage-width: a 27-bit ring on a 32-bit processor is
A-OK, and shims will be inserted to enforce the semantics. Shims will also be
inserted if you're targeting an ISA that doesn't _have_ an integer type with
wrapping semantics (e.g. the JVM.)

3\. signed arbitrary-precision integer (i.e. optimal-machine-word-size-minus-
a-tag-bit with bignum promotion checks). The optimizer might convert one of
these to something of type #1 if it can be very, very sure of the value range
you're operating within.

#1 is for doing pointer math in unsafe regions; #2 is for implementing
cryptographic primitives; #3 is for everyone else.

(There's also a variant on #1—let's call it #1A—which is a fixed-size array of
#1s you can repeat signed/unsigned integer transformations over, where this
will generate optimized SIMD code. This would be the "buffer" type for wire
protocol packing/unpacking, and also the backing store for non-sparse matrix
ADTs, bloom filters, etc.)

The only place integer overflow raising an _exception_ would make sense, to
me, are if you want some sort of hybrid type between #1 and #3: a value that
pretends to be arbitrary-precision, but in fact only has a constant-size
bitstring to operate within. I could see the use of this in something like
Cap'n Proto, where you're keeping wire-encoded integer bitstrings (#1s from a
#1A) around and pretending they're fully-functional integers (#3) for the sake
of zero-copy—but do you really gain that much by not letting a data structure
be recreated resized on the stack, if you're not also throwing down
optimizations like intrusive lists in fixed arenas?

~~~
nkurz
I have a small nit about #2 that doesn't affect your main argument. Similar to
the 'overflow' flag the processor automatically sets to signal signed
overflow, the processor flag for 'carry' is already being set by the processor
when you add or subtract unsigned numbers that overflow. There is no separate
signed or unsigned addition in x86/x64 assembly --- the processor just sets
the flags for both cases, and they can either be used or ignored.

This is the reason that overflow checking can be so inexpensive --- no
additional checking is necessary, just a conditional branch on an existing
flag. Correctly predicted branches not taken are very inexpensive, and this
branch will almost always be correctly predicted. The issue is that standard C
doesn't allow any direct defined way to make use of the overflow flag. Instead
you have to write something awkward and hope the compiler optimizes it down
for you.

~~~
TazeTSchnitzel
Irritatingly, GCC lacks a "branch on overflow" intrinsic or "add and branch on
overflow" intrinsic, forcing you to write inline asm to do fast overflow
checking.

~~~
derefr
Kind of makes sense, though. Since the C standard says that integer overflow
is undefined, you can think of the "C abstract machine" as being built on an
assumption of an underlying architecture that provides magic fixed-width-yet-
arbitrary-precision integers (and/or an assumption of omniscient programmers
who always know what their inputs will be and never let math happen if it
would result in overflow.) Basically, you can't talk about overflow in C, or
even to a C compiler, because as soon as you make the fact that overflow is
happening explicit, you've stepped outside the C abstract machine.

It's like trying to talk to Haskell about explicit thunks, or Lisp about
stack-vs-heap allocation. The language encapsulates the concept away from you,
so if you wrench it back, you're no longer quite writing the language; you're
writing a union of it and something else.

~~~
marvy
Yes, but in practice, that's what people do: write in a language more powerful
than the standard gives them. You kind of have to if you want to get anything
done without driving yourself crazy. For example:

long mega_byte = 1024 * 1024 * 8; // bits per MB long mega_bit = 1000 * 1000;
// bits per megabit

If you follow the C90 standard, this code has all sorts of issues. First, //
comments are not allowed. Second, identifiers are only guaranteed to have six
significant characters, so the above two variables might actually be the same
variable. Finally, ints are only guaranteed to be 2 bytes, so the
multiplication may overflow and the program is undefined. (Assigning to longs
doesn't help: the multiplication already happened.)

------
nkurz
I just did my own comparison using bzip2
([http://www.bzip.org/](http://www.bzip.org/)), comparing both CLang 3.4 and
GCC 4.9 with -O3 -fsanitize=signed-integer-overflow. I omitted the 'unsigned-
integer-overflow' because my version of GCC didn't support it. ICC is shown
for the non-sanitized version, as it doesn't support it the automatic overflow
checking.

I did a smaller file than Dan because I didn't have the patience to test with
the 1GB file. Testing was Linux on Haswell, cycle and instruction counts are
coming from 'perf stat' and are in billions except for the last column,
(instructions/cycle). Not shown here is the number of mispredicted branches,
which as expected does not increase for adding the checks since they are all
well predicted.

    
    
                        cycles instructions branches  ipc
      clang-O3:          24.4      43.8      6.3      1.79
      clang-O3-sanitize: 26.5      54.8      9.3      2.06
      gcc-O3:            23.0      42.5      6.2      1.85
      gcc-O3-sanitize:   24.6      47.3      7.6      1.93
      icc-O3:            23.8      41.6      6.5      1.75
    

For Clang, adding signed integer overflow checking adds about 20% to the count
of instructions executed, but only about 7% to the runtime. For GCC, 10% more
instructions but also about 7% increase in execution time. Agreeing with this,
the instructions per cycle ratio goes up more for CLang than for GCC, but in
both cases there is a significant increase showing that each check costs less
than a cycle.

I'm somewhat confused as to why the increase in instructions executed is so
much greater than the increase in the count of branch instructions. This would
seem to confirm Dan's experience with poor generated code quality, but looking
the tight loops with 'perf record' both GCC and CLang appear to be generating
reasonable code. Both sanitized and normal spill to the stack much more than
seems necessary, but this doesn't seem to because of the overflow checking.
I'd guess the extra instructions are just by chance because of how the 'jo'
checks were laid out.

------
pbsd
GCC 4.9 does have `-fsanitize=signed-integer-overflow`, and it generates
pretty good code for that toy function:

    
    
        f(int, int):
            mov eax, edi
            add eax, esi
            jo  .L8
            ret
        .L8:
            ...

~~~
Marat_Dukhan
Older gcc versions have `-ftrapv` option, which checks overflow for signed
addition, subtraction, and multiplication. However, the implementation calls
libgcc functions to do the checks, resulting in non-negligible overhead.

------
Animats
Ah yes, integer overflow checking. Here's iDrive, the commercial backup
service, failing at it:

[http://s24.postimg.org/xakc6eglx/idrivefail2.png](http://s24.postimg.org/xakc6eglx/idrivefail2.png)

iDrive backed up a 3GB file. This overflowed a 32-bit value, and so iDrive
shows the size as -149406.91 KB. This seems to have confused something in the
iDrive system, and now backups won't run. Their tech support people have been
working on this for two days now.

C officially considers unsigned integer overflow to be modular arithmetic. I
consider that a mistake. If you want modular arithmetic, you should write

    
    
        x += n % (2^32)
    

which can be easily optimized to eliminate the divide. Otherwise, it should be
an error.

~~~
TazeTSchnitzel
More portable:

    
    
      x += n % (1 << (SIZEOF_LONG * CHAR_BIT));

~~~
jevinskie
Perhaps the C++ standard could add 'operator{+,-,etc}%'.

~~~
TazeTSchnitzel
Along the lines of Swift's & operators?

~~~
jevinskie
Yes, those seem quite nice!

Edit:
[https://developer.apple.com/library/ios/documentation/Swift/...](https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/AdvancedOperators.html)

------
dezgeg
I wonder if these estimations of integer overflow cost have taken into account
all the possible optimizations where the compiler could safely remove overflow
checks. For example, consider:

    
    
        void memzero_range(char* array, int start, int end) {
            for(int i = start; i < end; i++)
                 array[i] = 0;
        }
    

In that example, a sufficiently smart compiler could just check for the `start
> end` case once at the start of the loop, and remove the overflow check from
the body. Perhaps this is something that today's Clang/GCC can already do?

~~~
gsnedders
Rewriting it as follows, with an overflow check:

    
    
        void memzero_range(char* array, int start, int end) {
          int i = start;
          while (i < end) {
            array[i] = 0;
            if (i == INT_MAX) {
              assert("overflow!");
            }
            i++;
          }
        }
    

Clang does indeed manage to see that the check is equivalent to a start > end
case (with -O1) and hoists it out of the loop. It also then rewrites the body
of the loop to use the LLVM memset intrinsic!

------
sanxiyn
I hope optimiztion of integer overflow checking improves in LLVM, since now
there is a highly visible project(JavaScriptCore FTL JIT) using the feature.
JavaScript semantics pretty much forces you to do integer overflow checking
fast.

~~~
TazeTSchnitzel
You're referring to the optimisation of storing JS Numbers as integers until
they get too large, presumably?

~~~
sanxiyn
Yes.

~~~
TazeTSchnitzel
Native integer overflow checks can only go so far, there, since there's no
52-bit integer type.

~~~
sanxiyn
What? JavaScript engines check for 32-bit integer overflow and promote to
double.

~~~
kevingadd
IIRC v8 promotes to double at 31 bits. But yeah.

~~~
mraleph
On 64bit architectures V8 promotes at 32bits.

(and if we talk about optimizing compilation things are getting more
complicated as there int31, int32 and float64 can all coexist and you can have
31-bit integer stored in a float64 value - if operation was specialized for
floating point values)

------
phkahler
It is the programmers job to check for overflow. There are a lot of comments
here about having a language check for overflow, but that is not practical.
Specifically, the language doesn't know what to do in the case of overflow.
The best it could possibly do is throw an exception, and I suppose that's what
those commenters want it to do. But it's still up to the programmer to decide
what to do and implement something. Are you going to do that for every math
operation in a program? If not then you're already thinking about where
potential problems may occur. Sometimes you want an overflow. I like to use 16
bit values to represent angles from 0 to 360 degrees - you can add and
subtract these and never need to worry about the wraparound - I don't want
that to throw an exception. If your code is going to overflow when you don't
expect it to, you've got a bug and the language isn't going to be able to save
you.

I feel like noobs hitting an overflow for the first time think the world would
be better if the machine could "just take care of it".

That said, one solution to overflow in some cases is saturating arithmetic.
Supported on some processors, but by no languages that I know of. That may be
worth considering.

~~~
aidenn0
Spot the signed integer overflow here that leads to undefined behavior on
targets where "int" is 32-bits:

    
    
        uint32_t leToWord(uint8_t bytes)
        {
            return bytes[0] |
                   bytes[1] << 8 |
                   bytes[2] << 16 |
                   bytes[3] << 24;
        }

~~~
detrino
This is more about C's integer promotion rules than it is about overflow.

------
asgard1024
I wonder why not use interrupts for overflow checking? (Maybe x86 still
doesn't support that - my knowledge of it is very old.) That seems smarter, if
the overflow is really going to be an exceptional case..

I would think that the best approach is to set up a global overflow handler,
driven by an interrupt, and then maybe for cases where you actually want to be
able to overflow, a special C primitive (that would use instructions that
avoid overflow in some way) could be useful.

I am just wondering why the article doesn't mention this possibility at all.

~~~
gsnedders
x86 doesn't support it.

~~~
asgard1024
Ah well. I work on zSeries mainframes, and this architecture does support
interrupts on integer overflows (as well as many other number formats). So I
thought that x86 architecture already got that option too.

------
olliej
I'd be curious about the effect of manually substituting the clang overflow
arithmetic intrinsics on the codegen.

~~~
TazeTSchnitzel
I suspect that'd make it worse, as you've added a new inline call.

------
niche
This calls for BaseN computing; an algorithm that compiles code down to the
most optimal base (minimizing overflow) and then ran it on a quantum baseN
computer; post compilation basing; cross basing; etc

------
lelf
Honestly we should talk about “The Security Cost of No Overflow Checking”
first.

~~~
Rusky
It's good to talk about, but not every article on performance needs to be
prefaced with a talk about security. Being aware of the performance costs of
your tools is just as important in many cases.

Besides, if we're talking about language design and compiler optimizations,
there are plenty of ways to check overflow without resorting to a branch on
every arithmetic operation.

------
dgdsgdsg
fefe?

