Hacker News new | comments | ask | show | jobs | submit login
The Performance Cost of Integer Overflow Checking (danluu.com)
166 points by nicholasjbs on Dec 18, 2014 | hide | past | web | favorite | 106 comments

It's not that unreasonable to ask for hardware support for overflow checking. There's various ways it can be supported which make it much cheaper than naive methods.

For example, ARM has a sticky overflow flag "Q". The idea is that you don't have to explicitly check the flag after every flag-setting instruction - instead you execute a bundle, and check if any set the "Q" flag at a point where you must branch or otherwise accept the output. Sadly, this is only implemented for a limited number of instructions (e.g QADD QSUB), and pretty much obsolete. Still - the idea is sound in that it doesn't cost a stall, if properly scheduled, and greatly reduces the number of branching points.

You can somewhat do the same purely in software with the compiler, but the lack of a "sticky" flag means it'll need to access - and potentially stall - at each point where flags would be overwritten by another ALU instruction.

I hear Rust did go down this path for a while but abandoned it. Pretty much all (non-Rust) code I write these days would SERIOUSLY benefit from global overflow detection. I would turn it off for only a few critical paths, audit the hell out of those code snippets, and I bet I would get very close to the original performance.

> ARM has a sticky overflow flag "Q".

This is surprisingly expensive to implement fast and correctly in a modern OoO core. The most likely implementation would end up taking Q as a dependency for every qadd/sub, which would be terrible.

> You can somewhat do the same purely in software with the compiler, but the lack of a "sticky" flag means it'll need to access - and potentially stall - at each point where flags would be overwritten by another ALU instruction.

The jump depends on flags, but by definition has no further dependencies. So long as it's untaken and predicted as untaken the cpu can never stall on it.

I'm pretty rusty so excuse me if I'm spouting nonsense, but I think MRS (read register file) has a dependency on the value of Q (so q/add/sub/clear) and operations which OR a new value with Q (q/add/sub [do q |= 0 or q |= 1) have a dependency on operations which change the register file (MSR). Basically qadd and qusb commute on Q. I'm a bit fuzzy on the details, but I think this means that you just have to stall/predict when there's about to be a branch anyway. Please correct me/clarify if I'm wrong.

Yeah you have to predict - so just predict not taken, and treat JO as an uncommon trap. I think that works well in practice.

If you assume that reading (and clearing) the Q bit is infrequent, you can do something much cheaper. Because the bit is sticky, the order in which you set it does not matter. And when you need to read (or to clear) the Q bit, you just block the instruction that wants to read/clear until the OoO core is empty.

An OoO core with speculation has to be able to roll back arbitrary instructions when an earlier instruction causes a visible trap after a later instruction has been executed. Because of this, the core needs to be able to reconstruct the value of Q at any point of execution, even if it's not actually needed then. So the order matters.

Good point. So then instead of updating Q out-of-order you could update at commit. You just need to block any instruction that reads Q until the OoO core is empty. This is _much_ cheaper to implement than adding a dependency on all instructions.

> A sticky overflow flag would be great... "if properly scheduled"

That is the rub right there. Think about what this means. Every ALU operation needs to read whatever register contains this state and write a new value. So every ALU operation becomes serially dependent all all previous operations. You have just totally killed performance on out-of-order processors.

Processor designers work really hard to avoid these kinds of operations. In fact between the Pentium and the Pentium Pro (aka p6) the behavior of condition codes on several x86 instructions were changed from "keep old value" to "always cleared". We couldn't find any code that cared and it was never previously documented.

This code doesn't have to use branches, it could also OR the condition codes to some register after every operation. This would also hurt performance, but only for code that cares instead of all code.

Floating-point flags behave in exactly this "sticky" fashion on every mainstream architecture, and processor designers don't have trouble breaking the apparent serial dependency. It's annoying, but it's well-understood, and there are widely-used techniques that address it.

Itanium style not-a-thing constructs might be interesting for that.

Not-a-thing is basically a tagging scheme where the contents of registers are extended with a bit indicating whether the value is valid or not. Arithmetic operations on a NaT produce a NaT: branches or stores which consume a NaT raise an error. Given arithmetic instructions that generate NaT on overflow, the wrong result will be prevented from affecting the state of the program without explicit tests or jumps.

I'm not sure what the implementation costs would be on the hardware side, though.

The Mill CPU also works like that- they have instructions for wrapping, overflow, and saturated add, and the overflow ones generate viral bad values that only trap on store/branch/etc.

Given that it's generating the same bits as x86, but with the flag stored with the output rather than in a separate register, the hardware cost shouldn't be more than one more level to select the actual result.

This is also how they handle illegal reads, which (along with vector smear and pick instructions) makes vectorizing loops possible when they might otherwise cross protection boundaries.

You wouldn't believe the cost of hardware support.

The addition is usually done using carry-look-ahead scheme [1]. This scheme has depth of O(log(N)) (N being number of bits). For 64 bits it is k * 6, and k is about 1.5. So you are looking at ~9 logical operation depth.

The computation of overflow uses bits from both operands and result. You also have to store that overflow bit somewhere. This means that 1) you have to add logical operations (usually two) to compute it and 2) you have to lay wires to store results of computations. Either way you waste timing resources (logical ops for wider processes, wires for thinner ones).

For superscalar execution you end with another result dependence to resolve and mostly ignore.

In the end you add about 5-10% of overhead of clock cycle time due to constant checking for overflow.

E.g., your request will make all computers more expensive to operate.

I have relatively extensive experience in designing hardware for a software engineer (accelerated video controller for STB, no less, from algorithmic prototype to tests). My modus operandi in that area is that you should do in hardware only what you cannot do in software.

Let's compare SPARC and MIPS. SPARC has status register (hardware support for overflow you asked for) and MIPS doesn't. SPARC lso has complicated register file, but it is out of control path, which is register-register addition, you wouldn't believe. SPARC and MIPS are equavalent otherwise. We estimated operating frequency estimations for ALUs of SPARC and MIPS for 0.13um process and for SPARC it was 400-450MHz for SPARC and was about 500MHz for MIPS, without any tweaking in low level. We have here 10% in speed difference. MIPS would be even faster if we ditch ADD/ADDI/SUB/SUBI instructions (add/subtract with overflow checking).

The same is true for OpenRISC and RISC-V. OpenRISC generates exceptions for any sneeze that may happen, RISC-V continues. Guess what is easier to develop, test and will be faster in the end.

Please, do not add to hardware any functionality you really do not need. You can check for integer overflow statically and generate special code that will generate exceptions if you cannot prove their absence conclusively. This is already done for division by zero for MIPS target (check it out, it is amazing to see difference between -O0 and -03), it can be done for integer additions.

[1] http://en.wikipedia.org/wiki/Carry-lookahead_adder

How do you feel about John Regehr's suggestion (http://blog.regehr.org/archives/1154) mentioned below? His suggestion sounds reasonable to me because the instructions would optionally trap on overflow. That should avoid the cost (which I have an intuition for) for the common case.

x86 has special into command to trap on overflow. Thus I don't know what he is talking about.


You can do without carry and overflow flags in long arithmetic just fine.

INTO is not supported in 64-bit mode.

A flag register is indeed an additional dependency and may end up in the critical path. But x86 and ARM already have a flag register. They already pay the cost.

And you don't need a flag resister to check for overflows. Trap on overflow (like Alpha) will work just as well. The difference is that traps are infrequent so you don't have to make them fast, just correct. You don't have to raise the trap in the same cycle that you calculate the integer operation. You just have to to do it before the commit stage. (And the last time I checked, the Alphas were quite a bit faster than MIPS.)

Of course hardware support implies some overhead and more complexity. I can see why people would oppose it. But there really is software that would greatly benefit from hardware support.

Modern processors can generate different micro-ops depending on whether the flags are observed. In old non-pipelined/non-speculative/non-rewriting processors what you said is true but all bets are off in the world of massively funded x86 processor development.

The trap on overflow will need to compute said overflow flag. Yep, it is not a register. Yep, you still have to compute it. And here you again pay for things that are 1) infrequent and 2) mostly statically inferred.

The Alphas were faster than MIPS due to manual design. Instead of using standard components they laid everything out by hand and they often used domino logic, AFAIK. Classic five stage pipeline MIPS implemented with domino logic can easily get 1GHz (twice as fast as direct synthesis).

Domino logic: http://en.wikipedia.org/wiki/Domino_logic

And software would benefit from static error checking much more than from hardware support for integer overflow.

I am not describing constant checking of overflow. The sticky flag is the OR of any overflow flag between the last flag-clear operation and a speculation pivot.

It does not need evaluating between every operation. In fact in the example I gave (ARM SIMD) the entire point of its design is exactly so that you don't need to incur the latency of the flag generation logic for every operation.

I wouldn't dream of implementing anything quite as deliberately inefficient as what you're describing. We're decades beyond the point of two instructions serially depending on each other.

What I am saying is that for 3% of real use you sacrifice 5%+ performance for all other operations (all 100% of them) and make your future designs much more complex (not 5%).

IF you add overflow checking into hardware, you doing disservice to all your users. Pure and simple.

Hardware always evaluate everything. It is hard to get for software engineer, but it is true. The little overflow flag and scheme for its computation will sit there draining energy all the time.

Intel/AMD already handles flag-setting in this fashion because the vast majority of ALU instructions have a flag side-effect.

There is an additional latency to get flag results. The latency is not paid unless you depend on them. Overwriting is not a dependency.

Other architectures get around the cost altogether by having a separate "set" bit in their ISA, e.g ARM. If you don't want flags modified, then don't specify it, e.g ADD instead of ADDS.

I think you are quite mistaken by multiple orders of magnitude how much power cost it would have, especially when taking into account the rest of the system.

Yes! John Regehr seriously advocates that [1] and I fully agree to him.

[1] http://blog.regehr.org/archives/1154

Yes, AMD should have added it to x64. It isn't exactly new either. The Alpha's integer instructions had a bit in the instruction word that indicated if an overflow should cause a trap or not. IMHO that is the ideal solution.

No, it is not an ideal solution. It causes decrease in clock frequency for about 10% (or you have to make your design noticeably more complex).

The trap in Alpha ISA was done for MIPS compatibility reasons (Alpha was more or less MIPS assembly compatible).

The sticky overflow bit is useful when you do signal processing with integers. To detect bugs it would be better to have a real exception that indicates which instruction caused the overflow. But to do that you would have to change/extend the instruction set which is rather likely. ARM and AMD missed the change to add this when they migrated to 64bit ISAs.

The fact many languages don't overflow check by default really saddens me. Integer overflow is the cause of so many bugs (many user-facing: http://www.reddit.com/r/softwaregore/), and yet people keep making new languages which don't check overflow. They check buffer overruns, they check bounds, and yet not integer overflow. Why? The supposed performance penalty.

The reckless removal of safety checks in the pursuit of performance would be considered alarming were it not commonplace.

(Disclaimer: I really, really care about integer overflow for some odd reason, going so far as to be going through the entire PHP codebase to add big integer support...)

This is the main reason I don't trust the mob of people working on Rust to get their act together and make it a good language. You can see some awareness of the value of integer overflow detection among some individuals that work on Rust but despite that, here we are without that feature. Swift, on the other hand, is one language that has put some thought to the matter.

> You can see some awareness of the value of integer overflow detection among some individuals that work on Rust but despite that, here we are without that feature. Swift, on the other hand, is one language that has put some thought to the matter.

We have put a lot of thought into the matter, and the conclusion was that (a) the performance overhead, as measured by benchmarks, could be significant; (b) the relative benefits of doing the checking in Rust would be much less than in C and C++, since overflow in (safe) Rust cannot result in memory safety problems. However, we're looking into ways that we could allow folks to opt into the feature globally in the future should the performance story change or if they want to panic rather than overflow. You can already use the special checked types for arithmetic on a case-by-case basis if you'd like overflow checking.

Not to re-relitigate old discussion, but (as I'm sure you know) it's now more probable that you might overflow your integers and then pass them to unsafe code or external APIs, and there are many other bugs or undesirable behavior that could be killed dead instead of allowing them to behave badly. For a not contrived example, I don't want weird stuff to happen when my semaphore counter overflows, or underflows, I want it to just crash, right there. (My perspective and desires come from a project where I didn't care about "raw" performance that much -- the project being a storage engine where C++ was chosen with the performance concerns in mind being lack of GC and resultant traffic shockwaves, so my preferences lie a bit away from CPU integer operation performance and more towards crashing obviously at the spot of the foul. It's the worst kind of bug when you hang mysteriously in production or to write garbage to disk when the error could be caught, and I had many C++ integer underflow worries at that job.)

I also think overflow errors should be conveniently handleable without putting a try! or match expression on every operation, and so should other errors, but I'll spare you that rant :-)

> Not to re-relitigate old discussion, but (as I'm sure you know) it's now more probable that you might overflow your integers and then pass them to unsafe code

Unsafe code needs to validate its inputs. That's part of the contract of unsafe code. If your unsafe code causes memory safety errors when given MAX_INT as an input, then your unsafe code is dangerously broken regardless of how easy it is for safe code to accidentally create a MAX_INT value.

The relevant contract boundary isn't specifically at the edge of the unsafe block, it's at some other level like the public face of your ADT, inside which your unsafe and technically safe regions of code share the same set of invariants. It's immaterial whether the integer operation that results in overflow happens inside or outside what is technically the unsafe block, it should still be trapped. (Sorry about being unclear in that part of my previous comment.)

Does Rust do saturating arithmetic by default? If not, an integer overflow won't pass INT_MAX, it will pass some other int that might be a valid input to the unsafe code.

Rust does "computer" (silently overflowing) arithmetic by default. Saturating[0] and checked[1] are optionally available but require converting to methods (and some other changes for checked as it returns an Option)

[0] http://doc.rust-lang.org/std/num/trait.Int.html#tymethod.sat...

[1] http://doc.rust-lang.org/std/num/trait.Int.html#tymethod.che...

Unsafe code should not have memory safety errors when passed valid input. The point is that while integer overflow in Rust might cause incorrect program behaviour, it's not going to cause undefined behaviour (and therefore e.g. executing user input) the way it does in C.

How should the errors be handled in your opinion? I'm asking because I'm also a language designer, and this is one of my sticking points...

It's that old song. "Our language is safe!—for some very specific, rather convenient definition of 'safe' which we just picked ourselves."

I don't understand your criticism. Anyone who claims anything is "safe" has to be specific about the definition of "safe", or else they'd be making vague unsubstantiated claims.

Rust is "memory safe" per the commonly used Saraswat definition [1] (not proven yet, but the proof is in progress). This is not something we just made up.

[1]: http://www.cis.upenn.edu/~bcpierce/courses/629/papers/Sarasw...

I think I misread you. It's certainly reasonable to say that memory safety will reduce the harm, on average, of integer overflows and that this should affect any cost/benefit calculation for implementing overflow checking. What would have been unreasonable is to imply that the safety implications of integer overflow can't matter much or are somehow out of scope because of Rust memory-safety. No matter how nice and precisely-defined the memory-safety criteria are, they're only part (an important one ofc) of what people reasonably expect in terms of safety improvements from a language which aims to be a better, saner successor to C/C++ and is evangelised as such. "Not a deathtrap like C" may be a much fuzzier standard than "memory safe per the Saraswat definition", but that doesn't change the fact that the latter is just a partial means to the former.

> This is not something we just made up.

I deliberately did not say 'made up' or 'created', but rather 'picked' meaning 'selected'. The fact that the map is pre-existing and commonly-used (and rightly so) does not alter the fact that it is not the terrain.

  > What would have been unreasonable is to imply that the 
  > safety implications of integer overflow can't matter much 
  > or are somehow out of scope because of Rust memory-safety.
pcwalton is not suggesting this. The Rust devs care about reducing the amount of unsafe code in the wild, which means that they care about checked arithmetic operations. But making the world a safer place first requires displacing C++ where it is used in safety-conscious applications, which means being competitive with C++ in terms of speed. And yes, this is all about defaults, given that Rust already offers checked and saturating arithmetic types. That Rust leans on the side of speed here is a pragmatic choice (and one that is rooted in actual benchmarking, corroborating the 2-6x slowdown mentioned in the article), and is only possible because it doesn't violate memory safety by any accepted definition (see also how Rust gladly accepts the cost of checked array indexing, while also going to great lengths to provide elaborate abstractions (iterators) to eliminate that same cost). It's also an unfortunate accident of our modern historical context: once compilers get better at optimizing overflow checkes, and/or once hardware gains better support for efficient overflow checks, and/or once Rust begins to actually demonstrate a credible ability to replace C++ in safety-conscious applications (and thus doesn't need to beat the drum of performance quite so vigorously), then checked arithmetic by default can be trivially added to Rust after the fact (though it would obviously be a breaking change, necessitating a version bump).

(For completeness, I'd like to point out here that checked arithmetic is not a panacea. For any range that you wish to express in an integral type, it is vanishingly unlikely that INT_MAX is actually the logical upper bound. If I have an RPG where characters can quiver 100 arrows but I forget to enforce the upper bound, checked arithmetic won't save me from the fact that my program can now exist in somewhere between 28 invalid states (signed eight-bit variable) and nine sextillion invalid states (for a signed pointer-sized variable on a 64-bit platform, as is so common for "default" integral types). The real solution is reifying integral bounds in the type system, as I expect the language that someday replaces Rust to do.)

Other ways in which the choice of what constitutes safety is convenient: you can recurse until the stack overflows, and you can make infinite loops instead of having to prove through the type system that everything terminates. These are reasonable design decisions. If allowing silent integer overflow is the wrong choice for Rust, it's barely the wrong choice, because the costs and benefits are clearly arguable and user-specific.

> These are reasonable design decisions. If allowing silent integer overflow is the wrong choice for Rust, it's barely the wrong choice, because the costs and benefits are clearly arguable and user-specific.

Seems reasonable to me, and I certainly didn't demand overflow-checking by default in Rust. (OTOH, twelve hours ago you were so disappointed with that decision that it undermined your confidence in the Rust team...) But I don't see what that has to with picking the definitions for desirable adjectives like 'safety' when it comes to promoting or describing a system.

> OTOH, twelve hours ago you were so disappointed with that decision that it undermined your confidence in the Rust team...

Well it's relatively small damage levels but it's the wrong decision, and it's the inoffensive choice made of status quo bias.

So inaction on overflow checking in hardware is regularly justified by claims that overflow checking in software imposes only a minor performance burden (viz. OP), while inaction on overflow checking in software is justified by claims that that it imposes an unacceptable performance burden. I feel that something doesn't add up here.

> You can already use the special checked types for arithmetic on a case-by-case basis if you'd like overflow checking.

Already this is a HUGE improvement over C. Is it allowed to silently mix checked and unchecked integers?

No, Rust does exactly zero implicit coversions between numeric types. For that matter, it doesn't allow you to implicitly use an i8 where, say, an i32 is expected (forcing you to explicitly cast instead), even though the former represents a range that is a strict subset of the latter. Whether or not this is a feature or a flaw depends on the observer. :)

There was a discussion in the Rust mailing list about it a while back. I used that as a basis for looking at overflow detection at compile time in ATS: http://bluishcoder.co.nz/2013/05/07/ranged-integer-types-and...

Thanks for the great read. As someone who's been eyeing ATS for a while and checking it out more lately, this was very inspirational.

Huh, Swift actually does overflow checking and traps by default. That's quite refreshing.

Asides from pcwalton's argument, I think any successful (or to-be-successful) language is trying to solve some problems very well and other problems reasonably well. For Rust, the former includes the memory safety and the latter includes the integer overflow detection (but you can always name a lot more). It would be very hard to solve all problems at once (especially when known solutions to some problems worsen other problems, as seen in Rust). That can be said to Swift as well, it just picked different problems to solve very well.

I don't think it's hard to solve the problem of memory safety and integer overflow detection at the same time.

Both problems are undergoing active research. They are indeed orthogonal (albeit one can somehow reduce the severity of another and vice versa), but with a finite resource and time you sometimes have to choose.

It seems to me that, despite all the flexibility people demand, they only really want (and use) exactly three integer data-types in a language:

1. machine bit-string of size 8(2^n) bits (e.g. b/w/d/q fields) with no concept of being a signed/unsigned integer, but where you can apply both signed and unsigned integer operations to it in an optimized manner. Such code will be shimmed with sets of clever shifting ops if the target architecture doesn't actually have a storage type of that width, but you have to be explicit about what you'll allow (e.g. if an integer is of type "d|2w", then it will compile fine on a 16-bit architecture (and operate using shimmed ops), but fail to compile on an 8-bit architecture.

2. unsigned integer of fixed exactly-specified bitsize, which exactly matches the semantics of a having a value on a modular-arithmetic ring. You go up, it wraps around at 2^bitsize; you go down, it wraps back. This stays true even if the code is compiled on an architecture where the exactly-specified bitsize isn't a clean processor storage-width: a 27-bit ring on a 32-bit processor is A-OK, and shims will be inserted to enforce the semantics. Shims will also be inserted if you're targeting an ISA that doesn't have an integer type with wrapping semantics (e.g. the JVM.)

3. signed arbitrary-precision integer (i.e. optimal-machine-word-size-minus-a-tag-bit with bignum promotion checks). The optimizer might convert one of these to something of type #1 if it can be very, very sure of the value range you're operating within.

#1 is for doing pointer math in unsafe regions; #2 is for implementing cryptographic primitives; #3 is for everyone else.

(There's also a variant on #1—let's call it #1A—which is a fixed-size array of #1s you can repeat signed/unsigned integer transformations over, where this will generate optimized SIMD code. This would be the "buffer" type for wire protocol packing/unpacking, and also the backing store for non-sparse matrix ADTs, bloom filters, etc.)

The only place integer overflow raising an exception would make sense, to me, are if you want some sort of hybrid type between #1 and #3: a value that pretends to be arbitrary-precision, but in fact only has a constant-size bitstring to operate within. I could see the use of this in something like Cap'n Proto, where you're keeping wire-encoded integer bitstrings (#1s from a #1A) around and pretending they're fully-functional integers (#3) for the sake of zero-copy—but do you really gain that much by not letting a data structure be recreated resized on the stack, if you're not also throwing down optimizations like intrusive lists in fixed arenas?

I have a small nit about #2 that doesn't affect your main argument. Similar to the 'overflow' flag the processor automatically sets to signal signed overflow, the processor flag for 'carry' is already being set by the processor when you add or subtract unsigned numbers that overflow. There is no separate signed or unsigned addition in x86/x64 assembly --- the processor just sets the flags for both cases, and they can either be used or ignored.

This is the reason that overflow checking can be so inexpensive --- no additional checking is necessary, just a conditional branch on an existing flag. Correctly predicted branches not taken are very inexpensive, and this branch will almost always be correctly predicted. The issue is that standard C doesn't allow any direct defined way to make use of the overflow flag. Instead you have to write something awkward and hope the compiler optimizes it down for you.

Irritatingly, GCC lacks a "branch on overflow" intrinsic or "add and branch on overflow" intrinsic, forcing you to write inline asm to do fast overflow checking.

GCC 5.0 has them:

> A new set of built-in functions for arithmetics with overflow checking has been added: __builtin_add_overflow, __builtin_sub_overflow and __builtin_mul_overflow and for compatibility with clang also other variants

-- https://gcc.gnu.org/gcc-5/changes.html

As of GCC 5, it has __builtin_add_overflow, __builtin_sub_overflow, __builtin_mul_overflow


Kind of makes sense, though. Since the C standard says that integer overflow is undefined, you can think of the "C abstract machine" as being built on an assumption of an underlying architecture that provides magic fixed-width-yet-arbitrary-precision integers (and/or an assumption of omniscient programmers who always know what their inputs will be and never let math happen if it would result in overflow.) Basically, you can't talk about overflow in C, or even to a C compiler, because as soon as you make the fact that overflow is happening explicit, you've stepped outside the C abstract machine.

It's like trying to talk to Haskell about explicit thunks, or Lisp about stack-vs-heap allocation. The language encapsulates the concept away from you, so if you wrench it back, you're no longer quite writing the language; you're writing a union of it and something else.

Yes, but in practice, that's what people do: write in a language more powerful than the standard gives them. You kind of have to if you want to get anything done without driving yourself crazy. For example:

long mega_byte = 1024 * 1024 * 8; // bits per MB long mega_bit = 1000 * 1000; // bits per megabit

If you follow the C90 standard, this code has all sorts of issues. First, // comments are not allowed. Second, identifiers are only guaranteed to have six significant characters, so the above two variables might actually be the same variable. Finally, ints are only guaranteed to be 2 bytes, so the multiplication may overflow and the program is undefined. (Assigning to longs doesn't help: the multiplication already happened.)

You are correct that such things are outside of the C abstract machine, but GCC provides a lot of extensions that are outside of the C abstract machine. For example, __sync_synchronize and other atomic intrinsics (https://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-Builtins...). C had no memory model when these were introduced (which, of course, is why they were introduced), so they were outside of the C abstract machine.

signed integer overflow is undefined; undefined integer overflow is well defined (to use mod arithmetic).

> an ISA that doesn't have an integer type with wrapping semantics (e.g. the JVM.)

This is incorrect: the JVM has fully well-defined modulo arithmetic:


"The result is the 32 low-order bits of the true mathematical result in a sufficiently wide two's-complement format, represented as a value of type int. If overflow occurs, then the sign of the result may not be the same as the sign of the mathematical sum of the two values. Despite the fact that overflow may occur, execution of an iadd instruction never throws a run-time exception."

With this definition the JVM doesn't need distinct instructions for unsigned arithmetic since they're identical to signed arithmetic in two's complement (with some exceptions like shifting for which there are both ishr / iushr instructions).

Why limit #2 to 2^bitsize? Arithmetic modulo 86400 (or 12, 24, 60) can be handy, too (especially if there is an 'on overflow do' facility)

And people will want complex numbers, quarternions, non-Abelian groups, etc.

The important thing when giving people primitives to work with is that they come with time and space guarantees. People are writing low-level code basically in order to be able to explicitly make that choice, rather than having the compiler make it for them.

So, #2 is just a few extra guarantees about the intrinsics of regular machine-words, with minimal overhead for the fast path (even on the slower path of e.g. a 30-bit ring on 32-bit storage, there's still an O(1) set of barrel-shifts and masks that can be used to implement that.)

A fully-general ringed arithmetic value, on the other hand, is more of an ADT. It could certainly be implemented in the same language and act like a number, but it wouldn't be useful as backing storage for other types in the way #1, #2, and #3 are. (Note that #2 is a better backing store than #1 for many use-cases, e.g. encoding IP addresses and subnet masks.)

Time guarantees on a modern out-of-order processor with some complicated cache structure of three or more levels, with virtual memory enabled? Forget it.

Also, if you include bigints in your list of things people want, I do not find it unreasonable to add timestamps (maybe not the best example because of leap seconds) or 'day of the week', 'month of the year' to that list. Most language libraries get those (often including those perfect 86400 second days) before they get bigints.

I don't want 3, because I don't want tag bits or space wastage, but also because I don't want to have to link a bigint library.

What do you expect to happen, then, in cases where you have a "can't fail" transparent deserialization operation from a wire format that doesn't limit the precision of its integers? Say, a web request that receives and decodes a JSON object. Now, JSON itself has no integer type, just a double type—but when people are manually generating JSON, they tend to forget this and shove arbitrary-precision integers in as naked \d+ tokens; and when people are receiving JSON, they also tend to expect to receive integers, not doubles. So what would you expect this JSON-decoding library to do, in a sensible language:

• crash the request thread?

• output a floating-point value with some of the precision thrown away?

• output a machine integer with the overflow flag set?

• output an arbitrary-precision integer?

Basically, arbitrary-precision integers are a necessary part of any platform that needs to be able to hold onto, and do math to, externally-generated numbers that didn't arrive packed in #1A type structs.

To insist, in the modern day, on attempting to do arbitrary math with arbitrary numbers using strictly machine words, is like insisting on defining a Postgres column as a VARCHAR, when TEXT is just as optimal for the average case, while automatically coping with edge-cases.

Most numbers computers touch don't have bit patterns that fit in 8(2^n) bits, but not 8(2^n) - 1 bits. Tag bits are cheap, especially if you have the other two integer types to fall back on.

I would expect the JSON-decoding library to do one of the following, depending on the library author's or callee's wishes:

- output a floating-point value with some lost precision

- output an arbitrary-precision integer, or some general "number" type that can be an arbitrary-precision integer or double-precision float

- output an "integer too precise to decode" placeholder in that location in the data structure

- fail to decode the document entirely

In no case should it crash.

I don't know what you mean by "a machine integer with the overflow flag set", since it doesn't exactly work that way, so the "integer too precise to decode" value is my alternative.

For example, suppose you're making a "JSON document"-based database engine and its query language only supports double-precision numbers. In that case, I would output a correctly-rounded floating point value, if I were decoding JSON.

Usually in other cases of decoding an integer, unless you're hosting an arbitrary precision integer calculator web application, there should be maximum acceptable values on any integer you'd read off the wire. For example, there's a maximum data size, maximum timeout duration, and such.

> fail to decode the document entirely [...] in no case should it crash

Note that I was using the Erlang sense of "crash"—a crash from a pattern-matching failure is the default clause of most case statements, and you want those to be bubbled up by crashing linked and supervising processes until they hit something that has generic "something went wrong; I'm going to report this to the user" code in it (like e.g. Rack does in Ruby.) You don't want to specifically be throwing, immediately catching, and then converting exceptions into 500 errors in every part of your code.

Also, I was assuming in this example that you're using a modern web framework, where the JSON decoding is handled by the framework—not by you—so the framework likely doesn't expose any configuration for something like its JSON deserialization library, which it wants you to just consider an implementation detail. So you won't get to specify anything like "maximum data size". Instead, the JSON library needs to "just work" in all cases, returning its best-effort interpretation of the wire message to you. (The parallel for HTML parsing would be libraries like Beautiful Soup.)

Punting the choice up to application code is not a good idea; maybe you should allow them to override the config, but good defaults are important.

The only sane default is to return an arbitrary precision arithmetic type with an optional mode for truncation to double for performance.

"JSON itself has no integer type, just a double type"

Nit - actually the JSON standard doesn't specify a precision for the number type, but most JSON implementations don't support arbitrary precision numbers.

Though the more recent RFC does have some implementation notes for interoperability.


> "This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available."

> "Note that when such software is used, numbers that are integers and are in the range [-(253)+1, (253)-1] are interoperable in the sense that implementations will agree exactly on their numeric values."

What is missing is a good and easy way of doing the check:

- Without need checking after every operation

- Without impeding easy use of overflow (and/or allowing for saturation) when it's needed (in a lot of places)

So, I see a narrow use for it, IF you validated your inputs AND overflow is unexpected (or, it is expected and you do something different, but it's still on narrow cases)

I just did my own comparison using bzip2 (http://www.bzip.org/), comparing both CLang 3.4 and GCC 4.9 with -O3 -fsanitize=signed-integer-overflow. I omitted the 'unsigned-integer-overflow' because my version of GCC didn't support it. ICC is shown for the non-sanitized version, as it doesn't support it the automatic overflow checking.

I did a smaller file than Dan because I didn't have the patience to test with the 1GB file. Testing was Linux on Haswell, cycle and instruction counts are coming from 'perf stat' and are in billions except for the last column, (instructions/cycle). Not shown here is the number of mispredicted branches, which as expected does not increase for adding the checks since they are all well predicted.

                    cycles instructions branches  ipc
  clang-O3:          24.4      43.8      6.3      1.79
  clang-O3-sanitize: 26.5      54.8      9.3      2.06
  gcc-O3:            23.0      42.5      6.2      1.85
  gcc-O3-sanitize:   24.6      47.3      7.6      1.93
  icc-O3:            23.8      41.6      6.5      1.75
For Clang, adding signed integer overflow checking adds about 20% to the count of instructions executed, but only about 7% to the runtime. For GCC, 10% more instructions but also about 7% increase in execution time. Agreeing with this, the instructions per cycle ratio goes up more for CLang than for GCC, but in both cases there is a significant increase showing that each check costs less than a cycle.

I'm somewhat confused as to why the increase in instructions executed is so much greater than the increase in the count of branch instructions. This would seem to confirm Dan's experience with poor generated code quality, but looking the tight loops with 'perf record' both GCC and CLang appear to be generating reasonable code. Both sanitized and normal spill to the stack much more than seems necessary, but this doesn't seem to because of the overflow checking. I'd guess the extra instructions are just by chance because of how the 'jo' checks were laid out.

GCC 4.9 does have `-fsanitize=signed-integer-overflow`, and it generates pretty good code for that toy function:

    f(int, int):
        mov eax, edi
        add eax, esi
        jo  .L8

Older gcc versions have `-ftrapv` option, which checks overflow for signed addition, subtraction, and multiplication. However, the implementation calls libgcc functions to do the checks, resulting in non-negligible overhead.

Ah yes, integer overflow checking. Here's iDrive, the commercial backup service, failing at it:


iDrive backed up a 3GB file. This overflowed a 32-bit value, and so iDrive shows the size as -149406.91 KB. This seems to have confused something in the iDrive system, and now backups won't run. Their tech support people have been working on this for two days now.

C officially considers unsigned integer overflow to be modular arithmetic. I consider that a mistake. If you want modular arithmetic, you should write

    x += n % (2^32)
which can be easily optimized to eliminate the divide. Otherwise, it should be an error.

More portable:

  x += n % (1 << (SIZEOF_LONG * CHAR_BIT));

I don't think SIZEOF_LONG is portable C. You would want

     sizeof(long) * CHAR_BIT
instead (while at it, use sizeof(x) instead, ?and abort if sizeof(n) > sizeof(x)?)).

More importantly, I think that expression produces an int value, so if sizeof(long) > sizeof(int), that shift will produce undefined behaviour (or zero? I don't have a C standard to check that on at hand. Actually, shifting any integral type by its size in bits could well be undefined behavior)

Even if that weren't the case, that doesn't do modular arithmetic. You would want

    x = (x + n) % (1 << (SIZEOF_LONG * CHAR_BIT));
and of course, that computes x+n before doing the modulus, so even that doesn't do modular arithmetic.

I think one would need custom syntax for making modular arithmetic easy in C. I would suggest something like

  x += n (mod 37);
That may make parsing difficult, though, if one allows expressions after the 'mod'. Alternatively, put the 'do modular arithmetic' thing in the type:

  modular(43) int x;
  modular(23) int y;
  int n = x + y; // error?
Summary: this is way harder than it seems. I bet people will find errors in the above that show that conclusively.

Edit: yes, it is harder. According to http://blog.regehr.org/archives/738, left-shifting a signed integer by one bit already can trigger undefined behaviour.

If the program wants 32-bit unsigned modular arithmetic, that's what you should get, regardless of the underlying machine architecture. If you code that, you're probably doing a checksum, and you want the answer to be consistent across platforms.

Perhaps the C++ standard could add 'operator{+,-,etc}%'.

Along the lines of Swift's & operators?

I wonder if these estimations of integer overflow cost have taken into account all the possible optimizations where the compiler could safely remove overflow checks. For example, consider:

    void memzero_range(char* array, int start, int end) {
        for(int i = start; i < end; i++)
             array[i] = 0;
In that example, a sufficiently smart compiler could just check for the `start > end` case once at the start of the loop, and remove the overflow check from the body. Perhaps this is something that today's Clang/GCC can already do?

Rewriting it as follows, with an overflow check:

    void memzero_range(char* array, int start, int end) {
      int i = start;
      while (i < end) {
        array[i] = 0;
        if (i == INT_MAX) {
Clang does indeed manage to see that the check is equivalent to a start > end case (with -O1) and hoists it out of the loop. It also then rewrites the body of the loop to use the LLVM memset intrinsic!

I hope optimiztion of integer overflow checking improves in LLVM, since now there is a highly visible project(JavaScriptCore FTL JIT) using the feature. JavaScript semantics pretty much forces you to do integer overflow checking fast.

Not to mention Swift.

FTL has already precipitated other interesting changes in LLVM (to support deoptimisation and GC).

You're referring to the optimisation of storing JS Numbers as integers until they get too large, presumably?


Native integer overflow checks can only go so far, there, since there's no 52-bit integer type.

What? JavaScript engines check for 32-bit integer overflow and promote to double.

IIRC v8 promotes to double at 31 bits. But yeah.

On 64bit architectures V8 promotes at 32bits.

(and if we talk about optimizing compilation things are getting more complicated as there int31, int32 and float64 can all coexist and you can have 31-bit integer stored in a float64 value - if operation was specialized for floating point values)

Yeah, V8 is a bit of an odd one out here. Everyone else just promotes once a signed 32-bit int overflows.

It is the programmers job to check for overflow. There are a lot of comments here about having a language check for overflow, but that is not practical. Specifically, the language doesn't know what to do in the case of overflow. The best it could possibly do is throw an exception, and I suppose that's what those commenters want it to do. But it's still up to the programmer to decide what to do and implement something. Are you going to do that for every math operation in a program? If not then you're already thinking about where potential problems may occur. Sometimes you want an overflow. I like to use 16 bit values to represent angles from 0 to 360 degrees - you can add and subtract these and never need to worry about the wraparound - I don't want that to throw an exception. If your code is going to overflow when you don't expect it to, you've got a bug and the language isn't going to be able to save you.

I feel like noobs hitting an overflow for the first time think the world would be better if the machine could "just take care of it".

That said, one solution to overflow in some cases is saturating arithmetic. Supported on some processors, but by no languages that I know of. That may be worth considering.

Spot the signed integer overflow here that leads to undefined behavior on targets where "int" is 32-bits:

    uint32_t leToWord(uint8_t bytes)
        return bytes[0] |
               bytes[1] << 8 |
               bytes[2] << 16 |
               bytes[3] << 24;

This is more about C's integer promotion rules than it is about overflow.

I wonder why not use interrupts for overflow checking? (Maybe x86 still doesn't support that - my knowledge of it is very old.) That seems smarter, if the overflow is really going to be an exceptional case..

I would think that the best approach is to set up a global overflow handler, driven by an interrupt, and then maybe for cases where you actually want to be able to overflow, a special C primitive (that would use instructions that avoid overflow in some way) could be useful.

I am just wondering why the article doesn't mention this possibility at all.

x86 doesn't support it.

Ah well. I work on zSeries mainframes, and this architecture does support interrupts on integer overflows (as well as many other number formats). So I thought that x86 architecture already got that option too.

I'd be curious about the effect of manually substituting the clang overflow arithmetic intrinsics on the codegen.

I suspect that'd make it worse, as you've added a new inline call.

This calls for BaseN computing; an algorithm that compiles code down to the most optimal base (minimizing overflow) and then ran it on a quantum baseN computer; post compilation basing; cross basing; etc

Honestly we should talk about “The Security Cost of No Overflow Checking” first.

It's good to talk about, but not every article on performance needs to be prefaced with a talk about security. Being aware of the performance costs of your tools is just as important in many cases.

Besides, if we're talking about language design and compiler optimizations, there are plenty of ways to check overflow without resorting to a branch on every arithmetic operation.

Yes! You can make code that runs as fast as you want if it doesn't have to actually work.


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact