
Harder than it looks: rounding a float to the nearest integer - pascal_cuoq
http://blog.frama-c.com/index.php?post/2013/05/02/nearbyintf1
======
forrestthewoods
Floats are outrageously complicated. Tread carefully! If you ever deal with
the nitty gritty of floating point numbers you should read every post on the
topic on Bruce Dawson's blog. It's up to 15 posts long.

TLDR - if you think you know all there is to know about floats then you're
wrong. Read on!

[http://randomascii.wordpress.com/2013/02/07/float-
precision-...](http://randomascii.wordpress.com/2013/02/07/float-precision-
revisited-nine-digit-float-portability/) (has links to all posts in series)

~~~
GhotiFish
Thanks for the link. I think floats are pretty overused, I don't see fixed
precision decimals getting used, which always struck me as odd.

~~~
zokier
I'd love to see a language or even a library that would make no-surprise fixed
point and/or arbitrary precision math first class citizens, like what Python
did with integers.

~~~
whattssonn
COBOL and RPG, for example, on Mainframe / Midrange systems have native
support for fixed point (decimal) types. It's the second most used type,
besides the "character" types. Floats are seldom used in the context of
business applications. Fixed point math is even directly implemented in
hardware on Power processors. Always been like that on IBM platforms.

------
haberman
I wrote a very similar article about how testing for integer overflow is also
harder than it looks: [http://blog.reverberate.org/2012/12/testing-for-
integer-over...](http://blog.reverberate.org/2012/12/testing-for-integer-
overflow-in-c-and-c.html)

~~~
cjh_
Well written article.

One of the things I found amazing about learning assembly was that in x86 asm
it is relatively trivial to check for overflow (after an operation has
happened) by checking if the overflow flag [1] is set.

Although this doesn't help with detecting if overflow will occur from an
operation, it can be used to detect when it has occurred and assist recovery.

[1] <http://en.wikipedia.org/wiki/Overflow_flag>

~~~
haberman
Thanks! Yes I am also impressed by how relatively easy this is to implement at
the machine level. One thing I meant to do for that article but didn't get to
is implement each of those algorithms in assembly directly and see if the
C/C++ solution would compile into the optimal machine code (or something
equally efficient).

In assembly you have the luxury of no undefined or implementation-defined
behavior. This makes things a lot easier, since you can count on things like
two's complement. In assembly these checks are probably as easy as (untested):

    
    
        will_u64_overflow_u32:
          ; If any of the high 32 bits are set.
          shr   rdi, 32
          setne al
          ret
    
        will_u64_overflow_i32:
          ; If any of the high 33 bits are set.
          shr   rdi, 33
          setne al
          ret
    
        will_i64_overflow_i32
          ; If the high 32 bits are other than all 0 or all 1.
          shr   rdi, 32
          add   edi, 1
          cmp   edi, 1
          seta  al
          ret
    

I'm not sure if these are exactly right, but I think the right answers are
about this simple. Any of these could also be written simply in C/C++ also,
the main difference is just that in assembly you can make machine-specific
assumptions (like two's complement) since you're already doing something
machine-specific.

~~~
pbsd
The shift in the second one should be 31, if I understand what you're doing
correctly.

The third one is incorrect. Consider, for example, the integer
0xFFFFFFFF00000001, which is -4294967295 in 2's complement. It passes your
test, but is not representable as a 32-bit signed integer. An alternative
might be:

    
    
        movsxd rax, edi ; signextend(rdi mod 2^32) == rdi
        cmp rax, rdi
        setz al
        ret

~~~
haberman
Brilliant, I love it. Thanks for the corrections.

------
rthomas6
Wouldn't this be easier if both numbers were just represented bitwise, and the
float were converted to integer that way? Seems more straightforward to me.

Floating point representation works by having "exponent" bits, and "mantissa"
or fraction bits. Also a sign bit. For explanation's sake let's say we have a
single precision 32-bit FPU. IEEE 754 single precision is 1 sign bit, 8
exponent bits, and 23 mantissa bits. The exponent bits are subtracted by 127
to yield the exponent.

So, the basic formula is:

    
    
      (-1)*sign * 2^exponent-127 * 1.mantissa
    

So right away, if exponent < 0x7F, round to 0.

If exponent > 9E (for 32-bit int), round to infinity.

Multiply

    
    
      1.mantissa * 2^exponent-127
    

This will be a power of 2 so there's probably a fast bit-shifting method for
this. I'm just too lazy to look up how this should be done.

If exponent > 0x96, the exponent's too big to represent fractions of 1, so
you're done.

Otherwise, you're within 1 number. If the remaining decimal < 0.5, truncate.
Otherwise, add 1 and truncate.

~~~
pascal_cuoq
You can definitely round a float to the nearest integer-represented-as-a-float
by accessing its bits. It will work.

It will not be faster on a modern processor, though, because floating-point
numbers live in their own registers, distinct from integer registers.
Floating-point registers do not have bitwise operations, moving data to and
from integer registers is comparatively expensive, but they have IEEE 754
arithmetic.

So the game is to do as much as possible with what instructions there are. One
such instruction is for instance (for processors with SSE2) " cvttss2si %xmm0,
%eax", to truncate a float in xmm0 to an int in eax. Intel provided an
instruction for that because the C language defines the cast float -> int as a
truncation.

The next blog post in the series will be about using IEEE 754 addition to
avoid the round-trip through eax altogether.

~~~
pbsd
SSE4 added FRNDINT (x87) and ROUNDSS/SD (xmm) to do this task purely on
hardware.

Further, if one is seriously determined, there are ways to perform bitwise on
floating-point registers directly. On XMM, it's trivial: just take advantage
of the integer (and non-integer, e.g. XORAPS) instruction available.

On x87, one can take advantage of the direct mapping between MMX and x87
registers to manipulate them using MMX's integer instructions. This has
problems, but as I said -- seriously determined.

~~~
gsg
On modern x86 hardware, integer instructions that act on float values in xmm
registers (and vice versa) require a somewhat expensive change of execution
domain. That the register names are the same is irrelevant: everything will be
renamed into what the CPU works with internally (which is separate integer and
floating point domains).

Mixing integer mmx and x87 is even worse, with the crazy emms junk. Not worth
it.

~~~
DarkShikari
The penalty is at most ~1 cycle of latency -- in practice I find it gets
completely absorbed by the OOE engine. I've never measured any significant
penalty in any code for mixing float and int SSE operations on any x86
microarchitecture.

Floating point bitwise operations exist too: xorps, andps, and so on.

~~~
gsg
Hmm, ok. Intel recommend avoiding it pretty strongly: I guess they overstate
their case. One cycle in (and one out?) isn't exactly crushing.

------
svantana
The title piqued my interest in how to implement rounding of floats, on the
bit-fiddling level - certainly not an easy task! Disappointingly though, the
article starts out by using type casting. If it's a solution on that level
we're after, why not just use roundf()?

~~~
pascal_cuoq
If you feel that using type casting to implement round-to-nearest-integer is
cheating, an earlier post is on the subject of implementing a substitute for
type casting without type casting.

It uses only floating-point arithmetic, but it should still satisfy your
curiosity for bit-fiddling.

[http://blog.frama-c.com/index.php?post/2013/05/01/A-conversi...](http://blog.frama-c.com/index.php?post/2013/05/01/A-conversionless-
conversion-function2)

------
Strilanc
Doesn't the modulus function avoid most of the issues, here?

    
    
        float RoundToNearestWholeNumber(float f) {
            if (double.IsNanOrInfinity(f)) return f;
            if (f < 0) return -RoundToNearestWholeNumber(-f);
            
            // make sure we're working with a consistent value
            float v = DiscardExtraPrecision(f);
            // the fractional part is guaranteed to be representable
            float r = v % 1;
            // largest whole value up to v (the floor)
            float w = v - r;
            // switch to the ceiling if in the upper half
            if (r > 0.5) return w + 1;
            return w;
        }
    

Feel free to disillusion me.

------
X4
Hi,

I don't understand the problem with a) rounding floats to integer and I don't
understand b) why you would ever have to overflow an integers.

a) Wouldn't it solve the problem to write an algorithm that can use
mathematical axioms like +,-,/,* virtually using strings? I mean you want it
to calculate 99999999999999999999999999999999999999999999999999999 +
9999999999999999999999999999, why not teach it how we do this instead of
having to use bit-hacking tricks? Maybe my example isn't good, but I mean
writing an algorithm that reads 999 for example and 1000 knows that it now has
to put a 1 infront of 999. You see, that's what I mean by manually.

b) And can you please explain "why" it isn't enough to use ie. a linked list
of 8bit virtual address spaces and allocate using virtual address spaces
everything that fits into RAM and everything that goes beyond the addressable
space onto swap?

Why would you have to worry about flipping bits in an overflowed
integer/float, when you could use that mechanism. Don't you just recursively
run through the virtual address map until you're able modify the allocated
data in that memory segments again? I don't see a problem with this
implementation except for performance maybe.

When the algirhtms knows about the 32bit limit for example (constant), you can
tell it to not allocate real memory with that size, but allocate virtual
memory with that size dependant on the input data size. Yes I know that the
CPU has built-in mechanism for the mathematical axioms, but doing it manually
would solve the problem, from my limited perspective.

Sorry if I'm blind for seeing the problem, I'm not ignorant, just really
curious in how this can be solved once and forever.

~~~
omra
a) That solution is sometimes implanted. The problem is: its slow, very, very,
slow. It is not nearly as fast as regular integer arithmetic. (Also, +, -, /,
and * aren't called axioms.) Also, how do you handle things such as powers and
exponentiation (especially non-integer arguments)?

b) Search up for "BigInt libraries". The problem is the same with the others,
its slow. You still have to allocate memory and the operations aren't as fast.
However, BigInt libraries are widely used when necessary.

~~~
X4
@omra, why downvote a question? It wasn't a proposed solution to the problem,
it was a question on if there is a solution similar to that and otherwise why
this would be wrong. I have added that I believe that this would be slow, if
you read it carefully.

Btw. "axiom" was probably an unlucky translation of "Körperaxiom". See:
<http://de.wikipedia.org/wiki/K%C3%B6rper_%28Algebra%29> The english wikipedia
says:

"The most common way to formalize this is by defining a field as a set
together with two operations, usually called addition and multiplication, and
denoted by + and ·, respectively, such that the following axioms hold;
subtraction and division are defined implicitly in terms of the inverse
operations of addition and multiplication"

~~~
scott_s
omra likely did not downvote your question.

------
skrebbel
Just like there is a relatively clear separation between electrical
engineering and software engineering, I'm starting to think that we need
separate words for this kind of stuff and for hacking up a CRUD-some-tables
MVP using Ruby on Rails. They're nearly entirely unrelated skills.

I, for one, am very happy that I do not need to think about this stuff. I
think that rounding a float to an integer is super easy:

    
    
        5.3.round

------
v-yadli
Now that we know floats have coarse granularity on large scale, why not reduce
the scale first?

float fr = (int)f; if(f-fr>=0.5f)fr+=1.0f; return fr;

It works for all the examples in the article. I'm not saying that I've created
a good solution. I'm saying that the author is trying to create a problem out
of nowhere.

Thought it would be an article to teach people how to manually do float
rounding according to a specific fp encoding scheme... :-(

~~~
pascal_cuoq
This is a good solution that I did not think of at the time. But it is no less
expensive than the final (int)(f + 0.49999997f), to compare apples with apples
(both still need to be guarded against overflow in the conversion).

------
Someone
I don't think that _better function that works_ actually works flawlessly;
when passed a NaN or -Infinity, it triggers undefined behavior.

Still, this is a nice article.

~~~
pascal_cuoq
The contract says that you have to call the function with 0 <= f <= FLT_MAX.
So if you call it with Nan, any infinity, or a negative number other than -0
(which happens to satisfy 0 <= f), you are using it outside its
specifications.

In that, it is like a standard library strlen(). It does not work if you pass
it NULL because it's specified as accepting well-formed strings.

------
stephencanon
There are lots of cute tricks for rounding to integer and to integral values.
A few of my favorites:

1\. Round to an integral value according to the current rounding mode,
allowing inexact to be set (rintf( ) in C, pretty close to what Pascal is
going for in his example):

    
    
        // All suitably large floats are integral already.
        if (fabsf(x) >= 0x1.0p23f) return x;
        // Otherwise we can force rounding to occur in the
        // units position by simply adding and subtracting
        // 2**23 with the correct sign.
        const float forceRound = copysignf(0x1.0p23f, x);
        return x + forceRound - forceRound;
    

If you want the result as an integer and it’s in the [-2^23, 2^23] range, you
can just rip out bits 0:22 of the floating-point representation of x +
forceRound via a mask, which is really simple (faster than hardware convert-
to-integer instructions on some architectures!).

If the calls to fabsf and copysignf look like hazards to you, keep in mind
that good compilers lower them to one or two bitwise operations on most
architectures (though I tend to write these routines in assembly to avoid any
risk of missed optimization).

2\. Round to nearest, ties away from zero (roundf( ) in C, which can be
annoying because it doesn’t correspond to a hardware rounding mode on most
architectures):

    
    
        // All suitably large floats are integral already.
        if (fabsf(x) >= 0x1.0p23f) return x;
        // Now that we know (int32_t)x cannot raise invalid,
        // we can use that to get truncf(x).
        const float truncated = (int32_t)x;
        // Compare the residual to 0.5 to determine if the
        // result needs to be adjusted away from zero.
        if (fabsf(x - truncated) >= 0.5f)
            return truncated + copysignf(1.0f, x);
        return truncated;
    

This is also a pretty close analogue to what Pascal was considering. I like my
implementation more because it doesn’t spuriously raise the inexact flag
(whereas Pascal’s implementation does). This still isn’t a complete
implementation of roundf( ), as it gets the sign of zero wrong when the result
should be negative zero. Left as an exercise to the reader.

3\. Since people asked about doing the rounding using integer operations
elsewhere in the discussion: assume the input is known to be in the
interesting range already, and that we have some utility functions for
converting between a floating-point number and its encoding (which is a NOP
architecturally, but requires a union or memcpy or other type-punning in C):

    
    
        const uint32_t xRep = toRep(x);
        const uint32_t xAbs = xRep & 0x7fffffff;
        const uint32_t xSgn = xRep & 0x80000000;
    
        // If |x| > 2**23, or x is Inf or NaN the result is x.
        if (xAbs > 0x4b000000) return x;
    
        // If |x| < 1.0f, the result is +/-0 if |x| < 0.5f and
        // +/-1 otherwise.
        if (xAbs < 0x3f800000) {
            if (xAbs < 0x3f000000) return fromRep(xSgn);
            return fromRep(xSgn | 0x3f800000); 
        }
    
        // Otherwise, use the exponent of x to identify the
        // units bit of x.
        const int exponent = xAbs - 0x3f800000 >> 23;
        const uint32_t unitsBit = 0x00800000 >> exponent;
        // "Add 0.5 to x” by adding half the units bit to the
        // encoding.  If this doesn’t change the exponent it
        // really is equivalent to adding 0.5.  If it carries
        // into the exponent, then it’s not actually the same,
        // but it still does what we need it to do.  After
        // “adding 0.5”, we simply mask off the fractional
        // bits to get the result.
        return fromRep(xAbs + (unitsBit >> 1) & -unitsBit);
    

This doesn’t set inexact, obviously, which makes it unsuitable for some uses
(but actually nearbyint( ) is _required_ not to set inexact, so this is pretty
useful there). It’s actually even less code than it looks like; C tends to
make this sort of thing relatively verbose. On ARM a complete implementation
is about 12 instructions, for example.

N.B. although I write this sort of stuff professionally, I’m just sketching
code here without even checking if it compiles. Test before use.

~~~
pascal_cuoq
I notice that your bit-twiddling version avoids the conditional branch for
choosing whether to round up or down. Nice, but not easier to explain.

(0x00800000 >> exponent) is easier to explain that what I did (1 << (23 - e)).

> C tends to make this sort of thing relatively verbose.

If this were assembly, there would be a chance that the test for whether to
return the argument without change can be factored with the computation of the
quantity (23 - e) (which would be computed directly from the unbiased
exponent), so I would really have to try both before deciding on a variant.

> although I write this sort of stuff professionally,…

Since you are reading this (assuming you are reading this), may I ask you what
tools you use? I have been making do with CRlibm and a good C compiler,
occasionally relying on long double when extra precision was needed, but it
has been close to inadequate a few times.

Is there a standard multi-precision floating-point calculator that all
floating-point experts use for quick computations, with hexadecimal input and
output, or does one have to build one's own? I was thinking of forcing myself
to learn either bc or Gappa, but they both seem to have horrible syntax (that
is, not a syntax similar to something I'm used to).

~~~
stephencanon
> If this were assembly, there would be a chance that the test for whether to
> return the argument without change can be factored with the computation of
> the quantity (23 - e) (which would be computed directly from the unbiased
> exponent), so I would really have to try both before deciding on a variant.

It works out exactly as you suggest.

Tools: MPFR and GMP are extremely useful if using GPLv3 code doesn’t pose a
problem for you and the small performance overhead of arbitrary-precision
isn’t an issue. I have a small “big float” (not arbitrary precision, just more
than enough digits than needed to design math library routines for “standard”
precisions) library that I wrote myself. It’s not very complete but it has
everything I need for library development.

------
jstanley
Aside from what other commenters have said, this article seems to totally
ignore negative numbers.

~~~
niggler
The article explicitly says

"For the sake of simplicity, in this series of posts, we assume that the
argument is positive and we allow the function to round any which way if the
float argument is exactly in-between two integers."

As a result, adding the condition

    
    
        float myround(float f)
        {
            if(f < 0) return -myround(-f); /* <-- the fix */
            /* ... */
        }
    

suffices

~~~
jstanley
My mistake, I didn't notice that.

