What Every Computer Scientist Should Know About Floating-Point Arithmetic 74 points by hamidr on Nov 21, 2012 | hide | past | web | favorite | 39 comments

 Personally I think one of the hardest things to wrap your head around is how it's power-of-two fractions that can be represented exactly, not power-of-10 fractions. In other words, 0.5 can be represented exactly as IEEE floating point, but 0.1 cannot.This is not inherent to floating point; a floating point representation that was based on decimal digits instead of binary ones would not have this issue.It's not too hard to understand limited precision and rounding; we deal with both all the time. The part that's hard to understand is that you have to perform rounding simply to put a number like "0.1" into a floating-point value.Specifically, in 0.1 as an IEEE float is represented as:`````` val = 0x3dcccccd sign = 0 significand = 4ccccd (1.6000000) exponent = 0x7b (-4) `````` So 0.1 is internally represented as 1.6000000 * 2^-4, while 0.5 has the much more understandable representation 1 * 2^-1.This is made even more confusing by the fact that rounding performed on output will make it appear that 0.1 has been exactly represented in many cases.So take a data type that can't exactly represent some numbers that look very simple in decimal, but that appears to represent them exactly when you print them out, but then compares not equal when you try to actually compare them. No wonder people get confused!
 It's easier to explain that decimal 0.1 in binary is 0.000110011001100110011001100110011001100110011001100110011001100110011001100... It's impossible to present accurately with any combination of significand*2^exponent in a finite-memory binary system. Just like 1/3 in decimal system can't be written down accurately.
 Sure, but it's not obvious what numbers can be represented exactly and which can't, and since the numbers get rounded on output to look exact, you can easily be misled into thinking that 0.1 can be represented.
 Any rational number can be represented as a fraction p/q, where p and q are integers with no common factors (reduced fraction).If that fraction is representable as an exact "decimal" number in base b that means that p/q = m/b^n = m * b^(-n), where m and n are integers too. For example, 3/4 = 75/10^2 or 0.75 in decimal, 3/2^2 or 0.11 in binary.That means p * b^n = m * q. We said p and q have no common factors, so all of q's factors go into b^n. That means that q divides b^n, or in other words:p/q is representable as an exact "decimal" number in base b if and only if q divides some power of b.For example, 0.1 is 1/10. But there is no power of 2 which is divisible by 10, so 0.1 is not exactly representable in binary.As another example, 1/3=0.33333.... because there is no power of 10 divisible by 3.
 > Just like 1/3 in decimal system can't be written down accurately.Or like integers in base pi! ;)
 You mean, some integers? 0 is perfectly representable. And so might be 1.
 Am I missing something, or isn't 1.6 * 2^-4 = 1.6 / 16 = 0.1? Seems exact to me.
 Yes, 1.6 is. But the actual value being stored is between 1.6 and 1.6000000000000001 (in the double case) -- you have to round it to display it as decimal.I'm working on a program right now to give a detailed dump of a double/float's value, to make this as clear as possible (and I'm learning a lot in the process).EDIT: got an initial revision out there; check out: https://github.com/haberman/dumpfp
 The D programming language takes the (so far as I know) unique approach that the implementation is allowed to evaluate expressions in the highest precision it wants to - the different floating point data types only represent storage formats.The idea is that algorithms should be designed with a minimum required precision, but no maximum, and should not fail if higher precision is used.I think this rule serves very well.
 The .NET CLI takes the same approach. Any implementation is allowed to use a higher precision if it wants. The stated rational is for performance:"This design allows the CLI to choose a platform-specific high-performance representation for floating-point numbers until they are placed in storage locations. For example, it might be able to leave floating-point variables in hardware registers that provide more precision than a user has requested. "
 Indeed most optimizing compilers (including C and Fortran) maintain full hardware precision in intermediate results. It's one of the reasons that debug and release versions of the same numerical code often give different results.
 That seems a bit terrible. Now it's D so it's not like it's a real problem but: if I went from one "standards compliant" implementation to another and suddenly my numerical code started being shit because it turns out my first compiler was silently more precise... there would be much wailing and gnashing of teeth.
 The vagaries of floating point mean this "terrible" scenario essentially happens already. For example, a "double" variable isn't the same from one machine to another. And debug vs release versions of the same numerical code, built with the same compiler and run on the same machine, can arrive at different answers. These things lead to much gnashing of teeth in the scientific programming community already.I have a numerical code that I cross-compile with gcc -O0, gcc -O3, clang, icc, and MS Visual C, all on the same machine. Each version can give slightly different results. The fact that they don't give wildly different results is mainly down to the stability (in the ODE sense) of the problems run. Unstable problems would definitely give different results.
 I have yet to see any algorithm, outside of a test suite, that gave worse results when precision was increased.
 That 'other standards compliant compiler' need not be the one with higher precision.Also, it might have higher precision in most cases but just not in that case where your code needs it most.Real-world examples are x86 compilers that use the 80-bit floating point registers for intermediates, but flush them to 64-bit results (sounds stupid, but it is not; flushing typically is done to the bytes that hold the result variable). There, the register allocation algorithm used by the compiler can be the determiner of which operation sequences get done in 80 and which in 64 bits.gcc: http://gcc.gnu.org/wiki/FloatingPointMath (the 'Note on x86 and m68080 floating-point math' section), http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Disappointments.... ("On 68000 and x86 systems [...] you can find that a floating point value which is not a NaN is not equal to itself")Microsoft's compilers: http://msdn.microsoft.com/en-us/library/e7s85ffb.aspx
 With this approach, how do you ensure that a higher intermediate precision does not result in reduced accuracy due to double rounding?
 This article seems to appear every so often on HN and Reddit. It used to be at sun.com. I'm glad Oracle didn't 404 it.The title isn't accurate though. The article is mostly about using IEEE 754 to manage computational error, which is a pretty esoteric subject.Most programmers spend precious little time using floating point and don't need to worry about computational error. Even among those who use it a great deal -- those doing numerical analysis -- computational error doesn't seem to be something they concern themselves with. My guess is that in most instances it's swamped by experimental error.My thoughts on this article are:1: It's certainly good that it's available online. It'd be better with a different title.2: It's good that modern CPUs support IEEE 754 for the rare instance that people use it.3: It's worth reading the first third or so of it, just to get an idea of what computational error is4: The most important thing the "typical" programmer needs to know is that dollar values shouldn't be represented in floating point. Ideally they shouldn't be represented as integers either because the integer rounding truncates towards zero, which is not how an accountant would round.
 It's perfectly possible to generate accountant rounding using integers. If the accuracy required is one cent then the system must merely multiply by 10 before calculating and then use dutch round to the nearest 10 before dividing by 10. I'm pretty sure you could do it with quarter cent accuracy rather than 10ths of a cent.You just shouldn't calculate with 1 cent accuracy if the rounding methodology is important to your business, but truth be told most businesses have a lot more to optimize than the rounding methodology used.
 >> but truth be told most businesses have a lot more to optimize than the rounding methodology used.Perfectly true. I can see exceptions being managing other people's money--banks, accountants, and the IRS had damn well better get this right.
 > Most programmers spend precious little time using floating point and don't need to worry about computational error.I think this is true of experienced programmers, but I see inexperienced programmers tripping over floating point misunderstandings quite often. In almost every case they shouldn't have been using floating point at all.
 I'm considering creating a programming language to be used on the internet, are you saying it would be a bad idea to make all numbers in my new language floating point?
 If you stick to integer math with 64-bit floating point numbers, then everything actually works out pretty reasonably, since all 32-bit ints can be represented exactly as 64-bit floats (doubles). As long as you use floor/ceil/round where appropriate, you get most of the benefits of integer arithmetic, without the type checking (which you lose in a dynamic language anyways).Some people feel strongly against this, but for dynamically typed language I enjoy its simplicity (for static languages, type systems are meant to enforce constraints on types, so multiple number types can be useful). It has done well in [at least] javascript and lua, and while the former is anything but an example in how to design a programming language, IMHO, the latter may be.So no, it wouldn't be a bad idea to make all numbers 64-bit floats. It sounds like you might likely be compiling to javascript, in which case using 64-bit floats will mean you have one less problem to deal with while implementing it.
 I think the latex typeset pdf is much nicer to read.
 The big thing I remember from my Numerical Computing course in college was floating point arithmetic is not commutative. Totally. Rocked. My world. You have to work hard to retain your significant digits throughout the computation.
 Floating point addition and multiplication are commutative, but are not necessarily associative.
 You are correct. I should have been more clear. Our expectation for addition to be associative is deeply ingrained and it's shocking to discover that's not the case when performing floating point computations.
 After spending a week ripping out someone's poorly thought out floating point code, the lesson I learned is that you just shouldn't use floating point unless you really need to.For example, if you're working with money, you don't need floating point.
 Not that big a deal. Most floating point numbers are imprecise rounded values to the degree of precision that is being used. An imprecise binary representation of an already imprecise rounded value at the edge of precision really doesn't matter - and if it does the problem is that you probably needed more precision in the first place.
 What Every Web Programmer Should Know About Floating Point:0. Do not calculate monetary amounts in FP. Businesses don't react well to being told about acceptable imprecision.1. Javascript is floating point.2. Do not calculate money amounts in Javascript.There are decimal libraries for Javascript (eg [1][2]). Use those if you need to calculate monetary amounts.
 I'm assuming that storing \$129.99 as 12999 and dividing by 100 when necessary would be preferable?
 The problem is that every javascript number is floating point.Even if you think you have integers ... sometimes you don't.
 A number of years ago the project I was working had problems in Ada because the conversion of floating point numbers from character string to internal floating point by the Ada compiler did not always produce the same values as runtime conversion. I had fun reconciling the two.
 What Every Programmer Should Know About Floating-Point Arithmetic:
 Should be http://floating-point-gui.de - sorry for that
 What every Computer Scientist should be asking about Floating-point arithmetic: Why aren't there types for fixed-point/rational arithmetic exposed by your language and/or standard library?
 Would also love 128 bit ints to be commonly available. And I even have a use for them: representing the size in bytes of some extremely large storage configurations (sparse ones -- as yet a fully allocated > 2^64 byte store would cost several billions).
 Don't most of the newer languages have these types? And even the older languages support them now with their most popular libraries (e.g. C++/Boost).
 Because most popular languages are a leaky abstraction on top of the CPUs they were built for.

Search: