Printing Floating-Point Numbers: A Faster, Always Correct Method [pdf] (ucsd.edu) 90 points by lifthrasiir on Jan 17, 2016 | hide | past | web | favorite | 17 comments

 Some context: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-ad...It probably marks the end of the quest to find the unconditionally correct and fast algorithm to print floating point numbers. The prior state of the art, Grisu3 (2010), is great but is only fast for 99.5% of inputs; it was nevertheless so useful that there are now several language implementations with Grisu3 (IIRC, including V8, SpiderMonkey, Go, Julia and Rust). The new algorithm, Errol3, improves Grisu3 by using double-double floating point format (instead of 64-bit custom FP) and applying more precise error analysis.Quoting the highlight of the paper: (emphases original)> After removing the narrowing and widening steps, we used the enumeration algorithm to generate a list of possibly incorrect or suboptimal inputs. Each input was run using Errol2 to enumerate a complete list of inputs that do not generate correct and optimal outputs. In total, we found that only 45 inputs (of the nearly 2^64 inputs) generate incorrect or suboptimal results; all other inputs are guaranteed to generete correct and optimal output. In order to correctly and optimally handle the failing inputs, they are hard-coded into a lookup table that maps the failing inputs to correct and optimal outputs. Combining the special handling of integers and this lookup table, Errol3 is guaranteed to be correct and optimal without runtime checks.
 In order to correctly and optimally handle the failing inputs, they are hard-coded into a lookup table that maps the failing inputs to correct and optimal outputs.That makes one wonder, what's so special about these 45 values? I think it would be far closer to "the end of the quest" if that could be determined, and those could also be handled by the rest of the algorithm somehow.
 IIRC both Grisu and Errol has a fairly specific class of pathological inputs, referred as to pathological midpoints in the paper. I have implemented Grisu3 myself so I would use that as an example. (I don't know if this example is also problematic in Errol1.)10^23 is not exactly representable in IEEE 754 binary64 format (aka double). 10^23 - 3 * 2^23, 10^23 - 2^23 and 10^23 + 2^23 are exactly representable however, so any number from 10^23 - 2 * 2^24 to 10^23 (inclusive, because IEEE 754 has a specific rounding mode here) is round to 10^23 - 2^23. Clearly the shortest correct output in the range would be 10^23, but 10^23 is the midpoint between two FP values! Both Grisu and Errol use a fixed size mantissa (64 bits and 106 bits respectively) and have a conservative error margin as a result, and that margin would always contain such midpoints.Errol uses more precise error analysis to show that pathological midpoints are always integral and in the specific range. Since the specific range is <= 2^131, it simply tries to calculate the midpoint as a 128-bit integer and works around the conservative error margin. This removes a major source of pathological inputs, but the conservative error margin still exists and numbers may still fall in that hole. The final enumeration shows that there are only 45 such numbers.
 Author here. The simplest way to describe those 42 values: the value we want to output (the shortest value) is very close to the midpoint between two floating-point values. Since the best answer so close to a midpoint, its difficult for the algorithm to decide which direction that number will round.So, how can we make sure that such numbers are converted correctly? The answer from Steel & White's paper is to perform arbitrary integer arithmetic and leave nothing up to chance. The Grisu3 algorithm by Loitsch showed that 64-bits correct converts about 99.5% of numbers. As you increase the amount of precision, you can improve your accuracy, but there is no guarantee that even 128-bits or 256-bits is enough to correctly convert all numbers.Instead of trying to be 100% correct, our algorithm sidesteps the issue entirely by enumerating all failures a priori.
 This paper is first-rate. That said, it would have be nice to have links to the C code used for the performance comparisons (reproducable research).Python currently uses David Gay's algorithm and would likely benefit from Errol3. However there is a concern that arises when switching from integer arithmetic to floating point -- the algorithm becomes susceptible to FP rounding modes and susceptible to C compilers generating code with double rounding when mixing 64-bit doubles with 80-bit extended precision. We encountered both problems when implementing Python's math.fsum() from Shewchuk's "Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates".
 http://goto.ucsd.edu/~andrysco/errol/ seems to be the artifact in question. (My friend also noted that there is a Github repo in https://github.com/marcandrysco/Errol, but I'm not sure if it is intended to be canonical.)> Python currently uses David Gay's algorithm and would likely benefit from Errol3. [...] the algorithm becomes susceptible to FP rounding modes and susceptible to C compilers generating code with double rounding when mixing 64-bit doubles with 80-bit extended precision.That would be really problematic since double-double arithmetic relies on the recovery of errors from prior operations. If you are interested in the integer-only algorithm, Rust's Grisu3 implementation derives from my own public-domain reimplementation of Grisu [1].
 Author here. We intend for the GitHub repo (https://github.com/marcandrysco/Errol) to be the primary source for Errol. We still want to add a few improvements to the existing implementation: selecting the middle-most shortest number and an implementation for 32-bit floats (and maybe __float128).
 I assume that for 32-bit floats, you could mostly do away with the HP stuff and just use 64-bit floats, right? Though it does make me wonder if that's always faster than two 32-bit floats. I would think so.
 It should be faster to replace the HP numbers entirely with 64-bit floats. Regardless, I think it would be a useful for two purposes: 1) it would serve as an implementation for chips with only 32-bit floats (maybe some kinds of embedded systems) and 2) we can exhaustively test and verify that the algorithm works on all 32-bit numbers, providing more confidence in the 64-bit version.
 I took a quick glance though the code... looks really good. I wonder how hard it would be to tweak it to not require compiler support for __uint128_t by replacing it with manual 64-bit ops... or is support for that type no longer limited to gcc/clang?
 Good catch! I mostly used __uint128_t out of haste. Ideally, I would like to remove dependencies on large integer types. For now, this remains future work.
 > http://goto.ucsd.edu/~andrysco/errol/ seems to be the artifact in question.The artifact reveals that authors of the paper benchmarked Grisu3 in the debug mode (low compiler optimization level, ASSERTions enabled).Changing their scripts to actually build Grisu3 in release mode turns tables:`````` ==== Absolute Results ==== Errol3 1196 cycles Grisu3 802 cycles Dragon4 6176 cycles Grisu3 w/fallback 833 cycles ==== Relative Speedup of Errol ==== Grisu3 0.67x Dragon4 5.16x Grisu3 w/fallback 0.70x `````` (these are results on my machine with both Grisu3 and Errol3 compiled with -O3, original scripts build Errol with -O2 which makes it yet more slower than Grisu3).So ultimately it seems that Grisu3 is still faster.I also wanted to try benchmarking it on ARM - but it actually fails to build due to its dependency on __uint128_t which does not seem to be supported by my cross-compilation tool chain (at least for 32bit ARMs).
 > However there is a concern that arises when switching from integer arithmetic to floating point -- the algorithm becomes susceptible to FP rounding modes and susceptible to C compilers generating code with double rounding when mixing 64-bit doubles with 80-bit extended precision.The solution to this would presumably be to compile Errol with its own set of compiler flags, rather than with your normal ones.
 This problem is even trickier than that. The core of the algorithm should be robust given different rounding modes, but the values in the lookup table may differ based on the rounding mode. While it may seem odd for the rounding mode to change at run-time, it has happened before: https://blogs.msdn.microsoft.com/oldnewthing/20080703-00/?p=....I will have to look at this further. It might mean generating lookup table for every rounding mode and using their union. In general, changing the global state of the FPU seems to be a risky proposition.
 Hmm, just out of curiosity, would it be possible to tweak Grisu to use the same error analysis as Errol? While possibly slower in the modern architecture, it might result in the compact, fast enough and hassle-free algorithm that is good for most [1] purposes.[1] I have seen some performance-oriented JSON encoders using Grisu2 instead of Grisu3 in favor of performance. They may be willing to switch to Errol3 in spite of such configuration problems...
 That really depends on what you are trying to get out of the algorithm. Using 64-bit integers, I don't think you can turn Grisu2 into an failure-free algorithm. However, you can definitely improve its reliability by handling special casing small integers (as is done in Errol2/Errol3). By my estimates, this should improve Grisu's success rate from ~99.5% to ~99.95% [see math at end]. The remaining 0.05% is far too large to for enumeration and using a lookup.If you are willing to print numbers that are correct but possibly suboptimal, you can also use Errol2 and modify it to remove checking. This will give you the speed of Errol3 without having a lookup table. I might look at making such an implementation and maybe call it Errol2b.[math for calculating approximate error rate] With an input floating-point format with p bits of precision and an intermediate format with q-bits of precision: the "chance" of value failing is close to 1 in 2^(q-p). With 2^p values per exponent, you're going to see about 2^(2p-q) errors per exponent. In the case if Grisu, q is 64 and p is 53, for a total of about 2^42 errors. These numbers only give rough estimates, you'd have to verify them manually.
 Reading that must have taken me two hours, but it was a good read.

Search: