It probably marks the end of the quest to find the unconditionally correct and fast algorithm to print floating point numbers. The prior state of the art, Grisu3 (2010), is great but is only fast for 99.5% of inputs; it was nevertheless so useful that there are now several language implementations with Grisu3 (IIRC, including V8, SpiderMonkey, Go, Julia and Rust). The new algorithm, Errol3, improves Grisu3 by using double-double floating point format (instead of 64-bit custom FP) and applying more precise error analysis.
Quoting the highlight of the paper: (emphases original)
> After removing the narrowing and widening steps, we used the enumeration algorithm to generate a list of possibly incorrect or suboptimal inputs. Each input was run using Errol2 to enumerate a complete list of inputs that do not generate correct and optimal outputs. In total, we found that only 45 inputs (of the nearly 2^64 inputs) generate incorrect or suboptimal results; all other inputs are guaranteed to generete correct and optimal output. In order to correctly and optimally handle the failing inputs, they are hard-coded into a lookup table that maps the failing inputs to correct and optimal outputs. Combining the special handling of integers and this lookup table, Errol3 is guaranteed to be correct and optimal without runtime checks.
That makes one wonder, what's so special about these 45 values? I think it would be far closer to "the end of the quest" if that could be determined, and those could also be handled by the rest of the algorithm somehow.
10^23 is not exactly representable in IEEE 754 binary64 format (aka double). 10^23 - 3 * 2^23, 10^23 - 2^23 and 10^23 + 2^23 are exactly representable however, so any number from 10^23 - 2 * 2^24 to 10^23 (inclusive, because IEEE 754 has a specific rounding mode here) is round to 10^23 - 2^23. Clearly the shortest correct output in the range would be 10^23, but 10^23 is the midpoint between two FP values! Both Grisu and Errol use a fixed size mantissa (64 bits and 106 bits respectively) and have a conservative error margin as a result, and that margin would always contain such midpoints.
Errol uses more precise error analysis to show that pathological midpoints are always integral and in the specific range. Since the specific range is <= 2^131, it simply tries to calculate the midpoint as a 128-bit integer and works around the conservative error margin. This removes a major source of pathological inputs, but the conservative error margin still exists and numbers may still fall in that hole. The final enumeration shows that there are only 45 such numbers.
So, how can we make sure that such numbers are converted correctly? The answer from Steel & White's paper is to perform arbitrary integer arithmetic and leave nothing up to chance. The Grisu3 algorithm by Loitsch showed that 64-bits correct converts about 99.5% of numbers. As you increase the amount of precision, you can improve your accuracy, but there is no guarantee that even 128-bits or 256-bits is enough to correctly convert all numbers.
Instead of trying to be 100% correct, our algorithm sidesteps the issue entirely by enumerating all failures a priori.
Python currently uses David Gay's algorithm and would likely benefit from Errol3. However there is a concern that arises when switching from integer arithmetic to floating point -- the algorithm becomes susceptible to FP rounding modes and susceptible to C compilers generating code with double rounding when mixing 64-bit doubles with 80-bit extended precision. We encountered both problems when implementing Python's math.fsum() from Shewchuk's "Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates".
> Python currently uses David Gay's algorithm and would likely benefit from Errol3. [...] the algorithm becomes susceptible to FP rounding modes and susceptible to C compilers generating code with double rounding when mixing 64-bit doubles with 80-bit extended precision.
That would be really problematic since double-double arithmetic relies on the recovery of errors from prior operations. If you are interested in the integer-only algorithm, Rust's Grisu3 implementation derives from my own public-domain reimplementation of Grisu .
The artifact reveals that authors of the paper benchmarked Grisu3 in the debug mode (low compiler optimization level, ASSERTions enabled).
Changing their scripts to actually build Grisu3 in release mode turns tables:
==== Absolute Results ====
Errol3 1196 cycles
Grisu3 802 cycles
Dragon4 6176 cycles
Grisu3 w/fallback 833 cycles
==== Relative Speedup of Errol ====
Grisu3 w/fallback 0.70x
So ultimately it seems that Grisu3 is still faster.
I also wanted to try benchmarking it on ARM - but it actually fails to build due to its dependency on __uint128_t which does not seem to be supported by my cross-compilation tool chain (at least for 32bit ARMs).
The solution to this would presumably be to compile Errol with its own set of compiler flags, rather than with your normal ones.
I will have to look at this further. It might mean generating lookup table for every rounding mode and using their union. In general, changing the global state of the FPU seems to be a risky proposition.
 I have seen some performance-oriented JSON encoders using Grisu2 instead of Grisu3 in favor of performance. They may be willing to switch to Errol3 in spite of such configuration problems...
If you are willing to print numbers that are correct but possibly suboptimal, you can also use Errol2 and modify it to remove checking. This will give you the speed of Errol3 without having a lookup table. I might look at making such an implementation and maybe call it Errol2b.
[math for calculating approximate error rate]
With an input floating-point format with p bits of precision and an intermediate format with q-bits of precision: the "chance" of value failing is close to 1 in 2^(q-p). With 2^p values per exponent, you're going to see about 2^(2p-q) errors per exponent. In the case if Grisu, q is 64 and p is 53, for a total of about 2^42 errors. These numbers only give rough estimates, you'd have to verify them manually.