
Formatting floating point numbers - matt_d
http://www.zverovich.net/2019/02/11/formatting-floating-point-numbers.html
======
floitsch
Author of Grisu here.

It's really surprising, how long it took to find the most efficient algorithms
for double-conversions. In 1980 Coonen already published a good algorithm, but
that one was kind-of lost.

For a long time Steele & White's algorithm was the state-of-the-art (with some
improvements here and there over time).

Now, Ryu is by far the fastest algorithm out there, but that took ages...

Unfortunately, there doesn't seem to be an easy-to-use, complete library for
Ryu yet.

The Grisu library ([https://github.com/google/double-
conversion](https://github.com/google/double-conversion)) is probably still
the go-to library if you don't want to implement it again...

~~~
haberman
My understanding is that _fixed-length_ conversion to string is more or less
trivial, the tricky part is writing a fast algorithm to compute the _shortest
round-trippable_ conversion to string, correct?

~~~
floitsch
Depends.

"Correct" conversion, where the length doesn't matter (which could be called
"fixed length" for a certain length that is big enough) is easier, because one
can avoid a lot of rounding and imprecision issue: using the right technique,
just produce enough digits until the imprecisions don't matter anymore.

However, fixed length is as difficult as the shortest, if the length is
limited, since the last digit sometimes lies on the boundary and thus runs
into the same difficulties as the shortest digit. Think of it this way:
needing to decide whether 6 digits is enough, is often similar to asking
whether the 6th digit is a 0 or a 9 (roughly speaking). This means that
producing the best fixed-length representation (for length 6) runs into the
exactly same question.

------
cbsmith
This highlights, for me, one of the reasons why our obsession with "human
readable" serialization formats is so misplaced. You're burning up a lot of
CPU power here, potentially creating a synchronization point, just to do
something that could be addressed with a simple 4 or 8 byte memcpy.

~~~
tomxor
It also creates a lot of misconceptions of how accurate floating point math
is.

An interesting aspect of decimal formatting is how it frequently masks
representation error (i.e encoding of 0.1 etc), because the error is
symmetrical in the formatter/parser it makes such non-representable fractions
appear to be stored perfectly to unsuspecting users.

This can be quite deceptive, if more users were aware of just how many of the
simple rational decimals they input were converted into imprecise
representations they probably wouldn't trust computers as much as they do. To
confuse things more, when operating upon periodic representations the result
often matches the representation error of the the equivalent accurate decimal
value encoded directly (i.e there were errors, but everything canceled out
through formatting) - when they occasionally do not (e.g 0.1 + 0.2) it makes
the problem appear all the more elusive.

I think this detail is often lost in explanations of 0.1 + 0.2, that is:
representation error is extremely common, 0.1 + 0.2 is merely one of the cases
where it both persists through the formatter AND you notice it because the
inputs were short decimals, and it's so obvious that the output should be a
non-periodic decimal..

TL;DR formatting floats to decimals makes us trust floating point math far
more than we should - it's healthy to remember that the formatting process is
necessarily imprecise and that you are merely looking at a proxy for the
underlying value. Remember that next time you look at _seemingly_ non-periodic
decimal output.

~~~
bunderbunder
Or perhaps it makes us mistrust more than we should? How often are we working
on problems where the difference between 0.1 and 0.100000000000000006 is of
any practical importance?

When I format a float out to 5 decimal places, I'm sort of making a statement
that anything beyond that doesn't matter to me.

~~~
tomxor
> When I format a float out to 5 decimal places, I'm sort of making a
> statement that anything beyond that doesn't matter to me.

Yes, it's true the vast majority of the time it doesn't actually matter.
However decimal formatting does such a good job at giving us the impression
that these errors are merely edge-cases, and calculators automatically
formatting to 10sf etc further that illusion. If people are not aware it's
only an illusion (or just how far the rabbit hole goes) it can be dangerous
when they go on to create or use things where that fact matters.

~~~
jancsika
> I think that illusion can be a bit dangerous when we create things or use
> things based on that incorrect assumption.

I'd be curious to hear some of the problems programmers have run into from
this conceptual discrepancy. We've got probably billions of running instances
of web frameworks build atop double precision IEEE 754 to choose from. Are
there any obvious examples you know?

~~~
karmakaze
Operations that you think of being associative are not. A simple example is
adding small and large numbers together. If you add the small numbers together
and then the large one (e.g. sum from smallest to largest) the small parts are
better represented in the sum than sum from largest to smallest. Could happen
if you have a series of small interest payments and are applying it to the
starting principal.

~~~
Demiurge
I've worked with large datasets, aggregating millions of numbers, summing,
dividing, averaging, etc... and I have tested the orders of operations, trying
to force some accumulated error, and I've actually never been able to show any
difference in the realm of 6-8 significant digits I looked at.

~~~
vbezhenar
JavaScript is precise for 2^53 range. It's unlikely that you're operating with
numbers outside of that range if you're dealing with real life things, so for
most practical purposes doubles are enough.

~~~
civility
> JavaScript is precise for 2^53 range

What does this mean to you? It's very easy to get horrible rounding error with
real-life sized things. For instance

    
    
        document.writeln(1.0 % 0.2);
    

The right answer is 0.0, and the most it can be wrong is 0.2. It's nearly as
wrong as possible. These are real-life sized numbers.

btw: I think IEEE-754 is really great, but it's also important to understand
your tools.

~~~
vbezhenar
I'm talking about integer numbers. 2^53 = 9007199254740992. You can do any
arithmetic operations with any integer number from -9007199254740992 to
9007199254740992 and results will be correct. E.g. 9007199254740991 + 1 =
9007199254740992. But outside of that range there will be errors, e.g.
9007199254740992 + 1 = 9007199254740992

~~~
tomxor
You are are describing only one of the types of numerical error that can
occur, and it is not commonly a problem: this is only an edge case that occurs
at the significand limit where the exponent alone must be used to approximate
larger magnitudes at which point integers become non-contiguous.

The types of errors being discussed by others are all in the realm of non-
integer rationals where limitations in either precision or representation
introduce error and then compound in operations no matter the order of
magnitude... and btw _real_ life tends to contain _real_ numbers, that
commonly includes rationals in use of IEEE 754.

------
jnordwick
We had such floating point printing performance issues in java once that i
needed to implement a grisu variant that didn't fall into bignum calculations
and had few other tweaks to short circuit the printing and improve some common
cases. It made a decent impact especially on garbage generation (it was
specifically zero gc). Since been ported to a couple other languages, and I
was thinking of redoing it lately for another project and using the newer
algo. [https://github.com/jnordwick/zerog-
grisu](https://github.com/jnordwick/zerog-grisu)

------
ramzeus
To me the most complicated thing about formatting and parsing floating point
numbers is usually if it uses decimal comma or decimal dot. It never seems to
be what you expect if using a mix of different language operating systems.

~~~
mort96
As someone from a country which uses commas as a decimal separator (Norway)...
Just use a dot. If you're not going to make a _serious_ commitment to actually
make your software work well in all the various locales, use a dot. People
will understand it regardless.

You're probably going to do something stupid like showing the user a comma-
separated list of numbers at some point, which will be needlessly hard to
parse for a human when your numbers use a comma as a decimal separator. You
(or someone else, or your users if they are technical) will probably at some
point make something which tries to parse some output, and that will break if
you switch between points and commas arbitrarily. Your users will want to copy
a number your software prints and paste it into a calculator or REPL or
something, and that probably doesn't work with comma as a decimal separator.

Half-assed "localization" from people who don't know anything about how other
countries work is just needlessly annoying to be subjected to.

That's at least my perspective as a Norwegian who experiences a lot of _bad_
localization even though I know English fairly well and configure all my
computing devices to use English. The perspective of someone from a country
where English is less well known might be different.

<rant>

Examples of horrible localization from clueless American companies or
organizations include:

* A lot of software will use your IP address to determine your language. That's annoying when I'm in Norway and want my computers to use English, but it's horrible when abroad. No, Google, I don't want French text just because I'm staying in France for a bit.

* Software will translate error messages, but not provide an error code. All information about error messages online is in English on stackoverflow or whatever. If Debian prints an error message in Norwegian, there's absolutely no information about the error anywhere on the web.

* There was a trend for a while where websites would tick the "localization" checkbox by adding a Google Translate widget, so English websites would automatically translate themselves into completely broken Norwegian automatically. That would've been useless if I didn't know English, and it's even worse considering I already know the source language just as well as Norwegian. Luckily, most websites seem to have stopped doing that.

</rant>

~~~
molf
Adding to this: ALWAYS use spaces as thousand-separators.

Resist the temptation to use commas or dots as thousand-separators. Seeing a
number with a dot as a decimal separator instead of a comma will be fine for
most people (even if proper localisation would mean using a comma), but if you
throw in commas that mean something else you WILL confuse people. And I
imagine the inverse is also true.

~~~
stronglikedan
A space between numbers makes two separate numbers.

~~~
stronglikedan
I would love to hear why that's wrong, when we use whitespace to separate
pretty much everything in textual representation.

~~~
NikkiA
Well, SI did standardize the thousands separator as space, so there is at
least precedence for using spaced numbers there.

I personally don't like it though, and tend to prefer either the swiss system
(ie, 3'800'000.0), or maritime system (ie 3_800_000.0) if separators must be
used

------
ChuckMcM
Life is so much easier when you use decimal to represent your numbers
[[http://speleotrove.com/decimal/IEEE-cowlishaw-
arith16.pdf](http://speleotrove.com/decimal/IEEE-cowlishaw-arith16.pdf)]

------
nabla9
Can this formatting provide round trip for all double floats? The article
don't say it.

~~~
ufo
I'm not sure, but generally speaking converting doubles to strings will fail
to round trip at least the NaNs

If round tripping is important, my recommendation would be to output something
that directly corresponds to the binary representation of the float. For
example, printf %a

~~~
theoh
See this article for an in-depth look at implementing these conversions:
[https://research.swtch.com/ftoa](https://research.swtch.com/ftoa)

------
ncmncm
I don't understand why anyone would write a new Grisu conversion, with Ryu 3x
as fast. It might have educational benefits, preparing you for a Ryu
conversion, but shipping it?

~~~
shereadsthenews
Ryu is only a year old. Grisu is 10 years old. It takes time for people to
absorb these things.

------
ChrisMarshallNY
This is interesting. I have learned to just use whatever operating system/RTL
utilities are provided (like Apple's NumberFormatter class).

