Hacker News new | past | comments | ask | show | jobs | submit login
John Gustafson’s crusade to replace floating point with something better (nextplatform.com)
432 points by galaxyLogic 13 days ago | hide | past | web | favorite | 189 comments





Here is the official site[1] of the project. There is a request[2] to add it in the Scryer Prolog[3] (ISO Prolog implementation in Rust). And implementations in Rust[4] itself, and Julia[5] language.

[1] https://posithub.org/index

[2] https://github.com/mthom/scryer-prolog/issues/6

[3] https://github.com/mthom/scryer-prolog

[4] https://gitlab.com/burrbull/softposit-rs

[5] https://juliacomputing.com/blog/2016/03/29/unums.html


He will keep failing to replace IEEE floating point as long as he insists on making NEGATIVE infinity the same as POSITIVE infinity.

Also, IEEE 754 floating point standard guarantees the results of addition, subtraction, multiplication, division, and square root to be the exact correctly rounded value, ie a deterministic result, contrary to what he says.


Not so sure. I mean, yes, what you say is right, but there are problems nevertheless, see eg Wikipedia:

> Reproducibility

> The IEEE 754-1985 allowed many variations in implementations (such as the encoding of some values and the detection of certain exceptions). IEEE 754-2008 has strengthened up many of these, but a few variations still remain (especially for binary formats). The reproducibility clause recommends that language standards should provide a means to write reproducible programs (i.e., programs that will produce the same result in all implementations of a language), and describes what needs to be done to achieve reproducible results.

https://en.wikipedia.org/wiki/IEEE_754#Reproducibility


Reproducibility is an orthogonal issue to Posits vs IEEE754.

Most developers prefer speed over reproducibility, and are encouraged to use denormal-to-zeros, fast-math optimizations, fused MAC, approximated square root function and whatever else is available to achieve results.

The IEEE754 standard provides a guarantee for deterministic results, and many multi-precision and interval arithmetic libraries depend on this guarantee to be true to function properly.

IEEE754 defines unique -infinity and +infinity values, and any "new and improved" standard that breaks this axiom is just incompatible with all existing floating-point libraries written in the last +30 years.

"These claims pander to Ignorance and Wishful Thinking." Kahan (main author of IEEE754) on Posits claims.


> Most developers prefer

You're free to voice your own opinion, but I take some issue with people asserting theirs as if they speak for "most developers". Especially if it comes from a new account with a name like "Gustafnot". That doesn't exactly scream "unbiased" to me.


Having written scientific numerical software for decades, and having been in situations where I want high speed or I want reproducibility, and having worked with hundreds of developers, I agree wholeheartedly with Gustafnot - the vast majority of developers prefer performance over bitwise reproducibility. Lose a few bits here and there, and most don’t care, because they treat floats as fuzzy to begin with (and almost never care about reproducibility since it’s very hard to obtain due to compilers, libraries, etc.) . But slow down code, and they sure notice quickly.

If you really want to claim the opposite, do you have evidence? Or experience that it’s true?


But slow down code, and they sure notice quickly.

Yet electron is not only a thing for hobbyists, but something even large companies bet their livelihood on.


He argues that underflow/overflow are usually caused by bugs in the code. Having so many special numbers implies reducing the numeric representation (there are a lot of NaN in IEEE-754) and a hardware overhead to deal with all the special cases. That small hardware overhead can add up when working with thousands of FPU.

The vast majority of application dosen't need fine control of the FPU. There is always will be hardware for the few application need the ieee-754 features.


That's great, but if the hardware doesn't support it, then wouldn't the implementation would be slow?

Besides claimed efficiency improvements, posits are claimed to have advantages for implementing numerical algorithms correctly compared to IEEE floats. A big part of Gustafson's book The End of Error is devoted to that, claiming to show that a range of numerical algorithms are implemented better with posits vs IEEE floats (better meaning some mix of clearer, easier to produce stability, better error bounds, etc.). It's something of a competitor to interval arithmetic in that respect [1].

A 'softposit' implementation can at least let people experiment with writing posit algorithms to investigate the claimed algorithm-design benefits, even if it's not going to beat hardware floats in speed.

[1] The interval arithmetic people don't seem very happy with his comparisons though: http://frederic.goualard.net/publications/MR3329180.pdf


Yes, those implementation are slow compared to classical floating points. The goal, as I understand it, is to let people experiment with Posits in order to prove that they bring something to the table.

As far as I read https://posithub.org/docs/Posits4.pdf there are FPGA implementations. Also in systems where floating point arithmetic isn't supported in hardware (i.e. Arduino Uno) you rely on soft fp.

Benchmark performance analysis of accuracy and speed on AVR platforms would indeed be very interesting and an actual, possible, immediate implementation scenario...

Posits and other floating variants are seriously cool, and Gustafson work is amazing.

Sadly, the guy has a very annoying writing style that makes him sound like a crackpot. The advantages of posits would shine much more if they were not mixed with ridiculous language (posits are floating point numbers, thus they cannot replace them) and outlandish claims (IEEE floating point is deterministic, to the apparent contradiction of many sentences written by Gustafson). It does not help either that Gustafson work only seems to attire the attention of semi-illiterate journalists who do not really understand what they are talking about.


>IEEE floating point is deterministic, to the apparent contradiction of many sentences written by Gustafson

What about this then, which somebody comments below:

"On old x86 systems, the x87 registers were 80-bits and the bottom bits were undefined. You could theoretically have a 1-bit difference on some results depending on what the bottom 26 bits (that were undefined, because you had 64-bit floats most of the time, even though the machine did things 80-bits at a time)."

And below:

"I spent a long time once debugging an issue arising from this. Code was basically:

  double x = y;
  [...]
  if (x>y) fail;
Nothing in between the assignment and the check changed the value of x or y, yet the check triggered the fail condition. Turned out to be that one of the value stayed in the 80 bit x87 register and one was written to RAM as a 64 bit value, then loaded back into an x87 register for the check, resulting in the inequality. Running in a debugger wrote both values out of the registers before the check, making the problem unreproducible."

This behavior might not be IEEE defined, but even as IEEE undefined (and implementation dependent), it still qualifies those IEEE compliant implementations as non-deterministic.

Or how about:

"parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation, on the same system. This is due to the fact that several pipelines will do calculations in parallel and depending on which pipeline ends first (which can depend on many factors like current temperature) the sums can happen in a different order, leading to different rounding results."

due to fp associativity?


Those are not problems a new number format can fix though. The issues with x87 precision are due to compilers freely converting between different representations without this being indicated in the program.

Posits also can't achieve associativity in all conditions, you need to use "quire" accumulators with higher precision. However, if the decision of using quires were left to the compiler, you'd end up with the exact same problem of unclear precision as for x87. To avoid that, those accumulators will need to be annotated in the code, and straightforward programs that do not use them will continue to exhibit non-associativity.


The observation that all of the +-*/ operations in a basic block can be reduced to a single round with the quire would get you a reasonable way with the compiler alone.

Overall I'm not sold on the quire thing. It seems like a great big accumulator is a solution that is fairly independent from the floating point format.


> It seems like a great big accumulator is a solution that is fairly independent from the floating point format.

It could still be a good idea to mandate a standard version in the spec for consistent behavior.


If you have an 80 bit object, convert it to a 64 bit object, then convert it back to an 80 bit object, and you expect to drop zero bits, of course you'll run into problems. Posits can't fix these problems.

Desktop graphics cards intentionally ignore IEEE 754 rules if it allows them to squeeze out a little bit of performance. (or die space, or power consumption depending on the application) If the bad guy is drawn half a pixel out of position or if his shirt is slightly the wrong shade of red, nobody cares. But if your graphics card draws at 72 frames per second and your competitor's cards draws at 73, every publication is going to say their card is better than yours. Posits won't fix this either.

These problems you point out aren't technical problems, they're institutional ones. IMHO it's generally a bad idea to try to fix institutional problems with technical solutions, but I went try to stop you from attempting it.


”This behavior might not be IEEE defined, but even as IEEE undefined (and implementation dependent), it still qualifies those IEEE compliant implementations as non-deterministic.”

Although IEEE is non-deterministic for other reasons (the transcendental example below is a good illustration) this argument is incorrect.

Performing operations on a non-IEEE 80-bit representation and then using a non-IEEE comparison is not a valid argument about non-determinism in IEEE.


> >IEEE floating point is deterministic, to the apparent contradiction of many sentences written by Gustafson

> "parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation, on the same system. This is due to the fact that several pipelines will do calculations in parallel and depending on which pipeline ends first (which can depend on many factors like current temperature) the sums can happen in a different order, leading to different rounding results."

> due to fp associativity?

As far as I understand, that's not quite the case (on Fermi+ Nvidia, as previous generations had no real support for proper IEEE single compute).

There are pipelines in parallel, but almost all non-determinism in the order of execution (resulting in fp associativity) is exposed to the programmer, except for atomic instructions. Those can even depend on clock skew across PCIe or NVLink, if they operate on remote memory. NVLink supports remote execution in a recent version (I suggest the microbenchmarking-derived microarchitecture analysis papers for Volta and Turing, even if they are less hands-on the Nervana's maxas (with the SGEMM walk-through, which I really recommend)), and thus any hope of reliably tricking synchronization behavior via undefined opcodes/scheduler instructions trickery is deliberately out of the window.

On IBM POWER9 CPUs they can apparently delegate execution of atomic operations to the host CPU, which might (I don't know enough about POWER9 cache coherency /NUMA protocols) delegate them over the NUMA fabric to a different rack-mount unit bonded into the same NUMA mesh.

TL;DR: Recent Nvidia GPUs do not suffer implicit non-determinism in IEEE floats unless you use hardware atomics.

Non-determinism is exposed and introduced in explicit concurrent data structures [0].

[0]: If you count a datastructure with a simple spinlock guarding it as a concurrent datastructure.

PS: I'll add references / links later on the laptop.


> What about this then (...)

This is exactly the kind of fringe shitfuckery that must be avoided at all costs if you want to promote posits seriously.


I am implementing a high performance posit arithmetic unit for uni thesis. The idea of using Golomb-Rice prefix tree to encode the exponent is genius. Having said that. I totally agree his papers make a lot of bogus claim.

For instance, the claim that "posits can be a direct replacement of floating point" is not true. There are a few tricks, such the Fast Invertion Square Root that abuses the floating point format to do fast arithmetic operation. This would break replacing with posit.

Also he claims that early FPGA posit implementation uses less area and is has high performance than a Floating Point implementation. This cannot be true^. The posit unit IS A floating point unit with a encoder/decoder unit. The posit floating point unit will actually have a larger word size when decoded than ieee floating point because it has more precision than ieee format. Also the features such different rounding supported by ieee but not by posit uses a negligible extra area.

^ Unless, ofc, he is comparing apples to oranges.


I'm eager to read your thesis, then! Please, make sure to include a (short) historical account from a neutral point of view. Most notably, Kahan's rejection of unums/posits mixes mostly reasonable criticism with a few points that are not relevant; and then Gustafson holds on to these irrelevant points and replies with his usual word salad. It is quite a sad state of affairs for what could be an enlightening scientific discussion.

The Golomb-Rice prefix tree for the exponents is certainly cool. It has the air of Gosper's "continued logarithms", which are like the same idea but turned to the max: use always the same mantissa (which is implicit) and represent only the exponent to a high precision.


I will get in touch once I finish the section on Posit. I would appreciate some feedback! I'm writing in my language, but hopefully the google translator will work well.

Gustation is being too ambitious trying to reinvent the whole floating point framework. I think he should focus in what really matters: the compact encoding. A lot of application could save memory bandwidth using some kind of posit inspired format. A pack/unpack unit that converted the posit encoding to floating point inside the core would could be game changing.

He also proposed a good idea on code ergonomic: Make the accumulator a type (quire in the posit). Making the programmer aware of the accumulator can lead to better to code.

By the way, reading the Kahan work in floating point I could understand why Gustafson want's to avoid the IEEE. The Kahan implementation favored Intel, not necessarily the community.

Anyway. My point of views are from the hardware side of the things. The format seems to have some arithmetic properties that might be useful for mathematicians.


Papers have already been written. This one implemented posits on an FPGA: https://hal.inria.fr/hal-02131982

The summary:

These architectures are evaluated on recent FPGA hardware and compared to their IEEE-754 counterpart. The standard 32 bits posit adder is found to be twice as large as the corresponding floating-point adder. Posit multiplication requires about 7 times more LUTs and a few more DSPs for a latency which is 2x worst than the IEEE-754 32 bit multiplier.


There are a few FPGA implementations. The posit hub website provides a table with all of them. Each has different trade-offs. The implementation I working on is focused on hardware reuse and high frequency operation.

> IEEE floating point is deterministic, to the apparent contradiction of many sentences written by Gustafson

How can that be? The IEEE standard does not place many requirements on the precision of the transcendental functions, and AFAICT a transcendental doesn't even need to return the exact same answer twice. That is, it is correct for a compiler to, if it can, evaluate a transcendental at compile-time with infinite precision, and if it cannot, call a run-time function that computes it. Math libraries can be dynamically linked, and all can produce different results here, so it is not that a compiler can go out of its way to make things deterministic either.


> posits are floating point numbers, thus they cannot replace them

Everyone who reads that who is cognitively typical will understand "floating-point" (which, by the way, needs a dash, if we are going to nitpick) as a nickname for "IEEE 754 floating-point" or "traditional floating-point", rather than an insinuation that posits are, in contrast, fixed-point or something other than floating-point.



The proposed format has changed several times since 2015, so not all those threads are discussing the same thing as this one.

Posits are Type III Unums IIRC, the original proposal was with Type I Unums.

Previous proposals were variable length formats, which is pretty much a nonstarter for a variety of reasons. This proposal (posits) is a fixed length format with variable length fields.


There have been several papers and blog posts written about the more recent posits proposals.

https://marc-b-reynolds.github.io/math/2019/02/06/Posit1.htm... https://hal.inria.fr/hal-01959581v2 https://hal.inria.fr/hal-02131982

The bottom line is they are a tradeoff; they certainly aren't purely better than IEEE floats. Plus they seem to have a non-trivial hardware cost (one of the linked papers shows the adder is twice as large as the equivalent IEEE adder and has double the latency!)


Also this 2017 thread with 'dnautics https://news.ycombinator.com/item?id=13633991

Very few developers truly understand floating point representation. Most think of it as base-10, and put in horrific kludges and workarounds when they discover it doesn't work as they (wrongly) expected. I shudder to think how many e-commerce sites use `float` for financial transactions!

So as far as I'm concerned, whatever performance cost these alternate methods may have, it would be well worth it to avoid the pitfalls of IEEE floats. Intel chips have had BCD support in machine code; I'm surprised nobody has made a decent fixed point lib that is widely used already.


Replacing all the IEEE 754 hardware with posits won't fix this, though.

If you don't care about performance, then the actual solution has no dependency on hardware:

1. Replace the default format for numbers with a decimal point in suitably high level languages with a infinite precision format.

2. Teach people using other languages about floating point and how they may want to use integers instead.

The end. No multi-generation hardware transition required.

IMO, IEEE 754 is an exceptionally good format. It has real problems, but they aren't widely known to people unfamiliar with floats (e.g. 1.0 + 2.0 != 3.0 isn't one of them).


> 1. Replace the default format for numbers with a decimal point in suitably high level languages with a infinite precision format.

Unlimited precision in any radix point based format does not solve representation error. If you don't understand why:

  - How many decimal places does it take to represent 1/3 (infinite, AKA out of memory)

  - Now how many places does it take to represent 1/3 in base3? (1)
If you are truly only working with rational numbers and only using the four basic arithmetic operations, then only a variable precision fractional representation (i.e a numerator and demonstrator, which is indifferent to underlying base) will be able to store any number without error (if it fits in memory). Of course if you are using transcendental functions or want to use irrational numbers e.g PI then by definition there is no numerical solution to avoid error in any finite system.

I'm pretty sure that 1.0 + 2.0 == 3.0 in IEEE 754. :) Now, 0.1 ...

For reference to what they're talking about, the helpful https://0.30000000000000004.com/

One of the great qualities of IEEE 754 is its ability to represent many integers and operate on them without rounding errors.

Yep, sorry, meant to put 0.1 ...

Thanks!


They are not the same thing, but they are close enough most of the time.

The issue with floating point arrises when comparing very close numbers.


Great points. I agree - the performance hit is simply the cost of being accurate and having predictable behaviour.

I'm not suggesting we replace all our current HW with chips that implement posits (let's fix branch prediction first!!). More that FP should be opt-in for most HLLs.


In the paper "Do Developers Understand IEEE Floating Point" the authors surveys phd student and faculty members from computer science and found out that your observation is true: most people don't know how fp works.

They have the survey online at [1] in case you want to see how much you know about fp behavior.

[1] http://presciencelab.org/float


That was a waste of time: 6 pages of questions, and instead of "grading" it and letting me know where I stand on FP, it says "Thanks for the gift of your time".

Likewise!

FYI the actual paper is https://ieeexplore.ieee.org/document/8425212 and you can get the PDF here: http://pdinda.org/Papers/ipdps18.pdf


Interestingly, they got one of their own answers wrong.

The question is:

    if a and b are numbers, it is always the case that (a + b) == (b + a)
and their notes on it:

    Is a simple statement involving the commutativity over addition true for floating point? Generally, floating point arithmetic follows the same commutativity laws as real number arithmetic.
They make it clear in their notes that "are numbers" includes infinities but not NaNs. Now consider the case where a = inf and b = -inf. Then inf + (-inf) is NaN, and (-inf) + inf is NaN, and NaN != NaN.

    >>> a = float('inf')
    >>> b = float('-inf')
    >>> a + b == b + a
    False

Nice catch.

> They make it clear in their notes that "are numbers" includes infinities but not NaNs.

This is definitely not very clear on the form, thought.


Sorry for tricking you into giving real data for the researchers LOL.

Another comment pointed out to the paper with the answers.

A great resource to learn about the FP corner cases is the great Random ASCII blog:

https://randomascii.wordpress.com/category/floating-point/


Considering that back in the 1980's we figured out that you should use some kind of 'integer' or 'bcd' based type for financial calculations; it is astonishing that by almost 2020 people are making these same mistakes over and over again.

A 64-bit integer is big enough to express the US National Debt in Argentine Pesos.


> it is astonishing that by almost 2020 people are making these same mistakes over and over again.

It's actually logical: the number of developers doubles roughly every 5 years. It means that half of the developers have less than 5 years of experience. If they don't teach you this in school (university), you will have to learn from someone who knows, but chances are the other developers are as clueless as you.


OMG! The Eternal Eternal September.

It's not enough to represent one US cent in original Zimbabwean dollars though as of 2009. Although those 'first' dollars hadn't been legal for quite some time, the currency having gone through three rounds of massive devaluation and collapse, totalling something like 1x10^30 by then.

> Most think of it as base-10 [...] I'm surprised nobody has made a decent fixed point lib that is widely used already.

Note that fixed radix point does not solve the common issues with representing rational base 10 fractions. A base10 fixed radix solution would, but so would IEEE754's decimal64 spec, which would eliminate representation error when working exclusively in the context of base10 e.g finance, but these are not found in common hardware and do not help reduce propagation of error due to compounding with limited precision in any base.


The use of binary in the numerator is not the problem with using floats for financial math, it is the use of powers of 2 in the denominator.

The numerator is just an integer and integers are just integers and the base doesn't matter. But if the exponent is base 2, then you can have 1/2, 1/4, 1/8 on the base but not 1/5 or 1/10.


Rational numbers are a good way to do many financial calculations (extra points for doing something useful with negative denominators!), since many financial calculations are specified with particular denominators (per cent, per mille, basis points; halves, sixths, twelfths, twenty-sixths, fifty-seconds of a basis point, etc.).

However, as soon as you start doing anything interesting, you have limited precision as a matter of course.


If there's one thing I've always really appreciated about Groovy, it's that it used BigDecimal as the default for fractions, because 9 times out of 10, you need accuracy more than you need high performance and large exponents (and if you do need high performance, you wouldn't be using Groovy anyway).

Sadly most languages don't support something like that out of the box.


> Very few developers truly understand floating point representation.

Where would one go to better understand how floating points are represented?


“What every computer scientist should know about floating point” by David Goldberg. Readily available free online.

Thanks!

Sorry, but you actually sounds like one those people who dosen't really know how FP work.

> I shudder to think how many e-commerce sites use `float` for financial transactions!

The float IEEE-754 represent up to 9 decimal digits (or 23 binary digits) with precision. The double, represent 17 decimal digits. The error upper bound is (0.00000000000000001)/2 per operation. Likely irrelevant for most e-commerce.

Also, the database stores in currency values using fixed point.

> Intel chips have had BCD support in machine code

BCD is floating point encoding not fixed point. AFAIK, only Intel supports it and very precariously.

> I'm surprised nobody has made a decent fixed point lib that is widely used already.

Nonsense. If you do any scientific computation you have likely have Boost, GMP, MPFR installed in your system. They support arbitrary precision arithmetic with integer (aka fixed point), rational and floating point.


> Sorry, but you actually sounds like one those people who dosen't really know how FP work.

LOL, sure ok. Worked on banking systems for 2 years and been doing scientific computing for many more. Pretty comfortable with fixed and floating point.

> [error bounds] Likely irrelevant for most e-commerce.

Those bounds are theoretical, and there are plenty of occasions I have come across in the past when rounding errors were observed. It was forbidden in the bank to use floating point! We went to enormous lengths to ensure numerical accuracy and stability across systems.

I think this article has a pretty good explanation:

https://dzone.com/articles/never-use-float-and-double-for-mo...

> Nonsense. If you do any scientific computation you have likely have Boost, GMP, MPFR installed in your system. They support arbitrary precision arithmetic with integer (aka fixed point), rational and floating point.

Yes, absolutely right; I have used several of those 3rd party libs myself, as well as hand-rolling fixed point code (esp for embedded systems). I didn't write what I intended. I meant to say that very few languages have first-class fixed point in their standard library. So long as the simple `float` is available as a POD, people will (mis-)use it.

I think in a general purpose HLL, a fixed decimal type should be the default, and you should have to opt in to IEEE-754 floating point.


First, I would like to apologize for my attitude on the original post, my choice of words was very inappropriate for this forum. Second. I would like to thank you for bringing another point of view.

I'm talking about the average joe e-commerce. The precision requirement of Financial institution and large e-commerce are more strict than most website. For the average e-commerce the extra cost of using a decimal arithmetic framework might not be reasonable. Of course, software engineers should use currency types when they are available.

> https://dzone.com/articles/never-use-float-and-double-for-mo...

The example used by the website is misleading. They print a value with all the significant digits. In reality, e-commerces are only concerned with two digits after the decimal point. I run the same program in C with one billion (-1) iterations and the printed value with two decimal digits was exact. It was the the as it would be if I used decimal arithmetic.

> I think in a general purpose HLL, a fixed decimal type should be the default, and you should have to opt in to IEEE-754 floating point.

Most modern languages have some support to rational, fixed point and currency arithmetic. Nowadays, even GCC's C supports Fixed Point arithmetic [1].

The hardware implementation of floating point arithmetic is concerned about efficiency hardware and support for a lot of uses cases. The ranges and accuracy needed in different application varies widely. The BCD encoding is very inefficient hardware-wise compared to binary encoding.

In summary, I do agree people with working currencies should use the proper currency type, but I also think using IEEE 754 is usually fine, specially in the front-end. Also, I don't think the trade offs of changing from binary arithmetic to decimal arithmetic are not worth it for most people and hardware system.

[1] https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Fixed-Point.htm...


Obviously for any 32 bit value there’s only 2^32 possible values. So logically, a better number format is all about the distribution of values and reducing redundancy.

It sounds like the basic idea here is that rather than a fixed amount of significant figures, you get more precision in the middle. Meanwhile, you can also get larger exponents. Is that right?


The ideas are (1) gradual instead of hard overflow [IEEE floats do gradual underflow via denormals], (2) crowding representable numbers more precisely around 1 while letting very large or very small numbers get less and less precise, vs. floating point which (except for denormals) is scale free within its range, (3) making a format which can be extended by just adding bits without new definitions, so that you can get more or less precise numbers as needed by just adding zeros or truncating, (4) a it more uniform logic for basic arithmetic with fewer edge cases.

It's a bit sad that many of the creator’s claims are exaggerated and his examples cherry-picked though.


> floating point which (except for denormals) is scale free within its range

Presumably that's what you mean by scale free, but let's spell it out:

It's denser around 0, then has uniform density between, say, 1/8 and 1/4, half that density between 1/4 and 1/2, half that between 1/2 and 1, half that between 1 and 2, etc. etc.

If x = 1.0e22, you can add a million and it's still the same.


I recall in some of these earlier conversations, people pointing out that there are game engines where they invert some of their calculations to prevent artifacts caused by loss of significant figures with certain numbers.

that's the general idea. There is an additional tradeoff - error analysis in non-extreme (aka subnormal) cases is easier for IEEE than posits.

The quire stuff has been thrown in there, too, because while we're boiling the ocean, we might as well encourage people to "do the right thing". think of the quire as a really large accumulator cache for things like matrix vector multiplication or tensor outer products, which I hear, AI uses a lot.


I believe this has even fewer allowable states because there's more than one way to represent the same number

There aren't. On top of that, look at all the ways to represent NaN with IEEE floats.

I agree with your analysis, think utf-8 on ieee754’s exponent.

> Better yet, he claims the new format is a “drop-in replacement” for standard floats, with no changes needed to an application’s source code.

Claims like that are best taken with a grain of salt:

> It also does away with rounding errors, overflow and underflow exceptions, subnormal (denormalized) numbers, and the plethora of not-a-number (NaN) values. Additionally, posits avoids the weirdness of 0 and -0 as two distinct values.

Ok, so posits will probably work fine as a drop-in replacement when my application makes simple use of floats. But, assuming my application is doing non-trivial math, it's probably aware of the above edge cases. Thus, dropping in posits might have lots of weird side effects where I had to work around weird side effects of floats.


> But, assuming my application is doing non-trivial math, it's probably aware of the above edge cases. Thus, dropping in posits might have lots of weird side effects where I had to work around weird side effects of floats.

Yeah, but posits appear to have fewer weird edge-cases. Plus the people trying out posits are constantly trying to find methods to work around the limitations too.

During this years conference on posit maths Florent de Dinechin had a really nice talk bringing up all current issues with posits and ways that floating point maths had found workarounds over the year, as a kind of challenge to make posits catch up[0][1]. The community took it really well, as far as I can see, and Gustafson in particular seemed delighted because he genuinely wants everyone to start using better numerical methods.

[0] https://www.youtube.com/channel/UCOstJ2IVC4Y8mbgN0IsowKw

[1] https://www.youtube.com/watch?v=tcX2nRCdZvs


I think we can assume that the people who are sophisticated enough to “use floats correctly” are also sophisticated enough to know the article is not aimed at them.

Your boss, on the other hand, may try to armchair architect you, which is a real (and underreported) problem,


Choosing to have -infinity and various nans can be done independently from choosing to add regime bits, one could as well design this system to include them by adding more special values.

I didn't find a resource specifying how many exponent bits to actually use? (The 'es' value)

I don't see anything different in this system than regular floats that would ensure consistent results across machines/compilers/...? It's floats plus unary encoded regime bits to have var length exponent, so everything that can make floats inconsistent can still happen here: non commutativity, different exponent sizes or precisions of intermediate values, different rounding modes, ...


I just read the paper. Standard es bit counts are covered in section 7.2

Long story short: es=log2(nbits) -3


> It also does away with rounding errors

What? Surely it does not have infinite precision with a finite number of bits, and also doesn't seem to be a rational number representation.


Hm, perhaps the answer is a subtle definition of "rounding error".

A rounding error is when you have a number, lets say 0.75, and due to rounding it is recorded as 1.00. The "rounding error" is 0.25.

An alternative to rounding to 1.00 would be to have a mechanism which says "the value is between 0.50 and 1.50". This way, there is no actual rounding, as it doesn't commit to a rounded value, so there is technically no "rounding error".

A neat advantage of recording an interval rather than rounding is that the "error" is preserved in the data through arithmetic, so if there is some following code that runs x * 100, a rounding mechanism would say "the value is 100" whereas an interval mechanism would say "the value is between 50 and 150". Then, if the user only looks at the output, it will be clear that there is a wide error range and something needs to be fixed, rather than the output indicating a precise answer when really it suffers from significant rounding errors.


I like posits, but they do not encode intervals. His older ideas on that are crap and that's part of the problem getting posits accepted IMO.

That claim can only be made with respect to using a Kulisch accumulator (what he calls the quire) for accumulation.

This can represent the exact sum of any number of floating point values (or products of floating point values) with the intermediate operations preserved at infinite precision, up to a single rounding at the end.

e.g., for computing an inner product of vectors \sum_i x_i y_i with x_i and y_i in floating point, a fused multiply-add operation (as on today's computers) would perform something like:

r(x_n y_n + r(... + r(x_2 y_2 + r(x_1 y_1 + 0))...))

where what is performed within the rounding function r() is done to infinite precision and a single rounding at the end. This is also only true if you are accumulating into the same value, and not splitting this operation across multiple values, otherwise there would be additional rounding steps performed.

Using a Kulisch accumulator the result would be:

r(\sum_i x_i y_i)

similarly also only preserving precision if everything is done in a Kulisch accumulator.

You can add a Kulisch accumulator to IEEE floating point or any other floating point as well. In fact the second? ever computer with floating point (the Zuse Z3) performed FP addition using a similar accumulator, but sums-of-products probably didn't preserve the multiplied value though.


Yes, this is the question I come to. The article betrays its own claim...

> Gustafson said that for training, 32-bit floating point is overkill and in some cases doesn’t even perform as well as the smaller 16-bit posits

If posits have a bit number, then they are not of variable accuracy. This is simply sloppy explaining.

At some point, rounding has to happen (call it an interval if you want, but you're not being helpful). As a programmer, I WANT rounding to happen. I depend on it. You don't break out the floats unless you are prepared for some information to be lost. If you try to never lose any information, then you wind up with a monstrosity like Maple and Mathematica in which all steps need to be curated by hand to occasionally reduce the giant glob of fractions and implicit solves into something concrete, but imperfect.

I read the article and took a glance at https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit..., but I'm still completely unable to answer this question of how accuracy is shed.


That is unfortunately a mistake on the writer’s side: posits do not claim that anywhere in their specification or elsewhere.

Sloppy journalism.


One form of rounding error that comes up in floating point conversations has to do with data aggregation. If the average value of a series is 3.5 but after rounding the average drops to 3.4, then your rounding has lost some fidelity, and this upsets some people. This counts as a rounding error to them.


Thanks, came to say the same thing - wondered what William "Father of IEEE 754" Kahan had to say about this.

I haven't had time to digest it all, but he is certainly critical of Gustafson's proposals. Not sure though the articles linked above cover the latest and greatest Unum III. At any rate, I'd pay close attention to Kahan's critique.


"(even the same computation on the same system can produce different results for floats)"

wait...what? Is this real?


Yes and no.

1. Yes -- On old x86 systems, the x87 registers were 80-bits and the bottom bits were undefined. You could theoretically have a 1-bit difference on some results depending on what the bottom 26 bits (that were undefined, because you had 64-bit floats most of the time, even though the machine did things 80-bits at a time).

2. No -- Modern x86 systems use SSE registers, which are 64-bits. There are a whole slew of configuration options, but if Windows / Linux does their job, you should have the same rounding-errors across your program (unless you manually change the rounding options, but that's your own fault if you go that route).

3. Kinda yes -- Floating point operations are NON-associative. (A+B)+C does NOT equal A+(B+C). The best example is Python:

>>> (1+1)+2.053 9007199254740994.0 >>> 1+(1+2.053) 9007199254740992.0

253 is the first value which starts to "round off" 1.0 in double-precision. So (1+2.053) == 2.053, because the value was rounded off.

Due to #3, even if you did everything correctly (ie: set the rounding flags so they were consistent), if your arithmetic happened in slightly different orders (very common in network-games, where different players may have their updates in slightly different orders), you end up with a 1-bit error between clients... which completely borks your simulation.

Because #3 is nonintuitive, many people think that you get different results when using floats. But its simply due to non-associativity.


In case you're like me, reading this comment and thinking "how the heck is 1 plus 1 plus 2-ish like 90 gazillion?", it's:

  >>> (1+1)+2.0**53
  9007199254740994.0
  >>> 1+(1+2.0**53)
  9007199254740992.0
The exponent operator (double star) got transformed into italics apparently.

Oh! Thank you! The comment makes a lot more sense now!

The newlines also helped, by the way, because I was wondering what putting the two numbers next to each other was meant to indicate.


> On old x86 systems, the x87 registers were 80-bits and the bottom bits were undefined. You could theoretically have a 1-bit difference on some results

I spent a long time once debugging an issue arising from this. Code was basically:

  double x = y;
  [...]
  if (x>y) fail;
Nothing in between the assignment and the check changed the value of x or y, yet the check triggered the fail condition. Turned out to be that one of the value stayed in the 80 bit x87 register and one was written to RAM as a 64 bit value, then loaded back into an x87 register for the check, resulting in the inequality. Running in a debugger wrote both values out of the registers before the check, making the problem unreproducible.

Yes, that's a hard bug to catch, but if the mantra of "every floating point comparison must use a context-appropriate epsilon" is followed, then it wouldn't have happened in the first place.

I hate that mantra, as it produced more bugs and inaccuracies in the code I have to work with than necessary. Actually you should be able to rely on the equality of a value with its copy no matter what. We are past the FPU days, so I think we can let go of these mindless epsilons sprinkled over the code.

(Even during the FPU days, every ; should have rounded each value to the size of the variables, unless you use a special compiler switch to make FP math faster)


Absolutely, and the fix involved adding a tolerance check. Not my code, I was the new guy at the time, and my job was trying to fix the bugs nobody else wanted to deal with...

Nothing in between the assignment and the check changed the value of x or y

What was the check for?


There was code in between that could have changed the value, but it was never triggered in the bug case.

Ah yeah, that makes sense. Relying on FP equality should always make one's eye twitch a little but this is a neat way of hiding it.

Wrt #3, I've had huge errors (I think on the order of 10e-3) in non linear curve fitting spectroscopy algorithms because of this. One of the physics research fellows in the group looked at me like I was an idiot for not knowing the order of multiplication mattered (I still have no clue why he would think it's a standard thing to know this).

Lots of researchers cargo cult floating point programming. I had a Fortran program a few months ago where the author did "if var > 0.99 and var < 1.01 then" etc., where "var" was an integer. I tried to search back where and how that was ever done, but no scenario made any sense ("var" was a categorical variable and always had been). So I went back to some of the original authors and there too they looked at me like I was an idiot and said "you should never test for equality, always check for distance within a certain epsilon". So I asked about the difference between integers and floating point and then they "didn't have time for technical details in code written 10 years ago." Shrug, they pay me to fix that particular sort of weirdness I guess.

In all likelyhood, the reason that's there isn't because a programmer was an idiot. Almost certainly, that variable was originally a float, then some time later another programmer came in and refactored it into an int, but missed fixing this condition, since it technically "works" for ints as well.

This is a good argument for stricter typechecking than anything else. In Rust, a condition like that wouldn't compile because of the mixed types. That's probably a good call in my opinion: it alerts the programmer that there's something funky going on, and it forces the programmer to be explicit about how the comparison should actually happen.

I do find myself doing the "cargo cult" thing from time to time. I was working on a codebase where Lua was embedded in C++, and there were all these flags that had to work in both systems. Because Lua only supports 64-bit floats as a numeric type, the natural type for these were doubles, and "by convention" they only ever had the values 0.0 or 1.0. It was a bit silly of me, but everytime i saw the line `if (fSomeFlag == 1.0f) ...` in the code, I would wince. It just feels wrong to do that to floats. I found myself regularly writing `if (fSomeFlag > 0.5f) ...` instead.

Of course, I was wrong about this. These weren't values determined as a result of arithmetic, they were always set to whole numbers. There was never going to be a case where the actual value of the flag would be 0.9999998f or whatever. But still: once you've been burned by the "floating point equality comparison", you live in fear of it forever after.


It was the same in the situation I was talking about; all numbers that would always be integers because of their nature (and in the data source they came from, they were integers, there was no way this could ever have been floating point, or that the values ever had any non-0 digits after the comma). And well, they probably weren't 'idiots' in the absolute sense, in fact I'm 100% sure that they are much smarter (and/or just plain better humans in other ways) than I am in many respects.

My point was that many researchers can only cobble together somewhat barely working software by taking snippets from their undergraduate textbooks (or googling/stack overflowing, for those under 40), and have no idea about many/most of the underlying principles. If only they would recognize that, and leave the software development to professionals, instead of treating it as an inconsequential implementation detail that is beneath them. Then again, if they did, they wouldn't have to pay me what they do to fix it up afterwards. So meh?


x87 has been functionally superseded (largely by various SSE iterations), but it is still supported for backwards-compatibility.

Modern applications and compilers largely do not use x87, but they can, and x87 results depending on hidden bits persists in modern x86_64 CPUs.


DLang allow to use it explicitly by using Real type, I instead of Double and Float types.

yet it is still used for arithmetic with "long double"

Wouldn't that resovle the issue? long double contains the whole 80 bit state, not much space for undefined bits left.

Yes, it resolves the issue in practice. However, the issue is an implementation detail of the compiler, not of the floating point format itself.

> . No -- Modern x86 systems use SSE registers, which are 64-bits. There are a whole slew of configuration options, but if Windows / Linux does their job, you should have the same rounding-errors across your program

Knock, knock. Who is there? Floating-point contraction.


> Modern x86 systems use SSE registers

Are your sure of this? It's my understanding that SSE registers require the use of a special API and standard floating point operations do not use them.

But the last time I worked with them was writing a SIMD vector library 7 years ago.


64-bit x86 CPUs are defined to support SSE2 as a minimum, and all 64-bit operating systems use it by default.

For 32-bit systems, you still theoretically have to check, but I think every halfway modern compiler also uses SSE2 as the default target.


> It's my understanding that SSE registers require the use of a special API and standard floating point operations do not use them.

No, SSE registers are used by most compilers that do floating point these days. They support all the usual IEEE float math operations and a host of bit twiddling operations as well as vector operations. The vector operations do still require using compiler intrinsics in C++, although some autovectorization does occur in gcc, icc, and llvm.


Ah yes. This is what I was remembering. For vector operations you have to use special intrinsics

Compilers now exclusively target the SSE registers (with scalar SSE instructions like mulss), and they did 7 years ago too.

https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

  -mfpmath=unit
  
      Generate floating-point arithmetic for selected unit unit. The choices for unit are:
  
      ‘387’
  
          Use the standard 387 floating-point coprocessor present on the majority of chips and emulated otherwise. Code compiled with this option runs almost everywhere. The temporary results are computed in 80-bit precision instead of the precision specified by the type, resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.
  
          This is the default choice for non-Darwin x86-32 targets.

I'm pretty certain all 64-bit x86 (so amd64) systems support SSE 2, so FP calcs are done with these registers.

Common Lisp over here like:

  * (+ (+ 1 1) (expt 2 256))
  115792089237316195423570985008687907853269984665640564039457584007913129639938

  * (+ 1 (+ 1 (expt 2 256)))
  115792089237316195423570985008687907853269984665640564039457584007913129639938

Python will happily do this too, if you remove the ".0" in the example, the values become integers, instead of floats, and the computation is precise.

The point is to demonstrate a property of IEEE floats, not the language. I'd expect CL has some means of using IEEE floats too, and with the appropriate syntax, the same issues could be demonstrated.


Many languages support BigFloats and BigInts, but this discussion is about IEEE 754 floats vs Gustavson's proposals.

Not really. I suspect they are referring to the original x87 FP functions with their odd 80 bit FP stack and operations, where the exact result depends on when intermediate results are written to RAM and when they are kept in the FP stack, which might depend on intransparent things like compiler optimisations.

They could refer to parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation, on the same system.

This is due to the fact that several pipelines will do calculations in parallel and depending on which pipeline ends first (which can depend on many factors like current temperature) the sums can happen in a different order, leading to different rounding results.


> They could refer to parallel systems, most likely GPUs, where it is common to have slightly different results even on the same operation.

Parallel systems have great difficulty in defining an order of operations. Floating-point numbers are non-associative, so order matters for bitwise compatibility.

With that being said, you CAN define an order, and it CAN be done in parallel. The parallel-prefix sum is a fully defined order of operations for example, and if you sort all numbers (from smallest to largest), you'll get a more accurate result.

https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...

The + operations always happen in the same order, even on a GPU. So you should get the same results every time with floating-point math. But it is non-intuitive for many people to work with non-associative floating-point math.

-----------

Note that writing a sequential algorithm with fully defined order is still grossly difficult! IMO, if anyone wants a bit-accurate simulation, use 64-bit integers instead.

If you absolutely must use floats, then define a proper order of all operations, and include sorting the numbers (!!) to force a fully defined order as much as possible. You may need to sort your pool of numbers many, many times per calculation. But that's what you have to do to get a defined ordering.


I've never heard of that happening on a GPU. Can you give an example?

We have this in our GPU computations, this is very annoying. The errors are usually too small to matter, but they make comparing binary results / reproducible builds impossible. Here is an example I just made:

    $ python3
    >>> s = [10**-x for x in range(16)]
    >>> s
    [1, 0.1, 0.01, 0.001, 0.0001, 1e-05, 1e-06, 1e-07, 1e-08, 1e-09, 1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15]
    >>> sum(s)
    1.1111111111111112
    >>> sum(reversed(s))
    1.111111111111111
(it uses CPU for simplicity, but GPU has exactly the same math)

Summation of an array is surprisingly non-trivial. Below a discussion about it with many interesting links (including Kahan summation and pairwise summation) if you want to go down some unexpected rabbit holes.

Somewhere I've seen a little routine (by Stefan Karpinski of Julia fame) that starts with a fixed pre-determined array of numbers, and you give it a desired target number, and it permutes the numbers of the array such that when you sum them up (naively, from left to right) the result is your desired target number [2].

https://discourse.julialang.org/t/accurate-summation-algorit...

EDIT: [2] https://discourse.julialang.org/t/array-ordering-and-naive-s...


If you don't care about cancellation error and are only concerned with perfect bitwise / reproducible results, then sort your numbers.

Sorting the numbers is necessary to minimize cancellation error. But absolute minimum cancellation error can only be done sequentially: the parallel version may return different results.

    l = list(s)
    l.sort()
    sum(l)
    l = list(reversed(s))
    l.sort()
    sum(l)
Both return 1.111111111111111 now. Sorting is the "synchronization" step you need to get all your parallel tasks back into a defined order.

You could have this kind of behavior if you use atomic functions - but you need to explicitly opt-in actually using those and it is not limited to GPU.

And performance will suffer as a consequence.

>> where the exact result depends on when intermediate results are written to RAM and when they are kept in the FP stack, which might depend on intransparent things like compiler optimisations.

One thing I've never really been sure about is how context-switching affects this. I've always assumed this could also be a source of non-deterministic floating-point results: when the OS decides to switch threads, it has to save the CPU & FPU state, and restore it when switching back. Since context switches are (practically speaking) non-deterministic on a pre-emptive OS, this would imply truncating 80-bit to 64-bit floating point values could also happen on non-deterministic points in time.

Does anyone know if this is actually true? Or do the full 80-bits get saved/restored on context switches?


They are saved.

The FPU state is saved using the F(N)SAVE x86 instruction, which stores 108 bytes to memory, which contains all visible FPU state.


it doesn't happen much in practice, but the official IEEE spec is silent about it, so you can toss it into whatever bin contains 'undefined' behaviour of C.

Are any C undefined behaviors explicitly undefined spec?

this is not allowed by the standard. it is false in practice too

How does the implementation of an FPU for this compare? I thought the existing IEEE 754 floating point standard was focused on reduced complexity of addition and multiplication hardware. This seems more complicated.

having implemented it, with unoptimized verilog (ok, i wrote a verilog generator to generate it), it requires about 30% fewer LUTs on a FPGA relative to berkeley hardfloat implementation.

Is the variable length regime handling that much easier to deal with (space-wise) than the NaN and subnormal handling needed in IEEE floats? I'd think that the regime scheme would effectively be equivalent to creating a multitude of different-width subnormal routes. Is it really the NaN handling that kills IEEE float performance?

It's basically a barrel shifter; for addition you're going to need it anyways. Multiplication is a bit nastier, but most of multiplier gates are the adder gates anyways. I made a useful insight that negative numbers are basically the same as positives, with a "minus two" invisible bit.

Here is a sample 8-bit multiplier. All code was generated using a verilog DSL I wrote in Julia for the specific purpose. All verilog is tested by transpiling to c using verilator and mounting the shared object into a Julia runtime with a Julia implementation.

https://github.com/interplanetary-robot/mullinengine/blob/ma...


This may be good for numerical simulation, but I doubt this will ever replace common ieee754 float in general purpose programs. Two problems I can see right away:

- Sometimes, one needs precision for numbers not close to 1. A UNIX timestamp is a good example -- for the current date, it provides ~1uS resolution in 64-bit ieee754. I could not calculate what the resolution would be for posits, but I suspect much worse.

- The lack of positive and negative infinites will break naive min/max calculations which start accumulator with +/-inf.

Now, you might say that those things should not be done, and that software that uses those patters is defective. But even then, there is a lot of software like this, and so posits will never become a default "float" type.


UNIX timestamps are normally stored as 32-bit or 64-bit signed integers, not as floating-point. If you want better than 1-second precision, then the type "struct timespec" (specified by POSIX) gives you nanosecond precision. Fixed-point types can also be used in languages that support them.

Yeah, timespec (or time_interval) is a proper way to go, but it is quite a pain to work with -- you need a library even if you just want to subtract the numbers.

On the other hand, floating-point time is pretty common in scripting languages -- for example Python has time.time(); ruby has Time.now.to_f. It is not perfect, but great for smaller scripts: fool proof (except for the precision loss), roundtrips via any serialization format, and easy to understand. And no timezone problems at all!


I thought unsigned in 64 bits, holding the number of nanoseconds since the Unix 0 time (which might be 1970, but i forget).

UNIX time is seconds since epoch (hence year 2038, that's the limit for a signed 32b time_t).

gettimeofday() and clock_gettime() provide higher resolution timestamps (respectively µs and ns), using typedefs instead of just numbers.

Some APIs return floating-point UNIX time in order to provide sub-second accuracy (the decimal part is the fractional second). Python's time.time() does that for instance.


It’s a signed type so dates before 1970 can be represented

time_t is usually signed.

> will break naive min/max calculations which start accumulator with +/-inf.

Wouldn't the equivalent pattern just be to start at +/- typemax(Posit) or whatever? As you would for integers, which lack infinity.


I think this person is saying that it should break existing code not the pattern.

The solution seems simple to me though. Whatever syntax programmers are using to initialize floats to +/-inf, make that ensure to posit min and max as well.


I read "drop-in" as dropped in to new code. Certainly this could never be dropped in transparently at an OS or processor level. As long as the change is opt-in, neither of the things you mention are problems. I think a new world where programmers don't have to be aware of negative zeros, etc, would be appealing to a lot of people.

Posits seem great, but LLNL seems to favor ZFP[1], not posits.[2] Maybe new chips should then implement both posits and ZFP.

[1] https://github.com/LLNL/zfp

[2] https://helper.ipam.ucla.edu/publications/bdcws2/bdcws2_1504...


Unrelated as far as I can tell. Posits are an alternative floating point format, not an array compression algorithm. The “compression” benefits from being able to use posit floats instead of ieee754 doubles, because of the better precision.

Posits also take less space for the same total accuracy, they use less space on disk, ram and less memory bandwidth.

It occurs to me that maybe you are referring to unums? The "unum" format discussed in "End of Error" is variable sized. However they since dropped that design and switched to fixed-sized number formats (16-, 32-, or 64-bit) with adjustable division between exponent and fraction bits, the "posit."

So "posit" numbers get you the ability to tradeoff precision and range, but they're not anymore highly compressed than regular old IEEE floats. Unless, as Gustafson argues, you didn't need a double in the first place and the added features of posits let you switch to a float.


Not really... there is a standard 32-bit and 64-bit representation of posits. They use the same memory as a float or a double, just with better accuracy and other desirable properties. You only reduce storage space if you switch from “double” to “float”, which you can only do if you didn’t need most of the precision of a double in the first place. They’re not THAT much better.

Zfp is lossy compression just FYI. Fpzip (an earlier invention by Lindstrom et al) has a lossless mode. They really push zfp more though as it has nice features like random access decoding and (except on my data) higher compression ratios.

The paper for this was published in 2017[0]

[0] https://dl.acm.org/citation.cfm?id=3148220


It's odd. Someone told me about that on irc at the Time. People were feeling strongly about the topic. Quackery, mythomany,.. nobody wanted to hear about anything but IEEE standards.

Lots and lots of people have proposed replacements for IEEE floats that fix (or claim to fix) various problems with floats. 90% of them are utter nonsense, so many people default to assuming that any new one they encounter is also utter nonsense, and they usually turn out to be right.

> Quackery, mythomany,.. nobody wanted to hear about anything but IEEE standards.

Mostly because there is no good evidence that what is being proposed is better.

This isn't the olden days when it was difficult to demonstrate on a large enough CPU and dataset. Today we have cloud computing. If you create something better, you can demonstrate it by putting it into a numerics application and blow everybody away.

The CFD people are always looking for better solutions. The numerical simulation people are always constrained.

Until you do that, people have a right to blow you off.


This paper on UK weather simulation, cited in the article, seems like pretty good evidence: https://posithub.org/conga/2019/docs/13/1100-MilanKlower.pdf

That only really compares posits to Float16 in a domain where the paper admits roundoff error mostly isn't really an issue even at single precision.

I'd be much more interested to see how these handle stiff systems with multiple time constants. That's a domain where everything has problems and improvement is going to actually move some things from completely infeasible to actually simulatable.

That's a much stronger use case. And you don't have to worry about implementation efficiency since the improvement ratio is "infinity"--you went from can't do it at all to actually being able to do it.


>Mostly because there is no good evidence that what is being proposed is better.

Err, it's math, there doesn't need to be "evidence". They can do the calculations for themselves and see...


It's not my job to prove your assertion.

If you want me to believe something, the jobs is yours to cough up the convincing evidence.


The curious skeptics here might want to check out the videos from CoNGA'19[0]. There are talks about posits being used and tried out in the wild with impressive results. For example, according to Millan Klöwer, 16 bit posits could be accurate enough to replace 64 bit floats in certain climate modelling problems[1].

EDIT: just realized that the example was mentioned in the article, with a link to the slides. Still, the YT talk may add some context

[0] https://www.youtube.com/channel/UCOstJ2IVC4Y8mbgN0IsowKw/vid...

[1] https://www.youtube.com/watch?v=XazIx0cMVyg


The innovative idea from posits are they use Golomb-Rice prefix to encode exponent numbers. The Golomb Rice prefix let you encode exponent closer to zero using less space.

For example, the posit16 with nbits=16 and es=1 encodes the exponent like:

01.0 = 0 01.1 = 1 001.0 = 2 001.1 = 3 0001.0 = 4

The format has a normal exponent field with es bits that encodes exponent in binary. When it overflows, it encodes the carry as unnary format. In the unnary format the run-lenght of the same number encodes a number. For example: 0001 would be 3, 001 would be 2, etc. Of course, the posit format is a bit bore complicate than that (it supports negative exponent, for example). But the ideia is pretty much this.

Because of this encoding, if you are working with numbers that have small magnitude posit will have a LOT of more precision than your floating point format.

But the claim that posit16 can have as much precision as binary64 from ieee-754 is misleading. The posit16 can have up to 16-1-2-1=12 bits of precision. While binnary64 always has 53 bits of precision.

They likely compared the binnary64 with posit16 using the accumulator (aka quire). I'm not sure how the quire would map to real world FPU, it uses a lot os space.


> They likely compared the binnary64 with posit16 using the accumulator (aka quire).

Which I always find deceptiv because nothing stops us from using a quire with classical floating point arithmetic.

It is in fact a comparaison betwen a summation and a compensated summation : it is more precise because the algorithm is different not because of posits or floats.

They tell you that the quire is even faster than a traditional sum while being more precise but reading the associated reference reveals that it only hold with specific hardware which could also be used to use the quire with floats.

(and, arguably, I would love to know that every programmer is aware of a solid implementation of compensated/exact summation/dot-product and uses it when appropriate)


> They tell you that the quire is even faster than a traditional sum while being more precise but reading the associated reference reveals that it only hold with specific hardware which could also be used to use the quire with floats.

Precisely. Once the posit in unpacked they are indistinguishable from a floating point. It is not fair let posit use a massive accumulator while working with a tiny ieee-754 floating point accumulator. Like I said before, the precision of a number represented in binnary64 is greater than posit8. This comparation ignores the biggest advantages of posit: efficient data format.

> (and, arguably, I would love to know that every programmer is aware of a solid implementation of compensated/exact summation/dot-product and uses it when appropriate)

I like the idea of making the accumulator type (quire) accessible to the programmer. I think this brings awareness of the underlying hardware implementation to the average programmer.


> They likely compared the binnary64 with posit16 using the accumulator (aka quire).

As far as I can tell, Klöwer didn't. He mentions around the 13 minute mark (slide 9) that he used the SigmoidNumbers software package for Julia. From the looks of the examples he gives, if he used the quire, it must have happened implicitly. Which IIRC is not how using the quire works.

He did rescale all his inputs to minimize rounding errors that way, and he mentions that this has benefits for posits that floats don't have (because of the tapered precision of posits).


Hmm. Their results seems to good to be true without using quire. Look:

The two largest/smallest exponent posit16 can represent are:

min exp: 2^{ (+14<<es) or 0x01 } * (1+0) = 2^(-28) max exp: 2^{ (-14<<es) or 0x00 } * (1+0) = 2^(+29)

Note those numbers only have the implicit bit as significant. They don't have any space left to encode any other information other than the sign and regime.

While the double (binary64) can represent a much larger range of exponent:

min exp: 2^{ (-1023-2047-1) } * (1+0) = 2^(-1022) max exp: 2^{ (-1023+2047-1) } * (1+0) = 2^(1023)

Also, all doubles have 52 bits of precision while the posit have at most 16-1-2-1=12 maximum bits of precision.

Their significant are encoded slightly differently though. I'm not sure if this would be enough to achieve such different result without quire.

The scaling he mention could be done very easily introducing a parameter bias on FPGA implementation. I might add that to my work.

I pinged Klöwer on twitter.


Tweet exchange + link, for anyone else stumbling upon this later:

> @milankloewer Would mind joining this thread on HN: https://news.ycombinator.com/item?id=20392612 We are discussing the results of your work with posit.

> Happy to join. In short: I did not use any quires so far. All simulations are entirely based on 16bit posits, but I compare them to Float16 and not Float64. Tricks are: Scaling and rewriting algorithms to avoid very large and very small numbers, that's it.

https://twitter.com/milankloewer/status/1148670158883475461


You clearly know more about this topic than me, thanks for taking the time to explain your viewpoints.

> I pinged Klöwer on twitter.

Probably the most sensible, easiest way to clear this up, haha :)


> It also does away with rounding errors, overflow and underflow exceptions, subnormal (denormalized) numbers, and the plethora of not-a-number (NaN) values.

All of those are important and useful features. Presenting their absence as some kind of advantage shows that Gustafson has no clue.


The article is too sloppy on details: these can be found in the specifiation and are not what the journalist writes.

One annoying part about floats is the abuse of the NaN space. Some runtimes like SpiderMonkey, JSC, and LuaJIT abuse the NaN space to store pointers in doubles. This practice is often called nan-boxing, but has a few variants. A more efficient use of bits like with Posits will break this.

As for the claim about a FPU taking up less space and power for posits than floats, facebook made that claim: https://code.fb.com/ai-research/floating-point-math/


> One annoying part about floats is the abuse of the NaN space.

I would say that this is certainly my favorite feature of IEEE floats :)


Do these techniques make it easier or harder to implement hardware floating point units?

Storage and transmission speed have progressed exponentially while execution units have become the bottleneck. A floating point format that is 20% denser or more accurate but that requires a 2x number of gate delays to implement is major step backwards, except maybe in highly specialized applications.


From [1]: " The standard 32 bits posit adder is found to be twice as large as the corresponding floating-point adder. Posit multiplication requires about 7 times more LUTs and a few more DSPs for a latency which is 2x worst than the IEEE-754 32 bit multiplier." It's for an FPGA implementation.

This being said, I object to your premise: transmission and memory storage are getting comparatively more costly vs computation. From "Computer Architecture, A Quantitative Approach", 6th edition in figure 1.13 there are some values for the TSMC 45nm process, so hardly the leading edge and it only gets worse with finer nodes: - 32 bits integer multiplication: 3.2 pJ - 32 bits float multiplication: 3.7 pJ - 32 bits read in a small 8 kB SRAM: 5 pJ - external DRAM 32 bits read: 640 pJ

We have a lot of transistor nowadays, and as long as the computation can be pipelined halving the memory accesses could be a win (TBC).

Also on from the same Inria team as [1], and also covering hardware cost but more than this: "Posits: the good, the bad and the ugly": https://hal.inria.fr/hal-01959581v3/document

[1] "Hardware cost evaluation of the posit number system", https://hal.inria.fr/hal-02131982/document


I'm not sure the inria team used the correct optimization for posit addition. You do yourself a disservice by treating a posit like a float and bifurcate the cross (negative/positive) from the negative/negative and positive/positive branches; since posits are twos complements the posit adder should be smaller, not bigger.

> transmission speed

Memory bandwidth relative to instruction to instruction latency has gone way down. Memory is slower now relative to the amount of math one can do.

We are also swimming in extra silicon, so even if what you are saying is true (2x the number of gates, all adding to the delay), posits would still be a win.

You are arguing from an arbitrary what-if position, it doesn't look good.


This has always been interesting to me. It's been awhile since I read about it last, though, and I remember reading some criticisms or concerns that posits (or unums? what's the difference?) would end up being slower for some reason. I don't remember the arguments though, or where I saw them; I think the idea was that there were some edge cases that were common enough in practice that overall it would slow things down. It would be nice to see a balanced discussion of the ideas (pros and cons).

Performance per transistor has barely budged in thirty years. Posits seem like a very good development in that direction.

Is there any FPGA implementation of his ideas?

One can then solve some ODEs to check how these implementation are performing.

Ofcourse the speed will be far slower but one can get number of instructions.


This sounds really good. I find floating points completely unusable for any situation where accuracy is important. It's fine when being vaguely in the right ballpark is good enough, but I don't want to have to deal with 2 + 4.1 = 6.1000000001, or x/1000000 + y/1000000 != (x+y)/1000000.

Well, those will not be solved by posits, it is still an approximate floating point format. What it does is redistribution the precision to what Gustafson considers a better default and dropping edge cases in order to get more bits for precision.

edit: One of several examples found in the "Posits: the good, the bad and the ugly" paper linked in the thread : 10.0 * 2.0 = 16.0 in posit8


Really? My impression from the article was that this was exactly one of the things posits were supposed to fix, at least for numbers with small exponents.

I'm not sure how 10.0 * 2.0 = 16.0. I'm not sure what posit8 means, but it can only be correct if it switches halfway from base 8 representation to base 10, which is a bit weird, but at least the calculation is correct. (Otherwise it would be so incorrect to be unusable for anything.)


I am serious (and recommend reading the paper I quoted to have a good understanding of the trade-of offered by posits).

Overall, posits are a new trade-of that will give you better precision (nothing exact, it is still an approximation) when you manage to keep all your number in a small range. Once you get out of that range precision drops significantly (whereas the precision of classical floating points drops gradually).

Posit8 are equivalent to 8 bits floating points (minifloats) making them an easy target for pathological cases but the example still illustrate the fact that, contrary to floating-point arithmetic, multiplication by a multiple of two is not exact with posits (one of several good properties we take for granted and would lose when switching to posits).


If it's just a variation of the same problems behind floats, then I'm not that interested. Well, I guess it depends on how small the range is.

Here's the paper they mentioned: https://hal.inria.fr/hal-01959581v3/document

The problem you describe here arises from using binary fractions, that is, a power of 2 as a denominator. You cannot represent the decimal fraction 0.1 as a binary fraction. You would have the same problem representing 1/3 with decimal fractions. It just does not work.

You can solve it by switching to decimal floating points, they are defined by IEEE as well:

https://en.wikipedia.org/wiki/Decimal_floating_point


This is awesome. Doubling performance for free sounds good to me.

Well, we still need to replace all the floating point hardware in everything. So not exactly free...

Gustafson commented on the article this morning.

At the risk of a "flame war" where there are no winners, I would like to comment on some the statements here before they get stale. If we avoid ad hominem attacks and stick to the math, the claims, and counterexamples, this can be a useful scientific discussion and I very much welcome all the criticism of my ideas.

The irreproducibility of IEEE 754 float calculations is well documented... on Wikipedia, by William Kahan, and in an excellent paper by David Monniaux titled "The pitfalls of floating-point computations". It is amazing that this is tolerated, but IEEE 754 has done a great deal to lower the expectations of computer users regarding mathematically correct behavior.

The posit approach is not merely a format but also the Draft Standard. Whereas floats can arbitrarily use "guard bits" to covertly do calculations with greater accuracy, the posit standard rules that out. Whereas the float standard recommends that math functions like log(x), cos(x) etc. be correctly rounded, the draft posit standard mandates that they be correctly rounded (or else they have to use a function name that clarifies that they are not the correctly-rounded function). By the draft posit standard, you cannot do anything not specified in the source code (like noticing that a multiply and an add could be fused into a multiply-add with deferred rounding, so calling fused multiply-add without telling anyone). The source code completely defines what the result will be, bitwise, or it is not posit-compliant. It cannot depend on internal processor flags, optimization levels, or special hardware with guard bits to improve accuracy; this is what corrupted the IEEE 754 Standard and made it an irreproduci ble environment to this day.

The claim that posits is a "drop-in" replacement for floating point needs a lot of clarification, and this is unfortunately left out of much of the coverage of the ida. Clearly, if an algorithm assigns a hexadecimal value to encode a real value, that will need work to port from IEEE floats to posits. The math libraries need to be rewritten, as well as scanf and printf in C and their equivalent for other languages. However, a number of researchers have found that they can substituted a posit representation for a float representation of the same size, and they get more accurate results with the same number of bits. I call that "plug-and-play" replacement; yes, there are a multitude of side effects that might need to be managed, but it's nothing like the jarring change, say, of moving from serial execution to parallel execution. It's really pretty easy, and it's easy to build tools that catch the 'gotcha' cases.

Some here have suggested the use of rational number representation, or said that there are redundant binary representations of the same numerical value. Unlike floats, posits do not have redundancy. I suspect someone is confused by the Morris approach to adjusting the tradeoff between fraction bits and exponent bits, which produces many redundant wa6s to express the same mathematical value.

Perfect additive associativity is available, as an option, with the quire. If needed. Multiplicative associativity is available, as an option, by calling fused multiply-multiply in the draft posit standard. Because quire operations appear to be both faster (free of renormalization and rounding) and more accurate (exact until converted back to posit form), I am puzzled regarding why anyone would want to do things more slowly and with less accuracy.

Kulisch blazed the way with his exact dot product; unfortunately, any exact dot product based on IEEE floats will have an accumulator with far too many bits (like 4,224 for IEEEE double precision) and an accumulator that is just a bit larger than a power-of-two size. The "quire" of posits is always a power-of-two, much more hardware-friendly. It's 128 bits for 16-bit posits, and 512 bits for 32-bit posits, the width of a cache line on x86, or a an AVX-512 instruction.

"A little knowledge is a dangerous thing." In evaluating posit arithmetic, please use more than what you see in a ycombinator blog. You might discover that there are several decades of careful decision-making behind the design of posit arithmetic. And unlike Kahan, I subject my ideas to critical review by the community and learn from their input. The 1985 IEEE format is grossly overdue for a change.


I want to add a few comments as most of the discussions here concerned the hardware implementation and only few pointed to possible applications. I work on weather and climate simulations, but my opinions should apply in general to CFD or PDE-type problems.

Yes, having redundant bitpatterns is not great when designing a number format, however, even for Float16 (half-precision), making use of the 3% NaNs is wise, but not going to be a gamechanger. Some others discussed pro/con for neg zero and also neg infinity: In my view you want to have a bit pattern that tells you that the answer you get is not real, but whether it's +/- Inf or some NaN is pretty much irrelevant. Using these bit patterns for something else sounds like a very reasonable approach to me. Furthermore, I've never come across a good reason for -0 in our applications.

When it comes to weather and climate models in HPC, I see the following potential for posits: Similar as BFloat16 is supported on TPUs, I could see Posit16 to be supported by some specialised hardware like GPUs, FPGAs etc. I'm saying that because for us it's not important to have a whole operating system running in posits (although I probably wouldn't mind) but to have them for some performance critical algorithms. Unfortunately, weather and climate models are far more complex than some dot products and we usually have to deal with a whole zoo of algorithms causing weather and climate models to cover easily several million lines of code. Now let's say we know our model spends 20% of the time in algorithm A which only requires a certain (low) precision to be stable and to yield reasonable results, then it would be indeed a big game changer if we could run this algorithms in, say, 16bit. In exchange of precision for speed we would probably want to push things to the edge, i.e. if we can just about do it in 16bit, then we should. Now there are several 16bit formats: Float16, BFloat16, Posit16, Posit16_2 (with 2 exp bits), and technically also Int16. Let's forget about the technical details of these formats and let's focus on where they actually considerably differ: What is the dynamic range and where on the real axis do I get how much precision to represent numbers. Yes, from a computer science perspective also the technical details matter, but from our perspective most of it is pretty irrelevant and what actual matters are these two things: dynamic range and where is the precision. Because these two really determine whether your algorithm is gonna crash or whether you can use it operationally on your desktop computer or in a big fat $$$ supercomputer.

For PDE-type problems (that includes CFD and also weather and climate models) I came within the last year of my research to the following preliminary conclusions regarding dynamic range and precision with respect to the above mentioned formats:

Int16: Let's forget about it. Float16: The precision is okay, but rarely needed towards the edges of the dynamic range. Floatmin might work, however, floatmax with 65504.0 is easily a killer. Might work with a no-overflow rounding mode and smart rewriting of algorithms to avoid large numbers. BFloat16: For our applications having only 7 significant bits is not enough, I didn't come across a single sophisticated algorithm that works with BFloat16. Posit16 (with 1 exp bit): Great, puts a lot of precision where it's needed but also allows for a reasonable dynamic range. Posit16 (with 2 exp bits): Probably even better, the sacrifice of a bit precision in the middle is fine and the wide dynamic range gives it the potential to also work with algorithms that are hard to squeeze into a smaller dynamic range.

In short, posits actually fit much better the numbers our algorithms produce. And this can indeed be the game changer: If a GPU supports posit arithmetic and we can run algorithm A on it in 16bit: Wonderful, contract sold! But if we couldn't with BFloat16 or Float16 than there is no future for 16bit in our field.

I explain more about this in this paper: dx.doi.org/10.1145/3316279.3316281

And there are two talks which tell a similar story: https://www.youtube.com/watch?v=XazIx0cMVyg https://www.youtube.com/watch?v=wp7AYMWlPLw

or simply drop me an email if you have questions (unlikely respond here) that you find on my website: milank.de


Let's change the world.



Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: