Hacker News new | past | comments | ask | show | jobs | submit login
Q (Number Format) (wikipedia.org)
106 points by brudgers on Sept 2, 2022 | hide | past | favorite | 52 comments



Last year I wrote some code in Rust to read GF files (the output from Metafont). Knuth was fond of these sorts of fixed-point representations of non-integers in his file formats and all of the binary formats he created around TeX and MF employ Q15.16 fixed-point numbers where integer values are not acceptable. Likewise, internally to TeX and MF, all non-integer calculations are done with Q15,16 fixed-point numbers since at the time that DEK was writing his programs, there was no standardized FP implementation and significant differences in FP calculations could occur between different architectures and he wanted to be able to guarantee identical output on all platforms.


Significant differences in FP calculations still occur between different architectures, and sometimes even when code is compiled with different optimizations on the same platform. In particular, Intel Gen doesn't support denormalized floats and gradual underflow (and I think other GPUs are similar here); FMA (fused multiply-add) changes the results of most algorithms, though usually only in the least significant bit; and the IEEE standard doesn't require that operations beyond the basic ones (+, -, *, /, and square root) produce correctly-rounded results because that is deemed to be too difficult.

The FMA thing is the thing that made me realize that expecting identical output from FP algorithms is a lost cause. Knuth's choice was the only reasonable choice if you're trying to define a deterministic platform that makes your results reproducible on future hardware. We can expect with great confidence that 50 or 100 years from now, much less 1000 or 5000 years from now, there will be other optimizations like FMA that change the results of our programs, rendering their executions irreproducible. But TeX will still produce the same page layout when run on the same input files.


> Intel Gen doesn't support denormalized floats and gradual underflow (and I think other GPUs are similar here)

It differs from vendor to vendor. I know at least one major GPU vendor that has full denormal support, at least for normal arithmetic (not sure about transcendentals etc). In Vulkan, it's possible to ask the GPU to either always flush denormals to zero, or (if the HW supports it) to never do so, but I think nobody uses this, and there may be a (huge) performance penalty from requesting either of these!


Subnormals in particular I think are a pretty big failure of the IEEE standard, in the sense that because their implementation is very slow & varies across architectures, the only time most people learn about them is when they're hitting FP perf issues & want to disable their support.


Their implementation is not necessarily slow. In the past there have been many implementations where using subnormals did not cause any speed penalty.

The real disadvantage appears to be that a fast implementation is likely to be more expensive, i.e. it requires more complex circuits, thus a greater chip area. Most designers of recent CPUs or GPUs appear to be much more concerned with obtaining a greater speed and a minimum chip area than with obtaining adequate numeric accuracy, because most buyers are influenced by speed benchmark results and know little or nothing about the accuracy required by their computations.

Moreover, the subnormals are an optional feature of the floating-point standard, the alternative to using subnormals is to enable the underflow exception. However, an overwhelming majority of the programmers are too lazy to enable and handle overflow exceptions, much less to enable and handle underflow exceptions.

Unlike using either underflow exceptions or subnormals, the use of flush-to-zero and denormals-are-zero is acceptable only in the programs where computation errors do not matter at all, such as games.


> Unlike using either underflow exceptions or subnormals, the use of flush-to-zero and denormals-are-zero > is acceptable only in the programs where computation errors do not matter at all, such as games.

That is not true. FTZ and DAZ are perfectly reasonable in a lot of scientific computing scenario's, where computation errors in general are closely scrutinized.


It depends on what is meant by "scrutinized".

FTZ and DAZ and any other methods that are certain to introduce errors can be accepted only inside a computation for which there is an independent way to verify the results and that way is always used, without exceptions.

For example, it is indeed OK to use FTZ and DAZ when solving a system of equations, if, and only if, after obtaining a solution it is inserted in the original equations and a residual error is computed, this time without using FTZ and DAZ, and an excessive residual error triggers an appropriate means of notifying the user about the failure.


yeah. the ieee standard was way better than not having it, but I think that for a lot of cases, posits are much better thought out now. hopefully they'll start to gain broader hardware support.


Posits have advantages for low-precision numbers, e.g. up to 32 bits.

For high-precision numbers, e.g. for 64 bits, the standard double-precision floating-point numbers are the best.

Any number format using a given number of bits has the same number of values that can be represented. The only difference between various formats, e.g. integers, fixed-point numbers, floating-point numbers or posits, is how the same number of points is distributed over the line of the real numbers, being more dense in some parts and more rare in other parts.

For scientific computation, the optimal distribution of the representable numbers is when the relative error is constant. The ideal number format would be logarithmic, but it is too difficult to add and subtract logarithmic numbers, so the second best format is used, i.e. floating-point numbers, which allow fast implementations for all important arithmetic operations.

Posits have a non-uniform relative error, low close too 1 and high for large and small numbers, so they are unsuitable for complex physics modelling, which needs more computations with large numbers and small numbers than with numbers close to 1.

On the other hand, for the applications where low-precision numbers are suitable, e.g. machine learning, graphics or some kinds of DSP, posits could be better than floating-point numbers.


I agree with you that Posits only really make sense for <=32 bits. That said, I disagree that the optimal distribution of numbers is logarithmic. Especially for 16 bit or smaller computations, if you are dealing with big or small numbers, you need to renormalize to roughly 1 anyway to prevent overflow and underflow. As such, it makes a lot more sense to have extra accuracy near 1.


I agree with you that for small precision logarithmic distribution may not be optimal and that is why posits may be a better choice for such applications.

Logarithmic distribution is optimal for complex physics models where there are relationships between many different kinds of physical quantities, so it is not possible or convenient to choose a scale factor that would make all quantities have values close to 1. In such applications many numbers are very large or very small and any other distribution except logarithmic leads to increased errors in the results. Such models also typically need a 64-bit precision. Moreover, having a much greater exponent range than provided by the 32-bit floating-point format, in order to avoid overflows and underflows, is frequently even more important for them than having a greater precision.

If for an application the range of the 32-bit floating-point numbers is good enough, then there are indeed chances that posits might be better for it.


What makes posits so good for hardware? Does the fact that they have variable sized exponent not cause similar difficulty as the subnormals in floats?


The main difference of posits (the variable sized exponent) is slightly worse for hardware (but not much). Posits, however win back a lot of points because they remove a lot of special cases. FP hardware generally needs special checks for NaNs, Infs, -0 and subnormals. Posits only have 2 "special" numbers 0 and NaR. The problem with subnormal floats isn't that they are inherently slow, it's that they are different and rare which means they aren't well optimized for. With Posits, everything is the same, which reduces the number of branches in hardware.


The question though is whether you actually need bit identical or if you can just build a numerical model of your accuracy such that the result is +/- some bounded error.

Or you can just use infinite precision libraries at the cost of CPU.


TeX rendering (example from the parent of this thread) is highly non-linear though. Even a negligible difference in some computation can result in changes that cause an entire document to render completely differently. TeX layout is a global optimization problem, with a lot of big step functions in cost. If you want reproducible rendering in something like TeX, you need to use fixed point arithmetic.


Granted I haven’t played in space. Maybe text rendering is more complicated than indoor positioning. But there we simply made sure results were numerically deterministic on a given machine given a known seed. However, we would always run with a random seed to make sure we didn’t over tune to random initial parameters. And we did manage to make the system numerically stable (ie random perturbations still resulted in the expected value +/- some error). And yes, the system was highly non linear (lognormal probability calculations, Kalman filters, etc).

Do you have specific knowledge that it is actually difficult due to some nuance specific to layout because you’ve spent time on the problem or is that an institution you have from first principles?


It may be illustrative to discuss a concrete example from text layout: consider deciding whether to break before or after a certain word on a line.

• Suppose the width of each line is L, and the inter-word space is "S plus P minus M" in TeX notation (i.e. default S with stretchability of P and shrinkability of M), and that after N words with total width C, we encounter the next word of width W, which would have to either go on the next line, or be squeezed onto that line.

    .....................|
    This is an example line
• If we break the line before that word, we'd have to stretch the default space of (C + (N-1)S) to L, so a stretch factor of (L - C - (N-1)S)/P.

• If we break the line after that word, we'd have to shrink the default space of (C + W + NS) to L, so a shrink factor of (C + W + NS - L)/M.

So we'd pick one choice or the other depending on which of these two fractions is smaller.

(This is not exactly TeX's algorithm — the above is a greedy/local "best-fit" rather than TeX's "optimum fit" that considers the paragraph as a whole; see the paper at http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf — but it's hopefully illustrative.)

Even in this simple case that just needs to compare two ratios (in practice we have to at least compare two sums of such fractions), if you compute these two fractions as real numbers, and you don't have any consistent floating-point arithmetic available across different machines and compilers (or across time), how would you ensure that the system would be deterministic / numerically stable as you said? (Would always make the same decision between the two choices?)

(And would that method have worked in 1980? The initial version of TeX, now called TeX78, was written in 1977–1978 in the SAIL language and only had to run on one computer/compiler (the ones at SAIL in Stanford), and it did use the "real" type. By 1980, already (programs based on) TeX had proliferated to various places—people were porting/rewriting TeX into their own OS/languages—and Knuth started the rewrite of TeX into the current TeX82 in WEB (based on Pascal), and that time both hardware and Pascal compilers varied widely in their support for and implementation of floating-point arithmetic, so Knuth pretty much had to use fixed-point (i.e. only integer arithmetic staying within 2^31) to ensure that the same output would be produced everywhere. But I'm curious how we could get numerical stability in the sense you mentioned even today, as the problem seems to inherently have the property that small perturbations can lead to different output.)


Reasoning from first principles:

When a problem is linear, a small pertubation in your input will result in a similarly small difference in the results, bounded by some constant factor. When a problem is non-linear, there is no such constant upper bound to the output error.

There are differences in the amount of nonlinearity however. It seems that your algorithm was nonlinear, e.g. using log and exp functions, but otherwise pretty well behaved. So while the factor between input and output error might not be constant, but rather dependent on input values, it is still the case that in the limit of the input error to zero, the output error will also vanish. (obviously in the real domain, not considering floating point).

Contrast this with a problem that has discontinuities in it. In that case it might happen that however small you make your input error, any nonzero pertubation will cause a significant change in the solution. The TeX layout problem is an example of this, but also happens often enough in physical simulations.


It’s not necessarily more complicated, but it has a strict reproducibility requirement. If I render TeX today, it has to end up 100% identical to what you get when you render it on your machine tomorrow, or what I get ten years from now.

If a single tiny rounding difference makes a letter even a millionth of its width wider, that may mean line breaks move to different locations, that the number of lines in the text changes, or even that the number of pages in the text changes.


The first release of TeX was in 1978.

Dr. Knuth could have plausibly gotten results similar to yours by writing a 'classic' floating point algorithm: but IEEE 734 [sic: 754] wasn't even published until 1985, so relying on hardware, let alone stable results from hardware, was completely impossible.

This would also have been very, very, very. Very slow. Adding a constant factor of 20-50X to TeX wouldn't have been fun at all.

TeX is also a famous program, the algorithms involved are well-known to anyone with an interest. The statement that it uses an expensive global strategy for layout is accurate.


(For those also curious: I tried to look up IEEE 734 thinking it was a previous standard, but it seems there is none and the parent commenter meant IEEE 754-1985.)


Thank you, that was a typo, I've made a note in the original.


Knuth has explained that he rewrote TeX with fixed point to get reproducibility; this is not a speculation from first principles, but a hard-won battle scar. But it's not specific to text layout either; almost any chaotic system has the same reproducibility problem if you rerun the same program with the same inputs (including seeds) on new hardware, or, often, even with a new version of the compiler or different compilation options.

Of course the whole point of things like Kalman filters is that they are insensitive to small quantitative errors, and as you say you worked hard to ensure your programs as a whole were also insensitive; experience with things like that can lead your intuition astray.


The objective for TeX was good, but not optimal, accuracy and performance, and perfect reproducibility. You are evidently accustomed to domains where you want good, but imperfect, reproducibility, but optimal accuracy and performance.

You are correct that requiring both perfect reproducibility and optimal accuracy requires sacrificing performance, but actually it is worse than that; this change makes some ordinary terminating algorithms fail to terminate.


Note that not all languages semantically allow combining a+b*c into fma. This isn't a floating point problem, but a language problem.


If you're doing long FP calculations, you can allow FMA, or have poor results. Sure, some languages think it's worth it for the determinism, but rounding after every operation can completely destroy your precision. Even more so for something like x87, where internal registers are higher precision, not that too many people use it anymore.


There's a 3rd option. Allow fma, but only where the programmer explicitly puts an `fma` (or specifically allows reordering). This does sacrifice a little bit of ease of use to get optimal performance, but allows for reproducibility and high performance.

It is a shame x87 was so poorly thought out. If it had a better defined programming model, it would be incredibly useful.


> where integer values are not acceptable

Er why not


Are you asking why there’s floating point in general? Or why Tex needed floating point to represent layout?

For the latter likely because you have to do layout. You could try defining a fundamental page unit but some layout decisions will likely still put you into fractional space where you need to preserve the fractions to get a correct result.


Ah I misunderstood.

Thought it was "integers not allowed" which would be strange.

Actually it is "integers not sufficient"


Qm.n notation is for fixed point numbers, not floating point. As OP said, floating point was not a viable option at the time.


So why not work with fractions instead? Then be problem becomes to store big or arbitrary sized integers.


They are effectively fractions. Everything is represented to the nearest 1/65536. This is equivalent to arbitrary fractions with 16-bit numerators and denominators.


I think my question is rather, "Why not use a representation as a fraction?" then. Technically every number inside a computer, which is not merely symbols interpreted outside of the computer, is a fraction, because of limited precision of finite memory. I guess it is all about the representation then. So the question becomes one of why one representation is better and why not use a fraction representation of 2 integers instead.


Well, scaled fixed point representation lets you use integer operations. With two separate memory locations, you'd have to manage things like borrow and carry yourself.


Fixed point is a super fun rabbit hole into quantization and numerical representations. I work with them all the time for FPGA hardware design and HLS to build deep learning models and HLS is our bread and butter.


I recently discovered 'posits' which are clearly the best number format for representing rationals (don't say 'real', we can exactly represent approximately 0% of the reals).


For reference, as I didn't know, posits are a modern float number alternative to IEEE 754. Wikipedia[1] claims better in many aspects, but, as ubiquitous as IEEE numbers are I don't see them getting replaced soon.

[1] https://en.m.wikipedia.org/wiki/Unum_(number_format)#Unum_II...


So what makes posits so great?


No infinities, it saturates instead. Only one bit pattern for 'NaR'. Only one bit pattern for 0. No subnormals. No overflow or underflow. Highest resolution around 0. Natural graceful degredation in resolution for larger numbers. A much higher percentage of hitting an exactly representable result over all operations. Equality and ordinal comparison is the same as for integers. Much simpler addition and multiplication. -1/x is just twos complement. Perfect reciprocals of powers of 2. 1 clock sigmoid function. A very clean conceptual design.

https://youtu.be/JJgT-YphE1Y


Also prohibits nan-boxing, afaiu, which may be a big loss for dynamic languages.


I think the advantages are big enough that you can afford to use a flag plus a 63 bit posit and still be ahead.


What does NaN-boxing mean in this context?


https://softwareengineering.stackexchange.com/a/185431

Basically it is a way to push primitive values of any type and pointers into a 64-bit double, to reduce memory/cache footprint in dynamic languages.


It seems to be a complicated way to explain that, if you need decimal values, you can just multiply then by a power of two internally. e.g. if you just need a resolution of 1/256 (a case that comes to mind are measures coming from a analog-to-digital converter), you just multiply everything by 256 and do your arithmetic as usual. Or actually you don't have to do anything except remember to scale your constants the same way and do the conversion for I/Os where needed. Might be wise to protect those values with a dedicated type to prevent mistakes.

The "saturation" thing in the article is a joke; no way you can get correct results at the end if you cap the result of computations on overflow. That's slightly better than letting the overflow happen, but the right thing to do is to throw an error, not silently min/max the result.


Saturation is a real thing: lots of DSPs use Q-format fixed-point integer maths with saturation.

It's not designed to not have errors, it's a way to detect overflows without floating point capability. You can't always just throw an error in some hard real time systems (though you obviously need to take care of saturated results "carefully", which depends on what you're up to).


imho it's unmoored to think of two's complement as having a sign bit. Two's complement was invented to escape from the sign bit and it works by adding/subtracting the bits as if there is no sign

(careful defending the sign bit unless you believe zero to be a positive number)


The sign bit in two’s complement is still called the sign bit.


not in the sense where you would always keep in mind "there is one bit of sign and 7 bits of value", and not in the sense that you "can run the sign bit through an ALU adder as data, don't worry, it'll work". It's just a confusing way to think about it.

one place where "the sign bit interpretation" shows up, is still not a slam dunk for sign bits. Using C as the example, when casting a signed char to a signed word, you do have to take the "sign bit" and extend it: but you don't just move it to the new sign bit, you extended it through all those other "data/value" bits.


The representation sure works differently from one’s complement, but it is still called the sign bit, because the number is negative if and only if that bit is set.


yes, but my original comment did say don't "think" of it as a sign bit rather than don't call it that.

and one's complement is a complement, but two's complement isn't really. It does turn out that if you take the complement and add one you get the right value, but a better way to think of it (on byte scale e.g.) is "take a modulo-256 dial, now change the labeling counter clockwise from the top to negative numbers because you note that 255+1 is 0 just like -1+1 is 0". I just think this explanation serves beginning C or ASM programmers better because it's more mentally agile.

I learned it what I would call "the dumb way", but since I was amazed that it worked I studied it to convince myself it was true and then I realized what was actually going on.


Learned about these from Rust In Action book. Seems like programmers keep rediscovering stuff now and then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: