Hacker News new | past | comments | ask | show | jobs | submit login
Why Does Integer Addition Approximate Float Multiplication? (probablydance.com)
102 points by ibobev 9 hours ago | hide | past | favorite | 28 comments





The most significant bits of a floating point representation are basically a logarithm of the number. Logarithms have this relation between multiplication and addition.

Yes, and overflowing the mantissa moves the exponent in the right direction by the right amount; just now the least significant bits are not quite in the right place anymore. But that's by definition a small error.

To give a "yes and" side-track to your comment: saying "logarithms have this relation between multiplication and addition" is even underselling what logarithms are, because reducing multiplication to an additive operation was the whole motivation for John Napier[0] to discover/invent logarithms:

> “…nothing is more tedious, fellow mathematicians, in the practice of the mathematical arts, than the great delays suffered in the tedium of lengthy multiplications and divisions, the finding of ratios, and in the extraction of square and cube roots… [with] the many slippery errors that can arise…I have found an amazing way of shortening the proceedings [in which]… all the numbers associated with the multiplications, and divisions of numbers, and with the long arduous tasks of extracting square and cube roots are themselves rejected from the work, and in their place other numbers are substituted, which perform the tasks of these rejected by means of addition, subtraction, and division by two or three only.”[1]

Logarithms were honestly an enormous breakthrough in optimization, computers wouldn't be remotely as useful without them, even if most of us don't "see" the logarithms being used.

In fact I'd argue that they are the second-biggest computational optimization in use today, with only positional notation being a bigger deal. Which, funny enough, works kind of similarly: imagine you only could count by tallying (so, unary). Adding two number M and N would take M+N operations, e.g. 1234 + 5678 would require counting all 6912 individual digits. Unary math scales O(n) in both data and computation. Systems like Roman numerals almost work, but as soon as we reach values larger than the largest symbol (M for 1000) it's O(n) again, just with a better constant factor.

With positional notation numbers require only log(n) symbols to write down, and log(n) operations for addition, e.g. 1234 + 5678 requires one or two additions for each digit pair in a given position - one addition if there's no carry from the previous addition, two if there is. So addition at most 2 × ceil( max( log(M), log(N) ) ) operations, so log(n).

Logarithms take that idea and "recursively" apply it to the notation, making the same optimization work for multiplication. Without it, the naive algorithm for the multiplication of two numbers requires iterating over each digit, e.g. 1234 × 5678 requires multiplying each of the four digits of the first number with each of the digit of the second number, and then adding all the resulting numbers. It scales O(di×dj), where di and dj are the digits of each number. If they're the same we can simplify that to O(d²). When the numbers are represented as two logarithms the operation is reduced to adding two numbers again, so O(log(d) + [whatever the log/inverse log conversion cost is]). Of course d is a different value here and the number of digits used affects the precision.

I think the craziest thing of all this is that we're so used to positional notation that nobody ever seems to consider it a data compression technique. Even though almost no other data compression method would work without it as a building block (run-length encoding, Lempel-Ziv, Arithmetic coding? Useless without positional notation's O(log(n)) scaling factor). The only exceptions are data compression methods that are based on inventing their own numerical notation[2].

We do this every day ever since we first learned addition and subtraction as kids. Or as David Bess[3] puts it in his book "Mathematica": ask almost any adult what one billion minus one is and they know the answer instantaneously, so most adults would appear to have mental superpowers in the eyes of pretty much all mathematicians before positional notation was invented (well, everyone except Archimedes maybe[4]). Positional notation is magical, we're all math wizards, and it's so normalized that we don't even realize it.

But to get back to your original point: yes, you are entirely correct. IEEE floats are a form of lossy compression of fractions, and the basis of that lossy compression is logarithmic notation (but with a fixed number of binary digits and some curious rules for encoding other values like NaN and infinity).

[0] https://en.wikipedia.org/wiki/John_Napier

[1] https://en.wikipedia.org/wiki/Mirifici_Logarithmorum_Canonis...

[2] https://en.wikipedia.org/wiki/Asymmetric_numeral_systems

[3] https://www.quantamagazine.org/mathematical-thinking-isnt-wh...

[4] https://en.wikipedia.org/wiki/The_Sand_Reckoner


Excellent comment!

As a historical tidbit I'll add that Romans did develop two ways to write larger numbers.

1. Writing a line (vinculum) over a numeral to multiply its value by 1,000. This was in fact extended to writing a line to the left and above a numeral to multiply its value by 1,000,000, and could in principle be extended to lines below and to the right to multiply by 10^9 and 10^12, and even nested boxes for larger powers.

2. The use |), |)), |))), ... for 500, 5,000, 50,000, ... and (|), ((|)), (((|))), ... for 1,000, 10,000, 100,000, ... These can be continued indefinitely.

https://en.wikipedia.org/wiki/Roman_numerals#Large_numbers

Both require an ever increasing number of marks just to write the increasing powers, as well as an ever increasing number of powers being summed, but both increase only logarithmically, so we end up using O((log n)²) marks to write n. This is quadratically worse than positional notation, but exponentially better than just writing M over and over.


And if I may take this yet a step further, this is the point of mathematical transformations: to find a domain in which desirable operations are easier such that the round trip to and from the domain are offset.

In the past, transformations, like logarithms, Fourier transforms, wavelets, had to be proposed. Machine learning automated away all that by using differentiable building blocks to compose complex, universal estimators. The parameters of these building blocks are estimated through gradient descent in conjunction with a user-chosen loss function, which guides the optimization of the transform. Good representations can be manipulated through basic algebra (like added, averaged, and compared for similarity, depending on the task) in a way that corresponds to semantic operations when their raw, untransformed representations can not.


I'm still amazed at word embeddings. They allow to find related words, and even do addition and subtraction.

The way you can supposedly take "king", subtract "man" and add "woman" and you get "queen", and this kind of thing is the mathematical basis of LLMs.


This is the core trick in the Fast Inverse Square Root[0], as made famous by the Quake III source code.

It uses a shift (equivalent to dividing) and a subtraction in integer-land to estimate x^(-0.5) in float-land.

[0]: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root


I've always hated that it's called an “inverse”. It's not an inverse. The inverse of square root is square. If they had called it “reciprocal”, it would have been clear to me what it does, but “inverse” confused the hell out of me when I first saw it.

It’s confusing for non-mathematicians, but (and you may know that) it is not incorrect. https://en.wikipedia.org/wiki/Inverse_element:

“In mathematics, the concept of an inverse element generalises the concepts of opposite (−x) and reciprocal (1/x) of numbers.”


Interestingly Intel got it right, thus the rsqrt family of instructions.

The inverse is about the negative power right? Square root is the 0.5

"inverse" has a very specific meaning: inverse(x) * x = 1

x^2 * x != 1 for any x other than 1. So no, x^2 is not the inverse of sqrt(x)


No. "inverse" has a very specific meaning: the operation that undoes whatever operation you're talking about.

https://en.m.wikipedia.org/wiki/Inverse_function

The inverse of multiplying by 3 is dividing by 3, and vice-versa.

The inverse of squaring is squart root, and vice-versa.

The inverse of a number or other element (rather than an operation / function) means whatever number would form the inverse if you use it with whatever binary operation you're talking about. So the additive inverse of 3 is –3, the inverse of "rotate clockwise 90 degrees" in the group of geometric operations is "rotate anticlockwise 90 degrees", and the multiplicative inverse of 3 is 1/3.

Saying "the inverse of 3" to mean 1/3, i.e. its multiplicative inverse, is sloppy but acceptable so long as everyone knows what you mean (and it's fairly common). Saying "the inverse square root" to mean 1/square root is just wrong.


But sqrt · square = 1, where "sqrt: R⁺ → R⁺" is the square root operation, "square: R⁺ → R⁺" is the squaring operation, "1: R⁺ → R⁺" is the identity operation, and (·) is function composition: i.e., working in the monoid of functions R⁺ → R⁺. So x² and sqrt(x), as elements of R, are not inverses, but sqrt and square, as elements of R⁺ → R⁺, are.

It depends on how you parse the phrase "fast inverse square root".


Neat. Of course it works for the exponent, but that it's not that far off for the mantissa is unexpected. It helps that the mantissa is normalized.

For the mantissa, the insight is probably that :

(1+e1)*(1+e2) = 1+e1+e2+(e1*e2)

If e1 and e2 are small, then e1*e2 is negligible.


Huh, and in the case that e1 and e2 are large, the mantissa overflows into the exponent, multiplying the result by 2.

I guess we're lucky that the IEEE 754 committee put the exponent in front of the mantissa. Or maybe they were aware of this trick all along ...

This arrangement makes ordering easier anyway which is probably why they chose it.

If we have some positive 32-bit integers A and B then if A < B, f32::from_bits(A) < f32::from_bits(B)

Edited to add anecdote:

I actually had a bug in realistic (my Rust crate to implement Hans Boehm's "Towards an API for the Real Numbers") which I only fixed recently, for converting my Real number into a floating point type where I might end up with too many mantissa bits, but then sometimes coincidentally the bottom bit of the exponent is zero, and so instead of increasing the exponent by one I actually overwrite it with a one and that's the same result.

I finally caught that when it occurred to me to do "thorough testing". That is, take all possible 32-bit floats, turn them into a Real, then, turn that Real back into a 32-bit float, this should be a roundtrip, if it's not we've found a bug. There are "only" a few billion of these values to test, a machine can do that.

I fixed the 64-bit routines for the same bugs, but I don't have enough years to run all the 64-bit tests the same way and discover any that aren't paralleled.

[Of course there are presumably still bugs in the conversion algorithms which may either be symmetric, and thus the bug doesn't impact a round trip, or they don't affect the binary fractions used exclusively by floats, Real has proper rationals and various other things like the square root of integers, Pi, etc. which we can convert to a float and lose precision but we never get these from converting a float because floats are always binary fractions.]


Previous discussion of the paper:

Addition is all you need for energy-efficient language models - https://news.ycombinator.com/item?id=41784591 - Oct 2024 - 126 comments


It's math and not just for integers. That's also how slide rule works.

a×b = 10^(log(a×b))

log(a×b) = log(a)+log(b)

thus a×b = 10^(log(a)+log(b))


Yes, but what is nontrivial here is how good the approximation is, despite the mantissa part being linear (much better than the upper bound of being within 2x of the real value).

Nontrivial yes, but it's not that complicated to figure out that floats are a piecewise linear approximation of the base 2 logarithm.

Edit: or base 2^k, depends a bit how you view it.


This can be very worth it in circuit design for custom accelerators. Floating point operation is often multicycle. If you can get close enough with addition it will save a ton of resources and probably also simplify the design.

Would subtraction also approximate division?

Yes

Would negation also approximate the reciprocal?

Yes. Logarithms are wonderful like that :)



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: