
The radix 2^51 trick (2017) - mooreds
https://www.chosenplaintext.ca/articles/radix-2-51-trick.html
======
Strilanc
In hardware this would be called a carry save adder [1].

One interesting fact about carry save adders is that the carry part of the
register can be extended in order to avoid any carrying between words for very
long periods of time. Instead of using 53 bits to represent 52 bits, use 60
bits, and now you can perform 256 sequential additions with no carrying
between words before the carry part saturates and needs to be handled.

Somewhat surprisingly, it's even possible to use this construction in the
context of a quantum computation [2][3]. The reason it's surprising is because
tacking on additional registers like this can act as an information leakage
mechanism (which is an error in a quantum computation). For analogy, you can
imagine that if you computed a public key using this trick that you would not
be comfortable handing an attacker the registers storing your public key
before you removed the carry padding that made the computation faster. But it
turns out that if you just initialize the carry padding randomly, subtracting
out of the next word at the start to ensure everything sums to the correct
result at the end, then you can show that the information leakage and chance-
of-bad-overflow is exponentially suppressed in the length of the carry
padding. Currently the most efficient known way to implement Shor's algorithm
uses this technique [4].

1: [https://en.wikipedia.org/wiki/Carry-
save_adder](https://en.wikipedia.org/wiki/Carry-save_adder)

2: [https://arxiv.org/abs/1905.08488](https://arxiv.org/abs/1905.08488)

3: [https://youtu.be/upTipX9yXNg?t=777](https://youtu.be/upTipX9yXNg?t=777)

4: [https://arxiv.org/abs/1905.09749](https://arxiv.org/abs/1905.09749)

------
tspiteri
> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore
> the carries in the most significant limb, allowing numbers to wrap when they
> overflow past 2^256 - 1 (just like how unsigned addition works in C with
> normal size integer types). As a result, we can assign 52 bits to the most
> significant limb and ignore the fact that it will run out of room for
> carries before the other limbs do.

If you're going to wrap, you could go all the way and assign 64 bits to the
most significant limb; that way you save 12 bits which you can spread to the
other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48,
48}, so you have a spare 16 bits instead of 13 bits.

~~~
gene91
Exactly. Any one can offer an explanation why they didn't take this path for
256-bit numbers instead?

~~~
dgoldstein0
Taking a guess: if you use this form to do a bunch of operations before
normalizing, you don't need to worry as much about intermediate overflow. E.g.
you could easily add 10 different 256 bit numbers in this form and normalize
at the end. Whereas if the leading chunk is 64 bits, every addition can
overflow.

So 52/51/51/51/51 seems like the more generally useful choice

~~~
bonzini
Why would you ever worry about overflow of the leading chunk (aka limb)?

~~~
dgoldstein0
Because you want to know when your result is inaccurate?

~~~
bonzini
But you're operating modulo 2^256.

------
brandmeyer
The design of Poly1305 was directly informed by this very technique, with the
addition (heh) of performing the arithmetic on the x87 FPU operating in 80-bit
mode. It even has some zero padding bits placed strategically to aid the
technique, hence the security bound of 2^106 instead of 2^128 or 2^130.

See also
[https://cr.yp.to/mac/poly1305-20050113.pdf](https://cr.yp.to/mac/poly1305-20050113.pdf)

------
steerablesafe
Redundant number systems are a really great topic.

[https://en.wikipedia.org/wiki/Redundant_binary_representatio...](https://en.wikipedia.org/wiki/Redundant_binary_representation)

[https://en.wikipedia.org/wiki/Carry-
save_adder](https://en.wikipedia.org/wiki/Carry-save_adder)

[http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf](http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf)

I have an arbitrary precision arithmetic library in my mind that represents
numbers as streams of digits starting from the most significant digit. A
redundant numeric representation could avoid unbounded backtracking (think
flipping between 0.9999... and 1.0000... in a regular representation) and
could guarantee forward progress for the generated digits.

~~~
jjoonathan
I love that the same trick works for human brains as well as CPUs. Your brain
already pays the price to represent numbers as high level abstract concepts
rather than tightly packed fields. Just let the digits overflow, then clean
them up later.

    
    
        14*14
        140 + 14*4
        140 + 4(16)  <-- 4 tens + 16 ones
        18(16)
        196

~~~
HeWhoLurksLate
I did the same thought and broke them up a little differently:

    
    
       14*14
       7*7*2*2 [b/c 14 = 2*7]
       49*2*2
       98*2
       196
    

For whatever reason, my brain likes using twos (and powers of it), so I play
to that strong spot wherever possible.

 _EDIT: as noted below, I made a mistake (7·7=49,not 47,) and carried it
halfway through my solution. Oops._

~~~
iso947
7x7 isn’t 47, and 94x2 isn’t 196

~~~
dr_zoidberg
Despite that, the breakup does hold up the correct calculation:

    
    
        14 * 14
        (7 * 2) * (7 * 2)
        7 * 7 * 2 * 2
        49 * 2 * 2
        98 * 2
        196
    

And on my side, 49 * 2 quickly resolves to "a hundred minus 2" (because 49 * 2
= (50 - 1) * 2), and pretty much the same for 196 (200 - 4).

------
majke
Funnily enough this is the code I was reading today

[https://github.com/constructor-
igor/cudafy/blob/11cdd4def4e7...](https://github.com/constructor-
igor/cudafy/blob/11cdd4def4e7cc349d9771e8251327d771a481c1/Cudafy/Cudafy.Translator/Resources/simd_functions.h#L1178-L1193)

    
    
        s = a ^ b;          // sum bits
        r = a & 0x7f7f7f7f; // clear msbs
        t = b & 0x7f7f7f7f; // clear msbs
        s = s & 0x80808080; // msb sum bits
        r = r + t;          // add without msbs, record carry-out in msbs
        r = r ^ s;          // sum of msb sum and carry-in bits, w/o carry-out

------
posix_me_less
Very interesting. It may seem that the bad performance of the ADC instruction
is due to bad design of the ALU, but I think this is more of an example of how
replacing (single sequential task) by (several parallel tasks whose results
are combined at the end) allows the CPU to finish faster.

~~~
csense
It's a data flow problem. Four ADD instructions can be parallelized. Four ADC
instructions have to run serially.

~~~
a1369209993
And in particular, sixteen ADC instructions in four chains _shouldn 't_ run
any slower than sixteen ADD instructions (assuming four execution ports and a
large enough reorder buffer), but a: Haswell sucks, and b: even on properly-
designed CPUs, making parallelization easier is almost always _some_ kind of
win, because general-purpose CPUs always have dispatch overhead.

------
CliffStoll
Carries were a major problem in mechanical calculators as well. For example,
the upper section of the Curta had a special bell & gearing to allow carries.
The Friden STW - a mechanical behemoth - had a complex set of gears within the
moving carriage to allow carries & borrows. Same with Monroe & Marchant
mechanical calculators. For electric motor driven calculators, the additional
weight of the moving carriage - as well as the gearing - set limits on how
fast the machine could add and subtract.

------
ramshorns
It's amazing that the 17 instructions at the end to tidy up all the carries,
which looks like it has to be done serially, is still faster. But I guess each
register's carry bits are independent from the ones that get carries added
from the last register, so it could be ...

    
    
      mov W, E
      mov V, D
      mov U, C
      mov T, B
      shr W, 51
      shr V, 51
      shr U, 51
      shr T, 51
      add D, W
      add C, V
      add B, U
      add A, T
    

which does seem like it could be parallelized.

~~~
Suncho
> each register's carry bits are independent from the ones that get carries
> added from the last register

No they're not. Let's say B's non-carry bits are all 1. If you carry anything
from C, that will affect B's carry bits.

~~~
londons_explore
You could check if the middle 64-12-12 bits are all 1's, and if they are
branch to a slow codepath.

It would be an incredibly rare case, so branch prediction will always get it
right, and the cost of a branch nearly nill.

~~~
a1369209993
If you're trying to add two 256-bit numbers, there's _probably_ cryptography
involved, and data-dependant branches are poison to crypto code.

------
bobbiechen
Cool trick. I wonder how it would compare in performance to a carry-lookahead
adder [1], which uses the fact that you can compute any carry digit directly
from a combination of the original inputs (which is increasingly long the
deeper you go).

[1] [https://en.wikipedia.org/wiki/Carry-
lookahead_adder](https://en.wikipedia.org/wiki/Carry-lookahead_adder)

~~~
jonsen
To compare you would have to implement carry-lookahead in software. How would
you do that?

~~~
bobbiechen
Yikes, good question. I thought about it for a bit:

Here's a toy example adding three-bit numbers, where all letters are bits (0
or 1) and you can normally only add one bit at a time.

    
    
          a b c
        + d e f
        -------
        g h i j
    

Instead of doing

    
    
        j = c ^ f
        j_carry = c & f
        i = j_carry ^ (b ^ e)
    

in two steps, we can condense it to

    
    
        j = c ^ f
        i = (c & f) ^ (b ^ e)
    

And similarly, we can get g and h combinationally from the input:

    
    
        #                   "j_carry AND one of b, e"
        i_carry = (b & e) | ( (c & f) & (b | e) )
    
        h = i_carry  ^ (a ^ d)
        h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
    
        g = (a & d) | (i_carry & (a | d))
        g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
    

So we end up directly computing

    
    
        g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
        h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
        i = (c & f) ^ (b ^ e)
        j = c ^ f
    

All of these innermost combinations like (a & d) you can get just by AND / OR
of the original inputs.

But as you suggest (I think), then you need to combine these further to
actually get results. In theory they're all parallelizable (if you picture
this as tree), but I don't see a good way to do that quickly. Though I'm no
assembly expert so maybe I'm missing something.

I guess thinking in terms of gate delays doesn't really help much at this
abstraction level.

------
munificent
Ah, this post is so good. This is why I come to HN.

------
zodiac
This reminds me of "zeroless number systems" or "bijective number systems"
([https://en.wikipedia.org/wiki/Bijective_numeration#The_bijec...](https://en.wikipedia.org/wiki/Bijective_numeration#The_bijective_base-10_system))
- in the language of the article, the allowed digits in base 10 are
123456789A. Note that 0 is not an allowed digit.

For instance here are the first few natural numbers in this system:
1,2,3,4,5,6,7,8,9,A,11,12...

I first read about them in Chris Okasaki's purely functional data structures
book, they have some applications in terms of designing data structures (e.g.
combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).

They're conceptually interesting too, being a positional number system with no
redundancy (exactly one way to represent each number) and without 0. Since the
concept of a 0 digit is not required, one could argue that it's conceptually
easier for e.g. a classical Roman mathematician to learn this than the
positional system we do use today.

Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z,
AA, AB ...

------
pachico
I really enjoyed it. I might never use what I learned by reading it but it was
so well explained. Bravo!

------
glangdale
A question here is whether this trick (which I acknowledge is very
interesting) is worth doing on Broadwell. Broadwell has a pair of new
instructions (ADCX and ADOX) that don't behave like a normal ADC - they only
read and set a single bit (carry bit and overflow bit). As such you can do two
of these at once and at least on recent Intel processors there are two
execution ports (0 and 6) that can handle these instructions.

You'd still need some cleverness if your goal is to add just one giant bignum
(as opposed to doing lots of independent add-with-carry sums).

------
peter_d_sherman
>"The key insight here is that we can use this technique to delay carry
propagation until the end. We can’t avoid carry propagation altogether, but we
can avoid it temporarily. If we save up the carries that occur during the
intermediate additions, we can propagate them all in one go at the end."

Brilliant and tremendously interesting!

Useful for if/when I write a fast super-long integer math library in the
future (yes, I know GNU already has one, but it's always interesting to
know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-
single-register" integers...)

------
Slartie
What I'm wondering about now is: does that 'adc' opcode have any other useful
application except for adding numbers bigger than the width of a single
register?

Because it does look a little bit as if it was engineered for that particular
situation, and if that was the case, with this situation now in practice being
solved much more efficiently without a single 'adc' instruction, the opcode
might suddenly be completely useless bloat - which of course still had to be
maintained forever, even though nobody uses it in practice, because it's part
of the spec.

~~~
mrfredward
I believe adc is still commonly used as a way of reading the carry bit. For
example, assuming eax has already been zeroed, adc eax, eax would be the
idiomatic way of reading the carry bit into eax.

The carry flag gets set by all sorts of instructions other than addition and
subtraction (comparisons, bit shifts, multiplication, etc) so this is more
useful than you might think.

~~~
giovannibajo1
Actually, setc is the idiomatic way, and doesn’t depend on the previous value
of the destination register so it’s much faster to execute because it has no
dependencies

~~~
zwegner
No, setcc actually does depend on the previous value of the register, because
it only comes in the low-byte variants (i.e. al, bl, cl, dl etc)--the top 7
bytes of the destination are left unmodified.

A one-instruction variant I've seen that gets you the negated value of the
carry flag is sbb rax, rax. This doesn't depend on the previous value of rax
in a mathematical sense, though I'm unsure if it depends in an out-of-order
sense; that is, if it's recognized as a zeroing idiom (or rather, zeroing or
all-one-ing in this case). Probably not.

EDIT: did a quick test, and sbb rax, rax etc are not recognized to be a
zeroing idiom, at least on my old Haswell. So it's still one instruction, but
has a dependency on the previous register value.

~~~
nayuki
> setcc actually does depend on the previous value of the register, because it
> only comes in the low-byte variants

You should be able to break the dependency and avoid the partial register
stall by doing: movzx eax, ax

See: [https://stackoverflow.com/questions/41573502/why-doesnt-
gcc-...](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-
partial-registers) ; [https://software.intel.com/en-us/forums/intel-isa-
extensions...](https://software.intel.com/en-us/forums/intel-isa-
extensions/topic/297441) ;
[https://www.agner.org/optimize/microarchitecture.pdf](https://www.agner.org/optimize/microarchitecture.pdf)
(section 6.8)

~~~
zwegner
You're right, but only for pre-Haswell chips, where al etc. are renamed
separately. From Haswell onwards, they aren't, and depend on the old value. So
movzx eax, al gets you mathematical independence, but not out-of-order
independence. There's a bunch more detail in this Stack Overflow question
(referenced in the one you linked):
[https://stackoverflow.com/questions/45660139/how-exactly-
do-...](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-
registers-on-haswell-skylake-perform-writing-al-seems-to)

------
eumenides1
This is really cool, but gives me PTSD from my compliers course.

~~~
jonsen
Were you forced to comply against your will?

~~~
chowyuncat
ED-209 was a killer class.

------
barbegal
What sort of real world use case adds large numbers like this?

~~~
commandlinefan
public-key crypto.

~~~
cirno
And djb's implementations of Ed25519, Curve25519 use this exact radix 2^51
trick.

~~~
mratsim
Actually on 32-bit it uses a 2^25.5 radix, some are 2^26-bit limbs and others
are 2^25.

~~~
cirno
I was meaning on 64-bit, but it's good to clarify that, thank you.

------
smithza
The post did not mention potential cost of splitting and masking these into 5
limbs instead of 4. Perhaps I am missing that in this proposed system, 256-bit
ints are always stored in 5 registers, then you only normalize when the result
is needed? What are the costs of splitting/masking into 5-limbs and could the
scaling impact the gains of parallel adds/subtracts meaningfully?

------
dreamcompiler
TLDR: Denormalized polynomial addition followed by normalization is faster
than normalized polynomial addition on Haswell. Nice trick!

------
gautamcgoel
What I want to know is what, if any, are the implications for CPU
architecture. How should bignum arithmetic be efficiently implemented in
hardware?

~~~
not2b
Since good bignum libraries already use tricks like this, and many others
besides (like using FFT to handle very large bignum multiplication), it
doesn't seem that the hardware needs to change.

------
asciimov
edit: Ignore this, I'm tired and dumb.

You aren't gaining performance by reducing carries, you are gaining
performance by running these additions in parallel.

The ALU is doing the addition in binary. Even though you add padding to the
beginning of the number chunks it's still a binary number. When you recombine
the chunks of numbers you will be doing all those carries you feel like you
got for free.

~~~
tspiteri
But the carry propagation blocks parallelism. So by reducing the need to
carry, you are getting the parallel gains.

~~~
asciimov
But you are doing the same number of carries. Op is making it sound like you
are doing less carries.

~~~
phkahler
It depends how many numbers you add. If you compute A+B it will be slower
because the Carrie's still have to be handled after the addition. But if you
compute A+B+C+D+E you do 4 fast addition and only deal with carry bits once.
TFA allows up to about 8000 numbers to be summed for a single pass of carry
propagation.

~~~
asciimov
Right, I understand we are parallelizing the addition operations and the
performance gain is due to adding chunks of the numbers in parallel.

What I am saying is the total number of carry operations is conserved, even
though they are delayed.

~~~
MauranKilom
I can't tell from your comments whether you realized yet, but the trick is
only useful if you add _several_ 256 bit numbers. You can add up to 2^13
numbers (each one 256 bits) before you actually have to do the four carry
shift-outs _once_. In effect, you _do_ reduce the number of carry operations
(to four, albeit slightly more involved), and that is what ends up saving
time.

There is no performance gain if you only add two numbers, even though that's
the setup the article discusses for 90% of its length. It seems that this is
leading to some confusion.

~~~
asciimov
Yeah, I re-read the whole thing again, and I see the mistake I made. I took a
wrong turn when op went into base 37 numbers, which made me consider every bit
carry not just the carry on word boundary.

