
Wallace Tree - eternalban
https://en.wikipedia.org/wiki/Wallace_tree
======
londons_explore
TL;DR: Multiply numbers in hardware by multiplying as you would in school -
each digit by each other digit, then add up all the results.

The apparant speedup comes from the fact that adding a list of N integers
doesn't take O(n) time because it can be parallelized (for example by adding
pairs of numbers first to make a list half as long).

The hardware specific tricks are that multiplication by 1 or 0 is simply a
very cheap and operation, and that adding 3 numbers to make two results is
much faster in hardware than adding 2 numbers to make one result.

~~~
klodolph
This summary has some important errors in it. If you added each partial
product in parallel, you would get O((log N)^2) delay, because each addition
has O(log N) and there are O(log N) levels in the tree. The Wallace tree has
total delay O(log N), which is smaller.

The way you learned in school is to add each number up in series, which
doesn’t give the correct parallel structure to multiply numbers quickly in
parallel.

For example,

    
    
          1234
        x 5678
       -------
          9872
         8638
        7404
       6170
       -------
       7006652
    

In school, you would add each part column by column, and keep track of a carry
digit. This gives a fairly bad delay from right to left, since the result of a
carry from the right-hand columns will affect the results of the left-hand
columns, many operations later. If you add pairs of numbers in parallel, it
will be faster, but you will still have to propagate carry from right to left.

The key insight in the Wallace tree is to reduce these numbers hierarchically,
but only bitwise. Three bits in the same position are reduced to two bits in
different positions, at each level. Eventually, there are only two levels
left, and we do a single addition.

This process for addition is completely alien to the way humans do it.

> (for example by adding pairs of numbers first to make a list half as long).

That is absolutely not how it works. There is only one pair of numbers being
added here—at the end, the very last two numbers are added together normally.
If you try to add each pair of numbers in parallel, you will add more gate
delay as the carry bit has to propagate all the way from the right side to the
left at each level in the tree (instead of only once at the end).

With the Wallace tree, the carry bit only has to propagate the width of the
numbers once, at the very end. This is much faster. At the beginning, again,
bits are being added with full adders which shrinks the tree at a rate of 3:2
(not 2:1) where possible. You will notice that the height of the numbers goes
from 8 to 6 (not 4), then 6 to 4, 4 to 3, and then 3 to 2. The final two
numbers are then added normally.

~~~
londons_explore
Your explanation here is described in my original reply as "that adding 3
numbers to make two results is much faster in hardware than adding 2 numbers
to make one result."

I initially typed a further explanation of "no carry chain is necessary, so
each addition becomes constant time", but decided for most readers here, they
either wouldn't be hardware experts and wouldn't see why a carry chain was
relevant, or would be hardware experts in which case it's already obvious,
since full adders are used in loads of things wired like this.

Thankyou for expanding in a lot more detail though.

------
ars
What method do modern CPU's actually use?

~~~
SuperscalarMeme
Radix-4 modified Booth encoding to reduce the number of partial products and
then a sort of modified Wallace tree using 4:2 compressors (or some sort of
3:2 & 4:2 compressor combination based on technology node). There is another
type of multiplier sometimes used for maximum performance: unlike Wallace
trees where you go through steps of tree reduction, the "Three Dimensional
Method" looks at each output bit and generates the fastest possible tree from
the partial product matrix. The problem with this method (besides area) is
that the wiring and layout of cells is highly irregular. However, now that we
are in the age of automated tools, this type of design is more feasible. Paper
here: [https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC280/Web-
page/p...](https://www.ece.ucdavis.edu/~vojin/CLASSES/EEC280/Web-
page/papers/Arithmetic/A%20Method%20for%20speed%20Optimized%20Partial%20product.pdf)

If you're interested, this paper from Synopsys has some neat information about
datapath synthesis:
[https://guest.iis.ee.ethz.ch/~zimmi/publications/datapath_sy...](https://guest.iis.ee.ethz.ch/~zimmi/publications/datapath_synthesis.pdf)

------
ggm
I always wanted to beleive somebody would make a coherent case for a real
decimal encoded adder/multiplier: use voltages or phase or frequency or
something in ways which were strictly additive, subtractive over 10 discrete
values, and compute the sum by .. doing decimal. Not decimal BCD but real
decimal in the logic. 10 distinct voltage levels.

Not to be totally surprised, my hardware lecturers said "notachance"

~~~
graphpapa
I understand enough to think the proposal sounds reasonable, but not enough to
know why it is apparently, in fact, unreasonable.

~~~
fanf2
Digital logic is built from amplifiers with the gain set right up, so that any
voltage above the threshold goes to max and any voltage below goes to zero.
Very little tuning is needed to get a working circuit, compared to multi-level
or analog circuitry. In CMOS, current basically only flows when the circuit is
switching between two states, whereas in an analogue circuit it flows all the
time, so digital logic can be more efficient. In digital signal transmission,
“eye diagrams” are used to visualise the quality of the signal: the eye is the
gap between low and high and it needs to be clearly open for receivers to be
able to detect the signal with decent integrity.
[https://www.onsemi.com/pub/Collateral/AND9075-D.PDF](https://www.onsemi.com/pub/Collateral/AND9075-D.PDF)

