Yes. More generally, this will enable implementation via crazy-cheap bit-wise op...

phkahler · 2024-02-29T15:21:14

Fleshing out my thought above. If we want to multiply A*B = C and all operands are stored in 2 separate bits Ap and An (Ap = 1 if A = +1 while An = 1 if A = -1). We can do a product with:

Cp = (Ap & Bp) | (An & Bn)

Cn = (An & Bp) | (Ap & Bn)

So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.

The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

rep_lodsb · 2024-02-29T20:42:02

Another way would be to use one register for "zero" vs. "non-zero", and another for negative (basically 2 bit sign-magnitude representation).

    C_sgn = A_sgn ^ B_sgn
    C_mag = A_mag & B_mag

The result can then be converted into bitmasks for positive and negative:

    C_plus = C_mag & ~C_sgn
    C_minus = C_mag & C_sgn

This solution should be more efficient if there is an "AND NOT" instruction, or when multiplying more than two factors.

rep_lodsb · 2024-02-29T22:48:25

Thinking a bit more about this, you could eliminate the conversion and do

    sum = popcount(mag) - 2*popcount(mag & sgn)

cs702 · 2024-02-29T15:24:49

> I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

Yes, I agree. This still needs to be more extensively tested.