Hacker News new | past | comments | ask | show | jobs | submit login

Fleshing out my thought above. If we want to multiply A*B = C and all operands are stored in 2 separate bits Ap and An (Ap = 1 if A = +1 while An = 1 if A = -1). We can do a product with:

Cp = (Ap & Bp) | (An & Bn)

Cn = (An & Bp) | (Ap & Bn)

So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.

The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.




Another way would be to use one register for "zero" vs. "non-zero", and another for negative (basically 2 bit sign-magnitude representation).

    C_sgn = A_sgn ^ B_sgn
    C_mag = A_mag & B_mag
The result can then be converted into bitmasks for positive and negative:

    C_plus = C_mag & ~C_sgn
    C_minus = C_mag & C_sgn
This solution should be more efficient if there is an "AND NOT" instruction, or when multiplying more than two factors.


Thinking a bit more about this, you could eliminate the conversion and do

    sum = popcount(mag) - 2*popcount(mag & sgn)


> I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.

Yes, I agree. This still needs to be more extensively tested.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: