Fleshing out my thought above. If we want to multiply A*B = C and all operands are stored in 2 separate bits Ap and An (Ap = 1 if A = +1 while An = 1 if A = -1). We can do a product with:
Cp = (Ap & Bp) | (An & Bn)
Cn = (An & Bp) | (Ap & Bn)
So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.
The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.
Cp = (Ap & Bp) | (An & Bn)
Cn = (An & Bp) | (Ap & Bn)
So 64 products in 6 instructions, or 256 in 6 instructions with AVX2, or 512 in six instructions using AVX512. If you can execute 2 instructions at a time on different words, this becomes 1024 "products" in 6 cycles or between 0.5 and 1 TOP per core.
The summing still involves using popcount on the positive and negative bits - I doubt AVX supports that but its still a fast way to "sum" individual bits. I don't see custom hardware for this as a short term thing - they need to prove out the quantization concept more first.