
Computing multiple hash values in parallel with AVX2 - kryptiskt
https://www.snellman.net/blog/archive/2017-03-19-parallel-hashing-with-avx2/
======
jandrewrogers
Great post and he raises a very valid point that most high-performance hash
functions are essentially optimized for high-end 64-bit superscalar
microarchitectures. As a practical consequence, the only exploitable
parallelism in the silicon is the multiple ALUs within a single core and the
algorithms are designed to exploit that effectively. However, as noted in the
article, this runs directly into the fact that vector pipelines only support a
limited subset of the operations supported by the scalar pipelines, and for
the purposes of high-performance hash functions these tend to be disjoint.

His prescription of a new high-performance hash function based on
mul32/xor/shift is also exactly correct in my opinion. Not only does it allow
you to scale the hash function down to limited microarchitectures but it also
allows you to scale it up to the largest/fastest vector engines in most cases
for extreme throughput on higher end hardware.

Coincidentally, I also started working on a new set of hash functions based on
the mul32/xor/shift motifs precisely for the purpose of getting something that
could be efficiently vectorized for performance while still able to run on,
say, a 32-bit ARM core.

~~~
jsnell
Thanks.

> Coincidentally, I also started working on a new set of hash functions based
> on the mul32/xor/shift motifs precisely for the purpose of getting something
> that could be efficiently vectorized for performance while still able to run
> on, say, a 32-bit ARM core.

That's cool to hear. I'm happy to reimplement hash functions to fit my bizarre
needs, but as per the post I really don't want to design them :)

What's your view on how switching from rotates to shift-xors affects quality?
Is it mostly neutral as long as the constants are chosen correctly, or is
there some quality loss?

~~~
jandrewrogers
The use of multiplication as an operator introduces a bit value bias that
arguably makes replacing shift with rotate less of a problem than it would be
for algorithms that use e.g. the add/rot/xor operators, since shift introduces
a (non-symmetrical) counter-bias to multiply as it is basically a special case
of division. Rotate is nice for bit diffusion but most of the necessary heavy
lifting is in the high bits being shifted low.

I have no hard evidence but my intuition is that the loss of quality by moving
from rotate to shift is below the noise floor of what is required to patch up
the bit value bias introduced by multiply.

------
wmu
Really good and well written article.

We did the similar trick with SSE. It's a good approach if a kind of non-
portable hash can be used. In system I'm working on we introduced two APIs:
hash() which maps to "fastest hash we have on this machine" and
portable_hash() that's a predefined hash algorithm.

BTW, SSE4 introduced carry-less multiplication which can also be used to boost
hashing[1]. However, it's good for large data; in our tests it outperformed
other algorithms for inputs larger than 1kB.

[1]
[https://github.com/lemire/StronglyUniversalStringHashing](https://github.com/lemire/StronglyUniversalStringHashing)

------
tromp
Most proof-of-work cpu miners use this optimization. My own Cuckoo Cycle cpu
solver supports 4-way, 8-way, and even 16-way siphash computation, where the
latter use multiple sets of avx registers to allow for more interleaving of
operations. On recent iCore i7, the 8-way performs best.

The Makefile at
[https://github.com/tromp/cuckoo/blob/master/src/Makefile](https://github.com/tromp/cuckoo/blob/master/src/Makefile)
has targets cuckoo28stx, cuckoo28stx4, cuckoo28stx8 and cuckoo28stx16 for
these different levels of parallellization.

------
nayuki
I parallelized SHA-512 for a competition - the compression function runs 4
copies in AVX2.

[https://www.nayuki.io/page/lowest-sha512-value-by-brute-
forc...](https://www.nayuki.io/page/lowest-sha512-value-by-brute-force)

[https://www.reddit.com/r/programming/comments/1y17al/hashcha...](https://www.reddit.com/r/programming/comments/1y17al/hashchallenge_can_you_find_the_lowest_value/)

------
lathiat
I'd just like to take a moment to acknowledge the author for their excellent
"Don't look under the rug" section. Amazing.. I wish every article had
information like that.

------
gumby
> and given the way things are going, I wonder if a general purpose CPU using
> AVX-512 will actually ever launch

Boy, I really hope you're wrong! Any further thoughts?

------
faragon
That's wonderful. For non-cryptographic uses, this could speed-up a lot of
things, e.g. even git, replacing the SHA1 with a e.g. a 128 bit hash made from
a xxHash64(file_content)) plus the size of the file (64 + 64 bit). In fact,
replacing SHA1 with xxHash64 + file size, and replacing zlib with LZ4, could
allow git reduce "repack" times to the minimum. BTW, both xxHash and LZ4 are
developments from Yann Collet, a very talented programmer (also author of
Zstandard -most of the "magic" of both LZ4 an Zstandard comes from the hash
function, which is a huge breakthrough-).

~~~
gwern
> For non-cryptographic uses, this could speed-up a lot of things, e.g. even
> git

------
yuhong
AMD XOP used to have a rotate instruction. Sadly it is dead in Ryzen. It seems
that AVX-512 will have it though.

~~~
wmu
XOP had a few nice instructions, but was killed. AVX-512F has rotations, but
only 32 and 64-bit elements. AVX-512BW will support rotation of bytes and
words (8 and 16-bit words).

