Hacker News new | comments | show | ask | jobs | submit login
Computing multiple hash values in parallel with AVX2 (snellman.net)
85 points by kryptiskt 65 days ago | hide | past | web | 14 comments | favorite



Great post and he raises a very valid point that most high-performance hash functions are essentially optimized for high-end 64-bit superscalar microarchitectures. As a practical consequence, the only exploitable parallelism in the silicon is the multiple ALUs within a single core and the algorithms are designed to exploit that effectively. However, as noted in the article, this runs directly into the fact that vector pipelines only support a limited subset of the operations supported by the scalar pipelines, and for the purposes of high-performance hash functions these tend to be disjoint.

His prescription of a new high-performance hash function based on mul32/xor/shift is also exactly correct in my opinion. Not only does it allow you to scale the hash function down to limited microarchitectures but it also allows you to scale it up to the largest/fastest vector engines in most cases for extreme throughput on higher end hardware.

Coincidentally, I also started working on a new set of hash functions based on the mul32/xor/shift motifs precisely for the purpose of getting something that could be efficiently vectorized for performance while still able to run on, say, a 32-bit ARM core.


Thanks.

> Coincidentally, I also started working on a new set of hash functions based on the mul32/xor/shift motifs precisely for the purpose of getting something that could be efficiently vectorized for performance while still able to run on, say, a 32-bit ARM core.

That's cool to hear. I'm happy to reimplement hash functions to fit my bizarre needs, but as per the post I really don't want to design them :)

What's your view on how switching from rotates to shift-xors affects quality? Is it mostly neutral as long as the constants are chosen correctly, or is there some quality loss?


The use of multiplication as an operator introduces a bit value bias that arguably makes replacing shift with rotate less of a problem than it would be for algorithms that use e.g. the add/rot/xor operators, since shift introduces a (non-symmetrical) counter-bias to multiply as it is basically a special case of division. Rotate is nice for bit diffusion but most of the necessary heavy lifting is in the high bits being shifted low.

I have no hard evidence but my intuition is that the loss of quality by moving from rotate to shift is below the noise floor of what is required to patch up the bit value bias introduced by multiply.


Feature detection should be a goal for the function container, exploiting the resources given a particular environment. If AVX2 exists, use it; if CUDA does, use that; if both, use both. Why limit the implementation?


Two reasons. First, functions may need to cross system boundaries while producing identical results. Second, it simplifies the implementation and reduces the potential defect surface in the function design.


Really good and well written article.

We did the similar trick with SSE. It's a good approach if a kind of non-portable hash can be used. In system I'm working on we introduced two APIs: hash() which maps to "fastest hash we have on this machine" and portable_hash() that's a predefined hash algorithm.

BTW, SSE4 introduced carry-less multiplication which can also be used to boost hashing[1]. However, it's good for large data; in our tests it outperformed other algorithms for inputs larger than 1kB.

[1] https://github.com/lemire/StronglyUniversalStringHashing


Most proof-of-work cpu miners use this optimization. My own Cuckoo Cycle cpu solver supports 4-way, 8-way, and even 16-way siphash computation, where the latter use multiple sets of avx registers to allow for more interleaving of operations. On recent iCore i7, the 8-way performs best.

The Makefile at https://github.com/tromp/cuckoo/blob/master/src/Makefile has targets cuckoo28stx, cuckoo28stx4, cuckoo28stx8 and cuckoo28stx16 for these different levels of parallellization.


I parallelized SHA-512 for a competition - the compression function runs 4 copies in AVX2.

https://www.nayuki.io/page/lowest-sha512-value-by-brute-forc...

https://www.reddit.com/r/programming/comments/1y17al/hashcha...


I'd just like to take a moment to acknowledge the author for their excellent "Don't look under the rug" section. Amazing.. I wish every article had information like that.


> and given the way things are going, I wonder if a general purpose CPU using AVX-512 will actually ever launch

Boy, I really hope you're wrong! Any further thoughts?


AMD XOP used to have a rotate instruction. Sadly it is dead in Ryzen. It seems that AVX-512 will have it though.


XOP had a few nice instructions, but was killed. AVX-512F has rotations, but only 32 and 64-bit elements. AVX-512BW will support rotation of bytes and words (8 and 16-bit words).


That's wonderful. For non-cryptographic uses, this could speed-up a lot of things, e.g. even git, replacing the SHA1 with a e.g. a 128 bit hash made from a xxHash64(file_content)) plus the size of the file (64 + 64 bit). In fact, replacing SHA1 with xxHash64 + file size, and replacing zlib with LZ4, could allow git reduce "repack" times to the minimum. BTW, both xxHash and LZ4 are developments from Yann Collet, a very talented programmer (also author of Zstandard -most of the "magic" of both LZ4 an Zstandard comes from the hash function, which is a huge breakthrough-).


> For non-cryptographic uses, this could speed-up a lot of things, e.g. even git




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: