This is a good looking hash, just from the construction I can see that it should be extremely fast for bulk hashing of large objects and with strong statistical properties.
I do want to comment for the wider audience why the details of this construction does not lend itself to small keys, and how you would fix that. I have an (unpublished, I'm lazy) AES-based hash function that is very fast for both bulk hashing and small keys, which required a mostly performance neutral (likely fits in an unused ALU execution port) tweak to the bulk hashing loop.
Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.
As an important observation, vectorizing the AES lanes makes it difficult to disperse bits because there are no good, cheap operations that move bits between lanes. However, the CPU will happily run multiple AES operations and other 128-bit operations concurrently across ALU ports if they are in independent registers. This allows you to trivially create a "mixing" lane that solely exists to aid bit dispersal because you no longer are trying to do sideways operations on a vector. So what does a mixing lane look like that doesn't create dependencies on your hashing lanes? It is so simple it is embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash but it only exists to represent all the bits from all the lanes.
At the mixing stage, you simply fold the mixing lane into each of the hashing lanes by running it through a round of AES -- one concurrent operation. This allows you to avoid the deep mixing stage altogether, which means small keys can be hashed very quickly since almost all of the computation is in the finalizer. Using this technique, I've been able to hash small keys in <30 cycles with concurrent AES lanes.
As one more comment, because I've made the same mistake in the past, it is often a bad idea to use the length in the initialization vector of a general purpose hash because it means you can't make an incremental version of the hashing algorithm e.g. a stream of data of unknown size. This is a trivial change but it has side effects in how you pad partial blocks.
If the length is not part of the initialization vector, then the padding vector for partial blocks should vary as a function of partial data length. Otherwise it is trivial to create collisions by having partial blocks of different sizes that mimic the padding vector. Fixing this can be as easy as using _mm_set1_epi8(partial_length) as your padding vector.
I have a completely new algorithm construction I have been using for a while that I am releasing any day now -- I was literally working on the documentation this weekend. It is a substantial improvement on the original algorithms in almost every regard and also easier to customize.
I look forward to the release, and I know the feeling (writing docs, cleaning up comments, etc.). I've been perpetually ~2 weeks from commercial release of my own product since July or so.
>Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.
This sounds deeply fascinating, but I'm way out of my depth. Can you suggest any reading that would help me appreciate the "architecture" of hash functions?
To our surprise, we found a lack of published, well-optimized, large-data hash functions. Most hash work seems to focus on small input sizes (for things like dictionary lookup) or on cryptographic quality. We wanted the fastest possible hash that would be collision-free in practice (like SHA-1 was), and we didn’t need any cryptograhic [sic] security.
This is an important passage that could be explained in more detail. I can look up a list of non-cryptographic hash functions, nine of which support 128-bit output:
I think the big issue with bytes/cycle comparisons right now is that once you get to multiple bytes per cycle, most people are going to be limited on I/O, even to RAM probably. I assume Meow Hash was designed to run on high end workstations, but if you are developing software for end users my guess is that Meow Hash will end up being not significantly faster than xxHash.
Of course, if you are running a lot of hashes in parallel, I bet things get more interesting. (Example: would AES-NI instructions scale in parallel as fast as ordinary bitwise/vector instructions?)
When I say high end workstation I really do mean workstation and not gaming PC. Most people do not have quad channel memory on their home PCs but it is in fact a pretty good step up in throughput.
xxHash is still very fast. In fact, I am sure that some implementations of xxHash are faster than that. In fact, after Googling it, I can see that xxHash64 is quoted as having 3.1 bytes per cycle, which is not bad.
xxHash is also very simple, and can be written easily in high or low level languages, which I like. Someone even implemented it in C++ templates I believe. I don't know how Meow Hash is but I imagine the default implementation heavily exploits AMD64 CPUs and based on the article probably makes use of AES acceleration.
Basically, if you do this, at least be sure you're not bottlenecked on I/O already :)
Late to the party, some actual benchmarks using an older CPU with AES-NI:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
Kernel 4.18.14
GCC 7.3.0
Short version, it's ~2x as fast as anything else on my machine for data > 4k, maybe down to 1k, all the way up to > cache (where various other algorithms also max out memory bandwidth). Other non-cryptographic hashes (for comparison) on my local machine: xxHash64 -> ~13G/s, Metrohash128 -> ~13G/s, Metrohash128_crc -> ~18G/s, t1ha_crc -> ~17.5G/s, etc., but Meow doesn't break the 1G/s barrier until 64-96 bytes (the others mentioned reach that speed at 8 bytes).
So yeah, faster for longer keys due to the fast aes mixing + parallel pipeline usage, but slower for shorter keys due to the final mixing dependency at the end.
Using a hacked-up version of Smhasher for wider range of hashing sizes (cache on this CPU is 20 megs):
--- Testing Meow1_64 "parallel aes internals"
[[[ Speed Tests ]]]
Bulk speed test - 67108864-byte keys
Average - 4.556 bytes/cycle - 14337.09 MiB/sec @ 3 ghz
Bulk speed test - 16777216-byte keys
Average - 7.882 bytes/cycle - 24805.52 MiB/sec @ 3 ghz
Bulk speed test - 4194304-byte keys
Average - 9.813 bytes/cycle - 30884.30 MiB/sec @ 3 ghz
Bulk speed test - 1048576-byte keys
Average - 9.657 bytes/cycle - 30390.48 MiB/sec @ 3 ghz
Bulk speed test - 262144-byte keys
Average - 10.642 bytes/cycle - 33490.19 MiB/sec @ 3 ghz
Bulk speed test - 65536-byte keys
Average - 11.871 bytes/cycle - 37360.76 MiB/sec @ 3 ghz
Bulk speed test - 16384-byte keys
Average - 11.336 bytes/cycle - 35675.98 MiB/sec @ 3 ghz
Bulk speed test - 4096-byte keys
Average - 8.241 bytes/cycle - 25936.55 MiB/sec @ 3 ghz
Bulk speed test - 1024-byte keys
Average - 4.056 bytes/cycle - 12765.26 MiB/sec @ 3 ghz
Note that if you're just using it for a hash table, where you'll compare the full strings to resolve hash collisions, you don't actually need good randomness. FNV1A_VT is faster than most, has worse collision properties, yet provides a faster overall hash table, because the faster hash (every operation) outweighs the extra full string checks (rare).
Almost certainly faster than any of the ones you list for large data. I would expect the perf to be more comparable to the other aesni hashes (t1ha and falkhash), both which bench around 4 bytes/ cycle while meow claims 16 bytes/cycle.
Meow does include implementation for smhasher, so benchmarking should be easy enough
That's a cool link - thank you. There's so much there I couldn't tell if there were benchmarks for implementation on embedded hosts. I'm constantly looking for something suitable for use on embedded stuff. Having an x86 would be like having a unicorn as my deployment target.
Recently (this year) the MCUs I've asked for have hardware crc32 and sometimes even AES128 but the suits tell me to Remember the BOM, Luke, and those fancy periphs aren't available on common M0+ my hardware guy likes (NXP KE0xxx for ex). Left wondering how this one will work on a run of the mill ARM.
> because the faster hash (every operation) outweighs the extra full string checks (rare)
I'm confused... don't you have to do at least one full string check for every table lookup on a hash table, to ensure the value you're returning is really the right key, and not just some other key that happens to collide?
In other words, if I have 32 buckets in my table and one value stored at table["hello"], and table["goodbye"] happens to hash to the same bucket, how do you know whether to return not-found for "goodbye" without a full string comparison? (Or for that matter, how do you know whether to return the stored value for "hello" without a comparison?)
Every operation requires one hash and at least one string check, but the more collisions there are, the more string checks you have to do. In theory, a hash algorithm with a great distribution and few collisions would require less string checks (I.e generally just one) per operation than an algorithm with a poor distribution.
GP was saying that in their experience, the less frequent extra string checks generally isn't worth the slower hashing speed of an algorithm with a better distribution.
(It's also worth noting that getting or setting elements which don't exist requires 0 string compares unless there's a collision, which might be relevant for certain workloads.)
Most of the cost of string checks is often just the cache miss to load the first bytes of the string.
Depending on the input, e.g. if lengths have more or less uniform distributions, full string checks might only be needed 10% of the time or so; store the string length in the bucket and you save the cache miss. Or you could store the first few bytes of the key in the bucket (but then in other cases, for example a hash table containing absolute filesystem paths, the first few bytes are always the same across keys).
I wonder if speed will still be a reason for doing something like this going forward now that Intel is starting to include SHA acceleration instructions in its processors.
It will, for a couple reasons. The first is, Intel doesn't include the SHA intrinsics in their high end consumer processors. Only the low end Atom stuff so far.
The second is that even with the intrinsics, this is (supposedly) faster. The SHA instructions only get you a few gigabytes a second (single digits). This purportedly gets tens of GB/s.
Phoronix is, once again, wrong. Skylake does not support the SHA instructions. The only currently shipping x86 CPUs which support them are Intel Goldmont (Atom) and AMD Zen (Ryzen, Threadripper).
Mainstream support for SHA in Intel CPUs will come with Cannon Lake, or whatever the next microarchitecture revision they release ends up being.
* Yes, Cannon Lake is technically shipping as the i3-8121U, but that isn't really relevant to anyone here.
However, I question "mainstream" since many Celeron and Pentium branded parts featured SHA-NI, and those are arguably mainstream, low-cost parts.
Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake, but didn't make it based on various posts I see online as early as 2014 and Intel's first postings about SHA-NI on Goldmont, etc. in 2013.
> many Celeron and Pentium branded parts featured SHA-NI
Do you have a source for this claim?
> Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake
Yeah, but on the other hand that article was published 4 months after the general availability of Skylake. It was possible for Larabel to verify the facts before publishing and he just didn't. Nor did he correct the mistake in any of the past three years.
I just happen to need a fast, non-cryptographic checksum for large amount of data. I assumed using CRC32C would be the fastest thing available on x86_64 since SSE 4.2 includes an intrinsic for it.
Can I expect Meow Hash being faster than that on x86_64? What about on some other platforms such as common ARM chips? Which one is more collision resistant?
Also: the writer seems to be Casey Muratori, who is also the author of the Handmade Hero series! Kudos.
OK, after looking into it for just a few minutes, I found out: 1) CRC32C is not nearly as collision-resistant as many common fast non-cryptographical hashes. 2) there are many portable hash functions that are faster (and I think this is due to their bit size.) See: https://github.com/Cyan4973/xxHash and https://github.com/leo-yuriev/t1ha
So, my initial assumption was wrong. I'll be trying Meow Hash and some others out!
Meowhash claims to be faster (16B/cycle) than CityHashCrc (4.3-5.5 B/cycle), which is a variant of CityHash (ab)using the x86 CRC32C intrinsics (but with much better hash properties than plain CRC32).
Both are faster than xxHash on machines with AESNI and CRC intrinsics (~1.7B/cycle). The advantage of xxHash is that it is very fast on machines without these instrinsics.
Other hashes tend to be slower although I'm sure there are some other intrinsics-based hashes I don't have at the tip of my tongue.
Will this hash be useful to check if the content of files of all sizes has changed or not? Thinking about storing the hash of a file in a database to evaluate when changes has taken place.
> Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc., it’s going to waste all of its time padding the input
This seems like an important caveat that means it will only be used in specialized applications.
I think releasing it on yet-another-open-source-license instead of some of well-known ones severely degrades the possibility for this code to be used by companies.
The zlib license predates 3-clause BSD (zlib was released under the zlib license in 1995, BSD started releasing code under 3-clause BSD in 1999).
I will admit that when I first saw the zlib license, I was similarly confused but it's actually a license that is about the same age as most other common licenses (and quite a large body of software is licensed under it).
No software was licensed under the 3-clause BSD license (because it didn't exist) until 1999. Just because the text is similar to an earlier license doesn't change that software licensed under zlib precedes software licensed under 3-clause BSD.
But that's not really the point -- GGGP implied that the license was written from scratch ("yet-another-open-source-license") and that's obviously not true.
Currently, only an x64 implementation is provided. "While it can be implemented on ARM processors which have AES instructions, we have not yet created a reference implementation for that platform."
You're probably being downvoted since your comments appear to have nothing to do with the article or any ongoing discussion in the comments section. The use case the authors had was deduplication/change detection of large art asset files. Just how are reversible hash functions for small fixed size inputs relevant to that?
Also, to the GP whose posts are getting downvoted to the point of being flagged, in case it's helpful in the future:
(1) The idea of using a bijection as a building block of hashing is not new, and it's not new to the creators of MeowHash. You point out that there exists a function from 512 bits to 512 bits that's guaranteed to be unique.
Another such function that also has good avalanche properties and is hard to determine the original input from if you don't know the key: The AES mixing function.
MeowHash is based, in fact, on transforming individual input blocks with a single round of AES, and then merging streams of AES-mixed blocks. The AES part is straightforward -- but provides much better practical hash properties than your proposal (for example, I can use the low-order 128 or 64 bits of the output of MeowHash with the same general collision behavior). The merging step is more important from the perspective of collision-freedom, and you omit that fairly critical part.
(2) The contribution of meowhash is its _speed_ and good hashing behavior. It's a very elegant use of highly-optimized on-CPU AES mixing (a feature used by previous hash functions), with a nice block size and stream mixing design that lets it keep a lot of blocks in flight for high throughput.
(Meta) You're engaged in a pattern that HN readers generally respond poorly to: Get downvoted -> Complain about downvotes in a way that implies the downvoters are idiots -> get downvoted -> double down.
It's pretty easy to rephrase your comments in a way that doesn't have that tone of "you're all idiots for not seeing the connection here".
don't you sometimes think you'd want a platform where the users are in control of the platform rules? of course we'd all be held to the standards to which we hold others...
MeowHash is also a bad cryptographic hash. Can we compare properties instead of intentions? MD5 is faster than SHA. Is MeowHash faster than MD5 or not?
I think what GP meant was that MD5 is meant to be cryptographic, and parenthetically, it's bad at it. MeowHash is not even meant to be cryptographic, so it is much faster.
I do want to comment for the wider audience why the details of this construction does not lend itself to small keys, and how you would fix that. I have an (unpublished, I'm lazy) AES-based hash function that is very fast for both bulk hashing and small keys, which required a mostly performance neutral (likely fits in an unused ALU execution port) tweak to the bulk hashing loop.
Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.
As an important observation, vectorizing the AES lanes makes it difficult to disperse bits because there are no good, cheap operations that move bits between lanes. However, the CPU will happily run multiple AES operations and other 128-bit operations concurrently across ALU ports if they are in independent registers. This allows you to trivially create a "mixing" lane that solely exists to aid bit dispersal because you no longer are trying to do sideways operations on a vector. So what does a mixing lane look like that doesn't create dependencies on your hashing lanes? It is so simple it is embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash but it only exists to represent all the bits from all the lanes.
At the mixing stage, you simply fold the mixing lane into each of the hashing lanes by running it through a round of AES -- one concurrent operation. This allows you to avoid the deep mixing stage altogether, which means small keys can be hashed very quickly since almost all of the computation is in the finalizer. Using this technique, I've been able to hash small keys in <30 cycles with concurrent AES lanes.