Meow Hash: A high-speed non-cryptographic hash function

jandrewrogers · on Oct 20, 2018

This is a good looking hash, just from the construction I can see that it should be extremely fast for bulk hashing of large objects and with strong statistical properties.

I do want to comment for the wider audience why the details of this construction does not lend itself to small keys, and how you would fix that. I have an (unpublished, I'm lazy) AES-based hash function that is very fast for both bulk hashing and small keys, which required a mostly performance neutral (likely fits in an unused ALU execution port) tweak to the bulk hashing loop.

Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.

As an important observation, vectorizing the AES lanes makes it difficult to disperse bits because there are no good, cheap operations that move bits between lanes. However, the CPU will happily run multiple AES operations and other 128-bit operations concurrently across ALU ports if they are in independent registers. This allows you to trivially create a "mixing" lane that solely exists to aid bit dispersal because you no longer are trying to do sideways operations on a vector. So what does a mixing lane look like that doesn't create dependencies on your hashing lanes? It is so simple it is embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash but it only exists to represent all the bits from all the lanes.

At the mixing stage, you simply fold the mixing lane into each of the hashing lanes by running it through a round of AES -- one concurrent operation. This allows you to avoid the deep mixing stage altogether, which means small keys can be hashed very quickly since almost all of the computation is in the finalizer. Using this technique, I've been able to hash small keys in <30 cycles with concurrent AES lanes.

jandrewrogers · on Oct 20, 2018

As one more comment, because I've made the same mistake in the past, it is often a bad idea to use the length in the initialization vector of a general purpose hash because it means you can't make an incremental version of the hashing algorithm e.g. a stream of data of unknown size. This is a trivial change but it has side effects in how you pad partial blocks.

If the length is not part of the initialization vector, then the padding vector for partial blocks should vary as a function of partial data length. Otherwise it is trivial to create collisions by having partial blocks of different sizes that mimic the padding vector. Fixing this can be as easy as using _mm_set1_epi8(partial_length) as your padding vector.

0xb100db1ade · on Oct 20, 2018

On the other hand, https://en.wikipedia.org/wiki/Length_extension_attack

Although Meow Hash is described as non-cryptographic, so the described attack is likely not applicable

DrJosiah · on Oct 21, 2018

I'm a fan of your work; I've personally mangled your Metrohash64 in some of my own work for incremental updates.

Any chance you're thinking about releasing the program for generating new (good) Metrohash constants + shifts?

jandrewrogers · on Oct 21, 2018

I have a completely new algorithm construction I have been using for a while that I am releasing any day now -- I was literally working on the documentation this weekend. It is a substantial improvement on the original algorithms in almost every regard and also easier to customize.

DrJosiah · on Oct 21, 2018

I look forward to the release, and I know the feeling (writing docs, cleaning up comments, etc.). I've been perpetually ~2 weeks from commercial release of my own product since July or so.

presscast · on Oct 20, 2018

>Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.

This sounds deeply fascinating, but I'm way out of my depth. Can you suggest any reading that would help me appreciate the "architecture" of hash functions?

TimTheTinker · on Oct 21, 2018

Any clue why this hash isn’t cryptographically secure (beyond the designers’ stated intent)?

apo · on Oct 20, 2018

To our surprise, we found a lack of published, well-optimized, large-data hash functions. Most hash work seems to focus on small input sizes (for things like dictionary lookup) or on cryptographic quality. We wanted the fastest possible hash that would be collision-free in practice (like SHA-1 was), and we didn’t need any cryptograhic [sic] security.

This is an important passage that could be explained in more detail. I can look up a list of non-cryptographic hash functions, nine of which support 128-bit output:

https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cry...

Given that the claim is that Meow Hash is fast, where's the benchmarking to prove it's faster than the alternatives?

jchw · on Oct 20, 2018

16 bytes per cycle is what they quote. Imo, that's pretty impressive.

For comparison, xxhash is the closest thing to this that I'm aware of and clocks in at around 1.7 bytes per cycle using the official implementation.

loeg · on Oct 20, 2018

There are variants of CityHash that use the x86 CRC intrinsics (CityHashCrc128 and CityHashCrc256). They quote 4.3-5.5 bytes/cycle.[0]

(Better than xxhash on x86, worse than Meowhash -- assuming the bytes/cycle numbers can be compared apples-to-apples, which is a big if.)

[0]: https://github.com/google/cityhash

jchw · on Oct 20, 2018

I think the big issue with bytes/cycle comparisons right now is that once you get to multiple bytes per cycle, most people are going to be limited on I/O, even to RAM probably. I assume Meow Hash was designed to run on high end workstations, but if you are developing software for end users my guess is that Meow Hash will end up being not significantly faster than xxHash.

Of course, if you are running a lot of hashes in parallel, I bet things get more interesting. (Example: would AES-NI instructions scale in parallel as fast as ordinary bitwise/vector instructions?)

gameswithgo · on Oct 20, 2018

ram throughput isn't that different on a cheap to high end pc.

jchw · on Oct 20, 2018

When I say high end workstation I really do mean workstation and not gaming PC. Most people do not have quad channel memory on their home PCs but it is in fact a pretty good step up in throughput.

conradev · on Oct 20, 2018

That is also for x86-64, I'd be interested in learning what the performance would look like for ARMv8-A

afturner · on Oct 20, 2018

goddamn it, I just spent a fair amount of time getting xxhash to work.. now I need to reevaluate lol

jchw · on Oct 20, 2018

xxHash is still very fast. In fact, I am sure that some implementations of xxHash are faster than that. In fact, after Googling it, I can see that xxHash64 is quoted as having 3.1 bytes per cycle, which is not bad.

xxHash is also very simple, and can be written easily in high or low level languages, which I like. Someone even implemented it in C++ templates I believe. I don't know how Meow Hash is but I imagine the default implementation heavily exploits AMD64 CPUs and based on the article probably makes use of AES acceleration.

Basically, if you do this, at least be sure you're not bottlenecked on I/O already :)

DrJosiah · on Oct 21, 2018

Late to the party, some actual benchmarks using an older CPU with AES-NI: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf Kernel 4.18.14 GCC 7.3.0

Short version, it's ~2x as fast as anything else on my machine for data > 4k, maybe down to 1k, all the way up to > cache (where various other algorithms also max out memory bandwidth). Other non-cryptographic hashes (for comparison) on my local machine: xxHash64 -> ~13G/s, Metrohash128 -> ~13G/s, Metrohash128_crc -> ~18G/s, t1ha_crc -> ~17.5G/s, etc., but Meow doesn't break the 1G/s barrier until 64-96 bytes (the others mentioned reach that speed at 8 bytes).

So yeah, faster for longer keys due to the fast aes mixing + parallel pipeline usage, but slower for shorter keys due to the final mixing dependency at the end.

Using a hacked-up version of Smhasher for wider range of hashing sizes (cache on this CPU is 20 megs):

    --- Testing Meow1_64 "parallel aes internals"

    [[[ Speed Tests ]]]

    Bulk speed test - 67108864-byte keys
    Average      -  4.556 bytes/cycle - 14337.09 MiB/sec @ 3 ghz
    Bulk speed test - 16777216-byte keys
    Average      -  7.882 bytes/cycle - 24805.52 MiB/sec @ 3 ghz
    Bulk speed test - 4194304-byte keys
    Average      -  9.813 bytes/cycle - 30884.30 MiB/sec @ 3 ghz
    Bulk speed test - 1048576-byte keys
    Average      -  9.657 bytes/cycle - 30390.48 MiB/sec @ 3 ghz
    Bulk speed test - 262144-byte keys
    Average      - 10.642 bytes/cycle - 33490.19 MiB/sec @ 3 ghz
    Bulk speed test - 65536-byte keys
    Average      - 11.871 bytes/cycle - 37360.76 MiB/sec @ 3 ghz
    Bulk speed test - 16384-byte keys
    Average      - 11.336 bytes/cycle - 35675.98 MiB/sec @ 3 ghz
    Bulk speed test - 4096-byte keys
    Average      -  8.241 bytes/cycle - 25936.55 MiB/sec @ 3 ghz
    Bulk speed test - 1024-byte keys
    Average      -  4.056 bytes/cycle - 12765.26 MiB/sec @ 3 ghz

ardaozkal · on Oct 21, 2018

I just ran a simple collision test against it with real life, prod data. It resulted in 13 collisions on 116275 files in 512b mode.

https://github.com/cmuratori/meow_hash/issues/7#issuecomment...

jsnell · on Oct 22, 2018

Given the insanely high rate of collisions for a 512 bit hash, the "we probably won't update the main loop" response is quite something.

martincmartin · on Oct 20, 2018

How does it compare to:

FarmHash

CityHash

SMHasher

MetroHash

FNV1a_YT

Note that if you're just using it for a hash table, where you'll compare the full strings to resolve hash collisions, you don't actually need good randomness. FNV1A_VT is faster than most, has worse collision properties, yet provides a faster overall hash table, because the faster hash (every operation) outweighs the extra full string checks (rare).

https://github.com/rurban/smhasher

zokier · on Oct 20, 2018

Almost certainly faster than any of the ones you list for large data. I would expect the perf to be more comparable to the other aesni hashes (t1ha and falkhash), both which bench around 4 bytes/ cycle while meow claims 16 bytes/cycle.

Meow does include implementation for smhasher, so benchmarking should be easy enough

https://github.com/cmuratori/meow_hash/blob/master/meow_smha...

loeg · on Oct 20, 2018

FNV has terrible hash properties (i.e. it fails many smhasher tests as mentioned in TFA), so it isn't really comparable.

SMHasher isn't a hash function? It's a test suite for hash functions.

CityHash's CRC modes are nominally 4.3-5.5 bytes/cycle.

The others are slower.

This article quotes 16 b/c for Meowhash.

pkaye · on Oct 20, 2018

Yup FNV is terrible from my experience using it in a custom hashmap implementation. The lower bits of the FNV hash have a lot of bias.

howard941 · on Oct 20, 2018

That's a cool link - thank you. There's so much there I couldn't tell if there were benchmarks for implementation on embedded hosts. I'm constantly looking for something suitable for use on embedded stuff. Having an x86 would be like having a unicorn as my deployment target.

Recently (this year) the MCUs I've asked for have hardware crc32 and sometimes even AES128 but the suits tell me to Remember the BOM, Luke, and those fancy periphs aren't available on common M0+ my hardware guy likes (NXP KE0xxx for ex). Left wondering how this one will work on a run of the mill ARM.

ninkendo · on Oct 20, 2018

> because the faster hash (every operation) outweighs the extra full string checks (rare)

I'm confused... don't you have to do at least one full string check for every table lookup on a hash table, to ensure the value you're returning is really the right key, and not just some other key that happens to collide?

In other words, if I have 32 buckets in my table and one value stored at table["hello"], and table["goodbye"] happens to hash to the same bucket, how do you know whether to return not-found for "goodbye" without a full string comparison? (Or for that matter, how do you know whether to return the stored value for "hello" without a comparison?)

mort96 · on Oct 20, 2018

Every operation requires one hash and at least one string check, but the more collisions there are, the more string checks you have to do. In theory, a hash algorithm with a great distribution and few collisions would require less string checks (I.e generally just one) per operation than an algorithm with a poor distribution.

GP was saying that in their experience, the less frequent extra string checks generally isn't worth the slower hashing speed of an algorithm with a better distribution.

(It's also worth noting that getting or setting elements which don't exist requires 0 string compares unless there's a collision, which might be relevant for certain workloads.)

bonzini · on Oct 20, 2018

Most of the cost of string checks is often just the cache miss to load the first bytes of the string.

Depending on the input, e.g. if lengths have more or less uniform distributions, full string checks might only be needed 10% of the time or so; store the string length in the bucket and you save the cache miss. Or you could store the first few bytes of the key in the bucket (but then in other cases, for example a hash table containing absolute filesystem paths, the first few bytes are always the same across keys).

aappleby · on Oct 20, 2018

Yep, unrolled AES instructions are the way to go if you need to hash gigs of stuff and you're on a CPU that supports them.

-Austin, SMHasher author.

esaym · on Oct 20, 2018

Never heard of these guys, but they got several hundred hours of video footage of building a game from scratch: https://www.youtube.com/user/handmadeheroarchive/videos looks interesting.

pmorici · on Oct 20, 2018

I wonder if speed will still be a reason for doing something like this going forward now that Intel is starting to include SHA acceleration instructions in its processors.

loeg · on Oct 20, 2018

It will, for a couple reasons. The first is, Intel doesn't include the SHA intrinsics in their high end consumer processors. Only the low end Atom stuff so far.

The second is that even with the intrinsics, this is (supposedly) faster. The SHA instructions only get you a few gigabytes a second (single digits). This purportedly gets tens of GB/s.

binarycrusader · on Oct 20, 2018

As far as I’m aware, that isn’t true. Skylake added support for 7 different SHA extensions:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4....

loeg · on Oct 20, 2018

Did it actually? Phoronix is generally a pretty questionable source.

https://en.wikipedia.org/wiki/Skylake_(microarchitecture) nor https://en.wikipedia.org/wiki/Kaby_Lake nor even Coffee Lake list SHA intrinsics. It starts showing up in https://en.wikipedia.org/wiki/Cascade_Lake_(microarchitectur... and Cannon Lake pages.

magila · on Oct 20, 2018

Phoronix is, once again, wrong. Skylake does not support the SHA instructions. The only currently shipping x86 CPUs which support them are Intel Goldmont (Atom) and AMD Zen (Ryzen, Threadripper).

Mainstream support for SHA in Intel CPUs will come with Cannon Lake, or whatever the next microarchitecture revision they release ends up being.

* Yes, Cannon Lake is technically shipping as the i3-8121U, but that isn't really relevant to anyone here.

binarycrusader · on Oct 20, 2018

Ok, it seems you're correct about Cannon Lake:

https://software.intel.com/sites/default/files/managed/c5/15...

Cannon Lake and later according to Intel.

However, I question "mainstream" since many Celeron and Pentium branded parts featured SHA-NI, and those are arguably mainstream, low-cost parts.

Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake, but didn't make it based on various posts I see online as early as 2014 and Intel's first postings about SHA-NI on Goldmont, etc. in 2013.

loeg · on Oct 21, 2018

> many Celeron and Pentium branded parts featured SHA-NI

Do you have a source for this claim?

> Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake

Yeah, but on the other hand that article was published 4 months after the general availability of Skylake. It was possible for Larabel to verify the facts before publishing and he just didn't. Nor did he correct the mistake in any of the past three years.

binarycrusader · on Oct 22, 2018

Sure, quite a few processors were based on Goldmont:

https://en.m.wikipedia.org/wiki/Goldmont

based2 · on Oct 20, 2018

https://www.reddit.com/r/programming/comments/9pr5ir/meow_ha...

GolDDranks · on Oct 20, 2018

I just happen to need a fast, non-cryptographic checksum for large amount of data. I assumed using CRC32C would be the fastest thing available on x86_64 since SSE 4.2 includes an intrinsic for it.

Can I expect Meow Hash being faster than that on x86_64? What about on some other platforms such as common ARM chips? Which one is more collision resistant?

Also: the writer seems to be Casey Muratori, who is also the author of the Handmade Hero series! Kudos.

GolDDranks · on Oct 20, 2018

OK, after looking into it for just a few minutes, I found out: 1) CRC32C is not nearly as collision-resistant as many common fast non-cryptographical hashes. 2) there are many portable hash functions that are faster (and I think this is due to their bit size.) See: https://github.com/Cyan4973/xxHash and https://github.com/leo-yuriev/t1ha

So, my initial assumption was wrong. I'll be trying Meow Hash and some others out!

loeg · on Oct 20, 2018

Meowhash claims to be faster (16B/cycle) than CityHashCrc (4.3-5.5 B/cycle), which is a variant of CityHash (ab)using the x86 CRC32C intrinsics (but with much better hash properties than plain CRC32).

Both are faster than xxHash on machines with AESNI and CRC intrinsics (~1.7B/cycle). The advantage of xxHash is that it is very fast on machines without these instrinsics.

Other hashes tend to be slower although I'm sure there are some other intrinsics-based hashes I don't have at the tip of my tongue.

wolfhumble · on Oct 20, 2018

Will this hash be useful to check if the content of files of all sizes has changed or not? Thinking about storing the hash of a file in a database to evaluate when changes has taken place.

man-and-laptop · on Oct 20, 2018

https://github.com/cmuratori/meow_hash/blob/master/meow_hash...

What do AESRotate(foo,bar) and AESMerge(foo,bar) do?

[edit]

So it's a simple cyclic rotation preceded by A.Q0 = _mm512_aesdec_epi128(A.Q0, B.Q0)

From Googling, _mm512_aesdec_epi128 looks like an Intel processor built-in.

im3w1l · on Oct 20, 2018

> Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc., it’s going to waste all of its time padding the input

This seems like an important caveat that means it will only be used in specialized applications.

Sharlin · on Oct 20, 2018

It's meant for file hashing and it's rare in the real world that files are less than 256 bytes.

maq1234 · on Oct 20, 2018

I think releasing it on yet-another-open-source-license instead of some of well-known ones severely degrades the possibility for this code to be used by companies.

tbodt · on Oct 20, 2018

It's the zlib license, which is used by GLFW, SDL, and SFML as well. Companies seem to have no problem using things like SDL, so I think this is fine.

Still, it would be nice is to state that is is the zlib license, and to include the license in the repository.

loeg · on Oct 20, 2018

And it looks more or less fundamentally identical to the 3-clause BSD license.

cyphar · on Oct 20, 2018

The zlib license predates 3-clause BSD (zlib was released under the zlib license in 1995, BSD started releasing code under 3-clause BSD in 1999).

I will admit that when I first saw the zlib license, I was similarly confused but it's actually a license that is about the same age as most other common licenses (and quite a large body of software is licensed under it).

loeg · on Oct 21, 2018

The 3-clause BSDL is just the 4-clause with clause #3 removed. Per Wikipedia, the 4-clause dates to 1990: https://en.wikipedia.org/wiki/BSD_licenses#4-clause_license_... . (I.e., all of the text in the 3-clause BSD license was written in 1990.)

cyphar · on Oct 21, 2018

No software was licensed under the 3-clause BSD license (because it didn't exist) until 1999. Just because the text is similar to an earlier license doesn't change that software licensed under zlib precedes software licensed under 3-clause BSD.

But that's not really the point -- GGGP implied that the license was written from scratch ("yet-another-open-source-license") and that's obviously not true.

TimTheTinker · on Oct 21, 2018

I wonder if a cryptographer could take a look at Meow Hash. I'm curious how it would fail first when subjected to modern cryptanalytic methods.

amelius · on Oct 20, 2018

How fast is it on other platforms, e.g. ARM?

tln · on Oct 20, 2018

Currently, only an x64 implementation is provided. "While it can be implemented on ARM processors which have AES instructions, we have not yet created a reference implementation for that platform."

DoctorOetker · on Oct 20, 2018

[flagged]

DoctorOetker · on Oct 20, 2018

[flagged]

jsnell · on Oct 20, 2018

You're probably being downvoted since your comments appear to have nothing to do with the article or any ongoing discussion in the comments section. The use case the authors had was deduplication/change detection of large art asset files. Just how are reversible hash functions for small fixed size inputs relevant to that?

DoctorOetker · on Oct 20, 2018

[flagged]

dgacmu · on Oct 20, 2018

Because you're ignoring their first stated goal (which is immediately prior to the snippet you quote):

"we found a lack of published, well-optimized, large-data hash function"

Large-data != Inputs the same size of the outputs. Your comment is irrelevant to their goals.

(And later in the doc: "Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes")

dgacmu · on Oct 20, 2018

Also, to the GP whose posts are getting downvoted to the point of being flagged, in case it's helpful in the future:

(1) The idea of using a bijection as a building block of hashing is not new, and it's not new to the creators of MeowHash. You point out that there exists a function from 512 bits to 512 bits that's guaranteed to be unique.

Another such function that also has good avalanche properties and is hard to determine the original input from if you don't know the key: The AES mixing function.

MeowHash is based, in fact, on transforming individual input blocks with a single round of AES, and then merging streams of AES-mixed blocks. The AES part is straightforward -- but provides much better practical hash properties than your proposal (for example, I can use the low-order 128 or 64 bits of the output of MeowHash with the same general collision behavior). The merging step is more important from the perspective of collision-freedom, and you omit that fairly critical part.

(2) The contribution of meowhash is its _speed_ and good hashing behavior. It's a very elegant use of highly-optimized on-CPU AES mixing (a feature used by previous hash functions), with a nice block size and stream mixing design that lets it keep a lot of blocks in flight for high throughput.

(Meta) You're engaged in a pattern that HN readers generally respond poorly to: Get downvoted -> Complain about downvotes in a way that implies the downvoters are idiots -> get downvoted -> double down.

It's pretty easy to rephrase your comments in a way that doesn't have that tone of "you're all idiots for not seeing the connection here".

DoctorOetker · on Oct 20, 2018

[flagged]

pvg · on Oct 20, 2018

Downvoting without replying is an act of idiocy.

It's perfectly fine on HN. Going on about downvoting is not.

DoctorOetker · on Oct 21, 2018

don't you sometimes think you'd want a platform where the users are in control of the platform rules? of course we'd all be held to the standards to which we hold others...

ynniv · on Oct 20, 2018

Not even a mention of MD5?

UncleMeat · on Oct 20, 2018

MD5 is a (bad) cryptographic hash and therefore has entirely different goals and properties as this hash function.

ynniv · on Oct 20, 2018

MeowHash is also a bad cryptographic hash. Can we compare properties instead of intentions? MD5 is faster than SHA. Is MeowHash faster than MD5 or not?

masklinn · on Oct 20, 2018

> MeowHash is also a bad cryptographic hash.

MeowHash is not cryptographic at all.

> Is MeowHash faster than MD5 or not?

Meowhash is advertised as 16 bytes per cycle. MD5 is around 5 cycles per byte[0] or 0.2 bytes per cycle.

Pre-existing fast non-cryptographic hashes apparently top out around 5.5 bytes per cycle (CityHashCRC256).

[0] https://keccak.team/sw_performance.html

ynniv · on Oct 20, 2018

Thank you. The comparison of MeowHash to SHA without mentioning MD5 or any actual performance numbers didn't make sense.

throwaway080383 · on Oct 20, 2018

I think what GP meant was that MD5 is meant to be cryptographic, and parenthetically, it's bad at it. MeowHash is not even meant to be cryptographic, so it is much faster.

garmaine · on Oct 20, 2018

Orders of magnitude faster.

loeg · on Oct 20, 2018

And it's slow af, meeting neither cryptographic nor performance needs.