
Meow Hash: A high-speed non-cryptographic hash function - spatulon
https://mollyrocket.com/meowhash
======
jandrewrogers
This is a good looking hash, just from the construction I can see that it
should be extremely fast for bulk hashing of large objects and with strong
statistical properties.

I do want to comment for the wider audience why the details of this
construction does not lend itself to small keys, and how you would fix that. I
have an (unpublished, I'm lazy) AES-based hash function that is very fast for
both bulk hashing and small keys, which required a mostly performance neutral
(likely fits in an unused ALU execution port) tweak to the bulk hashing loop.

Large key performance is entirely in the bulk loop algorithm (an AES round in
this case) and small key performance is entirely in the mixer/finalizer
algorithm. If you have a very wide bulk loops with no mixing of the individual
lanes then it usually requires a deep mixing stage that makes small keys very
expensive since that is a fixed overhead. The question then becomes, how do
you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten
the mixing stage _without_ creating an operation dependency between lanes that
will massively reduce throughput.

As an important observation, vectorizing the AES lanes makes it difficult to
disperse bits because there are no good, cheap operations that move bits
between lanes. However, the CPU will happily run multiple AES operations and
other 128-bit operations concurrently across ALU ports if they are in
independent registers. This allows you to trivially create a "mixing" lane
that solely exists to aid bit dispersal because you no longer are trying to do
sideways operations on a vector. So what does a mixing lane look like that
doesn't create dependencies on your hashing lanes? It is so simple it is
embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash
but it only exists to represent all the bits from all the lanes.

At the mixing stage, you simply fold the mixing lane into each of the hashing
lanes by running it through a round of AES -- one concurrent operation. This
allows you to avoid the deep mixing stage altogether, which means small keys
can be hashed very quickly since almost all of the computation is in the
finalizer. Using this technique, I've been able to hash small keys in <30
cycles with concurrent AES lanes.

~~~
jandrewrogers
As one more comment, because I've made the same mistake in the past, it is
often a bad idea to use the length in the initialization vector of a general
purpose hash because it means you can't make an incremental version of the
hashing algorithm e.g. a stream of data of unknown size. This is a trivial
change but it has side effects in how you pad partial blocks.

If the length is not part of the initialization vector, then the padding
vector for partial blocks should vary as a function of partial data length.
Otherwise it is trivial to create collisions by having partial blocks of
different sizes that mimic the padding vector. Fixing this can be as easy as
using _mm_set1_epi8(partial_length) as your padding vector.

~~~
DrJosiah
I'm a fan of your work; I've personally mangled your Metrohash64 in some of my
own work for incremental updates.

Any chance you're thinking about releasing the program for generating new
(good) Metrohash constants + shifts?

~~~
jandrewrogers
I have a completely new algorithm construction I have been using for a while
that I am releasing any day now -- I was literally working on the
documentation this weekend. It is a substantial improvement on the original
algorithms in almost every regard and also easier to customize.

~~~
DrJosiah
I look forward to the release, and I know the feeling (writing docs, cleaning
up comments, etc.). I've been perpetually ~2 weeks from commercial release of
my own product since July or so.

------
apo
_To our surprise, we found a lack of published, well-optimized, large-data
hash functions. Most hash work seems to focus on small input sizes (for things
like dictionary lookup) or on cryptographic quality. We wanted the fastest
possible hash that would be collision-free in practice (like SHA-1 was), and
we didn’t need any cryptograhic [sic] security._

This is an important passage that could be explained in more detail. I can
look up a list of non-cryptographic hash functions, nine of which support
128-bit output:

[https://en.wikipedia.org/wiki/List_of_hash_functions#Non-
cry...](https://en.wikipedia.org/wiki/List_of_hash_functions#Non-
cryptographic_hash_functions)

Given that the claim is that Meow Hash is fast, where's the benchmarking to
prove it's faster than the alternatives?

~~~
jchw
16 bytes per cycle is what they quote. Imo, that's pretty impressive.

For comparison, xxhash is the closest thing to this that I'm aware of and
clocks in at around 1.7 bytes per cycle using the official implementation.

~~~
loeg
There are variants of CityHash that use the x86 CRC intrinsics (CityHashCrc128
and CityHashCrc256). They quote 4.3-5.5 bytes/cycle.[0]

(Better than xxhash on x86, worse than Meowhash -- assuming the bytes/cycle
numbers can be compared apples-to-apples, which is a big if.)

[0]: [https://github.com/google/cityhash](https://github.com/google/cityhash)

~~~
jchw
I think the big issue with bytes/cycle comparisons right now is that once you
get to multiple bytes per cycle, most people are going to be limited on I/O,
even to RAM probably. I assume Meow Hash was designed to run on high end
workstations, but if you are developing software for end users my guess is
that Meow Hash will end up being not significantly faster than xxHash.

Of course, if you are running a lot of hashes in parallel, I bet things get
more interesting. (Example: would AES-NI instructions scale in parallel as
fast as ordinary bitwise/vector instructions?)

~~~
gameswithgo
ram throughput isn't that different on a cheap to high end pc.

~~~
jchw
When I say high end workstation I really do mean workstation and not gaming
PC. Most people do not have quad channel memory on their home PCs but it is in
fact a pretty good step up in throughput.

------
DrJosiah
Late to the party, some actual benchmarks using an older CPU with AES-NI:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz bugs : cpu_meltdown spectre_v1
spectre_v2 spec_store_bypass l1tf Kernel 4.18.14 GCC 7.3.0

Short version, it's ~2x as fast as anything else on my machine for data > 4k,
maybe down to 1k, all the way up to > cache (where various other algorithms
also max out memory bandwidth). Other non-cryptographic hashes (for
comparison) on my local machine: xxHash64 -> ~13G/s, Metrohash128 -> ~13G/s,
Metrohash128_crc -> ~18G/s, t1ha_crc -> ~17.5G/s, etc., but Meow doesn't break
the 1G/s barrier until 64-96 bytes (the others mentioned reach that speed at 8
bytes).

So yeah, faster for longer keys due to the fast aes mixing + parallel pipeline
usage, but slower for shorter keys due to the final mixing dependency at the
end.

Using a hacked-up version of Smhasher for wider range of hashing sizes (cache
on this CPU is 20 megs):

    
    
        --- Testing Meow1_64 "parallel aes internals"
    
        [[[ Speed Tests ]]]
    
        Bulk speed test - 67108864-byte keys
        Average      -  4.556 bytes/cycle - 14337.09 MiB/sec @ 3 ghz
        Bulk speed test - 16777216-byte keys
        Average      -  7.882 bytes/cycle - 24805.52 MiB/sec @ 3 ghz
        Bulk speed test - 4194304-byte keys
        Average      -  9.813 bytes/cycle - 30884.30 MiB/sec @ 3 ghz
        Bulk speed test - 1048576-byte keys
        Average      -  9.657 bytes/cycle - 30390.48 MiB/sec @ 3 ghz
        Bulk speed test - 262144-byte keys
        Average      - 10.642 bytes/cycle - 33490.19 MiB/sec @ 3 ghz
        Bulk speed test - 65536-byte keys
        Average      - 11.871 bytes/cycle - 37360.76 MiB/sec @ 3 ghz
        Bulk speed test - 16384-byte keys
        Average      - 11.336 bytes/cycle - 35675.98 MiB/sec @ 3 ghz
        Bulk speed test - 4096-byte keys
        Average      -  8.241 bytes/cycle - 25936.55 MiB/sec @ 3 ghz
        Bulk speed test - 1024-byte keys
        Average      -  4.056 bytes/cycle - 12765.26 MiB/sec @ 3 ghz

------
ardaozkal
I just ran a simple collision test against it with real life, prod data. It
resulted in 13 collisions on 116275 files in 512b mode.

[https://github.com/cmuratori/meow_hash/issues/7#issuecomment...](https://github.com/cmuratori/meow_hash/issues/7#issuecomment-431657191)

~~~
jsnell
Given the insanely high rate of collisions for a 512 bit hash, the "we
probably won't update the main loop" response is quite something.

------
martincmartin
How does it compare to:

FarmHash

CityHash

SMHasher

MetroHash

FNV1a_YT

Note that if you're just using it for a hash table, where you'll compare the
full strings to resolve hash collisions, you don't actually need good
randomness. FNV1A_VT is faster than most, has worse collision properties, yet
provides a faster overall hash table, because the faster hash (every
operation) outweighs the extra full string checks (rare).

[https://github.com/rurban/smhasher](https://github.com/rurban/smhasher)

~~~
loeg
FNV has terrible hash properties (i.e. it fails many smhasher tests as
mentioned in TFA), so it isn't really comparable.

SMHasher isn't a hash function? It's a test suite for hash functions.

CityHash's CRC modes are nominally 4.3-5.5 bytes/cycle.

The others are slower.

This article quotes 16 b/c for Meowhash.

~~~
pkaye
Yup FNV is terrible from my experience using it in a custom hashmap
implementation. The lower bits of the FNV hash have a lot of bias.

------
aappleby
Yep, unrolled AES instructions are the way to go if you need to hash gigs of
stuff and you're on a CPU that supports them.

-Austin, SMHasher author.

------
esaym
Never heard of these guys, but they got several hundred hours of video footage
of building a game from scratch:
[https://www.youtube.com/user/handmadeheroarchive/videos](https://www.youtube.com/user/handmadeheroarchive/videos)
looks interesting.

------
pmorici
I wonder if speed will still be a reason for doing something like this going
forward now that Intel is starting to include SHA acceleration instructions in
its processors.

~~~
loeg
It will, for a couple reasons. The first is, Intel doesn't include the SHA
intrinsics in their high end consumer processors. Only the low end Atom stuff
so far.

The second is that even with the intrinsics, this is (supposedly) faster. The
SHA instructions only get you a few gigabytes a second (single digits). This
purportedly gets tens of GB/s.

~~~
binarycrusader
As far as I’m aware, that isn’t true. Skylake added support for 7 different
SHA extensions:

[https://www.phoronix.com/scan.php?page=news_item&px=Linux-4....](https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.4-Crypto)

~~~
magila
Phoronix is, once again, wrong. Skylake does not support the SHA instructions.
The only currently shipping x86 CPUs which support them are Intel Goldmont
(Atom) and AMD Zen (Ryzen, Threadripper).

Mainstream support for SHA in Intel CPUs will come with Cannon Lake, or
whatever the next microarchitecture revision they release ends up being.

* Yes, Cannon Lake is technically shipping as the i3-8121U, but that isn't really relevant to anyone here.

~~~
binarycrusader
Ok, it seems you're correct about Cannon Lake:

[https://software.intel.com/sites/default/files/managed/c5/15...](https://software.intel.com/sites/default/files/managed/c5/15/architecture-
instruction-set-extensions-programming-reference.pdf)

Cannon Lake and later according to Intel.

However, I question "mainstream" since many Celeron and Pentium branded parts
featured SHA-NI, and those are arguably mainstream, low-cost parts.

Also, in fairness to Phoronix, it looks like SHA-NI support may have been
planned for Skylake, but didn't make it based on various posts I see online as
early as 2014 and Intel's first postings about SHA-NI on Goldmont, etc. in
2013.

~~~
loeg
> many Celeron and Pentium branded parts featured SHA-NI

Do you have a source for this claim?

> Also, in fairness to Phoronix, it looks like SHA-NI support may have been
> planned for Skylake

Yeah, but on the other hand that article was published 4 months _after_ the
general availability of Skylake. It was possible for Larabel to verify the
facts before publishing and he just didn't. Nor did he correct the mistake in
any of the past three years.

~~~
binarycrusader
Sure, quite a few processors were based on Goldmont:

[https://en.m.wikipedia.org/wiki/Goldmont](https://en.m.wikipedia.org/wiki/Goldmont)

------
based2
[https://www.reddit.com/r/programming/comments/9pr5ir/meow_ha...](https://www.reddit.com/r/programming/comments/9pr5ir/meow_hash_a_noncryptographic_hash_capable_of_16/)

------
GolDDranks
I just happen to need a fast, non-cryptographic checksum for large amount of
data. I assumed using CRC32C would be the fastest thing available on x86_64
since SSE 4.2 includes an intrinsic for it.

Can I expect Meow Hash being faster than that on x86_64? What about on some
other platforms such as common ARM chips? Which one is more collision
resistant?

Also: the writer seems to be Casey Muratori, who is also the author of the
Handmade Hero series! Kudos.

~~~
GolDDranks
OK, after looking into it for just a few minutes, I found out: 1) CRC32C is
not nearly as collision-resistant as many common fast non-cryptographical
hashes. 2) there are many portable hash functions that are faster (and I think
this is due to their bit size.) See:
[https://github.com/Cyan4973/xxHash](https://github.com/Cyan4973/xxHash) and
[https://github.com/leo-yuriev/t1ha](https://github.com/leo-yuriev/t1ha)

So, my initial assumption was wrong. I'll be trying Meow Hash and some others
out!

~~~
loeg
Meowhash claims to be faster (16B/cycle) than CityHashCrc (4.3-5.5 B/cycle),
which is a variant of CityHash (ab)using the x86 CRC32C intrinsics (but with
much better hash properties than plain CRC32).

Both are faster than xxHash on machines with AESNI and CRC intrinsics
(~1.7B/cycle). The advantage of xxHash is that it is very fast on machines
_without_ these instrinsics.

Other hashes tend to be slower although I'm sure there are some other
intrinsics-based hashes I don't have at the tip of my tongue.

------
wolfhumble
Will this hash be useful to check if the content of files of all sizes has
changed or not? Thinking about storing the hash of a file in a database to
evaluate when changes has taken place.

------
man-and-laptop
[https://github.com/cmuratori/meow_hash/blob/master/meow_hash...](https://github.com/cmuratori/meow_hash/blob/master/meow_hash.h)

What do AESRotate(foo,bar) and AESMerge(foo,bar) do?

[edit]

So it's a simple cyclic rotation preceded by A.Q0 = _mm512_aesdec_epi128(A.Q0,
B.Q0)

From Googling, _mm512_aesdec_epi128 looks like an Intel processor built-in.

------
im3w1l
> Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow
> hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc.,
> it’s going to waste all of its time padding the input

This seems like an important caveat that means it will only be used in
specialized applications.

~~~
Sharlin
It's meant for file hashing and it's rare in the real world that files are
less than 256 bytes.

------
maq1234
I think releasing it on yet-another-open-source-license instead of some of
well-known ones severely degrades the possibility for this code to be used by
companies.

~~~
tbodt
It's the zlib license, which is used by GLFW, SDL, and SFML as well. Companies
seem to have no problem using things like SDL, so I think this is fine.

Still, it would be nice is to state that is is the zlib license, and to
include the license in the repository.

~~~
loeg
And it looks more or less fundamentally identical to the 3-clause BSD license.

~~~
cyphar
The zlib license predates 3-clause BSD (zlib was released under the zlib
license in 1995, BSD started releasing code under 3-clause BSD in 1999).

I will admit that when I first saw the zlib license, I was similarly confused
but it's actually a license that is about the same age as most other common
licenses (and quite a large body of software is licensed under it).

~~~
loeg
The 3-clause BSDL is just the 4-clause with clause #3 removed. Per Wikipedia,
the 4-clause dates to 1990:
[https://en.wikipedia.org/wiki/BSD_licenses#4-clause_license_...](https://en.wikipedia.org/wiki/BSD_licenses#4-clause_license_\(original_%22BSD_License%22\))
. (I.e., all of the text in the 3-clause BSD license was written in 1990.)

~~~
cyphar
No software was licensed under the 3-clause BSD license (because it didn't
exist) until 1999. Just because the text is similar to an earlier license
doesn't change that software licensed under zlib precedes software licensed
under 3-clause BSD.

But that's not really the point -- GGGP implied that the license was written
from scratch ("yet-another-open-source-license") and that's obviously not
true.

------
TimTheTinker
I wonder if a cryptographer could take a look at Meow Hash. I'm curious how it
would fail first when subjected to modern cryptanalytic methods.

------
amelius
How fast is it on other platforms, e.g. ARM?

~~~
tln
Currently, only an x64 implementation is provided. "While it can be
implemented on ARM processors which have AES instructions, we have not yet
created a reference implementation for that platform."

------
ynniv
Not even a mention of MD5?

~~~
UncleMeat
MD5 is a (bad) cryptographic hash and therefore has entirely different goals
and properties as this hash function.

~~~
ynniv
MeowHash is also a bad cryptographic hash. Can we compare properties instead
of intentions? MD5 is faster than SHA. Is MeowHash faster than MD5 or not?

~~~
masklinn
> MeowHash is also a bad cryptographic hash.

MeowHash is not cryptographic at all.

> Is MeowHash faster than MD5 or not?

Meowhash is advertised as 16 bytes per cycle. MD5 is around 5 cycles per
byte[0] or _0.2 bytes per cycle_.

Pre-existing fast non-cryptographic hashes apparently top out around 5.5 bytes
per cycle (CityHashCRC256).

[0]
[https://keccak.team/sw_performance.html](https://keccak.team/sw_performance.html)

~~~
ynniv
Thank you. The comparison of MeowHash to SHA without mentioning MD5 or any
actual performance numbers didn't make sense.

