Hacker News new | past | comments | ask | show | jobs | submit login
Meow Hash: A high-speed non-cryptographic hash function (mollyrocket.com)
173 points by spatulon on Oct 20, 2018 | hide | past | favorite | 68 comments



This is a good looking hash, just from the construction I can see that it should be extremely fast for bulk hashing of large objects and with strong statistical properties.

I do want to comment for the wider audience why the details of this construction does not lend itself to small keys, and how you would fix that. I have an (unpublished, I'm lazy) AES-based hash function that is very fast for both bulk hashing and small keys, which required a mostly performance neutral (likely fits in an unused ALU execution port) tweak to the bulk hashing loop.

Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.

As an important observation, vectorizing the AES lanes makes it difficult to disperse bits because there are no good, cheap operations that move bits between lanes. However, the CPU will happily run multiple AES operations and other 128-bit operations concurrently across ALU ports if they are in independent registers. This allows you to trivially create a "mixing" lane that solely exists to aid bit dispersal because you no longer are trying to do sideways operations on a vector. So what does a mixing lane look like that doesn't create dependencies on your hashing lanes? It is so simple it is embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash but it only exists to represent all the bits from all the lanes.

At the mixing stage, you simply fold the mixing lane into each of the hashing lanes by running it through a round of AES -- one concurrent operation. This allows you to avoid the deep mixing stage altogether, which means small keys can be hashed very quickly since almost all of the computation is in the finalizer. Using this technique, I've been able to hash small keys in <30 cycles with concurrent AES lanes.


As one more comment, because I've made the same mistake in the past, it is often a bad idea to use the length in the initialization vector of a general purpose hash because it means you can't make an incremental version of the hashing algorithm e.g. a stream of data of unknown size. This is a trivial change but it has side effects in how you pad partial blocks.

If the length is not part of the initialization vector, then the padding vector for partial blocks should vary as a function of partial data length. Otherwise it is trivial to create collisions by having partial blocks of different sizes that mimic the padding vector. Fixing this can be as easy as using _mm_set1_epi8(partial_length) as your padding vector.


On the other hand, https://en.wikipedia.org/wiki/Length_extension_attack

Although Meow Hash is described as non-cryptographic, so the described attack is likely not applicable


I'm a fan of your work; I've personally mangled your Metrohash64 in some of my own work for incremental updates.

Any chance you're thinking about releasing the program for generating new (good) Metrohash constants + shifts?


I have a completely new algorithm construction I have been using for a while that I am releasing any day now -- I was literally working on the documentation this weekend. It is a substantial improvement on the original algorithms in almost every regard and also easier to customize.


I look forward to the release, and I know the feeling (writing docs, cleaning up comments, etc.). I've been perpetually ~2 weeks from commercial release of my own product since July or so.


>Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage without creating an operation dependency between lanes that will massively reduce throughput.

This sounds deeply fascinating, but I'm way out of my depth. Can you suggest any reading that would help me appreciate the "architecture" of hash functions?


Any clue why this hash isn’t cryptographically secure (beyond the designers’ stated intent)?


To our surprise, we found a lack of published, well-optimized, large-data hash functions. Most hash work seems to focus on small input sizes (for things like dictionary lookup) or on cryptographic quality. We wanted the fastest possible hash that would be collision-free in practice (like SHA-1 was), and we didn’t need any cryptograhic [sic] security.

This is an important passage that could be explained in more detail. I can look up a list of non-cryptographic hash functions, nine of which support 128-bit output:

https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cry...

Given that the claim is that Meow Hash is fast, where's the benchmarking to prove it's faster than the alternatives?


16 bytes per cycle is what they quote. Imo, that's pretty impressive.

For comparison, xxhash is the closest thing to this that I'm aware of and clocks in at around 1.7 bytes per cycle using the official implementation.


There are variants of CityHash that use the x86 CRC intrinsics (CityHashCrc128 and CityHashCrc256). They quote 4.3-5.5 bytes/cycle.[0]

(Better than xxhash on x86, worse than Meowhash -- assuming the bytes/cycle numbers can be compared apples-to-apples, which is a big if.)

[0]: https://github.com/google/cityhash


I think the big issue with bytes/cycle comparisons right now is that once you get to multiple bytes per cycle, most people are going to be limited on I/O, even to RAM probably. I assume Meow Hash was designed to run on high end workstations, but if you are developing software for end users my guess is that Meow Hash will end up being not significantly faster than xxHash.

Of course, if you are running a lot of hashes in parallel, I bet things get more interesting. (Example: would AES-NI instructions scale in parallel as fast as ordinary bitwise/vector instructions?)


ram throughput isn't that different on a cheap to high end pc.


When I say high end workstation I really do mean workstation and not gaming PC. Most people do not have quad channel memory on their home PCs but it is in fact a pretty good step up in throughput.


That is also for x86-64, I'd be interested in learning what the performance would look like for ARMv8-A


goddamn it, I just spent a fair amount of time getting xxhash to work.. now I need to reevaluate lol


xxHash is still very fast. In fact, I am sure that some implementations of xxHash are faster than that. In fact, after Googling it, I can see that xxHash64 is quoted as having 3.1 bytes per cycle, which is not bad.

xxHash is also very simple, and can be written easily in high or low level languages, which I like. Someone even implemented it in C++ templates I believe. I don't know how Meow Hash is but I imagine the default implementation heavily exploits AMD64 CPUs and based on the article probably makes use of AES acceleration.

Basically, if you do this, at least be sure you're not bottlenecked on I/O already :)


Late to the party, some actual benchmarks using an older CPU with AES-NI: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf Kernel 4.18.14 GCC 7.3.0

Short version, it's ~2x as fast as anything else on my machine for data > 4k, maybe down to 1k, all the way up to > cache (where various other algorithms also max out memory bandwidth). Other non-cryptographic hashes (for comparison) on my local machine: xxHash64 -> ~13G/s, Metrohash128 -> ~13G/s, Metrohash128_crc -> ~18G/s, t1ha_crc -> ~17.5G/s, etc., but Meow doesn't break the 1G/s barrier until 64-96 bytes (the others mentioned reach that speed at 8 bytes).

So yeah, faster for longer keys due to the fast aes mixing + parallel pipeline usage, but slower for shorter keys due to the final mixing dependency at the end.

Using a hacked-up version of Smhasher for wider range of hashing sizes (cache on this CPU is 20 megs):

    --- Testing Meow1_64 "parallel aes internals"

    [[[ Speed Tests ]]]

    Bulk speed test - 67108864-byte keys
    Average      -  4.556 bytes/cycle - 14337.09 MiB/sec @ 3 ghz
    Bulk speed test - 16777216-byte keys
    Average      -  7.882 bytes/cycle - 24805.52 MiB/sec @ 3 ghz
    Bulk speed test - 4194304-byte keys
    Average      -  9.813 bytes/cycle - 30884.30 MiB/sec @ 3 ghz
    Bulk speed test - 1048576-byte keys
    Average      -  9.657 bytes/cycle - 30390.48 MiB/sec @ 3 ghz
    Bulk speed test - 262144-byte keys
    Average      - 10.642 bytes/cycle - 33490.19 MiB/sec @ 3 ghz
    Bulk speed test - 65536-byte keys
    Average      - 11.871 bytes/cycle - 37360.76 MiB/sec @ 3 ghz
    Bulk speed test - 16384-byte keys
    Average      - 11.336 bytes/cycle - 35675.98 MiB/sec @ 3 ghz
    Bulk speed test - 4096-byte keys
    Average      -  8.241 bytes/cycle - 25936.55 MiB/sec @ 3 ghz
    Bulk speed test - 1024-byte keys
    Average      -  4.056 bytes/cycle - 12765.26 MiB/sec @ 3 ghz


I just ran a simple collision test against it with real life, prod data. It resulted in 13 collisions on 116275 files in 512b mode.

https://github.com/cmuratori/meow_hash/issues/7#issuecomment...


Given the insanely high rate of collisions for a 512 bit hash, the "we probably won't update the main loop" response is quite something.


How does it compare to:

FarmHash

CityHash

SMHasher

MetroHash

FNV1a_YT

Note that if you're just using it for a hash table, where you'll compare the full strings to resolve hash collisions, you don't actually need good randomness. FNV1A_VT is faster than most, has worse collision properties, yet provides a faster overall hash table, because the faster hash (every operation) outweighs the extra full string checks (rare).

https://github.com/rurban/smhasher


Almost certainly faster than any of the ones you list for large data. I would expect the perf to be more comparable to the other aesni hashes (t1ha and falkhash), both which bench around 4 bytes/ cycle while meow claims 16 bytes/cycle.

Meow does include implementation for smhasher, so benchmarking should be easy enough

https://github.com/cmuratori/meow_hash/blob/master/meow_smha...


FNV has terrible hash properties (i.e. it fails many smhasher tests as mentioned in TFA), so it isn't really comparable.

SMHasher isn't a hash function? It's a test suite for hash functions.

CityHash's CRC modes are nominally 4.3-5.5 bytes/cycle.

The others are slower.

This article quotes 16 b/c for Meowhash.


Yup FNV is terrible from my experience using it in a custom hashmap implementation. The lower bits of the FNV hash have a lot of bias.


That's a cool link - thank you. There's so much there I couldn't tell if there were benchmarks for implementation on embedded hosts. I'm constantly looking for something suitable for use on embedded stuff. Having an x86 would be like having a unicorn as my deployment target.

Recently (this year) the MCUs I've asked for have hardware crc32 and sometimes even AES128 but the suits tell me to Remember the BOM, Luke, and those fancy periphs aren't available on common M0+ my hardware guy likes (NXP KE0xxx for ex). Left wondering how this one will work on a run of the mill ARM.


> because the faster hash (every operation) outweighs the extra full string checks (rare)

I'm confused... don't you have to do at least one full string check for every table lookup on a hash table, to ensure the value you're returning is really the right key, and not just some other key that happens to collide?

In other words, if I have 32 buckets in my table and one value stored at table["hello"], and table["goodbye"] happens to hash to the same bucket, how do you know whether to return not-found for "goodbye" without a full string comparison? (Or for that matter, how do you know whether to return the stored value for "hello" without a comparison?)


Every operation requires one hash and at least one string check, but the more collisions there are, the more string checks you have to do. In theory, a hash algorithm with a great distribution and few collisions would require less string checks (I.e generally just one) per operation than an algorithm with a poor distribution.

GP was saying that in their experience, the less frequent extra string checks generally isn't worth the slower hashing speed of an algorithm with a better distribution.

(It's also worth noting that getting or setting elements which don't exist requires 0 string compares unless there's a collision, which might be relevant for certain workloads.)


Most of the cost of string checks is often just the cache miss to load the first bytes of the string.

Depending on the input, e.g. if lengths have more or less uniform distributions, full string checks might only be needed 10% of the time or so; store the string length in the bucket and you save the cache miss. Or you could store the first few bytes of the key in the bucket (but then in other cases, for example a hash table containing absolute filesystem paths, the first few bytes are always the same across keys).


Yep, unrolled AES instructions are the way to go if you need to hash gigs of stuff and you're on a CPU that supports them.

-Austin, SMHasher author.


Never heard of these guys, but they got several hundred hours of video footage of building a game from scratch: https://www.youtube.com/user/handmadeheroarchive/videos looks interesting.


I wonder if speed will still be a reason for doing something like this going forward now that Intel is starting to include SHA acceleration instructions in its processors.


It will, for a couple reasons. The first is, Intel doesn't include the SHA intrinsics in their high end consumer processors. Only the low end Atom stuff so far.

The second is that even with the intrinsics, this is (supposedly) faster. The SHA instructions only get you a few gigabytes a second (single digits). This purportedly gets tens of GB/s.


As far as I’m aware, that isn’t true. Skylake added support for 7 different SHA extensions:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4....


Did it actually? Phoronix is generally a pretty questionable source.

https://en.wikipedia.org/wiki/Skylake_(microarchitecture) nor https://en.wikipedia.org/wiki/Kaby_Lake nor even Coffee Lake list SHA intrinsics. It starts showing up in https://en.wikipedia.org/wiki/Cascade_Lake_(microarchitectur... and Cannon Lake pages.


Phoronix is, once again, wrong. Skylake does not support the SHA instructions. The only currently shipping x86 CPUs which support them are Intel Goldmont (Atom) and AMD Zen (Ryzen, Threadripper).

Mainstream support for SHA in Intel CPUs will come with Cannon Lake, or whatever the next microarchitecture revision they release ends up being.

* Yes, Cannon Lake is technically shipping as the i3-8121U, but that isn't really relevant to anyone here.


Ok, it seems you're correct about Cannon Lake:

https://software.intel.com/sites/default/files/managed/c5/15...

Cannon Lake and later according to Intel.

However, I question "mainstream" since many Celeron and Pentium branded parts featured SHA-NI, and those are arguably mainstream, low-cost parts.

Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake, but didn't make it based on various posts I see online as early as 2014 and Intel's first postings about SHA-NI on Goldmont, etc. in 2013.


> many Celeron and Pentium branded parts featured SHA-NI

Do you have a source for this claim?

> Also, in fairness to Phoronix, it looks like SHA-NI support may have been planned for Skylake

Yeah, but on the other hand that article was published 4 months after the general availability of Skylake. It was possible for Larabel to verify the facts before publishing and he just didn't. Nor did he correct the mistake in any of the past three years.


Sure, quite a few processors were based on Goldmont:

https://en.m.wikipedia.org/wiki/Goldmont



I just happen to need a fast, non-cryptographic checksum for large amount of data. I assumed using CRC32C would be the fastest thing available on x86_64 since SSE 4.2 includes an intrinsic for it.

Can I expect Meow Hash being faster than that on x86_64? What about on some other platforms such as common ARM chips? Which one is more collision resistant?

Also: the writer seems to be Casey Muratori, who is also the author of the Handmade Hero series! Kudos.


OK, after looking into it for just a few minutes, I found out: 1) CRC32C is not nearly as collision-resistant as many common fast non-cryptographical hashes. 2) there are many portable hash functions that are faster (and I think this is due to their bit size.) See: https://github.com/Cyan4973/xxHash and https://github.com/leo-yuriev/t1ha

So, my initial assumption was wrong. I'll be trying Meow Hash and some others out!


Meowhash claims to be faster (16B/cycle) than CityHashCrc (4.3-5.5 B/cycle), which is a variant of CityHash (ab)using the x86 CRC32C intrinsics (but with much better hash properties than plain CRC32).

Both are faster than xxHash on machines with AESNI and CRC intrinsics (~1.7B/cycle). The advantage of xxHash is that it is very fast on machines without these instrinsics.

Other hashes tend to be slower although I'm sure there are some other intrinsics-based hashes I don't have at the tip of my tongue.


Will this hash be useful to check if the content of files of all sizes has changed or not? Thinking about storing the hash of a file in a database to evaluate when changes has taken place.


https://github.com/cmuratori/meow_hash/blob/master/meow_hash...

What do AESRotate(foo,bar) and AESMerge(foo,bar) do?

[edit]

So it's a simple cyclic rotation preceded by A.Q0 = _mm512_aesdec_epi128(A.Q0, B.Q0)

From Googling, _mm512_aesdec_epi128 looks like an Intel processor built-in.


> Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc., it’s going to waste all of its time padding the input

This seems like an important caveat that means it will only be used in specialized applications.


It's meant for file hashing and it's rare in the real world that files are less than 256 bytes.


I think releasing it on yet-another-open-source-license instead of some of well-known ones severely degrades the possibility for this code to be used by companies.


It's the zlib license, which is used by GLFW, SDL, and SFML as well. Companies seem to have no problem using things like SDL, so I think this is fine.

Still, it would be nice is to state that is is the zlib license, and to include the license in the repository.


And it looks more or less fundamentally identical to the 3-clause BSD license.


The zlib license predates 3-clause BSD (zlib was released under the zlib license in 1995, BSD started releasing code under 3-clause BSD in 1999).

I will admit that when I first saw the zlib license, I was similarly confused but it's actually a license that is about the same age as most other common licenses (and quite a large body of software is licensed under it).


The 3-clause BSDL is just the 4-clause with clause #3 removed. Per Wikipedia, the 4-clause dates to 1990: https://en.wikipedia.org/wiki/BSD_licenses#4-clause_license_... . (I.e., all of the text in the 3-clause BSD license was written in 1990.)


No software was licensed under the 3-clause BSD license (because it didn't exist) until 1999. Just because the text is similar to an earlier license doesn't change that software licensed under zlib precedes software licensed under 3-clause BSD.

But that's not really the point -- GGGP implied that the license was written from scratch ("yet-another-open-source-license") and that's obviously not true.


I wonder if a cryptographer could take a look at Meow Hash. I'm curious how it would fail first when subjected to modern cryptanalytic methods.


How fast is it on other platforms, e.g. ARM?


Currently, only an x64 implementation is provided. "While it can be implemented on ARM processors which have AES instructions, we have not yet created a reference implementation for that platform."


[flagged]


[flagged]


You're probably being downvoted since your comments appear to have nothing to do with the article or any ongoing discussion in the comments section. The use case the authors had was deduplication/change detection of large art asset files. Just how are reversible hash functions for small fixed size inputs relevant to that?


[flagged]


Because you're ignoring their first stated goal (which is immediately prior to the snippet you quote):

"we found a lack of published, well-optimized, large-data hash function"

Large-data != Inputs the same size of the outputs. Your comment is irrelevant to their goals.

(And later in the doc: "Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes")


Also, to the GP whose posts are getting downvoted to the point of being flagged, in case it's helpful in the future:

(1) The idea of using a bijection as a building block of hashing is not new, and it's not new to the creators of MeowHash. You point out that there exists a function from 512 bits to 512 bits that's guaranteed to be unique.

Another such function that also has good avalanche properties and is hard to determine the original input from if you don't know the key: The AES mixing function.

MeowHash is based, in fact, on transforming individual input blocks with a single round of AES, and then merging streams of AES-mixed blocks. The AES part is straightforward -- but provides much better practical hash properties than your proposal (for example, I can use the low-order 128 or 64 bits of the output of MeowHash with the same general collision behavior). The merging step is more important from the perspective of collision-freedom, and you omit that fairly critical part.

(2) The contribution of meowhash is its _speed_ and good hashing behavior. It's a very elegant use of highly-optimized on-CPU AES mixing (a feature used by previous hash functions), with a nice block size and stream mixing design that lets it keep a lot of blocks in flight for high throughput.

(Meta) You're engaged in a pattern that HN readers generally respond poorly to: Get downvoted -> Complain about downvotes in a way that implies the downvoters are idiots -> get downvoted -> double down.

It's pretty easy to rephrase your comments in a way that doesn't have that tone of "you're all idiots for not seeing the connection here".


[flagged]


Downvoting without replying is an act of idiocy.

It's perfectly fine on HN. Going on about downvoting is not.


don't you sometimes think you'd want a platform where the users are in control of the platform rules? of course we'd all be held to the standards to which we hold others...


Not even a mention of MD5?


MD5 is a (bad) cryptographic hash and therefore has entirely different goals and properties as this hash function.


MeowHash is also a bad cryptographic hash. Can we compare properties instead of intentions? MD5 is faster than SHA. Is MeowHash faster than MD5 or not?


> MeowHash is also a bad cryptographic hash.

MeowHash is not cryptographic at all.

> Is MeowHash faster than MD5 or not?

Meowhash is advertised as 16 bytes per cycle. MD5 is around 5 cycles per byte[0] or 0.2 bytes per cycle.

Pre-existing fast non-cryptographic hashes apparently top out around 5.5 bytes per cycle (CityHashCRC256).

[0] https://keccak.team/sw_performance.html


Thank you. The comparison of MeowHash to SHA without mentioning MD5 or any actual performance numbers didn't make sense.


I think what GP meant was that MD5 is meant to be cryptographic, and parenthetically, it's bad at it. MeowHash is not even meant to be cryptographic, so it is much faster.


Orders of magnitude faster.


And it's slow af, meeting neither cryptographic nor performance needs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: