Hacker News new | past | comments | ask | show | jobs | submit login
New fastest portable hash: wyhash (github.com)
120 points by rurban 13 days ago | hide | past | web | favorite | 49 comments





Here's a link to wyhash itself: https://github.com/wangyi-fudan/wyhash

Maybe the link could be updated to point to this instead?

Yeah the submitter didn't link to wyhash, they linked to their own project...without the title there'd be no context here. SMHasher is well known (at least the original Google project is, maybe not this fork), but why not just submit wyhash directly? There's no explanation of how wyhash works here, and I can't find a paper.

No, because you need to see it in context. Wang Yi only compared to t1ha, only in one criteria.

This is the code: https://github.com/rurban/smhasher/blob/master/wyhash.h


Awesome! Found a bug (undefined behavior) and made a PR.

https://github.com/wangyi-fudan/wyhash/pull/3


I would love for this to be solid, but something here seems too good to be true. Modern hashing is well-studied, with lots of smart people attacking the problem. Modern hashes like Google's FarmHash are thousands of lines of code with platform-specific intrinsics for acceleration. This hash claims to beat other modern hashes in speed, without using platform-specific intrinsics, and without quality problems, and in a mere 100 lines of code to boot.

I would love for this to be true, but I'd love to see a more thorough explanation of how it manages to be both smaller and faster than other competing hashes.


I would love for this to be solid, but something here seems too good to be true.

How so? It's much slower than AES based hashes. It's about the throughput same as xxHash64. It uses fewer cycles/hash than xxHash64.


> It's much slower than AES based hashes.

I must have misinterpreted the following:

> So the fastest hash functions on x86_64 without quality problems are: > > - wyhash > - t1ha > - [...]

I interpreted that to be an ordered list, saying that wyhash was the fastest.

But looking at the actual table I am confused. There are MB/s numbers and cycles/hash numbers. In cycles/hash wyhash appears to beat all FarmHash variants, but in MiB/s it is slower than some of them. I don't understand why these two performance measures would not track perfectly, since the CPUs should be running a constant number of cycles/second.


I don't understand why these two performance measures would not track perfectly, since the CPUs should be running a constant number of cycles/second.

x86 isn't RISC and not all operations take the same number of cycles. Also, the code might have different levels of possible parallelism and might impact the pipeline differently.

Comp Sci education for the low level basics is often completely neglected nowadays. This should be freshman year stuff. I'm certainly not an expert, myself, just familiar with the issues. (Know enough to know what you don't know.)


> x86 isn't RISC and not all operations take the same number of cycles.

Yes but it doesn't say instructions per hash, it says cycles per hash. Unless some kind of frequency scaling is going on, the number of cycles per second should be very consistent. If bytes/hash is held constant and cycles/second is constant, then MiB/second and cycles/hash should be exact inverses. I don't understand why this is not the case in these tables.

> Also, the code might have different levels of possible parallelism and might impact the pipeline differently.

Again this can affect the number of instructions being retired, but not the number of cycles. A cycle is a cycle, regardless of how much work is actually being accomplished.

> Comp Sci education for the low level basics is often completely neglected nowadays. This should be freshman year stuff.

I'm a low-level junkie who lives in godbolt.org and Agner Fog's tables, writes JIT compilers, and does FPGA design for fun on the side. It's possible that I'm mistaken here, but I do have a fair amount of background in this.


I wonder if the first number is sustained, pipelined rate for big values, and the second number is cycles for the first iteration, still assuming everything is in L1 cache.

Those would be the useful numbers.


The small key performance is very nice. It would be good to have more written about this new hash function, and less misinformed rant about siphash.

For someone who isn't following the hashing function 'scene', what's misinformed about this rant?

Most of his arguments simply do not apply to the relevant use case and attacks. The only one that does, the claim that the seed is recoverable, is not an argument at all but a vague, unsubstantiated claim.

His arguments on siphash are pretty practical, not sure what did you find misinformed there.

From the readme: "SipHash is not secure enough for security purposes"

There's been nothing published to that effect, even though he has claimed for years that he can recover bits from the seed.


[flagged]


I didn't know much about this argument so I looked it up and funny enough it's you arguing here too:

https://github.com/google/highwayhash/issues/28

Still trying to wrap my head around exactly what is being argued... but it's starting to feel like there's something personal going on here, not gonna lie. It's a bit quaint.

(Also a bit curious why Rust is looped into this. It seems like Rust was insecure because the same key was being used for all tables, not really because the hash function was insecure?)


I think the claim isn't "SipHash is not a secure keyed hash function (PRF)" but rather "secure keyed hash functions are insufficient mitigation for hash DoS".

For long lived hash-tables even per-table keys don't prevent finding bucket-collisions via side-channels. So you either have to either rekey when a large number of bucket-collisions are detected. Or you fall back to a balanced tree based data structure for those buckets. If you prefer the latter mitigation, there is little reason to pay the performance hit of SipHash.

So the OP has a point, but is terrible at communicating it.


Well, I am not a cryptographer, so I can't really make much attestations myself, but here's what I read:

- OP claims that the security claims of SipHash are incorrect or misleading.

- SipHash paper author defends claims made in SipHash paper.

- OP responds, claims that the security properties are misleading because most people would just assume it meant it was a "secure hash" and not a secure PRF. To prove this, he alludes to a Rust bug that as far as I can tell would work exactly the same with any hash function.

So I'm curious. Can you recover the seed in SipHash with side channel attacks? The response can be summed up as "PoC||GTFO," and I tend to agree; it seems to contradict at least some of the security claims they are making, after all.

All in all, if they have a point, it's missed in the flurry of unrelated points that were brought up there, and I definitely am not sure what to take away from it.


He means the worst case scenario when the attacker knows the key. In which case it's trivial indeed.

As I explained there, getting the hash seed of a running process in a dynamic language is trivial. Every language has enough ropes to hang yourself and print the value from some known offset. And if not, there's still enough information from a typical bad hash table (95% of hash tables are bad) to get at the seed via ordering and timing. From there you just brute force an attack even with an extremely slow hash like siphash. A seed is no secure protection against an determined attacker, only for newbies. A newbie would just DOS the system with simplier means.

The point is, don't believe the theatre, use proper protection against the attack (no linear list on collisions), and keep the hash table fast. Collision counting or fallback to tree really is trivial.


> And if not, there's still enough information from a typical bad hash table (95% of hash tables are bad) to get at the seed via ordering and timing.

Really? How’s that? The SipHash seed isn’t supposed to be recoverable from hashes – that would be a pretty significant break – and I don’t see what other route the hash table would have to the seed.

> Every language has enough ropes to hang yourself and print the value from some known offset.

You mean explicitly reading the seed out for an attacker? If we’re modifying the program to help, I figure while (true) {} is probably a simpler DoS.


No, that onus is on you. If you've got specific critique of siphash I'm sure people would like to read them. As it is, you're making somewhat vague denunciations without providing any substance to back them up.

Have you published your SipHash seed finder yet?

A seed finder is independent on the hash function. It is dependent on the framework you are using. If you are using perl, ruby, python or php e.g. I won't publish the perl solution as nobody is protected and I had my own services on Redhat Openshift, which is unsafe forever. I only published the proof of the various attacks on most hash functions in my testsuite, to verify the proper protection against it.

You're the one posting your own page to HN, so I think maybe you have the burden of proof here?

From what I see Farmhash is still faster on IA32/AMD64 architectures (although they need to leverage the AES instruction for that, granted).

Algorithms using AES intrinsics are generally fastest at all key sizes now. For small keys, the AquaHash[0] algorithm is extremely fast, and several algorithms (e.g. MeowHash[1] and the aforementioned AquaHash) are essentially at the hardware limits for large keys. Algorithms like wyhash are useful for things like embedded CPUs without AES support, but if you are targeting x86/ARM then AES is often a better choice.

[0] https://github.com/jandrewrogers/AquaHash [1] https://github.com/cmuratori/meow_hash


Some non-portable hashes are still faster, sure. But for a general hash it's remarkably simple, good and fast.

I’d really like to see progress on hashes that fit into a few lines of code. It seems FNV-1a is still the best option in that regard.

If this is all there is to it: https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h

That doesn’t seem like much.


Agreed, if that's the whole hash function I think small code size is a selling point of this hash compared with other modern fast hashes.

This hash function looks like it will do unaligned reads if you're hashing an unaligned string. This doesn't matter on x86, but portable code should avoid this. It would be helpful to have a wrapper function that could handle unaligned strings by special-casing the begin and end of buffer.


It’s bigger than I’d want to include in JavaScript for a web page.

> SipHash is not secure enough for security purposes and not fast enough for general usage.

I beg your pardon? The author seems very confused.


Oof.

To expand on this, SipHash is designed to mitigate a certain kind of DOS attack. It’s not a cryptographic hash. SipHash will still be a top choice for general purpose hashing with inputs that can be chosen by an adversary.


https://github.com/switch33/sha2592 something I just coded wondering how it measures up.

and i made another one too: https://github.com/switch33/sha29893

and a third one: https://github.com/switch33/sha5987

and an even better one: https://github.com/switch33/sha2999999

and maybe the best for quite some time: https://github.com/switch33/sha130000000-


>The fast hash functions tested here are recommendable as fast for file digests and maybe bigger databases, but not for 32bit hash tables.

Do people really try to optimize file hashing this way?


To me, the most interesting part:

>> Even if [...] worse hash functions will lead to more collisions, the overall speed advantage beats the slightly worse quality.


I'm not sure if that part is correct. If I use something like SHA1 or MD5 (yes, I know they're no longer considered secure), know with essentially 100% certainty that I will not get a collision. This means I don't have to implement chaining, quadratic probing, or another method of collision avoidance. Though it might cost some speed on the end use of the app, it's often easier for initial development because I simply have to write less code.

Maybe a more experienced dev can comment on this? Still learning here, and haven't spent time in a lot of large codebases.


The strong collision avoidance guarantee is mostly a result of the large space of hashes those functions generate. So, yes, if you have a hashmap with 2^128 slots, you're technically right. But you don't have a hashmap with 2^128 slots. So you have to smash that output space into a smaller space somehow. And then you're back to considering what to do when you have collisions.

Not an expert, would have thought that collision avoidance would be the most important criteria.

Not in hash tables. There the smaller, the faster. The selection criteria is: not bad (not failing any test), small and fast. Esp. it needs to be inlinable.

There will always be collisions on dynamic workloads. Otherwise you would choose perfect hashes. In the usual programming language case SPOOKY32 has the least collisions, and is pretty fast too. But it has no chance against the small hash functions.

The smhasher speed test doesn't tell you which hash will be the fastest in a hash table with small key lengths, only when used as digest. e.g. for bigger files, db or network blobs. The icache footprint in comparison to all the hash table code is very important.


> Not in hash tables. There the smaller, the faster. The selection criteria is: not bad (not failing any test), small and fast. Esp. it needs to be inlinable.

Can you explain why? Like what use cases are there where you don't actually care how likely something is to collide, you just want it to be fast?

I've always thought being able to predict the chance of collision to be the most important factor on a hasher. When is it not?


First off, for small tables, you always take the hash value modulo some small number like 10000. Who cares how good the hash function is, if you end up with 1/10000 chance of collision in the end anyway? Speed is obviously more important in this setting.

Second, assume very large hash tables...assume that with noncollision using the data structure takes time T1(H) and with collission it takes T2(H) for hash function H, and that the probability of collision is P(H).

So your total cost is then

P(H) T2(H) + (1 - P(H)) T1(H)

Which is approximately

P(H) T2(H) + T1(H)

Easy to play with numbers so that a worse but cheaper hash has a lower total cost. In fact this will usually be the case I would expect... T2 is multiplied with P and disappears for all but the very worst hashes/extremely expensive T2


You have to deal with worst case regardless of how likely collisions happen, you can't leave it to a chance. At least if you want a decent hash table implementation.

For security porpose, i thought -> slower performance = better.

This would delay rainbow tables


These kind of hashes are used for hash tables, where you want a good compromise between speed and low collision probability. They can also be used as checksums to detect accidental data modification.

They are not meant for cryptographically strong fingerprinting.


Faster is better for non-cryptographic hashes (like this one) and for cryptographic hashes (like SHA-2, MD5, etc.)

For password hashing, you generally want a work function to increase the resource usage, or some alternative way of making cracking harder. That’s the case you’re thinking of, and it’s a very specific application, not what this new hash function is designed for.


Sure, but for performance purposes, faster performance is better.



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: