Hacker News new | past | comments | ask | show | jobs | submit login
SeaHash: Explained (ticki.github.io)
143 points by bretthoerner on Dec 9, 2016 | hide | past | favorite | 55 comments



Discussion that came up the other week for those that missed it.

https://news.ycombinator.com/item?id=13057797


> `f : {0,1}^n → {0,1}^n` is a perfect PRF if and only if given a distribution `d : {0,1}^n → [0,1]`, `f` maps inputs following the distribution `d` to the uniform distribution.

Is this even possible? If the distribution `d` always returns 0, how can a function make it uniform?

It would be nice, if the article would go into more details on how SeaHash obtains this property, and how it related to collision avoidance.


It's not clear what the author means here. This is not a very mathematical article.

(Nit: d is a distribution so it can not always return 0 -- it has to sum to 1. But your point stands.)


I believe the point is that you get a uniform distirbution of 1s and 0s after going through the PRF irrespectively of the distribution of 1s and 0s in the input. This is certainly a desirable property for a hash function.


A (translated) Dirac delta function might be a better example [1].

[1]: https://en.wikipedia.org/wiki/Dirac_delta_function


ticki and the rest of the crew working on Redox (jackpot51) are really doing some neat stuff. a lot of it is over my head but it is cool to see something being built from the ground up using some new techniques and ideas.


Could someone explain what the point is? Is there a use-case for this for "general hashing" and such, where sha256 is _genuinely_ insufficient?


A good example is for a bloom filter: using sha256 is much too slow for good filter performance, you want something like siphash or some other non-cryptographic hash.

Here's a good story of the performance benefits of switching from cryptographic to non-crypto hashes: https://github.com/bitly/dablooms/pull/19

(But I don't recommend you use murmur anymore: https://emboss.github.io/blog/2012/12/14/breaking-murmur-has... (although tbh I could be wrong on this one, not an expert))

(Shameless plug for my bloom filter tutorial https://llimllib.github.io/bloomfilter-tutorial/ )


> But I don't recommend you use murmur anymore

I think xxHash was/is the fastest good non-crypto hash, and now SeaHash may be best. Although I'd like to see a bit more data on that (small keys? large keys? benchmarking methodology) than SeaHash's author is providing.


yeah I should look into that more. aapleby seems to have given some pretty good arguments against SeaHash in the previous discussion: https://news.ycombinator.com/item?id=13058652


That argument depends on the initial values being the same, you can easily make sure they are not.


This isn't a cryptographic hash. SHA-256 and other cryptographic hashes are much slower, but give certain guarantees about difficulty of doing things like finding hash collisions. The advantage of SeaHash is that it's even faster than other commonly used non-cryptographic hashes, and has attractive statistical properties that will result in fewer collisions than other hash functions.


A lot of stuff.

SHA256 is very slow, and that's no surprise. It's cryptographic after all.

Here's a small list of usecases for non-cryptographic hash functions:

- Checksums and error correction codes, as long as there is no way to maliciously use this.

- Hash tables. These always use non-cryptographic hash functions.

- Bloom filters.

- Heuristic fingerprinting. They're not strong enough to be used for normal data fingerprints, but they can be used as a way to decide if two buffers are "probably equal" or "certainly not equal".

Hash tables are the main one. Cryptographic hash functions are almost never used in them. SipHash is a popular choice, but it is not cryptographic. That is a misunderstanding: It's a MAC function.


I may be totally off-base here, but can't the SHA256 CPU instructions be used for this? Then it'd just be 1 (? Or is it more) instruction to perform?


Hi,

I recently used a hash function like this as part of a React-style delta calculation engine but for Swift / iOS.

I needed a hash function that was fast but collision-resistant, and I did not need a cryptographic hash, as all data is trusted (and a hash collision would not actually matter that much).

I chose SpookyHash V2 based on the advice of some peers and http://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/

Hope that helps, Chad


Even when I've not needed a cryptographic hash, I've still used one, because why not? I've never not needed one so bad as to resort to some barely studied, homemade hashing algorithm.

> A hash collision wouldn't matter that much.

Interesting. What was the use for the hash function then?


In hash tables, you never use cryptographic hash functions. Why? Because they're slower.

Take SHA3, which is around 50x slower than SeaHash. That is really really bad for hash tables.

When hash collisions happen in hash tables, they're resolved through collision-resolution strategy, such a linear proping.

Fingerprints are one very narrow usecase for hash functions, and there are tousands of other uses.


> Even when I've not needed a cryptographic hash, I've still used one, because why not?

To avoid wasting cpu cycles preserving a property you don't need. Seahash should be 50x faster than sha3


A trade off between collision potential vs. speed. Sometimes you need speed more than you need cryptographic levels of collision avoidance.

For example, finding unique files on the file system. After looking at size, first and last bytes, it would be better to filter quickly on an imperfect hash (with, say, a 1 in 1^56 chance of collision) than slowly on a perfect hash (with a 1 in 1^256 chance).


I hope you mean 2^56 and 2^256 :)


We can increase that by one to the fourth power!

http://aperiodical.com/2013/05/the-maths-of-star-trek-the-or...


> Even when I've not needed a cryptographic hash, I've still used one, because why not?

Performance. Take a look at djb's (non-cryptographic) hash, with a constant multiplier chosen to be implemented with a shift and an add — that's the level of performance a non-cryptographic hash (e.g. for hash tables & similar purposes) needs.

https://gist.github.com/hmic/1676398


Please don't. DJB2 is a poor hash function. It's similar to FNV: Entropy only moves upwards, so flipping higher bits doesn't affect lower bits.

In other words, you risk mapping `n` and `-n` to the same value under some modulus.


Caching often relies on hash functions. If you run the Cloudflare cache, you'll start caring, considering hashing is usually >50% of the CPU workload and an optimized non-cryptographic hash function can be 20x faster.


> Is there a use-case for this for "general hashing" and such, where sha256 is _genuinely_ insufficient?

Yeah. There are often times when you want to hash data but don't want to waste sha256-levels of cycles doing so. Applications are pretty much everything except cryptographic signing. Load-balancing, higher-quality checksuming, etc.


> SeaHash has mathematically provable statistical guarantees

I'm all for proofs but would love to see an empirical head-to-head against metroHash. Is it as good or better output distribution?


MetroHash's main transformation actually loses entropy, so that's, well, pretty bad. It still passes Smhasher, though.


Should be easy to plug it into smhasher.


According to the author, it passes.


> there is a major difference between cryptographic and non-cryptographic hash functions. SeaHash is not cryptographic

Is this in reference to most used hash functions not leaving any way to trace the contents of what was hashed, or not being able to reliably reconstruct contents to generate a certain hash? Or something else entirely?


A cryptographic hash makes it hard for an active attacker to cause the hash function to do something unwanted (generate collisions, generate a pre-determined output for a chosen input, etc.). "Hard" here is defined as it usually is for crypto: it should not be any easier to break the algorithm by knowing the algorithm's details, than to find one by brute-forcing 2^128 (or however many) possible inputs and treating the algorithm as a black box.

A non-cryptographic hash is generally not worried about active attackers; it wants to give a pretty good, even distribution of outputs for an arbitrary set of inputs, but it's not worried about inputs that are specifically designed to abuse the hash function. If you want to, say, create a random-looking but deterministic and stateless color for each user in an chat room, or something, a non-cryptographic hash function is fine. It's possible that an attacker can create a bunch of users with names constructed to all have the same color, but that's not going to break your website.

However, if you're putting each user in a hash table, a large number of collisions in a hash table can easily get you O(n^2) performance, despite the hash table being O(n) for a randomly-selected set of n inputs. That's the usual reason for using cryptographic hash functions even in places you wouldn't usually expect to see crypto. For instance, SipHash is a fast cryptographic hash function with output too small for use in actual cryptosystems, but it's perfect for hash tables.

(Strictly speaking, a cryptographic hash function doesn't promise anything about the privacy of its input; H(x) = x[0] + SHA-256(x) satisfies the theoretical constraints on cryptographic hash functions for preimage resistance and collision resistance.)


The latter. Even a very simple hash algorithm will almost always make it impossible to get the original contents back. You'd have to write an intentionally pathological algorithm to achieve that.


Well, yeah, unless the input is smaller than the digest, hashing is effectively a very lossy compression!


I've been using blake2(b) for file hashing in some content addressable db stuff. What might be a scenario where i would choose Seahash over Blake2?

My main concern was speed and assurance that i would not see collisions. Beyond that, i am clearly naive on the subject.


The main question, aside from performance, is...

- Is there an (abstract) attack model? (Assuming that you have one if you ought to)

- Then: Can an attacker insert collisions into the DB, and is that problematic?

- Then: A non-cryptographic hash might be much easier to "reverse", especially for short inputs. Is that problematic?

If none of these are problematic you probably don't need a cryptographic hash.

Regarding performance: BLAKE2b on a Haswell gives you, in a "naive", pure C implementation (compiled to pure, non-vectorized AMD64 assembly), about 230 MB/s / GHz. (Referring to https://github.com/borgbackup/borg/issues/45#issuecomment-22... ), ie. something like 850-1000 MB/s on a desktop SKU. There are implementations that are around 10-30 % faster than that.

AFAIK all these newer n-c hash functions that popped up in the last couple years perform (on desktop SKUs) in the area of beyond ~10 GB/s.



SeaHash is obviously not cryptographic (nor is SipHash), but I hope it is a secure PRF (i.e. the keys cannot be extracted), and this was the best attack I was able to construct. Still, it isn't a practical attack, but I suppose it is possible to improve.

Note that I am not a cryptographer, and my only piece of advice is: For the sake of god, don't use hash functions not designed for cryptographic security, if you need cryptographic security. It's that simple.


SipHash is a cryptographic hash with a pretty good pedigree. The distinction between SipHash and Blake2 is more subtle than "cryptographic vs not".


Sure. For starters SipHash targets a lower-end spectrum of device (in a way) than Blake2, and is also less flexible. Since SipHash produces a 64 bit digest it isn't suitable for many applications in the first place. The two also differ significantly in other cryptographic properties, making Blake2 a more secure and easier (safer) to apply choice.

I would tend to say that SipHash is, in a cryptographic context, more an "if you know what you're doing" choice, and not at all a general purpose [cryptographic] hash function.


No, it's not a cryptographic hash function. It's a MAC function.

The paper clearly states that it is not collision resistant.


I believe that you and tptacek may be using differing definitions of "cryptographic", because he tends to knows what he's talking about when it comes to cryptography (e.g. https://gist.github.com/tqbf/be58d2d39690c3b366ad).


Oh, well.

What I think of as "cryptographic hash function" is a function resistent to pre-image attack, second pre-image attack, and collision generation.

Neither of those are satisfied by SipHash, and can thus not classify as a cryptographic hash function by the normal definition.


It's a cryptographic hash function, but not one suitable for all of the same applications as SHA2. It has security characteristics that other hash-table hashes (for instance) lack.


If you know the key, it is as weak as it gets (as the paper notes too, you can construct collisions easily if the key is known), so I disagree.


Honestly, I really don't care about this debate, except to the extent that it's about whether SipHash "isn't cryptographic and therefore you might as well use CityHash or SeaHash", which just isn't true.


I agree. SipHash is certainly strong if you don't know the key.


There's a huge difference between cryptographic and non-cryptographic. Note that blake2 has various length, whereas SeaHash is fixed to 64-bit (although I suppose it's not to hard to make a version with bigger length), and thus naturally collisions will happen.

If you need fingerprints, don't use SeaHash, but if you are looking to insert into e.g. a hash table, you shouldn't use BLAKE2. It's awfully slow for that.


For a content-addressable database, you absolutely want a cryptographic hash. (And please, don't make the mistake Git did: please include the name of the hash function in the "address", to make it possible to change the hash function.)


It seems like Git could change the disk format eventually with backwards compatibility for current SHA1 addresses. You don't even need to rewrite old objects. You just need to know which hash method a given tree/commit object is using to verify history or perform checkout. That could be a flag or tag in the object.


Git has some work in progress in that direction, starting by changing all the internal data structures to use a struct rather than a hardcoded array of bytes the size of a SHA1.


And on that note, has anyone seen any benches for blake2 vs Seahash/Metrohash? I'm not finding anything so far


BLAKE(2) is a cryptographic hash function, SeaHash is not. Even the fastest implementations of BLAKE only gets around 7.8 cycles/byte (hardware might do it twice as fast). SeaHash gets 0.24 cycles/byte.

That's a wide difference, around 32x faster.



How does this compare to murmurhash3?


Seahash is about 3x faster then murmur3

https://docs.rs/seahash/3.0.3/seahash/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: