> `f : {0,1}^n → {0,1}^n` is a perfect PRF if and only if given a distribution `d : {0,1}^n → [0,1]`, `f` maps inputs following the distribution `d` to the uniform distribution.
Is this even possible? If the distribution `d` always returns 0, how can a function make it uniform?
It would be nice, if the article would go into more details on how SeaHash obtains this property, and how it related to collision avoidance.
I believe the point is that you get a uniform distirbution of 1s and 0s after going through the PRF irrespectively of the distribution of 1s and 0s in the input. This is certainly a desirable property for a hash function.
ticki and the rest of the crew working on Redox (jackpot51) are really doing some neat stuff. a lot of it is over my head but it is cool to see something being built from the ground up using some new techniques and ideas.
A good example is for a bloom filter: using sha256 is much too slow for good filter performance, you want something like siphash or some other non-cryptographic hash.
I think xxHash was/is the fastest good non-crypto hash, and now SeaHash may be best. Although I'd like to see a bit more data on that (small keys? large keys? benchmarking methodology) than SeaHash's author is providing.
This isn't a cryptographic hash. SHA-256 and other cryptographic hashes are much slower, but give certain guarantees about difficulty of doing things like finding hash collisions. The advantage of SeaHash is that it's even faster than other commonly used non-cryptographic hashes, and has attractive statistical properties that will result in fewer collisions than other hash functions.
SHA256 is very slow, and that's no surprise. It's cryptographic after all.
Here's a small list of usecases for non-cryptographic hash functions:
- Checksums and error correction codes, as long as there is no way to maliciously use this.
- Hash tables. These always use non-cryptographic hash functions.
- Bloom filters.
- Heuristic fingerprinting. They're not strong enough to be used for normal data fingerprints, but they can be used as a way to decide if two buffers are "probably equal" or "certainly not equal".
Hash tables are the main one. Cryptographic hash functions are almost never used in them. SipHash is a popular choice, but it is not cryptographic. That is a misunderstanding: It's a MAC function.
I recently used a hash function like this as part of a React-style delta calculation engine but for Swift / iOS.
I needed a hash function that was fast but collision-resistant, and I did not need a cryptographic hash, as all data is trusted (and a hash collision would not actually matter that much).
Even when I've not needed a cryptographic hash, I've still used one, because why not? I've never not needed one so bad as to resort to some barely studied, homemade hashing algorithm.
> A hash collision wouldn't matter that much.
Interesting. What was the use for the hash function then?
A trade off between collision potential vs. speed. Sometimes you need speed more than you need cryptographic levels of collision avoidance.
For example, finding unique files on the file system. After looking at size, first and last bytes, it would be better to filter quickly on an imperfect hash (with, say, a 1 in 1^56 chance of collision) than slowly on a perfect hash (with a 1 in 1^256 chance).
> Even when I've not needed a cryptographic hash, I've still used one, because why not?
Performance. Take a look at djb's (non-cryptographic) hash, with a constant multiplier chosen to be implemented with a shift and an add — that's the level of performance a non-cryptographic hash (e.g. for hash tables & similar purposes) needs.
Caching often relies on hash functions. If you run the Cloudflare cache, you'll start caring, considering hashing is usually >50% of the CPU workload and an optimized non-cryptographic hash function can be 20x faster.
> Is there a use-case for this for "general hashing" and such, where sha256 is _genuinely_ insufficient?
Yeah. There are often times when you want to hash data but don't want to waste sha256-levels of cycles doing so. Applications are pretty much everything except cryptographic signing. Load-balancing, higher-quality checksuming, etc.
> there is a major difference between cryptographic and non-cryptographic hash functions. SeaHash is not cryptographic
Is this in reference to most used hash functions not leaving any way to trace the contents of what was hashed, or not being able to reliably reconstruct contents to generate a certain hash? Or something else entirely?
A cryptographic hash makes it hard for an active attacker to cause the hash function to do something unwanted (generate collisions, generate a pre-determined output for a chosen input, etc.). "Hard" here is defined as it usually is for crypto: it should not be any easier to break the algorithm by knowing the algorithm's details, than to find one by brute-forcing 2^128 (or however many) possible inputs and treating the algorithm as a black box.
A non-cryptographic hash is generally not worried about active attackers; it wants to give a pretty good, even distribution of outputs for an arbitrary set of inputs, but it's not worried about inputs that are specifically designed to abuse the hash function. If you want to, say, create a random-looking but deterministic and stateless color for each user in an chat room, or something, a non-cryptographic hash function is fine. It's possible that an attacker can create a bunch of users with names constructed to all have the same color, but that's not going to break your website.
However, if you're putting each user in a hash table, a large number of collisions in a hash table can easily get you O(n^2) performance, despite the hash table being O(n) for a randomly-selected set of n inputs. That's the usual reason for using cryptographic hash functions even in places you wouldn't usually expect to see crypto. For instance, SipHash is a fast cryptographic hash function with output too small for use in actual cryptosystems, but it's perfect for hash tables.
(Strictly speaking, a cryptographic hash function doesn't promise anything about the privacy of its input; H(x) = x[0] + SHA-256(x) satisfies the theoretical constraints on cryptographic hash functions for preimage resistance and collision resistance.)
The latter. Even a very simple hash algorithm will almost always make it impossible to get the original contents back. You'd have to write an intentionally pathological algorithm to achieve that.
- Is there an (abstract) attack model? (Assuming that you have one if you ought to)
- Then: Can an attacker insert collisions into the DB, and is that problematic?
- Then: A non-cryptographic hash might be much easier to "reverse", especially for short inputs. Is that problematic?
If none of these are problematic you probably don't need a cryptographic hash.
Regarding performance: BLAKE2b on a Haswell gives you, in a "naive", pure C implementation (compiled to pure, non-vectorized AMD64 assembly), about 230 MB/s / GHz. (Referring to https://github.com/borgbackup/borg/issues/45#issuecomment-22... ), ie. something like 850-1000 MB/s on a desktop SKU. There are implementations that are around 10-30 % faster than that.
AFAIK all these newer n-c hash functions that popped up in the last couple years perform (on desktop SKUs) in the area of beyond ~10 GB/s.
SeaHash is obviously not cryptographic (nor is SipHash), but I hope it is a secure PRF (i.e. the keys cannot be extracted), and this was the best attack I was able to construct. Still, it isn't a practical attack, but I suppose it is possible to improve.
Note that I am not a cryptographer, and my only piece of advice is: For the sake of god, don't use hash functions not designed for cryptographic security, if you need cryptographic security. It's that simple.
Sure. For starters SipHash targets a lower-end spectrum of device (in a way) than Blake2, and is also less flexible. Since SipHash produces a 64 bit digest it isn't suitable for many applications in the first place. The two also differ significantly in other cryptographic properties, making Blake2 a more secure and easier (safer) to apply choice.
I would tend to say that SipHash is, in a cryptographic context, more an "if you know what you're doing" choice, and not at all a general purpose [cryptographic] hash function.
I believe that you and tptacek may be using differing definitions of "cryptographic", because he tends to knows what he's talking about when it comes to cryptography (e.g. https://gist.github.com/tqbf/be58d2d39690c3b366ad).
It's a cryptographic hash function, but not one suitable for all of the same applications as SHA2. It has security characteristics that other hash-table hashes (for instance) lack.
Honestly, I really don't care about this debate, except to the extent that it's about whether SipHash "isn't cryptographic and therefore you might as well use CityHash or SeaHash", which just isn't true.
There's a huge difference between cryptographic and non-cryptographic. Note that blake2 has various length, whereas SeaHash is fixed to 64-bit (although I suppose it's not to hard to make a version with bigger length), and thus naturally collisions will happen.
If you need fingerprints, don't use SeaHash, but if you are looking to insert into e.g. a hash table, you shouldn't use BLAKE2. It's awfully slow for that.
For a content-addressable database, you absolutely want a cryptographic hash. (And please, don't make the mistake Git did: please include the name of the hash function in the "address", to make it possible to change the hash function.)
It seems like Git could change the disk format eventually with backwards compatibility for current SHA1 addresses. You don't even need to rewrite old objects. You just need to know which hash method a given tree/commit object is using to verify history or perform checkout. That could be a flag or tag in the object.
Git has some work in progress in that direction, starting by changing all the internal data structures to use a struct rather than a hardcoded array of bytes the size of a SHA1.
BLAKE(2) is a cryptographic hash function, SeaHash is not. Even the fastest implementations of BLAKE only gets around 7.8 cycles/byte (hardware might do it twice as fast). SeaHash gets 0.24 cycles/byte.
https://news.ycombinator.com/item?id=13057797