Hacker News new | past | comments | ask | show | jobs | submit login

Don't do this. Use a real hash function that guarantees a highly random distribution, make your hash tables power-of-two sized, and map from hash value to table index using (hash & (size-1)).

The fibonacci constant thing will help clean up the distribution of a bad hash function, but it does nothing for collision resistance if the underlying hash function is weak.

-Austin, author of Murmurhash and SMHasher




He's clearly, and obviously not talking about using it as a hash, or even to consider it as a secondary hash as someone mentioned.

The use case according to the article is strictly to replace integer modulo to map into buckets for cases where that operation is the limiting factor. In his case that's when 9ns per key is too much, and roughly 1ns is good.

For small hash tables sometimes the worst case of a linear scan of the entire table is perfectly acceptable, and the additional performance the other 99.999% of the time is a welcome bonus.


Fibonacci hashing takes the form of "h * k>>(64-b)", where k is determined by golden ratio and b is the bit size of the table. This involves one generic multiplication, which is the bottleneck. This multiplication is not strictly necessary. You can replace it with k=1033 for example, which can be implemented as "h+(h<<10)+(h<<3)". This will be a good enough safeguard against naive hash functions but is much faster to compute.

With "h * k>>(64-b)", the result has a cycle of 2^b. Suppose b=3 and you have input 1<<3|1, 2<<3|1 and 3<<3|1, Fibonacci hashing will put them to the same bucket – it is not that effective. A safer strategy is to use a proper integer hash function like Thomas Wang's 32-bit hash function. Although it involves more steps, it only involves plus and bit operations and probably can be computed faster than generic multiplication.

At the end of day, however, Fibonacci hashing or similar ideas only helps when you hash keys to similar integers but has no effect when you hash different keys to the same integer. You have to use a reasonable hash function anyway.


> Although it involves more steps, it only involves plus and bit operations and probably can be computed faster than generic multiplication.

It the most recent popular CPUs the multiplication is exactly "one step" long, that's how wonderfully fast they got to be. See e.g. Agner Fog instruction tables. And where not, the number of "steps" is typically not more than 2 or 3. The multiplication is implemented very efficiently today, unless the CPU has to be with a very low transistor count (linke in some embedded systems). The more exotic CPUs can, of course, be different.


Of course it's a secondary hash. He's mapping buckets by a hash and a shift. This is not a new strategy, and xor, Fibonacci, and "real" hashes fall along a spectrum of average vs worst case runtime tradeoffs.


He is talking about using it as a hash function and clearly states so. To quote:

> 1. Hash the key > 2. Map the hash value to a slot

> Knuth [Fibonacci Hash] uses the term “hash function” to refer to something that does both step 1 and step 2.


Actually that quote is in the article for the opposite reason. He is referencing that because historically Fibonochi hashing was coupled with a hashing function and that is why it’s overlooked as a mapping function. His article is specifically advocating using the Fibonacci method only for mapping.

He is writing from the context of writing a hash table library where the user is expected to provide their own hashing function.


When someone else pointed out that you need another hash function on the front:

Author: "Why [would] Fibonacci hashing not work as a hash function alone? It should, unless you have a use case that results in lots of Fibonacci numbers being inserted. You can use the identity function as the hash when you use Fibonacci hashing to assign a hash to an index."


The whole point of fibonacci hashing is making hash tables more resistant to bad hash functions.

Expecting everybody to be using the perfect hash function for each case is extremely naive, and even then, it just takes somebody else refactoring some code and adding a new member to a struct somewhere without updating the hash function to break that completely.


Doubly so if you are making a library where de facto you don't know what the input data might be and hence you don't know in advance what the perfect hash function is.


Also, there is nothing magic about his fibonacci constant. Any large odd constant with roughly half the bits set at random will do the same thing.


I also did not follow what is special about 2^n/phi. I found this more concise justification: http://mathforum.org/kb/message.jspa?messageID=431065

"Knuth's finding was that the dispersion of indexes for a sequence of consecutive keys is maximized when M is chosen this way, thus a multiplicative hash table with a dense set of keys will have the fewest possible collisions when M approx= 2^N * R."

Although, if the keys are dense, then we could just use the low bits directly. I guess the unstated assumption is that in the real world, we'll have a mix of keys that are sequential and keys that are strided in such a way that using low bits directly would cause a lot of collisions. So we need a multiplicative hash to take care of the latter and we should use 2^n/phi to take care of the former.

Austin, you've done a lot of great work on this. What is the hash function you'd use today for small (<=32byte) keys?


Phi is the most irrational of the irrational numbers, meaning that successive rational approximations are just slightly better: 1/1, 1/2, 2/3, 3/5, 5/8, 8/13, etc. There are no rational approximations that "jump out" as being much better, so you won't find any unexpected patterns when using it.

The Golden Ratio (why it is so irrational) - Numberphile https://www.youtube.com/watch?v=sj8Sg8qnjOg


Murmur3 has good distribution properties and is much less code than City/Highway/Spooky. There are also some hashes based on hardware AES instructions (not crypto, just taking advantage of the mixing properties) that are near perfect but I don't recall their names and haven't benchmarked them.


The code for these hash tables is open source. You keep repeating that he should be using your hash functions for reasons xyz. Honestly, if you really believe this, show that doing so is better.

Just because you are the author of uvw does not make yo right.




The best is currently Leonid Yuriev's t1ha. See https://github.com/rurban/smhasher/

Note that in contrast with what Andy or DJB say, the collision safety is not a problem of the hash function per se, as you cannot fix collision attacks with any "safer" hash function. You can easily brute-force even the worst of all siphash in under 4min. Safety begins with 256 bits, in a hash table you got typically 10-14, max 32 to attack. It is only making it marginably safer, but much slower. It is only doable by checking collision attacks in the collision scheme, either by collision counting or using a non-attackable collision scheme.

Unfortunately most implementations have no idea, and just go with the slowest of all. The latest such sin done in the linux network stack. I thought at least those guys were above that, but apparently not.

This simple multiplication hash is indeed superior, just with linear probing it has problems. You would use double hashing with another similar constant then. Or find two such constants dynamically, as this would be universal then.

And also note that the upper bits of a hash function are always superior to the lower bits. power-by-2 & checks are thus always worse than right-shifting as with this one.


Thanks. Also, from smhasher text:

"See e.g. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing

https://infosys.cs.uni-saarland.de/publications/p249-richter...

for a concise overview of the best hash table strategies, confirming that the simpliest Mult hashing (bernstein, FNV*, x17, sdbm) always beat "better" hash functions (Tabulation, Murmur, Farm, ...) when used in a hash table."

The conclusion is not properly stated, it's surely not "always" but there are enough use cases where the simple multiplications (or, even more probably, the "rotate, multiply, xor" and variants) are the fastest in practice.

From the paper:

"We could observe that Mult produces indeed more collisions than the expected amount on uniformly distributed keys. However, this larger amount of collisions does not get highly reflected in the observed performance. Thus, we consider Mult as the best candidate to be used in practice when quality results on high throughputs is desired, but at the cost of a high variance across data distributions."

It’s still a trade-off, but a good choice in some use cases.


I haven't found a usecase yet where a better hash function leads to a faster hash table. Theoretically this would happen with CPUs with extremely slow or small cache.

It's also confirmed with the recent trend to add slower hash functions everywhere, leading to dramatic performance regressions.


> You can easily brute-force even the worst of all siphash in under 4min

You assume that you know the seed or can directly observe the output, which is rare in practice.

> upper bits of a hash function are always superior to the lower bits

Why is that? The identity hash for ints is quite common. And for good hashes, the difference should be negligible.


Actually there is something special about the using the golden ratio. By some measure it is "the most irrational" number, and that property makes it really good at distributing numbers evenly along an interval.

You can use it to generate a nice sequence of different colours - set the hue to (n * the golden ratio) mod 1.

But I don't know if that property is really necessary for a hash table. I guess only if you are using a terrible key hash.


In fact, if you actually choose a random odd number, you'll get multiply-shift, which is probably the fastest universal hash function that exists.


Wait, I thought the author was saying to use this _after_ using a more secure hash function, not instead of it. Why wouldn't you do that?


Yes, the author specifically points out that there are really two separate steps here: a hash function, and scaling that result down to the number of bits needed at the current number of hash slots.

He's arguing that the fibonacci approach is both faster than modulo and at the same time is better when the input is bad.

He's certainly not arguing it replaces a good hash when you can provide one.

More pointing out that if writing a general hash table implementation you need to expect bad inputs, and since you need to scale the result down anyway you might as well improve on the input if you can do so very cheaply.


It's about "hash tables" (ie associative arrays, HashMap), right?

Is security even an issue in hash functions for this purpose? I honestly don't know, it's not obvious to me that it is.


Perhaps for DOS resistance like the hashdos vulnerability from 2011?

https://nakedsecurity.sophos.com/2011/12/28/large-percentage...


This is for a hashmap library where the hash functions are user-defined, so a secondary hash is justifiable.


A good secondary hash would be the fmix methods from Murmur, or multiply-byteswap-multiply.

However, having seen a lot of terrible user-defined hashes (I do a bit of consulting on an internal Google mailing list), I strongly advise against rolling your own.


The secondary hash must primarily be good at mapping the hashed value into a slot: that is, be as cheap as possible (ideally 0 cycles, fib hashing is ~5 cycles) if the primary hash was good while preventing catastrophic performance if the primary hash was bad. All of this without knowing anything about the primary hash.


It'd be interesting seeing comparisons there given that the issue here was that none of the widespread implementations he surveyed, including a Google one, even tries to address this, and several of them use methods that are both worse and slower.

Seems like it's something that has gotten little attention.


As a secondary (supplemental) hash, I found the Murmur finalizer methods are ok, but you can get better if you use different constants. See my result in this answer: https://stackoverflow.com/questions/664014/what-integer-hash...

I measured the avalanche effect (the number of output bits that change if a single input bit is changed; should be nearly 16 on average for a 32 bit hash), independence of output bit changes (output bits should not depend on each other), and the probability of a change in each output bit if any input bit is changed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: