
Hash function performance - glower
http://scripting.com/stories/2010/09/09/hashFunctionPerformance.html
======
gojomo
That's an awful hash function. The largest bucket has more than 4X as many
items as the smallest. He should pick another function that uses all the
characters in the object name. Whatever language or libraries he's already
using probably has a better function for strings handy.

~~~
acqq
I don't know in which language he implements this, but if it's in any
efficient language he should certainly use all the characters. Some
interesting benchmarks (and also some very interesting links in the comments!)
can be found at:

<http://www.strchr.com/hash_functions>

------
Xk
Looks like a case of premature optimization to me. Optimizing for hash
function speed when you end up with thousands of comparisons is just pathetic.
Who cares if your hash function takes a whole millisecond instead of a
fraction of a millisecond if it ends up that you get an equal distribution in
the different buckets?

(On average they're going to need to compare with 6324 other objects before
they've gotten the right one. A perfect hash function would end up 5686
checks. Willing to bet that the 639 fewer checks would make up for a better
hash function.)

But that's just the beginning of the problem! I mean, even with a hash
function that's that bad, ~100 buckets would get them nearly 10X better
performance! And realistically, why not use several thousand buckets? Come on
...

~~~
wiredfool
This is used for everything in his environment, and has been for 20 years or
so. The Odb. Local hash variables, stack frames. Both speed and size wewe
important when this was created. Why it's coming up now, I have no idea.

------
jallmann
The ignorance in this article is almost embarrassing. Why does Dave Winer keep
getting posts on HN?

~~~
codexon
I thought he was being sarcastic when he said it looked random. Isn't that why
he is asking for input?

~~~
Natsu
If he wants my input he should:

* Learn more about the theory behind hashes before designing one.

* Use a hash that distributes _evenly_ across buckets.

* Learn what random looks like.

If it was random, all the numbers would be chosen with close to the same
probability. So there would be an even number of items in each bucket. What we
have here shows that bucket 1 is rarely chosen and bucket 7 is really common.
The spread is so ridiculous (17.6k vs. 3.8k) that it cannot be due to chance.

In short, this is an absurd hash function. Do not use it in cryptography or
elsewhere. If you need simple, low-cost, non-cryptographic hashes, look here:

[http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homewo...](http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html)

------
wiredfool
Many of the design decisions in his software make a lot of sense for the
design of an environment that could fit on a floppy.

~~~
jemfinch
None of the ones in this post do, though.

Using an appropriate number of buckets would actually reduce memory usage.

Using an appropriate hash function (one that at least considered all the
characters in the key) wouldn't increase the time to hash a string read off a
floppy disk, because he's using the first and last characters anyway.

~~~
wiredfool
What's the appropriate number of buckets, taking into account everywhere it's
used?

~~~
jemfinch
Depends on your way of hashing. If you're using linear chaining (basically,
hashing a linked list of entries) you can get away with N buckets for N items
if you have a good hash function. If you're using open addressing, you'll
usually want your load factor (the ratio of filled slots to unfilled slots)
not to be much higher than 2/3 or so. It may appear on the surface that open
addressing uses more memory, but don't forget the overhead of linear chaining
(an extra pointer for every element in the hash table, many more cache misses
relative to a low-load open addressed table).

------
16s
I read that headline and straight away thought of md4 versus md5 performance.
I need to take a break :)

------
watersco
Why only 11 buckets? That seems like the bigger issue. Converting an O(n)
lookup to O(n/10) seems silly. Why not have 1000 buckets and get two more
orders of magnitude improvement in lookup performance.

Of course that will expose how bad the hash function really is. Saving a few
cycles in the hash function and then chaining through 17k linked list entries
doesn't make any sense.

~~~
wiredfool
Because it was a design decision from 20 years ago for small n hash tables.
It's baked into the design of his object db files.

------
houseabsolute
I upvoted because I really hope to hear from someone who knows this stuff. Off
the top of my head (as a non-expert), this sounds like a horrible hash
function that is vulnerable to a plethora of attacks, but I don't know for
sure. Why not just use the built-in hash function, or some function someone
smarter than you has written?

~~~
drblast
I'm not a hash function expert, but I did sleep at a Holiday Inn Express last
night.

This is a horrible hash function for a number of reasons. You want your hash
function to use every bit of information from the data you're keying on
because that's more likely to give you a spread that doesn't contain a
collision. Consider this list of keys:

aaz abz axz

Those three keys would collide, and for no reason. Even adding up each letter
(another horrible hash function because it eliminates information about the
position of the character, so "the" and "eht" would collide) would be better
than this.

Information theory is important here; you want to preserve as much information
from your key as possible. Check out
<http://en.wikipedia.org/wiki/Entropy_(information_theory)> for a good
discussion on this.

Also, there is almost no earthly reason to use a hash function with 11
buckets, and you certainly wouldn't want to evaluate your hash based on that.
Assuming you'd have to search each bucket of 10,000 for your match, hashing it
into 11 sections buys you very little time.

Also, there's no reason not to do something more complicated; assuming your
key is in CPU cache because you're going to add the first and last letters,
why not at least add up all the letters? You're wasting free CPU cycles after
you've already loaded the key from memory.

Finally, you don't want output that "looks pretty random." You want output
that sorts exactly evenly between buckets. He's nowhere close.

