
Problems with Hash Tables - RiderOfGiraffes
http://enfranchisedmind.com/blog/posts/problems-with-hash-tables/
======
ajross
This doesn't make much sense to me. The author correctly identifies that hash
resizes that are done by changing the size by a multiple remain O(1) amortized
over their construction and use. But then asserts that many hash tables
(who?!) don't do this, and that makes them O(N). I've written a ton of hash
tables in my career. None have that characteristic. Nor do any I've studied in
popular software.

Then he finishes up with a discussion about "bad hash functions" without
actually pointing to any examples of tables that use them.

Worst, the point of the whole (rather long!) article is about performance, and
yet there's not a benchmark to be seen. Nothing is measured; it's just one
assertion after another.

The whole thing is a giant straw man. Not worth reading.

~~~
joe_the_user
Good criticism...

Yet, if you want to avoid your application occasionally freezing when the user
hits a key, then isn't distinguishing worst-case scenarios good? You seem to
agree with the article's claim that hash table sometimes give you an O(n) hit.
Even if this happens very occasionally, aren't there situations where this
could be worse than a constant O(log(N)) hit?

On the other, what keeps the hash algorithm from expanding the buckets little-
by-little?

~~~
ajross
Sure, latencies can be a problem. Though even extraordinarily large tables
aren't going to have user-visible latencies, really. I'd worry more about real
time applications where occasionally high latencies would cause things like
buffer underruns. Think packet routers, video transcoders, things like that.

But even then, the naive solution would just be to pre-size the hash table and
be careful about insertion/deletion architcture. Not to chuck it in favor of
an AVL tree or whatnot.

~~~
stcredzero
_the naive solution would just be to pre-size the hash table and be careful
about insertion/deletion architcture. Not to chuck it in favor of an AVL tree
or whatnot._

Sometimes, a pre-sized hash table is the "best" solution, depending on
context, and for some perhaps debatable value of "best." If I was doing hard
real time (which is not what I generally do) and I didn't have any good idea
of how much space the system would need, then I'd use a largish pre-sized
table as a starting point then think about cooking up some sort of incremental
growth/rehash code. Start out just using one hash table. When it gets too
close to full, you allocate the bigger table and you start moving things to
the larger one with each table access. To check for membership in this state,
you check both tables. To add a new entry, you add to the larger table. Once
you are done, you can throw away the old table. I don't do hard real time, but
this is just what comes off the top of my head. (Note that table allocation
might have to be done incrementally, if nulling out lots of entries would take
too long.)

------
nimrody
Potential benefit of hash tables: cache locality.

Assuming you already have the key in the CPU cache, computing the hash is
relatively cheap and gets you directly to the desired value.

A binary tree structure always makes you traverse several levels until you get
to the desired value. Of course, this can be minimized with high-fanout trees
-- see clojure for example).

~~~
rozim
Cache locality: especially with open addressing and linear probing.

------
RiderOfGiraffes
I've done a search and found that this was submitted a year ago ...

<http://news.ycombinator.com/item?id=123718>

However, it got no comments, and replies are now closed, so I'll leave this
here. Useful information, many lessons to learn. If you're not into algorithms
or low level details, you can probably ignore it.

------
antirez
If you want a compromise that takes little space like hash tables, has average
time complexity of O(log(N)) like balanced trees, and is as simple to
implement as hash tables, take a look at skip lists.

------
ggchappell
The eficiency of hash tables (insertions, in particular), is actually slightly
worse than he indicates. They are not amortized O(1) for all data, but only
for _typical_ data (or on average over all possible data sets). To see this,
suppose that every item inserted happens to go into the same bucket. The
amortized time complexity is linear.

On the other hand, insertion is not O(1) even for average data; there is
always the periodic table resizing. So: _amortized_ O(1) for average data, not
O(1) for average data.

The upshot of all this is that hash tables are still pretty good, but
insertion is only constant time in a "double average" sense: on average over a
large number of consecutive operations (amortization) _and_ over all possible
data sets.

A note to those who say, "Why not just use a different hash function?": This
can greatly decrease the _probability_ of poor hash-table performance. But it
does not change the worst case, unless we allow for an arbitrarily large
number of hash functions, _and_ one of them is guaranteed to be a good one.
I've never seen this done (actually, I'm not sure anyone knows how to do it
within reasonable efficiency constraints).

And a note to those who say, "But in the real world it doesn't matter": No,
sometimes it matters, and sometimes it doesn't. (Don't use Python to program a
pacemaker ....)

------
peterwwillis
i would love to read this except i am using windows mobile and am shown some
"use a different browser!" page instead. this is annoying

~~~
blasdel
He's blocking Opera?

------
BearOfNH
Hashtable routines are far easier to write and debug than balanced binary tree
routines.

Of course there are open source implementations available for both. But I
would guess it is easier to write your own hash function than learn to use
somebody else's, and easier to learn somebody else's B-tree routines than to
write your own.

So we live in a world with O(N) hash implementations and O(1) B-tree
implementations, for N programmers.

