

Hash Tables Aren't a Magic Bullet - bobbydavid
http://notjustburritos.tumblr.com/post/21221997397

======
dpark
I agree that hash tables are not a magic bullet, but you're presenting them as
if they're a poor option in general, which is not the case.

> _Instead, they try to cheat by “knowing something” about the expected data.
> For instance, they may only run their hash function on the first six
> characters of a string instead of the whole thing. But what happens if I am
> attempting to insert ten-thousand lines that all begin with “ROBERT:”? O(n)
> look-up time. That’s what happens._

> _The above example is a bit contrived, but the sentiment is valid._

No, it's not. You either need to provide a sane hash function or you need to
use keys that are sane in combination with an existing hash function. If your
keys are massive, then the most likely case is that you're being lazy and
using an inappropriate object as a key because it's convenvient. If you can't
or won't choose a reasonable key for a hash table, how likely are you to pick
a reasonable key for a BST-based dictionary (or any other dictionary)?

> _Hash tables waste space proportional to the amount of data you want to
> store. Hash tables rely on a lack of collisions. In a rosy situation where
> your hash function maps elements randomly (but deterministically of course)
> into your hash table, you still need to keep plenty of empty space lying
> around to avoid collisions._

A significantly-overloaded hashtable (with chaining) can still be faster than
alternative data structures in some cases. Collisions in a hash table are not
really a problem unless they are approaching O(N). If you shove O(N x M)
objects into an O(N)-sized hash table, you'll need to churn through O(M)
entries on every lookup (amortized). Assuming a reasonable M and an efficient
comparison operation, that can be a pretty efficient approach.

> _So where does that leave us? Hash tables require very large data sets to be
> useful, yet they waste a lot of space for large data sets and aren’t great
> at storing large elements._

Neither of these claims is really true. Hash tables are not always the best
data structure (in fact, they often are not), but moderately-sized data sets
or large objects don't rule out hash tables as a good choice. (Small data sets
don't either. If you've only got a dozen elements, it probably doesn't really
matter how you store them.)

> _The only time hash tables seem useful is when we know something about the
> data ahead of time, so that we can vastly improve our hash function._

I should hope you know something about your data, or you are not in a position
to choose a good data structure.

> _But if we know that a certain part of the data is most likely to
> differentiate it, why not simply use that portion of the data as the key for
> the binary search?_

Because a BST isn't a magic bullet, either?

------
bobbydavid
I wonder how conflated the ideas are a "dictionary" versus a "hash table" are
in a lot of developer minds (a dictionary is an abstract data type that makes
no guarantees about how it's implemented, while a hash table is a specific way
to implement a dictionary).

For a savvy developer, is it worth it to know the differences? Do you know how
a dictionary is implemented in your language of choice?

------
VeejayRampay
"Hash tables aren’t a magic bullet data type. They have plenty of issues,
they’re not really constant time, and in most cases there are other data
structures that are much more appropriate."

Be sure to not tell us anything about said data structures though, we ain't
too keen on knowing stuff anyway.

