
Myths about Hash Tables - ptr
http://hughewilliams.com/2012/10/01/five-myths-about-hash-tables/
======
Udo
While the article is a bit polemic, I really wish more people knew how hash
tables work under the hood.

When I was looking for a job, I flunked out of one interview where I got into
an argument with an interviewer about the runtime properties of hash tables -
more specifically: the interviewer was adamant their access time was always
O(1), wanted me to admit my "mistake" and move on, but I just couldn't ;) It
was one of those rare cases where they give you feedback on why you're not
being hired, too. The lady on the phone said my "technical knowledge" was not
up to par with what they needed for the position. Yeah, I'm still salty about
that.

~~~
arethuza
Sounds to me more like the interviewer/company flunked you're test rather than
you flunking - you might have dodged a bullet there.

~~~
Udo
I wouldn't have worked for the interviewer, it's often just a gatekeeper you
need to overcome. Also, when you don't know how to make rent next month and
the company seems pretty cool (apart from that person), it's hard to see at
the time how you dodged any bullets...

~~~
arethuza
If someone reacts badly when you know more than they do then that's not a
great bit of behaviour, if a company chooses such a person as an _interviewer_
then that's a bit of a red flag to me - an interview should be as much about
trying to present a positive image of the company as much as testing the
interviewer.

Sorry to hear about the rent situation - I take that worked out OK eventually?

~~~
Udo
_> if a company chooses such a person as an interviewer then that's a bit of a
red flag to me_

I think the basic corporate rules of signalling competence apply here.
Assuming that person was both technologically and socially incompetent (which
might be too harsh of a judgement), some people still have a way of getting
promoted into valuable positions by _appearing_ to do a great job.

 _> Sorry to hear about the rent situation - I take that worked out OK
eventually?_

Thankfully that was some years back, I'm doing okay today :)

~~~
arethuza
"which might be too harsh of a judgement"

Not sure it is - the usual advice is "if in doubt then don't hire" so why not
apply that to employers as well as employees?

Of course, if there is an element of desperation involved then I can
understand it. But I have walked away from interview processes (or turned down
offers) when _I_ had some worries and never really regretted it.

Harshness both ways would seem to be fair.

------
seanwilson
> Engineers irrationally avoid hash tables because of the worst-case O(n)
> search time.

Not sure about other people here but I haven't heard any developers I work
with avoid hash tables for reasons like this. I find most coders treat hash
tables as black boxes that offer instant lookups.

Also, I found from giving interviews most developers can't explain at a basic
level how a hash table works as well as other fundamental data structures like
linked lists.

~~~
usrusr
Engineers avoiding hash tables, yes, that's one myth. Still waiting for the
other four promised in the headline. In the age of javascript everywhere I'd
say that hash tables have pretty much won.

I was curious about which corner of software technology the author was coming
from where hash tables are routinely questioned (embedded C maybe?), but even
the followup post does not quite explain it. (
[https://hughewilliams.com/2013/06/03/reflecting-on-my-
hash-t...](https://hughewilliams.com/2013/06/03/reflecting-on-my-hash-table-
post/) , also references a previous hn discussion)

~~~
masklinn
> In the age of javascript everywhere I'd say that hash tables have pretty
> much won.

All of Lua, PHP, Ruby and Python are hash tables flying around, and pretty
much everything web is going to involve tons of hashtables whatever your
language as HTTP headers, QS parameters and form-encoded data have arbitrary
named fields.

~~~
usrusr
Makes me wonder if today's Lisps are actually "Hasps" in disguise.

(unwritten rule of hn: don't post terrible puns like this before the majority
of voting is over)

------
acidbaseextract
A whole post about hash tables and he doesn't mention the real reason trees
are better than hash tables: lower variance!

> Hash tables become full, and bad things happen

The bad thing is not a crazy long probe or even rehashing to enlarge the
table—it's that some random individual insert is stuck eating the entire cost
of rebuilding the table. For many applications, I'm happy with a slightly
higher cost per operation in return for predictable performance.

> Hash functions are slow

Great, another unmentioned footgun - hash functions are hard. For example, the
hash function he provides doesn't use a prime number of buckets. Oops, non-
uniform input data in combination with unlucky bucket counts can generate high
numbers of collisions.

Writing a comparison function to put your data in a tree is pretty
stupidproof!

~~~
whack
There are resizing algorithms out there which specifically avoid the single-
insertion-spike that you're worried about.

[http://stackoverflow.com/a/2200345](http://stackoverflow.com/a/2200345)

If you're working with small data sizes, the difference between O(1) and
O(lgN) isn't that big a deal, so you can use whatever you want. But once you
start working with larger datasets containing millions of elements, and the
only thing you care about is insertions and exact-match-lookups, it's hard to
justify using a tree over a hash map.

------
emodendroket
I'm all for learning for hash tables work, but is avoiding them really
widespread? Especially in dynamic languages I feel like dynamic arrays and
hash tables are like 90% of the data structures used.

~~~
falcolas
No, dynamic languages use Maps and Dictionaries and Objects and Containers
and...

Sure, they're all hash tables under the hood, but if you go up to an average
JavaScript programmer and ask them the difference between a Hash Table and an
Object, I'm sure you'd get a surprising (to you) answer.

~~~
emodendroket
I mean, not even under the hood. Those are just different names for the same
thing.

~~~
masklinn
Due to the way JS objects are used, modern runtime will generally try to use
structures for them (e.g. "hidden classes" in V8/Chrome) rather than "general-
purpose" hashmaps. Likewise JS arrays incidentally. They will often need to
deoptimise back to general-purpose hashmaps, but they'll try pretty hard not
to.

~~~
emodendroket
Oh. That's neat.

~~~
masklinn
Yep. V8's JIT will also try to specialise functions based on that so if you
always pass the same type (where type = hidden class), a limited number of
types or "anything goes". Or at least it did a few years back.

------
saw-lau
Interesting that the author offers the chained hash table as an alternative
implementation - that was always how I was taught they worked back in school.
(1986?)

~~~
douche
Same here, graduated in the 2000s. The way hash tables are described in the
opening section sounds really naive, and I would hope that real
implementations aren't using that method.

~~~
emodendroket
AFAIK most real-world dictionary classes use linear probing.

~~~
gpderetta
std::unordered_map and friends use chaining (and they suffer from it).

------
josefx
The worst case was once used to attack Java based servers, using colliding
post parameters I think. Since then the Java HashMap sorts colliding entries
if the entries implement Comparable.

~~~
jkot
Also _String.hashCode_ is broken, it generates too many collisions. It can not
be fixed for historic reasons.

~~~
arethuza
Wasn't an early version _really_ broken - where it only looked at the first N
characters of a string when hashing it? Where N was actually pretty low.

~~~
jkot
Perhaps, but it made sense with 16MB RAM and 486 CPU. Current implementation
uses 31 multiplier which is useless over 1M entries. This problem is also in
Arrays.hash*() and many other places.

------
Annatar
Hash arrays are great; I use them all the time in AWK. All searches using hash
indexed arrays in AWK are O(1).

I believe that the biggest problem around using hashes as indices entails
people wrapping their head around the concept that the array index is a string
(for all intents and purposes), rather than a number of a field in an array
(or an address of a region in memory, which it eventually is anyway, once it
runs). For example, I used entire lines of text as the hash for an array. My
colleague who I was showing this technique to was completely dumbfounded as to
how that worked. Why was I only storing the value "1" into Array["ORA-1234:
blablabla"]?

    
    
      if(Array[$0] != 1)
      {
        #
        # This record is different, so print the entire record.
        #
        print;
      }
    

"How is that used to _filter out_ lines which are identical the ones we have
in the canonical file?" He seemed completely flabbergasted. He just couldn't
wrap his head around this concept, and ended up re-implementing the entire
thing using several lines of grep -v, sed, and if [ ...] then. I think there
were even several temporary files in the game, whereas the hash array
technique in AWK loaded both the canonical file and the input from stdin _into
memory_ , rather than using intermediate files.

------
krylon
One interesting alternative to hash tables that appears to get too little love
is the Judy tree:
[https://en.wikipedia.org/wiki/Judy_array](https://en.wikipedia.org/wiki/Judy_array)

They are not _quite_ as versatile, I think, but for simple cases where your
keys are strings or ints, they work just as well and are quite fast, too. The
API is very simple, too.

I don't do much programming in C these days (pretty much none at all), but I
used Judy in one toy project and have fond memories of it.

~~~
stuxnet79
Gotta love HN. I'm always learning new stuff. Thanks for posting this. The
"Judy Shop Manual"
([http://judy.sourceforge.net/application/shop_interm.pdf](http://judy.sourceforge.net/application/shop_interm.pdf))
looks very complicated so I don't know if I should take the time to fully
understand how they work, especially if they are not as versatile as hash
tables.

~~~
krylon
Judy's author appears to have poured bathtubs full of cleverness into his
brainchild to make it _fast_ at the cost of implementation complexity.

As long as you don't look behind the curtain, it's genius - an API so simple a
child could use it, and it's so fast... But try to understand how it works,
and your brain melts. Well, mine, at least.

Then again, I suspect the same can be said of what used to be called the STL,
or Boost. In a library that is meant to be used by lots of other projects, the
trade-off is legitimate.

------
greg7mdp
If you are interested in hash tables, check out my improved version of
Google's already excellent sparsehash at
[https://github.com/greg7mdp/sparsepp](https://github.com/greg7mdp/sparsepp).
Excellent performance, very low memory usage, grows as needed of course.

------
personjerry
This seems like Data Structures 101

------
stefs
potential offtopic question here: i always wondered about multi-layered
hashtables, i.e. instead of a linked list for resolving collisions use a
second hashtable with a different hash function (and then maybe a linked list
on the 3rd level).

i guess this might bring some potential improvements in case of a hashtable
with too many collisions but a prohibitively expensive overhead in all other
cases. i.e. it might alleviate a hashtable that was already broken.

~~~
rurban
This is a bad idea, since double hashing with open addressing has the same
benefits but is much faster. The indices are already in the cache and there's
need for an additional hash table setup overhead. Usually you go for a variant
of robin-hood, cuckoo or even quadratic hashing, with the same benefits but
vastly better performance. (all open addressing schemes). And if the collision
rate ("load factor"/"fill rate") is too high for your use case, you go for
lower load factors, down from 90% to 50%. It's still much faster than using a
complete separate 2nd data structure.

And if you need something better than a hash table, (pointers, fixed size
keys, ordered, ...) you usually go for patricia/crit-bit trees.

One popular database tried a RB-tree as second data structure once (forgot the
name and link. starts with R but not redis). Also see
[http://programmers.stackexchange.com/a/281785/27031](http://programmers.stackexchange.com/a/281785/27031),
DMDScript used that also.

------
hueving
He mentioned his own home grown hash with the bit shifts. How does it compare
to common ones that use an AES instruction if the crypto chip is available?

~~~
rurban
Poor. See
[https://github.com/rurban/smhasher](https://github.com/rurban/smhasher)

------
stefs
i think the article omits one interesting point - cache coherence. afaik this
is the one reason otherwise inefficient data structures might result in better
real world performance for small data sets.

~~~
marcosdumay
Well, _asymptotic_ analysis is not very useful for small data sets.

