Dynamic Hash Tables (1988) [pdf]

antirez · on March 18, 2018

Hash tables for me were not just a data structure, I had some form of love for this way of organizing data. Can't forget the articles in the Dr. Dobb's issues about the new hashing schema, or how the new processors caches + access locality could change the game and so forth. But at this point, in 2018, I believe that hash tables should be rarely used.

Take Redis for instance. I used an hash table to start. Very early in the project I discovered that people were going to use it to store large amounts of data and rehashing times were a too big price to pay, so let's switch to an incremental rehashing schema, with all the complexity. Then there was to implement some form of iteration (see the Redis *SCAN family of commands). So let's fight against the complexity of an incrementally rehashing hash table in the context of a stateless cursor iterator. and so forth.

Basically a decent implementation of hash tables end being slower and more complex to handle than a good radix tree implementation (not to implement, radix trees when full featured are a nightmare to get right, if you have a stateful iterator). So once you implement a good radix tree (check for instance the "rax.c" file inside Redis source code, or simply https://github.com/antirez/rax), you start to see that, hey! you have all the advantages of an hash table, but it's faster, the key space is sorted, you can iterate it in many ways, and the memory usage is much smaller.

So I'm not sure at this point, other than "the simplest dictionary data structure I can implement just recalling from memory", if there are great applications for in-memory hash tables as data structures. They are perhaps starting to be more interesting for disk applications.

For all my new projects I imagine myself just copying rax.[ch] and use it, unless there is some very odd reason why an hash table will be preferable.

Veedrac · on March 18, 2018

The issues you give make sense when your datastructure grows very large or contains hash-unfriendly data, but it seems a plain hash table would still be appropriate in day-to-day scenarios. A simple int→int mapping is very fast in a hash table, and even string keyed hash tables are near unbeatable if you can reliably cache the hashes. The cases Radix trees win seem fairly small, as do the times you need incremental resizing.

proverbialbunny · on March 18, 2018

If anyone wants to know a bit more about Radix Trees (as I did) there is a comment on HN from 2008 that adds a bit to the picture here: https://news.ycombinator.com/item?id=101987

jnwatson · on March 18, 2018

Hash tables seem to work pretty well for cpython (the default Python interpreter). They are used for pretty much all name resolution including all attribute and variable access.