
Diving into the world of hash tables - majikarp
http://www.zeroequalsfalse.press/2017/02/20/hashtables/
======
VHRanger
If you care about performance, it's important to note how the underlying
representation of a dictionary data structure.

For instance, the unordered_hash_map in C++ is based around a linked list of
key/value pair buckets. This means iterating through the keys or values is
very slow (lots of cache misses!), but insertion is fast. Retrieving a key is
O(logn) but a very slow O(logn), because of the cache misses.

Other implementations is to keep a sorted vector of keys and a respective
vector of values. There's loki::assocvector, booost::flat_unordered_map that
do this for instance. Now insertion is slow, but iteration and retrieval are
very fast (a fast O(logn) by binary search with few cache misses). It's also
memory efficient, since no pointers between elements.

If you have one big dictionary you would use throughout an application, and
know the data up front, a good strategy is to reserve the necessary memory in
an array, fill it with keys/values, sort it once over the keys and coindex the
value array. Now you have a memory efficient and extremely fast dictionary
data structure.

One other strategy any intermediate coder can implement is to have two
unsorted coindexed arrays. You don't even need a hash function for this. Now
iterating and insertion through the table is extremely fast, and it is memory
efficient, but finding a key is just a fast O(n). So this is good for smaller
tables. In C++ you could implement it as a std::pair<vector<key>,
vector<value>>. If you need a quick small map in a function this is often the
fastest data structure you can implement without too many headaches.

~~~
obstinate
> sorted vector of keys . . . reserve necessary memory and sort it . . .
> unsorted coindexed arrays . . .

Most of the things you mentioned are not hash tables, but members of a parent
concept, dictionaries. Hash tables all by definition involve some sort of
hashing of the key. The two main categories of hash table are chained hash
tables (std::unordered_map does this, at least in the implementations I'm
aware of) and open addressed hash tables, which use probing instead of
secondary data structures to resolve conflicts.

~~~
VHRanger
You can implement the sorted vector of keys with a hash function instead of a
coindexed vector of values. It still keeps most of the properties we like,
especially if doing the coindex-sort operation is too expensive for some
reason.

~~~
obstinate
> You can implement the sorted vector of keys with a hash function instead of
> a coindexed vector of values

So you're saying you're going to hash the keys, then sort them according to
the hash, with tie breaking on the key itself? I'm not aware of any sorted
table that does this, but I'm sure some exist. I suppose you'd get something
of a win if N was large, and the keys had long common prefixes, and you didn't
care about the ordering property.

But in that case you'd probably use an actual hash table, not the algorithm
you just described. Unless there's something I'm missing.

~~~
VHRanger
Sorry I misspoke. Lookup is always O(1) in a hash table. But this could be a
"weak dict" that is you don't actually store the keys, as long as you can
reference a key through a hash you can lookup.

In the proper "data structure", we usually store the key and the value --
iteration through keys and/or values and/or pairs are probably supported
operations.

Finding a key in the structure (either with binary tree search, or binary
search on a sorted array, or linear lookup on an array) varies. So does
iteration and most other operations.

------
emeraldd
This is one of those data structures that everyone should try building at
least once, kind of like linked lists, etc. Semi-unrelated, I built a couple
of toy implementations a few years ago using the same basic ideas:

* This one is the my original written in something halfway resembling scheme: [https://github.com/arlaneenalra/Bootstrap-Scheme/blob/master...](https://github.com/arlaneenalra/Bootstrap-Scheme/blob/master/lib/hash.scm)

* And this implementation is part of a half finished byte code scheme that I haven't touched in a few years. Another project I need to get back to. Interface: [https://github.com/arlaneenalra/insomniac/blob/master/src/in...](https://github.com/arlaneenalra/insomniac/blob/master/src/include/hash.h) Internals: [https://github.com/arlaneenalra/insomniac/tree/master/src/li...](https://github.com/arlaneenalra/insomniac/tree/master/src/libinsomniac_hash)

They were kind of fun to build and I'd recommend giving, especially if you
have some skill but don't _think_ you have the chops. To get a working toy
isn't really all that hard once you understand the principals.

------
rdtsc
It is really a basic data structure in most modern languages. Either built on
or part of the standard library.

I can see if this was published in late 90's but today, I don't know.

And just use the built-in ones don't invent your own unless there is a very
good reason for it.

~~~
zzzcpan
Built-in implementations usually are kind of slow, waste memory, have broken
iterators [1], even in modern languages. If either of those things are
important - it's better to do some research and maybe even invent your own.

[1] when you can't both iterate and insert items consistently in the same loop

~~~
iainmerrick
"Usually", really? Which languages are you thinking of, and have you
benchmarked them?

~~~
flukus
It's often true due to a few factors. One is that they are safe, whereas a
specific use case may not need the same level of safety. They generally
optimize for the general case as well, but in your own code you can optimize
for the very specific use case you have. One I ran into many years ago was
implementing my own singly linked list by adding the next pointer to another
piece of data. In this one specific case it was worth removing another layer
of indirection. I was still young though, so there were probably even better
ways of handling it.

I have only encountered these scenarios a small handful of times, but I'm not
a very low level developer.

------
taeric
Hash Table has basically been elevated to a general idea lately. Pretty much
anyone that is looking for an association between keys and values is using
what they will call a hash table. To that end, few people actually know
anything about how they are implemented. (And apologies if this comes across
as negative, I do not mean it as a value judgement.)

This was different back when you would pick something like an alist to store
associations. There the implementation stared you in the face. Same for the
times you had to implement your own hash table. I don't exactly yearn for
those days. Though, I am curious if alists actually win in speed for a large
number of smaller hash tables.

~~~
yxhuvud
Depends on implementations of the alist and of the hash table. It is even
fully possible that the small hash table will use an alist for the cases with
few (and small) elements.

~~~
taeric
Fair. I was making a huge assumption that the alist would be implemented as
you statically see it in code.

My point was supposed to be that an alist really has an obvious
implementation, whereas a hashtable actually does not. My main objection being
that there is a ton of glossing over what goes into an actual hashtable. While
I would expect someone to be able to do a basic alist implementation, I have
grown away from expecting folks to do a basic hash table.

------
hprotagonist
"this is a data structure that is at the guts of basically every python
object. Classes are dicts. Modules are dicts. Packages are probably dicts.
Don't worry though, it's underrated!"

@_@

~~~
whatshisface
Python has its own very cleverly optimized dict implementation:

[http://pybites.blogspot.com/2008/10/pure-python-
dictionary-i...](http://pybites.blogspot.com/2008/10/pure-python-dictionary-
implementation.html)

~~~
hprotagonist
hettinger's updates at pycon this year: it's better!
[https://www.youtube.com/watch?v=npw4s1QTmPg](https://www.youtube.com/watch?v=npw4s1QTmPg)

------
Cyph0n
Quick question: are there special hash functions that are optimized for use in
hash tables? Or do typical hash table implementations in e.g. Python just use
standard hash functions like MD5?

 _Edit:_ It's 100% clear now. Thanks for the great answers everyone!

~~~
robmccoll
Typically the hash functions that you are familiar with due to cryptographic
or data consistency use (SHA family, MD family, etc) do not make for good hash
table choices because they produce hashes that are much larger than needed and
are slow to compute so that they have better cryptographic properties
(extremely low collision, no information leakage about inputs, difficulty of
guessing inputs). When picking a hash function for a hash table, you want a
function that makes a hash just big enough and with low enough collisions
while still being fast and easily dealing with variable length keys. This
could he something as simple as byte-wise XOR or addition with some shifting
as you iterate the key followed by a mod or even bitwise AND mask to pick an
index.

~~~
dom0
However, collision resistance must be still quite good for use in a general-
purpose hash table or a HT that is possibly exposed to attackers, otherwise
denial-of-service attacks become very easy.

Many "modern" implementations (Python, Ruby, Perl, Rust, Redis, ...) use
SipHash with a random seed for this very reason.

------
booleandilemma
"They didn't teach me about these in my coding bootcamp"

------
sjroot
"An underrated data structure"

Uh......

------
skybrian
This article explains "2-Level Hashing" which is apparently "dynamic perfect
hashing." While interesting, I'm wondering how much it's actually used? Or is
it just of theoretical interest?

[1]
[https://en.wikipedia.org/wiki/Dynamic_perfect_hashing](https://en.wikipedia.org/wiki/Dynamic_perfect_hashing)

------
relics443
My data structures professor made a habit of proclaiming every few classes:
"Give me a hash table, and I could rule the world"

------
spankalee
I wonder how hash tables are "underrated". They're probably _the_ data
structure, if there is one.

------
afinlayson
I think 90% of technical interview questions result in hashtable in it.
underrated wouldn't be the word I'd use.

~~~
Izmaki
No they're not. And if they are, they are more advanced than this brief
overview.

------
ribs
This word "extremely" has me skeptical from sentence one. Maybe I'm jaded.

------
mhh__
wtf i love hash tables now! ;)

------
mabbo
I think when the author says "underrated", what they mean is "I didn't realize
how important this is". Hash tables are used everywhere, by everyone, for a
lot of things.

Maybe they don't give it enough time in school for people to realize it is the
king of practical software development.

~~~
rectangletangle
Nearly every expression in high-level languages relies on multiple hash
lookups. This is part of the reason these languages are regarded as "slow." I
suppose you could use a tree in place, and get reasonable performance. However
the hash table's nature allows you to pretty much throw more memory at the
problem in exchange for speed (though this is hardly unique to this particular
data structure).

For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope,
foo scope, then bar scope), and a single set operation (local scope). This is
part of the reason Python programs can be optimized by assigning a deep
attribute lookup to the local scope outside of a loop's scope, and it will
yield improved performance relative to doing the deep attribute lookup inside
the loop's scope.

    
    
      a = foo.bar.baz
      for _ in range(20):
          print(a)
    

vs

    
    
      for _ in range(20):
          print(foo.bar.baz)

~~~
bogomipz
>"For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope,
foo scope, then bar scope), and a single set operation (local scope)"

Can you explain where exactly and why a set operation is performed? Thanks.

~~~
rectangletangle
When the name `a` is assigned the value of `baz`, Python is setting a key name
`a` in the local dictionary (hash table) `locals()`.

Basically `a = 1` is syntactic sugar for `locals()['a'] = 1`

    
    
      >>> a
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      NameError: name 'a' is not defined
      >>> locals()['a'] = 113
      >>> a
      113
      >>>
    

One interesting side-effect of this is you can assign names that are not valid
Python syntax explicitly.

For example:

    
    
      >>> locals()['foo-bar'] = 1
      >>> locals()['foo-bar']
      1
      >>> foo-bar
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      NameError: name 'foo' is not defined
      >>> 
    

The name `foo-bar` can't be literally referenced, because the interpreter
attempts to interpret it as the subtraction operation `foo - bar`.

~~~
hueving
locals() is presented to the user as a dictionary, but is that the way cpython
actually works with it internally? I've run into weird GC issues that imply
it's not a normal dictionary and it's just presented to the user as one for
programattic access.

~~~
rectangletangle
That's a good question, I'm pretty sure it's not a normal dictionary. However,
I'd have to go through the CPython source to confirm. Maybe someone who's more
familiar with CPython's implementation will chime in.

------
fenomas
Mods, can we get a title change? As I write this every comment is taking issue
with the "underrated", which isn't claimed in the article.

~~~
OJFord
Though, in this case, changing it from the original does seem warranted. (This
was just another bad choice...)

~~~
iainmerrick
What's wrong with leaving it unchanged? The blog post title seems fine.

------
deckarep
It's actually sets that are underrated. All languages should come with a
reference implementation.

Hashtables aka Maps aka Dictionaries aka Associative arrays are just fine.

~~~
rdtsc
Though once you have hash/maps/dicts set are with reach by making sets from
hashes were keys are the element and values are just true or 1 or something
like that.

But I think you probably meant having and using set operations effectively in
day to day tasks, as in "make 2 sets and do a set different operation" instead
of "do a for loop on first hash check if it is in the second, then put results
in an accumulator.

Another thing is to think about set of sets. Can that be useful sometimes?
Implementing that is slightly trickier. You'd need to be able to get a hash of
a set. Python has frozenset
[https://docs.python.org/3/library/stdtypes.html#frozenset](https://docs.python.org/3/library/stdtypes.html#frozenset).
I've used those on occasion.

Then of course there is Erlang sofs (sets of sets) module. Stumbled on it by
accident. Oh my, it comes complete with an introduction to set theory and
relational algebra:

[http://erlang.org/doc/man/sofs.html](http://erlang.org/doc/man/sofs.html)

It just struck me as so out of place with the rest of the standard library
modules. Would like to know its history

~~~
bogomipz
>"Though once you have hash/maps/dicts set are with reach by making sets from
hashes were keys are the element and values are just true or 1 or something
like that."

Can you elaborate on how you can derive a set from hashes by using values of
True or 1? Might you have a link? Thanks.

~~~
rdtsc
> Can you elaborate on how you can derive a set from hashes by using values of
> True or 1? Might you have a link? Thanks.

Sure. Np!

I mean that a simplified set is just a hash where the elements of the set are
keys of the hash table and the values can be anything. I used 1 or True as
example.

As in adding an element would be:

    
    
       my_dict[element] = 1
    

Then membership check is:

    
    
       if element in my_dict
    

Then removal is deleting:

    
    
       del my_dict[element]
    

and so on.

In other words, the reason sets are sometimes not explicitly there is because
they are easy to implement on top of existing data structures.

Basic operations like union, difference, intersection between two sets can be
done with a few simple for loops.

But like I mentioned in other comment, there is one interesting aspect to set
(and hashes) in that the element now have to be hash-able. That kind of
depends how mutability and identity works in the particular language.

~~~
bogomipz
Thanks for the clear explanation, this makes total sense. Cheers.

