
Hash tables - milsebg
https://www.data-structures-in-practice.com/hash-tables/
======
dragontamer
High-performance Hash table design is pretty interesting, because modern CPU
architecture has grossly changed their performance characteristics.

In the 90s, it was (probably) faster to do linked-list / chaining as a
collision resolution mechanism. But because L1 caches are so fast on today's
machines (and DDR4 remains very high latency), linear probing seems to be the
winner on modern machines.

Consider that DDR4 RAM is maybe 200-clock cycles to access (50ns on a 4GHz
modern machine). L1 cache can be accessed within 4 clock cycles. Fetching an
entire cache line is like guessing on 8-positions in linear probing.

The if/else statements will execute all within single-digit nanoseconds
(Modern CPUs execute at 4GHz, and can execute more than 1 instruction per
clock tick... up to 6-instructions per tick if all instructions are in uop
cache and are properly branch-predicted), long before the 2nd location (ex:
the 2nd member in a Linked List) can be even accessed.

Furthermore, modern CPUs will perform pre-fetching when they notice code
traversing through memory linearly. As such, the 200-clock cycle latency is
"pipelined" and performed in parallel in practice. Your "effective latency"
drops significantly when you traverse linearly.

EDIT: And finally: staying on the same DDR4 row (Roughly 1024 bytes) means
that you only have to perform a Column-read command over and over again. If
you leave a DDR4 Row, the RAM needs to precharge, open the new row, and then
perform a new column read. Takes roughly 3x longer than reading from an
already open column.

~~~
kazinator
Firstly, we have L2 caches, not only L1.

Secondly, it is possible for all of a linked list to in fact be in L1.

Thirdly, if by "linear probing" you are at all referring to "open addressing",
then there are some disadvantages to consider there.

If we suspect that linked lists are performing badly, the reasonable thing
would be to stick with the chained hash table, but replace the chains by
little vectors. Essentially, "linear probing" but sideways out of the table,
into little sub-tables.

Issues with open addressing are issues like clustering if linear probing is
used. If we use quadratic probing, then the table size has to be prime,
otherwise we may not be able to able to insert a key into the table even if it
is nowhere near full. When we delete keys, we have to leave "tombstone"
entries in their place, so that the linear probing can continue past them to
find other items. Things of that sort.

~~~
dragontamer
> Issues with open addressing are issues like clustering if linear probing is
> used.

Robin Hood hashing more or less solves that problem.

> When we delete keys, we have to leave "tombstone" entries in their place, so
> that the linear probing can continue past them to find other items.

Actually no. Even ignoring Robin Hood Hashing, you just swap a later element
into place.

    
    
        idx = hash(value);
        delete(array[idx]);
    
        for(int i=idx+1; array[i] != empty; i = (i+1) % TABLE_SIZE){
            if(hash(array[i].value) <= idx){
                array[idx] = array[i];
                idx = i; // Repeat for the new location
            }
        }
    

I just typed the above in like 2 minutes, so there's probably a bug. But its
probably "correct enough" that you can get the concept. There's no tombstones
needed for linear probing (and only linear probing).

Conceptually, you can see that array[i] is simply being "re-hashed" into the
hash table. Donald Knuth in "The Art of Computer Programming" proves that the
above procedure (well... the correct bug-free version at least) is equivalent
to clearing out the hash table and rehashing all elements. Except of course,
the above procedure is way faster.

\--------

Robin Hood hashing solves the problem in a different way. See this other guy's
blog post for details: [https://www.sebastiansylvan.com/post/robin-hood-
hashing-shou...](https://www.sebastiansylvan.com/post/robin-hood-hashing-
should-be-your-default-hash-table-implementation/)

~~~
kazinator
What you have to do is walk the linear (or whatever) probe sequence completely
to find two elements: the to-be-deleted element, and the last element in the
sequence. Then if the to-be-deleted element exists, and is the same as the
last element, we just mark that element as a free slot. Otherwise, we move the
last element over the to-be-deleted element and mark last element's slot free.

Of course, "the last element" has to be one which belongs to the original
starting hash slot S. We must check that its hash value odulo the table size
is S. The first element that we encounter which not followed by an occupied
slot is not necessary that last element.

~~~
dragontamer
> Otherwise, we move the last element over the to-be-deleted element and mark
> last element's slot free.

That doesn't work at all.

Consider a hash table of size 5: [Foo, Bar, 0, 0, 0], where 0 represents
"empty" locations. Assume "Foo" is in "slot#0" (0-indexed arrays. Note that
Knuth in The Art of Computer Programming works with 1-indexed arrays)

Lets say we delete Foo. Bar does NOT necessarily want to go into location #0.
For example, hash(Bar) might == 1 (in the case of linear probing). So in this
case, we want to leave Bar exactly where it is.

That's why I have the "if(hash(array[i].value) <= idx){" line in the code.
However, this conditional seems impossible to write in the case of quadratic
(or other forms) of hashing. This if-statement ONLY works on linear-probing.

\----------

Consider this other pattern (still Quadratic probing): [Foo, Bar, 0, 0, Foo2],
where Hash(Foo) == Hash(Foo2) == 0.

While Hash(Bar) == 1.

Lets say you want to delete Bar. How do you know to "move" Foo2 into Slot#1 ?
You don't. There's no easy pattern to check for here. Quadratic-probing
requires the tombstone method.

\----------

The code I presented is very subtle (subtle enough that I probably have a few
bugs in it). It works because linear probing has a very predictable sequence.

Quadratic probing, and other forms of probing (ex: double hashing, Cuckoo
hashing, etc. etc.) are very irregular, and hard to figure out if an object
"should" be moved back.

\------------

There's a lot of ways of thinking about this problem. But I'm pretty sure the
description you gave is incorrect.

My preferred way of understanding the problem is as follows:

1\. Upon any deletion, you want to perform a set of operations that is
equivalent to rebuilding the Hash Table from scratch.

2\. The procedure I listed before, is provably equivalent to rebuilding the
hash table from scratch. (Most elements stay where they are). At least, if I
didn't write a bug in it accidentally...

\--------

> The first element that we encounter which not followed by an occupied slot
> is not necessary that last element.

I disagree.

An empty slot, __in the case of linear probing__, guarantees that you've
finished the chaining sequence.

That's why linear probing is best. Because you have simple guarantees for
which objects are part of a "collision chain", and which ones aren't.

This means that deletion under linear-probing can be implemented efficiently.
But deletion under other probing schemes (ex: Quadratic) requires the
inefficient "tombstone and vacuum" procedure you were describing.

There's a lot of subtleties at play here which makes linear probing the best.
And a lot of textbooks get these details wrong (ex: the Cormen book!!).

------
saagarjha
> A hash function is a one-way function that produces the same output for a
> given input. It’s one-way in the sense that, given the hash function, it
> should be difficult to convert the output back to the input without trial
> and error.

Usually I like to say that they convert an input into a uniform distribution
of fixed length output, since reversibility isn’t really important if you
using it for a non-cryptographic purpose such as this one.

~~~
barbegal
Yeah sometimes the best hash functions for hash tables are easily reversible.
And hash functions that can't be reversed faster than brute force can be
really bad if their output is biased making collisions more likely.

~~~
AdamN
It's an important distinction that many people forget: * Hash tables need
speed and uniformity * Cryptographic hashes need irreversibility

~~~
dragontamer
Cryptographic hashes also need speed and uniformity. They just need
irreversibility significantly more.

GF(2) space is incredible for both speed and uniformity. IIRC, the fastest x86
hash-functions just use the AESENC instruction (Note: the AESENC instruction
executes with 1-instruction per clock cycle on Intel, and 2-instructions per
clock cycle on AMD. Its an incredibly fast primitive).

~~~
jcranmer
That's throughput you're citing, they still have 4-8 clock cycles of latency,
according to
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=132,133,227,236,233&othertechs=AES)

~~~
dragontamer
True, but in practice, throughput is what you should be thinking about as a
low level performance programmer.

CPUs can execute many, many, many instructions in parallel. If all your data
fits inside of L1 cache (4-clocks of latency), its actually pretty easy to
achieve 2-instructions per clock or more !

Furthermore, modern CPUs are out-of-order processors. So the processor will
automatically execute independent instructions to "fill up your latency", at
least to some extent.

CPUs have enough space to even handle main memory fetches (over 200+ reorder
buffers on Skylake, to handle the 200+ clocks of latency on a DDR4 memory read
or write). As long as you have "enough independent work to do," its not too
bad. Compilers usually figure out independent chunks of work as they unroll
loops for example.

In my experience, the loop accounting (for int i=0; i<100; i++) will all
execute inside of that latency in parallel to the work inside of the loop. So
there's almost always work to do, at least at the ~5 clocks to 10-clocks worth
of "misc" functions in any bit of code.

The hard part is coming up with work to do for ~50ns of latency (ex: DDR4
Reads or Writes).

~~~
gpderetta
Actually no, latency is usually a bottleneck before throughput is.

Edit: for example when accessing an hash table, the hash computation is in the
critical path.

~~~
dragontamer
> Edit: for example when accessing an hash table, the hash computation is in
> the critical path.

Hmmm... I think I'm biased a bit because of something I'm writing recently
where different iterations of a loop were independent.

In this case, you're right. The hash calculation is on the critical path and
therefore is latency bound.

~~~
gpderetta
> Hmmm... I think I'm biased a bit because of something I'm writing recently
> where different iterations of a loop were independent.

That's a great place to be in :D.

BTW, I haven't tried to get implement an hash function in a while (I remember
playing with carryless multiplication), but IIRC 6 clock cycles is not too
bad.

~~~
dragontamer
> BTW, I haven't tried to get implement an hash function in a while

Multiply RAX, CONST1 / bswap RAX / XOR RAX, CONST2 / Multiply RAX, CONST3.

12 cycles of latency. CONST1 and CONST3 must be odd (bottom bit is 1). Pick
CONST1, CONST2, and CONST3 out of /dev/urandom.

\--------

BTW: This is exactly why latency didn't matter, because the 12-cycles of
latency here are basically independent between loops. The next loop iteration
would cut-the-dependency on RAX, allowing the next loop iteration's "RAX" to
get a new register and execute independently.

\--------

AESEnc is a good baseline, but you need 2 or 3 iterations of it to work well.
AESEnc also works on 128-bit vector registers, but most people want something
that works on the 64-bit registers.

If your data was already in XMM registers, AESEnc / AESDec will be great.
Otherwise, 64-bit multiply is really good at shuffling those bits around. Take
RAX (64-bit result), EAX (32-bit result), AX (16-bit result), or AL (8-bit
result) as needed.

~~~
nullc
If you are multiplying (or using any other t-function) for mixing, you
generally want the higher bits, rather than the lower bits.

~~~
dragontamer
Think about what bswap does:
[https://c9x.me/x86/html/file_module_x86_id_21.html](https://c9x.me/x86/html/file_module_x86_id_21.html)

After Multiply / BSwap / Multiply, all 64-bits will be "high quality". The XOR
"shifts the zero" (I don't like the fact that Hash(0) == 0 personally), but
honestly I haven't been able to figure out a statistical change in my testing.
So I guess the XOR is optional.

~~~
nullc
er. confession: I somehow only read the last two sentences in your post!

------
sdegutis
Also [http://craftinginterpreters.com/hash-
tables.html](http://craftinginterpreters.com/hash-tables.html)

------
rurban
> SipHash is a relatively fast hash function.

Oh my! SipHash is by far the slowest of all practical hash functions.
[https://github.com/rurban/smhasher#smhasher](https://github.com/rurban/smhasher#smhasher)

~~~
pubby
Those benchmarks are for hashing data in the billions of bytes range. Hash
tables typically use data in the 1-20 byte range. Pretty big difference!

As it turns out, hash functions optimized for billions of bytes don't work so
well when you use them on only a few bytes. There's just too much startup time
and too many irrelevant branches and too much bloat for the icache to handle.
That's why most hash tables use simpler algorithms like FNV1A or SipHash,
which are faster for small data.

~~~
zzzcpan
> Those benchmarks are for hashing data in the billions of bytes range. Hash
> tables typically use data in the 1-20 byte range.

No, you misunderstood the purpose of the project. From the benchmarks,
SipHash:

    
    
       Small key speed test -    1-byte keys -   110.91 cycles/hash
       Small key speed test -    2-byte keys -   110.32 cycles/hash
       ...
       Small key speed test -   20-byte keys -   164.79 cycles/hash
    

wyhash:

    
    
       Small key speed test -    1-byte keys -    18.00 cycles/hash
       Small key speed test -    2-byte keys -    18.00 cycles/hash
       ...
       Small key speed test -   20-byte keys -    21.00 cycles/hash
    

The whole purpose of smhasher is to help choose hash functions for hash tables
and alike implementations.

