
Hash tables with open addressing - tosh
https://bugs.ruby-lang.org/issues/12142
======
masklinn
> I see a tendency to move from chaining hash tables to open addressing hash
> tables due to their better fit to modern CPU memory organizations. CPython
> recently made such switch

CPython recently switched to the naturally ordered maps suggested by Raymond
Hettinger in 2012[0] (Pypy had already implemented it in early 2015[1]) but
AFAIK it has _never_ used chaining. In the latest revision of dictobject.c you
can find a note/comment[2] saying:

> The basic lookup function used by all operations. This is based on Algorithm
> D from Knuth Vol. 3, Sec. 6.4. Open addressing is preferred over chaining
> since the link overhead for chaining would be substantial (100% with typical
> malloc overhead).

This is attributed to guido@1256[3], and if you follow the link you end up
back in March 1993 in the commit "Generalized version of dictionaries, with
compatibility hacks." when the file was created...

[0] [https://mail.python.org/pipermail/python-
dev/2012-December/1...](https://mail.python.org/pipermail/python-
dev/2012-December/123028.html)

[1] [https://morepypy.blogspot.com/2015/01/faster-more-memory-
eff...](https://morepypy.blogspot.com/2015/01/faster-more-memory-efficient-
and-more.html)

[2]
[https://hg.python.org/cpython/annotate/default/Objects/dicto...](https://hg.python.org/cpython/annotate/default/Objects/dictobject.c#l663)

[3]
[https://hg.python.org/cpython/annotate/7aa9613ffd36/Objects/...](https://hg.python.org/cpython/annotate/7aa9613ffd36/Objects/dictobject.c#l103)

~~~
tomnipotent
Raymond Hettinger gave a presentation a few weeks ago on the evolution of the
Python dict implementation - one of the most insightful videos I've seen in a
long time.

[https://www.youtube.com/watch?v=p33CVV29OG8](https://www.youtube.com/watch?v=p33CVV29OG8)

------
fbernier
A good tldr of the resulting change here:
[https://blog.heroku.com/ruby-2-4-features-hashes-integers-
ro...](https://blog.heroku.com/ruby-2-4-features-hashes-integers-
rounding#hash-changes)

~~~
jlas
> The reason open addressing is considered open is that it frees us from the
> hash table. The table entries themselves are not stored directly in the bins
> anymore, as with a closed addressing hash table, but rather in a separate
> entries array, ordered by insertion.

> Open addressing uses the bins array to map keys to their index in the
> entries array.

Am I wrong or is this generally not true? Open addressing is about storing the
entries directly in the bins [1].

The new implementation is still open addressing, sure, but the bins contain an
index to a separate entries array, presumably to keep the size of the bins
array compact.

[1]
[https://en.wikipedia.org/wiki/Hash_table#Open_addressing](https://en.wikipedia.org/wiki/Hash_table#Open_addressing)

~~~
masklinn
> Am I wrong or is this generally not true? Open addressing is about storing
> the entries directly in the bins.

Correct.

> The new implementation is still open addressing, sure, but the bins contain
> an index to a separate entries array, presumably to keep the size of the
> bins array compact.

Yes, the original proposal for CPython[0] also noted improvements in iteration
speed since the iterator doesn't keep branching on the empty/full cells of the
sparse array (it can just go through the dense one which is mostly or entirely
full depending on implementation), and improvements to resizing performances.

Plus it also allows further gains e.g. pypy switches the size of the values in
the sparse array depending on dict size (so under 256 ( _actual items_ ) the
sparse array will be 1 byte/item, then 2 bytes until 2^16, etc…)[1].

And it has the advantage of being _naturally ordered_ (that is entries will be
iterated in original insertion order) at no additional cost (modulo how
_removals_ are implemented) whereas in older systems you'd need an additional
doubly-linked list for that, IIRC that was the case for both PHP and Ruby (the
base Python dict didn't conserve or guarantee ordering).

[0] [https://mail.python.org/pipermail/python-
dev/2012-December/1...](https://mail.python.org/pipermail/python-
dev/2012-December/123028.html)

[1] [https://morepypy.blogspot.fr/2015/01/faster-more-memory-
effi...](https://morepypy.blogspot.fr/2015/01/faster-more-memory-efficient-
and-more.html)

------
rawnlq
Is this probing scheme better than cuckoo hash?

Also, Java hashmaps also used to use separate chaining but switched to using
redblack tree for each bucket[1]. The main reason was to prevent attackers
from choosing worst case inputs where everything gets hashed into one place
and degenerate to O(N) search.

It seems like with the new change Ruby is still vulnerable to these hash dos
attacks.

[1][http://grepcode.com/file/repository.grepcode.com/java/root/j...](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java#165)

~~~
jay-anderson
Java still uses a linked list chaining until a threshold is met (8 currently)
which it then switches the bucket to a tree. I wish java could switch to open
address hashing, but without value types it wouldn't help much (there's work
on adding them:
[http://cr.openjdk.java.net/~jrose/values/values-0.html](http://cr.openjdk.java.net/~jrose/values/values-0.html)).

Linear probing would likely be better than cuckoo hashing from a cpu cache
perspective (all locations closer in memory). There are probing schemes which
result in low number of checks and allow for higher loading factors similar to
cuckoo hashing (e.g. robin hood hashing. See
[http://codecapsule.com/2013/11/17/robin-hood-hashing-
backwar...](http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-
shift-deletion/))

~~~
dom0
Strategy performance seems to be a rather mixed bag. In some testing we've
done RH (completely implemented) performed somewhat worse across the board,
but allowed higher load factors, compared to simple linear-probing. However,
it would seem that RH doesn't (directly) suffer from tombstoning, which is
rather problematic with LP for long-lived tables.

------
kickscondor
I feel like the old adage "premature optimization is the root of all evil"
needs to be amended "...unless we're talking about optimizing for the L1 cache
because: do that first!" Data locality is an architecture problem - I really
think it needs to be considered when you first map out the data structures.

While it's great to draw attention to this and to bring any performance you
can to Ruby, I'm not sure the effects of it will be felt. While the hash
indices are now local to each other, the elements of the hash are still
scattered all over the atmosphere. You really want hash table contents to be
local as well. I wonder if we'll ever see a language grouping things in memory
by type - who knows, perhaps one of you can steer me in the right direction
here.

My primary tool of the moment is the slotmap for this sort of thing:
[https://gist.github.com/kickscondor/e706145b20293dc05b0a262a...](https://gist.github.com/kickscondor/e706145b20293dc05b0a262a007046f1)

But it isn't a hash table in quite the same way, in that the "hash keys" are
generated from the data's location in the table rather than from its content.
I wonder what a hash table would look like which was designed to keep
everything in cache-friendly pages.

~~~
rcthompson
I don't think the adage needs revising. We just need to remember that it
applies to programming language users, not programming language writers. The
whole idea is that the writers of your programming language should be doing
most of the optimizing for you by making sure the language's primitives are
already well-optimized for common tasks.

~~~
kickscondor
Sure, I think what a programmer is building may allow for ignorance of
performance - a microservice may be so short-lived and its memory usage so
small that it all fits into 32k anyway. No point optimizing.

But think about mobile developers. You aim to paint 60 times per second. If
Swift doesn't give you any tools for ensuring a specific memory layout - I
imagine it does, by having some kind of contiguous array of structs - then how
do you optimize those heavily trodden pathways of your app that could really
use keeping a certain array within that 32k?

This applies to WebGL developers, too. It seems like I've seen code where
folks were using arrays of integers to achieve data locality - using a kind of
serialization almost to pack and unpack from this array - alas I can't seem to
recall where I saw that.

------
acqq
Note this "small" detail:

[https://en.wikipedia.org/wiki/Hash_table#Open_addressing](https://en.wikipedia.org/wiki/Hash_table#Open_addressing)

"A drawback of all these open addressing schemes is that the number of stored
entries cannot exceed the number of slots in the bucket array. In fact, even
with good hash functions, their performance dramatically degrades when the
load factor grows beyond 0.7 or so. For many applications, these restrictions
mandate the use of dynamic resizing, with its attendant costs."

Even if this is marked as "Citation needed" in Wikipedia anybody who tries to
measure can get the same results: open addressing is much worse when the hash
table is fuller and not "relatively empty." It's also much less forgiving to
the hashing algorithm used. So unless you're clever enough to detect the cases
where you'd carefully regrow the hash table each time it's "too full" there's
a reasonable chance that your open addressing implementation measured
optimistically can get worse in real life uses.

Increasing data locality is a good thing, however.

I'd prefer carefully implemented chaining unless I could be made sure that the
"unlucky case" scenarios of open addressing won't happen.

So collect the real life uses, then measure, then decide what's better. Don't
trust pre-selected micro-benchmarks.

~~~
masklinn
> So unless you're clever enough to detect the cases where you'd carefully
> regrow the hash table each time it's "too full"

There's little smart about it, open addressing implementations decide on a
filling threshold, keep track of the current fill factor and resize when it's
exceeded.

The smart part is deciding on the maximum fill factor, which depend on the
probing strategy and the worst case you allow.

~~~
acqq
> which depend on the probing strategy

Not only. Resizing can involve rehashing all of the existing entries (unless
you lose memory by storing the full hash values all the time, then it still
involves copying and reallocation). If without open addressing you can get by
with much less resizing and with worse (but faster) hashing function (although
modern languages often have to use non-trivial hashing functions to prevent
some kinds of attacks), it's tricky to find under which circumstances open
addressing is cheaper: again, microbenchmarks aren't the complete story, the
real-world uses where users grow the hashes are: what are the typical sizes?
The typical occurrences of regrows? The good solution behaves good for a
typical case (where a lot of the hashes are probably quite small9 and for the
typical "big" case where some hashes grow significantly.

~~~
dietrichepp
Is that really true? Yes, you may need to rehash but the amount of resizing
you do is still going to be similar for similar performance targets.

------
chucknelson
While the actual change/patch is interesting, I found the long discussion and
process to get it merged pretty interesting. It seemed to get bit tense at
times in the "competition" between the two developers, but it seemed to have
ended politely enough.

------
faragon
Depending on element size, maps implemented with binary trees instead of hash
tables could give even faster results, with guaranteed O(log n), and without
requiring rehashing.

~~~
q3r3qr3q
For which sizes is this?

~~~
faragon
E.g. compare C++ std::map vs std::unordered_map on strings. You can find some
benchmarks here:

[https://github.com/faragon/libsrt/blob/master/doc/benchmarks...](https://github.com/faragon/libsrt/blob/master/doc/benchmarks.md)
(for 16 and 64 byte strings: cxx_map_s16, cxx_umap_s16, cxx_map_s64,
cxx_umap_s64)

------
arthurcolle
site seems down

