
Rust Standard Hash Table Could Go Quadratic - signa11
http://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion
======
Animats
_The key effect of Robin Hood hashing is just that it gives you confidence and
/or hubris to push a table to 90% capacity, which greatly exacerbates the
problem._

If you push a hash table which handles collisions by linear probing that hard,
and any non-randomness appears, you can get long runs and are in trouble. I
knew about Robin Hood hashing, but didn't know it was being used to justify
90% table density. Usually, above 50% full and it's time to double the size of
the table.

What does the Rust implementation do for deletions? The classic solution is to
mark the deleted cell as available for reuse, but non-empty for the linear
part of the search during lookup. If you do that, at some point you have to
compact the table or recopy it or something. This is an argument for a hash
table with linear lists for each hash slot. More memory allocations and cache
misses, but better behavior as the table fills up.

(Remember the YC discussion in which some job interview question insisted that
hash tables operated in constant (O(1)) time? Not quite. There's collision
overhead and growth overhead.)

~~~
doomrobo
From the docs page [0] for std::collections::HashMap, Rust uses backwards
shift deletion [1] instead of the tombstone method.

[0] [https://doc.rust-
lang.org/std/collections/struct.HashMap.htm...](https://doc.rust-
lang.org/std/collections/struct.HashMap.html)

[1] [http://codecapsule.com/2013/11/17/robin-hood-hashing-
backwar...](http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-
shift-deletion/)

~~~
Animats
That makes sense. With the tombstone method, hashmaps get worse with age.

------
userbinator
Quadratic complexity is bad, but when I saw the example code I automatically
thought "that's definitely not a good way to make a copy". I'm not familiar
with Rust, but I assume the class contains a special cloning method which does
something more like a memcpy()? If so, if you want to make a slightly modified
copy of a hashtable, it's probably still faster to copy the whole thing in one
go and then add/remove the desired elements than insert each one and do the
filtering element-wise.

Otherwise, it somewhat reminds me of
[http://www.joelonsoftware.com/articles/fog0000000319.html](http://www.joelonsoftware.com/articles/fog0000000319.html)

~~~
masklinn
> when I saw the example code I automatically thought "that's definitely not a
> good way to make a copy"

Sure but that's not really the point, it's just an easy way to trigger the
issue. If you look at the original issue[0], the initial use case is merging
two maps:

    
    
        let first_map: HashMap<u64, _> = (0..900000).map(|i| (i, ())).collect();
        let second_map: HashMap<u64, _> = (900000..1800000).map(|i| (i, ())).collect();
    

> The user wants to merge the hash maps, and does so naïvely,
    
    
        let mut merged = first_map;
        merged.extend(second_map);
    

> Time for merge when second_map is shuffled: 0.4s

> Time for merge when second_map is not shuffled: 40.s (x100 amplification)

The example in TFA is a simplified/boiled down to the barest version exposing
the issue from a more real-world case of "was transforming one HashMap into
another (with natural language words as keys), iterating over the keys of the
first HashMap, inserting them in the second with different values."

> I'm not familiar with Rust, but I assume the class contains a special
> cloning method which does something more like a memcpy()?

You can't memcpy the hashmap since it could contain non-trivial keys or values
(e.g. refcounted heap allocations)[1], but you could clone() it for a one-shot
copy. As noted above you're missing the forest for the trees though, that's
just an easy trigger case, the point is not cloning maps.

[0] [https://github.com/rust-lang/rust/issues/36481](https://github.com/rust-
lang/rust/issues/36481)

[1] and the hashmap is mostly a pointer to a big holding the keys, values and
hashes, which you probably don't want to share

------
amelius
Why didn't they just use the same algorithms as typical implementations of
C++'s STL?

~~~
db48x
Any algorithm you pick is going to have edge cases, and the C++ STL only
defines the interfaces for interacting with them. This means that there is no
"typical" implementation. Do you actually know how the implementation you're
currently using is written?

~~~
Fede_V
I might be wrong but I thought the STL design docs also had restrictions on
algorithm speed? As in - you are free to choose any design you want, as long
as it has a given API and has O(1) access.

~~~
KMag
It's been a while, so my memory could be playing tricks on me. However, I
believe the specified std::unordered_map iterator invalidation behavior rules
out efficient use of open addressing unless someone comes up with some
timsort-level novel implementation ideas.

