
A Critique of Rust's `std::collections` - pimeys
http://ticki.github.io/blog/horrible/
======
chowells
Yes, you really do need collision attack resistance in your standard hash
table implementation. You know why? Because we've known for a long time that
insecure defaults are broken. I'm glad rust makes the correct choice here.

~~~
AstralStorm
No, cryptographic hash has an additional property. It is one way. So, given a
hash it is hard to derive original input.

Collision resistance is a separate property which can be fulfilled without one
way property.

Maybe you want hash numbers that are not predictable by an attacker, but sip
hash is not salted either. Plus salting does not require a cryptographic hash
either.

~~~
valarauca1
>It is one way. So, given a hash it is hard to derive original input.

This is the definition of a Hash Function. Not a cryptographic Hash Function.

Cryptographic Hashes should __NEVER __collide, on any inputs, ever, period.
The moment it is found too, the algorithm is considered depreciated. This is
why MD5 shouldn 't be used anymore, there is a generalized algorithm for
producing collisions.

Cryptographic hashes are overkill for HashTables/HashMaps because you can't
use 128/256/512bit address spaces unless your attempt to store all the
information in the galaxy or observable universe. And modern computers just
aren't there yet. Also they are very slow.

So when your masking a 512bit output, down to 16bits of your HashTable you
just lose the _will never collide_ guarantee. Then something that operates at
10GB/s, and _can collide_ sounds better then something that runs at 150MB/s
and _will still collide anyways_.

~~~
luchs
>This is the definition of a Hash Function. Not a cryptographic Hash Function.

No, a hash function is just any function which can be used to put values into
a hash map. If your inputs are numbers, modulo will work fine as a hash
function, but is obviously not one-way.

>Cryptographic Hashes should NEVER collide, on any inputs, ever, period.

This is obviously not possible, as the output of the cryptographic hash
function is of fixed-length while the input is variable-length. Finding
collision just needs to be hard, not impossible.

~~~
arjie
How is modulo invertible in the general case? If it is, can you demonstrate? I
have picked certain numbers and their values modulo 10 are 6, 7, and 8. What
numbers did I pick?

~~~
chias
It doesn't have to be. Onewayness says nothing about finding "the original x".
If a function is one-way, it means that given y (a random n-bit string in the
output space of h), it is hard to find any x such that h(x) = y. If your h()
is just a modulo operation, this is trivial: just choose x = y.

~~~
kbenson
Ah, I see, way as in _path_ , not way as in _direction_ (which is I think the
misinterpretation that leads people to the root of the misunderstanding). So,
there's one path to get from x -> y, but it does not imply you can't get back
to the original x. In that respect, "way" is an unfortunate word to use, at
least as it's used in modern English.

------
cpeterso
Does Rust define collection traits for abstract interfaces like Set, Map, and
Queue like Java's Collections interfaces? I don't see any in the
std::collections documentation. Standardizing those APIs in the std lib seems
as (or more!) important than providing concrete implementations like BTreeMap.
I see some 2015 discussion about missing collections traits:

[https://internals.rust-lang.org/t/collection-traits-
take-2/1...](https://internals.rust-lang.org/t/collection-traits-take-2/1272)

[https://internals.rust-lang.org/t/traits-that-should-be-
in-s...](https://internals.rust-lang.org/t/traits-that-should-be-in-std-but-
arent/3002)

For comparison, here are Java's core collection interfaces: Collection, Set,
SortedSet, List, Queue, Deque, Map, and SortedMap.

[http://docs.oracle.com/javase/tutorial/collections/interface...](http://docs.oracle.com/javase/tutorial/collections/interfaces/index.html)

~~~
steveklabnik
IIRC, we need higher kinded types before we can create a truly great set of
generic collections traits.

~~~
maxbrunsfeld
I'm curious about why higher-kinded types are needed for this. I'd have
thought that it would be possible to create these kinds of traits in rust's
current type system. Is there someplace I could read more explanation on this?

~~~
steveklabnik
[https://github.com/rust-
lang/rfcs/blob/4b4fd5146c04c9c284094...](https://github.com/rust-
lang/rfcs/blob/4b4fd5146c04c9c284094aad8f54ca5c2093c7f2/text/0235-collections-
conventions.md#removing-the-traits)

------
grayrest
This showed up on r/rust last night and there's a lot of discussion there:

[https://www.reddit.com/r/rust/comments/52grcl/rusts_stdcolle...](https://www.reddit.com/r/rust/comments/52grcl/rusts_stdcollections_is_absolutely_horrible/)

------
jsnell
A lot of this article might be a matter of opinion, but the discussion on
HashMaps seems to be outright false. The author has set up some kind of a
double hashing strawman, when the actual implementation uses linear probing
(which is the only kind of Robin Hood hash table anyone would implement
today).

------
petters
> And, ignoring that point for a moment, The idea that your code is ‘secure by
> default’ is a dangerous one and promotes ignorance about security. You code
> is not secure by default

Bah. This argument can be used to show anything. I could argue that C++ is
better than Rust with the same argument.

------
bluejekyll
I'm not entirely sure I get all of these arguments. It sounds like someone
complaining about the language being young, and the early implementations
aren't perfect.

For not being perfect, it's amazing how awesome the std lib is!

Also, the concurrent collections didn't show up in Java until 1.5, I believe.
I think that was ten years into the life of the language, which was after the
1.3/4 releases which actually made the JVM substantially more performant.

Rust is young, concurrent data models are hard, but it is being worked on, and
you can lend a hand! There are good points in this article, and some of them
are being discussed on the Rust forums. This is a great time to be involved in
helping shape a language and its libraries!

~~~
Manishearth
I strongly believe that concurrent datastructures should be kept out of the
stdlib. There are just way too many tradeoffs to worry about here, and these
tradeoffs are far more important in the concurrent case.

Take a concurrent hashmap for example. Do you want a "vec of lock free lists"
kind which is lock free and infinite-capacity (no need to copy-reallocate), or
do you want a probing one which has fine-grained locks and needs to reallocate
(but has better cache behavior)? Or something in between? If it matters enough
that `Mutex<HashMap<K,V>>` isn't good enough, then these tradeoffs matter too.
Rust could have a rogues' gallery of concurrent hashmaps, but that feels like
bloat.

Some already exist in the ecosystem, though, and you can always use those. But
I'd love to see more work here!

Having a semi-official/nursery library containing a lot of concurrent
datastructures would be nice, though. crossbeam provides better primitives for
writing such things, but not the actual datastructures.

~~~
paulddraper
Disagree.

There are tradeoffs in non-concurrent data structures too. Do you want a
boolean vector that aligns to word size, or one that conserves memory and fits
in cache? Pushed to the limit, nearly ever non-primitive part of the library
has compromises.

Don't let great be the enemy of good. Concurrent collections are a common
need. Include some reasonable implementations with compromises. No, you won't
answer every use case, but you can be better than `Mutex<HashMap<K,V>>` and
still provide something solid.

~~~
Manishearth
Sure, there are tradeoffs everywhere, but there's a distinction here. I'm
talking about things which can be cobbled together for most use cases from
what already exists in the stdlib. Vec<bool> is good enough for most use
cases, and thus BitVec doesn't exist in the stdlib. Mutex<HashMap> is also
good enough for many use cases. The better, more specific implementations
exist as separate crates.

Need a concurrent hashmap? Use Mutex<HashMap>. Think that's not good enough
for your use case? Then what the stdlib would have offered will probably not
fit for you either.

Remember, I'm not saying that we shouldn't have these things in a library
somewhere (even an official library), I'm saying they shouldn't be in the
stdlib, because there's an additional burden to that, and stability guarantees
mean that it's harder to evolve.

------
thenewwazoo
This particular quote rang _very_ true to me:

> If the allocation is hidden to the programmer, she might not realize the
> expensive operations behind the scenes.

I am a green Rust programmer, and I currently have little insight into the
allocation costs of the various things I'm doing, and I haven't yet found a
strategy for gaining it. I feel like the actual mechanisms for memory
management are _so_ deeply buried that it would take lots and lots of
spelunking to get an idea of what's going on behind the scenes. Even something
as simple as a "This function allocates <x> memory" tag with stdlib functions
would help me immensely.

~~~
kibwen
I agree that we should add some more documentation in this area (though we
always have to be careful that we don't document things so thoroughly that we
de-facto stabilize the internals and make it impossible for us to improve them
in the future). But in the meantime, if you're just learning the language, I
would recommend not being overly concerned with the performance details of
collections until you start seeing them at the top of your performance
profiles.

EDIT: And of course, I do encourage you to check out the documentation that we
have in the std::collections module that goes into at least a little detail
about each of the standard collections, along with tips for choosing when to
use each: [https://doc.rust-lang.org/std/collections/#when-should-
you-u...](https://doc.rust-lang.org/std/collections/#when-should-you-use-
which-collection)

~~~
thenewwazoo
Well, it's not so much about performance of existing code as it is about being
able to take a high-level view of the design of a program. To pick a bad
example, if I know one function will allocate memory according to a certain
pattern (e.g. 2n*sizeof(obj)) and another will just shift an internal pointer
around, I predict it'll get easier to reason about the behavior of the code.
The semantics of ?alloc+free are exceptionally easy to understand, but I have
no idea how or when Rust allocates memory, except for a vague notion of heap
vs stack. I suspect there's an elegant relationship between how Rust does its
(hidden) memory management and the behavior of the borrow checker, but I
haven't come across anything yet linking the two concepts.

Mostly, I think my problem is that I don't know what I don't know. Like I
said, I'm green. :)

------
paulddraper
He says BinaryHeap is superfluous, but then complains there's not a priority
queue. Huh?

A binary heap _is_ a priority queue. (Specifically, a binary heap is an
implementation of a priority queue, just as a linked list is an implementation
of a list.)

Even after the "they're fundamentally very different" update, he's still
wrong.

Wanna know what the (non-synchronized) priority queue is called in Python?
_heapq_
[https://docs.python.org/3.5/library/heapq.html](https://docs.python.org/3.5/library/heapq.html)
Critique that next, please.

~~~
desdiv
Yeah, he made an embarrassing mistake. I googled "rust binary heap" and the
second link[0] is a priority queue example _using_ the std binary heap.

[0] [https://doc.rust-lang.org/std/collections/binary_heap/](https://doc.rust-
lang.org/std/collections/binary_heap/)

~~~
paulddraper
The weird thing is...after having that brought to his attention, he reasserts
his error.

------
exDM69
The variant of Robin Hood hash table isn't a double hashing scheme like the
original research paper. This is more like linear probing with a collision
resolution trick to reduce worst case in lookups.

The article suggests using quadratic probing instead, but that has two
weaknesses. Cache locality is worse than linear probing. But the deal breaker
is that the hash table must have a prime number size for correctness. This
makes the hash table grow quite quickly once the size reaches tens of
thousands. And finding primes is another issue.

------
agentgt
In all fairness I have yet to see a language with an awesome builtin
collections library that meets everyones needs (and I use JDK collections all
the time.... it took a long time to get where it is and still is sort of ok).
If you know of a language that does please chime in.

The big thing I missed in my limited playing around with Rust is more types of
queues. This is in large part because I was attacking Rust in a similar way to
Java. Because of its lack of other concurrency models other than threads I was
resorting to queues (I'm sure there are more "reactive-like" libraries now).

Overall I can't decide if a better approach might be to let library writers
work on adding more structures instead of a batteries included approach. Maybe
just have the traits included and thats it.

------
loeg
> The other popular self-balancing trees are good candidates as well (AVL and
> LLRB/Red-black). While they do essentially the same, they can have very
> different performance characteristics, and switching can influence the
> program’s performance vastly.

In what scenario would you prefer an RB-tree (or AVL tree) to a B-tree?

My understanding is that the performance characteristics of RB/AVL trees can
mostly be described as "worse." Maybe insert speed? But in that case, you'd
really prefer an LSM tree.

The author is also very confused about how hash table DoS attacks are
protected. And what a cryptographic hash is. There are very fast non-
cryptographic hash functions like xxHash (in fact, I thought siphash was much
faster than 1GB/s, but I could be misremembering).

~~~
aidenn0
GB/s is a fairly useless metric for hash-functions designed to be used in hash
tables. This is because the keys for hash tables are usually very short, and
keyed hashes can have long setup times. I don't know the speed of sip-hash for
bulk data, but I wouldn't be surprised to see that VMAC is probably faster
than siphash for large data, but VMAC has a long setup time (on the order of
5000 cycles IIRC compared to siphash24 at on the order of 100).

Heck, for large inputs, python's random hash isn't much faster than hmacmd5,
but for 8 byte inputs python's random hash is faster than siphash24.

The better measurement is X hashes per second at N key size, and then perform
that at a few key sizes.

[edit]

I finally found a comparison that does this and includes siphash:

[http://cglab.ca/~abeinges/blah/hash-rs/](http://cglab.ca/~abeinges/blah/hash-
rs/)

Note that siphash is competitive with everything except FNV for small inputs,
but FNV blows it away up to about 60 bytes.

~~~
Gankro
These benchmarks are out of date -- Rust recently updated its SipHash impl
from Sip2-4 to Sip1-3, making it actually the fastest for 8-64 bytes in some
workloads (looking at the hashers outside of the context of using them in a
map gives deceptive results).

(I need to get clearance to post updated results)

~~~
aidenn0
Good to know. The page I posted was nice in that it listed both (and included
BTree for a comparison). I know nothing about how the benchmarks were run, but
the fact that it shows varying sizes _and_ shows hashes both in isolation and
in a map gave me confidence that someone who knows what they were doing was
running the show.

I think not looking at both the bare hash performance and the map performance
is a bad idea because there are a lot more variables involved testing against
a Map, but poor distribution (which ought not be an issue with anything except
FNV on that graph) could make a fast hash not useful.

------
jeffdavis
I appreciate reading thoughtful, constructive criticisms like this. But
ultimately, I didn't find any of his points paricularly compelling. Maybe the
author can point to a language that really gets this right?

I haven't written much rust, though.

------
zerosign
I think std means standard collections, not custom ("special" case) collection
library. People may look for this kind of specialization in outside std or
create their own library. Or you could start be the first one who create this
extended collection std.. CMIIW

------
programmer_man
Typo: "alternatives and way(s) to improve it"

