Hacker News new | past | comments | ask | show | jobs | submit login
A Critique of Rust's `std::collections` (ticki.github.io)
85 points by pimeys on Sept 13, 2016 | hide | past | web | favorite | 60 comments

Yes, you really do need collision attack resistance in your standard hash table implementation. You know why? Because we've known for a long time that insecure defaults are broken. I'm glad rust makes the correct choice here.

Moreover, despite the author's continued invocation of the term "cryptographic hash", SIPHash was designed to be a default hash table hash. Very few things use it as a standalone cryptographic hash. You don't see a lot of SIP-based KDFs.

It's not the fastest general purpose hash --- but it's not supposed to be. It's supposed to be performant enough to be used as a default, and in being a default, it's supposed to take a pretty common attack off the table.

It's the right default. In the rare situations where container hashing winds up high on the profile, the faster hash should be the optional hash.

The SipHash paper proposed two uses; the hash-table use, but also for network traffic authentication, with the idea that it's output is small enough to MAC each network packet. Do you know if the network use got any traction? I thought the justification for MACing each packet rather than groups of packets was rather weak (single forged packed means multiple packet retransmits), but if I cared about both latency and security, I could see authenticating UDP packets with this.


Also, as a slight additional nitpick to the article, the term "cryptographic hash," when used in isolation, typically implies collision resistance, and SipHash is specifically not collision resistant due to its small output size.

OpenBSD ping uses siphash to mac packets.

SipHash is also pretty fast, on my computer I get 2 GB/s. There are hash functions that are 5x faster than that (some AES-NI based ones are 15x faster) but none of them have any credible cryptanalysis for key recovery.

I studied SipHash a while ago and implemented a SIMD version but that performed slower (needs 4x64 bit int vectors and doesn't allow any instruction level parallelism). I'm sure that you could make a faster hash function with 4x32 bit vectors using similar ARX (add-rotate-xor) networks but I'm not smart enough to do the cryptanalysis.

There's a new hash function called HighwayHash which was about 11GB/s but the implementation is very tied to Intel CPUs (SSE4.1 or AVX512) and uses 4x64 bit vectors. And no cryptanalysis on that either, although it looks like it has similar characteristics to SipHash.

Hash table keys are typically quite short, so going crazy with SIMD to push the throughput might not be a great idea.

Nonsense. Rust made a non-technical fear based choice. Robin-Hood is enough protection, SipHash only adds slowness. There are many twice as fast hash funcs with less collisions, and with Robin Hood you are fine with FNV1. For concurrency leapfrog would be better.

It does nothing to protect against having to do bit-to-bit comparison of keys when two keys hash to bit-exact same hash value. This is what's behind "hash flooding" algorithmic complexity DDoS attacks that cause O(n^2) CPU usage w.r.t. input length.

SipHash is the only hash function that has credible cryptanalysis against key recovery and collisions.

E.g. FNV1 is trivially simple to generate collisions against.

It's also trivial to produce SipHash collisions and its "security" claims are dubious in the hash table context. There's not much credibility in the siphash paper in section 7. linear chaining is not the recommended way to treat hash tables anymore, and even with that you can add a useful protection. siphash is not one.

Yes, it's non-reversable, and it doesn't allow easy seed-independent multi collisions as with other hash functions. But one hash function used in a hash table can never be secure. The collision resolution adds the security part. (or/plus some other schemes which are obviously over your head).

No, cryptographic hash has an additional property. It is one way. So, given a hash it is hard to derive original input.

Collision resistance is a separate property which can be fulfilled without one way property.

Maybe you want hash numbers that are not predictable by an attacker, but sip hash is not salted either. Plus salting does not require a cryptographic hash either.

Uh, what?

Siphash isn't a cryptographic hash function. It's a keyed hash function designed not to leak the key.

Collision resistance implies onewayness for compressing hash functions. You can invert a hashed value, and if your inversion procedure is randomised, then there is a high probability that you'll end up with a different input.

> Collision resistance implies onewayness for compressing hash functions.

No it doesn't. These are independent properties. That is, collision resistance does not imply onewayness, and onewayness does not imply collision resistance.

Collision Resistance: it is hard to find x != x' such that h(x) = h(x')

One Wayness: Given y, an n-bit string randomly chosen from the output space of h(), it is hard to find any x such that h(x) = y.

Proof that CR does not imply OW: Suppose g() is collision resistant and outputs n bits. Define h(x) to be 0x if x is n bits long, otherwise 1x. Collisions are hard to find: if y starts with 1 there are no collisions, and if it starts with 0 then you'd need to find a collision in g(), which is collision resistant. h() is not one way: given a random y in the output space, with probability 1/2 it is trivial to find an x such that h(x) = y.

Proof that OW does not imply CR: Suppose that g() is one-way. Define h(x) = g(x) with the last bit truncated. h() is one-way because you cannot invert h without inverting g(). h() is not collision resistant because for any x, h(x0) = h(x1)

>It is one way. So, given a hash it is hard to derive original input.

This is the definition of a Hash Function. Not a cryptographic Hash Function.

Cryptographic Hashes should NEVER collide, on any inputs, ever, period. The moment it is found too, the algorithm is considered depreciated. This is why MD5 shouldn't be used anymore, there is a generalized algorithm for producing collisions.

Cryptographic hashes are overkill for HashTables/HashMaps because you can't use 128/256/512bit address spaces unless your attempt to store all the information in the galaxy or observable universe. And modern computers just aren't there yet. Also they are very slow.

So when your masking a 512bit output, down to 16bits of your HashTable you just lose the will never collide guarantee. Then something that operates at 10GB/s, and can collide sounds better then something that runs at 150MB/s and will still collide anyways.

>This is the definition of a Hash Function. Not a cryptographic Hash Function.

No, a hash function is just any function which can be used to put values into a hash map. If your inputs are numbers, modulo will work fine as a hash function, but is obviously not one-way.

>Cryptographic Hashes should NEVER collide, on any inputs, ever, period.

This is obviously not possible, as the output of the cryptographic hash function is of fixed-length while the input is variable-length. Finding collision just needs to be hard, not impossible.

How is modulo invertible in the general case? If it is, can you demonstrate? I have picked certain numbers and their values modulo 10 are 6, 7, and 8. What numbers did I pick?

It doesn't have to be. Onewayness says nothing about finding "the original x". If a function is one-way, it means that given y (a random n-bit string in the output space of h), it is hard to find any x such that h(x) = y. If your h() is just a modulo operation, this is trivial: just choose x = y.

Ah, I see, way as in path, not way as in direction (which is I think the misinterpretation that leads people to the root of the misunderstanding). So, there's one path to get from x -> y, but it does not imply you can't get back to the original x. In that respect, "way" is an unfortunate word to use, at least as it's used in modern English.

Woah, I was completely wrong. You're right. A pseudo-inverse is sufficient. However, there are no functions known to be one-way. How interesting.

You can demonstrate that a number input into modulo belongs to a specific group. Even not knowing the specific value of modulus you can drive it by observing statistical properties of the output. (which is why modulo hashing gives many collisions)

      modulo will work fine as a hash function, but is obviously not one-way.
Modulo is a one way trap door operation.

Proof via arrogance:

       X % 100 = 50 
       Solve for X 
       Solution X = F(n) where F(n) = 100*n+50 (for all n > 1)
This does indeed meet the requirements of a hash functions. Modulu is a bad hash function because it is poorly distributed.

      Cryptographic Hashes should NEVER collide, on any inputs, ever, period.
      >This is obviously not possible
NIPS says if you observe 2 inputs of a cryptographic hash which compute to the same digest the hash is considered depreciated. So yes my definition is correct.

Obviously yes all cryptographic hashes aren't unique for every input. As they maybe N bytes long. And simply a logic will tell you there is more information in N bytes of data vs 64 bytes (512bits).

The real issue is, prove it. Testing even the 512bit address space will take you longer then the heat death of the universe, so good luck!

No - per the definition of a trapdoor function: "For any k ∈ K, without trapdoor tk, for any PPT algorithm, the probability to correctly invert fk (i.e., given fk(x), find a pre-image x' such that fk(x' ) = fk(x)) is negligible"


Note that the definition is about a pre-image, not necessarily the input used to create it in the first place.

Second, that definition is absolutely not correct. "Never" would require that the size of the output of the hash function be equivalent to the size of the universe of possible inputs, which it obviously isn't. There's a very important difference between "never be observed in practice" and "never, period." You're not using the language precisely, and that's very dangerous when talking about hashing and cryptography.

Pre-image is even simpler. A good cryptographic hash also prevents second pre-image attack e.g. pre-image based on multiple hashes.

None of those properties is needed for indexing into hash table. Good collision resistance is all that is required and salting for more paranoid cases.

Any hash function provably collides for at least two inputs, unless we're talking about perfect hashing. The only exception would be a hash function that does not compress, where the output is larger than the input. That's not a hash-function, though.

> you just lose the will never collide guarantee

There is no such guarantee, in fact, the opposite is guaranteed.

It's just very unlikely that you get a collision - that's a desirable property of any hash function -, and it should be hard to create a collision because the function is hard to reverse - that's a desirable property of any cryptographic hash function, as the OP laid out.

Does Rust define collection traits for abstract interfaces like Set, Map, and Queue like Java's Collections interfaces? I don't see any in the std::collections documentation. Standardizing those APIs in the std lib seems as (or more!) important than providing concrete implementations like BTreeMap. I see some 2015 discussion about missing collections traits:



For comparison, here are Java's core collection interfaces: Collection, Set, SortedSet, List, Queue, Deque, Map, and SortedMap.


IIRC, we need higher kinded types before we can create a truly great set of generic collections traits.

I'm curious about why higher-kinded types are needed for this. I'd have thought that it would be possible to create these kinds of traits in rust's current type system. Is there someplace I could read more explanation on this?

Are higher-kinded types scheduled for inclusion? If so when?

They're at the "nobody has yet to come up with a serious proposal" stage. Some are skeptical that they can ever fit in. We'll see.

Thanks. Do you know who is working on this? I'd like to talk with them.

I'm not aware of anyone seriously pursuing it yet.

This showed up on r/rust last night and there's a lot of discussion there:


A lot of this article might be a matter of opinion, but the discussion on HashMaps seems to be outright false. The author has set up some kind of a double hashing strawman, when the actual implementation uses linear probing (which is the only kind of Robin Hood hash table anyone would implement today).

> And, ignoring that point for a moment, The idea that your code is ‘secure by default’ is a dangerous one and promotes ignorance about security. You code is not secure by default

Bah. This argument can be used to show anything. I could argue that C++ is better than Rust with the same argument.

I'm not entirely sure I get all of these arguments. It sounds like someone complaining about the language being young, and the early implementations aren't perfect.

For not being perfect, it's amazing how awesome the std lib is!

Also, the concurrent collections didn't show up in Java until 1.5, I believe. I think that was ten years into the life of the language, which was after the 1.3/4 releases which actually made the JVM substantially more performant.

Rust is young, concurrent data models are hard, but it is being worked on, and you can lend a hand! There are good points in this article, and some of them are being discussed on the Rust forums. This is a great time to be involved in helping shape a language and its libraries!

I strongly believe that concurrent datastructures should be kept out of the stdlib. There are just way too many tradeoffs to worry about here, and these tradeoffs are far more important in the concurrent case.

Take a concurrent hashmap for example. Do you want a "vec of lock free lists" kind which is lock free and infinite-capacity (no need to copy-reallocate), or do you want a probing one which has fine-grained locks and needs to reallocate (but has better cache behavior)? Or something in between? If it matters enough that `Mutex<HashMap<K,V>>` isn't good enough, then these tradeoffs matter too. Rust could have a rogues' gallery of concurrent hashmaps, but that feels like bloat.

Some already exist in the ecosystem, though, and you can always use those. But I'd love to see more work here!

Having a semi-official/nursery library containing a lot of concurrent datastructures would be nice, though. crossbeam provides better primitives for writing such things, but not the actual datastructures.


There are tradeoffs in non-concurrent data structures too. Do you want a boolean vector that aligns to word size, or one that conserves memory and fits in cache? Pushed to the limit, nearly ever non-primitive part of the library has compromises.

Don't let great be the enemy of good. Concurrent collections are a common need. Include some reasonable implementations with compromises. No, you won't answer every use case, but you can be better than `Mutex<HashMap<K,V>>` and still provide something solid.

Sure, there are tradeoffs everywhere, but there's a distinction here. I'm talking about things which can be cobbled together for most use cases from what already exists in the stdlib. Vec<bool> is good enough for most use cases, and thus BitVec doesn't exist in the stdlib. Mutex<HashMap> is also good enough for many use cases. The better, more specific implementations exist as separate crates.

Need a concurrent hashmap? Use Mutex<HashMap>. Think that's not good enough for your use case? Then what the stdlib would have offered will probably not fit for you either.

Remember, I'm not saying that we shouldn't have these things in a library somewhere (even an official library), I'm saying they shouldn't be in the stdlib, because there's an additional burden to that, and stability guarantees mean that it's harder to evolve.

It's true that Java took a long time to develop (good) concurrent collections, but I will suggest that they're more important in 2016 than they were in 1995. Core counts are increasing, and the competition from other languages is better.

I'm not saying we should be disgusted if there aren't immediately awesome concurrent data structures, but you can't be satisfied with waiting until 2024 (the equivalent of Java 1.5, I believe).

I didn't mean to imply that a ten year wait was acceptable. That being said, expecting stable concurrent libraries just one year after the language stabilized isn't reasonable.

Futures just landed, and those seem like a good first step toward a good concurrency model.

This particular quote rang very true to me:

> If the allocation is hidden to the programmer, she might not realize the expensive operations behind the scenes.

I am a green Rust programmer, and I currently have little insight into the allocation costs of the various things I'm doing, and I haven't yet found a strategy for gaining it. I feel like the actual mechanisms for memory management are so deeply buried that it would take lots and lots of spelunking to get an idea of what's going on behind the scenes. Even something as simple as a "This function allocates <x> memory" tag with stdlib functions would help me immensely.

I agree that we should add some more documentation in this area (though we always have to be careful that we don't document things so thoroughly that we de-facto stabilize the internals and make it impossible for us to improve them in the future). But in the meantime, if you're just learning the language, I would recommend not being overly concerned with the performance details of collections until you start seeing them at the top of your performance profiles.

EDIT: And of course, I do encourage you to check out the documentation that we have in the std::collections module that goes into at least a little detail about each of the standard collections, along with tips for choosing when to use each: https://doc.rust-lang.org/std/collections/#when-should-you-u...

Well, it's not so much about performance of existing code as it is about being able to take a high-level view of the design of a program. To pick a bad example, if I know one function will allocate memory according to a certain pattern (e.g. 2n*sizeof(obj)) and another will just shift an internal pointer around, I predict it'll get easier to reason about the behavior of the code. The semantics of ?alloc+free are exceptionally easy to understand, but I have no idea how or when Rust allocates memory, except for a vague notion of heap vs stack. I suspect there's an elegant relationship between how Rust does its (hidden) memory management and the behavior of the borrow checker, but I haven't come across anything yet linking the two concepts.

Mostly, I think my problem is that I don't know what I don't know. Like I said, I'm green. :)

I am interested in improving this. It's tough because as an experienced Rust programmer, I feel like it's obvious, but have a hard time figuring out how to put it into words. It should be something that can be intuitive, rather than needing explicit docs, though those can help too.

He says BinaryHeap is superfluous, but then complains there's not a priority queue. Huh?

A binary heap is a priority queue. (Specifically, a binary heap is an implementation of a priority queue, just as a linked list is an implementation of a list.)

Even after the "they're fundamentally very different" update, he's still wrong.

Wanna know what the (non-synchronized) priority queue is called in Python? heapq https://docs.python.org/3.5/library/heapq.html Critique that next, please.

Yeah, he made an embarrassing mistake. I googled "rust binary heap" and the second link[0] is a priority queue example using the std binary heap.

[0] https://doc.rust-lang.org/std/collections/binary_heap/

The weird thing is...after having that brought to his attention, he reasserts his error.

The variant of Robin Hood hash table isn't a double hashing scheme like the original research paper. This is more like linear probing with a collision resolution trick to reduce worst case in lookups.

The article suggests using quadratic probing instead, but that has two weaknesses. Cache locality is worse than linear probing. But the deal breaker is that the hash table must have a prime number size for correctness. This makes the hash table grow quite quickly once the size reaches tens of thousands. And finding primes is another issue.

In all fairness I have yet to see a language with an awesome builtin collections library that meets everyones needs (and I use JDK collections all the time.... it took a long time to get where it is and still is sort of ok). If you know of a language that does please chime in.

The big thing I missed in my limited playing around with Rust is more types of queues. This is in large part because I was attacking Rust in a similar way to Java. Because of its lack of other concurrency models other than threads I was resorting to queues (I'm sure there are more "reactive-like" libraries now).

Overall I can't decide if a better approach might be to let library writers work on adding more structures instead of a batteries included approach. Maybe just have the traits included and thats it.

> The other popular self-balancing trees are good candidates as well (AVL and LLRB/Red-black). While they do essentially the same, they can have very different performance characteristics, and switching can influence the program’s performance vastly.

In what scenario would you prefer an RB-tree (or AVL tree) to a B-tree?

My understanding is that the performance characteristics of RB/AVL trees can mostly be described as "worse." Maybe insert speed? But in that case, you'd really prefer an LSM tree.

The author is also very confused about how hash table DoS attacks are protected. And what a cryptographic hash is. There are very fast non-cryptographic hash functions like xxHash (in fact, I thought siphash was much faster than 1GB/s, but I could be misremembering).

GB/s is a fairly useless metric for hash-functions designed to be used in hash tables. This is because the keys for hash tables are usually very short, and keyed hashes can have long setup times. I don't know the speed of sip-hash for bulk data, but I wouldn't be surprised to see that VMAC is probably faster than siphash for large data, but VMAC has a long setup time (on the order of 5000 cycles IIRC compared to siphash24 at on the order of 100).

Heck, for large inputs, python's random hash isn't much faster than hmacmd5, but for 8 byte inputs python's random hash is faster than siphash24.

The better measurement is X hashes per second at N key size, and then perform that at a few key sizes.


I finally found a comparison that does this and includes siphash:


Note that siphash is competitive with everything except FNV for small inputs, but FNV blows it away up to about 60 bytes.

These benchmarks are out of date -- Rust recently updated its SipHash impl from Sip2-4 to Sip1-3, making it actually the fastest for 8-64 bytes in some workloads (looking at the hashers outside of the context of using them in a map gives deceptive results).

(I need to get clearance to post updated results)

Good to know. The page I posted was nice in that it listed both (and included BTree for a comparison). I know nothing about how the benchmarks were run, but the fact that it shows varying sizes and shows hashes both in isolation and in a map gave me confidence that someone who knows what they were doing was running the show.

I think not looking at both the bare hash performance and the map performance is a bad idea because there are a lot more variables involved testing against a Map, but poor distribution (which ought not be an issue with anything except FNV on that graph) could make a fast hash not useful.

  > but for 8 byte inputs python's random hash is faster 
  > than siphash24.
For reference, note that Rust recently switched its default hash function from SipHash 2-4 to SipHash 1-3, which should lead to improved performance while retaining the usual flood-resistant properties (at least, that's what the algorithm's creator advised us to do :P ).

Also python adopted siphash in 3.4, due to security concerns around its hash function (see PEP-456).

Good point.

In C and C++ code it's common to use a red-black tree implementation where the node is part of the object. For example, using the standard BSD library

  #include <sys/tree.h>
  struct foo {
    RB_ENTRY(foo) rbe; // tree node
  RB_HEAD(foo_tree, foo);
  RB_GENERATE(foo_tree, foo, rbe, foo_cmp);
  struct foo_tree rbt = { 0 };
  struct foo *foo;
  RB_INSERT(foo_tree, &rbt, foo); // no memory allocation
In such implementations the RB tree will be faster. The reason to prefer a B-tree is to minimize the number of cache misses incurred when loading objects _through_ separately allocated nodes. But if the node is part of the object, you've already eliminated that cost entirely and so a B-tree is strictly inferior.

<sys/tree.h> is written in the same vein as <sys/queue.h>, a collection of list implementations originally included with BSD and even provided by glibc. The tree and especially list routines are used for most of the data structures in *BSD kernels and utilities, and the Linux kernel and other systems often use very similar libraries.

The downside to using predefined, preallocated nodes is that you can't mix-and-match types in the same collection. And an object can only be inserted into a single collection unless you define multiple nodes.

In practice this isn't a concern. Especially in systems programming it's uncommon to need or even want to stash random objects in a collection, or for objects to exist in many multiple trees or lists simultaneously.

And it's a very useful property to be able to rely on the fact that insertion can _never_ fail (always for lists; qualified never for trees). That dramatically simplifies code dealing with complex data structures and that must remain robust in the face of resource exhaustion. It means that once you've allocated a new object, you can guarantee insertion into any number of collections. That also simplifies backing out of a transaction because you can make guarantees about consistency--i.e. that an object is definitely a member of this list or that tree. This benefit is often overlooked in Rust and elsewhere because idiomatic Rust code is expected to exit on memory exhaustion anyhow.

I can't remember the last time I actually used a hash table in C that wasn't specialized (e.g. perfect hash function). The fact that a dynamic hash table may need to be resized introduces an undesirable failure mode; it may happen only once in a million insertions, but that's significantly more often than never. For general use I default to a red-black tree, which usefully can serve as a sorted container, too (e.g. for priority queues). When performance matters (i.e. a proven bottleneck) I'll invariably use a more specialized data structure and implementation suited to the particular problem, perhaps also refactoring other aspects of the code as appropriate.

IMHO, all this concern about the fastest general purpose algorithm and implementation seems like bike shedding. When performance _truly_, _manifestly_ matters, a general purpose implementation is unlikely to suffice.

I appreciate reading thoughtful, constructive criticisms like this. But ultimately, I didn't find any of his points paricularly compelling. Maybe the author can point to a language that really gets this right?

I haven't written much rust, though.

I think std means standard collections, not custom ("special" case) collection library. People may look for this kind of specialization in outside std or create their own library. Or you could start be the first one who create this extended collection std.. CMIIW

Typo: "alternatives and way(s) to improve it"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact