
SeaHash: A fast, portable hash function in Rust - pcwalton
https://docs.rs/seahash/2.0.0/seahash/
======
aappleby
Looks like a pretty straightforward 64-bit block hash unrolled 4 times. I'd
prefer a bit more assymetry in the diffuse() method, but since it passes
SMHasher it's probably OK.

I wonder how the Rust version compares with plain-jane C.

-Austin (murmurhash guy).

~~~
aappleby
Actually, just noticed a minor issue - since there's no intermixing between
the four lanes and the diffuse() function is the same for all of them, if any
of the IVs match then I can swap all the blocks in those lanes and get the
same hash out.

For example, if IV1 and IV2 match and the block pattern is ABCDABCDABCD, then
BACDBACDBACD will produce the same hash value.

A minor finalizer change would fix it for any IV (pseudocode as I don't
actually know Rust) -

vec[0] ^= diffuse(vec[1]); vec[1] ^= diffuse(vec[0]); vec[2] ^=
diffuse(vec[3]); vec[3] ^= diffuse(vec[2]);

u64 result = diffuse(vec[1] ^ diffuse(vec[3]));

that's probably overkill but should work.

~~~
alekratz
I'm not a "hash guy" by any means - what impact would that have on its
performance?

~~~
loeg
None for large strings and you'd have to benchmark small strings.

------
aleyan
This is is not a knock against SeaHash, but I was looking at buffer.rs [0] and
noticed pretty much all the code is wrapped in unsafe {} blocks. How much
advantage is there to rust implementation vs c++ if unsafe is used so
liberally? I ask this in ernest.

[0]
[https://docs.rs/crate/seahash/2.0.0/source/src/buffer.rs](https://docs.rs/crate/seahash/2.0.0/source/src/buffer.rs)

~~~
Animats
This code is C code written as unsafe Rust:

    
    
        let mut ptr = buf.as_ptr();        
        let end_ptr = buf.as_ptr().offset(buf.len() as isize & !0x1F) as usize;
        while end_ptr > ptr as usize {
            a = a ^ read_u64(ptr);
            ptr = ptr.offset(8);
            b = b ^ read_u64(ptr);
            ptr = ptr.offset(8);
            c = c ^ read_u64(ptr);
            ptr = ptr.offset(8);
            d = d ^ read_u64(ptr);
            ptr = ptr.offset(8);
    
            ....
            match excessive {
                0 => {},
                1...7 => {                
                    a = a ^ read_int(slice::from_raw_parts(ptr as *const u8, excessive));
                    a = diffuse(a);
                },
                8 => {
                    a = a ^ read_u64(ptr);
                    a = diffuse(a);
                },
                9...15 => {               
                    a = a ^ read_u64(ptr);
                    ptr = ptr.offset(8);
                    excessive = excessive - 8;
            ....
    

This bothers me about Rust. There's too much "unsafe" code in libraries. The
language is unable to express some essential concepts. Known areas of trouble
include partial initialization of an array, needed to implement growable
collections, and single ownership doubly linked lists. Neither of those is
expressible within Rust, which leads to unsafe code to implement them. Here,
though, it's purely a performance issue. That's disturbing. If you can't do
fast big-banging in safe Rust, there's a problem somewhere.

If Rust let you access a slice of bytes as an slice of ints, alignment and
length permitting, the code above could be much more straightforward. That's
what I mean about expressive power. The hack to do that used here:

    
    
        let end_ptr = buf.as_ptr().offset(buf.len() as isize & !0x1F) as usize;
    

is iffy. Why is there an "isize" (a signed quantity) in there? They want to
align with a 32-bit cache line, yes, but why the signed quantity? The
documentation for Rust's "std::ops::BitAnd" doesn't say what the semantics are
for signed numbers. What would happen on a 32-bit machine if someone allocated
a buffer bigger than 2GB? Exploitable?

~~~
Manishearth
> There's too much "unsafe" code in libraries.

Which libraries? I see very few that do this, and all of them are safe
abstractions containing some unsafe code.

You keep repeating this claim but I haven't seen any evidence to back it up.

> If Rust let you access a slice of bytes as an slice of ints, alignment and
> length permitting, the code above could be much more straightforward. That's
> what I mean about expressive power. The hack to do that used here:

The byteorder crate lets you do this. Of course, it uses unsafe code, but
that's safely encapsulated away (and easy to verify). This crate doesn't use
it, but it could. Not every operation needs to be baked into the language
semantics.

~~~
yazaddaruvala
While I mostly agree with you, I'd like to play devil's advocate:

The Rust core team is relatively relaxed about the community's usage of
`unsafe`. I say this because they do not seem to be interested in actively
discouraging it's usage. i.e. `unsafe` is discouraged in documentation, not
via tools.

"Hey please don't use `unsafe` unless you know what you're doing". Is like
writing a comment in Javascript, `function(x /* int */) {`.

Ideally, the usage of `unsafe` should be discouraged by the compiler via
friction. The compiler should, at the least, spit out some metrics after a
compile cycle about the percentage of `unsafe` lines/instructions. Even more
ideal, when the compiler detects an `unsafe` it would pause and ask, "Is Crate
Z trusted [y/N]?". Cargo can then make this easy in Cargo.toml.

All of a sudden Crate writers need to think twice about their usage of
`unsafe`, will users be willing to trust my code? is using `unsafe` here
really worth the risk of adopting fewer users? is there already a library that
solves this which is generally trusted?

~~~
Manishearth
Nobody is putting effort into propaganda about unsafe because the community is
already very strongly averse to this, and is careful about writing unsafe
code. It's not a problem. If it becomes a problem (I doubt it) we can put
effort into it. People learn about the language through discussion or
documentation, and both of these venues actively discourage unsafe. The one
resource out there that teaches unsafe code in depth (the Rustonomicon) is
very heavy on warning the reader about unsafe code pitfalls and in general
discouraging the reader from writing unsafe code.

A tool for tracking unsafe dependencies has been talked about before, though.
Sounds like a good idea to me. Like I said I don't think there's a particular
need for it, but it would be nice to have.

~~~
duaneb
Have you ever been bitten by unsafe code?

IMHO any metrics would be gamed and prioritized above actually quality. As
actual quality isn't suffering, why add the unsafe metric in at all? Seems
like paranoia not born out of experience.

~~~
Manishearth
> Have you ever been bitten by unsafe code?

Yeah. Most often in FFI code (when the invariants are much harder to uphold).
Rarely when writing unsafe abstractions. The few times I remember this
happening with abstractions is due to really old code that broke in a compiler
upgrade (pre-1.0).

> IMHO any metrics would be gamed and prioritized above actually quality.

Yeah, there have been discussions in the past about a "safe code" badge for
crates and stuff like this, and the conclusion is that it might discourage
people from using unsafe code where they actually should be.

------
sp332
I was trying to figure out how this is so much faster than FNV
[https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo...](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1_hash)
Is it only because of the parallelism? Or are the operations really that much
cheaper somehow?

~~~
LeifCarrotson
Both.

The pseudocode for FNV looks like this:

    
    
       hash = FNV_offset_value
       for each byte_of_data to be hashed
       {
            hash = hash XOR byte_of_data
            hash = hash × FNV_prime
       }
       return hash
    

The pseudocode for seahash looks like this (with '×' as the wrapping multplier
operator, and some simplification for padding if the data length in bytes is
not a multiple of 8 bytes per word × 4 words in the hash state):

    
    
        hash = {offset_1, offset_2, offset_3, offset_4}
    
        for (int data_index = 0; 
                 data_index < data.length_in_64_bit_words; 
                 data_index = data_index + 4_words_in_hash)
        {        
            for (int hash_index = 0; hash_index < 4; hash_index++)
            {
                // Mix in data
                hash[hash_index] = hash[hash_index] XOR data[data_index + hash_index]
                // Diffuse
                hash[hash_index] = hash[hash_index] XOR (hash[hash_index] RSHIFT 32)
                hash[hash_index] = hash[hash_index] × seahash_prime
                hash[hash_index] = hash[hash_index] XOR (hash[hash_index] RSHIFT 32)
                hash[hash_index] = hash[hash_index] × seahash_prime
                hash[hash_index] = hash[hash_index] XOR (hash[hash_index] RSHIFT 32)
            }
        }
    
        result = hash[0] XOR hash[1] XOR hash[2] XOR hash[3] XOR data.length_in_bytes
    
        result = result XOR (result RSHIFT 32)
        result = result × seahash_prime
        result = result XOR (result RSHIFT 32)
        result = result × seahash_prime
        result = result XOR (result RSHIFT 32)
    
        return result
    

FNV is operating on bytes of data, while seahash is operating on 64-bit words.
A modern processor will be able to handle 64 bits at once. True, it can
probably handle 8 bits independently in one instruction without having to
create a temporary value, but it still needs to do more operations.

FNV is completely sequential. Until the first byte is hashed, no work can be
done on the second byte. In seahash, as you observed, parallelism can be
exploited. The second, third, and fourth bytes are all completely independent
of the first byte, as bytes 6, 7, and 8 are independent of byte 5, and so on.
You can have four independent threads each do a quarter of the work, and then
put the result back together at the end.

~~~
irq-1
Hardware threads, just to clarify.

> SeaHash achieves the performance by heavily exploiting Instruction-Level
> Parallelism.

> This means that almost always the CPU will be able to run the instructions
> in parallel.

~~~
Sharlin
Execution units, to be precise. Separate threads of execution are not
involved, hardware or software.

------
tibbe
How does it perform on short strings (e.g. <= 16 bytes)? We've seen several
new hash functions lately with great throughput numbers, but unfortunately
they often end up being slower than FNV when used e.g. on keys in hash maps,
which are often short strings.

------
Dylan16807
I was unsure how to read the diffusion function's description, so I went to
the source.

The first line, "x ← x ≫ 32", might have a typo? It's actually assigning (x
XOR (x >> 32)).

With "x ← px", p is a fixed prime number being multiplied by x.

------
nullnilvoid
Great to see another piece of code written in Rust. That said, how do you make
claims of something being blazingly fast without any comparisons to
implementations in other languages such as C or C++?

~~~
tupshin
The claim is that it is a blazingly fast hash function _compared with other
hash functions_ , and it is also written in Rust. Rust is an enabling
technology, but not able to be dramatically faster than a comparable C/C++
implementation, as a general rule.

~~~
nullnilvoid
The title is confusing. If it tries to compete with other hash functions, why
does the title have to bear "in Rust"?

~~~
pcwalton
Because it's probably the most notable/unusual part of this hash function.
Virtually all other hash functions for at least a decade have had their
reference implementations written in C or C++.

------
eridius
Very cool!

BTW, the hyperlink for "reference" in the "Specification" section is broken.

------
jedisct1
Great. But keep in mind that this is not a keyed hash function.

~~~
pcwalton
You can use it as a keyed hash function by adjusting the IV. See:
[https://docs.rs/seahash/2.0.0/src/seahash/.cargo/registry/sr...](https://docs.rs/seahash/2.0.0/src/seahash/.cargo/registry/src/github.com-1ecc6299db9ec823/seahash-2.0.0/src/reference.rs.html#94)

~~~
tveita
It seems to already be implemented in
[https://docs.rs/seahash/2.1.1/src/seahash/.cargo/registry/sr...](https://docs.rs/seahash/2.1.1/src/seahash/.cargo/registry/src/github.com-1ecc6299db9ec823/seahash-2.1.1/src/buffer.rs.html#102)

But it only seeds one of the lanes, so you can still make collisions trivially
in one of the other lanes. I guess it could still be useful for namespacing.

------
crtc
What's the matter with this stupid word "blazingly" in the Rust community?

