Hacker News new | past | comments | ask | show | jobs | submit login
XXH3 – a new speed-optimized hash algorithm (fastcompression.blogspot.com)
275 points by corysama 41 days ago | hide | past | web | favorite | 83 comments



It's good to see a focus on small input sizes. Small keys are extremely important for hash functions and are often overlooked. Whenever I've looked at the distribution of key sizes in real world Rust code using hash maps it's always turned out to be shockingly small. 8 byte hash inputs are very common in real world code, for instance.

Most hash functions and benchmarks are not tuned for small inputs. This is why SipHash is disfavored in Rust these days, because while its performance is excellent on large keys it suffers quite a bit on small ones.


Yes, although even here, small is not that very small:

> The main focus is about short keys of random lengths, with a distribution of length roughly in the 20-30 bytes area, featuring occasional outliers, both tiny and large.

I have been playing around trying to write interpreters and compilers for a hobby, and though I haven't taken stats, but I suspect that in source code, identifiers half that length are more than twice as common.

It make me wonder if compilers are better of using old-school byte-by-byte hashes rather than new ones optmised for parallelism. (Then again, the hash function probably makes no real difference to speed once in a real-world compiler that does optimisation).


> It make me wonder if compilers are better of using old-school byte-by-byte hashes rather than new ones optmised for parallelism.

Probably not.

Good hashes optimized for small sizes would probably still use wider loads and operations, perhaps overlapping them: e.g., for a 7 byte hash you might do two 4-byte hashes on the first and second half meaning one byte would get done twice, which isn't a problem. High speed small memory copies follow the same approach.

One problem with byte-at-a-time hashes are that byte operations aren't really any faster than wider operations (at least up to 64-bits), so you'll do 7 byte mixing operations in the example above rather than 2.

The other problem is that wider chunks tend to "quantize" the number of loops in the algorithm, which is very helpful for branch prediction. A byte-at-a-time hash that uses a loop or some kind of jump into inline code will necessarily mispredict unless the exact length is predictable: but if it varies by even 1 byte, the branching is different. Hashes with larger chunks may take the same path for a range of lengths, so may have fewer mispredictions even if the size varies. For example, a hash that uses two possibly overlapping 4-byte reads may handle sizes from 5-8 (or even 4-8) bytes with the same path, so random variation within that range doesn't cause a mispredict.

I think the era of the byte-at-a-time hash is really over, except perhaps if gets some special hardware support.


> Yes, although even here, small is not that very small:

> > The main focus is about short keys of random lengths, with a distribution of length roughly in the 20-30 bytes area, featuring occasional outliers, both tiny and large.

Based on the graphs, XXH3 still seems to beat everything at key sizes > 6 (presumably bytes?)


> SipHash is disfavored in Rust these days

I wasn't aware of this!

What do people use instead?

(I remember reading that part of the motivation for making SipHash the default Hasher was that its randomness guards against hash flooding attacks. Is that something people care about in practice?)


https://github.com/cbreeden/fxhash is reasonably widely used. It doesn't have the security benefits of SipHash, however they are rather dubious in practice -- 64 bits of state is simply not enough to be secure, no matter the hash function (not to mention, real tables will not use the vast majority of those bits).


64 bits is plenty for avoiding hashdos attacks, though. Pretty much anything keyed is enough for that. What attacks are you talking about for a non-cryptographic hash function?


It's a sensitive topic that gets some people really riled up, but essentially it doesn't offer the protection people think it does.


Are you referring to this? http://perl11.org/blog/seed.html

That's controversial the way climate change is controversial.


The benefits or the critique? (I could guess, but prefer not to ...)


The critique. There's more rambling by (I think) the same author here: https://github.com/google/highwayhash/issues/28

His arguments are a little inconsistent and ambiguous, which suggests either a misunderstanding, perhaps a touch of mania, or both. But they're all wrong--whether at face value or when assuming the best possible version--at least regarding the security and the ease of generating collisions.

As for the argument that SipHash is too slow, that's a qualitative judgment that many people have made. But they don't usually need to first convince themselves, erroneously, that SipHash is insecure.


Because I got down voted:

1) If he's arguing that SipHash (or any keyed) hash function is insufficient because one can presume that the key can be acquired, then that would render all encryption schemes insecure. The security of any keyed cipher or hash is always predicated on maintaining the secrecy of the key, period. If you throw that presumption out the door than they're all "broken".

2) If he's arguing that he can generate collisions by timing when an injected key collides, he never shows how that information can be used to reliably generate new collisions. At best he only _alludes_ to this. He says something about having code to do it, but then refuses to provide the code.

If a hash table only has two buckets, it's trivial to create collisions. For a uniformly random hashing function you have a 50% chance. But so what? Unless the hash table is a fixed size, when you try to inject more keys (usually 50%-80% of the hash table size) the table will be resized. This will happen regardless of the collision rate, and indeed you want to avoid predicating resize on actual collision rate because otherwise an attacker can force you to use up all memory. All you need for this is a simple counter for the table. (And of course your hash should evenly distribute keys, as SipHash does.)

If you use a fixed-sized table, then not only will a secure hash not help you, but neither will anything else. If an attacker can reliably generate biased collisions, then he can always cause worse-case performance, including triggering whatever fancy reshuffle logic the hash table uses after a collision, violating the presumption of amortized O(1) work.

Note that the paper cited in the Github thread (Wool 2009) presumes a small key space:

  We have demonstrated that a remote algorithmic complexity 
  attack, against randomized hash tables, is possible if the 
  secret value is chosen from a small enough space. More
  secret bits cause more effort, time and space to be consumed
  in the information gathering stage. Thus, it seems that a 
  random value of 32 bits would render this attack impractical 
  with today's technology. Note though that in this paper the 
  attacker iterates over all possible random values in a 
  brute-force manner, searching for bucket collisions. 
  However, the search space may be limited to a smaller subset 
  of random numbers by taking advantage of the vulnerabilities 
  in the Linux Random Number Generator as suggested in 
  (Gutterman et al., 2006). This might lead to a feasible 
  attack against a server with a longer secret value.
SipHash takes a 128-bit key. The Wool attack isn't even remotely feasible. Even if the hash table is small and you're reducing the 64-bit SipHash-generated key to only, say, 10 bits, you still can't recover the key and you can't generate biased collisions. As for targeting Linux's PRNG, that same argument applies to all cryptographic software schemes.

EDIT: Previously said the Wool paper relied on fixed-sized hash tables. But it actually relies on a small key space permitting a brute force, oracle attack.


I added xxh3 to my fork of smhasher: https://github.com/injinj/smhasher

I find it to be the fastest hash in my stable of hashes.

Graphs for 64 / 128:

https://github.com/injinj/smhasher/blob/master/plot64.svg

https://github.com/injinj/smhasher/blob/master/plot128.svg

Latency graphs for 64 / 128

https://github.com/injinj/smhasher/blob/master/plot64lat.svg

https://github.com/injinj/smhasher/blob/master/plot128lat.sv...

Latency is using the hash result after each function, the original smhasher algo does not use the result.


For each size shown on the x-axis, every hashing call using exactly that size?

It would be interesting to also have a test where an average size was chosen, but where the actual sizes varied randomly around that average - because that's a common case in real world code, and the results can be very different since branch prediction won't be perfect any more as it is with the fixed size tests.

Also, latency usually means feeding the output of the function back into the input of the next call. Throughput is what you get when you don't do that but you still need to consume the result, so it's not optimized away. You can use something like DoNotOptimize here:

https://github.com/google/benchmark/blob/master/include/benc...


Yes, it is the same size. I agree, a complex test like that in the original article is better (the Throughput versus Latency section).

The graphs from the "latency" test are doing this:

  for (;;) {
    h = hash( val, seed );
    seed += h;
  }
The original smhasher did this:

  for (;;) {
    hash( val, seed );
  }


Actually, making the next hash dependent on previous hash through the `seed` doesn't work for benchmarking latency.

Several hashes only use the seed at the very end of calculation. Among them, CityHash, FarmHash, MeowHash, probably a few more. This makes it possible for them to start calculating next hash before the end of the previous one, so it's no longer "latency" for them, and the test condition becomes uneven.

In reality, in a latency scenario, the hash function is waiting for the input. So it's the input which must depends on previous hash result. This way, all algorithms must wait for previous result, no more dependency on how a specific algorithm handle the seed.


Good point.

I created a graph with these attributes:

0 total += hash (XXH364) 1 seed += hash (XXH364Seed) 2 seed += hash, val += hash (XXH364SH)

https://gist.github.com/injinj/138543ccc6a23ceb1fcdc05f46288...

There are weird bumps at some key sizes with City and XXH when the key is updated with the previous hash. My guess is there is some underlying cache line latency added when the key is written to.

smhasher should really have a test for good seeding. As I understand it, bad City seeding was the reason people started using SipHash.


The cache line latency is strongly associated with the number of hash call repetitions. At 1, 2, 3, 4 repeats, the latency is not present in the timestamp counter. From 5 -> 9 repeats, the latency builds, adding 2 cycles of latency each repeat step. 9 is the max latency added, for a total of 10 additional cycles. I can make the latency go away with a memory fence added after each repetition, but the total number of cycles added is about 50.

Graphs:

  1.  The original, with repeat at 32

  2.  The repeats, 1 -> 9

  3.  The repeats + memory fence, 1 -> 9
https://gist.github.com/injinj/138543ccc6a23ceb1fcdc05f46288...


I'm afraid I don't follow. Is there any code source that could be read ?

The way to force the hash algorithm to actually wait for input is to use the previous hash to determine the start position of next hash's input. Jumping doesn't have to be large. You can get good results with just a few hundred bytes of variance.


This is notable, because this is what rurban/smhasher does in the small key test, which many are using nowadays.

In SpeedTest.cpp, there is a function called timehash():

  uint64_t *aligned_result = (uint64_t *) aligned_alloc( 64, 256 / 8 );
  uint64_t *aligned_buf = (uint64_t *) aligned_alloc( 64, 256 );
  memcpy( aligned_buf, key, len );

      begin = rdtsc();
      for (int i = 0; i < repeat; i++) {
        hash(aligned_buf,len,seed,aligned_result);
        seed += aligned_result[0];
        aligned_buf[0] += aligned_result[0];
        aligned_buf[1] += aligned_result[1];
      }
      end = rdtsc();
Take away the aligned_buf += aligned_result, and the additional latency goes away.

When I get a chance, I'll try your method of randomizing input.


Right, so that is a latency test because each result feeds back into the next call (through seed). A typical throughput test would be like:

  total = 0;
  for (;;) {
    total += hash( val, seed );
  }
Note that the hash value is consumed, but doesn't feed into the next result so multiple hashes can run in parallel. If that wasn't the case, it is lucky the loop wasn't compiled away entirely (probably because the hash functions and the test loop are in different compilation units and LTO is not turned on).


I just re-ran this with a total += hash(). It adds about 2 - 4 cycles to the throughput test (the non-latency graph).


What I really loved was the different benchmark ideas e.g. variable input size and hashing as part of a bigger algorithm.


Thank you! Currently using XXH64 to good success for small keys, and xxh3 looks fantastic as the next upgrade.


https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxh...

The core of the function is as follows:

    static U64 XXH64_round(U64 acc, U64 input)
     {
      acc += input * PRIME64_2;
      acc = XXH_rotl64(acc, 31);
      acc *= PRIME64_1;
      return acc;
     }
Multiplication, rotation, and a 2nd multiplication. These are 64-bit multiplications, which is well known to mix bits well, but CPUs don't have many 64-bit multipliers. Four rounds execute in parallel (see XXH64_update_endian), which can hide the long latency (~5 cycles) associated with a 64-bit multiply.

My own experiments show that multiplication is a great function for hashing. Multiplication however only spreads the bits in "one direction" (towards the most-significant bit), and very poorly spreads bits towards the LSB.

Rotations, as well as a 2nd multiplication, are helpful at spreading the bits around better.

----------

The 64-bit version is somewhat interesting to me, but the 32-bit version probably has more potential for optimization. 32-bit vectorized multiplication exists in AVX2, so the 32-bit version probably can be vectorized. I'd expect the 32-bit version (if vectorized) would be faster, but the 64-bit version probably mixes the bits around better.

Overall looks like a decent design. Uses ILP with 4x state variables (state->v1, state->v2, etc. etc. to allow CPUs to process these multiplications in parallel). Multiply + rotation + multiplication is good from my personal experiments, but the numbers chosen need to be carefully chosen.

Prime numbers are probably not necessary: they just need to be odd numbers to ensure invertibility (Ex: I've achieved decent mixing by multiplying with 0xAAAAAAAA5, which probably isn't a prime number). I'm curious how the prime-constants were selected. There could be some potential for "better numbers" for mixing, depending on how PRIME64_1 and PRIME64_2 were chosen.

Overall checks all of my boxes with regards to good design. Nice! I'm just curious about some details.

----------

I'll note that in my experiments, multiplication seems to "mix" better with XOR, rather than with ADD. I haven't done any tests with this hash yet, but I'm curious how changing the "+=" into "^=" would have on the statistical strength of the hash function.


I believe you are looking at XXH64 and XXH32, which is the old version. The new version XXH3 is located in https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h which exposes the public prototypes `XXH3_64bits()` and `XXH3_128bits()` (and variants with seeds).


Thanks. This code is IMO far easier to read, despite being intrinsics. It just requires some familiarity with shuffle. Anyone who wants to follow along should use: https://software.intel.com/sites/landingpage/IntrinsicsGuide...

        assert(((size_t)acc) & 31 == 0);
        {   ALIGN(32) __m256i* const xacc  =       (__m256i *) acc;
            const     __m256i* const xdata = (const __m256i *) data;
            const     __m256i* const xkey  = (const __m256i *) key;
    
            size_t i;
            for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                __m256i const d   = _mm256_loadu_si256 (xdata+i);
                __m256i const k   = _mm256_loadu_si256 (xkey+i);
                __m256i const dk  = _mm256_add_epi32 (d,k);                                  /* uint32 dk[8]  = {d0+k0, d1+k1, d2+k2, d3+k3, ...} */
                __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
                __m256i const add = _mm256_add_epi64(d, xacc[i]);
                xacc[i]  = _mm256_add_epi64(res, add);
            }
        }
This is followed up by "ScrambleAcc":

        assert(((size_t)acc) & 31 == 0);
        {   ALIGN(32) __m256i* const xacc = (__m256i*) acc;
            const     __m256i* const xkey  = (const __m256i *) key;
    
            size_t i;
            for (i=0; i < STRIPE_LEN/sizeof(__m256i); i++) {
                __m256i data = xacc[i];
                __m256i const shifted = _mm256_srli_epi64(data, 47);
                data = _mm256_xor_si256(data, shifted);
    
                {   __m256i const k   = _mm256_loadu_si256 (xkey+i);
                    __m256i const dk  = _mm256_mul_epu32 (data,k);          /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
    
                    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
                    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
                    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);           /* U32 dk[4]  = {d0+k0, d1+k1, d2+k2, d3+k3} */
    
                    xacc[i]  = _mm256_xor_si256(dk, dk2);
            }   }
        }
I don't believe AVX2 rotation exists (correct me if I'm wrong though). So the author has opted for _mm256_srli_epi64 (... 47), which is a 47-bit right shift, followed up with XOR. This is how the author gets those "highly mixed" high-bits back down to the lower-bits.

Vectorization has broadened to 512-bits of state (4x128) so that the vectorized steps can be ILP'd.

The keys are:

    ALIGN(64) static const U32 kKey[KEYSET_DEFAULT_SIZE] = {
        0xb8fe6c39,0x23a44bbe,0x7c01812c,0xf721ad1c,
        0xded46de9,0x839097db,0x7240a4a4,0xb7b3671f,
        0xcb79e64e,0xccc0e578,0x825ad07d,0xccff7221,
        0xb8084674,0xf743248e,0xe03590e6,0x813a264c,
        0x3c2852bb,0x91c300cb,0x88d0658b,0x1b532ea3,
        0x71644897,0xa20df94e,0x3819ef46,0xa9deacd8,
        0xa8fa763f,0xe39c343f,0xf9dcbbc7,0xc70b4f1d,
        0x8a51e04b,0xcdb45931,0xc89f7ec9,0xd9787364,
        0xeac5ac83,0x34d3ebc3,0xc581a0ff,0xfa1363eb,
        0x170ddd51,0xb7f0da49,0xd3165526,0x29d4689e,
        0x2b16be58,0x7d47a1fc,0x8ff8b8d1,0x7ad031ce,
        0x45cb3a8f,0x95160428,0xafd7fbca,0xbb4b407e,
    };
Hmmm... this key is added with the data as it gets mixed in.

---------

Definitely seems like cleaner code actually. Again, I'm not seeing any obvious red-flags in the code, so it looks pretty good to me.

The only thing is that the 32-bit to 64-bit multiply step seems odd to me. It probably isn't losing any information, but I'm kind of mind-farting and can't see whether or not that step is potentially losing entropy or not... I'll probably have to sleep on it.

Specifically this step:

    __m256i const res = _mm256_mul_epu32 (dk, _mm256_shuffle_epi32 (dk, 0x31));  /* uint64 res[4] = {dk0*dk1, dk2*dk3, ...} */
Those numbers aren't necessarily odd. Its a 32-bit multiply that takes two 32-bit numbers, and outputs a 64-bit number (vectorized). It doesn't seem like a "strong" hash, but maybe I need to think about it more. If this is a good hash (and with all the tests done on it... it probably is a good hash), then I'd be surprised.

It reminds me of a RNG that Knuth discussed in his book: where you multiply two numbers and then take the "middle" results of the multiplication (https://en.wikipedia.org/wiki/Middle-square_method). So I feel like there's potential for weakness here. I can't see it one way or the other yet, its just something I'm trying to think about...


According to the UMAC paper, the 32x32=>64 multiplication only contains 32-bit of entropy, even though it uses 64-bit space. That's understandable : most of the entropy will be in the middle of the register.

That's enough for XXH3. Since it maintains a 512-bit internal state, that means it transports 256-bit of entropy.


But there's 64-bits of input (32-bit A x 32-bit B). So if you have 64-bits of input, but only result in 32-bits of output entropy, then you've lost information.

I realize you have to compress data down in a Hash function somehow, but ideally you want to minimize the loss of entropy / information from the input bits. The ideal mixing function would have 512-bits of internal state, with 512-bits of entropy starting off... and ending with 512-bits of entropy once all the mixing were done.

If your factoid is correct, then we're starting with 512-bits of entropy, but only 256-bits of entropy after the multiply.

> That's enough for XXH3. Since it maintains a 512-bit internal state, that means it transports 256-bit of entropy.

Why not optimize the function, and aim for only 256-bits of internal state (with 256-bits of entropy) ??

See: you can cut down on state and possibly improve performance. Maybe not on Intel Skylake, but probably on ARM Neon / AMD Zen (which have 128-bit SIMD registers internally). Hmmmm... 512-bit (aka 2x AVX2 registers) is probably needed to get good ILP on Intel processors. So there probably wouldn't be much improvement for Intel.

-----------

Again though: I'm not sure if its losing information yet. Its just something I'm thinking about... (32-bit XOR would be 32-bit + 32-bit input, with only 32-bits of entropy output. But unlike multiply, you only have 32-bits of state)

32-bit multiplication with a constant, keeping only the bottom 32-bits, keeps all 32-bits of entropy without being forced to expand to 64-bits. I definitely like the bit-mixing properties of multiply, its just difficult to find configurations of multiplication that saves every bit of entropy.


It's not exactly "losing information". We are not trying to regenerate original data, just make sure that all source bits can fairly influence the result. It's more a question of bit contribution.

In the resulting 64-bit register, bit-0 can only be contributed by the first bit-0 of each 32-bit input. So it's only representative of these 2 bits. Same at the other end, bit-63 mostly depending on bit-31 of each 32-bit input, and also on carry over from previous bits (making things more complex).

In the middle, many more bits participate, so that's where they are "well mixed".

This question of bit distribution becomes critical if the 64-bit accumulator was used "as is" to produce a hash, but fortunately it's not. Accumulators will get mixed before that stage. When the mixing function is done correctly, every bit will get redistributed to the point of saturation. After which point, it does not matter that initially one accumulator's bit had less contributions than another : all source bits contributes fully. This can be easily verified with SMHasher's Avalanche test.

Finally, XXH3 does not follow UMAC too closely, and adds an operation which ensures that all bits are necessarily present at least once in the accumulator. This compensate from the risk of multiplying by zero, which _is_ dangerous for checksumming, as it would nullify a contribution.


> It's not exactly "losing information". We are not trying to regenerate original data, just make sure that all source bits can fairly influence the result. It's more a question of bit contribution.

I agree. But entropy is a great concept that helps allow us to calculate whether or not "input bits" can affect "output bits".

Lets look at XXH3_scrambleAcc:

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31);
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    xacc[i] = _mm256_xor_si256(dk, dk2);
Assume you start with 512-bits of input entropy (that is: assume that all 512-bits of state are uniformly random). Will you have 512-bits of uniformly random output after the above operations? Or to put it in your words: are you 100% sure that the above steps allow every input bit to affect every output bit?

It appears not. My counter-example (not that I've debugged your code... but based on my reading) is as follows: key[2] is 0x7c01812c, and key[3] is 0xf721ad1c. This means that key[2] AND key[3] are even numbers.

Which means dk.int64[1] bit0 will always be zero. And dk2.int64[1] bit0 will ALSO always be zero (on iteration 0, i=0)

I bet you, at least... if I did my math correctly... that xacc[0].int64[1].bit0 will ALWAYS be zero, no matter what state you start with (bit#64 of xacc[0] == 0)

This is because xacc[0].int64[1].bit0 = dk.int64[1].bit0 XOR dk2.int64[1].bit0

And both of those component parts seem to always be zero. Which means you're at LEAST wasting the 64th bit of state each time you perform XXH3_scrambleAcc. It certainly seems like a weakness of the XXH3_scrambleAcc function to me.

That's a lot of operations you do, to literally throw away half your bits. There's probably an optimization you can do here to get better mixing.


Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one. Considering feedbacks on the algorithm is one of the objectives of the test phase. If the issue is about the default keys, it's also possible to change them, though one will also have to consider what happens with custom keys and if it implies ensuring some conditions (custom keys is a long-term objective of XXH3). At the end, XXH3 is only generating 64 and 128 bit hashes, and maybe 256 in the future, so compression necessarily happens somewhere. I believe the issue would be more pressing if the intention was to generate a 512-bit hash, but that has never been an objective.


HighwayHash uses PSHUFB to spread entropy from twin 32x32=64 muls across 128-bit vectors [or vector halves, on AVX2]. See ZipperMerge in https://github.com/google/highwayhash/blob/master/highwayhas....


> Well, if you believe a better scrambling operation is possible, you are certainly welcomed to suggest one.

Oh, criticism of the algorithm is the easy part, especially when I'm being vague and can't think of solid proof of my suppositions. Coming up with a better scrambling operation is the hard part. :-)

As I stated earlier, there are no obvious "red flags" in your algorithm. Its just this obscure part of the function that makes me think that optimization potential exists.

Since you only have 256-bits of entropy in your 512-bit accumulator, my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

1. All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

2. Precisely throw away the top 32-bits of your multiplications, and only keep the bottom 32-bits. This will be slower, but it ensures full entropy when combined with odd numbers for the key.

3. Now that we have full entropy on the multiplication steps, cut down the 512-bit accumulator down to 256-bits, which should improve performance on AMD Zen / ARM NEON. It may help Intel Skylake, but Skylake is so wide that its hard to get ILP.

--------------------

Without testing the code, here's an idea.

    __m256i const k   = _mm256_loadu_si256 (xkey+i);
    __m256i const dk  = _mm256_mul_epu32 (data,k);
    __m256i const d2  = _mm256_shuffle_epi32 (data,0x31);
    __m256i const k2  = _mm256_shuffle_epi32 (k,0x31); // <---- This line can be precomputed by the way...
    __m256i const dk2 = _mm256_mul_epu32 (d2,k2);
    // Above this line is the same code as before
    dk2= _mm256_shuffle_epi32 (dk2,0xc4);
    xacc[i] = _mm256_blend_epi32(dk, dk2, 0xaa);
As long as the keys were all odd (bottom bit set to 1), the above should mix the bits without losing any entropy. So you can now have a 256-bit state.

----------

The accumulate portion would be simply:

    __m256i const add = _mm256_xor_epi64(d, xacc);
    xacc = _mm256_xor_epi64(res, add);
XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply (I don't know why, but its just something I've noticed). But otherwise, its conceptually similar to your original "add" code.

----------

I mean, I'm just shooting from the hip here. I don't know how well the above code mixes things around. But those are kind of the steps that I would take to move forward.

----------

Although, if you want me to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128. That'd be

    xacc[0] = _mm_aesenc_si128(xacc[0], data[0]);
    xacc[1] = _mm_aesdec_si128(xacc[1], data[1]);
    
    // Finalization:
    __m128i finalizationA = _mm_aesenc_si128(xacc[0], const1);
    finalizationA = _mm_aesenc_si128(finalizationA , const2);
    finalizationA = _mm_aesenc_si128(finalizationA , const3);
    finalizationA = _mm_aesenc_si128(finalizationA , const4);
   
    __m128i finalizationB = _mm_aesdec_si128(xacc[1], const1);
    finalizationB = _mm_aesdec_si128(finalizationB , const2);
    finalizationB = _mm_aesdec_si128(finalizationB , const3);
    finalizationB = _mm_aesdec_si128(finalizationB , const4);

    __m128i finalHash = _mm128_xor_epi64(finalizationA, finalizationB);

    // Return bottom 32, 64, or 128-bits of finalHash
 
    // 4 rounds of finalization should be enough. 
    // Testing required to find the minimum for an avalanche condition. It will be more than 2, but probably less than 4...
And... the end. That's it. Its only uses the 128-bit functionality of the execution units, but still gets ILP off of two parallel instances. So 256-bits of state. Its probably a fast function, but I can't be bothered to test it right now. I'm curious if the simpler functionality above would outspeed the 256-bit execution units on Skylake, but I don't have a Skylake box to test on (just AMD Zen).

With AVX512, _mm512_aesenc_epi128 is possible to do this over 512-bits (4x 128-bit parallel instances of AES encryption). But I don't have a Skylake-X machine to test this idea with.

See here for my experiment on AES functionality: https://github.com/dragontamer/AESRand


Thanks for suggestions @dragontamer. We are genuine when saying the algorithm is opened to suggestions, and can still change to improve its properties. Let's review yours :

> my goal if I were to go through your code is to cut the state down to 256-bits, while keeping 256-bits of entropy.

The point is, it's a lot more difficult to keep the registers "fully loaded" at maximum entropy _at every step_. By going double size, and accepting that this entropy is leniently dispersed into the larger register, we make our lives a lot easier, which translates into sensible speed gains. To be detailed below

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

One core idea of UMAC, which XXH3 is based upon, is that the keys could be any random number (they are supposed to be secret). Forcing them to be odd reduces available space by one bit. Not a very big deal, but still. This could also be ensured by adding an `OR 1` operation on loading the key.

> All multiplication keys should be odd (bottom bit set to 1). This ensures that all multiplication steps that only keep the 32-bottom bits will keep all of their entropy.

OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

> This line can be precomputed by the way...

That's a great point ! It transforms the shuffle into a load, it's not completely free but is probably faster. More importantly, it requires memory to store the swapped table of keys, which can be troublesome if the size of the table of keys can be customized. I'll look into it, thanks for the suggestion !

> XOR (or add) captures all the entropy from the input. In my experiments, XOR works better with multiply

There are 2 parts here :

- It's not possible to XOR `d` with `res` (which is what happens transitively in your proposal). The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero. With XOR, the same impact can still happen, but it requires a multiplication by 1 instead : `(1*d)^d = 0` . Same problem if `d` is subtracted. But with an add operation, cancelling `d` contribution requires a multiplication by `-1`. And that is precisely impossible when doing a 32x32=>64 multiplication. Unfortunately, when doing a 32x32=>32 multiplication, it now becomes actually possible : -1 == 0xFFFFFFFF. So the addition doesn't save the situation, and it's now necessary to change the formula

- Xoring `res` with `acc` seems more tractable. I tried it the early days of XXH3, but unfortunately it proved worse in term of hash quality. At this stage, I'm not too sure why. And maybe later changes indirectly solved an underlying issue, so it might be worth trying again, and see if it proves any better.

> to be 100% honest, anything I'd do "for real" would revolve around _mm_aesenc_si128

I fully agree. Leveraging dedicated hardware capabilities is most likely efficient, and AES is doing a great job at mixing bits. This is more difficult to emulate with simpler instructions. There are only 2 minor issues to be aware of :

- It relies on the presence of a hardware AES module. While not an issue when the target platform basically guarantees its presence, it's a non-trivial problem when targeting broader portability. Platforms without AES, or without access to it (yes, even on Intel, some "systems" can't access the AES instruction, think `WASM` or Kernel space for example), will pay a hefty performance price while using a software backup. It's not necessarily a killing issue, just something to be aware of. xxHash tries to target a very broad portability. This is a "handicap", but with its own benefits.

- AES instructions have excellent throughput, but latency is non negligible. This is especially true on pre-Skylake CPU. Latency is hard to measure, so it's frequently forgotten in benchmarks. In my own tests (on latest generation Intel CPU, so very favorable to AES), using AES instructions ends in the 80M/s region when measuring latency, which is not bad. To be compared with XXH3, which ends in the 110-150M/s region.

Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.


> OK, so that's where the notion of "bit contribution" becomes useful. By making a 32x32=>32 multiplication, and ensuring the multiplier is odd, you have mixed correctly the lowest bit. But the other ones do not contribute to lower bits. At the other extreme, the highest bit only contribute to one (the highest) bit in the resulting product. It's clearly not well mixed.

True, but this is also true for the top 16-bits and bottom 16-bits of a 64-bit multiply. Its a problem innate to multiplication: the "middle" bit (bit#31) will be best, while the "edge" bits (bit#0 or bit#63) will be awful.

> This can be compensated, with a rotation, or a right shift + add operation, followed by another spreading (another multiplication), and another right shift + add. But all this adds quite a few operations, right in the critical loop, so this is no longer the same speed.

There are quite a few other ways to compensate, which will remain efficient. vpshufb is 1-latency and once-per-clock throughput. If this were ARM, NVidia GPU, or AMD GPU, bit-reversal would work (RBIT on ARM).

But since this is Intel, the fastest way to spread bits around is to do shuffle(state, [0,1,2,3]) (which is my special shorthand for _mm256_shuffle_epi32 (k,0x1b)).

In general, you just need to map the "strongly mixed" bits (bit#31) to the locations which will potentially affect the most bits (bit#0 in the next multiplication). Bit-reversal is best, but almost any vpshufb should do the trick.

> The whole point of adding d is that it avoids cancelling a contributor, which can happen if there is a multiplication by zero

"Cancelling a contributor" is bad, but I strongly disagree with your characterization of it!

"Cancellation of a contributor" only happens when two inputs map to the same output. By the pigeonhole principle, something MUST map to zero. Otherwise, the "zero" output is wasted.

You've identified the singular inputs that result in zero in various ways on 32-bit -> 32-bit multiplication. That's a good thing!! There is ONLY ONE input that maps to zero in my proposed multiplication.

Again: Its not so important that "nothing maps to zero". Its far more important that "exactly one thing maps to zero".

----------

In your algorithm: if d[0] and d[1] are zero, you get a zero as a result. (d[0] * k + d[1] * k == dk[0] == 0 when d[0] == d[1] == 0).

But... (d[0] * k[0] + d[1] * k[1] == 0) defines a whole slew of "potential mappings to zero". Its non-trivial to analyze this, especially with overflow (the "zero" on the right hand side is 64-bit unsigned with overflow).

In a 16-bit case, 0xffff * 0xfffc (d[0] == 0xffff, k[0] == 0xfffc) will be "inverted" by d[1] == 163838 k[1] == 2.

So in this case, we have a violation of the pigeonhole principle: d[0] == d[1] == 0 will map to zero, AND d[0] == 0xffff, k[0] == 0xfffc, d[1] == 163838, k[1] == 2 maps to zero as well.

That's how you "lose bits", with multiple inputs potentially mapping to the same output (in this case: zero, but in general... any input).

With regards to multiplication by an odd number, someone already proved its reversible: https://lemire.me/blog/2017/09/18/computing-the-inverse-of-o...

Which means there is exactly one input that maps to exactly one output for all odd-multipliers.

> Don't read me wrong, using AES is likely a good choice. The only reason XXH3 doesn't use it is that it targets very broad portability, including targets without a hardware AES module.

Hmmmm... in any case, a 32-bit multiply is certainly more portable. I think the throughput / latency issue can be easily solved by increasing the state. 5x128 bits of state to unroll the inner-loop should be sufficient (5-cycles of latency, once-per-cycle of throughput == 640-bit state should be best)

But in any case, you have a point about portability. But note that the AES instruction is implemented on ARM, POWER9, and Intel machines now. So its surprisingly portable, even if systems as a whole (ex: WASM or maybe Java Virtual Machine) don't necessarily support it yet.


Following your suggestion, I went ahead and modified the scrambling formula. The crux of the problem is that the secret is allowed to be absolutely anything (it's user provided). As a consequence, it's a bad idea to depend on it for the multiplier. Now, the secret key only function is to be a mask, while the multiplier is under implementation control, guaranteeing a good prime.


> Why not optimize the function, and aim for only 256-bits of internal state (with 256-bits of entropy) ??

It's a lot more difficult to ensure that accumulators contain full-width entropy at all times. The trade-off in this algorithm is that we intentionally use bigger accumulators and don't even try to maintain the entropy at full level all the time. All it needs is to ensure that the level of entropy is _at least_ 32 bits per accumulator, which is much easier. This leads to better speed.


If I am not mistaken 0xAAAAAAAA5 is 45812984485 = 5 * 28297 * 323801. At least that is what

    factor $(echo -e "ibase=16\nAAAAAAAA5" | bc -l)
claims.


Oh man, a typo. Lol. I meant 0xAAAAAAA5... sorry about that.

Lemme muse on the concept: "multiplying by 0xAAAAAAA5" is that the binary is 1010,1010,1010...0101.

A multiplication at its core, is equivalent to a LOT of bitshift-left and add. Multiplying by "1000" is the same as bitshift left by 4. Multiplying by "1010" is bitshift left by 4 + bitshift left by 2.

This is why its important to multiply by an odd number: all those bitshifts lose the top-most bit. You only keep the topmost bit if the bottom-most bit is set.

So multiply by 0xAAAAAAA5 is the same as "bitshift left 31 + bitshift left 29 + bitshift left 27 ... + bitshift left 2 + bitshift left 0", all at once.

Addition itself is a great "mixing" instruction, its an XOR innately with the potential for carries. The carries make the XORs far more complicated to analyze. The 31st bit after all of those bitshifts are not only the XOR of a bunch of bits... but also the "propagated carry bit".

The topmost bit of any multiplication operation is very complex, while the bottom most bit is always the original bit (bit0 is only affected by bit0 from the input). That's why a rotation + 2nd multiply is needed.


0xAAAAAAA5 make a lot more sense as it is a 32 bit integer. Btw it is 36 * 52 * 157109, so not an integer, just like you expected. Then again you don't care about that, but the bit pattern.

Any reason why use use 0xAAAAAAA5 instead of 0xAAAAAAAA or 0xA5A5A5A5?


> 0xAAAAAAA5 instead of 0xAAAAAAAA or 0xA5A5A5A5?

0xAAAAAAAA isn't odd. Every time you multiply with 0xAAAAAAAA, you lose the top-most bit of your state (alternatively: the bottom-most bit will ALWAYS be zero. So you lost a bit). If you start with 32-bits of entropy, multiplying with 0xAAAAAAAA results with 31-bits of entropy.

> 0xA5A5A5A5

That's where statistical tests need to come into the picture, to differentiate. I've written GPGPU programs to brute-force search for constants based on a few metrics.

In these brute-force searches, even a simple number like 0xAAAAAAA5 works out decently. Its suboptimal in my experience, but its surprisingly "not bad". My only point is that you shouldn't limit your search to prime numbers. Any odd number works pretty decently as a multiplication constant.


Addition and multiplication also leak more information via side channels than bit shift and xor.


I'd be surprised if that's true.

I'm pretty sure addition is always 1-clock tick latency / 3-instructions per cycle. While Multiplication is 5-clock tick latency / 1-instruction per cycle IIRC.

So if the timing of those instructions are constant, it seems difficult to side-channel / timing attack them.


there's also power, 0xffffffff + 0x1 results in 32 bit flips, 0x0 + 0x1 results in 1, they don't consume the same amount of energy.


So it took me a while to find the actual algorithm. As far as I can tell it is here:

https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h

Is there a description of the algorithm, the C-source is not only highly optimised, it is also highly #ifdefed to take care of various scenarios. So the file is not optimised for readability.


I believe that this is just the initial release for testing, and the author plans to clean up the header file to separate out the interface from the implementation, and add documentation.


I've tested XXH3 using xxhash's built-in benchmark tool with clang-7.0.1 and gcc-8.2.1 on an Intel i9-9900K. The processor was otherwise idle, and was running at 5 GHz. The command I tested is `xxhsum -b5i10`.

    SSE2: CFLAGS=-O3
    AVX2: CFLAGS="-O3 -mavx2"
    ARCH: CFLAGS="-O3 -march=native"

    Compiler  Mode  Speed
    gcc-8     SSE2  32.8 GB/s
    clang-7   SSE2  36.5 GB/s 
    gcc-8     AVX2  44.1 GB/s
    clang-7   AVX2  68.3 GB/s
    gcc-8     ARCH  60.9 GB/s
    clang-7   ARCH  69.7 GB/s
gcc nearly catches up with clang when compiled with -march=native, but with only -mavx2 it isn't performing as well.


I haven’t tried compiling this myself, but gcc does different things with a -march=x flag compared to specifying -mavx2, directly, mostly because -march implies -mtune.

Trying -march=haswell might give closer results.

On the other hand, if using an older gcc and it can’t identify your processor, march=native will result in mtune=generic and much head scratching with results.


That is pretty dang fast!


Here is a comparsion with meowhash: https://twitter.com/mmozeiko/status/1106714583761215488

Meow is faster than xxh3 for all sizes, small or large.


Meow is a freak of nature, it has a constant overhead for all of the small sizes.

I added Meow to my test graphs: https://github.com/injinj/smhasher

I still find xxh3 to slightly faster for small hashes on my machine (i9-7960X), except at the 128bit throughput size > 32b.


Hmm, Meow fails the TwoLongNeighbors test. More information on this test here: https://github.com/hmakholm/smhasher/blob/master/src/LongNei...

  [[[ Keyset 'TestLongNeighbors' Tests ]]]

  Looking for 2-bit collisions in the last 2400 bits.
  Looking for 4-bit collisions in the last 2048 bits.
  Looking for 6-bit collisions in the last 160 bits.
  Trying bases of length 10 to 300, 5 of each length.
  ......[16]...............[32]............... [48]...............[64].

  Among 285 tested bases, we expected 4.25749e-06 bads. 
  Actually there were 5.
  The most striking collision has surprise score 3.52004e+06 and length 46 bytes:
  000: BD 8D 12 8C FB EE 75 2E 17 80 4E 26 8C E7 75 94 DA EF 12 8C 95 CA 91 0D
  018: 08 BD 8D 5F F2 78 FF 64 88 27 88 3F 8E 8F CE FC E2 8A 99 CE 2B E3
                  ^D2^                 ^7F^              ^18^     ^0B^
  The hashes are length 8 bytes:
  000: 35 C1 E6 5B 6B 66 C7 28
  *********FAIL*********


I can make Meow pass the LongNeighbors test by doubling the number of AES rounds on each pass.

The original Meow algo:

    while(Len >= 64)
    {
        S0 = Meow128_AESDEC_Mem(S0, Source);
        S1 = Meow128_AESDEC_Mem(S1, Source + 16);
        S2 = Meow128_AESDEC_Mem(S2, Source + 32);
        S3 = Meow128_AESDEC_Mem(S3, Source + 48);

        Len -= 64;
        Source += 64;
    }
The modified algo:

    while(Len >= 64)
    {
        S0 = Meow128_AESDEC_Mem(S0, Source);
        S0 = Meow128_AESDEC_Mem(S0, Source);
        S1 = Meow128_AESDEC_Mem(S1, Source + 16);
        S1 = Meow128_AESDEC_Mem(S1, Source + 16);
        S2 = Meow128_AESDEC_Mem(S2, Source + 32);
        S2 = Meow128_AESDEC_Mem(S2, Source + 32);
        S3 = Meow128_AESDEC_Mem(S3, Source + 48);
        S3 = Meow128_AESDEC_Mem(S3, Source + 48);

        Len -= 64;
        Source += 64;
    }
(also changed the partial rounds, but this is not shown here)

This doesn't change it's performance that much, maybe 1 or 2 extra CPU cycles. My i9 CPU must be able to run these extra rounds in parallel... I do have hyperthreading disabled, if that matters.

Update:

Ah, it does cause the bulk speed test to be 50% of the original (48MiB vs 24Mib). The small key test only adds 1/2/3/4 extra AES instructions since they are manipulating 16 bytes at a time. It looks like this is only taking a cycle.


I've already looked at it last week and it looked very good indeed (minus some debugging code Yann ripped out today). It should still be the best hash function for large data digests. The prev. version was already the best for this use case. Interesting is the new property for smaller keys (this was the weak point in xxhash), and the extended tests.


I would be interested in the avx2 transition penalty under real world workloads.


Seems a bit funny that CRC32 is so slow.


The comparison is to SMHasher's CRC32 implementation, which is simple but slow, forked from zlib some time ago.

SIMD CRC32 implementations can be more than 14x faster than recent zlib's.

I dropped some details as comments on the original blog post.


I expect that's because it was designed very well for the days of serial, in-order, static execution. But, CPUs today are all about pipelining and parallelism. Herb Sutter's "Free Lunch" has been over for longer than most people here have been writing software. http://www.gotw.ca/publications/concurrency-ddj.htm


Sutter's solution was to write more concurrent code, but that's not what helps in this case. Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Traditionally the fastest CRC code has used lookup tables, I wonder if that's creating cache pressure these days? Or maybe that approach was abandoned while I wasn't paying attention.


> Rather the CPU vendors have been throwing transistors at doing more math operations in parallel. All you need to do is make sure you're taking advantage of the vector instructions, whether that's done by the compiler or with intrinsic calls.

Instruction-level parallelism is incredibly important. I'd say that any optimizing programmer needs to fully understand ILP, and how it interacts with pipelines and dependency cutting (and register renaming).

Modern CPUs are extremely well parallelized with ILP. Any good, modern hash function will take advantage of this feature of modern CPUs.

Case in point, it seems like xxhash is SCALAR 32-bit / 64-bit code. No vectorization as far as I can tell, its purely using ILP to get its speed.

Intel Assembly has a 64-bit multiplier (but vectorized only has 32-bit multipliers). I've theorized to myself that this 64-bit multiplier could lead to better mixing than the vectorized instructions, and it seems like xxhash goes for that.

The 32-bit version of xxhash can likely be vectorized and optimized even further.



Nope. The fastest crc code now uses the intrinsics and some parallel execution tricks, see crclib from Cloudflare.


Where do I find this library? Googling “crclib from Cloudflare” without quotes returns your comment as the top hit :P



The OP probably finds it funny because Intel processors since Nehalem have had an instruction for computing CRC32.


Though, sadly, the CRC32 instruction computes the wrong polynomial for Zip and most other formats. :(


Where is the disparity between the two coming from?


There are two widely used polynomials, IEEE and Castagnoli. These parameterize the CRC-32 algorithm, and are sometimes known as crc32 and crc32c.

The IEEE polynomial is used by Bzip2, Ethernet (IEEE 802.3), Gzip, MPEG-2, PNG, SATA, Zip and other formats.

The Castagnoli polynomial is used by Btrfs, Ext4, iSCSI, SCTP and other formats.

For better or worse, the hardware instruction computes CRC-32-Castagnoli, which means it's not relevant for e.g. Zip.


Ahh interesting. Thank you.

Is there any particular advantage of one polynomial over the other?


https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standa... suggests that Castagnoli has better "error detection capacity". What that means exactly (in terms of information theory) and whether that difference matters in practice is getting beyond my expertise level. In any case, existing formats like Zip are what they are, and as I mentioned in the blog post comments, it is still possible to SIMD-accelerate the IEEE flavor.


It still should be vectorizable without the dedicated instruction. That isn't evident here so the comparison is questionable.


Slightly OT, but I recently had the need for a rolling hash, and wasn't able to find anything better than Fletcher. Is there anything of similar speed, but better characteristics?


Glad to see that part of the human race is still on the track to practical innovation (collision avoidance, small footprint, performance) as opposed to the perpetual Moore's Law paranoia arms race (crypto-secure, more bits, more rounds, constant utilization regardless of input to deter side channel attack).

If you're the slightest bit security conscious you realize that there is NO level below which security is not a factor. If that is the deciding factor these compromises (or as they would tell it, lack of compromise) will decrease overall UTILITY of technology and also increase its FRAGILITY.

The move to aggressively force HTTP to HTTPS even for Joe's Who Cares Web Site for no specifically stated reason is a prime example. If rounds and loops were temperature, it is as if we're shoving the Internet into thermal overload beyond convection cooling, without performing any additional useful work.

Meltdown mitigation is like economic inflation. A general performance decrease for a worst case threat scenario.

What price, paranoia? Is there an upper limit?


Those images are unreadable. Can we get higher res. images, or larger fonts? Thanks.


The images are high res, but you have to download them to view the larger image.


It's cool see these hashing algorithms coming out and excitement about building new libraries and adding them to existing ones, but we also need people gravitating them into an organized and useful project space.


https://github.com/switch33?tab=repositories I think mine might match up its written in lorem ipsum you can compile it with utf8.

Can someone benchmark it for me? Because sha with higher values might be faster.


Interesting to see this, when just a few days ago we had another "new fast hash" item on the front page: https://news.ycombinator.com/item?id=19357895.

Anyone have a TL;DR regarding how these two compare?


wyhash is a close relative to mumhashv2, and feature approximately the same performance strength and weaknesses.


Working on it tomorrow. This just got in today. I expect xxh3 to beat everybody on long keys, and because it uses optimized intrinsics it should be close to the optimized th1a variants. wyhash is purely portable code.


Results are now at smhasher. I had to fix all the MSVC issues first.

wyhash still is the recommended hash function: portable and lowest cycles/hash.

However xxxh3 is spectacularly fast, for small and esp. large key sizes. This is now the recommended hash for file or db digests.

String hash table stats not yet done, there FNV1A is still the recommendation, with safe collision handling.


Here's the smhasher output for XXH3 on an i9: https://pastebin.com/ryLN24Qy, the wyhash has its output in its readme.

TLDR: XXH3 uses less cycles per key, by about 30%. The real throughput / latency need a more equal and thorough benchmark of both, though.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: