

CityHash: new hash function by Google (faster than Murmur) - markerdmann
http://code.google.com/p/cityhash/

======
coderdude
From the readme: "On a single core of a 2.67GHz Intel Xeon X5550,
CityHashCrc256 peaks at about 6.09 bytes/cycle. The other CityHashCrc
functions are wrappers around CityHashCrc256 and should have similar
performance on long strings. CityHash128 peaks at about 4.31 bytes/cycle."

I think it's fascinating that they're able to determine the number of bytes
their code processes during a single cycle. Does anyone know where to start
looking to learn more about how to do that? It reminds me of a comment I think
I read on here a month or two ago about the number of pixels that could be
processed per cycle with an algorithm implemented on a 500MHz FPGA.

Edit: Thanks for the answers!

~~~
vilda
Timestamp counter: <https://en.wikipedia.org/wiki/Time_Stamp_Counter>

The question is how representative this measurement really is. Performance of
a hash function depends heavily on memory caches. This looks like a test with
many iteration where all data are as close to the processor as possible -
hardly a real-life scenario.

~~~
dalke
This is meant for high-speed hashing, so the comparison is this algorithm vs.
another high-speed hash algorithm. In that case, how does memory cache affect
the comparison between the two algorithms?

That is, if the data isn't "as close to the processor" then won't another hash
algorithm in this class be affected the same way? The only thing is that it
would deemphasize the performance differences, but it wouldn't change that one
is faster than another, would it?

------
ot
Relevant comment on reddit by the author of MurmurHash:

 _Hi, I'm the author of MurmurHash (let me know if verification is needed) and
helped review the design of CityHash (I work at Google now). CityHash has
higher throughput and is more complex, MurmurHash has lower latency for very
short keys and is relatively simple. Both are more than adequate for any sort
of hashing needs you might have and should be virtually identical to a random
oracle for any keysets you might throw at them. I'm happy to answer questions
if you have any._

[http://www.reddit.com/r/programming/comments/ozodk/cityhash_...](http://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/c3lg3ih)

------
gioele
Is there an article that describes the algorithm in more abstract terms?

The C++ code is quite readable but I'd say that it contains too many magic
numbers. Code is no substitute for a clear description given by the authors of
the algorithm, its tradeoffs and design choices.

------
Groxx
Whenever I see C code like this:

    
    
      #define CHUNK(multiplier, z)                                  \
        {                                                           \
          uint64 old_a = a;                                         \
          a = Rotate(b, 41 ^ z) * multiplier + Fetch64(s);          \
          b = Rotate(c, 27 ^ z) * multiplier + Fetch64(s + 8);      \
          c = Rotate(d, 41 ^ z) * multiplier + Fetch64(s + 16);     \
      ... etc
    

I wonder: isn't this the kind of thing that's better handled by creating a
function, and _maybe_ explicitly telling the compiler to inline it? I mean,
you can do this (from here: <http://gcc.gnu.org/onlinedocs/gcc/Inline.html> ):

    
    
       inline void foo (const char) __attribute__((always_inline));
    

or you could simply let your optimizer take care of it. It may be faster to
_not_ inline it in some circumstances, but you've _gone out of your way_ to
prevent it from being able to make that decision, and to make your code harder
to read and more prone to problems due to macro expansion.

Why? Is it habit? Or is there a good reason? (currently. historical reasons
are not good reasons to continue doing things.)

~~~
saurik
People who write hash functions are both exceedingly anal about performance
concerns and generally interested in having their work compile with any and
all reasonable C compilers (yes, even those that aren't gcc) and still have at
least good performance.

They also tend to be things that are not designed in code: they are designed
in math and then translated to code as the last step, preferably in a way that
is able to remain static forever (they should not need maintenance by an
engineer, even/especially to update to support random new C compiler builds).

(Some of my friends entered the recent SHA3 NIST hash competition and lasted
long enough to have random other mathematicians scrutinize, improve, and
finally defeat their math. <http://en.wikipedia.org/wiki/Spectral_Hash>)

------
casca
Here's the announcement from Google in April last year: [http://google-
opensource.blogspot.com/2011/04/introducing-ci...](http://google-
opensource.blogspot.com/2011/04/introducing-cityhash.html)

The most important thing to take away from this is "These functions aren’t
suitable for cryptography". The requirements for a cryptographically strong
hash function are much stronger than one just used for hash-table performance.
That's not to say that you shouldn't use it, rather be careful that there's no
scope creep.

------
cmurphycode
The page outright says that it is not a cryptographic hash, but the Hash
Quality section doesn't expound on why (except for a minor flaw on short
strings for one of the functions). Does anyone know what the expected
collision rate for this hash is?

In other words, I don't care if the hash doesn't avalanche perfectly (half
changed for one bit of input change; a requirement for cryptographic hashing
so changes are "obvious") as long as the collision rate is ~SHA256. Or, if I
_should_ care, explanations would be wonderful.

~~~
jedbrown
The issue is whether it is feasible to engineer collisions.

~~~
cmurphycode
True. But in the case I'm evaluating, that isn't my worry. All I care about is
the expected number of collisions over some extremely large number of hashes.

~~~
orijing
That's not the definition of a cryptographic hash. One of the four
requirements of a function being a cryptographic hash function is that "it is
infeasible to find two different messages with the same hash."

~~~
cmurphycode
I understand. I worded my question poorly- what I was looking for was what
they found that invalidated it as a cryptographic hash, so I could figure out
whether it would fit my (non-crypto) requirements.

------
DiabloD3
Sure, its faster, but is it well distributed and has no known immediate flaws?
I understand its not cryptographically secure, but does it perform as well as
Murmurhash3 in those areas?

~~~
cuu508
Their README has "Hash Quality" section. It says there are some flaws that
they consider harmless for intended use in hash tables.

------
vilda
Looking at the code it looks like it's heavily optimized for x86 family of
processors.

What is suitable option for ARM processors? Similar Murmur hash depends on
ultra-fast multiplication unit (which consumes a lot of energy).

------
nashby
BTW, here is a ruby binding for this hash function
<https://github.com/nashby/cityhash>

------
tripzilch
Any reason why I should use this instead of the (very simple to implement,
very good avalanche behaviour) FNV hash?

~~~
cbsmith
I've seen FNV have some unfortunate properties around symmetric keys. I'd at
least use Jenkins instead. If not Murmur, this, or Spooky Hash.

~~~
tripzilch
Symmetric keys?? Which part of "not suitable for cryptographic purposes" did
you not understand?

Neither FNV, Jenkins, Murmur _nor_ this new City Hash are supposed to be used
for cryptography.

Besides, the way one generally requires hash functions in cryptography,
performance is the _least_ of your worries (unless you're trying to crack), so
there's no reason whatsoever not to go with a proper cryptographically strong
hash such as SHA256.

That said, Jenkins has worse avalanche behaviour than FNV--though I'm sure
with a few tweaks that could be fixed, non-crypto hash functions aren't
exactly rocket science (check out where FNV's constants came from)--but as it
is, Jenkins is more of an academic example.

Spooky (also developed by Bob Jenkins) seems better. Oddly enough Spooky
claims 3 bytes per cycle, but the author cannot confirm CityHash's 6 bytes per
cycle ( <http://burtleburtle.net/bob/hash/spooky.html> ). Though whatever it
is, both are obviously blazingly fast, picking one over the other because of
performance reasons should only be done once you've established that the speed
of the hashing function is indeed a bottleneck, and you can compare the
performance on the specific hardware it's supposed to run on.

~~~
cbsmith
> Symmetric keys?? Which part of "not suitable for cryptographic purposes" did
> you not understand?

I don't mean symmetric in the cryptographic sense. I mean as in "ABBA" mapping
to the same hash value as "YZZY". I can't remember the specific case, but I
think it was with 8 byte keys, and it might be like the above or it might be
more like "ABAB" and "YZYZ". Either way, it caused some pretty catastrophic
failures in one case due to the unusually high probability of such keys
existing in our dataset.

I've done tests with pretty huge dictionaries on both Jenkins & FNV, and
despite FNV's enviable characteristics, empirically Jenkins did much better at
avoiding collisions. Sometimes the stuff that looks good on paper has
unfortunate characteristics with real world data sets (which typically aren't
that random).

------
nerdo
Is this really new? Initial check-in is July 22nd on the source tab.

~~~
foobarbazetc
This post (and title) were "stolen" from reddit:

[http://www.reddit.com/r/programming/comments/ozodk/cityhash_...](http://www.reddit.com/r/programming/comments/ozodk/cityhash_new_hash_function_by_google_faster_than/)

