
A Fast, Minimal Memory, Consistent Hash Algorithm - luu
http://arxiv.org/abs/1406.2294
======
teraflop
It doesn't seem to be mentioned anywhere in the paper, but this is a
description of the "consistent hashing" algorithm in Google's Guava library:
[http://docs.guava-
libraries.googlecode.com/git/javadoc/src-h...](http://docs.guava-
libraries.googlecode.com/git/javadoc/src-
html/com/google/common/hash/Hashing.html#line.323)

I find it kind of funny that they created and released an apparently novel
hashing method, and then waited 2.5 years to actually explain how it works.

~~~
dchichkov
It takes time and effort to do a writeup. And it is easy to procrastinate.
That reminds me...

A bit of a plug here. I have a cute nearly-minimal perfect hashing algorithm
designed to have good cache-friendly properties. Briefly, it is somewhat
similar to hopscotch hashing, only you pre-calculate the positions of the
elements to put them into the 'best' spots by solving the assignment problem.
Works for up to about 50k elements. It feels like it might have good
theoretical properties too, might be even optimal, but it was a while since
I've taken the algorithms class.

If anyone is interested to do a writeup and publish clean source code - you'd
be welcome.

~~~
jemfinch
It sounds similar to robin hood hashing. Is there source code anywhere?

~~~
dchichkov
Yes, it is similar to robin hood. Only you actually place items into _optimal_
positions (by solving the assignment problem on your memory/cache access costs
& access probabilities), rather than stochastically swapping items.

I'll put up sample code if somebody would be willing to do a writeup ;)

------
tryp
One point to consider is that this algorithm appears to rely on a double-
precision floating-point divide at its core, so the speediness measured on a
Xeon E5 may not translate to speediness on architectures with weaker floating
point units.

~~~
def-lkb
The purpose of reals is just to map from [0-1) to [0-n) where n is the number
of hosts.

Floating points are used to ease the presentation, I think the algorithm can
be ported to integer operations without loss of performance (didn't prove it,
I just tried to write a pure integer implementation and checked distribution
of results on some inputs).

------
CJefferson
Using doubles in a hash algorithm seems (to me) very dangerous.

For years I have had problems with different optimisation levels providing
different results for floating point code, as different registers have
different internal sizes, FP registers can have different internal precision,
etc.

Can this hash function be trusted to produce repeatable results?

~~~
arjie
From the paper, this originates from:

> ...Since i is a lower bound on j, j will equal the largest i for which P(j ≥
> i), thus the largest i satisfying i ≤ (b+1) / r. Thus, by the definition of
> the floor function, j = floor((b+1) / r).

It seems to me that since all the numbers are positive, you can safely use
integer division if you like. AFAIK, floor(((double)a)/b) and a/b coincide for
that case at least.

------
cordite
In case anyone is interested in the random number, `2862933555777941757`, but
didn't catch it while scanning the PDF, it is shared with the "64-bit Linear
Congruential Generator" [1]

The author specifies that if the key is larger than 64 bits, then it should
get a 64 bit hash for use in input. But I wonder, supposing the compiler or
processor handled higher bits in a single register and we had a need for it,
would this algorithm handle a change of the magic number without a problem?
Where would one look for such numbers?

[1]:
[http://nuclear.llnl.gov/CNP/rng/rngman/node4.html](http://nuclear.llnl.gov/CNP/rng/rngman/node4.html)

------
robmccoll
super quick, super dumb comparison to a variant on bob jenkins 64-bit
mix/hash:
[https://gist.github.com/robmccoll/38f03971df66ca15e030](https://gist.github.com/robmccoll/38f03971df66ca15e030)

$./bin/googlehash 512 1000000 google hash 0.0765907 255374889 2.18324e+10 bob
jenkins hash 0.0158996 255484409 2.18324e+10 google is 0.207592x faster bob is
4.81714x faster

$./bin/googlehash 29 1000000 google hash 0.0478618 14000129 6.99994e+07 bob
jenkins hash 0.0158052 13999989 6.99994e+07 google is 0.330225x faster bob is
3.02824x faster

./bin/googlehash 65536 1000000 google hash 0.117784 32781492370 3.57002e+14
bob jenkins hash 0.0162152 32744680953 3.57004e+14 google is 0.137668x faster
bob is 7.26383x faster

print out is time(s), sum, variance (could use a suggestion on better test for
uniformity - only other idea was look at a histogram, anyone have
suggestions?).

edit: it is likely that this is a poor comparison :-)

~~~
def-lkb
You are comparing apples to oranges.

The purpose of google algorithm is that, when changing the number of bins, a
minimum number of items get moved.

This version of your codes prints the number of items which have been
reaffected: [https://gist.github.com/def-
lkb/58243299e114244d3b90](https://gist.github.com/def-
lkb/58243299e114244d3b90)

$ ./a.out 1000 1002 100001 google hash 0.0343956 49967527 8.34091e+09,
different bins 207 bob jenkins hash 0.0039651 50033275 8.34095e+09, different
bins 99794 google is 0.115279x faster bob is 8.67458x faster google moved
0.206998x% objects bob moved 99.793x% objects optimal distribution required
200 movements (0.199998%)

bob is performing extremely bad.

edit: updated the gist to include the integer version

Also, the google algorithm does O(log(bins)) iterations in the loop, which
explains the speed advantage of bob one. Given the task accomplished, it's
really efficient and near optimal.

~~~
robmccoll
Very cool! Thanks, I clearly missed the point :-)

------
streametry
The algorithm is called "Jump". A quick search on GitHub revealed three
projects that implement it and they are all written in Go:
[https://github.com/benbjohnson/jmphash](https://github.com/benbjohnson/jmphash)

~~~
vanderZwan
Given that the paper is by people from Google, where Go is fairly well
adopted, and that both Go's and this algorithm's primary use-cases are server-
related, that makes kind of sense.

------
vilda
Just small note: This is not a pair-to-pair competition to consistent hashing.

This algorithm requires consistent mapping between nodes and (consecutive)
integers--that's not something you get for free in distributed systems where
nodes may join or leave the pool at any time.

------
simpsond
It seems a simple key mod bucket_size works to divide a workload based on a
numeric key. I imagine this has a different distribution which works better
for certain use cases. Anyone have an example of when mod will fail for
something like this?

Edit: The paper covers this.

~~~
cycrutchfield
[http://en.wikipedia.org/wiki/Consistent_hashing](http://en.wikipedia.org/wiki/Consistent_hashing)

