
Perfect Hashes and faster than memcmp - dmit
http://blogs.perl.org/users/rurban/2014/08/perfect-hashes-and-faster-than-memcmp.html
======
dchichkov
If anyone is interested, I have a cute nearly-minimal perfect hashing
algorithm designed to have good cache-friendly properties. It works very well
in practice and in my particular application was performing faster and more
consistently (no long tail) than anything found in CMPH.

Briefly, it is somewhat similar to hopscotch (or robin-hood) hashing, only you
pre-calculate positions of the elements to put them into optimal spots by
solving the assignment problem (via Hungarian assignment problem solver or
such). Works for up to about 50k elements. It feels like it might have good
theoretical properties, might be even optimal (after all, we pretty much
calculating optimal positions of the elements based on the costs of our
memory/cache accesses and probabilities of each element), but it was a while
since I've taken the algorithms class.

If anyone is interested to do a writeup and publish clean source code - you'd
be welcome, ping me via e-mail.

~~~
phpnode
Your email address isn't listed in your profile (you might have entered it in
the "email" box, but that's not visible to other people)

~~~
dchichkov
It is dchichkov at gmail...

------
userbinator
_we can optimize memcmp away by doing word-size comparison directly. This is
50% - 2000% faster then doing memcmp on each string_

This is what a good implementation of memcmp would do anyway (and on x86, one
may compare up to 16 bytes at once using SSE), but the problem is that memcmp
is generic enough that it must work for all lengths so for short comparisons
you lose more cycles in switching between all the special cases than is gained
from the comparison itself.

------
fake-name
For the love of FSM, __please __don 't call your library "phash". "phash" is
already in wide use for "perceptual hash", or the hashing mechanisms used for
things like fuzzy image searching.

~~~
rurban
Oh, I am on HN :)

Okay, point taken. I didn't know about this name. "pperf" maybe then. Note
that this is still work in progress, I just had to finish another project, and
will come back to this one soon.

EDIT: I've renamed the frontend to "pperf" now
[https://github.com/rurban/Perfect-
Hash/commit/b475886f37d9](https://github.com/rurban/Perfect-
Hash/commit/b475886f37d9)

But the memcmp trick with unrolling of the comparisons beforehand for constant
keys is already usable, and I believe someone already considered that
optimization for gcc and llvm. If I remember I even saw a patch already. It
should be used for switch statements with constant keys.

~~~
octo_t
I think this is the first time I've seen a person on HN be told "oh a product
already exists with this name, maybe change it" and they actually have.

Props :)

------
dpayne
One caveat of perfect hash functions, including CMPH, is that for a very large
amount of keys the data structs they use internally become very large. In the
case of CMPH when I tried with around 30 million keys the result was several
MBs which caused it to not fit in L3 cache. For large dictionaries, hash
functions like xxhash and farmhash are usually a better option than the
perfect hash functions.

~~~
t1m
You can combine perfect hashing with string compression (Huffman, prefix
tries) and de-duplication techniques to vastly improve memory usage for a wide
variety of use cases. DiscoDB does this to good effect
([http://discodb.readthedocs.org/en/latest/](http://discodb.readthedocs.org/en/latest/)).

------
ot
cmph is not the fastest library for MPHF, mostly because it uses a slow
ranking function to turn a PHF into a MPHF.

I wrote a small library some time ago that performs much faster, both for in-
memory construction and lookups, [1, see the linked paper for benchmarks],
unfortunately I have no time to maintain it but I recently found out that some
projects are using it, so it wasn't all wasted time :)

[1] [https://github.com/ot/emphf](https://github.com/ot/emphf)

</shameless plug>

~~~
rurban
Excellent. I'll take it :)

------
jheriko
i still fail to see the enormous fuss about hash tables as if they are some
revelatory data structure.

the special cases where you can construct perfect hashes are normally amenable
to setting some fixed index 'hashes' for each item and storing them in an
array, then never referring to their values for lookups in code, but always
using the indices.

in heavily compiled languages like C and C++, when you know everything at
compile-time then the run-time should become zero. anything more is a failure
to use the language to sufficiently inform the compiler of what is happening.

i suspect this is why perfect hashes do not see the enormous widespread use,
because there is a solution for the most common use cases with the exceptional
run-time complexity of O(0).

this is enormously faster than the necessarily more complicated perfect hash
functions, and only falls down when you have a run-time requirement to do the
lookup from the value instead of the key... which in my experience goes hand
in hand with not knowing your data until run-time, which rules out a perfect
hashing function.

in most run-time cases a binary tree is good enough, its very rare that you
need better, and even when you do the optimisation can be found by optimising
that structure to pool allocate nodes or similar. the usual, array of linked-
list approach can be better for small data sets too... but a perfect hash
function isn't even applicable.

all that being said, it is the right solution in those exceptionally rare
cases where you know your data ahead of time, but need to look it up by value,
rather than hash, at run-time.

~~~
HelloNurse
Constrained input that is processed according to what the program knows about
each "key" it contains, with keys defined in advanced and effectively
specifying a data structure or file format, is actually very common. Example:
HTML, XML and similar kinds of text markup; tagged file formats, e.g. PNG; the
traditional application to programming language keywords.

~~~
jheriko
right, that's a good use case, although i'm not sure its terribly important.
the slow part of processing a png file is not identifying its contents...
although this maybe different with XML and HTML parsers. i'm still not sure
i'd call it common, or even desirable.

for lots of chunk based formats using a switch statement on the chunk-ids is
clean and gives the optimisation task to the compiler, which tends to do a
pretty good job of these things... even if it generates a gigantic conditional
statement. since these sorts of things are 4 or 8 bytes long the 'memcmp is
slow' argument falls apart

in fact this sort of switch statement usage is such a good optimisation here
that it is used by this perfect hash mechanism itself to solve its own
problem, effectively offloading some of the cleverness requirement to the
compiler...

also, in the context of programming languages this problem falls into the
category i described of being beatable at compile-time. usually this is
implemented by tokenising (hashing pieces of) all the string input then only
ever using the tokens (hashes, indices) outside of that context. whatever
lookups need doing can then become indexing into a nice flat array defined at
compile-time.

in the other cases where lookups are needed they are for things like variables
and constants, which are not known at compile-time, and the perfect hash can
not help...

you might also want hash collisions in cases where languages have aliases for
keywords

sure, you could lift some of the performance cost of the tokenisation itself
outside of run-time with that, but tokenising programming language input is
not the dominant cost of compiling code.

