
Random Hash Functions - mdwrigh2
http://research.swtch.com/randhash
======
ezyang
Which is distinct from a "totally random hash function", which is a hash
function for which the hash associated with every value is uniformly randomly
selected from the key space. Totally random hash functions have very good
properties, but they take a lot of space to store. (Exercise for the reader:
if the hash function is {0,1}^n -> {0,1}^m (so the input is an n-bit string,
and the output is an m-bit string), how much space do you need? Why isn't this
compressible?)

------
avar
This is off-topic but is there an official Google API + snippet for embedding
Google+ comments in a non-Google+ page like that? He seems to have some
hardcoded custom JavaScript code for it, is there no easy official way to do
so?

------
peterwwillis
....not sure if april fools joke or if i'm just really dumb

------
barrkel
I had to AdBlock typekit.com in order to disable the custom font on this site;
normally Firefox config option gfx.downloadable_fonts.enabled disables custom
fonts, but not here.

(I think the "idea" presented here is silly; I think it's an April Fools. Any
given NaN (IEEE NaNs have lots of different representations) should hash
consistently and compare consistently with functions specific to the
requirements of hash tables, i.e. one for which NaN = NaN for any given NaN.
You shouldn't be able to store multiple NaNs in the hash table (unless you're
specifically using distinct NaNs) no more than you can store multiple keys for
'1.0', so there should be no quadratic explosion.)

(Oh and furthermore: I only know rudimentary Python, but timeit taking a
string parameter for the code to time rather than a lambda is revolting.)

~~~
ezyang
Did you run the code?

The reason why the behavior here is bad is because Python uses probing to
implement hash collision detection. Essentially, once you've looked up the
hash, you have to check to make sure the value stored at this hash is actually
the key you are looking for. However, floating point equality is different
from bit-wise equality, and so once a NaN is stored as a key in the table it's
unretrievable, since it will always report that it's not equal to any key you
may be looking for.

Maybe it's April Fool's, but only in the fine tradition of WTF but true.

~~~
barrkel
I'm completely uninterested in the specifics of how Python hash tables are
implemented; in other words, the code is irrelevant. What concerns me is the
logic of hash functions as applied to hash tables, as a language independent
concern.

Hash tables require certain things from the key hash and compare functions. If
language defaults are used for comparing, particularly with NaNs, then hash
tables break down. The fix isn't to make your hash function random; it's to
make your compare function conform to the requirements of hash tables.

~~~
ezyang
I'm not the biggest fan of the solution in the article; but my biggest problem
is that you're using floating points as keys for your hash table.

Fix comparison for floating point? Well, now you're violating the IEEE
standard.

Introduce a new 'equality-for-hash-tables' magic function? Well, now you have
another special case that contradicts normal equality and doesn't even really
have well defined semantics in many cases.

~~~
barrkel
I'm not using floating point as keys in my hash table; Russ is.

If you want a hash table with meaningful semantics, you need certain
guarantees from your hash and comparison functions. If you're trying to store
values that break those comparisons, you either need to change the functions
or not store those values.

You're advocating not storing those values; fine. But some people will, not
even directly; for example, a 'memoize()' function that creates a cache behind
the scenes, using a hash table with keys formed from function arguments, whose
types may be floats. Here, you _do_ want NaN = NaN; you want hash table rules
to be followed; and you _don't_ want a random hash function.

Just saying "sorry, I can't memoize this function" on pedantry grounds isn't
good enough.

~~~
defrost
What you should be doing, if you're doing numerics, is back up and find out
why there's a NaN in the arguments in the first place.

Not all NaN's are created equal, in some contexts you'll want to always raise
an exception whenever you see them, and in others you'll want to provide an
appropriate closure or limit ... that depends on how you arrived at the Nan in
the first place.

For what it's worth, you need to always know your libraries, for any one using
Go to hash floats the behaviour _is_ to return random for has( NaN ) :

    
    
        // NOTE: Because NaN != NaN, a map can contain any
        // number of (mostly useless) entries keyed with NaNs.
        // To avoid long hash chains, we assign a random number
        // as the hash value for a NaN.
        
        void
        runtime·f32hash(uintptr *h, uintptr s, void *a)
        {
            uintptr hash;
            float32 f;
    
            USED(s);
            f = *(float32*)a;
            if(f == 0)
                    hash = 0;  // +0, -0
            else if(f != f)
                    hash = runtime·fastrand1();  // any kind of NaN
            else
                    hash = *(uint32*)a;
            *h ^= (*h ^ hash ^ 2860486313U) * 3267000013U;
        }
    
    

[http://code.google.com/p/go/source/browse/src/pkg/runtime/al...](http://code.google.com/p/go/source/browse/src/pkg/runtime/alg.c#236)

------
tantalor
> Since the inputs are identical

They aren't identical because NaN != NaN. That's what "identity" means.

