
A missing feature in most dynamic languages: take a random key from an hash table. - davidw
http://antirez.com/post/A-missing-feature-in-most-dynamic-languages%3A-take-a-random-key-from-an-hash-table..html
======
subwindow
In Ruby (w/ activesupport):

    
    
      myhash[myhash.keys.rand]
    

Is an O(n) operation, for no apparent reason. At least in 1.9, they're
obviously keeping track of the keys internally in an array to support the
ordered hashes. However, when you do Hash.keys, it does a foreach on the hash
and creates a brand new array.

I'm still not sure why getting a random key is particularly useful, but the
real problem is that getting the keys of a hash should be an O(1) operation,
instead of an O(n) one.

~~~
lsb
Uh, here's how a hash table is commonly built. Ruby's may be optimized, but
the jist is the same.

You have N buckets, a hashing function for key => bucket #, and each bucket
has a linked list of value pointers. You hash the key down to a bucket number,
walk the list until you find your key, which will give you your value.

To get all the keys, you need to walk all the buckets' lists, which is O(n).

~~~
subwindow
I'm perfectly aware of that, and if you read my comment you'll see that I
acknowledge that fact. However, I also said:

> they're obviously keeping track of the keys internally in an array to
> support the ordered hashes.

Which means that they can return them in an O(1) operation, but they choose
not to for some reason.

Edit: I'm wrong- I just remembered why they can't return them, and it is
because they're not storing the keys in an array. D'oh. They're using a
doubly-linked list. So to return a list of keys you'd need to walk the linked
list- an O(n) operation.

See: [http://www.igvita.com/2009/02/04/ruby-19-internals-
ordered-h...](http://www.igvita.com/2009/02/04/ruby-19-internals-ordered-
hash/) for more info.

------
compay
It's "missing" because it's not an every day operation, and it's easy to do
using other built-in functions.

Languages in general should leave out things that are specialized use cases
but easy to implement using simpler pieces. If you fill a language's standard
library up with "useful" stuff like this, you eventually end up with a morass
like PHP.

~~~
antirez
I'm a minimalist too, but this time I don't agree. You can't have this feature
at all if it is not implemented in the core, not in O(1), and is general
enough, hash.random_key() will not really bloat the language and it is very
clear what it does. If there is 'partition' in Ruby's array I think random_key
is not so strange to see inside, especially given that you can't implement it
in Ruby itself.

~~~
omouse
Luckily in Smalltalk, you can add that method to your Dictionary class and re-
use the image :D

~~~
gnaritas
Don't need to, in Smalltalk you can just say ...

    
    
        hash keys atRandom

------
anuraggoel
Python does have built-in support for getting an arbitrary key: popitem() on a
dict. It removes the item from the dict, but you can always add it back.

<http://docs.python.org/library/stdtypes.html#dict.popitem>

Edit: This method returns arbitrary (not random) results; each element is not
equally likely to be picked.

~~~
ivank
That's "arbitrary", and not random at all. Some basic testing shows that the
same dictionary keys will always lead to the same popitem order.

~~~
antirez
I bet this scan the table from the first to the last bucket.

------
abecedarius
For this to be in a library, I'd want it parameterized by the random number
generator. (After being convinced it's really worth it.)

------
lacker
Typically you do not need to both randomly select one element and use them as
keys in a map. If you do need both, the easy way to do it is to just keep the
keys in an array as well as in a hash table.

The algorithm the author uses is not a good example. A simpler way to find the
approximate most common elements in a large collection is to use a heap to
store the (object, count) pairs. Still O(1), and you can remove the element
with the lowest count each time instead of getting an approximation.

~~~
antirez
Still from this page: <http://en.wikipedia.org/wiki/Heap_(data_structure)> I
can't see how it is possible using an heap to count most common elements in
large collction, without approximation, and using constant memory. May you
please argument better the original comment? Thanks

~~~
abecedarius
Here's a Brian Hayes column on good algorithms for this problem:
[http://www.americanscientist.org/issues/id.3822,y.0,no.,cont...](http://www.americanscientist.org/issues/id.3822,y.0,no.,content.true,page.1,css.print/issue.aspx)

I think lacker was probably thinking of a correct algorithm for finding the n
largest values in a set, instead of the n most frequent values.

~~~
antirez
Update: I wrote a simulation program, my algorithm appears to perform better
compared to the majority-finding algorithm of Yale's university described in
the american scientist article even if the algorithm I proposed is O(1) and
the Yale's is O(M) (where M is the number of top-items to track). I'll post an
article with more details and the source code on YN today.

------
wildwood
Maybe I'm missing something, but the pseudocode looks funky. Is he really
assuming at most one key per bucket?

This code seems like it fails when the number of keys in the table exceeds the
number of buckets. Assuming that 'table.size' returns the number of keys, and
not the number of buckets, he'll also be hitting 'index out of bounds' errors.

Of course, I could be wrong. Pseudocode can have a funky syntax. :)

------
mooism2
It's missing because you can't get O(1) _and_ have each key equally likely to
be returned, right?

~~~
antirez
are you sure? I don't think so. See the new version of the pseudo code. Just
select a new random index every time you reach an empty bucket and you are
done.

~~~
damienkatz
Unfortunately, it's not guaranteed to return, ever. It's a psuedo-random
generator, it can (though not real likely) get stuck in a cycle of buckets it
checking, the worst case time complexity O(∞). Even if it uses a real random
generator, I'm not sure it's guaranteed to return.

~~~
lacker
If you want to guarantee it returns, then after N failed operations, just
convert the hash table to a list and pick something at random. That will take
O(N) time, but since you've already spent O(N) time it amortizes out to no
additional cost.

------
tptacek
How is this operation O(1)? In the worst case, in a table with 4096 buckets
and 1 entry, and assuming you "color" the buckets or track them somewhere
else, you spend 4096 lookup operations getting the key. In reality, as
implemented here, the algorithm doesn't even necessarily terminate.

