Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A missing feature in most dynamic languages: take a random key from an hash table. (antirez.com)
12 points by davidw on Feb 10, 2009 | hide | past | favorite | 32 comments


In Ruby (w/ activesupport):

  myhash[myhash.keys.rand]
Is an O(n) operation, for no apparent reason. At least in 1.9, they're obviously keeping track of the keys internally in an array to support the ordered hashes. However, when you do Hash.keys, it does a foreach on the hash and creates a brand new array.

I'm still not sure why getting a random key is particularly useful, but the real problem is that getting the keys of a hash should be an O(1) operation, instead of an O(n) one.


Uh, here's how a hash table is commonly built. Ruby's may be optimized, but the jist is the same.

You have N buckets, a hashing function for key => bucket #, and each bucket has a linked list of value pointers. You hash the key down to a bucket number, walk the list until you find your key, which will give you your value.

To get all the keys, you need to walk all the buckets' lists, which is O(n).


I'm perfectly aware of that, and if you read my comment you'll see that I acknowledge that fact. However, I also said:

> they're obviously keeping track of the keys internally in an array to support the ordered hashes.

Which means that they can return them in an O(1) operation, but they choose not to for some reason.

Edit: I'm wrong- I just remembered why they can't return them, and it is because they're not storing the keys in an array. D'oh. They're using a doubly-linked list. So to return a list of keys you'd need to walk the linked list- an O(n) operation.

See: http://www.igvita.com/2009/02/04/ruby-19-internals-ordered-h... for more info.


Indeed, I thought the same thing (re: your code snippet).

Even better, assuming you _really_ need to get the value for a random key out of the hash, might be:

    class Hash
      def random
        self[self.keys.sort{rand}.first]
      end
    end
    
    ...
    
    myhash.random
This doesn't require activesupport, just for comparison.


"sort{rand}"

I am not very experienced with Ruby, but this sounds like a really bad idea? In the worst case, sort might run forever?

Don't know what sort algorithms Ruby uses by default, maybe for some it doesn't matter, but it seems best to not make assumptions about the underlying algo?


You're right that `sort{rand}` is a bad idea, but not because it could run forever. `rand` in Ruby just returns a number like 0.9307038377384, used in sorting this will just always sort the same way (since it's always > 0). So Ruby will use the result of `rand` to determine if one element should be sorted before or after another. So maybe a better solution would be to use `sort{rand - 0.5}` or something, so that it will randomly be either < or > 0.

I was curious so I looked Array#rand in the Active Support source code[1], the implementation uses the following code to get a random element:

    def rand
      self[Kernel.rand(length)]
    end
So, this is much better than sorting the whole array in random order and pulling one off the top :) But it still uses `rand`.

[1] http://github.com/rails/rails/tree/e56b3e4c0b60b2b86f5ca9c5e...


It's "missing" because it's not an every day operation, and it's easy to do using other built-in functions.

Languages in general should leave out things that are specialized use cases but easy to implement using simpler pieces. If you fill a language's standard library up with "useful" stuff like this, you eventually end up with a morass like PHP.


I'm a minimalist too, but this time I don't agree. You can't have this feature at all if it is not implemented in the core, not in O(1), and is general enough, hash.random_key() will not really bloat the language and it is very clear what it does. If there is 'partition' in Ruby's array I think random_key is not so strange to see inside, especially given that you can't implement it in Ruby itself.


Luckily in Smalltalk, you can add that method to your Dictionary class and re-use the image :D


Don't need to, in Smalltalk you can just say ...

    hash keys atRandom


Python does have built-in support for getting an arbitrary key: popitem() on a dict. It removes the item from the dict, but you can always add it back.

http://docs.python.org/library/stdtypes.html#dict.popitem

Edit: This method returns arbitrary (not random) results; each element is not equally likely to be picked.


That's "arbitrary", and not random at all. Some basic testing shows that the same dictionary keys will always lead to the same popitem order.


I bet this scan the table from the first to the last bucket.


Correct, nice catch.


For this to be in a library, I'd want it parameterized by the random number generator. (After being convinced it's really worth it.)


Typically you do not need to both randomly select one element and use them as keys in a map. If you do need both, the easy way to do it is to just keep the keys in an array as well as in a hash table.

The algorithm the author uses is not a good example. A simpler way to find the approximate most common elements in a large collection is to use a heap to store the (object, count) pairs. Still O(1), and you can remove the element with the lowest count each time instead of getting an approximation.


Still from this page: http://en.wikipedia.org/wiki/Heap_(data_structure) I can't see how it is possible using an heap to count most common elements in large collction, without approximation, and using constant memory. May you please argument better the original comment? Thanks


Here's a Brian Hayes column on good algorithms for this problem: http://www.americanscientist.org/issues/id.3822,y.0,no.,cont...

I think lacker was probably thinking of a correct algorithm for finding the n largest values in a set, instead of the n most frequent values.


Update: I wrote a simulation program, my algorithm appears to perform better compared to the majority-finding algorithm of Yale's university described in the american scientist article even if the algorithm I proposed is O(1) and the Yale's is O(M) (where M is the number of top-items to track). I'll post an article with more details and the source code on YN today.


very interesting, thanks. I could like to simulate the problem in order to check how my simple algorithm performs compared to others described in the article.

EDIT: I analyzed a bit the differences between Yale University algorithm using M conters and my algorithm. It is interesting that while the Yale's algo gives you all the top elements with a given minimal frequency it is O(M) because you have to decrement M counters for every new element not matching any of elements in the M registers.

My algorithm even if randomized (so you are not sure you track all the top M elements, but the output will be an approximation of that) can bring the M-probably-top-elements using O(1) time for element. The memory used is O(M) in both the algorithms. I wonder if the algorithm I proposed was described or not.

Note how you can trade time for accuracy selecting more accurately what element to drop to add the next one. If you sample three elements dropping the minimum of the three the accuracy will be a given one, but if you sample ten elements it will be different. I wonder if somebody is able here to analyze this algorithm more in depth.


You are using a randomized algorithm to approximate the smallest element in a collection of (object, count) pairs. Instead, use a min-heap to find the smallest element. You do need a hash table as well to keep track of items' locations in the heap. You don't have to do any pop operations - since you only need to remove something from the list when you have a new item with count = 1, you can just replace the item at the root of the heap.

This is probably also not the simplest way to get a reasonable approximation. Certainly those Hayes algorithms are better. You could also try something like, read in 2N elements, sort them, and drop the least-frequent N of them. Then read in N more and repeat. Pretty hacky though. You could also try bloom filters.


lacker, there are infinite solutions to every problem, but if you want really to find a competitor to my algorithm that performs better you NEED to talk about algorithms not using more than O(1) for every new element processed.

To sort the set from time to time is not valid under this assumptions for example. This is going to be M log M.

Registers is O(M).

p.s. M here is the number of top-elements tracked.


You asked for algorithms that use constant memory. So M is a constant and M log M is constant too.


lacker the example was just useful to explain how sometimes it can be useful to run randomized algorithms with hash tables. A real world example is a DNS cache that can't hold more than N elements, you can just remove one at random. Another one is a cache where elements can expire, and you want to perform a lazy cleanup.


Maybe I'm missing something, but the pseudocode looks funky. Is he really assuming at most one key per bucket?

This code seems like it fails when the number of keys in the table exceeds the number of buckets. Assuming that 'table.size' returns the number of keys, and not the number of buckets, he'll also be hitting 'index out of bounds' errors.

Of course, I could be wrong. Pseudocode can have a funky syntax. :)


It's missing because you can't get O(1) and have each key equally likely to be returned, right?


are you sure? I don't think so. See the new version of the pseudo code. Just select a new random index every time you reach an empty bucket and you are done.


Unfortunately, it's not guaranteed to return, ever. It's a psuedo-random generator, it can (though not real likely) get stuck in a cycle of buckets it checking, the worst case time complexity O(∞). Even if it uses a real random generator, I'm not sure it's guaranteed to return.


If you want to guarantee it returns, then after N failed operations, just convert the hash table to a list and pick something at random. That will take O(N) time, but since you've already spent O(N) time it amortizes out to no additional cost.


with an hash table 50% full the probability of it not returning after just 100 tries is: 0.0000000000000000000000000000007 I think we can sleep without problems about it ;)


Hmm, yes, I think you are right.


How is this operation O(1)? In the worst case, in a table with 4096 buckets and 1 entry, and assuming you "color" the buckets or track them somewhere else, you spend 4096 lookup operations getting the key. In reality, as implemented here, the algorithm doesn't even necessarily terminate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: