Hacker News new | past | comments | ask | show | jobs | submit login

That isn't a uniform distribution. It's biased in favor of records with a bigger index. (I.e., rand() returns the sane value twice.)



Yes - the code is wrong. A correct implementation would adjust the probability of sampling based on how many items have been trialed.

EDIT: I misread it as rand(n), rand(idx) is correct.


But that's happening here as well, right? The random number is an indexed picked between 0 and the current index.

Though, I think there is an off-by one in this implementation (assuming that Ruby indexing starts at 0):

    j = rand(idx)
    out[j] = i if j < n
Say that the index is n, it will call rand(n), which gives a random number [0..n). However, the index should be picked from [0..n].


You may be right...I spent a lot of time thinking about that, and I convinced myself that this was correct, but I could be wrong. My thinking is that rand(N) in ruby returns a value from [0, N-1] inclusive, which is the complete index range of the sampled stream so far.

That sounded right to me. I could be convinced otherwise.


You are not taking the index of the element that you are currently sampling into consideration.

Suppose that the sample size is 1 and you are getting the second item (index 1). You will call rand(1), which has 0 as the only possible outcome. So, you will always replace the first item (index 0). Whereas if you would call rand(2) (possible outcomes: 0 and 1), you replace the item in the sample with probability .5 (assuming that the random number generator is uniform).


D'oh. Yes, I think you're right. That'll teach me to try to mentally debug code at 2AM!

I'll fix the code and publish a new gem a bit later today. Thanks!


I'm not a ruby guy at all (never used it), but if I understand this correctly:

  j = rand(idx)
  out[j] = i if j < n
..it is keeping the sampling probability at n/idx where n is the sample/reservoir size and idx is the number of items seen so far. Then if this item is selected for sampling, each of the n items in the reservoir is equally likely to be replaced. I think the implementation is correct, assuming that idx is the number of items seen so far.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: