
Ask HN: Fast sampling and update of weighted items? (data structure like red-black trees?) - bravura
What is the appropriate data structure for this task?<p>I have a set of N items. N is large.
Each item has a positive weight value associated with it.<p>I would like to do the following, quickly:<p>inner loop:<p><pre><code>   Sample an item, according it its weight.

   [process...]

   Update the weight of K items, where K &#60;&#60; N.
</code></pre>
When I say sample by weight, this is different than uniform sampling. An item's likelihood is in proportion to its weight. So if there are two items, and one has weight .8 and one has weight .2, then they have likelihood 80% and 20% respectively.<p>The number of items N remains fixed. Weights are in a bounded range, say [0, 1].
Weights do not always sum to one.<p>A naive approach takes O(n) time steps to sample.
Is there an O(log(n)) algorithm?<p>What is the appropriate data structure for this task?
I believe that red-black trees are inappropriate, since they treat each item as having equal weight.
======
loicfevrier
Red-black trees are far too complex for this simple task : just use any binary
balanced tree so that the maximum height is O(log n). (red-black trees will
work but take any one you want, red-black tree are slow) On each node you
store two numbers : \- sum of the weights at the left \- sum of the weights at
the right Want you want to choose an item just explore the tree and for each
node choose left or right according to the two weights. ==> O(log n) If you
update an item you'll need to update all the weight up to the root of the
tree. ==> O(log n) for each update

~~~
nkurz
I don't think I'm quite understanding your algorithm. When you say "choose
left or right", do you mean to choose randomly between the branches with
probability equal to the ratio of the weights of the trees? So that each
sample requires log(n) random numbers before you reach a leaf? Or do you mean
something else?

~~~
bravura
You know the sum of the weights. Sample a random number between 0 and sum.
Then, use this value to guide your search in the tree. For example, if the
left branch has weight 30 and the right branch has weight 50, the total weight
is 80. If I pick 40, then I am going down the right branch.

At the next child, if the left branch has weight 12 and the right branch has
weight 38, the left branch spans 30-42 and the right branch spans 42-80. So
you follow the left branch, because you want 40.

------
mjtokelly
Here's a sampling algorithm that will work, as long as the maximum weight
isn't too much greater than the average weight:

1) Determine the maximum weight mw. (This is O(N) the first time, O(K)
subsequent times.)

2) Repeat until an index i is accepted:

    
    
      a) Generate a random pair (i, x), where i is a uniform integer over [1,N], and x is a uniform float over [0,mw].
    
      b) Accept index i if weight(i) < x.
    

3) Return the accepted index i.

This gives you the desired distribution in time O(mw/aw) (maximum weight over
average weight).

~~~
bravura
Rejection sampling is perhaps not a good option, because the assumption does
not hold: "the maximum weight isn't too much greater than the average weight"

In learning algorithms, the distribution of the weights can get piqued.

------
drcode
First, put all items into an array, computing a number k for each item, where
k=(sum of all weights above it).

Then, just generate a random number 0<x<(sum of all weights).

Finally, do a binary search on the array O(log(n)) comparing x to k.

...As for the "updating" step, you didn't define that very well or state
whether you're interested in minimizing that as well- But given that the
number of items may be as large as N it seems that a naive linear update would
be the best you could achieve.

~~~
loicfevrier
But you will need to re-compute the sum each time you are doing an update wich
takes O(n). You can use Binary Indexed Trees to do that efficiently :
[http://www.topcoder.com/tc?module=Static&d1=tutorials...](http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees)

~~~
bravura
It appears that Binary Indexed Trees are appropriate only when the weights are
integral?

~~~
loicfevrier
Exact, I forgot that point.

------
hotpockets
As long as the weights are distributed randomly in the array, you can just
perform the algorithm on a random subsection of the array. The subsection
needs to have the same distribution of weights as the whole section. I think
thats the only requirement you need.

Ideally, you could pick smaller fraction of the array as N gets bigger, giving
you sub O(N) scaling.

In other words, if the array indexes that get picked using a slow algorithm
are distributed randomnly, you might as well just pick a random subsection, as
long as it is big enough to have the same distribution of W's as the whole
array.

------
a-priori
Of the top of my head, here's how I would go about this.

First, preprocess your data:

1) Normalize the weights 2) Sort the items in descending order by weight 3)
Calculate, for each item, the sum of the weights of all prior items 4) Enter
the items into a binary search tree. The key into this tree is the sum you
calculated in step 3. You can use whichever data structure you want here...
red-black or splay would probably be best, but it depends on your situation.

Now your O(log n) lookup:

1) Generate a random variate in [0,1) 2) Search the binary tree for the item
with the largest key that is <= the variate.

~~~
bravura
Each time you do an update step, if you renormalize the algorithms becomes
O(n)

------
eru
Interesting. Could you tell us what you problem you are solving?

~~~
bravura
I am trying to speed up a machine learning algorithm. There are a very large
number of possible output classes. It would be too expensive on each training
step to perform an update for all outputs. I use an exponential moving average
to keep track of an estimate of which outputs are correctly learned (low
weight) and which are not well-learned (high weight). In this way, I can focus
training on the what it currently has not learned well, and not spend training
time on what it already knows.

~~~
loicfevrier
Which language are you using ? Matlab, C/C++ ?

What would be an estimate of N : 10^3, 10^6, 10^9 ?

~~~
bravura
python/numpy with C.

Funny story, that. At our lab, we are developing an optimizing compiler in
Python for math expressions. You write the function you want to repeatedly
evaluate in Python. Maybe there's a gradient thrown in. Our library optimizes
the function graph, converts in to C, and compiles it. You can then execute
your function in python, but it's fast.

We should be releasing 0.1 in a week.

------
Autre
A priority queue [<http://en.wikipedia.org/wiki/Priority_queue>] seems to be
the way to go with this.

~~~
mjtokelly
This would give you the maximum weighted item, but not a weighted probability
distribution.

------
jonnyba
if the number of different weights is finite and small, you could create a
bucket for each different weight.

for example if the possible weights are .25, .5, and .75, put all items into
one of those buckets (array backed). The bucket then gets a weight of
(individual weight)*(bucket size). First randomly choose a bucket based on the
bucket weight, then randomly choose an object from the bucket.

~~~
gojomo
Still O(n).

------
utnick
What does the << operator mean?

and how are these K items found? Do they have the same weight as the sampled
item?

~~~
loicfevrier
It means that K is much smaller that N. Because if K is a O(N) then doing an
update in O(log n) is not good, it is better to rebuild the whole strcuture in
O(n).

