

Ask HN: need help writting a pig script - gumbo

Hi There.
We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
I've been learning pig for a few days now and understand how it work.
So to start here is how the log file look like.<p><pre><code>  user		url				time
  user1		http://someurl.com		1235416
  user1		http://anotherlik.com		1255330
  user2		http://someurl.com		1705012
  user3		http://something.com		1705042
  user3		http://someurl.com		1705042
</code></pre>
As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.<p>The algorithm could be splited as bellow:<p>#Find all users that has accessed to some common urls.
#generate pair-wise combination of all users for each resource accessed.
#for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
#sum up for each pair-url the similarity.<p>here is what i've written so far:<p><pre><code>  A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);

  grouped_pos = GROUP A BY ($1);
</code></pre>
I know it is not much yet, but now i don't know how to generate the pair or move further.
So any help would be appreciated.<p>Thanks.
======
SoftwarePatent
This isn't really the place for this, you should try stack overflow.

