

Using your laptop to compute PageRank for millions of webpages  - helwr
http://michaelnielsen.org/blog/using-your-laptop-to-compute-pagerank-for-millions-of-webpages/

======
kragen
From skimming, this appears to be the algorithm given in the original PageRank
paper from 12 years ago. You don't need hundreds of megs of RAM to do it,
though; put yourself in the mindset of a 1960s COBOL programmer, and you can
do the matrix-vector multiplication in a few K of memory space by the
following approach. The adjacency matrix is a set of triples (src, dst,
weight), indicating that there is a link from URL src to URL dst of weight
weight, where the weights in a row sum to s. You're computing an occupancy
vector consisting of a bunch of pairs (url, weight) where the weights total to
1.

1\. Sort the adjacency matrix by src and the occupancy vector by url.

2\. Iterate through the current occupancy vector and the adjacency matrix in
parallel. For each triple (a, b, c) and pair (a, d), emit a partial-sum pair
(b, d*c).

3\. The partial-sum pairs will be randomly ordered. Sort them by URL.

4\. Sum the partial-sum elements for the same URL, and add in the contribution
of sD + tE. (I'm leaving out how to compute sD at this point, but it's not
that hard.) This is your new occupancy vector.

5\. Go back to step 2 for a few more iterations.

This approach parallelizes nicely with MapReduce, and doesn't require anything
but sequential data access; you could do it with four tape drives and a
microcontroller. I know I've seen it in some paper somewhere, but I can't
remember where.

~~~
jacquesm
Every programmer should be - for a while at least - forced to work on a
machine with no more than a megabyte of ram and a very slow hard disk.

Nothing like a constrained environment to teach you about efficiency in
programming.

~~~
kragen
While that is true, this is not the most efficient way to solve the problem if
you have a lot of RAM. My point is merely that your laptop can compute
PageRank for, roughly, any link graph that will fit on its hard disk, or that
it can access at a reasonable speed over the network.

(At 30 links per web page and 5 bytes per link, you need about 160 bytes per
web page to represent this link graph. That would allow you to store a
6-billion-page graph on a 1TB notebook hard disk. Unfortunately, making 8
passes over that terabyte will probably take more than a day.)

------
notauser
{Related} I wonder how search engines are using the 'nofollow' tag these days.

I know it's supposed to mean that PR does not accrue to the destination but
they would have to be mad to ignore such a huge source of data.

~~~
ddemchuk
just because the PR doesn't flow through the nofollow link does not mean that
the SE's aren't following the links to see what's on the other side....

