
Exploration of the Google PageRank Algorithm - bflbfl
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/
======
wfunction
This is way too complicated!

How about an easy explanation, one that a middle- or high-schooler who just
learned to solve 3-equations and 3-variables and has seen probability can
understand?

Let's say you have three websites: Reddit, Yahoo, and Google. Let's say Reddit
links to Yahoo and Google, and Yahoo links to Google, and Google links to
Reddit.

If you start on a random page and click randomly, what happens? You'll land on
Reddit with probability x, Yahoo with probability y, and Google with
probability z. Notice that if you continue clicking randomly, x, y, and z will
stop changing after a while.

Okay, now stop clicking. How likely are you to have landed on page Google?
(i.e., What's z?)

\- You're on Reddit with probability x, and 1/2 of the links there will take
you to Google.

\- You're on Yahoo with probability y, and 1/1 of the links there will take
you to Google.

\- You're on Google with probability z, and 0/1 of the links there will take
you to Google.

Therefore, z = x(1/2) + y(1/1) + z(0/1)

Similarly, x = x(0/2) + y(0/1) + z (1/1)

Similarly, y = x(1/2) + y(0/1) + z (0/1)

Since x + y + z = 1, we get: x = 2/5, y = 1/5, z = 2/5

What does that mean? It means Reddit and Google are twice as "important" as
Yahoo, since you're twice as likely to land on them.

There, we just learned how PageRank works with zero linear algebra business.

~~~
mullr
Nonetheless, linear algebra is the way it's done for real applications. It's
how the literature treats the subject as well, so it's important to
understand. This tool helps for that goal, so I'm glad it's there.

~~~
wfunction
In a real application, you don't even know the transition probabilities to
begin with, so actually performing the computation is the least of your
worries -- the first problem is coming up with numbers, which wasn't
mentioned. :)

So I feel that if the goal is to give a conceptual understanding, linear
algebra is overkill. If the goal is to show how it's done in practice then
this isn't terribly useful since it doesn't even show the tip of the iceberg!

~~~
bflbfl
Hey wfunction

I believe we know exactly what the matrix is, based on the hyperlink structure
itself. It's updated and displayed on the vis here based on that, too.

btw, David Austin's article on PageRank is highly recommended :
<http://www.ams.org/samplings/feature-column/fcarc-pagerank>

------
williamcotton
I made a similar example a few years ago. It even works in IE6!

<http://williamcotton.com/pagerank-explained-with-javascript>

~~~
merusame
Thanks & even great music you make!

------
chmullig
This is awesome if you actually want to understand the linear algebra behind
it.

A useful paper is also Bryan and Leise's The $25B Eigenvector, The Linear
Algebra Behind Google.

~~~
sgloutnikov
Love me some power method :) I was amazed when studying it in numerical
analysis, and the professor showed us how this works in the real world
(Google).

------
dewey
Anyone else just getting a 404 "We're sorry, but you do not have access to
this page." error?

~~~
RobAley
Yup. If you google the URL you'll get the cached version though.

------
deadairspace
What does G signify? I'm a bit confused as to why taking its eigenvector gives
you the pagerank vector.

~~~
jamessb
In essence, pagerank sorts pages by the probability that you'd arrive at them
by randomly following links form other pages.

G is a matrix of transition probabilities - each entry is the probability of
moving directly from a particular page to another particular page. It is
composed of 3 terms:

\- the hyperlink matrix: the probabilities obtained by assuming a user
randomly selects a link to follow from their current page, with an equal
probability for each

\- the dangling nodes matrix: to ensure that it is possible to leave every
page, this adds an equal probability for moving from a page with no outgoing
links to every other page

\- the matrix U of all ones: this provides G with 2 desirable properties, by
ensuring it is both irreducible/strongly-connected and aperiodic. It does this
by making it possible to move from any page to any other page, with some small
probability (exactly how small is determined by the damping factor d).

The pagerank vector represents the equilibrium probability distribution - in
other words, the probability of being on a particular page after starting on a
random page then spending a long time randomly moving between pages according
to the probabilities in G.

Now, if at a given time your probability of being on each page is given by
some vector X, the probability of being on each page after one random move is
the transition matrix multiplied by X.

The equilibrium probability vector thus has the property that it is unchanged
by multiplication by the transition matrix - it is therefore an eigenvector of
the transition matrix, with an eigenvalue of 1.

Edit: After typing this, I see that explanatory hypertext appears when you
mouse over the blue words.

------
j2kun
Interested in more of the mathematics behind the algorithm? I implement and
prove things about it on my website: [http://jeremykun.com/2011/06/18/googles-
pagerank-a-first-att...](http://jeremykun.com/2011/06/18/googles-pagerank-a-
first-attempt/)

------
bflbfl
For those who may come across this, I have added some extra stuff to let you
watch how the power method converges (or doesn't converge) to the PageRank
vector. Just click the "Explore Convergence" button over there where the
PageRank vector is written out.

------
chriscoyfish
As old as this algorithm may be, it's still relevant/useful in many practical
applications. It's crazy to think that these small examples on here scale to
gigantic data sets, especially as each PageRank gets updated.

------
mikescoffield
This is pretty cool. I'm playing around w/ different backlink structures. It
seems if you link back to a tier 1 link, it will boost your PR. Is this true
in practice?

~~~
bflbfl
Hey Mike - can you give me some details on the examples you're playing with?
If I get around to it, I was going to see about including some canned examples
to maybe show/disprove how things like that work (at least with this version
particular of the algorithm). The inclusion of the dangling nodes matrix and
that extra all-ones one can make it trickier to predict what will happen.

~~~
mikescoffield
I built a basic pyramid and then tinkered w/ variations from that.

~~~
bflbfl
Thanks Mike. I've been playing with small networks as well (partly to confirm
things working correctly) - wrote up a few notes on one little one with
slightly counter-intuitive behavior -
[http://www.nowherenearithaca.com/2013/05/google-pagerank-
eva...](http://www.nowherenearithaca.com/2013/05/google-pagerank-evaluating-
simple.html)

~~~
mikescoffield
That's interesting!

Sadly, it does remind me of how much math/engineering I've forgotten since I
graduated.

Are you still a student?

~~~
bflbfl
Hi Mike

thanks for taking a look at that. Linear algebra/eigenvector/eigenvalue stuff
is useful and usually fairly (re-)approachable.

I am a student of interesting things :)

------
ari_elle
Try deleting all the Nodes with DEL

When deleting the last one it crashes (tested in FF and Chrome)

~~~
bflbfl
thx - think I fixed that the other day.

