

Latent Dirichlet Allocation Surprisingly Well Correlated w/ Google Rankings - randfish
http://www.seomoz.org/blog/lda-and-googles-rankings-well-correlated

======
nkurz
This is a good layman's introduction to modern search techniques, but to
someone not in the SEO field it feels like a very strange inversion of
priorities. To me, like most people, the surprise is how effective techniques
like LDA[1] can be in characterizing a document, but the 'surprise' in the
article is that LDA correlates to Google search order better than a more
simplistic model.

To a technologically savvy but naive outsider, this might seem obvious:
shouldn't pages that rank highly in Google have strong topic-based correlation
to pages that the user wants to see? But from the SEO perspective, I guess the
conclusion would be that your page is more likely to be ranked highly if it
includes all the trappings of other high ranked pages, with, you know, like
synonyms and stuff. At a certain point, one has to start thinking, wouldn't it
be simpler to make a page that people actually want to find?

Are there good examples of actually useful pages that Google doesn't do a good
job of ranking? I occasionally find myself lately getting frustrated with
Google about ignoring my rarer search terms, but generally I find the good
pages are at the top if they exist at all.

[1] LDA is Latent Dirichlet Allocation, which is very similar to Latent
Semantic Analysis, which in turn is very similar to Principle Component
Analysis and Singular Value Decomposition. So it's possible you've already
heard of the concept, but coming from another angle in another field.

~~~
nl
I though it was kind of odd they did LDA rather than something more broadly
used (eg LSA).

But I've never really looked at LDA, and Wikipedia says: _Compared to standard
latent semantic analysis which stems from linear algebra and downsizes the
occurrence tables (usually via a singular value decomposition), probabilistic
latent semantic analysis is based on a mixture decomposition derived from a
latent class model. This results in a more principled approach which has a
solid foundation in statistics._ so maybe they made the right choice. (Not
that I see what "solid foundation in statistics" really means in this context)

~~~
cdavid
LSA is relatively similar in some abstract sense to the published pagerank
algorithm. LDA is more powerful, in the sense that it can account for more
complex relationships (but may be less accurate with large number of data - I
have really no idea how those would scale and compare at google-like size).

~~~
nl
Can you explain this some more?

My understanding of LDA is that it gives you document scores against queries
based on the topics extracted using the LDA algorithm on the text in the page.

Pagerank, on the other hand scores based on external pointers (ie, references)
but doesn't have anything to do with the text on the page.

~~~
noelwelsh
It's the abstract sense that is important. Pagerank is a dimensionality
reduction technique. It finds the first eigenvector of the transition matrix.
Eigenvectors = PCA. LSI is basically PCA, but applied to the document-term
matrix. LDA is a dimensionality reduction technique that makes use of more
information.

~~~
nl
Oh, I see.

I thought you were talking about some functional similarities, not the
mathematical similarities.

------
moultano
All good ranking functions are pretty correlated. There are many ways for a
ranking to be bad, and few ways for it to be good.

------
nl
This is news? Seriously????

They have found a correlation between a set of words related to a topic you
are searching for and how highly a search engine ranks that page?

Well duh! Did anyone really think search engines did a keyword search and then
applied Pagerank/HITS (<http://en.wikipedia.org/wiki/HITS_algorithm>) or
whatever? That would give dreadful results.

If you really want to understand this, I recommend _Building a Vector Space
Search Engine in Perl_ [http://perl.about.com/b/2007/05/24/building-a-vector-
space-s...](http://perl.about.com/b/2007/05/24/building-a-vector-space-search-
engine-in-perl.htm)

I build the vector space classifier in <http://classifier4j.sf.net> based
almost entirely on that article, even though I don't know Perl. It's very
readable, and gives you a great understanding.

~~~
will_critchlow
The news isn't that there is a correlation but that there is such a strong
correlation. There are a bunch of specific techniques Google could be using
and it looks likely that this is close to what they actually use.

They also use a lot of other ranking factors beyond just the words on the page
so seeing such a high correlation from a "bag of words" model is pretty
interesting (to me at least).

~~~
gjm11
The correlation really isn't all that high.

If I've read the graph right, it's about 0.33. For a Pearson (product-moment)
correlation coefficient, that would mean that about 10% of the variance in
Google rankings is explained by a linear regression on LDA scores. They've
actually used the Spearman (ranking-based) correlation coefficient, which is
equivalent to ranking all the values of each variable from 1..N and then
computing the Pearson correlation coefficient for the ranks. So, kinda-sorta
with lots of handwaving, that means that about 10% of the ordering of the
Google rankings is explained by the LDA scores.

Clearly that's a lot better than for the other scoring methods they mentioned,
and that probably indicates that Google are doing something a bit like LDA
(but this will be true for any approach that takes note of synonyms, and it's
hardly news that Google do that). But it doesn't, e.g., suggest that PageRank
and other things based on link structure aren't extremely important to
Google's rankings.

~~~
noelwelsh
This post should be upvoted a zillion times. The correlations they report are
really quite low and as such their claims are really quite bogus.

~~~
will_critchlow
I don't know if I saw my comment (looks like we posted at similar times). I'd
be interested in your comments relative to that: there are a lot of factors
this is quite a high correlation for a single one... (I think).

------
mark_l_watson
I sometimes use LDA (using Hadoop and Mahout) and it is not an inexpensive
calculation for large document sets). I wonder what the costs are for using
this large scale.

------
madridorama
I'm sorry but this is overthinking something that is relatively simple to
understand

