

Machine Learning Applied to Google's Rankings - randfish
http://www.seomoz.org/blog/googles-algorithm-pretty-charts-math-stuff

======
ramanujan
1) Rand Moz and other SEO people should do an extremely thorough study of
23andMe's site. Their product may be of questionable value, but if there is
one site which has the skeleton key to SEO, it is an ecommerce site run by
Google's Wife. Any kind of convention or trick that they use is likely to be
preferred by Google.

2) I've messed around with this problem myself a bit. In general, predicting
rank as a function of page properties is equivalent to replicating Google's
own search ranking (i.e. if your predicted rank \hat{Y} = the true rank Y for
input features X then you can basically rank pages as google does from signals
on web pages, though of course you'll be doing it in batch without all the
semi-realtime crawling that goog now does).

That said, you can pretty easily get something decent that will (a) give you
an overall estimate of rank and (b) at least tell you quantitatively whether a
given feature impacts rankings. This can settle a lot of debates among SEO
people.

3) Specific proposal: calculate a non-parametric measure of correlation
between empirical page rank and each of the features mentioned in this post
(<http://www.seomoz.org/article/search-ranking-factors> ) on a sample of say
100k keywords. Examination of individual scatterplots will also be
informative.

Now you can do a more abstract analysis. Construct a table where rows
correspond to features and there are two columns: the empirical non-parametric
correlation with PageRank and the estimate in the SEOMoz post on ranking
factors of that feature's importance.

Make a scatterplot here (and calculate just one more non-parametric
correlation) to see how good the experts were at determining how much each
feature contributed to rank.

~~~
martian
23andMe is still using meta keywords, so they might be a little out of date.

------
paraschopra
Though the analysis is interesting, it is not "Machine Learning". There is no
test/training data set, no prediction, no model selection. It is just plain
old, but extremely useful correlation analysis.

If someone is interested in a similar kind of correlation (and regression)
analysis for website conversion rate, have a look at the study I did a study
recently - [http://www.wingify.com/case-studies/predictive-web-
analytics...](http://www.wingify.com/case-studies/predictive-web-analytics-
conversion-case-study.php)

~~~
randfish
Actually, that's inaccurate. Please do read through the entire post. Although
it begins with correlation data, the conclusion reached halfway through the
post is that more sophisticated models are required, which is where the
machine learning is put into use - a large number of features about links and
on-page elements as well as derivatives of these features - modeled against
training data (10K SERPs), then shown against a different set of 10K SERPs.

~~~
paraschopra
That is very, very hazy. In my opinion the article just says "a machine
learning model that maps to the search results and produces a result that's
considerably better correlated with rankings than any single metric" without
_any_ other information on what the model is, how it is made, what is its
validity. Please correct me if I missed something in the article, but where
exactly is that machine learning model described?

------
jgrahamc
If you're going to throw around a term like 'machine learning' then it would
be nice if you were to explain what you were doing. The article says:

 _We (well, technically, Ben) run them through a machine learning model that
maps to the search results and produces a result that's considerably better
correlated with rankings than any single metric._

------
kurtosis
If anyone is interested in what a real "machine learning" approach the problem
of learning a ranking function from data looks like see this paper:

Burges et. al. Learning to Rank using Gradient Descent
<http://research.microsoft.com/apps/pubs/?id=69183>

Although this is most definitely an active research area and the papers citing
this one should be pretty interesting.

------
carbocation
If you look at their model vs the correct result, their error appears
logarithmic. This is what I would expect from a linear model that is trying to
approximate a function known to be logarithmic. (The 1-10 PageRank values we
see are logarithms of the actual internal Google values, or so it is said.)

------
meatbag
This submission implies that SEOmoz is at least slightly interested in peer
review. Which would be a very good thing for them.

~~~
randfish
Absolutely! I can't promise we can divulge everything we're doing, but I know
that Ben (who runs these models) would love outside opinions and critiques.
You can reach him via Ben at SEOmoz.org.

