Some ask, "What's the difference?" If I tell you about 5 albums that you've already heard of, are the recommendations good? Even if we're pretty certain you'll like them? If you're buying an obscure jazz album and you get "Kind of Blue" as a recommendation (probably most popular jazz album in history and one any jazz fan would know of) is that a good recommendation?
How do users build trust of recommendations? How does that factor into the algorithms? It turns out you need a mix of obvious and surprising results. All obvious and they don't discover anything; all surprising and they don't trust them.
Those are the right questions. A good algorithm for recommendations is one that people interact with and discover things with.
This is an awesome read (in fact, I uhm, submitted it a few minutes before at Greg's blog, but it's good enough that I upvoted it here too). As soon as I ran across it I immediately blogged, tweeted, and submitted here. I'd had a draft of an essay along these lines kicking around for ages.
It's one, amusingly, that I'd skipped because it seemed to be less technical. :-) Good stuff.
But a rating close to the middle is never going to be chosen as a recommendation if the algorithm involves recommending the movies with the highest predicted scores. Instead, an RMSE based system is always going to prefer safe 4's over risky 5's. This doesn't mean that improved predictions can't yield improved recommendations, but I don't see truly great ones ever coming from a prediction based approach.
Personally, I want a recommendation system that maximizes the percentage of recommendations that I would rate as 5's, and don't much care if the misses are 1's, 2's, or 3's.
Presumably Netflix knows that the recommendation algorithm has a significant impact on their bottom line, which is why they launched the Netflix Prize to outsource new algorithm development.
Now, Netflix can't give revenue data to third parties, and they also don't want to let third-party recommendation algorithms run on their system because an "average" algorithm will hurt their bottom line.
The question then becomes: which well-understood metric correlates best with long-term revenue?
Perhaps the answer is RMSE, which is why Netflix chose it. That doesn't seem totally implausible to me.
However, I'm pretty sure that Amazon's recommendations don't do that, or don't do it much, anyway. Their "similar product" recommendations seem to be on a very simple (and often mediocre quality) pure counting correlation between two items purchases. It's much harder to guess which algorithms are at work for personalized recommendations.
At the end of the day, profit margins aside, there's a lot that goes into optimizing recommendations that can't be easily measured. How do you measure customer loyalty based on good recommendations? There have been a number of market research studies that indicate that recommendations do drive customer loyalty, but it's hard to say where the sweet spot is between skewing things toward higher margins vs. skewing things towards customer utility. About 80% of Amazon's visitors aren't there to buy stuff -- and that's great for them! They've become an information portal / window shopping location that happens to also sell stuff. Which is a great position to be in when somebody does think of buying stuff.
That Netflix uses RMSE for their contest doesn't bother me. What I think Greg is reacting to (and certainly my sentiment, again, this is really similar to something I'd been writing) is that there's becoming a blurring between stimulus and response here and there's the assumption, if not in this subfield, certainly among those casually tracking recommendations advances, that RMSE is a good way of measuring a recommendations algorithm, not just, "the metric Netflix is using", when in fact, it's really a much more inexact science.
There's also examples using python, java, and PHP/SQL.