Hacker News new | past | comments | ask | show | jobs | submit login
What is a Good Recommendation Algorithm? (acm.org)
70 points by Anon84 on Mar 29, 2009 | hide | past | web | favorite | 15 comments

Greg nails something that seems to be passing the academic world of recommendations by: you can't measure recommendations quality with RMSE. It's just not a good metric. User happiness is the goal, not the ability to predict ratings of unrated items. I'm glad to have someone with a little more clout than me saying this.

Some ask, "What's the difference?" If I tell you about 5 albums that you've already heard of, are the recommendations good? Even if we're pretty certain you'll like them? If you're buying an obscure jazz album and you get "Kind of Blue" as a recommendation (probably most popular jazz album in history and one any jazz fan would know of) is that a good recommendation?

How do users build trust of recommendations? How does that factor into the algorithms? It turns out you need a mix of obvious and surprising results. All obvious and they don't discover anything; all surprising and they don't trust them.

Those are the right questions. A good algorithm for recommendations is one that people interact with and discover things with.

This is an awesome read (in fact, I uhm, submitted it a few minutes before at Greg's blog, but it's good enough that I upvoted it here too). As soon as I ran across it I immediately blogged, tweeted, and submitted here. I'd had a draft of an essay along these lines kicking around for ages.

I think they use RMSE because it's easy, not because it's ideal. Bellkor, a participating team in Netflix challenge, discussed this in their paper describing their method who won the progress prize; they calculated whether minute differences in RMSE improved the quality of top10 results; it did pretty significantly.

Just fished it out -- paper is here for the curious:


It's one, amusingly, that I'd skipped because it seemed to be less technical. :-) Good stuff.

This hasn't been passing us by! Netflix were the ones who decided to make RMSE the criterion for their contest, and put up a million dollars to ride on it for good measure, so it's hardly a surprise all the papers are focused on it. Of course, RMSE doesn't measure user satisfaction; that's why we write papers describing the techniques that seem to work, and it's up to Netflix (and other recommendation service providers) to pick which of those they want to use given that they're maximizing something slightly different.

It's true that not being in academia that I don't hear the conversations that fill the gaps between publications. But if one's simply going from the published output on collaborative filtering at the moment there has been some convergence on RMSE as a benchmark. That's understandable, since it's easily measurable, and as you say, there are some folks throwing $1mm at it (which really isn't much considering what it'd do for their sales).

Still, wouldn't predicting how well somebody likes something form a good basis for running a recommendation engine on top of it? Maybe it is a waste of effort for many scenarios, but if you can do it well, you can still add all sorts of algorithms to pick the best recommedations from the predictions?

Well, that's the question underlying the article. Consider the hypothetical case of a movie that is very controversial: all 1's or 5's. Even if your system can tell that a user is quite likely to fall in the '5' camp, the only safe prediction for a high variance movie is something close to the middle. Even if you are pretty sure the user would give this movie a 5, the squared error for the small chance of a 1 is enormous.

But a rating close to the middle is never going to be chosen as a recommendation if the algorithm involves recommending the movies with the highest predicted scores. Instead, an RMSE based system is always going to prefer safe 4's over risky 5's. This doesn't mean that improved predictions can't yield improved recommendations, but I don't see truly great ones ever coming from a prediction based approach.

Personally, I want a recommendation system that maximizes the percentage of recommendations that I would rate as 5's, and don't much care if the misses are 1's, 2's, or 3's.

And beyond that it's somewhat domain specific as to what the tolerance for misses is. In something like recommending online (non-paid) content, it doesn't matter much. It's worth more to take a gamble on something a user will really like than to give them something you're sure they won't hate. If you get two great hits and three bad hits, it's probably still a net win for the user. On the other hand, if you're say, doing online dating recommendations, you probably want to avoid the polarized cases since you could lose a paid customer with one horrible recommendation.

I'd argue that "user happiness" isn't the goal for Netflix, long-term revenue is. That's relatively easy to measure, and certainly easier than something nebulous like "user happiness." You can even test different recommendation algorithms and see which maximizes long-term revenue.

Presumably Netflix knows that the recommendation algorithm has a significant impact on their bottom line, which is why they launched the Netflix Prize to outsource new algorithm development.

Now, Netflix can't give revenue data to third parties, and they also don't want to let third-party recommendation algorithms run on their system because an "average" algorithm will hurt their bottom line.

The question then becomes: which well-understood metric correlates best with long-term revenue?

Perhaps the answer is RMSE, which is why Netflix chose it. That doesn't seem totally implausible to me.

You'd expect that. In the recommendations world that's called "business rules" and includes things from skewing results based on margins to not showing inappropriate recommendations (say, women's clothing to men).

However, I'm pretty sure that Amazon's recommendations don't do that, or don't do it much, anyway. Their "similar product" recommendations seem to be on a very simple (and often mediocre quality) pure counting correlation between two items purchases. It's much harder to guess which algorithms are at work for personalized recommendations.

At the end of the day, profit margins aside, there's a lot that goes into optimizing recommendations that can't be easily measured. How do you measure customer loyalty based on good recommendations? There have been a number of market research studies that indicate that recommendations do drive customer loyalty, but it's hard to say where the sweet spot is between skewing things toward higher margins vs. skewing things towards customer utility. About 80% of Amazon's visitors aren't there to buy stuff -- and that's great for them! They've become an information portal / window shopping location that happens to also sell stuff. Which is a great position to be in when somebody does think of buying stuff.

That Netflix uses RMSE for their contest doesn't bother me. What I think Greg is reacting to (and certainly my sentiment, again, this is really similar to something I'd been writing) is that there's becoming a blurring between stimulus and response here and there's the assumption, if not in this subfield, certainly among those casually tracking recommendations advances, that RMSE is a good way of measuring a recommendations algorithm, not just, "the metric Netflix is using", when in fact, it's really a much more inexact science.

A simple item based algorithm which has been reported to work quite well is Slope One. The advantages are that it is easy to implement, can be updated on the fly, and it works well (enough) for sparse data sets.


There's also examples using python, java, and PHP/SQL.

A friend of mine made a Rails plugin called acts_as_recommendable (a plugin for collaborative filtering): http://github.com/maccman/acts_as_recommendable/tree/master

1) Getting more preference-defining data from the user trumps algorithm improvements at this point. Netflix would have improved RMSE even more by turning over additional data like queue-adds, page views, user age\gender, etc. 2) Use caution criticizing RMSE as overly blunt. It may seem so, but it is not obvious that an algorithm can be improved for top N prediction simply because you declare that as the focus.

Netflix needed a formal measure for their contest, so RMSE is a useful one while "making people happy" is not. A business that relies on recommendation can plug a new algorithm with better RMSE and get improved results immediately, it is an important part of the puzzle.

"Making people happy" is hard to define, but you can pick better concrete metrics than RMSE, and this article offers suggestions on how. An important part of solving any problem is defining success correctly.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact