Ask HN: Comments on and help on improving our recommendation engine for news

bdfh42 · on Nov 24, 2009

Just a point on form. Why did you not blog this (or post an article on your web site) and then post a link to same on HN?

You would then have been able to format your text in a manner more suited to the content - and also made it more straightforward to edit your "content" in part in response to feedback from HN and elsewhere.

Good luck with the project though.

haidut · on Nov 24, 2009

Good point, will blog next time. Thx

physcab · on Nov 24, 2009

First, I think just based on your description you are indeed asking a lot of the user. For our service, we simply ask the user to vote up or down a piece of content and our dataset is pretty sparse. Users like to consume-they don't have any interest in helping you tune your algorithm.

With that said, now let's address the algorithm. You're using SVM? Is that quick enough? Are your recs computed offline? To be honest, you don't really need SVM or any complex technique to do recommendations well. You can get by just fine doing hand selected recs even though that might not scale well. To scale you can use something like Pearson correlation or naive Bayes and you'll do just fine.

With recs it's all about the quality of data and not so much the fancy algorithm.

haidut · on Nov 24, 2009

We have a number of options, with SVM being the fastest so far b/c the library is written in C. Pearson, NaiveBayes and a number of others have all been tried and "tested" for acceptance with several thousand users and the majority picked the results from SVM as the most accurate. The recs are computed offline but there is an option to do that on the fly given enough RAM to load the SVM models for all users. As far as asking users for too much - what exactly do you mean? All we are asking for is for the user to login and if they find article they like (on the front page or from the topics menu) then click on them. What more simple that that? The voting system up/down you talk about is exactly the same except that you explicitly ask the user to vote, while we implicitly do the same. TechnologyReview ran an article last year how asking the user to explicitly rate stuff is considered overburdening and how the system should quietly monitor the user without asking direct questions. Anyways, your up/down approach and our click/remove approach are the same in my opinion. But thanks for the comments.

physcab · on Nov 25, 2009

Sorry, when I wrote the comment I was on my Iphone and didn't check the website. I just did. I see what you mean. Why don't you just aggregate total click data and see which articles get clicked the most, bin the results to categories (5,4,3,2,1) by topic then compute Pearson? SVM still sounds a bit heavy weight. Sure, I guess it'll handle a few thousand...how about 1-10 million? And how do you know if SVM is the most accurate? What's your criteria?

tdoggette · on Nov 24, 2009

Paragraph breaks, please.