Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Comments on and help on improving our recommendation engine for news
4 points by haidut on Nov 24, 2009 | hide | past | favorite | 6 comments
Hi all,

A friend of mine and I built this news ranking service of sorts based on a ranking algorithm I came up while I was in grad school for CS. Recently I also developed a recommendation engine based on a modified Support Vector Machine (SVM) algorithm. Why modified? Well, as most of you know SVM are really binary classifiers and need to classes of training data - i.e. good/bad. interesting/boring, etc. So in the case of news, you have to ask the user to select both interesting articles and not interesting articles. There have been a number of studies done on the subject how asking the user to do both is soometimes too much of a burden. Ideally, the user should tracked implicitly and the model should account for the uncertainty that arises from lack of explicit ratings. So here is how it works. I created an account and seeded the system with some articles that seem to have been popular on HN and the results are available. Just go to http://www.euraeka.com and click login on the top right. The user name is hackers (at) are (dot) us, and you can email me at haidut (at) gmail (dot) com for the password. Since the system already generated recommendations, you will see them right when you login. If a user likes an article they just click on it and the system records that. There is also a "remove" button next to each article in case the user knows up front they don't like that article and want it removed from view and recorded as uninteresting. The purpose of the remove option is also if you clicked on an article that you thought you will like but it turned out to be not interesting, so you have the option of reversing your initial decision and telling the system that the article was not good. Like I said above, "removing" an article is optional. The system can work if the user only clicks on articles that he/she likes and ignores the rest. Note: Obviously, I couldn't seed the system with every article that has been on HN. I just picked some that I liked before from topics such as science, entrepreneurship, health, etc. Feel free to seed the system even more. How to do that? Basically, once you are logged in either use the search box at the top to search for keywords of interest, or use the topics section at the right of the screen to find more specific news. If you find anything you like just click on the article. Note: Obviously, if you create your own account you will be able to test your own set of recommendations. Recommendations are generated every 24 hours using ALL the news collected in the last 24 hours. If any of you have used the Google News or Digg recommendaton engines, I'd very interested to hear how this stacks up against those to services, given the fact that they are both based on user-to-user recommendation techniques (i.e. articles are recommended to you based on similarity to other users) rather than content-based recommendations (which is what Euraeka is).

Thanks in advance.



Just a point on form. Why did you not blog this (or post an article on your web site) and then post a link to same on HN?

You would then have been able to format your text in a manner more suited to the content - and also made it more straightforward to edit your "content" in part in response to feedback from HN and elsewhere.

Good luck with the project though.


Good point, will blog next time. Thx


First, I think just based on your description you are indeed asking a lot of the user. For our service, we simply ask the user to vote up or down a piece of content and our dataset is pretty sparse. Users like to consume-they don't have any interest in helping you tune your algorithm.

With that said, now let's address the algorithm. You're using SVM? Is that quick enough? Are your recs computed offline? To be honest, you don't really need SVM or any complex technique to do recommendations well. You can get by just fine doing hand selected recs even though that might not scale well. To scale you can use something like Pearson correlation or naive Bayes and you'll do just fine.

With recs it's all about the quality of data and not so much the fancy algorithm.


We have a number of options, with SVM being the fastest so far b/c the library is written in C. Pearson, NaiveBayes and a number of others have all been tried and "tested" for acceptance with several thousand users and the majority picked the results from SVM as the most accurate. The recs are computed offline but there is an option to do that on the fly given enough RAM to load the SVM models for all users. As far as asking users for too much - what exactly do you mean? All we are asking for is for the user to login and if they find article they like (on the front page or from the topics menu) then click on them. What more simple that that? The voting system up/down you talk about is exactly the same except that you explicitly ask the user to vote, while we implicitly do the same. TechnologyReview ran an article last year how asking the user to explicitly rate stuff is considered overburdening and how the system should quietly monitor the user without asking direct questions. Anyways, your up/down approach and our click/remove approach are the same in my opinion. But thanks for the comments.


Sorry, when I wrote the comment I was on my Iphone and didn't check the website. I just did. I see what you mean. Why don't you just aggregate total click data and see which articles get clicked the most, bin the results to categories (5,4,3,2,1) by topic then compute Pearson? SVM still sounds a bit heavy weight. Sure, I guess it'll handle a few thousand...how about 1-10 million? And how do you know if SVM is the most accurate? What's your criteria?


Paragraph breaks, please.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: