

Ask HN: Comments on and help on improving our recommendation engine for news - haidut

Hi all,<p>A friend of mine and I built this news ranking service of sorts based on a ranking algorithm I came up while I was in grad school for CS. Recently I also developed a recommendation engine based on a modified Support Vector Machine (SVM) algorithm. Why modified? Well, as most of you know SVM are really binary classifiers and need to classes of training data - i.e. good/bad. interesting/boring, etc. So in the case of news, you have to ask the user to select both interesting articles and not interesting articles. There have been a number of studies done on the subject how asking the user to do both is soometimes too much of a burden. Ideally, the user should tracked implicitly and the model should account for the uncertainty that arises from lack of explicit ratings. So here is how it works. I created an account and seeded the system with some articles that seem to have been popular on HN and the results are available. Just go to http://www.euraeka.com and click login on the top right. The user name is hackers (at) are (dot) us, and you can email me at haidut (at) gmail (dot) com for the password. Since the system already generated recommendations, you will see them right when you login. If a user likes an article they just click on it and the system records that. There is also a "remove" button next to each article in case the user knows up front they don't like that article and want it removed from view and recorded as uninteresting. The purpose of the remove option is also if you clicked on an article that you thought you will like but it turned out to be not interesting, so you have the option of reversing your initial decision and telling the system that the article was not good. Like I said above, "removing" an article is optional. The system can work if the user only clicks on articles that he/she likes and ignores the rest. 
Note: Obviously, I couldn't seed the system with every article that has been on HN. I just picked some that I liked before from topics such as science, entrepreneurship, health, etc. Feel free to seed the system even more. How to do that? Basically, once you are logged in either use the search box at the top to search for keywords of interest, or use the topics section at the right of the screen to find more specific news. If you find anything you like just click on the article.
Note: Obviously, if you create your own account you will be able to test your own set of recommendations. Recommendations are generated every 24 hours using ALL the news collected in the last 24 hours.
If any of you have used the Google News or Digg recommendaton engines, I'd very interested to hear how this stacks up against those to services, given the fact that they are both based on user-to-user recommendation techniques (i.e. articles are recommended to you based on similarity to other users) rather than content-based recommendations (which is what Euraeka is).<p>Thanks in advance.
======
bdfh42
Just a point on form. Why did you not blog this (or post an article on your
web site) and then post a link to same on HN?

You would then have been able to format your text in a manner more suited to
the content - and also made it more straightforward to edit your "content" in
part in response to feedback from HN and elsewhere.

Good luck with the project though.

~~~
haidut
Good point, will blog next time. Thx

------
physcab
First, I think just based on your description you are indeed asking a lot of
the user. For our service, we simply ask the user to vote up or down a piece
of content and our dataset is pretty sparse. Users like to consume-they don't
have any interest in helping you tune your algorithm.

With that said, now let's address the algorithm. You're using SVM? Is that
quick enough? Are your recs computed offline? To be honest, you don't really
need SVM or any complex technique to do recommendations well. You can get by
just fine doing hand selected recs even though that might not scale well. To
scale you can use something like Pearson correlation or naive Bayes and you'll
do just fine.

With recs it's all about the quality of data and not so much the fancy
algorithm.

~~~
haidut
We have a number of options, with SVM being the fastest so far b/c the library
is written in C. Pearson, NaiveBayes and a number of others have all been
tried and "tested" for acceptance with several thousand users and the majority
picked the results from SVM as the most accurate. The recs are computed
offline but there is an option to do that on the fly given enough RAM to load
the SVM models for all users. As far as asking users for too much - what
exactly do you mean? All we are asking for is for the user to login and if
they find article they like (on the front page or from the topics menu) then
click on them. What more simple that that? The voting system up/down you talk
about is exactly the same except that you explicitly ask the user to vote,
while we implicitly do the same. TechnologyReview ran an article last year how
asking the user to explicitly rate stuff is considered overburdening and how
the system should quietly monitor the user without asking direct questions.
Anyways, your up/down approach and our click/remove approach are the same in
my opinion. But thanks for the comments.

~~~
physcab
Sorry, when I wrote the comment I was on my Iphone and didn't check the
website. I just did. I see what you mean. Why don't you just aggregate total
click data and see which articles get clicked the most, bin the results to
categories (5,4,3,2,1) by topic then compute Pearson? SVM still sounds a bit
heavy weight. Sure, I guess it'll handle a few thousand...how about 1-10
million? And how do you know if SVM is the most accurate? What's your
criteria?

------
tdoggette
Paragraph breaks, please.

