I tried to train bayesian (and other) classifiers to reliably pick the same stories to read as I would. Despite looking at a variety of things - title, poster, domain, corpus from the article, corpus of the comments, I found their accuracy was never really better than 60%.
Then I tried rating the same set of articles myself several times. My accuracy was only around 60% too.
Yeah, I think it shows how hard it is to classify something with such a diverse set of stories. Each week for my Hacker Newsletter project I have to come up with a short list of links to share. I don't ever want to make it "automated", but at the same time I need to narrow down what to pick from. I have tried several things in the past, but what has worked best for me is using a combination of: what i voted up, votes, # of comments, if <user> commented, time on front page, and finally a lot of regex filters. Tying all that together with a simple interface/tool allows me to find a list of articles that I think my subscribers will enjoy.
Maybe it's easier simply to classify things you wouldn't want to read and hide those as less interesting. Because of the variety of topics, training something to figure out what you like seems much more restricting on the flow.
E.g. if you rarely read things with ".js" (stupid amounts of js library posts here), it'll be easier to say this is uninteresting to me, vs classifying everything as interesting so the algorithm has to infer that you find js libraries uninteresting.
Although I'm pretty interested in node but not js libraries for api's necessarily, tough problem indeed.
As an alternative, just trust the HN home page algorithm.
Stories seem to move up to a relevant max rank position, stay there and then move back down. Big stories stay in the top 5 for 20+ hours.
Here's what I do: If I only have time to look at 5 stories per day, I visit once per day at any point in time and look at the first 5 stories. If I have time to look at 20, look at the first 20.
Set yourself a timeout, start reading at the top, stop when the time is up, repeat after 12 or 24 hours. Works very well for me, I get the best stories, and feel pretty well informed.
Don't you think that submissions are motivated by getting to the home page? So with time, submissions become similar to what's on the home page.
If the home page is so bad according to you, wouldn't you agree that the actual submissions still be a sub-par source for your filter, and a different source would be preferable?
The issue is that, at least for me, a lot of the most interesting articles get to fifth or six place and then lose steam and start slowly falling off the page. A lot of the time the articles that get to the very top are about boring subjects like Apple so they worry me less.
Then I tried rating the same set of articles myself several times. My accuracy was only around 60% too.
Figures.