

Ask HN: Has anyone tried to write a Bayesian classifier for stories? - StavrosK

I had an idea that I should write a simple program to take the HN front page, look at the stories, and then use Bayesian inference to learn what I like. It sounds like this would be very simple to do, and would generally be the same idea as spam filtering. My interests are not that varied, and I think a bag-of-words model would easily be able to tell what I like or not.<p>However, I'm pretty sure lots of people have tried this, and it probably doesn't exist because nobody succeeded.<p>Has any of you tried doing some ML for interesting stories? Did it work? If so, is it available, and if not, why not?<p>Any insight on this would be valuable, thanks!
======
a_macgregor
Stavrosk,

If what you want is to only classify the stories in the front page and
classify them based on a preset of categories, that's actually pretty simple
to do.

I been working on a similar concept for personal project. Here are my
recommendations:

\- Be sure to remove stopwords from the titles before using the classifier. \-
The ankusa gem will help you greatly <https://github.com/bmuller/ankusa>

Ankusa is a naive bayesian text classifier that will come really handy for the
task you are trying to achieve.

Also make sure your training data sets are pretty clean and with little
overlapping as possible.

Finally have fun and let us knows how it goes!!

Cheers and let me know if you have more questions or if you want a hand coding
this thing.

~~~
StavrosK
Thanks for your answer! What I'm thinking of making is basically separating
posts into two categories, things that interest me and things that don't.
Then, I want to receive emails at intervals I specify. This is so I no longer
have the urge to check HN frequently, but still stay up t date.

The actual classification is probably the easy part, the hard part is training
the model, which is why I wanted to ask if anyone had done it before. Have you
managed to train anything to recognize your tastes, or is it objective
categories? How well does it work?

~~~
a_macgregor
Well, my classifier works based on categories like ruby, programming, php,
magento etc.

To train the classifier I grabbed feeds from different reddits and used that
as a based data set. What you are trying to achieve sounds more like a
recommendation engine rather than a classifier maybe recommendify might come
handy <https://github.com/paulasmuth/recommendify>

You still can use the bayesian classifier, for training it I would recommend
the supervised training route, basically start with a small dataset(100
records) and manually classify each of the training examples.

Also you should leave some sort of way to provide feedback to your classifier
to improve the results and make corrections

~~~
StavrosK
Yeah, I'll have upvotes and downvotes to tell it what I liked or didn't.
Unfortunately, I can't see a way to do this without supervised learning (maybe
semi-supervised would work), which is why I posted here for ideas (I want to
avoid the costly supervision step if someone knows the result won't work).

Thanks for your comments, they help a lot.

------
Houshalter
I've been thinking about a very similar idea. Mostly so I don't feel the
compulsion to check the internet so frequently, the good stuff would just wait
until I do.

Would searching through the text be enough though? If you could get several
people to use it that would give it more information. You could rank content
based on whether or not someone else with similar interests has liked it on
top of that.

~~~
StavrosK
There are various things you can do and various machine learning techniques
you can use, but I imagine that the single user version would be enough, to
start with.

I'll give it a go and see if it works well. If it does, I might release it as
a service.

~~~
Houshalter
That would be cool. The basic info might be enough. The thing about the
previous guy who tried something like this which was posted above is, he was
mostly sorting by the words in the title which seems like only a very weak
predictor. Better would be the number of comments and votes, and maybe other
stuff like how long the article is or whether or not certain words are in the
comments. He also trained it on whether he thought it sounded interesting, not
after reading the article and determining if it actually was.

~~~
StavrosK
I plan to implement votes (maybe), domain, actual raw text of the article,
title, submitter (maybe) and show articles that are deemed "important" (i.e.
have stayed on the front page for longer than X hours), as well as some random
ones, to avoid a bubble. Plus, I've already started training the filter
manually, I'll maybe write a simple web UI later on so I can up/downvote
articles from there.

I think that should give a good first draft.

------
jimminy
Here is a story from the beginning of last year of someone who did this.
<http://joelgrus.com/2012/02/16/hacking-hacker-news/>

Link to HN Thread: <http://news.ycombinator.com/item?id=3602407>

~~~
StavrosK
That's exactly what I was looking for, thank you. He doesn't include results,
but has some good ideas.

