

My personalized news site based on the book Programming Collective Intelligence - hashbucket
http://fyynd.com

======
hashbucket
This is a personalized news site that I wrote in two months based on
algorithms from the book Programming Collective Intelligence. Please tell me
what you think.

It has two main features: the ability to identify related /similar links and
suggestions/recommendations that actually work.

The basis for all of the algorithms is a document similarity metric presented
in Chapter 3: Discovering Groups. Basically, to compare document A with
document B, we calculate the Pearson correlation coefficient between the word
frequencies of document A and the word counts of document B. (You can imagine
this as plotting a series of points of a graph: each point's x coordinate is
its frequency in document A and each point's Y coordinate is its frequency in
document B. The Pearson correlation coefficient is a measure of how well the
line-of-best-fit fits the points.)

Using this similarity metric, links can be clustered together using K-means
clustering. This is what you get when you click on “related” at the bottom of
each link. Clicking on “similar” gives the results of running K-NN. (“related”
doesn't work as well as it could be right now because there are too few links
for a link to be similar with, but this is an example of where it does work:
<http://fyynd.com/links/197/related/> “similar” usually works better right
now.)

There are two algorithms for giving recommendations, “Suggested” and
“Recommended”. "Recommended" generally works better than Suggested when you
haven't yet made votes but Suggested should be more in tune to your
preferences in the long run.

In layman's terms, the Recommendation algorithm works by "averaging" together
the links that you liked and then find links that are similar to that while
the Suggestion algorithm tries to determine whether you will like a particular
link by seeing whether it is similar to any page that you have already rated
highly. As a result, "Recommended" will list pages in your general interest
area, but insensitive to any "niche" interest that you might have. The
"Suggested" page will be sensitive to "niche" interests but will requires more
votes to train. For example, if most of the link you rate highly are about
computer science, with a only a few links about biology, when the
recommendation algorithm averages them together, the biology links would count
for very little. As a result, you wouldn't see much on biology. On the other
hand, the suggestion algorithm will not be hindered by this, though it will
have trouble if you don't vote much.

Please note that because predictions are so computationally intensive, they
are not updated in real-time but on a hourly basis. Thus, you have to wait a
bit before they come out. Please be patient!

Please check it out and tell me what you think! Any
questions/comments/suggestions are more than welcome!

~~~
cstejerean
I really like the interface. It has some features I wish HN had, like the
ability to hide items from view. I've been meaning to write something like
this for a while but never got around to it. Keep up the good work.

oh, please create a bookmarklet to let users submit stories while browsing,
this is VERY IMPORTANT, and shouldn't take much effort (use the HN one as an
example).

I'd like to feed the site with stories from here and create a Greasemonkey
plugin to automatically rate items on your site when I vote them up here (if I
can find a good way to vote up items programatically on fyynd).

~~~
hashbucket
Bookmarklets: done. See <http://fyynd.com/bookmarklets/>. As for rating links
programmatically: it is a simple POST to
"<http://fyynd.com/links/>[link_id]/rate/" with a parameter "rating". "rating"
should be a float between 0 and 5. A rating of 0 will delete that vote.

Thanks for your interest.

~~~
cstejerean
how do I tell the application which user I am when posting a ranking?

~~~
hashbucket
You have to include the cookie.

------
thorax
I'm reading this book currently. A lot of it is covered in traditional AI
courses in CS programs, but I like seeing the Pythonic representations of some
of the concepts.

The site itself seems kind of neat, feels like you'd need to use it for a
while to get it working well.

------
csmajorfive
Cool site. I am working on the same thing (but for school/fun). You should
look into Support Vector Machines as they are much better at text
classification then kNN.

~~~
hashbucket
I have looked into SVMs but I don't think they would work well in this case
because: 1) A separate classifier would have to trained for each user and this
would take too much resources. 2) I think an SVM would require too many
training cases before it becomes useful.

If you know different, let me know.

------
ews
Congrats for the site.

Regarding presentation, did you reused any old reddit-like frontend? It looks
a lot like links <http://reddit.com/r/programming/info/61e7j/comments/c02j42c>

------
gscott
I viewed your site and I had a case of information overload. I would suggest
having a feature where a person can type in what they like (cars, technology,
etc) and that you feed them what they want rather then dump everything on them
at once.

