
How to Build a Popularity Algorithm You can be Proud of - 8plot
http://blog.linkibol.com/post/How-to-Build-a-Popularity-Algorithm-You-can-be-Proud-of.aspx
======
ssn
Interesting overview, however he fails to address scalability issues properly.
Some of the algorithms presented need to periodically recompute each item's
score - this is a drawback if scalability is what you are looking for. A
scalable algorithm will compute each score on write and will not require batch
updates of previous items.

See: <http://code.google.com/appengine/articles/overheard.html>

~~~
physcab
After looking at a number of these algorithms, it seems like you really need
to take each situation into consideration. I like that Google example as it is
easy and scalable as you said. But for more complicated situations, you can do
batch updates using Hadoop/MapReduce, assuming you don't have popular items
that need to be calculated real-time.

------
profquail
I really liked this article. I've been playing around with some social-news-
rating algorithms of my own, which are quite different from any of the ones
listed here. One of these days I'll find the time to sit down and code a site
around it them...

Also...I'm pretty sure that his argument for the "Dampening The Weighted Votes
By Record Age" section is wrong. If you assume that each vote has the same
weight (like HN, Digg, etc.), then you can rearrange the terms so that it's
possible to use an algorithm that updates the 'rating' of the story on-the-
fly.

~~~
roundsquare
> I've been playing around with some social-news-rating algorithms of my own

Where do you get the data for this? Its something I'd like to toy around with
a bit but I have no idea how to get data.

------
mseebach
Some discussion on a similar, but less comprehensive post:
<http://news.ycombinator.com/item?id=478632>

------
pixcavator
What you are likely to have is something like this: 1000 users and each voted
between 0 and 100 times with 10 votes average. Yet with this approach all you
have is a bag of 10,000 votes. It does not matter what you do with this bag -
all the information on how the individual users voted is lost.

~~~
alain94040
None of these algorithms take into account the number of views for each item.
Wouldn't you define the popularity by starting with the number of people who
viewed the item, divided by the number of votes?

This would solve the issues reported about late night news being ignored
because no one is around.

Anyway, great article.

~~~
stratomorph
I think that would be a helpful factor in theory, but in practice would be too
inconsistent to rely on. Using Reddit as an example (because I have no
familiarity with HN's source) there is client-side javascript that intercepts
clicks on links and appends story IDs to one of the Reddit cookies. Next time
a request goes to Reddit, they get a list of recent clicks.

This can fail in a lot of ways, most obviously if Javascript or cookies are
turned off. Also, the cookie isn't sent to Reddit until I load another page,
so if I read an entire page of links and then close the browser without
refreshing the page, the cookie doesn't get sent. Plus, the script clips the
list around 20 elements, so even if I did refresh Reddit, it wouldn't know I'd
clicked on more than 20.

My point is not the numerous weaknesses of Reddit's approach. Instead, it's
self-reported information that must necessarily be suspect and incomplete. If
an article on an obscure programming language pops up here, and every single
person who reads it uses Lynx with cookies turned off for security, there
might be no opportunity to record any views.

