
Reddit's Ranking Algorithm - michael_dorfman
http://redflavor.com/reddit.cf.algorithm.png
======
aneesh
So, a couple people asked for an explanation, so here goes:

t_s basically serves as "gravity" to make older posts fall down the page. Why
Dec 8, 2005? Maybe that's when they launched. Anyway, what t_s does in the
function is equate a 10-fold increase in points with being submitted 12.5
hours (that's 45,000 secs) later. So a 1-hour-old post would have to improve
its vote differential 10x over the next 12.5 hours to maintain it's rating to
compensate for elapsed time. If a post's vote differential increases more than
10x in 12.5 hours, its rating goes up.

As for where the numbers come from, I'm pretty sure they're tuned by trial-
and-error. It's really hard to predict voting patterns beforehand (ie how fast
should items "fall" from the main page?)

The log function is there because your first 10 upvotes should have more
weight than the 101st to 110th upvotes. The way the formula is written (and
assuming 0 downvotes), your first 10 upvotes have the same weight as the next
100 upvotes, which have the same weight as the next 1000, etc. Again, the base
of the logarithm is somewhat arbitrary, and can be tuned by trial and error.

And needless to say, if you have more downvotes than upvotes, your rating is
negative. That's about it.

(note: I'm just reading the page and interpreting the algorithm - I don't have
any special insight into how they chose these particular constants)

Edit: Time since Dec 8, 2005 is an elegant way of doing it. My first (crude)
thought would've been to use "time since posting" to determine gravity, but
that requires keeping track of what time it is now. This method is completely
independent of the current time. So nicely done.

~~~
swombat
Hmm... so the rating increases exponentially, by an order of magnitude every
12.5 hours?

Doesn't that mean that the number of points for a new post requires a
something like a 5 megabyte sized int to be stored in memory?

Or did I misunderstand this?

~~~
aneesh
No, ratings don't change without votes being cast. But new posts automatically
start out with a higher rating, so a post must grow its (positive) vote
differential by 10x to "keep up" with the new posts. The rating increases as a
log of the vote differential, so the _votes_ need to increase exponentially,
not the rating. The vote differential increasing by 100x would result in an
increase of 2 rating points (this is what a logarithm means).

A new post submitted on June 8, 2008 would have started out with about 1753
rating points, and this number grows by a little less than 2 every day. So
it's not that big to store.

~~~
swombat
d'oh. Thanks. My brain must not have been screwed on right while I was reading
this.

------
Anon84
Does anybody have other examples of algorithms of this type? How does HN work?
(I'm lisp-illiterate so please don't tell me to read YC's source)

~~~
pg
News.YC's is just

    
    
      (p - 1) / (t + 2)^1.5
    

where p = points and t = age in hours

~~~
lincolnq
In other words, unlike reddit, the rating of a story changes over time.
Stories get one free upvote, so p-1 compensates for that (new stories have no
points). Dividing by a power of time makes the number of upvotes that a story
receives in its first few hours crucial to how likely the story is to stay on
the front page.

In reddit's system, if you take a snapshot of the front page and then stop
voting or submitting stories, the page will never change. At HN, stories might
reorder themselves because a story with few points, which was given a boost
from being very new, will lose rating compared to high-value stories that have
been around a long time.

~~~
palish
It's likely that a News.YC post isn't reordered until someone actually casts a
vote on it. So if all voting stopped, News.YC would remain static, just like
Reddit would.

------
ntoshev
I wish this was about the "Recommended" page, and why it doesn't work. Has
anyone looked at this part of the open-sourced Reddit?

~~~
apathy
Yes. See my explanation in the reddit thread. Basically, after orthogonalizing
a set of feature vectors for dimensionality reduction, the resulting landscape
of posts is clustered (k-NN as far as I can tell) and the 'closest' set of
'hot' posts to a user is returned. I'll be better able to fuss with this after
I have a little more free time (eg. after my exam and the paper I'm working
on); the code is primarily in Recommender.cpp if you have checked out the r2
git repo.

Using an unsupervised clustering algorithm instead of a supervised algorithm
was, in my opinion, the Wrong Way to Go. After I get done with my screening
exam this week, I am planning to screw around with it and maybe see if libSVM
will offer a means of constructing arbitrary discriminators based on the
selections of, say, one's favorite users, or one's own feedback.

Obviously there are a great many nits that need to be worked out with my idea,
but I figure it may be worth a try.

------
shafqat
I'd like to see some of the mathematical reasoning/logic behind this
algorithm. Why that particular log function? Moreover, an explanation into
what went behind each step? If the logic can't be explained in plain english,
I'm always a bit sceptical.

------
kylec
What's the significance of December 8, 2005 7:46:43 AM?

~~~
gwniobombux
That seems to be the time reddit.com went live. Compare this Guardian article:
[http://www.guardian.co.uk/technology/2005/dec/08/innovations...](http://www.guardian.co.uk/technology/2005/dec/08/innovations.guardianweeklytechnologysection1)

~~~
pg
Actually reddit was launched before that, but the initial ranking algorithm
didn't have enough force to pull popular stories back down. Dec 05 is probably
around when they switched to the current ranking algorithm.

------
andreyf
It's a pity the recommendations were never worked out properly - I think such
an engine with a proper recommendations system would be quite valuable.

Also, taking note of how often a user visits (refreshes) the site could be
useful (varying the second component) - if I haven't been to the site in a
week, it would make sense to show me older stories than if I visited this
morning.

~~~
apathy
You should join the discussion on reddit:

1) Spez actually weighs in on such matters

2) the suitability of supervised/unsupervised algorithms for recommending
links is discussed (note that Amazon and other highly successful
recommendation engines rely upon multi-layered feedback to self-tune, hence my
bias towards supervised algorithms)

3) the churn and refresh rate, along with maximization of the desirable
turnover and minimization of bombing, is discussed in the context of the 'hot'
algorithm and potentially useful changes to the algorithm

You might find it worth your while.

------
immad
I am happy how simple it is. And how simple the news.yc one is too. I always
thought ranking was more magical

~~~
edw519
Your comment reminds me of something I once learned about simplicity in
complex business systems.

The early ERP systems had complex algorithms computing action items for people
to act upon (what to buy, what to adjust, what to move, what to make, etc.)
Nobody understood them. So everyone had a built in excuse, "The computer did
it." The bosses had trouble holding people accountable because a) the bosses
didn't understand the algorithms themselves and b) It was hard to argue with
that logic.

Then ERP systems starting using much simpler altorithms and rules. For
example, "Don't tell me to change anything unless it's more than 3 days early
or more than 3 days late." Something anyone could understand. It took about 10
years, but man and machine finally started working together.

------
TweedHeads
Reddit should drop the algorythm and embrace digg's logic of moving down older
posts as new posts come alive.

I hate having to re-read everything just to look for something new to read.

~~~
lincolnq
That would require them knowing what each individual user's rate of refreshing
the page was. This is a reasonable balance between showing newer stories and
showing stories which were very popular. Many more articles are posted than
make the front page. So you would be wading through a lot of severe crap all
the time, every time you refreshed.

~~~
TweedHeads
There is always a top-ten section refreshed every 24 hours. Also upcoming
stories don't get to the front page as easy, for THAT you use an algorythm.

