Hacker News new | past | comments | ask | show | jobs | submit login
Reddit's Ranking Algorithm (redflavor.com)
97 points by michael_dorfman on June 29, 2008 | hide | past | favorite | 34 comments

So, a couple people asked for an explanation, so here goes:

t_s basically serves as "gravity" to make older posts fall down the page. Why Dec 8, 2005? Maybe that's when they launched. Anyway, what t_s does in the function is equate a 10-fold increase in points with being submitted 12.5 hours (that's 45,000 secs) later. So a 1-hour-old post would have to improve its vote differential 10x over the next 12.5 hours to maintain it's rating to compensate for elapsed time. If a post's vote differential increases more than 10x in 12.5 hours, its rating goes up.

As for where the numbers come from, I'm pretty sure they're tuned by trial-and-error. It's really hard to predict voting patterns beforehand (ie how fast should items "fall" from the main page?)

The log function is there because your first 10 upvotes should have more weight than the 101st to 110th upvotes. The way the formula is written (and assuming 0 downvotes), your first 10 upvotes have the same weight as the next 100 upvotes, which have the same weight as the next 1000, etc. Again, the base of the logarithm is somewhat arbitrary, and can be tuned by trial and error.

And needless to say, if you have more downvotes than upvotes, your rating is negative. That's about it.

(note: I'm just reading the page and interpreting the algorithm - I don't have any special insight into how they chose these particular constants)

Edit: Time since Dec 8, 2005 is an elegant way of doing it. My first (crude) thought would've been to use "time since posting" to determine gravity, but that requires keeping track of what time it is now. This method is completely independent of the current time. So nicely done.

It's the other way around. yts goes up as the posts are made a greater and greater distance in the future from 12/08/2005. That means that the ratings of new posts are pushed UP, rather than older posts being pulled down.

Yeah, you're right. I simplified it since posts falling down is easier to conceptually understand. New posts start out with a higher rating, and an old posts needs to boost its rating 10x to be at the level a posts starts out at 12.5 hours later.

The effect is the same though.

Hmm... so the rating increases exponentially, by an order of magnitude every 12.5 hours?

Doesn't that mean that the number of points for a new post requires a something like a 5 megabyte sized int to be stored in memory?

Or did I misunderstand this?

No, ratings don't change without votes being cast. But new posts automatically start out with a higher rating, so a post must grow its (positive) vote differential by 10x to "keep up" with the new posts. The rating increases as a log of the vote differential, so the votes need to increase exponentially, not the rating. The vote differential increasing by 100x would result in an increase of 2 rating points (this is what a logarithm means).

A new post submitted on June 8, 2008 would have started out with about 1753 rating points, and this number grows by a little less than 2 every day. So it's not that big to store.

d'oh. Thanks. My brain must not have been screwed on right while I was reading this.

No. The rating doesn't change unless people vote on it, and is a function of the logarithm of number of votes.

Does anybody have other examples of algorithms of this type? How does HN work? (I'm lisp-illiterate so please don't tell me to read YC's source)

News.YC's is just

  (p - 1) / (t + 2)^1.5
where p = points and t = age in hours

I always imagined that the reddit/HN/Digg algorithms (if you can call it that) went something like this:

1) submit article

2) attach Unix timestamp

3) increment/decrement each timestamp by a fixed time interval of Unix seconds for each upvote/downvote

With this scheme, there is one operation (add a positive or negative increment). Eventually each article will naturally decay as time moves on. You can adjust the size of the increment to be weighted more closer to the actual submission time and have the increment decay to an average to accelerate "hotter" postings to the top if you don't like linear increments.

The overall idea is to project an article (in Unix timestamp) into the future by the number of upvotes; this timestamp is merely a ranking "key". The front page articles would have a Unix timestamp of one or two days into the future depending on how many votes. This would naturally place currently submitted articles somewhere a few pages back.

It more or less mimics the same thing (doesn't it?).

In other words, unlike reddit, the rating of a story changes over time. Stories get one free upvote, so p-1 compensates for that (new stories have no points). Dividing by a power of time makes the number of upvotes that a story receives in its first few hours crucial to how likely the story is to stay on the front page.

In reddit's system, if you take a snapshot of the front page and then stop voting or submitting stories, the page will never change. At HN, stories might reorder themselves because a story with few points, which was given a boost from being very new, will lose rating compared to high-value stories that have been around a long time.

It's likely that a News.YC post isn't reordered until someone actually casts a vote on it. So if all voting stopped, News.YC would remain static, just like Reddit would.

I think it would be better to normalize time for the number of submissions: right now when I submit a story late night US time, usually it goes nowhere because there is no one to upvote it.

It is simple to do that: just use the number of stories submitted as time measure instead of hours.

Yeah, I probably should. Paul Buchheit suggested this a long time ago.

What if t is not the age but the first time the story made it to front page? That way old links that suddenly gain popularity can bubble up one day.

I've submitted some good stories only to find that they were posted a few days ago but were not noticed because of bad timing or bad title.

I wonder what Hacker News would be like were the votes to decay in importance independently, or, alternatively, for there to be two rating systems: one for before something hits the front page (with very little time decay), and one for something after (say with the normal method).

One thing that often happens is that hackers, hunting and reading late at night (like me) submit something to YC News. But basically nobody's reading the news feed, and four or five hours later, one or two votes isn't enough to get something on the front page, which is sometimes needed to get recognized by many people.

I have a good metric for this: sometimes I'll submit something and it won't hit the front page at all. But one by one, over the next few weeks, votes trickle in. This must be from: (a) people reading my submission history, or (b) people submitting the same article. But because I submitted at a time when nobody was reading it, it never even gets a chance to hit the front page.

Edit: I did not notice gaika's similar, elegant proposal above. But if two people make the same sort of comments independently, it should say something.

One possibly better approach would be to replace the new feed with a feed composed only of items that have spent less than 15 minutes on the front page (because everything deserves 15 minutes of fame).

You then sort this new feed like the front page (though perhaps with nonzero start points) -- so there's both decay and a chance that good articles without a ton of initial votes, eventually get up there (though perhaps on slow news days).

I cannot resist to say, simple and effective. Simplicity is beauty :)

I wish this was about the "Recommended" page, and why it doesn't work. Has anyone looked at this part of the open-sourced Reddit?

Yes. See my explanation in the reddit thread. Basically, after orthogonalizing a set of feature vectors for dimensionality reduction, the resulting landscape of posts is clustered (k-NN as far as I can tell) and the 'closest' set of 'hot' posts to a user is returned. I'll be better able to fuss with this after I have a little more free time (eg. after my exam and the paper I'm working on); the code is primarily in Recommender.cpp if you have checked out the r2 git repo.

Using an unsupervised clustering algorithm instead of a supervised algorithm was, in my opinion, the Wrong Way to Go. After I get done with my screening exam this week, I am planning to screw around with it and maybe see if libSVM will offer a means of constructing arbitrary discriminators based on the selections of, say, one's favorite users, or one's own feedback.

Obviously there are a great many nits that need to be worked out with my idea, but I figure it may be worth a try.

I'd like to see some of the mathematical reasoning/logic behind this algorithm. Why that particular log function? Moreover, an explanation into what went behind each step? If the logic can't be explained in plain english, I'm always a bit sceptical.

What's the significance of December 8, 2005 7:46:43 AM?

Ah magic numbers; It's no CAFEBABE though. Most likely the time Reddit was relaunched after the Lisp to Python rewrite (Aaron's blog post)


That seems to be the time reddit.com went live. Compare this Guardian article: http://www.guardian.co.uk/technology/2005/dec/08/innovations...

Actually reddit was launched before that, but the initial ranking algorithm didn't have enough force to pull popular stories back down. Dec 05 is probably around when they switched to the current ranking algorithm.

It's a pity the recommendations were never worked out properly - I think such an engine with a proper recommendations system would be quite valuable.

Also, taking note of how often a user visits (refreshes) the site could be useful (varying the second component) - if I haven't been to the site in a week, it would make sense to show me older stories than if I visited this morning.

You should join the discussion on reddit:

1) Spez actually weighs in on such matters

2) the suitability of supervised/unsupervised algorithms for recommending links is discussed (note that Amazon and other highly successful recommendation engines rely upon multi-layered feedback to self-tune, hence my bias towards supervised algorithms)

3) the churn and refresh rate, along with maximization of the desirable turnover and minimization of bombing, is discussed in the context of the 'hot' algorithm and potentially useful changes to the algorithm

You might find it worth your while.

I am happy how simple it is. And how simple the news.yc one is too. I always thought ranking was more magical

Your comment reminds me of something I once learned about simplicity in complex business systems.

The early ERP systems had complex algorithms computing action items for people to act upon (what to buy, what to adjust, what to move, what to make, etc.) Nobody understood them. So everyone had a built in excuse, "The computer did it." The bosses had trouble holding people accountable because a) the bosses didn't understand the algorithms themselves and b) It was hard to argue with that logic.

Then ERP systems starting using much simpler altorithms and rules. For example, "Don't tell me to change anything unless it's more than 3 days early or more than 3 days late." Something anyone could understand. It took about 10 years, but man and machine finally started working together.

Reddit should drop the algorythm and embrace digg's logic of moving down older posts as new posts come alive.

I hate having to re-read everything just to look for something new to read.

That would require them knowing what each individual user's rate of refreshing the page was. This is a reasonable balance between showing newer stories and showing stories which were very popular. Many more articles are posted than make the front page. So you would be wading through a lot of severe crap all the time, every time you refreshed.

There is always a top-ten section refreshed every 24 hours. Also upcoming stories don't get to the front page as easy, for THAT you use an algorythm.

So... your browser doesn't track history for you?

Same applies to HN btw.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact