So, a couple people asked for an explanation, so here goes:
t_s basically serves as "gravity" to make older posts fall down the page. Why Dec 8, 2005? Maybe that's when they launched. Anyway, what t_s does in the function is equate a 10-fold increase in points with being submitted 12.5 hours (that's 45,000 secs) later. So a 1-hour-old post would have to improve its vote differential 10x over the next 12.5 hours to maintain it's rating to compensate for elapsed time. If a post's vote differential increases more than 10x in 12.5 hours, its rating goes up.
As for where the numbers come from, I'm pretty sure they're tuned by trial-and-error. It's really hard to predict voting patterns beforehand (ie how fast should items "fall" from the main page?)
The log function is there because your first 10 upvotes should have more weight than the 101st to 110th upvotes. The way the formula is written (and assuming 0 downvotes), your first 10 upvotes have the same weight as the next 100 upvotes, which have the same weight as the next 1000, etc. Again, the base of the logarithm is somewhat arbitrary, and can be tuned by trial and error.
And needless to say, if you have more downvotes than upvotes, your rating is negative. That's about it.
(note: I'm just reading the page and interpreting the algorithm - I don't have any special insight into how they chose these particular constants)
Edit: Time since Dec 8, 2005 is an elegant way of doing it. My first (crude) thought would've been to use "time since posting" to determine gravity, but that requires keeping track of what time it is now. This method is completely independent of the current time. So nicely done.
It's the other way around. yts goes up as the posts are made a greater and greater distance in the future from 12/08/2005. That means that the ratings of new posts are pushed UP, rather than older posts being pulled down.
Yeah, you're right. I simplified it since posts falling down is easier to conceptually understand. New posts start out with a higher rating, and an old posts needs to boost its rating 10x to be at the level a posts starts out at 12.5 hours later.
No, ratings don't change without votes being cast. But new posts automatically start out with a higher rating, so a post must grow its (positive) vote differential by 10x to "keep up" with the new posts. The rating increases as a log of the vote differential, so the votes need to increase exponentially, not the rating. The vote differential increasing by 100x would result in an increase of 2 rating points (this is what a logarithm means).
A new post submitted on June 8, 2008 would have started out with about 1753 rating points, and this number grows by a little less than 2 every day. So it's not that big to store.
I always imagined that the reddit/HN/Digg algorithms (if you can call it that) went something like this:
1) submit article
2) attach Unix timestamp
3) increment/decrement each timestamp by a fixed time interval of Unix seconds for each upvote/downvote
With this scheme, there is one operation (add a positive or negative increment). Eventually each article will naturally decay as time moves on. You can adjust the size of the increment to be weighted more closer to the actual submission time and have the increment decay to an average to accelerate "hotter" postings to the top if you don't like linear increments.
The overall idea is to project an article (in Unix timestamp) into the future by the number of upvotes; this timestamp is merely a ranking "key". The front page articles would have a Unix timestamp of one or two days into the future depending on how many votes. This would naturally place currently submitted articles somewhere a few pages back.
It more or less mimics the same thing (doesn't it?).
In other words, unlike reddit, the rating of a story changes over time. Stories get one free upvote, so p-1 compensates for that (new stories have no points). Dividing by a power of time makes the number of upvotes that a story receives in its first few hours crucial to how likely the story is to stay on the front page.
In reddit's system, if you take a snapshot of the front page and then stop voting or submitting stories, the page will never change. At HN, stories might reorder themselves because a story with few points, which was given a boost from being very new, will lose rating compared to high-value stories that have been around a long time.
It's likely that a News.YC post isn't reordered until someone actually casts a vote on it. So if all voting stopped, News.YC would remain static, just like Reddit would.
I think it would be better to normalize time for the number of submissions: right now when I submit a story late night US time, usually it goes nowhere because there is no one to upvote it.
It is simple to do that: just use the number of stories submitted as time measure instead of hours.
I wonder what Hacker News would be like were the votes to decay in importance independently, or, alternatively, for there to be two rating systems: one for before something hits the front page (with very little time decay), and one for something after (say with the normal method).
One thing that often happens is that hackers, hunting and reading late at night (like me) submit something to YC News. But basically nobody's reading the news feed, and four or five hours later, one or two votes isn't enough to get something on the front page, which is sometimes needed to get recognized by many people.
I have a good metric for this: sometimes I'll submit something and it won't hit the front page at all. But one by one, over the next few weeks, votes trickle in. This must be from: (a) people reading my submission history, or (b) people submitting the same article. But because I submitted at a time when nobody was reading it, it never even gets a chance to hit the front page.
Edit: I did not notice gaika's similar, elegant proposal above. But if two people make the same sort of comments independently, it should say something.
One possibly better approach would be to replace the new feed with a feed composed only of items that have spent less than 15 minutes on the front page (because everything deserves 15 minutes of fame).
You then sort this new feed like the front page (though perhaps with nonzero start points) -- so there's both decay and a chance that good articles without a ton of initial votes, eventually get up there (though perhaps on slow news days).
Yes. See my explanation in the reddit thread. Basically, after orthogonalizing a set of feature vectors for dimensionality reduction, the resulting landscape of posts is clustered (k-NN as far as I can tell) and the 'closest' set of 'hot' posts to a user is returned. I'll be better able to fuss with this after I have a little more free time (eg. after my exam and the paper I'm working on); the code is primarily in Recommender.cpp if you have checked out the r2 git repo.
Using an unsupervised clustering algorithm instead of a supervised algorithm was, in my opinion, the Wrong Way to Go. After I get done with my screening exam this week, I am planning to screw around with it and maybe see if libSVM will offer a means of constructing arbitrary discriminators based on the selections of, say, one's favorite users, or one's own feedback.
Obviously there are a great many nits that need to be worked out with my idea, but I figure it may be worth a try.
I'd like to see some of the mathematical reasoning/logic behind this algorithm. Why that particular log function? Moreover, an explanation into what went behind each step? If the logic can't be explained in plain english, I'm always a bit sceptical.
Actually reddit was launched before that, but the initial ranking algorithm didn't have enough force to pull popular stories back down. Dec 05 is probably around when they switched to the current ranking algorithm.
It's a pity the recommendations were never worked out properly - I think such an engine with a proper recommendations system would be quite valuable.
Also, taking note of how often a user visits (refreshes) the site could be useful (varying the second component) - if I haven't been to the site in a week, it would make sense to show me older stories than if I visited this morning.
2) the suitability of supervised/unsupervised algorithms for recommending links is discussed (note that Amazon and other highly successful recommendation engines rely upon multi-layered feedback to self-tune, hence my bias towards supervised algorithms)
3) the churn and refresh rate, along with maximization of the desirable turnover and minimization of bombing, is discussed in the context of the 'hot' algorithm and potentially useful changes to the algorithm
Your comment reminds me of something I once learned about simplicity in complex business systems.
The early ERP systems had complex algorithms computing action items for people to act upon (what to buy, what to adjust, what to move, what to make, etc.) Nobody understood them. So everyone had a built in excuse, "The computer did it." The bosses had trouble holding people accountable because a) the bosses didn't understand the algorithms themselves and b) It was hard to argue with that logic.
Then ERP systems starting using much simpler altorithms and rules. For example, "Don't tell me to change anything unless it's more than 3 days early or more than 3 days late." Something anyone could understand. It took about 10 years, but man and machine finally started working together.
That would require them knowing what each individual user's rate of refreshing the page was. This is a reasonable balance between showing newer stories and showing stories which were very popular. Many more articles are posted than make the front page. So you would be wading through a lot of severe crap all the time, every time you refreshed.
t_s basically serves as "gravity" to make older posts fall down the page. Why Dec 8, 2005? Maybe that's when they launched. Anyway, what t_s does in the function is equate a 10-fold increase in points with being submitted 12.5 hours (that's 45,000 secs) later. So a 1-hour-old post would have to improve its vote differential 10x over the next 12.5 hours to maintain it's rating to compensate for elapsed time. If a post's vote differential increases more than 10x in 12.5 hours, its rating goes up.
As for where the numbers come from, I'm pretty sure they're tuned by trial-and-error. It's really hard to predict voting patterns beforehand (ie how fast should items "fall" from the main page?)
The log function is there because your first 10 upvotes should have more weight than the 101st to 110th upvotes. The way the formula is written (and assuming 0 downvotes), your first 10 upvotes have the same weight as the next 100 upvotes, which have the same weight as the next 1000, etc. Again, the base of the logarithm is somewhat arbitrary, and can be tuned by trial and error.
And needless to say, if you have more downvotes than upvotes, your rating is negative. That's about it.
(note: I'm just reading the page and interpreting the algorithm - I don't have any special insight into how they chose these particular constants)
Edit: Time since Dec 8, 2005 is an elegant way of doing it. My first (crude) thought would've been to use "time since posting" to determine gravity, but that requires keeping track of what time it is now. This method is completely independent of the current time. So nicely done.