t_s basically serves as "gravity" to make older posts fall down the page. Why Dec 8, 2005? Maybe that's when they launched. Anyway, what t_s does in the function is equate a 10-fold increase in points with being submitted 12.5 hours (that's 45,000 secs) later. So a 1-hour-old post would have to improve its vote differential 10x over the next 12.5 hours to maintain it's rating to compensate for elapsed time. If a post's vote differential increases more than 10x in 12.5 hours, its rating goes up.
As for where the numbers come from, I'm pretty sure they're tuned by trial-and-error. It's really hard to predict voting patterns beforehand (ie how fast should items "fall" from the main page?)
The log function is there because your first 10 upvotes should have more weight than the 101st to 110th upvotes. The way the formula is written (and assuming 0 downvotes), your first 10 upvotes have the same weight as the next 100 upvotes, which have the same weight as the next 1000, etc. Again, the base of the logarithm is somewhat arbitrary, and can be tuned by trial and error.
And needless to say, if you have more downvotes than upvotes, your rating is negative. That's about it.
(note: I'm just reading the page and interpreting the algorithm - I don't have any special insight into how they chose these particular constants)
Edit: Time since Dec 8, 2005 is an elegant way of doing it. My first (crude) thought would've been to use "time since posting" to determine gravity, but that requires keeping track of what time it is now. This method is completely independent of the current time. So nicely done.
The effect is the same though.
Doesn't that mean that the number of points for a new post requires a something like a 5 megabyte sized int to be stored in memory?
Or did I misunderstand this?
A new post submitted on June 8, 2008 would have started out with about 1753 rating points, and this number grows by a little less than 2 every day. So it's not that big to store.
(p - 1) / (t + 2)^1.5
1) submit article
2) attach Unix timestamp
3) increment/decrement each timestamp by a fixed time interval of Unix seconds for each upvote/downvote
With this scheme, there is one operation (add a positive or negative increment). Eventually each article will naturally decay as time moves on. You can adjust the size of the increment to be weighted more closer to the actual submission time and have the increment decay to an average to accelerate "hotter" postings to the top if you don't like linear increments.
The overall idea is to project an article (in Unix timestamp) into the future by the number of upvotes; this timestamp is merely a ranking "key". The front page articles would have a Unix timestamp of one or two days into the future depending on how many votes. This would naturally place currently submitted articles somewhere a few pages back.
It more or less mimics the same thing (doesn't it?).
In reddit's system, if you take a snapshot of the front page and then stop voting or submitting stories, the page will never change. At HN, stories might reorder themselves because a story with few points, which was given a boost from being very new, will lose rating compared to high-value stories that have been around a long time.
It is simple to do that: just use the number of stories submitted as time measure instead of hours.
I've submitted some good stories only to find that they were posted a few days ago but were not noticed because of bad timing or bad title.
One thing that often happens is that hackers, hunting and reading late at night (like me) submit something to YC News. But basically nobody's reading the news feed, and four or five hours later, one or two votes isn't enough to get something on the front page, which is sometimes needed to get recognized by many people.
I have a good metric for this: sometimes I'll submit something and it won't hit the front page at all. But one by one, over the next few weeks, votes trickle in. This must be from: (a) people reading my submission history, or (b) people submitting the same article. But because I submitted at a time when nobody was reading it, it never even gets a chance to hit the front page.
Edit: I did not notice gaika's similar, elegant proposal above. But if two people make the same sort of comments independently, it should say something.
You then sort this new feed like the front page (though perhaps with nonzero start points) -- so there's both decay and a chance that good articles without a ton of initial votes, eventually get up there (though perhaps on slow news days).
Using an unsupervised clustering algorithm instead of a supervised algorithm was, in my opinion, the Wrong Way to Go. After I get done with my screening exam this week, I am planning to screw around with it and maybe see if libSVM will offer a means of constructing arbitrary discriminators based on the selections of, say, one's favorite users, or one's own feedback.
Obviously there are a great many nits that need to be worked out with my idea, but I figure it may be worth a try.
Also, taking note of how often a user visits (refreshes) the site could be useful (varying the second component) - if I haven't been to the site in a week, it would make sense to show me older stories than if I visited this morning.
1) Spez actually weighs in on such matters
2) the suitability of supervised/unsupervised algorithms for recommending links is discussed (note that Amazon and other highly successful recommendation engines rely upon multi-layered feedback to self-tune, hence my bias towards supervised algorithms)
3) the churn and refresh rate, along with maximization of the desirable turnover and minimization of bombing, is discussed in the context of the 'hot' algorithm and potentially useful changes to the algorithm
You might find it worth your while.
The early ERP systems had complex algorithms computing action items for people to act upon (what to buy, what to adjust, what to move, what to make, etc.) Nobody understood them. So everyone had a built in excuse, "The computer did it." The bosses had trouble holding people accountable because a) the bosses didn't understand the algorithms themselves and b) It was hard to argue with that logic.
Then ERP systems starting using much simpler altorithms and rules. For example, "Don't tell me to change anything unless it's more than 3 days early or more than 3 days late." Something anyone could understand. It took about 10 years, but man and machine finally started working together.
I hate having to re-read everything just to look for something new to read.