
How Hacker News ranking algorithm works - cristoperb
http://amix.dk/blog/post/19574
======
pg
That's close to the current version, but a little out of date. Here's the code
running now:

    
    
        (= gravity* 1.8 timebase* 120 front-threshold* 1
           nourl-factor* .4 lightweight-factor* .17 gag-factor* .1)
    
        (def frontpage-rank (s (o scorefn realscore) (o gravity gravity*))
          (* (/ (let base (- (scorefn s) 1)
                  (if (> base 0) (expt base .8) base))
                (expt (/ (+ (item-age s) timebase*) 60) gravity))
             (if (no (in s!type 'story 'poll))  .8
                 (blank s!url)                  nourl-factor*
                 (mem 'bury s!keys)             .001
                                                (* (contro-factor s)
                                                   (if (mem 'gag s!keys)
                                                        gag-factor*
                                                       (lightweight s)
                                                        lightweight-factor*
                                                       1)))))

~~~
pypyguy
Any genius willing to translate this to python?

~~~
zackattack
I would prefer PHP but seconded

~~~
sstrudeau
This is pretty trivial but sure, something like:

    
    
      function calculate_score($votes, $item_hour_age, $gravity=1.8) {
        return ($votes - 1) / pow(($item_hour_age+2), $gravity);
      }

~~~
mkramlich
it's interested to see such a direction comparison of PHP to Python. PHP is
very similar to the Python except with more syntax noise. :) And probably
worse docs. And worse namespacing, etc. ;)

------
gojomo
Some weaknesses of this algorithm are:

(1) Wall-clock hours penalize an article even if no one is reading (overnight,
for example). A time denominated in ticks of actual activity (such as views of
the 'new' page, or even upvotes-to-all-submissions) might address this.

(2) An article that misses its audience first time through -- perhaps due to
(1) or a bad headline -- may never recover, even with a later flurry of votes
far beyond what new submissions are getting.

Without checking the exact numbers, consider a contrived example: Article A is
submitted at midnight and 3 votes trickle in until 8am. Then at 8am article B
is submitted. Over the next hour, B gets 6 votes and A gets 9 votes. (Perhaps
many of those are duplicate-submissions that get turned into upvotes.) A has
double the total votes, and 50% more votes even in the shared hour, but still
may never rank above B, because of the drag of its first 8 hours.

(I think you'd need to timestamp each vote for an improved decay function.)

~~~
angusgr
_Wall-clock hours penalize an article even if no one is reading (overnight,
for example)_

I'd be interested to know what the hourly fluctuation for HN is actually like,
on account of having readers all over the world.

I'm in Australia, so your example "submitted at midnight" California time[1]
means submitted at 6pm my time. Also 8am London time, 11am Moscow time. :).

[1] I'm going to go ahead and assume you're in California. ;)

~~~
gojomo
I'm in California, usually, but have often observed HN through the California
night -- either because of my own odd online hours, or trips to distant time-
zones.

It's true there's never total quiescence, but the pace of actions changes by a
noticeable factor. (Without going to the data, I'd guess 5X from trough to
peak over a day's cycle, and a somewhat smaller weekend-to-weekday difference.
Holidays and nice bay area weather also play a factor.)

~~~
ig1
I've found optimal submission time to generally be midday london time, you
catch the european lunch-time traffic and the US wake-up/get-into-work
traffic. I don't think traffic from anywhere else is heavy enough to matter.

------
antirez
When I built oknotizie.virgilio.it many years ago, more or less at the same
time reddit was created, I used the same base algorithm, that is: RANK = SCORE
/ AGE^ALPHA, where ALPHA is the obsolescence factor.

This is a pretty obvious algorithm, but the evil is in the details. First,
since oknotizie is based in italy AGE is calculated in a special way so that
nightly hours are calculated in a different way (every hour should be take
into account proportionally to the traffic that there is in this hour).

Second, there is to do a lot of filtering. Oknotizie is completely built out
of anti-spamming: statistical analysis on users voting patterns, cycles
detection, an algorithm penalizing similarities in general in the home page,
and so forth.

To run a simple HN style site is simple as long as the community is not trying
hard to game it. Otherwise it starts to get a much more complex (and sad)
affair.

------
barrkel
A problem (IMHO) with the HN ranking algorithm is that once a post fails to
get traction (perhaps because things were busy at the time it was submitted),
it won't really be able to get traction later, even if it's re-discovered 6
hours or 2 days later. Seems to me like velocity ought to be taken into
account a little more for items that have otherwise languished.

~~~
wensing
This has been the case with my bootstrapping post. It has been revived a few
times thanks to tweets and other references since its initial publication to
HN, but it has never risen back to where it was, even though it has twice as
many points as it did 11 days ago.

------
yesbabyyes
Here's an explanation of other ranking algorithms, including Bayesian average,
Wilson score and the ones used on HN, Reddit, StumbleUpon and Del.icio.us:

[http://blog.linkibol.com/2010/05/07/how-to-build-a-
popularit...](http://blog.linkibol.com/2010/05/07/how-to-build-a-popularity-
algorithm-you-can-be-proud-of/)

------
jacquesm
This is not the 'hacker news' ranking algorithm, this is the ranking algorithm
distributed with 'ARC', which is the basis for the HN algorithm, but
definitely not equal to it.

The biggest missing ingredients are flagged posts dropping off quicker and
posts that contain no URL dropping off quicker but there are quite a few other
subtle tweaks.

The (very good) reason why the ARC sources do not give out the real ranking
algorithm is to make it a bit harder to game the system.

~~~
jackowayed
He glossed over it, but this code does include URL-free posts dropping off
faster. See the stuff dealing with "nourl-factor*". I don't know arc at all,
but it appears that having no URL multiplies your final score by a factor of
.4, meaning that it's ranked almost 3x lower than it otherwise would be. That
surprises me; I've noticed that Ask HN get rated lower, but it doesn't seem
that extreme.

So is Hacker News is a fork of news.arc, rather than straight news.arc? I
figured it was, but never heard that officially (since people refer to
news.arc as the HN "source code").

Edit: Also, the "lightweight" thing is interesting. There's something in place
that sees if the post is a "rallying cry" or is mostly made of images.
Additionally, if you link directly to an image file, or to some list of
domains that have been deemed lightweight, that'll get marked as lightweight
as well. Lightweight posts have a .3 factor, meaning that they're even more
deflated than URL-free posts.

~~~
jacquesm
Ah yes, you're right, the 'nourl' is there, it's in the arc bit, but I
couldn't find that in the graphs or in the python code.

------
bergie
I built a reasonably similar ranking system a few years ago, but also taking
social media interaction (blog links, delicious bookmarks, etc) with the
content items being ranked into account:
<http://bergie.iki.fi/blog/calculating_news_item_relevance/>

You can see it in action on maemo.org: <http://maemo.org/news/>

PHP sources: [http://trac.midgard-
project.org/browser/branches/ragnaroek/m...](http://trac.midgard-
project.org/browser/branches/ragnaroek/midcom/org.maemo.socialnews)

------
qeorge
I made an HN filter for myself, that's basically:

points / comments

It works shockingly well. Its here if anyone would like to check it out:
<http://www.upthread.com/>

~~~
fizx
Yep, you basically have a controversy filter.

------
DeusExMachina
Does anybody care to explain the strange indentation I see in the Arc code? I
know Lisp a little (mostly Clojure), but I don't get the indentation of the
code at the bottom of the algorithm where it looks like it's branching in two
parts. Is this peculiar to Arc? Or to other Lisps as well?

~~~
rntz
That's because of the way Arc does if-expressions. In Scheme and Common Lisp,
you have two conditional forms: 'if and 'cond. 'if takes three arguments (or
two, with an implied nil, in CL), and is analogous to an if-then-else in other
languages:

    
    
        > (if #t 'then 'else)
        then
        > (if #f 'then 'else)
        else
    

'cond, in contrast, takes as many branches as you like, but they're
parenthesized like so:

    
    
        > (cond (#f 'a) 
                (#t (display "I can have a body here!\n")
                    'b)
                (#t 'c))
        I can have a body here!
        b
    

In arc, there's just 'if, which is like 'cond with a lot of implicit
parenthesization and else-branches:

    
    
        arc> (if t 'then 'else)
        then
        arc> (if nil 'then)         ; if no else-branch is given, nil is implied
        nil
        arc> (if nil 'a
                 t   (do (prn "'do is like Scheme's 'begin or CL's 'progn.")
                         'b)
                 'else)
        'do is like Scheme's 'begin or CL's 'progn.
        b

------
kens
For more details on the algorithm, see my article "Inside the news.yc ranking
formula" from last year: [http://www.arcfn.com/2009/06/how-does-newsyc-
ranking-work.ht...](http://www.arcfn.com/2009/06/how-does-newsyc-ranking-
work.html)

------
gsivil
Nice post. Do you know what is the algorithm for the ranking of comments? I
think this would be interesting to write about that too.

------
johns
The home page algorithm seems to have changed recently, but the RSS feed
hasn't and is much noisier. It would be nice to see the RSS feed updated to
reflect the home page changes (or confirmation that I'm just perceiving a
difference that doesn't actually exist).

------
callmeed
So, if you're implementing this in a framework like Django or Rails, can you
get a result set in this order directly from a query? Or do you have to query
then sort?

~~~
endtime
In Django, you could make the ranking score a property of the model and then
do query.order_by('-score').

------
tamersalama
A basic question: If this was performed on page load, and in-memory, how would
the first db fetch occur? Unless this is pushed 'somehow' to the database.

------
brianbreslin
so this means you'd be best served to submit something at or nearest peak
hours?

what are the peak use hours on HN? since everyone is a hacker, i'd assume it
was evening hours on east and west coast US as heaviest load? not 9-5 ET?

~~~
pavel_lishin
Don't forget employed people with ADD, or people who are bored at work, who
stroll over here when they need a break from their job.

------
b_emery
Looks like there is a built in max lifetime of about 5-10 hrs.

~~~
seldo
I feel like lots of big stories have managed to last overnight, so at least 12
hours, so I don't think this can be true. I've no idea how I would prove that
though.

~~~
thiele
This is also anecdotal but I think TechCrunch's "AngelGate" article was on the
front page for almost 3 days.

~~~
b_emery
Maybe 'half life' is a better way to put it. Looking at the graphs, the score
drops to ~1 in 5-10 hrs, unless the points are very high.

~~~
noarchy
I've wondered if there aren't some hands-on efforts (as in, non-automated) to
keep particular stories on the front page.

------
d0m
I thought the "secret sauce" was hidden from Arc sources..?

~~~
tptacek
My understanding is that the actual HN implementation is heavily customized
and not public.

------
10smom
Thanks for the info! now I need to get my son to explain to me. :)

