
How Not To Sort By Average Rating - ntoshev
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
======
gjm11
The wrong solutions the article describes are certainly wrong. It doesn't
really make much of a case that the particular solution it proposes is
particularly good.

For instance, here's another much simpler formula that avoids the problems
described in the article: score = (pos+1)/(pos+neg+2). It's the posterior mean
of Pr(random person likes the product), if your prior is uniform on [0,1]. You
can adjust this to favour items with more ratings by, e.g., using a prior that
"prefers" smaller values of that probability; one easy way to do that is to
pretend that every item starts with a certain number of positive reviews and a
certain number of negative reviews; you end up with a formula of the form
(pos+A)/(pos+neg+B).

Is the article's formula better than this? Probably, but the author hasn't
said why.

~~~
tel
If we're going to use a Bayesian treatment, the real sin is throwing away your
uncertainty.

What I would love to see is a ranking algorithm that provides an elegant,
intuitive interface to fuzzy ranking. If we can't be confident of the
ordering, why pretend? It's a usability problem, not a statistics problem.

Your estimator is slightly more resilient than Amazon's and somewhat less
conservative than the author's. It's definitely simpler, but who cares?
Visitors never have to compute it after all.

~~~
gjm11
I think it's a _huge_ usability problem, if we're talking about something
Amazon-like that's presenting ranked data to an unsophisticated audience. Why
pretend? Because it produces something the user might be able to understand,
that's why.

(Yes, it would be extremely cool to make an elegant, intuitive interface to a
fuzzy ranking. One reason why it would be extremely cool is that it's totally
not obvious how it can be done, or even _that_ it can be done.)

I wasn't suggesting that my estimator is better because it's simpler, nor that
anyone should care; just pointing out that the author of the article made a
huge leap from "here are two things that don't work" to "and here is the
specific arbitrary math-heavy thing I think you should do instead".

~~~
ntoshev
With Bayesian average (<http://news.ycombinator.com/item?id=1219154> ,
[http://wiki.answers.com/Q/What_does_true_Bayesian_estimate_m...](http://wiki.answers.com/Q/What_does_true_Bayesian_estimate_mean_in_connection_with_the_IMDb_Top_250_ratings))
you can easily put a slider saying how much the missing votes count (or how
much preference do you give to items with more ratings).

------
JustSemantics
It's very nice to see a mathematically motivated piece on how to sort based on
users ratings. But there are two problems with the solution presented, the
first minor, and the second more serious.

Minor: Distributions of ratings are not usually normal about the average
possible rating -- they are usually skewed to the positive end, because people
are pretty good at selecting what they read or view. The solution presented
will tend to skew low-frequency ratings too far toward the average. It should
be easy to calculate an expected average for your site, and use this instead
of 1/2 for the formula given in the post.

Major: What you really need is a missing data point: the number of readers who
haven't made any rating at all. It's pretty clear if you think of two
examples:

Item 1 1,000,000 views 10 positive ratings 1 negative rating

Item 2 15 views 10 positive ratings 1 negative rating

Clearly, readers feel pretty blah about Item 1. You're much better off
displaying Item 2, other things being equal, but the formula doesn't
distinguish the two cases.

~~~
yish
I agree, especially that the skew needs to be site dependent. For example the
iPhone app store actually has the opposite problem where their rate on delete
"feature", skews the average ratings way down. Reviews are heavily distributed
along 1 star or 5 star ratings.

Would also be curious to see a comparison of different rating systems, for
example i'd like to see what the distribution is on a site thumbs up/down
rating site like digg vs a 5 star system.

~~~
JustSemantics
One data point is the netflix rating set from their contest. The system is 1
to 5 stars. The average movie rating was 3.4 in 1998 climbing to 3.8 in 2005.

Back to the original point -- if you used a formula with a default rating of 3
it would underestimate low-frequency movies. People don't view movies at
random. They try to pick things they will like, which should be reflected in
the statistical model.

------
Automatt-2
This is a very helpful blog post from about a year ago. We implemented
something similar to a Wilson score to power the default sorting for the
millions of reviews we have at RateItAll. In general, with a system like this,
the items with the most reviews are going to rise to the top. This might not
be what you want. You can easily see this effect with something like our
"Events of 2010" list: <http://www.rateitall.com/t-2987869-events-
of-2010.aspx>

~~~
Perceval
>items with the most reviews are going to rise to the top

Yes and no. Anything with a larger N is going to make for greater confidence
in the rating. But that rating still depends on the number of positive and
negative reviews. So if a product has a large number of ratings, but they are
50/50 positive and negative, then it's not going to rise to the top simply
because it has the most reviews. The algorithm would presumably be very
confident about placing it in the middle of the pack.

~~~
Automatt-2
>So if a product has a large number of ratings, but they are 50/50 positive
and negative, then it's not going to rise to the top simply because it has the
most reviews.

What I'm saying is that it will rise to the top because even though only 50%
of the ratings are positive, it will have many many more positive ratings than
the nearest next item. That 2nd item may have a much higher percentage of
positive ratings, but a much lower number of total ratings.

~~~
DougBTX
It seems to work as described in the article:

    
    
        require 'rubygems'
        require 'statistics2'
    
        def ci_lower_bound(pos, n, power)
            if n == 0
                return 0
            end
            z = Statistics2.pnormaldist(1-power/2)
            phat = 1.0*pos/n
            (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
        end
    
        puts ci_lower_bound(60, 100, 0.10)   # => 0.517809505446319
        puts ci_lower_bound(500, 1000, 0.10) # => 0.474027691168875
        puts ci_lower_bound(50, 100, 0.10)   # => 0.418847795168265
    

The top ranked result has 60 positives out of 100, and beats 500 out of 1000 -
almost 10 times as many positive votes.

------
CWuestefeld
I recently needed to do this for sorting search results. My approach was to
calculate a Bayesian average (<http://en.wikipedia.org/wiki/Bayesian_average>
), which was also pretty simple (once I found a formula to copy from here:
<http://www.thebroth.com/blog/118/bayesian-rating> )

It seems to me that the Bayesian average is superior, because it considers the
value relative to other items.

~~~
Perceval
The Bayesian example given in the Wikipedia article looks like it works best
with interval level data. Not sure it would work as well with ordinal data,
because ordinals don't suffer to the same degree from outliers.

The second example also isn't ordinal data—it's just nominal "liked" vs
"didn't like."

Both examples are interesting, but don't necessarily get at the problem with
ordinal data introduced by the article.

------
jedberg
reddit uses the Wilson algo for comment sorting:

[http://blog.reddit.com/2009/10/reddits-new-comment-
sorting-s...](http://blog.reddit.com/2009/10/reddits-new-comment-sorting-
system.html)

------
ck2
I use what reddit uses, works well

<http://code.reddit.com/browser/r2/r2/lib/db/sorts.py>

    
    
          s = score(ups, downs)
          order = log(max(abs(s), 1), 10)
          sign = 1 if s > 0 else -1 if s < 0 else 0
          seconds = epoch_seconds(date) - 1134028003
          return round(order + sign * seconds / 45000, 7)
    

Let me know if you want my attempt at a MySQL translation of that.

 _ps. why is my account ( _ck_ ) so slow when logged in and my comments marked
as "dead" ? Was I flagged as a spammer by mistake?
Example:<http://news.ycombinator.com/item?id=1219475> _

~~~
araneae
That's not really appropriate for what this blog post was talking about, which
is rating products. It doesn't make sense for products' ratings to be time
sensitive. That would make the most recent items have an artificially inflated
score.

~~~
ck2
But you want to control the weight of a rank for an aging product, no?

Let's say you are ranking cellphones that a store sells and someone has a fond
memory of a cellphone they used a few years ago so they vote it up. But the
phone itself was posted years ago, compared to the newest phones that came out
this year. You don't want their vote to have as much weight because of the
product age.

~~~
araneae
You could make an argument for using that for, say, deciding on which products
should be featured on a page that's promoting products to consumers.

But for actual product "scores" that people see that's a very bad idea. A book
shouldn't have a 4/5 if everyone rated it a 5/5 just because it came out 50
years ago.

------
yesbabyyes
A more thorough post on the same subject has been posted on HN before: How to
build a popularity algorithm you can be proud of
<http://news.ycombinator.com/item?id=797740>.

It's been a great resource for me.

EDIT: Fixed the link.

------
aw3c2
He shows the flaws of both examples but only gives a solution to the first
one. How does one nicely sort a rating scale like Amazon's that does have more
than "yes" and "no"?

~~~
gjm11
I don't know offhand how the Wilson thing generalizes. The simple Bayesian
posterior mean one I mentioned yesterday is pretty simple: suppose you have
a1,a2,...,ak ratings with 1,2,...,k stars; then the posterior mean number of
stars is [(1+a1).1 + (1+a2).2 + ... + (1+ak).k] / [(1+a1) + (1+a2) + ... +
(1+ak)]. (Of course the denominator can be written more briefly, but I think
it's clearer that way.) Equivalently, you just calculate the average star
count, but you add in a bunch of fictitious reviews, one with each possible
number of stars. Once again, by adjusting the number of each kind of
fictitious review you can (e.g.) make less-reviewed items do better or worse
relative to more-reviewed ones; by changing the coefficients that are
1,2,...,k above you can (e.g.) make a 5-star review more than 5/4 as good as a
4-star one; etc.

------
JarekS
We have used this algorithm in our side project
<http://movieorwhat.com/month/> \- it seems to be working very well!

------
ZeroGravitas
Nice. I'd figured out the problems with the 2 wrong solutions myself but I
wouldn't have known where to start looking for the better approach.

------
barmstrong
This is the most useful post I've read in a long time - AWESOME.

------
mrcharles
Oh god, wall of math!

/me hangs head in shame.

