
How Not To Sort By Average Rating - marcus
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
======
ntoshev
Why get the lower bound: this means you systematically underestimate the items
with fewer ratings. Also this formula assumes normal distribution.

There is another solution called 'True Bayesian Average' that is used on
IMDB.com, for example. For the formula and the explanation how it works see
here:

<http://answers.google.com/answers/threadview/id/507508.html>

~~~
timr
It assumes that the aggregate outcome of a large number of Bernoulli trials
(i.e. true/false; up/down; good/bad) are distributed binomially, which can be
reasonably approximated by the Normal distribution when the number of trials
is large. This is technically true, and works for HN-style voting.

This isn't true for Amazon ratings, since ranking something 1-5 isn't a
Bernoulli trial. But the central limit theorem says that that the _average_ of
those ratings will be normally distributed (assuming that they're identically
distributed and independent), so it can still work. The confidence interval is
different, however.

------
DavidSJ
Actually, I think Amazon has it right, for several reasons:

1\. The user can actually _understand_ and _predict_ the behavior.

2\. The "problem" the OP identifies is partially self-correcting, because
items with a few positive ratings get more attention as a result of their high
ranking, and if they deserve more poor ratings, they'll get them.

3\. As long as you _tell_ users how many ratings there are, they can use their
own judgment as to how important that is.

------
jim-greer
On Kongregate we do the same thing that Newgrounds does - games don't display
an average rating or appear in the rankings until they have a minimum number
of ratings. For us that's 75 ratings.

Of course we only get a hundred or so new games a week, so it's not hard to
get that many ratings. Much harder for a site with lots of stuff, especially
if they have lots of stuff from day one.

------
spolsky
Speaking of Amazon -- I'd rather buy a book with lots of 1s and 5s than a book
with straight 3s. Wouldn't you?

~~~
fgimenez
So you'd rather sort by variance?

I don't see why (outside of security issues) we can't just define our own
sorting functions for a site.

~~~
alexandros
This is a dream of mine as well, but performance may be a problem. Custom
functions mean no caching and arbitrary computational burden from each
function.

~~~
ars
It doesn't have to be totally custom. Just solicit a few representative types
and allow all those. This also solves the problem of not everyone knowing how
to write such a function, and dealing with user code running on the server.

~~~
IsaacSchlueter
Better yet, play around with lots of different sorting methods, and look for
patterns in the sorting method used (s), the user's data and history (h), and
the amount of money the user ends up spending ($).

Maximize $, and don't ask stupid questions like "Would you rather sort by
variance, average rating, or Wilson score?"

(I'm pretty sure that's what Amazon is busy doing every day. They're
positively brilliant at turning data into money.)

------
sachinag
OK, I came to this late, but there are other solutions.

On Dawdle, we spent _a ton_ of time thinking about this. For their Marketplace
sellers, eBay does number one; Amazon does number two. What we do _is not
three_ , but something that I think is better: we just rank our users.

Look, you can't ask buyers on the site to rank any particular transaction
against all others; that's crazy talk, especially as you get into large
numbers and people forget about all their experiences ever on a site. But we
guide all our buyers to leave feedback of 3 - not 5 or a positive. The hope is
that you'll have a semi-normal distribution of all feedback. _Then_ , and only
then, do we do all sorts of stuff to that. We have bonus points for good
behaviors, some of which we talk about (shipping quickly, using Delivery
Confirmation, linking your Dawdle account to Facebook, MySpace, XBL, PSN, etc)
and some we don't. Then we just throw the ranks on a five point scale.

We call this our Seller Rating, not a feedback rating. Feedback is just the
beginning of the process. KFC starts with the chicken - necessary but not
sufficient - then adds their 13 herbs and spices. That's what we do, and it's
why we don't even allow users to see the individual feedbacks. They're
designed to be useless individually, and I don't want some new eBay expat
bitching about not having 100% feedback.

Does this mean that some people are going to end up with 1s on a 5 point
scale? You betcha. And we don't want them anyway - they're more hassle than
they're worth. They can go to eBay or Amazon.

------
gojomo
Yelling "WRONG" at two popular options isn't much of an argument. I suspect
that Amazon has a good profit-maximizing reason for their ordering.

~~~
paulgb
But the author didn't just yell wrong, he provided examples exposing the flaws
in both scoring models.

~~~
gojomo
He showed screenshots, without explaining anything wrong with the rankings
unless the reader already intuitively agreed with him.

But why isn't the net-positive score the right thing for UrbanDictionary, and
the average the right thing for Amazon? No case was made.

I could argue these either way, but if I were to write an attention-grabbing
headline and yell "WRONG", I'd actually make a case. Attitude alone is not an
argument.

------
raganwald
Could it be that the _business requirement_ for the ranking system is actually
that users be given the illusion of participating in a social site? because
ultimately the rankings have very little impact on overall sales?

While over there, the "Users who bought X also bought Y" feature has a very
strong impact on sales, so the engineers spend all of their time tweaking its
algorithm?

------
acangiano
The first example is wrong. 60 - 40 = 20 which is greater than 100 - 100 = 0.

~~~
pyroman
I noticed that too. The numbers in the picture do support his point though.
209/259 = 80.7% is listed higher than 118/143 = 82.5%

~~~
raganwald
Sadly, the proggit comments are overwhelmed by debate as to whether a small
error like this means the author is an idiot. What makes this happen? It's not
just debating the color of the bikeshed, it's questioning the qualifications
of the Nuclear Power Plant architect on the basis of the bike shed.

~~~
timr
Amen. This is a big reason that I can't see myself working as a developer for
the rest of my life. I can't deal with the abundance of passive-aggressive
personalities who will seize on tiny errors like this, and discount an
individual's intelligence.

If you ask me, it comes from the fact that software geeks are primarily young
men, many of whom were emotionally abused by their peers while growing up.
When you think the world only values your intelligence, you go out of your way
to prove that you're smarter than everyone else.

~~~
davidw
Hah! If you want abuse, you should see some of the frothing at the mouth that
happens about LangPop.com, which, as clearly stated on the site, is never
going to be more than an approximation. Man does it set people off when their
favorite language doesn't do as well as they think:

<http://journal.dedasys.com/2009/01/08/angry-perl-users>

------
dilap
Silly that he messes up the first example, but I've wanted something like this
third solution when sorting by rating on Amazon -- when a product has only a
couple of positive reviews, it's almost the same as being unreviewed. (On the
other hand, just a couple of negative reviews can often be quite helpful,
since they can list concrete problems making the product bad -- a positive
review has the much harder case to make of "there will be nothing bad about
this product".)

------
joshu
Yeah, this always drives me nuts.

Assuming that someone's rating is only an estimator of their true rating, and
then clipped to an integer - the more ratings there are, the less the maximum.

You never see something with hundreds of ratings actually score a 5.0, even if
most people love it. And the fact that there ARE hundreds of ratings of that
thing, and not some comparison thing, is also important.

~~~
ntoshev
If you treat the number of ratings as an indicator of quality, you need to
account for the time the item has been available for rating as well. Otherwise
you underestimate newer items.

~~~
anewaccountname
Not just time available but also units sold; if you are looking for a niche
item, you don't expect a lot of ratings.

------
aneesh
This is a decent way to estimate the average score. But sometimes you're not
really trying to estimate the average score. You're really trying to estimate
the score that the user who's currently viewing the page would give it,
especially for someone like Amazon (ie, People Like You Rated this Product as
1-star).

His estimator might be a decent one at picking the average score, but in
Amazon's case, it's not a great estimator of the rating I would give the
product I am viewing. If you have such an extensive record of my past-
purchases, use it to predict how I would like certain products! Surely the
5-star ratings of SICP are more applicable to me than the 1-star ratings.

------
tptacek
This is, I think, the bestest funniest post to hit Hacker News in awhile.

~~~
TimothyFitz
Evan and I worked together at IMVU, and at one point a senior engineer heard
one of my (appallingly bad) ideas and replied "You're fired." and went back to
work.

A few weeks later I, apparently, made a terrible joke. Seconds later this
shows up in my inbox:

Dear Timothy,

After careful deliberation, it's been decided that you need to be fired.

Reason: telling not funny jokes

You may take one (1) Odwalla for the road.

Thanks,

The Firing Committee

Not only was it hilarious but while I suspected Evan, he hadn't had time to
type that whole message! Over the next few weeks I received quite a few of
these e-mails, but with different messages:

Reason: distracting Evan from his conversation with Luc

Reason: Asking too many questions

etc.

It turns out that not only did Evan find the need to write a formal you're
fired letter; he wrote a web service to send formal you're fired letters. To
me.

~~~
patio11
_he wrote a web service to send formal you're fired letters. To me._

You're fired, for dragging down the average amount of awesome in any room Evan
is in.

------
timcederman
I wrote a little bit about a (mathematically naive) solution to the averages
problem on my blog (props to Eric Liu for his feedback) --
<http://www.cederman.com/?p=116>

------
kin
I actually have yet to encounter a single website with this sophisticated of a
rating system.

It is particularly hard to find quality content on websites with a large
database (i.e. YouTube). Newer videos will always have a higher 'rating'
because it is 'newer'. Older videos will always have the most 'views' because
it has had more time to get to that state. If you can't reach a consensus
about what the most optimal rating system is, implement a nested rating
system. This way, the end-user can specify to sort by 'rating' and then by
'date' and then by 'views', or however order he/she feels most optimal.

------
jackowayed
_How Not To Sort By Average Rating_

Or, "How to Make Your Host Very Happy Because You Suddenly Have to Move Up To
a Server With Double the Processing Cycles"

~~~
patio11
I realize this is probably a joke, but the relative infrequency of rating, the
fact that this rating algorithm only requires information from the local
context[1], and the existence of caching means that this is very, very
unlikely to be the bottleneck for any site.

[1] i.e. to calculate Harry Potter's rating you only need to know about Harry
Potter, not every other book ever printed

~~~
jackowayed
yeah, you could cache it in the db until you get a new rating.

It would still slow you down a little though.

And yes, it was a joke :-)

------
biohacker42
Every site's crappy sorting has always bothered me, it reminds me of AltaVista
(before Google kids).

I have not checked the math on this, but damn near anything should be better
then what most sites are doing today.

In fact a better way to sort user ratings is a good idea for a startup.

------
lacker
The simple formula is to assume a Dirichlet prior. This gives you

    
    
      score = positives / (negatives + x)
    

and you can fiddle with x to get something that looks reasonable.

------
diN0bot
pedantic: the urban dictionary example is voting not rating. that's what leads
to the confusion of adding up all the positives and negatives.

amazone is rating. it's a scale and there is a clear mean, medium and standard
deviation.

whether solution #2 is wrong kind of depends. some of this is a social problem
on what items get the kind of followers who are likely to rate online. at the
extreme case you could imagine a product where (not) liking it was caused by
an inability to use html forms.

------
swombat
Excellent. Can we implement that on comments here? How hard would it be?

------
time_management
An easier solution is to give each item a number of implied mediocre ratings
(say 10 3s) to start.

1 rating, 5.0 = 35/11 = 3.18; 90 ratings, 4.5 = 435/100 = 4.35

The downside is that, since items with few ratings get mediocre scores, if
there isn't a way for them to be visible, they won't get out of that rut. So a
better approach might be to feature (on a "top 10" list) seven high scorers
and 3 "rising stars" selected purely on average score.

