
How Not to Sort by Average Rating (2009) - jmilloy
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
======
stdbrouw
Also discussed in Cameron Davidson-Pilon's Bayesian methods for Hackers in the
context of Reddit ups/downs:
[http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabil...](http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-
Programming-and-Bayesian-Methods-for-
Hackers/blob/master/Chapter4_TheGreatestTheoremNeverTold/Chapter4.ipynb#Example:-How-
to-order-Reddit-comments)

For Amazon, though, which is the example in Evan Miller's post, I don't really
get why you'd first dichotomize the five-star rating into positive vs.
negative and then use Wilson intervals. Just construct a run-of-the-mill 95%
confidence interval for the mean of a continuous distribution and sort by the
(still plausible) worst case scenario a.k.a. the lower bound of that: `mean -
1.96 * SE`, where the standard error is `SE = stddev(scores)/sqrt(n)`.

Because of the central limit theorem, you can do this even if scores are not
normally distributed and it'll work out too.

~~~
vcdimension
For better accuracy with small samples you could use the multinomial
distribution instead. The covariance matrix for the rating probabilities can
be found here for example:
[http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.ht...](http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.html)
Then the variance for the expected rating can be calculated as a weighted sum
of the values in the covariance matrix.

These companies really should be hiring statistics consultants instead of
relying on the intuitions of their programmers.

~~~
stdbrouw
I'd prefer to just treat scores as continuous and correct using `t_ppf(.975,
n-1)` instead of the normal approximation (1.96) but I suppose working from a
multinomial distribution would give pretty similar results.

~~~
vcdimension
You're still relying on the central limit theorem (i.e. a reasonable amount of
data) : using t instead of z just corrects for the fact that you only have
sample variances instead of population variances. However, I suppose it's not
unreasonable to assume that the ratings are likely to have a bell shaped
distribution (which could be checked), so the normal/t approximation is
probably going to be OK.

~~~
stdbrouw
Ah yes, true. Let's call it a bias/variance tradeoff ;-)

------
danialtz
I have fundamental problem with democratic voting systems. Whatever the
general view likes will tend to come on top, hence cat pictures on reddit. The
most philosophiycally elegant solution I've encountered so far is the
"quadratic voting" (see
[https://news.ycombinator.com/item?id=9477747](https://news.ycombinator.com/item?id=9477747)),
where every user has a limited number of credits to spend per time-period.
Every vote will have a quadratic cost.

Assume user has/obtains 1000 karma points a month. If the user merely likes a
post or not he gives his 1 vote, which cost him 1 karma. If he strongly wants
one post up he can spend a maximum of 31 vote on it. This way the minorities
will have also extra influence on the voting process.

The requirements is that each user have 1 account, e.g. maybe by some form of
payment for 1000 karma to avoid fake voting fraud. Maybe using bitcoins if it
picks up to avoid privacy problems.

Do you think this will hinder the workings of sites like reddit?

~~~
bwy
What I always thought was that there really should be some user-based
weighting system. Like, if a user upvotes 90% of the things he sees, his
upvotes are probably worth less than upvotes by someone who upvotes only 1% of
the posts he sees.

Same thing applies to things like Yelp reviews. Maybe a user with close to a
5-star lifetime rating average should have his reviews "renormalized" to 3's
because his standards are probably just lower than the guy with a 1-star
average.

The problem is that there are so many other factors here (maybe the 5-star
average person only visits (? or just _reviews_ ) really good places). Maybe
the crazy upvoter just spends more time reading each page on Reddit. These are
complicating factors that are hard to predict and if the simple case is
working, why try? If there were a simple, clearly better way of voting/rating,
it would be done.

~~~
joshuapants
> Same thing applies to things like Yelp reviews. Maybe a user with close to a
> 5-star lifetime rating average should have his reviews "renormalized" to 3's
> because his standards are probably just lower than the guy with a 1-star
> average.

Another problem is the perception of star ratings. It seems like 5 (maybe 4
also) is the only "positive" rating for many people. Anything less and it
might as well be a 1.

~~~
bwy
Of course, there's many areas that I didn't even mention. Another thing in the
same vein for upvotes that I sometimes think about is, what is the meaning of
upvoting? Does it mean, "I like this," or can it also mean "I think this
submission should be higher?" Maybe I think too much but I've refrained from
upvoting posts I like because I don't think they should be higher than their
current position.

~~~
travis_bickle
Can the design tell the user the meaning of upvote("I like this" / "I think
this submission should be higher")? Because the significance of upvote could
be either of the two depending on the area. Also, then one can sort
accordingly.

------
woah
Wrong solution #1 sounds like it could work quite well for UrbanDictionary,
since it would tend to reward posts that have a lot of engagement. It's
probably a good solution for a lot of sites.

~~~
imh
The problem here is feedback. The higher rated ones get higher rated, so more
people see it so it gets higher rated. That opens up a whole extra can of
worms you don't want to deal with.

~~~
sova
This is exactly true. You gotta balance freshness, quality, and uncertainty.

------
bbrazil
Previously:
[https://news.ycombinator.com/item?id=478632](https://news.ycombinator.com/item?id=478632)

~~~
ggreer
It was also discussed about 3 years ago:
[https://news.ycombinator.com/item?id=3792627](https://news.ycombinator.com/item?id=3792627)

Some of the comments from that posting give concrete examples where the
formula fails. Such as: an item with 1000 upvotes and 2000 downvotes will get
ranked above one with 1 upvote and 2 downvotes. This is because the formula
uses the lower bound of the Wilson interval.

------
rurban
I'm ranking movies by critics ratings. Most of them have too low numbers, and
thus naive bayesian ranking by avg does not work. IMDB gets away with it, but
I cannot. And you should be able to see a good preview of the expected ranking
even with low numbers.

So you need to check the confidence interval with Wilson, but you also need to
check the quality of the reviewer. There are some in the 90%, but there are
also often outliers, i.e. extremities. Mostly french btw.

I updated the c and perl versions, compiled and pure perl here:
[https://github.com/rurban/confidence_interval](https://github.com/rurban/confidence_interval)

------
anon4327733
First two points are great, but why then we see this:

"Given the ratings I have, there is a 95% chance that the "real" fraction of
positive ratings is at least what?"

What normal person thinks in terms of confidence intervals?

The obvious answer is people want the product with the highest "real" rating.
That is the rating the product would get if it had arbitrary many ratings.

To get this you just find the mean of your posterior probability distribution.
For just positive and negative reviews thats basically (positive+a)/(total+b)
where a and b depend on your prior.

His proposal would mean that a product with zero reviews would be rated below
a product with 1 positive review. This may deal with spam and vote
manipulation since things with less information are penalized more but that is
a separate issue.

------
spacemoelte
I have always wondered what amazon was thinking with that way of sorting.
Perhaps it's a deliberate way to spread purchases out over a span of products
instead of just the two top products?

~~~
randomtree
I think it's about a product discovery. If we always sort this way, new
products don't have a chance.

And I don't think Amazon would sort like this, it would make more sense for
them to use hn/reddit way to sort items that give a chance for the new items
to get to the top.

------
Houshalter
There is a much simpler and elegant method. Just rank posts by their
probability of getting an upvote. This is just
(upvotes+1)/(upvotes+downvotess+2).

~~~
stdbrouw
This gives an advantage to new posts for which the probability is much more
uncertain: it's easier to get 1 upvote and 0 downvotes (rank 2/3) than to get
1999 upvotes and 999 downvotes (also rank 2/3). Maybe that's what you want,
but the post is exactly about those cases when this is _not_ what you want.

~~~
dredmorbius
New posts frequently start with a disadvantage though in existing systems.
Temporarily biasing them favourably increases odds of _any_ moderation.
Alternatively you could present them only to a subset of readership. I've
suggested this as a solution to HN's new submissions queue problem.

Increase the presentation as ratings increase.

------
jrochkind1
Interestingly, I _think_ the Reddit algorithm basically makes this mistake too
-- although embedded in a more complicated algorithm that combines with
'newest first' altered by positives minus negatives.

I don't think the HN algorithm is public, but wouldn't be surprised if it does
the same.

Perhaps the generally much smaller number of 'votes' on a HN/reddit post makes
it less significant.

~~~
gsteinb88
For posts, I'm not sure what the algorithm is (I think it's deliberately more
complicated, and has to take into account time of posting?), but after this
article [the op] was written, reddit implemented the method for comments, as
explained by Randall Munroe: [http://www.redditblog.com/2009/10/reddits-new-
comment-sortin...](http://www.redditblog.com/2009/10/reddits-new-comment-
sorting-system.html)

You only get this ranking method if you sort the comments by 'best' though

~~~
jrochkind1
Not the default 'top' though, I think.

------
sova
Awesome if you are using only "up" and "down" ...

~~~
Houshalter
It should work for star ratings, and generalize to non discrete rating systems
too.

------
discardorama
How well does this work when you don't have a binary (+/-) rating system, but
a multi-valued one (1 - 5 stars) ?

~~~
cgearhart
See the discussion elsewhere in this thread for confidence intervals in
multinomial distributions.
[https://news.ycombinator.com/reply?id=9856607&goto=item%3Fid...](https://news.ycombinator.com/reply?id=9856607&goto=item%3Fid%3D9855784)

------
dredmorbius
First off, for anyone looking at web reputation systems, _read the book on the
subject_ : Randy Farmer and Bryce Glass, _Building Web Reputation Systems_ :

Book:
[http://shop.oreilly.com/product/9780596159801.do](http://shop.oreilly.com/product/9780596159801.do)

Wiki:
[http://buildingreputation.com/doku.php](http://buildingreputation.com/doku.php)

Blog: [http://buildingreputation.com/](http://buildingreputation.com/)

I can pretty much guarantee there are elements of this you're not considering
which are addressed there (though there are also elements which Farmer and
Glass don't hit either). But it's an excellent foundation.

Second: If you're going to have a quality classification system, you need to
determine _what you are ranking for._ As the Cheshire Cat said, if you don't
know where you're going, it doesn't much matter how you get there. Rating for
popularity, sales revenue maximization, quality or truth, optimal experience,
ideological purity, etc., are all different.

Beyond that I've compiled some thoughts of my own from 20+ years of using (and
occasionally building) reputation systems myself:

"Content rating, moderation, and ranking systems: some non-brief thoughts"
[http://redd.it/28jfk4](http://redd.it/28jfk4)

⚫ Long version: Moderation, Quality Assessment, & Reporting are Hard

⚫ Simple vote counts or sums are largely meaningless.

⚫ Indicating levels of agreement / disagreement can be useful.

⚫ Likert scale moderation can be useful.

⚫ There's a single-metric rating that combines many of these fairly well --
yes, Evan Miller's lower-bound Wilcox score.

⚫ Rating for "popularity" vs. "truth" is very, very different.

⚫ Reporting independent statistics for popularity (n), rating (mean), and
variance or controversiality (standard deviation) is more informative than a
single statistic.

⚫ Indirect quality measures also matter. I should add: _a LOT._

⚫ There almost certainly isn't a single "best" ranking. Fuzzing scores with
randomness can help.

⚫ Not all rating actions are equally valuable. Not everyone's ratings carry
the same weight.

⚫ There are things which don't work well.

⚫ _Showing_ scores and score components can be counterproductive and leads to
various perverse incentives.

I'm also increasing leaning toward a multi-part system, one which rates:

1\. Overall favorability.

2\. Any flaggable aspects. Ultimately, "ToS" is probably the best bucket,
comprising spam, harassment, illegal activity, NSFW/NSFL content (or
improperly labeled same), etc.

3\. A truth or validity rating. Likeley rolled up in #2. But worth mentioning
separately.

4\. Long-term author reputation.

There's also the general problem associated with Gresham's Law, which I'm
increasingly convinced is a general _and quite serious_ challenge to market-
based and popularity-based systems. Assessment of complex products, including
especialy information products, is difficult, which is to say, _expensive_.

I'm increasingly in favour of presenting newer / unrated content to subsets of
the total audience, and increasing its reach as positive approval rolls in.
This seems like a behavior HN's "New" page could benefit from. Decrease the
exposure for any one rater, but spread ratings over more submissions, for
longer.

And there are other problems. Limiting individuals to a single vote (or
negating the negative effects of vote gaming) is key. Watching the watchmen.
Regression toward mean intelligence / content. The "evaporative cooling"
effect ([http://blog.bumblebeelabs.com/social-software-
sundays-2-the-...](http://blog.bumblebeelabs.com/social-software-
sundays-2-the-evaporative-cooling-effect/)).

------
fahadalie
The most important criteria for sorting should be the 'engagement'.

~~~
dredmorbius
That's a useful element to consider, and I do favour implicit ranking inputs,
but "best" requires you know your goal. _What are you selecting for?_

