
How Not to Sort by Average Rating (2009) - Aqwis
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
======
paulgb
Averages (even with the post's approach) still have the problem of not being
"honest" in the game theory sense. For example, if something is rated 4 stars
with 100 reviews, a reviewer who believes its true rating should be 3 stars is
motivated to give it 1 star because that will move the average rating closer
to his desired outcome. A look at rating distributions shows that this is in
fact how many people behave.

Median ratings are "honest" in this sense, as long as ties are broken
arbitrarily rather than by averaging. Math challenge: is there a way of
combining the desirable properties mentioned in the post with the property of
honesty? I suspect there is but I haven't tried it.

~~~
hyperpape
John Gruber has been arguing that the only meaningful way to do ratings is a
simple thumbs up/thumbs down. I don't necessarily agree, but I see the appeal.

I usually don't want ratings, I want the Wirecutter treatment. Sometimes, I
know/care enough to really research the topic, in which case star reviews are
relatively unhelpful. The rest of the time, I just want someone trustworthy to
say "buy this if you want to pay a lot, buy this if you want something cheap,
but this third thing is no good at any price".

~~~
dntrkv
I've been saying this for years, thumbs up/down is the only system that makes
sense to me.

Foursquare uses it and I've found their scores to be way more useful than
Yelp's.

The biggest problem with star ratings is that it's so arbitrary. What is the
difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost
failing when you think about it on a grading scale, if I scored something as a
3/5 I would never use that product or service again, yet, many of the best
restaurants are rated 3/5 on Yelp.

Unless the user has some scoring system in place for different qualities of
the product or service, there is no way you can get anything resembling an
accurate score.

I would never trust a user to accurately assess a score given 10 different
options (.5-5) but I would be way more likely to trust a user to say either "I
like this product" or "I do not like this product."

But yes, the Wirecutter approach works great, but it just doesn't scale.

~~~
crazygringo
Counterpoint: I almost solely rely on the stars histogram in Yelp (available
only on the website, not the app), completely ignoring whatever Yelp's
calculated "average" is.

If a place has more 5-star ratings than 4-star ratings, it's generally
amazing. If it has more 4-star ratings than 5-star ratings, it's generally
fine but not something particularly special.

Just thumbs up/down would eliminate what is, to me, the single most useful
aspect of Yelp.

It doesn't matter that star ratings are arbitrary -- when you average enough
of them out, a clear signal overrides the noise. You can distrust any given
user, while still trusting the aggregate.

(Curiously enough, I don't find any equivalent value on Amazon. On Yelp,
you're really evaluating an overall experience along a whole set of
dimensions, so there's a lot more to discriminate on. On Amazon, it does seem
to be more of a binary evaluation -- does the product work reliably or not?)

~~~
BoiledCabbage
I used to think the same thing until I realized the most accurate and
consistent ratings I use on a regular basis is rotten tomatoes. And they're
based on strict thumbs up/ down.

It ensures votes hold equal weight and that "extreme polar" voters don't skew
things. It also avoids the opposite problem of "everything is neutral" vote
unless horrible/incredible.

RT also handles high brow and low brow well. You get less voting of "eh I
didn't love it, but it's sophisticated so I'll give it an extra star."

I'm sold on simple up/down.

~~~
stinkytaco
Rotten Tomatoes is good and predicting a movie I (or others) like, but not
really at "ranking". Zootopia, one of their top movies of 2016 and a 98%
rating, is a good movie, but one I'm unlikely to pursue again. The Godfather
(with a 99%) rating, is a movie I will pick up on Blu Ray and revisit many
times. It's far more than 1% better than Zootopia.

So RT is good at predicting "should I watch this movie I haven't watched
before", but bad at predicting more sophisticated habits or preferences. I
wouldn't buy the Blu Ray off a RT prediction, but I would rent.

So it becomes a question of what are you trying to accomplish? For some issues
up/down is a good way to solve a problem, for others it isn't.

~~~
icebraining
Rotten Tomatoes actually has both ratings, meaning they recognize the
limitation you're referring to. In the other, Zootopia has 8.1/10 and The
Godfather has 9.2/10, showing that difference in quality.

~~~
Houshalter
Also you just aren't the demographic for zootopia. If you have kids then it
probably is worth buying and they will watch it many times. There are so many
genres of films, it's best to compare within a single genre and not between.

------
kstenerud
That's what's always annoyed me with Amazon's "sort by average rating"
setting. I want to see the top 10 or so items by rating to give me a baseline
to investigate from, but instead I get page after page of cheap Chinese crap
with one 5-star review each from the resident fake reviewer.

Worse than useless.

Even a simple change like adding a "show only items with a minumum of X
reviews" would be a godsend.

~~~
nehushtan
What's crazy is everyone knows Amazon's ranking is crap, except apparently
Amazon - and it's been crappy in the same way for 10 years.

~~~
folli
For Amazon (and equally large companies) I usually tempted to put the proverb
"don't attribute to malice which is adequately explained by stupidity" on its
head. There's definitely a financial reason behind this.

~~~
oconnor663
I'd hasten to add "don't attribute to stupidity what is adequately explained
by people having to deal with more than one problem at the same time."

------
toniprada
Other approach for non binary ratings is to use the true Bayesian estimate,
which uses all the platform ratings as the prior probability. This is what
IMBD uses in its _Top 250_ :

"The following formula is used to calculate the Top Rated 250 titles. This
formula provides a true 'Bayesian estimate', which takes into account the
number of votes each title has received, minimum votes required to be on the
list, and the mean vote for all titles:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

Where:

R = average for the movie (mean) = (Rating) v = number of votes for the movie
= (votes) m = minimum votes required to be listed in the Top 250 C = the mean
vote across the whole report"

[http://www.imdb.com/help/show_leaf?votestopfaq&pf_rd_m=A2FGE...](http://www.imdb.com/help/show_leaf?votestopfaq&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0XCZT7P6XFKP334FP3GT&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_faq)

------
poorman
I reference this article constantly at Untappd.

When we were building the NextGlass app, I took much of this into
consideration for giving wine and beer recommendations.

We recently ran the query on the Untappd database of 500 million checkins and
it yielded some interesting results. The "whales" (rare beers) bubbled to the
top. I assume this is because users who have to trade and hunt down rare beers
are less likely to rate them lower. The movie industry doesn't have to worry
about users rating "rare movies", but I would think Amazon might have the same
issue with rare products.

~~~
chris_va
There is an interesting phenomenon of exclusivity/sunk-cost boosting ratings
for rarer or harder to acquire items.

That is also a problem with movie ratings (I just noticed that you mentioned
movies). Critics (and audiences) at pre-screenings are generally significantly
more favorable to a movie than an equivalent group in a normal theater. I
would not be surprised if the same thing applied to foreign movies, and other
types of "whales".

------
intenscia
Implemented this after discovering it via
[https://www.gamasutra.com/blogs/LarsDoucet/20141006/227162/F...](https://www.gamasutra.com/blogs/LarsDoucet/20141006/227162/Fixing_Steams_User_Rating_Charts.php)

Works amazingly well and so easy to calculate vs say the way IMDb rates
things.

------
loisaidasam
Here's a SO post w/ a python implementation:

[https://stackoverflow.com/questions/10029588/python-
implemen...](https://stackoverflow.com/questions/10029588/python-
implementation-of-the-wilson-score-interval/45965534)

The accepted answer uses a hard-coded z-value.

In the event that you want a dynamic z-value like the ruby solution offers, I
just submitted the following solution:

[https://stackoverflow.com/questions/10029588/python-
implemen...](https://stackoverflow.com/questions/10029588/python-
implementation-of-the-wilson-score-interval/45965534#45965534)

------
dperfect
What's the best way to apply the suggested solution to a numeric 5-star rating
system (the author mentions Amazon's 5-star system using the wrong approach,
yet the solution is specific to a rating system of binary positive/negative
ratings)?

I suppose one could arbitrarily assign ratings above a certain threshold to
"positive" and those below to "negative", and use the same algorithm, but I
imagine there's probably a similar algorithm that works directly on numeric
ratings. Anyone know? Or if you must convert the numeric ratings to
positive/negative, how does one find the best cutoff value?

~~~
amrrs
What we do with 5-star rating system is completely ignore 2,3,4 stars which in
a lot of ways just skew our analysis, hence ending up with a new score similar
to Nps (5-star minus 1-star) / total stars

~~~
overcast
Why bother have a 5-star rating system then? Sounds like Netflix went down the
right path, with thumbs up or thumbs down. People either zero it out, or give
it five stars.

------
jbochi
It's very common to see a "Most Popular" section in a website, but the way
it's usually done is not optimized for clicks.

Inspired by Evan's post, I wrote "How Not to Sort by Popularity" a few weeks
ago: [https://medium.com/@jbochi/how-not-to-sort-by-
popularity-927...](https://medium.com/@jbochi/how-not-to-sort-by-
popularity-92745397a7ae)

------
kuharich
Previous discussions:
[http://news.ycombinator.com/item?id=1218951](http://news.ycombinator.com/item?id=1218951),
[http://news.ycombinator.com/item?id=3792627](http://news.ycombinator.com/item?id=3792627),
[https://news.ycombinator.com/item?id=9855784](https://news.ycombinator.com/item?id=9855784)

------
hood_syntax
Read this article before and I really liked how to the point it is. More than
anything, can I just say how infuriating Amazon's rating system is?

------
eeZah7Ux
This is computationally very heavy, but, more importantly, for practical
purposes you want to have a tunable parameter to balance between sorting by
pure rating average and sorting by pure popularity.

Often you also want to give a configurable advantage or handicap to new
entries.

~~~
quantdev
For a fixed confidence level, it looks computationally light weight: a dozen
or so multiplications and divisions plus one square root, which could be
approximated if needed. There is no inverse normal needed at run time.

------
amelius
> What we want to ask is: Given the ratings I have, there is a 95% chance that
> the “real” fraction of positive ratings is at least what? Wilson gives the
> answer.

Well, you can't answer that question without making assumptions. And these
seem to be missing in the article.

------
thanatropism
Arguably what Urban Dictionary is doing is to weigh by "net favorability" in
some sense _and_ quantity of votes. Quantity of votes correlates to relevance,
particularly because UD is meant to represent popular usage.

~~~
gleenn
We actually switched to Wilson score. Doing it later has some weird effects,
when you've already have a lot of people typically voting on the first
definition, and then suddenly the order gets switched because something has a
higher ratio giving it higher confidence. We're honestly not sure it's done
anything that great for UD, sometimes simple is just better.

~~~
thanatropism
You might want to weigh less controversial (as in abs(upvotes - downvotes)
higher. This would be somewhat like Effect Size in science.

The Bayesian approach would be to assume the true vote distribution is
binomial and use a beta prior (possibly with Jeffrey's degenerate bimodal
prior). Then as the total number of votes increases the posterior distribution
tightens. Ranking score is prob(score>0).

------
agentgt
This sort of reminds of "voting theory" and if I recall it was proven by I
think a nobel prize winner that there cannot be a fair winner.

Obviously it's not entirely analogous but I would not be surprised if it
mapped over to this domain.

Edit: on mobile so late on the link to Kenneth Arrow
[https://en.m.wikipedia.org/wiki/Arrow%27s_impossibility_theo...](https://en.m.wikipedia.org/wiki/Arrow%27s_impossibility_theorem)

~~~
petters
That theorem, being about when every user provides a complete ranking, does
not apply in this case.

------
gesman
I think ratings need to be normalized to personal beliefs and preferences of
the viewer.

In other words - I can care less how Joe Blow rated the product - but it's
important to me how likeminded people like me rated the product.

Also - Amazon is _not making mistake_ in ratings.

Amazon is less interested in selling you relevant product _for you_.

Amazon is more interested to boost it's bottom line, move stalled inventory or
move higher margin inventory.

------
alexvay
I think the article is missing something visual to demonstrate the actual
scoring at work.

I've made a simple plot in Excel here:
[http://i.imgur.com/adjaLQ9.png](http://i.imgur.com/adjaLQ9.png)

The number of up-votes remains the same, while down-votes increases linearly.
The scoring declining line in grey is the score.

~~~
shaftway
Here's a 3d graph showing this as a function of upvotes and downvotes. I think
it's clearest with

x: [0, 100] y: [0, 100] z: [0, 1]

[https://www.google.com/search?q=graph+((x+%2B+1.9208)+%2F+(x...](https://www.google.com/search?q=graph+\(\(x+%2B+1.9208\)+%2F+\(x+%2B+y\)+-+1.96+*+\(\(\(x+*+y\)+%2F+\(x+%2B+y\)+%2B+0.9604\)+%5E+0.5\)+%2F+\(x+%2B+y\)\)+%2F+\(1+%2B+3.8416+%2F+\(x+%2B+y\)\)&oq=graph+\(\(x+%2B+1.9208\)+%2F+\(x+%2B+y\)+-+1.96+*+\(\(\(x+*+y\)+%2F+\(x+%2B+y\)+%2B+0.9604\)+%5E+0.5\)+%2F+\(x+%2B+y\)\)+%2F+\(1+%2B+3.8416+%2F+\(x+%2B+y\)))

------
tabtab
What about having a scaling factor to adjust the impact of quantity (total) of
individual ratings as needed? Rough draft:

    
    
      sort_score = (pos / total) + (W * log(total))
    

Here, W is the weighting (scaling) factor. Total = positive + negative

~~~
jules
That formula will give an item a higher score the more down votes it gets. A
better approach is

score = (pos + a) / (tot + b).

Where a<b, e.g. a=1, b=2.

See this post why that formula follows from Bayesian reasoning:
[http://julesjacobs.github.io/2015/08/17/bayesian-scoring-
of-...](http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-
ratings.html)

~~~
tabtab
I'm not sure what you mean in the 1st sentence. Example? The problem with the
2 weights is that it's 2 values that have to be given, and for large
quantities neither makes much difference. It's why I used log().

~~~
jules
The log(total) term increases without bound whereas the pos/tot term is at
most 1, so in the limit of a lot of votes you will beat an item with fewer
votes even if all your votes are downvotes.

That there are two configurable parameters is a good thing. One parameter
controls how much of a penalty you get for having few votes, the other
controls how many votes count as "few".

------
phunge
Classic post! This post is like a gentle gateway to the world of Bayesian
statistics -- check out Cameron Davidson Pilon's free book if you want to go
deeper.

------
alexpetralia
Chris Stucchio and Evan Miller have amazing statistics blogs.

------
bradbeattie
I think this article is missing the next step: collaborative filtering. I only
care about the ratings it received from people that rate thing like I do.

------
larkeith
This article is useful, but the author's tone really rubs me the wrong way -
to the point I'm dubious about trusting the information without further
sources. Cutting the entire first part ("not calculating the average is not
how to calculate the average") would help, as would more accurately titling
the piece - no matter how effective this method is, it is NOT sorting by
average, strictly speaking.

------
ignawin
Any blog posts/papers on what the best general approach to onliene reviews is?

~~~
dredmorbius
Good, qualified, honest reviewers.

Hal Varian (UC Berkeley) has some 1990s refs, which remain good. "Grouplens"
is the project/product.

Randy Farmer literally wrote the book on the topic. There's a book, blog, and
wiki.

Frankly, Farmer's work, good as it is, largely reinforces my view that Varian
captured the essence of the problem, which I've summarised in my opening
'graph. You _cannot_ algorithmically correct for crap quality assessment.

If you're interested in the long-form answer, the fields are epistemology
(philosophy) and epistemics (science).

Enjoy!

[http://people.ischool.berkeley.edu/~hal/Papers/publish.html](http://people.ischool.berkeley.edu/~hal/Papers/publish.html)

[http://people.ischool.berkeley.edu/~ngood/](http://people.ischool.berkeley.edu/~ngood/)

[http://people.ischool.berkeley.edu/~hal/Papers/japan/](http://people.ischool.berkeley.edu/~hal/Papers/japan/)

[http://buildingreputation.com](http://buildingreputation.com)

------
Animats
Mandatory XKCD: [https://xkcd.com/937/](https://xkcd.com/937/)

------
autokad
a gamma poison might more accurately calculate the rating based off
uncertainty of the data

------
donatj
Does a decent Fortran implementation exist?

~~~
jimktrains2
Did you read the article?

~~~
jimktrains2
To those downvoting me and sibling, parent asked about a SQL implementation
originally.

