
Bayesian ranking of items with up and downvotes or 5 star ratings (2015) - mooreds
http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-ratings.html
======
EvanMiller
I recommend the approach described in this article:

[http://www.evanmiller.org/ranking-items-with-star-
ratings.ht...](http://www.evanmiller.org/ranking-items-with-star-ratings.html)

In this formulation, s_k equals utility. Like the Wilson score formula (and
unlike the linked article), the provided equation takes into account the
variance of the expected utility.

~~~
iainmerrick
I find that article very hard to follow -- there are lots of detailed
formulas, but no obvious place where the prior distribution is discussed, or
the utility score given to different star ratings. And the examples are all
very abstract.

 _Edit to add:_ ah, I think I see, the utility of N stars is assumed to be N,
and the prior is all ones. But aren't those the most important things to tune
in a Bayesian model?

~~~
wenc
Another practical Bayesian approach that is much easier to understand and to
productionize, is described here:
[https://www.johndcook.com/blog/2011/09/27/bayesian-
amazon/](https://www.johndcook.com/blog/2011/09/27/bayesian-amazon/)

It does assume a Beta(1,1) prior however.

~~~
iainmerrick
With star ratings, I think an important point that often gets ignored is:
_different people use stars in different ways._ One user might 5-star most
things, but give the occasional 4- or 3-star review if they have a problem.
But another user might 3-star by default, and save their 4- and 5-star reviews
for exceptionally good cases.

I wonder if a simple way to fix that might be to reinterpret everyone's star
ratings as percentiles, based on the overall distribution of stars in their
reviews. "This user gives 5 stars 10% of the time, so we'll interpret a 5-star
review from them as anything in the range 90-100 -- assume 95%."

You would probably also want to reinterpret the results for each user. "This
review scores average out as 84%. For user A, that's 4.5 stars, but for user
B, it's only 3.5 stars."

The big downside is that star ratings become subjective. But they're _already_
subjective, and ignoring that problem doesn't make the results any better.
Average star ratings on all the big websites and app stores right now are
garbage -- they'll usually warn you if some Amazon product is terrible, but
that's about all.

If you crunch all the review data and figure out the best possible
recommendations, you end up with collaborative filtering and the Netflix
Prize. It's a shame that so much great work was done for that competition, but
nobody seems to be using it now. Netflix themselves just use a trivial upvote
scheme now.

But I wonder if there's some much simpler approach that still gets pretty good
results.

~~~
nkristoffersen
Or even a simple thumbs up or thumbs down. Less open to interpretation on how
the user uses stars. 1 star or 5 star basically.

~~~
acrooks
I wrote this a couple of years ago [1]. I think we need to remove subjectivity
on ratings by asking more specific questions and only allowing a binary
answer.

1\. Is the food good? 2\. Is the service good? 3\. Is the atmosphere good?

That's a pretty simple answer. Often when I see 1 star reviews it's because of
a single element of the experience but not the overall experience.

It's easier to leave a review because there's less cognitive load. It's easier
to search for what you want: if I have my foodie hat on, I don't particularly
care about the service. If it's a night out with a customer, that becomes more
important all of a sudden.

And then you can generate some sort of average score based on the answers to
these questions to calculate the 5 star rating.

[1] [https://medium.com/@acrooksie/no-more-5-star-rating-
systems-...](https://medium.com/@acrooksie/no-more-5-star-rating-
systems-a4a5032bb19d)

~~~
iainmerrick
I do prefer that over stars, but I think it potentially misses some
information. Let's say most people answer "good" for all the categories. Does
that just mean the place is good overall, or is it fantastic?

To put it another way, how do you distinguish the 4.0-star places from the
4.9-star places?

With conventional star ratings, you're reliant on most people using stars
consistently. With a series of yes/no questions, you're relying on a
potentially small pool of "no" answers to give you a useful signal.

I think stack ranking would be much more powerful. "How does this place
compare to others? Average, better than average, in your all time top 5?"
Everybody's feedback would be completely clear. It's not obvious how to
aggregate that into a single rating number though.

~~~
acrooks
Given a set of questions - e.g. "how's the food" "how's the atmosphere" "how's
the service" etc. - you could figure out how the restaurant scores relative to
others by stack ranking based on the % of answers to a particular question
that got a "Yes". The numbers should hopefully reflect a normal distribution
and from there you get your /5 rating.

If everybody answers "yes" to all of the questions - good value, service,
food, atmosphere - then that suggests to me that it's a great restaurant. And
you can have a lot of questions that are even asked randomly to limit the
number of questions per user.

I rate a lot of places highly that have great a lot of things but not great
service, because I don't think the service is bad enough to bring it down. But
that's data that is being lost.

I like your idea of stack ranking but with a different flavour. I think that
"in your all time top 5" is a hard question to answer. How about this though -
if we know you've been to Taco Place X and now you're going to Taco Place Y,
maybe the question is "are the tacos at Y better than X", "is the atmosphere
at Y better than X" or even "is Y better than X" (but I like the idea of
collecting more granular data).

If you collect this^ data to stack rank. Then it definitely gives you a better
distribution of restaurants relative to each other in each category.

As a consumer, with this level of granularity, I can select what I care about
tonight. If I'm grabbing takeout for lunch at work, does a five star rating
even matter? I should ask Siri "show me the top fast and delicious takeout
restaurants near me" and she should do: "select name from restaurants where
distance < 500m order by (speed + flavour) limit 3;" and from there I will
pick something from that list that looks nice. That seems like a nice UX.

~~~
hysthola
There's a body of research on this, and it suggests that ratings are more
meaningful if you add options, up to about 5 or 6 ratings.

That is, if you asked people to do the ratings once, and then asked them 1
hour later, there would be more consistency across time as you add options
from 2 to 3 to 4, up to about 5 or 6.

The problem with binary ratings is that, as much as you might think otherwise,
you're forcing a kind of hazy, grey experiential assessment into 0 or 1. And
in doing so, people near the boundary (whatever that might be) will vacillate
between them. E.g., people who feel "meh" about something are forced to choose
something else, and sometimes they'll say 0 and sometimes 1. The more options
you give, the more reliable / meaningful the ratings will be.

This example is interesting to me because it's something most people can
relate to and illustrates the complications of utility-based and Bayesian
formulations of the problem. You end up having to decide on utilities and/or
priors.

To me the answer is to weight the data maximally in forming a posterior, in
which case you end up using a reference prior. Similar kinds of arguments
about utilities lead to reference priors. Reference priors can be complicated
to compute, but for things like multinomials over ordinal ratings, reference
priors have been worked out fairly well.

To me it always made sense to allow people to sort by the center of the
estimate, or the lower bound (maybe using different language).

~~~
iainmerrick
Slight tangent--

I think 1-4 stars is the ideal rating style. I wish that were used more often.

A choice of 1-4 stars gives you enough freedom to express your opinion,
without being overwhelming. It's a small enough range to be reasonably
objective (almost everybody will interpret it as 1 star = bad, 2 = passable, 3
= good, 4 = great). And with an even number of choices there's no middle "meh"
option -- you're forced to make a choice between 2 and 3.

Of course it's important not to ruin it by adding extra options, like 0 stars
or half-stars. (That was Ebert's big mistake!)

 _Edit to add:_ to relate this to the parent post, I'm thinking that maybe
ranking things as 1-4 stars in several categories could be the best if both
worlds.

------
anameaname
Not a statistician, but this still seems flawed. The pretend votes need to be
related to the person seeing the list of items. These normally come from the
population (i.e. if you were ranking Netflix, the pretend votes would be the
sum of all votes that exist for every movie, grouped by star count). This
makes sense, because if you had no other information, your guess would just be
the average of all the existing ratings.

The problem is that the pretend votes need to be culled in order to be
predictive. Otherwise they dominate in the arithmetic. They need to be more
specific to the user looking at the ranking. Continuing with the Netflix
example, if a user was looking for scary movies, the pretend votes need to
come from the corpus of all scary movies, rather than all movies that exist.

Here's the problem, there doesn't seem to be a good way to narrow the pretend
votes. Worse, there isn't a good way to combine the two. If the pretend votes
came from two sources, its not clear what to do. For example, if the user is
from California, the California pretend votes (priors?) need to be combined
with the scary movie pretend votes.

How can we add pretend votes without justifying where they came from?

~~~
iainmerrick
It doesn't have to be correct, just a plausible starting point. The "pretend
votes" have less importance as more real votes come in.

I do think this article suggests add too many pretend votes. Without the kind
of justification you're talking about, it's usually better to add only a
couple (reflecting low confidence in the prior).

~~~
naasking
I'm just getting into stats, but the way I see it, a new item has 0 votes with
an error bar +/\- the number of possible votes. Each sorting should then
include a randomization factor related to the error bar, and so randomly
promote some new items into top rankings so they get some exposure to gather
votes. As they accumulate votes, the error bar shrinks as the ranking becomes
a little more certain.

~~~
iainmerrick
I believe the right way to think about it isn't error bars, but the entire
probability distribution -- what's the probability that if everybody voted,
the upvote/downvote ratio would be 75/25, 80/20, 85/15, etc. Once you've
figured out the probability distribution, you can calculate error bars any way
you like (e.g. 95% confidence interval).

The beta distribution is one model you can choose for that probability
distribution, which happens to have some nice properties that make it easy to
work with.

The other question is, what's the "zero knowledge" probability distribution? I
think your "0 votes with an error bar +/\- the number of possible votes" would
translate to "uniform probability of any result", which I think is beta(1,1).

Depending on the scenario, though, you might look at the data and observe that
extreme values are very uncommon, and therefore start with something like
beta(2,2) instead (a bell curve rather than a flat distribution). That has
minimal impact once you have lots of real upvote/downvote data, but it makes a
huge difference to how the first few votes are interpreted.

~~~
naasking
Right, that sounds more like what I meant. Still familiarizing myself with the
terminology, thanks!

------
akie
Previous discussion:
[https://news.ycombinator.com/item?id=10481507](https://news.ycombinator.com/item?id=10481507)

------
willis77
Anyone here have thoughts on why, all these years later, Amazon still doesn't
have a sort option along the lines of these proposals? It seems like such an
easy win and an easy technical change. Do they have some business reason not
to change their default sort?

~~~
SomeStupidPoint
I'm not sure what you mean -- could you elaborate?

Amazon probably doesn't use straigt score averaging to decide "best" items
sort, and this is just proposals of how to change that to be better by not
just using averages. So what is it you're looking for Amazon to add?

Disclaimer: work at Amazon, not on anything search related.

~~~
willis77
Amazon has the default "Featured" sort (I'm not sure what is behind this, but
it intuitively seems like some combination of popularity + availability +
rating). If this default doesn't fit your needs, your only option is to change
to sort by "Avg. Customer Review", which gets you a list that is sorted by
average rating regardless of the number of reviews. Evan called out nearly 10
years ago in the post that OP's article mentioned -
[http://www.evanmiller.org/how-not-to-sort-by-average-
rating....](http://www.evanmiller.org/how-not-to-sort-by-average-rating.html).
The root problem is that one random obscure product with a single 5-star
rating out-ranks something with 499 5-star ratings and 1 4-star rating.

I'm often looking for what is the best/highest-quality item in a category,
meaning I want not just a high average, but a high average that is
statistically meaningful. I'm just surprised Amazon hasn't offered a way to do
that (and have read umpteen threads on HN in the past years expressing the
same frustration).

~~~
SomeStupidPoint
....Default 'featured' sort?

When I go to Amazon.com and search, I see 'relevance' as my default, with
'featured', some price related ones, 'average', and 'new' as options.
('Featured' only seems to exist on some products, and be related to ads.)

Is it not the same for you?

\-----

As for your main point (because I think that your complaint is still valid
even with 'relevance' as the default), it sounds like what you want is a way
to choose what factors are applied to your sort.

I'm not sure, but it seems likely that 'relevance' is doing more than just
averaging, and so being able to select which parts you apply (eg, only use a
statistical notion of best, don't consider availability or shipping times)
would cover your usecase, right?

Well, you might want to be able to choose between a few models of 'best', but
the real issue, the core need, is that you want control over the model that
Amazon is using to sort what you see and to have some input on what that looks
like. (And not just have 'lolsux' or 'Amznsort', to be a little glib.)

Gotta say, that actually sounds like a pretty reasonable ask. I'm not sure why
it doesn't work that way, either.

~~~
willis77
Yeah, my above comment was not using a text search, hence no "relevant" option
(i.e. if you just drilled down the department hierarchy to, say, the TVs
department).

> the core need, is that you want control over the model that Amazon is using
> to sort what you see and to have some input on what that looks like

Indeed, but I'm not even looking to have that much granular control over it. I
just want "sort by rating, but toss out all the obscure crap that has 1 or 2
ratings, because that rating is meaningless."

------
grenoire
Unfortunately, no matter how you twist and turn it ordinal data is going to
stay ordinal. You cannot try to make it meaningful regardless of how you
aggregate it.

