
Bayesian ranking of items with up and downvotes or 5 star ratings - striking
http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-ratings.html
======
EvanMiller
The method described here is simple because it's only looking at the mean of
the belief about each item; it uses the prior belief as a way either to
sandbag new items or to give them a bump. I tend to advocate methods that take
into account the _variance_ of the belief in order to minimize the risk of
showing bad stuff at the top of the heap.

I have a newer article (not mentioned here) that ranks 5-star items using the
variance of the belief. It ends up yielding a relatively simple formula, or at
least a formula that doesn't require special functions. Like the OP I use a
Dirichlet prior, but then I approximate the variance of the utility in
addition to the expected utility:

[http://www.evanmiller.org/ranking-items-with-star-
ratings.ht...](http://www.evanmiller.org/ranking-items-with-star-ratings.html)

The weakness of the approach (as well as the OP) is that it doesn't really
define a loss function for decision-making (i.e. doesn't properly account for
the costs of an incorrect belief), which one might argue is the whole point of
being a Bayesian in the first place. In practice it seems that using a
percentile point on the belief ends up approximating a multi-linear loss
function, but I haven't worked out why that is.

~~~
nkurz
_I tend to advocate methods that take into account the variance of the belief
in order to minimize the risk of showing bad stuff at the top of the heap._

Penalizing variance would be the opposite of my intuition. Given a boring low-
variance item with 10 3-star votes, and a divisive item with 5 1-star votes
and 5 5-star votes, I'd think you'd want the one at the top to be the one with
the medium chance that they'll "love" it than a high chance they'll find it
passable.

If you further assume that the average person is going to check out the top
few results but only "buy" if they find something they really like, the risky
approach seems even more appealing. A list topped by known mediocre choices
has a low chance of "success". What's the scenario you are envisioning?

~~~
stdbrouw
The kind of divisive item you describe is rare, at least on Amazon. What
happens most commonly is that everyone loves something or everyone hates it,
with some noise (e.g. 10% 1 or 2 star reviews). In this case, it makes sense
to promote the item that has a 4.5 mean score and 100 reviews over one that
has a 4.7 mean score and only 5 reviews. You want to account for the
uncertainty when there are few ratings. If you don't, all the items at the top
of your search results will be 5-star 1-review products.

------
albertoleal
Here's a paper that analyzes various ways to rank items using
upvotes/downvotes; and comes up with a way to determine which method is the
best:
[http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...](http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2011.pdf)

Also related: [http://planspace.org/2014/08/17/how-to-sort-by-average-
ratin...](http://planspace.org/2014/08/17/how-to-sort-by-average-rating/)

~~~
rer0tsaz
To summarize and paraphrase the paper:

> Every upvote should increase the score, every downvote should decrease the
> score and the more votes there are the less an additional vote should
> matter. Only "adding pretend votes" satisfies this.

That really puts into words why "adding pretend votes" just felt right to me
in practice.

------
Matumio
> _So if we want to maximize the expected utility we get out of the top spot,
> we should put the item with maximum expected popularity there._

Really? If users only ever look at the top 10 items, you'll never find out
that item #33 would end up much higher if it got some attention from voters.
This is not only a statistical problem, but also a policy/intervention
problem. There is an explore/exploit trade-off to be solved.

A very popular policy for similar problems is to use Thompson sampling, e.g.
don't sort items according to their expected score, instead draw a score at
random and sort according to those. (At random from your current belief about
the plausible true scores, e.g. the beta distribution you have learned.)

------
igonvalue
> A beta prior is equivalent to assigning “pretend votes” to each new item.
> Instead of starting out with 0 upvotes and 0 downvotes, we start out with a
> upvotes and b downvotes. If we expect that most items are bad, we could
> start with 3 upvotes and 10 downvotes. After getting u real upvotes, and d
> real downvotes the posterior distribution is Beta(a+u, b+d).

Fascinating. Does this follow as a straightforward consequence of how the beta
distribution is defined? Otherwise, is there a proof that someone could point
me toward?

> The popularity of an item has a Beta(a,b) prior

Is there an optimal choice of a, b given, say, a specific utility function?

~~~
wuch
Beta is conjugate prior of binomial distribution. That means is you start with
prior beta, and likelihood has binomial distribution, then posterior obtained
using Bayes theorem also will have beta distribution (with different
paramters).

If you look up definition of beta distribution and then write down formula for
posterior it should be quite clear that result is also beta distribution -
modulo normalizing constant which may be a little more tricky to determine.

------
trishume
See this related article for another Bayesian method that can account for time
of submission if you don't want to penalize new submissions as much for having
less voting time: [http://camdp.com/blogs/how-sort-comments-intelligently-
reddi...](http://camdp.com/blogs/how-sort-comments-intelligently-reddit-and-
hacker-)

~~~
trishume
This blog post is an excerpt from the following chapter of his ebook, if you
want more detail and cool analysis:
[http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabil...](http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-
Programming-and-Bayesian-Methods-for-
Hackers/blob/master/Chapter4_TheGreatestTheoremNeverTold/Chapter4.ipynb#Example:-How-
to-order-Reddit-comments)

------
jwmerrill
It seems to me that "hot or not" style ratings, where the user is asked to
pick a preference between two items, would be more informative and easier to
interpret than star ratings or thumbs up/thumbs down ratings. I wonder why it
isn't used more often.

Star ratings have problems with compression of scales at the top and bottom.
You'll never know which item is someone's favorite (or least favorite) with
star ratings, because typically there will be several items with the maximum
or minimum number of stars.

Pair-wise comparisons are also more fun and easier for users. When I'm doing
star ratings, I often find myself trying to remember what star ratings I've
given to similar things that I liked a little more or a little less so that I
can try to be consistent.

Pair-wise comparisons probably make more sense for items in similar
categories, though. It makes a lot more sense to pick a preference between two
novels than it does to pick a preference between an ice cube tray and a camp
chair.

------
nkurz
If you are looking for more about this approach, "additive smoothing" might be
a good search term not mentioned in the article. Also "Laplace smoothing" or
"Lidstone smoothing". "Dirichlet prior" (add an 'h' to the spelling in the
article) might also be on topic.

------
brw12
I've been using an equivalent method, after reading about imdb's method for
comparing the best movies of all time.

I extend it, hackily, to allow categories to be declared to apply to objects
with arbitrary confidence at any time, and to declare the same categorization
multiple times.

I then consider both the confidence amounts and number of declarations in
comparing the overall confidence in different categorizations.

I use a 0-1.0 scale for confidence, then product adjusted confidence for each
potential categorization as (sum of confidences) / (num confidence
declarations + 3).

This is equivalent to assuming a prior of three declarations of zero
confidence; this effectively rewards higher binders of votes, such that a
single declaration for category A of confidence 1.0 will tie three
declarations for category B of confidence 5.0.

------
vortico
I've always considered "bumping" to be the perfect scaling algorithm for
content sorting. If a comment is added to a thread, it will show up at the top
of the page until a comment is added to another thread. Thus the probability
of seeing a thread on the front page is proportional to the responses it
gathers. New threads will be seen immediately and will generate responses if
interesting. Unpopular threads will still get a small chance at reaching
someone rather than never being seen by anyone. Finally, two threads that
would otherwise generate equal number of upvotes or stars would be scaled by
how deep and worthwhile their content is, since a simple like is less
meaningful than a like and a long comment.

~~~
babuskov
I have seen many forums where this is exploited by users who like to keep
their threads on top. They do not submit the whole story and then keep adding
new pieces into comments, bumping the thread whenever it gets off the front
page.

So, I wouldn't call it perfect.

------
stared
IMHO ordinal values as the most misunderstood ones - their are neither
categorical nor numerical (even if they happen to be represented by numbers).
Setting utility by hand is at best - a questionable way to do it. The educated
way is to use Item Response Theory (see e.g.
[https://github.com/philchalmers/mirt/wiki](https://github.com/philchalmers/mirt/wiki)).

------
amelius
I suspect that people are more likely to rate an item when they have strong
negative feelings about it, and I guess this should be taken into account too.

------
piyushchauhan
Only one error

300 / (300+100) = 3/4 And not equal to 1/4

Nice read!

------
ingenter
[http://www.evanmiller.org/how-not-to-sort-by-average-
rating....](http://www.evanmiller.org/how-not-to-sort-by-average-rating.html)

~~~
DennisP
You realize that link was in the first line of the article, right?

