Hacker News new | comments | show | ask | jobs | submit login
Bayesian ranking of items with up and downvotes or 5 star ratings (julesjacobs.github.io)
162 points by striking 717 days ago | hide | past | web | 26 comments | favorite



The method described here is simple because it's only looking at the mean of the belief about each item; it uses the prior belief as a way either to sandbag new items or to give them a bump. I tend to advocate methods that take into account the variance of the belief in order to minimize the risk of showing bad stuff at the top of the heap.

I have a newer article (not mentioned here) that ranks 5-star items using the variance of the belief. It ends up yielding a relatively simple formula, or at least a formula that doesn't require special functions. Like the OP I use a Dirichlet prior, but then I approximate the variance of the utility in addition to the expected utility:

http://www.evanmiller.org/ranking-items-with-star-ratings.ht...

The weakness of the approach (as well as the OP) is that it doesn't really define a loss function for decision-making (i.e. doesn't properly account for the costs of an incorrect belief), which one might argue is the whole point of being a Bayesian in the first place. In practice it seems that using a percentile point on the belief ends up approximating a multi-linear loss function, but I haven't worked out why that is.


This is interesting stuff, but I wonder has anyone verified the results in practice? These methods are all quite simple. They assume, for example, that the quality of an item is independent of the quality of the surrounding content. This is clearly not true. When Steve Jobs died, for example, no other new in the tech community was going to get air time. There is also the need for a variety of content. I think we all know how boring it is to read endless "I wrote X in Y" posts on HN, where X is some simple software system like a blog and Y is the language du jour (Node.js / Go / whatever).

In the machine learning community the above problems are addressed with submodular loss functions, bandit algorithms, and no doubt other methods I don't know about. Now I don't value complexity for its own sake, so I wonder if the additional power these approaches bring is warranted.


I tend to advocate methods that take into account the variance of the belief in order to minimize the risk of showing bad stuff at the top of the heap.

Penalizing variance would be the opposite of my intuition. Given a boring low-variance item with 10 3-star votes, and a divisive item with 5 1-star votes and 5 5-star votes, I'd think you'd want the one at the top to be the one with the medium chance that they'll "love" it than a high chance they'll find it passable.

If you further assume that the average person is going to check out the top few results but only "buy" if they find something they really like, the risky approach seems even more appealing. A list topped by known mediocre choices has a low chance of "success". What's the scenario you are envisioning?


The kind of divisive item you describe is rare, at least on Amazon. What happens most commonly is that everyone loves something or everyone hates it, with some noise (e.g. 10% 1 or 2 star reviews). In this case, it makes sense to promote the item that has a 4.5 mean score and 100 reviews over one that has a 4.7 mean score and only 5 reviews. You want to account for the uncertainty when there are few ratings. If you don't, all the items at the top of your search results will be 5-star 1-review products.


I saw this post's headline and it reminded me of a "how not to sort by average rating" post I read many years ago. I looked it up on Google [1], read it, clicked through to this HN thread... lo and behold, the top comment was written by the author of the article I just looked up. Great stuff.

[1] http://www.evanmiller.org/how-not-to-sort-by-average-rating....


Here's a paper that analyzes various ways to rank items using upvotes/downvotes; and comes up with a way to determine which method is the best: http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...

Also related: http://planspace.org/2014/08/17/how-to-sort-by-average-ratin...


To summarize and paraphrase the paper:

> Every upvote should increase the score, every downvote should decrease the score and the more votes there are the less an additional vote should matter. Only "adding pretend votes" satisfies this.

That really puts into words why "adding pretend votes" just felt right to me in practice.


> So if we want to maximize the expected utility we get out of the top spot, we should put the item with maximum expected popularity there.

Really? If users only ever look at the top 10 items, you'll never find out that item #33 would end up much higher if it got some attention from voters. This is not only a statistical problem, but also a policy/intervention problem. There is an explore/exploit trade-off to be solved.

A very popular policy for similar problems is to use Thompson sampling, e.g. don't sort items according to their expected score, instead draw a score at random and sort according to those. (At random from your current belief about the plausible true scores, e.g. the beta distribution you have learned.)


> A beta prior is equivalent to assigning “pretend votes” to each new item. Instead of starting out with 0 upvotes and 0 downvotes, we start out with a upvotes and b downvotes. If we expect that most items are bad, we could start with 3 upvotes and 10 downvotes. After getting u real upvotes, and d real downvotes the posterior distribution is Beta(a+u, b+d).

Fascinating. Does this follow as a straightforward consequence of how the beta distribution is defined? Otherwise, is there a proof that someone could point me toward?

> The popularity of an item has a Beta(a,b) prior

Is there an optimal choice of a, b given, say, a specific utility function?


Beta is conjugate prior of binomial distribution. That means is you start with prior beta, and likelihood has binomial distribution, then posterior obtained using Bayes theorem also will have beta distribution (with different paramters).

If you look up definition of beta distribution and then write down formula for posterior it should be quite clear that result is also beta distribution - modulo normalizing constant which may be a little more tricky to determine.


This is the Hello World of bayesian data analysis. A discussion and examples could be found in literally every bayesian textbook.


See this related article for another Bayesian method that can account for time of submission if you don't want to penalize new submissions as much for having less voting time: http://camdp.com/blogs/how-sort-comments-intelligently-reddi...


This blog post is an excerpt from the following chapter of his ebook, if you want more detail and cool analysis: http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabil...


Don't bother visiting that site on mobile on android. Didn't work for me at all


It seems to me that "hot or not" style ratings, where the user is asked to pick a preference between two items, would be more informative and easier to interpret than star ratings or thumbs up/thumbs down ratings. I wonder why it isn't used more often.

Star ratings have problems with compression of scales at the top and bottom. You'll never know which item is someone's favorite (or least favorite) with star ratings, because typically there will be several items with the maximum or minimum number of stars.

Pair-wise comparisons are also more fun and easier for users. When I'm doing star ratings, I often find myself trying to remember what star ratings I've given to similar things that I liked a little more or a little less so that I can try to be consistent.

Pair-wise comparisons probably make more sense for items in similar categories, though. It makes a lot more sense to pick a preference between two novels than it does to pick a preference between an ice cube tray and a camp chair.


If you are looking for more about this approach, "additive smoothing" might be a good search term not mentioned in the article. Also "Laplace smoothing" or "Lidstone smoothing". "Dirichlet prior" (add an 'h' to the spelling in the article) might also be on topic.


I've been using an equivalent method, after reading about imdb's method for comparing the best movies of all time.

I extend it, hackily, to allow categories to be declared to apply to objects with arbitrary confidence at any time, and to declare the same categorization multiple times.

I then consider both the confidence amounts and number of declarations in comparing the overall confidence in different categorizations.

I use a 0-1.0 scale for confidence, then product adjusted confidence for each potential categorization as (sum of confidences) / (num confidence declarations + 3).

This is equivalent to assuming a prior of three declarations of zero confidence; this effectively rewards higher binders of votes, such that a single declaration for category A of confidence 1.0 will tie three declarations for category B of confidence 5.0.


I've always considered "bumping" to be the perfect scaling algorithm for content sorting. If a comment is added to a thread, it will show up at the top of the page until a comment is added to another thread. Thus the probability of seeing a thread on the front page is proportional to the responses it gathers. New threads will be seen immediately and will generate responses if interesting. Unpopular threads will still get a small chance at reaching someone rather than never being seen by anyone. Finally, two threads that would otherwise generate equal number of upvotes or stars would be scaled by how deep and worthwhile their content is, since a simple like is less meaningful than a like and a long comment.


I have seen many forums where this is exploited by users who like to keep their threads on top. They do not submit the whole story and then keep adding new pieces into comments, bumping the thread whenever it gets off the front page.

So, I wouldn't call it perfect.


Zombie bumping can be a problem. find some controversial thread from a year ago, bump it, watch people not reading the year date stamp and replying to it as if it's a new thread.

And rewarding bumps usually means rewarding bickering flamewars.


You have to account for trolls though. There is also often a naysaying or nitpicking thread on HN which explodes in reply count as people quibble back and forth over definitions.


IMHO ordinal values as the most misunderstood ones - their are neither categorical nor numerical (even if they happen to be represented by numbers). Setting utility by hand is at best - a questionable way to do it. The educated way is to use Item Response Theory (see e.g. https://github.com/philchalmers/mirt/wiki).


I suspect that people are more likely to rate an item when they have strong negative feelings about it, and I guess this should be taken into account too.


Only one error

300 / (300+100) = 3/4 And not equal to 1/4

Nice read!



You realize that link was in the first line of the article, right?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: