Hacker News new | past | comments | ask | show | jobs | submit login
How Not To Sort By Average Rating (evanmiller.org)
195 points by marcus on Feb 12, 2009 | hide | past | web | favorite | 56 comments

Why get the lower bound: this means you systematically underestimate the items with fewer ratings. Also this formula assumes normal distribution.

There is another solution called 'True Bayesian Average' that is used on IMDB.com, for example. For the formula and the explanation how it works see here:


It assumes that the aggregate outcome of a large number of Bernoulli trials (i.e. true/false; up/down; good/bad) are distributed binomially, which can be reasonably approximated by the Normal distribution when the number of trials is large. This is technically true, and works for HN-style voting.

This isn't true for Amazon ratings, since ranking something 1-5 isn't a Bernoulli trial. But the central limit theorem says that that the average of those ratings will be normally distributed (assuming that they're identically distributed and independent), so it can still work. The confidence interval is different, however.

Actually, I think Amazon has it right, for several reasons:

1. The user can actually understand and predict the behavior.

2. The "problem" the OP identifies is partially self-correcting, because items with a few positive ratings get more attention as a result of their high ranking, and if they deserve more poor ratings, they'll get them.

3. As long as you tell users how many ratings there are, they can use their own judgment as to how important that is.

On Kongregate we do the same thing that Newgrounds does - games don't display an average rating or appear in the rankings until they have a minimum number of ratings. For us that's 75 ratings.

Of course we only get a hundred or so new games a week, so it's not hard to get that many ratings. Much harder for a site with lots of stuff, especially if they have lots of stuff from day one.

Speaking of Amazon -- I'd rather buy a book with lots of 1s and 5s than a book with straight 3s. Wouldn't you?

Hmm that depends on the book and who posted the 1s and the 5s, you just described the ratings of every scientology book and every evolution book in amazon.

But I agree that while there is a decent chance that a book with lots of 5s and lots 1s will be of value, the chances that a book with straight 3s to be worth anything are pretty slim.

And Wolfram's "A New Kind of Science", if I remember correctly.


So you'd rather sort by variance?

I don't see why (outside of security issues) we can't just define our own sorting functions for a site.

This is a dream of mine as well, but performance may be a problem. Custom functions mean no caching and arbitrary computational burden from each function.

It doesn't have to be totally custom. Just solicit a few representative types and allow all those. This also solves the problem of not everyone knowing how to write such a function, and dealing with user code running on the server.

Better yet, play around with lots of different sorting methods, and look for patterns in the sorting method used (s), the user's data and history (h), and the amount of money the user ends up spending ($).

Maximize $, and don't ask stupid questions like "Would you rather sort by variance, average rating, or Wilson score?"

(I'm pretty sure that's what Amazon is busy doing every day. They're positively brilliant at turning data into money.)

Custom functions mean no caching

This only applies if you are using a sub-par database. While caching computationally heavy operations is not always a good choice (since they will need to get updated when more entries are added to the dataset), chances are you will have orders of magnitudes more lookups than writes and hence it makes sense to create a computed column and index the results (and hence cache them).

Because the vast majority of users out there can barely click a mouse, much less write code to define sorting functions.

In the case of Amazon, perhaps the Associates Web Service APIs could allow the creation of tool allowing user-defined sorting functions. I've never used these APIs, but I found some details on them that lead me to think it would be possible. See http://aws.amazon.com/associates/#details

That said, asking honestly, are you entirely convinced of the mass appeal of user-defined (or user-selected) sorting algorithms? I find that the current algorithm usually gives fairly sane results.

(Edit: rewrote last sentence for clarity)

Yes. And similarly, for sites like News.YC: I'd rather read a "2 points" comment with 100 upvotes and 99 downvotes, than something banal with one upvote.

Comments like that are frequently dogmatic assertions about down-the-middle controversial issues (like religion or politics) that draw knee-jerk up/down votes from the two ever-combative groups. The comments with one upvote can sometimes be hidden gems in an abandoned thread, or possibly one great comment among many others.

They can be. But if they are shallow, dogmatic assertions, they often (and should) go negative and stay negative.

I've seen lots of good, fair comments with strong opinions (whether I agree with them or not) go initially negative -- early downvotes from the anonymous censorious underemployed peanut gallery -- then creep back up to small positive values after a full day's cycle. Those are much better comments than any pandering +20 one-liner.

Maybe Amazon can implement "Sort by Kurtosis" :)


OK, I came to this late, but there are other solutions.

On Dawdle, we spent a ton of time thinking about this. For their Marketplace sellers, eBay does number one; Amazon does number two. What we do is not three, but something that I think is better: we just rank our users.

Look, you can't ask buyers on the site to rank any particular transaction against all others; that's crazy talk, especially as you get into large numbers and people forget about all their experiences ever on a site. But we guide all our buyers to leave feedback of 3 - not 5 or a positive. The hope is that you'll have a semi-normal distribution of all feedback. Then, and only then, do we do all sorts of stuff to that. We have bonus points for good behaviors, some of which we talk about (shipping quickly, using Delivery Confirmation, linking your Dawdle account to Facebook, MySpace, XBL, PSN, etc) and some we don't. Then we just throw the ranks on a five point scale.

We call this our Seller Rating, not a feedback rating. Feedback is just the beginning of the process. KFC starts with the chicken - necessary but not sufficient - then adds their 13 herbs and spices. That's what we do, and it's why we don't even allow users to see the individual feedbacks. They're designed to be useless individually, and I don't want some new eBay expat bitching about not having 100% feedback.

Does this mean that some people are going to end up with 1s on a 5 point scale? You betcha. And we don't want them anyway - they're more hassle than they're worth. They can go to eBay or Amazon.

Yelling "WRONG" at two popular options isn't much of an argument. I suspect that Amazon has a good profit-maximizing reason for their ordering.

But the author didn't just yell wrong, he provided examples exposing the flaws in both scoring models.

He showed screenshots, without explaining anything wrong with the rankings unless the reader already intuitively agreed with him.

But why isn't the net-positive score the right thing for UrbanDictionary, and the average the right thing for Amazon? No case was made.

I could argue these either way, but if I were to write an attention-grabbing headline and yell "WRONG", I'd actually make a case. Attitude alone is not an argument.

Could it be that the business requirement for the ranking system is actually that users be given the illusion of participating in a social site? because ultimately the rankings have very little impact on overall sales?

While over there, the "Users who bought X also bought Y" feature has a very strong impact on sales, so the engineers spend all of their time tweaking its algorithm?

The first example is wrong. 60 - 40 = 20 which is greater than 100 - 100 = 0.

I noticed that too. The numbers in the picture do support his point though. 209/259 = 80.7% is listed higher than 118/143 = 82.5%

Sadly, the proggit comments are overwhelmed by debate as to whether a small error like this means the author is an idiot. What makes this happen? It's not just debating the color of the bikeshed, it's questioning the qualifications of the Nuclear Power Plant architect on the basis of the bike shed.

Amen. This is a big reason that I can't see myself working as a developer for the rest of my life. I can't deal with the abundance of passive-aggressive personalities who will seize on tiny errors like this, and discount an individual's intelligence.

If you ask me, it comes from the fact that software geeks are primarily young men, many of whom were emotionally abused by their peers while growing up. When you think the world only values your intelligence, you go out of your way to prove that you're smarter than everyone else.

Hah! If you want abuse, you should see some of the frothing at the mouth that happens about LangPop.com, which, as clearly stated on the site, is never going to be more than an approximation. Man does it set people off when their favorite language doesn't do as well as they think:


> What makes this happen?

I'll answer the, perhaps rhetorical question, by defining a new law: "Any online community that doesn't actively encourage its users to be nice, and civil, to each other is bound to become a cesspool.".

It wasn't meant to be rhetorical nor calling proggit out by name. The very same thing happens within development teams when people "flip the bozo bit" on team mates for the most trivial errors and are biased against anything they say or do forever afterwards.

It can be a very serious problem when this happens within a team, IMO.

Yes, his point is still valid.

Silly that he messes up the first example, but I've wanted something like this third solution when sorting by rating on Amazon -- when a product has only a couple of positive reviews, it's almost the same as being unreviewed. (On the other hand, just a couple of negative reviews can often be quite helpful, since they can list concrete problems making the product bad -- a positive review has the much harder case to make of "there will be nothing bad about this product".)

Yeah, this always drives me nuts.

Assuming that someone's rating is only an estimator of their true rating, and then clipped to an integer - the more ratings there are, the less the maximum.

You never see something with hundreds of ratings actually score a 5.0, even if most people love it. And the fact that there ARE hundreds of ratings of that thing, and not some comparison thing, is also important.

If you treat the number of ratings as an indicator of quality, you need to account for the time the item has been available for rating as well. Otherwise you underestimate newer items.

Not just time available but also units sold; if you are looking for a niche item, you don't expect a lot of ratings.

This is a decent way to estimate the average score. But sometimes you're not really trying to estimate the average score. You're really trying to estimate the score that the user who's currently viewing the page would give it, especially for someone like Amazon (ie, People Like You Rated this Product as 1-star).

His estimator might be a decent one at picking the average score, but in Amazon's case, it's not a great estimator of the rating I would give the product I am viewing. If you have such an extensive record of my past-purchases, use it to predict how I would like certain products! Surely the 5-star ratings of SICP are more applicable to me than the 1-star ratings.

This is, I think, the bestest funniest post to hit Hacker News in awhile.

Evan and I worked together at IMVU, and at one point a senior engineer heard one of my (appallingly bad) ideas and replied "You're fired." and went back to work.

A few weeks later I, apparently, made a terrible joke. Seconds later this shows up in my inbox:

Dear Timothy,

After careful deliberation, it's been decided that you need to be fired.

Reason: telling not funny jokes

You may take one (1) Odwalla for the road.


The Firing Committee

Not only was it hilarious but while I suspected Evan, he hadn't had time to type that whole message! Over the next few weeks I received quite a few of these e-mails, but with different messages:

Reason: distracting Evan from his conversation with Luc

Reason: Asking too many questions


It turns out that not only did Evan find the need to write a formal you're fired letter; he wrote a web service to send formal you're fired letters. To me.

he wrote a web service to send formal you're fired letters. To me.

You're fired, for dragging down the average amount of awesome in any room Evan is in.

i think tptacek's account has been compromised

You didn't laugh out loud when you got to "the correct solution"?

I did. The way it was presented was rather classic.

I've clearly missed something. Other than being a baroque way to generate an approximation that doesn't really mean much, I guess I don't see what makes it so funny?

Well, with the caveat that the surest way to kill humor is to explain it, I thought it was funny because option 1 was a super-simple equation, option 2 was a super-simple equation, and the "right" answer was Quantum Rocket Surgery.

(I suppose it's funnier if you've done research. Roughly 88.69% of all research literature is some dude trying to "improve" a simple model with pages of mathematics. Not that this guy was wrong -- it was just evocative.)

Alright thanks... To be fair, I am really really bad at picking up humor online. I have some kind of internet autism or something.

I wrote a little bit about a (mathematically naive) solution to the averages problem on my blog (props to Eric Liu for his feedback) -- http://www.cederman.com/?p=116

I actually have yet to encounter a single website with this sophisticated of a rating system.

It is particularly hard to find quality content on websites with a large database (i.e. YouTube). Newer videos will always have a higher 'rating' because it is 'newer'. Older videos will always have the most 'views' because it has had more time to get to that state. If you can't reach a consensus about what the most optimal rating system is, implement a nested rating system. This way, the end-user can specify to sort by 'rating' and then by 'date' and then by 'views', or however order he/she feels most optimal.

How Not To Sort By Average Rating

Or, "How to Make Your Host Very Happy Because You Suddenly Have to Move Up To a Server With Double the Processing Cycles"

I realize this is probably a joke, but the relative infrequency of rating, the fact that this rating algorithm only requires information from the local context[1], and the existence of caching means that this is very, very unlikely to be the bottleneck for any site.

[1] i.e. to calculate Harry Potter's rating you only need to know about Harry Potter, not every other book ever printed

yeah, you could cache it in the db until you get a new rating.

It would still slow you down a little though.

And yes, it was a joke :-)

Every site's crappy sorting has always bothered me, it reminds me of AltaVista (before Google kids).

I have not checked the math on this, but damn near anything should be better then what most sites are doing today.

In fact a better way to sort user ratings is a good idea for a startup.

The simple formula is to assume a Dirichlet prior. This gives you

  score = positives / (negatives + x)
and you can fiddle with x to get something that looks reasonable.

pedantic: the urban dictionary example is voting not rating. that's what leads to the confusion of adding up all the positives and negatives.

amazone is rating. it's a scale and there is a clear mean, medium and standard deviation.

whether solution #2 is wrong kind of depends. some of this is a social problem on what items get the kind of followers who are likely to rate online. at the extreme case you could imagine a product where (not) liking it was caused by an inability to use html forms.

Excellent. Can we implement that on comments here? How hard would it be?

An easier solution is to give each item a number of implied mediocre ratings (say 10 3s) to start.

1 rating, 5.0 = 35/11 = 3.18; 90 ratings, 4.5 = 435/100 = 4.35

The downside is that, since items with few ratings get mediocre scores, if there isn't a way for them to be visible, they won't get out of that rut. So a better approach might be to feature (on a "top 10" list) seven high scorers and 3 "rising stars" selected purely on average score.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact