There is another solution called 'True Bayesian Average' that is used on IMDB.com, for example. For the formula and the explanation how it works see here:
This isn't true for Amazon ratings, since ranking something 1-5 isn't a Bernoulli trial. But the central limit theorem says that that the average of those ratings will be normally distributed (assuming that they're identically distributed and independent), so it can still work. The confidence interval is different, however.
1. The user can actually understand and predict the behavior.
2. The "problem" the OP identifies is partially self-correcting, because items with a few positive ratings get more attention as a result of their high ranking, and if they deserve more poor ratings, they'll get them.
3. As long as you tell users how many ratings there are, they can use their own judgment as to how important that is.
Of course we only get a hundred or so new games a week, so it's not hard to get that many ratings. Much harder for a site with lots of stuff, especially if they have lots of stuff from day one.
But I agree that while there is a decent chance that a book with lots of 5s and lots 1s will be of value, the chances that a book with straight 3s to be worth anything are pretty slim.
I don't see why (outside of security issues) we can't just define our own sorting functions for a site.
Maximize $, and don't ask stupid questions like "Would you rather sort by variance, average rating, or Wilson score?"
(I'm pretty sure that's what Amazon is busy doing every day. They're positively brilliant at turning data into money.)
This only applies if you are using a sub-par database. While caching computationally heavy operations is not always a good choice (since they will need to get updated when more entries are added to the dataset), chances are you will have orders of magnitudes more lookups than writes and hence it makes sense to create a computed column and index the results (and hence cache them).
That said, asking honestly, are you entirely convinced of the mass appeal of user-defined (or user-selected) sorting algorithms? I find that the current algorithm usually gives fairly sane results.
(Edit: rewrote last sentence for clarity)
I've seen lots of good, fair comments with strong opinions (whether I agree with them or not) go initially negative -- early downvotes from the anonymous censorious underemployed peanut gallery -- then creep back up to small positive values after a full day's cycle. Those are much better comments than any pandering +20 one-liner.
On Dawdle, we spent a ton of time thinking about this. For their Marketplace sellers, eBay does number one; Amazon does number two. What we do is not three, but something that I think is better: we just rank our users.
Look, you can't ask buyers on the site to rank any particular transaction against all others; that's crazy talk, especially as you get into large numbers and people forget about all their experiences ever on a site. But we guide all our buyers to leave feedback of 3 - not 5 or a positive. The hope is that you'll have a semi-normal distribution of all feedback. Then, and only then, do we do all sorts of stuff to that. We have bonus points for good behaviors, some of which we talk about (shipping quickly, using Delivery Confirmation, linking your Dawdle account to Facebook, MySpace, XBL, PSN, etc) and some we don't. Then we just throw the ranks on a five point scale.
We call this our Seller Rating, not a feedback rating. Feedback is just the beginning of the process. KFC starts with the chicken - necessary but not sufficient - then adds their 13 herbs and spices. That's what we do, and it's why we don't even allow users to see the individual feedbacks. They're designed to be useless individually, and I don't want some new eBay expat bitching about not having 100% feedback.
Does this mean that some people are going to end up with 1s on a 5 point scale? You betcha. And we don't want them anyway - they're more hassle than they're worth. They can go to eBay or Amazon.
But why isn't the net-positive score the right thing for UrbanDictionary, and the average the right thing for Amazon? No case was made.
I could argue these either way, but if I were to write an attention-grabbing headline and yell "WRONG", I'd actually make a case. Attitude alone is not an argument.
While over there, the "Users who bought X also bought Y" feature has a very strong impact on sales, so the engineers spend all of their time tweaking its algorithm?
If you ask me, it comes from the fact that software geeks are primarily young men, many of whom were emotionally abused by their peers while growing up. When you think the world only values your intelligence, you go out of your way to prove that you're smarter than everyone else.
I'll answer the, perhaps rhetorical question, by defining a new law: "Any online community that doesn't actively encourage its users to be nice, and civil, to each other is bound to become a cesspool.".
It can be a very serious problem when this happens within a team, IMO.
Assuming that someone's rating is only an estimator of their true rating, and then clipped to an integer - the more ratings there are, the less the maximum.
You never see something with hundreds of ratings actually score a 5.0, even if most people love it. And the fact that there ARE hundreds of ratings of that thing, and not some comparison thing, is also important.
His estimator might be a decent one at picking the average score, but in Amazon's case, it's not a great estimator of the rating I would give the product I am viewing. If you have such an extensive record of my past-purchases, use it to predict how I would like certain products! Surely the 5-star ratings of SICP are more applicable to me than the 1-star ratings.
A few weeks later I, apparently, made a terrible joke. Seconds later this shows up in my inbox:
After careful deliberation, it's been decided that you need to be fired.
Reason: telling not funny jokes
You may take one (1) Odwalla for the road.
The Firing Committee
Not only was it hilarious but while I suspected Evan, he hadn't had time to type that whole message! Over the next few weeks I received quite a few of these e-mails, but with different messages:
Reason: distracting Evan from his conversation with Luc
Reason: Asking too many questions
It turns out that not only did Evan find the need to write a formal you're fired letter; he wrote a web service to send formal you're fired letters. To me.
You're fired, for dragging down the average amount of awesome in any room Evan is in.
(I suppose it's funnier if you've done research. Roughly 88.69% of all research literature is some dude trying to "improve" a simple model with pages of mathematics. Not that this guy was wrong -- it was just evocative.)
It is particularly hard to find quality content on websites with a large database (i.e. YouTube). Newer videos will always have a higher 'rating' because it is 'newer'. Older videos will always have the most 'views' because it has had more time to get to that state. If you can't reach a consensus about what the most optimal rating system is, implement a nested rating system. This way, the end-user can specify to sort by 'rating' and then by 'date' and then by 'views', or however order he/she feels most optimal.
Or, "How to Make Your Host Very Happy Because You Suddenly Have to Move Up To a Server With Double the Processing Cycles"
 i.e. to calculate Harry Potter's rating you only need to know about Harry Potter, not every other book ever printed
It would still slow you down a little though.
And yes, it was a joke :-)
I have not checked the math on this, but damn near anything should be better then what most sites are doing today.
In fact a better way to sort user ratings is a good idea for a startup.
score = positives / (negatives + x)
amazone is rating. it's a scale and there is a clear mean, medium and standard deviation.
whether solution #2 is wrong kind of depends. some of this is a social problem on what items get the kind of followers who are likely to rate online. at the extreme case you could imagine a product where (not) liking it was caused by an inability to use html forms.
1 rating, 5.0 = 35/11 = 3.18;
90 ratings, 4.5 = 435/100 = 4.35
The downside is that, since items with few ratings get mediocre scores, if there isn't a way for them to be visible, they won't get out of that rut. So a better approach might be to feature (on a "top 10" list) seven high scorers and 3 "rising stars" selected purely on average score.