Imagine a user chooses "Sort by rating", and they subsequently observe an item with an average 4.5 ranking above a score of 5.0 because it has a higher Wilson score. Some portion of users will think "Ah, yes, this makes sense because the 4.5 rating is based on many more reviews, therefore its Wilson score is higher." and the vast, vast majority of users will think "What the heck? This site is rigging the system! How come this one is ranked higher than that one?" and erode confidence in the rankings.
In fact, these kinds of black-box rankings* frequently land sites like Yelp into trouble, because it is natural to assume that the company has a finger on the scale so to speak when it is in their financial interests to do so. In particular, entries with a higher Wilson score are likely to be more expensive because their ostensibly-superior quality commands (or depends upon) their higher cost, exacerbating this effect due to perceived-higher margins.
So the next logical step is to present the Wilson score directly, but this merely shifts the confusion elsewhere -- the user may find an item they're interested in buying, find it has one 5-star review, and yet its Wilson score is << 5, producing at least the same perception and possibly a worse one.
Instead, providing the statistically-sound score but de-emphasizing or hiding it, such as by making it accessible in the DOM but not visible, allows for the creation of alternative sorting mechanisms via e.g. browser extensions for the statistically-minded, without sacrificing the intuition of the top-line score.
* I assume that most companies would choose not to explain the statistical foundations of their ranking algorithm.
In that article, he even includes a formula for how many ratings you'd need:
> If you display average ratings to the nearest half-star, you probably don’t want to display an average rating unless the credible interval is a half-star wide or less
In my experience, the second article is more generally useful, because it's more common to sort by star rating than by thumb-up/thumb-down ranking, which is what the currently linked article is about.
And the philosophical "weight on the scale" problem isn't as bad as you'd think when using these approaches. If you see an item with a perfect 5-star average and 10 reviews ranked below an item with a 4.8-star average and 1,000 reviews, and you call the sort ranking "sort by popularity," it's pretty clear that the item with 1,000 reviews is "more popular."
It also erodes confidence in ratings when something with one fake 5 star review sorts above something else with 1000 reviews averaging 4.9.
I think you're mainly focusing on the very start of a learning curve, but eventually people get the hang of the new system. Especially if it's named correctly (e.g. "sort by review-count weighted score").
This is already done to a degree on most sites. The author is just describing a better possible way to do it.
What’s the most popular office pen? Papermate, Bic? I may be looking for more quality.
What’s the most popular hotel in some city? Maybe I’m looking for location or other aspects other than popularity among college kids.
Yelp didn't get dinged because their algorithms were hidden. They lost credibility because they were extorting businesses. Intention matters.
The inherent problem to me seems like we're trying to condense reviews into a tiny signal of an integer in the range of 1 to 5.
For many things, this simply doesn't cut it.
2 stars, what does that mean? Was the coffee table not the advertised shade of grey? Does the graphics card overheat on medium load because of a poor cooler design? Was the delivery late (not related to the product, but many people leave these kinds of reviews)? Did you leave a 2 star review because you don't like the price but you didn't actually order the product?
All these things I've seen on reviews and I've learned to ignore star ratings because not only they can be gamed, they are essentially useless.
Props to users who take the time to write out detailed reviews of products which give you an idea of what to expect without having to guess what a star rating means, although sometimes these can be gamed as well as many sellers on Amazon and such will just give out free products in exchange for favourable reviews.
Being a consumer is not easy these days, you have to be knowledgeable in what you're buying and assume every seller is an adversary.
That's assuming you can trust the reviews themself of course.
There has to be a way to let users choose between "sort by rating, but put items without many reviews lower" and "sort by rating, even items with only one or two reviews" in a way that helps give control back to them.
To read reviews of awful products for entertainment, I guess?
I think there's better approaches that can be taken here to address possible confusion. E.g., if the Wilson score rating ever places an item below ones with higher average rating, put a little tooltip next to that item's rating that says something like "This item has fewer reviews than ones higher up in the list." You don't need to understand the full statistical model to have the intuition that things with only a few ratings aren't as "safe".
> So the next logical step is to present the Wilson score directly, but this merely shifts the confusion elsewhere -- the user may find an item they're interested in buying, find it has one 5-star review, and yet its Wilson score is << 5, producing at least the same perception and possibly a worse one.
Though I'm not convinced how big of a deal this is. Even if you're worried about this, a further optimization may be to simply not display the score until there's enough reviews that it's unlikely anyone will manually compute the average rating.
A) labelling 1-2 review items with "needs more reviews" message
Or B) not giving an aggregate review score for low review items. Actually replacing the review star bar with "needs more reviews". Then when the user goes from the listing page to the detail page, you can show the reviews next to a message saying "this item only has a few reviews, so we can't be sure they're accurate until more people chime in"
Ranking a 4.9 star item with 500 reviews above a 5 star item is intuitive to many already, and will become intuition quickly for everyone else because it’s broadly more useful. The average customer doesn’t care that much how the sausage is made, they care about quality of the results.
Ranking items is basic functionality and it’s broken across the web. It shouldn’t be a feature that’s only available to users willing and able to fiddle with browser extensions.
I agree that if I “sort by rating” then an average rating sort is expected. The solution is to simply not make sorting by rating an option, or to keep the bad sorting mechanism but de-emphasize it in favor of the more useful sort. Your users will quickly catch on that you’re giving them a more useful tool than “sort by average rating.”
"relevance" and "confidence" can mean a lot of different things and I tend to expect those types of sorts to be gamed by the site in order to promote whatever they'd prefer that I buy. For example, assuming an equal number of reviews a site could decide a more expensive item rated at 4 stars is more "relevant" than a cheaper item with a 5 star rating.
If it's not explicitly explained what determines confidence and relevance and/or users don't have the ability to access the information used to assign those scores it degrades trust that the results being promoted are genuinely beneficial to the user vs the website/service.
Amazon for example uses "featured" which is transparently gamed in Amazon's favor and Avg. Customer Review which should be clear enough and remove most of the worst items and the number of reviews is easily seen in the list as well (although the legitimacy of reviews still has to be considered and there are a lot of other problems with the way amazon handles reviews in general)
Generally I'll sort by rating and look deeper at the reviews for the ones with both high ratings and a high number of reviews. It's not perfect, but it makes a great starting point.
I feel like all that's really needed is a clear indicator that it's some proprietary ranking system (for example, "Tomatometer" branding), plus a plain-language description of what it's doing for people who want to know more.
Result 1: (4.5 )
Result 2: (5.0 )
edit: HN stripped out the unicode characters :(. I was using something like this: https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-spar....
Then the user can pick the regular average if they want, whereas the so-called weighted average (the algorithm described in the article) would be the default choice.
Edit: Link to paper, which looks like it actually attempts to use a linear prediction algorithm. https://github.com/rkuykendall/rkuykendall.com/blob/e65147f6...
For example, we have an internal phishing simulation/assessment program, and want to track metrics like improvement and general uncertainty. Since implementing this about a year ago, we've been able to make great improvements such as:
* for a given person, identify the wilson lower bound that they would not get phished if they were targeted
* for the employee population as a whole, determine the 95% uncertainty on whether a sample employee would get phished if targeted
It lets us make much more intelligent inferences about things, much more accurate risk assessments, and also lets us improve the program pretty significantly (e.g. your probability of being targeted being weighted by a combination of your wilson lower bound and your wilson uncertainty).
There are SO MANY opportunities to improve things by using this method. Obviously it isn't applicable everywhere, but I'd suggest you look at any metrics you have that use an average and just take a moment to ask yourself if a Wilson bound would be more appropriate, or might enable you to make marked improvements.
Though this property may be suboptimal for other reasons.
It rewards items with more ratings. Basically, you initialize the number of negative ratings to 1 instead of 0.
x / (x+y+1) :: https://www.wolframalpha.com/input/?i=plot+x+%2F%28x+%2B+y+%...
horrendous formula :: https://www.wolframalpha.com/input/?i=plot+%28%28x%2F%28x%2B...
Much less prone to typos.
(positive + constant1) / (positive + negative + constant1 + constant2)
For more details see the beta distribution.
+10/-0 should rank higher than +1/-0
+10/-5 should rank higher than +10/-7
+100/-3 should rank higher than +3/-0
+10/-1 should rank higher than +900/-200
(class1 + 1)/(class1 + class2 + 2).
(effectively, initialize all counts to 1).
This basically makes the 'default' rating 50% or 3 stars or whatever, and votes move the rating from that default.
There's a retail chain I write
the software for, including the website and online store. They send out occasional customer surveys, which I also built, and which are constantly reviewed by corporate management. So, they wanted to raise their Google review rating, which was hovering between 4.6 to 4.8 stars, varying by location.
The CEO called me with this idea that at the end of taking a survey, people who give us five stars in all ten survey categories should be shown a link to directly review the store on Google.
Me, as an engineer - I was like, okay let's sit down and review metrics for each survey question, weight them and figure out how likely someone is to give us 5, rather than 4 stars in aggregate, and use that to decide whether to give them the link.
He said no. The only people who should get the link are people who gave 5 stars on the survey in every single category. This seemed counterintuitive to me and I ran the numbers and found that only a small fraction of people did. But I accepted the dictat and did it his way.
End result: About a 30% increase in 5-star google ratings across the board. His intuition was right, that the people most motivated to take the survey and complete it with straight-A's would be the people who would take the time to write great comments.
You can and probably should infer from this that if you actually need to balance "average" ratings, the outliers on the top and bottom of the scale exert an inordinate amount of pull and tend to skew results toward the extremes.
Some other thoughts - anyone know what state of the art is in ratings?
- especially for technology products, time on market seems like a big thing. How should the iPhone 8 be rated today? It was a great product in its day, but if someone is looking at it now there are a lot more trade offs.
- I wonder how we should consider the users in this. It seems like certain users are more harsh or fair than others, and have different rating patterns. Should we try to identify "good" reviewers?
- Similar to the above, I wonder if for example Uber drivers get harsher reviews if traffic or weather is bad. So should we normalize for local trends in ratings?
To this day I don't understand how the original Netflix algorithm for predicting user desires isn't co-opted for everything.
I don't mind people rating taco bell 5 stars. I just don't want any of their reviews influencing the restaurant ratings I see.
Same with basically every product category. We have this insanely intricate and rich data set that could truly help people find things that are great for them. But it's all locked away and only used for making a buck with nobody giving a shit about making people's lives better.
Weighted score = (positive + alpha) / (total + beta)
In which alpha and beta are the mean number of positive and total votes, respectively. You may wish to estimate optimal values of alpha and beta subject to some definition of optimal, but I find the mean tends to work well enough for most situations.
Humans will see 3 stars and not perceive that as 50%.
It seems like it’s purely a result of widget design deficiency: how do you turn a null into a 0 with a star widget? (You could add an extra button but naturally designers will poo poo that)
(in less important areas than sorting things by ratings to directly rank things for users; mentally bookmarked this idea for the next time I need something better, as this clearly looks better)
(I saw first this covered in Murphy's Machine Learning: A Probabilistic Perspective, which I'd recommend if you're interested in this stuff)
I generally pay the most attention to 3 star reviews, because they tend to be pretty balanced and actually tell you the plusses and minuses of the product. It seems like 2 star reviews would be somewhat like that, but leaning toward the negative/critical side. Is the negative/critical feedback what you're after?
"2 stars" means "I really don't like it, but I can control my emotions and explain myself".
Some previous discussions:
4 years ago https://news.ycombinator.com/item?id=15131611
6 years ago https://news.ycombinator.com/item?id=9855784
10 years ago https://news.ycombinator.com/item?id=3792627
13 years ago https://news.ycombinator.com/item?id=478632
Reminder: you can enjoy the article without upvoting it
How Not to Sort by Average Rating (2009) - https://news.ycombinator.com/item?id=15131611 - Aug 2017 (156 comments)
How Not to Sort by Average Rating (2009) - https://news.ycombinator.com/item?id=9855784 - July 2015 (59 comments)
How Not To Sort By Average Rating - https://news.ycombinator.com/item?id=3792627 - April 2012 (153 comments)
How Not To Sort By Average Rating - https://news.ycombinator.com/item?id=1218951 - March 2010 (31 comments)
How Not To Sort By Average Rating - https://news.ycombinator.com/item?id=478632 - Feb 2009 (56 comments)
I don't know if I knew about HN four years ago and if I did, I almost certainly missed that post, and if I didn't, I certainly don't remember the interesting discussion in the comments.
I enjoyed the article and I'm not sure I see a reason not to upvote it.
It adds a click event to each link for the article, and then after a day has passed, will start filtering that link out from HN results? I give it a gap of a day because maybe you'd want to return and leave a comment.
I might try my hand at a greasemonkey script if you're interested.
Though, personally, I have no great issue seeing high quality posts again occasionally.
I think you simply need separate search categories for this.
Say I want to look for underrated or undiscovered gems:
"Give me the best ranked items that have 500 votes or less."
It is misleading to throw a 12 vote item together into the same list as a 12,000,000 vote item, and present them as being ranked relative to each other.
That is not to say that (positive - negative) is perfect, but just dismissing it because it doesn't satisfy your randomly chosen criteria is just lazy.
The blog post that explained it: https://web.archive.org/web/20091210120206/http://blog.reddi...
> What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?
> Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
There is actually a 97.5% chance that the "real" fraction of positive ratings is at least X, because the Wilson interval is a confidence interval which excludes a tail at each extremum.
I personally would prefer to use a Bayesian derivation and just compute a one-tailed statistic, rather than applying a CLT approximation out-of-the-box with no tailoring as is typical in frequentist statistics.
Regardless though, the principles are much the same and the sorted result probably would match very closely, so if this computation is significantly faster than stick with that.
- Sum of votes divided by total votes
- More advanced statistical algorithms that take into account confidence (as this article suggests)
- Recommendation engines that provides a rating based on your taste profile
But I'm pretty sure you could take this further depending on what data you're looking to feed in and what the end-users' expectations of the system are.
* Included a graph of the resulting ordering of the two dimensional plane and some examples
* Included consideration of 5- or 10-star scales.
There's a whole section on their website that has different statistics for programmers, including rating systems .
 https://www.evanmiller.org/ ("Mathematics of user ratings" section)
For example a 3/5 stars turns into 0.6 positive and 0.4 negative observation. Following the formula from there will give a lower bound estimation between 0 and 1, so then you just multiple by 5 again to get it between 0 and 5.