While I agree with the author in principle, I think there is an implicit criteria they ignore, which is the intuitive correctness from the perspective of the user.
Imagine a user chooses "Sort by rating", and they subsequently observe an item with an average 4.5 ranking above a score of 5.0 because it has a higher Wilson score. Some portion of users will think "Ah, yes, this makes sense because the 4.5 rating is based on many more reviews, therefore its Wilson score is higher." and the vast, vast majority of users will think "What the heck? This site is rigging the system! How come this one is ranked higher than that one?" and erode confidence in the rankings.
In fact, these kinds of black-box rankings* frequently land sites like Yelp into trouble, because it is natural to assume that the company has a finger on the scale so to speak when it is in their financial interests to do so. In particular, entries with a higher Wilson score are likely to be more expensive because their ostensibly-superior quality commands (or depends upon) their higher cost, exacerbating this effect due to perceived-higher margins.
So the next logical step is to present the Wilson score directly, but this merely shifts the confusion elsewhere -- the user may find an item they're interested in buying, find it has one 5-star review, and yet its Wilson score is << 5, producing at least the same perception and possibly a worse one.
Instead, providing the statistically-sound score but de-emphasizing or hiding it, such as by making it accessible in the DOM but not visible, allows for the creation of alternative sorting mechanisms via e.g. browser extensions for the statistically-minded, without sacrificing the intuition of the top-line score.
* I assume that most companies would choose not to explain the statistical foundations of their ranking algorithm.
In another article, the author (Evan Miller) recommends not showing the average unless there are enough ratings. You would say "2 ratings" but not show the average, and just sort it wherever it falls algorithmically.
In that article, he even includes a formula for how many ratings you'd need:
> If you display average ratings to the nearest half-star, you probably don’t want to display an average rating unless the credible interval is a half-star wide or less
In my experience, the second article is more generally useful, because it's more common to sort by star rating than by thumb-up/thumb-down ranking, which is what the currently linked article is about.
And the philosophical "weight on the scale" problem isn't as bad as you'd think when using these approaches. If you see an item with a perfect 5-star average and 10 reviews ranked below an item with a 4.8-star average and 1,000 reviews, and you call the sort ranking "sort by popularity," it's pretty clear that the item with 1,000 reviews is "more popular."
> Imagine a user chooses "Sort by rating", and they subsequently observe an item with an average 4.5 ranking above a score of 5.0 because it has a higher Wilson score. Some portion of users will think "Ah, yes, this makes sense because the 4.5 rating is based on many more reviews, therefore its Wilson score is higher." and the vast, vast majority of users will think "What the heck? This site is rigging the system! How come this one is ranked higher than that one?" and erode confidence in the rankings.
It also erodes confidence in ratings when something with one fake 5 star review sorts above something else with 1000 reviews averaging 4.9.
I think you're mainly focusing on the very start of a learning curve, but eventually people get the hang of the new system. Especially if it's named correctly (e.g. "sort by review-count weighted score").
I'd opt for a simpler and less precise name like "Sort by Rating", but then offer the more precise definition via a tooltip or something, to minimize complexity for the typical user but ensure that accurate information is available for those who are interested.
When you use the OP article's formula, you're sorting by popularity. You may choose not to sort by popularity, but when you use it, you should call it sorting by "popularity."
Not having faith in the user is a giant step towards mediocrity. Does a weighted average provide better results? Then use a weighted average! The world isn't split into an elite group of power users and the unwashed masses. There are just people with enough time and attention to fiddle with browser extensions, and everyone else. And all of them want the best result to show up first.
Yelp didn't get dinged because their algorithms were hidden. They lost credibility because they were extorting businesses. Intention matters.
The inherent problem to me seems like we're trying to condense reviews into a tiny signal of an integer in the range of 1 to 5.
For many things, this simply doesn't cut it.
2 stars, what does that mean? Was the coffee table not the advertised shade of grey? Does the graphics card overheat on medium load because of a poor cooler design? Was the delivery late (not related to the product, but many people leave these kinds of reviews)? Did you leave a 2 star review because you don't like the price but you didn't actually order the product?
All these things I've seen on reviews and I've learned to ignore star ratings because not only they can be gamed, they are essentially useless.
Props to users who take the time to write out detailed reviews of products which give you an idea of what to expect without having to guess what a star rating means, although sometimes these can be gamed as well as many sellers on Amazon and such will just give out free products in exchange for favourable reviews.
Being a consumer is not easy these days, you have to be knowledgeable in what you're buying and assume every seller is an adversary.
My goto method for reading reviews is sort by negative and only look at detailed reviews, or at least ones that explain what they thought was lacking. Often the 1/2 star reviews have fair points but might be for a use case I don't care about or similar. This generally gives me an idea of the actual pros and cons of the product as opposed to just a vague rating.
That's assuming you can trust the reviews themself of course.
The problem with having faith in your users is you have to actually do it. If you're sorting by Wilson score when the user clicks a column that displays a ranking out of five, then you're mixing two scores together in a frustrating way because you think your users are too dumb to understand.
There has to be a way to let users choose between "sort by rating, but put items without many reviews lower" and "sort by rating, even items with only one or two reviews" in a way that helps give control back to them.
This is a fair point but it's not as if knowing with items are actually good is something that should only be available to power users. The real goal ought to be: making sure your customers get access to actually good things. Not merely satisfying what might be some customers' naive intuition that things with higher average ratings are actually better.
I think there's better approaches that can be taken here to address possible confusion. E.g., if the Wilson score rating ever places an item below ones with higher average rating, put a little tooltip next to that item's rating that says something like "This item has fewer reviews than ones higher up in the list." You don't need to understand the full statistical model to have the intuition that things with only a few ratings aren't as "safe".
> So the next logical step is to present the Wilson score directly, but this merely shifts the confusion elsewhere -- the user may find an item they're interested in buying, find it has one 5-star review, and yet its Wilson score is << 5, producing at least the same perception and possibly a worse one.
Though I'm not convinced how big of a deal this is. Even if you're worried about this, a further optimization may be to simply not display the score until there's enough reviews that it's unlikely anyone will manually compute the average rating.
A) labelling 1-2 review items with "needs more reviews" message
Or B) not giving an aggregate review score for low review items. Actually replacing the review star bar with "needs more reviews". Then when the user goes from the listing page to the detail page, you can show the reviews next to a message saying "this item only has a few reviews, so we can't be sure they're accurate until more people chime in"
I think this can be solved with better UI: Instead of stars, show a sparkline of the distribution the of scores. The user can then see the tiny do representing the single 5 star review and the giant peak representing the many 4 star reviews.
While I agree with your comment in principle, is any shopper happy with the utterly useless ranking system that everybody uses now?
Ranking a 4.9 star item with 500 reviews above a 5 star item is intuitive to many already, and will become intuition quickly for everyone else because it’s broadly more useful. The average customer doesn’t care that much how the sausage is made, they care about quality of the results.
Ranking items is basic functionality and it’s broken across the web. It shouldn’t be a feature that’s only available to users willing and able to fiddle with browser extensions.
If you don’t provide a “Sort by rating” option but instead include options like sort by “popularity,” “relevance,” “confidence,” or similar, then it is more accurate description, more useful to the user, and not so misleading about what is being sorted.
I agree that if I “sort by rating” then an average rating sort is expected. The solution is to simply not make sorting by rating an option, or to keep the bad sorting mechanism but de-emphasize it in favor of the more useful sort. Your users will quickly catch on that you’re giving them a more useful tool than “sort by average rating.”
To me "rating" is pretty clear cut. I expect some sort of ranking based on the ratings provided by users.
"relevance" and "confidence" can mean a lot of different things and I tend to expect those types of sorts to be gamed by the site in order to promote whatever they'd prefer that I buy. For example, assuming an equal number of reviews a site could decide a more expensive item rated at 4 stars is more "relevant" than a cheaper item with a 5 star rating.
If it's not explicitly explained what determines confidence and relevance and/or users don't have the ability to access the information used to assign those scores it degrades trust that the results being promoted are genuinely beneficial to the user vs the website/service.
Amazon for example uses "featured" which is transparently gamed in Amazon's favor and Avg. Customer Review which should be clear enough and remove most of the worst items and the number of reviews is easily seen in the list as well (although the legitimacy of reviews still has to be considered and there are a lot of other problems with the way amazon handles reviews in general)
Generally I'll sort by rating and look deeper at the reviews for the ones with both high ratings and a high number of reviews. It's not perfect, but it makes a great starting point.
I think you're overemphasizing the confusion that an alternate ranking schema would cause. We have Rotten Tomatoes as a very obvious example of one that a lot of people are perfectly happy with even though it's doing something very different from the usual meaning of X% ratings.
I feel like all that's really needed is a clear indicator that it's some proprietary ranking system (for example, "Tomatometer" branding), plus a plain-language description of what it's doing for people who want to know more.
I worked on an e-commerce site that attempted to solve the issue by simply not giving an average rating to an item until it had a certain amount of reviews. We still showed the reviews and their scores, but there was no top level average until it had enough reviews. We spent a lot of time in user testing and with surveys trying to figure it how to effectively communicate that.
In order to deal with that, I would place two sorting options related to average:
- regular average
- weighted average (recommended, default)
Then the user can pick the regular average if they want, whereas the so-called weighted average (the algorithm described in the article) would be the default choice.
This article inspired me so much that I based my shitty undergrad senior thesis on it. My idea was to predict the trend of the ratings by using I think a trailing weighted average, weighted to the most recent window. It managed to generate more "predictive" ratings of the following 6 months based on the Amazon dataset I used, but I doubt it would have held up to much scrutiny. I learned a ton though!
I've been using this at work for the last year or so to great success.
For example, we have an internal phishing simulation/assessment program, and want to track metrics like improvement and general uncertainty. Since implementing this about a year ago, we've been able to make great improvements such as:
* for a given person, identify the wilson lower bound that they would not get phished if they were targeted
* for the employee population as a whole, determine the 95% uncertainty on whether a sample employee would get phished if targeted
It lets us make much more intelligent inferences about things, much more accurate risk assessments, and also lets us improve the program pretty significantly (e.g. your probability of being targeted being weighted by a combination of your wilson lower bound and your wilson uncertainty).
There are SO MANY opportunities to improve things by using this method. Obviously it isn't applicable everywhere, but I'd suggest you look at any metrics you have that use an average and just take a moment to ask yourself if a Wilson bound would be more appropriate, or might enable you to make marked improvements.
This is another way of moving towards a bayesian lower bound instead of a frequentist one. As your formula shows, in this case the bayesian formula is super duper simple.
There's a retail chain I write
the software for, including the website and online store. They send out occasional customer surveys, which I also built, and which are constantly reviewed by corporate management. So, they wanted to raise their Google review rating, which was hovering between 4.6 to 4.8 stars, varying by location.
The CEO called me with this idea that at the end of taking a survey, people who give us five stars in all ten survey categories should be shown a link to directly review the store on Google.
Me, as an engineer - I was like, okay let's sit down and review metrics for each survey question, weight them and figure out how likely someone is to give us 5, rather than 4 stars in aggregate, and use that to decide whether to give them the link.
He said no. The only people who should get the link are people who gave 5 stars on the survey in every single category. This seemed counterintuitive to me and I ran the numbers and found that only a small fraction of people did. But I accepted the dictat and did it his way.
End result: About a 30% increase in 5-star google ratings across the board. His intuition was right, that the people most motivated to take the survey and complete it with straight-A's would be the people who would take the time to write great comments.
You can and probably should infer from this that if you actually need to balance "average" ratings, the outliers on the top and bottom of the scale exert an inordinate amount of pull and tend to skew results toward the extremes.
This is really neat, I'd never heard of the Wilson score.
Some other thoughts - anyone know what state of the art is in ratings?
- especially for technology products, time on market seems like a big thing. How should the iPhone 8 be rated today? It was a great product in its day, but if someone is looking at it now there are a lot more trade offs.
- I wonder how we should consider the users in this. It seems like certain users are more harsh or fair than others, and have different rating patterns. Should we try to identify "good" reviewers?
- Similar to the above, I wonder if for example Uber drivers get harsher reviews if traffic or weather is bad. So should we normalize for local trends in ratings?
To this day I don't understand how the original Netflix algorithm for predicting user desires isn't co-opted for everything.
I don't mind people rating taco bell 5 stars. I just don't want any of their reviews influencing the restaurant ratings I see.
Same with basically every product category. We have this insanely intricate and rich data set that could truly help people find things that are great for them. But it's all locked away and only used for making a buck with nobody giving a shit about making people's lives better.
In which alpha and beta are the mean number of positive and total votes, respectively. You may wish to estimate optimal values of alpha and beta subject to some definition of optimal, but I find the mean tends to work well enough for most situations.
Is that really a fatal flaw? It's humans reading the ratings, and humans doing the ratings, so our human-factors might balance out a bit. I don't think people come in expecting the rating system to be perfectly linear because we have a mental model of how other humans rate things -- 1 star and 5 stars are very common, even when there's obviously ways the thing could be worse/better. So even though 3 stars sounds like more than 50%, most people would consider 3.0 stars a very poor rating.
I think you make a good point. But I don’t think it completely defeats the bias. Especially given that the star system that existed before the Web had 0 and half stars.
It seems like it’s purely a result of widget design deficiency: how do you turn a null into a 0 with a star widget? (You could add an extra button but naturally designers will poo poo that)
Percentage systems aren't immune to this, various pieces of games media were often accused of a 70-100% rating scale. Anything below 70 was perceived as a terrible game, and they didn't want to harm their relationship with publishers. So 70 became the "You might like it if there are some specifics that appeal to you" and 80 was a pretty average game.
This is cool. But what I usually do is replace x/y with x/(y+5), and hope for the best :). The 5 can be replaced by 3 or 50, depending on what I'm dealing with.
(in less important areas than sorting things by ratings to directly rank things for users; mentally bookmarked this idea for the next time I need something better, as this clearly looks better)
Heads up this weights all your scores towards 0. If you want to avoid this, an equally simple approach is to use (x+3)/(y+5) to weight towards 3/5, or any (x+a)/(y+b) to weight towards a/b. It turns out that this seemingly simple method has some (sorta) basis in mathematical rigor: you can model x and y as successes and total attempts from a Bernoulli random variable, a and b as the parameters in a beta prior distribution, and the final score to be the mean of the updated posterior distribution: https://en.wikipedia.org/wiki/Beta_distribution#Bayesian_inf...
(I saw first this covered in Murphy's Machine Learning: A Probabilistic Perspective, which I'd recommend if you're interested in this stuff)
Why 2 star? I get the whole "forget about the 5 star reviews, because they're not going to tell you any of the downsides of the product," and "forget the 1 star reviews, because they're often unrelated complaints about shipping or delivery, and generally don't tell you much about the product." But, why not 3 star reviews?
I generally pay the most attention to 3 star reviews, because they tend to be pretty balanced and actually tell you the plusses and minuses of the product. It seems like 2 star reviews would be somewhat like that, but leaning toward the negative/critical side. Is the negative/critical feedback what you're after?
Because therein I find the best explanations for product failures. 3-star reviews tend to contain less failures and more "this could have been much better if they ___" . Again, it's anecdotal. I have no data to back my words.
I was just looking for some of his old blog posts about A/b testing the other day. Since I first read them, I'd lost my bookmarks and forgotten his name. Do you know how bad the google search results for A/B testing are now? They're atrocious! SEO services and low-content medium posts as far as the eye can see! I was only able to rediscover his blog after finding links to it in the readme of a random R project in github.
This is a genuine question, is there an HN guideline that says not to upvote reposts?
I don't know if I knew about HN four years ago and if I did, I almost certainly missed that post, and if I didn't, I certainly don't remember the interesting discussion in the comments.
I enjoyed the article and I'm not sure I see a reason not to upvote it.
Maybe what we need here is an extension where you can filter out articles?
It adds a click event to each link for the article, and then after a day has passed, will start filtering that link out from HN results? I give it a gap of a day because maybe you'd want to return and leave a comment.
I might try my hand at a greasemonkey script if you're interested.
Though, personally, I have no great issue seeing high quality posts again occasionally.
This still has the problem that some item with 12 votes will be ranked higher than some item with 12,000 votes. Oh, and also has the problem that some item with 12 votes will be ranked lower than some item with 12,000 votes.
I think you simply need separate search categories for this.
Say I want to look for underrated or undiscovered gems:
"Give me the best ranked items that have 500 votes or less."
It is misleading to throw a 12 vote item together into the same list as a 12,000,000 vote item, and present them as being ranked relative to each other.
Exactly. At a previous job we asked applicants about sort that had to satisfy some criteria like "more votes + same ration ranks higher" and "same amount of total votes, better ratio ranks higher" and (positive - negative) was the simplest solution that satisfied them all.
That is not to say that (positive - negative) is perfect, but just dismissing it because it doesn't satisfy your randomly chosen criteria is just lazy.
Fun fact, this article inspired the sysadmin at XKCD to submit a patch to open source reddit to implement this sort on comments. It lives still today as the "best" sort.
> What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?
> Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter
There is actually a 97.5% chance that the "real" fraction of positive ratings is at least X, because the Wilson interval is a confidence interval which excludes a tail at each extremum.
I personally would prefer to use a Bayesian derivation and just compute a one-tailed statistic, rather than applying a CLT approximation out-of-the-box with no tailoring as is typical in frequentist statistics.
Regardless though, the principles are much the same and the sorted result probably would match very closely, so if this computation is significantly faster than stick with that.
There are a number of approaches to this with increasing complexity:
- Sum of votes divided by total votes
- More advanced statistical algorithms that take into account confidence (as this article suggests)
- Recommendation engines that provides a rating based on your taste profile
But I'm pretty sure you could take this further depending on what data you're looking to feed in and what the end-users' expectations of the system are.
This is a blast from the past. It's also surprisingly simple to implement his "correct" sort. Seriously, this link should make the rounds every year or so here.
They have an article about K-star rating systems [0] which uses Bayesian approximation [1] [2] (something I know little to nothing about, I'm just regurgitating the article).
There's a whole section on their website that has different statistics for programmers, including rating systems [3].
The formula still works for scales of 5 or 10, you just have to divide by the max rating first and then multiply by it again at the end.
For example a 3/5 stars turns into 0.6 positive and 0.4 negative observation. Following the formula from there will give a lower bound estimation between 0 and 1, so then you just multiple by 5 again to get it between 0 and 5.
if you dont have PostgreSQL it might be hard to create an index on that function. you can use a trigger that updates a fixed field on the row each time positive/negative changes, or otherwise run the calc and include it in your UPDATE statement when those numbers change.
No idea. It's customary to include the year in HN submission titles if it was published before the current year. When I made my comment, the title didn't include the year.
Imagine a user chooses "Sort by rating", and they subsequently observe an item with an average 4.5 ranking above a score of 5.0 because it has a higher Wilson score. Some portion of users will think "Ah, yes, this makes sense because the 4.5 rating is based on many more reviews, therefore its Wilson score is higher." and the vast, vast majority of users will think "What the heck? This site is rigging the system! How come this one is ranked higher than that one?" and erode confidence in the rankings.
In fact, these kinds of black-box rankings* frequently land sites like Yelp into trouble, because it is natural to assume that the company has a finger on the scale so to speak when it is in their financial interests to do so. In particular, entries with a higher Wilson score are likely to be more expensive because their ostensibly-superior quality commands (or depends upon) their higher cost, exacerbating this effect due to perceived-higher margins.
So the next logical step is to present the Wilson score directly, but this merely shifts the confusion elsewhere -- the user may find an item they're interested in buying, find it has one 5-star review, and yet its Wilson score is << 5, producing at least the same perception and possibly a worse one.
Instead, providing the statistically-sound score but de-emphasizing or hiding it, such as by making it accessible in the DOM but not visible, allows for the creation of alternative sorting mechanisms via e.g. browser extensions for the statistically-minded, without sacrificing the intuition of the top-line score.
* I assume that most companies would choose not to explain the statistical foundations of their ranking algorithm.