How Not to Sort by Average Rating (2009)

paulgb · on Aug 30, 2017

Averages (even with the post's approach) still have the problem of not being "honest" in the game theory sense. For example, if something is rated 4 stars with 100 reviews, a reviewer who believes its true rating should be 3 stars is motivated to give it 1 star because that will move the average rating closer to his desired outcome. A look at rating distributions shows that this is in fact how many people behave.

Median ratings are "honest" in this sense, as long as ties are broken arbitrarily rather than by averaging. Math challenge: is there a way of combining the desirable properties mentioned in the post with the property of honesty? I suspect there is but I haven't tried it.

hyperpape · on Aug 30, 2017

John Gruber has been arguing that the only meaningful way to do ratings is a simple thumbs up/thumbs down. I don't necessarily agree, but I see the appeal.

I usually don't want ratings, I want the Wirecutter treatment. Sometimes, I know/care enough to really research the topic, in which case star reviews are relatively unhelpful. The rest of the time, I just want someone trustworthy to say "buy this if you want to pay a lot, buy this if you want something cheap, but this third thing is no good at any price".

dntrkv · on Aug 30, 2017

I've been saying this for years, thumbs up/down is the only system that makes sense to me.

Foursquare uses it and I've found their scores to be way more useful than Yelp's.

The biggest problem with star ratings is that it's so arbitrary. What is the difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost failing when you think about it on a grading scale, if I scored something as a 3/5 I would never use that product or service again, yet, many of the best restaurants are rated 3/5 on Yelp.

Unless the user has some scoring system in place for different qualities of the product or service, there is no way you can get anything resembling an accurate score.

I would never trust a user to accurately assess a score given 10 different options (.5-5) but I would be way more likely to trust a user to say either "I like this product" or "I do not like this product."

But yes, the Wirecutter approach works great, but it just doesn't scale.

crazygringo · on Aug 30, 2017

Counterpoint: I almost solely rely on the stars histogram in Yelp (available only on the website, not the app), completely ignoring whatever Yelp's calculated "average" is.

If a place has more 5-star ratings than 4-star ratings, it's generally amazing. If it has more 4-star ratings than 5-star ratings, it's generally fine but not something particularly special.

Just thumbs up/down would eliminate what is, to me, the single most useful aspect of Yelp.

It doesn't matter that star ratings are arbitrary -- when you average enough of them out, a clear signal overrides the noise. You can distrust any given user, while still trusting the aggregate.

(Curiously enough, I don't find any equivalent value on Amazon. On Yelp, you're really evaluating an overall experience along a whole set of dimensions, so there's a lot more to discriminate on. On Amazon, it does seem to be more of a binary evaluation -- does the product work reliably or not?)

BoiledCabbage · on Aug 30, 2017

I used to think the same thing until I realized the most accurate and consistent ratings I use on a regular basis is rotten tomatoes. And they're based on strict thumbs up/ down.

It ensures votes hold equal weight and that "extreme polar" voters don't skew things. It also avoids the opposite problem of "everything is neutral" vote unless horrible/incredible.

RT also handles high brow and low brow well. You get less voting of "eh I didn't love it, but it's sophisticated so I'll give it an extra star."

I'm sold on simple up/down.

stinkytaco · on Aug 30, 2017

Rotten Tomatoes is good and predicting a movie I (or others) like, but not really at "ranking". Zootopia, one of their top movies of 2016 and a 98% rating, is a good movie, but one I'm unlikely to pursue again. The Godfather (with a 99%) rating, is a movie I will pick up on Blu Ray and revisit many times. It's far more than 1% better than Zootopia.

So RT is good at predicting "should I watch this movie I haven't watched before", but bad at predicting more sophisticated habits or preferences. I wouldn't buy the Blu Ray off a RT prediction, but I would rent.

So it becomes a question of what are you trying to accomplish? For some issues up/down is a good way to solve a problem, for others it isn't.

icebraining · on Aug 30, 2017

Rotten Tomatoes actually has both ratings, meaning they recognize the limitation you're referring to. In the other, Zootopia has 8.1/10 and The Godfather has 9.2/10, showing that difference in quality.

Houshalter · on Aug 31, 2017

Also you just aren't the demographic for zootopia. If you have kids then it probably is worth buying and they will watch it many times. There are so many genres of films, it's best to compare within a single genre and not between.

BoiledCabbage · on Sept 1, 2017

> Rotten Tomatoes is good and predicting a movie I (or others) like, but not really at "ranking". Zootopia, one of their top movies of 2016 and a 98% rating, is a good movie, but one I'm unlikely to pursue again

It feels like you're mixing together two different arguments. Rotten Tomatoes is good at predicting whether someone will like a movie. What is "ranking"? That is a very undefined concept. Ranking of what? It's clearly not ranking of likelyhood of a person liking a movie because rotten tomatoes already does that.

Later you mention likelihood of repeat watchings of a movie. Rotten Tomatoes thumbs up or down based on whether someone liked a movie, as a result it produces a metric on likelihood of someone liking of movie. Instead if rotten tomatoes immediately after watching a movie, asking "Did you like this movie?", asked "Would you watch this movie again?" then it would produce an indicator of re-watchability.

Up/down doesn't matter - it's the question that's being asked.

note the caveat RT obviously doesn't actually ask critics these questions, they read and judge their reviews and interpret them as answering those questions.

ehhnetfliz · on Aug 30, 2017

In my experience, my favorite movies I find via glowing reviews. Rotten tomato completely obscures this view: if all the reviewers kind of like it, it'll get 100%, whereas polarizing films always suffer. I'll take "kids" over "star wars" any day for a better movie. Why? I'm gonna see star wars because i want to, not because I expect a meaningful aesthetic. But Rotten Tomatoes takes the opposite tact, pushing me towards crowd favorites rather than what i might rate highly.

Really this comes down to how terrible one dimensional comparisons are: it only measure popularity, which is a terrible filter for quality.

alexilliamson · on Aug 31, 2017

I used to religiously research movies on RT - with a lot of success in my mind. With the user rating, the critic rating, and the "top" critic rating, you can infer a surprising amount about who is going to like any given film, and you learn over time where you fall on the critic/top critic/audience graph.

Recently, however, it seems like more (imo undeserving) movies that are "just ok" - like decent, but nothing special, romantic comedies and big blockbusters - are scoring above 90%. I might be being curmudgeonly about it, but I've nearly stopped checking it because it feels like there's no information there. My theory is that this started happening once Roger Ebert died... without such a leader in the field, no one is willing to say they didn't like a film unless it's obviously very bad.

albedoa · on Aug 30, 2017

I pay a lot of attention to histograms when there are many high-rated options for the same Amazon product type. A histogram that curves sharply in its number of 5-star reviews to almost nothing on the other end is the product you want (ignoring fake reviews for the sake of this conversation).

Amassing a bunch of 4- and 5-star ratings is easy, but leaving nothing for even the most habitual of complainers to complain about? That's an monumental achievement.

ghaff · on Aug 30, 2017

For things like books, I also find that reading the middling reviews often gives the best S/N ratio. It weeds out the fanboys and weeds out those who were clearly not the audience for the book (or just have some ax to grind). You're more likely to get the "I really love this author in general but I didn't care for this book because 1.) 2.) 3.)."

VanillaCafe · on Aug 30, 2017

Agreed. For products in Amazon above a certain star threshold (say, 3+), I evaluate given the shape of the review histogram, particularly minimizing the size of the bump down at 1-star and 2-star.

stinkytaco · on Aug 30, 2017

If the provider is in a position to provide a prediction, then the rating system is useful. For example, on Netflix I used the Hated It, Didn't Like It, Liked It, Really Liked It and Loved It system. When they predicted a star rating, it was pretty close. When they said we predict you'll give this a three star rating(which is probably well below the "average") that was generally a movie I liked.

ghaff · on Aug 30, 2017

Which, in practice, tends to devolve to what's effectively a four-star rating of some sort: Want two hours of my life back, OK/meh, Good, Excellent

A humorous take: https://xkcd.com/1098/

stinkytaco · on Aug 30, 2017

But my point is that for me, it didn't. Netflix's system was good enough to take into account that people have different systems. Thus when Netflix says "we predict you'll give this 3 stars", that means it was a movie I would like. That might mean you gave it 4 stars or 2 stars or whatever, even though you liked it as much as me. They made my system the only one that matters, as long as I was consistent. Reviews in aggregate are pretty much meaningless, but a good system weighs that problem in.

beefsack · on Aug 31, 2017

Perhaps the issue isn't the granularity of a single dimensional rating scale, but the lack of expressive options when in reality your feeling about something is complex and multifaceted.

I've been really interested in the idea of emotive reviews as an alternative to single dimensional scores. The best idea I have at the moment is something akin to emoji reactions like you see on GitHub issues, finding a way to encode some feelings relevant to product reviews in a mechanism like that seems really intriguing to me.

evincarofautumn · on Aug 31, 2017

I envision a panel of emoticons akin to the Facebook reaction set, but where the user can select as many as they want to quickly convey different combinations of their reactions:

    (thumbs up)      I liked this
    (heart)          I loved this
    (thumbs down)    I didn’t like this
    (smiling face)   This made me happy or satisfied
    (frowning face)  This made me sad or disappointed
    (surprised face) This made me surprised or impressed
    (angry face)     This made me angry or frustrated

Of course, it gets complicated. Did Sam U. Zerr give that product an (angry face) because they used it and didn’t like it, or because they’re offended that you would recommend it, or what?

If you’re only using icons to make recommendations to an individual user based on their own history, maybe you don’t need to infer the actual meanings; you can add all sorts of icons without any particular meaning and just make recommendations by correlation:

    (thinking face)  I’m considering this / I’m confused by or dubious of this
    (gear)           This was useful / this made me think
    (fire)           This album was great / this sauce was spicy
    (heart eyes)     I really want this / this is adorable
    ...

E.g. a recommendation for me might be “(thumbs up)(gear)(heart eyes)” because some product or content is similar, by some hidden metrics, to other things that I’ve reacted to in those ways.

Just brainstorming here. There are obviously many possible approaches in this space.

astrobe_ · on Aug 31, 2017

Put differently, a set of binary choices: amusing, interesting, sad, ... It's a bit difficult to come up with a good set to rate any thing, but I can see it working for specific topics, like movies or games.

Or, one could just let users tag the subject and the interface would display the "weights" of the tags.

ghaff · on Aug 31, 2017

That's part of the problem here. Appropriately rating different types of things differ in various ways.

A simple utilitarian object? It mostly works or it doesn't.

A movie? Just to start with, there's the rating of the movie itself vs. the rating for this particular DVD. And then there are the dimensions on which the movie itself could be rated.

Or you just throw your hands up in the air and either do a thumbs up/down or a 5 star rating system on the grounds that it's better than nothing.

xaedes · on Aug 31, 2017

How about vision based emotion recognition of viewers with cameras in the televisions and monitors? Sure sounds creepy and behaving different when observed etc. But I believe people will forget they are "observed" so the effect dimishs after a time. Than we would have a quite honest emotional feedback for movies. Even for specific scenes, for advertisment, etc

throwaway613834 · on Aug 30, 2017

To be honest, the fact that 60% is a failing grade is a failure of the grading system, not a fact to take for granted. We've basically lost the entire dynamic range of 0-60% for no good reason.

andrewflnr · on Aug 31, 2017

I would actually say it's often not strict enough. In what serious field is it acceptable to only know, say, 70% of the material? Do you want to drive on a bridge designed by an engineer who only got 70% on their exams? It depends on how the test is structured, really, but unless it was one of those tests designed to bring smart people to their knees, I'd rather not.

adrianratnapala · on Aug 31, 2017

We probably cross bridges designed by engineers who only got 70% on their exams all the time. That was pretty satisfactory score when I was in Uni.

antod · on Aug 31, 2017

Yeah - Exam performance from a decade or two ago is quite irrelevant for evaluating senior design engineers.

I wouldn't trust an engineering graduate who scored 100% on all their exams to design a bridge at all. Where as someone with 10+yrs relevant experience but who got 60-70% in their exams would be preferable to me.

Mastery of the math isn't that relevant due to all the design standards you have to understand and comply with anyway, while all the little pragmatic solutions to real world constraints (incl how the builders work and what they need to be effective) learnt from experience and mentoring from your senior peers are far more important.

logfromblammo · on Aug 30, 2017

So you're saying that the measurement of student mastery has no noise floor?

waterhouse · on Aug 31, 2017

It depends on how things are graded. On a multiple-choice test with four choices per question, someone with no knowledge who guesses randomly will get ~25%. On a true-false test, someone with no knowledge gets ~50%. On a project graded by a human, or a worksheet whose answers are real numbers, someone with no knowledge and a hard-eyed grader might well get 0%. Different classes will have different proportions of these things that contribute to the overall grade (at least, I haven't heard of any requirement that classes have the same proportions of such). The simple approach of summing total points achieved over each graded item, divided by total points possible, is straightforward to calculate, but I think there's no mathematical justification for choosing one percentage-based grading scale and applying it uniformly to all classes.

hanbura · on Aug 31, 2017

>On a multiple-choice test with four choices per question, someone with no knowledge who guesses randomly will get ~25%.

That would be terrible test design. At my (German) university, most Multiple Choice tests give one point for a correct answer, minus half a point for a wrong answer. That way you expect negative points from people who think they know everything but are no better than random guessing, zero points from somebody who knows nothing, some points from someone who can always narrow it down to two choices.

I guess my point is that you can arbitrarily raise the floor with a bad grading scheme, but there's no inherent reason to do that.

ThrustVectoring · on Aug 30, 2017

Yup. You can't assume that what you think a 3/5 means is the same as what someone else thinks a 3/5 means. You can basically assume that for thumbs up/down. And really, the question you want answered is "how likely am I to like this", and thumbs-up % of the overall population is a decent proxy for that for good reason.

ehhnetfliz · on Aug 30, 2017

What does a thumbs up mean, though? In a netflix context, am i recommending it to others? Trying to train the recommendation for my own taste? Making sure i rewatch it if I don't remember watching it the first time? What do I do if i like a movie but it's objectively terrible? All of the above questions weigh heavily, and the end result is I just avoid binary voting systems (including voting on hn) and it becomes feature bloat with little use.

Strangely, it gets even harder with the thumbs down—there are vanishingly few things i actively wish didn't exist. Why downvote at all?

ThrustVectoring · on Aug 30, 2017

If I see an approve/disapprove button, I try to click it if it's for something I've chosen to consume (watch, buy, visit, etc). If it's a decision I'm glad I made, I thumb it up. If it's a decision I regret making, I thumb it down. People and systems will read that input for one of two ways: either optimizing stuff for my preferences, or using that data to make choices further in line with my preferences. Either way, the world is marginally more like I like it.

ehhnetfliz · on Aug 30, 2017

Right, but what about consumers that want the rating to be meaningful? Assumably netflix has a history of videos you've seen entirely; they don't need your rating to know you consumed it.

Personally I just stop watching the moment I feel regret—the thumbs down button has no role in how I consume.

Two star ratings, though—that is meaningful, at least to me.

dennisgorelik · on Aug 30, 2017

You may stop watching a movie on Netflix because you do not like it. You may also stop watching a movie on Netflix because you already saw it multiple times and only wanted to rewatch few minutes snippet from it.

Without your thumbs up/down feedback it is hard for Netflix to figure out what is your opinion about the movie.

Houshalter · on Aug 31, 2017

Just normalize ratings. If the average rating is in the 50th percentile of all ratings on the site, convert the rating to 50%. That way it carries the maximum possible information. If someone rates something 60% that just means it's better than 60% of similar products.

School grading systems serve a completely different purpose and are a terrible comparison.

core-utility · on Aug 30, 2017

What about a thumbs up, thumbs down, and a neutral? In the case of restaurants, there are plenty of places I've eaten where I wouldn't give them a thumb's up "best place ever", but also not deserving of a thumb's down "terrible."

dredmorbius · on Aug 30, 2017

This really depends on how thick or thin the data are.

If any given option only gets a small handful of votes, then you might see a strong bias (favourable or otherwise) where neutral would be appropriate.

In Likert scale design (where favourability options >2), there's a strong debate over even or odd choices -- should someone be able to give a "meh" rating, or do you want to force a positive or negative, if even slight.

Hence, 3, 4, 5, 6, and 7 point (typically) scales.

jonknee · on Aug 30, 2017

Even Netflix finally moved over to up/down and they were famous for squeezing every drop out of their previous star based reviews [1]. In theory stars work better, but the issue seems to be everyone has a different ranking system.

For example, Uber seems to think anything but a 5/5 is a failure. I know this so I skew to accommodate, but in my personal ranking system I've only had a couple 5 star rides (someone really going above and beyond).

Up/down with an optional qualifier afterwards (e.g. "why were you unhappy?" after a thumbs down) seems to remove a lot of confusion.

[1] https://en.wikipedia.org/wiki/Netflix_Prize

yoz-y · on Aug 30, 2017

Maybe the problem is with stars and wording. Currently most of systems are worded (e.g. amazon) in a way that 3 stars is the base and people would add stars if their expectations were exceeded and remove them if they were not met. However at all places that I have seen it is like you say 5 stars is for expectations being met and it only goes downhill from that.

I think that a wording and iconography in 4 steps could be useful. -2 = something really bad happened, -1 = below expectations, 0 = happy customer, 1 = exceeded expectations. Forcing people to write a detail on any rating other than 0 would make most ratings 0. Angry people usually like to write comments anyways.

ghaff · on Aug 30, 2017

I'm not sure an expectations-based rating system is the norm though.

To use an example I gave elsewhere. I order a cable from Amazon. It works. Therefore it met my expectations of a working cable. Yet, I think most people would interpret a 3-star rating as my being lukewarm on my purchase. I'm not. But what the heck do I expect a cable to do other than being a fair price and to work?

With respect to movies. Some movies get really built up and I go in expecting great things (e.g. Fury Road). I come out thinking they were just OK. So maybe 3 stars. But definitely not -1 or 2 stars. My personal expectations aren't necessarily a good baseline.

yoz-y · on Aug 31, 2017

For a single data point they are useless. But I thing a rating in aggregate could be informative. Of course, the rating is self defeating as too many people who were positively surprised will raise the expectations.

For movies I would love to have reviews in style: Somebody writes up their expectations before going to the movie, and then rates the movie according to that. I find most "press and critics" reviews useless as if somebody is not a fan of story-less action movies, then why the hell is he reviewing them.

enobrev · on Aug 31, 2017

This is how I see it as well. It would also likely get rid of comments / ratings like "I would give this 0 stars if I could". I think the numbering you mention is just as essential (especially 0).

The way I generally rate things falls in line with this idea:

5: Excellent

4: Pretty Good

3: Average; Unsurprising. Not Impressed, Not Disappointed

2: Kinda Sucks

1: Run Away

This definitely doesn't seem to be how everyone else is using the 5-star scale.

ghaff · on Aug 30, 2017

This was (is? I haven't really used them much in a while) an issue with eBay's reputational system as well. Anything other than a Positive and "A++++++++++++ seller" was interpreted to mean they shipped you a box full of bricks rather than they took a week to ship things.

dredmorbius · on Aug 30, 2017

Normalising a given rater's stars, or asssigning costs to higher / lower ratings, is another option. Essentially you have a "ratings budget" you can spend, up or down, on your assessments.

Stack Exchange has this to an extent, where negative ratings cost the rater points -- you have to really want to assign a negative.

stinkytaco · on Aug 30, 2017

Everyone does have a different ranking system, but Netflix was good at predicting what I would give a movie, so my ranking system was the only one that mattered. Their special sauce in the background made it pretty accurate.

The problem with an up/down for me is that it doesn't capture an ambivalent reaction. That means transactions will tend toward the mode, imo. You will do enough to get a thumbs up, that's all.

kalleboo · on Aug 31, 2017

I often wish it was an up/down/love

dennisgorelik · on Aug 30, 2017

5-star ratings probably carry more information than binary thumbs up/down, but every 5-star vote is more complex to collect, so Netflix was probably getting less 5-star votes than they are getting up/down votes.

Overall number of votes matters too.

dredmorbius · on Aug 30, 2017

Close, but still, IMO, wrong.

The ultimate question is is this going to be useful to me, and the answer to that is ... somewhat complicated.

Informative, timely, accurate, significant (which may be none of the above), funny (may be appropriate or inappropriate, based on context and/or volume).

Some information is often (though not always) better than no information. Bad information is almost always worse.

(Aside: troubleshooting a systems issue yesterday I had the problem of someone trying to offer answers to questions where "I don't know" was far more useful than "I think ...". Unfortunately, I was getting the "I think ..." response, though not phrased as such.)

What you describe, the wirecutter treatment, is the case of an expert opinion. Here there remain issues -- particular of the biased expert. But if I could give a hierarchy of opinions from least to most useful:

-2. Biased. -1. Ignorant. 0. None. 1. Lay. 2. Novice. 3. Experienced. 4. Expert. 5. Authority in field and unbiased.

Note that the problem of judging expertise itself recapitulates much of the same problem.

Qualification and reputation of the raters themselves is a critical element missing from virtually all ratings systems.

hammock · on Aug 30, 2017

Offering a five stars to the rater can cause them to treat it as a thumbs up/thumbs down (as your parent comment alludes to when he references "game theory" and giving a one-star), with one star being the most powerful thumbs down, the obvious choice.

The other alternative is for users to actually SORT and RANK all products in that category that they have reviewed. Not a tenable solution.

Side comment, the Yelp histograms ARE useful... but that is more of a side effect/emergent from a bad rating scheme than anything else. Because people are using the stars not ideally, the histogram gives you insight into that. So it's not a bad solution, but a better solution would be something other than the stars.

ehhnetfliz · on Aug 30, 2017

Ehh, netflix switched to that. It's even less useful now: there's no way to indicate you really like a show vs it's not terrible; this means your taste approximately correlates with abailable content, not content you prefer.

The real win would be empowering the user to choose their own rating style. I don't see this happening because it's much harder to push content at users this way.

mrexroad · on Aug 30, 2017

To be fair, Netflix is less interested in that you liked the thing itself, and more interested in the attributes of the films you liked or watched to the end. Note, this is based on an article from 2014 [1], but a good read nonetheless.

[1] https://www.theatlantic.com/technology/archive/2014/01/how-n...

ehhnetfliz · on Aug 30, 2017

Right, but they don't ask me what I like about it. Judging by their recommendations they certainly aren't discerning it well.

Netflix is interesting because their content is pretty bad (compared to say IMDB they have very few movies); recommendations are incredibly useful to pretend its library is much larger than it actually is.

I wonder why they didn't go the amazon route of actual reviews + semantic analysis.

Mostly I just hate cross referencing netflix with reviews to figure out if it's worth my time.

zolloie · on Aug 31, 2017

That's the funny thing to me, that people are using Netflix as an example. To me, Netflix ratings are just about the most useless ratings of all the ratings I'm aware of, maybe even more so than Amazon's ratings.

There's also things to consider, like time, that becomes relevant. Dichotomous ratings are known to be inferior statistically speaking, but they are faster, so there's a convenience angle. Tradeoffs.

These discussions always get frustrating to me because there's so much armchair ad hoc stuff that goes on when there's a huge scientific literature on this already.

People also don't seem to be aware of the assumptions they're making. About ratings being skewed, for example: for a lot of products, people probably do kind of want to know basically "is this meeting my needs?" and then everything is just a decrement away from that. Laundry detergent, for example, is something where I want it to clean my clothes well without damaging them. Why should that be normally distributed?

Also, there's a difference between ratings and how they're used. My guess is that 1-3 star rating variance is meaningful from an experiential point of view, but not from a purchasing point of view. That is, if you had the choice of a 3-star product or a 1-star product, I think people would prefer the 3-star product. When we say "1-3 stars don't matter" we don't actually mean that, we mean that they don't matter because it's below our threshold of what we'd be willing to spend money on.

ghaff · on Aug 30, 2017

Wirecutter is particularly good for a lot of reasons. I'd actually argue that the "Wirecutter treatment" is mostly less applicable to Wirecutter than other areas given that many of the items they review are relatively pricey and sophisticated.

But, to your overall point, there are a lot of things that I just want a hopefully mostly unbiased expert to tell me what to buy and I'm just fine with that. When I buy a garden hose nozzle, I'm just fine with whatever one of Wirecutter's sister sites tells me to buy. I don't need or want to do a lot of research into the finer points of garden hose nozzle design.

makecheck · on Aug 30, 2017

Up/down is sufficient to capture the vote itself but there is more data in there to consider, like:

1. How long did it take for the person to vote in the first place? (Might change weight; if we're talking cars, nobody really knows if they "like" it 4 minutes after purchasing and it means something different if the rating appears 3 months later.)

2. Has the vote changed between up/down? Has this happened twice?

3. Has the person voted for other things in similar categories? Might make sense for phones, over a period of years. Doesn't make sense for a person to buy and rate 20 different chairs in a week. Use it to give credibility.

zolloie · on Aug 31, 2017

I do research in this area and have many reactions to a lot of topics being brought up. I read this piece when it first was written and didn't think to look at the posting on HN until now.

The problem with dichotomous ratings (binary, thumbs up-down) is that they lose a lot of meaningful information without eliminating the problems you're referencing.

That is, the same problems apply to dichotomous ratings, in that people still have tendencies to use the rating scale differently. Some tend to give thumbs up a lot, others down, and people interpret what's good or bad differently. People who are ambivalent split the difference differently.

On top of that, you lose the valid variance in moderate ranges, and actually amplify a lot of these differences in use of the response scale, by forcing dichotomous decisions, because now you've elevated these response style differences to the same level of the "meaningful part" of the response. E.g., maybe one person tends to rate things more negatively than another person, rating 4 and 5 respectively. But when you dichotomize, now that becomes 1 and 2.

The question is whether or not, on balance, the variance associated with irrelevant response scale use is greater than the meaningful variance, and generally speaking studies show the meaningful variance is bigger. In general, you see a small but significant improvement in rating quality going from 2 to 3, and from 3 to 4, and then you get diminishing returns after 4-6 options.

Also, people really don't like being forced to take ambivalence and choose up or down, so in the very least having a middle option is better (unless you want to lose ratings).

It's fairly straightforward to adjust for rating style differences if you have a bunch of ratings of an individual on a bunch of things whose rating properties are fairly well-known. Amazon could do this if they wanted to, and Rotten Tomatoes I think might do something like this already.

RT, in fact, is kind of a bad example, because their situation is so different from typical product ratings, in that you have a small sample of experts who are rating a lot of things. They also are aggregating things that themselves are not standardized-- their use of the tomatometer in part stems from them having to aggregate a wild variety of things, as if everyone on Amazon used a different rating scale, or no rating scale at all. Note too that there's then a "filtering" process involved by RT. Finally I also feel obliged to note they do have ratings and not just the tomatometer, which I've started paying attention to after realizing that things like Citizen Kane show up as having the same tomatometer score as Get Out--a fine movie but not the same.

The game theory angle is interesting to think about. It's something I don't deal with usually because in the situation I'm used to, the raters don't have access to other rater's ratings. That's one solution, but impractical. A sort of meta-rating is one solution--a lot like Amazon's "helpfulness" ratings. It's imperfect but probably does well in adjusting for game theory-type phenomena, like retaliatory rating, etc.

SamBam · on Aug 30, 2017

That's an interesting hypothesis, but I'd want to see more evidence that "a reviewer who believes its true rating should be 3 stars is motivated to give it 1 star."

Surely you can't prove this simply by noticing that there are many 1- and 5-star reviews, as there could be many other reasons for that. One obvious one: people who strongly like or strongly dislike a product are more likely to take the time to review. I personally have never felt the need to review something if I felt "meh" about it.

One sample study might be to see how people's ratings change if they have a chance to see the average rating first or not, but that would be a tricky study as you'd need to get people to buy something without seeing the ratings.

paulgb · on Aug 30, 2017

That's a good alternative hypothesis. It could also be that people's experience with a product really is bimodal: if I order an alarm clock that works as advertised, it is easy to get 5 stars, if it doesn't work at all, it's 1 star. Your explanation works better for why the distribution persists in books and movies though.

In any case, I find the mechanism design angle interesting regardless of the behavioral angle :)

gknoy · on Aug 30, 2017

I've often hypothesized that most people are more likely to leave a negative review when they are upset, than a positive review when they like something. It certainly holds in my case: I nearly never review things, because it's a giant hassle, so I have to feel really strongly to be willing to spend the time on a review.

CM30 · on Aug 30, 2017

You're right. It's been proven a bunch of times that people are more likely to leave negative reviews than positive ones, or to remember bad experiences more overall.

Zendesk actually did a survey on this back in 2013:

http://cdn.zendesk.com/resources/whitepapers/Zendesk_WP_Cust...

And American Express found something similar in their Global Customer Service Barometer survey as well:

http://about.americanexpress.com/news/docs/2014x/2014-Global...

stordoff · on Aug 30, 2017

It makes sense - a product doing what it says it will doesn't stand out / incentivise a review, whereas a bad product makes you want to caution others / get some sense of justice on the company (not sure of a better way to word that).

JoshTriplett · on Aug 30, 2017

> That's an interesting hypothesis, but I'd want to see more evidence that "a reviewer who believes its true rating should be 3 stars is motivated to give it 1 star."

I doubt it's thought about in precisely those terms, but I've certainly thought "that's overrated/underrated" before, and had that affect my rating.

ghaff · on Aug 30, 2017

>I personally have never felt the need to review something if I felt "meh" about it.

Or for that matter if I got a simple item and it works as expected. What's an Amazon Basics HDMI cable supposed to do? Does 5 stars mean that it turns HD into 4K through magic? But if it does what an HD cable is supposed to do (and seems well-built) I guess I should give it 5 stars?

splonk · on Aug 30, 2017

There's a fair amount of research on this, although I don't have a link to any public results offhand. But I have sat in on several UX experience sessions with users being interviewed about their motivations for rating, and "fixing" a perceived bad average rating comes up repeatedly.

I'm fairly sure at one point years ago I did see some data showing that rating distribution changes depending on whether the rater sees the average rating first, but it was a long time ago and I don't recall the specific differences now. In practice you're right that this isn't feasible for most use cases.

I do know that the rating distribution is strongly bimodal - on a 5 star scale I think 80+% of ratings will be either 1 or 5 star. Mostly 5 stars - IIRC they were around half of all ratings.

nfriedly · on Aug 30, 2017

I wonder if a system that assigned weights to each individual user's rating based on that user's rating history could help there - if a user always rates products with 5-stars, then another 5-star rating shouldn't have nearly as much weight as one coming from a user that gives a fairly balanced range of ratings.

I'm not sure if that would actually work better in practice, but it's at least an interesting idea.

fnordian_slip · on Aug 30, 2017

At first your system seemed to me like a solution to bought reviews (mturk and otherwise) and bots.

Then I realized that it would just incentivise bots to add 1-star reviews to random products once their creators figure out this mechanism.

Sometimes these problems make me sad, it could all be so nice and easy if it weren't for these bad actors.

BearGoesChirp · on Aug 30, 2017

It would be a lot more work, but one could check the validity of ratings based on how other users rate things compared to this user.

Say there are 4 games. Most users who rate all 4 rate them similar, except for the 4th game that always gets really low. So 5,5,5,1 is a normal expected rating, but 1,1,1,5 isn't. So 5,5,4,4 from a high rater or 2,2,2,2 from a low rater would be given more weight than a 1,1,1,5. Other things can be added such as weighting a user's ratings a low impact if they have too few scores to determine ratings from.

This reminds me of the problem of determining the answer key to a multiple choice test given only the answers of the test takers.

bllguo · on Aug 30, 2017

unfortunately sample sizes will probably decrease drastically when you're looking at the subset of people who rated 4 specific games, or other such cases

fnordian_slip · on Aug 31, 2017

Yes, that alone is probably enough to make any system like that come with a huge false-positive rate.

I myself only rarely post a review, and I could easily see how - in a vacuum - my star ratings might look like those of a bot.

splonk · on Aug 30, 2017

This does actually happen, possibly for several other reasons. Bought ratings often come accompanied by other ratings, whether bought or not, just to attempt to cloak the bot. This even happens in systems where there aren't any countermeasures being taken to remove bots yet - I was vaguely involved in spam detection in one and saw all sorts of cloaking behavior even before we'd turned on any sort of mitigation.

nfriedly · on Aug 30, 2017

Yea, that's the main reason why I'm not sure it could be made to work.

Tyr42 · on Aug 31, 2017

What if you just only say visit restaurants which are well reviewed, and deserve 5 stars? Why should avoiding going to crappy restaurants penalize me when reviewing great restaurants? You're kinda assuming that people are visiting restaurants at random, and should experience the full range of good and bad things, but that's not true, especially for people who rely on existing ratings a lot.

dagw · on Aug 30, 2017

On the other hand most people are only generally moved to bother writing a review for something if it's exceptionally good or exceptionally bad. So someone only giving 5-star ratings could very well be someone who only bothers writing a review if a product is a life changing experience and as such his opinion isn't less valid than that of someone who reviews every little thing he has ever bought.

thanatropism · on Aug 30, 2017

Individual preferences cannot be aggregated into something that resembles a preference ranking. The most cited formalization of this is Arrow's impossibility theorem, but choice aggregation is this whole theory.

_Judgement_ is a slightly different problem. There's an entire issue (#145) of the _Journal of Economic Theory_ on this, but the panorama is still quite bleak, and the reddit approach is far from state-of-the-art.

(Personal experience: I've "returned" to reddit (I swore off facebook but I'm still addicted to having something on my phone), and the only way to get people to interact with you is to browse the "new" queue. Once something is "hot" it's basically dead -- new comments are queued to the end even if they're rising fast, and no one replies to you).

zolloie · on Aug 31, 2017

The flaw with that line of criticism is that it makes assumptions about the meaning of the ratings. Note, too, that Arrow's impossibility theorem applies to ranking but not ratings. That also applies to a very simplified, idealized case which can be superceded by more sophisticated voting/rating systems.

bo1024 · on Aug 30, 2017

Yes, but single-peaked preferences are a special case that apply here, and where the median is truthful.

(For those not familiar: single-peaked preferences assumes that the person always prefers the final rating to end up closer to their personal rating. So if I believe the restaurant is 3 stars, I'd most prefer it gets rated actually at 3 stars, and I'd rather see 2 stars than 1 star. If all the raters have single-peaked preferences, then using the median to produce the final rating is truthful: A person can't move the final rating closer to their own belief by lying. The mean is not truthful: If the current average is 4 and my rating is 3, I can pull the average closer to 3 by giving a 1-star review.)

grandalf · on Aug 30, 2017

Excellent point. Per your question I'm also curious if there is a way to make aggregate ratings more useful as a quality measurement.

For instance, if I see a product on Amazon with a 4.8 average rating but notice a lot of very angry 1 star ratings, I'm likely to infer that there may be quality control problems.

Amazon displays a histogram so the shopper can assess the meaning of the distribution heuristically.

There's also the issue of whether ratings should be absolute or based on value. If I buy some obviously knockoff ear buds for $6 and they are way better than expected, I'd give them 5 stars, but if they had cost $50 I'd have given a three star review.

So for shopping it seems that there are multiple signals being aliased into a single star rating.

paulgb · on Aug 30, 2017

Somewhat related, I think it was Nate Silver who theorized that given enough time all restaurant reviews trend towards four stars. The theory is that if something gets less than four stars it attracts a niche crowd that appreciates it uniquely (and rates highly), while if it gets five stars it attracts a general crowd that doesn't have a particular appreciation (and rates poorly).

ghaff · on Aug 30, 2017

And/or restaurants that everyone hates tend not to stay in business. There's also a more general effect that ratings (and reviews) affect the behaviors of people who haven't rated yet. I wouldn't rate many movies that I see below a 3/meh/OK level. That's not so much because I grade inflate but because I actively seek to avoid movies I'd rate 1 or 2 (and, indeed, mostly 3).

stordoff · on Aug 30, 2017

It also doesn't help when reviews aren't made on the merits of the product. Pretty common on, e.g., Steam:

> DOTA 2 users then had the brilliant idea to do the dumbest thing any fanbase can do to a game, flood Metacritic with bad user reviews. The slew of zeros since the forgotten Diretide has dropped the game's user score about two points to a 4.5.

https://www.forbes.com/sites/insertcoin/2013/11/02/valve-for...

dota · on Aug 30, 2017

Half-Life fans are currently leaving negative reviews on Dota 2 because Valve won't make HL3. Recent reviews went from "overwhelmingly positive" to "mixed".

https://arstechnica.com/gaming/2017/08/steam-reviewers-bomb-...

logfromblammo · on Aug 30, 2017

Valve should probably just farm out HL3 to Obsidian, and continue printing money with Steam.

pvdebbe · on Aug 31, 2017

Open-world Half-Life does sound intriguing.

svachalek · on Aug 30, 2017

There's also the hostage-taking effect, most notable in the iOS app store. "This 1-star review will be changed when you give me what I want."

KVFinn · on Aug 30, 2017

>Averages (even with the post's approach) still have the problem of not being "honest" in the game theory sense. For example, if something is rated 4 stars with 100 reviews, a reviewer who believes its true rating should be 3 stars is motivated to give it 1 star because that will move the average rating closer to his desired outcome.

Also I'd like to know if a 5/10 rating is mostly 5s, or an average of mostly 1s and 10s.

panic · on Aug 30, 2017

You can also model each vote as an "agent" that tries its best to move the star rating toward its desired value. If the current rating is a 4, each "agent" with a vote less than 4 will throw a 1 into the average, and each vote greater than 4 will throw a 5. This process converges, though the rating tends strongly toward 3 (or whatever the middle value is).

paulgb · on Aug 30, 2017

Good answer, this has a nice property that it can be applied to any reasonably behaved average-based system to get an honest mechanism. For a plain average it is equivalent to the median.

ajennings · on Aug 30, 2017

Not actually equivalent to the median.

If all the scores are 1s and 5s, then the median will be a 1 or a 5, but the average will be somewhere in between.

The problem is slightly easier to understand if we consider the grading scale from 0 to 100. Then every agent, trying to manipulate the score as much as they can toward their "ideal grade", will submit a grade of 0 or 100.

The average will converge to the unique number, X, where X percent of the graders want the final grade to be above (or equal to) X.

paulgb · on Aug 31, 2017

Yes, you're right and I was wrong.

dragonwriter · on Aug 30, 2017

Numeric ratings are GIGO anyway, since cultural differences in how people map satisfaction to star ratings mean that the same number of people with the same degrees of preference for your product can produce a near-infinite array of different sets of preference ranking simply depending on how preferences are distributed among subcultures.

pcollins123 · on Aug 31, 2017

For every service there are very relevant factors about what makes the product good or bad. If you don't separate these 2 or 3 factors, then over time everything becomes a score of 3.6.

In the case of Amazon, the relevant options are:

   1. Likert scale of quality:
       a junk, just don't buy it
       b cheap and works good enough for occassional use
       c higher quality: willing to spend more and you'll get a much better outcome.
       d overpriced

   2. bad shipping, bad vendor, poor customer service

I hate seeing a bad review for a product based on the last item, they're normally outlier issues or whiners and I normally try to filter them out.

In the case of rotten tomatoes it is, again a different set of parameters.

bo1024 · on Aug 30, 2017

> Math challenge: is there a way of combining the desirable properties mentioned in the post with the property of honesty? I suspect there is but I haven't tried it.

The suggested method seems to be asking for binary responses (like/dislike), then aggregating them with the confidence-bound formula. This should be truthful in, e.g., a model where users who like it want to maximize the score and users who dislike it want to minimize the score.

frgtpsswrdlame · on Aug 30, 2017

>A look at rating distributions shows that this is in fact how many people behave.

This is really interesting, do you have anything where I could read more about it?

paulgb · on Aug 30, 2017

Here's an analysis from a couple years ago: http://minimaxir.com/2014/06/reviewing-reviews/

In particular, the conclusion:

"The reviews on Amazon’s Electronics products very frequently rate the product 4 or 5 stars, and such reviews are almost always considered helpful. 1-stars are used to signify disapproval, and 2-star and 3-stars reviews have no significant impact at all. If that’s the case, then what’s the point of having a 5 star ranking system at all if the vast majority of reviewers favor the product? Would Amazon benefit if they made review ratings a binary like/dislike?"

bnegreve · on Aug 30, 2017

I remember a talk by Thorsten Joachims at ECML/PKDD 2013 where he was talking about this, you can watch it here.

https://www.youtube.com/watch?v=fX9lj0UdB9s

Around 11:40 he shows evidence of this "dishonest" behavior. As far as I remember, the whole talk was very good. He has some publications on the topic.

asr · on Aug 30, 2017

I am having a hard time digging it up, but I remember reading some reporting on the Netflix Prize that said (before Netflix abandoned the star system) that many users rated things only one-star or five-star.

But, counter to the OP's point, I wouldn't assume this is an attempt to move the average; I would guess this is for a number of reasons, including because it's too much mental energy to decide if a product (film) is worth four or five stars, if you rate something you are often just trying to say "liked it" or "didn't like it."

dredmorbius · on Aug 30, 2017

Some years back I designed a multi-point rating system for a social media site.

I used it precisely as you describe.

Ended up in a discussion/argument with one of the users (not aware of my role) over whether or not that constituted "abuse" of the system.

It was pointed out (by others) what my relationship to the system design was. I remain amused by the episode.

tshaddox · on Aug 30, 2017

If it's true that a significant number of people give 1-star reviews to drive down the average (I'm skeptical without seeing evidence), then would people really understand the idea of median ratings and stop doing that?

ajennings · on Aug 30, 2017

Good question.

Usually we consider "aggregation functions" with a fixed number of graders, N. It has been proven that if you want an aggregation function that is:

- anonymous: all graders treated equally

- unanimous: if all graders give the same grade, then that must be the output grade

- strategy-proof: a grader who submitted a grade higher (lower) than the output grade, if given the chance to change their grade, could do nothing to raise (lower) the output grade

- strictly monotone: if all graders raise (lower) their grade, then the output grade must rise (fall)

then your aggregation function must be an "order statistic": the median (if N is odd) or some other function which always chooses the Mth highest input grade.

If you relax the last criterion to:

- weakly monotone: if all graders raise (lower) their grade, then the output grade must rise (fall) or stay the same

then your aggregation function must be "the median of the input data and N-1 fixed values". As an example of this last type of function, let's take @panic's idea that each grader has an honest evaluation between 0 and 100 but has an agent that submits a fake grade (0 or 100 usually) to pull the average toward their honest evaluation.

As I say in a descendant comment, this system will converge to the unique number, X, such that X percent of the graders want the final grade to be X or above. You noted that this whole system (the average and the agents) is strategy-proof, so each grader should be honest with their agent. We might as well pull the agents into the system and say, "submit your honest evaluation and we'll calculate X, the unique number such that X percent of the graders want the final grade to be X or above."

This is an aggregation function. It is anonymous, unanimous, strategy-proof, and weakly monotone. I call it the "linear median" in my PhD thesis. Rob LeGrand called it "AAR DSV" in his thesis. We've been calling it the "chiastic median" more recently. It has some interesting properties. Considered in the context of "the median of the input data and N-1 fixed values", with 100 graders, the 99 fixed values are 1,2,...,99, and this function always returns the median of the input data with these 99 fixed values. (No matter how many graders there are, the fixed values will equally divide the number line between 0 and 100.)

You can see chapters 5-8 of my PhD dissertation for more info: http://ajennings.net/dissertation.pdf

Now, you're thinking about how the grade changes when a new vote is added, so we're really talking about a family of aggregation functions, one for each possible number of graders. We want each one to be strategy-proof in itself, but we also need to consider how they relate to each other.

Do you want strict monotonicity or weak? (I find strict monotonicity too restrictive, myself.) If you say "strict", then for each N you need to choose which order statistic you want. If you say "weak", then for each N you need to choose N-1 fixed values and you'll always take the median of the input data and the appropriate array of fixed values.

In my thesis (section 7.2) I talk about how you can create a "grading function" to unify a family of aggregation functions, but I don't think that's a perfect fit since we want to somehow "punish" subjects that don't have very many grades (that's what the OP is about). Do we want to pull them towards 0, or pull them towards some global neutral value (like 3 out of 5)?

kstenerud · on Aug 30, 2017

That's what's always annoyed me with Amazon's "sort by average rating" setting. I want to see the top 10 or so items by rating to give me a baseline to investigate from, but instead I get page after page of cheap Chinese crap with one 5-star review each from the resident fake reviewer.

Worse than useless.

Even a simple change like adding a "show only items with a minumum of X reviews" would be a godsend.

nehushtan · on Aug 30, 2017

What's crazy is everyone knows Amazon's ranking is crap, except apparently Amazon - and it's been crappy in the same way for 10 years.

SamBam · on Aug 30, 2017

While Amazon certainly has a vested interest in getting people to trust the reviews (they care more about people coming back to Amazon again and again than selling any one product), I wonder if they also have a vested interest in keeping a large number of products from multiple vendors available.

If Amazon ranked it's items the "proper" way, such that all one-rating products were far from the top, I imagine it would see a lot more clustering of purchases on the most popular version of every product type. All those variations that were not as popular would receive many fewer purchases, and some of those vendors might simply fold.

Amazon may have decided that having a larger ecosystem of vendors is worth more than implementing a better rating system. This, presumably, is not in the customer's interest (unless perhaps the "discoverability" of unknown products is on balance worth the risk).

Whatever the reason, Amazon certainly knows about other ranking systems, so it has to have made this choice deliberately.

folli · on Aug 31, 2017

For Amazon (and equally large companies) I usually tempted to put the proverb "don't attribute to malice which is adequately explained by stupidity" on its head. There's definitely a financial reason behind this.

oconnor663 · on Aug 31, 2017

I'd hasten to add "don't attribute to stupidity what is adequately explained by people having to deal with more than one problem at the same time."

logicallee · on Aug 30, 2017

> with one 5-star review each from the resident fake reviewer.

don't you think if they switched to this algorithm you would see page after page of cheap crap with 500 5-star reviews each from the 500 resident fake reviewers?

...though now that I've written it like that it does make it obvious that this would be higher a burden on fakers, and cut out some of the products doing this.

gesman · on Aug 30, 2017

There is a good plugin for Chrome "Amazon Sort - Number of Reviews".

It helped me to push chinese crap way below the fold and focus on well-researched items on top.

zolloie · on Aug 31, 2017

I seem to be the only one who remembers this, but for a brief period of time, Amazon implemented the lower confidence bound approach (where you were sorting on the lower bound to the average, not the average itself).

I loved it, but I noticed that not too long after (maybe a year?) they removed it. My sense was that small businesses were complaining that the system was unfairly benefiting larger businesses. E.g., if you have a new product, using the lower bound or something similar is unfair because it penalizes you for being new, relative to established players.

Honestly, I can see that perspective too (which is missing from the linked piece), and am not really sure what to do about it. The linked piece comes at it from the perspective of consumer risk minimization, and not from the perspective of the producer, which Amazon also has to contend with.

The solution is probably to allow sorting by both.

ocfnash · on Aug 30, 2017

I think this comment would be improved by removal of the word "Chinese".

coldtea · on Aug 30, 2017

Would it be more accurate though?

ocfnash · on Aug 30, 2017

I don't know; no doubt we could debate the point at length.

Nevertheless I stand by my original remark. IMHO the conflation of "Chinese" with "cheap crap" is:

  * Not relevant

  * Debatable / Subjective 

  * Divisive

and thus distracts attention from the actual point being made.

kstenerud · on Aug 30, 2017

Meh.

China today makes the lion's share of cheap crap. Before that it was Korea. Before that it was Japan. Next will probably be Bangladesh. Just follow where garment manufacturers go and you'll see the pattern.

They key to selling more cheap crap in this day and age is to game the ratings systems as inexpensively as possible, thus the Amazon one-fake-5-star-review problem.

stordoff · on Aug 30, 2017

FWIW, in this specific case, part of the problem can be that items are shipped from China (so longer lead times etc.).

toniprada · on Aug 30, 2017

Other approach for non binary ratings is to use the true Bayesian estimate, which uses all the platform ratings as the prior probability. This is what IMBD uses in its Top 250:

"The following formula is used to calculate the Top Rated 250 titles. This formula provides a true 'Bayesian estimate', which takes into account the number of votes each title has received, minimum votes required to be on the list, and the mean vote for all titles:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

Where:

R = average for the movie (mean) = (Rating) v = number of votes for the movie = (votes) m = minimum votes required to be listed in the Top 250 C = the mean vote across the whole report"

http://www.imdb.com/help/show_leaf?votestopfaq&pf_rd_m=A2FGE...

poorman · on Aug 30, 2017

I reference this article constantly at Untappd.

When we were building the NextGlass app, I took much of this into consideration for giving wine and beer recommendations.

We recently ran the query on the Untappd database of 500 million checkins and it yielded some interesting results. The "whales" (rare beers) bubbled to the top. I assume this is because users who have to trade and hunt down rare beers are less likely to rate them lower. The movie industry doesn't have to worry about users rating "rare movies", but I would think Amazon might have the same issue with rare products.

chris_va · on Aug 30, 2017

There is an interesting phenomenon of exclusivity/sunk-cost boosting ratings for rarer or harder to acquire items.

That is also a problem with movie ratings (I just noticed that you mentioned movies). Critics (and audiences) at pre-screenings are generally significantly more favorable to a movie than an equivalent group in a normal theater. I would not be surprised if the same thing applied to foreign movies, and other types of "whales".

infomofo · on Aug 30, 2017

Untappd is also weird because you know that the producers of some of these small brewery beers actually look at these checkins. A lot of the beer drinkers I know will prefer to not rate a beer instead of giving it a sub-3 rating.

intenscia · on Aug 30, 2017

Implemented this after discovering it via https://www.gamasutra.com/blogs/LarsDoucet/20141006/227162/F...

Works amazingly well and so easy to calculate vs say the way IMDb rates things.

loisaidasam · on Aug 30, 2017

Here's a SO post w/ a python implementation:

https://stackoverflow.com/questions/10029588/python-implemen...

The accepted answer uses a hard-coded z-value.

In the event that you want a dynamic z-value like the ruby solution offers, I just submitted the following solution:

https://stackoverflow.com/questions/10029588/python-implemen...

dperfect · on Aug 30, 2017

What's the best way to apply the suggested solution to a numeric 5-star rating system (the author mentions Amazon's 5-star system using the wrong approach, yet the solution is specific to a rating system of binary positive/negative ratings)?

I suppose one could arbitrarily assign ratings above a certain threshold to "positive" and those below to "negative", and use the same algorithm, but I imagine there's probably a similar algorithm that works directly on numeric ratings. Anyone know? Or if you must convert the numeric ratings to positive/negative, how does one find the best cutoff value?

dbaupp · on Aug 30, 2017

The author has also written http://www.evanmiller.org/ranking-items-with-star-ratings.ht... .

amrrs · on Aug 30, 2017

What we do with 5-star rating system is completely ignore 2,3,4 stars which in a lot of ways just skew our analysis, hence ending up with a new score similar to Nps (5-star minus 1-star) / total stars

overcast · on Aug 30, 2017

Why bother have a 5-star rating system then? Sounds like Netflix went down the right path, with thumbs up or thumbs down. People either zero it out, or give it five stars.

jbochi · on Aug 30, 2017

It's very common to see a "Most Popular" section in a website, but the way it's usually done is not optimized for clicks.

Inspired by Evan's post, I wrote "How Not to Sort by Popularity" a few weeks ago: https://medium.com/@jbochi/how-not-to-sort-by-popularity-927...

kuharich · on Aug 30, 2017

Previous discussions: http://news.ycombinator.com/item?id=1218951, http://news.ycombinator.com/item?id=3792627, https://news.ycombinator.com/item?id=9855784

hood_syntax · on Aug 30, 2017

Read this article before and I really liked how to the point it is. More than anything, can I just say how infuriating Amazon's rating system is?

eeZah7Ux · on Aug 30, 2017

This is computationally very heavy, but, more importantly, for practical purposes you want to have a tunable parameter to balance between sorting by pure rating average and sorting by pure popularity.

Often you also want to give a configurable advantage or handicap to new entries.

quantdev · on Aug 30, 2017

For a fixed confidence level, it looks computationally light weight: a dozen or so multiplications and divisions plus one square root, which could be approximated if needed. There is no inverse normal needed at run time.

amelius · on Aug 30, 2017

> What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what? Wilson gives the answer.

Well, you can't answer that question without making assumptions. And these seem to be missing in the article.

thanatropism · on Aug 30, 2017

Arguably what Urban Dictionary is doing is to weigh by "net favorability" in some sense and quantity of votes. Quantity of votes correlates to relevance, particularly because UD is meant to represent popular usage.

gleenn · on Aug 30, 2017

We actually switched to Wilson score. Doing it later has some weird effects, when you've already have a lot of people typically voting on the first definition, and then suddenly the order gets switched because something has a higher ratio giving it higher confidence. We're honestly not sure it's done anything that great for UD, sometimes simple is just better.

thanatropism · on Aug 31, 2017

You might want to weigh less controversial (as in abs(upvotes - downvotes) higher. This would be somewhat like Effect Size in science.

The Bayesian approach would be to assume the true vote distribution is binomial and use a beta prior (possibly with Jeffrey's degenerate bimodal prior). Then as the total number of votes increases the posterior distribution tightens. Ranking score is prob(score>0).

onorton · on Aug 30, 2017

Would a better idea for UD be like the original Facebook "like" system? So you only vote if you think it's relevant and only popular definitions sit at the top.

agentgt · on Aug 31, 2017

This sort of reminds of "voting theory" and if I recall it was proven by I think a nobel prize winner that there cannot be a fair winner.

Obviously it's not entirely analogous but I would not be surprised if it mapped over to this domain.

Edit: on mobile so late on the link to Kenneth Arrow https://en.m.wikipedia.org/wiki/Arrow%27s_impossibility_theo...

petters · on Aug 31, 2017

That theorem, being about when every user provides a complete ranking, does not apply in this case.

gesman · on Aug 30, 2017

I think ratings need to be normalized to personal beliefs and preferences of the viewer.

In other words - I can care less how Joe Blow rated the product - but it's important to me how likeminded people like me rated the product.

Also - Amazon is not making mistake in ratings.

Amazon is less interested in selling you relevant product for you.

Amazon is more interested to boost it's bottom line, move stalled inventory or move higher margin inventory.

alexvay · on Aug 31, 2017

I think the article is missing something visual to demonstrate the actual scoring at work.

I've made a simple plot in Excel here: http://i.imgur.com/adjaLQ9.png

The number of up-votes remains the same, while down-votes increases linearly. The scoring declining line in grey is the score.

shaftway · on Sept 1, 2017

Here's a 3d graph showing this as a function of upvotes and downvotes. I think it's clearest with

x: [0, 100] y: [0, 100] z: [0, 1]

https://www.google.com/search?q=graph+((x+%2B+1.9208)+%2F+(x...)

bwaxxlo · on Aug 31, 2017

Label the axises please.

tabtab · on Aug 31, 2017

What about having a scaling factor to adjust the impact of quantity (total) of individual ratings as needed? Rough draft:

  sort_score = (pos / total) + (W * log(total))

Here, W is the weighting (scaling) factor. Total = positive + negative

autarch · on Aug 31, 2017

IMDB uses something like this. It's called a "weighted" rating system. In the IMDB version what happens is that you calculate the average of all ratings of all items, and then push an item's rating towards the average. The fewer ratings it has the more it's pushed.

See http://www.imdb.com/help/show_leaf?votes for details.

jules · on Aug 31, 2017

That formula will give an item a higher score the more down votes it gets. A better approach is

score = (pos + a) / (tot + b).

Where a<b, e.g. a=1, b=2.

See this post why that formula follows from Bayesian reasoning: http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-...

tabtab · on Aug 31, 2017

I'm not sure what you mean in the 1st sentence. Example? The problem with the 2 weights is that it's 2 values that have to be given, and for large quantities neither makes much difference. It's why I used log().

jules · on Sept 3, 2017

The log(total) term increases without bound whereas the pos/tot term is at most 1, so in the limit of a lot of votes you will beat an item with fewer votes even if all your votes are downvotes.

That there are two configurable parameters is a good thing. One parameter controls how much of a penalty you get for having few votes, the other controls how many votes count as "few".

alexpetralia · on Aug 30, 2017

Chris Stucchio and Evan Miller have amazing statistics blogs.

phunge · on Aug 30, 2017

Classic post! This post is like a gentle gateway to the world of Bayesian statistics -- check out Cameron Davidson Pilon's free book if you want to go deeper.

bradbeattie · on Aug 31, 2017

I think this article is missing the next step: collaborative filtering. I only care about the ratings it received from people that rate thing like I do.

larkeith · on Aug 30, 2017

This article is useful, but the author's tone really rubs me the wrong way - to the point I'm dubious about trusting the information without further sources. Cutting the entire first part ("not calculating the average is not how to calculate the average") would help, as would more accurately titling the piece - no matter how effective this method is, it is NOT sorting by average, strictly speaking.

ignawin · on Aug 30, 2017

Any blog posts/papers on what the best general approach to onliene reviews is?

dredmorbius · on Aug 30, 2017

Good, qualified, honest reviewers.

Hal Varian (UC Berkeley) has some 1990s refs, which remain good. "Grouplens" is the project/product.

Randy Farmer literally wrote the book on the topic. There's a book, blog, and wiki.

Frankly, Farmer's work, good as it is, largely reinforces my view that Varian captured the essence of the problem, which I've summarised in my opening 'graph. You cannot algorithmically correct for crap quality assessment.

If you're interested in the long-form answer, the fields are epistemology (philosophy) and epistemics (science).

Enjoy!

http://people.ischool.berkeley.edu/~hal/Papers/publish.html

http://people.ischool.berkeley.edu/~ngood/

http://people.ischool.berkeley.edu/~hal/Papers/japan/

http://buildingreputation.com

jules · on Aug 31, 2017

I wrote a blog post about a simpler and statistically grounded method: http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-...

Animats · on Aug 30, 2017

Mandatory XKCD: https://xkcd.com/937/

autokad · on Aug 31, 2017

a gamma poison might more accurately calculate the rating based off uncertainty of the data

donatj · on Aug 30, 2017

Does a decent Fortran implementation exist?

tom-lord · on Aug 30, 2017

It's written in the article?

jimktrains2 · on Aug 30, 2017

Did you read the article?

jimktrains2 · on Aug 30, 2017

To those downvoting me and sibling, parent asked about a SQL implementation originally.