Hacker News new | past | comments | ask | show | jobs | submit login
Why Ratings Systems Don't Work (goodfil.ms)
196 points by geelen on Aug 22, 2012 | hide | past | web | favorite | 162 comments



Terrible article. Calling histograms awful, based on nothing more than an opinion.

Then trying to conclude that some convoluted scatter plot system makes more sense is laughable.

Not to mention, this system is still just a star rating system. This would be no different than having two histograms side by side.. assuming, of course, that you'd even want to rate different aspects of the same thing.

I can't even imagine scatter plots on amazon, or trying to convince the general public that "it makes more sense"


Terrible comment. I found the article interesting. As someone who has played around with many different types of rating systems, I applaud their effort at trying something different. Sounds like you area little too emotionally invested in histograms. I'm not even going to ask why.


I'll meet you halfway. It's a terrible comment about a terrible article. I agree that it's laudable to try to invent an improved version of online ratings, but this article wasn't effective in convicing me that they've succeeded.

Their argument that historgrams are just.awful. (I didn't care for the extra periods) seems to have two components: 1. asserting that historgrams are bad 2. showing us 3 histograms and saying they tell you the same thing about all 3 movies, when in fact there is a very clear and important difference between the histograms.

Its completely obvious from inspection ("people are really good at seeing patterns") that Starship Troopers has a much lower percentage of 5 star ratings than the other two, and a much higher fraction of 0 star ratings. It also appears to me that the Fifth Element has a higher fraction of 4 or 5 star ratings, and is probably the most apprciated of the 3 films, although Blade Runner is fairly close.

If you are going to cherry a set of 3 specific films to make your point, you should be sure to at least pick 3 films that support your point instead of refuting it.

We then learn about their hypothesis that while 5-star rating system sucks, a system that relies on two correlated 5-star ratings is great. They demonstrate this by using the two question system to draw the exact same conclusion I drew from the histograms of the 1 question ratings.

I would've liked some sort of objective attempt to compare the two rating systems. Perhaps it would be possible to measure how frequently the two question system leads people to make a better choice than the one question system, or at least some sort of statistical wonkery that would purport to show me that the two question system in practice draws more distinctions than the one question system. Unfortunatley we only get this one rather uninspired example ("watch this if you’re in the mood for something really good").

They also didn't address why "would you re-watch this film" is a better choice than any other second question. There are attempts to justify it being a good question, but no real evidence that other questions were tried and didn't perform.

Finally, the thing that really irked me was that this proposed system doesn't seem to do anything to address most of the actual problems with the regular 5 star system, namely that people who feel really strongly about something are more likely to rate so that most ratings tend towards the extremes, and that without context we have no idea why someone rated something a 5 instead of a 1. Those problems now exist along two dimensions instead of one.

I see this less as an article and more as an advertisement hitching a ride on an xkcd comic.


It's different from having 2 histograms because it lets you distinguish people who thought it was a good movie but not re-watchable, from people who thought it was a good movie and fairly re-watchable, from people who thought it was an OK movie and not re-watchable. With a bunch of histograms you can't correlate the variables.


Why would I care about rewatchability? The torrents are full with thousands of lives on video I'll never watch, what's the point of rewatching anything but the dearest, youth-defining films?


> Why would I care about rewatchability?

Speaking as a film buff; this is actually quite a good guide to the sort of movie it is (when combined with quality). If lots of people mark it good quality, but wouldn't watch it again, that implies that you have to be in the right mood for it.

If people mark rewatchability high, even if the quality rating varies, you know it is much more easy going film.

And so on. Combining data points is good :)


The comment I was replying to said This would be no different than having two histograms side by side. And I pointed out that it is not.

Oh well I'll answer your question anyway. Because 90% of everything is crap (probably more than that on P2P sites). Re-watching a movie can be like walking in the same park more than once, looking at pleasant things with a sense of recognition. You could put a movie in to suit your mood. Sometimes you'd rather watch a good "original" movie again than watch yet another crappy remake or rip-off.


If I have already seen this movie why would I need ratings of any kind?


What use case are you referring to? I don't know anyone who uses ratings sites after seeing a movie.


My point exactly. And why would you care about rewatchability if you didn't see the movie once?


rewatchability should indicate a light-weight movie. If you want that, you look at rewatchability, if you want something more complex, you use quality. I think this is the underlying assumption.


I don't think this is right. People watch movies like Blade Runner and Citizen Kane repeatedly.


Possibly to know whether you should rent it or buy it.


Because the rewatchability rating doesn't just tell you how rewatchable a movie is, but also how good and "timeless" it is, i.e mainly how deeply it resonates with people.

This is not about picking a movie to watch again after you have already seen it once: this is about picking a movie you haven't seen but that is so good that other people like to watch it again and again.


You're telling me its stupid to choose the 100% chance of some known amount of fun rewatching an old series than to bet compared to watching that latest Jennifer Aniston flick?


You are not trying to gauge if you would want to re-watch the film. If other would re-watch the film, maybe, just maybe, you would enjoy watching it the first time.


Because "rewatchability" is not just useful to judge how much people like to watch a movie again -- it's also useful to judge how much people like the movie, find it deep, and connect with it (over pure mindless fun).

So rewatchability roughly translates to "emotional connection + quality".


Now compare your point of view with the one expressed by mhellmic[1] in this very discussion. 'Rewatchability' sounded to me as something even more vague than 'quality' while reading the article. Also, my pattern for rewatching films differs greatly from both your description for rewatchability and mhellmic's, and from my own pattern for rereading books. I don't see how it can be applied to rating systems with just one additional parameter (the quality thing).

[1]http://news.ycombinator.com/item?id=4417184


For adults, this might be true, but for kids, i'm not so sure.

If you've got kids, you know they'll just watch the same flashy animated movies over and over again, even if their opinion of the content is "meh".

What would probably happen is that family films, especially animated ones, would have skewed results.


Kids of that age (that watch the same "flash animated movies" over and over again) seldom buy/rent movies themselves.

Plus, they do not normally use online rating systems...


Good point, but I do see adult-entered reviews at places like Amazon where they write "My kid watches this all the time, bla bla".


This would be no different than having two histograms side by side..

Yes it would, and the article shows why and how. Scatter plots are easy to read (for comp./math. educated people). Two histograms side by side are easy to read (and find correlations) for nobody.


The point is that scatter plots are harder to read than histograms. It's not two histograms vs. one scatter plot, it's replacing the current single histogram system with a single scatter plot.

Also, side-by-side histograms aren't the only way to display two parameters in a single histogram. What about stacked histograms? They scale up to an arbitrary number of parameters and everybody who can read a histogram can read a stacked histogram. Scatter plots seem like thermonuclear overkill for a problem which most movie sites seem to consider to be "solved".


I think it's a problem of brevity. You write something, you refine, you cut out the pet witty comment. You cut your message to the bone. Now you're done.

Read the text of the article describing each movie. The correlations are absolutely meaningless. He doesn't hit on them at all. In fact, with comments like "almost nobody" he's specifically looking at averages. Then coming to a conclusion effectively based on the average of Score 1 and the average of Score 2 to determine what type of movie it is.

So on this Quality/Rewatchability grading system:

Starship Troopers: 3:4 The Fifth Element: 4:4 Blade Runner: 4.5:3.8

Drop that in place of the graphs and the conclusions would sound as seemingly valid.


I'm with Henry ... the scatterplot is fundamentally no different to two histograms side-by-side (though it is marginally easier to read).


A scatter plot definitely shows more information than a pair of histograms. Who knows if that additional information will be useful.

It would be nice if the article actually compared different visualizations of the same data, rather than showing histograms of 2 separate data sets and scatter plots of a 3rd data set.


Though a scatterplot conveys less information (e.g. the correlations between the two axes), I think it takes longer to process. It also takes more screen real estate than a pair of histograms.


>Though a scatterplot conveys less information

A scatterplot conveys objectively more information.


> the scatterplot is fundamentally no different

Actually it is fundamentally different as the histograms show aggregate data and scatterplots show individual data points.


It's true it's harder to find correlations, but this particular correlation is not likely to be meaningful.

As far as reading the distribution, two separate histograms are definitely more readable and understandable in order to understand the distribution. Adding the complexity of a scatter plot because it also shows a correlation that people are not actually interested in makes things less understandable.


It seems to me like "rewatchability" isn't really a useful metric when I am looking for a new-to-me movie to watch.

And when I am looking at user rankings for a movie, almost by definition I am only concerned with rankings for movies I haven't yet seen, since I already have an internal self-ranking for a movie I've seen already.

Histograms are extremely useful for knowing the spread of rankings, which the scatter plot also illuminates.


Possibly correct but a bit grumpily put.


My point was you can't create a scatter plot from a single dimension of data, and that histograms are awful because they don't allow you to see patterns. But of course you're right, it is just my opinion.


But histograms do allow you to easily see one of the most important patterns to me, at least for technical books, which is that the book is excellent but too "hard" for some who aren't willing to put in the effort. Those books tend to have a lot of 5 star and 1 star ratings, with few in between. Check out the SICP reviews on Amazon for example. When augmented with a set of comments, it's usually easy to predict which of the two groups I'd be in.


I find histograms very useful on amazon. Particularly the bimodal ones, which typically indicate controversial material.


You should consider throwing up some histogram pairs of the data shown on the scatter plots.

It looks like they would still give a good picture of the data, as the trends are mostly linear.


There's a pretty clear correlation between the histograms and the scatterplots, too, if you look.

If you gave someone those 3 histograms and those 3 scatterplots, I bet they could match them up correctly.

Aside- The scatterplots are an awful user interface because of the cognitive effort to interpret them, but perhaps there's a way to present the same information in a usable way.


Please show me any diagram that does not require cognitive effort to interpret.


It's not a terrible article at all. It's a suboptimal solution but the article is pointing out what's wrong with 5-star rating systems. It's labelled as a "response", not as a solution to all our earthly rating problems.


I agree, the good idea was to add more than one dimension to the rating. I suspect that it would be even more interesting if we had a couple more ratings to separate that one number into more precise groups.


I'm surprised this article, nor the discussion here doesn't address the main issue here: Just because you use numbers doesn't mean your data is quantitative.

Star scores are an attempt to map a qualitative experience (enjoyment of the film) with some quantitative measure. Which is fine if you just want to get a sense of 'how much' somebody liked a something. If I say I give scotch A a 5 scotch B a 3 and scotch C a 4 then you know that I like the scotch's in A, C, B order. It's a short hand way to express my personal ordering of qualitative experience, just like we use the words: 'good', 'better', 'best'.

The problem is this data is not really numerical, so even basic mathematical operations don't make any sense. When we add 2 heights, 2 masses, 2 speeds etc the result makes sense. But not so with ratings. Even basic difference doesn't make sense, is the difference between 5 and 4 stars the same as between 4 and 3 stars? There is no 'unit' distance in scoring system. So doing any sort of averaging is just going to give you nearly meaningless results.


I've had this intuition as well, and tried to put it to use when redesigning the Ubuntu Software Center's ratings system. Ratings are fundamentally ordinal data -- higher is better, but the difference between 4 and 5 is not the same as the difference between 3 and 4.

This implies that the arithmetic mean is a broken concept, however the _median_ should still survive intact. I thought about ways to implement this in Software Center, however I'm still not quite sure what a good algorithm for ordinal rating data would look like.

Please feel free to post ideas on this stackexchange question: http://stats.stackexchange.com/questions/19115/how-do-i-sort...


Ah, the top-sense comment is the third one from the top. :)

A store that is also a movie theatre could do away with numeric representations by just watching what the users are doing with it's content. Things like "did they finish watching the movie?" or "did they get through the whole thing in one sitting?" could be helpful. Not to mention you can actually see if those titles were being watched again or not.

But there's still the problem of how to communicate the findings to the user, or formulate them.


Such data is called ordinal (see https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_s...). You could use arithmetic means etc. if the distance between the categories were equal. Unfortunately, nobody defined what 1, 2, 3 ... stars mean.


Scatter plots are definitely more informative, once one gives them a couple of minutes to get used to them. However, I think you're shooting at the wrong target and your solution would exacerbate the root problem: bias.

The first time I really noticed the problem was when I published my own Flash game on Kongregate and started paying closer attention to the ratings. That led me to examine my own rating habits and I conjectured that is probably what happens to everyone else.

The bias I'm talking about is caused by the fact that most people can't be bothered to rate something. Most people only rate something when there's a powerful impulse to do so, so most of the votes will be 5 stars or 1 star. The 4-star ratings come from people who liked something enough to be moved to rate it, but not enough to gush about it; note that the group of people who makes that distinction is already substantially smaller than the 5- and 1-star reviewers. The rest comes from a very small minority, most of whom are people who didn't have anything better to do at that moment and decided to spend some time rating, but don't do it on the regular basis.

By the way, I realize that this is just a conjecture, but from what I've seen so far, it seems to be pretty accurate.

I think that introducing an additional axis will only exacerbate this, by raising the bar for rating. If the act of rating starts demanding more effort, you'll get a distribution that is even more skewed than now.

The two improvements I would like to see are:

1. a system that infers ratings from users' actions

2. better mechanisms for gauging the relevance of someone's review/rating based on my preferences/tastes

The first would help reduce the bias and the second would help me extract more useful information from the biased dataset.


Similar to what you said, where YouTube found most people only voted 5 stars, with 1 and 4 stars being used much less frequently: http://youtube-global.blogspot.com/2009/09/five-stars-domina...


Obviously, it doesn't explain everything from your example, but Kongregate rewards users for voting for games (1 point per rating in their levelling scheme). This will have some impact on how people vote.


I've noticed the same problem, where people tend to rate a 5 or a 1.

As a benefit, I've noticed I can usually find the best reviews on Amazon by looking for 3 star ratings, and to a lesser extend 4 and 2 stars. People who rate a 3 have looked at pros and cons of the product, and generally compare to similar goods. 3 star reviews usually provide FAR more information than glowing or glowering 4 and 1's.


The old Latin proverb "Quis custodiet ipsos custodes?"

http://en.wikipedia.org/wiki/Quis_custodiet_ipsos_custodes%3...

might in this context be paraphrased to "Who is rating the raters?" The hope in any online rating system is that enough people will come forward to rate something that you care about so that the people who have crazy opinions will be mere outliers among the majority of raters who share your well informed opinions. But how do you ever know that when you see an online rating of something that you haven't personally experienced?

Amazon has had star ratings for a long time. I largely ignore them. I read the reviews. For mathematics books (the thing I shop for the most on Amazon), I look for people writing reviews who have read other good mathematics books and who compare the book I don't know to books I do know. If an undergraduate student whines, "This book is really hard, and does a poor job of explaining the subject" while a mathematics professor says, "This book is more rigorous than most other treatments of the subject," I am likely to conclude that the book is a good book, ESPECIALLY if I can find comments about it being a good treatment of the subject on websites that review several titles at once, as for example websites that advise self-learners on how to study mathematics.

The problem with any commercial website with ratings (Amazon, Yelp, etc., etc.) is that there is HUGE incentive to game the ratings. Authors post bad ratings for books by other authors. The mother and sister and cousins of a restaurant owner post great ratings for their relative's restaurant, and lousy ratings for competing restaurants. I usually have no idea what bias enters into an online rating. So I try to look for the written descriptions of the good or service being sold, and I try to look for signals that the rater isn't just making things up and really knows what the competing offerings are like. When I am shopping for something, I ask my friends (via Facebook, often enough) for their personal recommendations of whatever I am shopping for. Online ratings are hopelessly broken, because of lack of authentication of the basis of knowledge of the raters, so minor details of dimensions of rating or of data display are of little consequence for improving online ratings.


> The problem with any commercial website with ratings (Amazon, Yelp, etc., etc.) is that there is HUGE incentive to game the ratings.

While I agree that this is a problem, I think a bigger problem is a simple matter of scale:

Amazon is huge, and many people buy things, but they don't split reviews and ratings by what kind of person is rating them. If they wanted to make an improvement, why not show me only ratings and reviews by people who are similar to me? They have tons of data about me and other people who use the service, so it should be possible for them to say "people like you rated this on average a 4, but everyone in the world rates it an average of 2.5."

That's much easier than having to read all the reviews and decide if the person is in my demographic or whether I agree with their review.


Like Netflix


In my experience Netflix ratings are total garbage. IMDB and rottentomato ratings are both way more accurate


> "Who is rating the raters?"

Netflix does. By cross referencing your likes and dislikes against those of your fellow Netflix members, the company is able to create a meta rating system, in which the score you see for a movie is your own. You see that score because that's how much Netflix thinks you'll like it, based on how similar people liked it.

This is the only good way of going about this method. The trick is, it's easy to do this with movies, but much more difficult with product ratings and the like. Maybe this is an opportunity for someone to build something on top of Facebook or Amazon.


Pandora does something similar. They only have "like" and "dislike" rating. If you like a song they look at other users who liked that song and try to find more songs/bands from the people with your taste. And the other way around for dislike i guess.

It works exceptionally well. You just listen to the stream of incoming songs, you never pick songs yourself. After a good song you click like, after a mediocre song you keep listening, during a bad song you click skip. After a few days you will only get good songs! (with a few exceptions of course) It's like magic, i can't even count how many new bands i found through pandora without any effort.

Too bad it doesn't work outside US anymore. :(


Can the same rating calculation used on Hacker News?


Is Netflix really the only site which does this nearly-braindead machine learning approach?

Once you realize that people have different tastes and you know someone's preferences that is the obvious solution. Or is the process of crawling through that much statistical data that expensive that it can only be offered to paying subscribers?


The more accurate you want to get, the more computationally expensive. Netflix actually did a contest with a million dollar prize to the team that could come up with the most accurate rating prediction algorithm. In the end, the million dollar algorithm was too expensive to implement, so they never ended up using it.


rateyourmusic.com sort of has this, but it's not in the default view. You have to go to an album and click a "View my suggested rating" button, then it whirs for a few seconds before giving you an average of ratings from users with similar taste. It would be much more useful to browse the whole site with those ratings showing, but I get the feeling it's a computationally expensive feature.


Re: Your watchmen quote

I think Amazon's "Was this rating helpful (Yes/No)?" provides a good filter for ratings. A lot of mindlessly negative reviews get filtered out by the users who come along afterwards and rate the rating in their own self-interest.


That can be easily gamed as well. If you want to boost a rating for a book, mark all of the negative comments as not helpful... In fact, I see that happen on Amazon and Newegg a lot.


They have a solution for that -- on Amazon (at least) to filter for the most helpful unfavorable reviews.


Regarding the unhelpfulness of online reviews, my company has problems with manufacturers/sellers writing 5-star reviews of their own product listings (ASIN's) on Amazon. We've begun (manually) data mining 5-star reviews to identify whether each 5-star-reviewer has any other reviews (or wish list, to indicate the possibility of a real user account), then calculating the % of reviews written by no-history user accounts. Of the ASIN's we've assessed, the gut-level-doesn't-seem-like-heavy-review-fraud listings can be in the 6% range, whereas the looks-like-review-fraud ASIN's are above 20%. We're working with Amazon to identify and penalize these manufacturers/sellers, but internally at Amazon the Seller Performance team is separate from their Community (user review) team, so it presents a challenge. Also hard for them to separate valid complaints from sour grapes complaints.


"but internally at Amazon the Seller Performance team is separate from their Community (user review)"

Indeed, my suspicion is that organizational politics have more to do with the lack of a better rating system than any technical limitation.

The approach I use is to read 3 star ratings first before biasing myself with the more extreme ratings. I also check to see what else the reviewer has rated and if there's nothing there then I immediately dismiss the review.


Relative ratings are more useful. Everyone uses their own scale, but their ratings are relative to the constant movie. I want to see how people rated a movie relative to other movies I've watched.

N people rated this better than X movie, but less than Y movie.

Ranking movies can be easy. Show 5 movie posters instead of 5 stars or have an auto-complete field for this movie is up there with:.


Anyone interested in relative ratings should look at Dan Areily's[1] Predictably Irrational. He is an economist who writes about behavioral economics and decision making.

I've often thought about some start-up ideas around relative ratings, and this book was the reason

[1] - http://danariely.com/


In the context of a website like Netflix, where your recommendations and ratings for movies are based off of your history of ratings, aren't you the one rating the raters?

However, the type of rating mentioned in the OP and the type of rating on Netflix only seem to work in specific niches. I can't imagine how a website like Amazon would implement anything even close to what Goodfilms is doing.


Many reviewers are biased and judgemental. I prefer to look at the distribution of ratings that amazon shows. Good items have a distribution with one peak, even if it isn't at 5 stars. When ratings create a saddle the product is probably a fluke.


The cute XKCD comic aside, the distribution is also useful because, for certain types of things, a lot of fairly to very negative comments are illuminating even if the average rating is still pretty high. It's not just about polarizing material. If you look, for example, at genre fiction you'll get a lot of fans who give 5s no matter what sort of crap the current book is. But if there are also a notable number of 1s and 2s, that's often a good red flag.


IMO the perfect solution would be to rate only by saying "I like" or "I dislike". And then writing a review (to show how much you liked/disliked the product/movie etc...)

When you first get on the website, you're asked to rate a number of products. The more you rate, the more accurate the solution becomes.

When it can couple your test with users having the same test, you know see these users ratings in priority for new products. You can even ask them why they disliked/liked something if they didn't write a review. Because their opinion matters to you now.


I'd just like to point out that bias feeds into written reviews as well. I may think good service is someone not refilling my water every 5 seconds where as you might think it's that your food didn't get out in under 10 minutes.

While words do express the point better, this rating system is a step in the right direction.


I definitely go by the 4.5 stars == very good, <4 stars == crap heuristic, but to argue this is no good is ridiculous. It's actually very, very helpful.

E.g. when I go to Amazon I don't buy some random product with a 4.5 star review -- I search for a specific product or a specific kind of product and then reject candidates which are lousy. How is that not INCREDIBLY useful? Similarly, who goes to a movie simply based on whether it's good or not.

In general, if you create any point rating system people who like a thing will tend to rate it towards the top of the scale, e.g. 4/5 or 9/10.

I actually did an informal experiment -- I used to run role-playing tournaments, and do exit surveys on participants. For the first few years we asked players to rate us on a 5-point scale and scored slightly over 4/5 on average. Then we switched to a 10-point scale and scored slightly over 9/10. Not scientific -- but I don't think we suddenly got better.

This finding is backed up by serious research (which is why when a psychologist creates a scale, the numerical ranges need to stay constant in follow-up studies or the results are not statistically comparable).

Netflix, which tries to give users customized ratings, actually subtracts value (in my opinion) from its scores because it tries to make ratings mean "how much will you enjoy this?" BZZZT. I pick stuff for me, my wife, my au pair, and my kids. We don't all like the same stuff, and we don't want to track ratings individually. My kids want good kid stuff. I want good me stuff. Don't try to guess what I like based on our collective tastes.


The problem that XKCD gets at is simply translating/scaling the results; the article is solving a different problem.

Early on in the Netflix Challenge, I was able to get myself (very briefly) a leaderboard score with nothing more than analyzing every user's ratings; re-centering them by their mean, and re-scaling them according to their standard deviation. The by remembering their translations and scales, I could put a globally-predicted score back into their own language.

So just some very basic statistics is sufficient to erase much of the bias toward higher numbers, as well as halo effects and the like.

(I was pretty surprised that Netflix's own algorithm apparently wasn't doing anything this simple)


I was pretty surprised that Netflix's own algorithm apparently wasn't doing anything this simple

Netflix does have really interesting blind spots. They claim to take ratings seriously, to the point of offering a million dollars for the best rating algorithm. Then, as the GP says, they implement the rating algorithm in a way that renders it completely worthless to any household with more than one viewer.

Netflix does offer us a good demonstration of the failings of absolute technocracy, but it leaves the question of how best to rate movies wide open.


I never actually look at the average star rating. I read a couple 5-star reviews, read more 4-star and 2-star reviews, and decide based on that.

Considering how many Amazon 1-star reviews I find that can be summed up as, "UPS sucked," averages are kinda useless.


When I'm looking for people's opinion I want to know at least a little about the people and have more things in common with them, so that we have similar tastes. IMDB is almost useless for me, people like complete crap imo (is there any movie without at least one 10 star rating? That should be the single best movie ever). If you wouldn't fit in the community, the community's opinion on things is largely irrelevant and whether you look at the opinion through a histogram or a scatter plot is irrelevant.

If you actually have friends, why don't you ask them for recommendation in person. If your friend is really into arty movies and recommends you an arty movie as being very arty (and well done) you can consider it. Collapsing it into a single number doesn't make sense </rant>

EDIT: that's not to say that the scatter plot isn't an interesting idea, it's just not going to help much because people's background is important for rating


You've absolutely hit the nail on the head. Knowing the rating of a friend is worth so much more than a bunch of strangers. That's what Goodfilms is all about - it puts your friends' opinions ahead.


Surprised to see no mention of the website Criticker. Been using it for a few years now (ever since I cancelled Netflix and missed the recommendation engine).

Criticker's rating system is out of 100 points but for each user it scales ratings into tiers (deciles) 1-10. So for someone like me who watches lots of movies that I sort of know I'm gonna like (thanks to Criticker!), most of my ratings end up in the 70 to 100 range, but I still have 5 tiers in that range. The wide range allows the system to adapt to a user's biased view of the scale. Also plenty of users simply keep their rankings from 0-10.

Criticker gives recommendations in two ways. First it predicts my ranking for a movie. So I can just browse unwatched movies and filter them however I like and then sort by how Criticker expects I will rate them. It is actually scary how predictable I am.

The other method of recommendations is to browse users who have very high correlation to my rankings and see what movies they've ranked highly which I have not seen. This might be the best way to find movies. It also seems to be the key to how the expected ratings I mentioned above are computed.

No doubt one of the things that keeps Criticker running so well is a community of serious film buffs. It makes it easy to find movies I would have never heard of otherwise (foreign, limited release, shorts).


sigh

A butterfly flaps its wings, xkcd puts up a comic on ratings, someone piggybacks on the comic, it makes the front page of HN, you wander by and mention criticker, a bunch of geeks pile onto the site to check it out... and it ends up crashy for a while.

Cool site, thanks for mentioning it. From what I saw before it went down (too many mysql connections?), it even looks like I can export my ratings.


Glen makes a good point about how "people are good at seeing patterns" but he still gives short shrift to the histograms. I see a big difference in the histograms. The "crescent" shape of an item's histogram, like the one for Starship Troopers, is often telling on Amazon or the iTunes App Store. That either means something about the product sucks (perhaps only in a small minority of purchases, but the risk is significant) or somebody is trying to lower the rating of the item.

The more a histogram resembles an exponential increase, the better it is. The higher the exponent, the better.

Sucky:

    XXXXXX
    XXXX
    XX
    XX
    XXX
Mediocre, still sketchy:

    XXXXXX
    XXXX
    XX
    XX
    XX
Excellent:

    XXXXXXXX
    XXXX
    XX
    X
    X


Actually, I find histograms pretty useful, primarily because if there is a secondary bump toward 1, it indicates there are a significant number of people who had bad experiences with the product -- more investigation required.

Having a two-dimensional graph might have more information, if the dimensions really matter. I'm doubtful that "stars" and "rewatchable" are really independent, and I'm unsure why I would care about it when I haven't seen the film. (If I have seen the film, I'll have my own opinion and not need the graph.)

I'm all for looking for improvements to the ratings game, though. What seems to work best for me is to actually read the reviews, but that's obviously time-intensive.


I used Goodfilms briefly, but the rating system is so bad I stopped. The odd thing is that, as this blog post demonstrates, they recognize the problem and then totally fail to solve it. Ratings tend to be fairly bimodal, with people either liking or disliking stuff and not making fine-grained choices. In response to this, the Goodfilms system makes ratings continuous, so that rather than trying to figure out whether a film is 4 or 5 stars, the user now has to figure out whether it's 3.8 stars or 4.6, then compounds the problem by making the user rate on two separate scales with a pretty opaque distinction. So the response to the observation that people's ratings tend to be simplistic is to make the rating system much more complicated; it's pretty much the exact opposite of a solution (I quite like letterboxd's system, which has five-star ratings and also a "like" button, which gives you some level of choice over how fine-grained you want to make your ratings).


If you're interested in a more rigorous analysis of the problem, I highly recommend reading the paper "How to Count Thumb-Ups and Thumb-Downs: User-Rating based Ranking of Items from an Axiomatic Perspective":

http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...

It's very accessible for an academic paper.


I'll say histograms are useful for certain things.

If you are looking at electronic devices or camera lenses there's the issue that a certain fraction of people get lemons. Some bad reviews are because of that.

Other people have unrealistic expectations of the product and give a bad review.

A histogram gives some immediate insight into this problem, and then looking at stratified samples of the reviews helps there on out.

Now, I will say the star ratings on Ebay are weak because of the fact that a less-than-perfect ranking gets people in trouble. Although "acceptable" performance on Ebay goes a considerable range (It's certainly a worse experience to have a long confused exchange with somebody with poor english -- this person shouldn't be punished, but they shouldn't be rewarded either.)


Some bad reviews on Amazon are from shipping snafus. If you're going to get any useful info, you have to read the reviews.


Or worse yet, reviews bitching about the price.

Edited to add: "If you're going to get any useful info, you have to read the reviews," is so incredibly true. I'm surprised it's not getting more mentions in these comments. The scatter plot is kind of cool, but I'd so much rather have a histogram and actual reviews to check so I can find out why the product got those ratings.


If you could mouse over each scatter plot point and get a corresponding view that explains its position, that would be cool... for nerds...


Overview of past discussions:

"How not to sort by average rating" (2009): https://news.ycombinator.com/item?id=3792627 For thumbs-up/thumbs-down systems, suggests using the lower bound of a Wilson confidence interval for a Bernoulli distribution, which is what Reddit does now. Convincingly refuted by How to Count Thumb-Ups and Thumb-Downs: User-Rating based Ranking of Items from an Axiomatic Perspective, http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2... by Dell Zhang et al., which argues for simple smoothing with a Dirichlet prior (i.e. (upvotes + x) ÷ (upvotes + x + downvotes + y)), which was also suggested by several people in the comments.

In 2010, William Morgan wrote http://masanjin.net/blog/how-to-rank-products-based-on-user-... partly in response, applying Bayesian statistics to the problem of ranking things rated using 5-star rating systems.

Perhaps related: HotOrNot started out displaying the mean of the rankings as the rating of each photo (after you clicked on it). But they found that there was a gradual drift down in ratings: they started with around 1-5 (out of a theoretical max of 10), then ended up around 1-3, etc., with the predictable damaging effects on egos, people's willingness to post their photos, and the information content of the ratings. The solution they adopted was to display not the mean of ratings but the percentile: a photo rated higher than 76% of other photos would have its "average" displayed as "7.6", even if the mean was 4.5. This trained the users to flatten the histogram!

http://www.nashcoding.com/2011/10/28/hackernews-needs-honeyp... suggested that fake "products" to attract ratings could distinguish intelligent ratings from unintelligent ones. Although written about thumbs-up/down systems, it applies to multi-star systems as well.


The author nails it when he/she points out that star ratings don't reveal how good a movie is for you ("the movies appeal to different kinds of people"), but then goes on to propose a 2-dimensional metric that still doesn't capture the personalization aspect ("rewatchability" doesn't say much about how good a movie is for me).

IMO movie ratings should iterate on Amazon's powerful statement "People that bought this item also bought...". That is, one should look at people with similar tastes and see how those people have rated the movie.

Easier said than done as it needs a ton of data in order for it to work, but that's the only way you're going to get close to more personalized ratings.


Yes. The scatter plot is just a different version of how others felt about the movie. It does not communicate how that movie is related to other movies (that I have rated), nor does it inform me of how similar I am to the people who rated the movie. Personalization is missing.


I like the two axes idea, although "Rewatchability" would probably be better as "How much I liked it."

There are very high quality, well made movies that I don't like, and there are some really crappy ones that I do. And that's a good distinction to see in a review system.

Because sometimes you just want to watch a good shitty movie, but it's really difficult to tell the good shitty movies from the bad shitty movies when The Brady Bunch movie (brilliant) has the same rating as any Adam Sandler movie (awful).


Similarly, I've heard a proposal here on HN about having separate voting buttons for Agreement and Contributing. I might not agree with a comment, but admit it raises a good point. I might agree with another comment (or think it's funny) without it contributing to the current conversation.


From a statistician's standpoint, ratings systems suck. But, from a consumer standpoint, they are super easy to understand. A scatter plot system makes sense to me, but I would never put it in front of a user.

In my opinion, current ratings systems are 80% UX and 20% data.

For example, Newegg uses a pretty intuitive system of allowing you to sort a product page by Best Reviews and Most Reviews. In my opinion, this allows the user to make a more educated decision if they seek the information out.


From a consumer standpoint, a single five star review pushing a product to the top of the list is not easy to understand, it's a pain.


The simple answer there relates back to UX, just don't show the stars when there isn't enough data. Set a minimum number of reviews as a baseline so that you don't get the result you mentioned.

If there is a written review component, make a note of the review but don't quantify the value of said review until the minimum threshold is reached.


When there are only a handful of reviews, I find myself using "gymnastics rules" and throwing out the best and worst score.

Probably not very scientific though.


This a big reason we went with scatter plots - when there's only a handful of points you don't get misled.


Why not try the lower bound of Wilson score confidence interval for a Bernoulli parameter?

http://www.evanmiller.org/how-not-to-sort-by-average-rating....


We actually use this whenever we do any ranking within the site. On the film page though, we think presenting the raw data as a scatter plot is better than a single number.


Facebook has it right with "Like" -- either you like it or not. This eliminates these review patterns:

  5 stars - OMG I LOVE EVERY PRODUCT
  4 stars - Love this product, but I am withholding one star   because of _____
  3 stars - Everything to me is just meh.
  2 stars - I hate everything but this product earned 1 star for ___ and another for ____.
  1 star - UPS drop-kicked my item and it arrived late, so this product is trash!
If you distill all reviews so the the reviewer has to decide whether they like it or not, then you have a less diluted overall ranking.


Why don't rating systems just give a simple yes/no question to the reader. In the case of rating a movie, just ask "Would you watch this again?" or in the case of purchasing a product from Amazon, "Would you purchase this item again?"

I'd rather a boolean system than one where someone's 4-star rating is different than my 4-star rating. Whenever I see a multi-star rating system, I remember back to a prof I once had that said "The top grade is B+. A's are reserved for God." Albeit disgusting, it taught me that everyone has a different rating scale.


Personally, I've consistently found that the best predictor of whether or not I was going to enjoy a movie was the NUMBER of ratings, not the rating itself. This also works for restaurants and other things on sites like Yelp. It almost seems that a movie should come with a simple "recommend!" button that simply counts recommendations.

But ratings are a tricky issue and I think they require a more sophisticated mathematical treatment and modeling if one wants to get it right, not just a few histograms that treat all people equal.

There are a few modeling challenges that come to mind: For example, people disagree on quality of movies based on their taste. This could be modeled as a latent variable that must be inferred for every person in some graphical model. Another example of a relevant variable would be person's rating habits: some people rate movies 5 or 1, some people have a gaussian rating centered at some value. These should be explicitly modeled and normalized. Every rating could ideally be used to make a stochastic gradient update to the weights of the network, and since we are dealing with very sparse data, strong priors and Bayesian treatment seems appropriate. Ratings could then be personalized through an inference process on the graph.

Has anyone heard of a more sophisticated model like this, or any efforts in this direction? I'd like to see more math, modeling and machine learning and less silly counting methods.


That metric starts to fail when you've got, say, mediocre products or brands with tons of exposure.

The classic is in "readers choice" reviews of restaurants or eateries. Fast-food franchises dominate? Why? Because the philosophy of such sites is often "majority rules", and the establishment (or brand) with the most votes wins. But there are far more McDonalds or Taco Bells than Jacks Cook Shacks or Trader Vics. Even when the quality of JCS or TV exceeds TB or MD, it's not going to be reflected in the ratings.

Adjustments such as taking a Likert (3-7 point scale) and adjusting reviews based on the number of reviewers, to give both the actual qualitative assessment, and the probable maximal review can help. This is how sites such as Reddit have adjusted their comments/submissions ratings.

The broader and more philosophical problem is that "quality" is not a one-dimensional attribute, interpretation of quality differs among individuals, and "fitness for purpose or task" should be considered when assessing quality as well. McDonalds may very well be appropriate when your goal is a quick, inexpensive meal on the run (a conclusion I'd differ with), while Trader Vics is where you'd head to impress the boss, date, in-laws, or client.

It's a tough problem. It's also one that sees a great many very poor proposed solutions.


This article posted here four months ago is much better:

http://evanmiller.org/how-not-to-sort-by-average-rating.html

Discussion:

http://news.ycombinator.com/item?id=3792627


Histograms are useful int the case of items with a strong love/hate split.

The canonical example is the SICP ratings on Amazon: 3.5 average; 177 ratings, 96 five stars, 53 one stars.

http://www.amazon.com/Structure-Interpretation-Computer-Prog...


Had to go look up that book. It's actually available free from MIT: http://mitpress.mit.edu/sicp/full-text/book/book.html


The videos of the lectures but the book authors (Gerald Sussman and Hal Abelson) are also available for free.

Pick the MPEG1 versions. They are much heavier than the MPEG4 versions, but the text on the projected computer screen is at readable. IIRC, the MPEG4 are re-encoded versions of the MPEG1, which themselves were ripped from VHS.

http://archive.org/details/mit_ocw_sicp


Average rating as a measure of "goodness" is wrought with statistical problems. Without the context of other statistical modes, looking at mean is pretty useless. However, people don't want to look at summary statistics for each item (mean, median, mode, std/var, skew, etc). So we try to come up with scalar metrics that capture "goodness" or "coolness" or whatever. Popularity (how ever you define it) is a common one to use. Here's a good comparison of popularity models: http://blog.linkibol.com/2010/05/07/how-to-build-a-popularit.... In the past I've been pretty happy with "Bayesian average" - it's simple to implement and gives good results.

But if you really want to dig into it, you have to consider all kinds of stuff like bimodal distribution of ratings (controversial items), rater quality/consistency, age or ratings, etc, etc.

It's really not as simple as you'd think!


"People like you rated it..." is much better for users than two axes.


I knew I had seen a similar XKCD before. http://xkcd.com/937/


That is because xckd is the same 3 jokes repeated over and over.


It's easy to criticise, much harder to do.


So, watch this if you’re in the mood for something really good.

Something I never see addressed: what you want to watch eventually rarely correlates with what you want to watch now. On the whole, you'll spend lots of time putting "good" movies on your queue, but when time comes to pick something and hit Play you'll pass over those and pick some recent release which engages your excitement now and will be long forgotten soon after.

The star rating system confuses this even more by relying entirely on people who will bother to rate something at all - a very different crowd than the "what's good?" and "thrill me now" mindsets.

Insofar as ratings exist, I focus on written 1-star reviews (movies, apps, products, whatever), looking for a subclass of "there really was a particular problem" comments.


While this may not be applicable to goodfilms, a 2-dimensional system is harder to brand. As in, "Two thumbs up," or "90% on Rotten Tomatoes." There's value in a single understandable number.

I wonder if there's a measurable value for the system. I care about discovery, so I want a site that can recommend me movies that I wouldn't normally think about. How many more movies would this system recommend to me?

A lot of times, I don't care too much about accuracy as long as the system isn't too far off. This is simply because the cost of an inaccurate recommendation isn't too high when I can stream it on Netflix.

I like the idea of providing more data for people to make more accurate assessments, but I don't necessarily believe optimizing for accuracy optimizes the value provided to the user.


He jumps in and applies star rating problems just to movies, gives an alternative that works only for movies and in the end promotes his own movie site.


Why would someone who is looking for a film care about rewatchability? Presumably they'll be watching it for the first time and then can decide for themselves whether to ever watch it again.

Or did I miss the point and rewatchability is just a placeholder for something more useful?


Article says about Starship Trooper :

there’s a lot of disagreement over whether it’s high quality of not, but generally this scores high-rewatchability. So, maybe not the most intelligent movie, but good fun.

What I intuitively deduced from this example is that rewatchability is metric of enjoyability.

On a single 5 stars rating, some people will give 5 stars because they really really enjoyed the movie, some others will give 5 stars because they thought the film was perfect on a cinematrographic-quality (i.e. scenario, cinematography, acting, casting, etc. insert here some academy-award-technical-category) point of view.


The problem with ratings are that they are not from a large enough random sample. Rating scores tell me what people that like to rate products think. I don't rate products and in general I suspect that people that are happy with a product don't really care to take the time to go out of their way to tell people about it online (that may change or be changing, I don't know). but one thing is for sure, people that dislike a product WILL go out of their way to tell everyone, thus further shifting the data set to being reviewed by people that are unhappy.

Take a random sample of the true population of the data set (everyone that has seen a movie) and not just the people that logon to rate it.


The article mentions an important point about 5 star systems - people tend to only use 5 stars or 1 star. This is sort of shown with the histograms.

A 2 axis system seems like a good idea. But I'd like to see it with 3 options per access - [UP] [INDIFFERENT] [DOWN].

I'm also interested to know how the system will cope with "controversial" films ( life of brian, for example) where some people are going to downvote whether they've seen it or not. And they'll campaign and ask all their friends to downvote too.


I'd say people only use "5" (really liked, would recommend) and "meh" (spent some time, would not recommend)

In their example, blade runner has sligtly more "meh" votes and starship troopers is mostly "meh"


If something has 1000 ratings but is only 3.5 stars whereas something with 1 rating and 5 stars, what does that say about the average? Not much. It just says there's more people to rate something and have an opinion about it. Getting people to rate something is difficult in and of itself and only the people who are on the love/hate spectrum will rate something.

I like the histograms as it reveals a little into the rating. After all, if a single person gave it 1 star and everyone else rated it 3 or more, the average is likely skewed because of that one person who is clearly "gaming" the system because they weren't happy.

I think ratings should take into account intent. If multiple people are rating it 1 star, then clearly it should be weighted downward. However, if a single person out of 100 people gave it 1 star, I don't think the average should be weighted evenly. It's a difficult problem to solve and XKCD is just making a joke.


"Trustworthiness" of the reviewer is always difficult, especially with movie reviews, because there's never any accounting for taste.

I find it hard to rely on aggregated ratings for that reason.

When it came to picking movies to watch, I used to love watching Siskel and Ebert, because I knew their tastes.

If only Siskel (whose tastes were more like mine than Ebert's) gave a thumbs up, I knew there was a pretty good chance I'd at least think the movie was "ok". On the other hand, I'd be less likely to give a movie a chance if only Ebert gave a thumbs up.

These days, what I have to do is go to Rotten Tomatoes and take a sampling of four or five reviewers that I trust/like (which actually includes Roger Ebert and a few of the people he used to have as guest reviewers on Ebert & Roeper) and base my decision on that.


Human curation is still the state of the art; computed curation is a miserable failure that utterly fails to capture my tastes. It's trivially easy to guess what iTunes or Netflix will recommend to me, which indicates to me that the decision taken is tautological.

The goal of a recommendation system should be to expose me to things I wouldn't be likely to find by myself.


Hmmm. Whatever the scientific or theoretical improvement such an approach may offer having to educate users on how your ratings system works is going to add a huge amount of friction to user engagement.

And frankly who has ever mentally rated a film in terms of "re-watchableness".? People just think in terms of of "good" or "bad" and current ratings systems a la Amazon leverage that. It's simple, fast and given the histogram presentation tells me everything I need to know about the number and distribution of votes in a flash. Plus whether I want to rewatch a film or re-read a book is largely down to my mood at the time. But my opinion on whether it's "good" or "bad" is pretty static.

Maybe Amazon's system is not statistically bullet-proof, but who cares? We're talking movies here: a cheap, casual and discretionary purcahse.


Aren't Bayesian statistics designed to deal with low ratings counts?

I think there was an article about that a while ago on HN.


In the pre-Internet days, the way I discovered content (movies, books, music), was to take a trip to a store (Blockbuster, local bookstore, B&N, record store). There were two broad categories of content: the mainstream stuff with the primo shelf space, and the mysterious aisles of Everything Else. Judgments were based on things like cover art, in store promotional material, how many copies were still sitting on the shelf, sampling the content in-store, and recommendations from friends or store employees.

Im not sure if the success rate was any better or worse than the online star rating system these days, but it seemed more fun. However, the barrier to trying something else was also a lot higher if you made a poor choice, which might have had a side effect of narrowing one's tastes.


Maybe we need something like Pandora's Music Genome Project, but for movies and TV shows.

One of the most interesting features in Pandora is the "Why was this track selected?" action. Imagine something similar where a list of movies and TV shows are presented to you, with sentences for each as to why.

Netflix's recommendations were close, but they still seemed to always focus on one facet at a time, be it a user-predicted rating or a single subcategory of related shows.

Edit: Goodfilms seems to be better in the it tracks two facets at the same time, which does end up creating a diagonal scale from super funny movies you can wath again and again to super serious ones you'll watch once, but that's still not quite like filtering down on tons of facets at the same time.

The closest thing I can think of is the metadata from TV Tropes.


Ratings leave something to be desired for sure. You can tell if something is worth checking out if it's highly rated but other than that they aren't tremendously useful.

That being said I'm not a fan of the goodfil.ms graph at all. Why do I care about how rewatchable someone else thinks a movie is? Do I want to watch it again? I would know because I'd have seen it.

I think rewatchable is the wrong term. Fun is what you're looking for. Major Payne is a fun movie. It is incredibly rewatchable but I don't need a website to tell me that. I could use a website to point me to it if I haven't seen it before.

I think the site is on the fringe of something important though. There are multiple ways to rate a movie, and mood plays a huge role in what you want to watch at any given moment.


You can't rate the rater's because that's all subjective. This is where a nice machine learning algorithm that looks at your scores and find people who have similar rating trends. It would then give weight to other peoples ratings based on your ratings.

Thus is born a system in which everyone is happy. The site will get more ratings because people want the algorithm to be better and it needs more data for that.

The movies will be happy because this points to legal ways to find movies.

We'll all be happy because it's a complex solution to a complex problem and yet it can still be solved in an elegant and visually stimulating way. (I.E. you could color code the reviewers who are more likely to be similar to yours)


Hey, check out the site we're working on: www.criticrania.com. We're doing exactly what you mention. You'll need to sign up to see these features, but it's something we're working hard towards perfecting.


I'm not sure why I would care about the "rewatchable" metric. I only look at reviews for a movie I haven't seen. If I've seen it I can decide myself if I'd enjoy watching again. So subtracting that you are left with the normal star system rating.


I really don't think you've tackled the problem at the correct angle here. Anyone can assign two or more axis to rate things on and call it a better rating system than a one dimensional one because it provides more information. Viewers want more information before deciding whether or not a thing (in this case, a movie) is worth their time. While your two dimensional rating system may work for some, this specific system is only justified for people who want information on "quality" and "rewatchability." For example for a person like me, I don't rewatch movies so the "quality" axis offers me way more value and insight compared to the "rewatchability" side.


If I'm deciding whether to watch a film for the first time, I really don't understand how whether or not other people would want to watch it a second time helps me make my decision.

"We rate movies on two criteria - ‘quality’ and ‘rewatchability’, so you can admit to your guilty pleasures and properly capture the feeling you get when a film leaves you exhausted."

You are using rewatchability to infer some potentially helpful labels ("guilty fun", "exhausting but worthy"). But there's no guarantee that those inferences are safe/generalisable across viewers/films/genres, or that people who see a rewatchable axis will know to interpret it like you do...


This article forgot to take into account one HUGE factor in people who use the rating systems.

ATTENTION SPAN.

Yes. People don't have the attention span to independently analyze 5 different scatter graphs of 5 similar products. Sometimes the scatter plots can be actually more confusing and less informing than something simple like a histogram.

I firmly believe that people's attention spans are more captured by things like star ratings and histograms. If they get past the stage of their interest being captured, THEN they read the reviews to find out more in-depth information and opinions. I think it's a system that works well, as Amazon has shown.


For movies a histogram may not be helpful (other than highlighting polarizing films), but I find it invaluable for electronics. While I also read reviews, I find that with electronics 1 star reviews (and sometimes 2 star) are usually given for DOA or other serious failure. I can do a quick calculation of defect rate by calculating the percentage of 1 star reviews. If it's 20% (or above norms for that product type) I avoid it. And yes, I realize that the reported defect rate is likely to be higher than the actual defect rate, but it's far better than nothing.


Would you rather go to a restaurant that has 17 reviews and 4 stars, or the next-door neighbor with 3.5 stars from 750 reviews, all other things being equal?

Yeah.

Ratings are about post-choice satisfaction, not about pre-choice decision making.


Randy Farmer goes over this concept in detail in Building Web Reputation Systems (http://shop.oreilly.com/product/9780596159801.do).

Google Tech Talk: http://www.youtube.com/watch?v=Yn7e0J9m6rE

See also: "YouTube: Five Stars Dominate Ratings" (http://youtube-global.blogspot.com/2009/09/five-stars-domina...)


High-rewatchability is a poor metric for a good film.

Examples of movies I would highly rate but would not want to rewatch: Schindler's List (1993), Hotel Rwanda (2004), Blindness (2008), Amistad (1997), etc...


Years ago I created a website for local bands. I also liked to have a rating system and tried an emotion rating system (example: http://achterband.nl/sickboys_and_lowmen/white_buffalo). I think for music and films this is more suitable than giving it a single number. And maybe this applies to other ratings as well.

(Translation of the emotions used: happy, relaxing, surprising, aggressive, sad, explosive)


I wish people would use a 5 or 7 point scale: http://blog.jgc.org/2007/12/seven-point-scale.html


When it comes to anything subjective, ratings (like those histograms) tend to be bimodal http://zedshaw.com/essays/rubrics_and_the_bimodality_of_rati...


Interesting article but it's still arbitrary. I don't think it solves the problem still. What it boils down to is trusting the raters. This is the only way to get by spam, ignorance, and trolling.

I think the solution is knowing who is rating what you're looking at. In the old sense - "Quality over quanitty." I don't need to know what the whole world gave it, just a few people who I've come to trust. We're trying to do this here: www.criticrania.com.

I think this is the only real way around this problem.


Yeah, it makes the bold generic claim to answer "why [all] rating systems don't work", makes the bizarre claim that histograms that show the distribution are "worse" than a scalar mean, and then presents as a solution "We've solved the generic problem by doing two things: adding a 'rewatchable' rating, and plotted it against general quality rating as a scatter plot." Rewatchable is irrelevant to things that aren't movies. The scatter plot is only relevant if the correlation between the two is important, but the given graphs don't support that the correlation has meaningful information beyond that people who like a movie are more likely to want to rewatch it, which is a correlation that is likely consistent across all movies in their dataset and conveys little information about any particular movie.


I applaud their efforts to improve on movie ratings, but this is not an improvement. I agree with another commenter that they've really just created a 2D histogram, which is more difficult to assess than the histograms they complained about.

What's especially difficult with the scatter plot is that it requires you to assess density, rather than a simple scalar value. The other histograms have 5 numbers to indicate "weight" for each star, and a bar next to the star visually indicates the proportion the ratings received for that star. For their scatter plot, if there are 1000 ratings, how will it look different from a film with 100 ratings? The relative proportions of the ratings will only get muddied with the scatter plot approach.

The other thing about the scatter plot is that it still essentially maps to a 5 star rating, but only makes it more difficult to asses which star. That is, we are expected to visually assess: [1 star] - greatest density in lower left quadrant, [2 star] - density greatest in middle of graph, [3 star] - greatest density in lower right quadrant, [4 star] - greatest density in upper left quadrant, [5 star] - greatest density in upper right quadrant. There are only 5 useful density assessments which brings us back to the same categories as the 5 star system. Only in the scatter plot, its much much much more difficult to assess which quadrant (star) the ratings map to. And really, what is the meaningful difference between the 2, 3 and 4 stars (in my example)? Those density groupings seem almost equivalent (or some might argue). So in reality, the scatter plot will really only be meaningful if there is very little deviation between quality and re-watchability (which isn't true), which will help to group the ratings making density easier to assess. If they diverge frequently, then the plots are just going to be ignored by users since they'll have to assess density in every plot to try and make sense of it. That's hard. That's work. Users don't like to do work.

Finally, re-watchability? The question on a user's mind is, "would I want to watch this movie?" not, "if I saw this movie, would I want to watch it again?" I rarely watch movies again. Even the ones I love and own. That seems to be true of most people. The reasons for wanting to re-watch a movie are unrelated to whether it would be worth watching the first time. I'd argue that re-watching a film is more of a personality type than anything related to a movie.


In my experience the best films glue you to the seat. I'm fairly sure that's a universal phenomenon ;)

They are watched without pausing, skipping, rewinding, from beginning to end.

If it's not true in your case then you're not a reliable rating source.

A rating based on this can be fully automated.

No need to depend entirely on this one rating as it could be used for weighting a users subjective rating. E.g. if the film was watch over 3 evenings, then your 5 star rating is worth 1 star.


Super interesting post. We created SquidCube (squidcube.com) to help companies build out a rating system that instead of being shown publicly, went straight to the company.

We're curious to see how people respond to rating/giving feedback when it doesn't show up in their social media stream . Will it be more honest because it is direct feedback?

This post has sparked some new ideas for me on developing a new rating system beyond nero, stars, etc.


Polls predict poorly because everyone says what they think they should say, instead of what they actually believe.

Ratings systems are worse - they are self-selective, so most adequately-satisfied folks will not take part.

A better system than both, would be to somehow extract ratings from folks behavior. Did they read the whole article? How long did they dwell on each part? Did they return to it more than once? Stuff like that.


The problem is people do not take the time to rate products (movies in this case) unless they are notably good or notably bad. I believe Rotten Tomatoes has the most realistically accurate system for rating movies. Either you think someone should see the movie, or shouldn't. If the majority of people are on the fence, then it will get a rating around 50%.


I thought they would come up with a better system than stars. Like a "like/dislike" or "would watch it again". But no. They still use the same system.

They just made a nice blog post to buzz around.

But I went past through that and played a bit with their app, until I found out that everything you did was posted on your timeline by default.

I erased the application from my facebook.


I think the two-axis argument is stronger than the anti-histogram argument. The problem isn't that the one-axis rating systems are aggregating their ratings into histogram bars rather than displaying them in some other way, but that the ratings curve just doesn't carry a lot of information.


Already the idea to explain a self-explaining comic was giving me clues about what I would read here. Wasn't disappointed. If one thing of something is bad, it becomes good if you throw 2 of that at your users. Sounds like a solution.

And the comic is also explained now. Thanks.


Out of curiosity, how do I sign up for goodfil.ms without a Facebook or Twitter account?


Relevant discussion on how not to calculate average rating without taking into consideration of sample size:

http://news.ycombinator.com/item?id=3792627


Clustering across what you've previously "liked" ala Netflix is the solution to this problem, I think. The problem is fundamentally that single number ratings don't capture that people are non-homogenous.


Often I'm presented with a choice of films at the cinema (some of which I've seen already) and only want to decide which one to see next. Perhaps an ordering by score would be more appropriate?


It's a nice idea, and the readership of HN would probably prefer the recommended way in this article, but I'm not sure that mass market consumers will find it that valuable.


I found some good info here and shared it in an answer here: http://ux.stackexchange.com/a/23008/7627


In summary then, using the mean value to summarise data from an unknown distribution can be problematic. Not exactly ground breaking.


I still don't understand how is "rewatchability" different from "quality".

If the movie is good, I would like to watch it again. What is the difference?


The movie Requiem for a Dream is, to me, a perfect example of a high quality movie that I will not watch twice.

The acting, directing, cinematography, pacing, and sound design are all excellent. However, the film is such a grueling emotional experience, I don't foresee myself sitting down with it again.

Contrast that to, say, Airplane. It's a good movie. Funny. If I'm bored in a hotel room, I'll watch it again. However, it is emphatically not a "great film."


Fair point, I asked that myself. See ErrantX's answer[1] which IMO totally makes sense:

> Why would I care about rewatchability? Speaking as a film buff; this is actually quite a good guide to the sort of movie it is (when combined with quality). If lots of people mark it good quality, but wouldn't watch it again, that implies that you have to be in the right mood for it.

If people mark rewatchability high, even if the quality rating varies, you know it is much more easy going film.

And so on. Combining data points is good :)

[1]: http://news.ycombinator.com/item?id=4417078


This is not helpful at all. They are comparing the one-dimensional 1 to x star rating to something two dimensional.


Rewatchability and quality? Are you kidding me?

How do I glean from this wither of not I will like the movie?

A High quality movie can be terrible and a low quality movie can be great. What does quality mean? Does that mean they had good special effects and angles? Does that mean the color is true and the acting was great? How does this equate to me liking the movie?

Rewatchability? Many of my favorite movies I would not watch again (Lord of the rings) because they were so long. There are also so many movies that I want to see in the future that I will choose to watch one of them instead of re-watching one that I have already watched. Re-watching is something that people do less and less as they get older (kids and teenagers maybe do it) but adults (your target audience) not so much.

Also those scatter charts mean the same to my brain as the histograms that you are blasting so I would stick with what people are already used to (the histograms). You are not doing anyone any favors by changing the presentation of the same data. People are not stupid. They will see both equally in most cases but they will prefer the familiarity of the histograms.

Trying to be different is not always the best thing to do. As many have already mentioned. Use machine learning to augment ratings (like Netflix).

Good luck




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: