

Why Ratings Systems Don't Work - geelen
http://goodfil.ms/blog/posts/2012/08/22/why-ratings-systems-dont-work/

======
tokenadult
The old Latin proverb "Quis custodiet ipsos custodes?"

[http://en.wikipedia.org/wiki/Quis_custodiet_ipsos_custodes%3...](http://en.wikipedia.org/wiki/Quis_custodiet_ipsos_custodes%3F)

might in this context be paraphrased to "Who is rating the raters?" The hope
in any online rating system is that enough people will come forward to rate
something that you care about so that the people who have crazy opinions will
be mere outliers among the majority of raters who share your well informed
opinions. But how do you ever know that when you see an online rating of
something that you haven't personally experienced?

Amazon has had star ratings for a long time. I largely ignore them. I read the
reviews. For mathematics books (the thing I shop for the most on Amazon), I
look for people writing reviews who have read other good mathematics books and
who compare the book I don't know to books I do know. If an undergraduate
student whines, "This book is really hard, and does a poor job of explaining
the subject" while a mathematics professor says, "This book is more rigorous
than most other treatments of the subject," I am likely to conclude that the
book is a good book, ESPECIALLY if I can find comments about it being a good
treatment of the subject on websites that review several titles at once, as
for example websites that advise self-learners on how to study mathematics.

The problem with any commercial website with ratings (Amazon, Yelp, etc.,
etc.) is that there is HUGE incentive to game the ratings. Authors post bad
ratings for books by other authors. The mother and sister and cousins of a
restaurant owner post great ratings for their relative's restaurant, and lousy
ratings for competing restaurants. I usually have no idea what bias enters
into an online rating. So I try to look for the written descriptions of the
good or service being sold, and I try to look for signals that the rater isn't
just making things up and really knows what the competing offerings are like.
When I am shopping for something, I ask my friends (via Facebook, often
enough) for their personal recommendations of whatever I am shopping for.
Online ratings are hopelessly broken, because of lack of authentication of the
basis of knowledge of the raters, so minor details of dimensions of rating or
of data display are of little consequence for improving online ratings.

~~~
dclowd9901
> "Who is rating the raters?"

Netflix does. By cross referencing your likes and dislikes against those of
your fellow Netflix members, the company is able to create a meta rating
system, in which the score you see for a movie is _your own_. You see that
score because that's how much Netflix thinks you'll like it, based on how
similar people liked it.

 _This_ is the only good way of going about this method. The trick is, it's
easy to do this with movies, but much more difficult with product ratings and
the like. Maybe this is an opportunity for someone to build something on top
of Facebook or Amazon.

~~~
biostat
Is Netflix really the only site which does this nearly-braindead machine
learning approach?

Once you realize that people have different tastes and you know someone's
preferences that is the obvious solution. Or is the process of crawling
through that much statistical data that expensive that it can only be offered
to paying subscribers?

~~~
grokcode
The more accurate you want to get, the more computationally expensive. Netflix
actually did a contest with a million dollar prize to the team that could come
up with the most accurate rating prediction algorithm. In the end, the million
dollar algorithm was too expensive to implement, so they never ended up using
it.

------
s_henry_paulson
Terrible article. Calling histograms awful, based on nothing more than an
opinion.

Then trying to conclude that some convoluted scatter plot system makes more
sense is laughable.

Not to mention, this system is still just a star rating system. This would be
no different than having two histograms side by side.. assuming, of course,
that you'd even want to rate different aspects of the same thing.

I can't even imagine scatter plots on amazon, or trying to convince the
general public that "it makes more sense"

~~~
vhf
_This would be no different than having two histograms side by side.._

Yes it would, and the article shows why and how. Scatter plots are easy to
read (for comp./math. educated people). Two histograms side by side are easy
to read (and find correlations) for nobody.

~~~
abrookewood
I'm with Henry ... the scatterplot is fundamentally no different to two
histograms side-by-side (though it is marginally easier to read).

~~~
dbecker
Though a scatterplot conveys less information (e.g. the correlations between
the two axes), I think it takes longer to process. It also takes more screen
real estate than a pair of histograms.

~~~
mistercow
>Though a scatterplot conveys less information

A scatterplot conveys objectively more information.

------
CodeMage
Scatter plots are definitely more informative, once one gives them a couple of
minutes to get used to them. However, I think you're shooting at the wrong
target and your solution would exacerbate the root problem: bias.

The first time I really noticed the problem was when I published my own Flash
game on Kongregate and started paying closer attention to the ratings. That
led me to examine my own rating habits and I conjectured that is probably what
happens to everyone else.

The bias I'm talking about is caused by the fact that most people can't be
bothered to rate something. Most people only rate something when there's a
powerful impulse to do so, so most of the votes will be 5 stars or 1 star. The
4-star ratings come from people who liked something enough to be moved to rate
it, but not enough to gush about it; note that the group of people who makes
that distinction is already substantially smaller than the 5- and 1-star
reviewers. The rest comes from a very small minority, most of whom are people
who didn't have anything better to do at that moment and decided to spend some
time rating, but don't do it on the regular basis.

By the way, I realize that this is just a conjecture, but from what I've seen
so far, it seems to be pretty accurate.

I think that introducing an additional axis will only exacerbate this, by
raising the bar for rating. If the act of rating starts demanding more effort,
you'll get a distribution that is even more skewed than now.

The two improvements I would like to see are:

1\. a system that infers ratings from users' actions

2\. better mechanisms for gauging the relevance of someone's review/rating
based on my preferences/tastes

The first would help reduce the bias and the second would help me extract more
useful information from the biased dataset.

~~~
tolos
Similar to what you said, where YouTube found most people only voted 5 stars,
with 1 and 4 stars being used much less frequently: [http://youtube-
global.blogspot.com/2009/09/five-stars-domina...](http://youtube-
global.blogspot.com/2009/09/five-stars-dominate-ratings.html)

------
Homunculiheaded
I'm surprised this article, nor the discussion here doesn't address the main
issue here: Just because you use numbers doesn't mean your data is
quantitative.

Star scores are an attempt to map a qualitative experience (enjoyment of the
film) with some quantitative measure. Which is fine if you just want to get a
sense of 'how much' somebody liked a something. If I say I give scotch A a 5
scotch B a 3 and scotch C a 4 then you know that I like the scotch's in A, C,
B order. It's a short hand way to express my personal ordering of qualitative
experience, just like we use the words: 'good', 'better', 'best'.

The problem is this data is not really numerical, so even basic mathematical
operations don't make any sense. When we add 2 heights, 2 masses, 2 speeds etc
the result makes sense. But not so with ratings. Even basic difference doesn't
make sense, is the difference between 5 and 4 stars the same as between 4 and
3 stars? There is no 'unit' distance in scoring system. So doing any sort of
averaging is just going to give you nearly meaningless results.

~~~
YokoZar
I've had this intuition as well, and tried to put it to use when redesigning
the Ubuntu Software Center's ratings system. Ratings are fundamentally ordinal
data -- higher is better, but the difference between 4 and 5 is not the same
as the difference between 3 and 4.

This implies that the arithmetic mean is a broken concept, however the
_median_ should still survive intact. I thought about ways to implement this
in Software Center, however I'm still not quite sure what a good algorithm for
ordinal rating data would look like.

Please feel free to post ideas on this stackexchange question:
[http://stats.stackexchange.com/questions/19115/how-do-i-
sort...](http://stats.stackexchange.com/questions/19115/how-do-i-sort-an-
ordinal-list-of-user-generated-ratings-data)

------
podperson
I definitely go by the 4.5 stars == very good, <4 stars == crap heuristic, but
to argue this is no good is ridiculous. It's actually very, very helpful.

E.g. when I go to Amazon I don't buy some random product with a 4.5 star
review -- I search for a specific product or a specific kind of product and
then reject candidates which are lousy. How is that not INCREDIBLY useful?
Similarly, who goes to a movie simply based on whether it's good or not.

In general, if you create any point rating system people who like a thing will
tend to rate it towards the top of the scale, e.g. 4/5 or 9/10.

I actually did an informal experiment -- I used to run role-playing
tournaments, and do exit surveys on participants. For the first few years we
asked players to rate us on a 5-point scale and scored slightly over 4/5 on
average. Then we switched to a 10-point scale and scored slightly over 9/10.
Not scientific -- but I don't think we suddenly got better.

This finding is backed up by serious research (which is why when a
psychologist creates a scale, the numerical ranges need to stay constant in
follow-up studies or the results are not statistically comparable).

Netflix, which tries to give users customized ratings, actually subtracts
value (in my opinion) from its scores because it tries to make ratings mean
"how much will you enjoy this?" BZZZT. I pick stuff for me, my wife, my au
pair, and my kids. We don't all like the same stuff, and we don't want to
track ratings individually. My kids want good kid stuff. I want good me stuff.
Don't try to guess what I like based on our collective tastes.

~~~
CWuestefeld
The problem that XKCD gets at is simply translating/scaling the results; the
article is solving a different problem.

Early on in the Netflix Challenge, I was able to get myself (very briefly) a
leaderboard score with nothing more than analyzing every user's ratings; re-
centering them by their mean, and re-scaling them according to their standard
deviation. The by remembering their translations and scales, I could put a
globally-predicted score back into their own language.

So just some very basic statistics is sufficient to erase much of the bias
toward higher numbers, as well as halo effects and the like.

(I was pretty surprised that Netflix's own algorithm apparently wasn't doing
anything this simple)

~~~
CamperBob2
_I was pretty surprised that Netflix's own algorithm apparently wasn't doing
anything this simple_

Netflix does have really interesting blind spots. They claim to take ratings
seriously, to the point of offering a million dollars for the best rating
algorithm. Then, as the GP says, they implement the rating algorithm in a way
that renders it completely worthless to any household with more than one
viewer.

Netflix does offer us a good demonstration of the failings of absolute
technocracy, but it leaves the question of how best to rate movies wide open.

------
yread
When I'm looking for people's opinion I want to know at least a little about
the people and have more things in common with them, so that we have similar
tastes. IMDB is almost useless for me, people like complete crap imo (is there
any movie without at least one 10 star rating? That should be the single best
movie ever). If you wouldn't fit in the community, the community's opinion on
things is largely irrelevant and whether you look at the opinion through a
histogram or a scatter plot is irrelevant.

If you actually have friends, why don't you ask them for recommendation in
person. If your friend is really into arty movies and recommends you an arty
movie as being very arty (and well done) you can consider it. Collapsing it
into a single number doesn't make sense </rant>

EDIT: that's not to say that the scatter plot isn't an interesting idea, it's
just not going to help much because people's background is important for
rating

~~~
geelen
You've absolutely hit the nail on the head. Knowing the rating of a friend is
worth so much more than a bunch of strangers. That's what Goodfilms is all
about - it puts your friends' opinions ahead.

------
saucerful
Surprised to see no mention of the website Criticker. Been using it for a few
years now (ever since I cancelled Netflix and missed the recommendation
engine).

Criticker's rating system is out of 100 points but for each user it scales
ratings into tiers (deciles) 1-10. So for someone like me who watches lots of
movies that I sort of know I'm gonna like (thanks to Criticker!), most of my
ratings end up in the 70 to 100 range, but I still have 5 tiers in that range.
The wide range allows the system to adapt to a user's biased view of the
scale. Also plenty of users simply keep their rankings from 0-10.

Criticker gives recommendations in two ways. First it predicts my ranking for
a movie. So I can just browse unwatched movies and filter them however I like
and then sort by how Criticker expects I will rate them. It is actually scary
how predictable I am.

The other method of recommendations is to browse users who have very high
correlation to my rankings and see what movies they've ranked highly which I
have not seen. This might be the best way to find movies. It also seems to be
the key to how the expected ratings I mentioned above are computed.

No doubt one of the things that keeps Criticker running so well is a community
of serious film buffs. It makes it easy to find movies I would have never
heard of otherwise (foreign, limited release, shorts).

~~~
bstpierre
_sigh_

A butterfly flaps its wings, xkcd puts up a comic on ratings, someone
piggybacks on the comic, it makes the front page of HN, you wander by and
mention criticker, a bunch of geeks pile onto the site to check it out... and
it ends up crashy for a while.

Cool site, thanks for mentioning it. From what I saw before it went down (too
many mysql connections?), it even looks like I can export my ratings.

------
stcredzero
Glen makes a good point about how "people are good at seeing patterns" but he
still gives short shrift to the histograms. I see a big difference in the
histograms. The "crescent" shape of an item's histogram, like the one for
Starship Troopers, is often telling on Amazon or the iTunes App Store. That
either means something about the product sucks (perhaps only in a small
minority of purchases, but the risk is significant) or somebody is trying to
lower the rating of the item.

The more a histogram resembles an exponential increase, the better it is. The
higher the exponent, the better.

Sucky:

    
    
        XXXXXX
        XXXX
        XX
        XX
        XXX
    

Mediocre, still sketchy:

    
    
        XXXXXX
        XXXX
        XX
        XX
        XX
    

Excellent:

    
    
        XXXXXXXX
        XXXX
        XX
        X
        X

------
kragen
Overview of past discussions:

"How not to sort by average rating" (2009):
<https://news.ycombinator.com/item?id=3792627> For thumbs-up/thumbs-down
systems, suggests using the lower bound of a Wilson confidence interval for a
Bernoulli distribution, which is what Reddit does now. Convincingly refuted by
How to Count Thumb-Ups and Thumb-Downs: User-Rating based Ranking of Items
from an Axiomatic Perspective,
[http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...](http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2011.pdf)
by Dell Zhang et al., which argues for simple smoothing with a Dirichlet prior
(i.e. (upvotes + x) ÷ (upvotes + x + downvotes + y)), which was also suggested
by several people in the comments.

In 2010, William Morgan wrote [http://masanjin.net/blog/how-to-rank-products-
based-on-user-...](http://masanjin.net/blog/how-to-rank-products-based-on-
user-input) partly in response, applying Bayesian statistics to the problem of
ranking things rated using 5-star rating systems.

Perhaps related: HotOrNot started out displaying the mean of the rankings as
the rating of each photo (after you clicked on it). But they found that there
was a gradual drift down in ratings: they started with around 1-5 (out of a
theoretical max of 10), then ended up around 1-3, etc., with the predictable
damaging effects on egos, people's willingness to post their photos, and the
information content of the ratings. The solution they adopted was to display
not the mean of ratings but the _percentile_ : a photo rated higher than 76%
of other photos would have its "average" displayed as "7.6", even if the mean
was 4.5. This trained the users to flatten the histogram!

[http://www.nashcoding.com/2011/10/28/hackernews-needs-
honeyp...](http://www.nashcoding.com/2011/10/28/hackernews-needs-honeypots/)
suggested that fake "products" to attract ratings could distinguish
intelligent ratings from unintelligent ones. Although written about thumbs-
up/down systems, it applies to multi-star systems as well.

------
adamc
Actually, I find histograms pretty useful, primarily because if there is a
secondary bump toward 1, it indicates there are a significant number of people
who had bad experiences with the product -- more investigation required.

Having a two-dimensional graph might have more information, if the dimensions
really matter. I'm doubtful that "stars" and "rewatchable" are really
independent, and I'm unsure why I would care about it when I haven't seen the
film. (If I have seen the film, I'll have my own opinion and not need the
graph.)

I'm all for looking for improvements to the ratings game, though. What seems
to work best for me is to actually read the reviews, but that's obviously
time-intensive.

------
voyou
I used Goodfilms briefly, but the rating system is so bad I stopped. The odd
thing is that, as this blog post demonstrates, they recognize the problem and
then totally fail to solve it. Ratings tend to be fairly bimodal, with people
either liking or disliking stuff and not making fine-grained choices. In
response to this, the Goodfilms system makes ratings continuous, so that
rather than trying to figure out whether a film is 4 or 5 stars, the user now
has to figure out whether it's 3.8 stars or 4.6, then compounds the problem by
making the user rate on two separate scales with a pretty opaque distinction.
So the response to the observation that people's ratings tend to be simplistic
is to make the rating system much more complicated; it's pretty much the exact
opposite of a solution (I quite like letterboxd's system, which has five-star
ratings and also a "like" button, which gives you some level of choice over
how fine-grained you want to make your ratings).

------
Adrock
If you're interested in a more rigorous analysis of the problem, I highly
recommend reading the paper "How to Count Thumb-Ups and Thumb-Downs: User-
Rating based Ranking of Items from an Axiomatic Perspective":

[http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...](http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2011.pdf)

It's very accessible for an academic paper.

------
PaulHoule
I'll say histograms are useful for certain things.

If you are looking at electronic devices or camera lenses there's the issue
that a certain fraction of people get lemons. Some bad reviews are because of
that.

Other people have unrealistic expectations of the product and give a bad
review.

A histogram gives some immediate insight into this problem, and then looking
at stratified samples of the reviews helps there on out.

Now, I will say the star ratings on Ebay are weak because of the fact that a
less-than-perfect ranking gets people in trouble. Although "acceptable"
performance on Ebay goes a considerable range (It's certainly a worse
experience to have a long confused exchange with somebody with poor english --
this person shouldn't be punished, but they shouldn't be rewarded either.)

~~~
andrewflnr
Some bad reviews on Amazon are from shipping snafus. If you're going to get
any useful info, you have to read the reviews.

~~~
colomon
Or worse yet, reviews bitching about the price.

Edited to add: "If you're going to get any useful info, you have to read the
reviews," is so incredibly true. I'm surprised it's not getting more mentions
in these comments. The scatter plot is kind of cool, but I'd so much rather
have a histogram and actual reviews to check so I can find out why the product
got those ratings.

~~~
andrewflnr
If you could mouse over each scatter plot point and get a corresponding view
that explains its position, that would be cool... for nerds...

------
arturadib
The author nails it when he/she points out that star ratings don't reveal how
good a movie is _for you_ ("the movies appeal to different kinds of people"),
but then goes on to propose a 2-dimensional metric that still doesn't capture
the personalization aspect ("rewatchability" doesn't say much about how good a
movie is _for me_ ).

IMO movie ratings should iterate on Amazon's powerful statement "People that
bought this item also bought...". That is, one should look at people with
similar tastes and see how _those_ people have rated the movie.

Easier said than done as it needs a ton of data in order for it to work, but
that's the only way you're going to get close to more personalized ratings.

~~~
superqd
Yes. The scatter plot is just a different version of how others felt about the
movie. It does not communicate how that movie is related to other movies (that
I have rated), nor does it inform me of how similar I am to the people who
rated the movie. Personalization is missing.

------
drblast
I like the two axes idea, although "Rewatchability" would probably be better
as "How much I liked it."

There are very high quality, well made movies that I don't like, and there are
some really crappy ones that I do. And that's a good distinction to see in a
review system.

Because sometimes you just want to watch a good shitty movie, but it's really
difficult to tell the good shitty movies from the bad shitty movies when The
Brady Bunch movie (brilliant) has the same rating as any Adam Sandler movie
(awful).

~~~
thebigshane
Similarly, I've heard a proposal here on HN about having separate voting
buttons for Agreement and Contributing. I might not agree with a comment, but
admit it raises a good point. I might agree with another comment (or think
it's funny) without it contributing to the current conversation.

------
ericcholis
From a statistician's standpoint, ratings systems suck. But, from a consumer
standpoint, they are super easy to understand. A scatter plot system makes
sense to me, but I would never put it in front of a user.

In my opinion, current ratings systems are 80% UX and 20% data.

For example, Newegg uses a pretty intuitive system of allowing you to sort a
product page by Best Reviews and Most Reviews. In my opinion, this allows the
user to make a more educated decision if they seek the information out.

~~~
nodata
From a consumer standpoint, a single five star review pushing a product to the
top of the list is not easy to understand, it's a pain.

~~~
ericcholis
The simple answer there relates back to UX, just don't show the stars when
there isn't enough data. Set a minimum number of reviews as a baseline so that
you don't get the result you mentioned.

If there is a written review component, make a note of the review but don't
quantify the value of said review until the minimum threshold is reached.

~~~
slantyyz
When there are only a handful of reviews, I find myself using "gymnastics
rules" and throwing out the best and worst score.

Probably not very scientific though.

------
jeffehobbs
Why not try the lower bound of Wilson score confidence interval for a
Bernoulli parameter?

[http://www.evanmiller.org/how-not-to-sort-by-average-
rating....](http://www.evanmiller.org/how-not-to-sort-by-average-rating.html)

~~~
geelen
We actually use this whenever we do any ranking within the site. On the film
page though, we think presenting the raw data as a scatter plot is better than
a single number.

------
diego
This article posted here four months ago is much better:

<http://evanmiller.org/how-not-to-sort-by-average-rating.html>

Discussion:

<http://news.ycombinator.com/item?id=3792627>

------
magoon
Facebook has it right with "Like" -- either you like it or not. This
eliminates these review patterns:

    
    
      5 stars - OMG I LOVE EVERY PRODUCT
      4 stars - Love this product, but I am withholding one star   because of _____
      3 stars - Everything to me is just meh.
      2 stars - I hate everything but this product earned 1 star for ___ and another for ____.
      1 star - UPS drop-kicked my item and it arrived late, so this product is trash!
    

If you distill all reviews so the the reviewer has to decide whether they like
it or not, then you have a less diluted overall ranking.

------
typicalrunt
Why don't rating systems just give a simple yes/no question to the reader. In
the case of rating a movie, just ask "Would you watch this again?" or in the
case of purchasing a product from Amazon, "Would you purchase this item
again?"

I'd rather a boolean system than one where someone's 4-star rating is
different than my 4-star rating. Whenever I see a multi-star rating system, I
remember back to a prof I once had that said "The top grade is B+. A's are
reserved for God." Albeit disgusting, it taught me that everyone has a
different rating scale.

------
karpathy
Personally, I've consistently found that the best predictor of whether or not
I was going to enjoy a movie was the NUMBER of ratings, not the rating itself.
This also works for restaurants and other things on sites like Yelp. It almost
seems that a movie should come with a simple "recommend!" button that simply
counts recommendations.

But ratings are a tricky issue and I think they require a more sophisticated
mathematical treatment and modeling if one wants to get it right, not just a
few histograms that treat all people equal.

There are a few modeling challenges that come to mind: For example, people
disagree on quality of movies based on their taste. This could be modeled as a
latent variable that must be inferred for every person in some graphical
model. Another example of a relevant variable would be person's rating habits:
some people rate movies 5 or 1, some people have a gaussian rating centered at
some value. These should be explicitly modeled and normalized. Every rating
could ideally be used to make a stochastic gradient update to the weights of
the network, and since we are dealing with very sparse data, strong priors and
Bayesian treatment seems appropriate. Ratings could then be personalized
through an inference process on the graph.

Has anyone heard of a more sophisticated model like this, or any efforts in
this direction? I'd like to see more math, modeling and machine learning and
less silly counting methods.

~~~
dredmorbius
That metric starts to fail when you've got, say, mediocre products or brands
with tons of exposure.

The classic is in "readers choice" reviews of restaurants or eateries. Fast-
food franchises dominate? Why? Because the philosophy of such sites is often
"majority rules", and the establishment (or brand) with the most votes wins.
But there are far more McDonalds or Taco Bells than Jacks Cook Shacks or
Trader Vics. Even when the _quality_ of JCS or TV exceeds TB or MD, it's not
going to be reflected in the ratings.

Adjustments such as taking a Likert (3-7 point scale) and adjusting reviews
based on the number of reviewers, to give both the actual _qualitative_
assessment, and the _probable maximal review_ can help. This is how sites such
as Reddit have adjusted their comments/submissions ratings.

The broader and more philosophical problem is that "quality" is not a one-
dimensional attribute, interpretation of quality differs among individuals,
and "fitness for purpose or task" should be considered when assessing quality
as well. McDonalds _may_ very well be appropriate when your goal is a quick,
inexpensive meal on the run (a conclusion I'd differ with), while Trader Vics
is where you'd head to impress the boss, date, in-laws, or client.

It's a tough problem. It's also one that sees a great many very poor proposed
solutions.

------
pygy_
Histograms are useful int the case of items with a strong love/hate split.

The canonical example is the SICP ratings on Amazon: 3.5 average; 177 ratings,
96 five stars, 53 one stars.

[http://www.amazon.com/Structure-Interpretation-Computer-
Prog...](http://www.amazon.com/Structure-Interpretation-Computer-Programs-
Engineering/product-
reviews/0262011530/ref=dp_top_cm_cr_acr_pop_hist_all?ie=UTF8&showViewpoints=1)

~~~
Xcelerate
Had to go look up that book. It's actually available free from MIT:
<http://mitpress.mit.edu/sicp/full-text/book/book.html>

~~~
pygy_
The videos of the lectures but the book authors (Gerald Sussman and Hal
Abelson) are also available for free.

Pick the MPEG1 versions. They are much heavier than the MPEG4 versions, but
the text on the projected computer screen is at readable. IIRC, the MPEG4 are
re-encoded versions of the MPEG1, which themselves were ripped from VHS.

<http://archive.org/details/mit_ocw_sicp>

------
mumrah
Average rating as a measure of "goodness" is wrought with statistical
problems. Without the context of other statistical modes, looking at mean is
pretty useless. However, people don't want to look at summary statistics for
each item (mean, median, mode, std/var, skew, etc). So we try to come up with
scalar metrics that capture "goodness" or "coolness" or whatever. Popularity
(how ever you define it) is a common one to use. Here's a good comparison of
popularity models: [http://blog.linkibol.com/2010/05/07/how-to-build-a-
popularit...](http://blog.linkibol.com/2010/05/07/how-to-build-a-popularity-
algorithm-you-can-be-proud-of/). In the past I've been pretty happy with
"Bayesian average" - it's simple to implement and gives good results.

But if you really want to dig into it, you have to consider all kinds of stuff
like bimodal distribution of ratings (controversial items), rater
quality/consistency, age or ratings, etc, etc.

It's really not as simple as you'd think!

------
superqd
I applaud their efforts to improve on movie ratings, but this is not an
improvement. I agree with another commenter that they've really just created a
2D histogram, which is more difficult to assess than the histograms they
complained about.

What's especially difficult with the scatter plot is that it requires you to
assess density, rather than a simple scalar value. The other histograms have 5
numbers to indicate "weight" for each star, and a bar next to the star
visually indicates the proportion the ratings received for that star. For
their scatter plot, if there are 1000 ratings, how will it look different from
a film with 100 ratings? The relative proportions of the ratings will only get
muddied with the scatter plot approach.

The other thing about the scatter plot is that it still essentially maps to a
5 star rating, but only makes it more difficult to asses which star. That is,
we are expected to visually assess: [1 star] - greatest density in lower left
quadrant, [2 star] - density greatest in middle of graph, [3 star] - greatest
density in lower right quadrant, [4 star] - greatest density in upper left
quadrant, [5 star] - greatest density in upper right quadrant. There are only
5 useful density assessments which brings us back to the same categories as
the 5 star system. Only in the scatter plot, its much much much more difficult
to assess which quadrant (star) the ratings map to. And really, what is the
meaningful difference between the 2, 3 and 4 stars (in my example)? Those
density groupings seem almost equivalent (or some might argue). So in reality,
the scatter plot will really only be meaningful if there is very little
deviation between quality and re-watchability (which isn't true), which will
help to group the ratings making density easier to assess. If they diverge
frequently, then the plots are just going to be ignored by users since they'll
have to assess density in every plot to try and make sense of it. That's hard.
That's work. Users don't like to do work.

Finally, re-watchability? The question on a user's mind is, "would I want to
watch this movie?" not, "if I saw this movie, would I want to watch it again?"
I rarely watch movies again. Even the ones I love and own. That seems to be
true of most people. The reasons for wanting to re-watch a movie are unrelated
to whether it would be worth watching the first time. I'd argue that re-
watching a film is more of a personality type than anything related to a
movie.

------
bthomas
"People like you rated it..." is much better for users than two axes.

------
ctdonath
_So, watch this if you’re in the mood for something really good._

Something I never see addressed: what you want to watch _eventually_ rarely
correlates with what you want to watch _now_. On the whole, you'll spend lots
of time putting "good" movies on your queue, but when time comes to pick
something and hit Play you'll pass over those and pick some recent release
which engages your excitement now and will be long forgotten soon after.

The star rating system confuses this even more by relying entirely on people
who will bother to rate something at all - a very different crowd than the
"what's _good_?" and "thrill me now" mindsets.

Insofar as ratings exist, I focus on written 1-star reviews (movies, apps,
products, whatever), looking for a subclass of "there really was a particular
problem" comments.

------
sojong
While this may not be applicable to goodfilms, a 2-dimensional system is
harder to brand. As in, "Two thumbs up," or "90% on Rotten Tomatoes." There's
value in a single understandable number.

I wonder if there's a measurable value for the system. I care about discovery,
so I want a site that can recommend me movies that I wouldn't normally think
about. How many more movies would this system recommend to me?

A lot of times, I don't care too much about accuracy as long as the system
isn't too far off. This is simply because the cost of an inaccurate
recommendation isn't too high when I can stream it on Netflix.

I like the idea of providing more data for people to make more accurate
assessments, but I don't necessarily believe optimizing for accuracy optimizes
the value provided to the user.

------
manaskarekar
I knew I had seen a similar XKCD before. <http://xkcd.com/937/>

~~~
willlll
That is because xckd is the same 3 jokes repeated over and over.

~~~
rythie
It's easy to criticise, much harder to do.

------
mhb
Why would someone who is looking for a film care about rewatchability?
Presumably they'll be watching it for the first time and then can decide for
themselves whether to ever watch it again.

Or did I miss the point and rewatchability is just a placeholder for something
more useful?

~~~
vhf
Article says about Starship Trooper :

 _there’s a lot of disagreement over whether it’s high quality of not, but
generally this scores high-rewatchability. So, maybe not the most intelligent
movie, but good fun._

What I intuitively deduced from this example is that _rewatchability_ is
metric of enjoyability.

On a single 5 stars rating, some people will give 5 stars because they really
really enjoyed the movie, some others will give 5 stars because they thought
the film was perfect on a cinematrographic-quality (i.e. scenario,
cinematography, acting, casting, etc. insert here some academy-award-
technical-category) point of view.

------
webjunkie
He jumps in and applies star rating problems just to movies, gives an
alternative that works only for movies and in the end promotes his own movie
site.

------
K2h
The problem with ratings are that they are not from a large enough random
sample. Rating scores tell me what people that like to rate products think. I
don't rate products and in general I suspect that people that are happy with a
product don't really care to take the time to go out of their way to tell
people about it online (that may change or be changing, I don't know). but one
thing is for sure, people that dislike a product WILL go out of their way to
tell everyone, thus further shifting the data set to being reviewed by people
that are unhappy.

Take a random sample of the true population of the data set (everyone that has
seen a movie) and not just the people that logon to rate it.

------
DanBC
The article mentions an important point about 5 star systems - people tend to
only use 5 stars or 1 star. This is sort of shown with the histograms.

A 2 axis system seems like a good idea. But I'd like to see it with 3 options
per access - [UP] [INDIFFERENT] [DOWN].

I'm also interested to know how the system will cope with "controversial"
films ( _life of brian_ , for example) where some people are going to downvote
whether they've seen it or not. And they'll campaign and ask all their friends
to downvote too.

~~~
guard-of-terra
I'd say people only use "5" (really liked, would recommend) and "meh" (spent
some time, would not recommend)

In their example, blade runner has sligtly more "meh" votes and starship
troopers is mostly "meh"

------
mmuro
If something has 1000 ratings but is only 3.5 stars whereas something with 1
rating and 5 stars, what does that say about the average? Not much. It just
says there's more people to rate something and have an opinion about it.
Getting people to rate something is difficult in and of itself and only the
people who are on the love/hate spectrum will rate something.

I like the histograms as it reveals a little into the rating. After all, if a
single person gave it 1 star and everyone else rated it 3 or more, the average
is likely skewed because of that one person who is clearly "gaming" the system
because they weren't happy.

I think ratings should take into account intent. If multiple people are rating
it 1 star, then clearly it should be weighted downward. However, if a single
person out of 100 people gave it 1 star, I don't think the average should be
weighted evenly. It's a difficult problem to solve and XKCD is just making a
joke.

~~~
slantyyz
"Trustworthiness" of the reviewer is always difficult, especially with movie
reviews, because there's never any accounting for taste.

I find it hard to rely on aggregated ratings for that reason.

When it came to picking movies to watch, I used to love watching Siskel and
Ebert, because I knew their tastes.

If only Siskel (whose tastes were more like mine than Ebert's) gave a thumbs
up, I knew there was a pretty good chance I'd at least think the movie was
"ok". On the other hand, I'd be less likely to give a movie a chance if only
Ebert gave a thumbs up.

These days, what I have to do is go to Rotten Tomatoes and take a sampling of
four or five reviewers that I trust/like (which actually includes Roger Ebert
and a few of the people he used to have as guest reviewers on Ebert & Roeper)
and base my decision on that.

~~~
jfb
Human curation is still the state of the art; computed curation is a miserable
failure that utterly fails to capture _my_ tastes. It's trivially easy to
guess what iTunes or Netflix will recommend to me, which indicates to me that
the decision taken is tautological.

The goal of a recommendation system should be to expose me to things I
_wouldn't_ be likely to find by myself.

------
VMG
Aren't Bayesian statistics designed to deal with low ratings counts?

I think there was an article about that a while ago on HN.

------
pixelcort
Maybe we need something like Pandora's Music Genome Project, but for movies
and TV shows.

One of the most interesting features in Pandora is the "Why was this track
selected?" action. Imagine something similar where a list of movies and TV
shows are presented to you, with sentences for each as to why.

Netflix's recommendations were close, but they still seemed to always focus on
one facet at a time, be it a user-predicted rating or a single subcategory of
related shows.

Edit: Goodfilms seems to be better in the it tracks two facets at the same
time, which does end up creating a diagonal scale from super funny movies you
can wath again and again to super serious ones you'll watch once, but that's
still not quite like filtering down on tons of facets at the same time.

The closest thing I can think of is the metadata from TV Tropes.

------
retube
Hmmm. Whatever the scientific or theoretical improvement such an approach may
offer having to educate users on how your ratings system works is going to add
a huge amount of friction to user engagement.

And frankly who has ever mentally rated a film in terms of "re-
watchableness".? People just think in terms of of "good" or "bad" and current
ratings systems a la Amazon leverage that. It's simple, fast and given the
histogram presentation tells me everything I need to know about the number and
distribution of votes in a flash. Plus whether I want to rewatch a film or re-
read a book is largely down to my mood at the time. But my opinion on whether
it's "good" or "bad" is pretty static.

Maybe Amazon's system is not statistically bullet-proof, but who cares? We're
talking movies here: a cheap, casual and discretionary purcahse.

------
pirateking
In the pre-Internet days, the way I discovered content (movies, books, music),
was to take a trip to a store (Blockbuster, local bookstore, B&N, record
store). There were two broad categories of content: the mainstream stuff with
the primo shelf space, and the mysterious aisles of Everything Else. Judgments
were based on things like cover art, in store promotional material, how many
copies were still sitting on the shelf, sampling the content in-store, and
recommendations from friends or store employees.

Im not sure if the success rate was any better or worse than the online star
rating system these days, but it seemed more fun. However, the barrier to
trying something else was also a lot higher if you made a poor choice, which
might have had a side effect of narrowing one's tastes.

------
rnernento
Ratings leave something to be desired for sure. You can tell if something is
worth checking out if it's highly rated but other than that they aren't
tremendously useful.

That being said I'm not a fan of the goodfil.ms graph at all. Why do I care
about how rewatchable someone else thinks a movie is? Do I want to watch it
again? I would know because I'd have seen it.

I think rewatchable is the wrong term. Fun is what you're looking for. Major
Payne is a fun movie. It is incredibly rewatchable but I don't need a website
to tell me that. I could use a website to point me to it if I haven't seen it
before.

I think the site is on the fringe of something important though. There are
multiple ways to rate a movie, and mood plays a huge role in what you want to
watch at any given moment.

------
think-large
You can't rate the rater's because that's all subjective. This is where a nice
machine learning algorithm that looks at your scores and find people who have
similar rating trends. It would then give weight to other peoples ratings
based on your ratings.

Thus is born a system in which everyone is happy. The site will get more
ratings because people want the algorithm to be better and it needs more data
for that.

The movies will be happy because this points to legal ways to find movies.

We'll all be happy because it's a complex solution to a complex problem and
yet it can still be solved in an elegant and visually stimulating way. (I.E.
you could color code the reviewers who are more likely to be similar to yours)

~~~
Stonewall9093
Hey, check out the site we're working on: www.criticrania.com. We're doing
exactly what you mention. You'll need to sign up to see these features, but
it's something we're working hard towards perfecting.

------
joshu
Would you rather go to a restaurant that has 17 reviews and 4 stars, or the
next-door neighbor with 3.5 stars from 750 reviews, all other things being
equal?

Yeah.

Ratings are about post-choice satisfaction, not about pre-choice decision
making.

------
efa
I'm not sure why I would care about the "rewatchable" metric. I only look at
reviews for a movie I haven't seen. If I've seen it I can decide myself if I'd
enjoy watching again. So subtracting that you are left with the normal star
system rating.

------
SpectralShards
I really don't think you've tackled the problem at the correct angle here.
Anyone can assign two or more axis to rate things on and call it a better
rating system than a one dimensional one because it provides more information.
Viewers want more information before deciding whether or not a thing (in this
case, a movie) is worth their time. While your two dimensional rating system
may work for some, this specific system is only justified for people who want
information on "quality" and "rewatchability." For example for a person like
me, I don't rewatch movies so the "quality" axis offers me way more value and
insight compared to the "rewatchability" side.

------
alexatkeplar
If I'm deciding whether to watch a film for the first time, I really don't
understand how whether or not other people would want to watch it a second
time helps me make my decision.

"We rate movies on two criteria - ‘quality’ and ‘rewatchability’, so you can
admit to your guilty pleasures and properly capture the feeling you get when a
film leaves you exhausted."

You are using rewatchability to infer some potentially helpful labels ("guilty
fun", "exhausting but worthy"). But there's no guarantee that those inferences
are safe/generalisable across viewers/films/genres, or that people who see a
rewatchable axis will know to interpret it like you do...

------
onedev
This article forgot to take into account one HUGE factor in people who use the
rating systems.

ATTENTION SPAN.

Yes. People don't have the attention span to independently analyze 5 different
scatter graphs of 5 similar products. Sometimes the scatter plots can be
actually more confusing and less informing than something simple like a
histogram.

I firmly believe that people's attention spans are more captured by things
like star ratings and histograms. If they get past the stage of their interest
being captured, THEN they read the reviews to find out more in-depth
information and opinions. I think it's a system that works well, as Amazon has
shown.

------
espeed
Randy Farmer goes over this concept in detail in _Building Web Reputation
Systems_ (<http://shop.oreilly.com/product/9780596159801.do>).

Google Tech Talk: <http://www.youtube.com/watch?v=Yn7e0J9m6rE>

See also: "YouTube: Five Stars Dominate Ratings" ([http://youtube-
global.blogspot.com/2009/09/five-stars-domina...](http://youtube-
global.blogspot.com/2009/09/five-stars-dominate-ratings.html))

------
anonymous_mouse
High-rewatchability is a poor metric for a good film.

Examples of movies I would highly rate but would not want to rewatch:
Schindler's List (1993), Hotel Rwanda (2004), Blindness (2008), Amistad
(1997), etc...

------
liquidcool
For movies a histogram may not be helpful (other than highlighting polarizing
films), but I find it invaluable for electronics. While I also read reviews, I
find that with electronics 1 star reviews (and sometimes 2 star) are usually
given for DOA or other serious failure. I can do a quick calculation of defect
rate by calculating the percentage of 1 star reviews. If it's 20% (or above
norms for that product type) I avoid it. And yes, I realize that the reported
defect rate is likely to be higher than the actual defect rate, but it's far
better than nothing.

------
Stonewall9093
Interesting article but it's still arbitrary. I don't think it solves the
problem still. What it boils down to is trusting the raters. This is the only
way to get by spam, ignorance, and trolling.

I think the solution is knowing who is rating what you're looking at. In the
old sense - "Quality over quanitty." I don't need to know what the whole world
gave it, just a few people who I've come to trust. We're trying to do this
here: www.criticrania.com.

I think this is the only real way around this problem.

~~~
droithomme
Yeah, it makes the bold generic claim to answer "why [all] rating systems
don't work", makes the bizarre claim that histograms that show the
distribution are "worse" than a scalar mean, and then presents as a solution
"We've solved the generic problem by doing two things: adding a 'rewatchable'
rating, and plotted it against general quality rating as a scatter plot."
Rewatchable is irrelevant to things that aren't movies. The scatter plot is
only relevant if the correlation between the two is important, but the given
graphs don't support that the correlation has meaningful information beyond
that people who like a movie are more likely to want to rewatch it, which is a
correlation that is likely consistent across all movies in their dataset and
conveys little information about any particular movie.

------
lttlrck
In my experience the best films glue you to the seat. I'm fairly sure that's a
universal phenomenon ;)

They are watched without pausing, skipping, rewinding, from beginning to end.

If it's not true in your case then you're not a reliable rating source.

A rating based on this can be fully automated.

No need to depend entirely on this one rating as it could be used for
weighting a users subjective rating. E.g. if the film was watch over 3
evenings, then your 5 star rating is worth 1 star.

------
achterband
Years ago I created a website for local bands. I also liked to have a rating
system and tried an emotion rating system (example:
<http://achterband.nl/sickboys_and_lowmen/white_buffalo>). I think for music
and films this is more suitable than giving it a single number. And maybe this
applies to other ratings as well.

(Translation of the emotions used: happy, relaxing, surprising, aggressive,
sad, explosive)

------
jgrahamc
I wish people would use a 5 or 7 point scale:
<http://blog.jgc.org/2007/12/seven-point-scale.html>

~~~
sp332
When it comes to anything subjective, ratings (like those histograms) tend to
be bimodal
[http://zedshaw.com/essays/rubrics_and_the_bimodality_of_rati...](http://zedshaw.com/essays/rubrics_and_the_bimodality_of_ratings.html)

------
ajankelo
Super interesting post. We created SquidCube (squidcube.com) to help companies
build out a rating system that instead of being shown publicly, went straight
to the company.

We're curious to see how people respond to rating/giving feedback when it
doesn't show up in their social media stream . Will it be more honest because
it is direct feedback?

This post has sparked some new ideas for me on developing a new rating system
beyond nero, stars, etc.

------
JoeAltmaier
Polls predict poorly because everyone says what they think they should say,
instead of what they actually believe.

Ratings systems are worse - they are self-selective, so most adequately-
satisfied folks will not take part.

A better system than both, would be to somehow extract ratings from folks
behavior. Did they read the whole article? How long did they dwell on each
part? Did they return to it more than once? Stuff like that.

------
baby
I thought they would come up with a better system than stars. Like a
"like/dislike" or "would watch it again". But no. They still use the same
system.

They just made a nice blog post to buzz around.

But I went past through that and played a bit with their app, until I found
out that everything you did was posted on your timeline by default.

I erased the application from my facebook.

------
draggnar
The problem is people do not take the time to rate products (movies in this
case) unless they are notably good or notably bad. I believe Rotten Tomatoes
has the most realistically accurate system for rating movies. Either you think
someone should see the movie, or shouldn't. If the majority of people are on
the fence, then it will get a rating around 50%.

------
_delirium
I think the two-axis argument is stronger than the anti-histogram argument.
The problem isn't that the one-axis rating systems are aggregating their
ratings into histogram bars rather than displaying them in some other way, but
that the ratings curve just doesn't carry a lot of information.

------
erikb
Already the idea to explain a self-explaining comic was giving me clues about
what I would read here. Wasn't disappointed. If one thing of something is bad,
it becomes good if you throw 2 of that at your users. Sounds like a solution.

And the comic is also explained now. Thanks.

------
dominik
Out of curiosity, how do I sign up for goodfil.ms without a Facebook or
Twitter account?

------
eddieplan9
Relevant discussion on how not to calculate average rating without taking into
consideration of sample size:

<http://news.ycombinator.com/item?id=3792627>

------
gburt
Clustering across what you've previously "liked" ala Netflix is the solution
to this problem, I think. The problem is fundamentally that single number
ratings don't capture that people are non-homogenous.

------
Mordor
Often I'm presented with a choice of films at the cinema (some of which I've
seen already) and only want to decide which one to see next. Perhaps an
ordering by score would be more appropriate?

------
reubensutton
It's a nice idea, and the readership of HN would probably prefer the
recommended way in this article, but I'm not sure that mass market consumers
will find it that valuable.

------
sirtaptap
I found some good info here and shared it in an answer here:
<http://ux.stackexchange.com/a/23008/7627>

------
runn1ng
I still don't understand how is "rewatchability" different from "quality".

If the movie is good, I would like to watch it again. What is the difference?

~~~
happimess
The movie Requiem for a Dream is, to me, a perfect example of a high quality
movie that I will not watch twice.

The acting, directing, cinematography, pacing, and sound design are all
excellent. However, the film is such a grueling emotional experience, I don't
foresee myself sitting down with it again.

Contrast that to, say, Airplane. It's a good movie. Funny. If I'm bored in a
hotel room, I'll watch it again. However, it is emphatically not a "great
film."

------
Toenex
In summary then, using the mean value to summarise data from an unknown
distribution can be problematic. Not exactly ground breaking.

------
aw3c2
This is not helpful at all. They are comparing the one-dimensional 1 to x star
rating to something two dimensional.

------
goggles99
Rewatchability and quality? Are you kidding me?

How do I glean from this wither of not I will like the movie?

A High quality movie can be terrible and a low quality movie can be great.
What does quality mean? Does that mean they had good special effects and
angles? Does that mean the color is true and the acting was great? How does
this equate to me liking the movie?

Rewatchability? Many of my favorite movies I would not watch again (Lord of
the rings) because they were so long. There are also so many movies that I
want to see in the future that I will choose to watch one of them instead of
re-watching one that I have already watched. Re-watching is something that
people do less and less as they get older (kids and teenagers maybe do it) but
adults (your target audience) not so much.

Also those scatter charts mean the same to my brain as the histograms that you
are blasting so I would stick with what people are already used to (the
histograms). You are not doing anyone any favors by changing the presentation
of the same data. People are not stupid. They will see both equally in most
cases but they will prefer the familiarity of the histograms.

Trying to be different is not always the best thing to do. As many have
already mentioned. Use machine learning to augment ratings (like Netflix).

Good luck

