
Brewing a Better Rating System - jackcheng
http://blog.steepster.com/post/226679106/better-rating-system
======
mpotter
Hi, I'm Mike from Steepster. We thought we'd share our new ratings system we
just deployed with HN as we think it's relevant for products with customer
reviews, ratings, etc. It's our attempt to combat the 4.3 dilemma (discussed
here recently: <http://news.ycombinator.com/item?id=883890>).

Background: Steepster is a community site for tea drinkers to share their
tasting notes, get recommendations, and discover new teas.

Feedback appreciated!

~~~
callmeed
Mike, great job on this. Very informative. I have 2 questions for you:

1\. is your slider from the jQuery UI or other js framework?

2\. in regards to combating the 4.3 dilemma, have you found the average
ratings on steepster to be lower? maybe its too early to tell, but I'd love to
see some sort of curve on your ratings distribution in a future post ...

thanks

~~~
mpotter
Thanks, callmeed.

1\. Yep, slider implementation is jQuery UI.

2\. It is too early to tell, but we're definitely planning to share a follow
up. As mentioned in the post, we had a simple thumbs up/down for ratings and
were seeing a greater than 90% positive average, so we were definitely
experiencing that bias. Just today, albeit with a much too small sample size,
we're starting to see a more diverse mix of averages. We still expect to have
that positive skew but because we're now operating with a 100 point scale in
the UI, we hope the granularity will help users distinguish subtler
differences in rating.

~~~
physcab
It'd also be interesting to know if the number of ratings decrease or
increase. I wonder if your users will find the added granularity a nuisance or
an incentive.

~~~
mpotter
It will be interesting. It's important to note the nature of our community and
whom we expect to contribute. Generally, we're geared toward a more passionate
user who we find to be more than willing to contribute at this level of
granularity. So we've made the choice to cater toward their needs while still
trying to remain accessible.

But, this is a good point, and I think an important one to consider when
evaluating the mechanic that works best for your community/site.

------
alabut
I love that the sliding meter shows tick marks for your previous rankings of
other teas - the UI reflects that your judgement of a particular tea is
relative to your other experiences.

~~~
pbhjpbhj
I suspect under closer analysis ones scoring breaks down to be inconsistent -
"in retrospect I like tea X better than Y but not as much as Z, but I rated Z
lower than Y because I didn't like it as much as P which had a higher rating",
if you follow.

~~~
snprbob86
It seems that, pairwise, it is pretty easy to decide. Maybe one could go
further than this and eliminate the absolute scale all together (at least at
rating time).

I'm imaging a UI which asks you to pick a favorite among the item you are
viewing and one similar item. You could stop there, or ask repeatedly with new
comparison items until the viewed item's position on the absolute scale is
unambiguous. The user could provide some rating data with just a single binary
decision, but some ajax-y fade out/in of another pair could enable further
ratings if they desired.

~~~
m_eiman
If you do that, the Elo rating system is a good place to start algorithm-wise.

<http://en.wikipedia.org/wiki/Elo_rating_system>

------
lonestar
The problem with this system is in the sorting. The list of "Highest Rated"
teas is dominated by results where 1 person rated the tea 100.

Steepster should use a Bayesian average
(<http://en.wikipedia.org/wiki/Bayesian_average>) so that the uncertainty of a
small number of ratings is reflected in the sorting.

~~~
mpotter
Yeah, sorting is an issue we're still looking at (and is still very much in
transition considering the new rating system). Appreciate the suggestion!
We'll add it to our list of potential solutions.

~~~
selven
Start everything off with a single 50-point score. That way one person ranking
it 100 will bump it up to 75, the next to 83, and so on. Such a system would
cause teas that have more people upvoting them to rank higher than those that
just happen to have one or two good opinions.

~~~
Eliezer
That's a special case of a particular sort of Bayesian average.

------
ErrantX
The genius is adding some previous scores. I always struggle to rate stuff
fairly without anything obvious to compare it with.

------
mkinsella
This is THE best implementation of a ratings system I've seen. Very good job.

------
TrevorJ
I feel like this really combines the best of the granular 4 star systems with
the specificity of a percentage rating. Really good stuff, I'd be interested
to hear a follow up with user feedback on this approach and how it holds up
long term.

------
mhartl
This is cool, but I think virtually all rating systems suffer from the same
basic problem: there's no way to turn it up to 11.

Take movies, for example. They are usually rated on a four-star scale. And
yet, a three-star movie is a clear success. Few movies can realistically
aspire to more than three stars. Even many four-star movies are really just
trying desperately to avoid two-star land. Francis Ford Coppola was sure he
was going to be fired any day from _The Godfather_. The production crew and
actors on _Star Wars_ thought it was practically a joke. Please, God, let
_Star Wars_ not be a _B_ movie, they must have been thinking.

When you say ★★★ out of ★★★★, you make it look like it wasn't good enough:
75%. Movies really should be rated on a three-star scale: ★★★ out of ★★★; ★★★
= _A_ = 100%. Anything else is gravy.

So, rate tea on a three-star scale. Three stars means "excellent tea, no clear
way to make it better". ★★★½ means "Whoa, there _is_ something better than
★★★!" ★★★★ means "This is _The Godfather_ of tea! This tea makes me an offer I
can't refuse."

------
stuartjmoore
In this context, it looks like the user has incentive to rate (to get better
suggestions), but on rating in general:

Why even ask people how they feel?

Depending on the content, you can analysis how they use it to get a much more
accurate rating. For video: Did they watch the entire thing? Did they leave
after a few seconds? Did they share it somehow?

That (slightly off-topic) being said, this looks great.

------
robryan
Something else you could think about in a rating system like this would to
instead of using generic faces, you could associate each with a common tea
that most tea lovers have tried.

The notches kind of do this but theres always the risk of someone rating there
first tea 80, then deciding subsequent teas after are better so they need to
be rated higher, when the first one should have been more around 60.

------
fuzzythinker
I think main reason sliders aren't used is that users find it too troublesome,
hence up/down and 5 stars are mainly used. I remember from my pys class that a
7 point rating system is best. But the 5 stars' simplicity and ubiquity
probably trumps the benefits gained by a 7 point system. I think the best
compromised is a 5 star UI implemented as 6 points by allowing 0 point
assignments.

~~~
fuzzythinker
For those who marked me down, would you please comment on reason? I'm getting
tired of spending my time commenting and getting disapproval without reason. I
don't think down votes should be on disagreements; it should be on spamish,
childish, or comments that does not add anything to the topic. My main point
is that sliders aren't used much because they are too troublesome for a
typical user. If you disagree with that, please add your opinion. I'm not
trying to take anything away from the author. In fact, I think it's an
ingenious idea. But I usually dislike repeating "wow, cool" comments since so
many others have done so already. It's part of my DRYness kicking in.

~~~
nkurz
I voted you down because you asserted that a system was 'best' because you
remember from a 'pys' class. This has to be one of the weakest 'arguments from
authority' I have seen. You then asserted that a 5-star allowing zero is even
better. Then why didn't your (psychology?) professor say so?

I didn't vote down because I disagree, but because you haven't made much of a
case. I also downvote the 'wow, cool' comments as unhelpful, and upvote the
comments that seem like they will lead to useful discussion. Without intending
offense, I didn't think your comment was pitched at the right level for this
audience.

Personally, I think you are on the right track, although I think 5 stars
allowing halves is even better. Interestingly, Netflix (experts in this field)
started out with allowing half-stars and then got rid of them, making me worry
that they know better than I.

~~~
fuzzythinker
You are taking every single word of my comment too seriously. If every
assertion needs to have strong backing in order to be commented, the hn
comments will probably be only < 10% of what it is now (again, just a
guesstimate, don't take this one too seriously too). I forgot if my professor
has research backing for a 7 point sys being "best", maybe he did, maybe he
didn't. But I don't think I need to remember if there was indeed research
backing for it to add to the discussion. Again, I don't think you should down
vote every discussion just because they didn't state the research backing, but
I'm not the one to tell you that, maybe others can comment on this.

As for the 7 point system being "best" (for general purpose rating), I
remember it's because 5 star does not give enough granularity, while 10 points
is too much. Maybe that's why Netflix took that out. Now why not 6, 8, or 9? I
forgot, again, maybe there was research being done.

As for my "idea" of allow a 0 on a 5 point system; it makes it a 6 point
system while retaining a 5 point UI that everyone is accustom to. What is
wrong with that? Again, just asking for discussion, not trying to say it IS
the best.

Now back to the topic of down vote because I don't have enough backing. If I
need backing in order to comment, I wouldn't even be able to comment any of
this. Is this what you think is the way hn should work? Also, in order to not
make you think I have the research to back up my thoughts, I need to say that
in almost every sentence. I also don't think that should be the way hn works.

~~~
nkurz
_As for my "idea" of allow a 0 on a 5 point system; it makes it a 6 point
system while retaining a 5 point UI that everyone is accustom to. What is
wrong with that?_

Nothing is wrong with it necessarily --- it's all a matter of implementation
and audience. I think the first thing you are going to run into is a need to
visually differentiate a non-vote from a zero. I'm also not sure what problem
it's trying to solve.

What I would find more useful (from a 'build a better recommendations engine'
perspective) is a 5+: a short list of favorites that can stand in for
someone's favorites. Personally, I'd also like a better way to better
differentiate the gradations between standard, good, and great. Whether I hate
something or 'hate-hate' it isn't going to make much of a difference. Do you
think your audience is going to be persuaded to reduce their average rating by
a point, or are you still going to find the oft-quoted 4.3 average? I'm
doubtful, but this doesn't make it a a bad idea to try.

As to the downvote, I stand by it. My goal is to rearrange the page so that
the comments that are most useful to me are at the top. If others find your
comment useful, they will see the injustice and bring it back up to the top.

As to the need for 'strong backing', I think we just have different
worldviews. With due respect to my friends who are psychology professors, "a
[nameless] psychology professor told me" is barely a step up from "I'm not a
doctor but I play one on TV". We obviously respect different authorities in
our lives.

~~~
fuzzythinker
Re: "Need for 'strong backing'": I think this goes back to the seriousness of
how you take the discussions here. For me, forum/msg threads are just causal
discussions (one level lower than blog posts), it would be nice if the person
stated where they get their backing for the idea or assertion, but it's
neither "necessary" nor "too helpful". If every idea/assertions needs backing,
there would be almost no discussions at all. Innovation often comes from
idea/assertions that have no backings.

It is not "necessary" because to me, if I believe in it and it's important to
me, I will test it out. It doesn't matter if the idea came from Steve Jobs or
Joe Doe. It's not "too helpful" because often the research being done on it is
flawed, outdated, or just really not too trust-able. An example for example I
remember reading some group has done research on max width of cell phones
people like before feeling discomfort. The Motorola design team was the first
to ignore that when they designed RAZR.

------
thinksketch
This is very cool thank you. I posted earlier today about the need for a
better rating system than the five star system. I'm really glad to see you
working on a great solution. Thanks!

------
zeeone
Meh...

