
Recommender Systems: We're doing it (all) wrong - getp
http://technocalifornia.blogspot.com/2011/04/recommender-systems-were-doing-it-all.html
======
_delirium
The prediction context is somewhat different than the analysis context, I
think. For predictive recommender systems, the most relevant part of this
analysis is the critique of error measures. It may well be that MSE is not an
error measure that aligns with the system's actual accuracy goals (e.g.
something like perceived quality of the recommendations).

When it comes down to it, the end goal is just to predict whether someone
would like something, and/or present them a list of the things you are most
certain they'd like. In the analysis context (as with much of HCI), the scales
are being used to draw qualitative conclusions about tasks and preferences, so
it makes sense to directly attack erroneous modeling and assumptions, because
it can lead to wrong conclusions. But for prediction, erroneous modeling only
really matters to the extent that it means we're: 1) optimizing the wrong
thing; or 2) doing optimization suboptimally.

#1 is important to get right, but #2 is more of a "whatever works" sort of
thing, and we even have fairly good automatic methods for deciding. If
treating ratings as numerical data empirically leads to good predictions, then
it's fine to do; if not, then it's best avoided. Many recent systems avoid
even having a human make those kinds of decisions, by throwing in a giant bag
of possible ways of slicing the data, and then handing off the decision about
which of them to use, and how to weight them, to an ensemble method. Iirc,
that's what the winning Netflix-prize entry was like.

~~~
nkurz
I'd add an alternative:

    
    
      #3: Suggest items that they are likely to really love.
    

This is subtly different than predicting what the user is most likely to like.
To optimize with RMSE scoring, you are better off suggesting a sure "4" than a
risky "5". For buying an expensive item like a car or a stereo, the safe bet
might be a good approach. But for books, music, or movies --- easily sampled,
one of a series --- I'd be much more excited by a system that can predict A+
items with even 25% probability than one that offers up straight B items with
80% consistency.

------
macrael
Really great point. Slightly off topic, the standard for rating systems seems
to be 5 stars, but I prefer 4 star systems because they force you to make a
+/- choice with no cop out ambivalence choice. I'd be curious to see the same
distance work applied to a four star system.

~~~
Deestan
Why is it a cop-out to be ambivalent?

~~~
RickHull
> _Why is it a cop-out to be ambivalent?_

Well, I would say it kind of is, but it kind of isn't.

------
pixcavator
A good way to deal with recommendation systems that avoids these problems is
via flows on graphs. Here’s one method of converting individual ratings to
global rankings. Each alternative A, B, C, etc is a node of the graph. Now,
when the user/voter gives 5 stars to A and 4 stars to B, this is interpreted
simply as a preference of A over B. This preference contributes a single point
to the total flow from B to A. At this point, we remove all cycles from this
flow (there is a standard way to do that) and produce a gradient flow F. The
potential function h of this flow, grad h = F, is the rankings.

~~~
tel
This also eliminates the inconsistencies between Likert "intervals" across
different subjects. Very smooth.

------
logophobia
I wonder if it's possible to assign a weight to each of the possible scores to
correct for the "perceived" distance so you can still use all these existing
tools in a statistical valid way.

Also, this observation could be interpreted a bit differently:

> The probability that a user changes her rating between 2 and 3 is almost
> 0.35 while the probability she changes between 4 and 5 goes down to almost
> 0.1. This is a clear indication that users perceive that the distance
> between a 2 and a 3 is much lower than between a 4 and a 5.

It seems a bit counter intuitive that the distance between 3 (neutral) and 4
(positive) is smaller then 4 and 5 (very positive). You could also interpret
this differently. When a user changes his mind, he has to change his mind in
such a way that the difference is significant enough to also change the review
(is the review now a little bit wrong or very wrong). This means that he might
actually see the difference between 3 and 4 as larger then 4 and 5, large
enough for him/her to change the review. This effect is dampened the amount of
time the user actually changes his mind this way. If you look at it in that
way then the amount of pairwise inconsistencies are the wrong way to measure
the distance between these ordinal categories in this particular case, because
there actually might be two mechanisms that cancel each other out.

~~~
dhimes
Do we know if there is a difference in perception by the respondent if the
scale is numerical? A scale of 1-5 might be seen as having different intervals
than a scale whose range is strongly-disagree -- strongly agree.

------
luvcraft
Interesting. I've noticed on reccr ( <http://reccr.com> ), my own
recommendation project, that the recommendations for 4 and 5 star ratings are
much more accurate than those for 2 and 3 star ratings. I had initially
thought that it was simply that there were more 4 and 5 star ratings in the
system, and thus more data to base recommendations on, but the "larger
perceived gap" between 2 and 3 versus 4 and 5 makes a lot of sense and is
probably also a major contributing factor.

------
T_S_
tl;dr (my interpretation anyway). In utility theory a completely specified
preference ordering is the starting point and a utility function can be
derived to represent it. These functions are unique only up to a monotone
transformation. In the recommender systems we take the quantity as a given and
infer the preference order for missing items. If you reassign assign all the 5
stars items to 10 stars, it is perfectly consistent with the ordering, but
your inference method may be (will be!) sensitive to such reassignments. To me
this suggests you should also be estimating the shape of the best
transformation of the rating system. Think of it as an opportunity to reduce
bias. The next step would be to see if _your particular decision problem_ is
sensitive to such parameters. If not, chuck those parameters and go about your
business.

~~~
dr_strongarm
I think 11 stars is a better scale.

Additionally, a great body of work in behavioral psych tells us that humans
have a tough time measuring preferences on _any_ absolute scale; however, we
can consistently compare two items as better or worse (particularly when
they're of the same type, instead of apples versus oranges). "Riffle
independence" is a recent method for modeling these kinds of preference
distributions, and has been used quite successfully for social curation of the
blogosphere - i.e., showing the best set of blogs that span the topic space
and have little redundancy.

------
kvh
You could just map 5 to something farther away, like 6. In fact, this is how
most ordinal inference techniques work anyways: by taking an interval method
and learning cutoffs for your ordered categories. Learning more parameters
comes with a big cost though, which is why in practice the cutoffs are often
fixed from the get-go. Obviously, ordinal methods have been tried in the
literature. There is a reason they are not used in practice though, and that's
because the trade-off (being harder to learn vs modelling the data more
accurately) is not favorable.

------
kenjackson
Put up or shut up (I don't mean this in a mean way... read on).

There is a great data set to test this theory... Netflix. This article
shouldn't end by just solicitng opinions, but with his results in the Netflix
data set.

~~~
xamat
Hi @kenjackson, this is Xavier here. How do you propose to validate this on
the Netflix dataset? It is clear that you cannot use RMSE to compare to other
existing approaches, right? The way to go would be to propose a different
success measure (i.e. ranking based) and measure how different algorithm
perform. And _then_ validate this on users to prove that optimizing RMSE is
not as useful.

If you give me a few months, I might get there. But this is the reason I wrote
a blog post and not a paper ;-)

~~~
frak_your_couch
I just wanted to point out that the link on your blog to PureSVD points to the
wikipedia page for Discounted Cumulative Gain.

~~~
xamat
OOps! Thanks... just fixed it.

~~~
frak_your_couch
Hey, sorry, one more time. The updated link is also wrong, it's relative to
your blog rather than absolute i.e.
[http://technocalifornia.blogspot.com/2011/04/research.yahoo....](http://technocalifornia.blogspot.com/2011/04/research.yahoo.com/files/recsys2010_submission_150.pdf)
as opposed to
[http://research.yahoo.com/files/recsys2010_submission_150.pd...](http://research.yahoo.com/files/recsys2010_submission_150.pdf)

~~~
xamat
Oh man... damn blogger. Fixed it now... finally! Thanks!!!

------
noelwelsh
There is a lot of work on learning distance functions. A quick Google search
will turn up tens, if not hundreds, of papers on this topic (search for
learning distance metric or learning distance function).

------
2mur
My chief tech/gadget metric of recommendation is 'Would I purchase this
$expensive_electronic again?' I'm a big shopper at Amazon, but I don't care
about the exact breakdown of stars a certain product gets. My usual method is
to look at the total number of reviews (as a metric of popularity/community
etc), and then to read the 5 star and 1 star reviews (and any that get voted
up as most helpful).

I would love to see Amazon go to a binary recommendation system (thumbs
up/thumbs down) with a free text review.

~~~
dhimes
Sometimes, learning about the nuances is helpful, and they seem to be in the
4-star (and even 3-star) reviews, especially when there are many 5-star
reviews. That is, the reviewer says things like, "I would have given it 5
stars except ___." I zero in on those reviews to try to understand any edge
cases in usability or suitability.

EDITED for embarrassing grammar mistakes

------
mumrah
Treating ordinal ratings (1-5) as nominal labels is not the best solution,
IMO. It's true that the "distance" between each rating is not equal, however
disregarding the fact that 4 stars is greater than 3 looses quite a bit of
useful information.

------
farout
my 2 cents:

When I was doing recommender for a pet supplies company, I used log likelihood
test.

Given they product A, what are the chances they would buy B.

Also since we had thousands of products I sometimes looked at correlations
charts or even simple histogram to easily pinpoint what products and
quantities were purchased after the initial purchase of A. It made crunching
millions of transactions easier.

------
NY_USA_Hacker
This article is a great example of why 'computer science' should not assume
that they understand statistics or even much in data manipulation.

Then the article is an example of why computer people should be careful on
where they learn their statistics!

The article is awash in hand wringing about "interval scale" and "ordinal
scale" data without being at all clear on just why someone should care, and
for all the rest of the article they should not care.

So, the article has:

"For ordinal data, one should use non-parametric statistical tests which do
not assume a normal distribution of the data."

Mostly nonsense. In statistical testing, the normal distribution arises mostly
just via the central limit theorem which has quite meager assumptions
trivially satisfied by "Likert" scale data.

Then there is:

"Furthermore, because of this it makes no sense to report means of likert
scale data--you should report the mode."

Nonsense: The law of large numbers has especially meager assumptions also
trivially satisfied by Likert scale data. If you want to estimate expectation,
then definitely use the mean and not the mode.

Beyond the law of large numbers, there is also the classic

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical
Statistics', Volume 17, Number 1, pages 34-43, 1946.

that makes clear that the mean is the most accurate way to estimate
expectation.

If you want to use the mode for something, then say what the heck you want to
use it for and then justify using the mode as the estimator.

There is:

"In order to defend that ratings can be treated as interval, we should have
some validation that the distance between different ratings is approximately
equal."

Nonsense. Instead, you get a 'rating', say, an integer in [1, 5]. Now you have
it. Use it. For

"validation that the distance between different ratings is approximately
equal."

why bother? Besides, "the distance" is undefined here!

For the

"This is a clear indication that users perceive that the distance between a 2
and a 3 is much lower than between a 4 and a 5."

the writer is just fishing in muddy waters.

There is

"All the neighbor based methods in collaborative filtering are based on the
use of some sort of distance measure. The most commonly used are Cosine
distance and Pearson Correlation. However, both these distances assume a
linear interval scale in their computations!"

Nonsense. Just write out the definitions of expectation, variance, covariance,
and Pearson correlation and see that sufficient is that the expectation of the
squared random variables be finite. There is nothing about "interval scale" in
the assumptions.

But why calculate Pearson correlation? When dig into that, again, basically
just want some MSE (mean square error) convergence, which again makes no
assumptions about "interval scale" data.

There is

"This is my favorite one... The most commonly accepted measure of success for
recommender systems is the Root Mean Squared Error (RMSE). But wait, this
measure is explicitly assuming that ratings are also interval data!"

Nonsense. There is no such assumption about MSE. The main point about MSE is
just that any sequence of random variables (e.g., estimates) that converges in
MSE will have a subsequence that converges almost surely. In practice,
convergence in MSE is convergence almost surely, and that's the best
convergence there can be. So, if your estimates are good in MSE, then
essentially always in practice they are close in every sense. Nowhere in this
argument is an assumption about "interval data".

This article sounds like 'statistics' from some psycho researcher who has an
obsession about interval scales and a phobia about using ordinal scale data!
In particular he has high anxieties about being charged with heresy by the
Statistical Religious Police! The guy needs 'special help'!

Did I mention that the article is nonsense?

~~~
xamat
Thanks @NY_USA_Hacker Yours is a great example of why smart people should also
invest in improving communication and social skills. Half of your comment is
_nonsense_. The other half does raise interesting points that would deserve a
reply if they were written in a different tone. I would be happy to go into
each of the points you mention if you decide to re-write the comment in a more
constructive way.

Really, you can be positive, constructive and even happy in your life without
sounding less smart by doing so.

~~~
NY_USA_Hacker
Your response is to style, not substance.

But there isn't a lot of room to respond with substance because the article
is, did I mention, nonsense.

There is a reason: Several paths led to some of the more central topics in
probability and statistics. Such paths included gambling, astronomical
observations, psychological testing, signal processing, control theory,
quality control, 'statistical' physics, quantum mechanics, mathematical models
in the social sciences, experimental design, especially in agriculture,
mathematical finance, and more. In addition there is now a very solid,
polished field of probability, stochastic processes, and their statistics,

Some of these paths got lost in the swamp on their way to some reasonably
clear understanding. For the solid material, so far that is rarely taught: The
prerequisites need quite a lot of pure math, and then the pure math
departments rarely follow through with the probability, stochastic processes,
and statistics.

Early in my career, I was dropped into parts of the swamp, but later I got the
rest of the pure math prerequisites and good coverage of the solid, polished
material.

So, at this point I see both the swamp and the solid, polished material.

Net, the paper is from the swamp, and I responded with just a little of the
solid, polished material.

For the swamp, not a lot of discussion is justified. The best response is the
one I gave: The stuff from the swamp is nonsense. That may sound harsh, but
it's on the center of the target.

~~~
xamat
Good luck with your life out of the swamp. Looks like you are going to need
it.

------
bigiain
Meta quibble...

Is it just me who starts reading articles like this, only to crash into
sentences like this one:

"A Likert scale is a unidimensional scale on which the respondent expresses
the level of agreement to a statement - typically in a 1 to 5 scale in which 1
is strongly disagree and 5 is strongly disagree."

And think snarky comments like "I should keep reading this article. Do I 1)
strongly disagree or 5) strongly disagree?" (Then flick back to the site that
linked to it to post that Snarky comment).

~~~
xamat
Thanks for reporting the mistake... In a troll-style, but still helpful.
Thanks!

