The prediction context is somewhat different than the analysis context, I think. For predictive recommender systems, the most relevant part of this analysis is the critique of error measures. It may well be that MSE is not an error measure that aligns with the system's actual accuracy goals (e.g. something like perceived quality of the recommendations).
When it comes down to it, the end goal is just to predict whether someone would like something, and/or present them a list of the things you are most certain they'd like. In the analysis context (as with much of HCI), the scales are being used to draw qualitative conclusions about tasks and preferences, so it makes sense to directly attack erroneous modeling and assumptions, because it can lead to wrong conclusions. But for prediction, erroneous modeling only really matters to the extent that it means we're: 1) optimizing the wrong thing; or 2) doing optimization suboptimally.
#1 is important to get right, but #2 is more of a "whatever works" sort of thing, and we even have fairly good automatic methods for deciding. If treating ratings as numerical data empirically leads to good predictions, then it's fine to do; if not, then it's best avoided. Many recent systems avoid even having a human make those kinds of decisions, by throwing in a giant bag of possible ways of slicing the data, and then handing off the decision about which of them to use, and how to weight them, to an ensemble method. Iirc, that's what the winning Netflix-prize entry was like.
#3: Suggest items that they are likely to really love.
This is subtly different than predicting what the user is most likely to like. To optimize with RMSE scoring, you are better off suggesting a sure "4" than a risky "5". For buying an expensive item like a car or a stereo, the safe bet might be a good approach. But for books, music, or movies --- easily sampled, one of a series --- I'd be much more excited by a system that can predict A+ items with even 25% probability than one that offers up straight B items with 80% consistency.
Really great point. Slightly off topic, the standard for rating systems seems to be 5 stars, but I prefer 4 star systems because they force you to make a +/- choice with no cop out ambivalence choice. I'd be curious to see the same distance work applied to a four star system.
So you are forcing people who are neither in favour nor against something to produce false pro/contra votes. There are some cases where this strategy makes sense like questions about controverse topics. Other than that it produces noise. In my experience, the debate 4 vs 5 options often has more to do with the authors' personality and training than anything else. Surprisingly little is based on solid, non-ambiguous evidence.
Using 4 points instead of 5 merely removes some of the inherent ambiguity, the results are still likely to organize around similar concepts.
For many things, most people don't bother to think about the difference between crap unless you force them to, so in a 4-star system 2 stars becomes the "mediocre" rating while 3 and 4 differentiates between the good ones. How many people really care about a grade difference between D and an F? Likewise, do people really spend that much time making sure their 1 and 2 star ratings form a consistent philosophy of relative crappiness?
If "forcing" people to make a choice about something that is supposed to be a subjective categorization to begin with is probably not helping anything. If you want to force a like/dislike you should get binary data and be done with it.
"Would you eat this? Yes/No."
That's easy to answer accurately. Everyone will agree on what you mean. If you haven't answered that means you don't have an opinion. Beyond that semantic ambiguity is impossible to avoid and gets worse the more numbers you add.
A good way to deal with recommendation systems that avoids these problems is via flows on graphs. Here’s one method of converting individual ratings to global rankings. Each alternative A, B, C, etc is a node of the graph. Now, when the user/voter gives 5 stars to A and 4 stars to B, this is interpreted simply as a preference of A over B. This preference contributes a single point to the total flow from B to A. At this point, we remove all cycles from this flow (there is a standard way to do that) and produce a gradient flow F. The potential function h of this flow, grad h = F, is the rankings.
I wonder if it's possible to assign a weight to each of the possible scores to correct for the "perceived" distance so you can still use all these existing tools in a statistical valid way.
Also, this observation could be interpreted a bit differently:
> The probability that a user changes her rating between 2 and 3 is almost 0.35 while the probability she changes between 4 and 5 goes down to almost 0.1. This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5.
It seems a bit counter intuitive that the distance between 3 (neutral) and 4 (positive) is smaller then 4 and 5 (very positive). You could also interpret this differently. When a user changes his mind, he has to change his mind in such a way that the difference is significant enough to also change the review (is the review now a little bit wrong or very wrong). This means that he might actually see the difference between 3 and 4 as larger then 4 and 5, large enough for him/her to change the review. This effect is dampened the amount of time the user actually changes his mind this way. If you look at it in that way then the amount of pairwise inconsistencies are the wrong way to measure the distance between these ordinal categories in this particular case, because there actually might be two mechanisms that cancel each other out.
Do we know if there is a difference in perception by the respondent if the scale is numerical? A scale of 1-5 might be seen as having different intervals than a scale whose range is strongly-disagree -- strongly agree.
Interesting. I've noticed on reccr ( http://reccr.com ), my own recommendation project, that the recommendations for 4 and 5 star ratings are much more accurate than those for 2 and 3 star ratings. I had initially thought that it was simply that there were more 4 and 5 star ratings in the system, and thus more data to base recommendations on, but the "larger perceived gap" between 2 and 3 versus 4 and 5 makes a lot of sense and is probably also a major contributing factor.
tl;dr (my interpretation anyway). In utility theory a completely specified preference ordering is the starting point and a utility function can be derived to represent it. These functions are unique only up to a monotone transformation. In the recommender systems we take the quantity as a given and infer the preference order for missing items. If you reassign assign all the 5 stars items to 10 stars, it is perfectly consistent with the ordering, but your inference method may be (will be!) sensitive to such reassignments. To me this suggests you should also be estimating the shape of the best transformation of the rating system. Think of it as an opportunity to reduce bias. The next step would be to see if your particular decision problem is sensitive to such parameters. If not, chuck those parameters and go about your business.
Additionally, a great body of work in behavioral psych tells us that humans have a tough time measuring preferences on any absolute scale; however, we can consistently compare two items as better or worse (particularly when they're of the same type, instead of apples versus oranges). "Riffle independence" is a recent method for modeling these kinds of preference distributions, and has been used quite successfully for social curation of the blogosphere - i.e., showing the best set of blogs that span the topic space and have little redundancy.
You could just map 5 to something farther away, like 6. In fact, this is how most ordinal inference techniques work anyways: by taking an interval method and learning cutoffs for your ordered categories. Learning more parameters comes with a big cost though, which is why in practice the cutoffs are often fixed from the get-go. Obviously, ordinal methods have been tried in the literature. There is a reason they are not used in practice though, and that's because the trade-off (being harder to learn vs modelling the data more accurately) is not favorable.
Put up or shut up (I don't mean this in a mean way... read on).
There is a great data set to test this theory... Netflix. This article shouldn't end by just solicitng opinions, but with his results in the Netflix data set.
Hi @kenjackson, this is Xavier here. How do you propose to validate this on the Netflix dataset? It is clear that you cannot use RMSE to compare to other existing approaches, right? The way to go would be to propose a different success measure (i.e. ranking based) and measure how different algorithm perform. And then validate this on users to prove that optimizing RMSE is not as useful.
If you give me a few months, I might get there. But this is the reason I wrote a blog post and not a paper ;-)
There is a lot of work on learning distance functions. A quick Google search will turn up tens, if not hundreds, of papers on this topic (search for learning distance metric or learning distance function).
My chief tech/gadget metric of recommendation is 'Would I purchase this $expensive_electronic again?' I'm a big shopper at Amazon, but I don't care about the exact breakdown of stars a certain product gets. My usual method is to look at the total number of reviews (as a metric of popularity/community etc), and then to read the 5 star and 1 star reviews (and any that get voted up as most helpful).
I would love to see Amazon go to a binary recommendation system (thumbs up/thumbs down) with a free text review.
Sometimes, learning about the nuances is helpful, and they seem to be in the 4-star (and even 3-star) reviews, especially when there are many 5-star reviews. That is, the reviewer says things like, "I would have given it 5 stars except ___." I zero in on those reviews to try to understand any edge cases in usability or suitability.
Treating ordinal ratings (1-5) as nominal labels is not the best solution, IMO. It's true that the "distance" between each rating is not equal, however disregarding the fact that 4 stars is greater than 3 looses quite a bit of useful information.
When I was doing recommender for a pet supplies company, I used log likelihood test.
Given they product A, what are the chances they would buy B.
Also since we had thousands of products I sometimes looked at correlations charts or even simple histogram to easily pinpoint what products and quantities were purchased after the initial purchase of A. It made crunching millions of transactions easier.
This article is a great example of why 'computer science' should not assume that they understand statistics or even much in data manipulation.
Then the article is an example of why computer people should be careful on where they learn their statistics!
The article is awash in hand wringing about "interval scale" and "ordinal scale" data without being at all clear on just why someone should care, and for all the rest of the article they should not care.
So, the article has:
"For ordinal data, one should use non-parametric statistical tests which do not assume a normal distribution of the data."
Mostly nonsense. In statistical testing, the normal distribution arises mostly just via the central limit theorem which has quite meager assumptions trivially satisfied by "Likert" scale data.
Then there is:
"Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode."
Nonsense: The law of large numbers has especially meager assumptions also trivially satisfied by Likert scale data. If you want to estimate expectation, then definitely use the mean and not the mode.
Beyond the law of large numbers, there is also the classic
Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.
that makes clear that the mean is the most accurate way to estimate expectation.
If you want to use the mode for something, then say what the heck you want to use it for and then justify using the mode as the estimator.
There is:
"In order to defend that ratings can be treated as interval, we should have some validation that the distance between different ratings is approximately equal."
Nonsense. Instead, you get a 'rating', say, an integer in [1, 5]. Now you have it. Use it. For
"validation that the distance between different ratings is approximately equal."
why bother? Besides, "the distance" is undefined here!
For the
"This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5."
the writer is just fishing in muddy waters.
There is
"All the neighbor based methods in collaborative filtering are based on the use of some sort of distance measure. The most commonly used are Cosine distance and Pearson Correlation. However, both these distances assume a linear interval scale in their computations!"
Nonsense. Just write out the definitions of expectation, variance, covariance, and Pearson correlation and see that sufficient is that the expectation of the squared random variables be finite. There is nothing about "interval scale" in the assumptions.
But why calculate Pearson correlation? When dig into that, again, basically just want some MSE (mean square error) convergence, which again makes no assumptions about "interval scale" data.
There is
"This is my favorite one... The most commonly accepted measure of success for recommender systems is the Root Mean Squared Error (RMSE). But wait, this measure is explicitly assuming that ratings are also interval data!"
Nonsense. There is no such assumption about MSE. The main point about MSE is just that any sequence of random variables (e.g., estimates) that converges in MSE will have a subsequence that converges almost surely. In practice, convergence in MSE is convergence almost surely, and that's the best convergence there can be. So, if your estimates are good in MSE, then essentially always in practice they are close in every sense. Nowhere in this argument is an assumption about "interval data".
This article sounds like 'statistics' from some psycho researcher who has an obsession about interval scales and a phobia about using ordinal scale data! In particular he has high anxieties about being charged with heresy by the Statistical Religious Police! The guy needs 'special help'!
Thanks @NY_USA_Hacker Yours is a great example of why smart people should also invest in improving communication and social skills. Half of your comment is nonsense. The other half does raise interesting points that would deserve a reply if they were written in a different tone. I would be happy to go into each of the points you mention if you decide to re-write the comment in a more constructive way.
Really, you can be positive, constructive and even happy in your life without sounding less smart by doing so.
But there isn't a lot of room to respond with substance because the article is, did I mention, nonsense.
There is a reason: Several paths led to some of the more central topics in probability and statistics. Such paths included gambling, astronomical observations, psychological testing, signal processing, control theory, quality control, 'statistical' physics, quantum mechanics, mathematical models in the social sciences, experimental design, especially in agriculture, mathematical finance, and more. In addition there is now a very solid, polished field of probability, stochastic processes, and their statistics,
Some of these paths got lost in the swamp on their way to some reasonably clear understanding. For the solid material, so far that is rarely taught: The prerequisites need quite a lot of pure math, and then the pure math departments rarely follow through with the probability, stochastic processes, and statistics.
Early in my career, I was dropped into parts of the swamp, but later I got the rest of the pure math prerequisites and good coverage of the solid, polished material.
So, at this point I see both the swamp and the solid, polished material.
Net, the paper is from the swamp, and I responded with just a little of the solid, polished material.
For the swamp, not a lot of discussion is justified. The best response is the one I gave: The stuff from the swamp is nonsense. That may sound harsh, but it's on the center of the target.
Is it just me who starts reading articles like this, only to crash into sentences like this one:
"A Likert scale is a unidimensional scale on which the respondent expresses the level of agreement to a statement - typically in a 1 to 5 scale in which 1 is strongly disagree and 5 is strongly disagree."
And think snarky comments like "I should keep reading this article. Do I 1) strongly disagree or 5) strongly disagree?" (Then flick back to the site that linked to it to post that Snarky comment).
When it comes down to it, the end goal is just to predict whether someone would like something, and/or present them a list of the things you are most certain they'd like. In the analysis context (as with much of HCI), the scales are being used to draw qualitative conclusions about tasks and preferences, so it makes sense to directly attack erroneous modeling and assumptions, because it can lead to wrong conclusions. But for prediction, erroneous modeling only really matters to the extent that it means we're: 1) optimizing the wrong thing; or 2) doing optimization suboptimally.
#1 is important to get right, but #2 is more of a "whatever works" sort of thing, and we even have fairly good automatic methods for deciding. If treating ratings as numerical data empirically leads to good predictions, then it's fine to do; if not, then it's best avoided. Many recent systems avoid even having a human make those kinds of decisions, by throwing in a giant bag of possible ways of slicing the data, and then handing off the decision about which of them to use, and how to weight them, to an ensemble method. Iirc, that's what the winning Netflix-prize entry was like.