How Not To Sort By Average Rating

NathanRice · on April 3, 2012

While I agree with the spirit of the article, this is one of those cases where a Bayesian treatment is conceptually much clearer.

Assume that ratings are being generated by a stable stochastic process where the underlying distribution is multinomial (ignoring the ordinal character of ratings, for the time being) and use a dirichlet conjugate prior. This gives you a posterior distribution over new ratings for an item. The benefit of a posterior here is that it lets you rank items by thinking in terms of the probability that the viewer would rank one item higher than another at random. By adjusting the magnitude of the alpha parameter to the dirichlet prior, you adjust your sensitivity to small numbers of observations. A small initial alpha will lead to rapid changes in the posterior upon observing ratings, whereas a large alpha requires a significant body of evidence.

The best part of the multinomial model with conjugate dirichlet prior is that the math is REALLY simple. The normalizing constant for the dirichlet distribution looks scary when stated in terms of the gamma function, but given this is the discrete case, just pretend everywhere you see the gamma(x), it is replaced with (x - 1)! and you will be ok.

Let me know if you would like to learn more, I would be happy to help.

jules · on April 3, 2012

You could also go one step further on the Bayesian path and infer even alpha from the data on your site, and introduce a loss function on your ordering.

Or you could do a semi-frequentist thing and simplify your math by using MAP estimates to rank. Basically instead of score = #pos/(#pos + #neg), it becomes score = (#pos+x)/(#pos+x + #neg+y), where you choose x and y to suit your needs. You could choose x/y in proportion to the average number of up/down votes on your site or you could even choose x/y in proportion to the average number of up/down votes of the author of the post. That would rank posts of trolls lower than posts of good users. By varying x and y you can tweak the strength of this effect. You can interpret this as giving each item by default x upvotes and y downvotes.

This certainly works much better than the formula in the article. For example if a post has 1 upvote and 2 downvotes, his formula will say that should be ranked lower than a post with 1000 upvotes and 2000 downvotes (because he's using the lower bound of the confidence interval). Obviously that's bad because while the first post could be a good one, we know for certain that the second one isn't. In general his method will rank posts with a low number of votes very low, even compared to posts with a high number of downvotes.

NathanRice · on April 3, 2012

Absolutely! I just wrote a reply where I alluded to that, unfortunately I didn't refresh and see this post or I would have just plugged you instead.

The benefit of the Bayesian treatment here that I want to drill down on is how natural it is to adjust the prior to capture your beliefs about how items should be perceived in the presence of incomplete information. The frequentist approach is fine, but it does not provide such a pleasant, intuitive knob to tune.

jules · on April 3, 2012

Well, it depends on how you look at it. The Bayesian MAP approach is basically the same as the frequentist approach with made up data. Instead of using the maximum likelihood estimator p = pos/(pos+neg) you pretend that each new post already has some up/down votes by default, and then you use simply p = pos/(pos+neg). Seems like an even more intuitive explanation of the Bayesian knob to me! Rather than an abstract "alpha" to a "Dirichlet prior", you get something concrete (the number of made up votes). And you get a simple formula, which some would find desirable.

But I agree that the Bayesian approach is conceptually much cleaner. IMO the frequentist approach is just computational corner cutting for when the math in the Bayesian approach gets too involved, which is sometimes useful. What's nice about the Bayesian approach is that you state your assumptions and then it's just turning the math machinery. In contrast, in the frequentist approach the assumptions are interwoven and hidden in arbitrary choices in how the math is done (And then they claim that Bayesians are subjective! It's just that Bayesians admit that they are subjective. Frequentists try to hide the fact that they are more subjective in the math). The not so nice thing is that turning the math machinery is not always so easy and does not always produce fast algorithms. That's where maximum likelihood and friends come in, but I'd view them as an approximation to Bayesian methods.

NathanRice · on April 3, 2012

Yes, the explanation of "imaginary votes" is by far the simplest way of thinking about the situation. Having mathematical formalism and a rigorously studied methods makes me feel OK about doing it though, otherwise I would be very uncomfortable with the approach :)

I love Bayesian statistics from a conceptual point of view, but the ease with which one ventures into the land of analytic intractability kind of puts me off more complex models. MCMC is such a clumsy tool (in addition to taking forever); variational methods look interesting to me but I don't really feel they are quite there yet.

Eliezer · on April 3, 2012

I read the article, went "The ^&#$?" and came here to post this, more or less. It's amazing the $&#! people will get up to when they don't know Bayesian statistics.

moultano · on April 3, 2012

That really isn't fair to what this is doing. If a bayesian wanted to lower bound the distribution, they would be using exactly the same formula, but with an explicit prior. The typical bayesian solution to this of the dirichlet prior has a significantly different convergence rate, 1/n vs 1/sqrt(n) so the resulting ordering is substantially different as well, and I would argue worse for this application. The method of setting a pessimistic prior cares much less about the number of votes for a popular item than does the method of lower bounding the distribution.

Eliezer · on April 4, 2012

If you think most comments are bad, and you have an explicit prior that reflects this, then your ranking will update to "Hey, the comment was actually good!" at the optimal speed - no more, no less - given the rate of incoming ratings. Why would any other rate of updating be better?

moultano · on April 4, 2012

Because there's a high variance on your estimate, so in a lot of circumstances just by variance you'll show users bad comments on top most of the time, rather than showing them the good comment that you are certain of.

In this context, the pessimistic prior is more of a hack than the confidence interval, because your mental model isn't that most comments suck. Your mental model is that most of them are ok. The pessimistic prior is just there to say, if 1000 people confirm that a comment is ok, we should show that over a comment for which we don't know whether it's ok or not yet, and this reasoning is much better modeled by a lower bound.

_delirium · on April 3, 2012

Here's a paper proposing a solution in that space, and which also compares itself to the article linked here (kind of nice to see... papers sometimes fail to cite stuff that's "only" posted online rather than properly published, even if the authors know about it and it's quite relevant): http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...

I emailed Miller a while ago to see what he thought of this reply, and he thought it also seemed like a reasonable approach. But, in his view, the criticisms of his method within their framework include things that in practice he sees as features. In particular, they view the bias caused by using the lower bound as a bug, but he prefers rankings to be be "risk-averse" in recommending, avoiding false positives more than false negatives. Of course, that biased preference could also be encoded explicitly in a more complex Bayesian setup, which would also be a bit more principled, since you could directly choose the degree of bias, instead of indirectly choosing it via your choice of confidence level on the Wilson score interval.

NathanRice · on April 3, 2012

I don't think you have to resort to any overly complex machinery to achieve similar behavior. The simplest approach is to just use a non uniform prior. His pessimistic bound could be emulated by having an initial alpha that places more weight on low star ratings. The intuitive interpretation of that being "things are probably bad unless proven good" roughly. Another option would be to generate the prior based on the posterior distributions of other items. Just take the distribution of ratings observations for all products of a given type (perhaps only items produced by that company?) to get a sensible prior on a new item in that category.

The strength of priors here is that it is very easy to take intuitions and encode them statistically, in an understandable way. Taking the lower bound of a test statistic doesn't admit much in the way of intuition.

_delirium · on April 3, 2012

I agree the lower bound of a test statistic is a pretty indirect way of encoding intuitions. Somehow I tend to find loss functions the conceptually clearest way of encoding preferences about inference outcomes, though. But, in this case my feeling in that direction isn't very strong, and the priors-based solution seems fine.

moultano · on April 3, 2012

There's a distinct difference in the asymptotic behavior though between the lower bound and the prior. The lower bound goes to the mean as 1/sqrt(n), the prior goes to the mean as 1/n.

That makes for a pretty significant difference in practice, and I'm not sure which is preferable.

NathanRice · on April 3, 2012

You are absolutely correct that they are not mathematically identical. I struggled to word it in a way that would not mislead people, the distinction is important to emphasize.

moultano · on April 3, 2012

It has a really big effect I think on the tone of what gets selected at the top. The lower bound prefers things that are preferred by a majority and very popular. The prior method prefers things that are completely un-objectionable and liked by just enough people to be sure of that. My hunch is that with the lower bound you get more interesting things bubbling to the top because it puts a stronger emphasis on popularity.

In all of these models, the giant variable that is completely ignored is the actual choice to rate something at all, versus skipping over it and reading the next one. That's a very significant decision that the user makes. The behavior of each of these systems w.r.t that effect will be the dominant thing differentiating them.

jules · on April 3, 2012

Can anybody replicate their results at Proposition 5? My tests contradict their conclusion, namely that the total score is not monotonic (i.e. when I test I do get that the total score is monotonic).

Lewisham · on April 3, 2012

I'm too much of a Math idiot to understand that paper. Have you seen a code implementation of it anywhere?

_delirium · on April 3, 2012

(Reply-to-self, too late to edit)

Here are slides from that paper's presentation, for a quicker overview: http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...

mistercow · on April 3, 2012

> this is one of those cases where a Bayesian treatment is conceptually much clearer

Is there any other sort of case?

NathanRice · on April 3, 2012

Well, I try not to be dogmatic :)

celer · on April 4, 2012

I am a beginning stats student. I took the AP class in high school, and have been doing independent work since. Predicting voters, playing with data sets, using a bit of Python to try to make general solutions, and other non serious but non trivial actions. Bayesian solutions have always required more work from me, even when they seemed the optimal way to solve a problem. I still don't entirely understand how to work with continuous distributions of hypotheses, partially because I lack the deep mathematical intuition at the moment. I don't blame the math, but I do think that much of the time the Bayesian treatment can be more challenging. I suspect that this is why non-Bayesian techniques remain so popular.

pdeuchler · on April 3, 2012

Bayes all the things!

jwp · on April 3, 2012

What he said! I'd like to add that a key to squeezing more out of NathanRice's post is the phrase "conjugate prior." Another totally natural thing would be to use a Gaussian prior & likelihood, then update the posterior as ratings arrive. This would take advantage of the ordinality of ratings as NR suggests. Bishop's Machine Learning book goes into this sort of stuff in more depth.

CodeMage · on April 4, 2012

That's heavy on mathematical lingo and I confess I didn't understand it. Can you recommend an online resource that would explain what you just wrote in a newbie-friendly manner?

zbuc · on April 3, 2012

I would absolutely love to learn more. I've been trying to solve some novel NLP and machine learning problems lately but my lack of statistical knowledge is becoming apparent the further along I get.

Do you have any recommendations for a good introductory treatment of Bayesian statistics?

NathanRice · on April 3, 2012

By far my favorite book on the subject is conveniently available for free on the internet! "Information Theory, Inference and Learning Algorithms" by David MacKay is wonderfully written, well paced and comprehensive. If you like the book, you should purchase a copy, David is a great guy.

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

Videolectures has some very good videos as well. Zoubin Gharamani has a pretty solid lecture on Bayesian learning at http://videolectures.net/mlss05us_ghahramani_bl/ (he's a great researcher but not the most engaging speaker). Try Christopher Bishop's lecture at http://videolectures.net/mlss09uk_bishop_ibi/ as well, it might be slightly more palatable.

a1k0n · on April 3, 2012

I want to second MacKay's book. I had a terrible statistics class in college. We spent the entire time looking up tables of p-values and t tests, without a very convincing explanation as to why. The entire topic was damaged for me from then on until I read MacKay online, and then bought the paper version.

His book starts from first principles -- simple ideas about probabilities -- and it builds a foundation for understanding Bayesian methods. And he explains, basically, why the p-values and t-test stuff is a bunch of crap, which was an immense relief.

zbuc · on April 5, 2012

So I don't need to flinch over the mention of "p hat" again?

Thanks for the resource, this looks fantastic.

zbuc · on April 5, 2012

Thank you very much for the recommendations.

edw519 · on April 3, 2012

I love it and I hate it.

Why I love it: It's precise. It's elegant. It's rigorous. It's based upon solid, proven science & theory. It's a perfect application for a computer. And most of all, it does what's intended: it works.

Why I hate it: What human can understand it?

I used to implement the first manufacturing and distribution systems that used thinking like this. They figured, "We finally have the horsepower to apply complex logic to everyday problems." Things like safety stock, economic order quantities, reorder points, make/buy decisions, etc.

But the designers of these systems overlooked one critical issue: these systems included humans. And as soon as humans saw that decision making formulas were too complex to understand, they relieved themselves of responsibility for those decisions. "Why didn't we place an order?" "Because the computer decided not to and I have no idea why."

I suppose the optimal solution is somewhere in between: a formula sophisticated enough to solve 95% of the problem but simple enough for any human's reptile brain to "get it". This isn't it.

dxbydt · on April 3, 2012

>What human can understand it?

Lets start with Wilson's midpoint, since that's just high school math.

     def mid(upvotes:Int, downvotes:Int) = {
      val total = upvotes+downvotes+0.0
    
      val up = upvotes/total
      val half = 0.5
    
      val a = total/(4+total)
      val b = 4/(4+total)
    
      a * up + b * half
     }

So there are two weights a and b. Using these weights, the midpoint is a weighted average of half and the proportion of upvotes.

It should be very clear that if the total becomes large, a goes to 1 and b to zero. At that point you end up using the proportion of upvotes, just like Amazon.

Now, lets bring in the entire confidence interval.

     def wilson(upvotes:Int, downvotes:Int) = {
      val z = 1.96
      val n = upvotes+downvotes+0.0d
      val phat = upvotes/n

      val lower = (phat + z*z/(2*n) - z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
      val upper = (phat + z*z/(2*n) + z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
      (lower,upper)
      }

Using the author's data points, the results look like this:

     val itemVotes = List((600,400),(5500,4500),(2,0), (100,1))
     itemVotes.foreach( x => {
      val s = mid(x._1,x._2)
      val w = wilson(x._1,x._2)
      printf("Up:%5d\tDown:%5d\tMid:%.3f\tW:[%.3f,%.3f]\n", 
      x._1, x._2,s,w._1,w._2)
      })

       scala> Up:  600	Down:  400	Mid:0.600	W:[0.569,0.630]
        Up: 5500	Down: 4500	Mid:0.550	W:[0.540,0.560]
        Up:    2	Down:    0	Mid:0.667	W:[0.342,1.000]
        Up:  100	Down:    1	Mid:0.971	W:[0.946,0.998]

Basically, we prefer the lower bound of the confidence interval instead of the midpoint of that same interval.

biot · on April 3, 2012

The problem isn't that it's impossible for humans to understand, nor that one can't hire the best in the field to implement this solution. The problem is that the people who need to understand the outcome won't be able to understand it.

Case in point, the article mentions Amazon. Amazon could easily hire the best and brightest mathematicians and developers to make the implementation completely rigorous and provably correct. However, they're still going to field questions from merchants saying "My product has a 5-star average, but this competitor's product only has a 4.5-star average. Why the hell is my product shown further down in the list?"

Similarly, you may work in the acquisitions department of BigCorp and you need to purchase some equipment. When you prepare your report for Big Boss of your recommendations, how do you explain that you are recommending something which has a 4.5-star average rating instead of the 5-star average rated one?

The ratings may be 100% mathematically sound, but the users of the rating system are not the same as the producers of the rating system. So it's largely irrelevant if the producers made it perfect if the users don't understand what "perfect" means.

gliese1337 · on April 3, 2012

They'll only have to field those questions if they display the plain averages despite not sorting on them. If we've established that average rating is the wrong metric, why would you even show it to the user/merchant/whoever? Display the same metric that you sort by, and then the merchants will be perfectly happy with higher-rated items showing up higher on the list, and Big Boss will be happy that you've chosen the #1-recommended option.

esrauch · on April 4, 2012

I really don't think that solves the problem; you look at product X, it has 0 reviews, you give it a 5 star, now it shows 1 rating and 3.7 stars?

redguava · on April 4, 2012

So you let people rate it with 5 stars, but then show it as 3.7 magic points. Magic points can be your proprietary ranking system.

You don't need to let them rate with stars and also show it as stars.

gliese1337 · on April 4, 2012

If I saw that happen, I would probably assume that the item was so popular that a bunch of other people had rated it concurrently. Unless the number of ratings were displayed simultaneously. There's a bit of a comprehension barrier in letting people know that the ordering score is not the same as the thing they're inputting, and what the relationship between them is. I don't know how to solve that problem (if I had to, I'd probably just not show the ranking numbers at all- the user already sees the sorting order, what do they need the raw numbers for?), but that's OK, 'cause I prefer to sidestep it- just don't use a graded rating scale like "5 stars". At least not unless you have a really good justification and explanation of what precisely the scale means. 5-star rating systems tend to result in large numbers of 5s and large numbers of 1s anyway, so you might as well just go with a binary like/dislike system. The fact that a ranking is a real number and your input is an "up"/"down" enumeration makes it clear that the ranking you see is not the same thing as the data that you put in. It could also be useful from a defusing confusion standpoint to make the ranking more complicated by mixing in an age factor (like HN does) or some other extra band of information. Then if a user doesn't immediately understand the relation between their rating and the final ranking, they can easily write it off as being due to some other factor that's not in their control anyway.

aercolino · on April 4, 2012

You can do all stars but if you say you need at least X people to rate before showing any. That's what many sites do.

georgebarnett · on April 4, 2012

The problem isn't that it's impossible for humans to understand, nor that one can't hire the best in the field to implement this solution. The problem is that the people who need to understand the outcome won't be able to understand it.

I disagree that people won't be able to understand it - I propose they can't be bothered to understand it. Regular people do not give a shit about algorithms - either the computer/software works or it doesn't. Tech geeks regularly forget this simple fact.

Most people have other things they care about and all they want is to press a button and have an answer come out.

If its instantly clear how the answer is generated, they think about it for a very short while, agree and move on with their lives.

If it's not instantly clear how the answer is generated, they assume the computer is right and move on with their lives.

lwat · on April 4, 2012

> "My product has a 5-star average, but this competitor's product only has a 4.5-star average. Why the hell is my product shown further down in the list?"

'Because you have fewer votes than your competitor'

I don't think it's that hard to understand.

msg · on April 4, 2012

Right. If you look at most-helpful ratios for product reviews on Amazon, you will see this. 500/600 helpful floats far above 10/10 helpful.

gcb · on April 3, 2012

the article also misses the human factor. And intent.

for amazon, is it better to show the best product first or to help sell a few of the lacking-rating ones to generate more ratings and work out the uncertainty with real world data instead of crazy math?

vog · on April 4, 2012

I don't think it is a good idea to misuse your customers for that kind of testing.

DanI-S · on April 3, 2012

I think the original post still stands - try explaining the above to an irate customer.

27182818284 · on April 3, 2012

I've dealt with people who have wanted a retraction of negative comments to their business on the web, I explained it didn't work that way as best I could, but there isn't much you can do. Same with this, if they are irate about a low rating, you play nice, but you don't expect them to learn the ins and outs.

robryan · on April 3, 2012

Yeah, most merchants only really want to participate in rating schemes when it puts them above their competitors.

Eliezer · on April 3, 2012

It's neither precise, nor elegant, and certainly not rigorous. On this approach, an item with 1 upvote and 2 downvotes will be ranked below an item with 1000 upvotes and 2000 downvotes, as one commenter pointed out; and in general, bright new items will almost never be presented to anyone. Search on "Bayesian" in the comments for the precise, elegant, rigorous solution below. I won't claim that it's easy to understand, but it's a lot easier to understand than the frequentist ad-hoc version.

raldi · on April 3, 2012

I'm not sure that's a "critical issue". 99% of Google users don't care that PageRank is complicated; they just marvel at how good the results are. Just like most redditors simply talk about how great the comments are, and not the math that makes them so.

bri3d · on April 3, 2012

Amusingly the proposed solution in this blog post is almost line-for-line exactly the method Reddit uses to rank comments - proof, I think, that the method isn't "too complicated."

When you're working with a system managing orders, where there's a direct human interaction with the algorithm (like the top-level poster is), simplicity is somewhat important. When you're working with a system that's designed to look like magic to the end user, making the system more magical is often better, because it makes it less intuitive to game.

trebor · on April 3, 2012

The math doesn't make the comments good. The math makes good comments easier to find, and buries comments with inadequate favor or confidence.

Besides, the math made no effort to write the sentences. ;)

walexander · on April 4, 2012

I actually find issues with the "best" comment sorting, in some cases. I think it generally is the "best" sort, but in very popular posts it tends to break down.I often find comments with 5x or more upvotes buried down below the top few comments.

I think as comments get higher to the top, people start voting them down more, which might not be the cases for comments which are rising (theory: people reading further down in the comment section are possibly more thoughtful and more likely to upvote, while at the top the ADHD crowd might be more likely to knee jerk downvote?).

In any case, i've recently changed my default sort to "top" and feel it's an improvement.

true_religion · on April 3, 2012

I think you're overlooking a critical component of the Google equation: the supply.

Webmasters * constantly* complain about the impenetrability of the 'algorithm' and how it constantly changes for secret reasons.

So you're right in a way---the demand side doesn't care how you manage to curate supply, but supply is intimately aware of your curation and worries about being unfairly demerited.

imajes · on April 3, 2012

what?

There's nothing wrong with the math. It just requires better educators to explain it, with analogies and metaphor. Why would you compromise the data/outcome in a hope to simplify the problem?

Further: wouldn't you look to hire (and train) the best people who understand the domain they are in, thereby being able to judge whether the outcome of an equation is valid or not?

Hint: insurance/loan underwriters regularly end up in situations where the human component of a transaction may look different than the data, and react accordingly...

edw519 · on April 3, 2012

There's nothing wrong with the math.

I never said there was. In fact, I praised it as an elegant solution.

It just requires better educators to explain it, with analogies and metaphor.

You're right. In theory. In practice, no one does this, mainly because they can't afford it. You're implementing technology that costs $200,000 to save $100,000.

Why would you compromise the data/outcome in a hope to simplify the problem?

Actually, simplying the problem actually reduces the compromise when humans are involved. So, no.

...wouldn't you look to hire (and train) the best people who understand the domain they are in, thereby being able to judge whether the outcome of an equation is valid or not?

Only if it made economic sense to do so. I am not going to replace 800 workers earning $12/hour with "the best people who understand the domain" because the computer suddenly has formulas that people don't understand. Experience has shown repeatedly that workers, at any level, simply stop caring when they feel powerless by "solutions" like OP's Wilson formula. That abducation of responsibility almost always far outweighs any incremental benefit that a more sophisticated but incomprehensible formula introduces.

Good points, nice discussion, but please, for the sake of this community, don't reply with words like "what?", "Further:", or "Hint:". You made your point without the snarkiness.

dpritchett · on April 3, 2012

I'd imagine the right way to handle this would be Reddit style: Implement the naive solutions (i.e. sorted by average score) alongside the top-performing solutions.

Add in a UX that makes it obvious that there's a choice to be made and allows the user to easily evaluate the results of the naive solution alongside the optimized solution, and end users who are themselves evaluated on overall performance metrics will tend to choose the more effective solution whether they understand it or not.

The addition of metrics to allow for side by side evaluation of the simple strategy versus the optimized strategy is key - someone choosing the unintuitive yet superior strategy will likely have to explain themselves to management at one time or another.

Aan interesting parallel is NFL onfield playcalling decisions. There are plenty of instances where coaches make choices that are statistically unsound but they reinforce the image of a tough, responsible coach. Conversely there are coaches who play by the odds and get lambasted when their decision doesn't pan out.

imajes · on April 3, 2012

re the what/further/hint: good point - pre-coffee grumpiness. :) sorry.

adamio · on April 3, 2012

Workers who feel powerless by sophisticated solutions need to learn the solutions, and not be intimidated by them.

Yes, assembly line workers need simple steps to do a job. But that of course isn't who this is referring to. In your example managing inventory isn't an assembly line, checklist job.

You're not replacing 800 workers with the best. In your example, you wouldn't have 800 people managing inventory. You replace your inventory manager with someone who understands the math (if you want to manage your inventory that way)

brazzy · on April 3, 2012

Sufficiently advanced sophistication is indistinguishable from obfuscation. And obfuscation is a problem that's not ideally solved by hiring smarter people.

I think that's the point edw519 was trying to make.

davidw · on April 3, 2012

> It just requires better educators to explain it, with analogies and metaphor.

So... anyone want to take a stab at explaining that equation to those of us who don't really get it?

dpritchett · on April 3, 2012

If a comment has one upvote and zero downvotes, it has a 100% upvote rate, but since there's not very much data, the system will keep it near the bottom. But if it has 10 upvotes and only 1 downvote, the system might have enough confidence to place it above something with 40 upvotes and 20 downvotes -- figuring that by the time it's also gotten 40 upvotes, it's almost certain it will have fewer than 20 downvotes. And the best part is that if it's wrong (which it is 5% of the time), it will quickly get more data, since the comment with less data is near the top -- and when it gets that data, it will quickly correct the comment's position. The bottom line is that this system means good comments will jump quickly to the top and stay there, and bad comments will hover near the bottom.

[1] http://blog.reddit.com/2009/10/reddits-new-comment-sorting-s...

davidw · on April 3, 2012

That much was fairly clear on the site. I'm talking about the math itself.

dpritchett · on April 3, 2012

Wilson's 1927 paper is freely available [1]. I can't say I have brushed up on my statistics enough this decade to verify the math but the basics as I read them are as follows:

- You have a normal distribution (bell curve) of data points, in this case quality scores.

- You wish to sort these points based on their respective vote totals. Any given data point has pos positive votes out of n total votes for that item.

- You have a confidence interval, e.g. 95%. This confidence is expressed in terms of the bell curve, so a 95% confidence is within 1.96 standard deviations of the mean [2].

- You have a Ruby function accepting the aforementioned n, pos, and confidence variables and returning a decimal value representing the normalized confidence_interval_lower_bound, that is the quality score that our input data point has a 95% chance of meeting or exceeding.

- Given a set of data points, evaluate the ci_lower_bound for each, and then sort them accordingly. The results will give you a best-guess sorting that accounts for the fact that some data points will have more votes cast for/against them than others.

[1] http://www.med.mcgill.ca/epidemiology/hanley/tmp/Proportion/...

[2] http://en.wikipedia.org/wiki/1.96

dxbydt · on April 3, 2012

With zero knowledge of statistics, here's how to think about it - Wilson wants to construct some score. That score can be as low as 0 and as high as 1. So he wants some interval [x,y]. The center of that interval would obviously be c = (x+y)/2. Wilson first decides what c to pick. So Wilson says, the center c must be decided by the proportion of upvotes. But then he thinks, c must also be close to a half. So he says, okay, lets figure out c using some sort of an average. So he chooses two weights a and b. The weighted average would then be a times half plus b times the upvote proportion. He chooses those weights in such a fashion that if you have lots of data, one of the weights vanish & the other becomes unity. So the weights matter only if you have too few upvotes & downvotes.

Having figured out the midpoint c, Wilson has to actually figure out the lower bound x and upper bound y. Now he draws a distribution centered at c....ok so at this point you would need to know what is a distribution, and why you would need one, whether that distribution has a skew & whether its homoskedastic & so on...which is stats 101, so I won't go there. But if you've gotten this far, you should be able to atleast see the intuition behind Wilson's procedure.

ealloc · on April 3, 2012

Say the 'acutal' rating for an item is p. (ie, if you got an infinite number of people voting, the ratio of upvotes to total votes is p).

Now say your users vote, and they upvote with that probability p. The number of upvotes k you get out of n votes will follow a binomial distribution B(k;n,p). The binomial distribution has mean np and stdev sqrt(n p(1-p)), and is very close to gaussian in shape. Since the stdev is a rough measure of the 'width' of the distribution, common way to describe the error is (mean +/- lambda*stdev), where you can tune lambda to your desire. If you increase lambda you get a wider confidence interval, and therefore more certainty that a measurement will be within that confidence interval.

Now, say you measure k upvotes out of n votes. You can divide by n to get p0, your estimated rating based on those votes.

An easy estimate for the error of this measurement is to assume that p0 is approximately correct and equal to p. Then the expected number of upvotes would be n p0 with stdev(n p0(1-p0)). Divide this by n to get the fraction of upvotes, to give a final estimate for p of p0 +- lambda sqrt(p0(1-p0)/n)

Now, your estimate (p0) of p is not quite right, and therefore your estimate of the error (which depends on p0) is not quite right either. The wilson score attempts to correct for that. We don't know p, but imagine if we did, we would expect any measurement p0 to be in the range p +- lambda sqrt(p(1-p)/n). That is, we expect

        abs(p - p0) < lambda sqrt(p (1-p)/n)

If you solve this equation for p in terms of p0, you get a formula given in the article, ie, the confidence limits for p given p0.

IanDrake · on April 3, 2012

I thought he was talking about the end user. They may not understand why something with 209 thumbs up and 100 thumbs down ranks better than one with 5 thumbs up (100% positive).

tel · on April 3, 2012

I hate it exclusively. It's clearly better than (1) and (2), but...

I talk about this with some of my med school friends that are interested in/want to create/hate/fear automated diagnosis. At one level, having appropriate statistics to make use of a wealth of prior experience, worldwide prevalence, epidemiological data, &c is basically a requirement for the future of proper healthcare.

It's also an obvious terrible mistake to ever convince people that these statistics are anything besides a decision making tool.

I think the right usage of statistics is to enlighten and confuse simultaneously, not to "answer". They should provide analytical depth to a decision, never an escape route.

For this reason, I think proper statistical application is an intersection between mathematics, computer science, and UI design.

radarsat1 · on April 3, 2012

> I think the right usage of statistics is to enlighten and confuse simultaneously, not to "answer". They should provide analytical depth to a decision, never an escape route.

You're right, but I would go further. I would argue that when statistics are presented as "the answer," in many cases they are being abused. A good scientist recognizes the limits of his dataset. Like how a computer program can only do what it is written to do, statistics can only tell you what is in the data, they can't tell you what is _not_.

raldi · on April 3, 2012

No matter how much statistical data you reveal, you still need to present the results in some particular order. And this formula is the best ordering function I know of for things that are rated.

tel · on April 3, 2012

I think the situation is much more informatively modeled by a partial order though. You can induce a total order on a partial order and you can use this formula to do so conservatively, but maybe the real way to solve the problem is to create an intuitive way for users to appreciate partial orders.

NathanRice · on April 3, 2012

Operations research and hard systems modeling are neat. The only gripe I have is that often times the models make simplifying assumptions which hold tenuously at best. One example of this is the Gaussian copula used in VAR. In actuality, neither the marginals nor the copula are Gaussian in the large majority of cases; they ARE stable distributions, but the tails can be absolutely HUGE. The Gaussian distribution is attractive because it is well behaved, and in most cases we don't have enough observations to properly understand the full tail structure of the underlying distribution. As a result, you get quants that make a model and call a small subset of points (which are completely valid) "outliers". Then when we observe the tails of the distribution, things fall apart because over-leveraged bankers made decisions based on the notion that the model accurately reflected reality.

m_for_monkey · on April 3, 2012

You have overlooked a critical point: no one has to mess with this (otherwise quite simple) formula anymore, except the programmer, and even he already got a working piece of Ruby code.

JumpCrisscross · on April 3, 2012

»The computer decided not to and I have no idea why

This just means you need a better notification system, not a dumber model. I am fairly sure Netflix and Pandora use complicated recommendation algorithms - we don't complain because the complexity is abstracted away behind labels like "dark cerebral thrillers" or "songs with rhythmic gyrations like Pretty Lights".

efsavage · on April 3, 2012

Right, the solution here is that if you're going to sort by Bayesian average or Wilson score confidence interval or something besides what we're familiar with, reframe it to avoid my "something's wrong" detector. Hence the fact that people accept "trending" or "popular" but if I sort by "rating" and the results don't make sense because I expected an average, I don't think "I should probably understand this algorithm". I think "this site is broken and dumb".

brown9-2 · on April 3, 2012

And as soon as humans saw that decision making formulas were too complex to understand, they relieved themselves of responsibility for those decisions. "Why didn't we place an order?" "Because the computer decided not to and I have no idea why."

And somehow the solution isn't to have the computer output a detailed description explaining why it arrived at the decision that it did?

tripzilch · on April 3, 2012

Math is hard, let's go shopping!

If you think this is too difficult then maybe you shouldn't try to design scoring/voting systems.

rwallace · on April 5, 2012

> "Why didn't we place an order?" "Because the computer decided not to and I have no idea why."

This may or may not be a problem. Remember, human intuition evolved to be able to give a fast answer to any question in the face of insufficient data, not an optimal answer to specific questions with sufficient data. In many domains, once you have appropriate data and the means to statistically analyze it, an algorithm using such analysis can reliably outperform human experts - and in some cases, the problem was the political one of how to get the human experts to accept this, take their hands off the wheel and stop second-guessing the algorithm.

In other words, if you are dealing with such a domain, you may be better off to count your blessings, let the computer do its job and save the humans for tasks that can't be handled by a formula.

gus_massa · on April 3, 2012

To get a general idea, you can compare this formula with the simple version, that is * mu - 3 sigma* (be careful when the number of samples is small), where mu is the average and sigma is the standard variation, and 3 is a parameter that depends on the confidence (~93%). When the number of samples (n) is big, then the two formulas give similar results.

jfarmer · on April 3, 2012

The way I think about it is this: comprehensibility is a feature. You have to weigh it against accuracy, effectiveness, etc., but it should be taken into account.

Shorel · on April 3, 2012

Put this code in a class or a library and let a smarter programmer maintain it.

jlarocco · on April 3, 2012

Meh. "Why did you return the list of reviews in this order?" - "Here's the link to Wikipedia explaining it".

The formulas aren't "too complex to understand", the people are being too lazy to take the time and understand them. No reason to dumb it down and ruin it for the rest of us.

EvanMiller · on April 3, 2012

Original author here. For the academically inclined, there is a critique of this approach in this paper:

http://www.dcs.bbk.ac.uk/~dell/publications/dellzhang_ictir2...

Of course, I think the authors miss the point of the algorithm, since I basically wanted a system that is one-sided (i.e. false negatives are OK but false positives are bad).

Also, if you deal with more than two outcomes you might be interested in multinomial confidence intervals, described here:

http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.ht...

The application to 5-star systems is not straightforward, since it's not clear to me how stars relate to each other. Is it a linear scale? Are they discrete buckets? Or maybe we want to use Tukey's froots and flogs? I'm not sure.

By the way, I'm coming out with a stats app for Mac soon that implements this algorithm and much more. Drop me your email address if interested:

http://wizard.evanmiller.org/

NathanRice · on April 3, 2012

I appreciate people who take the time to apply math to things in the real world, and share it with non academic crowds. Thanks for that.

5 star rating systems are obnoxious. From a mathematical perspective, if you treat them in an ordinal fashion they are poorly behaved, and if you treat them categorically, you lose the relationship between stars. There seems to be some popular movement towards binary rating systems, and I think that is great. Not only do people tend towards binary rating behavior in the real world (only rating a movie they thought was very good or very bad) but they admit a much cleaner mathematical treatment.

tripzilch · on April 3, 2012

> 5 star rating systems are obnoxious. From a mathematical perspective, if you treat them in an ordinal fashion they are poorly behaved, and if you treat them categorically, you lose the relationship between stars.

Helping out a friend with a statistics test, I recently read up about the Wilcoxon Signed Rank Test[1]. Now this one is intended to get a p-value for experiments with "before" and "after" measurements, but what I got the idea it's trying to do, is to use the ranks of a not-very-normal behaving random variable, turn that into a summation of lots of things, so due to the central limit theorem you can treat it as a normal distribution again.

Though thinking about it, in this case it's the rank we're after, so maybe it's not useful at all. But it gives an interesting idea about the tricks you can pull if your input data isn't quite the sort of type that you can analyse very well.

[1] http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

YokoZar · on April 3, 2012

I really appreciated the post. Unfortunately we're using 5 stars, and need to do it a bit less wrong.

The main thorniness of 5 stars is that you have to answer the question of what the difference in star ratings actually mean. Is going from one star to two stars the same as four to five? Probably not, based on the way users rate, which means an algorithm like the arithmetic mean that treats them the same is wrong.

Personally, I think it's very reasonable to treat rating stars the way we should treat grades: as ordinal data, where we know that a higher rating is better but assume nothing beyond that. The difference between an F and a D is not the same as an A and a B, and the same is likely true of 1-2 stars and 4-5 stars.

I have made an attempt at applying this idea to the Ubuntu App Store's rating algorithm. I'm very much interested in comments. https://bugs.launchpad.net/ubuntu/+source/software-center/+b...

Someone · on April 3, 2012

"we know that a higher rating is better."

Typically, you do not know that, especially if you are comparing across reviewers. Some will only score zero and 5 stars, other will have 10% two stars, 80% three stars, and 10% four stars, yet others will have 10% three stars, 80% four stars, and 10% five stars.

If you have sufficient data (rare), it may be possible to (somewhat) correct for that. IIRC, this was something that helped in winning the Netflix challenge.

For example, http://www.netflixprize.com/assets/GrandPrize2009_BPC_Pragma... models the assignment of stars as two parts:

- modeling the user's appreciation of the movie - modeling how the user translates his appreciation to a star rating

"where we know that a higher rating is better but assume nothing beyond that"

Problem with that is that you throw away information with that assumption. You do know that 2 star scores are very unlikely to be about very good items.

alexchamberlain · on April 3, 2012

Mathematician not a statistician...

Would it be reasonable for 5 stars to normalise the data? Should star ratings be on some distribution, for instance?

NathanRice · on April 3, 2012

In the binary case the usual treatment in statistics is to use the logistic function (and logit) to work with real numbers, then transform back into probability space as the last step.

This is a little flakey for ordinal numbers, and the usual treatment is to use a learning algorithm to find a mapping from real numbers to ordinal values, either explicitly (if you need a "score") or implicitly. Support vector machines, radial basis functions and neural networks are typically used.

plainOldText · on April 3, 2012

I've just seen a screenshot on youtube. It looks interesting.

alexchamberlain · on April 3, 2012

Are we going to see an Nginx module?

a1k0n · on April 3, 2012

There are a lot of comments complaining about how complicated the math is. This shouldn't be all that hard to understand.

The assumption is that there's some constant p underlying probability that a random person will rate a given thing positively. If we observe, for instance, 4 positive and 5 negative reviews or votes, there's a probability distribution (known as a Beta distribution) which tells us what the possible values of p are given the votes we observe: p^4 (1-p)^5. graph: https://www.google.com/search?q=x%5E4+(1-x)%5E5%20from%200%2...

Now if we observe 40 and 50, respectively, the curve looks like this: https://www.google.com/search?q=exp(20+%2B+40+log(x)+%2B+50+...

(I had to do it in the log domain because Google's grapher underflows otherwise -- the 20 is just to make the numbers big enough to graph. The more correct thing involves gamma functions and that just gets in the way right now)

The more you observe, the more sharply peaked the likelihood function is. The funky equation in the article is an approximation to the confidence interval of that graph -- 95% of the probability mass is said to be within those bounds.

It's not a great approximation, for one because the graph is skewed (try it with 10/50) and it assumes that the mean is exactly in the middle of the confidence interval. The correct computation involves the inversion of a messy integral called the incomplete beta function. Scipy has a package which includes betaincinv which solves this more exactly:

>>> import scipy.special

>>> scipy.special.betaincinv(5,6, [0.025, 0.975])

array([ 0.18708603, 0.73762192])

would be the 95% confidence interval for 4 positive and 5 negative votes;

>>> scipy.special.betaincinv(41,51, [0.025, 0.975])

array([ 0.34599562, 0.54754792])

for 40 and 50, respectively.

[edit: apologies, I had to run and get ready for work -- I didn't really have time to make this very comprehensible; but i just now fixed a bug in my confidence interval stuff above]

raldi · on April 3, 2012

Or, in layman's terms, "If we rounded up the entire population and forced every single person to carefully review this item and issue a rating, what's our best guess as to the percentage of people who would rate it positively?"

And to make the description slightly more accurate, at the expense of more complexity: "What number are we 80% certain the approving percentage will exceed?"

metaxyy · on April 3, 2012

Very interesting. So would you say developers should probably use the incomplete beta function, rather than Ev's method? Or is it too computationally expensive?

a1k0n · on April 3, 2012

I haven't investigated it in depth -- quadrature over a single variable as this is can be pretty quick to compute. Not sure how scipy does it.

Anyway, I personally think 95% confidence intervals are a crutch. The correct Bayesian approach is to consider two items, each with their own up and down votes, and integrate over all possible values for p1 and p2 (being the underlying probabilities of upvotes for item 1 and 2, respectively) over the observed data, and compute the likelihood of superiority of p1 over p2.

How to turn that into an actual ranking function? No idea. I doubt it would work, but you could compute against a benchmark distribution (i.e. the uniform 0-1 distribution).

If you do that, it probably turns out that your ranking function is the mean of the Beta distribution, which is simple: (U+1)/(U+D+2) where U and D are the upvote/downvote counts [note: we started with the prior assumption that p could be anywhere between 0 and 1, uniformly]. Basically, the counts shrink towards 1/2 by 1. This is a hell of a lot less complicated, and it achieves the goal of ranking different items by votes pretty well with more votes being better.

yariang · on April 3, 2012

While it is good to look at these sorts of mathematically rigorous algorithms, I think I would be frustrated if it was used everywhere. Or, well, maybe not me perhaps, but a non technical user.

The beauty of the second algorithm for rating products is that it is straightforward. Having never seen it before I can deduce that 5 stars come before 4 stars and more reviews come before fewer. If I want to skip ahead to the 4 stars I know what to do. I can internalize the sorting algorithm easily. And as a user, understanding the order items are presented to me is important.

If Amazon were to use the last algorithm and present items in that order (assuming we accounted for the 5 star vs positive/negative issue), it would like a random order to most users and would be frustrating.

So I guess what I am saying is that this algorithm is very clever, but in some cases, it may be too clever. Sometimes you just want to keep it Simple Stupid.

gmac · on April 3, 2012

The second algorithm, in my experience, is too simple, though. When browsing Amazon I'm pretty regularly annoyed by an item with one 5* review appearing ahead of an item with hundreds of 4* and 5* reviews.

One simple fix would be to avoid calculating an average until a minimum number of ratings have been given. But I do think the statistical way is lovely. If I were Amazon I'd give it some kind of snappy trademarked name and push it as a feature.

raldi · on April 3, 2012

Instead of displaying stars, Amazon could display a percentage, which under the hood represents the Wilson confidence number. It would be totally intuitive to browse: first come all the 100% items, then the 99's, and so on.

true_religion · on April 3, 2012

You can't use Wilson's confidence with a star-rating system. Wilson's only works for binary systems.

Instead you could use a weighted baysian rating:

br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)

imajes · on April 3, 2012

er... and the problem with bucket categorizing?

80-100% = * * * * *

60%-79% = * * * *

etc..

raldi · on April 3, 2012

How is it a problem if the five-star reviews display first, then the four-star, and so on?

dsrguru · on April 3, 2012

The point is we have to determine how to define a five-star item, a four-star one, etc. Currently, an Amazon item's star value is the average of the star values of every review. The author is saying that that's a bad way to compute the item's star value. The author would argue an item with only two reviews that are both fives should have a lower star value than an item with 400 fives and 1 four. We typically associate stars with the averaging algorithm (i.e. we define an item's star value as the average of the star values of its reviews), so it might help if we do away with the notion that each item has a star value, and just think of this as saying an item with 400 reviews of 5 stars and 1 review of 4 stars should be shown before an item that just has 2 reviews of 5 stars.

Currently, when we see an item's star value, we think of it as an indicator of the quality of the item. But if it's just the average of the star values of every review, the author would argue that we're not going to get an accurate indicator of quality. The author argues that whether the quality indicator of an item is expressed in stars or percentages, that value should be determined by the third algorithm, not the second, and that the order the items are shown in should be the result of sorting those quality indicators.

brown9-2 · on April 3, 2012

But how is Simple Stupid in the Amazon case a better output for the user? Do you, as an Amazon shopper, really believe that the item with one 5-star review is a better bet for you than the item with 580 reviews and an average of 4.5-stars?

yariang · on April 3, 2012

I don't, but I can intuitively grasp that a 5-star item with 2 reviews is not reliable. Since I understand how the sorting works, I know I have to jump to the 4.5 star items in order and check how many reviews that item has and if it also has a small number then I will jump to the 4 star items.

The point is, I understand the sorting order and can manipulate them if I am not satisfied with what is presented to me. Having a very esoteric algorithm is a risk. Maybe you'll present just what the user really wanted. But if you get it wrong they will be lost to do anything about it. I tend to dislike systems that leave users helpless when something goes wrong.

peq · on April 3, 2012

I always assume that initially there are q voters who gave the average rating. This yields the following formula:

(pn) / (n+q)

This is simpler and gives similar results:

http://www.wolframalpha.com/input/?i=plot+%28p+%2B+z^2%2F2n+...

http://www.wolframalpha.com/input/?i=plot+%280.8+*+n%29+%2F+...

jwr · on April 3, 2012

I implemented this in a rating system once. Got multiple bug reports, people complained that the system calculates averages wrong, because there are two ratings and the average is obviously not the number they are seeing.

edash · on April 3, 2012

Why show the second figure to the user? They don't need to see the calculation you made to determine the sort.

Just sort the objects in the order determined by this formula and only show the ratings given by users in the interface.

jwr · on April 3, 2012

This was a system where I was supposed to display a star-rating (1-5 stars, with fractional stars as well) for each item.

The people reporting the problems were the users, and yes, this was bias, as they expected averages. That was exactly my point — while the statistics behind this method are sound, it is not what people expect. Building systems that don't do what people expect is difficult.

esrauch · on April 4, 2012

This is mildly amusing because above in this thread the exact opposite comment/response happened; someone said they used it for sorting but showed the mean next to items, and got bug reports that 4.5 was sorted above 5.0. The reply was that he should have just exposed the numbers that they are sorting by as the visible rating in the interface.

brown9-2 · on April 3, 2012

I'm curious, who filed the bugs - the users of the system or the fellow builders of the system? If you rolled back this algorithm based on the bug reports, it sounds like you might have ended up listening to the prior bias of the people on your team, rather than the feedback of the people meant to use the system.

hadronzoo · on April 3, 2012

Here's a superior Bayesian solution: http://masanjin.net/blog/how-to-rank-products-based-on-user-...

ajross · on April 3, 2012

It's well explained, informally. The giant equation sitting there without clearly defined parameters is mostly just showing off though. The final "QED" solution that you put at the end of a paper is not the proper form to introduce a concept.

But... so what? Amazon and Urban Dictionary are hardly failing in the market due to their "incorrect" score sorting. The whole problem is a heuristic, it's not amenable to rigorous treatment no matter how many giant equations you club your audience with.

esrauch · on April 4, 2012

It's been fine in those contexts, but imagine if HN ranked +1/-0 above +100/-1

ajross · on April 4, 2012

It would be instantaneously annoying to someone, who would downvote it and push it off the front page. So the net effect is that the frequency with which you saw "low vote garbage" would increase, but probably not by much. And IMHO a reasonable argument can be made that this is a good thing, because it increases visibility for new posts.

joshuahedlund · on April 3, 2012

For those who do not understand the Wilson algorithm, see this post which was on HN recently, explaining how it works in a little more detail: http://amix.dk/blog/post/19588

(I agree with other commenters that it is complicated and lacks common sense to average users, but I feel like I have a general understanding of the concept thanks to the above link)

aw3c2 · on April 3, 2012

Every time I see that page, I see the equation, I read statistical terms and I get overwhelmed. I use PHP so I have no pnormaldist. Would love to use it for some random page I run.

Spoom · on April 3, 2012

I was overwhelmed too (here's simple wrong solution #1, here's simple wrong solution #2, here's incredibly complicated but more correct statistical solution #3 that you may not understand) but since this is relevant to my day job and we use PHP, I did find this:

http://www.derivante.com/2009/09/01/php-content-rating-confi...

I'm not a statistician so I can't speak to the correctness of this implementation.

jreposa · on April 3, 2012

Here's my version that we use on MyBankTracker.com. We hardcode the z variable using a power of 0.05.

https://gist.github.com/2292254

gipsyking · on April 3, 2012

"if you don't have a statistics package handy or if performance is an issue you can always hard-code a value here for z. (Use 1.96 for a confidence level of 0.95.)"

aw3c2 · on April 3, 2012

Ouch, I managed not to see that at all. Thanks!

dfghghj4564 · on April 3, 2012

Googling "Wilson score confidence interval php" returns an implementation as the top result.

Kartificial · on April 3, 2012

See this example: http://www.derivante.com/2009/09/01/php-content-rating-confi...

However, if you install the stats package for PHP you can use stats_cdf_normal() as well.

eru · on April 3, 2012

Aren't there libraries for PHP like for other languages?

mumrah · on April 3, 2012

There was an article about this a few years back: http://blog.linkibol.com/2010/05/07/how-to-build-a-popularit...

I've found, in practice, a Bayesian weighted average is easy to implement and pretty effective. It's also a good candidate for "stream" processing (i.e., calculating in a single pass)

ketralnis · on April 3, 2012

See also http://blog.reddit.com/2009/10/reddits-new-comment-sorting-s...

Or maybe more notably, https://github.com/reddit/reddit/blob/master/r2/r2/lib/db/_s...

gtsc · on April 3, 2012

Here's an even simpler way to think about it: it's the left point of the standard 95% confidence interval from the Central Limit Theorem plus a hack for small sample sizes. The Wikipedia page says the hack is almost equivalent to estimating p = (X+2)/(n+4) i.e. assuming each item starts with two upvotes and two downvotes.

dredmorbius · on April 4, 2012

I've put some thought into metrics as well. A few other alternatives suggest themselves:

- Considering the standard deviation of ratings. On a 5 point scale, an item that rates 3 because ratings are split between 1 and 5 votes, differs from one that gets mostly 3 votes. The latter is a middlin' fit for anyone, the former has an enthusiastic but niche audience. If you're looking at sales, the former can be a valuable product if properly marketed.

- An item that gathers few votes regardless of favorability ratings can exhibit multiple problems. One is that it isn't well marketed / publicised, or known. Another (particularly on content sites) is that there's very likely a sampling bias (mutual admiration society / negging attack / vote stuffing). I've tended to favor systems which take into account the total volume of voting, generally on a ln(n) basis, though not out of any particular statistical rigor. As an implementation, you'd start with a 5 point Likert score, then multiply by, say, ln(n+1) (avoiding a zero multiplier on a single vote).

- The pattern of ratings over time and space (IP or geographical) may reveal both opportunities for marketing and/or issues with your ratings system. Since any effective quality proxy will be abused, you've got to be sensitive to the latter.

The Wilson score is an improvement over multiple other methods. It still does assume a relatively unbiased estimator and rating behavior. My feeling and experience is that excess reliance on any one metric is likely to cause problems -- reality is multidimensional, metrics for assessing reality should be as well.

There's also the question of whether or not you want to make specific recommendations for an individual, or general recommendations for a population. In the former case, correlating other rankings or behavior may give a better fit (and the Wilson score may still be useful).

Though for a suitably specific goal (marketing, suitability, revenue potential) a single encompassing metric may work.

omarqureshi · on April 3, 2012

The one example that I've found of a good site that does really good average ratings is steepster, it picks teas that you have previously rated and indicates the rating you gave to them. This way the users rating is much better and will give you a much more meaningful mean.

ColinWright · on April 3, 2012

Discussions from earlier submissions are also interesting:

http://news.ycombinator.com/item?id=1218951 <- 31 comments

http://news.ycombinator.com/item?id=478632 <- 56 comments

Further, I hope JoshTriplett (http://news.ycombinator.com/user?id=JoshTriplett) isn't too disappointed that when he submitted this exact same item 2 days ago it got one upvote and no discussion. In submitting to HN, as with comedy, timing is everything. http://news.ycombinator.com/item?id=3784912

sold · on April 3, 2012

Urban Dictionary no longer sorts by positive - negative, see e.g. http://www.urbandictionary.com/define.php?term=usa. I don't know what they use now.

gammarator · on April 3, 2012

Doesn't seem to have improved the relevance of the definitions:

"USA: The only country keeping penguins from coquering [sic] the Earth"

moe · on April 3, 2012

So, anyone have the formula for 5-star ratings?

PakG1 · on April 3, 2012

This is awesome. This is perfect for what we need for our startup. We are going to use this. We won't need to worry about the negative aspects listed in these comments due to our use case. Wow. Thanks, HN. :)

PenZenMaster · on April 3, 2012

But shouldn't the solution (formula) be "simply elegant"? eBay seems to be on to something with: positive/positive + negative rating system. The user knows how many data point are in the pool which over comes the one positive rating gets five stars. Much in the same way http://demanddriventech.com/home/solutions/replenishment/ has come up with a "simply elegant" formula for supply chains that is human understandable and effectively solves the problem.

derwiki · on April 3, 2012

My favorite part was looking at the code and seeing a variable named `phat', and then looking up at the equation to find `p^' (p-hat).

ComputerGuru · on April 3, 2012

What I don't get is how in 2012 sites like Amazon are still making this mistake. Amazon is a company that, much like Google, spends millions analyzing user behavior and trying to optimize the workflow (checkout, in their case).

This has been the number one complaint I have against Amazon for the past 10 years. And they haven't done a think about it?

nickm12 · on April 4, 2012

Well, you'll be glad to know that it changed over a year ago. See for example: http://www.amazon.com/Stand-Mixers-Small-Appliances-Kitchen/...

padobson · on April 3, 2012

It seems to me there's still a problem at the point of data collection. Not all +1's are equal.

This algorithm needs to be paired with another algorithm that weights each plus one according to each user's ability to plus one something that gets a lot of plus ones.

I couldn't hope to do the math for something like that, but I'd sure like to talk to someone that could.

nickm12 · on April 4, 2012

Lots of people seem to be missing the fact that Amazon changed their algorithm years ago to account for the number of review. For example see:

http://www.amazon.com/Stand-Mixers-Small-Appliances-Kitchen/...

tvorryn · on April 4, 2012

Huh. You're right. That seems pretty recent, or I didn't notice the switch-over. Thank goodness.

uggedal · on April 3, 2012

I implemented this algorithm by using likes/views in stead of positive/negative votes on http://mediaqueri.es/popular/ and have been quite happy with the results.

dsears · on April 4, 2012

When I have the whole body of reviews readily available, I like to just do a Bayesian average. Mix in the average number of reviews at the average review score to keep small data sets from skewing results.

jarin · on April 3, 2012

Haha, I don't know why, but I laughed when I got to the 3rd formula. It was like the punchline to a joke.

moofins · on April 6, 2012

You had me at "Lower bound of Wilson score confidence interval for a Bernoulli parameter"

jader201 · on April 3, 2012

I've always thought about this, and to me, a very simple (though slightly inaccurate) solution would be to sort using this formula:

(TotalScore - 1) / MaxPossibleScore

Such that (using the Amazon examples from the article):

((2 * 5) - 1) / 10 = 9/10 = 90%

((100 * 5) + (1 * 1) - 1) / 505 = 500/505 = 99%

jakejake · on April 4, 2012

we use this algorithm for our office ping pong game tracking system. it's great because the person who just plays one game and wins doesn't get bragging rights.

jwblackwell · on April 3, 2012

This must be the third time this has been posted.

klapinat0r · on April 3, 2012

The last time was about a month ago IIRC. Can anyone explain why this keeps happening (and why people keep giving karma to those who just repost month old HN links)?

cbg0 · on April 5, 2012

Probably because some haven't seen it yet and/or because people love discussing this topic.

excuse-me · on April 3, 2012

That's why a friend of mine joined the army engineers.

As a civil engineer working for a local city he might be involved in a 10year process of approvals to add a freeway on ramp. Where most of his job would be checking that an army of subcontractors were all doing things to code - not that they were doing things well, just to the written requirements

In Afghanistan if they want a road or a barrier he basically finds somebody lower rank points at a bulldozer and tells them to do it.

An interesting point he made was building a simple village clinic with a clean water supply that would save lives for a few days work and a few $1000. At home he would be involved in a multi $100M, 20year project for a new hospital where most of the money would go into pretty decoration and parking structures and would probably end up costing lives compared to the existing old hospital that was working perfectly well.

tripzilch · on April 3, 2012

I read this twice and I can't figure out how this is relevant to the article, please explain?

excuse-me · on April 3, 2012

EDIT - for some weird reason this cross-posted from another news.y topic. Please ignore it here.

its_so_on · on April 3, 2012

Date: 9:12 AM Wednesday, April 4, 2012

From: the boss

To: dev3

Subject: URGENT - front page showcase selection broken!!

Body: Hey bro, I was looking into it, and our ratings average equation is totally busted and products with just a few ratings are hogging space from proven winners when it's just a sample bias. This is costing us money and needs to be fixed NOW.

I'd like this up before our morning meeting so I can boast about it and you'll get credit too, as this should massively increase our conversions right away by putting BETTER products right on the front page.

this should get you started: http://evanmiller.org/rating-equation.png

I'm sure you'll figure it out. If you could do an A/B test for bragging rights too that would MASSIVELY rock. Thanks!!!

Rock on,

Boss

ashishb4u · on April 3, 2012

how bout ((positive-negative)/total)

jakejake · on April 4, 2012

That's the same thing as the average rating - works great if everything has about the same number of ratings.

epo · on April 3, 2012

The problem with star ratings is that they have nothing to do with measuring approval. They are a form of social inclusion mechanism to give the rubes the erroneous sense that someone cares about their opinions. It is done to attract users, not to guide them.