Hacker News new | past | comments | ask | show | jobs | submit login
Empirical Bayes for multiple sample sizes (chris-said.io)
104 points by csaid81 on May 4, 2017 | hide | past | web | favorite | 17 comments



Although tangentially linked to in the article, David Robinson's Introduction to Empirical Bayes[1] is also an excellent resource. It deals primarily with beta-binomial distributions.

[1] http://varianceexplained.org/r/empirical-bayes-book/


It's an excellent blog post, although it's worth emphasizing that it is designed for the binomial case, where you wish to compute the fraction of occurrences within some events, such as batting averages. For continuous variables, however, it makes more sense to use one of the methods described in the original post.

TL;DR: One blog post is for Rotten Tomatoes and the other is for Metacritic.


Absolutely, and thanks for better defining the distinction.

I really just wanted to point out another solid Empirical Bayes resource, as there's not that many about. Yours and David's make a good combination covering different cases.


Stan is great! Glad to see it on HN. Nice write up too.


That definition of symbols! So good!


I know! I love when math books and papers do that.


Technical term is overkill. Just use https://en.wikipedia.org/wiki/Bayesian_average


Thank you. As a statistician, the fact that mixed effects models (e.g. does this rater tend to rate high?) are overlooked is, IMHO, a death sentence. Too much nomenclature, too early (link to the table within the text, please, and omit needless words), and with too little attention paid to the value of an external citation.

Also, MCMC for ratings? Surely you jest. If the author had touched on mixed models, then maybe it would make sense. But given the sample sizes involved here, and the noise in the variance estimates, I recommend that the author investigate mixed models tout suite if they do in fact care about the sources of shared and unshared effects on variance. Because that is what mixed models do.


Author here. Please see the section on mixed models in my post. As I mentioned there, I would love if an expert could expand on the relationship between mixed effects and Empirical Bayes.

Regarding MCMC, one of the things I try to emphasize throughout the post is that the best solution depends on your needs (for example if you want a full posterior). In fact, most of the post is devoted to quick and simple methods -- not MCMC -- because they are good enough for most purposes. I welcome your feedback though on how I could make this point clearer.


> Author here.

Alright, I'll put on my Reviewer Number 3 hat and say that I learned some neat things from your work, including that the National Swine Improvement Federation. I'll try and do a halfway decent job here.

> I would love if an expert could expand on the relationship between mixed effects and Empirical Bayes.

A real expert? Here you go:

http://statweb.stanford.edu/~ckirby/brad/LSI/monograph_CUP.p...

Read it, all of it, but particularly chapter 1, section 2.5, and chapters 8, 10, and 11. Why does testing, effect size estimation, and high-dimensional analysis have anything to do with anything? Because...

1) independence is largely a myth 2) you are likely to have multiple ratings per reviewer on your site, whether your generating distribution is nearly-continuous (0-10, mean-centered) or discrete (0/1, A/B/C). If you discard this, you are throwing away an enormous amount of information, and failing utterly to understand why a person would estimate not just the variance but the covariance even for a univariate response.

The second point is the one that matters.

Also, "empirical Bayes" is in modern parlance equivalent to "Bayes". What's the alternative? "Conjectural Bayes"? (Maybe I should quit while I'm ahead, pure frequentists may be lurking somewhere)

> I welcome your feedback though on how I could make this point clearer.

For starters, edit. Your post is too damned long.

Think about where you are getting diminishing returns and why. Is there ever a realistic situation where your ratings site would not keep track of who submitted the rating? (It's certainly not going to be an unbiased sample, if so; the ballot box will get stuffed) So if you have to keep track of who's voting, you automatically have information to decompose the covariance matrix, and everything else logically follows.

A univariate response with a multivariate predictor (say, rating ~ movie*rater) can have multiple sources of variance, and estimating these from small samples is hard. When you use a James-Stein estimator, you trade variance for bias. You're shrinking towards movie-specific variance estimates, but you almost certainly have enough information to shrink towards movie-centric and rater-centric estimates of fixed and random effects, tempered by the number of ratings per movie and the number of ratings per rater. (Obviously you should not have more than one rating per movie per rater, else your sample cannot be unbiased).

I think you will return to this and write a much crisper, more concise, and more useful summary once this sinks in. I could be wrong. But you'll have learned something deeply useful even if I am. I do not think you can lose by it.


> Also, "empirical Bayes" is in modern parlance equivalent to "Bayes". What's the alternative? "Conjectural Bayes"?

My understanding of the difference, as a frequent user of empirical Bayes methods (mainly limma[1]), is that in "empirical Bayes" the prior is derived empirically from the data itself, so that it's not really a "prior" in the strictest sense of being specified a priori. I don't know whether this is enough of a difference in practice to warrant a different name, but my guess is that whoever coined the term did so to head off criticisms to the effect of "this isn't really Bayesian".

[1]: https://bioconductor.org/packages/release/bioc/html/limma.ht...


Do you have a webpage? I just helped my wife (physician) with stats for a research presentation she made that sought to track infection spreading in hospitals (via room number, location specific) via movement of equipment and staff which was tagged. They then PCRd the strains to make sure it was the same one.

The experimental design was good, the stats person they had to help them decipher the results was.... Left much to be desired.

Can you please be so kind to email me jpolak{at} email service of a company where a guy named Kalashnikov worked.


Yup, I agree about throwing away rater information. The actual application at my company that motivated me to research this doesn't have rater information, which is why I didn't think to adjust for it. The movie case was just an example I used to motivate this post for which, yes, I agree, rater information would be quite useful.


This seems different and a bit lacking in detail (although I don't dispute that it could be useful). How exactly does one choose m and C? And what are the conditions under which it would reduce to the James-Stein / Bulmannn / BLUP model?


The choice of m and C need not be exact. It is enough to choose them so that

1. If there are no ratings, Bayesian average is close to overall mean, and

2. If there are many ratings (how many depends on how big the site is), C and m do not affect the result much.

You probably can do a little better if you have a lot of data and ability to run A/B tests, but for vast majority of cases pseudocounts work just fine.


Got it. Thanks for the clarification. In that case I would think that James-Stein / Buhlmann / BLUP is a better approach, since it is just as easy to implement and the amount of shrinkage is optimally chosen based on the data, rather than on guesswork. In fact it may be more easy because no guesswork is required.

It would be interesting though to have people try to guess suitable values of m and C and then see how close their MSEs get to the James-Stein MSE. I suspect that some people's guesses would be meaningfully off target.


But that's not how you should measure it. You goal is not to minimize MSE. Your goal is to rank movies in a way that users like.

So the test would be to randomly split users into test and control, show ranking based on Bayesian averaging to control, show ranking based on James-Stein or some other method to test, measure some metric of user happiness (a different hard problem, click rate on top titles?), then do the comparison.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: