On a related note, you're wasting clicks using A/B testing. I emailed you guys about using a better online method (a bandit algorithm) but never heard back from anyone. If that's of interest, drop me a line (noel at untyped dot com).
Update It's occurred to me that you're using GAE, and so probably can't run C libraries like the above two projects. There is a Java library here: http://code.google.com/p/boxer-bayesian-regression/ If you're going to do per user and per exercise models you'll have many fewer data points to train your models on. You should consider sharing data between models or use a model that will give some measure of uncertainty in it's predictions. The Bayesian LR code I referenced above will give some measure of uncertainty. There is a stack of (really interesting) work on other methods that will also do this.
However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless.
The original intent of the experimenter shouldn't affect his conclusions!
With that said, I'm not sure what the proper Bayesian answer is. If I were to stop my experiments every time that P(X|D) < 0.05 (or whatever), wouldn't that introduce bias? 
Incidentally, you (Eliezer) had a nice post where you explained the problem in detail (something about two doctors doing an experiment 50 times with different "stopping rules", who get the same data), but I couldn't find it.
 EDIT: On further thought, there wouldn't be any bias if I do the correct thing and report my uncertainty: P(X|D) = 0.04. The "repeated significance testing errors" comes from converting P = 0.04 into an absolute statement "X is true". Then the stopping rule will affect the statistics.
Consider a data set consisting of a single license plate number. (This example is from Richard Feynman.)
If you set the rules in retrospect, you can go "Wow, what were the odds my one license plate would be XJKDL-2342-KE? One in a million?" But that's wrong.
On the other hand if you predicted XJKDL-2342-KE in advance, then the same data point would have a different meaning. How did you predict it?
Patterns that you can predict in advance are different from ones you can find in retrospect after looking through whatever results you get. So the same data point -- XJKDL-2342-KE -- can take on different meaning depending on the original intent and design of the experimenter.
People make this mistake all the time with more mundane examples. Like they will roll snake eyes three times in a row, then calculate the odds of that happening, and then says "wow 1/6^6, there was such a minuscule chance i'd get screwed like this". but they're just wrong. ANY exact ordering of the 6 individual dice roles has a 1/6^6 chance of happening, and you have to get one of the "unlikely" results.
To help make this more intuitive, consider that they would have been surprised by rolling all 2s, or all 3s, or various other patterns. So you at least would have to figure out how many outcomes they'd deem surprising and figure out what proportion of the possibilities are in that category. And then take into account all the rolls they made when they weren't surprised and didn't record any data...
With that said, I'm not sure that I see the connection between what you're arguing and the significance problem in the original. What do you think of the example with the two doctors? http://lesswrong.com/lw/mt/beautiful_probability/
Changing stopping rules after seeing the data creates bias/distortions
We're talking about a fixed stopping rule, which depends on the data.
I think that even if you get the same data set as your results, as in the linked doctor example, the stopping rule does matter. A reason for this is that it affects the repeatability of experiments. Experiments ought to be repeatable (within margins of error, and using all the same procedures including stopping rules). Let's consider what could happen with repeat trials. Suppose the medicine almost but not quite works according to the standard you're trying to test for that will make the medicine considered a success (cures at least 60 out of 100 people, on average -- not the most realistic standard, but that's not important). Also suppose the data set had 60 cures exactly, not 70 (the following is technically possible, but unlikely, with 70).
So, the first experiment with a fixed N=100 will not be repeatable. Slightly too few people will be cured in future trials. The success on the first attempt was good luck within the margin for error (the real average cure rate is 59 out of 100).
The second experiment, however, will eventually report that the medicine works on at least 60% of patients on average in all 10 (or whatever) repeat trials, even though this is false. (I think. Maybe they won't make that false "on average" claim but will conclude something different instead? What?) The reason this will happen is basically the same reason that if you flip a coin enough you random walk away from the average (and eventually visit both sides of the average). And because the real cure proportion is so near the goal, there's a pretty good chance you could do a lot of repeat trials without any stalling out for years (which might clue some people in to the problem, though if they strictly ignore data from studies that haven't stopped yet, then maybe it wouldn't).
Similarly, imagine a study of coins which had a stopping rule to stop whenever you have at least 60% heads. You'll always be able to get that result and conclude the coin is biased, even if all coins used are fair. You'll often be able to get that result pretty quickly (and actually if you don't get it quickly, but hover around average, the expected time to get it will keep getting worse. But I bet we could come up with an example that doesn't have that property. Or we could consider 20 research groups, 15 of which report coins are biased and 5 of which never publish.). But the point is their result, claiming coins are biased, may be wrong. Even the possibility of the method getting a wrong answer, without anyone having made a mistake in doing it, is a major problem!
If someone said, "Never mind their stopping rules, I want to salvage their coin flipping data and use it for my other project" I think they would have a serious problem because it's not a proper random set of coin flip data but is instead limited to various possible sequences of flips and not others.
Now it could always be that trials with bad stopping rules get lucky and are correct, and using or believing their data won't work out badly. Their data set could happen to be identical to one that is properly collected. But I think one always has to fear the possibility that they didn't get lucky and their stopping conditions have spoiled the data (especially when you don't have a properly done trial with identical results to compare with) just as the people trying to prove coins are biased could easily spoil their data using fixed but unreasonable stopping conditions.
Similarly, imagine a study of coins which had a stopping rule to stop whenever you have at least 60% heads. You'll always be able to get that result and conclude the coin is biased, even if all coins used are fair.
This is not true. Because of the law of large numbers, the probability of ever reaching the 60% decreases with time.
I do think the one that says "If I get one conclusion, stop. If I get the other, keep trying," has got to be a bad idea!
I understand what you're talking about. I see the potential for a problem. But my understanding is that Bayesian statistics isn't subject to that.
Proper Bayesian result reporting doesn't say "We believe that the coin is biased". We would rather say "The probability that this coin is biased is 60%, subject to our assumptions and model".
My feeling is that this statement is true:
If the model and assumptions are correct, then the Bayesian outcome will be true regardless of the stopping rule.
In this case: 60% of coins for which the Bayesian analyst proclaims P(biased)=0.6 WILL be biased (barring sampling variations). The stopping rule doesn't matter.
I'll try and figure out a solid explanation by tomorrow.
> Proper Bayesian result reporting doesn't say "We believe that the coin is biased". We would rather say "The probability that this coin is biased is 60%, subject to our assumptions and model".
I'm not really sure what you're getting at here. None of the coins are biased, by premise, so they shouldn't be concluding either thing.
If you throw in "if our model and assumptions are right" then you can shift the blame (if they assumed their stopping rule was OK, or came up with a model that says it's OK). But I'm not sure how that substantively helps.
Will check back tomorrow for further comments from you.
I've been thinking about the problem a lot today. I'm pretty sure that my point is basically right, if the model is correct, but my ideas are not clear enough to explain it properly. Model correctness in Bayesian statistics is a complicated problem, and as far as I can tell, it's not a completely solved one. Bayesians usually agree about their calculations, but there's heavy debate about the "philosophy".
In any case, maybe you'll find Eliezer's other post insightful:
I really hope to figure out model correctness, and this optional stopping problem looks a good vector of attack.
Thank you for the discussion, and sorry for leaving you hanging!
(if there's any Bayesian out there willing to continue the discussion, my email is in my profile)
For example of one that might seem bad, but does halt, and turns out to be OK:
Flip a coin until you have more heads than tails OR reach 500 flips.
This procedure will produce a majority of trials with more heads than tails, but I think the average over many trials will be 50/50. The conceptual reason is that stopping early sometimes prevents just as many heads as tails that would have come up after stopping. I haven't formally proved this but I did a simulation with a million trials with that stopping procedure and got a ratio of 1.0004 heads per tails which seems fine (and after some reruns, I saw a result under 1, so that is possible). Code here:
With a guaranteed halt, a sequence of 500 tails and 0 heads can be counted. With no guaranteed halt, it's impossible to count a tails heavy sequence, which is not OK because it's basically ignoring data people don't like.
Does that make sense? I think it may satisfy the stuff you/Bayesians/Eliezer are concerned with. It means it's OK to stop collecting data early if you want, but you do need some rules to make sure your all your results are reported with no selectivity there.
There's also a further issue that these kinds of stopping procedures are not a very good idea. The reason is that while they are OK with unlimited data, they can be misleading with small data sets. It's like the guy who bets a dollar, and if he loses he bets two dollars, and if he loses again he bets 4 dollars (repeated up to a maximum bet of 1024 dollars). His expectation value in the long run is not changed by his behavior but he does affect his short term odds: he's creating an above 50% chance of a small win and an under 50% chance of a larger loss. If you only do 10 trials of this betting system, they might all come out wins, and you've raised the odds of getting that result despite leaving the long term expectation value alone. Doing essentially the same thing with scientific data is unwise.
BTW/FYI I believe I have no objections to the Bayesian approach to probability but I do think the attempt to make it into an epistemology is mistaken (e.g. because it cannot address purely philosophical issues where there's no data to use, so it fails to solve the general problem in epistemology of how knowledge (of all types) is created.)
Under a uniform prior [0, 1] the posterior mean is the empirical mean. How you sample is of no consequence. The likelihood/posterior f(p|#heads, #tails) is p^(#heads)(1-p)^(#tails) regardless of how you sample. Differentiate with respect to p and you get p*=heads/total.
It is rather amusing that most statistics professors are happy to have taught their students that the sampling procedures matter while at he same time crushing the natural intuition that your decisions should be based on the data you observe not on what might have happened in a world that doesn't exist.
If someone then takes your dataset and assumes it's a random sample -- e.g. just the same as the N=100 doctor trial -- he's wrong. It's not, it's something else, and that something else is less useful.
You say "how you sample is of no consequence". But suppose your sampling method selectively throws out some data that it doesn't like. That is of consequence, right? So sampling methods do matter. Now consider a method which implicitly throws out data because some sample collections are never completed. That matters too.
See my other comment, up a few times then down the other branch, the one with the pastebin code.
However the example with the two doctors was not the halting type.
Can you agree to that? Or do you have a defense of non-halting stopping rules, even though they are incapable of reporting some data sets?
I think I figured this out but would be interested in criticism on this point if not. Is there some way of dealing with non-halting that makes it OK?
The book says if there's a stopping rule then inferences must depend only on the resulting sample but that assumes there is a resulting sample -- that the procedure halts.
The post you're looking for is http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequ...
Doctor A decides to test a cure on 50 patients. 40 have gotten better. Doctor B independently decides to test the same cure on his patients. He will stop once he has reached 'significance'. Coincidentally, the results become significant at the 50th patient, and he also has a 40/50 success rate.
Doctor A says "I followed a fixed testing procedure, and the statistical analysis says that my data is not significant. We need more experiments."
Doctor B says "I followed an optional stopping procedure, and the statistical analysis says that my data is not significant: the cure is good."
A Bayesian would claim that if they both have the same data, then they should reach the same conclusion, regardless of their intent.
A "frequentist" would uphold that the doctors can legitimately disagree. I don't know much about frequentism, but it's the dominant perspective in statistics. Everything I've read about A/B testing is frequentist.
It makes me wonder if the data set is available as well.
Chapter 22 of David Barber's "Bayesian Reasoning and Machine Learning" (he makes it available online) does a nice (perhaps brief) job of explaining the progression through the Rasch model, the Bradley-Terry-Luce model and Elo.
As an aside, the way they chesstempo generate the exercises is also cute. The tactical chess problems are positions taken from high level (human) games fed into a chess engine which identifies blunderous moves where there is a single distinctly best way to respond. The challenge is to find that best move. Because they are taken from real games, they have the appearance and feel of real positions, which is important; many people believe pattern recognition is an important part of chess mastery. Apparently they've built up nearly 40000 such tactical exercises.
Arguably this is all just semantics (nature knows no stats/ML divide), but as a ML person I know this drives statistics folks crazy.
The terminology confusions, I think, stems from the earlier era of ML research.
I think this method of A/B testing has some faults. I blogged about it A/B testing. Is Khan doing it wrong?http://liveatthewitchtrials.blogspot.com/2011/09/ab-testing-...
and Allen Downey ran some simulations at Repeated tests: how bad can it be?http://allendowney.blogspot.com/2011/10/repeated-tests-how-b...
I think you can model for randomness (kids getting lucky while answering a question), but if you can somehow add time-dependency to the model, then your predictability would be higher (of course this is pure speculation).
Does anyone have a good model I can look at? Any help would be appreciated.