Hacker News new | comments | show | ask | jobs | submit login

You're stuck in frequentist thinking. "Bias" is a property of repeated sampling -- the expectation over repeated samples. But we just have one! The relevant question is what is your best guess for p, the probability the coin will be heads.

Under a uniform prior [0, 1] the posterior mean is the empirical mean. How you sample is of no consequence. The likelihood/posterior f(p|#heads, #tails) is p^(#heads)(1-p)^(#tails) regardless of how you sample. Differentiate with respect to p and you get p*=heads/total.

It is rather amusing that most statistics professors are happy to have taught their students that the sampling procedures matter while at he same time crushing the natural intuition that your decisions should be based on the data you observe not on what might have happened in a world that doesn't exist.

http://books.google.com/books?id=6oQ4s8Pq9pYC&lpg=PA18&#...




Consider an infinite string of coin flips. Now consider a subset selected by a stopping rule to meet a particular criterion. And a different subset chosen with an N=100 criterion. The first stopping rule creates a bias: you have a non-random sample chosen to meet that criterion. The second stopping rule doesn't do that, it gets what we call a "random sample".

If someone then takes your dataset and assumes it's a random sample -- e.g. just the same as the N=100 doctor trial -- he's wrong. It's not, it's something else, and that something else is less useful.

You say "how you sample is of no consequence". But suppose your sampling method selectively throws out some data that it doesn't like. That is of consequence, right? So sampling methods do matter. Now consider a method which implicitly throws out data because some sample collections are never completed. That matters too.


Yes, clearly. I stated that too strongly. Sampling procedures can definitely matter enormously, but stopping rules are within a class of ignorable rules. The link above gives a more precise definition.


I think that you are mostly right about halting (guaranteed) stopping rules.

See my other comment, up a few times then down the other branch, the one with the pastebin code.

However the example with the two doctors was not the halting type.

Can you agree to that? Or do you have a defense of non-halting stopping rules, even though they are incapable of reporting some data sets?

I think I figured this out but would be interested in criticism on this point if not. Is there some way of dealing with non-halting that makes it OK?

The book says if there's a stopping rule then inferences must depend only on the resulting sample but that assumes there is a resulting sample -- that the procedure halts.


Off-topic, but what happened to statsia? I was curious to see what you were working on.


Website is down but the project continues. Email beta@statsia.com if you want to be put on our insider's list.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: