Hacker News new | past | comments | ask | show | jobs | submit login

Can you do this with a source of infinite samples? If every time I take a sample it's slightly higher, does this still hold?



I'm not good with stats, but five increasing measurements has two options: A. You've hit an unlikely coincidence, and you're fine B. You're not really randomly drawing from the same distribution.

Which scenario seems more likely ;)


> If every time I take a sample it's slightly higher, does this still hold?

I don't know enough stats to give a firm answer, but I'd reckon there is a key assumption that the samples need to be drawn i.i.d. from a single underlying probability distribution, or perhaps need to satisfy the related assumption of being exchangeable.

https://en.wikipedia.org/wiki/Independent_and_identically_di...

https://en.wikipedia.org/wiki/Exchangeable_random_variables

In your example of a sequence of samples that increase, they're certainly not exchangeable. I think they're not independent either.

E.g. thought experiment to give a concrete version of your example, where we define it so there's no randomness at all to make it easier to think about : let's suppose an idealised situation where we launch a space probe that travels away from the earth at 15 km / second. Suppose we have some way of measuring the distance d(t) that probe is from earth at some time t after launch. Regard each distance measurement d(t) as a sample. Let's assume we take 5 samples by measuring the distance every 10 seconds after launch. So t_1=10s, ..., t_5=50s, and d(t_1)=150km, ..., d(t_5)=750km.

The sequence of distance samples d(t_1), d(t_2), d(t_3), d(t_4), d(t_5) is not exchangeable as if we exchange two samples like d(t_2) <-> d(t_4), the permuted sequence d(t_1), d(t_4), d(t_3), d(t_2), d(t_5) corresponds to the situation: "at 10 seconds the probe was 150 away, at 20 seconds the probe was 600 km away, at 30 seconds the probe was 450 km away, at 40 seconds the probe was 300 km away, at 50 seconds the probe was 750 km away" -- the probability of observing that outcome is an awful lot lower -- based on our understanding of how physics of the situation work in this idealised example -- than the probability of observing the outcome from the original sequence (this is pretty sloppy as I am not clearly distinguishing between observed values and random variables, but hopefully it gives some vague intuition).

So if you want to estimate the median distance of the probe from the earth from 5 samples, you roughly need to take 5 measurements at 5 times chosen to be uniformly at random from the entire period you are interested in. E.g. if you want to estimate the median distance of the probe from the earth during the first 10 years of travel, you need to draw 5 samples from 5 different times sampled from the uniform distribution over the period [0 seconds, 10 years]. Then the resulting estimated median distance would only apply for the distance of the probe during that time period, it would not be an estimate that could be applied for any different time period.


> a key assumption that the samples need to be drawn i.i.d. from a single underlying probability distribution, or perhaps need to satisfy the related assumption of being exchangeable

Unfortunately this kind of key assumption is rarely made explicit when teaching people stats. I see research papers all the time making this assumption where it clearly isn't warranted - such as in benchmarking a computer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: