Hacker News new | past | comments | ask | show | jobs | submit login
How large is that number in the Law of Large Numbers? (thepalindrome.org)
195 points by sebg 8 months ago | hide | past | favorite | 75 comments



My statistics class at high school level taught the following:

The number of samples you need is very difficult to calculate correctly, requiring deep analysis of standard deviations and variances.

But surprisingly, you can simply know you've reached large number status when over 10 items exist in each category.

---------

Ex: when doing a heads vs tails coin flip experiment, you likely have a large number once you have over 10 heads and over 10 tails. No matter how biased the coin is.

Or in this 'Lotto ticket' example, you have a large number of samples after gathering enough data to find over 10 Jackpot winners.


Very cool rule.

I think you can justify it by approximating each category as an independent Poisson distribution. Then for each such processes the variance equals the mean. So once you have 10 successes in a bin, you have evidence of a probably good estimate for the arrival rate of that category. The book "The Probabilistic Method" calls a related idea "the Poisson paradigm."

(10 a nice round number where the standard deviation is below the mean)


Small proviso: this is only true for a reasonable number of categories (or you run into repeated experiment problems).


What's a reasonable number of categories? 10?


That's defined by the phenomenon you're investigating.

In the case of six-sided dice, there are precisely six categories, ideally with even odds of occurrence. With the lottery jackpot given, there are eight categories, with highly asymmetric probabilities and values.

In real-world cases, you might be trying to distinguish two cases (treatment and control in a medical experiment), between multiple particles or isotopes (say, with physics or chemistry), amongst different political divisions (countries, states or provinces, counties, cities, or other), between political parties or candidates (which raises interesting questions over which and/or how many to include in consideration, in turn dependent on voting procedures, overall popularity, and impacts of non-winning candidates or parties on others), on multiple products, or on different behavioural characteristics in some domain (e.g., highly-active, occasionally-active, and lurking participants in online fora).

There are times when categories are well and unambiguously defined. Others in which where you choose to draw divisions (say, in generational groups, or wealth or income brackets) is highly arbitrary. Even where there are a large number of potential categories, choosing some limited number for specific analysis (2, 3, 5, 10, etc.) and lumping the remaining into "other" may provide clearer insights and fewer distractions than choosing a large number of divisions.[1] In other cases, a very small number of individuals may account for an overwhelming majority of activity or outcome. I'd strongly argue that in this case, the analysis might be somewhat poorly focused, and that activities and outcomes rather than individuals are of greater interest.[2]

What's key is to match your sampling and sample sizes to the phenomenon being studied.

________________________________

Notes:

1. Power law distribution / Zipf functions often mean that a very small number of participants has highly disproportionate impact or significance.

2. This is often the flip side of power law distributions. If we look at all book titles, there are a huge number of individual items to consider; there are roughly 300k annual English-language "traditional" publications, and over 1 million "nontraditional" (self-published, or publish-on-demand) titles. But if your focus is instead titles by percentage of revenue or number of sales, a top-n analysis (5, 10, 20, etc.) often captures much of the activity, frequently well over half. This is typical of any informational good: music, cinema, blogs, social media posts, etc.


My intuition is telling me that a coin flip would be the worst case scenario, because I think it would be the easiest to hit 10 examples in both categories. Every other mix of probabilities I can come up with would average more rolls of RNG than a coinflip.

Am I mistaken?


That's a neat rule of thumb; is there a simple statistical argument for why 10 is the (not very large) number?


Does this work the other way? I.e. “you have enough buckets if adding one more puts the number of samples in the smallest bucket below 10?”


For heads or tails, that leaves a very large margin. In approx. 1 in 20 trials, you'll end up with a 10-20 split.


Yeah, 95% confidence ratio (or approximately two standard deviations) is pretty standard with regards to statistical tests.

You gotta draw the line somewhere. At high-school statistics level, its basically universally drawn at the 95% confidence level. If you wanna draw new lines elsewhere, you gotta make new rules yourself and recalculate all the rules of thumb.


I remember my high school AP Psychology teacher mocking p=0.05 as practically meaningless. In retrospect it's funny for a psychologist to say that, but I guess it was because he was from the more empirically minded behaviorist cognitive school and from time to time they have done actual rigorous experiments[1] (in rodents).

[1] For example as described by Feynman in Cargo Cult Science.


The problem is two-fold:

1. p=0.05 means that one result in 20 is going to be the result of chance.

2. It's generally pretty easy (especially in psychology) to do 20 experiments, cherry-pick -- and publish! -- the p=0.05 result, and throw away the others.

The result is that published p=0.05 results are much more likely than 1 in 20 to be the result of chance.


> p=0.05 means that one result in 20 is going to be the result of chance.

You made the same mistake most people make here: you turned the arrow of the implication. It is not "successful experiment implies chance (probability 5%)" but "chance implies successful experiment (probability 5%)".

What does that mean in practice? Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%. Even without any cherry-picking.

The problem is not that p=0.05 is too high. The problem is, it doesn't actually mean what most people believe it means.


I think we're actually in violent agreement here, but I just wasn't precise enough. Let me try again:

    p=0.05 means that one POSITIVE result in 20 is going to be the result of chance and not causality
In other words: if I have some kind of intervention or treatment, and that intervention or treatment produces some result in a test group relative to a control group with p=0.05, then the odds of getting that result simply by chance and not because the treatment or intervention actually had an effect are 5%.

The practical effect of this is that there are two different ways of getting a p=0.05 result:

1. Find a treatment or intervention that actually works or

2. Test ~20 different (useless) interventions. Or test one useless intervention ~20 times.

A single p=0.05 result in isolation is useless because there is no way to know which of the two methods produced it.

This is why replication is so important. The odds of getting a p=0.05 result by chance is 5%. But the odds of getting TWO of them in sequential trials is 0.25%, and the odds of a positive result being the result of pure chance decrease exponentially with each subsequent replication.


> Let me try again:

> p=0.05 means that one POSITIVE result in 20 is going to be the result of chance and not causality

No, you still didn't get it. In the example above, a full 100% of positive results, 20 out of every 20, are the result of chance and not causality.

Your followup discussion is better, but your statement at the top doesn't work.

(Note also that there is an interaction between p-threshold and sample size which guarantees that, if you're investigating an effect that your sample size is not large enough to detect, any statistically significant result that you get will be several times stronger than the actual effect. They're also quite likely to have the wrong sign.)


> No, you still didn't get it. In the example above, a full 100% of positive results, 20 out of every 20, are the result of chance and not causality.

Yep, you're right. I do think I understand this, but rendering it into words is turning out to be surprisingly challenging.

Let me try this one more time: p=0.05 means that there is a 5% chance that any one particular positive result is due to chance. If you test a false hypothesis repeatedly, or test multiple false hypotheses, then 5% of the time you will get false positives (at p=0.05).

However...

> Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%.

This is not wrong, but it's a little misleading because you are presuming that all of the hypotheses being tested are false. If we're testing a hypothesis it's generally because we don't know whether or not it's true; we're trying to find out. That's why it's important to think of a positive result not as "confirmed experimentally" but rather as "not ruled out by this particular experimental result". It is only after failing to rule something out by multiple experiments that we can start to call it "confirmed". And nothing is ever 100% confirmed -- at best it is "not ruled out by the evidence so far".


> I do think I understand this, but rendering it into words is turning out to be surprisingly challenging.

A p-value of .05 means that, under the assumption that the null hypothesis you specified is true, you just observed a result which lies at the 5th percentile of the outcome space, sorted along some metric (usually "extremity of outcome"). That is to say, out of all possible outcomes, only 5% of them are as "extreme" as, or more "extreme" than, the outcome you observed.

It doesn't tell you anything about the odds that any result is due to chance. It tells you how often the null hypothesis gives you a result that is "similar", by some definition, to the result you observed.


What do you think that "due to chance" means?


That is a very reasonable question, and in this context we might reasonably say that "this [individual] outcome is due to chance" means the same thing as "the null hypothesis we stated in our introduction is platonically correct".

But I don't really see the relevance to this discussion?

Suppose you nail down a null hypothesis, define a similarity metric for data, run an experiment, and get some data. The p-value you calculate theoretically tells you this:

If the above-mentioned hypothesis is true, then X% of all data looks like your data

It doesn't tell you this:

If you have data that looks like your data, then there is an X% chance that the above-mentioned hypothesis is true

Those are two unrelated claims; one is not informative -- at all -- as to the other. The direction of implication is reversed between them.

Imagine that you're considering three hypotheses. You collect your data and make this calculation:

1. Hypothesis A says that data looks like what I collected 20% of the time.

2. Hypothesis B says that data looks like what I collected 45% of the time.

3. Hypothesis C says that data looks like what I collected 100% of the time.

Based only on this information, what are the odds that hypothesis A is correct? What are the odds that hypothesis C is correct? What are the odds that none of the three is correct?


This is getting deep into the weeds of the philosophy of science. It is crucially important to choose good hypotheses to test. For example:

> Hypothesis C says that data looks like what I collected 100% of the time.

What this tells you depends entirely on what hypothesis C actually is. For example, if C is "There is an invisible pink unicorn in the room, but everyone will deny seeing it because it's invisible" then you learn nothing by observing that everyone denies seeing the unicorn despite the fact that this is exactly what the theory predicts.

On the other hand, if C is a tweak to the Standard Model or GR that explains observations currently attributed to dark matter, that would be a very different situation.


> It is crucially important to choose good hypotheses to test.

But if you were able to do that, you wouldn't need to test the hypotheses. You'd already know they were good.

I'm intrigued as to why you picked those two examples. They differ in aesthetics without differing in implications, but you seem to want to highlight them as being different in an important way!


Seriously? You don't see any substantive difference between an explanation of dark matter and positing invisible pink unicorns? How do I even begin to respond to that?

Well, let's start with the obvious: there is actual evidence for the existence of dark matter -- that's the entire reason that dark matter is discussed at all. There is no evidence for the existence of invisible pink unicorns. Not only is there no evidence for IPU's, the IPU hypothesis is specifically designed so that there cannot possibly be any. The IPU hypothesis is unfalsifiable by design. That's the whole point.


If the invisible pink unicorn hypothesis was true, what about the world would be different?

If the MOND hypothesis was true, what about the world would be different?

The whole reason we have a constant supply of theories attempting to explain observations currently attributed to dark matter in terms other than "dark matter" is that people feel the dark matter theory is stupid. There's nothing else to it. I assume you feel the same way about unicorns. What's the difference supposed to be?

> There is no evidence for the existence of invisible pink unicorns.

You need to be careful here too. The fact that a theory is false does not mean there is no evidence for that theory.


> The whole reason we have a constant supply of theories attempting to explain observations currently attributed to dark matter in terms other than "dark matter" is that people feel the dark matter theory is stupid.

No, that's not true. The reason we have a "constant supply" of dark matter theories is that all of the extant theories have been falsified by observations, including MOND. If this were not the case, dark matter would be a solved problem and would no longer be in the news.

> The fact that a theory is false does not mean there is no evidence for that theory.

What makes you think the IPU theory is false? The whole point of the IPU hypothesis is that it is unfalsifiable.


You can't simply ignore the base rate, even if you don't know it.

In a purely random world, 5% of experiments are false positives, at p=0.05. None are true positives.

In a well ordered world with brilliant hypotheses, there are no false positives.

If more than 5% of experiments show positive results at p=0.05, some of them are probably true, so you can try to replicate them with lower p.

p=0.05 is a filter for "worth trying to replicate" (but even that is modulated by cost of replication vs value of result).

The crisis in science is largely that people confuse "publishable" with "probably true". Anything "probably better then random guessing" is publishable to help other researchers, but that doesn't mean it's probably true.


> p=0.05 is a filter for "worth trying to replicate"

Yes, I think that is an excellent way to put it.

> The crisis in science is largely that people confuse "publishable" with "probably true".

I would put it slightly differently: people conflate "published in a top-tier peer-reviewed journal" with "true beyond reasonable dispute". They also conflate "not published in a top-tier peer-reviewed journal" with "almost certainly false."

But I think we're in substantial agreement here.


Do you know the difference between "if A then B" and "if B then A"?

This is the same thing, but with probabilities: "if A, then 5% chance of B" and "if B, then 5% chance of A". Those are two very different things.

p=0.05 means "if hypothesis is wrong, then 5% chance of published research". It does not mean "if published research, then 5% chance of wrong hypothesis"; but most people believe it does, including probably most scientists.


> if hypothesis is wrong, then 5% chance of published research

I would say "5% chance of positive result nonetheless" but yes, I do get this. I'm just having inordinate trouble rendering it into words.


> What does that mean in practice? Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%. Even without any cherry-picking.

Well, that's example is also introducing dependence, which is a tricky thing of course whenever we talk about chance and stats.

But there's also another issue - a statement like "5% of positive published results are by chance since we have a p<=0.05 standard" treats every set of results as if p=0.05, whereas some of them are considerably lower anyway. Though the point of bad actors cherry-picking to screw up the data also comes into play here.

(And of course, fully independent things in life are much harder to find than one might think at first.)


I agree that the point about the 'confused scientist' is important, even if that itself is not stated clearly enough. Here is my own reading:

Imagine that a scientist is making experiments of the form: Does observable variable A correlate with observable variable B? Now imagine that there are billions of observable variables and almost all of them are not correlated. And imagine that there is no better way to come up with plausible correlations to test than randomly picking variables. Then it will take a very long time and a very large number of experiments to find a pair that is truly correlated. It will be inevitable that most positive results are bogus.


So run a meta-study upon the results published by a set of authors and double-check to make sure that their results are normally distributed across the p-values associated with their studies.

These problems are solved problems in the scientific community. Just announce that regular meta-studies will be done, expectations for authors to be normally distributed is published, and publicly show off the meta-study.

-------------

In any case, the discussion point you're making is well beyond the high-school level needed for a general education. If someone needs to run their own experiment (A/B testing upon their website) and cannot afford a proper set of tests/statistics, they should instead rely upon high-school level heuristics to design their personal studies.

This isn't a level of study about analyzing other people's results and finding flaws in other people's (possibly maliciously seeded) results. This is a heuristic about how to run your own experiments and how to prove something to yourself at a 95% confidence level. If you want to get published in the scientific community, the level of rigor is much higher of course, but no one tries to publish a scientific paper on just a high school education (which is where I was aiming my original comment at).


> and double-check to make sure that their results are normally distributed across the p-values associated with their studies

What is the distribution of a set of results over a set of p-values?

If you mean that you should check to make sure that the p-values themselves are normally distributed... wouldn't that be wrong? Assuming all hypotheses are false, p-values should be uniformly distributed. Assuming some hypotheses can sometimes be true, there's not a lot you can say about the appropriate distribution of p-values - it would depend on how often hypotheses are correct, and how strong the effects are.


First, I was specifically responding to this:

> I remember my high school AP Psychology teacher mocking p=0.05 as practically meaningless.

and trying to explain why the OP's teacher was probably right.

Second:

> So run a meta-study upon the results published by a set of authors and double-check to make sure that their results are normally distributed across the p-values associated with their studies.

That won't work, especially if you only run the meta-study on published results because it is all but impossible to get negative results published. Authors don't need to cherry-pick, the peer-review system does it for them.

> These problems are solved problems in the scientific community.

No, they aren't. These are social and political problems, not mathematical ones. And the scientific community is pretty bad at solving those.

> the discussion point you're making is well beyond the high-school level needed for a general education

I strongly disagree. I think everyone needs to understand this so they can approach scientific claims with an appropriate level of skepticism. Understanding how the sausage is made is essential to understanding science.

And BTW, I am not some crazy anti-vaxxer climate-change denialist flat-earther. I was an academic researcher for 15 years -- in a STEM field, not psychology, and even that was sufficiently screwed up to make me change my career. I have advocated for science and the scientific method for decades. It's not science that's broken, it's the academic peer-review system, which is essentially unchanged since it was invented in the 19th century. That is what needs to change. And that has nothing to do with math and everything to do with politics and economics.


> It's not science that's broken, it's the academic peer-review system, which is essentially unchanged since it was invented in the 19th century.

In my experience, it's not even this. Rather, it is that outside of STEM, very, very few people truly understand hypothesis testing.

At least in my experience, even basic concepts, as "falsify the null-hypothesis" is surprisingly hard, even with presumably intelligent people, such as MD's in PHd programmes.

They will still tend to believe that a "significant" result is proof of an effect, and often even believe it proves that the effect is causal with the direction they prefer.

At some point, stats just becomes a set of arcane conjurations for an entire field. At that point, the field as a whole tends to lose their ability to follow the scientific method and turns into something resembling a cult or clergy.


FWIW, I got through a Ph.D. program in CS without ever having to take a stats course. I took probability theory, which is related, but not the same thing. I had to figure out stats on my own. So yes, I think you're absolutely right, but it's not just "outside of STEM" -- sometimes it's inside of STEM too.


Yes. I was, however, not arguing that every student of the field would have to understand the scientific method well. It's enough that there is a critical mass of leaders having such understanding, to ensure that students (including PhD students) work in ways that supports it.

What I was arguing was that there are almost nobody with this understanding in many fields outside stem.

As for your case, I don't know exactly what "probability theory" meant at your college. But in principle, but if it's teaching about probability density functions and how to do integration on them to calculate various probabilities, you're a long way towards a basic understanding of stats surpassing many "stats" courses taught to social science students.

I myself only took a single "stats" course before graduating, which was mostly calculus applied to probability theory, without applications such as hypothesis testing baked in. Then I went on to do a lot of physics that was essentially applied probability theory (statistical mechanics and quantum mechanics).

Around that time, my GF (who was a bit older than me) was teaching a course in scientific methodology to a class of MD students who wanted to become "real" doctors (PhD programme for Medical Doctors), and the math and logic part was kind of hard for her (physicians may not learn a lot of stats until this level, but most of the MD PhD students are quite smart). Anyway, with a proper STEM background, picking up these applications was really easy.

Since then, I've had many encounters with people from various backgrounds that try to grapple with stats or adjacant spaces (data mining, machine learning, etc), and it seems that those who do not have a Math or Physics background, or at least a quite theoretical/mathematical Computer Science or Economics background, struggle quite hard.

Especially if they have to deal with a problem that is not covered by the set of conjurations they've been taught in their basic stats classes, since they only learned to "how" but not the "why".


There’s a professor of Human Evolutionary Biology at Harvard who only has a high school diploma[1]. Needless to say he’s been published and cited many times over.

[1] https://theconversation.com/profiles/louis-liebenberg-122680...


I don't know whether you're mocking them or being supportive of them or just stating a fact. Either way, education level has no bearing on subject knowledge. I know more about how computers, compilers, and software algorithms work than most post-docs and professors that I've run into in those subjects.

Am I smarter than them? Nope. Do I know as many fancy big words as them? Nope. Do I care about results and communicating complex topics to normal people? Yep. Do I care more about making the company money than chasing some bug-bear to go on my resume? Yep.

I fucking hate school and have no desire to ever go back. I can't put up with the bullshit, so I dropped out; I just never stopped studying and I don't need a piece of paper to affirm that fact.


To the people downvoting, at least rebuttal.


The observation above is simply true. If you toss a coin 30 times, there's about a 5% chance that you'll end up with 10-20 ratio or one more extreme.

NHST testing inverts the probability logic, makes the 5% holy, and skims over the high probability of finding something that is not equal to a specific value. That procedure is then used for theory confirmation, while it was (in another form) meant for falsification. Everything is wrong about it, even if the experimental method is flawless. Hence the reproducibility crisis.


Tangential: syntax-highlighting math! This is the first time I’ve seen it. Not yet sure what I think about it, but I can definitely see the allure.


Yeah, I like it. I used this as a tutor in more finicky exercises when it becomes really important to keep 2-3 very similar, but different things apart. It takes a bit of dexterity, but you can switch fluently between 3 different whiteboard markers held in one hand while writing, haha.

I am kind of wondering if a semantic highlighting makes sense as well. You often end up with some implicit assignment of lowercase latin, uppercase latin, lowercase greek letters and such for certain meanings. Kinematic - xyzt for position in time, T_i(I_i) for the quaternion or transformation representing a certain joint of a robot.


It's easy on the eyes and it can make reading lots of equations less awkward if done correctly. I remember finding out this was possible while I was working on an assignment in Latex - it looked amazing.

It takes a little bit of work to colour in equations but I hope more people start doing it (including me, I'd forgotten about it for a while)


Pedant-man on the scene: this is just highlighting since the highlighting isn't derived from syntax.


Which makes me wonder what that would look like and would it be helpful?

But there's already such complex and varied typography in math I wonder if it would be kind of redundant. E.g. you don't need matching parentheses to be colored when they already come in different sets of matching heights.


It's been around for a long time (centuries?). In most textbooks, you'll get different semantics for italic and bold faces. Modern textbooks with color printing often use color in semantic ways.


This kind of intuition is why a high school level statistics or probability class seems so so valuable. I know not everyone will use the math per se, but the concepts apply to everyday life and are really hard to just grasp without having been taught it at some point.


The sad thing is, having a mandatory high school level statistics & probability class alone is not enough, you'll also need a good curriculum and a competent teacher to go along with it. Otherwise, it wouldn't work: a bad curriculum taught badly by a unmotivated or unqualified teacher will almost always fail to teach the intuition, or, even worse, alienates students from the materials.


Stats class role of thumb: if you need to calculate the relative probability of two outcomes, you can get to within about 10% once you get 100 samples of each outcome (so, need more samples overall if the distribution is skewed).


Its interesting that even in this thread the 2 answers differ by an order of magnitude lol


Just apply it recursively. Let’s get 100 samples of comments suggesting the number of samples to use. Then average those.


FWIW, the threshold I learned was 20 in each bucket, so now you have 3 answers.


Eh it's really the same rule, just applying a different threshold.


The problem is that the sensitivity to the number growth is supposed to be exponential. So if you need 100 samples for "within 10% of the value", then 10 samples should give you almost completely random behavior.

In reality, it depends on your actual distribution, but the OP from this thread here is unreasonably conservative for something described as a "rule of thumb". Almost always, if you have at least 10 of every category, you can already discover every interesting thing that a rule of thumb will allow. And you probably could go with less. But if you want precision, you can't get it with rules of thumb.


The dependence on sample size is not exponential, it's sublinear. The heuristic rate of convergence to keep in mind is the square root of the sample size, i.e. getting 10x more samples shrinks the margin of error (in a multiplicative sense) by sqrt(10) ≈ 3ish.

The exponential bit applies to the probability densities as a function of the bounds themselves, i.e. how likely you are to fall x units away from the mean typically decreases exponentially with (some polynomial in) x.

Of course, this is all assuming a whole bunch of standard conditions on the data you're looking at (independence, identically distributed, bounded variance, etc.) and may not hold if these are violated.


I think both answers are referencing the Central Limit Theorem, that states [simplified] that once you get over 30 samples for each independent variable, you will get a normal distribution.


If you had a gambling game that was simply "heads or tails, even money", you would expect over a Large Number of trials that you would get 0. But once you observe exactly one trial, the expected value because +1 or -1 unit. We know this is always going to happen one way or the other. Why then, does the bell curve of "expected value" for this game not have two peaks, at 1 and -1? Why does it peak at 0 instead?

What I'm asking about, I know I'm wrong about - I just want to know how I can derive that for myself.


The intuitive explanation is that the effect of a single sample on the average diminishes as you take more samples. So, hand-waving a bit, let's assume it's true that over a large number of trials you would expect the average to converge to 0. You just tossed a coin and got heads, so you're at +1. The average of (1 + 0*n)/(n+1) still goes to 0 as n grows bigger and bigger.

That skips over the distinction between "average" and "probability distribution", but those are nuances are probably better left for a proof of the central limit theorem.


There are a couple of confusions/ambiguities here.

The Law of Large Numbers is about the average, so it's not relevant here (an average of +1 would mean you got heads every single time, which is extremely unlikely for large n).

If you are looking at the sum, then the value depends on whether the number of trials (n) is even or odd. If n is odd, you would indeed get two peaks at 1 and -1, and you would never get exactly 0. If n is even, you would get a peak at 0 and you would never get exactly 1 or -1.

The expected value (aka average) is a number, not a distribution. The expected value for the sum is 0 even when n is odd and you can't get exactly 0 -- that's just how the expected value works (in the same way that the "expected value" for the number of children in a family can be 2.5 even though a family can't have half a child). If you look at the probability density function for a single trial, then it does have two peaks at 1 and -1 (and is zero everywhere else).

The curve you refer to might be the normal approximation (https://en.wikipedia.org/wiki/Binomial_distribution#Normal_a...). It's true that the normal approximation for the distribution of the sum in your gambling game has a peak at 0 even when n is odd and the sum can't be exactly 0. That's because the normal approximation is a continuous approximation and it doesn't capture the discrete nature of the underlying distribution.


"The expected value of a random variable with a finite number of outcomes is a weighted average of all possible outcomes." -- https://en.wikipedia.org/wiki/Expected_value


That makes sense, I was always thinking of it as "Given an infinite number of trials..."


That would be the frequentist interpretation. A Bayesian would say that probability is to be interpreted in belief that an outcome will occur. Neither is really right or wrong, it depends on what you're modeling. If we’re analyzing some kind of heavily repeated task (e.g., a sordid night of us glued to the blackjack table where we play a lot of hands or data transmission over a noisy cable), a frequentist interpretation might feel more sense. However if you’re talking about the probability of a candidate winning an election, you could take a Bayesian view where the probability asserts a confidence in an outcome. A radical frequentist would take umbrage with an event that only happens once. However, I suppose, depending on your election rules and model (e.g., a direct democracy), you could interpret the election winner in a frequentist manner: the probability of winning is the rate at which people vote for the candidate. For a more complicated system I’m not sure the frequentist view is as easily justified.

However to answer your question more directly, the expected value is just another name for the average or mean of a random variable. In this case, the variable is your profit. Assume we’re betting a dollar per toss on coin flips and I win if it’s heads (everyone knows heads always wins, right?). The expected value is probability of heads * 1 - probability of tails * 1. If the coin is fair, the probabilities are the same so the expected value is zero.

Aside: sequences of random variables that are “fair bets” are called martingales and are incredibly useful. It’s a fair bet because, given all prior knowledge of the value of the variable thus far, the expected value of the next value you witness is the current value of the variable. You could imagine looking at a history of stock values. Given all that information, it’s a martingale (and thus a fair bet) if given that information your expected profit from investing is 0.


Whether/when its better to think in terms of "X has a 37% chance of happening in a single trial" vs "If you ran a lot of trials, X would happen in 37% of them" is kind of a fraught topic that I can't say much about, but you might find this interesting: https://en.wikipedia.org/wiki/Probability_interpretations


You're on the right track. The only thing you're missing is that adding (averaging) two bell curve distributions that are offset by a little does not necessarily give a bimodal distribution. It will only be bimodal if the two unimodal that you are adding are placed far enough away from each other.

See this https://stats.stackexchange.com/questions/416204/why-is-a-mi...


The biggest problem for real processes is knowing whether in fact x ~ i.i.d., with regard to time as well as individual observations.


Curious how people are ‘applying’ the Law of Large Numbers in a way that needs this advice to be tacked on?

> Always keep the speed of convergence in mind when applying the law of large numbers.

Any ‘application’ of the LLN basically amounts to replacing some probalistic number derived from a bunch of random samples with the expected value of that number… and tacking on ‘for sufficiently large n’ as a caveat to your subsequent conclusions.

Figuring out whether, in practical cases, you will have a sufficiently large n that the conclusion is valid is a necessary step in the analysis.


> Figuring out whether, in practical cases, you will have a sufficiently large n that the conclusion is valid is a necessary step in the analysis.

The econometrics textbook I studied has more words “asymptotic” in it than there are pages. Oftentimes it’s impractical or even theoretically intractable to derive finite sample properties (and thus to answer when n is really large enough).


> This means that on average, we’ll need a fifty million times larger sample for the sample average to be as close to the true average as in the case of dice rolls.

This is "as close" in an absolute sense, right?

If I take into account that the lottery value is 20x larger, and I'm targeting relative accuracy, then I need 2.5 million times as many samples?


Didn't make sense to me as well. Maybe it's better to use something along the lines of variance/mean instead?


I really like how the plots and graphics look. Is it the library by 3blue1brown? (Is it manim, it’s called?)


Am I the only one unreasonably annoyed that his graphs don't match the description of his rolls?


One way to gain intuition on why the LLN might work faster, slower or not at all is to write the mean equation in rescursive form (What's the next expected value estimation, given a new sample and the current expected value estimation?).


In learning theory there is focus on "non-asymptotic" results. Instead of only showing that our method converges on the right answer in the limit of infinite data, we must show how fast it converges.


If the blog author is reading, some notes for improvement:

- Your odds calculation is likely wrong. You assumed from the word "odds" that "odds ratio" was meant, (Odds=3 meaning "odds 3:1 against" corresponding to p=25%) but the phrase is "approximate odds 1 in X" (Odds=3 meaning "odds of 1 in 3 to win" meaning 33%) and recalculating results in the remarkably exact expected value of $80 which seems intentional?

- You phrase things in terms of variances, people will think more in terms of standard deviations. So 3.5 ± 1.7 vs $80 ± $12,526.

- Note that you try to make a direct comparison between those two but the two are in fact incomparable. The most direct comparison might be to subtract 1 from the die roll and multiply by $32, so that you have a 1/6 chance of winning $0, 1/6 of winning $32, ... 1/6 of winning $160. So then we have $80 ± $55 vs $80 ± $12,526. Then instead of saying you'd need 50 million more lottery tickets you'd actually say you need about 50 thousand more. This is closer to the "right ballpark" where you can tell that the whole lottery is expected to sell about 10,200,000 tickets on a good day.

- But where an article like this should really go is, "what are you using the numbers for?". In the case of the Texas lottery this is actually a strong constraint, they have to make sure that they make a "profit" (like, it's not a real profit, it probably goes to schools or something) on most lotteries, so you're actually trying to ensure that 5 sigma or so is less than the bias. So you've got a competition between $20 · n and 5 · $12,526 · √(n), or √(n) = 12526/4, n = 9.8 million. So that's what the Texas Lottery is targeting, right? So then we would calculate that the equivalent number of people that should play in the "roll a die linear lottery" we've constructed is 187, call it an even 200, if 200 people pay $100 for a lottery ticket on the linear lottery then we can pretty much always pay out even on a really bad day.

- So the 50,000x number that is actually correct is basically just saying that we can run a much smaller lottery, 50,000 times smaller, with that payoff structure. And there's something nice about phrasing it this way.

- To really get "law of large numbers" we should actually probably be looking at how much these distributions deviate from Gaussian, rather than complaining that the Gaussian is too wide? You can account for a wide Gaussian in a number of ways. But probably we want to take the cube root of the 3d cumulant, for example, try to argue when it "vanishes"? Except given the symmetry the 3rd cumulant for the die is probably 0 so you might need to go out to the 4th cumulant for the die -- and this might give a better explanation for the die converging more rapidly in "shape" to the mean, it doesn't just come close faster, it also becomes a Gaussian significantly faster because the payoff structure is symmetric about the mean.


About three fifty.

(No, just joking. Actually 42 plus or minus.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: