Hacker News new | past | comments | ask | show | jobs | submit login
Birthday Paradox Revisited (datagenetics.com)
86 points by bookofjoe 7 months ago | hide | past | web | favorite | 27 comments



The Birthday Problem [1]:

    How many people do you have to put in a room
    before there's a 50% chance of two having the
    same birthday.
Assuming a uniform distribution of birthdays. The answer turns out to be about 23. In general, for M possible birthdays (other than 365, say), you need about sqrt(M) people before you get a 50% chance.

The article is a standard introduction to the Birthday Problem with some real world data thrown in. The 'paradox' comes from the surprise that the number is so low (23) compared with the number of days in the year (365). As the article points out, one short description of why the number is so low is that you're comparing each new person with every person already in the room instead of drawing two numbers at random and seeing if they're the same.

For the curious, I have a minimal post on how to derive the Birthday Paradox and other canonical probability problems [2].

[1] https://en.wikipedia.org/wiki/Birthday_problem

[2] https://mechaelephant.com/dev/Assorted-Small-Probability-Pro...


An even, shorter intuition would be that: number of pairs in a group of n people grows O(n^2), which implies some square root factor. I would lead with that before starting a more rigorous proof.


Agreed. To narrow down some of the details: There are n*(n-1)/2 = roughly n^2 / 2 pairs. If each pair has a 1/k chance of matching, then the expected number of pairs that match is (n^2 / 2) / k. Therefore, if we set n = √k, we get roughly 0.5 pairs on average.

Now, if it were impossible to ever get multiple matching pairs, then "the chance there's at least one match" would be equal to "the expected number of matches". Specifically: "chance of at least 1 match" = "expected number of matches" - "chance of at least 2 matches" - "chance of at least 3 matches" - ... . Since it's possible but unlikely to get multiple matches, the approximation should be reasonably close.


It feels that the number of 23 is specific to the context of birthdays. If it was switched to a different context such as coming from the same part of the word, or driving the same car, would the result be as surprising?


I believe statistician ecologists use the reverse in an interesting way. Tag 100 units from a species, say goats (failure of imagination on my part) from a certain location. Then sample 100 other animals at a different time, controlling for time spent and so on, and the ratio of tagged animals that come up in the result will give statistical information about the total population of said species.


I really enjoy reading about those kinds of problems because they are so opposite to my intuition.

Even thinking about this more, and reading (and understanding) the solution, if I think about this, my guess would still be "well, with 182 people, half the pigeon holes would be taken, so there would be 50% chance the next person getting in the room would pick a taken hole".

Another similar problem is the Monty Hall problem. Simple, easy to understand when explained, but still, despite understanding the solution, doesn't feel right!


You're formulating the problem wrong in your mind.

Its true that if you have 182 people in the room all with unique birthdays and you add one more random person to the room there is a 50% chance of them sharing a birthday just like you described. But you're assuming you already managed to gather 182 people without any birthday collisions. That's a different problem then the one originally posed. In your case you collected 182 people with unique birthdays and checked the probability of a collision when adding one more. But the real question asks what is if you grabbed those 182 people at random what is the chance that any two of them already share a birthday?


Or even simpler: Wow, 50% per person. And you have hundreds of people. So basically a guarantee.


indeed, that makes more sense this way!


Unfortunately I don't have any good intuition to share for the Monty Hall problem, as I haven't been able to get an intuitive understanding of it yet.

However I did just read about Bertrand's Box Paradox[1], and it's very much the same sort of thinking as the Monty Hall problem, but more intuitively understandable for me at least.

[1] - https://en.wikipedia.org/wiki/Bertrand%27s_box_paradox


The result is that the non-uniformity does have an impact effect, but it's very, very small. When there were four or less people in the room, from my experiments, the non-uniformity of distribution resulted in a slight decrease in the chances of a birthday collision (but this could loss of precision from too small a sample size as the percentage of times, for instance that two people collide is just 0.274%). After five people, the non-unformity provided a slight increase in the chances of all collisions. This is what I would expect.

Does this make any sense? The author says that with a realistic non-uniform distribution of birthdays, they found that collisions were less likely when there were fewer than 5 people involved. I can't think of any reason this would be the case. If certain birthdays are more common than others, I'd think that the chances of a collision must go up regardless of the number of people involved.

Is there any plausible mathematical explanation for this effect? Or was the experiment just underpowered? Or worse, might the simulation code be buggy? Presumably they would have run the simulation more than once after getting such such a counterintuitive answer, and it seems really unlikely that this effect would be consistent unless something was broken.


I agree with your intuition. I think the author must either be looking at statistical noise or have made a mistake.

At least for the case of two people, if we say that a_n is the probability of a person's birthday being chosen (so that ∑(a_n) = 1), then the chance of a match is ∑(a_n^2). To construct a rigorous argument, one could bring in the power mean theorem[1], which states that, if x > y and the sequence {a_n} is all nonnegative, then (avg(a_n^x))^1/x ≥ (avg(a_n^y))^1/y, with equality if and only if all elements of {a_n} are the same. This can be used to show that, if ∑(a_n) is fixed, then ∑(a_n^2) (i.e. the probability of a match) is minimized only when {a_n} are all equal.

It's less obvious how to make a rigorous argument for the cases of 3 and more people, but https://en.wikipedia.org/wiki/Muirhead%27s_inequality might help (if you trust the proof of it).

[1] https://en.wikipedia.org/wiki/Generalized_mean#Generalized_m...


> I can't think of any reason this would be the case.

The author suggests such a reason: the loss of precision. As I understand it hints to a limited precision of float/double number representation. When you calculate a sum of millions of fractions some of fractions could turn into zero, because you have divided a too small number on a too large one.


I think the floating-point precision interpretation is negated by the author's suggestion that the loss of precision is "because of the small sample size". I can sort-of see how a purely analytical solution might be susceptible to rounding issues (and I say sort-of because I don't think such an approach would actually include multiplying millions of fractions, and the straightforward approach involves numbers near 1 instead of near 0), but the reference to "sample size" makes me think that this was a numerical simulation.

But presuming the author ran the simulation more than once (wouldn't you do this if you had such a surprising result?) I don't see how there could be any consistent effect unless there was a bug in the logic of the program. Unless maybe they ran it multiple times with the same random seed and got exactly the same results?


In a similar vein, I saw this linked from HN not too long ago and thought it was a really neat application of the birthday paradox--all about finding duplicate packs of Skittles: https://possiblywrong.wordpress.com/2019/04/06/follow-up-i-f...


The US data show that the healthcare system is happy to induce labor to support holiday gatherings of hospital staff.

The interesting thing is that induction increases the risk of c-section, which is a negative for the family, but a positive for the healthcare system (for obvious reasons).

Also note that the US has one of the worst birthing mortality rates amongst western countries...

USA#1


Me and my 4 year younger sister share our birthday. Growing up people would often gasp and ask themselves "what are the odds??", to which I'd usually reply "1/365?".

My father and step-sister also share their birthdays.

I guess probabilities are just really unintuitive to people.


Obligatory quote by Richard Feynman:

"You know, the most amazing thing happened to me tonight... I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!"


Why is there a huge drop on July 4th?


In France it's May 1. The staff want to be on vacation and if you have a complication it would be dangerous. So there's a standard distribution up the first and then in the week before a lot of induced births with thus a small dip after the 1st.

I know this because our son's due date was the 1st of May and the staff was explicit about this.


By explicit, do you mean they pressured you to induce for safety?


I was due Christmas day. My mom's doctor said, "I don't want to work on Christmas, so we're gonna get him out early." No talk of safety or anything else was ever discussed. I suppose there was an implicit threat that she'd get whatever unlucky random doctor was on call Xmas instead of her actual doctor.

My older sibling was a C-section, so I was going to be a C-section anyways. It just became a planned c-section due to the holiday.


Yep. Even a friend of ours who was an obstetrician said it would be foolish not to.


As the article points out, induced labor provides an element of choice with regards to birth.

Not as many mothers (or doctors) in the U.S. seem to want to induce labor on U.S. Independence Day.


A lot of maternity wards will try to avoid booking inductions on public holidays, because they will have fewer staff to deal with the spontaneous births on the day.


It also seems that not as many people want to induce labour on the 13th.


Having worked with healthcare (not strictly natalty) data in the US: there is a distinctly weekly pattern to data, and very pronounced dips on holidays (NY, LD, ID, LD, TD, XM, NYE. Usually with a bump immediately after, summer holidays especially.

A seven-day smoothing realy reveals this.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: