
The birthday paradox in action: calculating the probability of a hash collision - ColinWright
http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN2
======
VMG
Hidden gem on the right side:

 _[...] People often ask why they need to memorise formulas, or why they need
to practice solving equations, when they can simply look stuff up whenever
they need it, and on-line computer algebra systems can solve equations faster
than they can, and more reliably.

But this is an example of why the ability simply to look stuff up is near
useless on its own. Searches are deep and wide, and you need intuition to
guide you. You need to recognise what might work, things you've seen before,
directions to take that are more likely to be fruitful. [...]_

------
pedrocr
That's a neat formula, and directly applicable to having a bunch of objects in
a git repository.

Applying the formula for 160bit SHA-1 you need 1.7e23 objects to get a 1%
chance of collision. The current Linus kernel repository has 2.7 million
objects. So to get a collision you'd need a repository that's 6e16 times
larger. That should be plenty.

For some wacky perspective that's 10 million kernel sized contributions for
every man woman and child on earth together in a single repository. It would
seem git will reach plenty of other bottlenecks before SHA-1 becomes a
problem...

~~~
viraptor
It is a huge number. Of course the fun part is that after precisely
calculating that probability and proving that we don't have anything to worry
about, the next commit we make could collide... and it will be the only git
collision within the history of human race.

Probabilities are fun, because anything non-zero is non-zero ;) (I don't
expect to ever witness a git hash collision of course)

------
cmurphycode
At my day job, SHA-1 is the basis for our deduplication. As such, we love
talking about frames of reference for the probability of a SHA-1 collision.
One of our presentations includes such facts as "the probability of dying in a
vending machine accident", etc.

If you enjoy this stuff, you'll get a pleasant tickle out of this one:

What is the probability that there exists a SHA-1 hash which hashes to itself?
In other words, are there any fixed points?

(This post was edited to fix the phrasing mistake nicely pointed out by a
commenter. Originally, I asked "probability that a SHA-1" when I meant
"probability that any SHA-1")

~~~
Dylan16807
Pleasant tickle? Isn't it just 1/max(SHA-1)?

I do get a pleasant tickle from expanding that to the question of if there is
_any_ fixed point in SHA-1, but the chance for a particular value is the
answer you always get for "what is the chance of a specific hash".

~~~
jlgreco
If there are max(SHA-1) hashes, and a 1/max(SHA-1) chance any particular one
will hash to itself, does that mean we would expect only a single instance of
a hash hashing to itself?

I have no idea how you might go about finding that one expected instance.

~~~
Dylan16807
You would expect about 1 per unflawed hash function tested. But the law of
averages is less powerful when you focus on a specific hash. The chance of
having no fixed points in a particular unflawed hash is 1/e, so there is
_probably_ a fixed point, and possibly more than one.

------
Xylakant
It is very important to note that this is the probability that any two hashes
collide in a set of hashes. It is not the probability to find a second
plaintext that hashes to the same hash as a given one. It's far harder to find
a colliding value to a given one than to just grab a group of values and have
two of them hash to the same hash.

~~~
ColinWright
Indeed, pretty much exactly that point is made in the third box on the right:

    
    
        In case you are one of the many, many people who get this wrong
        and are astonished, you may be thinking of a different question.
        My birthday is a specific date. If we assume birthdays are
        uniformly distributed, how many people must you ask before you
        find someone who has the same birthday as me?
    
        On average, about 182.
    
        If you ask some 150 to 200 people, the odds are about 50% that
        you'll find someone who shares my specific birthday.
    
        But that's not the question I asked. They don't just need to share
        my birthday, it can be that any birthday is shared among them, and
        that changes the odds dramatically.

~~~
dsego
I don't see how that point relates to Xylakant's comment. The text you quoted
basically points to the difference between these two cases:

1) that if you have a specific hash, how many more hashes would have to be
generated to get the same hash.

2) how many hashes would you have to generate before a collision occurs
between any of them.

So in the two cases above we're dealing with sets of hashes, and we don't care
how they are generated.

Xylakant's comment was about a different type of collision. This collision
happens because you are mapping an arbitrary large text into a fixed size
hash, which means that different texts can map to the same hash, hence
creating a collision.

~~~
ColinWright
Xylakant:

    
    
      > It is very important to note that this is the probability
      > that any two hashes collide in a set of hashes. It is not
      > the probability to find a second plaintext that hashes to
      > the same hash as a given one.
    

ColinWright:

    
    
      > Indeed, pretty much exactly that point is made ...
    
      > > In case you are one of the many, many people who get this wrong
      > > and are astonished, you may be thinking of a different question.
      > > My birthday is a specific date. If we assume birthdays are
      > > uniformly distributed, how many people must you ask before you
      > > find someone who has the same birthday as me?
    
      > >  ...
    
      > > But that's not the question I asked. They don't just need to share
      > > my birthday, it can be that any birthday is shared among them, and
      > > that changes the odds dramatically.
    	

dsego:

    
    
      > I don't see how that point relates to Xylakant's comment.
    

OK, I'll try to explain my thinking:

    
    
      > The text you quoted basically points to the difference
      > between these two cases:
    
      > 1) that if you have a specific hash, how many more hashes
      >    would have to be generated to get the same hash.
    
      > 2) how many hashes would you have to generate before a
      >    collision occurs between any of them.
    

Yes.

    
    
      > So in the two cases above we're dealing with sets of hashes,
      > and we don't care how they are generated.
    

That's true.

    
    
      > Xylakant's comment was about a different type of collision.
      > This collision happens because you are mapping an arbitrary
      > large text into a fixed size hash, which means that different
      > texts can map to the same hash, hence creating a collision.
    

If you're using cryptographic hashes then I believe these amount to the same
thing.

The distinguishing characteristics of a cryptographic hash are that it has no
internal structure, and it is completely unpredictable based on knowledge of
similar source texts. In particular, techniques such as differential analysis
don't work.

In that sense, finding a text to map to a hash to give a collision is exactly
the same as picking a random number that happens to match. The act of starting
with a text and using that to create a hash is identical to the act of
choosing a random element of the hash space. If these two things were not
identical, then the hash would have some sort of predictability or structure
that related it to the source text.

The situation is different if you are using non-cryptographic hashes such as
CRCs. Perhaps I should edit the original article to help make that
distinction, but I'm not really sure it's worth it.

~~~
Xylakant
Actually I was trying to express that the birthday paradox does not apply when
you to find a plaintext that evaluates to a known hash, as you'd need to do
for password cracking or forging a signed email for example[1]. The birthday
paradox only gives you the probability that of a random set of hashes, two are
the same. It's "if I generate n random pieces of plaintext, how big is the
chance that two of them generate the same hash." Actually it gives you the
probability of picking the same element twice from a finite set of elements
when you pick n times.

[1] When generating emails more restrictions apply: The colliding plaintexts
have to be at least somewhat coherent and probably should express something
the attacker wants to express.

Edit: Forgot a negation in a crucial place. Darn.

~~~
ColinWright
Yes, and that's exactly the point that is addressed in the box I quoted.

You're exactly right, and I believe it's covered. I didn't go into detail
about the problems involved in generating a plain text that hashes to a
specific hash,such as you mention. I did simply mention that the problem I'm
talking about is not that one.

So I don't understand the point dsego was making, because I think my reference
is relevant.

At this point I'm no longer sure it really matter.

~~~
dsego
Sorry, I think I was a bit confused. Now I see that it is the same thing.
Sorry for wasting your time :(

~~~
ColinWright
You haven't wasted my time. You've made me think harder, you've made me
reconsider my explanation, and I've helped you become unconfused.

Not a problem, and you're welcome!

------
davidjohnstone
I once made a little web page that calculates the probability of a hash
collision: <http://davidjohnstone.net/pages/hash-collision-probability>

------
ericcholis
My favorite header ever "Mathematicians aren't normal people". The numbers,
they're everywhere!!!

------
loonoddoo
In what sense is the birthday paradox a paradox?

~~~
DanBC
It's sometimes called the Birthday Problem to avoid the paradox label. (See,
for example, the wikipedia article.)

The paradoxy bits of the problem are the things that people find confusing.
"Only 23?!" or "So if there are 23 people in a room there's a 50% chance that
one of them will share a birthday with me?"

~~~
pendext
Is "So if there are 23 people in a room there's a 50% chance that one of them
will share a birthday with me?" correct? If you pick a fixed person (you) that
would break the pigeonhole principle, I think.

I would think that saying "So if there are 23 people in the room there is a
50% chance any 2 of those people share the same birthday" is better.

Edit: I misunderstood, you're saying that other people are misunderstanding.
My mistake!

~~~
ColinWright
You're misquoting or misunderstanding. The comment to which you are replying
said:

    
    
        The paradoxy bits of the problem are the things that
        people find confusing. "Only 23?!" or "So if there are
        23 people in a room there's a 50% chance that one of
        them will share a birthday with me?"
    

In that comment he is saying that people mis-understand the question, and
assume that it means that once there are 23 people in the room, then there's a
50:50 chance they will share a birthday with them specifically.

And that's exactly the wrong question, as you point out. So when you say:

    
    
        Is "So if there are 23 people in a room there's a 50% chance
        that one of them will share a birthday with me?" correct?
    

No, that's not correct, but it is what people think they hear, and it's that
confusion that makes this whole thing sometimes called a paradox.

So let's be clear:

    
    
        If you're in a room with 22 other people, the chance
        that one of them shares a birthday specifically with
        you is nowhere near 50%
    
        However, the chance that among the 23 people in the
        room there is, somewhere, a shared birthday, is indeed
        slightly greater than 50%
    

And my experience is that it really doesn't matter how carefully you word
this, some people simply will not understand it.

~~~
pendext
You're right, I misunderstand the comment I replied was using that as an
example of how other people misunderstand the problem. D'oh. Edited!

------
alexcsm
Interesting analysis on the collision angle.

For those interested in another look (basic) at the Birthday Paradox, check
out <http://alexanderle.com/blog/2011/birthday-paradox.html> !

------
bigiain
So has anyone yet done the calculation to work out how many files github will
need to store to have a non-negligeable chance of hash collisions in the git
file names? Or perhaps the linux kernel?

~~~
dbaupp
The formula given at the end of the article implies that to have a 5% chance
of a collision you would need 3.87e23 objects. (SHA-1 is 160 bits.) Even
storing 1 trillion objects gives a collision probability of 3e-19.

So this is unlikely to be a concern, ever.

~~~
bigiain
Hmmm, I wonder how many objects are currently stored in S3? And what rate
that's rowing at? Or how many documents Google has stored in bigtable?

But yeah, 160 bits is probably "enough" :-)

------
maebert
In Python: <https://gist.github.com/4031135>

~~~
Peaker
In Haskell: <http://hpaste.org/77404>

------
mauerbac
I just took an exam on this....

