

Principles of Content Addressing - btrask
http://www.bentrask.com/notes/content-addressing.html

======
leephillips
This is an attractive idea. But wouldn't the large space of all documents on
the web likely lead to hash collisions?

~~~
btrask
Hash collisions are dependent on the length of the hash and the quality of the
algorithm. AFAIK there has never been a _single_ SHA-1 has collision, and it's
an old algorithm that people don't recommend using anymore.

As mentioned in the article, 12 bytes is enough to make collisions very
unlikely. Assuming a high quality hash algorithm, you can use this table to
determine the necessary length for a given number of documents:
[https://en.wikipedia.org/wiki/Birthday_paradox#Probability_t...](https://en.wikipedia.org/wiki/Birthday_paradox#Probability_table)

