

Guide to cryptographic hashes for content-based addressing (2007) - atmosx
http://valerieaurora.org/monkey.html
Good reading about hashes with spot-on examples of widely used open-source software.
======
tytso
This is a very old writeup (2007, revised 2009) and one of the things that it
misses is the difference between pre-image resistance, second pre-image
resistance, and collision resistance. Yes, if you are using a 64-bit hash, it
only takes 2^32 tries to find a collsion via the birthday paradox. This is a
problem if you want to protect against random corruption.

But an attacker who is trying to subvert a system generally wants to replace a
specific node in a content-based addressing scheme. For example, they might
want to replace a commit in a git tree. In order to do this, you need to be
able to carry out a 2nd pre-image attack. That is, given a message m which
hashes to h, you need to find another message m' which hashes to the same
value. In this case, the birthday attack doesn't apply, because you can't just
find any collsion --- you need to find a collision given a specific hash
value. This is the difference between trying to find two people in a crowd
that have the same birthday, and trying to find a single person in a crowd
that has the same birthday as another, specific person in that crowd (say, May
14th). The latter is far more difficult than the former.

So for example, even if git used MD5, for which a collision was demonstrated
in 2005, a theoretical pre-image attack was only demonstrated in 2009, and
even then it had a complexity of 2^123. So you could replace a file in a git
tree with another, but it would require trying 2^123 different messages ---
only slightly better than a brute force search. And even then, there is no
guarantee that the replacement file would make any sense as a valid C file.

And of course, git is not using MD5; it's using SHA-1.

It's really unfortunate that Val's compare-by-hash paper doesn't go into this
rather subtle point, because it causes people to get more excited than what
the facts really warrant.

~~~
lambda
Birthday attacks can work against content-based addressing schemes too. Sure,
they may be a little bit harder (you need to convince someone to merge
something that is able to carry a large random payload), but given that, say,
the Linux kernel has a bunch of binary blob firmwares, it wouldn't be too hard
for a hardware manufacturer to submit a driver for new hardware with one
firmware, and then replace it with another later on.

For a similar example, you may think that generating a CA signing certificate
would involve a second preimage attack, as you would need to generate a
certificate that had a hash that matched one of an existing CA. But it turns
out that you only need a birthday attack, because you just need to submit a
certificate for signing that is one of two that collide, where the second one
is marked as being a CA cert. This has actually been done a with certificate
authority that used MD5 for signing certificates.[1]

There are lots of file formats that allow you to fairly easily create MD5
collisions, by virtue of being able to stick a random string that won't be
interpreted somewhere in them.[2] If it were feasible to produce such a
collision using SHA-1, anyone who got such a file merged in Git could then
substitute it for an evil file and no one would be the wiser.

That said, this attack does require the ability to generate a collision on
SHA-1 (which has not yet been done), the ability to convince Linus (or a
subsystem maintainer) to pull your tree containing a big binary blob, and that
binary blob being able to do something harmful when substituted that it
couldn't do just by virtue of being a difficult to audit binary blob
(otherwise, why bother with the birthday attack when you could just subvert
the blob anyhow).

It's true that the writeup should distinguish between these cases, and
contrast their difficulty, but you shouldn't be quite so quick to handwave
away the possibility of collision attacks rather than second-preimage attacks.

If I were to build a content-addressed storage system or design Git today, I'd
probably use a SHA-2 or SHA-3 hash to be on the safe side. While no one has
yet demonstrated a SHA-1 collision publicly, it's been substantially weakened;
it just doesn't give a good enough security margin. Current estimate are that
finding a collisions would cost about $2.7M in cloud computing time[3], and
there are lots of adversaries who have well more than those kinds of
resources.

1: [http://www.win.tue.nl/hashclash/rogue-
ca/](http://www.win.tue.nl/hashclash/rogue-ca/) 2:
[http://www.mscs.dal.ca/~selinger/md5collision/](http://www.mscs.dal.ca/~selinger/md5collision/)
3:
[https://www.schneier.com/blog/archives/2012/10/when_will_we_...](https://www.schneier.com/blog/archives/2012/10/when_will_we_se.html)

~~~
tytso
Fair point. Although it's important to note that not any collision attack
would work. The collision attack still had to meet certain properties before
it could be used to successfully attack x509 certificates. So the ability to
use this to succesfully attack a git tree from a cryptographic perspective
falls somewhere between a collision attack and a pre-image attack.

Making the binary blob do something useful also assumes that you know enough
about the firmware that you could figure out how to create a valid blob as
well as a malicious blob. Very often with these binary blobs the low-level
detail of the hardware is not generally available except to the manufacturer
--- and if the manufacturer were malicious, they could insert the malware into
the official firmware image (or to an update of the official firmware image)
in the first place.

Finally, the cost estimates in [3] are for a brute force birthday attack,
which is definitely not going to be enough for the sort of replacement attack
which you are describing. That being said, certainly if we were starting from
scratch today, using a newer hash would probably be a good idea. But it's
still true that there's a goodly distance from "we can generate a collision"
to we can use that to attack a CBA system in real life.

------
Terretta
Speaking as a CDN responsible for caching artifacts from petabytes of storage
on relatively limited pools of "edge servers", I feel most any discussion of
content-based addressing needs a review of _consistent hashing_ as well.

[http://en.wikipedia.org/wiki/Consistent_hashing](http://en.wikipedia.org/wiki/Consistent_hashing)

Also, we've been hashing content into content keys for a decade and haven't
had a collision even using hashes one shouldn't use according to this
discussion. In practice, “fast” has been more important to us than “secure”,
as we are dealing with multi gigabyte files. Consider your trade offs
carefully and keep an eye on published work.

~~~
diminish
a technical question; how do you know you didn't have any collisions, do you
get alerts when one occurs?

~~~
Terretta
Because there'd be a key already there for a new file.

------
lvh
A wonderful article (and a repost, I think), but showing it's age by not
mentioning SHA-3 era stuff. Particularly BLAKE2 (IMHO) is interesting for this
application, since it's blazingly fast, and at least as secure as BLAKE, which
has received quite a bit of scrutiny as a SHA-3 finalist.

~~~
diminish
could be interesting to see a revised version of this article, including also
authenticated MACs (HMAC etc).

------
zamalek
Very good article, but I am not sure why Valerie dedicated so much of the
content of the article to the security of hashes. Sure, there may be some CBA
systems that would require secure hashes but they would likely be few and far
between (yes, BitTorrent 'cares', but they haven't seen any reason to move off
of SHA1).

One could argue that weakening hashes increases the expected probability of
two pieces of content having the same hash (which it does); however, the
probability of the collision being exploited (as opposed to occurring by
mistake) could be argued as much greater (and is what the research deals
with).

I would have preferred to have seen two articles, with this one dealing with
CBA exclusively and the other the security of hashing algorithms (and how that
applies to CBA, for those who care). I would have expected an article on CBA
to cover at least Merkle trees, for example. Non-cryptographic hashing
functions such as Murmur would have also been a nice addition.

~~~
anaphor
I also would've liked to see a discussion of hash lists and hash trees,
because they seem fairly difficult to implement correctly. At least for hash
trees my research has shown there isn't much help implementing them correctly,
except for maybe reading the bittorrent source, and there are all kinds of
subtle things like some sources recommending you use a different hash function
for leaves that are on the sides of the tree, etc...

~~~
chubot
When you say hash tree do you mean Merkle trees? What research are you
referring to?

Git uses the Merkle tree scheme. It just creates a binary listing of a
directory, which the SHA1 of each file in it, and then takes the SHA1 of that
dir object. Is there something insufficient about that?

~~~
anaphor
I mean merkle trees more or less yeah. I don't remember all of the sources I
researched but here is a source for the use of two different hash functions
(it was for leaves not edges, my bad):
[http://crypto.stackexchange.com/questions/2106/what-is-
the-p...](http://crypto.stackexchange.com/questions/2106/what-is-the-purpose-
of-using-different-hash-functions-for-the-leaves-and-interna)

------
yalogin
Why is the security of hashes really important in content-based addressing?
The idea there is the hash chosen should be resistant to collision and more
importantly fast. SHA1 for that is more than adequate.

