Guide to cryptographic hashes for content-based addressing (2007)

tytso · on Sept 10, 2013

This is a very old writeup (2007, revised 2009) and one of the things that it misses is the difference between pre-image resistance, second pre-image resistance, and collision resistance. Yes, if you are using a 64-bit hash, it only takes 2^32 tries to find a collsion via the birthday paradox. This is a problem if you want to protect against random corruption.

But an attacker who is trying to subvert a system generally wants to replace a specific node in a content-based addressing scheme. For example, they might want to replace a commit in a git tree. In order to do this, you need to be able to carry out a 2nd pre-image attack. That is, given a message m which hashes to h, you need to find another message m' which hashes to the same value. In this case, the birthday attack doesn't apply, because you can't just find any collsion --- you need to find a collision given a specific hash value. This is the difference between trying to find two people in a crowd that have the same birthday, and trying to find a single person in a crowd that has the same birthday as another, specific person in that crowd (say, May 14th). The latter is far more difficult than the former.

So for example, even if git used MD5, for which a collision was demonstrated in 2005, a theoretical pre-image attack was only demonstrated in 2009, and even then it had a complexity of 2^123. So you could replace a file in a git tree with another, but it would require trying 2^123 different messages --- only slightly better than a brute force search. And even then, there is no guarantee that the replacement file would make any sense as a valid C file.

And of course, git is not using MD5; it's using SHA-1.

It's really unfortunate that Val's compare-by-hash paper doesn't go into this rather subtle point, because it causes people to get more excited than what the facts really warrant.

lambda · on Sept 10, 2013

Birthday attacks can work against content-based addressing schemes too. Sure, they may be a little bit harder (you need to convince someone to merge something that is able to carry a large random payload), but given that, say, the Linux kernel has a bunch of binary blob firmwares, it wouldn't be too hard for a hardware manufacturer to submit a driver for new hardware with one firmware, and then replace it with another later on.

For a similar example, you may think that generating a CA signing certificate would involve a second preimage attack, as you would need to generate a certificate that had a hash that matched one of an existing CA. But it turns out that you only need a birthday attack, because you just need to submit a certificate for signing that is one of two that collide, where the second one is marked as being a CA cert. This has actually been done a with certificate authority that used MD5 for signing certificates.[1]

There are lots of file formats that allow you to fairly easily create MD5 collisions, by virtue of being able to stick a random string that won't be interpreted somewhere in them.[2] If it were feasible to produce such a collision using SHA-1, anyone who got such a file merged in Git could then substitute it for an evil file and no one would be the wiser.

That said, this attack does require the ability to generate a collision on SHA-1 (which has not yet been done), the ability to convince Linus (or a subsystem maintainer) to pull your tree containing a big binary blob, and that binary blob being able to do something harmful when substituted that it couldn't do just by virtue of being a difficult to audit binary blob (otherwise, why bother with the birthday attack when you could just subvert the blob anyhow).

It's true that the writeup should distinguish between these cases, and contrast their difficulty, but you shouldn't be quite so quick to handwave away the possibility of collision attacks rather than second-preimage attacks.

If I were to build a content-addressed storage system or design Git today, I'd probably use a SHA-2 or SHA-3 hash to be on the safe side. While no one has yet demonstrated a SHA-1 collision publicly, it's been substantially weakened; it just doesn't give a good enough security margin. Current estimate are that finding a collisions would cost about $2.7M in cloud computing time[3], and there are lots of adversaries who have well more than those kinds of resources.

1: http://www.win.tue.nl/hashclash/rogue-ca/ 2: http://www.mscs.dal.ca/~selinger/md5collision/ 3: https://www.schneier.com/blog/archives/2012/10/when_will_we_...

tytso · on Sept 10, 2013

Fair point. Although it's important to note that not any collision attack would work. The collision attack still had to meet certain properties before it could be used to successfully attack x509 certificates. So the ability to use this to succesfully attack a git tree from a cryptographic perspective falls somewhere between a collision attack and a pre-image attack.

Making the binary blob do something useful also assumes that you know enough about the firmware that you could figure out how to create a valid blob as well as a malicious blob. Very often with these binary blobs the low-level detail of the hardware is not generally available except to the manufacturer --- and if the manufacturer were malicious, they could insert the malware into the official firmware image (or to an update of the official firmware image) in the first place.

Finally, the cost estimates in [3] are for a brute force birthday attack, which is definitely not going to be enough for the sort of replacement attack which you are describing. That being said, certainly if we were starting from scratch today, using a newer hash would probably be a good idea. But it's still true that there's a goodly distance from "we can generate a collision" to we can use that to attack a CBA system in real life.

Terretta · on Sept 10, 2013

Speaking as a CDN responsible for caching artifacts from petabytes of storage on relatively limited pools of "edge servers", I feel most any discussion of content-based addressing needs a review of consistent hashing as well.

http://en.wikipedia.org/wiki/Consistent_hashing

Also, we've been hashing content into content keys for a decade and haven't had a collision even using hashes one shouldn't use according to this discussion. In practice, “fast” has been more important to us than “secure”, as we are dealing with multi gigabyte files. Consider your trade offs carefully and keep an eye on published work.

diminish · on Sept 10, 2013

a technical question; how do you know you didn't have any collisions, do you get alerts when one occurs?

Terretta · on Sept 10, 2013

Because there'd be a key already there for a new file.

eru · on Sept 10, 2013

As a CDN, do you have any opinion on submodular load balancing (as mentioned in http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.147....)?

We encountered the same problem at work, but in a different domain.

Terretta · on Sept 11, 2013

Without revealing our IP, yes, we believe it's a really good idea to use just "good enough" algorithms for traffic balancing, packing edge servers with content, content promotion and demotion, etc.

At scale, we find blind strategies (round robin for balancing, LRU for content allocation) do okay (say, 70% of theoretic optimal) with almost no signal (e.g., server up/down), and many CDNs using off the shelf equipment are in that category. Using another 20% of hardware doesn't really affect their overhead.

We're a wholesaler and rely on a competitive edge in the margins, so in our IP we've discovered and take advantage of classes of "good enough" strategies (say, 95% of optimal) with just a couple high performance signals (e.g. overall server load). To get above those to, say, 96% or 97%, requires gathering detailed metrics with deltas (e.g., HDD and memory stats), and lengthy computation. For us, those final incremental gains are not worth the complexity.

eru · on Sept 13, 2013

Thanks!

lvh · on Sept 10, 2013

A wonderful article (and a repost, I think), but showing it's age by not mentioning SHA-3 era stuff. Particularly BLAKE2 (IMHO) is interesting for this application, since it's blazingly fast, and at least as secure as BLAKE, which has received quite a bit of scrutiny as a SHA-3 finalist.

diminish · on Sept 10, 2013

could be interesting to see a revised version of this article, including also authenticated MACs (HMAC etc).

zamalek · on Sept 10, 2013

Very good article, but I am not sure why Valerie dedicated so much of the content of the article to the security of hashes. Sure, there may be some CBA systems that would require secure hashes but they would likely be few and far between (yes, BitTorrent 'cares', but they haven't seen any reason to move off of SHA1).

One could argue that weakening hashes increases the expected probability of two pieces of content having the same hash (which it does); however, the probability of the collision being exploited (as opposed to occurring by mistake) could be argued as much greater (and is what the research deals with).

I would have preferred to have seen two articles, with this one dealing with CBA exclusively and the other the security of hashing algorithms (and how that applies to CBA, for those who care). I would have expected an article on CBA to cover at least Merkle trees, for example. Non-cryptographic hashing functions such as Murmur would have also been a nice addition.

ajb · on Sept 10, 2013

IIRC this article was triggered by concern about use in Git and filesystems; the author is mainly a kernel developer, not a cryptographer, so it's not surprising that she hasn't extended it to cover more general applications.

zamalek · on Sept 10, 2013

That makes much more sense, thank for the clarification.

anaphor · on Sept 10, 2013

I also would've liked to see a discussion of hash lists and hash trees, because they seem fairly difficult to implement correctly. At least for hash trees my research has shown there isn't much help implementing them correctly, except for maybe reading the bittorrent source, and there are all kinds of subtle things like some sources recommending you use a different hash function for leaves that are on the sides of the tree, etc...

chubot · on Sept 10, 2013

When you say hash tree do you mean Merkle trees? What research are you referring to?

Git uses the Merkle tree scheme. It just creates a binary listing of a directory, which the SHA1 of each file in it, and then takes the SHA1 of that dir object. Is there something insufficient about that?

anaphor · on Sept 10, 2013

I mean merkle trees more or less yeah. I don't remember all of the sources I researched but here is a source for the use of two different hash functions (it was for leaves not edges, my bad): http://crypto.stackexchange.com/questions/2106/what-is-the-p...

samatman · on Sept 10, 2013

Have you read the Keccak group's Sakura paper? It's fairly accessible as these things go.

anaphor · on Sept 10, 2013

Is this the paper you mean? http://keccak.noekeon.org/Sakura.pdf

I'll definitely give it a read when I have some spare time, thanks for the source.

samatman · on Sept 10, 2013

That's the one!

yalogin · on Sept 10, 2013

Why is the security of hashes really important in content-based addressing? The idea there is the hash chosen should be resistant to collision and more importantly fast. SHA1 for that is more than adequate.