Hacker News new | past | comments | ask | show | jobs | submit login
Single-block collision for MD5 (marc-stevens.nl)
191 points by robinhouston on Jan 29, 2012 | hide | past | web | favorite | 56 comments

Nice! From PDF:

                   Message 1  
    4d c9 68 ff 0e e3 5c 20 95 72 d4 77 7b 72 15 87
    d3 6f a7 b2 1b dc 56 b7 4a 3d c0 78 3e 7b 95 18
    af bf a2[00]a8 28 4b f3 6e 8e 4b 55 b3 5f 42 75
    93 d8 49 67 6d a0 d1[55]5d 83 60 fb 5f 07 fe a2
                   Message 2
    4d c9 68 ff 0e e3 5c 20 95 72 d4 77 7b 72 15 87
    d3 6f a7 b2 1b dc 56 b7 4a 3d c0 78 3e 7b 95 18
    af bf a2[02]a8 28 4b f3 6e 8e 4b 55 b3 5f 42 75
    93 d8 49 67 6d a0 d1[d5]5d 83 60 fb 5f 07 fe a2
                 Common MD5 hash

Wow, so only two bits flipped? That's pretty bad even if you're using md5 just as a simple checksum protection against transmission errors.

If you are trying to protect against transmission errors then md5 is a poor choice; it is expensive, it gives no guarantees as to how accurate it is, and has no ability to correct errors. If you want to protect against transmission errors, and possibly correct them, use CRCs; apart from the fact that they provide all three of the above, that's what they were designed for.

Very well optimized CRC is significantly faster than MD5, but note that MD5 is very fast.

MD5 is also a poorer choice for UUID-type applications than the SHA-2 functions, which offer more flexibility in output size at not much cost in performance.

Not really. This collision required specific intent, and a great deal of compute power, to find.

md5 is still adequate for accidental/random error detection because the universe doesn't (usually) spend many hundreds of hours on hundreds of GPUs to try to corrupt your data without you noticing :)

Though sometimes it can seem that way. :)

Fair point. I just kind of assumed that any collision would require quite a lot of flipped bits, because usually just one flipped bit is enough to cause an "avalanche" of changes in the hash sum.

Would it be possible to find a collision where only one bit is flipped in the input?

From the paper:

"Based on our complexity analysis and a number of computers available for our collision search, we estimated that it would take approximately five weeks. As this was feasible enough, we started the actual search. It was our fortune that a collision was found a bit earlier, namely after only three weeks."

This on a Core2 Q9550, a four year old processor. I wonder how fast this could be done using an EC2 temporary compute cluster.

Or low end consumer GPUs which can do over half a billion hashes/sec.

Not that low-end, right? I think the cheapest "good" one was probably over $100, when I read about "Bitcoin mining" perhaps month or so ago.

$100 is pretty low end for a GPU.

MD5 is many times faster than SHA256, too.

$100 is low end. It's easily affordable for consumer purchase. You no longer need to be the NSA with a budget of $1,000,000 to do this.

These sorts of thing tend to parallelize well too. If it were written for CUDA or similar, I wouldn't be surprised to see it finish in a few hours.

This is not a straight brute-force attack. I don't know how well it parallelizes, but Marc is aware of the existence of GPUs. (Also, it takes a few weeks. No point in messing around with CUDA...)

Why is this coming back up again? It is well known that you can create collisions using MD5. However, you will have to try real hard to do so.

The implications of that are simple.

Do not use MD5 for any type of security or cryptography. On the other hand, if you're using MD5 for other purposes, you can continue to do so.

I frequently use MD5 to generate unique ids for a ton of stuff. There is little risk of a collision, since I'm not trying to make things collide. On the other hand, I would never use it for anything security related.

  However, you will have to try real hard to do so.
This is coming back up again, because it's not so hard anymore.

The question I care is: Is the probability of naturally occurring a collision in MD5 hashing significantly more than any other 128 bit hashing algorithm?

The answer is: No.

Sure, but if you're willing to accept lots collisions on "bad" input there are faster hashes (Bernstein and Jenkins have nice fast non-cryptographic hashes, for instance.)

Non-cryptographic hash functions may be a bad idea: e.g. don't store URI parameters (?foo=bar&baz=bar) in such a hash table, or you'll be vulnerable to rather simple DoS (this was all over the internet a week or two ago.)

> Non-cryptographic hash functions may be a bad idea: e.g. don't store URI parameters (?foo=bar&baz=bar) in such a hash table, or you'll be vulnerable to rather simple DoS (this was all over the internet a week or two ago.)

I think universal hashing is the usual protection against that kind of attack, and I think universal hashing is not considered cryptographic:


> there are faster hashes

Yes, but none as widespread. MD5 is available in every system, on every language.

> I frequently use MD5 to generate unique ids for a ton of stuff.

Why not just use a random 64 or 128 bit number (or UUID)? This would be faster and would not require that the input (that you are hashing) already be unique.

The only legitimate use of MD5 I can think of is to verify legacy MD5 checksums.

Content-addressable storage concepts can be quite valuable. See git, for example.

Using MD5 for this is undeniably a bad habit, of course. Just use the output of an SHA-2 function, even if you have to truncate it to 128 bits. Anything to get people to stop using MD5.

> Anything to get people to stop using MD5.

Why? Because a bunch of academics managed to create a collision after 2 weeks of trying on a beefed up machine? Really?

MD5 is much faster than SHA-2. As long as you're aware of the security implications, there's no need to stop using it.

> Because a bunch of academics managed to create a collision after 2 weeks of trying on a beefed up machine? Really?

WTF?! Are you out of your mind? MD5 was exploited to falsify a real-world RapidSSL CA certificate... in 2008. MD5's weaknesses haven't been "academic" in years.

People reach for what they know, and most do not have the capacity to determine the security of cryptographic algorithms in any particular context. Getting them to stop using it at all is the surest way to get them to stop using it where it's critical.

> MD5 is much faster than SHA-2.

Largely irrelevant in today's world. Single-threaded, my two year old laptop runs over 125MB/sec through SHA256, and around 100MB/sec through SHA512. SHA-2 is not a major bottleneck.

Does this have security implications? For example allowing you to compute an alternate, working password if you only know the hash? Are there other things this article implies?

Well, nobody should be using plain MD5 to hash passwords anyway. However, preimage attacks (finding an input that produces a specific hash) are still vastly more difficult than collision attacks (where the hash is not chosen in advance).

The security flaws introduced by collision attacks tend to be a bit subtler. For instance, if a digital signature scheme uses MD5 as the underlying scheme, you could generate two different documents with the same hash, convince a third party to sign one of them, and then transfer the signature to the other document.

can you clarify what you mean by "plain md5"? if someone is using md5 with crypt(3) is that ok?

i ask because of this - https://bugzilla.novell.com/show_bug.cgi?id=743715

AFAIK crypt(3) should not be using "plain" md5, it uses both a salt and variations of md5, plus the hash function is run many times so that bruteforce attacks are mitigated (it takes much more time to compute the hashes)

googling..... http://en.wikipedia.org/wiki/Crypt_(Unix)#MD5-based_scheme

I think teraflop had in mind salting.

A salt only protects against pre-computed dictionary attacks (rainbow tables). It does not offer any additional protection in this scenario.

I would recommend both salting and key strengthening.

It's not best practice (use bcrypt/PBKDF2/scrypt), but there are no known or suspected attacks on that construction and the original article does not seem likely to help in finding one.

I had always wondered about that, but it seems unlikely an MD5 could ever be spoofed in such a way that would make any sense.

I mean to say, you could find a hash that would match an existing, let say, word document. But that wouldn't be a legit word document or anything - it would likely be some random character string. The chances of changing anything in a meaningful way or adding a payload seems practically impossible to me. Is that a false assumption on my part?

Oh goodness no, there are lots of practical attacks that would work if people still trusted MD5.

For example, two self-extracting archives with the same MD5 that unpack to completely different contents: http://eprint.iacr.org/2004/356.pdf

Two postscript files with the same MD5 that print different letters: http://web.archive.org/web/20071226014140/http://www.cits.ru...

And plenty more as listed here by Peter Selinger: http://www.mscs.dal.ca/~selinger/md5collision/

3 years ago, an MD5 collision was exploited to allow the creation of a fraudulent CA certificate!


Basically they could create HTTPS certificates for any domain (microsoft.com, gmail.com, etc) and it would be shown valid by the browser. So MD5 collisions really are useful in practical real-world attacks.

It is. The MD5 preimage attacks support arbitrary (non-common) prefixes and arbitrary common suffixes. With complex file format like a word document, it's easy enough to put a binary blob at the end of the file that doesn't affect how it looks to the user, and the content before that point can be whatever you want (as long as it's the same binary length, which is also easy enough to arrange).

There are no known "preimage attacks" against MD5. Only collision attacks. The term you are looking for is a "chosen prefix collision attack".

Huh. I hadn't known about the MD5 preimage attacks. I'd planned to use salted MD5 to generate nonces, possibly with HMAC. How unsafe is that?

that's pretty interesting, i had assumed that it would be fairly easy to create duplicate content with the same signature, but nearly impossible to make that content something usable.

In theory there are no new security implications, since you shouldn't be using MD5 for secure stuff anyway. The writing has been on the wall for MD5 for a while. This is just another interesting nail in the coffin. It's getting easier and easier to do thing with MD5 that you shouldn't be able to do.

It's still not trivial at all to introduce a payload and make it so the whole package matches a MD5 hash. Haven't seen a PoC ever do this.

Would recommend to go SHA anyway.

Several proofs of concept have been listed elsewhere in the thread.

Thanks, they weren't there when I posted. Will look into that.

Forgive me if this is a poor question. I need to study up on my cryptography a little more...

Would salting both of these messages lead to md5 hashes that no longer match?

Yes, if they were given different salts. But the attack model for password authentication is very different (e.g. there's usually only one salt in play and the attacker doesn't get to choose salt or password he's trying to crack). So the collision attacks on MD5 don't seem obviously relevant.

Even with salting MD5 is still far too efficient to compute to be strong for password hashing. It could be combined in an iteration framework which made it secure, but there are plenty of other hash functions (with better reputations) that would be a better choice.

If the salts are different, it will definitely change.

On the other hand, using the same salt twice will change the hash if it's put before the data, but not if it's put after the data.

Given an MD5 collision, you can apply a common suffix and still have a hash collision. Applying a common prefix changes the state the hash process is in when it reaches the colliding data and this breaks the collision.

Good to know. Thank you =)

This is quite interesting. Common hashes is pretty cool. What about for SHA-1, can there be a collision?

There is a well-known principle called the pigeon-hole principle - you cannot put more pigeons than you have holes in holes without putting at least 2 in one.

The larger the number of bits you have, the larger the number of possible messages. Therefore if the size of the message exceeds the size of the hash, there are more possible messages than there are possible hashes they can be sent to. Therefore by the pigeonhole principle, some hash value represents more than one message, so there must be collisions.

That part is simple. The hard part is finding them. The simplest way is to just try lots of random messages, until 2 give the same hash value. This is called brute force. But there are enough possible hash values that this is not feasible.

To date we have not found an SHA-1 collision. However we know of algorithms that should be able to find one faster than brute force. But we have not actually found one yet.

However attacks only get better, and computers only get faster. It is widely accepted that an actual hash collision in SHA-1 is just a matter of time now.

its not just widely accepted.... hashing algorithms, due to the pigeonhole principle you just explained, by definition are full of collisions.

the only security relevant part is how hard it is to find them for use in various scenarios......

(this is why i really dont get why zfs ever did deduplication relying on hash only.... sure verify is an option, but it would be insane to use hash only....... unlessi am overlooking aomething statistical that makes it make sense (maybethe odds of a collision are fR less than the odds of total hardwarefailure ?still.....)

Your last sentence hits the nail on the head. Even when storing petabytes of data, the odds on a freak hash collision are still many orders of magnitude longer than the odds on a hardware failure.

There's a statement that has been "pretty much permanently" on a whiteboard-covered wall of the computer lab at my college telling a joke about "the difference between a mathematician and an engineer", that goes through the math behind a specific type of prime number generator, calculates the likelihood that it might fail, and then claims the mathematician cares about that while the engineer knows that is orders of magnitude less likely than a guaranteed algorithm failing due to a cosmic ray hitting it in RAM and flipping one of its bits. ;P

Indeed... a quick search turns up someone from the zfs team indicating that a collision in zfs dedup (sha256) is 50 orders of magnitude less likely than an uncorrected hardware error. i shoukd have looked before posting.

I meant, of course, that we'll find one.

We know the exist. We just don't have any examples to present.

If the number of possible messages is greater than the number of possible hashes, collisions are trivially guaranteed to occur. Since hashes are typically used with messages longer than 192 bits, SHA-1 collisions are certainly a possibility. Constructing one is orders of magnitude harder than for MD5, though.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact