Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, I think that's pretty much the case. The first 320 bytes of the two PDFs released by Google result in the same SHA-1 state. Once you're at that point as long as you append identical data to each of the files you're going to get identical hashes. This is just taking those same 320 bytes and appending the combined images of your choice.

edit: as versteegen points out it's 320 bytes, not 304.

I had a whole discussion in another thread about this:


I learned a lot from it. One thing is that this property is true of any Merkle-Damgård-type hash if the hash internal state is the same size as the hash digest. This is true of SHA-1 and of several other famous and widely-used hashes, but not true of every hash, including some of the most recent designs like several SHA-3 candidates and SHA-3 itself. In a hash without this property, you can have a collision condition H(X)=H(Y) (and len(X)=len(Y)) yet typically H(X+a)≠H(Y+a).

Edit: len(X)=len(Y) is also necessary because Merkle-Damgård hashes encode the message length into internal padding, so if you happened to have two colliding inputs that were different lengths, they will generally not produce a collision when the same string is added to each.

This is really good to be aware of, even if there were no collisions. I could imagine someone making for example a signed cookie scheme that is value,SHA1(secret,value). Someone could then change it to value+foo,SHA1(secret,value+foo) without knowing the secret, and it would verify as a valid signed cookie.

Yes, that's called a length extension attack: https://en.wikipedia.org/wiki/Length_extension_attack

It's why you don't use a bare hash as authentication, but instead use a HMAC.

People sometimes overstate the impact of length extension attacks. If your format has a length prefix (really common) then you may well be "vulnerable" in the sense that appending arbitrary data is "valid", but a canonical form without the appended data is trivial to construct; and indeed most software would likely completely ignore that extra data.

HMAC is a neat trick to avoid length extension attacks (and other issues) in a generalized fashion, but that doesn't mean those risks actually apply in practice. (Some googling finds e.g. this paper: https://www.iacr.org/archive/fse2009/56650374/56650374.pdf which proposes an attack on length-and-key prefixed messages, using some sha1 weaknesses and merely over 2^84 memory and 2^154 queries - color me impressed, but not scared). Edit: just to be clear, I'm not suggesting anyone actally use LPMAC-sha1 given the current state of sha1.

For another example; in general it's unsafe to truncate a "secure" hash - hashes that satisfy most security requirements can be constructed that are not safe when truncated (e.g. sha3 prepended by zeros is still safe, but obviously not if truncate the sha3-provided bits off). But I don't know of any mainstream hash where this theoretical risk actually applies (e.g. no merkle-damgard hash suffers from such a risk); nobody constructs hashes intentionally with more bits than entropy.

It's probably still wise to stick with known-good constructions, but the risks seem overstated, and the difficulty is also overstated - assuming the primitives used aren't too flawed. Sure, it's cool that HMAC can use even flawed things like MD5 and retain safety, but typically nobody is forcing you to stick with md5. I guess the world is more complicated if you need to pick a protocol and then you're unable to change it, but most applications can (with some effort) be changed. You need something safe now, not for all eternity.

So, I think the rule is simpler: this has little to do with crypto per se; just don't be unnecessarily clever, in general. Crypto makes the consequences particularly nasty, often. But that's about it.

Indeed, it comes back to the usual rule of not rolling your own crypto.

This was good meme that served its function well when it was needed - early enthusiasm for reusable cryptographic primitives and a failure to recognise the foot-shooting potential lead to many easily broken schemes.

Now, however, "don't roll your own crypto" is dogma, and if anything we have the opposite problem of monoculture and slow progress. I think a more nuanced view is required, one that encourages experimentation when the stakes are low and more competing implementations when the stakes are high (or perhaps we should call them "complementing" - a standard ought to have multiple implementations).

As Wikipedia puts it, "Mathematical analysis of [security] protocols is, at the time of this writing, not mature... Protocol design is an art requiring deep knowledge and much practice; even then mistakes are common." How are programmers to practice, if they are not allowed to fail?

And SHA(-2)-512/X (The truncation to X forms) are also good for avoiding length extension

Yep, I have a long list in the other thread (transcribed from Wikipedia). It's nice to finally understand why the truncation forms exist!

320 bytes.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact