Hacker News new | past | comments | ask | show | jobs | submit login
The first chosen-prefix collision for SHA-1 (sha-mbles.github.io)
928 points by ynezz on Jan 7, 2020 | hide | past | favorite | 352 comments

So to be clear about what this is (because the website doesn’t quite clarify): this collision lets you pick two different prefixes P1, P2, then calculates some pseudorandom data C1, C2 such that SHA1(P1+C1) = SHA1(P2+C2). The length extension property of SHA1 (and MD5) means that now SHA1(P1+C1+X) = SHA1(P2+C2+X) for any X.

A similar attack (which requires only a few hours on modest hardware nowadays) has been known for a long time for MD5, but this is the first time it’s been demonstrated for SHA-1.

The previous attack, called Shattered (https://shattered.io) was a regular collision, that is, they chose a single prefix P and found different C1, C2 such that SHA1(P+C1) = SHA1(P+C2). This can also be length extended, so that SHA1(P+C1+X) = SHA1(P+C2+X). However, this attack is more limited because there is little to no control over the pseudorandom C1 and C2 (the only differing parts of the messages).

With a chosen prefix collision, though, things are way worse. Now you can create two documents that are arbitrarily different, pad them to the same length, and tack on some extra blocks to make them collide.

Luckily, the first collision should have already warned people to get off of SHA1. It’s no longer safe to use for many applications. (Note, generally for basic integrity operations it might be OK since there’s no preimage attack, but I’d still be a bit wary myself).

Squeamish Osssifrage's answer to this Crypto Stack Exchange question does a predictably excellent job of putting these attacks in context.


Agreed. It's a great loss to the community that they have decided to step away from Stack Exchange: https://crypto.meta.stackexchange.com/questions/1361/im-leav...

* that the owners of Stack Exchange drove them and others away.

Yup, that's pretty much what happened in reading that posted link, so why you're being downvoted I don't know.

Actually, the website does cover this in the '''Q&A''' section:

> What is a chosen-prefix collision?

> A classical collision (or identical-prefix collision) for a hash function H is simply two messages M and M’ that lead to the same hash output: H(M) = H(M’). Even though this security notion is fundamental in cryptography, exploiting a classical collision for attacks in practice is difficult.

> A chosen-prefix collision is a more constrained (and much more difficult to obtain) type of collision, where two message prefixes P and P’ are first given as challenge to the adversary, and his goal is then to compute two messages M and M’ such that H(P || M) = H(P’ || M’), where || denotes concatenation.

> With such an ability, the attacker can obtain a collision even though prefixes can be chosen arbitrarily (and thus potentially contain some meaningful information). This is particularly impactful when the hash function is used in a digital signature scheme, one of the most common usage of a hash function.

The confusing thing is that the team behind SHAttered did choose a prefix - they just had to choose a common prefix rather than two different prefixes (choosing a common prefix simply changes the initial state used by SHA-1). So I'm hoping that my post clarifies this point a bit.

The SHAttered attack (classical collision) can be used in practice - for example, my project https://github.com/nneonneo/sha1collider exploits their collision to turn any two PDFs into documents with identical SHA-1 hashes.

Is any practical application safe from Shattered but not Shambles?

Applications which don't accept the specific Shattered PDF prefix given there. In general Shattered could work for e.g. ELF executables with the same prefix, but the attack here lets you collide the hashes even if the prefixes are different.

Can you give a specific example of the danger here? I understand the principle behind the attack (kinda).

I just don't understand what danger being able to pad two documents to make them collide poses?

edit: My guess is that it can be abused to make something that I believe to be library X actually be library Y when I download it from the internet. Lets say I want to download something, and I check the signature provided. Assuming the attacker is able to send me the wrong library via a MITM attack, how can this prefix collision work? It seems that the original library AND the original signature on the library's website have not been altered, so their efforts to use this and make them match are impossible. And it seems like if they can alter the signature on the website and stuff, then all bets are off--why not just send the malicious library at that point?

Suppose your system uses SHA-1 hashes for codesigning verification (e.g. to load a system driver). I create an innocent-looking device driver and convince a signing authority to sign it. However, secretly I've created a malicious driver (e.g. a rootkit) which collides with my innocent one. Now, I can load the malicious one on your machine - which the signing authority has never seen - using the signing certificate of the legitimate one.

This might sound far-fetched; after all, you'd need to convince a signing authority to sign the code. But this is pretty much exactly how Apple's Gatekeeper verification works: your software is submitted to them, and they do some security checks and notarize your bundle (https://developer.apple.com/developer-id/), and I'm sure there's many more such examples out there.

Hey OP,

im not much of a math guy, I'm getting hung up on this part:

SHA1(P1+C1+X) = SHA1(P2+C2+X) for any X.

The example above seems like SHA1(GOOD_DRIVER) == SHA1(BAD_DRIVER+C2+X) somehow.

How does the C1 and X get appended to the signature of the good driver.

Most executable file formats (including drivers) put the code first followed by the data. So you could construct your drivers thusly:

GOOD_DRIVER = P1 (good code and some data) + C1 (data) + X (more data)

BAD_DRIVER = P2 (bad code and some data) + C2 (data) + X (more data)

You'd disguise the random-looking block of C1 data in the middle of the good driver as e.g. a cryptographic key to avoid suspicion. The "more data" part couldn't be modified in the bad driver, but since you can arbitrarily modify P2 this wouldn't be a severe restriction.

thank you OP and everyone. I have that shaky initial understanding but makes much more sense.

He said "innocent-looking device driver".

The good driver is actually padded so that it can be later replaced with the bad driver. I.E This doesn't allow the attacker to replace any driver, but only one they prepared in advance to look innocent but have the right structure.

that makes a lot more sense, thank you

The good driver consists of P1+C1+X. That's what gets sent to the signing authority. They verify it doesn't do anything bad and return a signature listing SHA1(P1+C1+X). But that signature is also valid for the malicious driver P2+C2+X.

> you'd need to convince a signing authority to sign the code.

Not necessarily. You can also self-sign or just announce the hash, let the public inspect and test your good driver for a year, and then ship the bad driver to people who only check the hash

The core of it is that it means a malicious document will pass the check of authenticity when you are using the genuine signature. Someone could tamper with a mirror infect you that way. Absolutely no tampering with the signature would be required, which is what makes this dangerous. Basically, it can be used to send to entirely fake data that you can't tell is fake.

Or someone could say that you have your legally binding signature attached to a contract with hash ABC, but then present a different contract with hash ABC but very different terms.

Those are the end state. This is a major step closer to that end state. Neither of those scenarios are the case today, but they're now close enough that it's a matter of time. And likely not a lot of time.

Time to abandon SHA1.

It’s a step on the same path that led to being able to spoof a CA for MD5 signed certs.


Please don't feed the cancer which is AMP.


Please don't overegg the issue with AMP by comparing it to cancer.

Don't make cancer anything it's not, either. Cancer is just growth of abnormal cells, unconstrained, to the point that it causes harm to the host.

That said, maybe AMP is more like a virus. A non-living organism that spreads by infecting living organisms and repurposing them to replicate itself instead of sustaining the organism they were a part of.

The more sites adopt AMP, the more everyone else says "well I guess we have to now." Seems pretty viral.

Note, generally for basic integrity operations it might be OK since there’s no preimage attack, but I’d still be a bit wary myself

AFAIK there's still no practical preimage attack for MD5, for which you can generate collisions in seconds on hardware that's a decade old, so it seems making a hash function collision resistant is in general quite a bit harder than making it resistant to preimage attacks.

To some extent that’s just a function of the brute-force search space: due to the birthday paradox a 128-bit hash like MD5 only requires testing around 2^64 hashes to find a collision. However, to find a preimage you would need to search through 2^128 possible messages.

Attacks generally improve on brute force by factors; for example, MD5 collision finding is some 2^40 times more efficient than 2^64 brute force, but being 2^40 times more efficient than 2^128 brute force for preimage finding is still well outside of anyone’s capabilities.

MD5 also appears to be unusually preimage-resistant; the best attack requires something like 2^123 operations which is really infeasible.

> We note that classical collisions and chosen-prefix collisions do not threaten all usages of SHA-1. In particular, HMAC-SHA-1 seems relatively safe, and preimage resistance (aka ability to invert the hash function) of SHA-1 remains unbroken as of today.

Nice to see this bit of intellectual honesty. Would be even nicer if they had explained what that means in terms of PGP keys.

It means if someone you want to impersonate uses the Web Of Trust, i.e. their key is signed by other people whose keys have been signed the same way, you can generate a GPG key for which all of these signatures are still valid.

For example, if an attacker gains access to a victim email account, they could send to their contacts a "trusted" key (as explained above) and then use it to send signed documents to the victim's contacts.

This would defeat an adversary "paranoid" enough to check a key signature, but not paranoid enough to obtain a clear explaination/confirmation of why the key changed...

>This would defeat an adversary "paranoid" enough to check a key signature, but not paranoid enough to obtain a clear explaination/confirmation of why the key changed...

Thereby turning the signal intelligence problem into a human intelligence problem.

> It means if someone you want to impersonate uses the Web Of Trust, i.e. their key is signed by other people whose keys have been signed the same way, you can generate a GPG key for which all of these signatures are still valid.

I'm not really knowledgeable about the implementation details of GPG. Mind explaining how this follows?

> It means if someone you want to impersonate uses the Web Of Trust, i.e. their key is signed by other people whose keys have been signed the same way, you can generate a GPG key for which all of these signatures are still valid.


> For example, if an attacker gains access to a victim email account, they could send to their contacts a "trusted" key (as explained above) and then use it to send signed documents to the victim's contacts.

Ok... But in this scenario the attacker has the victim’s new private key, so they don’t need to create a collision (using OP). They can just use the new private key to sign the documents. Right?

> No...

Why ?

> in this scenario the attacker has the victim’s new private key

You don't want to keep your private key in cleartext on your email provider servers, do you ?

This allows you to take two messages and append some data to both of them which causes the modified versions to have the same SHA-1 hash - but you need to modify both messages, and in order to use this in an attack you need to set up a scenario where the SHA-1 hash of one of your modified messages is trusted for some purpose. Creating a message with the same hash as another, existing message requires a second-preimage attack which is much harder and not feasible for any cryptographic hash that's currently in use.

HMACs do not require collision resistance from the underlying hash to provide secure message authentication. HMAC-MD5 is still considered "secure", although that doesn't mean you should use it.


This kind of thing always brings me down a bit. It's not rational, but it does.

I mean I truly admire these folks skills, the math involved is obviously remarkable.

But I think the feeling is related to not being able to rely on anything in our field. Hard to justify going to the trouble of encrypting your backup. 10 years from now, it might be as good as plain text.

It's not security only, nothing seems to work in the long term. Imagine an engineer receiving a call at midnight about his bridge because gravity changed during daylight saving in a leap year. That's our field.

There's an MC Frontalot song called "Secrets from the Future" and the refrain is "You can't hide secrets from the future." It's something of a useful mantra to remind oneself that if "the future" is a part of your threat model, yes your encryption likely isn't enough because on a long enough timescale it is likely "the future" will crack it.

As with any other security issue, the question is "what is your threat model?" You can still justify encrypting your backup today if your threat model includes today's actors, however much you worry about "the future".

> 10 years from now, it might be as good as plain text.

Or 10 years from now it might be the next Linear A tablets to confuse cryptoarcheologists, unreadable and untranslatable and entirely foreign. If "the future" is in your threat model, don't forget the other fun forms of entropy beyond encryption being cracked such as encodings changing, file formats falling out of service/compatibility, "common knowledge" about slang or memes lost to the ages leaving things indecipherable, and so on and so forth. Most of those things are probably unlikely on only a 10 year time horizon, but you never can tell with "the future".

"You can’t hide secrets from the future with math. You can try, but I bet that in the future they laugh at the half-assed schemes and algorithms amassed to enforce cryptographs in the past."

Didn't expect to see Front on HN today, what a pleasant surprise.


The point about archeologists is a good one because it speaks to motive. In general, we should be very supportive of the efforts of distant historians who want to understand what humanity used to be like. We should not WANT to hide secrets from a sufficiently far future. I can't think of any secret that deserves to be hidden from them for any reason besides perhaps modesty.

Relatedly, that is a part of why I love the term "cryptoarcheology" in general, as a reminder that digital spaces will need archeologists too.

There's it's somewhat shortened form "cryptarch", generally more used as a title ("cryptoarcheologist"), which was used in places in the Quantum Thief trilogy of books and is most probably already burned into your brain if you have played any of the Destiny series of videogames (and I presume was heavily influenced by Quantum Thief).

Hmm, I thought cryptarch was just crypt + arch with the arch part meaning “leader” (i.e. this sense https://www.etymonline.com/word/arch-), not archaeologist. Is there something about this in the Quantum Thief trilogy I’ve forgotten?

It's been a bit since I read it, but I recall the meaning overlap/double meaning between leader -arch (such as plutarch) and arch- from archaeo- was directly intentional word play in the book trilogy, and yes that leader -arch meaning does add spice to the neologism.

(I don't think Destiny does much with the playful dual meaning, though. Certainly the cryptarchs in Destiny have never yet been meaningful leaders.)

What is considered sufficiently far future may change with life extension technology.

Well I would assume that as long as you live you'll keep updating your crypto as new tech comes out. That way the clock only starts ticking when you die.

Your encrypted communications, intercepted today, will be encrypted with today's standards. The clock is ticking now.

> Or 10 years from now it might be the next Linear A tablets

At some point in the future, after the universe has expanded to the extent that other galaxies are moving away from ours faster than the speed of light, someone might read our astronomy papers and wonder whether they can believe something that cannot be verified.

I always think of this as well. But it may not be so:


This biblical prophecy from Luke 12,3 is solo true: " What you have said in the dark will be heard in the daylight, and what you have whispered in the ear in the inner rooms will be proclaimed from the roofs."

I wonder what the intended meaning is. I presume it has to do with a presumed eventual urge to confess.

It’s basically “don’t try to get fresh with God”. Which conveniently translates into a call to humility and honesty towards its earthly representatives.

In context, it appears to be saying "you can't keep secrets from God."

Ok but it implies/states that the secrets will be broadly divulged.

> You can't hide secrets from the future

With the usual disclaimer about one-time pads.

> other fun forms of entropy beyond encryption being cracked such as encodings changing, file formats falling out of service/compatibility, "common knowledge" about slang or memes lost to the ages leaving things indecipherable

I wonder what the digital archaeologists of the future will make of today's programming languages.

(I was going to point out how far we've come since the languages of yesteryear, but we still have such horror-shows as Perl and Bash.)

Bridge engineers don't have to fear the progress of science working against them, but computer security is not alone here. Consider designing body armor or military aircraft and hoping that the state of the art will stay the same! An adversary who can use the progress of science against you is always dangerous. Computer security has been rather lucky so far: the asymmetry between hashing and cracking a hash, for example, is much more favorable to the defense than the balance between bulletproof vests and bullets.

On a long enough time scale Bridge Engineers still have to worry about decay, entropy, collisions. Some of that can even be attributed to the "progress of science", as bridges have collapsed because earlier assumptions were invalidated by larger trucks, for instance.

An issue so far for computer security is less that decay and entropy happen, but that they happen so fast that the timescales are decades or even years rather than lifetimes.

This is a great point in general: sometimes the problem isn't an adversary using knowledge in your field against you, it's the unintended consequences of progress / changes in adjacent fields.

It also underscores that, when dealing with things like passwords, it's helpful to be able to seamlessly upgrade to more robust methods down the line, e.g. "if this password hash was generated using the old method, check the login using that method but rehash and update the database using the new method after authentication."

> Consider designing body armor or military aircraft and hoping that the state of the art will stay the same!

This was particularly a problem for castles vs. the progress of artillery, because those were supposed to have a rather long lifetime.

The problem is that the expectations of software are much more akin to the expectations of a bridge than those of a bulletproof vest

> Hard to justify going to the trouble of encrypting your backup. 10 years from now, it might be as good as plain text.

When has that happened? Public key cryptography and symmetric key cryptography are still doing fine as far as I'm aware, and the latter doesn't even seem to be vulnerable to quantum computing.

Moreover, SHA-1 has been considered insecure for, what, at least 10 years? The fact that a cryptographic hash function has been widely considered insecure and widely recommended for deprecation a decade before a proof of concept even emerges is, to me, something to feel very good about.

"Public/symmetric key cryptography" is just the name of the practice, of course it's doing fine. What's not doing fine is picking a particular set of ciphers/hash functions/signature scheme and expecting to not fail in 40 years.

Incorrect, particular sets of ciphers are themselves doing just fine. There is no hint of well studied symmetric-key algorithms (like AES, which is in fact now over 20 years old) with long known sufficient key lengths "failing" (in terms of cold ciphertext being cracked) in 40 years, or 400 years for that matter. Even with a fully scalable general quantum computer, the best known improvement is Grover's Algorithm, which results in roughly a quadratic improvement (so a 128-bit key cracked in 2^64 iterations, 256-bit in 2^128, etc). This obviously is trivially defeated by doubling the key length, and AES-256 has been available from the beginning and has been becoming more standard for a while as well.

Now, there are various kinds of hot attacks, such as side channel leaks, that always present a serious challenge in many threat scenarios. And it's certainly possible to have bad implementations of a solid algorithm. But for AES implementations are very mature, and side channel attacks do not apply to data at rest. So in the context of "will my encrypted ciphertext with all my deepest secrets [sitting on a drive/in the cloud/wherever] be crackable in 40 years" no, symmetric-key crypto with 256-bit keys is dependable at this point.

A false sense of defeatism is itself plenty dangerous to security.

DES is a good example here: it was designed ~45 years ago and it's still theoretically surprisingly strong when you compare it with other areas of cryptography. By that, I mean it's held up very well to cryptanalysis that attempts to find faster-than-brute-force techniques, with the best attack taking 2^43 time (vs. the 2^56 brute-force time).

To be clear, 2^56 is trivially brute-force-able today, but that can be mitigated with constructs like 3DES, which are still (barely) secure despite being based on a 45-year-old cipher.

(So no one mistakes my intent, there are many reasons to prefer AES over DES. I just wanted to provide it as an example, especially since it happened to line up with the 40-year timeframe.)

> 2^56 is trivially brute-force-able today

Youch, this broke my intuition.


"we’ve built a system with 48 Virtex-6 LX240Ts which can exhaust the keyspace in around 26 hours"

> trivially

Having to rent thousands of dollars in GPU-hours or using specialized hardware doesn't sound "trivial" to me. Practical, yes. Accessible even to a layman with money to spare, yes. Trivial? Hell no.

It's funny to see how different the various people here have on the definition of "trivial". 10 years ago, I generated MD5 collisions in a few minutes with a tiny little executable I downloaded. That's what I call trivial.

If I had some DES-encrypted files that I'd lost the key to, and they were important enough that I needed their contents, I would probably be happy that I could crack the key today and recover them if I spent enough, but doing so would still not be "trivial".

Oh, sure, maybe trivial isn't the best word for it, but it's all relative (to me). $20k is remarkably cheap for a practical cryptographic attack compared to the rest of the field, and there's plenty of software to do it, and it's been done before (many times).

Apparently there's even a SaaS offering to do it for you in some cases (for staggeringly cheap, like $300), but I had no idea that existed when I wrote my comment.

Encrypting is still better than not encrypting and if you care about keeping your data private then you can take additional measures to ensure that. Nothing will last forever but that's not a good reason to be nihilistic.

It's like locking my front door. I chose to lock the door to prevent random acts of burglary. But that lock will not stop someone determined from BnE my house.

DES Encryption will stop someone from BnE your data.

They might BnE your key though, if you aren't careful enough.

Triple DES remains as secure today as in 1975.

>> "Hard to justify going to the trouble of encrypting your backup."

Huh? If you're "encrypting" using SHA, I've got some bad news about those backups of yours.

I think GP was taking about the general nature of “previously assumed to be unbreakable” methods being broken. Not sure if he has implying using a checksum also for encryption

What do you mean by "previously assumed to be unbreakable" ? SHA-1 has been known to be unsafe for a dozen years, we just went from "assumed to be breakable" to "yep, definitely breakable, here's how one exact attack will work".

But backups have existed for more than a dozen years. And its replacements today, SHA-256 and SHA-3 will also be broken if you wait long enough.

I can see why backups might be needed for a dozen years, and I can see why encrypted backups might be needed, but outside plainly fake requirements like those of "national security" why would encrypted backups be needed for a dozen years? Aren't we throwing everything sensitive away after seven years? After that isn't it mostly about preserving history? Even things like balance sheets that might be sensitive today will be too out-of-date to be sensitive a dozen years from now.

The obvious counter-example is my library, however old my photos or music or videos are I'd like to keep them for as long as possible, and because they're private I'd like to keep them in an encrypted form

If you use SHA-256 to encrypt your backup, then I just need to steal your backup and wait 20 years, until that is cracked, and then I can decrypt your backup, even though today you’re using the “correct” encryption.

The GP was likely hinting at SHA1 being an hashing function, non an encryption function, so just applying sha* wouldn't produce a working backup

To be excessively pedantic you can encrypt securely (but slowly for the SHA series) with a hash function of sufficient capacity by running the hash function in CTR mode. You turn it into a stream cipher. Ideally you also MAC the ciphertext, nonce, and other associated data. That's is pretty easy with such a hash function (either use HMAC or a MAC mode of the hash function if supported).

Salsa20 & ChaCha20 cores are hash functions (though not collision-resistant compression functions since it's not needed for their design goal and would be slower) run in CTR mode.[1]

[1] https://cr.yp.to/salsa20.html

> To be excessively pedantic

This is the best, most delicious, type of pedantry friend!

You're the best kind of pedantry friend!

(Just some more pedantry, friend.)

Ok man, now that's just over the top.


It probably will if your data is less than 128 bytes, and you're willing to wait a few decades to decrypt it.

You might be able to find bytes that result in your hash, but they probably won't be the same bytes you 'backed up'.

If the data is shorter than the hash shouldn't it be the same data I backed up with reasonably high probability?

I guess you get (infinite?) many results which all have the same hash and one (or more) of them will be shorter than the hash.

Can you explain the relevance? If I put N items randomly into >> N buckets the chance of there being a second item in a particular bucket is small (as opposed to there merely being a bucket with two items, as in the birthday "paradox").

That doesn't apply here, since the birthday paradox is about the existence of a collision, not that any particular sequence collides.

Most people in the room will still have unique birthdays even if one pair share theirs.

As an aside, sha-1 is smaller than 128 bytes.

From my numerical experiments (I hope I didn't mess up...) using the random oracle model, the probability that a given key is collision-free is 99.6% if the input is one byte shorter than hash, 1/e if input is same size as hash and 6.6e-112 if the input is one byte longer than hash.

And this holds basically irrespective of key size.

If you're planning to brute-force count through 2^(128x8) possible bit inputs, it will be quite a few decades indeed. And you'll need a few spare solar systems to annihilate to get enough energy to drive your counting engine through that many states.


The idea is to wait for a preimage attack on sha, not brute force it.

I'm refering to not being able to rely on encryption in the long term.

Hashing is a separate problem from encryption. There is no proof that one way functions (the idea behind hashing) even exist (by proving this, you would actually prove P!=NP, IIRC). Encryption has a slightly better track record of being broken. AES still holds its promise and is also secure against quantum computing (you might want longer keys, but that's it).

And if you want really, provably unbreakable encryption, there is still OTP. But then you'd need a key, that is as long as the data you want to encrypt.

The best known attack against AES reduces attack complexity by about two bits over brute force. Given the history of block ciphers, the idea that AES might not be broken in this life time is not uncommon.

> Hard to justify going to the trouble of encrypting your backup. 10 years from now, it might be as good as plain text.

That is an absurd statement. Every backup you encrypted 10 years ago with then up-to-date security is still well encrypted. The SHA1-thing came coming over a timespan of 15 years. It still doesn't threaten past usage, only active attacks are really relevant.

FWIW, a lot of very smart people seem to think that we actually do have symmetric encryption in a state where it "works" and is "understood" and that AES is unlikely to be "broken". What we don't really feel so great about is asymmetric encryption, but at least that feels like something P != NP might imply can be done... hashing algorithms just have this problem where if you glance at the information theory you would have guessed they couldn't be possible until someone shows you one that seems to work.

> something P != NP might imply can be done... hashing algorithms just have this problem where if you glance at the information theory you would have guessed they couldn't be possible until someone shows you one that seems to work

I'm confused what you mean by this? Why does the info theory suggest hashing wouldn't be possible? Also, you can easily derive a secure hash function from a secure symmetric cipher and vice-versa, so why would one seem to be possible but not the other?

1) The premise of a hash function is that you are going to take a large set of inputs and map it to a small set of outputs such that you don't find collisions, with the use case that the result is somehow so unique that the input file can be henceforth named by its hash and looked up in some giant dictionary of everything we have so far ever bothered to hash.

The hash is this tiny bit of information, and somehow is expected to be sufficient to uniquely describe some arbitrarily large amount of information? That just flies in the face of the naive expectation. The only argument that it is even sort of reasonable to expect would be the idea that while there are an insane number of files very similar but not quite your file, very few of them are "interesting" to a human in any way, and so the set of "interesting" files might be super small... but it isn't like we designed the hash functions to know that.

The reality that it seemingly can be done well enough that no one notices most of the time is fascinating and surprising--the kind of thing that inspires awe and wonder at the universe if true or calls into question our hubris if it isn't--not something obvious in the way the existence of symmetric ciphers nor something reasonable to expect in the way asymmetric ciphers are: to the extent to which reality lets us pull this one off we should be thankful.

2) If one can truly "easily" derive a "secure" hash function from a secure symmetric cipher, then why don't we ever have secure hash functions, despite seemingly having secure symmetric ciphers? If this is so easy, I suggest you do it and then wait 30 years and see if someone figures out how to break it (and we can see it AES is still holding strong).

>The hash is this tiny bit of information, and somehow is expected to be sufficient to uniquely describe some arbitrarily large amount of information

It doesn't really describe the information it just indexes each piece. For that it is a very large key space. A 128 bit key could index 10^38 objects. That's a lot of slots. Provided your hash has sufficient mixing you will need a lot of shots to hit anything.

>If one can truly "easily" derive a "secure" hash function

Semantic security has a precise definition. We just have to make empirical assumptions about whether an input cipher is semantically secure. The hash is only secure if the cipher was but the derivation is easy.

1) Oh I see what you mean now. Yeah I guess it depends on your intuition.

2) I mean, I'm not sure how correct your premise that symmetric ciphers are more secure than hash functions is, but it literally is something that is done. You can read more about it in [1], including the possible pitfalls. The transformation should provide more than enough intuition to see why both are equally plausible, which was the point of my reply. Whether or not it's best to actually implement them that way in practice is a separate question which I'm not trying to answer here.

[1] https://crypto.stackexchange.com/a/6476

So, the intuition defying part happens in the transformation you're talking about.

The block cipher behaves how we expect, I'm confident that many of the ciphers used inside real hashes like SHA-256 are good block ciphers and either would not be broken if we used them as block ciphers or could be repaired by competent cryptographers to be robust enough so that they didn't get broken over their lifetime after the repairs were done.

But the transformation to get a hash introduces the effect that your parent sees (I think correctly) as unintuitive. Normally if I put a megabyte of plaintext into a cipher I get back a megabytes (or slightly more depending on my mode of operation and rounding up due to block size) of ciphertext, but this transformation ensures we always get one block back, the hash of a Wikipedia dump and of my address book are the same length for a particular hash function. The "block" is bigger than we're used to from popular block ciphers but that's what it is, one block.

And that's where the concern appears. Why is that safe? Surely that shouldn't be safe at all? And yet, it seems, it is successful, more or less, until it isn't.

Yes: for specific examples of this confidence see JP Aumasson’s paper https://eprint.iacr.org/2019/1492 “Too much crypto”

AES may be fine, but what about the block cipher modes?

All of this stuff is so new so we'd expect to see churn. Many people involved in the discovery that asymmetric crypto was even possible (Diffie, Hellman, Rivest, etc) are still alive. If the people that first invented bridges were still alive we'd be running into all kinds of failures in bridge building techniques. Bridges have been built in the last 100 years that the wind blew down!

At least in our case problem can be solved by re-encryption. Yes, we have to keep up to date with developments before everything is completely broken, but it is not bad as discovering a bridge needs to be rebuilt. Speaking of which, it is common to have to retrofit for earthquakes which probably wasn’t the rule when they built the original bridge

That depends on your threat model. A patient attacker can store all of your encrypted messages till some future point where decryption is economically feasible. Since they already have the weakly encrypted data, encrypting it again doesn't solve anything.

No, the bridge collapsed because it was never touched again after initial deployment, for 10 years. How are buildings doing in Chernobyl? Don’t neglect your data, if you want to keep it safe always. :P

Dude, Chernobyl reactor by design was nuclear bomb.

P.S. My father (still alive) was one of thousands of common liquidators of Chernobyl disaster from May to July 1986. Many of his coworkers, that lived with him in same tent, already dead.

Chernobyl was not quite a "nuclear bomb" in the sense that Pripyat isn't a charred crater. Nuclear reactors, when they fail, just spread their radiation all over.

> How are buildings doing in Chernobyl?

Not great, but not terrible?

A counter example are roman aqueducts

Concrete is pretty resilient and even if it cracks it can still hold up quite well but when it does, the rebar gets exposed and starts to oxidize until the building finally crumbles.

This can really apply to almost everything though, since if I understand you, it seems to be about progress in the face impermanence. The history of science is basically a constant process of things kind of working, then breaking. But overall it goes forward and things are better off for it. In contrast, I find it kind of inspiring.

Last part of this comment made me laugh, trying to imagine someone shouting over Slack "PATCH YOUR BRIDGE, NOW!!"

https://en.m.wikipedia.org/wiki/Citigroup_Center It has been done once after they found a bug.

DES is still basically as good as its insufficient key length. AES is 20 years old and holding strong. MD5 is still not fully broken (still nowhere near a practical preimage attack).

> 10 years from now, it might be as good as plain text.

I created an account just to chime in on this.

In short: horseshit.

Look, cryptology is a tricky field, but when it comes to standards development (especially OPEN standards development, without government intervention), it's mostly been hits, not misses. While there are occasional breaks and busts, the fact is, most of the trusted algorithms have remained trustworthy-- lasting through their designated lifespans, and often a lot longer. Many of the algorithms that you hear are "insecure" aren't really insecure due to any math advances; they have simply fallen victim to their already-known limitations, or are "insecure" when used in specific ways.

For instance, look at the Blowfish block cipher. Blowfish was designed and released almost thirty years ago. Its current biggest security issue is not some tricky biclique attack that allows key recovery on a desktop computer or something. The problem is that the block size is 64 bits, and in modern contexts, that's just too small-- networks shuffle enough bits around these days that a 32-gigabyte encrypted transfer is suddenly a realistic use case, and there are known attacks related to that. 64 bits was fine at the time, and the block size was a known limitation of the algorithm (birthday attacks are well-understood). The algorithm didn't fall; we outgrew the use case.

Going back even further: DES was released nearly FORTY-FIVE years ago. It was released with a 56-bit key, which was known at the time to likely be in brute-force range for certain Nation State Adversaries. It has the same block-size issue as Blowfish. But it's actually stood up to cryptanalysis pretty well, given the power of the attacks that have been developed since then. Triple-DES, properly used, is plenty secure for lots of applications-- it's still even promulgated as a standard algorithm for a lot of financial industry systems.

More germane to modern encryption practice, AES was standardized nearly 20 years ago, and has been studied for even longer. It's still a solid algorithm today. There are a few implementation caveats if you want to avoid things like cache timing issues, but the algorithm itself is holding up very nicely to mathematical advances-- the best known mathematical attacks drop the security levels by 1 to 4 bits, depending on key size, which is... well, it's worse than 0 bits, but it's certainly nothing to worry about. AES has been integrated into processor designs, it's used to protect US government information classified up to TOP SECRET, and it's supported by nearly every cryptographic suite out there-- because of all of that, the algorithm has remained a significant subject of ongoing research, and it has stood up to the scrutiny beautifully.

Cryptography is always advancing, and sometimes algorithms do fall to mathematical breakthroughs (RC4 has been battered pretty badly, for instance, and SHA-1 is clearly dead now). But for the most part, cryptologists try their damnedest to know the limitations of their algorithms, and they're pretty up front about them.

I'll also point out: cryptologists are aware that there's risk associated with mathematical advances, and they hedge their bets. Note that there is now a SHA-3 standard. That's not because the SHA-2 algorithms are dead. It's because, while we're pretty sure they're secure, the SHA-2 algorithms are cut from the same mathematical cloth as SHA-1 (they use the Merkle-Damgard construction). SHA-3 was standardized with the explicit purpose of adding diversity to the pool of standardized algorithms. NIST is currently running a post-quantum public-key standardization effort, and has made it very clear from the start that they'd like to select multiple "winners" from multiple categories. Part of this is to allow flexibility for different use cases (key size / message size / performance trade-offs vary wildly for different classes of algorithms). But another part is preventing a complete disaster if one class of algorithms is broken (either classically or with a quantum computer).

I agree with comments.

All ciphers older than 100 years have been broken, except for one.

However, it seems to me that designs improved exponentially, while attacks only improved quadratically. The old days of WWII where Enigma was broken before the end of the war will never happen again.

Conceptually, it makes sense: Science improves quadratically because breakthroughs improve tooling that accelerate breakthroughs. I am simplifying here, but if you have N tools at T1 that make you progress at rate R1=N×K, and that progress yields another tool: at T2 you have N+1 tools, making you progress at rate R2=R1+K. Your progress is P2=P1+R1, which generalizes to Pn+1 = Pn+n×K = P0+K×N×(N-1)÷2, a quadratic progression.

On the other hand, cryptographic security is exponentially better with every bit. If an old cipher uses its bits badly, it will still be good enough if it is long enough. Let's say it reaches a given difficulty level after 10 rounds; a cipher that uses its bits twice as well will reach the same level after 5 rounds, but would have something like 2^K times that level after 6, 2^2K times after 7, … 2^5K times after 10, reaching a level that quadratic improvements won't reach for an increasingly long time.

For instance, it took 5 years for MD4 (1990) to have a practical collision, 12 for MD5 (1992), 22 for SHA1 (1995), therefore roughly doubling every three years. If we extrapolate, SHA2 will have a practical collision in 2080.

There's a quantum leap in cryptography in WWII and it's a real shame that's so rarely explained even at Bletchley where they'd be well-placed to do it.

Engima (like most systems in use in the 1930s) is thinking about cryptography as being some sort of art in which the idea is to sort of stir letter symbols. Everything else often can't even be encrypted or must be elaborately transformed into letter symbols first. Bletchley has replica "Bombes" which would be used to attack this mechanically, replicating the function of the Enigma machine to defeat it.

Lorenz, which was also attacked at Bletchley using the Colossus, is a modern stream cipher. It XORs a pseudo-random stream against your plaintext bits (albeit in the 1940s these were in 5-bit ITA telegraph code) and so it isn't different in core principles from say RC4 or at a greater distance in time indeed Salsa20 - Lorenz just has about 56-bit keys where these modern ciphers have more and are better designed. Attacking this mechanically was impractical, and that's why a computer like Colossus was needed.

As a result it really isn't fair to compare 100 years ago. If we look back 50 years instead the difference is stark.

> Hard to justify going to the trouble of encrypting your backup. 10 years from now, it might be as good as plain text.

Realistically, any archive is going to have to re-record data to stay ahead of age and equipment becoming obsolete. I've heard the lifetime for the physical media of tape backups is in the 10-30 year range.

Updating the encryption isn't a big deal once you're already rewriting everything.

The thing to remember about cryptography is that as long as it's based around computational power, it can always be broken at some point. Especially if we're building exponentially more powerful computers every N years.

Assuming the universe is infinite in some sense. Otherwise the exponential growth has to stop at some point. So base your cryptography around computational power beyond physical limits and you are safe.

It was always broken like this. These guys just figured that out.

Now we know. It's better to know.

Even if you could make it secure for all you life time, this is always at play: https://xkcd.com/538/ - In general probably nobody give a dime about our backups but if they did they will find a way to get the data before it encrypted or when we are decrypting the backup.

The way I see it, this is what is bad:

> We have tried to contact the authors of affected software before announcing this attack, but due to limited resources, we could not notify everyone.

Many in the 'security industrial complex' make as if they are doing god's work by their 'research' but from a common sense, man/woman on the street, and layman point of view that is not what appears to be going on at all.

What they do is self serving and a detriment. Then they try to justify it in some way as being good when it's not good. It's like going around and compiling a list of doors in a neighborhood that are open, attempting to contact everyone but not getting some people and then saying 'hey look at what we did for you'. Meanwhile if a list were not compiled at all almost all people would do just fine.

> But I think the feeling is related to not being able to rely on anything in our field. Hard to justify going to the trouble of encrypting your backup. 10 years from now, it might be as good as plain text.

Figure out the amount of issues there would be if there was not such open disclosure and an entire industry surrounding breaking things vs. same not being done. That is the issue.

> Imagine an engineer receiving a call at midnight about his bridge because gravity changed during daylight saving in a leap year. That's our field.

No because nobody is able to or actively trying to change gravity (or is able to).

The idea behind your complaint is "if we tell good people about risks, then bad people will know about them. If we keep them secret from good people, then bad people won't find out about them". Or, "if we don't make a list of open doors, then bad people can't find which doors are open". Which is .. not true. Bad people will already be making their own list of open doors, and sneaking through them without being noticed, over and over again.

> all almost all people would do just fine

Here's an ongoing ransom attack in the UK, started on New Year's Eve: https://www.bbc.com/news/business-51017852 The ransom is in the millions of dollars range, Travelex websites in 30 countries down for a week, Sainsbury's money exchange websites down for the same time, and "Dates of birth, credit card information and social security numbers [of customers] are all in their possession, they say."

Not looking at problems doesn't stop problems from existing.

> "if we tell good people about risks, then bad people will know about them. If we keep them secret from good people, then bad people won't find out about them". Or, "if we don't make a list of open doors, then bad people can't find which doors are open".

You are not recognizing the nuance which is typically the case with people who supports practically any and all disclosure and thinks it's good plain and simple. With almost no downside at all (not the case).

In particular this: "then bad people won't find out about them"

My point is that disclosure makes it simpler for more bad people to find and learn and inflict damage damage. Unclear to me how you could think that isn't what has and is happening. If someone say publishing how to create a card skimmer at a gas station then more people (who are bad actors) will then have what they need to give it a try. If not there will be people who have figured it out and people who will put the effort into figuring it out but you must realize vastly less will do that, right? The disclosure removes a great deal of friction.

The amount of effort and friction and the amount of 'bad people' that can be actors is many magnitudes larger (I would argue) as a result of disclosure.

What I think you're missing is that the disclosure could come from a bad person who doesn't care about any of your arguments. It's like I'm saying "banks should invest in vaults to protect against theft" and you're saying "that costs money and disruption of building work, what if you just don't talk about where the money is kept". I agree that if people didn't steal the money, that would be nice. But most of the people who are talking about where banks keep money with a view to stealing it, aren't going to shut up in order to keep the money safe, because they don't care about keeping the money safe. So if us bank customers stop talking about it, a) that doesn't keep it quiet, and b) our money gets stolen.

It would be a nice world if we could tell companies about flaws and they fixed them, and nothing went public, but instead we tell companies with "responsible disclosure" and they ignore it, don't spend any effort on it, act incompetently leaving it with first line support people who don't understand it and have no path to escalate it, have no security contacts available for reports, cover it up or deny it or try to silence it with NDA style agreements, prioritise shareholder profit or public image over it, and generally behave irresponsibly in all possible ways that avoid them having to deal with it, with very few companies excepted.

In light of that, public disclosure with all its risks, actually does kick companies into taking action, and closing risk vectors for good. Like companies who say "we put customers first!" but it takes a complaining public Twitter thread for them to even respond at all. Telling people to not take it to Twitter ignores the fact that there's no other way which seems to actually work.

Give an alternative which also gets problems fixed, and I'll be much more in favour of it.

"Which is .. not true. Bad people will already be making their own list of open doors, and sneaking through them without being noticed, over and over again."

But the vast majority of people do literally have front doors which cannot guard them from 1% of the possible attacks in the world, if they were targeted. Never mind the best ones in the world.

The internet brings a much bigger attack surface than local people who can reach a front door, and home users access the same openssh as companies do, but companies (can) afford stronger doors. "Most people's home doors can't withstand a hit from a sledgehammer" -> "we shouldn't talk about cryptographic weaknesses in case someone abuses them" is a stretch, the comparison does break down.

[1] is a fun YouTube video about physical pen testing; one example at 13:45 in the video, the presenter is walking home from a bar, walks up to a locked high street bank, spits a mouthful of beer through the gap in the doors, triggers the presence sensor on the inside which lets people out, and the door opens and lets him in.

[1] https://www.youtube.com/watch?v=rnmcRTnTNC8

"The internet brings a much bigger attack surface than local people who can reach a front door"

"Local people", huh. My front door is visible to everyone on the internet, and I have no practical way to prevent that. Some obscure company went by with their mapping vehicle and...

Now, some people do live in big buildings where less is exposed to the outside, but millions don't.

> and...

and... what? Unless telekinesis has been invented, a photo of your door doesn't increase the amount of people who are able to try opening your door. If you're about to say "someone might choose to come a long way just for my door" then that seems like an argument in favour of what I'm saying - in that case, wouldn't you like to know about any vulnerabilities your door has which you could address, before they arrived, instead of relying on silence and hoping they won't know about them?

I'm not really arguing in favor of doing anything in particular; I was just pointing out that people rely on "security by obscurity" in day to day life, despite the fact that everything is connected to the internet. Perhaps there are some subtleties in exactly what "obscurity" is.

I'm saying the ideas I read about how the world is don't seem to be connected to my view of reality. I don't have to argue with your conclusions to find fault with your premises, and I'm feeling too lazy to do it right now.

We typically assume that organizations like the NSA, FSB, Mossad, MSS, and the like already know of such attacks.

The percentage of people that are impacted by NSA et al is exceedingly small compared to the pain and impact on everyday citizens and companies by disclosures. Not all disclosures and for sure an upside but it's out of control and has been for a long time.

The government (in the US) does not have the resources to go after everyone who commits a crime and that would assume they are actually scooping up info and know of the crimes (they aren't and they don't). They don't even have the resources to audit tax returns (other than a very small percentage). This idea that you are being watched all the time is fantasy. In the US. Other countries? Unfortunate when that's the case but that does not mean as a US citizen I can't view it as detrimental to me that this type of security disclosure makes it so easy for hackers to do a better job (and it does nobody is going to dispute that fact, right?).

The government (in the US) does not have the resources to go after everyone...

In point of fact, since Snowden's revelations we know not only that the USA state actually does have the resources to monitor everyone, but also that it does so. Furthermore, the state is not a monolith. While it may not be in the interest of the state as a whole to capriciously victimize individual humans, it is often in the interest of particular officers and organizations that comprise the state to do so. Cf. "parallel construction".

General questions:

(edit: these are indeed general questions, not just about SHA1)

Has anyone else been worried about data deduplication done by storage and/or backup systems, considering that they usually use hashes to detect data blocks that are "the same" (without additional metadata) and avoid storing those "duplicate data blocks" again? Doesn't this seem far worse when you also consider that systems like Dropbox deduplicate data across all their users (expanding the footprint for collisions)? Are there any research papers/articles/investigations about this?

It depends on how the deduplication is configured, and the risk tolerance of the organization running it, I suppose.

ZFS, for example, has a dedup setting option that forces the file system to do a byte-for-byte verification for any deduped data: https://docs.oracle.com/cd/E19120-01/open.solaris/817-2271/g...

Yep, I'm afraid of that. I have no idea how they are mitigating this, and I doubt they'll ever disclose.

From what I recall when the first collision was proved it was with a PDF which is easier to craft a collision from. I also seem to recall that the collisions were from hashes of the whole file.

The dedupe engine written where I work chunks a file and hashes those chunks meaning it's somewhat harder to craft collisions (I forget where the chunking boundaries are, but it's within a range iirc). The hashing algorithm was SHA-1 last I checkee but I've never heard even company folklore of corrupted backups caused by hash collisions. I get the feeling that it's near impossible in practical terms given the size of the string being hashed. Having said that, hubris is the downfall of programmers everywhere, so I wouldn't bet all my money on it.

In my generic question above, I wasn't referring to someone intentionally causing collisions, but about random collisions that could occur because of these data chunks coming from a large pool of data (say a system with tens or hundreds of millions of users with several hundred GB each of varied data among them). AFAIK, chunking methods usually do not use additional metadata (like a file timestamp or size or name) that could help do a more comprehensive comparison instead of relying on a hash alone. Are the chosen chunks very small (which means the number of hashes would be quite large, making deduplication take longer when looking for exact matches)? I'd be interested in writings from real systems online that explore the choices they have taken and how they pre-empt random collisions (as much as they can design it into the system).

are there any systems that do sha-1 for dedup? I am only aware of sha-256.

Subversion's dedup was broken by the http://shattered.io/ SHA-1 collision in 2017 https://medium.com/@hyrum/shattered-subversion-7edea4ba289c


git computes the sha1 checksum over the header that includes the length + the file data; also checks the length in the header as well as the the hash; Linus says that it would be less practical to find a collision that has the same length as the original data; I guess at some stage even that might become possible.

bup does. Even worse: rsync relies on md5 for finding duplicate blocks.

at least rsync does that only when --checksum is specified as option; otherwise it assumes that files with the same date/time/size are the same (which is probably not too bad as a default).

SHA1 isn't broken for hashes.

SHA-1 (and SHA-256 and many other hash functions) are subject to length extension, which is how the 2017 shattered attack works: find a pair of short prefixes that work as PDF headers and that have colliding hashes; then append a common suffix such that the different headers display different parts of the suffixes. Once the collision has been found you can easily make your own colliding PDFs - https://alf.nu/SHA1

This breaks anything that dedups on just the SHA-1 hashes of raw files

What does "for hashes" mean? For a hash table? Of course it's fine for a hash table (but probably too slow). Hash tables don't need collision resistance. If you need collision resistance then SHA-1 is broken.

SHA1 is vulnerable to preimage attacks in reduced round variants. The findings keep steadily improving.


This means if a storage system just uses SHA1 to detect duplication, you can abuse the ability to create a collision to possibly do bad things to the storage system.

Yes, but for the specific concerns listed it should not be a problem. That is, if you upload to dropbox and check the sha1 when it comes back, yes you did get the same data back.

And your data can't be stolen by a hash-to-data oracle either, unless the evil attacker constructed your secret data for you.

So it depends on your threat model. Yes there are practical concerns, but not the ones listed.

I disagree. Deduplicating filesystems often depend on hash equivalency meaning data equivalency. They may or may not have modes to validate all data before deduplication, but these suck for performance and are often turned off.

E.g. I make two things that hash to the same thing. One is a contract where I'm obligated to pay back a loan. Another is some meaningless document. I give them to a counterparty who puts them in their filesystem, which later scrubs and deduplicates data. Since they hash the same, the filesystem removes the contract and leaves my meaningless document (or never bothers to store the contract because it already exists, if deduping online, etc).

Note that this is a chosen prefix collision, which is much more demanding (and more useful!) than finding a collision in general. And this leaves aside that SHA1 is looking increasingly vulnerable to preimage attacks which further broaden the attack scenarios.

The filesystem removes a document, but you can't expect it will be the right one more than 50% of the time.

An attacker may be able to control the way it happens.

It may be repeatable to always remove the first or the last. If it's online dedup, it'll always be the last.

Or the attacker can provide many copies of the document you want to still have a copy of, and one copy in the middle of the one that should be vanished.

This doesn't sound crazy farfetched. I bet a lot of people have files in their dropbox that were created by someone else.

>SHA1 is vulnerable to preimage attacks.

From your linked article:

>All currently known practical or almost-practical attacks on MD5 and SHA-1 are collision attacks. In general, a collision attack is easier to mount than a preimage attack, as it is not restricted by any set value (any two values can be used to collide).

Just a curiosity, since people are talking about Git still using SHA-1 (despite work on SHA-256 since 2017).

I see that Git doesn't actually use SHA-1 any more, it uses "hardened SHA-1": https://stackoverflow.com/questions/10434326/hash-collision-...

Well, according to that reference, it's hardened against a specific, previously known attack. Do you have any information on whether that also protects against the different, new attack which was just published?

Not so much the specific attack, as the broad class of attacks. I think this new work is in that same broad class but I am not a mathematician.

The idea in Marc Stevens' anti-collision work is that some inputs are "disturbance vectors" which do unusual things to the SHA-1 internals, and we want to detect those and handle that case specially since there is almost no chance of that happening by accident. It has a list of such vectors found during his research.

This paper doesn't talk about "disturbance vectors" but it does discuss ideas like "Boomerangs" which I think ends up being similar - I just don't understand the mathematics enough to know whether that means "the same" or not.

I was wondering the same thing, and hoping someone else would answer that.

Hardened sha1 does detect this new attack. Easy to test: Check their pair of files into a git repo and see that they have different checksums, while sha1sum(1) generates the same for both.

checks-out, thanks

    $ mkdir sha1
    $ cd sha1
    $ curl -O https://sha-mbles.github.io/messageA
    $ curl -O https://sha-mbles.github.io/messageB
    $ echo foo > bar
    $ echo foo > baz
    $ openssl sha1 *
    SHA1(bar)= f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
    SHA1(baz)= f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
    SHA1(messageA)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0
    SHA1(messageB)= 8ac60ba76f1999a1ab70223f225aefdc78d4ddc0
    $ git init
    Initialized empty Git repository in ...
    $ git add *
    $ git commit
    [master (root-commit) b274c88] sha1 collision test
     4 files changed, 2 insertions(+)
     create mode 100644 bar
     create mode 100644 baz
     create mode 100644 messageA
     create mode 100644 messageB
    $ git ls-files -s *
    100644 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 0 bar
    100644 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 0 baz
    100644 5a7c30e97646c66422abe0a9793a5fcb9f1cf8d6 0 messageA
    100644 fe39178400a7ebeedca8ccfd0f3a64ceecdb9cda 0 messageB

No, you and joeyh are incorrect about the test (but correct about the result). As can be seen in the output, SHA1(bar)= f1d2d2f924e986ac86fdf7b36c94bcdf32beec15 but git_SHA1(bar) = 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 . Why is there a difference? Not because of hardened SHA1. Hardened SHA1 essentially always produces identical outputs to SHA1

> git doesn't really use SHA-1 anymore, it uses Hardened-SHA-1 (they just so happen to produce the same outputs 99.99999999999...% of the time).[1]


There's essentially no chance that the string "foo\n" fell into that tiny probability of difference. The reason there's a difference is because before git hashes something, git will do various processing to it (maybe appending and prepending various things) and those things broke the carefully created collision. But a chosen-prefix attack might mean those various things can be accounted for, and a collision could still be found.

So we need to directly run hardened SHA1 on the data, which I believe is located at https://github.com/cr-marcstevens/sha1collisiondetection

As seen in https://github.com/git/git/blob/master/sha1dc_git.c

So I tested that one:

    $ sha1collisiondetection-master/bin/sha1dcsum bar baz messageA messageB shattered-1.pdf shattered-2.pdf
    f1d2d2f924e986ac86fdf7b36c94bcdf32beec15  bar
    f1d2d2f924e986ac86fdf7b36c94bcdf32beec15  baz
    4f3d9be4a472c4dae83c6314aa6c36a064c1fd14 *coll* messageA
    9ed5d77a4f48be1dbf3e9e15650733eb850897f2 *coll* messageB
    16e96b70000dd1e7c85b8368ee197754400e58ec *coll* shattered-1.pdf
    e1761773e6a35916d99f891b77663e6405313587 *coll* shattered-2.pdf
So it does protect against the new attack.

I really appreciate this, thanks!

How would an attack on a git repo work? You create a repo with identical hashes but different content and next time the user clones from scratch they get your modified version?

yeah my thoughts about git are similar. look at the two messages they have an an example:

Key is part of a collision! It's a trap!yE'NsbK#ދW]{1gKmCx's/vr| -pJO_,1$)uB1qXv#U)9ESU;p~0G:Y ݕbBIjFra눰3&t'lB_!h5M([,˴QMK#|o5pv|i,+yYpݍD7_Rf\'GUZ,ϵdvAYAugV=Lk8_E 2 +nolBtxXoQt&+?Y3LP:'Qt(,ۛuԪWJm:A"M6<|B4kVv̨ޠA=M+m%殺j5N|EMA\Ed- s&@u@:a?pq^Xf0U?R}


Practical SHA-1 chosen-prefix collision!'lka}vbI3,·W]Ǟ+gK}Cxs/v&r| }-hRJO_ rO̳;bzC ,1&uRP-MXrU3aO;pr0:sY'2 l&r7#(A{oNyCJ_W,8 əbحBYީpFr2a8#&t+n_15q(_,ˤQMW#hzYMgVV=L,kO0E*N +oc@BpXoᯖd&?+?[{3LвP&'U t ( WJÏm\:A"6>>|SB(k;Vv̨ޠ^A=Y ;om%j-|cUAAۜEТ&@o@:La3psH^eXf0QJm ݶd

they have the same sha1sum, but in all practicality its nonsense since both messages are pure trash. you couldn't have malicious C code that would have the same hash as non malicious C code in this example

Isn't that like incredibly simple?

Dump your garbage string behind a // or inside an #if 0, restrict the garbage string character set to characters which will not disturb that, and your compiler will whistle while it works.

Depends on if the chosen prefix attack allows the content to appear arbitrarily in the middle of the byte stream like that.

That's exactly what a chosen prefix attack means. You choose the arbitrary prefixes. Then the garbage is inserted. Then (due to SHA1's Merkle–Damgård construction) you append a postfix that's mostly arbitrary (but the same in both files).

anyone checking diffs would notice that, or working on the file, etc. it wouldn't survive long

I think active projects would detect this fine - but what if that commit was pushed to lpad and everyone ended up pulling it to local because it's a dependency of a dependency of a dependency in NPM?

Or what if it's a really obscure library for parsing like... pyramidal jpeg2000s, are the library consumers going to be checking the source? Heck, most people already don't check download checksums unless their downloader does it automatically.

Hmmm, does the garbage string actually have to survive long?

If there's a followup CL to "delete a garbage string that accidentally made it into the repo", which doesn't actually fix whatever else was added, would that get you anywhere?

If you could push up a commit that computed to the same hash of the last tagged release in a repo... I'm not certain, the tag might end up referencing the new object? Certain versions of git (i.e. maybe git for windows) may also react in different manners.

In theory you might get people building software packages for distros to build your malicious version, you may also just temporarily shut down the ability for anyone to check out the version (basically denial of service for making?) but the time window would be weird.

You'd probably be most successful modifying the original repo - either by being the creator of the software or gaining their trust. However, it would have to be a rather powerful SHA1 attack for the commit to still be valid syntax, hard to detect, and make a meaningful malicious change.

> SHA-1 has been broken for 15 years, so there is no good reason to use this hash function in modern security software.

Why are cryptographers always exaggerating things and so out of touch with reality? The first actual collision was like 3 years ago. It's not like the world has been on fire in the meantime, and it's not like SHA-1 is broken for every single possible usage even now. And why the nonsense with "no good reason"? Obviously performance is one significant consideration for the unbroken use cases. Do they think painting a different reality than the one we live in somehow makes their case more compelling?

Cryptographers rely on precise definitions to make their assessments.

In particular, a primitive makes a number of useful promises. Any achievement that describes a way in which a promise is not kept makes that primitive broken, regardless of whether that achievement is theoretical or practical.

(They often talk about “theoretically broken” or “practically broken” to distinguish whether it was actually done.)

> it's not like SHA-1 is broken for every single possible usage

True, but it is extremely easy to believe your usage is unbroken and be wrong. Besides, often, primitives that are broken for one usage eventually break for others.

> why the nonsense with "no good reason"? Obviously performance is one significant consideration for the unbroken use cases

There are many much more robust primitives with better performance nowadays, such as BLAKE2, SHA-512/256, or Kangaroo12.

Is SHA1 not still faster than 256/512? And Blake2 barely faster? I suspect with some hardware SHA1 is still going to beat all of those out - am I wrong? I would love to learn more.


Regardless, no competent designer is going to use SHA1.

Thank you, that's an excellent chart.

CRC is faster than SHA1

It's not about the speed, it's about the 'is it secure.'

If my amazing hash function takes 10^10^10 years to compute, it's possibly more secure than anything else listed there, but it's also quite useless.

Ten years ago I was being pressured into not adding SHA-256 support to a new system because SHA-1 would allow them to use a stack of old crypto cards some idiot bought and hadn't been able to put to use, so they were just collecting dust in a closet somewhere. (They had a max of 1k RSA keys, which also were considered broken at the time, close to it when the cards were new).

This wasn't for some throw-away project. Big, old company. This may be the closest I've come by far to writing software that will still be used after I'm dead (gets a little easier every year, though). If I gave them SHA-1 they'd still be using it for sure.

I refused, and the fact that people were calling MD5 very broken and SHA-1 broken helped.

(A completely different faction was trying to get me to jump straight to SHA-512. I said that was probably overkill, and yes I will implement it but we're using SHA-256 as a default. Then a couple years later it turns out SHA-256 is more resilient than 512 anyway. But what a schizophrenic place.)

Past experience and documents that ceased being classified shows that serious attackers (e.g. NSA) are at least decade ahead of what's publicly known in cryptography; i.e. we know that pretty much always when new relevant groundbreaking math was published, the classified cryptographers had known that for a long, long time already.

So if this attack is developed today, then you should assume that NSA has been able to execute this attack for at least ten years already whenever it suited them, including mass surveilance of random not-that-important people. The same applies for the collisions - the first published collision was 3 years ago, but we should assume that that's definitely not the first; I mean, only a minority of the world's cryptography researchers participate in the public, open, academic community; the majority of them are employed with a condition that they won't get to publish anything important. And since we know for the last ten years that such attacks were possible, there's no reasonable reason why SHA-1 would have been considered as safe.

Are there materials that show a ten year head start?

For an early example, differential cryptanalysis methods were 'discovered' in late 1980s by Biham and Shamir; but it later turned out that resistance to it was pushed as a design consideration already back in 1974 when DES was designed, so NSA knew of it at least then, that's more than a decade.

We know that British intelligence (who don't have as much resources as NSA) had developed the RSA equivalent something like 5 years before Rivest/Shamir/Adleman got to it; and we still have no idea how far NSA was with that math at the time - all we have is circumstancial evidence such as travel reports of NSA representatives going to cryptography conferences and being satisfied that absolutely no math that's new (to them) or even potentially leading to something new was being revealed there.

We also have NSA suddenly changing recommendations to use/stop using certain cryptosystems that still doesn't make sufficient sense - e.g. the 2015 turning away from 'suite B' ECC may have been due to some quantum discovery as is claimed, or some other weakness being found, but it's been five years and we (as far as I understand) still don't know as much as they did back in 2015, so they're more than 5 years ahead. But to know whether the current advantage is ten years or more or less, we'll have to wait a generation or so, it takes a long time for truth to leak.

This is fascinating; are there any readings/sources for similar events, i.e. governments "predicting" academia?

I am aware of the 2015 plan to "transition soon(tm)", but that's because I was alive 5 years ago. Other earlier events would be super cool to read up on.

How many mathematicians are working for NSA? How many public research cryptographers are there?

One thing is that proper public cryptographers are very rare, there's a handful of effective teams - e.g. MIT has Rivest and associates, there are a bunch of other places, but most universities, including quite serious ones, don't have anyone doing reasonable cryptographic research. Cryptocurrencies caused a recent boom, but it's a niche with separate, specific goals that doesn't advance the rest of the field much.

It's hard to give good numbers, we'd have to look at Snowden leaks and others, but I haven't done much about that. Here's an earlier HN comment https://news.ycombinator.com/item?id=6338094 that estimates 600 proper mathemathic researchers working on crypto, and it seems quite plausible to me that it would be more research power than the entire public academia - especially because in many countries who do take this field seriously (e.g. China, Russia, Iran) there's no real public research in crypto happening because that's classified by default. I mean, prety much all academic research happens through targeted grants by governments, and who other than deparment of defence (or similar organizations in other countries) would be funding cryptographic research?

Also, I'll quote Bruce Shneier (2013, https://www.schneier.com/essays/archives/2013/09/how_advance...) regarding their budget - "According to the black budget summary, 35,000 people and $11 billion annually are part of the Department of Defense-wide Consolidated Cryptologic Program. Of that, 4 percent—or $440 million—goes to 'Research and Technology.' That's an enormous amount of money; probably more than everyone else on the planet spends on cryptography research put together."

This is a great summary, thank you!

I think there's skepticism in the community that NSA et al actually have such a far head-start now.

That said, historically speaking, back in the early 70s DES was being designed. The NSA made some unjustified changes to its S-boxes. At the time, there were allegations that they had made them intentionally weaker. (Or so I've read; I wasn't born yet.) In the early 90s, differential cryptanalysis was discovered for the first time, and it turns out that DES was already resistant to it (unlike other block ciphers at the time): in fact, the NSA already knew about differential cryptanalysis, 20 years ahead of the general public, and intentionally strengthened DES. (Also, IBM discovered it, too, but kept it quiet at the NSA's request.)

> So if this attack is developed today, then you should assume that NSA has been able to execute this attack for at least ten years already whenever it suited them, including mass surveilance of random not-that-important people.

They might be 10 years ahead on the algorithms side, but they aren't 10 years ahead on the hardware side. Also, spending 45k today gets you a single collision. That is hardly going to be useful for mass surveillance.

Other declassified documents of the past showed they were years ahead on hardware, too. The NRE costs got so high that commodity hardware got preferable in the general case. That said, they can still throw money at ASIC's, FPGA's, semi-custom versions of Intel/AMD CPU's, maybe same for GPU's, and so on.

They have access to better and more hardware than most threat actors.

They may have been though - there's talk of liquid helium tens+ of gigahertz (8-bit) processors being purpose built (in 2002). The National Cryptologic Museum next door to NSA HQ is a fascinating place and has some very cool displays to include a large disk changer and a piece of a cooling system. There's a PDF at [1] but the discussion at [2] I think might give a better idea.

[1]: https://www.nitrd.gov/pubs/nsa/sta.pdf [2]: https://it.slashdot.org/comments.pl?sid=485458&cid=22736288

Performance is the worst argument you could choose.

a) There's hardly an application where hash-performance matters. These things are fast.

b) For precisely that reason that people may still complain about performance cryptographers invented a hash function that is faster than all the old choices like md5/sha1 and still secure (it's called blake2).

The difference it was "a" collision. Now, that wasn't very helpful if you wanted to forge a document.

This is chosen prefix collision. This means you can select a beginning of a document which in many cases is enough.

For anything requiring fast hashing performance today, is there a reason why Blake2 [1] wouldn't be chosen? It seems to be faster than SHA1 and uses a similar algorithm as SHA3. As collisions become easier to create I would think it'd cause a problem in many even insecure use cases. I'm pretty sure github had to blacklist those two pdfs that originally had the same hash.

[1] https://blake2.net/

As stated, they are talking about "modern security software".

So yeah, if you are building/improving software that has a clear focus on security, you should use a secure hash. Seems only natural to me.

Depreciating things takes a long time. It could be 10 years before 'that guy' in your office gets it in to his head that SHA-1 isn't appropriate where security is concerned. Thus you want to make it unambiguous that now is a good time to move to something else so that when it is thoroughly broken in 10 years time you're minimising the number of people who have yet to switch.

Depreciating things takes however long your accountants and the tax laws in your country allow :)

Deprecating things does take a long time, and the only practical thing to do about that is to get ahead of the game as you say.

Because by shouting loudly now, we can maybe get changes into libraries within the next 5 years, before these attacks are commonplace

Like computer scientist, they think binary: Either it's secure, or it's not. In reality there's a spectrum where you also have "good enough".

"good enough" relies on a threat model. Cryptography researchers work in the abstract - without a threat model you must consider cases where your attacker has unlimited resources.

It's good enough for you and me, but research isn't meant to be practical, imo

What. The first thing any security paper defines is the assumed threat model. People design all kinds of schemes for different threat models.

The point with assuming conservative threat models for key primitives like hash functions is that the threat model can change rapidly even within the same application, and attackers only get stronger. So you err on the side of caution, and don't rely on luck to keep safe.

The security strategy with the most practical utility for most software engineers working on most projects is defense in depth: multiple layers, each assumed to be breakable.

It's a striking contrast with the stark mathematical language deployed by cryptographers, on whose work we rely.

If we differentiate between the two fields of software engineering and cryptography, it's easier to be generous in our appreciation for the different goals and mental models.

Most people think in yes/no logic. Unfortunately binary is a horrible oversimplification of a very analog reality and results in many of the world's problems that we're in now. Because we tend to think of a binary of yes/no, we often end up flying from one ditch on the side of the road to the other.

But in many contexts, "good enough" is more a question of perception than reality.

These statements serve to shift the Overton window of perception, and therefore help improve the odds that people are't thinking "good enough" when they are broken.

>A countermeasure has been implemented in commit edc36f5, included in GnuPG version 2.2.18 (released on the 25th of November 2019): SHA-1-based identity signatures created after 2019-01-19 are now considered invalid.

Since SHA-1 was always possible to break, and since NSA probably gets access to big computers and sophisticated techniques before researchers, why doesn't this invalidate every SHA-1 signature ever made and not just ones from last year?

Actually it's even worse than that: signature creation time is added by the signer so it's totally under control of the attacker. IMHO all SHA-1 based signatures should be ignored.

The signer is assumed to be trustworthy. That's why we care for their signature in the first place. It's the sign-ee that is presumed to be the attacker.

In this case, signature creation time isn't under control of the attacker. The attack scenario being considered is that Mallory can convince Alice to sign a key K1, provided by Mallory, such that it looks like Alice signed K2. The party creating the signature is honest here.

The problem is that the dates are part of the document being signed.

Rather than keys, Alice will typically sign a document (e.g. an X.509 to-be-signed certificate)

Mallory creates two documents, the legitimate seeming document A (a to-be-signed certificate Alice willingly signs) and document B (the content of which is controlled by Mallory to an extent depending on the details of the collision). In document A I'm sure Alice will insist on the date being roughly correct so you'd detect that. But Alice never sees document B, she isn't aware it exists, so it can specify any date, including one chosen not to set off alarms.

For the Web PKI we were triply safe because:

1. We told Alice (the public CAs) never to sign anything at all with the dangerous algorithm after a set date. So long as Mallory wasn't able to develop and use a collision before that date and Alice did as she was told‡ this would be safe in perpetuity.

2. We already had a countermeasure in the documents, very early in each Web PKI X.509 certificate is the Serial Number, if you look at yours you'll notice it's a crazy huge number and seemingly not "serial" in any sense. It's random. Can't do a chosen prefix collision attack if you can't choose the prefix.

3. Since no more new documents were being signed clients in the Web PKI were able to stop recognising these signatures thus permanently ensuring the attack was impossible within about 18 months.

‡ A very small number of exceptions were explicitly granted, and a similarly small number of exceptional cases occurred for which no permission was asked. All investigated to everybody's satisfaction. As you may see if you poke around in the demo documents from this article, a collision document may not jump out as problematic from a crowd but it certainly isn't so innocuous as to survive careful scrutiny, and with such a small number of exceptions to look at this scrutiny was possible in a way it never would be for the wider Web PKI.

No, because this is a collision attack (without control over the hash value), not a preimage attack (where you match an existing hash).

We know how to make pairs of new files that collide with each other, but there's no known way of creating a file that collides with something specific that existed before.

Quick question about the "What should I do" section. It says "use instead SHA-256". Isn't SHA-512 both better and faster on modern hardware?

SHA-256 and SHA-512 are both in the same family (SHA-2).

Latacora says to use SHA-2. If you can get away with it, SHA-512/256 instead of SHA-256. But they're all SHA-2 family hash functions.


No need to bikeshed this. But if you must: SHA-512/256 > SHA-384 > SHA-512 = SHA-256

If you're wondering, "Why is SHA-384 better than SHA-512 and SHA-256?" the answer is the same reason why SHA-512/256 is the most preferred option: https://blog.skullsecurity.org/2012/everything-you-need-to-k...

Additionally, the Intel SHA extensions target SHA1 and SHA-256 (but not SHA-512), which makes SHA-256 faster than SHA-512 on newer processors.

Isn't crypto fun?

I'm super confused. Are SHA-256 and SHA256 different, and if so, why in the world would this be considered a sane naming scheme?

If not, I completely do not understand the inequation you wrote, which seemingly lists SHA-256 (and -512) multiple times.

You're probably confused by "SHA-512/256", which does not mean SHA-512 or 256, but rather a truncated version of SHA-512: https://en.wikipedia.org/wiki/SHA-2 in the third paragraph.

So why would a truncated version of SHA-512 be better than SHA-512? And why is SHA-512 = SHA-256?

Truncated hash functions are not vulnerable to length-extension attacks.

Length-extension attacks are relevant when you design a MAC by passing a secret and then a message to a hash function, where only the message is known.

Truncating the hash (which is what SHA-512/256 and SHA-384 do to SHA-512) removes the ability to grab an existing hash H(k || m) (where k is unknown and m might be known) and append junk because a truncated hash does not contain sufficient information to recover the full state of the hash function in order to append new blocks.

Why do SHA-512/160 and SHA-512/128 not exist? They could be useful as drop-in replacements for SHA1 and MD5.

Because 224 bits is considered the minimum safe output length for a general purpose hash function. So they'd be drop-in replacements but still wouldn't be safe. Safer than MD5/SHA1, but not actually safe.

So rather than push off getting people to make things actually safe by providing a footgun NIST just didn't do that.

> 224 bits is considered the minimum safe output length for a general purpose hash function.

Considered by whom?

Truncating a hash function to 224 bits put it at the 112-bit security level, which is roughly equivalent to 2048-bit RSA under today's understanding of the costs of distributed cracking attacks.

There are a lot of standards organizations all over the world with various recommendations. https://www.keylength.com collates quite a few of them. Pick the one most closely relevant for your jurisdiction.

Most of them recommend 2048-bit RSA as their minimum for asymmetric security, and AES-128 / SHA-256 as their minimum for symmetric security. This is a [112, 128]-bit security lower bound.

Truncating a hash to 160 bits yields 80-bit security, which is insufficient. 128 bits (64-bit security) is out of the question.

"Cryptographic hash functions with output size of n bits usually have a collision resistance security level n/2 and preimage resistance level n."

Depending on what you're doing, "SHA-512/128" could have a 128-bit security level. But I guess it's safer to assume n/2 when making a general recommendation.

You can truncate a hash anywhere you like. But 128 bits is considered too short now.

Ah! Makes sense now, thanks.

I've edited my parent comment to consistently use hyphens.

There are six SHA-2 family hash functions:

  * SHA-224
  * SHA-256
  * SHA-384
  * SHA-512
  * SHA-512/224
  * SHA-512/256
Hope that helps. (I know it's still confusing.)

SHA-512/256 is a truncated SHA-512 that's shortened to 256 bits, it's a separate standard.

"SHA-512/256" is a single entity

I would rather call them AMD extensions, because you would only find them in all used AMD Rome style processors (Epic, Ryzen, Threadripper) but not in any useful Intel processor, but Goldmont (ie cheap Laptops).

I read the blog, but it doesn't really expand on why SHA224/384 aren't vulnerable to that attack, can you explain (or link to some place that does?)

A length-extension attack works by taking the output of a hash and using it as the starting point for a new hash; you can do this even for hashes of messages you haven't seen, minting new related hashes. Truncated SHA-2 hashes don't output the whole hash, and so you aren't given enough information to start a new related hash.

Because they're already truncated.

SHA-384 is SHA-512 with a different IV (which doesn't affect LEAs) truncated to 384 bits (which gives you 128 bits of resistance against LEAs).

SHA-224 is the same story but with SHA-256 instead (and only 32 bits of LEA resistance).

Depends on your definition of 'better'. Theoretically SHA-512 is harder to brute force than SHA-256, but 256 bits is already extremely strong, so really there is no practical safety benefit of SHA-512 over SHA-256.

On 64-bit capable processors SHA-512 has a slight performance gain over SHA-256, but only on larger inputs. However, the digest of SHA-512 is twice the size, so what you gain in processing time, you loose in storage.

You can truncate the SHA-512 digest down to 256 bits though. Cryptographic hash functions mix well and don't suffer from the truncation (except to the extent that the truncated digest is shorter, of course). You don't necessarily need more storage (after hashing completes) just because you're using a new hash function.

> but 256 bits is already extremely strong

The strength of hashes like SHA-256 doesn't just come from the number of output bits.

The 256 bits there is relevant for brute force attacks, but not more sophisticated attacks that take into account the internal structure of the hash algorithm, and in some cases "weak" values.

SHA-512 performs more "rounds" of computation than SHA-256.

Although it's impossible to compare two different hashes on rounds alone, in general a large number of rounds of the same type of hash decreases the likelihood of non-brute-force attacks finding a collision.

If you look at the literature for attacks on hashes, they will often say they could do it for a certain number of rounds, and that number increases over time as new methods are discovered.

The number of rounds in the hash design is chosen with this in mind, trying to balance being more than sufficient for future attacks yet not too slow.

Cryptographic engineers do not in fact think you should use SHA-2-512 so that you can maximize the number of times the round function is applied.

I'm not sure what that sentence means, can you rephrase it?

The analysis you have provided for why SHA-512 is superior to SHA-256 is faulty.

The analysis was only supposed to explain why the 256-bit length of the hash output is not the sole determinant of a hash's strength. The parent was saying that, that's why it's quoted in my reply.

Pity it didn't come across that way.

FWIW, the instructions in Intel SHA extensions (which are also supported on modern AMD processors starting from Ryzen onwards) only support SHA-1 and SHA-256.

SHA-256 is on the approved FIPS lists and are faster on 32-bit operating systems, which were surprisingly common in enterprise environments until recently.

People tend to be conservative in making changes for stuff like this, and don't do so until forced.

SHA-512 provides a higher security level (256 bits), but there's not much practical value in a security level higher than 128 bits. I think "better" is too strong a word here.

SHA-512 is faster than SHA-256 in software on 64-bit machines. That's a more important difference than the security level. However, there are two major caveats to consider: 1) Hash function performance is more likely to matter on cheap (non-64-bit) hardware where everything is slow, than on fancy hardware where everything is fast. 2) Some x86 and ARM chips have hardware accelerated implementations of SHA-256, but not of SHA-512.

I believe so also. Specifically, SHA-512/256 seems to be a better choice by almost every metric except 32-bit hashing speed.

SHA-256 is shorter, and there isn't much of a difference between SHA-256 and SHA-512 (they are both using SHA-2 algorithm).

Out of curiosity, can anyone explain in layman's terms the differences in design that make SHA-1's successors immune to the known attacks against SHA-1? Ultimately was this the result of an apparent flaw in SHA-1 that only became obvious in retrospect, or was it something totally unforeseeable?

SHA-2 is based on similar techniques to those in SHA-1, which prompted the SHA-3 competition when weaknesses in SHA-1 were first discovered (as they could conceivably have been present in SHA-2 as well). As it turns out, SHA-2 appears to be resistant to the attacks found thus far.

SHA-3 (originally named Keccak) is built on an entirely different foundation (called a sponge function), so it is unlikely that any attack against SHA-1 will be relevant to SHA-3. However, sponge functions are a relatively new idea, and weaknesses in the basic principles could conceivably be found in the future, as could weaknesses in the Keccak algorithm specifically.

Partly this: Number of bits.

This attack is almost 2^64, and SHA-1 is 160 bits. All else being equal (big big if) that means sha256 is 102 bits, meaning 362703572709.30493 times more expensive. Or about $16321 trillion USD.

Q: Does this make it even more urgent for git to move to a different hash?

It may, because now an attacker can replace code with arbitrary other valid code as long as developers are willing to ignore the long weird random comment at the end ;-)

I’m gonna say many developers will not care but and many compilers will not care either.

So yeah, Linus’ main deterrent reason (code won’t compile) doesn’t apply anymore.


1. A chosen-prefix attack still needs to compute TWO suffixes m1 and m2 so that h(a1+m1) = h(a2+m2). This does NOT mean that given a1 and a2 you can find a single m2 so that h(a1) = h(a2+m2). So that ONLY THE ORIGINAL AUTHOR OF THE COMMIT could spoof their own commit, by preparing in advance and attaching a long and weird comment in the end. And you could build tools to watch out for such commits in the first place

2. If git had used HMAC based on SHA1 then it would have been fine, even after this attack has become feasible.

3. Furthermore, it is likely still kinda fine because Merkle Trees have nodes referencing previous nodes. You’d have to spoof every historical node as well, to push malicious code. BitTorrent also requires computers to supply an entire merkle branch when serving file chunks.

Maybe someone can elaborate on this.

If you look in this 2017 (https://marc.info/?l=git&m=148787047422954) email from Linux, he discusses how git also encodes length. That would mean that you need a collision of the same length and the right functionality, so you can't just append data.

Now that you can arbitrarily produce collisions, the second step is easy enough for a skilled and well-funded attacker.

A: (from the article)

SHA-1 has been broken for 15 years, so there is no good reason to use this hash function in modern security software. Attacks only get better over time, and the goal of the cryptanalysis effort is to warn users so that they can deprecate algorithms before the attacks get practical. We actually expect our attack to cost just a couple thousand USD in a few years.

That's not a relevant answer. Git's use-case is very specific and it's quite possible that this attack won't be relevant. It needs analysis.

>no good reason to use this hash function in modern security software

This argument conveniently ignores the cost to switching existing software (i.e. it's completely detached from reality).

It adds to the weight of the argument, but there isn't a big issue. This article (https://www.zdnet.com/article/linus-torvalds-on-sha-1-and-gi...) and the linked email (https://marc.info/?l=git&m=148787047422954) both seem to still apply.

Further details as to why Torvalds is not concerned:

From the email...

"I haven't seen the attack yet, but git doesn't actually just hash the data, it does prepend a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header."


"I haven't seen the attack details, but I bet

(a) the fact that we have a separate size encoding makes it much harder to do on git objects in the first place

(b) we can probably easily add some extra sanity checks to the opaque data we do have, to make it much harder to do the hiding of random data that these attacks pretty much always depend on."

The fact that this attack is chosen prefix does weaken the first argument though, you may now find a collision even accounting for any prefixed git "header". The rest is still completely valid though.

I still feel like they really should've taken this problem more seriously and earlier. The more we wait the more painful the migration will be when the day comes to move to a different hash function, because everybody knows that'll happen sooner or later. Two years ago we had a collision, now we have chosen prefix, how much longer until somebody actually manages to make a git object collision?

And keep in mind that public research is probably several years behind top secret state agency capabilities. Let's stop looking for excuses every time SHA-1 takes a hit and rip the bandaid already. It's going to be messy and painful but it has to be done.

> you have to be able to also edit the size field in the header.”

As I read the OP [1] a chosen-prefix collision attack such as this allows you to “edit the size field in the header”. Or am I missing something?

1. “A chosen-prefix collision is a more constrained (and much more difficult to obtain) type of collision, where two message prefixes P and P’ are first given as challenge to the adversary, and his goal is then to compute two messages M and M’ such that H(P || M) = H(P’ || M’), where || denotes concatenation.”

EDIT: On second thought I was missing something: the adversary is further constrained in the git case because it must find M and M’ of correct length (specified in P and P’). Linus is right (as usual), this probably makes it much harder.

A few emails forward in the thread Linus explains though why we don’t need to worry much about this attack in practice: https://marc.info/?l=git&m=148787287624049&w=2

This argument sounds sound to me.

His argument assumes the file is text and people read the entire thing. If either of those assumptions are false, then it's not safe.

People store things in git that aren't text. Therefore it's not safe.

"(b)" is kind of amusing.. It's been known since 2011 that collision generating garbage material can be put after a NUL in a git commit message and hidden from git log, git show, etc. Still not fixed.

With this chosen-prefix attack, they chose two prefixes and generated collisions by appending some data. So your two prefixes just need to be "tree {GOOD,BAD}\nauthor foo\n\nmerge me\0"

The only thing preventing injecting a backdoor into a pull request now seems to be git's use of hardened sha1.

I think those are pretty practical approaches.

But it sounds as if the cost of changing the hash algorithm is high. What are the impacts of this change? How many things would break if git just changed the algorithm with each new release? Does git assume that the hash algorithm is statically given to be SHA-1 or are there qualifiers on which algorithm is enabled/permitted/configured?

After making the actual code change, the biggest problem is breaking compatibility with decades of tools in the ecosystem that rely on historically consistent SHA-1 hashes.

Git is moving to a flexible hash though. [1]

[1] https://stackoverflow.com/questions/28159071/why-doesnt-git-...

Maybe it's time for a version 3 that breaks a bit of compatibility.

The Python community would freak out, lol.

The cost is very high but it's only getting higher with time. People have known that SHA-1 was weak and deprecated for much of git's existence. Doing the switch in 2010 would've been painful, doing it now would be orders of magnitude more so and I doubt it'll get any easier in 2030 unless some other SCM manages to overtake git in popularity which seems unlikely at this point.

Unless Linus really believes that git will be fine using SHA-1 for decades to come I don't think it's very responsible to keep kicking the ball down the road waiting for the inevitable day when a viable proof of concept attack on git will be published and people will have to emergency-patch everything.

What if somebody makes an attack where they can choose the size and then find a collision?

Like the two files on the linked page?

The two files on the linked page were both full of junk data. I suspect that those files being of the same length isn't the norm.

Why wouldn't it be the norm? If they're able to make files of the same length, other attackers can make files of the same length too.

I would bet that a fixed size M' is part of the attack.

That first quote is misleading. git's special hashing scheme doesn't make the attack "much harder". First there is no difference in length in the original shattered collision already:

$ curl https://shattered.io/static/shattered-1.pdf | wc -c


$ curl -s https://shattered.io/static/shattered-2.pdf | wc -c


Second, the length is already being hashed into the content during computation of a SHA-1 hash. Look up Merkle-Damgard construction: https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_co...

There is benefit in storing the length at the prefix as well, as you can avoid length extension attacks, but that's not making attacks "much harder".

The point is that the hashed data must follow a specific data format, and can’t just be arbitrary data. This means that the collision data MUST contain the length at some specific offset in the data, which makes it harder to find a collision.

The more restrictive the serialization format of the hashed data, the harder it is to find a collision that’s valid in the given application context.

Yeah the data must start with a specific prefix, but can otherwise contain whatever it wants. Anyways, even the shattered attack, which this paper says costs 11k to execute today, had a pdf specific prefix (the shown part is the same in both files):

    $ curl -s https://shattered.io/static/shattered-1.pdf | hexdump -n 512 -C 
    00000000  25 50 44 46 2d 31 2e 33  0a 25 e2 e3 cf d3 0a 0a  |%PDF-1.3.%......|
    00000010  0a 31 20 30 20 6f 62 6a  0a 3c 3c 2f 57 69 64 74  |.1 0 obj.<</Widt|
    00000020  68 20 32 20 30 20 52 2f  48 65 69 67 68 74 20 33  |h 2 0 R/Height 3|
    00000030  20 30 20 52 2f 54 79 70  65 20 34 20 30 20 52 2f  | 0 R/Type 4 0 R/|
    00000040  53 75 62 74 79 70 65 20  35 20 30 20 52 2f 46 69  |Subtype 5 0 R/Fi|
    00000050  6c 74 65 72 20 36 20 30  20 52 2f 43 6f 6c 6f 72  |lter 6 0 R/Color|
    00000060  53 70 61 63 65 20 37 20  30 20 52 2f 4c 65 6e 67  |Space 7 0 R/Leng|
    00000070  74 68 20 38 20 30 20 52  2f 42 69 74 73 50 65 72  |th 8 0 R/BitsPer|
    00000080  43 6f 6d 70 6f 6e 65 6e  74 20 38 3e 3e 0a 73 74  |Component 8>>.st|
    00000090  72 65 61 6d 0a ff d8 ff  fe 00 24 53 48 41 2d 31  |ream......$SHA-1|
    000000a0  20 69 73 20 64 65 61 64  21 21 21 21 21 85 2f ec  | is dead!!!!!./.|
The shattered attack was about a so-called "identical prefix" collision, while the shambles paper's collision was a "chosen prefix" one. You can choose it in both cases, but in the "chosen prefix" one both colliding prefixes can be entirely different (and can be as long as you want btw, the attack doesn't cost more if the prefix is 4 KB vs 4 GB), while in the "identical prefix" case it has to be identical.

> the harder it is to find a collision that’s valid in the given application context

In the double-digit thousands of dollars, an attack that gets 10x or 100x harder is still cheap for state actors.

Assuming the NSA is at least a year or two ahead of the field, git should now accelerate its migration process.

Yeah, that quote doesn't exactly make me confident about Linus's understanding of this particular issue.

It's not about it being the same length, but the length of the data being part of the hashed data, which, Linus assumes, will likely make it more difficult to find a collision. He even says at the beginning that he hasn't had a look at the attack yet and is just making an assumption.

> It's not about it being the same length, but the length of the data being part of the hashed data

As I tried to point out, the length is already part of what the SHA-1 function hashes:


    As a summary, a "1" followed by m "0"s followed by a 64-
    bit integer are appended to the end of the message to produce a
    padded message of length 512 * n.  The 64-bit integer is the length
    of the original message.  The padded message is then processed by the
    SHA-1 as n 512-bit blocks.
Now, storing the length as a prefix does give you advantages: you can't mount a length extension attack, which limits your ability to exploit one shattered attack, e.g. the pdfs released by google, for different files/types of files. But it doesn't make mounting a novel shattered attack "much harder" as Linus claims.

> But it doesn't make mounting a novel shattered attack "much harder" as Linus claims.

From what I understood the core of Linus' argument[1] is that it's very hard to make a "bad" variant of the code which has the same length _and_ the same hash while still looking like sane code. For random data files, sure those are more at risk.

[1]: https://marc.info/?l=git&m=148787287624049&w=2

> it's very hard to make a "bad" variant of the code [...] looking like sane code

That is very hard, but not what was quoted above. The length has no part in it. The core part needed for the shattered collision attacks involves basically binary data.

    $ curl -s https://shattered.io/static/shattered-1.pdf | hexdump -C > s1
    $ curl -s https://shattered.io/static/shattered-2.pdf | hexdump -C > s2
    $ diff s1 s2
    < 000000c0  73 46 dc 91 66 b6 7e 11  8f 02 9a b6 21 b2 56 0f  |sF..f.~.....!.V.|
    < 000000d0  f9 ca 67 cc a8 c7 f8 5b  a8 4c 79 03 0c 2b 3d e2  |..g....[.Ly..+=.|
    < 000000e0  18 f8 6d b3 a9 09 01 d5  df 45 c1 4f 26 fe df b3  |..m......E.O&...|
    < 000000f0  dc 38 e9 6a c2 2f e7 bd  72 8f 0e 45 bc e0 46 d2  |.8.j./..r..E..F.|
    < 00000100  3c 57 0f eb 14 13 98 bb  55 2e f5 a0 a8 2b e3 31  |<W......U....+.1|
    < 00000110  fe a4 80 37 b8 b5 d7 1f  0e 33 2e df 93 ac 35 00  |...7.....3....5.|
    < 00000120  eb 4d dc 0d ec c1 a8 64  79 0c 78 2c 76 21 56 60  |.M.....dy.x,v!V`|
    < 00000130  dd 30 97 91 d0 6b d0 af  3f 98 cd a4 bc 46 29 b1  |.0...k..?....F).|
    > 000000c0  7f 46 dc 93 a6 b6 7e 01  3b 02 9a aa 1d b2 56 0b  |.F....~.;.....V.|
    > 000000d0  45 ca 67 d6 88 c7 f8 4b  8c 4c 79 1f e0 2b 3d f6  |E.g....K.Ly..+=.|
    > 000000e0  14 f8 6d b1 69 09 01 c5  6b 45 c1 53 0a fe df b7  |..m.i...kE.S....|
    > 000000f0  60 38 e9 72 72 2f e7 ad  72 8f 0e 49 04 e0 46 c2  |`8.rr/..r..I..F.|
    > 00000100  30 57 0f e9 d4 13 98 ab  e1 2e f5 bc 94 2b e3 35  |0W...........+.5|
    > 00000110  42 a4 80 2d 98 b5 d7 0f  2a 33 2e c3 7f ac 35 14  |B..-....*3....5.|
    > 00000120  e7 4d dc 0f 2c c1 a8 74  cd 0c 78 30 5a 21 56 64  |.M..,..t..x0Z!Vd|
    > 00000130  61 30 97 89 60 6b d0 bf  3f 98 cd a8 04 46 29 a1  |a0..`k..?....F).|
An ASCII formatted file only has text data. Also, with the shattered attack you can't choose what the two versions should be so you are required to cross reference the different looking binary data to turn on/turn off some functionality. So the attack is mostly interesting when you include binary data. With the chosen prefix attack, you can have two arbitrary components, even textual ones, but they still have to be followed by such a binary component.

Also now git has collision detection code from sha1collisiondetection [1], making attacks even harder.

[1]: https://github.com/cr-marcstevens/sha1collisiondetection

A good thing no one checks binary assets into source control

But even if the lengths are same, the resulting SHA1 will be different since you prefix the length before hashing

The shattered prefix was chosen as well, see my other comment in the thread: https://news.ycombinator.com/item?id=21980759

The only thing that prefixing the length makes difficult is using the same prefix multiple times: you basically have to make up your mind about the type and length before mounting the shattered attack. Also, the prefix means you have to do your own shattered attack and can't use the PDFs that google provided as proof of their project's success. Price tag for that seems to be 11k.

[1]: https://github.com/cr-marcstevens/sha1collisiondetection

What happens if you actually get a SHA-1 collision in git?

-> https://stackoverflow.com/questions/9392365/how-would-git-ha...

(This does not answer your question, but is still interesting.)

There is a migration path to SHA-256, see a good summary here: https://stackoverflow.com/a/47838703/109517

See a previous discussion here, regarding Linus's position on this in 2017: https://news.ycombinator.com/item?id=13719368

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact