It's not an issue to use SHA-1 for git. Heck they could have used md5 and it still would have achieved the same. The use of SHA-1 in Git is not for security purposes, it's for accidental data corruption and uniqueness purposes as you're guaranteed to never accidentally get a collision.
> The use of SHA-1 in Git is not for security purposes
It's for verifying data integrity, which is a cryptographic task (ie for security purposes). In Linus' Google tech talk on git he actually does say it's not a security thing, but then gives an example of verifying the data want tampered with by a third party [1].
Whether or not the feature was conceived of as a security feature, it is de-facto being relied on as a security feature. There are malicious actors on the internet that try to inject malware into software repositories. The fact that they can't silently change the history makes this task harder. If a non cryptographic hash function like crc32 was used, it would be child's play to cause shenanigans with collisions.
As an anecdote, I once reverse engineered the checksum used by warcraft 3 to verify that players had the custom map being played. It was not cryptographically secure. Just xoring values into a rotating accumulator [2]. Not hard to collide. Within a few months, there were versions of maps with built-in vision hacks and collided checksums being passed around. It was enough of a problem that the next game patch added sha1 as a checksum. If git had started with a non cryptographic hash function, it would have been forced to switch to one for similar reasons.
> The use of SHA-1 in Git is not for security purposes, it's for accidental data corruption and uniqueness purposes
In Git itself maybe, but many tools and developers rely on or automatically assume that commit hashes are collision resistant. For example from the GitHub Actions docs:
> Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.
Git is a tool and a tool should be useful. This tool would be much more useful, and also more intuitive to use, if it actually provided cryptographic hashes of its commits and didn't just pretend to.
You can shift some of the blame to the user if something goes terribly wrong, but at least partly it's also the tool's fault. Git's security is a footgun that is hardly productive.
Believe me, if Git were fundamentally broken, the issue would've been fixed already. It's in that "uncanny valley" of security issues, where it's bad enough to cause damage, but not bad enough to get people to stop what they're doing and fix it.
Well, as the article says, Git already fixed the issue, SHA-256 is already supported. Now it's up to GitHub, GitLab and the innumerable other platform/tool providers to update their solutions, until then Git can't in good conscience make SHA-256 the default, because that would condemn a repository to a life in eternal isolation...
Per the article, the biggest problem isn't third party integration but that there is no interop with old repos. The feature is in an experimental state right now (and labeled as such), so you can't and shouldn't expect GitHub/GitLab/anyone to start using it in production
That is a good plan on first consideration, but on close inspection appears to require that the tool author was omniscient and anticipated every possible use of their tool.
Traditionally a lot of the usefulness from tools comes from people doing things that were not intended. The modern web springs to mind, it was a terrible hack in the grand old IE days.
It is better for tools to have obvious failure modes.
> The use of SHA-1 in Git is not for security purposes, [...]
Only, it is. When you eg sign a commit in git, you only sign the hash. So someone else could pretend you signed a different commit (and commit history), if they can find a collision.
Git isn't relying on collision-resistance, it's relying on second-preimage[0] resistance, which is to say: in order to sneak a hash collision in to a git repository, you have to sneak _something else_ that's already trusted (e.g. via code review) into the repository; collisions can't (yet) be generated for arbitrary hashes.
I haven't heard of any second-preimage attacks against MD5, much less SHA-1, so mlindner was correct in asserting that MD5 would be fine (assuming 128 bits are enough). See also the analysis in [1].
More to the point, if you're able to sneak something into a repository in the first place (e.g. a benign file that generates a collision with a malicious file), then you're probably able to sneak in something more directly (e.g. [2]) that won't rely on both getting something in a trusted repository and then cloning from a different, untrusted source.
> if you're able to sneak something into a repository in the first place (e.g. a benign file that generates a collision with a malicious file), then you're probably able to sneak in something more directly
Could you imagine using an implementation of TLS that "probably" authenticated your network traffic though? I think there are two separate reasons we prefer to make strong guarantees in cryptography:
1. That's often really what I need. If I'm downloading e.g. software updates over the network, I really need those to be authentic.
2. Even when I arguably don't need strong authenticity, like just reading some news articles, I want to use the same strong tools, because I don't want to have to study and understand (much less teach) the situations where some weaker tool fails. Inevitably I'll get that wrong or just forget, and I'll end up using the weak tool in some case where I should've used the strong one.
In this case, if I imagine teaching how commit signing works with a weak hash function, it sounds like "Signing commits means that no one can sneak malicious content into your repository, unless they first steal your secret signing key, or else you ever committed (or allowed anyone else to commit) a non-text file that they created." Actually writing that second part out makes it feel really bad to me.
> "Signing commits means that no one can sneak malicious content into your repository
Signing commits does not mean that even when using cryptographically secure hash function. All it means is that you put your signature over a particular state of the repo (and, by extensions, its parent states). It has nothing to do with preventing "sneaking things in" - although it could be a (small) part of the whole set of measures taken to prevent someone from doing that.
> All it means is that you put your signature over a particular state of the repo (and, by extensions, its parent states).
That's technically true. Though in practice I think the implied social contract is that signing of a commit means you signal some kind of approval for the diff between the signed commit and its immediate predecessor(s).
I'm not 100% sure I understand your point, but it sounds like you're concerned about signing something using a weak hash function (i.e. where the hash of something is what actually gets signed)?
If that's the case, then my point is pretty simple: yes, SHA-1 is broken for signing untrusted input (due to weak collision resistance), but it is not broken (so far) for signing trusted input (due to strong preimage resistance).
My point earlier was primarily that the contents of a repository are generally trusted (via mechanisms like code review), and signing trusted content still works even with SHA-1.
Note that certificate signing vulnerabilities (which I assume is why TLS was mentioned?) usually rely on a malicious actor presenting one certificate and then presenting a different cert later; they can't arbitrarily fake existing certs from somebody else.
The analogous scenario for git repositories would be to have a malicious actor make a commit (or blob, tree, etc.) that could be swapped out for another. But if you already have malicious actors able to make commits in your repository, then the hash function doesn't matter: they can cause damage in many, many other ways.
> The analogous scenario for git repositories would be to have a malicious actor make a commit (or blob, tree, etc.) that could be swapped out for another. But if you already have malicious actors able to make commits in your repository, then the hash function doesn't matter: they can cause damage in many, many other ways.
The malicious actor can pose as a good-faith contributor and submit Pull Requests to your repository.
You review the code in the PR, and perhaps even prove it correct. Later on, the malicious actor can do the swapping trick. (Eg by running a mirroring service for your repository.)
> You review the code in the PR, and perhaps even prove it correct. Later on, the malicious actor can do the swapping trick. (Eg by running a mirroring service for your repository.)
Having a copy of code that is reviewable and then searching for a malicious collision is a preimage attack; extending two chosen prefixes (e.g. one "valid" and one "malicious") until they meet at a hash collision is how most practical (?) collision attacks work. The latter scenario produces large junk sections in the results, which should be obvious under even mild scrutiny.
If the reviewer misses the kilobytes of garbage in the middle of a file they're reviewing, then an attacker can just sneak malicious code in directly without requiring a hash collision.
If the project relies on an effectively unreviewable binary file that could hold kilobytes of junk (like some YAML files I've seen...), then that's already breaking the review process without requiring a hash collision.
Ignoring all of that, anybody grabbing code from an untrusted source is already vulnerable to whatever attacks that untrusted source wants to employ, with "exploiting hash collision" being one of the higher-effort attacks that could be mounted.
Essentially, any repository that would be vulnerable to any of the known hash collision attacks (via bad review, untrusted upstream, etc.) would be vulnerable to more mundane, easier attacks against the same weaknesses that do not depend on hash collisions.
> Having a copy of code that is reviewable and then searching for a malicious collision is a preimage attack;
No, it's not. You can sneak extra entropy into minor formatting choices or variable names etc, or exactly what you write in your commit messages. Or probably even ordering of files in your directories. (I don't think the git protocol enforces that files have to be in eg alphabetical order.)
> Ignoring all of that, anybody grabbing code from an untrusted source is already vulnerable to whatever attacks that untrusted source wants to employ, with "exploiting hash collision" being one of the higher-effort attacks that could be mounted.
I'm not sure. If your hash works fine, as long as someone trusted gives you the commit hash, anyone untrusted can give you the actual source.
And if you mean accepting PRs: accepting PRs from the untrusted internet basically how open source works..
First - if git really didn't care about collision resistance, there wouldn't have been a need to switch to SHA1DC as the hash function. They switched because they care enough that they were willing to accept the performance penalty.
Second - imagine this scenario: a user creates two commits with the same hash, one with a valid change and the second with a malicious one. The collision could be created by playing around with some data in a binary file - so, this is a collision attack not 2nd pre-image. The user then submits the change to the upstream and gets it approved. The user maintains a mirror of the upstream repo into which they place the malicious commit. Anyone that pulls from this mirror will think they have the same code as the upstream, even if they compare hashes.
So don't use an untrusted mirror? I guess - but that is something that should be possible with a strong hash. And if git really didn't want you to do that, it would provide for better ways of tracking where objects were actually pulled from.
Anyway, collision attacks are real and can impact git. They just aren't as bad as a 2nd pre-image attack.
> First - if git really didn't care about collision resistance, there wouldn't have been a need to switch to SHA1DC as the hash function. They switched because they care enough that they were willing to accept the performance penalty.
Git didn't _need_ to switch to SHA1DC, but they did because the cost was minimal and it's still a good idea to defend against known attacks.
> Second - imagine this scenario: a user creates two commits with the same hash, one with a valid change and the second with a malicious one. The collision could be created by playing around with some data in a binary file - so, this is a collision attack not 2nd pre-image. The user then submits the change to the upstream and gets it approved.
This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.
> The user maintains a mirror of the upstream repo into which they place the malicious commit. Anyone that pulls from this mirror will think they have the same code as the upstream, even if they compare hashes.
Having people pull data from an attacker-controlled source is a security issue, regardless of hash collisions.
> So don't use an untrusted mirror? I guess - but that is something that should be possible with a strong hash. And if git really didn't want you to do that, it would provide for better ways of tracking where objects were actually pulled from.
Git was designed for collaboration between trusted parties; collaboration between untrusted parties (e.g. pulling changes from untrusted sources) is a much harder problem that git doesn't pretend to solve.
> Anyway, collision attacks are real and can impact git. They just aren't as bad as a 2nd pre-image attack.
Collision attacks are real, but they have yet to impact git (beyond adopting SHA1DC, I guess), despite how big of a target popular git repositories are.
> Git didn't _need_ to switch to SHA1DC, but they did because the cost was minimal and it's still a good idea to defend against known attacks.
I'm confused with how a SHA1 collision being found is an "attack" if git truly doesn't care about collision resistance.
> This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.
I don't think you can ignore the use case - people do check binaries into git with the expectation that git will keep track of them.
> Git was designed for collaboration between trusted parties; collaboration between untrusted parties (e.g. pulling changes from untrusted sources) is a much harder problem that git doesn't pretend to solve.
Maybe that is how git was designed. But it's not how git is used. People do pull from repos that they don't fully trust. Maybe just to examine a change before throwing it away. What people don't expect is that by pulling from such a source that an unexpected file could get into their repository due to a collision attack. That is why git switched to SHA1DC - if git truly didn't support that use case, they wouldn't have needed to.
> Collision attacks are real, but they have yet to impact git (beyond adopting SHA1DC, I guess), despite how big of a target popular git repositories are.
I agree that collisions attacks are real but aren't a practical issue yet. What I was responding to was your comment:
> I haven't heard of any second-preimage attacks against MD5, much less SHA-1, so mlindner was correct in asserting that MD5 would be fine (assuming 128 bits are enough). See also the analysis in [1].
In that comment, it seems that you were saying that collisions attacks weren't a problem at all. But, it seems like you are saying in your more recent comment that "collision attacks are real"?
> This is a general problem with binary files: they're hard to properly review. Having unreviewable files in a repository (binaries, machine-generated configs, etc.) is already a security problem; hash collisions would just be one (very difficult) way of exploiting that problem.
That's not a problem in general. Eg having a binary bmp in your repository is fine as far as reviews go.
> Git isn't relying on collision-resistance, it's relying on second-preimage[0] resistance, which is to say: in order to sneak a hash collision in to a git repository, you have to sneak _something else_ that's already trusted (e.g. via code review) into the repository; collisions can't (yet) be generated for arbitrary hashes.
Yes, I know. I was arguing the more general point that 'The use of SHA-1 in Git is not for security purposes,'.
Of course, for anything crypto related we go by the maxim 'guilty, until proven innocent'. MD5 might not have a published second-preimage attack, yet; but its broken enough, that you shouldn't rely on it for anything anymore: it's not a acceptable crypto-hash, and if you don't need a crypto-hash, you can use something simpler like a CRC instead.
Finding a collision is very hard, not something you will do in minutes, it requires a tremendous amount of resources. For any practical use (like git) that doesn't require an extreme level of security SHA-1 is still fine, and it will be for a lot of years to come.
I'm not sure why git would require less security than almost any other application?
Control over what software runs is really important. If an attacker can get you to run different source code, especially if it looks like it's still signed by the people you trust to produce or review sources, would be a big deal.
MD5 is not vulnerable to second preimage attack, so signing a repo that doesn’t already have attacker-controlled data specially crafted ahead of time, is perfectly safe.
Collision attack is not “hash is useless you can make up anything”, but a specific condition that breaks only some uses, not all.
You can generate a pair of files that hash to same value that you can’t control. You can’t make a new file that hashes to an existing hash.
They could do that, but it probably shouldn't be on as default and it doesn't actually achieve anything other than protecting from accidental addition of a colliding hash. It wouldn't offer any additional security benefit.