When a git commit hash is calculated, one of the data that is used are the pointers to parent commits. That results in two properties for commits:
1) A child commit's hash cannot be predicted without knowing the parent commits
2) Once a commit is created, its parent pointers are practically immutable. The only way to change them is to create a new commit with a different hash - which cannot carry over the old commit's children.
Those properties meant that it's impossible to construct a commit with a parent pointer that points to its own children. Therefore it was safe to assume that a git history is always a DAG.
I think with deliberate SHA1 collisions now possible, someone could actually violate property (2) and introduce a cycle.
As the DAG assumption is pretty fundamental, I'd imagine such a cycle could break a lot of tools. Has this ever been tested?
So symlink-shenanigans shouldn't be possible using just the git objects.
My understanding of it is that it runs a SHA-1 and examines the internal state of the of the digest along the way to see if it matches up with known vectors that could be manipulated to cause a collision.
Filesystems like Venti could conceivably, on the one hand, check for collisions, but on the other, compare the stored block with the newly committed block. If there's a collision, but the blocks differ, just perturb the new block with a NOP, a nonce header of some kind...
Considering that the Bitcoin network burns through ~2^90 hashes in a year ... I'm not worried : )
> GitHub.com now performs this detection for each SHA-1 it computes, and aborts the operation if there is evidence that the object is half of a colliding pair.
Isn't it possible for a valid non-colliding object or commit to contain that pattern as well? It sounds like eventually, though possibly in the far distant future, someone will be unable to push a commit to Github because it matches the pattern but doesn't contain colliding objects.
Does anyone know what the pattern is they're looking for? I'm curious now.
There are metrics that will alert GitHub's infrastructure team if a collision is found (to confirm that we aren't seeing any false positives). Those metrics were quietly shipped (without the matching "die") for a week before flipping the final switch.
If you want to know more about the patterns, see the sha1collisiondetection project:
There's a research paper linked in the README.
A large project (like Firefox) might make a few hundred commits per day, or tens of thousand per year. So that comes out to 2^20-2^30 hashes per year. I don't know how many distinct repositories GitHub has, but I doubt it's larger than 2^30. So that means that GitHub has no more than 2^60 SHA-1 hashes related to git, and probably more like 2^40-2^45.
So the probability of a collision is <<2^-30 (the collision function is logistic, so assuming linearity between 1 and 2^90 is a wildly overoptimistic assumption) in the optimistic case, probably something like <<2^50. In perspective, it's more likely that you will win the lottery tomorrow but fail to discover so because you were killed by lightning than it is that GitHub will detect a chance SHA-1 collision.
Will someone please write a dystopian novel around this premise? It almost writes itself. GitHub turns evil, forcing the world population into subservience from birth; using Git is all anyone is ever allowed to do, and is how people conceptualize the universe and access entertainment; various cults form with the belief that randomness can be influenced via the right ceremony...
You can source inspiration from https://twitter.com/DystopianYA
Someone here had mentioned that they try to incorrectly answer recaptchas as a way to spite them for trying to steal free labor from them. My mind took that to the extreme of everyone working for Google... :)
...each believing in a Supreme branching strategy
So, to the question "Isn't it possible for a valid non-colliding object or commit to contain that pattern as well", the answer is "No, not really."
Just some food for thought.
I'm also not aware of any significant recent attempt at mitigating SHA-1 collisions that would've suddenly made the attack impractical. And in general these collisions are pretty difficult to exploit since you can't control the collision exactly.
This protection github is implementing uses a heuristic to try to find collision attempts by just looking at a single file. So it prevents github from hosting any collision attempts.
A direct comparison between what two things? In the attack they are worried about, they only see one half of the attack.
> hashing using a more secure algorithm
They don't control how Git works.
There may be new disturbance vectors discovered as the attack matures, but they can be added to the detector.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
What happens when machines are writing commits and there are far more than 5 million and 1 per second? Is this a "640K ought to be enough for anybody" quote of our time?
... or the machines will be able to generate real, useful code that fast, in which case it will be the machines' problem to solve.
A single titanX does 2^9 hashes per second. Suppose you had a 10,000 Gpus (expensive but not impossible) you can do 2e13 hashes per second.
In 100 years there are ~3e9 seconds, so 6e22 hashes. Still magnitudes below 90.
Sha1 isn't still crackable by brute force in a meaningful time.
Then a collision will happen far sooner than the Sun engulfing the Earth.
Maybe in the next million years, instead of 7 billion.
The attack persists regardless of the lack or presence of accidental collisions, therefore there aren't centuries to fix the problem.
Someone: "Huh, they say it would take billions of years for an accidental collision at checkin rate X"
Someone else: "What if the collision rate were much higher? Famous last words, 640k, herf glorp"
Me: "Even at a million times higher checkin rate you are unlikely to see collisions anytime soon"
You: "It's about ethics in gaming journalism!"
My conclusion doesn't follow your premise because your premise is simply wrong. Enough.
1) That quote wasn't actually said by Bill Gates
2) Very few people will quote that 15-20 years from now.
But that didn't stop someone from quoting it today and that's the third time or so this week that I come across that quote. The fact that it is quoted today - for which there is really no good reason - makes me suspect it will still be quoted 15-20 years from now. Hopefully less frequently.
0. Upgrade to SHA3/Keccak, eventually make SHA1 a read-only legacy feature.
1. Upgrade signing to sign blobs and sign signatures for non-blobs instead of signing refs.
Cyphar has the right answer for why this would be problematic, though - GPG would just prove that the initial commit was from a trusted committer. It would be trivial to use the same signature with the fake commit. That's not an unsolvable problem, but switching to a different hashing algorithm seems to me like the more reasonable solution.