
SHA-1 Collision Detection on GitHub.com - samlambert
https://github.com/blog/2338-sha-1-collision-detection-on-github-com
======
xg15
One thing I'm wondering about that I haven't seen discussed yet is the
possibility to make circular git histories:

When a git commit hash is calculated, one of the data that is used are the
pointers to parent commits. That results in two properties for commits:

1) A child commit's hash cannot be predicted without knowing the parent
commits

2) Once a commit is created, its parent pointers are practically immutable.
The only way to change them is to create a new commit with a different hash -
which cannot carry over the old commit's children.

Those properties meant that it's impossible to construct a commit with a
parent pointer that points to its own children. Therefore it was safe to
assume that a git history is always a DAG.

I think with deliberate SHA1 collisions now possible, someone could actually
violate property (2) and introduce a cycle.

As the DAG assumption is pretty fundamental, I'd imagine such a cycle could
break a lot of tools. Has this ever been tested?

~~~
rmc
A circular directory would be much worse. As soon as you check it out, your
disk fills up.

~~~
maweki
but git doesn't store directory information. it is implicit with the file
information.

So symlink-shenanigans shouldn't be possible using just the git objects.

~~~
rbehrends
Git doesn't store directory information per se, but the directory structure is
still reflected in tree objects.

------
koolba
The article links to this repo that actually does the work of finding possible
collisions: [https://github.com/cr-
marcstevens/sha1collisiondetection](https://github.com/cr-
marcstevens/sha1collisiondetection)

My understanding of it is that it runs a SHA-1 and examines the internal state
of the of the digest along the way to see if it matches up with known vectors
that could be manipulated to cause a collision.

~~~
unholiness
Interesting that it's hosted on GitHub. Everyone in this thread is wondering
about the 2^-90 chance that a random commit will fail, but I wonder how these
guys prevented their repo with test files from being rejected.

~~~
koolba
The files don't trigger a collision because git adds a header prior to
hashing. The hash is the sha1 of the total bytes.

~~~
theoh
On that note, in the case of this kind of system, the possibility of a false
positive (so a commit fails, but shouldn't) there's always the option of
adding a salt or a NOP of some kind to the file being committed so as to alter
its hash.

Filesystems like Venti could conceivably, on the one hand, check for
collisions, but on the other, compare the stored block with the newly
committed block. If there's a collision, but the blocks differ, just perturb
the new block with a NOP, a nonce header of some kind...

~~~
collinmanderson
You could also just zip the files and unzip when testing

------
click170
> The recent attack uses special techniques to exploit weaknesses in the SHA-1
> algorithm that find a collision in much less time. These techniques leave a
> pattern in the bytes which can be detected when computing the SHA-1 of
> either half of a colliding pair.

> GitHub.com now performs this detection for each SHA-1 it computes, and
> aborts the operation if there is evidence that the object is half of a
> colliding pair.

Isn't it possible for a valid non-colliding object or commit to contain that
pattern as well? It sounds like eventually, though possibly in the far distant
future, someone will be unable to push a commit to Github because it matches
the pattern but doesn't contain colliding objects.

Does anyone know what the pattern is they're looking for? I'm curious now.

~~~
rspeer
They say that the chance of a false positive is less than 2^-90.

~~~
orblivion
Per commit, or over the expected lifetime of Git's utilization of sha-1? If
it's the former it's not very useful unless we have an idea how many commits
are made.

~~~
jcranmer
You should expect to see a hash collision with probability >½ with about
sqrt(2^n) hashed objects for a hash of length n. I believe hashes are made per
changed file per commit.

A large project (like Firefox) might make a few hundred commits per day, or
tens of thousand per year. So that comes out to 2^20-2^30 hashes per year. I
don't know how many distinct repositories GitHub has, but I doubt it's larger
than 2^30. So that means that GitHub has no more than 2^60 SHA-1 hashes
related to git, and probably more like 2^40-2^45.

So the probability of a collision is <<2^-30 (the collision function is
logistic, so assuming linearity between 1 and 2^90 is a wildly overoptimistic
assumption) in the optimistic case, probably something like <<2^50. In
perspective, it's more likely that you will win the lottery tomorrow but fail
to discover so because you were killed by lightning than it is that GitHub
will detect a chance SHA-1 collision.

~~~
orblivion
Wait, are we talking about chance SHA-1 collisions, or a false positive for
this test for this vulnerability?

------
yuhong
I wonder how many chosen prefix (not identical prefix) attacks this detects.
There are many ways to do it ranging from 2^77 to 2^80 time. Of course, this
is still much more expensive than identical prefix, and git would probably not
the best target for such attacks.

------
IncRnd
The SHAttered attack was made public shortly after browsers deprecated SHA-1
or started flagging SHA-1 in certs as insecure. This specific practical attack
may have existed for a while before it was made public. In fact it may have
been made public only when it no longer provided practical benefits.

Just some food for thought.

~~~
taejo
The team that found it is at least partly academic; they published a paper on
a free-start collision at CRYPTO 2015, and made a prediction there about how
long it would take them and how much it would cost to find a full collision,
which was more or less accurate. They are hash function researchers, not spies
or hackers. There's no reason to believe they were hiding anything.

------
everlost
Naive question - why can't GitHub use a SHA2 function for commit hashes?

~~~
detaro
Because Git only supports SHA-1 as of now, and files in Git, distributed to
users, is what they worry about here. If it were some internal system, then
they could switch to a different hash easier.

------
tkremer
Why not just store a second hash with a different algorithm, and check both
hashes match?

~~~
Achshar
I don't think github is in a place to change how git itself works. That's
Linus and the team's job.

------
gruez
Why use that specific method to detect sha1 collisions rather than doing a
direct comparison or hashing using a more secure algorithm?

~~~
chrisseaton
> rather than doing a direct comparison

A direct comparison between what two things? In the attack they are worried
about, they only see one half of the attack.

> hashing using a more secure algorithm

They don't control how Git works.

------
TorKlingberg
If I understand right they detect files that contain the specific collision
that Google released. It will work fine until somebody else does the same work
Google did to generate an other collision pair.

~~~
peff
No, it's better than that. It detects evidence of the cryptanalytic tricks
that were used to generate that collision. So it will find any collision
generated using their attack technique (which is ~2^63 operations, as opposed
to 2^80 for brute force).

There may be new disturbance vectors discovered as the attack matures, but
they can be added to the detector.

------
homakov
Consider detecting this too
[https://gist.github.com/homakov/c11f1e3b8b35300ffa07abf95a64...](https://gist.github.com/homakov/c11f1e3b8b35300ffa07abf95a64fb69)

~~~
uwu
what is it?

~~~
fny
Looks like a bug with Cyrillic character highlighting. I'm guessing this is
probably an issue with Rogue/Pygments[0][1] since I doubt Github is rolling
their own syntax highlighter.

[0]: [https://github.com/jneen/rouge](https://github.com/jneen/rouge)

[1]: [http://rouge.jneen.net/pastes/6nDQ](http://rouge.jneen.net/pastes/6nDQ)

------
a13n
> Two objects colliding accidentally is exceedingly unlikely. If you had five
> million programmers each generating one commit per second, your chances of
> generating a single accidental collision before the Sun turns into a red
> giant and engulfs the Earth is about 50%.

What happens when machines are writing commits and there are far more than 5
million and 1 per second? Is this a "640K ought to be enough for anybody"
quote of our time?

~~~
graton
> Is this a "640K ought to be enough for anybody" quote of our time?

No.

1) That quote wasn't actually said by Bill Gates

2) Very few people will quote that 15-20 years from now.

~~~
jacquesm
> Very few people will quote that 15-20 years from now.

But that didn't stop someone from quoting it today and that's the third time
or so this week that I come across that quote. The fact that it is quoted
today - for which there is really no good reason - makes me suspect it will
still be quoted 15-20 years from now. Hopefully less frequently.

~~~
pvg
One of the reasons it lives on is because it does represent what very quickly
turned out to be a design mistake in the PC architecture - sticking 'reserved'
memory at the top of a 20 bit address space. If it hadn't been for that,
nobody would have ever had to opine whether 640k is or isn't enough, they'd
have just added more memory and moved on.

------
BrailleHunting
Forcing committers to GPG sign their commits and authorizing only verified
commits from certain key ids would be a stronger, end-to-end chain-of-custody
process. Right now, GPG verification badges on tags and commits is nice but it
doesn't have "teeth." This is a nice step, but GH needs to keep going and add
stronger, mandatory defense-in-depth (optional security doesn't get used).

~~~
homakov
It's not them it's GPG that's too ugly to be used even by programmers. If they
enforce it, too many people will struggle

~~~
FungalRaincloud
I see this line of thought pretty frequently. I don't really understand why we
collectively believe that GPG is so tough. I started using it as a teen,
barely even understanding the software, let alone how the actual encryption
worked. I've come to understand it, but even when I didn't, it didn't seem
difficult at all to get passable security out of it.

Cyphar has the right answer for why this would be problematic, though - GPG
would just prove that the initial commit was from a trusted committer. It
would be trivial to use the same signature with the fake commit. That's not an
unsolvable problem, but switching to a different hashing algorithm seems to me
like the more reasonable solution.

