Hacker News new | comments | show | ask | jobs | submit login
SHA-1 Collision Detection on GitHub.com (github.com)
374 points by samlambert 11 months ago | hide | past | web | favorite | 90 comments

One thing I'm wondering about that I haven't seen discussed yet is the possibility to make circular git histories:

When a git commit hash is calculated, one of the data that is used are the pointers to parent commits. That results in two properties for commits:

1) A child commit's hash cannot be predicted without knowing the parent commits

2) Once a commit is created, its parent pointers are practically immutable. The only way to change them is to create a new commit with a different hash - which cannot carry over the old commit's children.

Those properties meant that it's impossible to construct a commit with a parent pointer that points to its own children. Therefore it was safe to assume that a git history is always a DAG.

I think with deliberate SHA1 collisions now possible, someone could actually violate property (2) and introduce a cycle.

As the DAG assumption is pretty fundamental, I'd imagine such a cycle could break a lot of tools. Has this ever been tested?

You can already play with this using Git's "grafts" and "replace refs" features. Most of Git's graph algorithms are OK with this because they traverse the graph while setting one or more "SEEN" bits on each node. If you find one that loops infinitely, it would be worth making a bug report.

Interesting idea. Continuing your theory, if a git utility then got locked in an infinite cycle trying to import the commit due to the recursive parent-child relationship, then such an attack might be exploitable for DOSing cloud-based git repositories.

I don't think this is possible. While there are ways to use collisions to make a document contain its own hash, they rely on format-specific tricks to insert collisions between each character of the hash. There's no way I know of to do this with a bare, unformatted MD5 or SHA-1 hash.

It sees an easy way to fix this is to scan the report for sha1 tags, and put them on the stack. Then compare the next to the list. If match is found to be duplicated, refuse to load or load in degraded state with warning bells.

A circular directory would be much worse. As soon as you check it out, your disk fills up.

but git doesn't store directory information. it is implicit with the file information.

So symlink-shenanigans shouldn't be possible using just the git objects.

Git doesn't store directory information per se, but the directory structure is still reflected in tree objects.

The article links to this repo that actually does the work of finding possible collisions: https://github.com/cr-marcstevens/sha1collisiondetection

My understanding of it is that it runs a SHA-1 and examines the internal state of the of the digest along the way to see if it matches up with known vectors that could be manipulated to cause a collision.

Interesting that it's hosted on GitHub. Everyone in this thread is wondering about the 2^-90 chance that a random commit will fail, but I wonder how these guys prevented their repo with test files from being rejected.

The files don't trigger a collision because git adds a header prior to hashing. The hash is the sha1 of the total bytes.

On that note, in the case of this kind of system, the possibility of a false positive (so a commit fails, but shouldn't) there's always the option of adding a salt or a NOP of some kind to the file being committed so as to alter its hash.

Filesystems like Venti could conceivably, on the one hand, check for collisions, but on the other, compare the stored block with the newly committed block. If there's a collision, but the blocks differ, just perturb the new block with a NOP, a nonce header of some kind...

You could also just zip the files and unzip when testing

> Everyone in this thread is wondering about the 2^-90 chance that a random commit will fail

Considering that the Bitcoin network burns through ~2^90 hashes in a year ... I'm not worried : )

> The recent attack uses special techniques to exploit weaknesses in the SHA-1 algorithm that find a collision in much less time. These techniques leave a pattern in the bytes which can be detected when computing the SHA-1 of either half of a colliding pair.

> GitHub.com now performs this detection for each SHA-1 it computes, and aborts the operation if there is evidence that the object is half of a colliding pair.

Isn't it possible for a valid non-colliding object or commit to contain that pattern as well? It sounds like eventually, though possibly in the far distant future, someone will be unable to push a commit to Github because it matches the pattern but doesn't contain colliding objects.

Does anyone know what the pattern is they're looking for? I'm curious now.

It is possible; the researchers estimate the likelihood of a false positive at 2^-90 (which puts us back in "Sun engulfs the Earth" territory).

There are metrics that will alert GitHub's infrastructure team if a collision is found (to confirm that we aren't seeing any false positives). Those metrics were quietly shipped (without the matching "die") for a week before flipping the final switch.

If you want to know more about the patterns, see the sha1collisiondetection project:


There's a research paper linked in the README.

They say that the chance of a false positive is less than 2^-90.

Would it be possible to exploit the found bit-pattern in a "false positive" (that wasn't intentionally constructed to collide) to construct a second object with that preimage? Is a preimage attack in that situation more feasible?

Per commit, or over the expected lifetime of Git's utilization of sha-1? If it's the former it's not very useful unless we have an idea how many commits are made.

You should expect to see a hash collision with probability >½ with about sqrt(2^n) hashed objects for a hash of length n. I believe hashes are made per changed file per commit.

A large project (like Firefox) might make a few hundred commits per day, or tens of thousand per year. So that comes out to 2^20-2^30 hashes per year. I don't know how many distinct repositories GitHub has, but I doubt it's larger than 2^30. So that means that GitHub has no more than 2^60 SHA-1 hashes related to git, and probably more like 2^40-2^45.

So the probability of a collision is <<2^-30 (the collision function is logistic, so assuming linearity between 1 and 2^90 is a wildly overoptimistic assumption) in the optimistic case, probably something like <<2^50. In perspective, it's more likely that you will win the lottery tomorrow but fail to discover so because you were killed by lightning than it is that GitHub will detect a chance SHA-1 collision.

Wait, are we talking about chance SHA-1 collisions, or a false positive for this test for this vulnerability?

It's not logistic, that's a specific kind of function.

There are ~3.1e9 seconds in a century, and ~7.5e9 people on Earth. So if every human makes a git commit every second for a century, the odds of a single false positive is ~5.3e-7 (i.e. less than one in a million).

if every human makes a git commit every second for a century

Will someone please write a dystopian novel around this premise? It almost writes itself. GitHub turns evil, forcing the world population into subservience from birth; using Git is all anyone is ever allowed to do, and is how people conceptualize the universe and access entertainment; various cults form with the belief that randomness can be influenced via the right ceremony...

You can source inspiration from https://twitter.com/DystopianYA

I was recently thinking of another idea where humanity works for Google and everyone's job is to solve recaptchas. Everything is free for people who earn enough credits by solving the puzzles.

Isn't this an episode of black mirror?

It is.

Oh funny, which episode? I've never seen the show before.

Someone here had mentioned that they try to incorrectly answer recaptchas as a way to spite them for trying to steal free labor from them. My mind took that to the extreme of everyone working for Google... :)

I believe the parent poster is referring to the second episode [0], where people must earn credits for cycling on exercise bikes.

[0] https://en.wikipedia.org/wiki/Fifteen_Million_Merits

Yep, exactly. I didn't remember the title offhand, thanks.

Sounds like the plot to For-Profit Online University[1]. "I'm a digital gardener."

[1] https://www.youtube.com/watch?v=XQLdhVpLBVE

If it is not a movie yet.. you should definitely write one :D

> various cults form

...each believing in a Supreme branching strategy

And your status in society is determined by the number of consecutive digits of the hash of your very first git commit.

Alternative premise, tech interviews continue to become increasingly competitive, and if you want to be a true dev hirable by Google, etc. this is the output you need.

I could even see a matrix like aspect where those who are unplugged do so by creating hash collisions to hide what they are really doing.

You're right, I suck at probabilities and did it wrong: (3.1e9*7.5e9)/2^90 .

Well, I'd lay a fair amount of money that the number of commits is less than 10^20...

It's also possible for a SHA-1 collision to result from random chance, however it's extremely extremely unlikely to the point where it will almost certainly never happen.

This is true of any hash though. SHA-1 is broken because it is known how to create a collision intentionally.

True of any good hash, at least.

So, to the question "Isn't it possible for a valid non-colliding object or commit to contain that pattern as well", the answer is "No, not really."

I wonder how many chosen prefix (not identical prefix) attacks this detects. There are many ways to do it ranging from 2^77 to 2^80 time. Of course, this is still much more expensive than identical prefix, and git would probably not the best target for such attacks.

The SHAttered attack was made public shortly after browsers deprecated SHA-1 or started flagging SHA-1 in certs as insecure. This specific practical attack may have existed for a while before it was made public. In fact it may have been made public only when it no longer provided practical benefits.

Just some food for thought.

The team that found it is at least partly academic; they published a paper on a free-start collision at CRYPTO 2015, and made a prediction there about how long it would take them and how much it would cost to find a full collision, which was more or less accurate. They are hash function researchers, not spies or hackers. There's no reason to believe they were hiding anything.

That doesn't make a lot of sense to me, if anything this type of attack only gets more practical as the cost to compute a collision dwindles. It's not like a 0-day vulnerability.

I'm also not aware of any significant recent attempt at mitigating SHA-1 collisions that would've suddenly made the attack impractical. And in general these collisions are pretty difficult to exploit since you can't control the collision exactly.

Weaknesses with theoretical attacks have been known for over 10 years, and most major companies started phasing it out of use years ago. The recent news that someone was able to find two colliding inputs is just yet another red flag against continuous use, but there is no more need to panic now than a year ago. Remember that finding a collision for a spesific input is much harder than finding two inputs which collide.

Maybe. But SHA-1 is still widely used today, perhaps no long for signing certificates, but a lot of developers and systems today continue to use SHA1 to produce checksum.

We don't know what tools the NSA have in their arsenal.

Naive question - why can't GitHub use a SHA2 function for commit hashes?

Because Git only supports SHA-1 as of now, and files in Git, distributed to users, is what they worry about here. If it were some internal system, then they could switch to a different hash easier.

Why not just store a second hash with a different algorithm, and check both hashes match?

I don't think github is in a place to change how git itself works. That's Linus and the team's job.

That's a good idea, but only provides full protection if all git clients do it. The official git client itself doesn't support that, so people would still be vulnerable if they got a malicious half from github, or a signature on a non-malicious half from github and a malicious half somewhere else.

This protection github is implementing uses a heuristic to try to find collision attempts by just looking at a single file. So it prevents github from hosting any collision attempts.

Why use that specific method to detect sha1 collisions rather than doing a direct comparison or hashing using a more secure algorithm?

> rather than doing a direct comparison

A direct comparison between what two things? In the attack they are worried about, they only see one half of the attack.

> hashing using a more secure algorithm

They don't control how Git works.

If I understand right they detect files that contain the specific collision that Google released. It will work fine until somebody else does the same work Google did to generate an other collision pair.

No, it's better than that. It detects evidence of the cryptanalytic tricks that were used to generate that collision. So it will find any collision generated using their attack technique (which is ~2^63 operations, as opposed to 2^80 for brute force).

There may be new disturbance vectors discovered as the attack matures, but they can be added to the detector.

what is it?

Looks like a bug with Cyrillic character highlighting. I'm guessing this is probably an issue with Rogue/Pygments[0][1] since I doubt Github is rolling their own syntax highlighter.

[0]: https://github.com/jneen/rouge

[1]: http://rouge.jneen.net/pastes/6nDQ

>>> "params[:sandboх] = true".decode('ascii')

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 14: ordinal not in range(128)

> Two objects colliding accidentally is exceedingly unlikely. If you had five million programmers each generating one commit per second, your chances of generating a single accidental collision before the Sun turns into a red giant and engulfs the Earth is about 50%.

What happens when machines are writing commits and there are far more than 5 million and 1 per second? Is this a "640K ought to be enough for anybody" quote of our time?

One of two things will happen. Either the machines will be generating random commits, and nobody will care if they collide...

... or the machines will be able to generate real, useful code that fast, in which case it will be the machines' problem to solve.

And in the latter case, most human programmers will be unemployed, and thus unconcerned with hash collisions.

If you were generating hashes on the gpu for collisions, according to https://hashcat.net/forum/thread-4314.html

A single titanX does 2^9 hashes per second. Suppose you had a 10,000 Gpus (expensive but not impossible) you can do 2e13 hashes per second.

In 100 years there are ~3e9 seconds, so 6e22 hashes. Still magnitudes below 90.

Sha1 isn't still crackable by brute force in a meaningful time.

> What happens when machines are writing commits and there are far more than 5 million and 1 per second?

Then a collision will happen far sooner than the Sun engulfing the Earth.

Maybe in the next million years, instead of 7 billion.

No. Let's say the commit rate increases by a factor of a million. You still have centuries to upgrade and fix the problem.

Attacks always get better, not worse. You never have centuries to fix a broken security primitive, such as this.

Why are you telling me this? We are talking about accidental collisions not a 'security primitive' or attacks.

The github change was not put in place to handle accidental collisions but to detect an attack on the sha-1 security primitive.

So? This subthread is about a line in the write-up that talks about accidental collisions. Your objection to 'centuries' is a complete non-sequitur.

You apparently like to argue. Even if the sub-discussion is about accidental collisions centuries do not remain to fix the problem, which is that there was a recent practical attack of a mathematical speedup for certain types of collisions. Your conclusion doesn't follow from your premise.

The attack persists regardless of the lack or presence of accidental collisions, therefore there aren't centuries to fix the problem.

You're the one who started this stupid argument. There is no 'attack'. This is was the conversation:

Someone: "Huh, they say it would take billions of years for an accidental collision at checkin rate X"

Someone else: "What if the collision rate were much higher? Famous last words, 640k, herf glorp"

Me: "Even at a million times higher checkin rate you are unlikely to see collisions anytime soon"

You: "It's about ethics in gaming journalism!"

My conclusion doesn't follow your premise because your premise is simply wrong. Enough.

No, this is "640K ought to be enough for anybody while we fix the underlying problem and put a better hash function into git". This is more like living with a 4GB 32-bit memory space while we wait for 64-bit CPUs to become popular.

> Is this a "640K ought to be enough for anybody" quote of our time?


1) That quote wasn't actually said by Bill Gates

2) Very few people will quote that 15-20 years from now.

> Very few people will quote that 15-20 years from now.

But that didn't stop someone from quoting it today and that's the third time or so this week that I come across that quote. The fact that it is quoted today - for which there is really no good reason - makes me suspect it will still be quoted 15-20 years from now. Hopefully less frequently.

One of the reasons it lives on is because it does represent what very quickly turned out to be a design mistake in the PC architecture - sticking 'reserved' memory at the top of a 20 bit address space. If it hadn't been for that, nobody would have ever had to opine whether 640k is or isn't enough, they'd have just added more memory and moved on.

Pretty sure the GP is referring to GitHub's statement with "that", not the 640k quote. Nobody will quote the probability of a false positive of a SHA1 collision detector in 15-20 years.

Updated to make it modern: 640k of Minifed, Gzipped JS should be enough for anybody

Tell that to angular.

5 billion years divided by (speed of computers / 5 million per second) is probably still high enough for most practical terms.

I guess the pertinent question now is how close are we to 5 million and 1 commits per second!

Forcing committers to GPG sign their commits and authorizing only verified commits from certain key ids would be a stronger, end-to-end chain-of-custody process. Right now, GPG verification badges on tags and commits is nice but it doesn't have "teeth." This is a nice step, but GH needs to keep going and add stronger, mandatory defense-in-depth (optional security doesn't get used).

But commit signatures are actually just signing the SHA1 hash. So you still have the problem that you could just reuse the signature for the valid commit with the malicious one.

Then git also needs to change, but the process as above for github and others is still correct.

0. Upgrade to SHA3/Keccak, eventually make SHA1 a read-only legacy feature.

1. Upgrade signing to sign blobs and sign signatures for non-blobs instead of signing refs.

It's not them it's GPG that's too ugly to be used even by programmers. If they enforce it, too many people will struggle

I see this line of thought pretty frequently. I don't really understand why we collectively believe that GPG is so tough. I started using it as a teen, barely even understanding the software, let alone how the actual encryption worked. I've come to understand it, but even when I didn't, it didn't seem difficult at all to get passable security out of it.

Cyphar has the right answer for why this would be problematic, though - GPG would just prove that the initial commit was from a trusted committer. It would be trivial to use the same signature with the fake commit. That's not an unsolvable problem, but switching to a different hashing algorithm seems to me like the more reasonable solution.

I set up GPG for my account a couple of months ago, and found it surprisingly simple. The setup has even survived a recent machine change with no issues what so ever. Curious to know what makes this "too ugly" if you don't mind elaborating!

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact