Hacker News new | past | comments | ask | show | jobs | submit login
Linus on Git and SHA-1 (plus.google.com)
550 points by dankohn1 on Feb 25, 2017 | hide | past | web | favorite | 170 comments



The actual mailing list discussion thread can be found here, and is infinitely more informative than any of the bull being spouted in this thread: http://public-inbox.org/git/20170226004657.zowlojdzqrrcalsm@...


My takeaway from all that - the git project is internally taking steps to mitigate vulnerabilities from this particular attack (making it harder to insert the arbitrary binary data necessary into git metadata), but a) is just throwing up their hands at the problem of projects that store binary blobs like image data in their repos, and b) is not taking this as a signal that more serious sha-1 attacks are on the horizon and they should speed up their hash-replacement efforts.

This latter leads into the problems with Linus' positions in particular. In that thread, does not take seriously the threats that this poses to the broader git userbase, because he only seems to care about the kernel use-case: trusted hosting infrastructure at kernel.org (itself an iffy assumption, given previous hacks and the use of mirrors), and the exclusive storage of human-readable text in the repo which makes binary garbage harder to sneak in. These do not apply to most users of git. His rather extreme public position (paraphrased, "our security doesn't depend on SHA1") is even more troubling - it absolutely does depend on SHA1, this just isn't (yet) a strong enough attack to absolutely screw over the kernel. A stronger collision attack (eg a chosen-prefix as opposed to identical-prefix, or godforbid a pre-image attack) would absolutely invalidate the whole git security model.


> a) is just throwing up their hands at the problem of projects that store binary blobs like image data in their repos, and b) is not taking this as a signal that more serious sha-1 attacks are on the horizon and they should speed up their hash-replacement efforts.

As @tytso points out, there is ongoing work to replace the use of SHA-1. Also, yes, SHAttered raises serious concerns about the safety of SHA-1, but that doesn't mean everyone has to immediately work on the switch (and honestly, it's questionable how much it would help - throwing more developers at a problem doesn't necessarily help). Should Stefan Baller drop the work he's doing on submodules because SHA-1 is more imminent now?

It's also worth noting Linus isn't really a core Git dev any more, he just submits patches occasionally. Junio Hamano is the primary maintainer.


SHA1 is only used for uniqueness. You still need to have write access to repo to perform attack. If you already have write access you do not need to make collision...


Exactly. Why waste the time to fabricate a collision when all you have to do is "accidentally" commit exploitable code? At least then you got plausible deniability.


For this particular attack - what if you do not have write access, but have sufficient social capital to get a pull request merged with a benign-seeming file?


If we're just talking about this particular attack, the two files don't even resolve to the same SHA-1 in Git:

    $ sha1sum shattered*
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf
    38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf

    $ git hash-object shattered*
    ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0
    b621eeccd5c7edac9b7dcba35a8d5afd075e24f2


As soon as somebody invests a few $100k there are going to be such files. Now it's known how much it takes. That much.

Luckily, there is also a known solution to detect that kind of files.


That's not all "the git project" is doing. Work to allow git to support an alternate crypto checksums is on-going. One of the key steps is to replace the hard-coded use of char[40] with a "struct object_id" object, and it is about 40% done[1].

[1] http://public-inbox.org/git/20170217214513.giua5ksuiqqs2laj@...

Like many volunteer projects, the work would go faster if more people helped. What "the git project" considers important works the same way as "how much is bitcoin worth", or "how much is gold worth". People who think it's important can show the value by putting their effort where they think it is most important. Or, of course, people can snipe from the side-lines, and make themselves feel all self-important.

There has actually been a lot of thought about how to do a graceful transition period, but to also on how switch from a soft cutover to a hard cutover (breaking backwards compatibility and requiring people to upgrade their clients) to allow time for developers to upgrade at a reasonable rate, where reasonable can be defined on a per-git repository basis. So if you want to force a flag day as soon as the code is available, and not wait for stability testing, etc., those people who value security uber ales, and who don't want to rely on trusting kernel.org, etc., can do so.

I will note, though, that most people are doing blind pulls, or worse, blind merges, from developers outside of their immediate circle of trust _all_ _the_ _time_. Heck, people will cut and paste command lines from web pages of the form "curl http://alfred.e.numman/what/me/worry | bash" into root shells all the time! So if you are not auditing every line of code before a git pull, the fact that git is using SHA-1 is the least of your worries.

Personally, if I were a nation state planning on trying to insert the equivalent of a DUAL-EC backdoor into open source software, I'd do that by spending a person year or ten getting a collection of developers to be trusted contributors to some key open source project, like Docker, or Python, or even yes, the Linux kernel or git, and then "accidentally on purpose" introducing a buffer-overrun or some other zero day into said OSS code base. Or heck, just simply invest in finding more zero days that people have inserted into their code just because they're not careful!

So, sure, "we" should upgrade git to be able to support multiple crypto hash algorithms, and there is work going into doing this. But at the same time, it's important to keep a sense of perspective on all of this --- unless, of course, your goal is to make yourself seem important by exuding a sense of self-righteousness.


> people will cut and paste command lines from web pages [...] into root shells all the time!

Do they? I've never encountered anyone using a root shell to go about their business unless they had to do some exclusive operation that could only be done as root, and even then they would rather just sudo that specific operation. Hell, even if they do do that, they have bigger problems than just git, given that cut and paste is inherently broken[1].

[1]: https://news.ycombinator.com/item?id=10554679


Do you know anyone who uses calibre, the e-book reader? You want to give it a try? Just cut and paste this:

sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"

Or you can trust my docker image, where I've done this for you:

https://hub.docker.com/r/tytso/calibre/

(Hint: blindly using my docker image is only slightly better from a security perspective. What you _should_ do is download the dockerfile, audit it carefully, and then create your own docker image. And then you're _still_ trusting the Calibre folks to have access to your X server....)


I actually don't know anyone who uses it. Any Linux users I know would just use the book reader from their distro's repository (in my case, FBReader).

EDIT: Also, holy s@@@ e-calibre, your advice (with a mild warning) if people get certificate errors is to pass in --no-check-certificate.


That docker file downloads Calibre over HTTP. Awesome!


>Do they? I've never encountered anyone using a root shell to go about their business unless they had to do some exclusive operation that could only be done as root, and even then they would rather just sudo that specific operation.

Maybe you should encounter more devs?


I'm a developer, who works alongside other developers, on a floor full of developers. Not only does this apply to my work, but also at hack[spaces|athons] and jams I've been to (With ages ranging from ~8 - >30). I've literally never encountered anyone using su for a command that doesn't need it. Even when they need root, they just use sudo for that command.


Well, you'd be surprised.


Git has a security model?

edit: rephrase: Git has no security model.


Linus's transition plan seems to involve truncating SHA-256 to 160-bits. This is bad for several reasons:

- Truncating to 160-bits still has a birthday bound at 80-bits. That would still require a lot more brute force than the 2^63 computations involved to find this collision, but it is much weaker than is generally considered secure

- Post-quantum, this means there will only be 80-bits of preimage resistance

(Also: if he's going to truncate a hash, he use SHA-512, which will be faster on 64-bit platforms)

Do either of these weak security levels impact Git?

Preimage resistance does matter if we're worried about attackers reversing commit hashes back into their contents. Linus doesn't seem to care about this one, but I think he should.

Collision resistance absolutely matters for the commit signing case, and once again Linus is downplaying this. He starts off talking about how they're not doing that, then halfway through adding a "oh wait but some people do that", then trying to downplay it again by talking about how an attacker would need to influence the original commit.

Of course, this happens all the time: it's called a pull request. Linus insists that prior proper source code review will prevent an attacker who sends you a malicious pull request from being able to pull off a chosen prefix collision. I have doubts about that, especially in any repos containing binary blobs (and especially if those binary blobs are executables)

Linus just doesn't take this stuff seriously. I really wish he would, though.


> "A hash that is used for security is basically a statement of trust [..] In contrast, in a project like git, the hash isn't used for "trust". I don't pull on peoples trees because they have a hash of a4d442663580. Our trust is in people, and then we end up having lots of technology measures in place to secure the actual data."

This is horseshit, and Linus should not be saying these hugely misleading statements about security principles.

The point of a hash is to remove the need for trust between the trusted person who tells you the hash and the infrastructure you get the actual data that was hashed, from (edit: and, between you and the latter).

In other words, once you get a good non-colliding hash from a trusted person, then you don't need to worry about malicious infrastructure sending you bad data claiming to be the source of that hash.

Linus trusting Tytso to sign the commit object that references the SHA-1 of the tree object, says nothing about whether the infrastructure served him the tree object correctly. Sure, he might also trust the infrastructure providers, but when he says "trusts people" it does not sound like that is what he means. And even if he trusts the infrastructure providers, with a good hash HE DOESN'T HAVE TO.

The "trust" wording is serious horseshit.

(edit: there is also the case of people downloading "linux" from random git repos in the future. Right now if you GPG-sign a commit or tag, it has SHA-1 references to the tree object underneath it. Once SHA-1 is more broken it basically means you shouldn't trust random git repos across the internet to give you good content, even if it's "signed by Linus".)


You're saying Linus's statements are "hugely misleading", but it's just that you wish git were designed to be used differently. So, your argument is "horseshit".

Linus could have designed a cryptographically perfect system such that he could pull Tytso's signed commit from anywhere on the internet - but he didn't

Linus used sha1 as a useful tool for an effective DVCS with an initially simple implementation.

He still depends on the security of the kernel.org servers, his work computers, and the top submaintainer's work computers. His git trees and all the submaintainers he pulls from are hosted on kernel.org servers. Security of the kernel.org servers is taken very seriously, especially since the well-known break-in a few years ago. Now even two-factor auth is involved in all git pushes to kernel.org servers. Sub-maintainers make pull-requests by branch name - "please pull branch for-linus ...". More peripheral contributors submit their work via patches on LKML, no commit hashes involved.

Finally, no well-known SCM previous to git was based on perfect cryptographic proof of source history, or anything like that. It wasn't a big issue, and it's not the problem git focused on solving. Before git, we all used CVS, SVN, tarballs, and patches. And a significant portion of developers did not use any VCS at all. How could any of us have trusted any source code before 2005?! Somehow we did, though...


> Before git, we all used CVS, SVN, tarballs, and patches

Yeah, and that sucked.

Git is this close (I'm holding my fingers very close together) from providing cryptographic proof of source history. And that's why people think it should be included.


> How could any of us have trusted any source code before 2005?! Somehow we did, though...

They used pgp to sign the tar ball, which was a way better idea since you could just use a different hash function for your signature after sha1 has been broken in 2005[1].

Everybody serious about security kept doing that, since signed git commits were just asking for trouble due to the hard dependency on sha1.

[1]: https://www.schneier.com/blog/archives/2005/02/sha1_broken.h...


"Finally, no well-known SCM previous to git was based on perfect cryptographic proof of source history, or anything like that."

This is actually false. Monotone, from which git takes a lot of cues (ask linus, or see wikipedia), was such a system.


Thus my qualification "well known". I was aware of monotone and Darcs ... but never used them, or knew anyone who did, or anyone who seriously considered it. I have no idea if Darcs is "cryptographically secure". But both of these were experimental and extremely slow at the time git was designed.


"Thus my qualification "well known"." So that, to you, means "i don't know anyone who used them"?

Because that's a pretty small definition of "well known". For reference:

Inside the VCS community, they were very well known. Graydon also went on to start Rust, of course. Outside of that, even to corporations I dealt with (i was working on SVN at the time), plenty knew it existed, a few evaluated it.

" But both of these were experimental and extremely slow at the time git was designed."

I feel like this is just you trying to say "well, they never would have worked anyway".

DARCS was slow and couldn't be fixed, monotone was actually not that bad, and could be made very fast if necessary.

Monotone was slow precisely because it cared about integrity. If you made git care about integrity and security in the same way ... it would be just as slow!

Saying they were "experimental" is silly. Monotone was self hosting and nobody had found data corruption or other issues in quite a while (IE > 1 year), AFAIK. This is much better than git was for a long time. To put this in perspective, i converted the gcc repository (hundreds of thousands of revisions, in fact, many more than the kernel at the time, and history going back to 1983) to monotone, and 1. the conversion worked with no issues 2. Speed was fine in most cases. If i worked hard, i could find issues, but ....

git was beyond experimental at the time it was designed (IE data loss and repeated crashes). But it's design is essentially that of monotone in a different container and with slightly different goals. (IE they precisely gave up the integrity part that monotone verified and cared about, and people are now complaining about).

In truth, if Linus had cared about cryptographic security, he probably would have just stuck with monotone and rewritten parts of it. But he didn't, so he gave that up in the name of speed.

Which is fine, but let's not pretend that Linus somehow has a monopoly on design, or even was the first to design a VCS that was like git.

Any of the distributed VCSen could have beaten git. They just didn't care about the tradeoffs that git was making (IE non-portability in favor of speed), and honestly, it's pretty obvious git would have been laughed out of existence if it hadn't been supported by someone so popular to so many people.

In fact, it was laughed out of existence, the past N times people had devised VCSen that favored speed over portability.

Git is interesting as an example of how good marketing with the right figures sometimes leads to winning in the marketplace for long enough that you can fix the rest of your serious issues (IE everyone else had the OS/2 problem). Even more interesting is that everyone else backfills history to make it seem like it was the best and clearly the right set of tradeoffs from the start. It wasn't. It may not even be now! It's like any other system. Please don't try to backfill history here. I was there, as were many others.


I didn't claim that Linus came up with the design from scratch, nor that git was less experimental than monotone in 2005. Just that no one used monotone, so git was not a reduction in safety/assurance compared to any in-use SCM. And as mentioned, it's goal was speed, not cryptographic assurance.

I also was surprised that git became so popular, I thought mercurial would win for "ease of use". But I've never been an advocate or influencer of popularity. I think that's mostly due to GitHub, which got it to a market-share tipping point.


>Inside the VCS community, they were very well known.

Inside the PL community, tons of obscure language with less than 10.000 users are "well known". Being known inside some niche's expert's a pretty low bar. The "VCS" community is insignificant in size compared to the developer community at large.

>Graydon also went on to start Rust, of course.

Which is irrelevant as to whether Darcs and Monotone are well known.


Of note is that Monotone also used SHA-1 but it was released in 2003.


Why go through all the effort of securing kernel.org etc etc etc when a simple s/SHA1/SHA512/g on git would suffice? (Ignoring the non-git infrastructure there, for the purposes of this argument.) Not just for him, but for everyone in the world who can't download directly from kernel.org?

What's the point of GPG signing tags and commits and adding these as git features, when SHA1 pointers make it pointless?

If the cryptographic strength of the hash doesn't matter, why bother even talking about changing the hash from SHA1 to SHA3-256?

It's a huge shame because git is very close to being a cryptographically perfect system, but the original creator can't get his security reasoning correct, and does so so badly yet so publicly, and minions who don't know what they're talking about come flocking to defend this.


> when a simple s/SHA1/SHA512/g on git would suffice?

That's ignoring a huge amount of changes that have to be made to have this work properly. Hardcoded constants must be changed. Backwards compatibility needs to be maintained for Git to be a viable product. Scripts that depend on the current length/format of SHA1 hashes would be broken if everything were changed all of a sudden. That would be much, much worse than the current exploit.

I agree that his points are not all correct, but he is correct in that basically, you shouldn't have all of your security eggs in one basket. SHA1 is just one piece of their security infrastructure, and now that it's shown to be a little shaky, it can be argued that the others help keep it up while they repair it.

SHA1 pointers being "pointless" is also an overstatement. SHA1 isn't completely broken yet. It will be pointless in 3+ years, which is why they're performing the work of transitioning, although not easy. "Cryptographically perfect" wouldn't be occurring with SHA3-256 either. That'll be broken in 10+ years.

Security is, and always has been, along the lines of "good enough as far as we can see now", and when Git was made, it was good enough. Now it's not.


All git repos have "repositoryversionformat", you can bump it. And nobody is going to complain about having to do a fresh clone, given what just happened.

"Not having security eggs in one basket" is exactly irrelevant here. Upgrading the hash does not make the other parts of the system weaker, and should have been done years ago. Yes, SHA2 will also become weak eventually - so then you bump it again. The point is to choose the best one for that time of history, which he has been resisting doing for years, based on bullshit pseudo-arguments about "trust".

No, git uses hashes in the same way that most other security system use hashes. The "trust" comes at a different point of the system. No hashes by themselves are "trusted" anywhere by any system. Git is not special in this regard. I repeat, Linus is talking horseshit and this horsehit argument is unfortunately getting repeated and spread around because of his position in the community.

I said the use of GPG is pointless, not SHA1 pointers, especially given his "trust" arguments. To exaggerate slightly, just to show you the point: it's like locking up a crappy lock (SHA1) inside a really secure safe (GPG). But this crappy lock is what opens up the real safe where your actual money is kept (the tree objects). And then we have Linus talking shit on the side saying the keys are not so important, but what he really gets security from is his trust in the delivery company that ships the safe around.


> Yes, SHA2 will also become weak eventually - so then you bump it again.

I disagree here. I doubt SHA2 or any modern hash will become weak within our lifetime. JP Aumasson, who is one of the experts of the field, agrees with me on that: https://twitter.com/veorq/status/834872988445065218


It's funny because in that Twitter thread this[0] is linked:

     Reactions to stages in the life cycle of cryptographic hash functions
     Stage               | Expert reaction                             | Programmer reaction                             | Non-expert ("slashdotter") reaction
    ---------------------+---------------------------------------------+-------------------------------------------------+----------------------------------------
      ...                |  ...                                        |  ...                                            |  ...
     General acceptance  | Top-level researchers begin serious work on | Even Microsoft is using the hash function now   | Flame anyone who suggests the function
                         | finding a weakness (and international fame) |                                                 | may be broken in our lifetime
Just to be clear: I don't doubt JP Aumasson's reasons to believe so. I doubt yours because as far as I can tell it's basically argument from authority ("JP Aumasson said so!")

[0] http://valerieaurora.org/hash.html


Yes, exactly.

If Linus truly doesn't care about security, then git could use any error correcting code that produces a uniform distribution of tags, such as CRC64. The size of the tag only affects the number of objects we'd expect to be able to commit before we see a collision: over 4 billion in the case of CRC64.

Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.

Where the choice of a cryptographic hash function matters is specifically if we expect an attacker to be trying to collide tags. CRC64 is a linear function of the input data and therefore fails miserably at preventing attackers from colliding tags, but still produces a uniform distribution of tags for non-malicious inputs.

git seems to be in the odd place where Linus argues he's using a cryptographic hash function but not for security purposes.


Linus: "You can have people who try to be malicious... they won't succeed."

Linus talked about why git's use of a "strong hash" made it better than other source control options during his talk at Google in 2007.

https://youtu.be/4XpnKHJAok8

Edit: the whole talk is good but the discussion of using hashes starts at about 55 min.


Well, if the cost of computation is not too relevant, and if you don't explicitly need the ability to craft collisions, why would you use a non-cryptographic hash function?

Like, when I'm building a lookup index for files, I'm going to use sha-(something), because it's easy and well known. I don't particularly care about the security aspect; I care that everyone immediately knows the contract of sha-1.


There is nothing to be gained from using cryptographic primitives in a non-security context. You could just as easily use e.g. CRC32 for the case you're describing.

There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.

Linus claims: "So in git, the hash is used for de-duplication and error detection, and the 'cryptographic' nature is mainly because a cryptographic hash is really good at those things."

CRC produces a distribution just as uniform as a cryptographic hash function, and it's faster to boot. If these are the only things he actually cares about, and he's explicitly discounting security, he's choosing a slower primitive for no reason.

He writes off CRC inexplicably earlier in the post:

"Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be 'tough luck, maybe your data is there, maybe it isn't, I don't care'."

Linus seems to think that SHA1 has some sort of magic crypto sauce which magically makes the distribution it produces more uniform than CRC's. It doesn't. The only difference is SHA1 was originally designed to be resistant to preimage and collision attacks, both of which are irrelevant outside of a security context.


You can still make a not-totally-unreasonable argument that something like CRC64 is simply too small - that maybe the 1 in a million collision chance for a few million hashes is too high. The fast, keyed 'semi-cryptographic' big hashes that are common now weren't around when git was written so the easiest thing to reach for would have been something like SHA-1.


"maybe the 1 in a million collision chance for a few million hashes is too high"

Well, let's look at what the actual numbers are. There's a nice table on this page:

https://en.wikipedia.org/wiki/Birthday_attack

For a 64-bit tag, even with 6,100,000 objects we'd only have a 1 in 10 million chance of a collision, so a 64-bit tag is more than sufficient to meet your stated requirements.


No no, I don't have any requirements, I just did the numbers and missed a zero instead of sensibly looking at a table. But that doesn't change the argument much - if one in a million isn't crazy, one in ten million is not completely insane either. With a 64 bit hash, you probably should write code to deal with a potential collision. Beside the (tiny) chance, someone might legitimately plop some test vectors that collide into a repo, just like the webkit people did the other day. With SHA-1 (in 2005), you can just punt and spend the time you saved yelling at people complaining to you about the choice of SHA-1 on mailing lists. It seems like a pragmatic implementation decision for the time.

I'm not trying to defend Linus's somewhat confused explanations, it's just that git's 'security' requirements are somewhat woolly and one could reasonably get away with half-assing it a bit for a while.


If you feel CRC64 does not meet the requirements, use CRC128.

For what it's worth, I think CRC64 should be fine for git-like workloads (but would still recommend using a cryptographically secure hash function, because git's usage is security-critical despite Linus constantly insisting it's not).


Why would you do that? Even if you don't know exactly what you want, are wrong about whether it's the basis of 'trust', for the purposes of writing git, you'd just take SHA-1. Nothing terrible is going to happen if it's both overkill and you aren't really building a secure system. You seem to be arguing, if I'm understanding you right, that you should only use a cryptographically strong hash iff you need all its properties. That seems like a really odd angle.


I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

It is not an "odd angle". It is literally the very purpose for which they were created in the first place: to defend against attacks (preimage, collision)

If there are no attackers, the cryptography buys you nothing and merely makes the system slower.

Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.


I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

Well, if we're going to be dicks to each other about it I'll try to explain what I think appears to be difficult for you to grasp. :) If you were throwing together something like git in a hurry you'd want a hash that

Lets you not have to think about collisions at all even if:

The collisions are by mere chance

The collisions arise by non-malicious accident

The collisions arise from malicious inputs [obviously, that implies an attacker but it also falls under 'I just don't want to think about collisions']

And right there, you grab the first non-completely-broken, not-too-giant, not-too-slow crypto hash around and get on with whatever else you had in mind. And this, to me, seems like the right call, especially since if collisions did eventually pop up, it'll probably be one generative collision and your entire system won't suddenly implode just because it exists. You'll have some time to fix stuff.

Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.

No argument there. Maybe I misunderstood 'Linus said something wrong' as 'Linus did something horribly wrong' and we're arguing over nothing?


I recently worked on a project where I had to choose a hash function for non-cryptographic error checking. I investigated CRC in detail for it, even wrote two different implementations from scratch.

CRC-X is complicated to use and terrible choice.

First of all, it is not a single algorithm, it is a family of algorithms. For a CRC of size N, you have to also choose a N-bit polynomial, N-bit starting value, and N-bit output XOR value. There is no single standard, there are tens or hundreds of popular options [1]. The optimal polynomial depends on both the CRC size, and the length of the data you are feeding into it [2]. If you choose poor parameters, you will get terrible error detection characteristics (like allowing extra 0 bits at the start of data with no change to the checksum). If you choose good parameters, certain classes of common errors like zeroing out a block of more than N bits will still have much worse characteristics than 1 / 2^N random chance of collision.

Second, implementation in standard libraries is patchy. Most programming languages have some CRC32 implementation - but do not document what parameters they use, or use different notation for the same parameters (forward and reverse), or do not let you change the parameters, or do not let you change the CRC size, or all of these. There is no easy way to get a "standard CRC64 or CRC128" compatible across platforms without putting it together from github snippets and example code yourself.

Third, CRC is fast when implemented in hardware, but not that much faster than SHA-1 or SHA-512 in software. It's only 1.5-2.5x faster [3], and when you're doing one checksum per uploaded file or something, it really does not matter. It's going to be even slower when you don't have a suitable-length CRC available in optimised native code, and have to write it yourself in simple C or pure Python.

The obvious and simple solution is to pick a known popular cryptographic hash like (SHA-X) that is available in the standard library of every programming language under the exact same API and no parameters to configure, truncate its output to the digest size you want, and call it a day. No need to worry about error detection performance on specific error cases like long bursts of zeros, and you get some defense against malicious tampering as a free bonus.

[1] https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standa...

[2] http://repository.cmu.edu/cgi/viewcontent.cgi?article=1672&c...

[3] https://www.cryptopp.com/benchmarks.html


Would MD5 also not satisfy all the availablity and standardization problems you mentioned just as well as SHA-x ? and also meet other non security requirements of git just as well ?

So why SHA-1 over MD5 which probably is move available and order of magnitude faster ? what problems MD5 has outside of its security that SHA-1 does not ?


So why SHA-1 over MD5 which probably is more available and order of magnitude faster?

This actually ends up not being true. Try

    openssl speed md5
    openssl speed sha1 
On a somewhat recent Intel CPU.


I'm glad you've at least looked into this, but you have a number of things wrong:

  First of all, it is not a single algorithm, it is a family of algorithms
This argument holds for SHA1, but all modern cryptographic hash functions are also families. The SHA3 family has SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128, and SHAKE256

  For a CRC of size N, you have to also choose a N-bit
  polynomial, N-bit starting value, and N-bit output XOR
  value. There is no single standard, there are tens or
  hundreds of popular options
Apparently you've never heard of CRC32C!

http://www.evanjones.ca/crc32c.html

  Most programming languages have some CRC32 implementation
  - but do not document what parameters they use, or use
  different notation for the same parameters (forward and   
  reverse), or do not let you change the parameters, or do   
  not let you change the CRC size, or all of these. There is 
  no easy way to get a "standard CRC64 or CRC128" compatible 
  across platforms without putting it together from github  
  snippets and example code yourself.
Here's CRC32C implementations in a handful of popular languages:

- C: https://software.intel.com/en-us/node/503522

- Java: http://download.java.net/java/jdk9/docs/api/java/util/zip/CR...

- JavaScript: https://www.npmjs.com/package/fast-crc32c https://www.npmjs.com/package/crc32c

- Go: https://github.com/jacobsa/gcloud/blob/master/gcs/gcsutil/cr...

- Python: https://github.com/ludios/pycrc32c

- Ruby: http://www.rubydoc.info/gems/digest-crc/0.4.1/Digest/CRC32c

And again, I'm not actually recommending git switch to CRC (as a fan of security, I would prefer they use an actual collision resistant hash function). But CRC better meets Linus's stated requirements:

  - CRC will be faster (much faster as in ~4X, your numbers seem off to me) in almost all cases
  - CRC will *not* fail to detect a bitflip, whereas there's a certain probability a random oracle-based construction will


> I am likewise perplexed why "cryptographic hash functions are unnecessary in the absence of an attacker" is such a difficult concept for you to grasp.

Because you can't seem to tell the difference between "unnecessary" and "shouldn't be done".

If I build a shed then using larger screws on the door might be unnecessary but it only costs me .2% more and I know it won't fall over.

Using a recent SHA function might be overkill in a non-crypto context but it's high-quality and fast. And there's existing libraries for whatever language you want. Why the hell would anyone use CRC128? I've never even heard of it before.

It's a good hash choice, no matter how unnecessary.


Clearly it's not, because it's both several times slower than the CRC family, and was known to have cryptographic flaws before git was even released. "The worst of both worlds" so to speak...


> There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.

The performance difference between SHA1 and SHA256 is less than 50%. Unless hashing is a significant percentage of git, which it isn't, this is an insignificant difference.

http://atodorov.org/blog/2013/02/05/performance-test-md5-sha...

Also, you can easily argue that version control is a security context.


I absolutely believe git should use a cryptographically secure hash function.

But you're completely missing my point, which is about Linus's claims that git's use of SHA1 isn't security-critical, but why he's using a cryptographically secure hash function anyway.


There are better cryptographic hash functions (e.g. Blake2) than SHA1 that are even faster. Check out the "results" slide of this slide deck for the cycles per byte table: https://blake2.net/acns/slides.html


Probably not. It's just that CRC128/CRC160/CRC256 primitives are not as common as MD5/SHA1/SHA256.

I did not, from Linus' post, get the impression he believes SHA1 are more magical than CRC except in the "image / second primage" sense.


>Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.

of course it does. It is using a different field (cryptography) as a CRC that "really, really won't collide" because there is a whole field (cryptography) that is completely busted if it does.

Let me put it this way. If I really, really need a random distribution of white noise, I might use a different field, cryptography, to provide it: because if the distribution is not effectively random and uniformly distributed, that field in some fundamental sense is broken: no information is supposed to make it into the ciphertext, it should be indistinguishable from white noise.

So encrypting your source of white noise for the sole purpose of making it statistically closer to noise is a perfectly valid choice.

Actually in your commment you said it yourself: in as little as four billion commits CRC64 expects to see a collision. That is tiny compared to the search space cryptographers work with.

If you look at the history of git there was originally no reason to use cryptographic functions except in the same way as the analogy I just made (for white noise): he borrowed a property from a different field from the one he was working in.


You seem to be operating under the same sort of "cryptographic hash functions are magic!" delusions as Linus.

CRC and SHA1 both produce a uniform distribution. SHA1 does not magically do this better because cryptography. The only things that make CRC and SHA1 are any different are:

  - SHA1 produces a longer tag (of course CRC256 is a thing)
  - SHA1 is hardened against preimage attacks
  - SHA1 was intended to be secure against collision attacks (not anymore!)
SHA1, truncated to 32 or 64-bits, will produce a distribution just as uniform as CRC.

In a non-security setting, we can pick the size of the tag based on the rough number of objects we'd like to be able to store before we'd expect to see a collision (i.e. the birthday bound). If that number is ~4 billion, then CRC64 is sufficient.


This should really be a top-level comment.


That's not the plan. That was an idea that was thrown out if this was an emergency (it's handling different length hashes, and doing so that we don't have to force a flag day conversion which is hard), but once people realized that in fact, the sky was not following, the plan which Linus outlined in his G+ post was devised --- which does not involve truncating a 256-bit hash.


Can you please link me to "the plan" then? I have been trying to follow some of the ML discussion and that was the last plan I saw him put forth, e.g.:

https://marc.info/?l=git&m=148787047422954


By clicking "next in thread" in that very post you get Linus replying to himself with a non-truncating plan:

https://marc.info/?l=git&m=148787163023435&w=2


Thanks for the link.

Re: snark, I did hit "next in thread" but managed to skim over that in his response.

But again, thanks anyway.

(Note: this still sounds more like spitballing than "the plan", but at least it's a step in the right direction)


Can you link the new plan? I didn't see it described in his G+ post.


The G+ post was not meant to describe the plan, it was an overview of the situation for git users who aren't necessarily security experts.

The plan is outlined at https://marc.info/?l=git&m=148787163023435&w=2


Do you have any insight into the sha-256 vs sha-512 choice?


> Linus's transition plan seems to involve truncating SHA-256 to 160-bits.

This is not the plan. This is the backwards compatible system for tools that can only deal with 40 characters.


On the mailing list he suggested displaying 160 bits for backwards compatibility with existing tools. The full 256 bits would be used internally (and displayed externally with the proper --flag)


additionally cosmetics but instead of hex string could they not use a-zA-Z0-9 in the visual output to the user to make the git command line text output shorter?

0efaa. -> 2AdC..


Unfortunately there's a lot of code out there that assumes commit hashes are case-insensitive and hexadecimal... I've written some myself. It'd be a hugely painful change.


Then you have stuff like 1, I and l, which are difficult to distinguish. Which was why base58 was invented (basically the range you suggested, without visually similar characters).


And zbase32 (https://philzimmermann.com/docs/human-oriented-base-32-encod...), which is my own preference for still being case-insensitive and thereby able to be used in subdomain names or email addresses.


This really should not be a problem since that's task #1 that a programmer's good font needs to solve.


I often run into git hashes places where a programming font wouldn't apply, like URLs and web pages.


>where a programming font wouldn't apply, like URLs

Any phisher's wet dream is a URL displayed in a font where l and I look the same. For anything that routinely displays URLs, font matters a lot. And any website displaying git hashes is likely to be programming related and have some code-font (good monospace or similar) in it's repertoir.


> Linus just doesn't take this stuff seriously. I really wish he would, though.

Can't downvote this enough. This is plain FUD. Did you even read the complete thread on the git mailing list? This was just one proposal by him.


If he took this stuff seriously, he wouldn't have waited 12 years since SHA-1 was broken to even start considering any changes.


So did anyone on this thread send patched to the mailing list? Obviously you guys are very serious right?


Linus ranted[0] about how it's idiotic to worry about SHA-1 collisions in 2005 months after significant weaknesses[1] were found in it (which have recently been demonstrated by Google). I'm certainly not about to waste effort making a patch that he's going to reject as idiotic.

[0] http://www.gelato.unsw.edu.au/archives/git/0504/0885.html

[1] https://www.schneier.com/blog/archives/2005/02/sha1_broken.h...


This conversation has gone on several times over the course of years on the Git mailing list. In almost every situation it's been completely brushed off as a mostly non-issue to change from SHA-1 inside Git, despite the fact it's been known SHA-1 has basically been on life support. There are lots of opinions on both sides, but ultimately, until now, the Upstream decision seemed to be "WONTFIX".

Given this context, of course nobody wrote patches: they would have obviously been rejected and been a total waste of time. Until now, when we actually have to deal with it.

This is all aside from your argument being fundamentally weak, however ("you can't criticize anything unless i say so and contributed by meeting this arbitrary standards. i mean, didn't do anything either, i just get to make up the rules you abide by!!")


> This is all aside from your argument being fundamentally weak, however ("you can't criticize anything unless i say so and contributed by meeting this arbitrary standards. i mean, didn't do anything either, i just get to make up the rules you abide by!!")

Are we both reading the same GP comment? It reads as "If he took this stuff seriously, he wouldn't have waited 12 years since SHA-1 was broken to even start considering any changes.".


It's equivalent to the other comments. The others just have more facts to back it up & should've probably been the original. If The reason it's equivalent is:

1. Saw a known issue that cryptographers and security people were warning him about.

2. Alternatives existed that didn't have that issue. People were pushing on it.

3. Ignored all that to tell them it wasn't a problem, the issue could never have real-world consequences, and he wasn't interested in fixes.

In practice, that means he didn't take it seriously. He also made sure it wouldn't get fixed by letting people know he wasn't making a change to it. One more example in a long line of them where Linus doesn't give a shit about security enough to apply known, good practices.


>(Also: if he's going to truncate a hash, he use SHA-512, which will be faster on 64-bit platforms)

BLAKE2 is faster still! It's also at least as secure as SHA-3, and produces any choice of output size up to 512bit.


I'm a fan of blake2, but SHA-2 is the conservative choice here given the sheer amount of cryptanalysis it's been subjected to.


> reversing commit hashes back into their contents

Somewhat off topic, but is this actually possible?

Given hashing is inherently lossy, I'm inclined to assume it's not possible for anything must longer than a password, but commits are text, which I suppose is low entropy per character, so I don't know.


It doesn't sound possible for all but the most trivial diffs. An attacker would have to guess not only the exact commit text, but also its timestamp. And the only attack scenario I can conceive of seems pretty silly. Maybe I'm missing something...

Alice clones an open source git repo, commits one secret change where she edits a config file's default password to her own secret password (a bad practice), and then publishes the new hash in public for some reason (build info?). Mallory would have to (a) know that exactly this happened, (b) guess the commit message, (c) guess the commit's timestamp to the second (or within a few seconds), and (d) preimage-attack her password.

And the preimage attack must pierce git's Merkle tree, which sounds downright impossible. (Unless Mallory is just bruteforcing, in which case a strong password is enough.)


Most commits are not just text, they are source code. If you had a feasible way to enumerate all the byte sequences that collapse into a given hash value, you might find the subset that is syntactically correct code to be very low. Except if it's Perl, of course.

And the likelihood of all the characters aligning in a way a compiler might find acceptable gets lower with every increase in length of your collisions, so it would be extremely unlikely that the shortest nontrivial match (for both the hash and for code sanity) would not be the right one. The code constraint certainly would not make it easier to find the collision in the first place, but it would give you great confidence in the result of you did.


No, you're correct that for large inputs there are obviously going to be many that lead to the same hash (see pigeonhole principle).


This is presumably true also of small inputs, given that the large and small inputs are all mapping to the same space.


Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

But yeah, as the input size approaches the output size, the probability of a collision existing gets to 1. The birthday paradox formula will give the probability of a collision (assumes random placement in output space) based on number of inputs.


> Well not necessarily true for small inputs. Hash a single byte for example. There is no collision in sha1 for that so you can build a 1:1 mapping of hashes back to input examples for that case.

I suppose you are saying that if you know that the input size is sufficiently small, you don't have to worry about collisions, which is true. I was interpreting "for small inputs" to mean that if you give a small input to a hash function (which can take inputs much larger than the space of the output), that you can still reconstruct the small inputs uniquely. Unless the hash function is deliberately designed to provide unique 1:1 mappings for small inputs, I would think that it's not true that you can uniquely reconstruct small inputs because they will likely map to the same value as a large input would (i.e. 'a' might hash to the same value as some 14821-byte string).


Yes this is possible. It's called a preimage attack:

https://en.wikipedia.org/wiki/Preimage_attack

It's computationally infeasible against most cryptographically secure hash functions. However, Grover's algorithm results in a sqrt(keyspace) reduction in security levels (effectively halving bit-sized security levels):

https://en.wikipedia.org/wiki/Grover's_algorithm


A successful preimage attack against a cryptographic hash would most likely find you a different message that gets you the same hash. You wouldn't do a preimage attack to find what the original input to create a hash was. You would do a preimage attack to find a new input with the same hash so that you could pass this new input around claiming it was the original (and have any signatures for the old input be valid for your new input, etc).


What you described is called "second preimage attack", that's != "first preimage attack".


In a second preimage attack, you start with an input that comes out to a hash, and you're tasked with making a distinct input that comes out to a hash.

In a first preimage attack, you start with only the hash, and have to come up with an input from scratch that comes out to that hash. Even in a first preimage attack, you're extremely unlikely to craft the same input that someone else used to create the hash you were given. By pigeonhole principle, the number of inputs that come out to a shorter hash is extreme.


>Preimage resistance does matter

Is there a preimage weakness though? I thought this attack only reduced collision resistance.


GP said that in reference to quantum attacks.


Although, I generally defer to someone such as Linus in having far more domain knowledge such as this, but I'm concerned willingness to just drop bits of the hash like this. I mean 20-30 years ago, I could quasi understand for the sake of performance, but are we really so concerned about performance of a few clock cycles vs opening ourselves to a known vulnerable attack?


I think the truncation is considered because a lot of software assumes git commit id has 160 bit length. It's not because of performance.


There are only 2 reasonable rationales for the 160 bit restriction: it was arbitrary because that was the length of the chosen hash, or for performance (and no one will ever need more that 640kb of RAM, or 160 bits of hash). I'm guessing it was probably the first, though, and Linus designed for the immediate now, and not for the future.


It's not even a "restriction", so much as a fragile ecosystem of tools that parse the output of git commands. A lot of them assume particular column widths.


What is this known vulnerable attack you are talking about? The birthday paradox applies to every hashing function.


While there may not be a known, direct, attack against git, WebKit's SVN repository was demonstrably attacked by the SHA-1 vulnerability, oddly by their own developers today (or yesterday?). That it took that little time from a PoC to an actual production issue leads me to believe it wouldn't take long for a dedicated individual or team to extend the vulnerability to git and it's "mitigation" involving the header.

I prefer not to take the bury my head in the sand approach to this, at least when it comes to public repositories.


Linus talked about all of this and why the way git works makes it even harder, supremely harder, to pull off. Nobody is taking the "bury my head" road and Linus talked about ways forward.


- Truncating to 160-bits still has a birthday bound at 80-bits. That would still require a lot more brute force than the 2^63 computations involved to find this collision, but it is much weaker than is generally considered secure

What level is considered secure, then? The numbers for O(time) and O(space) should be many orders of magnitude apart to represent the relative costs.


> - Post-quantum, this means there will only be 80-bits of preimage resistance

Post-quantum, it's only ~2^53 work to find a collision. IMO that's worse.


> especially in any repos containing binary blobs

Yeah, that part is the real flaw in the argument.


If a repo contains binary blobs, especially executables, well, that's very bad practices right there. Also, how can somebody else modify a binary in a meaningful way and send a patch to it? How can you review a patch to a binary file before applying? I'd say that any sane project, especially if open source, would not include binaries (maybe apart from images), and even if it did, would not accept patches to them (if the members of a project were talking about changing, say, an icon, valid images would be exchanged on a mailing list/issue tracker, nobody would bother making, sending and applying binary diffs; then somebody w/ commit bit just commit it).

Let me put it more simply: if you're accepting patches for binary files in your repos you don't care about security at all. Maybe unless if you know how to decode machine code/JPEG manually.

Also, the proposition was to use the full hash internally, and truncate the representation. Nobody other than git itself uses full hashes anyways.


Executables in git are bad practise, but not all that uncommon. Images in git are the norm, and if somebody comes in and creates a pull request with an improved version of the existing images (better compression, better adapted for color blind people, fixing whitespace issues etc.) that's pretty unsuspicious and likely to succeed (and I've seen it multiple times).


But how would a safer hash improve resistance to that blob update scenario? How would a weaker hash make it more dangerous? If you have a fiercely audited branch next to one that is basically free for all, maybe?


One thing SHA-256 has going for it is that millions can be made from finding pre-image weaknesses in it, because it's used in Bitcoin mining. If you could "figure out" SHA-256, and use it to take over Bitcoin mining, you'd make $2M the first 24 hours, at current rates. And if you play it wise, it could take a long time before anyone figure out what's going on.

With regards to market price for a successful attack, I don't think any hash function stands close to SHA-256. And for that reason I think it would be the right choice.



I think the claim is interesting, and I certainly wouldn't reject it. But just relaying someone else's claim, without any substantiating argument, I find quite uninteresting.


Jean Philippe Aumasson[1] is an important currently active cryptographer who studied this particular problem closely. If he says he does not expect an algorithmic break of SHA2, I believe him that I don't need to plan to switch from SHA2 to SHA3 in an hurry.

I don't have the math to understand a more detailed explanation for his reasons to make such a statement. If you do have the math, ask him.

1 - https://131002.net/index.html


SHA-512/256 is faster and has the same output length


I don't really get the threat model here. If an attacker is pushing commits into your repository, you're long since toast on all possible security fronts, right? Is there anything nefarious they could accomplish through hash collisions that couldn't be done simply by editing commit history?


Not really. From Linus — I think the most important point that has not been discussed extensively:

> But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice.


If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice

Unless I'm misunderstanding how this would work in practice (I assume commits being added under my name or existing commits being modified?), I'm abolutely not buying that though. Way too general, no? I would likely notice something like that, but I know enough colleagues who are just as likely to not notice. Mostly because they don't fully grasp git so they're just like 'hmm, thing says I'm behind, ha I know the fix is pull, yes I'm a git wizard' and just pull then continue working without even checking what got changed.


Sure, most people don't review all the code that they pull.

But if you are in a position to insert "garbage data" into a commit in preparation for an attack, it will be much easier to insert malicious but safe-looking code in the first place. Omit an array bounds check, disable SSL certificate validation, or anything else that looks like a mistake but will allow you to compromise the running code later.


How about very subtle in with the noise. Like this attempt to backdoor Linux in 2003:

https://lwn.net/Articles/57135/

Hard to get a collision though I'm guessing.


> random odd generated crud in the middle of your source code, you will absolutely notice

like non-printing characters in comments?

Or, you know, random odd generated crud https://github.com/torvalds/linux/tree/master/firmware/radeo...


If you were planning on getting a compromised binary blob into a git repository, you would not bother with the whole SHA-1 collision and switching things out later.

You would just build a hidden backdoor into the first binary blob, like a deliberately omitted array bounds check if specific parameters are used. Reviewers and tests won't find that in complied binary code either, it's much easier to hide than large amounts of garbage data, and you even get to keep plausible deniability of it being a honest mistake if discovered.


also to divert my criticism, that directory is not for new commits


>If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice.

Just as they noticed the intrusion on kernel.org servers after how long?

edit:

> If an attacker is pushing commits into your repository, you're long since toast on all possible security fronts, right?

Sure, that's always true, in the worst case. Perfect security doesn't exist. But does it need to be an attacker? Couldn't it just happen that an important bugfix isn't recorded because the hash collides?


> Just as they noticed the intrusion on kernel.org servers after how long?

It doesn't work retroactively. That crud needs to be there in the initial commit when it's accepted by multiple people.

> Couldn't it just happen that an important bugfix isn't recorded because the hash collides?

The hashes will never collide if there's no attacker. And that bugfix would have to have a blob of crud in the 'good' version, which would stick out like a sore thumb.


> It doesn't work retroactively. That crud needs to be there in the initial commit when it's accepted by multiple people.

Do you assume that a proper preimage-attack would be needed? Instead, the good looking object could be committed, I guess after reading https://news.ycombinator.com/item?id=13721237

edit:

> And that bugfix would have to have a blob of crud in the 'good' version

Looking at google's POC, the colliding pdfs don't at all stick out (edit: but they are horribly contrived and you talked about code not blobs) [https://shattered.io/]

edit: thanks for your answer, it motivated me to learn a bit more. I'm still not sure if "never" is a rounding error, hyperbole (which would be ill advised talking to laymen), or actual fact (that source files due to the reduced entropy in byte patterns could in fact never ever collide, which I doubt).


> the good looking object could be committed

In source code there will be an obvious block of "hash was attacked here" crud in each file. The 'good looking' object will function non-maliciously, but it will be obvious that something is up.

> I'm still not sure if "never" is a rounding error, hyperbole (which would be ill advised talking to laymen), or actual fact (that source files due to the reduced entropy in byte patterns could in fact never ever collide, which I doubt)

"never" is to be read as "over a billion times more unlikely than all life on the planet suddenly being wiped out". It's not going to happen accidentally.


Well, open source software like Linux kernel has many obvious bugs for years go without notice. But his point is still extremely fair. There may be code no one really actively updating, but to even able to add random odd crud in the source code is really really really really hard. The effort is not worth it for any reasonable attackers. One would rather exploit an existing zero day or just infect network and wait to activate the infection. If someone modifies the local source code you just git cloned (or from a tarball), well you have something to worry about - your network.


The kernel, and OpenSSH to name another project, is full of random crud nobody noticed. Everyone does their best to review patches and ensure things are sane when going in, but there's a lot of change in projects at scale, and sometimes the standards one patch wrangler has are different from another. Mistakes happen.

If at-patch time is your only code review, you've got problems. If your code base is too huge to look over frequently you've got problems.


And context is important as well. Worrying about sha-exploits in an environment like this is like wondering whether you watered the plants while the house burns down. The security-theater troupe wails like there's no tomorrow when their buttons get pushed. As Linus, said, given the context and the risk, its of almost no consequence.


I'm in agreement with Linus, but it also highlights problems with code-bases this complex. We depend on code being readable to spot malicious activity. Generating a SHA1 collision to jam in a very subtle bug is highly improbable, and might even be unfeasible if the diff is small enough.

There's certainly a concern if your code is more opaque, as is the case in this bug. If you're taking in raw asm.js code, for example, from various sources...


If they edit the commit history and you're using a secure hash algorithm, then the hash of the current commit will change and no longer match the signed tag your trusted maintainer sent you.


One thing that I think is worth mentioning: This was completely avoidable. Git isn't that old, it wasn't taken by surprise by the SHA1 attacks.

The first paper from Wang et al, which should've put SHA1 to rest, was published in 2004, the year before the first ever Git version was released. It could have been easy: Just take a secure hash from the beginning.


If anyone is really interested in more assurance of git commit contents, there's "git-evtag", which does a sha-512 hash over the full contents of a commit, including all trees and blob contents.

https://github.com/cgwalters/git-evtag


While this post sounds very reasonable to me there's one point that I really don't get: why does he keep saying that git commit hashes have nothing to do with security?

If he believes that, why does git allow signing tags and commits and why does Linus himself sign kernel release tags? Isn't that the very definition of "using a hash for security"?


Related, from Mozilla:

* The end of SHA-1 on the Public Web

https://blog.mozilla.org/security/2017/02/23/the-end-of-sha-...

As announced last fall, we’ve been disabling SHA-1 for increasing numbers of Firefox users since the release of Firefox 51 using a gradual phase-in technique. Tomorrow [Feb 24th], this deprecation policy will reach all Firefox users. It is enabled by default in Firefox 52.


To be fair (although I've been commenting angrily about git's continued use of SHA-1 elsewhere) it's a lot easier for a browser to change hash algorithms than for git.


Linus is a little behind the times with this comment:

``Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be "tough luck, maybe your data is there, maybe it isn't, I don't care".''

BitKeeper has an error detection (CRC per block) and error correction (XOR block at the end) system. Any single block loss is correctable. Block sizes vary with file size so large files have to lose a large amount of data to be non-correctable.


In the specific case of cryptography where it's unknown how bulletproof the algorithm will be why not use multiple hash functions? Perhaps using the top 10 best hash functions of the day. That way you're not putting all your eggs in one basket and if nefarious collisions are able to be created in the future you still have the other hash functions to both "trust" and double check against. It's even more unlikely that nefarious collisions will be able to be constructed that collide all the other hash functions as well. You could just append the hashes to each other or put them in a hash table or something. Maybe my computer science is not up to snuff but it seems like this would provide more resiliency against future and non-public mathematical breakthroughs as well as increased computing power such as quantum computing. Yes, it would take a little longer to compute all the hashes in day to day use, but with the benefit of a more robust system both now and in the future.


Have there been writings on what exactly git's migration strategy to a new hash function will be? Apparently they have a seamless transition designed that won't require anyone to update their repositories, which seems like a pretty crazy promise in the absence of details.


In git the SHA-1 hash is simply an identifier for an object - it's used in the filename, but not stored in the object. And when a commit or tree object references others, it's just a name that can be looked up in the database. So a commit object hashed with SHA-256 can easily reference a previous commit that was hashed with SHA-1.

During the switch, a bit of deduplication may be lost. But the only interesting issue I can see is how git fsck will tell which hash an object was created with when verifying the hash (maybe with length?).


Git update repo kind of command may be ??


Also see discussion of Linus's earlier comments at https://news.ycombinator.com/item?id=13719368


Newbie question .. can some one please help me understand the attack scenario. if I, as the attacker, want to inject malicious code/binary into a git repo, then I need to write my malicious code/binary in such a way that the resultant hash collides with one of the commits (? Or the last one?) in the repo. Is this correct?


Probably isn'y the sky falling. But if knowing the length fixed all hash function issues then cryptographic hashes would just use a some more bits for length.


Can someone correct me. SVN/Subversion and GIT are affected by SHA-1 problem. SVN uses SHA-1 internally, but exposes only a numeric int as revision. GIT uses SHA-1 internally and as revision. So if someone commit a modified PDF that collides he can run havoc on both SVN and GIT at the moment. It seems easier to fix the issue in SVN than GIT.


It's a somewhat different issue.

Git can probably not be havoced by committing two colliding files (and doing so would require doing another chosen prefix attach with a git blob header). But git looses cryptographic integrity promises due to this attack (aka: you can have different source trees with different histories leading to the same top commit hash). svn never had any cryptographic integrity to begin with.


Sure, here's a correction: There has been no evidence of havoc or repo corruption in git.

It appears that to even try to attack git you'd need to spend the same amount of work again ($110K or what it was) to create a new collision with the right git object headers.


I do wonder how many outside of crypto circles know about SHA-2 circa 2004.


This, btw, is why we have e-cigarette bans. The fact that the generally high-IQ, paid-to-think-about-subtle-categorization community of software developers needs to be inoculated against the "I Heard SHA-1 Was Bad Now" meme, should serve as a reminder for why most things should not be managed by democracy.

(Yeah, I know this will be read as a plea for monarchy and downvoted. It simply proves my point: people are WAY too subject to errors in the classes (1) "I hate him because he said something 'bad' about something 'good'." and (2) "I hate him because he said something 'good' about something The Tribe now knows is 'bad.')


Save yourself some downvotes and remove the mention that you expect them.


Save myself from people proving my point? Why?


The HN guidelines specifically ask not to express the expectation of downvotes. You may be downvoted purely for ignoring the guidelines, regardless of the rest of your comment.

Please don't bait other users by inviting them to downvote you or proclaim that you expect to get downvoted.

https://news.ycombinator.com/newsguidelines.html


A generally reasonable guideline, but in this case, I am actually criticizing the tribalism that makes people rise to that bait.

It similarly leads to the discussion of how "I can't believe Linus is trying to defend SHA-1 when The Tribe already knows it is cryptographically 'bad'."


If you are taking the view that you're expecting downvotes to prove the point that people who are trying to uphold community standards are doing so blindly or ignorantly, you'll very likely think you're proven correct when you do receive downvotes. Can you blame them? You're explicitly flaunting the guidelines they choose to abide by while telling them they're wrong to do so in your special case.


ahem... "flouting"


You're right, of course. Thanks!


Um what? Software written in the past 20 years has a baked-in assumption that the length of some ID can't change?


I'm mystified as to why this is even a discussion.

SHA1 is busted. That impacts some git users. The fix is not invasive. Fix the bug. Make the transition. Move on.

Super unprofessional.


"Unprofessional" is an odd choice of word. I cannot see what you mean.

He explains why he wrote the post: "since the SHA1 collision attack was so prominently in the news." I think a lot of people on Hacker News and elsewhere are interested in both the SHA1 collision and how it affects git, even if it doesn't impact them.

Still, even if this were some obscure bug that nobody cared about, what would be unprofessional about explaining his plan to fix it? It's an open-source project, after all.

If anything, I'd say it's quite professional that he's calmly explained why there's no need to panic, that they have everything under control, and that the path forward should be smooth.


The issue is that this could have been fixed years ago. All the information needed to move forward today was also available the last time this got brought up. Fixing it then would have better for git's users (because they wouldn't have freaked out over this) and better for git (because it never would have even come up in the SHA1 discussion) and better for Linus (because he wouldn't have had to spend an hour writing a blog post about this). So why not do it then?

It's frustrating, because Linus clearly isn't stupid. And yet sometimes he does stuff like this where I can't help but go "how did you not see the unforced error you were making?".


It's not that simple. Git is a widely used software integrated into many places, so keeping some backwards compatibility is important.

Just going ahead and start breaking things would really be unprofessional.


They did break backwards compatibility with git v2.0 but sadly they did not bother to change the hash function.


What backwards compatibility was broken?



I'm not saying he should go out and break things now. I'm saying that the way forward he outlines could have been followed years ago, as essentially the rest of the industry did. Instead he waited until his user base was totally panicked to address their concerns.

In my view, that's not the right-- or even a rational-- way to do things.


What is the realistic worst case situation here? From what I understand, git uses SHA1 as a way of generating an id for a file, not for security. So two files might match up when they shouldn't? Is that it or is there more to this?

Or is this a proverbial "since SHA1 is don't work over there, then it shouldn't be used anywhere under any circumstance" attack?


As Linus touches on in the article, SHA1 is used for signing in git. This is clearly a security function, and should not have depended on SHA1 for at least the last several years.

Additionally, the defense offered against substituting binary blobs essentially comes down to "well, the kernel doesn't do that". Respectfully, other projects do. Those users' concerns were not taken seriously until lots of unrelated users freaked out about SHA1 for bad reasons.

Regarding the worst case scenario, yeah, substituting one binary blob for a different one (say, in Google's AOSP git repo) would be the worst. Which isn't sky-is-falling bad, but would still be pretty ugly.


> Additionally, the defense offered against substituting binary blobs essentially comes down to "well, the kernel doesn't do that".

Why should it matter? Git was designed for the needs of the kernel project since the very beginning[1].

It is us, mere mortals, who are at fault for picking a tool that was not designed with our needs in mind, just because that tool happened to be better suited for our needs than the alternatives it was designed to replace.

The sky isn't falling, and if you really believe so, well, there's Mercurial, although you may also feel uneasy since Mercurial also asks you to don't panic[2]; Maybe you should just build your own SCM?

[1] https://git-scm.com/book/en/v2/Getting-Started-A-Short-Histo...

[2] https://www.mercurial-scm.org/wiki/mpm/SHA1


I'm not sure if you're being sarcastic here.

Git supports binary blobs. That's a choice that was made way back, and it is not the fault of the user for using that feature. If Linus didn't want people to use git for binary blobs he probably shouldn't have added support for it.

Once he did add support for binary blobs he had an obligation to take the security of that component and its users seriously. And as I said before, although the risk is not sky-is-falling bad, it is present. I think even Linus agrees with that now or he wouldn't be working on a path forward at all.


Perhaps you should re-read the last paragraph of the article.


Can you explain why? I went back and reread it and it said the same thing I thought it did: that the fix wasn't a big deal and they were moving forward on it.

I agree it isn't a hairy fix, but think that it would have been better for everybody if it had been made before a SHA1 collision was found, as has been repeatedly proposed to Linus in the past. This would have avoided the entire discussion about SHA1 and git, and all the angst that went with it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: