This latter leads into the problems with Linus' positions in particular. In that thread, does not take seriously the threats that this poses to the broader git userbase, because he only seems to care about the kernel use-case: trusted hosting infrastructure at kernel.org (itself an iffy assumption, given previous hacks and the use of mirrors), and the exclusive storage of human-readable text in the repo which makes binary garbage harder to sneak in. These do not apply to most users of git. His rather extreme public position (paraphrased, "our security doesn't depend on SHA1") is even more troubling - it absolutely does depend on SHA1, this just isn't (yet) a strong enough attack to absolutely screw over the kernel. A stronger collision attack (eg a chosen-prefix as opposed to identical-prefix, or godforbid a pre-image attack) would absolutely invalidate the whole git security model.
As @tytso points out, there is ongoing work to replace the use of SHA-1. Also, yes, SHAttered raises serious concerns about the safety of SHA-1, but that doesn't mean everyone has to immediately work on the switch (and honestly, it's questionable how much it would help - throwing more developers at a problem doesn't necessarily help). Should Stefan Baller drop the work he's doing on submodules because SHA-1 is more imminent now?
It's also worth noting Linus isn't really a core Git dev any more, he just submits patches occasionally. Junio Hamano is the primary maintainer.
$ sha1sum shattered*
$ git hash-object shattered*
Luckily, there is also a known solution to detect that kind of files.
Like many volunteer projects, the work would go faster if more people helped. What "the git project" considers important works the same way as "how much is bitcoin worth", or "how much is gold worth". People who think it's important can show the value by putting their effort where they think it is most important. Or, of course, people can snipe from the side-lines, and make themselves feel all self-important.
There has actually been a lot of thought about how to do a graceful transition period, but to also on how switch from a soft cutover to a hard cutover (breaking backwards compatibility and requiring people to upgrade their clients) to allow time for developers to upgrade at a reasonable rate, where reasonable can be defined on a per-git repository basis. So if you want to force a flag day as soon as the code is available, and not wait for stability testing, etc., those people who value security uber ales, and who don't want to rely on trusting kernel.org, etc., can do so.
I will note, though, that most people are doing blind pulls, or worse, blind merges, from developers outside of their immediate circle of trust _all_ _the_ _time_. Heck, people will cut and paste command lines from web pages of the form "curl http://alfred.e.numman/what/me/worry | bash" into root shells all the time! So if you are not auditing every line of code before a git pull, the fact that git is using SHA-1 is the least of your worries.
Personally, if I were a nation state planning on trying to insert the equivalent of a DUAL-EC backdoor into open source software, I'd do that by spending a person year or ten getting a collection of developers to be trusted contributors to some key open source project, like Docker, or Python, or even yes, the Linux kernel or git, and then "accidentally on purpose" introducing a buffer-overrun or some other zero day into said OSS code base. Or heck, just simply invest in finding more zero days that people have inserted into their code just because they're not careful!
So, sure, "we" should upgrade git to be able to support multiple crypto hash algorithms, and there is work going into doing this. But at the same time, it's important to keep a sense of perspective on all of this --- unless, of course, your goal is to make yourself seem important by exuding a sense of self-righteousness.
Do they? I've never encountered anyone using a root shell to go about their business unless they had to do some exclusive operation that could only be done as root, and even then they would rather just sudo that specific operation. Hell, even if they do do that, they have bigger problems than just git, given that cut and paste is inherently broken.
sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"
Or you can trust my docker image, where I've done this for you:
(Hint: blindly using my docker image is only slightly better from a security perspective. What you _should_ do is download the dockerfile, audit it carefully, and then create your own docker image. And then you're _still_ trusting the Calibre folks to have access to your X server....)
EDIT: Also, holy s@@@ e-calibre, your advice (with a mild warning) if people get certificate errors is to pass in --no-check-certificate.
Maybe you should encounter more devs?
edit: rephrase: Git has no security model.
- Truncating to 160-bits still has a birthday bound at 80-bits. That would still require a lot more brute force than the 2^63 computations involved to find this collision, but it is much weaker than is generally considered secure
- Post-quantum, this means there will only be 80-bits of preimage resistance
(Also: if he's going to truncate a hash, he use SHA-512, which will be faster on 64-bit platforms)
Do either of these weak security levels impact Git?
Preimage resistance does matter if we're worried about attackers reversing commit hashes back into their contents. Linus doesn't seem to care about this one, but I think he should.
Collision resistance absolutely matters for the commit signing case, and once again Linus is downplaying this. He starts off talking about how they're not doing that, then halfway through adding a "oh wait but some people do that", then trying to downplay it again by talking about how an attacker would need to influence the original commit.
Of course, this happens all the time: it's called a pull request. Linus insists that prior proper source code review will prevent an attacker who sends you a malicious pull request from being able to pull off a chosen prefix collision. I have doubts about that, especially in any repos containing binary blobs (and especially if those binary blobs are executables)
Linus just doesn't take this stuff seriously. I really wish he would, though.
This is horseshit, and Linus should not be saying these hugely misleading statements about security principles.
The point of a hash is to remove the need for trust between the trusted person who tells you the hash and the infrastructure you get the actual data that was hashed, from (edit: and, between you and the latter).
In other words, once you get a good non-colliding hash from a trusted person, then you don't need to worry about malicious infrastructure sending you bad data claiming to be the source of that hash.
Linus trusting Tytso to sign the commit object that references the SHA-1 of the tree object, says nothing about whether the infrastructure served him the tree object correctly. Sure, he might also trust the infrastructure providers, but when he says "trusts people" it does not sound like that is what he means. And even if he trusts the infrastructure providers, with a good hash HE DOESN'T HAVE TO.
The "trust" wording is serious horseshit.
(edit: there is also the case of people downloading "linux" from random git repos in the future. Right now if you GPG-sign a commit or tag, it has SHA-1 references to the tree object underneath it. Once SHA-1 is more broken it basically means you shouldn't trust random git repos across the internet to give you good content, even if it's "signed by Linus".)
Linus could have designed a cryptographically perfect system such that he could pull Tytso's signed commit from anywhere on the internet - but he didn't
Linus used sha1 as a useful tool for an effective DVCS with an initially simple implementation.
He still depends on the security of the kernel.org servers, his work computers, and the top submaintainer's work computers. His git trees and all the submaintainers he pulls from are hosted on kernel.org servers. Security of the kernel.org servers is taken very seriously, especially since the well-known break-in a few years ago. Now even two-factor auth is involved in all git pushes to kernel.org servers. Sub-maintainers make pull-requests by branch name - "please pull branch for-linus ...". More peripheral contributors submit their work via patches on LKML, no commit hashes involved.
Finally, no well-known SCM previous to git was based on perfect cryptographic proof of source history, or anything like that. It wasn't a big issue, and it's not the problem git focused on solving. Before git, we all used CVS, SVN, tarballs, and patches. And a significant portion of developers did not use any VCS at all. How could any of us have trusted any source code before 2005?! Somehow we did, though...
Yeah, and that sucked.
Git is this close (I'm holding my fingers very close together) from providing cryptographic proof of source history. And that's why people think it should be included.
They used pgp to sign the tar ball, which was a way better idea since you could just use a different hash function for your signature after sha1 has been broken in 2005.
Everybody serious about security kept doing that, since signed git commits were just asking for trouble due to the hard dependency on sha1.
This is actually false.
Monotone, from which git takes a lot of cues (ask linus, or see wikipedia), was such a system.
Because that's a pretty small definition of "well known".
Inside the VCS community, they were very well known.
Graydon also went on to start Rust, of course.
Outside of that, even to corporations I dealt with (i was working on SVN at the time), plenty knew it existed, a few evaluated it.
" But both of these were experimental and extremely slow at the time git was designed."
I feel like this is just you trying to say "well, they never would have worked anyway".
DARCS was slow and couldn't be fixed, monotone was actually not that bad, and could be made very fast if necessary.
Monotone was slow precisely because it cared about integrity. If you made git care about integrity and security in the same way ... it would be just as slow!
Saying they were "experimental" is silly.
Monotone was self hosting and nobody had found data corruption or other issues in quite a while (IE > 1 year), AFAIK. This is much better than git was for a long time. To put this in perspective, i converted the gcc repository (hundreds of thousands of revisions, in fact, many more than the kernel at the time, and history going back to 1983) to monotone, and
1. the conversion worked with no issues
2. Speed was fine in most cases. If i worked hard, i could find issues, but ....
git was beyond experimental at the time it was designed (IE data loss and repeated crashes).
But it's design is essentially that of monotone in a different container and with slightly different goals.
(IE they precisely gave up the integrity part that monotone verified and cared about, and people are now complaining about).
In truth, if Linus had cared about cryptographic security, he probably would have just stuck with monotone and rewritten parts of it.
But he didn't, so he gave that up in the name of speed.
Which is fine, but let's not pretend that Linus somehow has a monopoly on design, or even was the first to design a VCS that was like git.
Any of the distributed VCSen could have beaten git. They just didn't care about the tradeoffs that git was making (IE non-portability in favor of speed), and honestly, it's pretty obvious git would have been laughed out of existence if it hadn't been supported by someone so popular to so many people.
In fact, it was laughed out of existence, the past N times people had devised VCSen that favored speed over portability.
Git is interesting as an example of how good marketing with the right figures sometimes leads to winning in the marketplace for long enough that you can fix the rest of your serious issues (IE everyone else had the OS/2 problem). Even more interesting is that everyone else backfills history to make it seem like it was the best and clearly the right set of tradeoffs from the start. It wasn't. It may not even be now!
It's like any other system.
Please don't try to backfill history here. I was there, as were many others.
I also was surprised that git became so popular, I thought mercurial would win for "ease of use". But I've never been an advocate or influencer of popularity. I think that's mostly due to GitHub, which got it to a market-share tipping point.
Inside the PL community, tons of obscure language with less than 10.000 users are "well known". Being known inside some niche's expert's a pretty low bar. The "VCS" community is insignificant in size compared to the developer community at large.
>Graydon also went on to start Rust, of course.
Which is irrelevant as to whether Darcs and Monotone are well known.
What's the point of GPG signing tags and commits and adding these as git features, when SHA1 pointers make it pointless?
If the cryptographic strength of the hash doesn't matter, why bother even talking about changing the hash from SHA1 to SHA3-256?
It's a huge shame because git is very close to being a cryptographically perfect system, but the original creator can't get his security reasoning correct, and does so so badly yet so publicly, and minions who don't know what they're talking about come flocking to defend this.
That's ignoring a huge amount of changes that have to be made to have this work properly. Hardcoded constants must be changed. Backwards compatibility needs to be maintained for Git to be a viable product. Scripts that depend on the current length/format of SHA1 hashes would be broken if everything were changed all of a sudden. That would be much, much worse than the current exploit.
I agree that his points are not all correct, but he is correct in that basically, you shouldn't have all of your security eggs in one basket. SHA1 is just one piece of their security infrastructure, and now that it's shown to be a little shaky, it can be argued that the others help keep it up while they repair it.
SHA1 pointers being "pointless" is also an overstatement. SHA1 isn't completely broken yet. It will be pointless in 3+ years, which is why they're performing the work of transitioning, although not easy.
"Cryptographically perfect" wouldn't be occurring with SHA3-256 either. That'll be broken in 10+ years.
Security is, and always has been, along the lines of "good enough as far as we can see now", and when Git was made, it was good enough. Now it's not.
"Not having security eggs in one basket" is exactly irrelevant here. Upgrading the hash does not make the other parts of the system weaker, and should have been done years ago. Yes, SHA2 will also become weak eventually - so then you bump it again. The point is to choose the best one for that time of history, which he has been resisting doing for years, based on bullshit pseudo-arguments about "trust".
No, git uses hashes in the same way that most other security system use hashes. The "trust" comes at a different point of the system. No hashes by themselves are "trusted" anywhere by any system. Git is not special in this regard. I repeat, Linus is talking horseshit and this horsehit argument is unfortunately getting repeated and spread around because of his position in the community.
I said the use of GPG is pointless, not SHA1 pointers, especially given his "trust" arguments. To exaggerate slightly, just to show you the point: it's like locking up a crappy lock (SHA1) inside a really secure safe (GPG). But this crappy lock is what opens up the real safe where your actual money is kept (the tree objects). And then we have Linus talking shit on the side saying the keys are not so important, but what he really gets security from is his trust in the delivery company that ships the safe around.
I disagree here. I doubt SHA2 or any modern hash will become weak within our lifetime. JP Aumasson, who is one of the experts of the field, agrees with me on that:
Reactions to stages in the life cycle of cryptographic hash functions
Stage | Expert reaction | Programmer reaction | Non-expert ("slashdotter") reaction
... | ... | ... | ...
General acceptance | Top-level researchers begin serious work on | Even Microsoft is using the hash function now | Flame anyone who suggests the function
| finding a weakness (and international fame) | | may be broken in our lifetime
If Linus truly doesn't care about security, then git could use any error correcting code that produces a uniform distribution of tags, such as CRC64. The size of the tag only affects the number of objects we'd expect to be able to commit before we see a collision: over 4 billion in the case of CRC64.
Linux mistakenly claims using a cryptographic hash function helps avoid non-malicious collisions, but this is not the case.
Where the choice of a cryptographic hash function matters is specifically if we expect an attacker to be trying to collide tags. CRC64 is a linear function of the input data and therefore fails miserably at preventing attackers from colliding tags, but still produces a uniform distribution of tags for non-malicious inputs.
git seems to be in the odd place where Linus argues he's using a cryptographic hash function but not for security purposes.
Linus talked about why git's use of a "strong hash" made it better than other source control options during his talk at Google in 2007.
Edit: the whole talk is good but the discussion of using hashes starts at about 55 min.
Like, when I'm building a lookup index for files, I'm going to use sha-(something), because it's easy and well known. I don't particularly care about the security aspect; I care that everyone immediately knows the contract of sha-1.
There is, however, a performance cost in using cryptographic primitives in non-security-related contexts. You may not care about performance, but it certainly matters for something like git.
Linus claims: "So in git, the hash is used for de-duplication and error detection, and the 'cryptographic' nature is mainly because a cryptographic hash is really good at those things."
CRC produces a distribution just as uniform as a cryptographic hash function, and it's faster to boot. If these are the only things he actually cares about, and he's explicitly discounting security, he's choosing a slower primitive for no reason.
He writes off CRC inexplicably earlier in the post:
"Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be 'tough luck, maybe your data is there, maybe it isn't, I don't care'."
Linus seems to think that SHA1 has some sort of magic crypto sauce which magically makes the distribution it produces more uniform than CRC's. It doesn't. The only difference is SHA1 was originally designed to be resistant to preimage and collision attacks, both of which are irrelevant outside of a security context.
Well, let's look at what the actual numbers are. There's a nice table on this page:
For a 64-bit tag, even with 6,100,000 objects we'd only have a 1 in 10 million chance of a collision, so a 64-bit tag is more than sufficient to meet your stated requirements.
I'm not trying to defend Linus's somewhat confused explanations, it's just that git's 'security' requirements are somewhat woolly and one could reasonably get away with half-assing it a bit for a while.
For what it's worth, I think CRC64 should be fine for git-like workloads (but would still recommend using a cryptographically secure hash function, because git's usage is security-critical despite Linus constantly insisting it's not).
It is not an "odd angle". It is literally the very purpose for which they were created in the first place: to defend against attacks (preimage, collision)
If there are no attackers, the cryptography buys you nothing and merely makes the system slower.
Again, to go back to the original point: Linus's argument is that cryptographic functions have unique properties that make them specifically useful in non-security contexts. He's wrong. They don't. The non-cryptographic constructions he namedropped then glossed over work fine in these contexts.
Well, if we're going to be dicks to each other about it I'll try to explain what I think appears to be difficult for you to grasp. :) If you were throwing together something like git in a hurry you'd want a hash that
Lets you not have to think about collisions at all even if:
The collisions are by mere chance
The collisions arise by non-malicious accident
The collisions arise from malicious inputs [obviously, that implies an attacker but it also falls under 'I just don't want to think about collisions']
And right there, you grab the first non-completely-broken, not-too-giant, not-too-slow crypto hash around and get on with whatever else you had in mind. And this, to me, seems like the right call, especially since if collisions did eventually pop up, it'll probably be one generative collision and your entire system won't suddenly implode just because it exists. You'll have some time to fix stuff.
No argument there. Maybe I misunderstood 'Linus said something wrong' as 'Linus did something horribly wrong' and we're arguing over nothing?
CRC-X is complicated to use and terrible choice.
First of all, it is not a single algorithm, it is a family of algorithms. For a CRC of size N, you have to also choose a N-bit polynomial, N-bit starting value, and N-bit output XOR value. There is no single standard, there are tens or hundreds of popular options . The optimal polynomial depends on both the CRC size, and the length of the data you are feeding into it . If you choose poor parameters, you will get terrible error detection characteristics (like allowing extra 0 bits at the start of data with no change to the checksum). If you choose good parameters, certain classes of common errors like zeroing out a block of more than N bits will still have much worse characteristics than 1 / 2^N random chance of collision.
Second, implementation in standard libraries is patchy. Most programming languages have some CRC32 implementation - but do not document what parameters they use, or use different notation for the same parameters (forward and reverse), or do not let you change the parameters, or do not let you change the CRC size, or all of these. There is no easy way to get a "standard CRC64 or CRC128" compatible across platforms without putting it together from github snippets and example code yourself.
Third, CRC is fast when implemented in hardware, but not that much faster than SHA-1 or SHA-512 in software. It's only 1.5-2.5x faster , and when you're doing one checksum per uploaded file or something, it really does not matter. It's going to be even slower when you don't have a suitable-length CRC available in optimised native code, and have to write it yourself in simple C or pure Python.
The obvious and simple solution is to pick a known popular cryptographic hash like (SHA-X) that is available in the standard library of every programming language under the exact same API and no parameters to configure, truncate its output to the digest size you want, and call it a day. No need to worry about error detection performance on specific error cases like long bursts of zeros, and you get some defense against malicious tampering as a free bonus.
So why SHA-1 over MD5 which probably is move available and order of magnitude faster ? what problems MD5 has outside of its security that SHA-1 does not ?
This actually ends up not being true. Try
openssl speed md5
openssl speed sha1
First of all, it is not a single algorithm, it is a family of algorithms
For a CRC of size N, you have to also choose a N-bit
polynomial, N-bit starting value, and N-bit output XOR
value. There is no single standard, there are tens or
hundreds of popular options
Most programming languages have some CRC32 implementation
- but do not document what parameters they use, or use
different notation for the same parameters (forward and
reverse), or do not let you change the parameters, or do
not let you change the CRC size, or all of these. There is
no easy way to get a "standard CRC64 or CRC128" compatible
across platforms without putting it together from github
snippets and example code yourself.
- C: https://software.intel.com/en-us/node/503522
- Java: http://download.java.net/java/jdk9/docs/api/java/util/zip/CR...
- Go: https://github.com/jacobsa/gcloud/blob/master/gcs/gcsutil/cr...
- Python: https://github.com/ludios/pycrc32c
- Ruby: http://www.rubydoc.info/gems/digest-crc/0.4.1/Digest/CRC32c
And again, I'm not actually recommending git switch to CRC (as a fan of security, I would prefer they use an actual collision resistant hash function). But CRC better meets Linus's stated requirements:
- CRC will be faster (much faster as in ~4X, your numbers seem off to me) in almost all cases
- CRC will *not* fail to detect a bitflip, whereas there's a certain probability a random oracle-based construction will
Because you can't seem to tell the difference between "unnecessary" and "shouldn't be done".
If I build a shed then using larger screws on the door might be unnecessary but it only costs me .2% more and I know it won't fall over.
Using a recent SHA function might be overkill in a non-crypto context but it's high-quality and fast. And there's existing libraries for whatever language you want. Why the hell would anyone use CRC128? I've never even heard of it before.
It's a good hash choice, no matter how unnecessary.
The performance difference between SHA1 and SHA256 is less than 50%. Unless hashing is a significant percentage of git, which it isn't, this is an insignificant difference.
Also, you can easily argue that version control is a security context.
But you're completely missing my point, which is about Linus's claims that git's use of SHA1 isn't security-critical, but why he's using a cryptographically secure hash function anyway.
I did not, from Linus' post, get the impression he believes SHA1 are more magical than CRC except in the "image / second primage" sense.
of course it does. It is using a different field (cryptography) as a CRC that "really, really won't collide" because there is a whole field (cryptography) that is completely busted if it does.
Let me put it this way. If I really, really need a random distribution of white noise, I might use a different field, cryptography, to provide it: because if the distribution is not effectively random and uniformly distributed, that field in some fundamental sense is broken: no information is supposed to make it into the ciphertext, it should be indistinguishable from white noise.
So encrypting your source of white noise for the sole purpose of making it statistically closer to noise is a perfectly valid choice.
Actually in your commment you said it yourself: in as little as four billion commits CRC64 expects to see a collision. That is tiny compared to the search space cryptographers work with.
If you look at the history of git there was originally no reason to use cryptographic functions except in the same way as the analogy I just made (for white noise): he borrowed a property from a different field from the one he was working in.
CRC and SHA1 both produce a uniform distribution. SHA1 does not magically do this better because cryptography. The only things that make CRC and SHA1 are any different are:
- SHA1 produces a longer tag (of course CRC256 is a thing)
- SHA1 is hardened against preimage attacks
- SHA1 was intended to be secure against collision attacks (not anymore!)
In a non-security setting, we can pick the size of the tag based on the rough number of objects we'd like to be able to store before we'd expect to see a collision (i.e. the birthday bound). If that number is ~4 billion, then CRC64 is sufficient.
Re: snark, I did hit "next in thread" but managed to skim over that in his response.
But again, thanks anyway.
(Note: this still sounds more like spitballing than "the plan", but at least it's a step in the right direction)
The plan is outlined at https://marc.info/?l=git&m=148787163023435&w=2
This is not the plan. This is the backwards compatible system for tools that can only deal with 40 characters.
0efaa. -> 2AdC..
Any phisher's wet dream is a URL displayed in a font where l and I look the same. For anything that routinely displays URLs, font matters a lot. And any website displaying git hashes is likely to be programming related and have some code-font (good monospace or similar) in it's repertoir.
Can't downvote this enough. This is plain FUD. Did you even read the complete thread on the git mailing list? This was just one proposal by him.
Given this context, of course nobody wrote patches: they would have obviously been rejected and been a total waste of time. Until now, when we actually have to deal with it.
This is all aside from your argument being fundamentally weak, however ("you can't criticize anything unless i say so and contributed by meeting this arbitrary standards. i mean, didn't do anything either, i just get to make up the rules you abide by!!")
Are we both reading the same GP comment? It reads as "If he took this stuff seriously, he wouldn't have waited 12 years since SHA-1 was broken to even start considering any changes.".
1. Saw a known issue that cryptographers and security people were warning him about.
2. Alternatives existed that didn't have that issue. People were pushing on it.
3. Ignored all that to tell them it wasn't a problem, the issue could never have real-world consequences, and he wasn't interested in fixes.
In practice, that means he didn't take it seriously. He also made sure it wouldn't get fixed by letting people know he wasn't making a change to it. One more example in a long line of them where Linus doesn't give a shit about security enough to apply known, good practices.
BLAKE2 is faster still! It's also at least as secure as SHA-3, and produces any choice of output size up to 512bit.
Somewhat off topic, but is this actually possible?
Given hashing is inherently lossy, I'm inclined to assume it's not possible for anything must longer than a password, but commits are text, which I suppose is low entropy per character, so I don't know.
Alice clones an open source git repo, commits one secret change where she edits a config file's default password to her own secret password (a bad practice), and then publishes the new hash in public for some reason (build info?). Mallory would have to (a) know that exactly this happened, (b) guess the commit message, (c) guess the commit's timestamp to the second (or within a few seconds), and (d) preimage-attack her password.
And the preimage attack must pierce git's Merkle tree, which sounds downright impossible. (Unless Mallory is just bruteforcing, in which case a strong password is enough.)
And the likelihood of all the characters aligning in a way a compiler might find acceptable gets lower with every increase in length of your collisions, so it would be extremely unlikely that the shortest nontrivial match (for both the hash and for code sanity) would not be the right one. The code constraint certainly would not make it easier to find the collision in the first place, but it would give you great confidence in the result of you did.
But yeah, as the input size approaches the output size, the probability of a collision existing gets to 1. The birthday paradox formula will give the probability of a collision (assumes random placement in output space) based on number of inputs.
I suppose you are saying that if you know that the input size is sufficiently small, you don't have to worry about collisions, which is true. I was interpreting "for small inputs" to mean that if you give a small input to a hash function (which can take inputs much larger than the space of the output), that you can still reconstruct the small inputs uniquely. Unless the hash function is deliberately designed to provide unique 1:1 mappings for small inputs, I would think that it's not true that you can uniquely reconstruct small inputs because they will likely map to the same value as a large input would (i.e. 'a' might hash to the same value as some 14821-byte string).
It's computationally infeasible against most cryptographically secure hash functions. However, Grover's algorithm results in a sqrt(keyspace) reduction in security levels (effectively halving bit-sized security levels):
In a first preimage attack, you start with only the hash, and have to come up with an input from scratch that comes out to that hash. Even in a first preimage attack, you're extremely unlikely to craft the same input that someone else used to create the hash you were given. By pigeonhole principle, the number of inputs that come out to a shorter hash is extreme.
Is there a preimage weakness though? I thought this attack only reduced collision resistance.
I prefer not to take the bury my head in the sand approach to this, at least when it comes to public repositories.
What level is considered secure, then? The numbers for O(time) and O(space) should be many orders of magnitude apart to represent the relative costs.
Post-quantum, it's only ~2^53 work to find a collision. IMO that's worse.
Yeah, that part is the real flaw in the argument.
Let me put it more simply: if you're accepting patches for binary files in your repos you don't care about security at all. Maybe unless if you know how to decode machine code/JPEG manually.
Also, the proposition was to use the full hash internally, and truncate the representation. Nobody other than git itself uses full hashes anyways.
With regards to market price for a successful attack, I don't think any hash function stands close to SHA-256. And for that reason I think it would be the right choice.
I don't have the math to understand a more detailed explanation for his reasons to make such a statement. If you do have the math, ask him.
1 - https://131002.net/index.html
> But if you use git for source control like in the kernel, the stuff you really care about is source code, which is very much a transparent medium. If somebody inserts random odd generated crud in the middle of your source code, you will absolutely notice.
Unless I'm misunderstanding how this would work in practice (I assume commits being added under my name or existing commits being modified?), I'm abolutely not buying that though. Way too general, no? I would likely notice something like that, but I know enough colleagues who are just as likely to not notice. Mostly because they don't fully grasp git so they're just like 'hmm, thing says I'm behind, ha I know the fix is pull, yes I'm a git wizard' and just pull then continue working without even checking what got changed.
But if you are in a position to insert "garbage data" into a commit in preparation for an attack, it will be much easier to insert malicious but safe-looking code in the first place. Omit an array bounds check, disable SSL certificate validation, or anything else that looks like a mistake but will allow you to compromise the running code later.
Hard to get a collision though I'm guessing.
like non-printing characters in comments?
Or, you know, random odd generated crud https://github.com/torvalds/linux/tree/master/firmware/radeo...
You would just build a hidden backdoor into the first binary blob, like a deliberately omitted array bounds check if specific parameters are used. Reviewers and tests won't find that in complied binary code either, it's much easier to hide than large amounts of garbage data, and you even get to keep plausible deniability of it being a honest mistake if discovered.
Just as they noticed the intrusion on kernel.org servers after how long?
> If an attacker is pushing commits into your repository, you're long since toast on all possible security fronts, right?
Sure, that's always true, in the worst case. Perfect security doesn't exist. But does it need to be an attacker? Couldn't it just happen that an important bugfix isn't recorded because the hash collides?
It doesn't work retroactively. That crud needs to be there in the initial commit when it's accepted by multiple people.
> Couldn't it just happen that an important bugfix isn't recorded because the hash collides?
The hashes will never collide if there's no attacker. And that bugfix would have to have a blob of crud in the 'good' version, which would stick out like a sore thumb.
Do you assume that a proper preimage-attack would be needed? Instead, the good looking object could be committed, I guess after reading https://news.ycombinator.com/item?id=13721237
> And that bugfix would have to have a blob of crud in the 'good' version
Looking at google's POC, the colliding pdfs don't at all stick out (edit: but they are horribly contrived and you talked about code not blobs) [https://shattered.io/]
thanks for your answer, it motivated me to learn a bit more. I'm still not sure if "never" is a rounding error, hyperbole (which would be ill advised talking to laymen), or actual fact (that source files due to the reduced entropy in byte patterns could in fact never ever collide, which I doubt).
In source code there will be an obvious block of "hash was attacked here" crud in each file. The 'good looking' object will function non-maliciously, but it will be obvious that something is up.
> I'm still not sure if "never" is a rounding error, hyperbole (which would be ill advised talking to laymen), or actual fact (that source files due to the reduced entropy in byte patterns could in fact never ever collide, which I doubt)
"never" is to be read as "over a billion times more unlikely than all life on the planet suddenly being wiped out". It's not going to happen accidentally.
If at-patch time is your only code review, you've got problems. If your code base is too huge to look over frequently you've got problems.
There's certainly a concern if your code is more opaque, as is the case in this bug. If you're taking in raw asm.js code, for example, from various sources...
The first paper from Wang et al, which should've put SHA1 to rest, was published in 2004, the year before the first ever Git version was released. It could have been easy: Just take a secure hash from the beginning.
If he believes that, why does git allow signing tags and commits and why does Linus himself sign kernel release tags? Isn't that the very definition of "using a hash for security"?
* The end of SHA-1 on the Public Web
As announced last fall, we’ve been disabling SHA-1 for increasing numbers of Firefox users since the release of Firefox 51 using a gradual phase-in technique. Tomorrow [Feb 24th], this deprecation policy will reach all Firefox users. It is enabled by default in Firefox 52.
``Other SCM's have used things like CRC's for error detection, although honestly the most common error handling method in most SCM's tends to be "tough luck, maybe your data is there, maybe it isn't, I don't care".''
BitKeeper has an error detection (CRC per block) and error correction (XOR block at the end) system. Any single block loss is correctable. Block sizes vary with file size so large files have to lose a large amount of data to be non-correctable.
During the switch, a bit of deduplication may be lost. But the only interesting issue I can see is how git fsck will tell which hash an object was created with when verifying the hash (maybe with length?).
Git can probably not be havoced by committing two colliding files (and doing so would require doing another chosen prefix attach with a git blob header). But git looses cryptographic integrity promises due to this attack (aka: you can have different source trees with different histories leading to the same top commit hash). svn never had any cryptographic integrity to begin with.
It appears that to even try to attack git you'd need to spend the same amount of work again ($110K or what it was) to create a new collision with the right git object headers.
(Yeah, I know this will be read as a plea for monarchy and downvoted. It simply proves my point: people are WAY too subject to errors in the classes (1) "I hate him because he said something 'bad' about something 'good'." and (2) "I hate him because he said something 'good' about something The Tribe now knows is 'bad.')
Please don't bait other users by inviting them to downvote you or proclaim that you expect to get downvoted.
It similarly leads to the discussion of how "I can't believe Linus is trying to defend SHA-1 when The Tribe already knows it is cryptographically 'bad'."
SHA1 is busted. That impacts some git users. The fix is not invasive. Fix the bug. Make the transition. Move on.
He explains why he wrote the post: "since the SHA1 collision attack was so prominently in the news." I think a lot of people on Hacker News and elsewhere are interested in both the SHA1 collision and how it affects git, even if it doesn't impact them.
Still, even if this were some obscure bug that nobody cared about, what would be unprofessional about explaining his plan to fix it? It's an open-source project, after all.
If anything, I'd say it's quite professional that he's calmly explained why there's no need to panic, that they have everything under control, and that the path forward should be smooth.
It's frustrating, because Linus clearly isn't stupid. And yet sometimes he does stuff like this where I can't help but go "how did you not see the unforced error you were making?".
Just going ahead and start breaking things would really be unprofessional.
In my view, that's not the right-- or even a rational-- way to do things.
Or is this a proverbial "since SHA1 is don't work over there, then it shouldn't be used anywhere under any circumstance" attack?
Additionally, the defense offered against substituting binary blobs essentially comes down to "well, the kernel doesn't do that". Respectfully, other projects do. Those users' concerns were not taken seriously until lots of unrelated users freaked out about SHA1 for bad reasons.
Regarding the worst case scenario, yeah, substituting one binary blob for a different one (say, in Google's AOSP git repo) would be the worst. Which isn't sky-is-falling bad, but would still be pretty ugly.
Why should it matter? Git was designed for the needs of the kernel project since the very beginning.
It is us, mere mortals, who are at fault for picking a tool that was not designed with our needs in mind, just because that tool happened to be better suited for our needs than the alternatives it was designed to replace.
The sky isn't falling, and if you really believe so, well, there's Mercurial, although you may also feel uneasy since Mercurial also asks you to don't panic; Maybe you should just build your own SCM?
Git supports binary blobs. That's a choice that was made way back, and it is not the fault of the user for using that feature. If Linus didn't want people to use git for binary blobs he probably shouldn't have added support for it.
Once he did add support for binary blobs he had an obligation to take the security of that component and its users seriously. And as I said before, although the risk is not sky-is-falling bad, it is present. I think even Linus agrees with that now or he wouldn't be working on a path forward at all.
I agree it isn't a hairy fix, but think that it would have been better for everybody if it had been made before a SHA1 collision was found, as has been repeatedly proposed to Linus in the past. This would have avoided the entire discussion about SHA1 and git, and all the angst that went with it.