Hacker News new | comments | show | ask | jobs | submit login
Linus' reply on Git and SHA-1 collision (marc.info)
704 points by sampo on Feb 24, 2017 | hide | past | web | favorite | 262 comments



Pertinent facts for the worried:

1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

2) Even if git did rely on SHA-1, there's no imminent threat. What happened today was a SHA-1 collision, not a preimage attack. If a collision costs 2^n, a preimage attack costs 2^(2n).

3) Even if someone managed to pull off a preimage attack, creating a "poisonous" version of one your git repository's objects, they'd still have to convince you pull from their repo. This requires trust.

4) Even if you pulled it in, your git client would simply ignore their "poison" object, because it would say, "oh, no thanks, I already have that object". At worst, the code simply wouldn't work. No harm would be done.

When it comes to git, an attacker's time is better spent creating a secret buffer overflow than wasting millions of dollars on a SHA-1 collision.


I don't like the living-on-the-edge-attitude that Linus and others here promote regarding Sha1 in git. First, attacks only get faster over time. What costs millions today is likely to be achievable on commodity hardware in the coming years. Second, attacks only get more flexible over time. A contrived collision on MD5 in 2004 got perfected to a single block collision in 2010 [1]. Third, devising an update strategy and rolling it out takes time. I can't guess how much hardwired use of Sha1 is in Github.

Fourth, people use git in creative ways. Linus may think it is a cardinal sin to commit binary blobs in a git repository, but I can't imagine I'm the only one using git as a poor man's backup and file sharing solution.

And last but not least, relying on Sha1 takes effort of constantly asserting its use in Git is still secure. Support request to that end will skyrocket from now on, both of the constructive kind, like the technically concerned coworker ("but isn't git insecure now that Sha1 is broken"), and of the regulatory kind ("if you use Sha1, MD5, … please fill out these extra forms explaining why your process is still eligible for certification with ISO norm foobar").

Since we have to migrate away from Sha1 at some point in the future, I'd like it to be sooner rather than later.

[1] See Wikipedia for a timeline and references: https://en.wikipedia.org/wiki/MD5#History_and_cryptanalysis


"On the edge"

"A contrived collision on MD5 in 2004 got perfected to a single block collision in 2010 [1]" so they'd have at least years to fix it were they using md5?

He says they'll migrate, but it's no reason to go crazy. If anything, calmness of this sort is what we need more of (this industry, anyway... we go crazy about stuff way too much).


As someone who has done "security" full time before - there's nothing worse than the "Security By Jumping Up And Down Like An Excited Monkey" policy.

Fix what's broken, no doubt. But stay rational and look at the problem from all perspectives.


Sha1 was already a questionable idea in 2005 when git came out, because it was then already understood to have fundamental flaws.

https://www.schneier.com/blog/archives/2005/02/cryptanalysis...


> I don't like the living-on-the-edge-attitude that Linus and others here promote regarding Sha1 in git. First, attacks only get faster over time. What costs millions today is likely to be achievable on commodity hardware in the coming years.

If it helps, the git devs recognized that SHA-1 would be replaced at some point and have been preparing to move away from it. It's just a lot of work on basically one volunteer. A non-SHA-1 prototype might show up in a year or two, hopefully.


> Fourth, people use git in creative ways. Linus may think it is a cardinal sin to commit binary blobs in a git repository, but I can't imagine I'm the only one using git as a poor man's backup and file sharing solution.

git-annex (written by Joey Hess from the email) is a way to manage binary blobs using Git, but IIRC it uses SHA256.


Yes, that's the default. It's configurable up and down, though (from SHA-1 to SHA-3).


> 1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

Some security-focused developers sign git tags and/or commits, specifically to have things verifiable end-to-end and not having to trust HTTPS and all the middle men that entails.

Would that not be a case where git relies on SHA1 for security? Someone could replace a tag or commit with a malicious version that verifies fine since it has the same hash the original developer signed.

> 3) Even if someone managed to pull off a preimage attack, creating a "poisonous" version of one your git repository's objects, they'd still have to convince you pull from their repo. This requires trust.

In the case of signed commits/tags, this opens projects up to malicious action by hosting companies and others. Usually signed commits and tags are used specifically to avoid that exposure, because the developers don't trust the infrastructure.

> 4) Even if you pulled it in, your git client would simply ignore their "poison" object, because it would say, "oh, no thanks, I already have that object". At worst, the code simply wouldn't work. No harm would be done.

That only protects existing checkouts that already have fetched that commit. What about new checkouts, or older checkouts that haven't been updated yet?

Not disputing that this SHA1 collision does not signal any immediate emergency, just pointing out that git is used in different ways by different people, and some of those uses very much do depend on git's SHA1 for security.


Another case where git uses SHA1 is submodules. You audit libfoo's well-regarded v1.3.3.7 release, commit badc00fee, decide it's suits your need and use that commit as a submodule in your project. You assume resolving the submodule will always give you same code (or fail).

Alas, libfoo author has been corrupted by The Adversary and later arranges for same hash to resolve to different code. (currently we're talking of collision not 2nd preimage, but libfoo's author could have been planning it all along, before your audit.)

["libfoo" here is fictional, I'm not refering to any of the projects actually named that.]

Similar things happen when Git hashes are exchanged by any side channel. The ability to use strong hashes as pointers to content inside any other data is the whole beauty of Merkle DAGs. Git submodules just happen to be one such channel that's part of git, but I think it's important to accept that git commit hashes are widely used outside git itself.

P.S. I see git-evtag already covers submodules. Nice.


Regarding tags, https://github.com/cgwalters/git-evtag seems like a really interesting idea.


or, you know, this could just be fixed in git by using a proper hash function that allows you to trust your merkle tree.


> Some security-focused developers sign git tags and/or commits, specifically to have things verifiable end-to-end and not having to trust HTTPS and all the middle men that entails.

This is a completely orthogonal, separate layer of security that has nothing to do with this particular issue.


Maybe quote the second line as well?

> Would that not be a case where git relies on SHA1 for security? Someone could replace a tag or commit with a malicious version that verifies fine since it has the same hash the original developer signed.

What's your response to this? To me it seems like this would be a serious issue.


You're right.


I control a fleet of servers. I have a saltstack or ansible script. One of the steps in provisioning a new server is to pull library X from github.com.

One day Egor Homakov finds a new hack and finds his way into access to the master branch for library X. As a prank, he force pushes a change to master.

Being aware of such a possibility, instead of setting up my script to pull from master or even a specific tag, perhaps I pull from a specific sha1sum. That way I know I'm getting the exact version I want. That's today. Tomorrow, when preimage SHA-1 attacks are cheap, it will no longer save me.


Sounds like sentence three is where you got it wrong. You should be pulling your software to a development host, building it after verification, and uploading the deployment package to an artifact repository inside your security perimeter. Sentence three becomes 'pull library X from my artifact server.'

There is no way to securely deploy a package directly from the internet. The sooner you understand that, the sooner you'll be able to sleep through needless catastrophes like this hypothetical attack -- or leftpad.js.


> There is no way to securely deploy a package directly from the internet.

Fetching a file from the internet and verifying it against a hash or signature is the only way to securely deploy a package from the internet. This is exactly one of the use cases theses cryptographic primitives are built for. Don't blame the users when they utilize the tools the way they are supposed to be, blame the tools when they fail when doing so.

And by the way, if you ever need to run `apt-get upgrade` or `apt-get install` on your production server from a public mirror, then you are guilty of "deploying a package directly from the internet", too. (My apologies and congratulations if don't.)


apt-get can (and do) check signatures of repos

But yeah, you could put an unverified repos to be used


For the argument at hand (Can you deploy anything on a production server from a third party that is only verified cryptographically?) signatures and hashes fulfill the same function (and btw I also wrote "verifying it against a hash or signature" above). More to the point, under the hood GPG signatures only sign a hash of the file in question anyway. Verifying a file via a GPG signature is strictly less secure than verifying it by its hash (assuming you use the same hash function as the signature).


Git has tag signing, surely there's a way to clone a specific tag and check the signature against a specific GPG key fingerprint?


Tag signatures only cover the mapping from tag name to commit hash. In other words, specifying a manually-verified commit hash is actually more secure.

Tag signatures are mostly worthless now from a crypto point of view -- with the caveat that you can still get some value from them if you still trust sha1 to be secure against second-preimage attacks.


    git checkout 0.1.0
    git tag -v 0.1.0


Either way he'd be pulling from Github.com at some point, whether in the build environment or production. He won't ever know that the repository was compromised.

Besides, some people use Heroku simply to deploy their rails or node apps, and they don't have an artefacts server.


If you want a specific hash, why pull it from github everytime?


Because, when provisioning a new server, where else can you get it from?


A host that you control.


Where did it get it from?


From github, the first time only. Not every single time.


Isn't getting it from GitHub the first time vulnerable to attack? Consider when you're scaling up or replacing a failed host.


From a custom package you have built? Did you know that RPMs and DEBs don't grow on magical trees, but are built?


So, the problem still stands, theoretically.

Not only do you likely need to populate your host with packages not from your host. But also, your host will also still be connect to a public net, even if only indirectly (e.g. private net), and hence potentially manipulated.


No, you misunderstood what is actually the problem here. Pulling under deployment some code from random resource from the internets that can go down or get deleted at a whim and you can't easily move to just some other mirror and you don't even control when the thing will be up back, that's the problem. Not the trust you need to put to use the code (this is still there, obviously). And the very same comment applies to third-party package repositories, like PPAs in Ubuntu.

Not to mention that with pre-built binary packages your deployment speed and repeatability get significantly better, as you don't need to rebuild the artifacts every single time.


I do understand, but "the trust you need to put to use the code" is what we were talking about, no?


> I control a fleet of servers. I have a saltstack or ansible script. One of the steps in provisioning a new server is to pull library X from github.com.

Ansible gives larger and easier to exploit attack vectors than git.

And then, with pulling semi-random things from internets you have much bigger problems with your deployment procedure. You should never ever download a git repository with software, instead you should be using package system supplied by your operating system.


> you should be using package system supplied by your operating system

2005-me vehemently agrees with you.


Package managers allow you to download from git repositories too. He'd suffer from the same attack vector.


Erm... Which ones?


I believe Homebrew on macOS uses it for formulae (Ruby scripts to actually download and build the application)


Well, I don't hear too often about a fleet of servers running MacOS (because this branch of comments started with provisioning in a fleet of servers).

Also, from rants I hear about Homebrew and its breaking random libraries on upgrades, I don't think it should be mentioned as it was state-of-the-art or something.



And which of these are "package systems supplied by your operating system"?


While I agree that the OS package should be used first and foremost, it just often doesn't have the required software. (Even after adding extra repos that it might support)

You are picking on the useless details. Those are all commonly used package managers in production environments. They often provide software that simply never gets packaged with the OS. They likely always will, because they have more focused design goals.

A better way to argue this would be point out that specific ways to use the package managers better. For example: bundlr supports saving all the required packages offline. This provides the opportunity to do a security review and save the packages locally/internally rather than always trusting the whatever is on the Internet.


> While I agree that the OS package should be used first and foremost, it just often doesn't have the required software. (Even after adding extra repos that it might support)

OS-supplied packages don't grow on magical trees. If you don't have the necessary software in official repositories (or if it's your software), you can package it yourself. Deployment then becomes a breeze, and you save yourself otherwise completely useless process of recompiling things over and over again.

> You are picking on the useless details.

Quite the contrary. Those details make important difference.

> Those are all commonly used package managers in production environments. They often provide software that simply never gets packaged with the OS.

Apart from Homebrew, which is for workstations (hardly anybody runs macOS servers), none of these "package managers used in production environments" provide you a complete way to rebuild your software. You can be fine for a while if you stay away from modules that are interfaces to C or C++ libraries and from tools from other languages (e.g. I have used Python's Sphinx to document Erlang daemons quite successfully), but once you hit that, deployment starts to be PITA, because you'll need to remember to install all the required libraries, -dev packages, compilers, and what not.

On the other hand, DEB or RPM with artifacts will just automatically pull the required libraries, and its build dependencies give a dedicated and standard place for the necessary build tools.

Your comment supports my opinion that today's programmers usually don't want to be bothered with learning things that have been working for sysadmins for twenty years already.


Gentoos emerge and FreeBSD port are package managers that preferably build from source. At least emerge also accepts git sources, with tags and commit specifiers. I haven't used bsd in a while on a prod system, but I'd not be surprised if port gained the same functionality - after all whether you're pulling a tarball and use a checksum for integrity checking or hand the task off to git makes exactly no difference at all.

And I have yet to hear that port and emerge are incapable of doing dependency resolution. They're battle tested systems that work well in production environments.


First, portage and ports are OS-supplied mechanism for installing software. They are nothing like pip or gems or npm, which only can install things written in their respective languages of choice and fail miserably for modules touching any library external to them (unless you manually ensure the library's and compiler toolchain's presence, that is).

Second, ports and portage have support for and networks of mirror servers that keep copies of software available through these packaging systems. It's trivial to switch if one of the mirrors goes down. For pip, gems, or npm you need to plan ahead for the problems and deploy your own package cache, from what I know.

Third, I was using Gentoo with one of these "battle tested systems that work well in production" for several years. It was doable, but it wasn't pretty, could lead to breaking software after updating some random deep dependency (if it was recompiled with different flags), and generally required more work and attention than APT would, all that for very little gain (if any gain at all). Oh, and it ended up working with binary packages, after all, I just needed to compile them myself instead of having a half an hour downtime of production MySQL because it needed to get compiled (which could fail, leaving me with no working database installed).


Gems can be used to installing anything. It often build java, go, C, C++ and I think I heard it could do rust once, but generally anything in faster languages to create faster versions of the implementation of the functionality for a given gem.

I bring that up to highlight some of the wrong assumptions you make. You make several needless assumptions and use those to draw funny distinctions between things. I am not even sure of the point anymore.

likely any of these systems could be used in a variety of environments for a variety of purposes.


You can do anything, including Rust, yeah. You can also distribute pre compiled stuff.


The ones that actually matter are more like apt or yum.


Most language-specific ones do. OS-specific ones less commonly do; Chocolatey, for example, can (IIRC) install packages by running PowerShell scripts, which may or may not include pulling from GitHub.


> Most language-specific ones do.

Those are development tools, not deployment ones (even though they are used as such; programmers usually don't bother with learning what sysadmins do, so it's not a surprise).


getoos emerge can. A package definition contains a source URL and git urls (with branches and tags) are accepted sources.


yaourt on archlinux


Pip, too.


> Tomorrow, when preimage SHA-1 attacks are cheap, it will no longer save me.

Well, Moore's law is petering out isn't it? :)


Computers are not getting better at the rate they used to, but we are still improving both on performance and power.

However, the largest change we have seen the last couple of years is the accessibility of computer power. These days an attacker can be concerned mostly by how many CPU/GPU hours he needs to rent from Amazon or equivalent providers to achieve his goal. So the accessibility, convenience, and price of processing power is still improving quite fast.


Depending on your interpretation. If you go by the relatively useless transistor/fixed price cpu then it is dead, but if you go by more practical measures like instructions per second in whole system/dollar the it is as fast as ever. GPUs are really fast and still advancing.


Moore's law is, but the cost of a given amount of computational ability is still going down quickly.


Wow, Linus raises an entirely different issue which is that the PDF-based attack won't work on git at all. Due to length prefixing, it is extremely difficult to insert nonsense into the middle of a git object which is how this attack works on PDFs. Linus correctly notes that using the first forty bytes of SHA-256 is an option if an attack against git's use of SHA1 were developed.

1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

I don't think I've ever (intentionally) used git over HTTPS. I always clone using ssh (which has its own authentication mechanisms) or the git protocol (which is read-only).

2) Even if git did rely on SHA-1, there's no imminent threat. What happened today was a SHA-1 collision, not a preimage attack. If a collision costs 2^n, a preimage attack costs 2^(2n).

Thankfully, the collision attack doesn't apply to git (see above) so the cost ought to be greater than 2^n for a collision.

3) Even if someone managed to pull off a preimage attack, creating a "poisonous" version of one your git repository's objects, they'd still have to convince you pull from their repo. This requires trust.

Find a popular git host (say, Github but if a popular project is on Git Lab or Bitbucket, they will do just as well) and compromise them. Target a recent release for a popular project (say, Rails) and poison a relevant object that gets pulled down by all the downstream maintainers to package the release.

The benefit to straight up compromising a git repo without faking the hash lies in introducing a vulnerability without the maintainers nor developers of the project noticing (or noticing months/years after the fact).


> Wow, Linus raises an entirely different issue which is that the PDF-based attack can't and won't work on git at all. Due to length prefixing it is extremely difficult to insert some nonsense into the middle of a git object which is how this attack works on PDFs.

Please note that the shattered-{1,2}.pdf files both have exactly the same length. And even with cleartext it is easy to pad passages so they contain the same amount of bytes. See how the quoted paragraph above has exactly the same number of characters as this one.


  >>> len('''> Wow, Linus raises an entirely different issue which is that the PDF-based attack can't and won't work on git at all. Due to length prefixing it is extremely difficult to insert some nonsense into the middle of a git object which is how this attack works on PDFs.''')
  264
  >>> len('''Please note that the shattered-{1,2}.pdf files both have exactly the same length. And even with cleartext it is easy to pad passages so they contain the same amount of bytes. See how the quoted paragraph above has exactly the same number of characters as this one.''')
  264


Indeed, they probably have to have the same length since the length is inserted into the final block as part of SHA-1 hashing and every block after the collision must be the same. I don't think it's even possible to create a MD5 collision between two documents of different length yet.


I'd be more impressed if both paragraphs also had the same SHA-1 hash ;)


What if the hash of the second paragraph had exactly the same number of characters as the hash of the first one? ;)


> 1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

I thought that if you signed your git commits or tags with GPG, you implicitly relied on the commit checksum, and thus on SHA1.

Is that right?


That's what I read in the thread about the collision that was found.


That's not entirely true. You could fork a popular git repo, and then make some kind of patch for a bug in some seldomly changed file. Then force a collision in the new file with the benign change as well as your poisoned version. Then they could convince you to pull in the changes. Then they could reset their repository to the one with the poisoned version and anyone who pulls from them first would get the poisoned version of the file instead of the right one. It seems extremely unlikely that a practical attack would come out of this though.


thats stretching it. if you could convince anybody to pull from you then why even bother to go to a great expense of creating a collision.


This reminds me of vulnerability reports that start with "if you have root access..."


Or, as Raymond Chen puts it (quoting Douglas Adams), "It rather involved being on the other side of this airtight hatchway."


It creates plausible dependability.

Imagine the NSA publishing a crypto algorithm and contributes it to openSSL or some hypothetical crypto library using git. If they commit their new algorithm, everyone will be looking at that. They could do something devious like tinker with the way random numbers are generated elsewhere and reduce the possible keyspace of another algorithm to something very small and easy to brute force.

When this keyspace shortening is found out it would be hard or impossible to track back. No amount of inspecting the files that reportedly changed would reveal that the NSA did this.


Fine with all this, and you are right: nothing to worry by now... but in the end we should trust the math, and nothing else. And so, we need a schedule for update it..


Not entirely true, it relies on SHA-1 for security when you use PGP signatures. That makes it possible to fetch git repositories from untrusted sources if you can validate the signatures.

I don't really see how HTTPS is relevant here either, I clone most of my repositories over SSH for instance. And you can use git over plain HTTP too.


> 1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

HTTPS lets you verify that you're fetching changes from, say, a Github server. Just because a repo is hosted by Github doesn't mean that you can trust its contents.

> 2) Even if git did rely on SHA-1, there's no imminent threat. What happened today was a SHA-1 collision, not a preimage attack. If a collision costs 2^n, a preimage attack costs 2^(2n).

This needs investigation, but I suspect some plausible attack vectors exist based on collisions. Say you generate a good file and a bad file with the same SHA. If Github uses some kind of object cache for viewing files through the website, you could probably get the good file into their cache, then open a PR with a commit containing the bad file. The project maintainer would see the cached good file, but when they merge the PR, the bad file would be merged.

I'm not sure if this exact attack would work, but probably something like it would. If not with Github, then perhaps with Bitbucket or GitLab.

Another possible approach would be to send a PR which discreetly introduces the bad file into the project maintainer's object store. (It doesn't have to be in a commit you're asking them to merge; it could be in a separate branch which they likely wouldn't notice.) Once that's done, you send a PR which introduces the good file. If they review the second PR on a website, they'll see the good file, but if they merge manually, they'll get the bad file from their local object store. Even if the maintainer merges through Github, if they deploy from their own machine, they'll deploy the bad file.

> 3) Even if someone managed to pull off a preimage attack, creating a "poisonous" version of one your git repository's objects, they'd still have to convince you pull from their repo. This requires trust.

Not really -- on Github, contributors are often strangers, and project maintainers often fetch from a stranger's repo in order to try something out. They expect that fetching objects is harmless. Even if some project maintainers do look for signs of trust before fetching, they'd probably be easily fooled by fake info in a profile. A determined attacker could even make a large network of fake accounts which star each other's projects and so forth, similar to black hat web rings.

> 4) Even if you pulled it in, your git client would simply ignore their "poison" object, because it would say, "oh, no thanks, I already have that object". At worst, the code simply wouldn't work. No harm would be done.

If an attacker had the resources to perform a preimage attack, one thing they could do is take the latest version of jQuery on the day it's released. They could append some malicious code, then add junk in a comment until they get the desired SHA. Now they just have to get a project maintainer to fetch from their repo ("Check out this feature I added! Just fetch from my fork and run the server."), and then wait for the maintainer to upgrade jQuery.


Git prepends type and length, that makes an attack significantly more difficult, and certainly you cannot just directly translate the pdf attack to Git.

For applications, such as signing pdf or other documents, SHA-1 should be retired. But the collective crypto community have said that for more than a decade, so I have little sympathy for companies that is affected by this.

But for Git, there is no reason for immediate concern. Should they upgrade to a more secure hash function, yes. Ideally they make it rather straight forward to use different function in the future and perhaps multiple hash functions. I doubt anyone is able to find an input that MD5 and SHA-1 both hash to the same value. It would significantly reduce reliance on a single hash function.


> I doubt anyone is able to find an input that MD5 and SHA-1 both hash to the same value. It would significantly reduce reliance on a single hash function.

As was mentioned in the original collision thread, concatenating insecure hash functions has fundamental weaknesses: https://www.iacr.org/archive/crypto2004/31520306/multicollis...


Sort of. It does not improve brute force resistance in a meaningful way, but that is not the purpose either. It protects against one of the hash functions being broken.

Let us assume that we have 2 160bit hash functions, and one is broken to the degree that we can find collisions in constant time. This now means that we can break the combined hash in 2^80 rather than 2^80 + 2^80. The total brute force complexity was not improved, but the reliance on either hash function was.


Yes, but you picked two hash functions which are known to be insecure. Their complexity reduction argument appears to apply to cases where you are not using pure brute force for either hash, which would be the case if you were attacking MD5 and SHA1.


I did so quite on purpose. Both are concidered broken, but combined, only the bruteforce attack is known to work, i.e. "the whole is greater than the sum of its parts".


> Git prepends type and length, that makes an attack significantly more difficult, and certainly you cannot just directly translate the pdf attack to Git.

With the attack vectors discussed above, the type would always be blob. I don't think the length field is much help either; see https://news.ycombinator.com/item?id=13720725


The length field makes the attack significantly more difficult. In the publish attack, you take 2 fixed input and prepend special data to make the collision. With Git, you have additional constraint. Either it needs to be of exact length, this may or may not be an issue, but most likely is unless we find a new weakness in SHA-1, or you must be able to handle the changing input.

In all the examples in your link, while true that you can add "silent data" to all of the examples, they are all examples of structured data. So not only do you have to figure out the collision, you have to do so within the structure of the format you're attacking. This is a lot harder than just prepending the exact bytes you want.


> The length field makes the attack significantly more difficult. In the publish attack, you take 2 fixed input and prepend special data to make the collision.

That's not how the attack works. Not at all.


If you ignore the prepend part, it is in essence taking known input and calculate a collision.

> This is an identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S.

And

> Our example colliding files only differ in two successive random-looking message blocks generated by our attack. We exploit these limited differences to craft two colliding PDF documents containing arbitrary distinct images.


For 3 though, how is trusting a random author on git (with fake stars etc) practically different than a random author who has the exact sha1 of a "trusted" repo in their history? If you pull from a random author are you really doing a diff with the last trusted commit or something?

Face it - it's far more likely that a github account is compromised and that repo you rely on has been amended. And you don't really have a good way of verifying which commits are "safe" whatever that means. At best, commits can be cryptographically signed by ther authors, to prevent this. But if the author goes rogue then all who depend on them are up the creek.

And all this for what?


1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

I would say cryptographic integrity definitely counts as "security".

I agree this isn't currently a big deal, but it's probably time to start migrating to a better hash.


But .. that's what Linus is saying in his mail. Like, really exactly what you say in your last sentence..?


The quote for those who didn't read the email. Linus Torvalds says:

> Do we want to migrate to another hash? Yes.


>1) Git doesn't rely on SHA-1 for security. It relies on HTTPS, and a web of trust.

This is a big lie. When the -S flag is used, git signs the SHA-1 of the commit. Moreover HTTPS does not provide any form of authentication due to the extremely broken CA model. NSA could simply ask any CA to give them a cert for github for example. Not to mention that https would only authenticate that you are talking to the github server, it would say nothing concerning the authenticity of the code.

>they'd still have to convince you pull from their repo. This requires trust.

How about compromising your servers instead? Or maybe simply have NSA asking github to let them modify a commit (which would end up having the same SHA-1 and being signed by you).

>4) Even if you pulled it in, your git client would simply ignore their "poison" object, because it would say, "oh, no thanks, I already have that object". At worst, the code simply wouldn't work. No harm would be done.

https://stackoverflow.com/a/34599081

Didn't Git v2.0 break backwards compatibility? Couldn't they simply move to Sha-2 during that time?


Fully agree to everything you said, one point:

> Didn't Git v2.0 break backwards compatibility?

No, it did not.


Linus has toned down a lot from a decade ago.

> You are _literally_ arguing for the equivalent of "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected".

> That's not engineering. That's five-year-olds discussing building their imaginary forts ("I want gun-turrets and a mechanical horse one mile high, and my command center is 5 miles under-ground and totally encased in 5 meters of lead").

> If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles.

> And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_.

> So please stop with the theoretical sha1 attacks. It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want. Even the "breakage" doesn't actually do that. And if it ever _does_ become true, it will quite possibly be thanks to some technology that breaks other hashes too.

> I worry about accidental hashes, and in 160 bits of good hashing, that just isn't an issue.

http://www.gelato.unsw.edu.au/archives/git/0504/0885.html


> "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected"

I think this is a shockingly good example of how smart people get security questions utterly wrong. The right analogy when it comes to security has to involve some type of adversary, not just random, unmotivated natural phenomena -- as long as we're using aircraft analogies, it's not so much "a meteorite might randomly hit my plane in flight" as "there's angry and armed people shooting at my plane with armour-piercing ammunition".

Indeed, in the real world, putting hundreds of kilograms of armour on aircraft isn't the absurd/childish idea Linus seems to think it is:

https://en.wikipedia.org/wiki/Fairchild_Republic_A-10_Thunde...


It was recently pointed out to me by a friend with a hobby for cryptography and mathematics that "secure" doesn't mean anything - something can only be secure with respect to a specific threat model.

It seems like a straightforward, almost painfully obvious definition in retrospect now, after having it articulated to me. I think education on security manners is poor and should be a standard topic.


When possible, measure security in attacker dollars.

What's interesting here is that, just 5 years ago, the 2017 cost of this very attack (the Stevens attack) was estimated at 2^18.4 = $350k [0]. The collision announced today cost about $100k. Perhaps even less. [1]

Cloud computing is cheaper today than many expected it to be, and it seems like we're only now entering the era of fierce competition. Who knows what the next decade will bring. If your modeled attacker cost is within an order of magnitude or two of the danger zone, beware. Better to have many orders of magnitude of headroom.

Along these lines, I really like what cperciva did with attacker cost modeling in his scrypt paper. See the table on page 14 [2]. I wish more security choices were presented this way, with estimated attacker costs. Taking that even further, I wish the numbers were updated dynamically against present day hardware & compute costs. Maybe even with trendline projections. It's difficult to make good security choices without knowing costs.

[0] https://www.schneier.com/blog/archives/2012/10/when_will_we_...

[1] https://sites.google.com/site/itstheshappening/

[2] https://www.tarsnap.com/scrypt/scrypt.pdf


The A-10 isn't a plane, it's a gun with a plane wrapped around it.

And it hunts tanks.


I'm sad that the AF wants to retire this plane. This plane was defined with one purpose: build a plane around this gun, that being a 30mm autocannon firing depleted uranium shells (not radioactive) that fires at such a rate that it retards the velocity of the plane carrying the gun, and would melt the gun if it fired from full to empty continuously.

The plane was built for survivability. It can withstand an engine being shot off (why the engines are external on "pods"). The cockpit is (I think) surrounded by a 2 inch titanium tub.

I'm not sure the JSF, i.e. F-35 will be capable of taking over this role, as is intended.


As a side note, depleted uranium is certainly still radioactive. "Depleted" refers to the percentage of U-235, the isotope used for making nuclear weapons. Only about 0.7% of naturally occurring uranium is U-235, with the rest being mainly U-238. U-238 emits alpha particles, with a half life of about 4 billion years.

The radioactivity is unrelated to its use as bullets, which relies on its high density, but it is radioactive.


I am told the gun is also quite good at shooting ground troops. A relative of mine had an A-10 stay with his group most of the night. Every time the enemy decided to go after them the A-10 came back and made them rethink their actions.


That zipper noise: To think that those are 30mm rounds being fired at that speed is just plain scary. http://www.xdtalk.com/attachments/img_7014-jpg.34438/

If one of those is lingering, I wouldn't even fight if I was the enemy.

It might not be able to kill a modern main battle tank, but I bet spraying it with rounds like that would fuck it up pretty seriously.


The A-10 delivers rounds at a rate that the individual shots are indistinguishable to the the human ear.

Another story that I've heard is that a B-1 flying at operational altitude (200 ft above ground level, mach 2) was often as effective as dropping munitions.


> Another story that I've heard is that a B-1 flying at operational altitude (200 ft above ground level, mach 2) was often as effective as dropping munitions.

That very topic is being discussed right now on the Aviation Stack Exchange: http://aviation.stackexchange.com/questions/35771/is-it-corr...


The B-1 can not even reach Mach 1 at such a lot altitude. It's maximal operating velocity (at any altitude) is Mach 1.25.

People tend to really underestimate how much harder it is to go faster once you reach 0.90 or so, especially at low altitudes.


It's also going to present zero obstacle to a meteorite travelling at six times the speed of sound. It'll rip through that armor like it's not even there.


Mach 6 is a heck of a slow meteorite.

But the if a small meteorite (only a few grams) were traveling that slow, 1.5 inches of titanium would do quite a bit.


I'm not sure how fast these go once they hit the atmosphere. I've seen numbers in the 15 to 25km/hr range, which is pretty damned fast, but how much do they slow down in the thicker atmosphere closer to the ground?


15 to 25km/hr is a running speed of human, probably you meant 15 to 25 km/s here (actually, some meteorites hit atmosphere >70km/s). Seems like they decelerate pretty fast after that[0]:

"At some point, usually between 15 to 20 km (9-12 miles or 48,000-63,000 feet) altitude, the meteoroid remnants will decelerate to the point that the ablation process stops, and visible light is no longer generated. This occurs at a speed of about 2-4 km/sec (4500-9000 mph). From that point onward, the stones will rapidly decelerate further until they are falling at their terminal velocity, which will generally be somewhere between 0.1 and 0.2 km/sec (200 mph to 400 mph). Moving at these rapid speeds, the meteorite(s) will be essentially invisible during this final “dark flight” portion of their fall."

0. http://www.amsmeteors.org/fireballs/faqf/


> 15 to 25km/hr

just btw, 25 km/s is almost 100.000 km/h.


That depends on the size. Smaller ones just never make it to the ground - take a look at this table:

https://en.wikipedia.org/wiki/Impact_event#Airbursts

One they get big enough, they slow down from ludicrously fast to still ludicrously fast. The smallest impactor shown in that table enters at 17 km/s, loses 90% of its energy traversing the atmosphere and smacks the ground at nearly 5km/s. You'll probably want to take your titanium armour and stand somewhere else.


We took a couple of minutes off this afternoon to watch some Warthogs pulling some tight maneuvers... to think, USAF wants to sell them all to Poland...


Haha you reminded me of the old riddle about how British engineers decided where to place armor on their bombers, because, "if they put it everywhere the plane wouldn't get off the ground!"


Specifically, during the war the engineers took a Bayesian approach and put armor on the bombers in locations where bombers which came back did not have damage. Reason being that hits in those sections of the planes were more likely than not resulting in fatalities. Smart thinking.


It appears that in the original 1943 memoranda by Wald takes a frequentist approach; see link below.

"A METHOD OF ESTIMATING PLANE VULNERABILITY BASED ON DAMAGE OF SURVIVORS" BY ABRAHAM WALD (1943) : http://www4.ncsu.edu/~swu6/documents/A_Reprint_Plane_Vulnera...


Definitely smart thinning, but not sure what makes it Bayesian.


We're interested in the quantity:

  (1) P(crash | section hit)
and adding armour to those sections where that quantity is maximized (maybe with some thought to the relative weight of armour needed for each section, but I digress).

Let's directly apply Bayes' rule:

  (2) P(crash | section hit) = P(section hit | crash) * P(crash) / P(section hit)
The denominator can be further expanded:

  (3) P(section hit) = P(section hit | crash) * P(crash) + P(section hit | no crash) * P(no crash)
So we can see from the 2nd term that if aircraft regularly comes back with a section that's been hit and yet it hasn't crashed, then that directly reduces (1), meaning that section needs less relatively less protection, all else being equal.

Another point in this method's favor is if crashed aircraft frames are too damaged to permit us to identify which sections were damaged. In that case, we can still estimate (1) just by replacing all the P(section hit | crash) terms with a uniform term.

This analysis can be further expanded to the actual amount of damage each section took in a as well. The more damage a section took on surviving aircraft, the less protection it needs.


anti-aircraft fire is random so doesn't specifically target any section of the plane. Therefore, the planes that return to base are a biased sample of the planes that took on damage. See: http://www4.ncsu.edu/~swu6/documents/A_Reprint_Plane_Vulnera...


Funny thing is, they used all this advanced math to reach the same conclusion the 5 years old child would come to: armor should be put around engine and ammunition storage. Too much education sometimes makes you really stupid.


AA hits planes on the bottom, presumably they weren't adding all the armor to the top?


Does it? AA explodes at predetermined altitudes which may be above or below the plane. The resulting shrapnel....

It's not like a rifle bullet.


There's the shell itself that can still hit a plane, also proximity mines, which I guess would be more likely to go off on the way app (unless there is a delay after the proximity is triggers). The air force would also know the approximate limit of the enemies AA fire and site above it if possible.


I would interpret it more charitably than that. He was responding to a claim that malicious collisions could be passed off as being pure chance:

> it could be incredible bad luck that caused that good-looking patch to be mistakenly matching a dangerous object


This was also in response to the original sender's claim that plain ol' non-malicious bad luck could cause sha1 collisions in git commits.

So this is a shockingly good example of how quotes taken out of context are pretty useless in a discussion.


A better counterargument to his argument would be: hashes are not "my plane", potential adversaries are not "a meteorite".

Using analogies is okay when you try to explain a difficult idea to somebody who wants to learn about your subject. Using analogies is not okay when you want to make a solid argument, when you want to convince somebody whose stance is very different from yours.


What's shocking is how badly people understand the purpose of an analogy.

It's to communicate an idea to another person in a way that can also convey subtleties, and not just the literal words being conveyed.

In this case, he was trying to convey the idea that the risk is so small and so remote that it really isn't worth spending a lot of time on.

You understood the point, I understood the point, and everyone else understood the point. Which means the analogy was successful.

So please, stop trying to pull the conversation on some tangent so you expound on why smart people should spend more time on the perfect analogy that you approve of.


> What's shocking is how badly people understand the purpose of an analogy.

The purpose of an analogy is to simplify something that's too hard to understand for the person you try to convey your idea to. Sometimes analogies are appropriate, e.g. when you teach something. When you want to convince somebody whose opinion is very different from yours, analogies aren't appropriate. They sound condescending: "because you are not smart enough to understand the real rationale behind my opinion, here is an oversimplified argument based on analogies I made just for you". The question is, did Torvalds have any real argument back then? Maybe he fell back to using analogies for the lack of any real argument.

> In this case, he was trying to convey the idea that the risk is so small and so remote that it really isn't worth spending a lot of time on.

To convey the idea that the risk is very small, one needs to have a proof. Real proofs shouldn't involve analogies. They should use facts and logic.


We're talking about a conversation amongst kernel/software developers about a very technical software issue.

Anyone who doesn't understand why the risk was so small doesn't belong in the conversation.

Can you even imagine where our medical field would be if we expected surgeons to talk amongst themselves as if they were speaking to the general public?

It is absolutely acceptable for the speaker to make assumptions about the listeners knowledge, and that doesn't reflect poorly on the speaker.

> The purpose of an analogy is to simplify something that's too hard to understand for the person you try to convey your idea to.

It's to convey an idea. That's it, anything you add to that is your own bias at work.

Linus was trying to get across the scale of just how small the risk was.


> Can you even imagine where our medical field would be if we expected surgeons to talk amongst themselves as if they were speaking to the general public?

The Git devs are not heart surgeons, software development is not a medical field. Again, you are using an analogy to "prove" your point. Can you, please, use a real argument?

>> The purpose of an analogy is to simplify something that's too hard to understand for the person you try to convey your idea to.

> It's to convey an idea. That's it, anything you add to that is your own bias at work.

Do you disagree that using an analogy is oversimplification? If you don't, then that part about "to simplify a complex idea" in my statement should absolutely stay. If you do disagree, then please show why and how an analogy doesn't oversimplify a complex idea.


An analogy isn't a proof.

When someone uses an analogy they're not trying to prove anything, they're trying to get you to see things from a specific perspective, or they're trying to communicate an idea.

You can still walk away from the analogy and disagree with them, but you should have a better understanding of their perspective or their argument.

> Do you disagree that using an analogy is oversimplification?

I think your entire approach to analogies is unnecessarily combative. You view an analogy as someone trying to prove something rather than trying to communicate their position better (or just an idea in general).

And your approach is to point out that the analogy isn't perfect, and therefore you've "disproved" the analogy.

Only analogies are, by their very nature, imperfect. When an analogy is perfect it ceases to be an analogy and becomes the thing being discussed.

It should automatically be understood that there are the analogy isn't perfect and anyone can find flaws in it. That doesn't mean it isn't an effective way to communicate.


Technically, he was emphasizing how low the probability was, not the risk. Since the impact is pretty high in both cases (accepting a malicious object/mid-air collision), the risk (as a combination of probability & impact) is significant.


Hi downvoters, without some post setting me straight I will be wrong forever, where can I learn more about risk analysis and mitigation?


I think you're getting downvoted because while you may not necessarily be wrong, you're also not really disagreeing with what I said, just making a distinction that isn't all that important (in the context of this conversation).


>The purpose of an analogy is to simplify something that's too hard to understand for the person you try to convey your idea to

An analogy applies a principle to a common setting without loss of specificity. Specifically the dedicated adversary is lost in this abstraction, so it's a bad analogy.


Yet other times, people use analogies to gloss over important facts, to sort of hide them from listener. The logical conclusion taken from analogy will be different then what would be reasonable if all facts would be taken into account. Great when your arguments are weaker, but you still wanna convince.


Other times, people introduce seemingly impertinent colour into an analogy in order to engage the target audience's imagination and reasoning better. They may even tailor the duration of their analogy in order to match that of other analogies which have previously hit home with the intended target. And the target may trust this, and potentially end up believing a great many falsehoods, because of a lack of critical reasoning in the contextualisation and import of the analogy.


If an analogy were perfect it would cease to be an analogy and would instead be the thing being discussed.

That's just the nature of an analogy, but that doesn't make it useless or fair to attack the speaker for using an analogy.


I'd like to remind everyone that this thread started with angry funny quotes from Linus.

And now we're debating on what does and doesn't qualify as an analogy.


Sometimes facts and logic can be extremely verbose - I feel many analogies are in place not to be condescending, but because the author has faith in the reader that they can make the connection between the analogy and the problem.

And to be quite honest abstracting ideas is core to problem solving, and I think it's a bit disingenuous to say that anyone misunderstood what Linus was getting at there.


> abstracting ideas is core to problem solving

Sure. But don't confuse abstractions with analogies.

Here is a good proof about both the integers and rational numbers. Every integer and rational is a real number (abstraction). When you add any 2 real numbers, you get the same result regardless of their order (fact). So it must be true that, when you add any 2 integers, you get the same result regardless of their order (correct conclusion #1). It also must be true that, when you add any 2 rationals, you get the same result too (correct conclusion #2).

Here is a bad proof about the integers and rational numbers. Both the integers and rationals are very similar: you can add them, subtract them and so on (analogy). Between every 2 rational numbers there is another rational number (fact). So it must be true that between every 2 integers there is another integer (wrong conclusion).


> Which means the analogy was successful.

Sure, if he's saying he is powerless against higher powers. But if he's trying to make a quantitative assertion, a qualitative analogy is not the right tool.


Well you said it... it's not a misunderstanding, it's a podium.


Here's another example of a person getting the scenario completely wrong. The adversary here is aliens and they're lobbing meteorites to take down planes. Get your head out of the groupthink people! just because it's difficult to conceptualize the foe, doesn't mean that the foe Isn't targeting you! Collanders on!


> It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want

This was his point, and it's still true. Generating a specific SHA-1 hash is still not feasible.


It's not so inconceivable, after seeing the PDF collision, to contribute to another project a commit whose hash has a collision with another malicious commit you keep up your sleeve. Not saying it's easy, but now it's on the horizon.


> Not saying it's easy, but now it's on the horizon.

Not really. It's not a preimage attack. They spent several hundred dollars to find two random byte strings with the same SHA1 hash. There's still no way to SHA1-collide a specific byte string instead of random junk.


This is exactly what euyyn is saying: create two files with the same SHA1 (by adding bytes of gibberish to an unused section), commit one to the repository, and now you have an collision available.


That's not how git uses hashes. In that scenario, there would still be a diff and hence git would recognize the files were different.


> > If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles.

Why is this quoted in support of an argument that Linus used to come across as a lunatic in online correspondence ?

This seems to me like an entirely reasonable way to make it very, very difficult to ever attack, because an attacker would have to be able to generate collisions for both of your hash functions.

(I'm not a crypto expert -- maybe you are, if so and my above comment is totally wrong, can you explain how it's wrong?)


I think Linus' point at the time was that the computational cost added to every transaction to protect against a (at the time) highly infeasible event of questionable impact was a poor cost vs. benefit argument. Adding complexity isn't free, and there needs to be a measurable benefit against a threat.

But of course, now is not then. And I think zkms' metaphor "there's angry and armed people shooting at my plane with armour-piercing ammunition" is apt here. The threat is much more sizable now, and so the cost vs. benefit analysis has changed.

Incidentally, one of the requirements of the SHA-3 competition was that it not be related to previous hashes. So while there are no demonstratable attacks against SHA-2, we nonetheless have "two ...hashes, and that they are based on totally different principles."


While we're at it, every SHA-3 finalist were deemed "good enough", so we now have more than 2 such hashes. (I personally love Blake2b.)


I actually asked this exact question in the SHA-1 thread and got an informative response [1]. Apparently creating a collision in two hash functions is not much harder [2].

[1] https://news.ycombinator.com/item?id=13715146

[2] https://www.iacr.org/archive/crypto2004/31520306/multicollis...


Not much harder than creating a collision in each of the hash functions, sequentially. You can't combine two broken hashes into a strong hash. But as long as at least one of them remains unbroken, then their concatenation is clearly secure too - assuming you keep the full result of each hash function and concatenate them into a long output, rather than trying to combine them somehow. Since it's impossible to know how resilient a given hash function will turn out to be against future cryptanalysis, it can make sense to hedge your bets by combining dissimilar functions.


It's worth noting that selecting SHA-1 ten years ago is probably akin to selecting SHA-512 today -- a collision on MD5 was first announced in 2004, and computers and GPUs were a lot slower ten years ago.


FWIW, the first warnings that SHA-1 was insecure and "It's time for us all to migrate away from SHA-1." were published prior to git's initial 2005 release: https://www.schneier.com/blog/archives/2005/02/cryptanalysis...


Because as we know, the security industry isn't a firehose of fear about everything that isn't airgapped (and then airgapping itself). Security recommendations come so thick and fast that even security researchers themselves don't bother following them. Even the bloke in your link explicitly apologises for using Word for some things, and he's considered a saint in the industry.

What all the commenters here doing a hatchet-job on Torvalds are missing is that he's saying "show me the money"/"perfect is the enemy of good". The hatchet-jobbers are too busy doing the usual tittering over his language to actually absorb the point.


Security industry != crypto academia.


Clearly, Linus didn't know at the time. If he'd know, he would have chosen another hash in 3 seconds —no additional complexity, no additional effort involved in not choosing a hash the security community starts to have doubts about.


Don't forget the extra computation cost with a more complex hash.


Nobody cares. Hashes are plenty fast. Even the code complexity doesn't matter: I've implemented both SHA-256, and SHA-512, take less than a hundred lines of code. And of course, Linus would have use some existing implementation.

Also, "more secure" doesn't necessarily mean "more complex", or "slower". Blake2b for instance is as fast as md5 and has a simple RAX (xor, rot, add) core from Chacha20.


It wasn't like that in 2005.


Which is largely irrelevant on desktops. Nobody's using git on embedded devices, so hash computational cost shouldn't be the deciding factor. Maybe people storing BLOBs will notice.


There are blobs in the kernel and speed was a very important motivation when kernel switched to git.


So now we need to assess whether hashes are a significant bottleneck.


I don't think that comparison is quite right: ten years ago we already had two of Xiaoyun Wang's attacks which significantly reduced the work factor to break SHA-1. They were already cited at the time as reasons to move away from the algorithm (and I doubt we'd be able to do the 2⁸⁰ work needed to get a collision without cryptanalytic research! — notably the successful Stevens attack today reported performing a total of 2⁶³ hash operations while claiming to be "one of the largest computations ever completed").

There's also important concern about SHA-512 (which has led to SHA-3 and other options), but I'm not sure it can be put in the same category; at least all of the best attacks today are significantly reduced-round.

https://en.wikipedia.org/wiki/SHA-2#Cryptanalysis_and_valida...

By contrast, Wang's original attack (which we had more than 10 years ago) achieved a 2¹¹ speedup for collisions against full SHA-1 and her second attack later the same year achieved 2¹⁷ speedup.

Evidently the speedup for the final version of Stevens's attack (which worked) was also around 2¹⁷. By contrast, a Moore's law improvement over 10 years would only be expected to make computers around 2⁶ times faster! (At the risk of mixed or mangled metaphors, we might say that mathematical insight during the time period you mention has been about 2¹¹ times more useful for attacking SHA-1 than computer power increases.)

Edit: someone else linked to Valerie Aurora's chart, which I'd forgotten about, which gives a synopsis of the historical status of the most popular hashes. On that chart, SHA-1 from 2005 was one category worse ("Weakened") than SHA-512 is today ("Minor weakness").

http://valerieaurora.org/hash.html


Something that bothers me is, why don't there exist dynamic encryption that scales with the current computing power? Then we might have less of a migration mess. Hope is that encryption breakers have to win by algorithmic strength rather than brute force "add more cores".


Where are the technical arguments? Just false analogies and hyperboles. I wouldn't like to work with a person like this in my team.


Wow you could have copy pasted that whole response. Every single sentence was fairly aggressive.


Classic availability bias and round trip bias

"I have no evidence of SHA-1 being broken which is evidence that SHA-1 can't be broken"


Extremely relevant discussion on stackoverflow from 2012 on how would git handle a SHA-1 collision, someone called Ruben changed the hash function to be just 4 bits padded with zeroes and checked what git actually does on collisions: http://stackoverflow.com/a/34599081/308851


Downloading the PDFs [1] and comparing their sizes takes less than a minute. They're the exact same size. Yet here we have Linus making one bet after another that size has to be different for this attack.

Now to be fair, he also keeps repeating that he hasn't seen the attack yet. Which leads me to question why is this post interesting to HN? Is it to show how Linus aimlessly speculates and gets his guesses wrong?

--

[1] http://shattered.it/


I think you're violently agreeing with Linus here.

He says:

> pdf's don't have that issue, they have a fixed header and you can fairly arbitrarily add silent data to the middle that just doesn't get shown.

In other words, he expects the PDFs to have the same size because silent data has been arbitrarily added to the middle that doesn't get shown.


Padding a code commit with extra whitespace, in indentation, between operators and operands, and trailing the lines, isn't rocket science. As another commenter said, you then helpfully fix the style in the following commit.


Let's assume that you can use spaces, tabs, newlines and linefeeds. It's still only 4 bytes amongst 128 possibilities (using ASCII)... Good luck with that. Or, in other words : good luck to conceal pseudo-random _bits_ in _text_ files...


"git diff" highlights, in bright red if you have color enabled, trailing whitespace. The changes you'd be able to make without making it suspicious would be highly limited.

Maybe it's possible, but it doesn't seem very likely for source. You'd be more likely to succeed if someone stores binary files in git where people don't have an easy way to audit what's causing the difference. But in that case it would seem you likely have simpler attack vectors.


That's not it's claim. He specifically goes down on saying how PDF are different than source files, and why while it is trivial to produce two forged PDF with the same size, it is not with source files.

That being said, his motto is more like "don't freak out", and in the specific git context, I can only agree.


I often see the suggestion that storing the length of the file too helps secure against hash collisions because it adds the additional requirement that both files be the same length, but every single MD5 and SHA1 collision I've seen are between values of the same length anyway! Where does this myth come from?


It's not a myth. It is obvious that you can find collisions with this constraint. When you make a collision attack such as this, you start with 2 inputs and have a target size, i.e. with a specific size of the padded data, then you simply iterate over paddings (of the same size), and when you find a collision, it will be of the correct size.

But it does add an extra constraint, and it prevent attacks that rely on random length filler data. This is not much of an improvement, but an improvement non the less.


I can agree it strictly limits the type of attacks that can be done, but it doesn't appear to be any practical increase in security. It's like arguing for adding another step in the manufacturing process of bulletproof armor to add a few extra atoms to the armor's thickness. Sure it theoretically adds some protection against some 0.000...1% of bullets that would have just barely had enough force to get through, but it wouldn't have made a difference to any single previously observed successful attack of a similar nature. The fact that someone would argue that this increase in defense is good enough to obviate the need of switching to a different type of stronger defense seems to imply a misunderstanding of hashing.


>Is it to show how Linus aimlessly speculates and gets his guesses wrong?

I wasn't thinking of it like that, but now that you mention it, yeah. There's definitely something to learn from the way people make assumptions and are subsequently led by them.


The PDFs have the same size, but they do not have a header in the file that states their overall size. If PDF had a header at the beginning of the file that states the file size, then it could be harder to find a collision. From what I understand, the attack works by inserting garbage data after a fixed file prefix and before a fixed file suffix (anyone please correct me if I'm wrong).


> If PDF had a header at the beginning of the file that states the file size, then it could be harder to find a collision.

No. It doesn't change anything if the size is in the PDF header. The size of both PDFs are the same, the header of both PDF files is the same on the both "shattered" files now.

What Linus says is that if you tried to put these two PDF files in git, it would not see them as the same, as git calculates the sha1 differently. But Google would be able to produce two PDF files that would, as git sees them, appear to be same just as easy as these that were produced.

P.S. (answer to your answer to this message) Note, You wrote one level above

> If PDF had a header at the beginning of the file that states the file size, then it could be harder to find a collision.

And I argued that it isn't harder, but irrelevant.

From your answer:

> But to generate a collision with a different prefix q one would have to do the expensive computation all over again

Yes. Now read what your claim was again. It's not harder. Exactly as easy as the first time.


> But Google would be able to produce two PDF files that would, as git sees them, appear to be same just as easy as these that were produced.

Right, but they would have to re-do their enormous calculation. ("This attack required over 9,223,372,036,854,775,808 SHA1 computations.")

Google started with a common prefix p (the PDF header), then computed blocks M11, M12, M21 and M22, such that (p || M11 || M21 || S) and (p || M12 || M22 || S) collide for any suffix S. Given p, M11, M12, M21 and M22, anyone can make colliding PDFs that show different contents quickly. But to generate a collision with a different prefix q, e.g. one including the file size, one would have to do the expensive computation all over again, I think.

Note: I'm not trying to argue that SHA-1 can be made secure with padding. I was just trying to say that the statement "The PDFs have the same size" misses the point.


Why would that make the attack harder? Both PDFs are the same length.


> Downloading the PDFs [1] and comparing their sizes takes less than a minute.

shattered-1.pdf and shattered-2.pdf have the same size and sha-1 hash, but git is still able to recognize that those are two different files and creates two different commit hashes for them. So clearly just having the same size and sha-1 hash is not enough to fool git.


Git calculates sha1(prefix+file), while the Google example was meant for sha1(file). However there is no difference in the attack. You could just as easily construct two files that have different sha1(file) but matching sha1(prefix+file).


>> Which leads me to question why is this post interesting to HN? Is it to show how Linus aimlessly speculates and gets his guesses wrong?

I think it's interesting because it is another example of his basic attitude toward security, and a lot of people who should know better use Linux believing it is secure because "it hasn't been broken yet."


Here is Mercurial's response to the SHA-1 attacks: "SHA1 and Mercurial security: Why you shouldn't panic yet."

https://www.mercurial-scm.org/wiki/mpm/SHA1


Several years ago I worked on a security product that used git as a sort of tripwire-type database. Since SHA1 was considered inadequate for Real Security, we had to hack jgit to use SHA256. It took a stupid amount of work - the 160-bit hash size was scattered all over the codebase in countless magic numbers. But it worked.

The product was cancelled. I always wondered if the patch would be of any use to anyone.


160 bits are still quite many. You could have done what Linus suggests and use a better hash but truncate it. If it's a good hash, a truncation of it should still be good (modulo the fewer number of bits, of course.)


That is jerry rigging it. If a stronger hash function is known to be secure, you throw all that confidence out the window if you truncate it. At best you reduce the brute force complexity, at worst you enable pre-image attacks.


This is wrong. There is no faster way for preimage attacks with truncated SHA-2.

http://crypto.stackexchange.com/questions/9435/is-truncating...


I never said that, at all. I explicitely say

  At best you reduce the brute force complexity, at worst you enable pre-image attacks.
One thing I hate about crypto talk is statements like this

  So, truncating one of the SHA-2 functions to 160 bits is around 2^20 times stronger when it comes to collision resistance.
Which is all too broad. What if SHA-1 is down to 2^10, is truncated SHA-2 2^30? Does it mean we have proved that no weakness exist in SHA-2? A correct statement would simply be that no known attack exists on truncated SHA-2 yet.


You could have simply truncated the 256-bit hashes to 160-bit, exactly as Linus is suggesting in this post.


This is a typical outcome. You worry about some hypothetical threat, spend countless hours mitigating the threat instead of working on useful features, money goes down the drain, and then the project gets canceled. Security is just one of the features of a product. It's not more special that other features. You prioritize it like the rest of product features.


> I always wondered if the patch would be of any use to anyone.

You think? I dunno why you wouldn't try to submit a pull request though.


Because for inclusion into upstream git it has to be done in some at least somewhat backward compatible manner, which is not exactly trivial.


In 20 years, the $100,000 attack will be a $100 attack (or perhaps a $1 attack), but programmers of the day will be overwhelmed with fixing all the 32-bit timestamps that everyone ignored for 50 years because the clearly forecast problem hadn't blown up in their faces quite yet.


> In 20 years, the $100,000 attack will be a $100 attack (or perhaps a $1 attack)

No. Moore's Law has been dead for years and will never come back. The benefits we saw in recent years came from people figuring out how to compile code for SIMD processors like GPU's, not faster or cheaper silicon.


Moore's law still holds, and is expected to hold true until at least 2025 (see wikipedia). I don't think it will be done by then, but that is just guess work.


Moore's Law (the 2-year version upwards-revised from the original 18-month version) stopped holding last year when 10nm (Cannonlake) was delayed to this year and Intel introduced a third step in its tick-tock process. The quotes you're looking at about 2025 were from 2012 (the one cited in 2015 has no support for the quote) and all three should be removed from the article or the assertion altered.


People predicted the death of Moores law in 2005, we all know what happened. Intels CEO seems to think it still holds true, and will continue to do so for the forseeable future. (http://fortune.com/2017/01/05/intel-ces-2017-moore-law/) This is probably partly marketing, but I'm sure there is some truth to it as well.


The linked article admits that the two-year doubling ended, which is what Moore's Law has been for most of its history. Moore's Law ending doesn't mean we won't ever have another die shrink, it means that the notion that we just have to wait two years to get twice the transistors for the same cost (or die area, depending on who you ask) is no longer true (and therefore, projections based on the notion of such a cadence should be considered even more silly than they already were). I don't understand why people continue to claim that Moore's Law's death "has been predicted many times" or whatever when it already ended; what happened in 2005 was that raw clock cycles stopped improving, and guess what: they still haven't improved that much for twelve years.


How much faster is a cpu from 2012 and a cpu from today when you compare them by single thread performance @ 3 ghz.


Moore's law is not about single thread performance, and attacks like this are easily parallelized anyway. Not to mention that fixing them at 3Ghz is just trying to coherce the conslusion to be that we have seen little gain in the last 5 years.


Irrelevant for this particular issue since finding SHA-1 collisions is embarrassingly parallel.


Single thread performance is completely irrelevant to Moore's law.


So many paragraphs in the beginning, just to finally read:

> Do we want to migrate to another hash? Yes.

Wouldn't all that time trying to explain away the SHA-1 issues be better spent on developing a safe transition plan? Work on this could have started long ago, and if it would have started, going from SHA-256 to SHA-512 to SHA-3 to ... would be a no-brainer by now.

In the simplest case, ensure that all newly created git repositories work woth SHA-256 by default (or SHA-512, or whatever), and switch back to SHA-1 for old repositories.

In the more advanced case, provide the possibility for existing repositories to have multiple hash values (SHA-1, SHA-256) for every blob/commit, then phasing out client support for old hashes as time goes on. When some SHA-1 collision happens, those who use newer git versions would notice and keep having a consistent repository.

If all those different browsers and web servers were able to coordinate a SSL/TLS hash transition SHA-1 to SHA-256, then a protocol like git with roughly 2 widespread implementations should be able to do that, too.


> Work on this could have started long ago

I read through this thread yesterday, and walked away with the impression that they have started working towards a hash migration (and general cryptoagility) already, albeit not with much priority.


There is some speculation on whether Linus got it right or wrong, but I haven't seen anyone actually test this with the shattered-1 & 2 files, so I did.

Git sees them as different despite them having the same hash. You can test with:

  mkdir shattered && cd shattered
  git init
  wget https://shattered.it/static/shattered-1.pdf
  git add shattered-1.pdf
  git commit -am "First shattered pdf"
  git status
  wget https://shattered.it/static/shattered-2.pdf
  sha1sum *
  md5sum *
  mv shattered-2.pdf shattered-1.pdf
  git status
So it doesn't see the files the same.

Apologies for those on mobile (please fix this HN!): the commands are: mkdir shattered && cd shattered && git init && wget https://shattered.it/static/shattered-1.pdf && git add shattered-1.pdf && git commit -am "First shattered pdf" && git status && wget https://shattered.it/static/shattered-2.pdf && sha1sum * && md5sum * && mv shattered-2.pdf shattered-1.pdf && git status

EDIT: Ah, of course! git adds a header and takes the sha1sum of the header+content, which breaks the identical SHA1 trick. You can add a footer on and they keep the same SHA1 though. Don't have time to play about with this more just now, but try it with `cat`ing some identical headers and footers onto the pdfs.

EDIT2: Actually, this is discussed more extensively in the other thread which I hadn't read yet. Go there for more details: https://news.ycombinator.com/item?id=13713480


> That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header.

> pdf's don't have that issue, they have a fixed header and you can fairly arbitrarily add silent data to the middle that just doesn't get shown.

This doesn't seem like much of an obstacle, since you can add silent data to all kinds of files, like

- With HTML, JS, etc. you can just add whitespace.

- Some formats like GIF89a have variable-length comments.

- With any media format that uses palettes, you can add extra, unused colors.

- Just about any compression algorithm can be tuned to manipulate the compressed size. E.g. with DEFLATE (which is used by PNG in addition to some archive formats), you can use a suboptimal static coding rather than the correct Huffman tree.

- With most human-readable document formats, you can add zero-width spaces or something similar.


Yes but the arbitrary data in the PDF doesn't have to be rendered.

It is much more difficult to change source code in a way that:

1) generates a collision 2) is still valid source code 3) the changes cause a desired effect (like a backdoor) 4) has the same file size


People store all kinds of binary data in git, not just source files. E.g. images or PDF files with specs or documentation.


>is still valid source code

pretty easy with comments.


Correct me if I'm wrong, but if you're letting untrusted people push to your git repositories, you're pretty much screwed anyway.

Given a case where someone with permission to push gets compromised and a malicious actor can pull this sha-1 attack off, aren't there bigger problems at hand? The history will be there and detectable or if they're rewriting history, usually that's pretty noticeable too.

I may be totally missing a situation where this could totally screw someone, but it just seems highly unlikely to me that people will get burned by this unless the stars align and they're totally oblivious to their repo history. So I guess I agree with the "the sky isn't falling" assessment.


The problem is I can't now download a git repository from someone I don't trust, verify it is correct, and then publish "I, Chris Jefferson, trust git commit abc125.... is good.". Now I would have to be sure everyone who has ever committed to that git repository wasn't trying to do something dodgy.

I have put git commits into scripts I run on automated servers for example, to be sure that every server runs exactly the same copy of the program.


What if there's a deep-cover operative for KGB/NSA/... who's already a respected kernel contributor? That's not implausible - those agencies already have people with the relevant skills. Hell, the NSA openly contributed SELinux.

Now suppose they've contributed a commit that collides with another one that puts a backdoor in, and they use this very selectively, MitMing git clones made by high-value targets. If anyone were to compare one of those clones against the real kernel repository then that would burn their operative, sure. But how likely is it that anyone would ever do that?


You can sign commits and tags. If you do this you can still offer some guarantees about your repo even if non-trusted people are able to access it (say, if you host your repo on github but don't necessarily trust them or something similar). If I have Linus's public key I can fetch the kernel from anywhere and just make sure to validate the signature of the tag I want to use for instance.

If you were able to forge commits with the same SHA-1 you might theoretically be able to rewrite the history without invalidating the signatures, which would be a problem. We're not there yet though, but it's one step closer.


Lots of hosted git solutions offer "protected branches", so it wouldn't be unheard of to allow less trustworthy contributors to push into a repository (to only some specific branches).


I posted this on the reddit thread, but I thought it would be interesting to hear feedback here too:

I don't know much about git internals, so forgive me if that is a bad idea, but what does everyone think about it working like this:

If future versions of git were updated to support multiple hash functions with the 'old legacy default' being sha1. In this mode of operation you could add or remove active hashes through a configuration, so that you could perform any integrity checks using possibly more than one hash at the same time (sha1 and sha256). If the performance gets bad, you could turn off the one that you didn't care about.

This way by the time the same problem rolls around with the next hash function being weakened, someone will probably have already added support for various new hash functions. Once old hash functions become outdated you can just remove them from your config like you would remove insecure hash functions from HTTPS configurations or ssh config files. Also, you could namespace commit hashes with sha1 beging the default:

git checkout sha256:7f83b1657ff1fc53b92dc18148a1d...

git checkout sha512:861844d6704e8573fec34d967e20bcfef3...

Enabling/disabling active hash functions would probably an expensive operation, but you wouldn't be doing it every day so it probably wouldn't be a huge problem.


Take a look at multihash[0]. I don't know the inner workings of the program, but I imagine it would be possible for `multihash` to periodically rehash files (as a cron job?) when a new crypto algorithm gets introduced.

[0]: https://github.com/multiformats/multihash


Do you know if git objects' size header was designed to deal with a possible collision or does it serve another purpose as well?

Just some context - git calculates an object's name by his content in the following way. Say we have a blob that represent a file who's content is 'Here be dragons', then the file name would be:

  printf "blob 17\0Here be dragons\!\n" | openssl sha1
  # => a54eff8e0fa05c40cca0ab3851be5aa8058f20ea
So the object gets stored in '.git/objects/a5/4eff8e0fa05c40cca0ab3851be5aa8058f20ea'


The PDF's released as proof are the same size, so if size and checksum are the same, git could certainly be fooled at checkout time.

So I could imagine in a large source file, it would be possible to have some malicious code plus some data in comment blocks to make the hash match. That said, the PDF's are 422k, and I think it's a much more difficult attack on more typical, smaller size source files that one would typically check out and build from git. Maybe Xcode .nibs and that sort of tool output could become relatively easy attack vectors, though.


Actually AFAIK this specific attack doesn't let the two files have different lengths, so the size header is irrelevant anyway.


The size is useful when you want to know the length of an object without unpacking it. The size is uncompressed length and not the size on disk.

For example, if you want to stream the blob over http you can use it to set the content-length. Otherwise, you have to use chunked transfer.


Also git compresses before hashing IIRC.


It doesn't.


The Git docs have some very specific claims about cryptographic integrity of git:

https://git-scm.com/about/info-assurance

These claims are wrong as long as it uses SHA-1. Full stop.

It'd be really nice if git had cryptographic integrity. Not just because it'd prevent some attacks on git repos, but because it'd make git essentially a secure append only log. Which would be interesting, as it'd more or less automatically give some kind of software transparency for many projects.


Some more advanced Git operations (I sadly never needed so far) can be used to break the "append only" part, right?

Like for instance rebases?


You can always break append only logs, the point is, it's detectable.


What do you mean by detectable in the context of Git-as-append-only-log?


Mirrors that pull down the version with rewritten history will not be able to run a fast-forward (because they will contain commits that are not in the upstream version), and will loudly complain.


I prefer the entire thread: http://marc.info/?t=148786884600001&r=1&w=2


the ratio of relevant new relevant to this problem information/new information is low.

If a low signal/noise ratio is still the purpose of information then the thread is less interesting than Linus mail:

- if we add size it will make forgery harder - yes SHA1 should be replaced

What linus is missing is people rewriting history. This will not be a concern for git, but certainly will for any crypto currency relying on SHA1 in a close future. (Hint this transaction belonged to me)


I'm not aware of any crypto currencies that rely on SHA1.

Can you name one that does?


I wonder what would the effect be if there was one.


Bitcoin is a popular one. 2 sha-1 rounds.


No, it uses SHA-256


What does that have to do with git?


Sorry for my ignorance, but isn't SHA-1 in git supposed to protect only against data corruption and not against someone maliciously replacing the entire repo?


I wish they'd thought about this in advance. So many little things (like suffixing every hash with '1', and rejecting commits with hash values which don't end with '1') would have made the switchover much easier to do in a backwards-compatible way.


> but git doesn't actually just hash the data, it does prepend a type/length field to it.

To me it feels like this would just be a small hurdle? But I don't really know this stuff that well. Can someone with more knowledge share their thoughts?

I think Linus also argued that SHA-1 is not a security feature for git (https://youtu.be/4XpnKHJAok8?t=57m44s). Has that been changed?


Yeah what is the attack here?

If you don't have permissions to my repo that already limits the scope of attackers to people who already have repo access. At that points there's tons of abuse avenues open that are simpler.

If someone could fork a repo, submit a pull request and push a sha for an already existing commit and that would get merged and accepted (but not show up in the PR on github) well that would certainly be troubling, but at that point I'm well past my understanding of git internals as to how plausible that kind of attack would be...


Repo access doens't stop people from injecting code into your repository. A pull request actually puts objects into your repo, but under a different ref than heads and tags.

1. Go to github, do a git clone 'repo' --mirror 2. cd to the bare repo.git and do a git show-ref and you will see all the pull requests in that repo. If any of those pull requests contained a duplicate hash, then in theory they would be colliding with your existing objects. But since git falls back to whatever was there first, I think it would be very challenging indeed to subvert a repo. You'd essentially have to find somebody who's forked a repo with the intention of submitting a PR, say, on a long running feature branch.

You could then submit your PR based off hashes in their repo before they do, which would probably mean your colliding object would get preference.

Its pretty far fetched, but the vector is non-zero.


It's easier than that. Someone could create a benign patch and a malicious patch that collides the hash. But what could they do with that?

For example, let's say it goes (letter##number represents the hashes):

State 1:

Real repo: R3 -> R2 -> R1 -> R0 (note: R3 is the head)

Benign PR: B1 -> B0 -> R1

Malicious PR: M1 -> M0 -> R1 (note: M0 = B0, but the contents are different)

State 2, after merging Benign PR:

Real repo: R4 -> R3 -> R2 -> R1 -> R0, R4 -> B1 -> B0 -> R1

If Malicious PR was merged now, Git would, I imagine, just believe that M1 is a commit that branched off of B0, since that's where the "pointer" is at.

So, yeah, what would this actually accomplish?


If the sha matched an existing one, would it be merged as part of the changeset, or just ignored?


If a sha matches, git won't fetch the object because it already has it.


Painful to read all the 'this isn't an issue because of bad reason X, Y, Z'.

Git can implement checking for easily collided data and warn the user, potentially even look to implement the safer hash countermeasures too. The fact that this isn't a second preimage, or that SHA1 isn't used to auth a repo doesn't really factor in to it.


But he doesn't say that they won't implement checks for collided data and warn the user (not in this post anyway).

He does say that the sky isn't falling, and there are some steps they can take to mitigate it.

Edit: or do you mean all the posts here in the comments?


Yes to your edit, but I can see the ambiguity.



This is a shocking aspect of crypto. Old standards get broken regularly, yet people waltz around with a "it would be embarrassing to do more since no one else does" attitude.


There is not mutch risk now, but git should be able to switch to another and longer hash. Truncating another hash to 40 chars does not "fix" the problem. It just move it into another place.

Another possibility, but this is a hack to keep key length to 40 chars, would be to change key encoding from hex encoding to base64. In 40 chars you could encode 240 bits instead of 160. It is preferable to get rid of the hard coded 40 char limit. It shouldn't be that hard.


The problem isn't the length of the text representation of the hash, it's the length of the actual binary representation of the hash.


The comment to whitch Linus respond reports that there are 40 constants in many places in the git code. The bit size of a SHA1 hash is 160 bits which holds in 20 bytes. In hexadecimal, the length is 40chars. So the problem reported in the initial comment was the text length, not the bit length.

Of course there is probably 20 hardcoded in many places too. So in this case the bit length is an issue too. You are right. Switching to base64 encoding would not solve the hardcoded bit length if any.

Using file hashes as file identifier doesn't look like a good idea as suggested here https://valerieaurora.org/hash.html because hash lifetime are short. The system should support changing the hash every year. The work required to compute hashes increases too.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: