Hacker News new | comments | show | ask | jobs | submit login
Git hash function transition plan (github.com)
215 points by vszakats 79 days ago | hide | past | web | favorite | 56 comments

Only have time to skim it, I didn't see anyplace, so might be a good time to suggest multihash: https://multiformats.io/multihash/

Having git to use that could be a great opportunity to standardize on a de facto hash function encoding standard.

What would be the best way to suggest that (if it hasn't been already, though I am guessing it likely has).

But does not solve the problem. Multihashes are not unique identifiers of a message, which is what git mostly uses hashes for. Now, instead of a single unique identifier, you have N possible ones, where N is the number of hash implementations your multihash library has. And it is not possible to convert between two hash types without having the original message.

Wasn't there an issue with JWT that was summarized as this:

"This is a good idea, but it doesn't solve the underlying problem: attackers control the choice of algorithm" ?

Here's another quote from the Wireguard paper[1]:

"Finally, WireGuard is cryptographically opinionated. It intentionally lacks cipher and protocol agility. If holes are found in the underlying primitives, all endpoints will be required to update"

[1]: https://www.wireguard.com/papers/wireguard.pdf

Sorry I wasn’t suggesting allowing any algorithm to be used just whichever one was chosen next it be encoded in a way that if it needs to be replaced again it could, and also if possible that numeric id for that algorithm be standardized beyond just git.


That’s only true of JWT if you allow your server to accept all algorithms.

You don’t actually have to.

Correct, your token authority should specify which algorithms are valid, and your clients should self configure via a secure back channel to only accept the algorithms your token authority issues.

Exactly! JWT is a much misunderstood system it seems. Though it doesn’t exactly help itself by being quite complex

Well-designed protocols generally include algorithm identifiers. It doesn't mean that upgrade will always be easy though.

I really don't like given this a new name ("multihash"). We have a name already: algorithm agility. We should use that name.

I also don't like this idea of having a standard for algorithm agility for hash functions (and another for encryption algorithms, and...).

It's also not obvious that making every hash/MAC/public key payload carry an algorithm ID is the right design for every protocol (it's not), though for git it is.

Yeah this came out of the IPFS camp, might be sensible though to use the same numeric id numbers for the hashing algorithm ids though all other things being equal.

Generally, and this is just my gut feeling, I think that for any hash code written to disk or stored in some way having an identifier for the hashing algorithm used is such a common bite you in the ass later thing that it makes sense to always just do it from day one. To that end it’s easier to do day one if everyone agrees to a standard set of numeric codes.

Multihash is the standard set of numeric codes for different algorithms I am aware of.

Unifying here might allow git objects to be served natively over IPFS.

Just a quick note, while we still would really love to have git use multihash. You can already serve git objects natively over ipfs via: https://github.com/magik6k/git-remote-ipld

Which uses our new plugin system: https://github.com/ipfs/go-ipfs/blob/master/docs/plugins.md

> Generally, and this is just my gut feeling, I think that for any hash code written to disk or stored in some way having an identifier for the hashing algorithm used is such a common bite you in the ass later thing that it makes sense to always just do it from day one. To that end it’s easier to do day one if everyone agrees to a standard set of numeric codes.

Yes, that's the basic idea of all multiformats: "it's never gonna change" is considered harmful.

> Unifying here might allow git objects to be served natively over IPFS.

IPFS can already do that thanks to the CID format: https://github.com/ipld/cid

There's no good examples for Git specifically yet, but there's a good bunch of working code for transporting e.g. Ethereum and Zcash transaction blobs over IPFS. For Git it's in principle the same: import the raw object into IPFS, and starts addressing it with /ipfs/<git-cid><original-git-hash>

Something about multihash makes me worry it's a security risk. Like I worry that it encourages this mistake:

1. Define a new protocol with multihash somewhere in it.

2. Import a super convenient multihash library.

3. Verify all hashes with a simple library function.

That sounds super natural and convenient to me, but if it means that you support MD4 by default, then you've introduced a downgrade attack into your protocol.

You can lock it down to specific hash functions no problem.

If I’ve learned anything from being in this field it’s that:

  1) many if not most implementations will support lots of algorithms by default, and
  2) as a result, approximately zero users will lock it down

3) the users who do lock it down will be harangued about not being compatible with less secure versions barring a major incident

Yeah that's exactly what I'm worried about. The nature of the beast makes it tricky to define a safe default.

Funny, I always expected Git to transition by adding a stronger hash as a piece of metadata to each commit and continue using SHA-1 for the day-to-day identifier, seeing as most of the time Git doesn't actually go back and actually verify the whole commit chain unless you ask it to.

They actually considered the reverse (search for `Using hash functions in parallel`)

This doesn’t render very well on mobile. I wish the Git team would write their docs as a .md so GitHub could render as HTML with word wrap in all its glory.

Here is a rich text version of the same document: https://www.kernel.org/pub/software/scm/git/docs/technical/h...

It's written an asciidoc, which GitHub can render too. It's a matter of changing the extension or adding vim modeline.

So, this is the transition plan. Is there anywhere where we can find what progress has been made on the plan? As far as I can tell, it is only a plan at the moment?

I also like the idea of a transition plan, but is there anywhere a proposed timeframe, for phasing out the non "post-transition" modes of operation? That is, as an organisation, is there anything that we can do with this now towards our future planning?

For something as widespread as Git, there is no "post-transition", I'm afraid: while maintained code will get migrated, old repositories will hang around Forever.

Note that Git is a protocol - all of its implementations will eventually need to change, and each repo using it as well. This is decentralized by the very purpose of Git.

So it says the protocol won't be extended initially, only the repo format. I'm trying to figure out the implications of that. IIUC this basically boils down to: can we make sure that when you have a signed tag (i.e. a hash signed with GPG), the content of your repo is truly the same as what the signer intended, and not a collision generated by a bad actor.

It says that there will be a new format for signed objects, i.e. you will now be able to sign tags with NewHash. But if the format is not extended, does that mean you can't get push or fetch those objects? If so then I believe this is just foundational work with no immediate functional impact, right?

(Not shitting on it btw, it's obviously still a Good Idea!)

It explains this further rather later in the document.

There's a compatibility mode, where it understands a translation between SHA-1 named objects and NewHash named objects, and translates them at the boundary - i.e. during a pull or a push.

Obviously you're at risk to some extent of flaws in SHA-1 being exploited in your remote, although presumably if the translation layer detects the SHA-1 of something didn't change but the NewHash did then it'll scream.

It does seem this is a temporary situation though, as it mentions in one small sentence that for the final transition stage they envisage the protocol also supporting NewHash, so they can throw away all SHA-1 metadata everywhere. What they don't address in that plan is how the protocol gets extended, but they do clearly rely on that happening for the full transition to take place.

He makes excellent points on tags; the one I hadn't considered before is that tags indeed can be separated from the tree, which makes them a unique asset in a git tree.

The problem with that however is how we use tags today. Creating a tag in the modern lingua franca of git means creating a new version. If you push that tag to Github or Gitlab or what have you, a handy "release" will be created for you. If you're signing all your commits for some security reason, you don't want that, aye?

So you'd want tags that are tracked separately and that's not easy to do. `git commit --sign` is going to include the signature in the commit, not create a separately-tracked tag with an appropriate name or whatever. It certainly sounds interesting, albeit unintuitive, and that summarizes git perfectly :)

>The problem with that however is how we use tags today.

"Doctor, it hurts when I cargo-cult workflow from GitHub..."

Hardly. It’s a workflow - that’s all. One that works exceedingly well for millions of people.

Do you have a point?

Point being that one shouldn't cargo-cult workflow from github.

The point is phrased using an old pop culture reference: https://en.wikipedia.org/wiki/Smith_and_Dale#.22Dr._Kronkhei...

It's not cargo-culting, it's not from github, and it's not even a workflow. If that was the point, it's a terrible one to make. I genuinely don't understand how people found that comment insightful, or anything short of trolling/hostile, but whatever.

pushing a tag to github-hosted repo certainly does not automatically create a 'release'.

Github "releases" - as listed on the repository index - are solely based on the tags of the repository. So yes, pushing a tag does create a release.

See here for documentation:


All releases are tags, not all tags are releases. Have you used the feature?

> 1. On GitHub, navigate to the main page of the repository.

> 2. Under your repository name, click Releases.

> 3. Click Draft a new release.


Pushing a tag does not create a release. You can have lots of tags that are not releases. You have to choose to create a release, as a separate step. All your releases are tagged though, yes (as they should be, using github and it's release feature or not, to identify the state of the repo from which the release was built).

Tags are also how the code review tool Phabricator sends diffs to CircleCI for testing. If you have that integration enabled, you quickly end up with more GitHub Releases than your project has commits.

Been there, done that, it's really not a good idea.

Based on my own research, it appears that the first git tag was created before the first git commit.

The first tag (?) points to a tree.

  $ git cat-file -p v2.6.11-tree
  object c39ae07f393806ccf406ef966e9a15afc43cc36a
  type tree
  tag v2.6.11-tree

  This is the 2.6.11 tree object.

  NOTE! There's no commit for this, since it happened before I started with git.
  Eventually we'll import some sort of history, and that should tie this tree
  object up to a real commit. In the meantime, this acts as an anchor point for
  doing diffs etc under git.
  Version: GnuPG v1.2.4 (GNU/Linux)


First commit

  $ git cat-file -p v2.6.12-rc2
  object 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
  type commit
  tag v2.6.12-rc2

  Linux v2.6.12-rc2 release
  Version: GnuPG v1.2.4 (GNU/Linux)

Unfortunately, I don't think I can confirm my suspicion using git alone. Maybe if I look at some mailing lists around July/August 2005 I could get a more accurate confirmation.

This is due to the fact that those tags pre-date the tagger header which came a short while later.

  $ git cat-file -p v2.6.13
  object 02b3e4e2d71b6058ec11cc01c72ac651eb3ded2b
  type commit
  tag v2.6.13
  tagger Linus Torvalds <torvalds@g5.osdl.org> 1125272548 -0700

  Linux 2.6.13 release
  Version: GnuPG v1.4.1 (GNU/Linux)


Just to reenforce the "First Commit" claim, here's the rev-list for the commit and the commit contents. (Notice it has no "parent" commit.

  $ git rev-list 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

  $ git dump 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
  tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
  author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
  committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700


  Initial git repository build. I'm not bothering with the full history,
  even though we have it. We can create a separate "historical" git
  archive of that later if we want to, and in the meantime it's about
  3.2GB when imported into git - space that would just make the early
  git days unnecessarily complicated, when we don't have a lot of good
  infrastructure for it.

  Let it rip!

Mike Gerwitz on signing commits: https://mikegerwitz.com/papers/git-horror-story.

The main downside to switching the hash function is that, when explaining why developers should stop worrying about hash conflicts, we'll need to calculate a new analogy to replace the standard, 180 bit "every member of your programming team being attacked and killed by wolves in unrelated incidents on the same night" scenario.

That analogy presumes that the hash function's output is uniformly random; when you know how to manipulate it s.t. its output is not random, then obviously it doesn't hold.

The question of accidental collisions is still relevant, even with SHA-256, and the answer is still the same: it's so vanishingly improbable that it is assumed to be impossible.

> Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256.

Not sure what K12 is (Keccak?), but BLAKE2 is a very attractive option.

How does it prevent this exact same problem in the future?

> In early 2005, around the time that Git was written, Xiaoyun Wang, > Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 > collisions in 2^69 operations. In August they published details. > Luckily, no practical demonstrations of a collision in full SHA-1 were > published until 10 years later, in 2017.

> The hash function NewHash to replace SHA-1 should be stronger than > SHA-1 was: we would like it to be trustworthy and useful in practice > for at least 10 years.

Why is SHA-3 not explicitly mentioned as a candidate?

SHA3 is slow.

NewHash is a terrible name - on par with Xbox One [X] and iPad New. Googling stuff will be hard, and good luck explaining to less technical-savvy users what is this all about.

Plus, in 100 years, when SHA-256 is compromised, what would be the name of a new new format?

I was under the impression that it's just a placeholder until the actual new hash function was decided on.

Also the whole point of this transition plan is that it will be a completely optional, per local repository, transition. So less technically savvy users won't have to worry about it in the first place.

Can someone explain the name? It does not look like a good name. Or is NewHash just a placeholder name for the git project because the haven't made a final decision on a new hash function? (It's hard to google and find out)

Can someone explain why they would transition to a new hash function and not a block chain based system of tracking? If one of the goals of introducing a stronger hash function is signage of individual commits it seems like a block chain would be ideal.

Chains of Git commits are already a blockchain - at least, already a DAG, and to be more specific, they are both Merkle trees. Internally, each commit contains the hash of the previous commit it was based on:

    $ git cat-file -p HEAD
    tree e013f4d121199d60b70043f525aef4a7e641b5f6
    parent 152bbb43b30ced1b32e9ed6f5ba2ac448de725b6
    author Linus Torvalds <torvalds@linux-foundation.org> 1510512373 -0800
    committer Linus Torvalds <torvalds@linux-foundation.org> 1510512373 -0800

    Linux 4.14
You can even GPG sign each commit if you want to ensure authenticity. The other aspects of cryptocurrency blockchains don't really apply here: we don't need a single "true chain," in fact that's the point of branching.

(Kids these days with their blockchains...)

git already is a hash linked datastore with the ability to sign your 'transactions'. The doc just points out that SHA1 is not a reliable hash to address objects anymore.

> ...it seems like a block chain would be ideal

If I could get just 1 satoshi every time I see this suggestion...

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact