
Updating the Git protocol for SHA-256 - chmaynard
https://lwn.net/SubscriberLink/823352/ea717c1e49390505/
======
kelnos
One of the comments mentioned that there was a suggestion (presumably
rejected) to "rotate" the first character of the hex string for the SHA256
hashes by 16 characters, so 0 becomes g, 1 becomes h, etc. (that way the
SHA256 hashes would be unambiguously not SHA1 hashes, even when abbreviated).

This made me think... why are we using long, unwieldy base-16 hex strings at
all? Why not use an alphabetic (non-numeric) base-46 string: 20 lowercase
letters ([g-z]), 26 capital letters ([A-Z])? Then the new SHA256 hash strings
end up being _shorter_ than the old SHA1 strings, and there is no overlap with
the [0-9a-f] range of the base-16 strings.

If you wanted to even it out to 64 characters, you could create a "modified
base-64" that doesn't use [0-9a-f] and instead uses more special characters
(though for convenience you'd want to choose characters that are shell-safe
and possibly even URL-safe, which might make this not work). Alternatively you
could use a subset for a base-32 representation.

The downside -- perhaps a significant one? -- is that you can't use standard
tools like `sha256sum` or the representation conversion functions into the
stdlib of many languages to generate these hashes; it would require custom
code. Not sure if that's a concern, though.

~~~
mkl
> This made me think... why are we using long, unwieldy base-16 hex strings at
> all? Why not use an alphabetic (non-numeric) base-46 string: 20 lowercase
> letters ([g-z]), 26 capital letters ([A-Z])? Then the new SHA256 hash
> strings end up being shorter than the old SHA1 strings, and there is no
> overlap with the [0-9a-f] range of the base-16 strings.

Are you sure? SHA1 hashes are 40 hex digits, and SHA256 hashes are 64 hex
digits. But in base 46, 2^256-1 =
3ZS4A7V5Ki0LWg1f3Of06YNfgQXCA2P0Q6RACKhEIWQXe07, which is 47 base-46 digits
long, so still longer than SHA1. (This is not your base 46, since it's using
0..9, A..Z, a..j.)

    
    
      import gmpy2
      gmpy2.digits(2**256-1, 46)
        -> '3ZS4A7V5Ki0LWg1f3Of06YNfgQXCA2P0Q6RACKhEIWQXe07'
      gmpy2.digits(gmpy2.mpz('f'*64, 16), 46) #same, but more clearly the "maximum" hash
    

edit: more code to try kelnos's proposed digits:

    
    
      gmpy_digits = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
      new_digits = 'ghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
      ''.join(new_digits[gmpy_digits.index(d)] for d in gmpy2.digits(gmpy2.mpz('f'*64, 16), 46))
        -> 'jPIkqnLlAYgBMWhVjEVgmODVWGNsqiFgGmHqsAXuyMGNUgn'
    

I think hex strings are probably still better, as there's less ambiguity, and
47 characters isn't much shorter than 64 for practical purposes.

~~~
mathnmusic
One of the mistakes Git made was that the hashes don't describe what algorithm
was used to generate them. That makes backward compatibility and incremental
upgrades harder. MultiHash is a solution for such issues:
[https://github.com/multiformats/multihash](https://github.com/multiformats/multihash)

~~~
heavenlyblue
You don’t need a specialised solution for this, just prefix the v1 of your
hashing algorithm with some constant and then you can easily extend it.

------
chmaynard
This article by John Coggeshall is an example of great technical writing. I
look forward to reading more of his work.

The more I learn about Git, the more it blows my mind. Torvalds must have
designed Git in his head during the time he was using BitKeeper. When he
decided to implement Git, he went to work and in a few weeks he had it working
well enough to use for Linux kernel source control. Torvalds is either a
genius or merely a superstar computer scientist and software engineer. Choose
any two.

~~~
nujabe
I'm almost certain if it weren't Torvalds, it would be someone else who
creates something like Git. He just had the need and skills to create it. Like
Bill Gates he was just a competent person, at the right place and time.

~~~
nordsieck
> I'm almost certain if it weren't Torvalds, it would be someone else who
> creates something like Git. He just had the need and skills to create it.
> Like Bill Gates he was just a competent person, at the right place and time.

I think it's fair to point out that Torvalds didn't come up with Git on his
own - he was influenced by a good amount of prior art, not the least of which
was Bitkeeper.

However, I think calling him a "competent person, at the right place and time"
undersells him to a substantial degree.

1\. He's a world class software developer.

2\. He's a world class tech lead (There are certainly criticisms about his
communication style, which I don't think are unfair. Here, I am judging him by
the artifacts created under his leadership).

3\. He was the write lock for an extremely large distributed team for more
than a decade.

It is not surprising to me that the 2nd most famous distributed source control
tool - Mercurial - was also initially created by a Linux kernel developer.

~~~
bch
> It is not surprising to me that the 2nd most famous distributed source
> control tool - Mercurial - was also initially created by a Linux kernel
> developer.

Because Linux was in a crisis and hot off an excellent experience w bitkeeper?
Or something more profound?

~~~
nordsieck
> Because Linux was in a crisis and hot off an excellent experience w
> bitkeeper? Or something more profound?

I'm sure the crisis was part of it.

But also, the way Linux development is organized is very weird compared to how
code is developed in most places. "Distributed" is not quite right, but it's
the best word I've got. A source control tool like Perforce just wouldn't fit
linux development very well, whereas large companies at the time were
perfectly happy with it.

~~~
quicklime
“Decentralized”?

I agree with you that Linux development is weird compared to most places. Most
places have a designated central repo that acts as the source of truth,
whereas the Linux developers don’t really have this.

This is possibly an unpopular opinion, but I think that when there is an
official main repo, git is far more complicated than it needs to be. It’s
confusing to a lot of developers because of all this unnecessary complexity.
Most teams could probably be served better by a simpler system.

~~~
Koromix
I share this opinion as well. For centralized-repo style, which is what most
teams need, the Git model has three problems:

\- It is hard to use. There's a reason most teams have a single or two Git
"experts" to help everyone else resolve repository / merge / pull / whatever
problems that happen regularly. The onboarding experience for developers
unfamilir with Git is very bad, and many of them end up never able to use it.
If you have non-developers (e.g. artists), it is even worse.

\- Binary files. Yes, there is Git LFS, but there are many small integration
problems when you try to use it, and the simple fact that Git needs something
special to manage binary files is a problem. And no, storing binary files
outside the repository is NOT a good solution. In games for example, the
binary files and the engine usually evole together and if you want to checkout
an old version, the code and the assets need to stay in sync. You want to
version the game, not just the engine.

\- Partial (subtree) clones and permissions. There's a reason monorepos are
popular: they're much simpler to manage. Unfortunaly, because Git barely does
partial clones and cannot manage permissions (inside a repository) at all,
monorepos are much more limited than in SVN or P4. That's why many projects
end up divided in multiple repos, but this brings all sorts of problems and a
large administrative overhead for all developers.

~~~
nix23
Give bitkeeper a try, it's now FOSS

-Simple clean interface

-You can work the linux way or with a central-repo-style

-Binary Asset Manager build in

-Nested (submodules done right)

~~~
Koromix
Interesting, thanks for the tip! I'm gonna give it a go during the weekend.

------
IshKebab
This is going to be such a pain for clients to support. Especially because
there's not really a good specification for git - the official implementation
is the specification. For my client I think I will just have to wait until
this is officially enabled before implementating it. Guess it has to be done
though.

I don't understand the bit about distinguishing hashes by length only. If
you're going to change it why not add an identifier prefix? Then the logic can
be "if length == 128 { its sha-1 } else { read first byte to determine hash
algorithm }".

------
dec0dedab0de
Wouldnt an easier solution be to keep the sha-1 to work exactly as is, but add
additional hashes to verify the files havent been maliciously modified to keep
the same hash. It seems almost impossible to craft a file that has different
hashing algorithms equal the same thing. Assuming older versions dont explode
when new data is added, it should stay backwards compatible.

The sha-1 and short hashes would be the name that is displayed and used for
checking out commits, but not relied upon for security.

------
cletus
I've never seen an adequate answer to this but it boggles my mind that this
problem wasn't foreseen.

When Git came out we'd already seen a move from MD5 hashes to SHA-1 because
MD5 was no longer deemed secure. Of course this was going to happen again.

So why wasn't the protocol and the repository designed to have multiple (hash
algorithm, hash) pairs?

At this point it becomes a client and server issue if you support a given
protocol or not so you have this handshake to agree on what you're using (eg a
server may reject SHA-1), much like we have with SSH.

This seems like it would've been trivial to allow for in the beginning. Or am
I missing something?

~~~
corty
It would have been possible to include the ability to change algorithms in a
handshake (agility).

But general current consensus in the cryptography and security community
(which I don't agree with) is: "Agility is bad, hmkay". The reasoning is
largely because there have been plenty of attacks based on downgrading
protocols like TLS to broken crypto by an attacker. Also, the handshake
mechanism necessarily enlarges the attack surface of your software, usually in
a pre-auth part which is very problematic.

The counter-argument is of course that a lack of agility leads to lots of pain
later on, when you need agility because your one chosen algorithm is broken.

There is also the added argument for not using SHA-1 in git: SHA-1 was already
starting to "smell" when git was first implemented. Picking something else
back then would at least have moved the current problems farther into the
future, when SHA-2 might start to be problematic.

~~~
hannob
There are good reasons why agility is bad. It never works the way people
imagine.

The fantasy is people think "in case of a sudden algorithm break we can
quickly switch to a better algorithm".

This is wrong because:

a) there are no sudden algorithm breaks. You always know way in advance before
something is broken. (In the case of SHA-1 there was more than a decade
between early warnings and the first attack.) In the case of Git the warnings
about SHA-1 are older than Git itself. The problem wasn't lack of agility, the
problem was choosing an algorithm that was already known to be bad.

b) if you have agility it means people will continue supporting bad algorithms
and use agility as an excuse.

b) happened exactly in TLS. Padding oracles were known, but people chose to
ignore it. In TLS 1.2 they included better algorithms (GCM), but chose to keep
the bad ones (CBC/HMAC). Then more padding oracle attacks happened. Then
people said "oh we have agility, we can disable CBC and ... well nobody
supports GCM yet... let's use this other broken thing we have called RC4".
Then attacks on RC4 came in.

If by the time people designed TLS 1.2 they would've said "we know RC4 is
weak, we know CBC/HMAC (as used in TLS) is weak, we will just support GCM"
none of that mess would've happened.

~~~
corty
The changeover to TLS1.2 only ever could have worked because of agility.
Without at least protocol version agility (which implies cryptographic
agility), one couldn't talk TLS and its successor on the same port, because
there wouldn't be any other way to distinguish them. A new TCP port assignment
would have been possible, but a total and utter disaster. Countless firewalls,
appliances, routers, middleboxes, rulesets and similar stuff assume that HTTPS
is on port 443, changing that would at least take a decade.

That the allowed set of ciphers didn't change enough with TLS1.2 may be right,
but doesn't matter for why we do need agility.

Agility is necessary and TLS is the poster-child for why we need agility in
such protocols. Whether the same applies for git, where the network protocol
isn't that important and ingrained is of course debatable, I think you are
right there. But what git has to do now is create agility by cramming it in
somewhere, using cludges, hacks and whatever might be possible without
breaking too much. It remains to be seen if the cludge will be free of
security problems.

~~~
cyphar
Given the plethora of downgrade attacks against TLS that have been found over
the decades, I'd really hope that there is a better poster-child for cipher
agility. If the canonical example of "cipher agility working as intended" is
TLS, then I want off this rollercoaster ride.

~~~
corty
TLS agility is definitely not the poster-child for a good implementation. I do
not know if there is any cipher agility implementation I would really call
good.

However, TLS is the prime example for why we desperately need agility anyways.
Just look at the myriads of ways things broke just by trying to get TLS1.3 to
work. And now imagine that totally without any agility mechanism. We would
still plan for the big introduction of TLS1.3 a few years into the future.

------
rudolph9
I wish more people would adopt (and better formalize) multihash and related
projects
[https://github.com/multiformats/multihash](https://github.com/multiformats/multihash)

It’s a really simple good idea that has developed in the ipfs world.

------
rudolph9
Perhaps I missed it but what are we doing to future proof the hash algorithm
to when we need to transition away from sha256?

Seems like we’re going to just have the same problem in another decade if we
don’t address this now?

~~~
matt_kantor
> In closing, it is worth noting that one of the reasons this transition has
> been so hard is that the original Git implementation was not designed to
> swap out hashing algorithms. Much of the work put in to the SHA-256
> implementation has been walking back this initial design flaw. With these
> changes almost complete, it not only provides an alternative to SHA-1, but
> also makes Git fundamentally indifferent to the hashing algorithm used. This
> should make Git more adaptable in the future should the need to replace
> SHA-256 with something stronger arise.

That's the last paragraph of the article.

------
miohtama
I remebember during DVCS wars, an argument was raised for Bazaar/Mercurial
that they supported pluggable hash algorithms. Git was hard coded to SHA. Now
this design decisions is harder to get changed.

------
qwename
Other than using a stronger hashing algorithm that produces longer hashes,
would there be any advantage in storing two or more separate hashes of an
object? The extra hashes could be from a different hash function, or a hash of
the reversed bits/bytes.

I wonder about the difficulty of producing collisions for a single 256-bit
hash function versus two 128-bit hash functions, four 64-bit hash functions
and so on.

~~~
TicklishTiger
That reminds me of my phone. It is kinda special. Instead of an eleven digit
number, I have eleven one digit phone numbers. When I write them down, I
usually just writen them all in one row. So the written version of my eleven
phone numbers look just like a normal eleven digit phone number.

~~~
SftwreEngnr
What?

~~~
sophiebits
A function that concatenates the results of two 64-bit hash functions _is_ a
128-bit hash function.

Just as a string of individual digits makes a larger number.

~~~
hobofan
I mean, technically yes, but it it would be a 128-bit hash function with the
security properties of a 64-bit hash function, so it offers little advantage
over just using a 64bit hash (which I think was also the point you were trying
to make, I think?).

However, that doesn't really address the original question on how much harder
cracking 2x64bit hash would be than cracking a single 128bit hash would be.

My best guess there would be that it's really quantify as you start opening up
more dimensions besides the number of bits. The gain would mostly come from
protection against other properties from one of the algorithms like a
potentially hidden backdoor or a undiscovered mathematical weakness. So as
long as the strength of the individual hash function holds up, it probably
makes sense to diversify between hash functions. E.g. SHA3-256 + BLAKE3-256,
probably offers better long-term security properties than using SHA3-512.

~~~
dependenttypes
> but it it would be a 128-bit hash function with the security properties of a
> 64-bit hash function

This is not true. Consider two hash functions f and g

    
    
        f(x) = md5(x)[0..63]
        g(x) = md5(x)[64..127]
    

and a third function

    
    
       h(x) = f(x) || g(x)
    

where || is concat

So no, concating multiple smaller hash functions is not any weaker than using
a single big one.

~~~
hobofan
So your point is that if you take the output of the _same_ 128bit hash
function twice, split it into 64bit parts and then put it back together, you
still have the full properties of the 128bit hash function? Well, no shit.

I have to admit that I'm not the greatest cryptography whiz, but I can't image
that this holds up for _independent_ hash functions, where you should be able
to more cheaply run a preimage attack against two 64bit hash functions than
one 128bit hash function.

~~~
dependenttypes
> where you should be able to more cheaply run a preimage attack against two
> 64bit hash functions than one 128bit hash function.

Try doing that in my example.

~~~
freemint
Your hash functions are contrustructed in a way such that the concatenation
has the security properties of a 128 bit hash function (because your
construction is equivalent md5). I am not sure that results holds for all
concatenations of hash functions. Or for SHA concatenations.

I appreciate you making me think deeper about hash functions. Nice
construction.

------
sargun
I wonder why SHA-256, and not BLAKE2(b/s) or BLAKE3. BLAKE3 is significantly
faster.

~~~
bawolff
Does performance matter in this context? Git isn't exactly hashing terabytes
of data.

~~~
JoshTriplett
It does; I've encountered some cases (ingesting large amounts of data into
git) where the limiting factor on performance is a combination of SHA-1 and
DEFLATE.

------
gentleman11
I am working on a game and version control in the gaming world is definitely
weird to somebody coming from web dev. Almost nobody in this world uses git,
they all use perforce, svn, and plastic. Apparently got doesn’t deal with
binary assets well; even with lfs, it’s hard to find hosting for your
100-500GB repositories that also supports file locking (another binary-
specific feature needed for games that many hosts don’t seem to support for
some reason).

I wanted to try lfs, but where do you actually go to find cheap hosting for a
100-150GB repo?

------
austincheney
I couldn’t see in the article, but why settle on SHA2-256? Why not something
bigger or newer such as SHA2-512 or SHA3-256? Or why not a variable length
hash like SHAKE256?

~~~
chmaynard
I believe this is discussed in
[https://lwn.net/Articles/811068/](https://lwn.net/Articles/811068/)

------
kzrdude
When browsing lwn in firefox, "unable to connect" errors are relatively
common. Is this just the server being aggressive with prioritizing which
requests to serve?

------
kazinator
Is there a way to opt out of this other than pinning your git executables to
an old version or forking?

I want to make sure all my git repos are forever readable by the original git
they were produced with, even if newer git versions are used for producing
commits.

The article makes some troubling comments like:

> _countless repositories need to be transitioned from SHA-1 to SHA-256._

Absolutely never doing this.

------
bluesign
maybe some naive question but why not hash content with SHA256 then hash the
result with SHA1?

~~~
umvi
That's like saying "why not choose a 20 character password and just use the
first 8 characters of it?"

------
rurban
So brian finally got closer? Waiting for a long time

