
The beginning of Git supporting other hash algorithms - _qxtl
https://github.com/git/git/commit/e1fae930193b3e8ff02cee936605625f63e1d1e4
======
bk2204
I'm the person who's been working on this conversion for some time. This
series of commits is actually the sixth, and there will be several more
coming. (I just posted the seventh to the list, and I have two more mostly
complete.)

The current transition plan is being discussed here: [https://public-
inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...](https://public-
inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a92p0C2O6TA53==BZfTrQ@mail.gmail.com/T/)

~~~
rurban
I do like your hashname/nohash idea. If we could come up with a simple
compression negotiation protocol also: zlib -> zstd. But this will be much
harder, as hashes are internal only, and compression is in the protocol.

kudos to brian m carlson to convince linus to use sha3-256 over sha256. this
is really the only sane option we have.

~~~
lisper
> this is really the only sane option we have

Why?

~~~
wolf550e
Yeah, I would have gone with BLAKE2. It's much faster than SHA-256 and
SHA3-256: [https://blake2.net/skylake.png](https://blake2.net/skylake.png)

~~~
benchaney
This is a perfect example of a situation where hashing performance doesn't
matter at all.

~~~
harshreality
The hash function may not matter for overall git performance in virtually all
dev machine setups, but there will be a (maybe tiny, maybe larger, depending
on the repo and disk io speed) difference in cpu utilization and heat
generation, right?

~~~
copperx
That's a silly thing to worry about when you're developing Ruby or Java
applications. My PC boots faster than the Rails console or IntelliJ.

~~~
harshreality
The broader point is this: code repositories are not the only things git is
used for.

------
lvh
From a cryptographer's perspective, everything around SHA-3 is a little weird.
We ended up with something that's pretty slow even though we had faster
things, for which general consensus was that they were just as strong.
Similarly, consensus was that some SHA-3 candidates made it as far as they did
because they are drastically different from previous designs. Picking a major
standard takes a while, and immediately preceding it we saw scary advances in
attacks on traditional Merkle-Damgard hashes like SHA-0, SHA-1. Not SHA-2, but
it's pretty similar, so the parallels are obvious.

Bow that we have SHA-3, we ended up with a gazillion Keccak variants and
Keccak-likes. The authors of Keccak have suggested that Git may instead want
to consider e.g. SHAKE128. [0]

[0]: [https://public-
inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...](https://public-
inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/)

It's a bit unfortunate that this is really a cryptographic choice, and it
seems to mostly be made by non-cryptographers. Furthermore, the people making
that choice seem to be deeply unhappy about having to make it.

This makes me unhappy, because I wish making cryptographic choices got much
easier over time, not harder. While SHA-2 was the most recent SHA, picking the
correct hash function was easy: SHA-2. Sure, people built broken constructions
(like prefix-MAC or whatever) with SHA-2, but that was just SHA-2 being
abused, not SHA-2 being weak.

A lot of those footguns are removed with SHA-3, so I guess safe crypto choices
are getting easier to make. On the other hand, the "obvious" choice, being
made by aforementioned unhappy maintainers, is slow in a way that probably
matters for some use cases. On the other hand, not even the designers think
it's an obvious choice, I think most cryptographers don't think it's the best
tool we have, and we have a design that we're less sure how to parametrize.
There are easy and safe ways to parametrize SHA-3 to e.g. fix flaws like
Fossil's artifact confusion -- but BLAKE2b's are faster and more obvious. And
it's slow. Somehow, I can't be terribly pleased with that.

------
lvh
FWIW, Fossil released a version with backwards compatibility, configurable
graceful upgrades a week ago: [https://www.fossil-
scm.org/index.html/doc/trunk/www/changes....](https://www.fossil-
scm.org/index.html/doc/trunk/www/changes.wiki#v2_1)

~~~
wolf550e
Dmitry Chestnykh wrote a little about problems with the documented security
claims of Fossil SCM 3 days ago:

[https://twitter.com/dchest/status/842489752892968960](https://twitter.com/dchest/status/842489752892968960)

[https://twitter.com/dchest/status/842498609652383744](https://twitter.com/dchest/status/842498609652383744)

~~~
david-given
Given that both claims are unreferenced and using deliberately provocative
language, I'd say he wrote _very_ little...

~~~
dchest
I'm a long time fan of Fossil and contributed a bit to its development (in
particular, TLS support and some protections against timing attacks). I'm not
sure where you found provocative language, but let me try to explain it here
more clearly.

 _Design deficiency_

(This is unrelated to the choice of hash.)

Fossil stores blobs as-is. A file containing "hello world" will be stored as
"hello world" and referenced as HASH("hello world").

Commits are stored as plain-text manifests, which are also referenced as
HASH(manifest_contents). To distinguish between different types of artifacts
(commit, file, wiki page, etc.), Fossil checks the contents of the blob.

See [https://www.fossil-
scm.org/index.html/doc/trunk/www/fileform...](https://www.fossil-
scm.org/index.html/doc/trunk/www/fileformat.wiki) for detailed description.

This made possible the following attack:

* Clone repository.

* Modify some files, commit.

* Deconstruct repository.

* Attach the deconstructed artifacts with changes to a ticket in the original repository or to a wiki.

By doing this, you could make commits to the target directory by attaching
files to tickets or wiki, and these commits were only visible to people who
cloned the repo until rebuilding (then they would be visible to everyone).

This attack was prevented by compressing every attached file with gzip, making
it impossible to attach a file that would be recognized as a commit, because
gzip adds its own header.

I think this design is deficient: instead, each blob should have a type
indicator — that is, file artifacts should have some prefix. This is how Git
works: each object has a prefix indicating its type. Also, Plain 9 had a
filesystem called... also Fossil! — which was based upon Venti content-
addressable storage, which stored _typed_ blobs.

Unfortunately, changing this will break compatibility, and since Fossil
artifact format was built to last for ages, I don't think it will be changed.

 _SHA-1 claims_

What made me rant about Fossil after congratulating them on switching to
SHA3-256 is that they made false claims regarding their use of SHA-1 _in the
same documentation which shows these clams are false_ :

Quoting [https://www.fossil-
scm.org/index.html/doc/trunk/www/hashpoli...](https://www.fossil-
scm.org/index.html/doc/trunk/www/hashpolicy.wiki):

 _The SHA1 hash algorithm is used only to create names for artifacts in Fossil
(and in Git, Mercurial, and Monotone). It is not used for security.
Nevertheless, when the Shattered attack found two different PDF files with the
same SHA1 hash, many users learned that "SHA1 is broken". They see that Fossil
(and Git, Mercurial, and Monotone) use SHA1 and they therefore conclude that
"Fossil is broken". This is not true, but it is a public relations problem. So
the decision was made to migrate Fossil away from SHA1._

If you search the docs, you discover that they use SHA-1 for security:

* To store passwords ([https://www.fossil-scm.org/index.html/doc/trunk/www/password...](https://www.fossil-scm.org/index.html/doc/trunk/www/password.wiki))

* In the client-server authentication protocol in an adhoc MAC construction ([https://www.fossil-scm.org/index.html/doc/trunk/www/sync.wik...](https://www.fossil-scm.org/index.html/doc/trunk/www/sync.wiki))

Speaking of passwords, the automatically generated passwords are too short: I
just created a repo with Fossil v2.1 and got "efc6f5" as initial password.
It's 6 hex characters, or just 3 bytes — trivial to crack.

Finally, I as I said, I really like Fossil even though I don't use it anymore
for open source projects (I still use it for some private projects) and have a
great respect to its author and other contributors. But in my opinion, it
needs at least a fundamental but simple change in the storage format to
introduce object types.

If something is unclear or you have questions, I'm happy to answer.

~~~
david-given
Thank you very much for posting this --- this is exactly the information which
was missing from the Twitter thread!

(My concern was that the twitter basically contained _nothing_ of any content:
no technical details, no link to a blog post, nothing which can be checked or
verified... which makes it indistinguishable from insinuation; and so I have
to dismiss what you said out of hand.)

(Regarding provocative language: I don't think publicly calling someone out as
a liar, in so many words, particularly on a medium like Twitter which doesn't
really allow for an effective response, is particularly effective in producing
a useful result...)

Anyway:

Re the manifest issue: to paraphrase, to check I'm understanding you
correctly: because each manifest refers to its predecessor, and not vice
versa, adding any blob which looks like a manifest implicitly adds that
manifest to the tree. Normally Fossil trusts authenticated users to add blobs
to the tree, because they're authenticated, but ticket attachments can be
added by anyone, which effectively means that you can bypass the
authentication and commits can be done by anyone. Is that correct?

In which case, yeah, I agree; that's very bad. I can't spot any holes in your
reasoning. It _is_ possible to positively identify attachments by looking at
their parent manifest, as each one should be pointed at by an A record, so I
suppose you _could_ disallow manifests if they're referenced like this, but my
gut tells me that's going to be horribly fragile... you're right, adding a
type prefix is obviously the right way to go.

If you create a manifest and check it in as a normal file, so it's referenced
by an F record, is it still treated as a manifest? If not, could this
machinery be extended to attachments as well?

You _did_ bring this up on the mailing list, right?

~~~
SQLite
When adding an attachment in Fossil, if the attachment is syntactically
similar to a structural artifact (such as a manifest), then the attachment is
compressed prior to being hashed and stored, thus making it very dissimilar to
a manifest and incapable of being confused with a manifest. Hence, it is not
possible for someone to add a new manifest as an attachment and have that
confuse the system. Furthermore, there is an audit trail so that should an
attacker discover and exploit some bug in the previous mechanism and manage to
get a manifest inserted via an attachment, then the rogue manifest can be
easily identified and "shunned".

Users with commit privileges are granted more trust and do have the ability to
forge manifests. But as before, there is an audit trail and rogue manifests
(and the users that insert them) can be detected and dealt with after the
fact.

Structural artifacts have a very specific and pedantic format. You can forge a
structural artifact, but you will never generate one by accident during normal
software development activities.

~~~
david-given
Isn't the problem here, though, that they _can_ be forged? If, by attaching a
structural artifact in the correct format to a ticket, we're effectively
allowing people without commit privileges to make commits --- potentially
anonymous people.

I agree that this isn't likely to happen by accident, but Fossil servers are
usually public facing, so we have to worry about malice as well.

------
corbet
This work actually began in 2014...
[https://lwn.net/Articles/715716/](https://lwn.net/Articles/715716/)

------
VMG
Is there some explainer on how the support will look like in the end? I'm
curious to know how multiple hash algorithms will be supported in parallel.

~~~
pyed
Probably newer versions will commit only using a new hash algorithm, while
completely able to deal with the old one

~~~
ebbv
Can it really be that simple though? If you are using a newer version of Git
on your repo which is committing only with the newer hash and I try to clone
your repo with an older version I will be unable to do so. I guess maybe
that's acceptable though?

~~~
CydeWeys
I think backwards incompatibility would be acceptable. Add read support for
the new format to git, but then don't have widespread repositories using the
new format until some period of time later. By the time they do become
commonplace, everyone should already be running a version of git supporting
them. It's not exactly hard to upgrade git in most situations anyway, just a
simple invocation of:

    
    
        $ sudo $PKG_MGR upgrade git

~~~
ufmace
Well, it's not hard to update the command-line git client on Unix-y systems
with package management. The trouble will be with the hundreds/thousands of
other programs that use Git in various ways and are essential to development
workflows in various places. Github themselves, Microsoft and Jetbrains IDEs,
etc.

~~~
CydeWeys
There's two potentially mitigating factors at play here:

I suspect a lot of the tools you mentioned also already treat hashes as
strings, not as 160 bit numeric types. The entire front-end JS for GitHub, for
example, just uses strings. That's what I'd do if I were writing IDE
integrations and such too.

Secondly, the new format will likely still be a 160-bit numeric type, just
calculated using a different hash algorithm (e.g. it might be the first 160
bits of the SHA256 result). The tools you mentioned likely don't have to
calculate said hashes, they just display them. The entire GitHub front-end,
for instance, just displays whatever is given to it; commit hashes are input
data to it, not output data.

~~~
eklitzke
Just using the first 160 bits of a new hash function was proposed at one
point, but it's not part of the current plan. The new plan is to introduce
full SHA3-256 hashes (which are 256 bits in size). More information here:
[https://docs.google.com/document/d/18hYAQCTsDgaFUo-
VJGhT0Uqy...](https://docs.google.com/document/d/18hYAQCTsDgaFUo-
VJGhT0UqyetL2LbAzkWNK1fYS8R0/edit#)

(Of course, CLI and frontend tools could still truncate display output to 40
hex characters, but internally full size hashes will be used.)

~~~
CydeWeys
Woah, that's disorienting to be linked to a bluedoc unexpectedly like that,
considering I'm not on my work account. I ever recognize one of the authors.

~~~
boxfish
What is a bluedoc?

~~~
CydeWeys
It's just a Google Docs template used for internal engineering design docs at
Google. The linked doc is a typical example.

------
benhoyt
I immediately looked at the length of this commit's hash to see if it was
longer than 40 hex chars -- but no, it's just an SHA-1. It would have been
cool if somehow the hash of this commit that added new hashes was a new hash.

Slightly similar: for a while I've wanted to recreate just enough of git's
functionality to commit and push to GitHub. My guess is the commit part would
be pretty trivial (as git's object and tree model is so simple) but the
push/network/remote part a bunch harder.

------
gkya
The commit on git.kernel.org:
[https://git.kernel.org/pub/scm/git/git.git/commit/?id=e1fae9...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=e1fae930193b3e8ff02cee936605625f63e1d1e4)

------
zoren
Someone please remind me why the hash is not a type definition so the
representation would only have to be changed in one place.

~~~
dahart
That's exactly what this change is. You mean why wasn't it that way before the
change? Maybe because it wasn't ever needed before? Git's been good with only
sha-1 for 12 years. Think about the flip side of your question... what were
the alternatives 12 years ago, or 5 years ago? And why would someone write
code for alternatives that aren't expected to be used and maybe don't exist?

In my experience, generalizing ahead of need more often than not causes
problems, and I've watched over-engineering result in far more effort to fix
when the need it was anticipating does arrive than just waiting until the need
is there.

~~~
tedunangst
Even totally ignoring that SHA2 was a thing, anybody looking around would have
noticed that MD4 was broken, MD5 was broken, and it would be unlikely that the
hash of today would stand forever.

~~~
dahart
Yes, true. Correct. That is still true, and applies to SHA-2 as well. And
Linus was aware of exactly what you say, back in 2005.

My point was that the choice that was made was considered good enough for the
purposes for which it was intended. In the context of the OP's comment,
criticizing git for not making different code design choices doesn't mean that
Linus was wrong, it may mean that the OP doesn't know and/or understand all
the considerations Linus had. And Linus has said many times that the security
of the hash is _not_ the primary consideration in his design.

Git's choice of SHA-1 was not at the time predicated on having the single most
unbreakable hash in existence, the hash's use is not for security purposes,
and to talk with such incredulity about Linus' choice may be to misunderstand
git's design requirements.

~~~
QSIITurbo
Well, this whole mess proves that Linus was wrong. Typing "unsigned char [20]"
everywhere is beyond amateurish to me in any case and raises a concern about
the overall quality of code in git and linux kernel.

~~~
yuhong
It shows that Linus is not a cryptographer, to be more precise. Though yes
SHA-1 chosen prefix attacks are still very expensive at this point. I wonder
how many non-cryptographers knew about SHA-2 back in 2003-2004.

~~~
indolering
Linus regularly treats security as a second-class citizen and is famous for
his outrageous harassment [0]:

> Of course, I'd also suggest that whoever was the genius who thought it was a
> good idea to read things ONE F _CKING BYTE AT A TIME with system calls for
> each byte should be retroactively aborted. Who the f_ ck does idiotic things
> like that? How did they noty die as babies, considering that they were
> likely too stupid to find a tit to suck on?

He deserves to eat this shit sandwich.

> I wonder how many non-cryptographers knew about SHA-2 back in 2003-2004.

Any systems engineer should have known about SHA-2. SHA-1 only provides
80-bits of security, so everyone else assumed that it would need to be
replaced.

[0]:
[https://en.wikiquote.org/wiki/Linus_Torvalds](https://en.wikiquote.org/wiki/Linus_Torvalds)

~~~
striking
What does his "outrageous harassment" have to do with his ignorance towards
security?

I agree that he should've used SHA-2 or better yet, have made the hash
algorithm modular, but what does your quote add to the discussion?

~~~
indolering
> but what does your quote add to the discussion?

Not much, thanks for the gentle reminder :)

------
ossmaster
So could be my ignorance of this project in detail, but where are the tests
for this?

~~~
smileysteve
The t/ directory.

[https://github.com/git/git/blob/master/t/README](https://github.com/git/git/blob/master/t/README)

------
kozak
Do they anticipate that one day we'll have to move from SHA256 to something
else again? It's only matter of time. Hash function have lifecycle. Tre
transition has to be done in a way that will also make the next transition
more straightforward.

~~~
anilgulecha
A note on 'lifecycle': that's not how it works -- the age of use/lifecycle is
not a function of the bit-length in hash, or inevitability of the current
standard being broken.

Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for hashes,
but they had cryptographic weaknesses -- the functions had cryptanalytic
attacks, which reduced bruteforce from the complete keyspace to something of a
much smaller magnitude. These weaknesses are what has lead to the deprecation
of MD5 and SHA1.

It is definitely possible that new crypt-analytic attacks could be shown on
SHA256/512, but none have so far been publicly provided. Hence the confidence
in them.

~~~
amluto
> Technically MD5(128bits) and SHA1(160bits) lengths are sufficient for
> hashes, but they had cryptographic weaknesses

Not true. A 128-bit hash gets collisions after ~2^64 tries. A big cluster can
find targeted 128-bit collisions. To attack something like git, the entire
attack can be done offline.

The big MD5 X.509 break needed cryptanalysis to make it day I because the
attack needed to happen in real time.

~~~
rocqua
The threat models in which 64 bits of security (by birthday attack on 128bit
hashes) is insufficient are really limited.

~~~
IncRnd
That is misleading, since a birthday attack is not required. The security
strength of hashes is not measured by the length of the hashes.

md5 was first "broken" in 1995. As of 10 years ago, a collision attack took
only a few seconds. Plus, there are a _number_ of other attacks on the hash.

~~~
rocqua
The argument I replied to concerned best-case for length. I.e. a perfect hash
at 128 bits delivers 64 bits of security against collisions. (Note the
'perfect' part)

64 bits of security is good enough against most non-nation state actors.

Obviously, MD5 (and sha-1) aren't anywhere near perfect hashes. And obviously,
you need to look at more than length when judging a hash.

Basically my point was that md5's hash length isn't a big problem.

~~~
IncRnd
You can rent Amazon time and create an md5 collision for less money than
people spend going to a movie. Restating the issue as "a perfect hash is
perfect" may be correct in a limited sense, but it is also highly misleading.

64 bits of ideal security is about half the industry accepted security
strength in bits for a hash function.

------
btrask
This is the chance to get rid of the object prefixes (i.e. "blob" plus file
length) that prevent the generated hashes from being compatible with hashes
generated by other software.

------
koolba
Since the majority of us are running x64 machines, will the hash be a
truncated SHA-512/256 or will it be SHA-256? The former is significantly
faster on x64 machines.

~~~
snakeanus
>Since the majority of us are running x64 machines

We don't.

~~~
koolba
I didn't say all. I said the majority. If you think I'm wrong, show me a
statistic that shows the most common platform for developers using git isn't
x86-x64.

------
kazinator
What problem does this solve? Are collisions common?

~~~
krallja
Until a few weeks ago, SHA-1 collisions had never been demonstrated.

~~~
kazinator
But, in any case, that's in the cryptographic realm.

Git hashes aren't digital signatures for cryptographic authenticity.

~~~
hkdennis2k
They are.

The git tag and signing verify logic assume the sha-1 hashes for integrity.

~~~
kazinator
Hashing for integrity and authenticity are different things.

For instance, a mere four byte CRC-32 can reasonably assure integrity in some
situations, like when used on sufficiently small payload frames; yet it is not
useful as a digest for certifying authenticity.

SHA-1 is suitable for integrity.

~~~
belovedeagle
That it may be, but in git, SHA-1 is also used for authenticity. "Signing a
commit" only authenticates one commit, and is considered to authenticate the
state of the repository only insofar as it authenticates the SHA-1 references
contained in the topmost commit.

------
pwdisswordfish
struct object_id was introduced in this commit, in 2015:

[https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817...](https://git.kernel.org/pub/scm/git/git.git/commit/?id=5f7817c85d4b5f65626c8f49249a6c91292b8513)

So this change doesn't do much for now. Good to see, though.

~~~
bk2204
Yes, this is correct. The struct object_id changes don't actually change the
hash. What they do, however, is allow us to remove a lot of the hard-coded
instances of 20 and 40 (SHA-1 length in bytes and hex, respectively) in the
codebase.

The remaining instances of those values become constants or variables (which
I'm also doing as part of the series), and it then becomes much easier to add
a new hash function, since we've enumerated all the places we need to update
(and can do so with a simple sed one-liner).

The biggest impediment to adding a new hash function has been dealing with the
hard-coded constants everywhere.

~~~
myst
Says something about quality of the codebase.

