
A new hash algorithm for Git - Tomte
https://lwn.net/SubscriberLink/811068/cfeb6a67b8dfbe47/
======
donatj
Excuse my ignorance, but couldn’t they just add a SHA256 hash to commit
objects (or some new commit-verify object) of the entire trees current
concatenated content, leave everything else SHA1 and get the same benefit
without rewriting the entire thing from the ground up? Git could even do that
as part of the git gc step slowly over time - tag commits with a secondary
hash.

Rewriting the whole thing including every git repos history seems like
throwing the baby out with the bathwater, when you could just add a secondary
transparent verification instead. Just seems like there has to be a better
way.

~~~
wongarsu
You can't change past commits to add that hash (without changing all commit
hashes), so this method could only protect new commits. For any existing repo
this would lead to a very weird security model: We admit that sha1 hashes are
broken, and only guarantee that commits made by git versions newer than git
x.x.x are safe from after-the-fact modification (or alternatively only commits
made after date X).

~~~
gregmac
My inclination is that protecting only new commits might be enough, but it
gets me thinking: What would a practical attack on this look like, assuming
sha1 was broken? Let's say I'm trying to insert a line of code that does
something nefarious, and that it's now trivial to generate "magic text" you
can stick anywhere in a file (eg, inside a comment at the end of a line) to
get any desired sha1 hash.

Are all the other future commits still valid, or am I going to suddenly get
conflicts or garbled text? Depending on where the modification is done, that
code might have gone through much more churn -- especially if there are a
bunch of sha-256 commits after it (which I can't attack). I don't know enough
about how git stores content blobs to answer this.

Second problem: Can I push my replacement commit to another repository (eg,
github)? Would even force push work? Do I have to delete branches and re-push
my own? If I already have enough permission on the repository to do this, it
means I can already push whatever I want -- so does this attack _even matter
at all_?

Assuming that's successful (or I can trick people into using my own
repository), what will happen to someone that already has a clone and does a
pull? Will they get my change (and will it work or be a pile of conflicts or
garbled text)?

Even if only fresh clones will get the changes it could still be quite
devastating -- especially if using CI -- but I'm just not clear if this attack
is even theoretically possible.

~~~
OJFord
> My inclination is that protecting only new commits might be enough

Why? It's not the same as saying 'versions after vX are safe', it's the same
as saying 'any unsafety after vX was there before, not introduced since' (both
with 'as a result of SHA-1 collision' qualifiers of course).

> Can I push my replacement commit to another repository (eg, github)? Would
> even force push work?

Implementation dependent I suppose, but I wouldn't have thought so - I don't
see why they'd actually check the content when the hash is supposed to
indicate whether it differs or not.

> Do I have to delete branches and re-push my own? If I already have enough
> permission on the repository to do this, it means I can already push
> whatever I want -- so _does this attack even matter at all_?

I think an attack would look more like:

    
    
      1. Create hostile commit that collides with extant commit SHA
      2. Infiltrate a package repository, or GitHub, or corporate network, or ...
      3. Insert hostile commit in place of real one
    

Of course it's a problem if 2 & 3 happen alone anyway, but the problem with
the collision commit is that it makes it so much less detectable.

~~~
Nullabillity
Git commits are snapshots, not diffs. Each commit contains a tree, which
contains a list of files and their respective hashes. As long as its whole
tree is SHA-256 then a commit should be safe, regardless of its history.

The downside to the migration would be that all unchanged files would be
stored twice (once identified by SHA1, once identified by SHA-256). But you
could work around that by hardlinking identical files.

~~~
loeg
This doesn't protect subdirectories unless you rewrite the entire tree
structure with SHA256. I don't know if Git does that now, or not. Git
generally points to unmodified subdirectories with the existing content hash;
if the SHA1 is pointed to by SHA256, which is implied by the transition plan
proposed in the grand-grandparent comment, then those subdirectories are
essentially unprotected.

------
strenholme
I’m already seeing a lot of discussion both here and over at LWN about which
hash algorithm to use.

The Git team made the right choice: SHA2-256 is the best choice here; it has
been around for 19 years and is still secure, in the sense that there are no
known attacks against it.

Both BLAKE[2/3] and SHA-3 (Keccak) have been around for 12 years and are both
secure; just as BLAKE2 and BLAKE3 are faster reduced round variants of BLAKE,
Keccak/SHA-3 has the official faster reduced round Kangaroo12 and
Marsupilami14 variants.

BLAKE is faster when using software to perform the hash; Keccak is faster when
using hardware to perform the hash. I prefer the Keccak approach because it
gives us more room for improved performance once CPU makers create specialized
instructions to run it, while being fast enough in software. And, yes, SHA-3
has the advantage of being the official successor to SHA-2.

~~~
_verandaguy
Honest question: what are the use cases in Git where hash computation speed is
a meaningful optimization?

~~~
SQLite
My experience in developing and maintaining Fossil is that the hashing speed
is not a factor, unless you are checking in huge JPEGs or MP3s or something.
And even then, the relative performance of the various hash algorithms is not
enough to worry about.

~~~
_verandaguy
Thanks for the insight. My intuition was kind of the same, but on modern
hardware computing the digest-style (as opposed to cryptographic, slow-by-
design) hash is essentially imperceptible for payloads in the low MBs -- and
much above that is a use case for LFS.

------
speedgoose
>…a simple command like:

> git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees
> --liability-waiver=none --use-shovels --carbon-offsets

Is it sarcasm ?

~~~
andrewflnr
Yes.

------
brobdingnagians
I was interested in how fossil handled the SHA1 transition, and found this
nicely explained as below:

[https://fossil-scm.org/home/doc/trunk/www/hashpolicy.wiki](https://fossil-
scm.org/home/doc/trunk/www/hashpolicy.wiki)

~~~
velcrovan
Fossil's main author is chiming in the discussion of this on Fossil's forums:

([https://fossil-scm.org/forum/forumpost/50a5bea5fb](https://fossil-
scm.org/forum/forumpost/50a5bea5fb))

> That's appalling. Fossil's implementation doesn't require a conversion.

“This is a key point, that I want to highlight. I'm sorry that it wasn't made
more clear in the LWN posting nor in the HN discussion.

“With Fossil, to begin using the new SHA3 hash algorithm, you just upgrade
your fossil binary. No further actions, workflow changes, disruptions, or
thought are required on the part of the user.

* “Old check-ins with SHA1 hashes continue to use their SHA1 hash names.”

* “New check-ins automatically get more secure SHA3 hash names.”

* “No repository conversions need to occur”

* “Given a hash prefix, Fossil automatically figures out whether it is dealing with a SHA1 or a SHA3 hash”

* “No human brain-cycles are wasted trying to navigate through a hash-algorithm cut-over.”

“Contrast this to Git, where a repository must be either all-SHA1 or all-SHA2.
Hence, to cut-over a repository requires rebuilding the repository and in the
process renaming all historical artifacts -- essentially rebasing the entire
repository. The historical artifact renaming means that external links to
historical check-ins (such as in tickets) are broken. And during the
transition period, users have to be constantly aware of whether they are using
SHA1 or SHA2 hash names. It is a big mess. It is no wonder, then, that few
people have been eager to transition their repositories over to the newer SHA2
format.”

~~~
seniorsassycat
The way I read the fossil's authors comments, old commits continue to use sha1
hashes. A repository will be vulnerable to sha1 collision attacks as long as
there is an object in the repository that has not been hashed with the new
algorithm.

For example, floppy.c could be replaced in a repo with file with the same sha1
hash as long as the last commit that modifies floppy.c used a sha1 hash.

Right?

~~~
SQLite
Just to be clear: Every time you modify a file, the new changes get put in
using SHA3. In an older repository, any given commit might have some files
identified using SHA1 (assuming they have not changed in 3 years) and others
identified using SHA3.

For example, the manifest of the latest SQLite check-in is see at
([https://www.sqlite.org/src/artifact/29a969d6b1709b80](https://www.sqlite.org/src/artifact/29a969d6b1709b80)).
You can see that most of the files have longer SHA3 hashes, but some of the
files that have not been touched in three years still carry SHA1 hashes.

An attack like what you describe is possible _if_ you could generate an evil.c
file that has the exact same SHA1 hash as the older floppy.c file. Then you
could substitute the evil.c artifact in place of the floppy.c artifact, get
some unsuspecting victim to clone your modified repository, and cause mischief
that way. Note, however, that this is a pre-image attack, which is rather more
difficult to pull off than the collision attacks against SHA1, and (to my
knowledge) has never been publicly demonstrated. Furthermore, the evil.c file
with the same SHA1 hash would need to be valid C code that does something evil
while still yielding the same hash (good luck with that!) and Fossil (like
Git) has also switched over to Hardened SHA1, making the attack even harder
still.

As still more defense, Fossil also maintains a MD5 hash against the entire
content of the commit. So, in addition to finding evil.c that compiles, does
your evil bidding, has the same hardened-SHA1 hash as floppy.c, you also have
to make sure that the entire commit has the same MD5 hash after substituting
the text of evil.c in place of floppy.c.

So, no, it is not really practical to hack a Fossil repository as you
describe.

~~~
wyoung2
> Furthermore, the evil.c file with the same SHA1 hash would need to be valid
> C code that does something evil while still yielding the same hash

...and also produce an innocent-looking diff!

I mean, you could stuff a bunch of random bytes into a C comment to force the
desired hash in the output using these documented attack techniques, but
anyone inspecting the diffs between versions is likely to see such an
explosion of noise and call foul.

If you want an analogy, it's like someone saying they've learned to
impersonate federal agent identification cards, only it requires that the
person carrying the fake ID to have a thousand rainbow-dyed ducks on a leash
in tow behind him.

Such attacks are fine when it's dumb software systems doing the checks, but
for a source code repository where people do in fact visually check the diffs
occasionally?

Well, let's just say that when someone manages to use SHAttered and/or
SHAmbles type attacks on Git (or even Fossil) I expect that it won't take a
genius detective to see that the repo's been attacked.

~~~
tjoff
Many diff tools don't highlight whitespace-only changes. Or at least not in a
clear manner.

Also, if something is replaced in the history how often do people go back and
view diffs in old code? Hardly often enough to rely on it being spotted.

~~~
wyoung2
It only takes one person to raise the flag.

Sure, many thousands of people doing blind "git clone && configure && sudo
make install" could be burned by a problem like this, but _someone_ would
eventually do a diff and see the problem on any project big enough to have
those thousands of trusting users in the first place.

I'm not excusing these SHA-1 weaknesses, only pointing out that it won't be
trivial to apply them to program source code repos no matter how cheap the
attacks get.

For instance, the demonstration case for SHAttered was a pair of PDFs: humans
can't reasonably inspect those to find whatever noise had to be stuffed into
them to achieve the result.

I also understand that these SHA-1 weaknesses have been used to attack X.509
certificates, but there again you have a case very unlike a software code
repo, where the one doing the checking isn't another programmer but a program.

~~~
remram
The problem is that we are considering an issue where different people can get
different objects for the same hash. If the people checking all see the valid
files, they cannot raise any alarms to save the poor victims who got poisoned
with the wrong objects. They'll clone from the wrong fork, and no amount of
checking hashes or signed tags will prevent them from running compromised
code.

~~~
wyoung2
> If the people checking all see the valid files

...which will likely contain thousands of bytes of pseudorandom data in order
to force the hash collision...

> they cannot raise any alarms

You think a human won't be able to notice that the diff from the last version
they tested looks awfully funny? Code that can fool the compiler into
producing an evil binary is one thing, but code that can pass a human code
review is quite another.

You might be surprised how often that occurs.

I don't do a diff before each third-party DVCS repo pull, but I do diff the
code when integrating such third-party code into my projects, if only so I
understand what they've done since the last time I updated. Commit messages,
ChangeLogs, and release announcements only get you so far.

Back when I was producing binary packages for a popular software distribution,
I'd often be forced to diff the code when producing new binaries, since
several of the popular binary package distribution systems are based on
patches atop pristine upstream source packages. (RPM, DEB, Cygwin packages...)

Each time a binary package creator updates, there's a good chance they've had
to diff the versions to work out how to apply their old distro-specific
patches atop the new codebase.

 _Someone 's_ going to notice the first time this happens, and my guess is
that it'll happen rather quickly.

~~~
remram
If this is your threat model, you don't need hashes or signed tags at all.
Good for you. Thankfully both Fossil and Git disagree with you and take the
threat seriously :)

------
GlitchMr
I wonder if it would make sense to use `concat(sha1, sha256)` hash algorithm.
This wouldn't change the prefixes while improving strength of an algorithm (by
including SHA256 in a hash).

~~~
dchest
Something to remember about the security of concatenated hashes:
[https://crypto.stackexchange.com/a/63543/291](https://crypto.stackexchange.com/a/63543/291)

~~~
bangboombang
This is pretty interesting and shows you shouldn't try to pull any sort of
stunts if you're not a crypto expert. I've actually wondered before whether
md5 + sha1 would result in something stronger than those two used
individually. Now I know.

~~~
GlitchMr
By the way, this may be rather obvious, but concatenating hash algorithms is a
terrible idea for passwords. A password cracker could easily pick the less
secure algorithm to crack, and ignore the other hash.

Note that git doesn't concern itself with reversing a hash function. The
commit contents are part of a repository, there is no value in guessing the
commit contents basing on its hash. Here, the hash function choice is purely
about collision resistance.

But yeah, don't do weird things with hashes. Cryptography is hard. Don't
invent memecrypto:
[https://twitter.com/sciresm/status/912082817412063233](https://twitter.com/sciresm/status/912082817412063233),
it's not going to increase the security. Use a single algorithm if you can.
Don't transform the output of a hash function in any way.

------
ericfrederich
I'll have to update my program which generates vanity hashes. I do enjoy
starting projects with an obligatory "Initial Commit" with a deadbeef SHA-1

~~~
bmn__
I like to start a repo with an "empty" commit, that is to say its tree is the
magic 4b825dc.
[https://news.ycombinator.com/item?id=18342763](https://news.ycombinator.com/item?id=18342763)

I wonder if it would still be practically possible to manipulate the commit
id.

~~~
wyoung2
Wow! I wouldn't have guessed that Git had that vulnerability. Fossil solves it
easily: creating a new repo involves generating a random project code (a
nonce) which goes into the hash of the first commit, so that even two
identical commit sequences won't produce identical blockchains.

Fossil lets you force the project ID on creating the repo, but the capability
only exists for special purposes.

~~~
kzrdude
It doesn't seem to be a vulnerability at all

------
pkilgore
I love LWN's technical writing--its worth the cost of a subscription!

------
tcharlton
I can't find documentation for the command in the article:

    
    
        git convert-repo --to-hash=sha-256 --frobnicate-blobs --climb-subtrees \
        --liability-waiver=none --use-shovels --carbon-offsets
    

Surely some of those options aren't real...

~~~
buserror
Of course they are ?!?!!? [https://git-man-page-
generator.lokaltog.net/](https://git-man-page-generator.lokaltog.net/)

(never fails to amuse me)

~~~
bangboombang
Oh dear god. Because it's git related, my brain somehow still tries to make
sense of that stuff because it just seems so real.

~~~
ekimekim
It's a curious feeling. Like reading code that is syntactically valid but
utterly nonsensical.

------
jakeogh
Is there an archive of crypto related future predictions?

How long until a specified length preimage attack can break bittorrent blocks?

I remember a paper published a ~decade ago estimating very short (well funded)
ASIC sha1 collisons. Anyone have that ref?

EDIT: Should I have not said preimage? My understanding is bittorrent is
broken (by DDoS, not infohash(?)) if you can make a bad block that matches the
length and sha1 of a target block.

~~~
tialaramex
> EDIT: Should I have not said preimage? My understanding is bittorrent is
> broken (by DDoS, not infohash(?)) if you can make a bad block that matches
> the length and sha1 of a target block.

There are three different attacks

1\. Collision, which is practical (expensive but practical) for SHA-1 today,
lets somebody make two documents A and B which have the same hash. This is
only useful if you can fool people somehow into accepting document B when they
think it's document A because of the hash, for example with digital
signatures.

2\. Pre-image, which is not practical for any hashes you care about including
MD5. This lets you find the document A given the hash(A) value. This is very
niche, since obviously for large documents by the pigeon hole principle there
will be many such pre-images and it's impossible to get the "right" one, for
small inputs it can be relevant, sometimes.

3\. Second Pre-image, likewise not practical. Given either document A or
hash(A) which you could easily determine from document A, this lets you
produce a new document A' that is different from A but hash(A') == hash(A).
This would be extremely bad, and is what you'd need to attack real world
Bittorrent from somebody else.

Often people say "pre-image" meaning strictly second pre-image, it's usually
clear from context, and a true pre-image attack as I explained above is only
rarely relevant.

Collision would only let bad guys corrupt their own purposefully constructed
collision bittorrent, which like, why? So yes, Bittorrent would only really be
in serious trouble if there was a second pre-image attack. But on the other
hand, don't use broken cryptographic primitives. Attacks only get better,
always.

~~~
SAI_Peregrinus
1.5 Chosen-prefix collision: Given a prefix A, generate two values AB and AC,
where B and C differ but are both prefixed with A. (AX is A concatenated with
X). This exists for SHA1. It's more powerful than a basic collision wheri you
can't pick the prefix, but weaker than either type of pre-image.

~~~
wyoung2
It's worth noting that this attack is a property of the Merkle–Damgård hash
construction, not of SHA-1 specifically, which means SHA-2 (Git's path
forward) is also vulnerable:

[https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_co...](https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction#Security_characteristics)

[https://www.reddit.com/r/crypto/comments/44p5jc/eli5_why_are...](https://www.reddit.com/r/crypto/comments/44p5jc/eli5_why_are_chosenprefix_collisions_specific_to/)

Fossil uses SHA-3, which has an entirely different construction, which is not
at this time known to have a similar weakness. SHA-3 is also much newer, with
a much shorter list of known attacks.

~~~
tialaramex
Ha that ELI5 is adorable, I love both how the person trying to answer in the
affirmative resorts to more and more frantic hand-waving as it becomes obvious
none of what they've said is true and most of it doesn't even make sense,
while the person being "flagged" for their supposedly "highly inaccurate"
simple statement that er, no, chosen prefix isn't about MD at all remains calm
and doesn't care as people insist they must be wrong because after all they
were flagged, and why would some anonymous user flag something as wrong unless
they were an expert...

Anyway, as hinted above, chosen prefix has nothing to do with the type of hash
construction, except in the sense that so far there were lots of
Merkle–Damgård hashes and some of them are no longer safe, whereas until
recently there weren't many of the Keccak family hashes.

The Wikipedia article is talking about Length Extension, which is a different
phenomenon from chosen prefix collision attacks, and if it was a problem in
Git (or indeed Fossil) would have doomed them both immediately anyway.

For a generic crypto hash you should use SHA-512/256 (NB this is not offering
a choice that slash is part of the name) to avert Length Extension but since
the DVCSs already seemingly put the effort in to be safe against it SHA-256 is
a perfectly reasonable choice.

------
cestith
I just want to add something the article couldn't cover. I know bmc and he's
both a software geek's software geek and one of the friendliest, most helpful,
and most genuine people I've ever met.

------
alkonaut
I didn't get the argument against just converting? Sure some code bases are
large and spread out, but any git repo needs to have one blessed central
point, and everyone needs to be able to just re-clone from the central
repository whenever history is rewritten for whatever reason (could be that a
huge file is trimmed from the past etc). Why can't all commits in the Kernel
history be rewritten to SHA256? (Other than that it would be an annoying
interruption in the development)?

~~~
corbet
The kernel doesn't really have the one central blessed point of which you
speak. Sure you can grab mainline releases from Linus's repository, but that's
not where the development actually happens. It really is a distributed
project, and having to delete all those old repositories would really hurt.

~~~
alkonaut
If 2 separate copies of the same repository does the same rewrite to sha256,
their histories are still compatible and equal up to the point where they
diverge. So other than that the rewrite needs to happen in more places, it
should still be doable. Needs to happen at more or less the same time however.

------
nsajko
Disappointed they went with an ARX based hash, instead of KangarooTwelve,
which uses the Keccak permutation. A lot of people on this thread think that
SHA2 is more secure because it is older, but it is my understanding that that
is completely wrong. Keccak is not only standardized, to get to that it had to
win the SHA3 competition, during and after which it received, as far as I
understand, unprecedented levels of scrutiny. And not only that, but,
according to what I read, the Keccak-like cryptographic constructions
(including the hash) are much more amenable to mathematical/cryptographical
analysis because of not using addition (word-wise, instead of bit-wise, to be
more correct). The idea is that a resourceful/moneyed attacker (like the NSA
or China, etc.) could create successful attacks on an ARX hash without the
public being able to come to the same developments because of no researchers
having access to similar levels of resources.

The sad thing is that the ARX BLAKEx functions seem to be gaining undeserved
amounts of hype. I do not think they are getting comparable scrutiny from
researchers, seeing as BLAKEx hashes are ARX, and also changed considerably
since the SHA3 contest (so it is far from clear that the scrutiny that BLAKE
did receive translates to BLAKE2 or BLAKE3).

~~~
nsajko
One thing that should also be noted is that ARX hashes are relatively less
well suited to silicon implementations.

The Keccak team published a short and poignant relevant blog post back in 2017
as an answer to that notorious "Maybe skip SHA3" blog post:
[https://keccak.team/2017/not_arx.html](https://keccak.team/2017/not_arx.html)

A HN commenter from 2017 explained ARX's safety downside better than I:
[https://news.ycombinator.com/item?id=15292103](https://news.ycombinator.com/item?id=15292103)

> The nuance that's being made here is that the public cryptanalytic results
> we have are from researchers that need to publish. However blackhats (be it
> government or private) have no such need. Thus, they do not care if the
> analysis is elegant or neat.

> This means that ARX functions will have less published analysis, but may
> still be successfully attacked.

> This isn't even a new argument they're making here. It's been well
> understood that simple cipher designs are better, because they are easier to
> understand. If you can understand it well, yet not break it, that gives
> confidence. If you don't understand it, it might break as soon as you do.

------
zackmorris
Summary of hashing function security in bits, for convenience:

[https://en.wikipedia.org/wiki/Secure_Hash_Algorithms](https://en.wikipedia.org/wiki/Secure_Hash_Algorithms)

Since collision resistance is roughly half the number of bits, it seems
unconscionable to me that anything below 256 bit hashes even exist, because 64
bits is crackable but 128 bits effectively never will be. This was well-
understood even in the 90s when MD5 and SHA were first published.

Just thinking about this for the first time, I don't buy any argument about
storage or performance, since those become less important as time goes on. It
feels like Linus made a mistake here, and offloaded the inevitable work of
upgrading repositories onto the general public (socialized the cost) which is
something that all programmers should work harder to avoid.

Said as an armchair warrior who has never accomplished anything of any
importance, I realize.

~~~
mathnmusic
Also relevant: Multihash is a format for self-describing hashes that helps
with data portability and future-proofing:
[https://github.com/multiformats/multihash](https://github.com/multiformats/multihash)

------
anaisbetts
I don't understand the practical attack vector for breaking SHA1s in Git. Not
only are objects checksummed by SHA1, they also encode the _length_. Finding a
SHA1 collision is plausible, but finding a SHA1 collision that both lets you
do something Nefarious, _and_ is the length you need, seems really really
unlikely

~~~
HereBeBeasties
You're assuming that 100% of the source code matters, but most source code has
comments. Some has a lot of comments (boilerplate headers). Delete all the
comments and superfluous whitespace, add nefarious code, put in a comment in
the remaining bytes for the sole purpose of causing a hash collision (likely
plenty of bytes to play with).

~~~
scoutt
Yes, but...

> this new version would have to contain the desired hostile code, still
> function as a working floppy driver, and _not look like an obfuscated C code
> contest entry_

It's still plausible that one can pull a trick like that to introduce
malicious code into the repo, but improbable.

------
kazinator
> _There is, of course, a way to unambiguously give a hash value in the new
> Git code, and they can even be mixed on the command line; this example comes
> from the transition document:_
    
    
         git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
    

> _For a Git user interface this is relatively straightforward and concise_

No, it isn't. It's a complete and utter user interface clusterfuck. Just say
no to this insanity.

~~~
scarejunba
Perhaps suggest an alternative? It may help understand why this was chosen.

~~~
loeg
One alternative would be to just do lookups in both hash databases (until SHA1
is fully migrated away from), and reject invocations that conflict. Git's CLI
already rejects ambiguous short hash prefixes for SHA1, it could easily reject
ambiguous prefixes between SHA1 and SHA256 and otherwise allow unique prefixes
for either hash. This would be pretty ergonomic for users.

~~~
jolmg
For most cases that would suffice and would be ergonomic, but what if a full
SHA1 also qualifies as a prefix of one or more SHA256, and you want the SHA1?
There's still a need for a mechanism to disambiguate for these cases, even if
it ends up very rarely needed.

~~~
loeg
You're talking about a 160 bit truncated hash collision on SHA256, which is
extraordinarily unlikely if SHA256 is not itself completely broken (moreso
than SHA1 already is!). I don't think any syntax is needed for that in the
porcelain CLI; it could be handled with non-user-facing commands if it ever
came up (it won't).

~~~
jolmg
> extraordinarily unlikely if SHA256 is not itself completely broken (moreso
> than SHA1 already is

I was hoping I captured that by saying "very rarely". However, if SHA1
collisions can be made willingly, doesn't that mean that one can also
willingly make a SHA1 hash that matches with the prefix of an existing SHA256
hash?

~~~
loeg
As far as I know, that kind of collision isn't practical at this time. So
predicating UI decisions on that basis seems like a mistake to me (given how
long git has already ignored the looming threat of SHA1 being broken).

When and if someone injects a SHA1 attack into your repository, and the main
git CLI throws up its hands and says "hash collision" trying to access it, I'm
not seeing major problems here. The git CLI doesn't need to provide convenient
commands to interact with attacks that are not practical today. To the extent
that these will become practical, I think git should drop the SHA1 lookup
after a migration period regardless, and it would not hurt to provide a
gitconfig knob to disable SHA1 lookup.

------
nnx
Surprising they didn't go with Blake3 instead since it has much higher
performance and Git's performance-oriented ethos.

~~~
nullc
> Git's performance-oriented ethos

Than sha256 will likely be preferable in the long run: It's faster with SHA-NI
than blake3.

If you're not developing on a system with sha-ni, get with the program. Zen2
is freeking awesome. :)

~~~
wolfgke
> Zen2 is freeking awesome. :)

SHA-NI was introduced with the Intel Goldmont microarchitecture.

~~~
nullc
Yes, but Goldmont is not particularly awesome. :) Presumably goldmont would be
a downgrade for many people.

(On AMD the first generation zen have sha-ni, FWIW)

~~~
wolfgke
Of course, processors that use one of the Atom/Celeron/Pentium
microarchitectures are not the best choice if you desire maximum speed, but
otherwise they are surprisingly interesting processors (IMHO much more
interesting than what Intel delivers with the Core series).

At this time, Intel often experiments with or introduces features that are
particularly interesting for embedded usages first on the Atom. For example
the already mentioned SHA-NI. Another example are the MOVBE instructions
(insanely useful if you handle big-endian data, for example in network
packages (I am aware that on older x86 processors, there exists the BSWAP
instruction)) - they were first introduced with Atom.

------
zokier
Does anyone know if a standard format for sort of tagged-union hash type,
something similar as crypt format for passwords? Feels like everyone is
needing to support multiple hash types at some point, and basically needs to
reinvent that particular wheel again and again.

~~~
loeg
It isn't too bad to just exhaustively look up provided hashes in all your
databases (at least, for Git). You should probably only support 1 primary hash
at a time, and 1 additional legacy hash for migration purposes. This makes
lookup twice as expensive; for git, this is not usually the slow part (the
slow part is 'git status' having to compare the entire local filesystem
checkout to the repo).

------
sunil_saini
The above article suggests that Sha-1 collision is infeasible because attacker
has to come up with code that not only generate same hash but also benefit
him. But can't he just add some malicious code and add some random text in
comments to produce same hash?

~~~
pornel
"produce same (specific) hash" is a pre-image attack, which is very very hard.
So hard, that even MD5 isn't broken for pre-image, and there's only a
theoretical pre-image attack against MD4.

We know only collision attacks which is "produce 2 files with the same hash,
but you can't control what hash". So you can't target any existing repo. You
need to use social engineering to get one of your special files into a repo.

------
k5hp
Just in case: [http://archive.is/omsjJ](http://archive.is/omsjJ)

------
lokedhs
How will rehashing work for commits that are signed? Do all the commits need
to be re-signed?

------
cm2187
Stating the obvious but the hash is a hex, that leaves lots of characters for
a one character prefix for sha256 hashes. Like the character "s" for instance.

------
powerapple
is it a real problem for git? Do we merge code based on hashes instead of
looking at the code.

~~~
donatj
The problem is in if you can make evil code with the same hash as innocuous
code, you can poison people who pull from a given repo you have access to. It
would allow you to make changes to the history without merging anything or
anyone being the wiser.

It makes the distributed aspect of git untrustworthy, as previously you knew
if you pulled from anywhere and the hash was good, you’d pulled the correct
code. With SHA1 being functionally broken that’s no longer necessarily the
case.

------
angrygoat
This article is via an LWN subscriber link; a cheerful reminder that LWN are
good and they are worth subscribing to :)
[https://lwn.net/subscribe/](https://lwn.net/subscribe/)

~~~
john-radio
OP's link says it is "subscription-only content," but it is still publicly
available. It says that it has been "made available by an LWN subscriber." How
does that work?

~~~
robjan
Subscribers receive a sharing link which can be used to share articles with
friends. They tolerate sharing on HN because it brings in new customers.

~~~
globuous
Damn, lwn is sweeet !! :)

------
throwaway-q2233
Can't just take Sha 1 of Sha 256?

~~~
loeg
Nope.

------
sandGorgon
does anyone know if github/bitbucket support it today ?

~~~
freddie_mercury
Why would they support it? The article clearly states it is nowhere close to
being useful yet.

It is untested, unstable code that can only write to repositories and not read
them.

"Much of the work to implement the SHA‑256 transition has been done, but it
remains in a relatively unstable state and most of it is not even being
actively tested yet. In mid-January, carlson posted the first part of this
transition code, which clearly only solves part of the problem:

"First, it contains the pieces necessary to set up repositories and write _but
not read_ extensions.objectFormat. In other words, you can create a SHA‑256
repository, but will be unable to read it. "

~~~
sandGorgon
actually - i might have worded it confusingly.

For smaller projects (like my own), can i move to sha-256 with no expectation
of backward compatibility _today_ ?

~~~
SAI_Peregrinus
"First, it contains the pieces necessary to set up repositories and write _but
not read_ extensions.objectFormat. In other words, you can create a SHA‑256
repository, but will be unable to read it. "

If you want it to be write-only, sure, go ahead!

------
PaulHoule
I am not a fan of SHA-256, you are better off with SHA-386 or SHA-256/512
which resist prefix attacks and are actually a little fast on 64 bit machines.

------
velox_io
Unless I'm missing something, why not just allow repositories to be upgraded
to SHA2 hashes? The only problem is ensuring everyone's tooling supports it.

~~~
majewsky
This question is exactly what a major portion of the article covers.

~~~
velox_io
It isn't the easiest article to read, plus they over complicate things by
talking about things such as truncating SHA2 hashes.

I don't see why changing the hashing algorithm is so problematic, hence the
reason why I asked the question. Converting a repository to SHA2 should be
straight forward (the only issue is everyone's tooling), you could also run
the repositories side-by-side. I'm genuinely interested as I think Git &
Bittorrent are quite elegant solutions to complex problems.

~~~
majewsky
> the only issue is everyone's tooling

Exactly! If you've ever worked in a corporate environment, you know the fun of
having to support 10-year-old versions of your favorite cutting-edge software.

------
kazinator
I'm completely against this security theater nonsense; please keep my git
SHA-1.

Please fork git for this and call it something else, like git6, and ensure
that git6 cannot push to git repos.

------
mratsim
> Thus, unlike some other source-code management systems, Git does not
> (conceptually, at least) record "deltas" from one revision to the next. It
> thus forms a sort of blockchain, with each block containing the state of the
> repository at a given commit.

Color me surprised, dropping the "blockchain" word in the middle of the
introduction

~~~
AndrewDucker
Git is a blockchain.

Being, as it is, a chain of signed blocks.

~~~
tonyedgecombe
Git is a Merkle tree as is a blockchain.

[https://en.wikipedia.org/wiki/Merkle_tree](https://en.wikipedia.org/wiki/Merkle_tree)

~~~
the8472
It is more a DAG than a tree.

~~~
afiori
And talking about hash attacks it becomes relevant to consider the possibility
of it being just a Directed Graph

