
Git integrity - jordigh
https://groups.google.com/forum/#!topic/binary-transparency/f-BI4o8HZW0
======
glandium
Note that it's not only a problem of the SHA-1s not being checked, but also a
problem of consistency of the pack not being checked. That is, you can be
given new commits with missing objects and Git won't notice. Worse, you can
push a pack with missing objects, and the server won't notice (by default).
Many guides will tell you to enable fsckObjects on servers.

It's also not a Git-only issue. Mercurial has the exact same problem.,
defaulting to trust everything it's given, both on the client and the server.
And while the server.validate option allows to enable consistency checking (no
missing object), I don't think there is a setting to enforce SHA-1 validation.
OTOH, on a repository the size of mozilla-central, this would have noticeable
consequences because file lists (manifests) are flat (but that's changing
thanks to the work for narrow clones).

~~~
indygreg2
Mercurial verifies SHA-1 on every read and write. The code is in revlog.py in
revision() and _addrevision() (search for "checkhash"). This is a lower level
than changegroup processing, which is where exchange occurs. Since you aren't
using revlogs from git-cinnabar, you should probably re-implement hash
checking in git-cinnabar if you haven't already.

server.validate ensures that all referenced revisions from changegroups are in
fact present. It prevents repos from becoming "corrupt" (in the sense that `hg
verify` will complain) due to missing data.

~~~
glandium
I stand corrected, unbundling, which happens during transfer, does seem to
check SHA-1. Which surprises me, because I would expect SHA-1 hashing on all
the manifests on mozilla-central to take a lot more time than what a clone
takes (250k+ manifests of more than 5MB each on average (most recent ones are
larger than 10MB), even with a SHA-1function doing 1GB/s (and I think we
barely reach half that), that should take more than 20 minutes). But maybe a
clone does take longer than that these days?

That said, while it doesn't during transfer, Git does check sha1s when objects
are accessed. The code is in object.c in parse_object(), which calls
check_sha1_signature(). But disappointingly not everything is going through
that code path.

$ git init

$ echo a > a ; echo b > b

$ git add a b

$ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85

a

$ cp -f .git/objects/61/780798228d17af2d34fce4cfbdf35556832472
.git/objects/78/981922613b2afb6025042ff6bd878ac1994e85

$ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85

b

$ git show 78981922613b2afb6025042ff6bd878ac1994e85

error: sha1 mismatch 78981922613b2afb6025042ff6bd878ac1994e85

fatal: bad object 78981922613b2afb6025042ff6bd878ac1994e85

~~~
glandium
So, it turns out that there's nothing to see under the sun. When pulling, what
you get is a pack, and not its index. The index is created locally after
retrieval. Packs don't contain the SHA-1s, so the process of creating the pack
index does, in fact, compute the SHA-1s. So if a pack is altered somehow to
contain objects with a different SHA-1 than the advertised one as in the
parent comment, what happens is that the connectivity check that happens after
all that will complain about the missing commits, trees or blobs.

In the altered repository in the parent, actually doing a commit and then
cloning (non-local, because local clones cheat) will yield an error about the
missing 78981922613b2afb6025042ff6bd878ac1994e85.

