The git book explains the internals very well, so you can easily verify it for yourself. Files are referenced as objects in trees, which are pointed to by commits. editing a file creates a new object for it. (edited for tone)
> You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
> It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.
> When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space.
That's just for compression. Commits aren't diffs, and when you checkout stuff, git doesn't do diffs to give you the working directory at that point. See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.
There is a disconnect somewhere. The linked answer says:
> Now, git is different. Git stores references to complete blobs and this means that with git, only one commit is sufficient to recreate the codebase at that point in time. Git does not need to look up information from past revisions to create a snapshot.
> So if that is the case, then where does the delta compression that git uses come in?
> Well, it is nothing but a compression concept - there is no point storing the same information twice, if only a tiny amount has changed. Therefore, represent what has changed, but store a reference to it, so that the commit that it belongs to, which is in effect a tree of references, can still be re-created without looking at past commits.
You can recreate a file that is stored as a root blob plus some series of diffs without looking at information from past commits. But you can't recreate it without doing the diffs! You have to look at the root blob. This is, internally, tracked separately from the commit which created it. But your conclusion:
> when you checkout stuff, git doesn't do diffs to give you the working directory at that point.
cannot be true. If the working directory at that point corresponds to a blob which has only diff information stored, git must apply that diff to a separate blob in order to give you the working directory.
What you are missing is the difference between gits object model (with loose objects) and packfiles. The delta compression happens when you run `git gc` (git does this automatically as well on occasion), and packfiles is how git fetches and pushes history.
But when you create a new commit, that commit object is stored as a loose object, with any new file blobs and tree objects. This represents a complete snapshot of your working tree.
But git does not need to make a complete copy of the working tree on each commit. Because objects are referred to by the hash of their contents (with a git specific header), git only needs to store each version of a file once.
Everyone here is arguing past each other because one side defines "what git does" as the literal implementation details of git, and the other side defines "what git does" as the model it presents to the end user. I suspect the reason for this disconnect is partly due to the emphasis on understanding the "internals" of git and the fact that this is about between the internal implementation as it exists in code and the internal model/interface.
When git makes packfiles using delta compression, it ignores the history. It roughly sorts blobs for similarity, completely ignoring their filenames or which commits they appear in. This sorting helps to make the packfile delta compression more efficient. The deltas on disk are completely unrelated to the diffs you see from `git show`.
Which is explicitly a lower-level optimization applied to files well-suited for it and not related to the concept of a commit. A commit does not reference a diff.
Fair enough. The blob stores a diff. The commit stores a reference to... the diff. This is a division between the concept of the object and the implementation. But it's not an example of a diff-storing model failing to model git as it is; git as it is is storing diffs.
If a commit references a "blob", and the "blob" that it references is, in fact, a diff, why would we say that the commit "does not reference a diff"?
Each commit consists of a structured collection of hash IDs for every file in the entire repo. The hash ID is generated by hashing the contents of the entire file. Not the diff.
The "diff" you're referring to is an implementation detail of the compression. It's not even always there; it depends on which commits are present in your clone. It's also not even the same "diff" you work with when you use git to generate or apply patches. Using the same word only leads to confusion.
And that's part of the problem with git! It requires you to get a mental model of how it works internally, but only part of how it works internally is important, and there are terminology conflicts. So people get incorrect ideas about how it works, then get surprised when something unexpected happens.
There're revision control systems based on diffs, but git's power (and durability) is that every commit references only blobs, which are (content addressable) files, not diffs. All the diffs used in git log presentation or git-diff command or git rebase command are computed on the fly from the two stored versions. And yes, if you commit a giant file, and then delete it in next commit it's there forever, until you remove or rewrite a history of a branch that references this file somewhere in history.
There're optimizations on the storage level, compression etc, but on logic level those are transparent
I would understand "a commit is/contains a diff" as the commit referencing the difference to its parent commit(s), whereas in a packfile the diff might be against a blob belonging to an entirely different branch of the repository, if that's a better diff. Which might be different for each file. And the blob doesn't have to be a diff, it only is if the packer found a good candidate.