Hacker News new | past | comments | ask | show | jobs | submit login

“Git gets easier once you get the basic idea that branches are homeomorphic endofunctors mapping submanifolds of a Hilbert space.”

— Isaac Wolkerstorfer


While the above is tecnobabble, there /is/ a simple way to state what git is. It's an API to interact with a torsor.

We have files, which are inert objects and form a "file space". We have diffs. Diffs can "act" on a file to produce a new file or a conflict --- we call this as "applying a patch". Mathematicians would call it a "group(oid) action". Diffs are a groupoid because (1) there's an identity diff that does nothing, (2) diffs can be smashed together, (3) if the smashing of diffs succeeds, the operation of concatenation is associative, (4) all diffs are invertible. If we were to delete added lines and add the deleted lines, this inverts the diff.

Finally, we have a rooted DAG where each node is a diff, and the root is the empty diff. Git queries enable to manipulate paths in the DAG, which corresponds to smashing together diffs along a path to produce a file.

If one groks this, the rest of git is a pretty poor interface on top of this nice mathematical structure.

For more physics-y applications of torsors, check out the baez article: https://math.ucr.edu/home/baez/torsors.html

How is this any less technobabble? I looked up torsor on wikipedia. I learned nothing, except that "torsor" is a real word, and not just something you made up for a joke.

How could you possibly not have learned anything by reading the wikipedia page for "torsor" that starts with the following?

In algebraic geometry, given a smooth algebraic group G, a G-torsor or a principal G-bundle P over a scheme X is a scheme (or even algebraic space) with an action of G that is locally trivial in the given Grothendieck topology in the sense that the base change Y × X P {\displaystyle Y\times _{X}P} Y\times _{X}P along "some" covering map Y → X {\displaystyle Y\to X} Y\to X is the trivial torsor Y × G → Y {\displaystyle Y\times G\to Y} Y\times G\to Y (G acts only on the second factor).[1] Equivalently, a G-torsor P on X is a principal homogeneous space for the group scheme G X = X × G {\displaystyle G_{X}=X\times G} G_{X}=X\times G (i.e., G X {\displaystyle G_{X}} G_{X} acts simply transitively on P {\displaystyle P} P.)

Look, there's even links to the sub-terms:

("algebraic group" article): In algebraic geometry, an algebraic group (or group variety) is a group that is an algebraic variety, such that the multiplication and inversion operations are given by regular maps on the variety.

("action" article): In algebraic geometry, an action of a group scheme is a generalization of a group action to a group scheme. Precisely, given a group S-scheme G, a left action of G on an S-scheme X is an S-morphism

("principal homogeneous space" article): In mathematics, a principal homogeneous space,[1] or torsor, for a group G is a homogeneous space X for G in which the stabilizer subgroup of every point is trivial. Equivalently, a principal homogeneous space for a group G is a non-empty set X on which G acts freely and transitively (meaning that, for any x, y in X, there exists a unique g in G such that x·g = y, where · denotes the (right) action of G on X). An analogous definition holds in other categories, where, for example,

Really, all you have to do is read the article carefully and follow the links to unknown terminology if you can't immediately intuit its meaning. However, it's only just basic category theory, really.

(yes, guys, /s).

Dude probably doesn't even know what a co-Yoneda embedding is, either. smh

While this sounds all nice it actually fails to model Git as it is. Git is an object database and its objects are blobs, trees and commits not diffs, so your premise is based on a misconception.

Git itself requires you be able to think of it as both models, diffs and snapshots. For example most uses of `git rebase` are clearer if your mental model while doing so are of diffs.

That only one is how it's implemented is besides the point really, until you get _quite_ low level.

Of course when working with Git it makes sense to think in changesets. But OP was specifically modelling the technical side starting with "We have files, which are inert objects …".

And that's a bigger problem with git than the horrible commands. Every project I've worked on that has used git has had a diff-based workflow, and sometimes the mismatch is painful — I wish I didn't have to know about `--full-history` and that `git show X` isn't necessarily the same as `git diff X^..X`.

I really hope something like pijul takes off.

> Git is an object database and its objects are blobs, trees and commits not diffs

What do you think is the difference between a "commit" and a "diff"?

A "commit" doesn't contain a diff, it contains (references to) the blobs of the files at that state. Diffs are display-only, generated by comparing two full file states.

You really believe that git stores -- in full -- every version of a tracked file? Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space gone?

> You really believe that git stores -- in full -- every version of a tracked file?

Yes, it does.

> Every commit that deletes the whitespace from an otherwise empty line in a 30KB file is another 30KB of hard drive space

Yes, it is.

"It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap." -- https://lwn.net/Articles/131657/

One of the insights of the git design was that, nowadays, disk space is cheap. The first releases of git always stored each object separately in its own file in the object database. Git still does so nowadays, but once the number of files gets over a certain threshold, newer releases of git run an "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size. But that's just a physical storage optimization; in the logical model, whenever you ask for an object, you always get its full contents, not a delta against some other object.

> "automatic GC" which combines these "loose objects" into a "pack file"; and within that "pack file", it uses a binary diff (a xdelta) between similar objects to reduce the total size

Isn't it the case then that git doesn't store in full every version of a tracked file?

Deduplication and compression do not imply diffs.

git does perform delta compression. From what I can gather, git's storage engine uses conventional compression (zlib), deduplication (exactly identical files need not be stored twice) and delta compression (between similar files).

The question was does git's implementation store - in full - every version of a tracked file? The answer is that it doesn't. git has a sophisticated storage engine precisely to avoid the inefficiencies of the naive approach.

The git book explains the internals very well, so you can easily verify it for yourself. Files are referenced as objects in trees, which are pointed to by commits. editing a file creates a new object for it. (edited for tone)

From the book:

> You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

> It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server.

> When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space.


That's just for compression. Commits aren't diffs, and when you checkout stuff, git doesn't do diffs to give you the working directory at that point. See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.

> See https://stackoverflow.com/a/25028688/8272371 for detailed explanation.

There is a disconnect somewhere. The linked answer says:

> Now, git is different. Git stores references to complete blobs and this means that with git, only one commit is sufficient to recreate the codebase at that point in time. Git does not need to look up information from past revisions to create a snapshot.

> So if that is the case, then where does the delta compression that git uses come in?

> Well, it is nothing but a compression concept - there is no point storing the same information twice, if only a tiny amount has changed. Therefore, represent what has changed, but store a reference to it, so that the commit that it belongs to, which is in effect a tree of references, can still be re-created without looking at past commits.

You can recreate a file that is stored as a root blob plus some series of diffs without looking at information from past commits. But you can't recreate it without doing the diffs! You have to look at the root blob. This is, internally, tracked separately from the commit which created it. But your conclusion:

> when you checkout stuff, git doesn't do diffs to give you the working directory at that point.

cannot be true. If the working directory at that point corresponds to a blob which has only diff information stored, git must apply that diff to a separate blob in order to give you the working directory.

What you are missing is the difference between gits object model (with loose objects) and packfiles. The delta compression happens when you run `git gc` (git does this automatically as well on occasion), and packfiles is how git fetches and pushes history.

But when you create a new commit, that commit object is stored as a loose object, with any new file blobs and tree objects. This represents a complete snapshot of your working tree.

But git does not need to make a complete copy of the working tree on each commit. Because objects are referred to by the hash of their contents (with a git specific header), git only needs to store each version of a file once.

Everyone here is arguing past each other because one side defines "what git does" as the literal implementation details of git, and the other side defines "what git does" as the model it presents to the end user. I suspect the reason for this disconnect is partly due to the emphasis on understanding the "internals" of git and the fact that this is about between the internal implementation as it exists in code and the internal model/interface.

When git makes packfiles using delta compression, it ignores the history. It roughly sorts blobs for similarity, completely ignoring their filenames or which commits they appear in. This sorting helps to make the packfile delta compression more efficient. The deltas on disk are completely unrelated to the diffs you see from `git show`.

Which is explicitly a lower-level optimization applied to files well-suited for it and not related to the concept of a commit. A commit does not reference a diff.

Fair enough. The blob stores a diff. The commit stores a reference to... the diff. This is a division between the concept of the object and the implementation. But it's not an example of a diff-storing model failing to model git as it is; git as it is is storing diffs.

If a commit references a "blob", and the "blob" that it references is, in fact, a diff, why would we say that the commit "does not reference a diff"?

Each commit consists of a structured collection of hash IDs for every file in the entire repo. The hash ID is generated by hashing the contents of the entire file. Not the diff.

The "diff" you're referring to is an implementation detail of the compression. It's not even always there; it depends on which commits are present in your clone. It's also not even the same "diff" you work with when you use git to generate or apply patches. Using the same word only leads to confusion.

And that's part of the problem with git! It requires you to get a mental model of how it works internally, but only part of how it works internally is important, and there are terminology conflicts. So people get incorrect ideas about how it works, then get surprised when something unexpected happens.

There're revision control systems based on diffs, but git's power (and durability) is that every commit references only blobs, which are (content addressable) files, not diffs. All the diffs used in git log presentation or git-diff command or git rebase command are computed on the fly from the two stored versions. And yes, if you commit a giant file, and then delete it in next commit it's there forever, until you remove or rewrite a history of a branch that references this file somewhere in history.

There're optimizations on the storage level, compression etc, but on logic level those are transparent

I would understand "a commit is/contains a diff" as the commit referencing the difference to its parent commit(s), whereas in a packfile the diff might be against a blob belonging to an entirely different branch of the repository, if that's a better diff. Which might be different for each file. And the blob doesn't have to be a diff, it only is if the packer found a good candidate.

If it didn't, then the diff mechanism git would end up using would be purely internal to git and abstracted away for the user. Why? Because it would mean your diff algorithm is now your storage format and it is not allowed to change, ever. That's going to cause worse issues than large git repositories.

There is also the obvious performance problem that you would have to replay all diffs to get to switch between commits.

It does, yes. To add a bit more color though, what happens is that when you run `git gc` (or it's run automatically for you sometimes) an extra compression step is done that uses diffs of some sort to avoid storing so many near copies. Packfiles are related to this.

git works off of snapshots which are blobs, and blobs are compressed very effectively, but if your file takes 30KB compressed then yes, 30KB compressed is being added for every white space added[0].

Another way to think of it, if it was diffs, if you had 1000 commits, getting to the 'head' would take forever because it had to replay all the commits diffs just to get there.

Yes, you could combine diffs & snapshots, but that in itself is a tricky complexity in an already very complex system.

[0] https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...

Oh my yes. Some years ago, I did not realize this. My mental model of Git had it de-duplicating common text between commits, but this is not the case. I learned the truth the hard way when I wrote a commit hook that automatically appended about a hundred lines to a text file with every commit. It worked fine at first, but eventually `git fetch` started failing.

A diff is something that describes the changes between two versions but does NOT refer to any specific version; or at least it is something that can be worked with independent of any specific version.

Ie. I can develop a fix for a issue for version 1.2.30 of some software, generate a the diff using the diff tool and then apply this diff using the patch tool to version 1.1.15 of the software. This might fail (or result in something undesired), but there no principal problem in moving the diff around and applying it somewhere else.

A git commit however is a particular version, so git is not really good at applying a commit somewhere else.

While this is a wonderful explanation for a person already familiar with mathematical concepts such as groupoids, torsors, and directed acyclic graphs, I think that this explanation is really unfriendly for the layman/novice developer who's just trying to get their changes up on Github.

All is well until you need merges, which is where the confusion happens and minds are lost.

- You have a branch `master`.

- You have a branch `feature` which contains commit C, which conflicts with `master`.

- You merge `feature` into `master`, fixing conflicts.

- You log the commits of `master`.

  * First is the merge commit, whose diff contains code added by commit C (including the conflict resolution).

  * Next you see commit C, whose diff contains code added by commit C (obviously)
Wait, how can 2 commits in a row have a diff that modifies the code in the same way? Well, mind you, the diff of a commit doesn't correspond to the moment the patch of the commit was added to the current branch, it only corresponds to the moment it was added to its own branch.

The problem with that is thinking a directed acyclic graph is a flat sequential list. Try `git log --graph --oneline` or `gitk`.

Ah, so just a file space with a rooted DAG where each node is a diff with paths manipulable by queries where diffs can use a groupoid action to produce a new file or conflict, resulting in an API to interact with a torsor.

It's an API to interact with a torsor

Well why didn't you just say so!

What does simple mean to you?

What do you think of darcs?

this is your brain on category theory

This is funny but I think it strikes at a real truth: what Git is trying to do is legitimately difficult. I can see where the author of the article is coming from and I agree that Git has a lot of commands and it's hard to pull it all together, especially if you are new. And I agree that these commands could be better organized and presented and I think that's something the project is actively trying to address.

But isn't the core idea of Git, everyone has their own repository and they share branches, kind of complicated and tricky? In my experience the developers I see having problems with Git are having trouble with the hard stuff, not the easy bits like committing and pushing a change.

Mercurial shows that it's possible to hide much of that complexity behind a user interface that only exposes as much of that complexity as necessary.

Git's--and Mercurial's--secret sauce is the directed acyclic graph of commits.

Everything else is window dressing. It's complicated if you're used to a linear sequence of commits, but it's not too hard once you grasp the tree structure.

All operations are operations that manipulate (or share) the DAG, so it now becomes a matter of mapping how you want to manipulate the DAG onto the command set the tool gives you.

Git's toolset was designed by a dozen madmen that didn't talk to each other, so it's a loosely connected set of tools with conflicting syntax and meaning. Mercurial's toolset was designed to be used, so if you know the right terminology, you can easily figure out what command to write. Want to commit something? It's probably hg commit. Want to rebase? It's probably hg rebase. Etc.

Git exposes unnecessary complexity. But it won the DVCS popularity contest, so we're stuck with it. I wish something else had won. Oh well...

git is a fast, content-addressable, decentralized and symmetrically ad hoc synchronizable, cryptographically verified filesystem, stored as a directed acyclic graph of tags, commits, trees and blobs, in that order, with SHA-1 pointers as edges, backed by a POSIX filesystem, with both a simple storage format and a space- and seek-efficient, compressed delta chained pack format.

Seems like an obvious play on "a monad is a monoid in the category of endofunctors, what's the problem?"

But the monad one is true, so this particular joke falls a bit flat since it's just nonsense. (I think...)

What Hilbert space would that be?

I am not sure, but I think it's a Joke Hilbert space

There's also very little in a graph of changes and revisions that could be considered a manifold, which makes looking for homeomorphic endofunctors pointless.

But the first impression is strong, it's still a good joke.

This would have been funny if indeed it would not have been technobabble.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact