Hacker News new | past | comments | ask | show | jobs | submit login
Commits are shapshots, not diffs (github.blog)
136 points by todsacerdoti on Dec 17, 2020 | hide | past | favorite | 150 comments



A fascinating bit of history is that the reason for this data structure was to explicitly distance git from BitKeeper.

BitKeeper was used by many kernel developers, until one of them (the co-inventor of rsync, unsurprisingly) reverse-engineered its protocol, leading its proprietary owner to end its offer of a free license to Linus Torvalds and other developers.

While searching for a replacement, the Linux community sparked two projects: Mercurial (which uses changesets, similar to BitKeeper) and Git (which uses Merkle trees). Git was created by Linus in the unexpected free time that this event created; and he mentioned being weary of Mercurial and having adopted this design for Git to remove any ambiguity that they were creating a BitKeeper clone.

Incidentally, BitKeeper did demand that its customers not work on Mercurial[1].

[1]: https://web.archive.org/web/20070929115009/http://article.gm...


I wonder if this is why Mercurial is so much simpler and easier to understand than Git -- precisely because Git was forced to adopt a more "out there" approach. Personally I'm sad Git won out over Mercurial -- the latter is a much better technology from my experience.


> Mercurial is so much simpler and easier to understand than Git

I think it must just be a mindset thing. The way git works makes perfect sense to me and I don't know why you would do it any other way.

But git was already gaining significant traction when I started using version control.

I can see how having built your mental models around previous version control systems and then trying to superimpose that on git would cause dissonance.

But my entire concept of version control is based on how git does it, so it seems perfectly natural.


It's, of course, impossible to do a practical experiment where someone doesn't learn, and therefore become biased toward, one of them first, and then learn the other after.

For my part, what I can say is, I used Mercurial for one year, and Git almost exclusively (I do occasionally touch Subversion) for a straight decade after that. And I still find myself having to consult Git's documentation far more often than I ever did with Hg. And, even now, I find myself having to un-pick minor screw-ups in Git more often than I did after only a couple months on Mercurial.

I'm pretty sure that the problem here is ultimately the UI and not the data model. No, I'm sure it is. Every time there's a conversation about Mercurial vs Git, and someone says, "Git's not that bad, all you have to do is learn the data model, and then learn all the different ad-hoc ways the UI is bound to it," my immediate thought is, "You see, that's exactly the point. The thing that's nice about Hg is that you don't have to take this extra learning step, because the data model and the UI model are one and the same."

Also, merge conflicts seem much more common in Git. I'm not sure if that's a UI problem or a data model problem; I could see it being either or both. In any case, I felt much more comfortable allowing branches to live for days in Hg. In Git, branches that survive past 24 hours seem to always result in either shaving or getting trampled by yaks.


Underlying technology aside, mercurial has (I think) quite objectively better cli interface. All flags are the same among commands, and all commands have their undo counterparts.


> Underlying technology aside, mercurial has (I think) quite objectively better cli interface.

I disagree. The fact that mercurial relied extensively on extensions to implent basic features, thus exposing a non-standard interface to the world, made it's mental load significantly higher than simply using a standardized (albeit debatable) interface to do standard things.

Case in point: requiring installing extensions to stash local changes.


> relied extensively on extensions to implent basic features

Sorry, this is a strawman. Yes, `hg shelve` is a very nice extension. But that's about the only add-on I've ever needed, while working more than 7 years on reasonably complex projects spanning about a dozen teams, in multiple timezones and repositories. And it takes less than 5 minutes to install.

We were forced to move to Git after being acquired, and while the migration was painless, the day-to-day friction caused by Git porcelain is notably higher than with Mercurial. The issues caused by line-endings alone waste more time than all the Mercurial issues combined.


> Sorry, this is a strawman.

It really isn't,and you cannot hide Mercurial's failings by trying to move the goal post.

It's a fact that Mercurial's non-standard interface created a mental load for basic ops that is considerably higher that git's reliable and predictable (and, more importantly, learnable) interface.

> Yes, `hg shelve` is a very nice extension. But that's about the only add-on I've ever needed.

It's not a "nice extension". It's core functionality, which is a part of any basic introductory workflow.

And with mercurial instead of just being able to stash changes with a simple $ git stash , all of a sudden you need to bother with installations and setups and configs and checking if everything is it's place.

Just. To. Stash. A. Change.

And if you find stashing nothing more than "a nice extension", and nothing else pops into your mind, then you clearly have limited experience in using revision control systems.

> And it takes less than 5 minutes to install.

Did you failed to notice that it forces you to waste time ensuring you're bolting on all the stuff you need whenever you jump a seat or pop into an instance?

And did you failed to realize this problem is not experienced with pretty much any revision control system? Not just git, but pretty much all of them.


While Git is conceptually beautiful, it's CLI seems something of an afterthought and is maddeningly inconsistent at times.


As another data point, I used CVS and later SVN for years before I used git, and I find git to be incredibly easy to use and understand. But I've been using it for over a decade, and spent time in the early days to understand how it works under the hood.

This isn't to say git has a good user interface. I don't think it does. I'm comfortable with it because I've spent many years using it, and because I expended extra effort (effort I don't think an average user should have to expend) to learn it well, and to be unafraid of making mistakes while poking around.


Short of breaking backwards compatibility at this point, there’s nothing preventing Git from being as simple as Mercurial CLI-wise. The two projects are nearly identical on a high level.


I know git is confusing on the one hand, and somehow instills over-confidence on the other. But the underlying implementation of "Store everything, SHA it, and build a merkel tree" is, in my opinion, the kind of thoroughness I want in a version control system.


Isn't Mercurial different only in nomenclature? The manifest id stored in a changeset is a tree snapshot. Otherwise I don't think hg-git would work as transparently as it usually does.

https://www.mercurial-scm.org/wiki/ChangeSet


They did add it; I think it wasn’t there in v0.1[0]. Nowadays, the revlog contains a sequence of either snapshots or deltas[1], a bit like video codecs I-frame and P-frame, except the heuristic is how much data must be read to check out the file; so most are likely deltas.

[0]: https://lore.kernel.org/lkml/42692470.9050605@tmr.com/T/

[1]: http://hgbook.red-bean.com/read/behind-the-scenes.html


They’re different in the their implementations of how repositories are stored on disk. Other than that, you’re right in that the two systems have a gigantic (almost one to one) abstract/conceptual overlap.


I forget - was there another side to the story that made McVoy's actions seem more reasonable?

If all I had to go on was Bryan O'Sullivan's email, I'd be tempted to draw some unflattering conclusions about McVoy's conduct.


I read a lot of the mailing list threads back when all this was going on, and I don't recall feeling any particular sympathy toward McVoy at all.

The fundamental issue I saw was that McVoy decided to foist what I consider an unconscionable license agreement upon his unpaid open source users: users were prohibited from working on another version control system without having their license to use the read-only BitKeeper client revoked. (And reverse engineering BK's protocol was also forbidden.) From the starting line I already thought he was slimy for that.

This is where Torvalds should have gotten his big honking "I told you so", but I expect he took the wrong lesson from it since he was able to whip up an acceptable alternative in git in a fairly short amount of time[0].

Anyway, McVoy was usually pretty civil in email threads about the whole thing (likely with a few angry words here and there), but I don't think that really mattered to me: IMO he was asking something unethical of his users from the start, and so I think ethically Tridge was completely in the right to reverse engineer BK's protocol and write an unencumbered client. McVoy's response to take his ball and go home was of course entirely within his rights, but was oh so childish. And, surprise, BitKeeper has since failed as a product; I believe it's now open source, and has a tiny fraction of git's user base.

[0] I think it's easy for some to take the -- IMO mistaken -- view that "being allowed to" (eye roll) use BitKeeper was a "generous" gift, because it pushed someone to build a viable open-source alternative when their back was against a wall. My view is that it was just lucky for the open source community that McVoy went up against one of the few people with the talent and motivation to out-build BitKeeper. This is one of those times RMS's preaching against using closed-source software is so spot-on, publicly, painfully correct.


This is where Torvalds should have gotten his big honking "I told you so", but I expect he took the wrong lesson from it

which lesson did he take and what should he have learned instead?


What I think he should have learned: betting the workflow of your open source project on a closed source solution is going to come back to bite you, and you shouldn't do that in the first place.

What he probably learned: I'm badass enough to clone a proprietary tool in a matter of weeks if my decisions end up being bad.

Which I guess is fine for him, but likely not most people.


> The fundamental issue I saw was that McVoy decided to foist what I consider an unconscionable license agreement upon his unpaid open source users: users were prohibited from working on another version control system without having their license to use the read-only BitKeeper client revoked. (And reverse engineering BK's protocol was also forbidden.) From the starting line I already thought he was slimy for that.

I'm letting you use my commercial product under the understanding that you won't try to undermine it... I think that's actually pretty reasonable. I'm giving you something for nothing in return, and all I ask is not to try to take more from me? I always though that Tridge was ungrateful in the situation.


I get that you might have no problem with that, but personally I find that kind of arrangement unethical, and I would never agree to that sort of license.

Also note that I believe the "free for open source users" version of the BK client was a read-only client that only allowed pulling code. So it's not like McVoy was giving his whole product away for free, out of the generosity of his heart.

> ...and all I ask is not to try to take more from me?

Reverse engineering is a neutral activity. There's no "taking" going on.

> I always though that Tridge was ungrateful in the situation.

Ungrateful that he was forced to use a crippled, proprietary tool to interface with the development process of a project he's involved in? A tool that expressly disallows him from doing anything to make that situation better? That's rubbish.


Mercurial was officially named after Larry McVoy. (No idea about git.)

That said, in his defense, causing the creation of multiple open-source competitors to one’s moneymaker can be stressful.


> (No idea about git.)

Quoth wikipedia[1]:

> "I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'."

[1]: https://en.wikipedia.org/wiki/Git#Naming


For a simple, easy to understand overview of git, nothing beats The Git Parable [1]. Every time I talk to someone starting out with Git I recommend they read it first. Once they understand Git through that lens, usually I find the rest falls into place.

[1] https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...


How about "Git For Ages 4 And Up" [1]?

[1] https://www.youtube.com/watch?v=1ffBJ4sVUb4


This is my go-to video for anyone looking to learn git. It won't teach you the various commands on the command line but it shows you what is actually happening inside git when you perform various actions. It made me transition from "I know how to run these commands to operate git" to "I know what git is doing, so I can reason about the system and adapt to unusual circumstances". Now when someone makes a mistake in my company's git, I am the person people go to for help.


Great article.

The first thing that I was told about git was that commits are snapshots. It is enough to understand the simple workflow: pull, commit -a, push.

What confused me was everything that involves applying diffs (git stash apply, git rebase, git cherry-pick).

The secret is that you need to think both in terms of snapshots, or diffs, depending on the context.

Some git tutorials show you a few "magic" commands to get you started but it's best to really understand the git model as soon as possible. It's not that hard, and it's here to stay. We may still be using git in 30 years, just like we're still using bash and vim.


Why does thinking about commits as snapshots make cherry-pick and rebase easier to understand? I've always thought of and taught commits as diffs because git does a good job of abstracting away the distinction.

EDIT: To be clear, I think this is one of the _few_ abstractions that git doesn't leak. It does a pretty bad job everywhere else.


It does a good job in simple scenarios.

It breaks down often. I've seen plenty of horribly botched merges and rebases that took many hours to fix up.

We learn to work around this by adapting and limiting our workflows.

Most devs are just so used to git that a different model like Pijul is hard to reason about or see the benefits of, but improvement is definitely possible.


To be clear, git has a horrible interface and is the leakiest abstraction. But I don't think that the snapshot-diff distinction is one of those leaks.


A snapshot is easy to understand. It's just the contents of a file at a point in time.

Diffs can have different definitions, and I've never seen a complete rigorous one for whatever git uses, if anything. Loosely a "diff" is the changes between one version an another. But to understand what cherry-pick does, we (or maybe just I) need more than a loose definition.

Using snapshots makes `rebase` trivial. The contents of a commit doesn't change. Just the link to its predecessor. Using a diff model, it requires all kinds of fancy changes.


> Using snapshots makes `rebase` trivial. The contents of a commit doesn't change. Just the link to its predecessor.

It's the exact opposite. When you rebase, you want to apply the changes you're rebasing atop other existing changes.

If you just copy over the snapshot, you break everything. Rebasing a snapshot basically smashes your history and is utterly useless. That would make a rebase-pull… implicitly revert every intermediate commit.

Although AFAIK Git really uses the merge machinery to perform rebases: conceptually (I don't know if it does that technically) it merges each commit to rebase onto the rebase target, then copies over the commits without the old (rebased-from) parent.


Warning: When I made this comment, I was under some false assumptions. Here's the original.

---

A snapshot is the aggregate of all previous changes. I don't see what could be broken by copying the snapshot.

I'm not familiar with rebase-pull, but I don't see why any revert would be involved. The snapshot before the rebase is exactly the same as the snapshot after the rebase. Only the predecessor is changed. The contents of the predecessor doesn't matter, since a snapshot contains the entire state of the tree.


> A snapshot is the aggregate of all previous changes. I don't see what could be broken by copying the snapshot.

If you copy the snapshot, you lose all the intermediate commits you're rebasing onto, because they're not in the snapshot you're rebasing.

> The snapshot before the rebase is exactly the same as the snapshot after the rebase.

Which is exactly what you do not want.

> The contents of the predecessor doesn't matter, since a snapshot contains the entire state of the tree.

Which, again, is exactly what you do not want:

    a - b - c - d
              \ e - f
If I rebase `d` onto `f`, I want the changes performed by `e` and `f` as well as those performed by `d`. If I just copy over the `d` snapshot to d', I get the exact same state I had at `d` with a different history, losing the changes performed in e and f, thus reverting them.

That's why when you rebase a commit, not only does the commit id change (makes sense, different parent) the tree also changes, which also makes sense because the content changes.


Ok I see. I actually misunderstood what rebase actually did. So I guess it's a good thing then that I've never used it. Thanks for your persistence.


Rebase is useful for artificially straightening out simultaneous work.

It's less clear in the above example, where d is just a single commit. But if there's a d1--d2--d3--d4, it can be confusing to merge chronologically (d1--d2--e--d3--f--d4).

If there aren't any conflicts, it's conceptually simpler in the long run to rebase it as e--f--d1--d2--d3--d4. You artificially serialize the work, which isn't what actually happened, but it could have happened that way. When you go back and look at it, pretending that it did is simpler. For example, if you have to go back and debug something that happened while working on d1...d4.

It's not absolutely mandatory to learn about it. But it the simple cases are common enough, and useful enough, that it's worth learning.


That didn't stop you from arguing that it was trivial upthread...


I was attempting to answer a question. It wasn't my intent to pose an argument.


> I've never seen a complete rigorous one for whatever git uses, if anything

Isn't that the point of a conceptual model? Thinking of commits as diffs abstracts away the actual implementation, which allows me to understand cherry picks and rebasing without worrying about object OIDs and trees. I can think of those actions as "applying the diffs onto other commits", even though technically it can't be implemented that way.


> Isn't that the point of a conceptual model?

Probably, but my understanding of the conceptual model is full of holes. There are many scenarios to which I don't know how cherry-pick would react. Maybe I should just try them.

"applying the diffs onto other commits" is all well and good for simple scenarios where there are not multiple interpretations of changes. But for other scenarios, it requires one to know what's actually in a diff, which I don't know. In fact, different git clients present diffs differently. Is that evidence that diffs aren't a real thing? I don't know.


It highlights why they're generally bad ideas that should be used sparingly, and certainly not as the default.


This is why I'm really excited about the potential of Pijul. In pijul, commits are diffs, and you avoid all of the rebase/cherry-picking craziness of Git.

Still alpha software, so tons of roughs edges, but the potential is incredible. I think it'll be similar to the centralized -> distributed revolution that git ushered in.


There's nothing intrinsically superior about storing commits as diffs - subversion stores its commits as diffs (or at least, it did a decade or so ago,) and I haven't heard anyone enthusing about svn in a long time.


It's about the model, rather than the storage per se. AFAIK, svn doesn't model dependencies between patches, so it doesn't get the benefits.

Have you ever had to laboriously cherry-pick a series of commits solving some bug between branches (e.g., from stable to dev or vice-versa)? It's painful, and then later on you get conflicts because when you cherry-pick, git doesn't "understand" what's happening. Pijul solves that.

At a fundamental level, imagine I'm editing two files A and B, that have nothing to do with each other. Because git's level of dependency is the entire repo, if I switch back and forth between editing A and B, git inserts a series of arbitrary and not-actually-real dependencies between those edits. Then, as a consequence, there's no easy way for me to decide, say, I would like to merge my changes to A somewhere, but not my changes to B.

Or how about the entire controversy(/complexity) of whether to rebase or merge when pulling? The choice is an artifact of git's model, really a question of "in which way would we like git to represent some not-real constraints". This difficulty doesn't exist in Pijul, where dependencies are more fine-grained and real.

Another pain point you may have hit w/ git is when you do a merge, but somehow actually the changes were not included; this usually happens as a result of confusion regarding a conflict. Then you have to do painful repo surgery to try to convince git to actually include the changes you want.

In general large conflicts are quite painful in git: for the duration of the conflict, you are in a special state, and cannot commit. In pijul, conflicts are a first-class state of the repo, so you can iteratively work towards resolving your conflicts, using all of the normal VC tools to checkpoint along the way.

Disclaimer: I've only played with Pijul, not used it in anger. I definitely see a lot of potential though.


That's right, and actually Pijul patches are not actual diffs, they just behave exactly like them for all uses.

The main thing I believe is intrinsically superior to Git (I use Pijul a lot) is patch commutation, not just patches. Pijul guarantees that two patches that could be written independently always commute.

This has important consequences:

- You can push patches from the same branch in any order (provided they don't depend on each other). This wont change their identity, meaning that if you decide to push the other patches later, you can do it without having to rebase, and review/test the new order.

- Conflicts happen between patches, and conflict resolutions are modeled as patches. This means that if you solve a conflict once, you can push your resolution, even if you have your own, private patches on the same branch.


Are you sure about that? See "Why a new version control system?" at

https://pijul.org/faq/

and the linked "badmerge" example:

https://tahoe-lafs.org/~zooko/badmerge/simple.html


If a project wants to hold up a bad merge as an example why they are better. They need to do better than just a list of letters. They need an actual example of real code.


Actually, I wrote the graphical version of that example (the original authors are cited). I find that the code version is harder to understand. The letters make it much more explicit that Git is really shuffling lines around (I find that scary).


Replace letters by line of codes and it will have the same result.


There is a link to example real code in the first sentence there.


Thanks.


Pijul patches can't be just diffs. I don't know the Pijul implementation but you can see what they're like in Darcs, for instance, in the text form they're meant to be mailed. (Darcs isn't alpha, nothing against Pijul.)


They're not just diffs indeed, they have a bit more information, see an example there:

https://nest.pijul.com/pijul/pijul/changes/XTMYHJZLWWT5I2PJA...

First, there are explicit dependencies, and also each section starts with a somewhat cryptic machine-readable description, for example:

B:BD[2.1195] → [2.1195:1291]

One of the challenges in the new version of Pijul was to find a description that was printable in text, and not too-unintelligible to humans.

Darcs isn't alpha, but merges patches in exponential time, which was the initial motivation for Pijul.


I'd still like to know if you're claiming Darcs 3 (which may or may not be at a similar stage of development) still has that issue (which I don't remember biting me). I've no particular axe to grind as a Darcs user, as long as I have a nice patch-based system.


I don't think Darcs 3 was even started. Darcs 2 recently solved the exponential merge, but it is now quadratic (which is still exponentially slower than Pijul instead of double-exponentially).

I've also been really happy with Darcs, expect for two things:

- Conflicts are not handled very well, as if they were not properly stored internally. `darcs revert` on a conflict doesn't always do what I expect.

- The fact that it doesn't scale (even with the quadratic algorithm) means that big users can't adopt it, which means that too few people write tools for it (including a hosting website: I'm aware of hub.darcs.net, but there are many things missing, including on security). Darcs is hard to install on platforms such as Windows. Also, few people know it, which means that it's hard to collaborate using it.


I should have explicitly said the version 3 theory, which has an imlpementation in Darcs 2.16. I've never seen worst case quadratic complexity, like quicksort's, labelled exponential before, which seems misleading. I think there's more to practical scalability than worst-case asymptotic complexity, even if guaranteed linear is likely a good start; have the two been compared for Linux or GCC, for instance? I'd be happy to use the best combination of speed and features/usability.

I just set up Trac for Darcs hosting -- though it has some tension with the Subversion model -- exporting to git for people who want the pain. I'd investigate Sourcehut integration if I had the time.


If you get a bad merge, why not just layer on a series of diffs? In short, a pijul/diff based front end merge tool to git, hg, etc to assist with any challenging merges.


You could certainly do that, and this is in fact one of the ways to use Pijul. However, you would completely miss out on the most important novelty of Pijul, which is patch commutation:

When we say "Pijul is patch-based", we don't only mean that it stores patches, but rather that patches that could be written independently commute, meaning that they can be applied in any order, and are guaranteed to give the same result in all cases (including when they conflict).

This simplifies many workflows.


To be fair, Roundy's idea of commutation was novel in 2002 (unless it was a reinvention of something earlier, which wouldn't surprise me) whether or not it was done right then. I'm happy if Pijul improves something which definitely simplifies workflows and doesn't get me confused after 30 years of revision control and sometimes resorting to diffs.


> (...) and you avoid all of the rebase/cherry-picking craziness of Git.

What's that rebase/cherry-picking craziness?


> What's that rebase/cherry-picking craziness?

I can only guess what OP had in mind: when you rebase in Git, commits are "replayed" on top of other commits. But there's no way this replay can really work, for two reasons:

- conflicts are not even modeled in commits, which is why some solve conflicts can come back. You may say this never happens, but (1) the problem is so real that Git even has a `git rerere` command to fix it, and (2) many workflows have explicit ways of avoiding this situation: no matter what your natural workflow is, you must adapt it to suit Git.

- each step of the replay tries to commute patches, but Git uses 3-way merge for that, and this is not a 100% algorithm, merely a heuristic, as explained there: https://pijul.org/manual/why_pijul.html


Disregarding whether the title helps you or not, the content of the post is useful to put things into context if you're not already very familiar with git or find some things surprising.

Commits are not 'just' the snapshots as the ancestor references and metadata matter. I for one think the title is a distinction without a difference (if you already accept that a git hash represents both the current set of exact files and ancestors) but the content is well presented for anyone.


Conceptually, a git commit is a snapshot. Simple as that. The user doesn't need to worry about how git manages its data internally.

> I believe that Git becomes understandable if we peel back the curtain and look at how Git stores your repository data.

This strikes me as misleading in the same way this StackOverflow answer [0] is misleading. In terms of how git stores data, commits are not always snapshots. Internally, git sometimes uses delta compression to reduce storage space. This is of no concern to the user, conceptually a commit is still a snapshot, but it's just not true that git naively stores each commit as a full snapshot.

Peeling back the curtain isn't helpful to someone trying to understand the basic git model. Also, as the article makes no mention of delta compression or of 'packfiles', it seems to me it hasn't pulled back the curtain at all.

[0] https://stackoverflow.com/a/8198276/


> Conceptually, a git commit is a snapshot. Simple as that. The user doesn't need to worry about how git manages its data internally.

This "simple" model is completely useless when trying to understand what merges, cherry picks, rebases, and patches actually do. Snapshots don't naturally do anything, and can't be combined in any way.

If you think in terms of snapshots, then you have to also start thinking about diff algorithms, and how different diff algorithms can give completely different meanings to what a "merge" or a "rebase" actually mean. Which is true, but given that git comes with one specific diff algorithm built in, and beginners should DEFINITELY not mess with it, it muddies the waters quite a bit.


The delta compression in Git is about storing the file contents of an object as a diff against another object. This changes the literal size on-disk, but it doesn't change the logical unit.

In fact, the delta chains used by Git for space compression have no direct relation to the object model DAG. From the perspective of a user using Git, these deltas are completely invisible.

Edit: perhaps to help this point... If Git stores an object using a delta, that doesn't change the object ID of that object compared to storing it uncompressed.


Git internal delta compression has nothing to do with why Git commits are most usefully thought of as diffs. All git operations start by computing diffs between commits, and then trying to apply those diffs before conceptually taking a snapshot. It is quite likely that the textual diff used in the operation and the internal diff produced by the delta compression are entirely different.


Respectfully, I already made all of those points quite explicitly.


Even though I knew everything about the internal structure already, reading this post was worth it for learning about `git range-diff`, which I'd never heard of before. This sounds exactly like a tool I've always wished Git had.


Git is the leakiest abstraction in the history of abstractions.

Diffs are a "natural" object for version control yet got doesn't actually use them and as we can see in this article, multiple git commands leak this implementation detail.


I don't think diffs are the "natural" object - if you talk to someone who isn't using an automated version control system but still doing revision control, what they have is "Resume 12-1-2020.doc," "Resume 12-2-2020.doc," "Resume reviewed by Joe.doc," "Resume for Contoso final.doc," "Resume for Contoso final final USE THIS VERSION.doc," etc.

Those are snapshots of individual versions of the files, not diffs.

Or think about your favorite wiki (Wikipedia or your corporate wiki) - if you hit the "History" page, what you see is a list of the versions of the page, and then some UI to compare any two given versions. While there's a button for computing a diff between any version and the previous, that's not what it appears to treat as the natural object.

Diffs are an emergent object when you have some mechanism to automatically create them and apply them. (This is generally a programmatic mechanism, but there have long been cases where the mechanism is manual effort - think laws that amend other laws by saying "After section 3 insert... and remove section 5".) But the ultimate goal is to produce a version of the file.

If you want a diff management system, play around with quilt, which treats diffs (patch files) as the first-class object. Just about everyone I know who's tried using quilt for more than the tiniest amount of work vastly prefers just importing the applied patches into git and working with them in git. This is largely because quilt is empirically much more leaky - it's very sensitive to the current state of your directory, because a diff is a second-class object which requires something to be applied to in order to make sense, and it's too easy for the contents of your directory to not quite line up.

(Put another way - there's a reason we call it "version control" / "revision control". The thing it tracks is versions/revisions, which are concrete instances of the file in time. Quilt is a patch management system, and very few people are interested in those.)


You're right but I think what you're describing the absolute bare minimum of version control. Yes, people want to ostensibly see the different versions/states. But for anything past the simplest setups, they immediately want to see changes, since you're actually after change management, not version control. Viewing different versions scales very poorly. Even regular user oriented tools such Word offer "Track changes", which is a diff view. Wikipedia also has a "Compare versions" button quite prominently on the page you mention.

If anything, most users I've exposed even to simple diff tools immediately appreciate the value. It's just that mainstream tools rarely support diffing since it's very complex from a technical point of view (end user tools and apps rarely work with plain text, they use binary formats, marked up text, etc), so it's not a mainstream paradigm as a result, and people are less familiar with it.


All the arguments you give are based on shortcomings of the tools: Microsoft Word doesn't have version control, Wikipedia stores versions in a database.

> But the ultimate goal is to produce a version of the file.

The fact that the end result is a version doesn't mean versions should be the main object. If you add a semicolon to a file, the actions you do are "press the ; key", not "create an entire version from scratch that has a semicolon there".

> If you want a diff management system, play around with quilt

You're doing it again! Why pick the least suitable tools for the jobs? Quilt is not a patch-based version control system, Pijul and Darcs are.


> Diffs are a "natural" object for version control

Diffs are natural objects for evaluating commits--answering questions like "does this new version of the code make sense?"--but they are not the natural objects for storing commits.


To expand, since we agree.

Diffs are "natural" objects for users of source control.

Diffs are not natural objects for storing anything since that's not the job of the users, it's the job of the version control system. It can use ponies to store the "controlled versions" for all I (and probably other users) care.


I don't know about ponies, but for snapshots, I disagree: snapshots don't model diffs properly, and many operations in Git are essentially trying to simulate patches:

- Merge and rebase both try to "replay" patches. But because Git doesn't work with patches, it actually only simulates that, using heuristic algorithms such as 3-way merge. This works most of the time, but not always (see https://pijul.org/manual/why_pijul.html).

- Conflict resolutions in Git cannot be modeled as patches, and you need `git rerere` to simulate that. This is not anecdotal, since conflicts is the one situation where you need the best and most intuitive tool.

- Short-lived branches are another poor simulation of patches: if Git were really working on patches, you would be able to create a branch "after the fact", meaning that you would work (possibly on multiple features at the same time), and push different parts of your work separately, without having to worry about branches before you start.


> Short-lived branches are another poor simulation of patches: if Git were really working on patches, you would be able to create a branch "after the fact", meaning that you would work (possibly on multiple features at the same time), and push different parts of your work separately, without having to worry about branches before you start.

You made me remember something I've been annoyed with version control systems. Which I've found hard to explain. But your comment makes me think about what doesn't come across.

The problem with version control systems is they are stand alone tools that don't capture and record refactoring. What you want is not a patch/diff. You want a record of the refactoring commands, and the result as separate things. And the ability to tag those.

  1 Tag with meta data (commit message etc)
  2 Command: Change this method name from this to that. 
  3 ChangeSet: List of 371 places where the change was made.
Consider if you have a couple of projects that rely on a framework/library. You could apply the commands to each project. And automagically get change sets for each.


> snapshots don't model diffs properly

Of course a single snapshot can't model diffs properly, since a diff is a property of a pair of snapshots.

Your examples of how git doesn't treat things properly look to me like cases where git is picking the wrong pairs of snapshots to generate diffs.


So... we're agreeing? :-)

I was criticizing Git for not using diffs/patches as a regular version control user would expect.


It seems we're agreeing indeed! I understood your meant that, but I also thought you meant that snapshots were an implementation detail to simulate diffs. My point is, I don't think they're good at that.


> It can use ponies

This I would like to see. :-) Particularly the part about defining a protocol for sending messages over the Internet using ponies (there already is one for pigeons...)


And good thing it's leaked, as anyone who's ever maintained, say linux kernel drivers for multiple distros knows. Easy to maintain a stack of commits on Linus's kernel and patch them into the various distro kernels. Well, for some value of "easy". But it would be a hell of a lot harder if you didn't have access to diffs and what I think of as "patch arithmetic." (i.e. source - patch3 + patch1 + patch2 == source + patch2 + patch1 - patch3). Patches are commutative (barring conflicts).


> Patches are commutative (barring conflicts).

That isn't quite true, even without conflicts. The same example that shows that Git merge is not associative (https://pijul.org/manual/why_pijul.html) can be used to show that it also isn't commutative.


That's a significant bar, which doesn't apply to Darcs and Pijul.


I'd argue that diffs in the presence of renames aren't natural, and git handles them better than other version control systems.

That is, git is content-based and not named-based. Name-based systems explicitly track renames with metadata; git does not. git calculates renames dynamically based on content.


> I'd argue that diffs in the presence of renames aren't natural

That is true, at least for the text-based diff(1) formats, though obviously nothing precludes extending the format, or even using something else entirely (aside from the risk of not being compatible with diff(1), but then you could always present diff(1) compatible diffs externally and use something richer internally, for instance).

> git handles them better than other version control systems.

But that is not.

> That is, git is content-based and not named-based. Name-based systems explicitly track renames with metadata; git does not. git calculates renames dynamically based on content.

And as a result regularly fucks up tracking renames, or plain refuses to do it without nonsensical workarounds if you copied or moved a file and modified it in the same commit. It also makes browsing histories, blaming, or tracking merges through renames incredibly expensive.


That could be all be true, but IMO git's handling is still better than the alternatives I've used (Perforce, mercurial, subversion, CVS).

(Although honestly I never really have problems with it messing up. If you approximately separate renaming from editing, it seems to work very reliably for me.)

There's also nothing stopping anyone from writing a git history browser that caches the calculation. The point is that it's ephemeral / derived state, not authoritative state.

One problem with the name-based systems is that developers don't actually use the VCS rename operation. They just delete and add, and then you've lost the diff under every system I know of. Another problem is importing from other VCSes -- the metadata is often messed up subtlely.

Git keeps it simple. And they can and have improved the algorithm over time without making your repo data obsolete / incompatible.


Git rename tracking is better than P4 rename tracking with colleagues that don't use Rename. But if people are disciplined and DO use Rename/Move (and Copy), then P4 is much nicer to use. In particular, since many languages require file contents to be change depending on dir structure, Git often forces you to commit invalid files just to try to ensure that it will pick up the changes.

With P4, you can move the files around, adjust everything until it compiles, and only then start telling P4 about what you actually did (you can even start from Reconcile Offline Work, which will try to use exactly Git's logic to identify moved/renamed/copied files based on content - but only one time, on your own machine, not every time someone looks at the history).


You could have the commit command do rename-detection and store the result.


> One problem with the name-based systems is that developers don't actually use the VCS rename operation.

That’s a matter of tooling. In languages and environments where using an IDE is the norm, use of VCS operations happens automatically, simply by virtue of performing all source file operations through the IDE.


> I'd argue that diffs in the presence of renames aren't natural, and git handles them better than other version control systems.

Maybe this is because I designed that part in Pijul, but I disagree. Pijul has renames commute with other types of edits, and renames are faithfully modeled in Pijul patches, not just "guessed" after the fact based on contents.


I can't agree, because I don't work on diffs, I work on whatever the current state of the file hierarchy is. When I make changes to a file, it's not enough to know just the few lines before, I need to know what versions of code are operating dozens of lines away and in different files.

Further, diffs only make sense if you have the full version of the files, pre-diff. Without that full file state, diffs could operate in all sorts of incorrect ways.

I think this is one thing that git actually does correctly: save full checkpoints of a directory structure. I want the same thing when I'm dealing with, say, edits of an essay or book. I want the saved full state, and may produce a diff for convenience, but I wouldn't want to save the diff as the fundamental object of interest.


> I can't agree, because I don't work on diffs, I work on whatever the current state of the file hierarchy is. When I make changes to a file,…

This is somewhat contradictory. Of course, everybody works on states. But as you write yourself, you make changes to a file, you don't create a full new version from scratch every time you add a semicolon.

So, if you "make changes", you actually do "work on diffs".

> Further, diffs only make sense if you have the full version of the files, pre-diff. Without that full file state, diffs could operate in all sorts of incorrect ways.

That actually is an argument against Git: because Git doesn't model diffs, yet tries to simulate them, Git merges and rebases do all sort of incorrect things. For example, even the following simple example is not correctly handled in Git:

https://pijul.org/manual/why_pijul.html (see the figure "git merge vs pijul merge").


I do create an entirely new state with every single semicolon, and my text editor agrees in the way that it allows me to undo and redo. As an implementation detail, these are likely stored as diffs, but since the diff is from a known start or final point, it works out OK. When we start talking about diffs that operate on different files, then everything falls apart.

Which is why I disagree heavily with the example pijul merge you linked. Since lines AB are copied directly in one edit, and the other edit adds a line X after AB, why does pijul decide that the correct merge is only adding the X line after the second instance of AB?

Viewing this in terms of diffs instead of states means that everything is ambiguous rather than clear. What does it mean to do a diff on a diff? There are many definitions and few programmers will agree on what the proper algebraic operations are, and in fact the same person will often want different algebraic operations in different settings.

In the example above, if X is a necessary cleanup step after invoking AB (e.g a file close after file open and read), then pijul's interpretation of the ambiguous situation is wrong. But if X is a summary step that's necessary only once after the double invocation of AB, (e.g. sum += read(); n += 1;) and X is, for example, mean = sum / n, then only one X is needed.

Thinking about a "diff" without tying it to a full and complete starting state will always lead to these sorts of foundational problems, and it is quite clear from this pijul example that they haven't really thought about the problem much.


> why does pijul decide that the correct merge is only adding the X line after the second instance of AB?

It doesn't decide, it only guarantees that the order between lines is preserved in all cases. Also, since you seem to have "thought a lot about the problem", you probably noticed that what Git does is totally wrong, since what it does depends on whether you merge both commits at once, or one at a time!

> the same person will often want different algebraic operations in different settings.

I doubt that when Alice adds line at the top of the file, while Bob edits the bottom of the file, anyone would ever want Bob's new lines merged in the middle of Alice's new lines.

Yet, that is what Git does.

> it is quite clear from this pijul example that they haven't really thought about the problem much.

Or maybe they have, and they have thought enough about it to notice that 3-way merge is doing the wrong thing, and that trying to merge snapshot doesn't even make any sense. They might have also thought enough about Git to understand that rebase is turning commits into patches, one by one, in order to "replay them on top of the current version".

While I do agree that full versions are important for storage (and all version control systems, including patch-based ones, are able to recover full versions), they are not very useful to merge work, or to solve conflicts.


Your output, however, should be a patch. That's all I care about when it's time to review and integrate.


You had better care that the patch is coming from the same starting state, or else you could be wasting you time on a patch that makes no sense.


Snapshots and diffs are dual. Most version control systems operate on snapshots, but there is at least one that doesn't: Darcs. Written in Haskell, it bases it's operation on an algebra of patches. Unfortunately, while usable (and neat!) for smaller projects, Darcs has serious performance problems (due to inherent algorithmic complexity) and gets slower with certain operations. Still, one can hope it will inspire a next generation of version control systems.


Darcs is quite usable in my experience, for medium-sized projects, at least. Say ~50MB of source with ~10 years of history. Darcs has inspired the next version of itself (version 3), but it's not clear how that will compare with Pijul.


This is not inherent to the patch model, just to Darcs' algorithm. Pijul (pijul.org) doesn't have these problems, and is indeed quite fast. The main limitation at the moment is merging very large histories, which has the same complexity as a Git rebase at the moment (this will be fixed).


I wonder if you could do the same thing video codecs does. Diffs, with the occasional keyframe.


I'm not sure I understand what you mean. Snapshots and diffs are different views of the same thing. You need both. It doesn't really matter which one is more natural or primitive in the implementation.

When I git push, I upload my commits (i.e. snapshots). And when I git cherry-pick, I apply a diff. Both views are needed, I don't see why it's a leak.


> Snapshots and diffs are different views of the same thing.

Not completely: think of conflict resolutions for example. In patch-based systems such as Darcs and Pijul, you don't need a "rerere" command to handle that special case.

And I'm sure you'll agree that conflicts is the one critical situation where we need the best possible tool.

> When I git push, I upload my commits (i.e. snapshots). And when I git cherry-pick, I apply a diff

That's not how Git does it. Instead, Git runs 3-way merge, which doesn't do what you expect, see https://pijul.org/manual/why_pijul.html.


Git does use diffs extensively its algorithms, just not in its representation of history. Git's "domain model" is blobs, trees, commits, branches and tags.

At a lower layer than that (the pack files) it does use binary diffs to store similar objects as deltas to save on disk space.

TL;DR diffs play an important part in git's machinery at different levels of abstraction. They just aren't part of how it represents its domain model.


I have never understood cherry-pick for this very reason. This helps, but I'm still confused.

> The git cherry-pick <oid> command creates a new commit with an identical diff to <oid> whose parent is the current commit.

How can two diffs ever be considered equivalent when they include a changed file that had different starting contents? Can they?


If I have a 100-line file and on 'main' it changes near the top, but in my 'topic' branch it changes near the bottom, then I can cherry-pick 'topic' onto 'main' and Git will resolve the diff correctly. The resulting diff or patch would only change in the line numbers for the context of the diff.

This is of course a very simple example. You might hit a conflict in your "git cherry-pick" command which gives you an opportunity to resolve the unexpected diff issue in an appropriate way, which ends up with a different diff than before.


> If I have a 100-line file and on 'main' it changes near the top, but in my 'topic' branch it changes near the bottom, then I can cherry-pick 'topic' onto 'main' and Git will resolve the diff correctly.

That is not true: sometimes Git will take the new lines from "topic" and merge them into the new lines from "main", see https://pijul.org/manual/why_pijul.html

> You might hit a conflict in your "git cherry-pick" command which gives you an opportunity to resolve the unexpected diff issue in an appropriate way, which ends up with a different diff than before.

Sometimes when you cherry-pick, you might not even hit a "true" conflict, but if you forgot to run "rerere", you might simply hit a previously solved conflict again.


I kind of intuitively get it, but that doesn't really seem well defined. I'm always a little bit spooked that `cherry-pick` will cleanly apply when it really shouldn't have. It's not clear to me under which circumstances it automatically resolves.


You're right to be spooked about that, but you're wrong if you think only cherry-pick has this problem. In fact, all git commands can and sometimes will cleanly apply and subtly mess your files (git merge, git pull, git rebase, git apply, git stash apply etc).

The definition of how changes are applied actually has nothing to do with git itself, and everything to do with the diff algorithm you choose (of course, you normally use a built-in one, but I believe you can customize it if you really want).

In general, the default Git diff algorithm, like all text-based diff algorithms, can have problems with structured data, such as removing closed parens or significant white-space. Naturally, it can also be problematic if you have declarations that must be a unique in a file, but that can occur in different places. The Java or Go `package` statements are safe, since they must occur at the beginning of a file, so if they are different between the 2 files they are likely to be caught. But if two people have added a top-level function called `foo` , but they added it in different places in the file with different params, it's pretty likely that the diff algorithm will not see any conflict and will duplicate both lines.

Cherry pick is in fact one of the places I would normally worry least about this, since it is usually done for limited sized commits. However, when merging a feature branch into master, the potential for errors goes up, and so does the work required to catch such errors during the review.


You don’t even need to think about diff algorithms to see why a cherry-pick, merge, etc may not do what you want.

If in my branch, I rename oldFunc to newFunc in file A, and change file B to replace the call to oldFunc with newFunc; and in your branch, you add a new call to oldFunc in file C... the code will break when we merge our branches. Our changes would both pass tests independently, but would break when we merge them. No file-level diff algorithm would detect a “conflict” here.

Diff algorithms only help with saying “are two branches trying to edit the same lines of code”, but the answer to that question is never enough to tell you whether two changes logically will apply cleanly to one another.


Thank you.

In all my (extensive) commentary in this topic, I feel this is the first response that addresses the root of my confusion in a way I can understand. Sincerely, this is helpful.


Glad to hear that! Rarely have I actually felt that a comment I wrote actually made a difference.


You're right that the commit produced by the cherry-pick operation won't be identical to the commit being cherry-picked. It's the diffs that are identical, not the final result.

It's analogous to how the difference between 5 and 15 is equal to the difference between 105 and 115.


> It's analogous to how the difference between 5 and 15 is equal to the difference between 105 and 115.

Patches are torsors? :o


I don't know what it means, but I recall seeing that word once before. That was in the context of explaining git also, so there's probably something to it.

https://news.ycombinator.com/item?id=25122863


That doesn't help. The result of subtraction is defined to be a number.

The result of a diff is... a nebulously defined concept that somehow describes the changes. It's not clearly (to me) defined which parts of the files are considered to part of the diff and which parts are to be excluded.


> The result of subtraction is defined to be a number.

A diff shows the difference between two commits (i.e. snapshots). Perhaps a better math analogy: one point can be subtracted from another to give a vector.

> It's not clearly (to me) defined which parts of the files are considered to part of the diff and which parts are to be excluded.

You're right that it's not precisely defined in that way, but there's a good reason for this.

If you move the top line of a file to the bottom, should the diff tool show you that one line was deleted and one new line was added (i.e. that one line having been moved), or should it show all the other lines in the file as having been deleted and re-added in a different place? Both are valid interpretations of the difference between the two snapshots, but of course a real-life diff tool will show the former, as that's what's helpful to the user.


I'm glad that real-life diff tools work in ways that they deem to be useful. But they don't always produce the most sensible result. In fact different algorithms produce different diff visualizations for the same input. I'm just a little uncomfortable with a feature that uses diffs for purposes other than display for humans, since they don't seem to be consistently defined.


> I'm just a little uncomfortable with a feature that uses diffs for purposes other than display for humans

Git is based on the idea that it is generally safe to do that. You can't do any kind of work with git other than commits on your own local repo without relying on diffs and merges based on diffs.


> I'm just a little uncomfortable with a feature that uses diffs for purposes other than display for humans, since they don't seem to be consistently defined.

Sure, I see your point as it applies to cherry-picks/rebasing. Yes, there's no formal guarantee it will do as you expect. The user should always check it manually. Occasionally it doesn't do what you want.

I'm of the opinion git cherry-pick should default to behaving like git cherry-pick --no-commit for that reason.

edit:

> they don't always produce the most sensible result

I don't know how much work has been done on language-aware diff/merge algorithms, but I can't see an obvious reason for it to be a dead-end.


I find it easiest to think in terms of diffs as patches. It’s just a bunch of search and update commands. Sure, there’s some hairiness around files moving or whatever, but for me, a cherry pick is just getting a patch file and applying it in the appropriate place. It doesn’t care about the file it’s being applied to, it just needs to find similar content in the file so it can apply itself.


Yeah, git cherry-pick is basically just git diff | git apply. And rebase is just cherry-picking many commits.


Exactly. And the other day I did exactly that with piping diff through apply so I had more control over conflicts I was hitting. Suspect all the years in svn just make it easier for me to think of it that way rather than some magical operation.


> How can two diffs ever be considered equivalent when they include a changed file that had different starting contents? Can they?

In darcs, patches commute. It works nicely sometimes!


We fixed the typo in the blog title (shapshots). Sorry!


When I need to explain git to someone for the first time I always start with the 'imagine you are copy/pasting your project folder, and you have multiple folders tagged with some details. Each folder is a commit'. People understand very easily how to keep copies of their work at a given time by simply coping the full folder (probably we all did this before using git).

So yes, a commit is just a snapshot, a copy, it is simply saved in a much more efficient way.

The issue is that some advanced (and not so advanced) commands, like a merge, aren't easily explainable with this metaphor, and that's when the 'commits are also diffs', or better 'commits can be seen as diffs' is needed.


> So yes, a commit is just a snapshot, a copy, it is simply saved in a much more efficient way.

Not really. Although it might be helpful to explain to someone entirely new to git that a commit marks a revision, and a revision is kind of a copy of a project folder, that mental model has no bearing to what a revision control does or helps you accomplish.

I mean, as soon as you talk about diffs and branches and cherry picks and merges then thinking about copies of a folder spread around does not explain what's happening.

And what's the point of git if not to track and manage diffs?


> that mental model has no bearing to what a revision control does or helps you accomplish.

It's exactly what revision control does and helps you to accomplish. What do you think revision control is for if not that?


Yeah. That's a perfectly valid form of version control. You could tell them to imagine renaming the directory each time with an incrementing number. That's the version number, hence version control.

Merge can be explained too. Say you send the folder to your friend and that night you both make a new version. If you want to consolidate your changes you have to merge them together. I don't see why this requires commits to be seen as diffs.

Rebase is a bit more tricky. Maybe that's what you were thinking of?


In a merge without conflicts (for example you edit one file and your friend edit another) a merge could be explained as simply pasting one folder into the other, but then you need to be careful to override only one of the two files.

Merging folders is not easy if you don't know what changed, but if you do know because your friend told you 'I simply modified this file' then you know how to merge them. But this information is a diff.

In this particular example you need the new folder and the what changed information. That's what I was referring to.

Rebasing...is far more advanced to explain, yes.

Edit: or you can paste your folder into theirs, but still you need to know what you changed, your diff, so it's the same.


Well you just say you compare both of your new versions to the common previous version and work out how to merge them. Seeing the differences between two snapshots is something easily grasped in my experience. If you can see John changed file A and Jane changed file B then it's obvious how to merge them.

When I teach git I try to convince people that such a system of keeping snapshots is a useful thing to do. Then I introduce diffing which lets you to merges. Then show how git does that automatically.


All this does seem to help make the argument for simple patch-based systems, i.e. Darcs and Pijul, in common with much discussion I see about git. (Their patches aren't just deltas, of course.)


Snapshots and diffs are just a storage implementation detail, no? Can you not calculate one from the other?


> Can you not calculate one from the other?

Yes, you can.

> Snapshots and diffs are just a storage implementation detail, no?

If this is what you think, the headline is truly awful; git commits are stored as diffs where possible.


It’s more complicated than that!

Packfiles use delta compression to store repositories more compactly, but the deltas are not at all related to commits. The objects (blobs, trees, etc.) are sorted for similarity, completely ignoring the graph of commit history and completely ignoring filenames or directory structures, then delta compression is applied. This allows git’s storage to be smaller than version control systems that use a storage structure that preserves history or filenames.


> Packfiles use delta compression to store repositories more compactly, but the deltas are not at all related to commits. The objects (blobs, trees, etc.) are sorted for similarity, completely ignoring the graph of commit history

While I understand this point, I would make two further ones:

- The storage is still diffs, and definitely not snapshots. It's just that the diffs of which a given commit consists do not all refer to the same "base". Contrasting git with a system in which commits are stored as one large diff (whereas git stores them as many small diffs) is something you can do, but it's not the same thing as contrasting git with a system in which commits are stored as snapshots. (Whereas git stores them as diffs.)

- "Sorting for similarity" isn't a thing you can do; similarity does not induce a consistent ordering.

> This allows git’s storage to be smaller than version control systems that use a storage structure that preserves history

This is worded unfortunately; git wouldn't be a version control system at all if its storage structure didn't preserve history. It does, but it doesn't mix that concern with its concern for file contents.


Well snapshots and lists of all the diffs to apply in order (so long as there are no conflicts) are equivalent, but there are important ways that git is fundamentally based on snapshots rather than diffs. A simple example is that rebasing changes the revision id and snapshot even if the diff doesn’t change, or that rerere exists. If snapshots were equivalent to diffs then this wouldn’t need to be the case: just do set union in the world of diffs.


Strictly speaking, it's a moot point. Snapshots and diffs are easily convertable to one another, which allows to present either way no matter the underlying structure.

We can talk though about how the data is actually represented in storage. Here efficiency comes forward, and good compression algorithms will detect a small change for a big dataset. But that's not necessarily what would be present to the user - though it could be.


Is it normal to need that level of understanding of the inner machinery of a software to use it?


I don't quite agree with the author that you need to understand that commits are implemented as snapshots, but yes, you need to understand quite a bit of how git works behind the scenes in order to use it effectively. This is because git tries to hide more than it should, which leads to confusion when things Go Wrong.


I'd also say that when I've worked through people's confusion surrounding git (and other complex systems, it is because they are attempting to model the system as simpler than it actually has to be. They're not thinking of cases the system must handle, and in the absence of that (essential) complexity, they come up with an incorrect mental model that they believe works (as they're not aware of the counter-examples to it).

This happens often. As another example, a former PM of mine used to assume that if our wearable recorded heart-rate data, it also had sweat data. That is, the wearable either had a lock, and recorded, or didn't. But sweat came from a different sensor than HR, and the two sensors could independently get or lose their signal. And so, sometimes one was null, sometimes the other, sometimes both, sometimes neither.


> They're not thinking of cases the system must handle

That's not their fault; git's CLI attempts to abstract away the details and presents a falsely simple view of the world with "checkout", "add", "commit", and "push". This works for a while, until it doesn't, and then you have to face the complexity of the graph all at once. If I was going to redesign its CLI, I'd make the graph as transparent as possible.


1. It is normal, as in, it's common.

2. Personally, I think this just shows how primitive our industry still is.


Imagine using bash without pipes or an understanding of file descriptors.


Given that we're still figuring out new ways to use rocks, maybe?


I don't look at it as either.

I simply look at it as a change in code at a location.

The commit ID of that change in the history chain is calculated based on the commit ID before it and the one before that is calculated based on the one before that all the way down to the very first commit. (Similar conceptually to blockchain)

So if you rebase or cherry pick or anything you will be building on top of a new commit ID so git will calculate a new commit ID for your new commit even though the code change is the same.

It's pretty simple... if you understand that concept you can get 90% of the way there with git.


For 12439th Time, Git Is Simple And Beautiful! Can you See It Now?

Another attempt to explain Git implementation details & lingo, because rewriting Git end user docs to not refer those is borderline insurmountable.


A surprising number of people think that git commits are diffs. To fully understand git you have to understand the DAG and that absolutely does not contain diffs! Great article.


shapshots?


Ive been thinking about creating a VCS that uses file system snapshots ...


The third sentence is wrong:

> This is most apparent in commands that “rewrite history” such as git cherry-pick or git rebase.

git cherry-pick does not rewrite history.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: