Hacker News new | past | comments | ask | show | jobs | submit login
In a Git repository, where do your files live? (jvns.ca)
398 points by todsacerdoti on Sept 14, 2023 | hide | past | favorite | 113 comments



I realize the Python script is likely provided for didactic purposes, but you can use `git hash-object` to get the object ID:

https://git-scm.com/docs/git-hash-object

With a bit of sed, the python script in the blog post becomes:

  git hash-object <path> | sed -E 's|(..)(.*)|.git/objects/\1/\2|'
The inverse of `git hash-object` is `git cat-file`:

  git cat-file -p $(git hash-object <path>)
https://git-scm.com/docs/git-cat-file

Aside, when I learn a new subject, I like to go to the index such that it is to get an idea of what I don't know (first stage of learning: you don't know what you don't know). Then I can try to prioritize the order of learning about the topic. With git, I'd start with this page:

https://git-scm.com/docs/git


I have a local clone of a git repo with many local branches. I lost an object file (and I'm not certain how). Now git gc always fails, complaining about the object:

Counting objects: 100% (11785289/11785289), done.

Delta compression using up to 64 threads

Compressing objects: 100% (4116944/4116944), done.

fatal: unable to read 1cae71a9d5b24991c0d632b45186ca8a250e5d52

fatal: failed to run repack

I've cloned the repo again, and that object does not appear in the new clone, so I assume it must be from a commit to a local branch.

The odd thing is that I think I locally cloned this repo, and saw no complaints about the missing object.

Is there any way to tell what branch and/or file(s) were referred to by the object? And, assuming its from a stale branch, just delete the branch and thereby fix my repo?


Have you run git fsck? It should tell you what kind of an object is missing -- and might say who pointed there, I can't remember. Regardless, that's the next step to figuring out who points to it.

For this:

> I've cloned the repo again, and that object does not appear in the new clone, so I assume it must be from a commit to a local branch.

I hope you realize it can be within a pack, not just as a loose object. git cat-file -t 1cae71a9d5b24991c0d632b45186ca8a250e5d52 in the other clones.


You can try to git fsck in case that identifies the missing object. But I’m not sure there’s any ready made command to identify through which path a missing object is reached.

It could actually be the reflog. Pruning the reflog then running “git gc —prune=now” might do the trick.


Thank you. I've tried git fsck in the past, and it complains about the missing object as well.

And I just tried "git gc —prune=now", and sadly it still fails the same way.

I'm afraid I'm going to have to bite the bullet and clean up my 30 or so worktrees and re-clone the repo, and re-create the worktrees.


You might be able to recreate the git history without that object (or commit). Look into commands like filter-branch. I'd be surprised if there's not a way to recover from this situation


Try a grep for that missing hash in the fatal message you posted and point it at a commit/object that does exist or just delete that line/file and see if that fixes it!


This is more or less what I've been trying to do for quite a while. How do I grep for it? And once I do that, how can I "point it at a commit/object that does exist ?"


If it's a ref somewhere then it'll just be a string of the hash deadbeef style. Just open the file up and change it to another commit hash or just delete the ref completely.

Try something like:

  grep -r "1cae71a9" ./.git/
And see if it comes up somewhere. My guess is that it's a ref for some old branch pointing at a commit that isn't around.


That won't work, as loose objects are stored zlib-compressed in .git, and packed objects can be stored in a delta format. You'd need something like this:

   git cat-file --unordered --batch-check='%(objectname)' --batch-all-objects | xargs -n1 -- git cat-file -p | grep 1cae71a9d5b24991c0d632b45186ca8a250e5d52


Thank you! This has been running for several hours now. If this allows me to fix it, I'll be very grateful. I've been delaying re-rooting all those worktrees for 6mo or more.. :)


Thats interesting.. Did you had any system crash recently? Or maybe disk full? The only moment I think this is the only way to loose object like this.

Anyway, If this is your only repo containing those branches then its gone. The only way you can try is to clone fresh repo from trusted source and then reconstruct every branch manually, basically squashing commits. And I bet, one of those branches will be broken.


I'd first check whether `git fsck` gives you more information.

There isn't a simple and fast way to find parent objects, but here is a slow hack:

    git_find_references() {
        if [ "$#" -ne 1 ] || [ "${1#-}" != "$1" ]; then
            echo "usage: git_find_references <HASH>" >&2; return 1
        fi
        local target="$1"
        git cat-file --batch-check --batch-all-objects --unordered |
            while read -r hash type size; do
                if [ "$type" != "blob" ] && git cat-file -p "$hash" | grep -F -q "$target"; then
                    printf "%s %s\n" "$hash" "$type"
                fi
            done
    }

    # example:
    git_find_references 1cae71a9d5b24991c0d632b45186ca8a250e5d52
This will output the hashes and types of the objects that refer to the target. (The command is similar to the one posted by @Denvercoder9, but is slightly more sophisticated.)

Note that this function will be very slow on large repositories. I can't think of a way to make it faster without writing actual code. (My only tip is that you can search for multiple targets at once by changing the grep command to `grep -E -q "$target"` and then using `git_find_references "HASH1|HASH2|HASH3"`.)

Once you find the referring objects, you can use `git cat-file -p <hash>` on them to see their content, and you can repeatedly invoke this function to walk up a directory tree until you get to the enclosing commit.

---

To actually fix the corruption, you have a couple of options:

- If the corruption is in a branch that you don't care about, just delete the branch.

- Find a good copy of the object, e.g. from another repository (you said the object did not appear, but I just want to point out that you should check using `git cat-file -p <hash>` rather than by looking in .git/objects/, because objects can also be stored in packs). Or you might be able to work out what the content was (if the corrupt object is a blob, aka file, you might be able to work out the content based on the previous and next versions of that file; `git hash-object <file>` will tell you if it's exactly right, but won't give you any hints if it's slightly wrong).

- Rewrite the history to remove/replace the corrupt object. You can use a tool such as `git-filter-repo` (docs: https://htmlpreview.github.io/?https://github.com/newren/git..., in particular see --strip-blobs-with-ids). You can also use `git replace` to temporarily replace the corrupt object with a placeholder; it won't fix the commit history, but it might be useful if the corruption causes commands to fail.


Edit: To allow the corrupted object to be pruned, you should also run `git reflog expire --all --stale-fix`.


Thanks for sharing this very interesting link (I actually knew it would be good before clicking simply because of the domain name)!

If this sort of internal view of git interests you, I strongly suggest reading the "DIY Git in Python" series from here: https://www.leshenko.net/p/ugit/


Oh yeah, the Git internals... I tried to find my way through them using visualization but did not progress much: https://github.com/smartmic/git2pic


I encourage everyone to read A Plumber's Guide to Git: https://alexwlchan.net/a-plumbers-guide-to-git/

It's not a book, just a series of five short blog posts. Part 1 explains precisely where files live: https://alexwlchan.net/a-plumbers-guide-to-git/1-the-git-obj...

  $ mkdir animals

  $ cd animals

  $ git init

  $ echo "Big blue basilisks bawl in the basement" > animals.txt
  
  $ git hash-object -w animals.txt
  b13311e04762c322493e8562e6ce145a899ce570
  
  $ find .git/objects -type f
  .git/objects/b1/3311e04762c322493e8562e6ce145a899ce570

  $ rm animals.txt

  $ git cat-file -p b13311e04762c322493e8562e6ce145a899ce570 > animals.txt
Congratulations, you just did a `git restore animals.txt` manually.

Parts 2 through 5 are equally illuminating.


Thank you a thousand times for sharing that. I just ran through the examples and they were very enlightening.



I found this page had very easy to understand visuals

https://www.devopsschool.com/blog/git-tutorial-objects-refer...


It is fun to explore the Git internals! Some time back, I used it to learn Golang [1]. Two birds with one stone!

[1] https://github.com/ssrathi/gogit



Let me add the You Tube series - Git and GitHub for Poets - for the poets, students and teachers. Not to everyone's taste but what is?


You don't need to know this, by the way. You can totally just think of git as simply storing a copy of your entire repo every time you make a commit. This is one part of git that actually is a good abstraction.

Of course it is fun to know this but it won't help you understand git really, and please don't tell people who are just learning git. Just say git makes a copy of the whole project every time you commit.


I feel like git culture is one of the worst about these sorts of things. I've run into people that think you're a loser if you don't use it from the command line, but my commits using a GUI client are much cleaner and I seem to have much less trouble than they do.


I use command line for pull/push/new branch, VSCode for commits and Tortoise for archeology and merges. I just use whatever seems easiest for the task.


I'm the "git guy" in every job I've had. I never use the CLI! I use magit which is probably the very best git UI there is. A GUI is simply the best thing when you are in the process of building state. CLI is great for functional type stuff (ie. I type a command providing both input and output but is completely stateless). Building commits is one such example of working with state. Most CLI users I know just add everything every time.


I’m a non-coder and use git for versioning text files for my fiction writing. I don’t have any sort of deep understanding of git, but I’m pretty comfortable at the command line for commits, pushes, pulls, etc. I must admit I’ve started liking the GitHub gui app lately, though.


I'm certainly skeptical of folks that use Git GUIs. It has been my experience that the folks that I've worked with who only use Git from the GUI only know the absolute minimal subset of the tool to allow them to get code/changes into the CI system. As soon as something happens outside of that comfort zone, things go haywire. Adding insult to injury, many of the Git GUIs that I've seen people work with don't use "standard" terminology, or they combine/conflate multiple actions (I'm looking at you, VS Code). This makes it difficult to help when things go awry, or when someone's having difficulty understanding how to work in the team's established workflow.

Ultimately, I don't care what tools people use if they produce quality work and can integrate with the team without creating additional friction. There's nothing that says that a good GUI can't exist, or even be more productive than the CLI in some cases. I have had enough experience of GUIs causing problems, however, to make me nervous when I hear one is being used by a team member. It's often followed by something like: "I ran into a problem with Git, so I just deleted the repository and cloned it again."

Perhaps it's time for me to start looking at the available GUIs again, so that I can make a recommendation for a solid one when someone doesn't want to use the command line. In the end, though, one still needs to understand Git's behavior, to some degree, to be productive and avoid creating situations which take the team's focus away from our goals. I don't know that any GUI is going to help junior developers with that.


What GUI are you using and how do I make my commits cleaner?


I use magit. You want a UI that allows you to easily see the changes in your worktree and add those line by line to the index, if necessary. Of course, you also want it to be easy to add whole files or hunks too, as always going line by line would be insane.

A good commit means a good version. That's what we're doing after all: version control. Every commit you produce should be a valid standalone version of the software. Commits can build on each other, e.g. you can add feature a then subsequently add feature b that depends on feature a, but a maintainer should always be able to only take feature a, there shouldn't be bits of feature b in there, and there shouldn't be fixes for feature a in feature b's commit.

With practice you can learn to make rough commits first then clean them up into proper commits later. For example, there are "end of day" commits and there are fixup commits. Those are both valid uses of git, but you shouldn't be exposing those to your team. You need to rebase them before sending them. A good git user will sow the seeds for a smooth rebase early.


For those who do prefer the command line (you do you), what you want is

    git add --patch


That's not really a CLI, though, it's a really basic rudimentary text UI. Might as well spend the effort to learn it learning enough Emacs to use Magit. Can it even do line by line effectively? What about unstaging stuff you added accidentally? It's not a patch on Magit or even any semi-decent GUI.


Heh fair enough. For me, it comes with the advantages of the CLI that I care about (primarily: easily accessible with the keyboard, and launching without delay) without having to adopt a different editor.

I don't need line-by-line usually (but it can technically do it if needed), but undoing accidental adds is indeed a bit of a pain.


The emacs user has feelings about annoying elitist cultures.


Eh? Emacs is the complete opposite of an elitist culture. The entire point of Emacs and free software in general is to destroy the elite (developers) and put the users (me) in control.


Yeah.

I want to understand git. But by that I mean using it. I don't need to know about the internals for that; I need to know common approaches to typical merge problems (where all of a sudden git loses a lot of its elegance).


I found out that GPT-4 is pretty good at helping me find/format git commands and solve git issues. These are usually short responses, repeated in many places on the web, and easy to memorise for the LLM. GPT is really a game changer related to git anxiety.


"I don't commit that often because I don't have that much hard drive space."


I wonder if anyone has built a system with the philosophy of the opposite extreme.

"Oh yeah, space is cheap, so I just let my automated system make 10,000 commits per second. It'll be fine."


I have been thinking about this. Why no record all keystrokes. Then checkpoint where you are whenever you desire. These days AI could describe any arbitrary span of code changes.


>Why not record all keystrokes?

That was Google Wave, it stored everything as an "operational transform". The presentation they used when showing it off to the world was brain damaged, they insisted as showing it off as a ton of separate blocks, instead of one continuous document, which gave people the wrong impression of it, and tanked it.

In a similar way, the impression that Git stores deltas, got stuck in my head somehow at the beginning and made it an opaque mystery for me.

It was only years later, when I somehow I managed to learn the truth, that Git stores full snapshots, and except for compression in extreme cases to save storage, it doesn't do anything with deltas (except fake them to show diffs)


It's not quite "all keystrokes", but git-wip is meant to add a commit every time you save a file: https://github.com/bartman/git-wip I've been using this (via magit-wip) for a while now and it stays out of the way. I've never needed to use it, though. I'm probably too careful now having been trained in a world without git-wip.


Magit stopped using the git-wip script ten years ago.


It's kind of true with git-lfs.


https://git-scm.com/book/en/v2

Git internals chapter. You'll learn more than you ever wanted to know about how git works.

Plus this classic: https://tom.preston-werner.com/2009/05/19/the-git-parable.ht...


Very apropos, I'm trying to write a guide for my team called "git beyond pull/push and checkout".

I'm trying to write something that will demystify git for the developers in this team. So I want to show them the files in .git, and connect them to the concepts they know like branches, etc. I'm always on the lookout for new stuff to include.

The more I can read, the better information I can put in this guide.


I’d definitely read this if you’re able to share it publicly



Heh, "git beyond pull/push and checkout" was also the idea behind my tutorial, though I figured visualising the commit graph would be more helpful than showing the hidden files. And also that a ~10 min guide would be more likely to be used than a very detailed book.

Anyway, in case you're interested: https://agripongit.vincenttunru.com


I'll probably announce it on HN when it's finished :)


That was very informative, but I wondered if there is an easy-to-read reference somewhere about WHY git works the way it does?

I am old enough to have used SCCS, RCS and CVS extensively. Each had their faults, but Git is the only VCS I have used where dealing with merge conflicts is unintuitive enough that I sometimes end up with the repository in an unusable state. I am sure I am doing something wrong, but I would like to understand why.

The VCS that maps closes to the way my brain works is ClearCase. You essentially have a versioned file system, and you can set up a view to present any previous state of that file system. Of course, administration is a nightmare, it is not distributed, it is expensive, yada, yada. But when using it I always felt I knew exactly what was going on under the covers, which is not the case with Git at all.


It's definitely not a super intuitive model, but it is a powerful one.

As I see it, git is a distributed database of snapshots of a directory of files. Every commit is a new snapshot. It's designed the way it is to achieve that goal while minimizing the space used, yet keeping an entire history of the project. It also has tools like git-diff to better compare those snapshots.

But distributed matters most because it was literally designed for managing the Linux kernel development- a large open source project with thousands of contributors working concurrently.

Merge conflicts are hard in this model because you're trying to put two snapshots together, not just two diffs.


> It's designed the way it is to achieve that goal while minimizing the space used, yet keeping an entire history of the project

Most of git's design was done before they thought of 'minimizing the space used'. Originally, they didn't even do deltas and just stored complete snapshots for every object.

> Merge conflicts are hard in this model because you're trying to put two snapshots together, not just two diffs.

I don't think that's the case at all. Thanks to three-way-merges git has just as much access to the diffs as any other version control system when merging. It's just that in git diffs are a derived data structure, not a source of truth, but that doesn't make a difference.


One of the ideas behind Pijul is that implicit vs explicit diffs does make an important difference sometimes: https://pijul.org/manual/why_pijul.html


I am very interested in Pijul, but the development seems to be slow. Has it reached the point where it can be used in production? (Serious question)


I get the feeling that depends on who you ask at this point, but there has apparently been some recent development of the hosting platform (https://nest.pijul.com/). I don't know if it can have as good of a Git interoperability story as, say, jj.


I am not interested that much in interoperability with Git. But I am afraid of data corruption, crashes, etc. Also, Pijul used to have severe performance problems in some situations. I would call it ready for production if it is stable and performant.


Git interoperability can help with that: if you have an on-disk data format that's either compatible with git, or can easily be converted to git, you can keep your backups in git format, and work from them with git if something goes wrong.


> ClearCase [...] it is not distributed [...]

That's the key reason Git can't work the same way, and what makes it so powerful, and sometimes hard to grok.

Git is based on a Directed Acyclic Graph (DAG) of committed changes. The DAG is shared by everyone. Each commit is immutable, and only extends the DAG with a new node. You do not alter any of the DAG that other people already possess, you are creating a NEW chain of nodes connected to the DAG.

And that's it, that's how it allows anyone to make changes, at any time, on any number of systems. Because all operations are strictly additive to the DAG. Every time you commit changes, you add another immutable node onto the DAG. And at the very same moment, on another unrelated server, someone else is adding unrelated immutable nodes as well.

Later, you may obtain some of those remote changes, they can be fetched into your local repository. You can fetch them without merging them. You will just have a copy of how someone else extended the DAG. But there will be no connection between the changes you made locally, and the changes the other person made in their repository.

When you merge, you're simply updating the DAG with a new node, it will point at two, previously disconnected chains of commits. Both sides of the merge represent immutable changes that extended the DAG starting from a single commit, the branching point.

And that's what you see in merge conflicts. The first block shows what a section of code looks like in the local branch. And the second block (after the equal signs) shows what that same section of code looks like in the remote branch you're merging.

    <<<<<<<  local
    a
    X
    c
    ========
    a
    Z
    c
    >>>>>>>> remote
That's it.


I think Git is conceptually simple... each commit is just conceptually a diff from a previous commit -- possibly from two commits. Branches and tags are just pointers to a commit.

A merge is when you join two commits and thus may result in one merge conflict, while a rebase is resetting your branch to some other target commit and then /re-applying/ _each_ of your commits on top, which may generate more than 1 merge conflict (due to each _one_ of your commits effectively being re-committed all over again) but hides the fact that you applying old changes to a newer base.

But the tooling around Git is not great. I think showing merge conflicts properly so you don't mess them up is also tooling issue more than anything.

I'm going to plug this app called SmartGit which I have no relationship with that I don't think a lot of people have tried but it's awesome. Doesn't obscure anything about Git and shows merge conflicts well IMO. Costs $ though.


> I think Git is conceptually simple... each commit is just conceptually a diff from a previous commit -- possibly from two commits. Branches and tags are just pointers to a commit.

That's backwards. Git commits are complete snapshots of the state of your repository at the time. You can compute diffs between arbitrary commits, be that between child and parent commits, or completely unrelated commits.

Have a look at https://stackoverflow.com/questions/4129049/why-is-a-3-way-m...


Oh I know. I do that all that time. (And in SmartFGit, you can select any two commits.)

But I said conceptually and that’s how I see them and it works for me.


Conceptually, though, that's the wrong model. Pijul works like that, but git does not: commits are not diffs, they are full snapshots of the repository, including history. If you cherry-pick a commit from one branch to another, that's an entire new commit, it's not just "this patch moved from one branch to another". If your idea of a commit is "it's a diff to the previous commit", you'll get into trouble.


Yes and no.

When you do a checkout or reset etc. you are indeed operating on snapshots. You're just saying "get me this version right here".

But when you do cherry-pick or rebase etc, git is operating on diffs. A cherry-pick doesn't set your head or workdir to that version[0], it applies the changes to a new commit that sits on top of your current head. Similarly, a rebase walks through the changes and applies each one to the target.

Commits and diffs are a duality and it matters not one bit how git stores them underneath. You can always calculate one from the other.

[0] If you did want to do that for some reason then git reset will do it for you. That sets your current branch to that commit and with --hard it makes your workdir match. `git checkout <sha1> .` makes your workdir match that older commit, but it doesn't do deletes. If you want to make just your workdir match an older commit the only way I know is something like: `git reset --hard <sha1>; git reset --soft HEAD@{1}` which can be read as "set my current branch to be at sha1 and make working directory match that snapshot exactly, then set my branch to be at where it was before that operation, leaving my workdir alone". It's not a very common thing to want to do.


> Branches and tags are just pointers to a commit

Technically that's correct, but it is not how branches are thought of and used. Conceptually a branch is, well, a line of development, a sequence of commits with a name. (Actually it is not necessary just a sequence, but this is not important here)

I think the implementation of branches in Git as just a kind of fancy tags is one of the sources of confusion about/around Git.

And I love Git, don't get me wrong! But some things in it make things harder than necessary.


A lot of its design decisions are based around the data storage model, and tooling built to operate on those data structures. I recall a good write-up from a decade gone now, but no dice googling for it. The short version is probably:

- Everything is a blob, a text file named after the SHA1 of its content.

- Files are just themselves.

- Directories list <entry sha1> <name> for their entries (file or subdirectory).

- Commits list a Directory (the project root), and some metadata about the commit like the author, commit message, and the parent(s) of the commit.

- A branch then is just an end-user-named reference to a commit's hash.

Everything flows from that - SHAs are reused, if you're doing a diff and two directory entries have the same sha referencing a file, there's no change. Switching a branch is modifying the special HEAD branch content, and recursively walking it to rehydrate the filesystem (and comparing to the previous checkout to optimize, skipping whole directories that don't change).



I used git for years now. I’m comfortable with all of the basic functions. It is my experience that I have had several times where merging/branching has caused a repo to “break” It doesn’t happen often, but when it does it’s super frustrating and this I avoid the merging workflows where possible. I’m sure if I educate myself a bit more about git and pay more careful attention to the details, I wouldn’t occasionally have this problem, but Git really shouldn’t be like this.


I've been using git for I guess 15 years, on a variety of server platforms (github, atlassian, MS devops, plain ssh) and branching models, with teams of various sizes, and I've never had a repo "break". Branching and merging is what git is amazing at.

The worse thing I've had happen is that someone was lazy when merging and just did a "take mine" on 100s of files, thinking that they'd do a "proper" merge later. Of course git doesn't work like that and their lazy merge had to unwound very carefully.


What does a broken repo look like? What does broken mean here?


> It is my experience that I have had several times where merging/branching has caused a repo to “break”

What do you mean? Like you resolve the conflict correctly and git stops working? Stops working how? That sounds very strange, and anything else just seems like PEBKAC.


I wouldn't go so far as the term PEBKAC.

It's true that git does not 'break' from a merge; but merge conflicts (and rebase conflicts) can still be frustrating to resolve for ordinary users. And after things get too frustrating, they often do really random stuff that then might accidentally break the repo for real, or at least get it into a state where they don't have enough knowledge to recover.

Git's underlying data model is fine, but the user interface can be quite lacking. For example, 'git checkout' is a mess of almost unrelated functionality thrown together under one command.

I think the idea behind 'git switch' is a good one, and git could benefit from a complete overhaul of its user interface. Well, at least if you ignore the switching costs. Old farts like you and me have gotten used to the quirks of the bad old interface, and learned all the barely coherent options to 'git reset'.


Git could be a lot better in a lot of ways, particularly from a developer experience perspective. I’m a little surprised we haven’t seen a meaningful successor.

One example is how git will deceive you and tell you you’re up to date with your remote (e.g. origin/main). What it means is that your local branch is up to date with its local concept of remote, and makes no statement guarantees about the actual state of the remote. Which is really a nonsense concept that does not need to exist.

Similarly the whole concept of needing to specify “origin” at all is a bit bonkers and does no favors. Why is it that I can pull from a remote branch, commit some changes, run ‘git push’ and git has no idea what branch I want to push to. Another example: if main is a protected branch, don’t let me accidentally commit to it locally. I could keep going with the examples but I won’t.

And yeah you can forgive all this by quibbling that git was written in a time when internet access was not ubiquitous, and of course all these decisions make sense because x, y, or z advanced edge case for advanced users only, and I’m a shitty engineer because all of this complexity secretly makes my life better and I’m just too simpleminded to appreciate it.

Really though, if you rewrote git from a principles first approach (with developer experience being one of those principles), it certainly would not look like how it looks today. There is too much complexity, too many ways to do things, and too many bad decisions around defaults. Treat it like a proper distributed system, perhaps even backed by a real database. It’s not special because the data is code. The fact that it’s treated as such is the reason it feels so weird.


You're confused because you're treating the repo server as special.

Git doesn't do special. All branches, on all machines, are just branches. "main" on your device and "origin/main" (main on the machine you called "origin"), are two different branches. They don't need to share a name, you can just as easily set your local "main" to have "origin/Release" as its upstream. The name "main" isn't special. You can name your branches anything you desire. If you want to lock yourself out from controlling your own code, that's a customization for you to make. To git, there's no difference between committing to "main","feature111", or anything else. "origin" isn't special, either: you can have many different repo servers hosting different branches. Or maybe you don't have a repo server at all, and you just have all your fellow devs machines and you coordinate via email.

While being perhaps a bit un-opinionated, it also makes Git conceptually extremely consistent, and thus simple.


I certainly understand the abstraction and how to use git in this sense. The foundation of my above opinion is that I don’t think this is a great idea that’s well applied to modern development.

Is it elegant that the same version control system can be used whether you have a remote server or you just coordinate over email? It certainly is! Is it necessary to have this level of generality and lack of opinions when 95% of users just want to do the same basic flow (and it does involve a server of some kind)? I don’t really think so, which is why I think there is so much confusion around really basic git functionality.

We live in a world where git is philosophically more like Perl than python, and I think that it’s not unreasonable to think that it’s possible that if we flipped that, then that might actually be a good thing.


Then write your own wrapper on top of Git.


Lazy Git ftw


> Why is it that I can pull from a remote branch, commit some changes, run ‘git push’ and git has no idea what branch I want to push to

Git has automatically tracked remote branches for years now. "git switch foo" will do exactly what you expect if "origin/foo" exists.

> if main is a protected branch, don’t let me accidentally commit to it locally

"Protected branches" don't exist in Git. They're a GitHub concept. How would you be able to fork a repository if the permissions on a remote copy of the repository prevent you from making changes to your own copy?


I can see the case for a local UI option (perhaps even on by default) that warns you when you make changes to a local branch that's tracking a remote branch that's marked as 'protected' there. The option could also have a setting where it outright stops you from making that change.

Of course, you could always opt out of that setting.

Git doesn't know what 'protected' means, but you could teach it with a relatively small change to its code. Similarly, it might be useful to teach git about 'volatile' branches (in the same sense as C's volatile variables); a volatile branch is one where git would always checks for upstream changes first before any operation. Eg origin/main would typically be marked as both protected and volatile.

Again, volatile would be something you can override locally, but it might be useful as a default.


The logic you're suggesting can easily be implemented with a Git hook:

https://stackoverflow.com/questions/40462111/prevent-commits...

Alternately, you can simply not check out that branch locally at all, and then you'll never have to worry about accidentally committing to it.


Also, many repo hosts have branch protection tools, and VS Code has a setting git.branchProtection to list branches which you want VS Code to remind you not to commit to, which can be handy.


Git separates its plumbing from its porcelain very well. That makes all the different front ends possible. Maybe there's one that will help you to work with git?

I differ from your opinion that the user experience is poor. I think it's great, and I prefer using the command line interface.


As an example of a different frontend, have you tried "jujutso" [1]? Every time I ended seeing a comment about git vs Hg usability I am reminded of it. It uses git in the backend but its workflow/frontend seems streamlined.

  Jujutsu is a Git-compatible DVCS. It combines features from Git (data model, speed), Mercurial (anonymous branching, simple CLI free from "the index", revsets, powerful history-rewriting), and Pijul/Darcs (first-class conflicts), with features not found in most of them (working-copy-as-a-commit, undo functionality, automatic rebase, safe replication via rsync, Dropbox, or distributed file system).
[1] https://github.com/martinvonz/jj


This is pretty interesting. Maybe a missed opportunity to call it jugitsu


> Similarly the whole concept of needing to specify “origin” at all is a bit bonkers and does no favors. Why is it that I can pull from a remote branch, commit some changes, run ‘git push’ and git has no idea what branch I want to push to.

Git actually works like that. (Though you might have to set push.autoSetupRemote to true in the config?)

However, when you have multiple remotes, it's only natural that you will sometimes need to specify which one you want.

> And yeah you can forgive all this by quibbling that git was written in a time when internet access was not ubiquitous, and of course all these decisions make sense because x, y, or z advanced edge case for advanced users only, and I’m a shitty engineer because all of this complexity secretly makes my life better and I’m just too simpleminded to appreciate it.

Git was explicitly written for Linux kernel development.

You are right that almost all other projects are simpler.

> Really though, if you rewrote git from a principles first approach (with developer experience being one of those principles), it certainly would not look like how it looks today.

I agree that git's UI is lacking. You don't need a from-scratch rewrite of whole system. You can just rewrite the UI only and think carefully about the defaults in the config.

Eg 'git checkout' is an incoherent mess of barely related features. 'git switch' is a later addition, and goes in the right direction.

> Treat it like a proper distributed system, perhaps even backed by a real database.

Git is very much backed by a real database. That's actually one of the stronger points of its design. You can see that the guy who started git, Linus Torvalds, has a lot of experience writing filesystems (which are also a kind of database).

Git is also very much a distributed system with no trust between the nodes necessary, including no central trusted authority. How would you 'treat' it even more 'like a proper distributed system'?

> It’s not special because the data is code. The fact that it’s treated as such is the reason it feels so weird.

Where do you get that impression from?

The main influence of being designed for code first comes in the form of the default merge driver being line oriented. But you can plug in your own merge drivers for your own data formats.

But most everything else works the same, whether you stick C source code or funny cat pictures into your repository.


Yes, Git was not designed for your workflow, that has a single remote and a special "main" branch. It was designed for a very different workflow used in Linux development.

Git just happened to be flexible enough to be a reasonably good fit for other workflows, including GitHub-centered workflow which is relatively common. Mostly because of easy branching/merging and great support for working off-line, I guess.

However, it does not have built-in support for such workflows, their special requirements need to be be satisfied by tools and/or practices external to Git.

What you are saying is that Git is not a perfect fit for your workflow, and that this workflow is fairly common, so why there is not something better for that need? If you stopped there, your post would be perfectly reasonable from my point of view. But you called your workflow "developer experience", as if that's the only one that exists. It's kind of offensive for the people with different workflows, and so with different experience.

Anyway, I don't know the answer to your question, but my guess is that Git + external tools/practices is "good enough" for your workflow, so nobody created a completely separate system.


How would that work when someone is working on a feature/bug and has to pause switch to a another bug, then return to the 1st feature/bug

Sometimes you want to keep 2 different copies of Main with different names and state


Worktrees perhaps?

https://git-scm.com/docs/git-worktree

Can't recall the source I learned from anymore, but this looks pretty good: https://www.gitkraken.com/learn/git/git-worktree

I have two projects where I've got a directory that I do most of my work out of and another directory (worktree) focused on syncing from remote/assisting others on the team.

If you don't need the extras worktree gives you, clone another copy?


I use worktrees instead of multiple clones mainly so I only have to worry about remembering to 'fetch' once.


> Sometimes you want to keep 2 different copies of Main with different names and state

Wouldn't that just be two different branches? What does it have to do with 'main'?


> git has no idea what branch I want to push to

Maybe i misunderstood what they mean, I thought they wanted local branches to be mirrors of origin branches.


Git is much more data focused than all the other VCS. Sure, exposes details that complicates its usage for novices users, but it makes everything more maintainable and extensible in the long run.


Blame the Linus Torvalds personality cult. Only way that terrible UX got traction in the first place. If anyone OTHER than Linus had written it, it would have been mocked, and deservedly so.


That's a pretty big claim. Git solved real problems at the time in a novel and extremely useful way. Coming from SVN (or god forbid VSS), the power and flexibility of git more than made up for its difficulty to grasp.

Why it "won" over alternatives with cleaner UX (eg Hg) is a different question, and I think has a lot to do with github.


Mercurial came out essentially simultaneously, solved the same problems, with a much more coherent UX.


Early on, I was definitely a proponent for hg over git. I assumed it would at the very least remain a peer, given how much more sane the UX is. I was wrong though, and I don't think it's just because of github.

Named branches being baked in turned out to actually be a real pain in mercurial. Sure, you can use bookmarks (and we did) but starts to quickly feel like you're swimming against the current.


Git may solve real problems but honestly, it's a PITA to use. Just for basic tasks needs quite a lot time sink to understand what's going on and what do you need to do.

It gets in the way of the workflow IMO.


It's bigger than him. GitHub built their SaaS on top of it. Had they chosen subversion or fossil, history would be very different.

People use git because everybody else uses git and it started because GitHub provides easy code sharing and a better UI than the alternative — SourceForge and mailing lists.


People could have kept using SVN or whatever else. They jumped ship in droves, way before GitHub.

At the time it was the only distributed and open source VCS. Fossil may have come quickly on its heels, but it was already too late.


Mercurial (hg) was released mere weeks after git saw the light of day, so it's not that.


And others were released before git:

> The first open-source DVCS systems included Arch, Monotone, and Darcs. However, open source DVCSs were never very popular until the release of Git and Mercurial.

https://en.wikipedia.org/wiki/Distributed_version_control#Hi...


Yeah, I originally had some Linus snark in there but deleted it. My above comment is being rapidly downvoted though, which is in some sense validating.


> And yeah you can forgive all this by quibbling that git was written in a time when internet access was not ubiquitous

I don't think that's right. Git's remote concepts were built heavily on even more ubiquitous internet access than you are assuming, to some extent. Git was built where some upstreams were patches on email mailing lists. Git was built in environments where every contributor could relatively easily stand up a small server of their own (even as just a temporary server on a personal device with specified time windows) and you might have branch remotes tied to different colleagues' servers in a distributed fashion, the D in DVCS.

At the time those weren't advanced features for advanced users, those were simple features for flexible source control. There's a sort of simplicity to pull requests in email flows. There's a sort of simplicity in "hey, can you check out my branch and make notes on it, I'll serve it on my lab machine for a couple of hours so you can grab it, here's the URL." In some of those cases you don't even care to remember that remote URL after you've grabbed the branch because it will be a different IP address and port the next time they bring up that lab machine. (Git was a built in a world where there was no "origin" and multiple repos were valid representations of progress, some of them transient and as-needed, and "origin" was a name and concept that came later.)

Some of that only exists in a world that assumes internet connectivity is ubiquitous, not just access, but service hosting and upload capabilities. The internet has some strange centralizing forces making access easier but anything other than raw consumption harder.

There are certainly a lot of good reasons for some of that centralization. Whether or not is "simpler", there's a convenience on everyone sharing big centralized hosts. There's a lot of convenience of "there is mostly only one remote that everyone shares and it has a high uptime SLA and a ton of extra collaboration features in one place". There were certainly a lot of centralized version control systems before the DVCS was invented, and beyond convenience also a lot of familiarity that such centralizing operations benefited from.

It's interesting to me that in your last paragraph you think the solution is to make git a more "proper" distributed system, but one of the features you find too complex and don't like exists so much because it was defined and built as a distributed system and just isn't as convenient when working with centralized providers. git repos support multiple remotes because it was built to be distributed, git repos require to fetch remotes explicitly because it was built to be fault-tolerant in a distributed system and remotes may have very different SLAs from each other; losing access to one remote host shouldn't stop you from fetching updates from a different one. The DX there was built for a distributed system. It is mostly where we see everything revolving around some super special "origin" remote that the DX feels overly-complicated and maybe missing better "defaults". It is mostly on the internet where running a simple CLI command to spin up a quick code server on a random port on a random machine with an accessible IP address is increasingly hard that it also becomes harder to imagine why people ever needed remotes beyond that special sauce "origin" idea.


For me git reflog was always a lifesaver when a complicated rebase or merge fails. You can always checkout previous SHA and then just return your HEAD to pre screw-up state. In case of remote repo that already ended up "broken" you can still force push to those branches, but others will have to discard theirs. Otherwise you never force push to anything shared.


+1 for clearcase. I enjoyed the time I used clearcase, and never found another one that was as pleasant to use as an end user.


Ugh. Ive been both using ClearCase and also were Admin of it. While as user it was pretty fine to use, as Admin I saw how complicated and fragile the whole thing was. They distributed model is absolutly terrible. You need powerfull box as a VOB server and probably be equal or even more powerfull box if you want to use dynamic views. We ended up using snapshot views.

I would choose GIT every single time over ClearCase.


While I've used Git a bit, and appreciate for what it is, I will admit that it's syntax is what makes it a bit difficult. If the Git commands had kept to a single metaphor, then I think that would have helped tremendously. Even now, I'm still trying to figure out the best metaphor for using Git and it's commands. So far the closest I have is a tree metaphor as follows:

Git init = tree planted

Git add and Git merge = trunk grows

Git branch = branch grows

Git merge = branch intersects

And so on. But even this falls apart once I get to Git diff and Git checkout.


If you haven't watched the brilliant Git For Ages 4 and Up I can't recommend it more highly. I make everyone watch it on my teams and my students. https://m.youtube.com/watch?v=3m7BgIvC-uQ Git is quite simple under the hood, it's the convoluted user interface that makes it seem so difficult. Once you understand technically what the commands are doing, it's a lot easier to know which one to reach for.


Git has biggest moat of all software, and it is stifling innovation.

On one side it's good because everyone is using the same tool and we don't have to learn many different intricacies of other tools.

But on the other side, there's probably a better way of doing things that could dramatically simplify everything.

It was designed so long ago with such different computing and memory constraints, and the way we use repos today is very different - think monorepos.

There are so many projects trying to bandaid solutions to its scaling issues, and so many projects are using monorepos today which are not well supported.

I genuinely believe re-writing a VCS is actually a very simple task.

If you look at the core use cases, or the commands people use, they are very few.


Discussions on improving git tend to be dominated by people who not only don't grok git but don't actually grok the problems inherent with distributed code development and want it to be little more than dropbox for github. Once you filter out these people, a good chunk of the remainders are people asking for UI changes that already exist but they didn't bother to learn.

Catering to these people means attempts at this problem usually punish power users by removing useful things like staging or a distinction between commit and push.


Even outside of the distributed aspect, discussions on improving git tend to be started by people who have never felt the need to express `HEAD~10` or `origin/main...main`, and.. I just don't have much sympathy for the naivety left in me.

It would be neat to have a GUI rebase -i style "commit reorganizer" that let you visually reorder & drag commits between branches etc. But that'd just be an app calling git, and once you're experienced enough write it, you no longer really need it. So nobody does.


> Discussions on improving git

I'm advocating for a replacement...and why it's so difficult.

> distributed code development

Code development is no longer distributed. It's centralized in Github/Gitlab for about 99% of cases. And everyone is always connected in real-time to a central server...and also to all other contributors.

These possibilities are so far from the reality of when Git was created.

> staging or a distinction between commit and push

This is trivial to handle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: