> Git repositories sometimes have submodules. I don’t understand anything about submodules so right now I’m just ignoring them.
Submodules are interesting, because they’re next to unusable from a user perspective (they’re a pain to maintain and interact with unless you never ever update them) but they’re ridiculously simple technically which I assume is what made them attractive.
A submodule is an entry in “.gitmodules” mapping a path to a repository URL (and branch), then at the specified path in the repository is a tree entry of mode 160000 (S_IFDIR + S_IFLNK), whose oid is the commit to check out (in the submodule-linked repository).
To add to this, Submodules are a hack on Git's data model.
Git's data model, put simply, is this:
* Branches are pointers to Commit objects
* Commit objects are a composite of {Commit_Comment, Tree, Parent Commit(s)}, referenced by the hash of that set
* a Tree (like a directory) is a list of Blobs and/or Trees (associating filenames with them) referenced by the hash of its contents.
* a Blob is a file, referenced by the hash of its contents.
So the set of types of objects in Git are: Commits, Trees, and Blobs.
Note that I said that a Tree can contain other Trees or Blobs... but what if... you put a Commit in it!?
That's a submodule!
Now if the Commit you reference doesn't exist in the current repo, Git can't do anything with it. That's where the .gitmodules file comes in, to associate a given path with a repo, so that Git can look up the Commit object in that repo.
Nit: the data model has refs, which are pointer to objects.
Branches are the subset of refs in the special-cased heads/ namespace which should be pointing to commit objects.
And there’s also tags, which are the subset of refs in the special-cased tags/ namespace, which should be pointing to commit (“lightweight”) or tag (“annotated”) objects.
Going further, the little-known git-notes [1] feature also uses its own reference namespace, `refs/notes/`
Going even further, Gerrit [2] leverages the wide-open reference namespace/directories to create its own. For example, pushing under the `refs/for/` namespace creates a new review, and specific reviews can be looked up under the `refs/changes/` namespace.
Even even further, Gerrit's special repos All-Users.git and All-Projects.git are "databases" for project configuration and user configuration, where for example external IDs (like usernames) are stored under the special `refs/meta/external-ids` ref/branch. This has the notable benefit that all configuration changes are tracked and auditable.
I believe git-appraise [3] also leverages special reference namespaces in Git for performing its review duties (but I don't know details) Edit: actually no, it "just" leverages git-notes.
So, one could theoretically "embed" an older commit into their repository as a pointer (submodule folder)? And because Git knows what that commit ID means it will show it fine?
Unfortunately no, because of how it's a hack, and so extra logic had to be tacked on to support it (logic which the `git submodule` command implements).
I tried it out, and the local commit object just appears as an empty folder.
To try for yourself, do this:
git init recursive-submodule && cd recursive submodule
echo "foo" > file1 && git add file1 && git commit -m "Add file1"
# this will have created an initial commit with hash 2d49d729. Then:
git update-index --add --cacheinfo 160000 2d49d729fe39d1def8ce537d7efeeabbf3efb4f2 submodule && git commit -m "Add submodule"
The `update-index` command is the plumbing command for adding an arbitrary object, which I used to add the previous commit object. Since we only updated the index and not the workspace, git will note that the submodule is missing. You can then run `git restore .` to set the workspace to the state the history says (ie with the missing submodule)... but that just creates an empty directory.
Vanilla git won't try much further. To actually populate the submodule requires running `git submodule update --init`, and that requires a `.gitmodules` file, even for a local commit object.
I feel like submodules are one of Git's most misunderstood features. I agree that they're really not great for the use-case of "I have to work in a bunch of repos at the same time", but they're also not designed for that.
Submodules are a really good solution for problems that look like "this repo depends on some upstream repos that I don't control", and a bad solution for any other problem. They do what they were designed to do.
Imagine that your build-script needs to clone a bunch of third-party dependencies. So maybe you write some kind of clone.sh that loops through a bunch of Git repo URLs. Then later you want to also specify specific commit hashes, so you add a commit-ID field. Then you write a tool that makes it easy to update the fields in your clone.sh file. Guess what you've got? Git submodules.
Git submodules can be useful for vendoring internal parts without code duplication. It can help you if the tech you are writing the code in that repo doesn't have any specific/advanced tool for dependency management. By using `git checkout --recurse-submodules` you have a poor-man version of a package system.
I'm not endorsing it as the best feature ever or as the way to do dependency management but it can be used in certain situations.
Submodules are interesting, because they’re next to unusable from a user perspective (they’re a pain to maintain and interact with unless you never ever update them) but they’re ridiculously simple technically which I assume is what made them attractive.
A submodule is an entry in “.gitmodules” mapping a path to a repository URL (and branch), then at the specified path in the repository is a tree entry of mode 160000 (S_IFDIR + S_IFLNK), whose oid is the commit to check out (in the submodule-linked repository).