Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why are Git submodules so bad?
213 points by gavinhoward 11 months ago | hide | past | favorite | 167 comments
I have been a git user for a long time, but I've never used Subversion or any other VCS more than a little.

I also hardly use Git submodules, but when I do, I don't struggle.

Yet people talk about Git submodules as though they are really hard. I presume I'm just not using them as much as other people, or that my use case for them happens to be on their happy path.

So why are Git submodules so bad?




Git submodules are fine and can be really useful, but they are really hard. I've run into problems like:

1. Git clone not cloning submodules. You need `git submodule update` or `git clone --recursive`, I think

2. Git submodules being out-of-sync because I forgot to pull them specifically. I'm pretty sure `git submodule update` doesn't always work with this but maybe only when 3)

3. Git diff returns something even after I commit, because the submodule has a change. I have to go into the submodule, and either commit / push that as well or revert it. Basically every operation I do on the git main I need to also do on the submodule if I modified files in both

4. Fixing merge conflicts and using git in one repo is already hard enough. The team I was working on kept having issues with using the wrong submodule commit, not having the same commit / push requirements on submodules, etc.

All of these can be fixed by tools and smart techniques like putting `git submodule update` in the makefile. Git submodules aren't "bad" and honestly they're an essential feature of git. But they are a struggle, and lots of people use monorepos instead (which have their own problems...).


Switching branches in a repository with submodules is a huge pain, especially if (like the Ansible repo) some branches have the subdirectory in the same repo like normal, and some branches have the same subdirectory in a submodule.


There are git options for managing these difficulties like:

git config --global submodule.recurse true

https://git-scm.com/book/en/v2/Git-Tools-Submodules search for "git config"


> There are git options for managing these difficulties

This is git in a nutshell. Most defaults are very bad, and so using git from the command line is an exercise in learning which flags to set to achieve a sane workflow.


Thanks! That one option simplifies like 2/3 of the pain of using submodules. Doing a pull or checkout would just be regular commands.

Though it’d be nice if `git commit’ supported it too and just did a `git submodule foreach git commit …`.


Interesting!

I've never thought of doing commits in submodules. We use them a lot at work but only in a "import a specific revision of this other repo into our repo" sense, if we needed to make changes we would do them in the repo that the submodule points to and then just update the ref that the submodule points to.

What's your use case for making and committing changes inside the submodule tree?


I naively thought that was the whole point of sub modules. Otherwise why not use a package manager?

The use case being that you can work on and update some independent system/repo while getting real-time feedback on how the changes interact with the system as a whole.


That's how I've come to use git submodules. It's been helpful when working with various embedded projects. They often don't get updated for months or years (ideally), and then you only want to pull in specific changes.


In a previous project we had a main repo that had the deploy/orchestration scripts, docs, etc and the actual components (frontend, backend, periodic and ondemand jobs and jobqueue, blob serving thing that did ACL in front of Ceph - because for some reason S3 was too mainstream :D)

doing a quick fix that changed an API and had to be done in backend and frontend and in the main.

though it's quite similar to todays gitops flow, but with submodules :)


It's easy to overlook that the submodule is a complete git repository in its own right.

When working on both, it's really annoying to have to commit to a dependency's repo, push, and pull down in the dependent repo. If you do the commit in the submodule, it's just immediately available...and you can still have whatever remote to push to. So it just cuts down on the number of checkouts you have to have.


That's all true, but work trees are cheap, and the workflow you describe means that your submodules are tracking a branch rather than pinned to a revision, right?

For our purpose that's definitely worse; the submodule is supposed to be a pointer to a specific tree, with that tree being the same for all developers. If we want to change the tree that is pointed to, we should commit and push a change to the submodule ref.


Yeah if you commit a change in a submodule, the parent repo gets marked as dirty as the ref changes and you need to commit.


Right. But why doesn’t it Just Work by default?


From a design point of view, is there a good reason why this isn't the default?


Agreed with all of this!

In git parlance, the submodule porcelain is hard to use (but the plumbing is good)


That's the entirety of git. Extremely fast. Bummer UI. It's status quo and no changing. Despite using git for 10+ years, I frequently have to look up commands and then I end up scratching my head as to why the CLI UI is like that.


I suppose the biggest problem is that the concept of SCM/VCS is just not simple enough to make both easy and useful/advanced at the same time.

You can have a 'pull, merge, push'-only system, but at that point we're re-inventing subversion. So making it more advanced would mean we also need to have the knowledge and skills to do other activities correctly and that means the tooling can't make as many choices for you because there simply isn't a default way that works all the time.

Most efforts at git-alternatives run in to the same problems and either they'll be just as advanced and have the samen benefits and downsides or they end up less advanced but now it's not equally useful and you can't really make it work right.


Mercurial covers generally the same concepts as git and is thus also not trivial to learn for someone uninitiated; yet its interface was like day vs. night when compared to git since their very early days. It proves one can design a decent interface if one actually tries to care about usability and friendliness. How I remember the past days, git won the rivalry squarely due to GitHub becoming popular (though I assume there were some reasons why GH chose git over hg).


Back in the day there were actually 2 or 3 different cloud SCM hosting providers that chose Mercurial. As I recall, a couple of them, BitBucket and Kiln, also had better Web UIs than GitHub did. Versus just GitHub offering Git, and it was kind of a duffer in my opinion.

GitHub has come a long way. But I would guess the main reason it managed to become dominant is not because it had a better product. (At the time, it didn't.) It's because Git benefited from the celebrity of its author, Linus Torvalds.


It's crazy how people forget the past. Back in the day, git was "rewrite, rebases modify past to make beautiful commit" vs hg "rewriting past is bad, beautiful commits are lies about history". Turns out people don't care about truthful history.

(Nowadays mercurial can do rebase/amend just fine.. But it is too late)


I think that the truth is probably somewhere in between.

I do like to squash and rebase before moving changes upstream. But, to me, that isn't really history quite yet. Or at least, it's not history that's worth recording. All those micro-commits from the work in progress are, in some ways, more akin to my editor's undo history. Which is also something I don't save.

It's also clear, in hindsight, that Mercurial's original position on this subject failed to anticipate AWS credentials accidentally being committed to source control.

But I have also seen (and done) some amount of history rewriting in long-lived branches that I don't think would have been necessary if Git had had some of Mercurial's ergonomics. Workflows for merging two different repositories while retaining the commit history from each, for example.


FWIW, Mercurial has had "censor" command for blowing away the contents of those revisions with AWS keys since 2015.

Although once stuff is pushed to the public repo you're probably going to want to change those keys regardless. And if it's on the local one, there's plenty of options for removing the commit.


Rewriting history is actively good because it means you can view it as a series of logical patches. I have worked on large projects using earlier versions of hg and they were absolutely full of merge commits just labeled "Merge" - some of them were safe, some of them had random changes in them, and some of them had automatic merged changes that actually caused problems.

It was also much slower than git. But I knew someone working on Google Code at the time who liked it better because it was "clean" and in Python.


It was much much faster than git at HTTP at the time though. That's why Google Code selected it. Also it was faster at imports which was why Mozilla selected it for their transition.

Some other things it was slower, that's changed over time ofc.

Also, in terms of clean history, mercurial has best of both worlds with phases and hidden by default commits to keep track of such cleanup.


> Also, in terms of clean history, mercurial has best of both worlds with phases and hidden by default commits to keep track of such cleanup.

Yeah, it has more features now but at the time it didn’t. There was something called patch queue you could use for early stage work but that was all.


The fact that mercurial is substantially slower than git was probably also a big factor.


in other words, the comment you’re replying to rewrote the past of rewriting the past.


Around 2013, having only knowledge of svn at the time, I tried both git and mercurial to see which I liked more and found git to be a lot more intuitive than mercurial.

It's been long enough I don't remember the details about why I didn't like mercurial, but fame power had nothing to do with it, nor did website integration - I was only using it locally. How it worked just didn't fit with how I thought about version control.


That's not what I remember. Back in the days when people started implementing dvcs git was just much faster than everyone else (that was in fact the reason why Linus wrote it). Once the kernel was using it, its mindshare just grew much faster because of the publicity this implied. In other words it was largely a case of "if the kernel devs are using it it must be good". When GitHub and all the other hosting services started many still had mercurial or other dvcs (launchpad was bzr for example), but by that time the ship had sailed already I would argue.


> I suppose the biggest problem is that the concept of SCM/VCS is just not simple enough to make both easy and useful/advanced at the same time.

Git has and always had a singularity bad UI amongst dvcs.


> but at that point we're re-inventing subversion

So? Maybe Subversion is all most developers need?


Solo devs maybe. Merging and branching are so much worse in svn that it’s not good enough for “most developers”, aka those working on professional projects with a team of developers. Sure we used to make do with svn, but I have no desire to go back.


Odd, I found branching and merging so much easier in SVN.

Git doesn't even technically have branches, just pointers to commits which can easily get mixed up, go headless and fail out in ways that just never happened in SVN.

And rebasing a branch with several commits can be a nightmare since you have to re-merge almost (but not exactly) the same code over and over again at whatever state it was in at some time in the past when some previous commit happened - unless you abort out and squash first. In SVN, you just merged the whole branch once and when both were in their final current state.

Of course, git's a lot more powerful, but with that comes complexity. SVN branching and merging was a snap comparatively.


SVN will frequently insist that merge conflicts exist where there shouldn't be any if trees have been modified by deleting or moving directories. This is so pervasive that organizations will avoid doing merges because of the manual fixups you have to do on long lived branches. There's metadata now for tracking branch history but older versions couldn't figure out that two branches in a merge had a common ancestor. A puzzling thing to omit from a VCS.



> Git doesn't even technically have branches, just pointers to commits which can easily get mixed up, go headless and fail out in ways that just never happened in SVN.

I mean, if that's a reason to say git doesn't technically have branches, then neither does svn. It has subdirectories that you can make copies of at any level, and copy commits between them at any level, such that you can make an amazing repo-within-a-repo mess not possible in git.


Honestly I’d take CVS over subversion. Merging was just bad on subversion and I found the IDE integrations to be confusing, obtuse and buggy.

Sure, CVS was limited but it was reliable and straightforward.


> Sure, CVS was limited but it was reliable and straightforward.

It was so reliable that people complained about SVN using a DB (Berkeley DB) as backend, as manual fixing of CVS files was a "normal" part of operation and people didn't believe that might not be needed ...


...And better than SCCS! ;)


> Solo devs maybe.

Like I’m gonna bother to set up a Subversion remote for every little repo that I create for myself.


The most beautiful part of svn, and the one I miss the most in git, is the fact that there is no need to set up tons of separate repo. Every subdirectory acts as a separate git repo.

This means you usually only have one svn repo, and you set it up the way you like. As an example, you may set things up so you can checkout:

server:/proj/small/hello - to get a single project

server:/proj/small- to get all small projects

server:/proj - to get all projects

If you already have one of those checked out, adding new project is as simple as "mkdir bar", "svn add bar", "svn commit". So juch easier than making new github repo. And multi-level hierarchical project nesting is still something impossible in git.


> So juch easier than making new github repo.

Were you paying attention to what I just wrote!? `git init`. What do I need a remote on the Internet for?


You don't need internet, you can create the main repo somewhere else on your pc (preferably on a second disk to have very minimal backup) and you use file url to access it "file:///F:/MyRepositoriesAreHere/MyProject-repo/ProjectName/trunk"


That’s better but still an extra step.


Don't you ever share your projects between machines? I at least have laptop, a desktop, and occasional raspberry pi. An ability to have personal projects on both is very handy. It also acts as a nice backup.

But yes, if all your work is on one machine, and you have backups of it, there is not much point in svn.


That’s a file sync. problem, not a VCS problem. And backup is a separate problem.


Git makes sense for the Linux kernel and merging patches at scale. 99% of software development would be fine with subversion


Git completely replaced Subversion so quickly because the benefits were apparent even at a small scale. Subversion was centralized and slow whereas Git branches were cheap and fast. It turns out the distributed model is just a lot better. Even my college project teams benefited from the superior experience of Git.


Interestingly, there was a distributed SCM build on top of SVN, called SVK (https://wiki.c2.com/?SvkVersionControl).

Being distributed, it solved the main gripes with SVN; it also added a better merging algorithm (https://foswiki.org/pub/Development/SVK/svk-visual-guide.pdf), solving another big gripe.

I was actually satisfied of it, and surprised that it never got attention, in particular, because there were no requirements in order to use it with existing SVN repositories. I'm actually baffled, because SVN is still active, so SVK would still be useful nowadays.


May as well use Git SVN integration for this.


Except that everybody changed to Github not git. And effectively recreated Subversion with caching.


"Subversion with caching" is not subversion.

I used subversion for a long time and was resistant to moving ardour.org to use git instead. 24hours after we switched (we never use 3rd party git hosting as our canonical repo), I was already convinced it was not the right choice, but an excellent choice.


It's also Subversion where you can commit your changes and write the message before pushing, instead of at the same time, so you get the chance to review it. That's enough to make it better.


No, I worked with gitlab, bitbucket, custom git server installs in the last 3 years alone.


Anything can be made to work. But I don’t see why I would want to handicap myself with a truly centralized SCM system.

Sure, we use one canonical repository. All our “pull requests” are really merge requests, mostly to the main branch. So that’s pretty centralized, right? So why use a distributed VCS? Well, why use a local editor or IDE for code that is ultimately going to end up in the cloud somewhere? Sure, you might want to out of preference, but why should you be forced to? The fact is that wherever the code will end up is besides the point when it comes to how to develop it.

The truly important thing about distributed VCS is that it forces almost all of the operations on the repository to be usable locally. And why should it not be? What’s “git log”, “git blame”, or “git merge” got to do with whether there is one canonical repo or a hierarchy of upstreams?

I think that this idea that non-distributed VCS is somehow the default—as in the obvious, simple thing to implement—is just backwards. Of course the default assumption for any VCS operation—unless it has a name like “send-email”—should be that it operates on your own local copy.

Sure, we use a centralized repo structure. And the only call-central-command operation I use is “git push”. All the other fiddling and querying—and all the things that make version-control-as-history useful—is local.


Have you use subversion ? If yes, do you remember how slow it was? How cutting a branch and then merge it was seen as something for senior dev to handle ?


I did use subversion in college. Never professionally, so creating our own branches wasn't a pain.

I think there is a reason why this XKCD was made: https://xkcd.com/1597/


“Delete your work and clone again” is ridiculous. Just goes to show that you don’t know what you’re talking about (or the xkcd guy for that matter).


I'm talking about my needs, ditto with the XKCD guy. Basically I just need to be able to create a PR from my local changes. Not much more.


Subversion requires a server, so it’s not suitable for the small local one-off repo


No. It doesn’t. svn checkout file://path/to/repo works absolutely fine.


But it does need a server to collaborate, doesn't it?


Define a server. NFS will be enough for more than one person to use a repo over the file protocol.

It will work fine over a Windows share, Samba, NFS, and so on. It doesn’t need svn or http protocol to operate.


I could define a server but I'm not a dictionary so I won't.

SVN doesn't work peer-to-peer or over email. And that's fine. It's just not ready to go with only local tools.

Of course, you can use GitHub with Subversion just fine, but that wasn't the point. The point was that Subversion alone is never enough if you want to collaborate.


> I could define a server but I'm not a dictionary so I won't.

That’s a bummer. It would help us point the discussion in the right direction.

> SVN doesn't work peer-to-peer or over email.

Why not? What is so different about people having their own subversion repositories over file protocol vs people having their git repositories?

Why would you not be able to send a subversion patch in an email?

As someone who uses git for 10 years, I understand it may not be as ergonomic as with git. But why not?

> The point was that Subversion alone is never enough if you want to collaborate.

Why not? Is git enough?


In which case they can use Subversion. Or Dropbox since it essentially offers the same features. I don't think there is anything bad about it, just that it solves different problems.

Older systems like CVS are also still in use, but it appears that none of the old systems really lasted more broadly and aren't useful for the needs of today.


I loved this XKCD. https://xkcd.com/1597/ use these commands unless you get an error, then save your changes, and download a fresh copy.


Pull, merge, rebase, push with local commits is what almost everyone cares about.


I think you're really, really undervaluing branches. The fact that patches are mostly shared as branches changes everything.

It's what enables three-way merges, and makes rebasing much more manageable.

The ability to traverse and jump to some other point in history is really missing here too.


You absolutely don't need the way Git does branching to do three-way merges. If you have local commits as first-class citizens, the need for local branch names disappears completely.

Local commits imply traversing history, but even that is terrible with Git. You can't just do obvious things like "git previous" or "git next" or "git checkout <commit hash>".


I'm not sure why you wouldn't want local beach names once commits are first class. And you don't have to do it the way got does, but aside from pijul I think everything does it about the same way git does.

Also, git checkout <commit hash> works?


Local branch names for short-lived branches are a crutch, there's nothing they convey that commit messages don't express better.

What happens if you amend a commit after checking it out?


> Local branch names for short-lived branches are a crutch

Hm. We must use git completely differently then. I can't really imagine what what I'd do without them.

> there's nothing they convey that commit messages don't express better.

Presumably you'd still want to maintain a list of HEADs, so you just want to alway refer to them by hash instead of a branch name? That's fine I guess -- not sure what it buys you.

> What happens if you amend a commit after checking it out?

Then it becomes a new commit? Not sure what you're getting at.


It's slowly changing. They finally added `git restore`, for example.


> It's status quo and no changing

Not entirely true: checkout was split into switch and restore, which is something I guess.


`git submodule update --init --recursive` is the magic phrase.

And, yes: submodules are really useful, as well as a PIA.


And you can also --recurse-submodules when cloning


> Git submodules are fine and can be really useful, but they are really hard

If an important software tool is hard to use to the point that most people avoid it, then it's not fine. It's broken.


I agree with all of this. Submodules aren't easy but they perform a useful job. It's hard to see how they could be made significantly easier. Where else in software is dependency management easy and convenient?


"Make 'git checkout' of the top level repo also set the submodules to the contents they should have for that top level commit hash" is probably the main change I'd want. The current setup means that checking out a branch or doing a git bisect gives you an inconsistent source tree, which seems like a really unhelpful behaviour.


What tools fix this?


If you're on windows, it just takes a few clicks with tortoisegit. I never have to remember these command line git commands.


The irony of needing a tool to make the first tool work well


Using something like Nix to specify the dependencies instead.


That sounds like a problem that exists between the chair and the keyboard.


In 25 years of software dev, and most of that in developing user interfaces to things, I have never found a commonly repeated error that could be attributed to user error. It's always badly designed software that ignores the user's mental model that's developed based on using the software.

In this case, "git clone" clones a repo 99% of the time, because 99% of repos are shallow and simple. For "git clone" to only clone the top level when you have su modules instead of prompting to ask if you want a deep clone, or doing a deep clone by default because that's the expected behavior, is pretty poor design IMO.


Git submodules aren't bad in that they're buggy, they do what the documentation suggests.

I think they're difficult to use, because it breaks my mental model of how I expect a repository to work. It creates a new level of abstraction for figuring out how everything is related, and what commands you need to be able to keep things in sync (as opposed to just a normal pull/branch/push flow). It creates a whole new layer to the way your VCS works the consumer needs to understand.

The two alternatives are

1. Have a bunch of repositories with an understanding of what an expected file structure is, ala ./projects/repo_1, ./projects/repo_2. You have a master repo with a readme instructing people on how to set it up. In theory, there's a disadvantage here in that it puts more work on the end user to manually set that up, but the advantage is there's a simpler understanding of how everything works together.

2. A mono repo. If you absolutely want all of the files to be linked together in a large repo, why not just put them in the same repo, rather than forking everything out across many repos. You lose a little flexibility in being able to mix and match branches, but nothing a good cherry-pick when needed can't fix.

Either of these strategies solve the same problem sub-modules are usually used to solve, without creating a more burdensome mental model, in my opinion. So the question becomes why use them and add more to understand, if there are simpler patterns to use instead.


You completely missed the problem that submodules are actually supposed to solve though. Using them for either of those cases would almost definitely be the wrong choice.

What they're really for, is vendoring someone else's code into yours. They're still not great even at that, but sometimes they're the best option.


Interesting. When you say that's the problem they're intended to solve, do you have a link to that as their intended use case.

IE is that "a" usecase or "the" usecase? I've never seen submodules used for that, only for internal dep management, so if there's content about "what they're really for," I'd love to read more.


When I worked in games we did exactly this but with perforce. All of our libraries were in a vendor tree and all in source. We slapped our own build over the top of them and checked in the build artifacts (perforce). If we needed an update, we updated their code and maybe our build script.

It'd be nice to use submodules for this but I gave up years ago.

The other big use is where you have your own libraries and you'd like to be able to share them across projects. My friend does game jams and has his own simple engine, he versions the engine and adds abilities and uses it across game projects.


check the first paragraph on this page: https://git-scm.com/book/en/v2/Git-Tools-Submodules


This î. Also subtree is an interesting relevant tool too.


> vendoring someone else's code into yours

Vendoring usually implies some kind of snapshot-copying of third-party code. A repo you depend on by value. That's actually solved by subtrees. If you buy that metaphor, then submodules, in contrast, express a dependency by reference.

tl;dr anyone vendoring with submodules is prolly doing it wrong


"Have a bunch of repositories with an understanding of what an expected file structure is, ala ./projects/repo_1, ./projects/repo_2. You have a master repo with a readme instructing people on how to set it up. In theory, there's a disadvantage here in that it puts more work on the end user to manually set that up, but the advantage is there's a simpler understanding of how everything works together."

This is what I do. I have something like 17 code repos organized this way, plus lots of testing repos, plus an extra "hub" repo. (Credit to a friend for calling this repo "hub": short, to the point, requires no explanation.) The hub repo is a bunch of scripts and makefiles that configure everything and even clone the rest of the repos for me. It also has special grep and find scrips that will run on all of the repos as their target. The hub repo just needs one env var to tell it where the root of all the repos is. Note that in the file system the hub repo is under the root and a sibling of the code repos, not their parent in the file system.

Each code or test repo has an "externs" subdir populated only with softlinks to the other repos on which it depends. The scripts configure this by default, but it is also straightforward to configure by hand if you want to do something non-typical. For example, if you want to have multiple versions of a repo checked out on, say different branches/commits, you can do that and name each directory with a suffix of the branch/commit. Then the client repos can just point at the one they want. You can have all kinds of different configurations set up at any time. Doing this makes it straightforward to know what you depend on just by looking at the softlinks. There is no confusion at any time.

There are ways of configuring the system that do not even need all of the repos, so this is ideal. Using the hub repo makefile I can clone the whole system with one make target (after cloning hub), I can build the whole system with one target, I can test the whole system with one target. It is a testament to how well it works that I don't even know exactly how many repos I have. In short, it works great.


There's actually a third alternative, called Git X-Modules (https://gitmodules.com). It's a tool to release the PIA submodules are causing, as described in may comment above :-) In short, it takes all synchronization to the server's side. So you can combine repositories together in any way you like, and still work with a multi-module repository as if it was a regular one - no special commands, specific course of actions, etc.


Maybe it's more helpful to think of submodules as a convention in .git to manage commit ids of external repos for code your main repo code depends on, with some assumptions (ie. own subdirectory) and porcelain that might or might not match your workflow with respect to how that external code is integrated. It can get tedious if having to deal with submodules of submodules etc., but so would other ways to track ids of transitive deps.


The SVN implementation worked pretty seamlessly, almost like a regular subdirectory.

There was no gotcha of a non-recursive clone/checkout. If you've used this feature, your users wouldn't keep getting "broken" checkouts.

There was no gotcha of state split between .gitmodules, top-level .git state, and submodule's .git, and the issues caused by them being out of sync.

There was no gotcha of referencing an unpushed commit.

Submodules are weirdly special and let all of their primitive implementation details leak out to the interface. You can't just clone the repo any more, you have to clone recursively. You can't just checkout branches any more, you have to init and update the submodules too, with extra complications if the same submodules don't exist in all branches. You can't just commit/push, you have to commit/push the submodules first. With submodules every basic git operation gets extra steps and novel failure modes. Some operations feel outright buggy, e.g. rebase gets confused and fails when an in-tree directory has been changed to a submodule.

Functionality provided by submodules is great, but the implementation feels like it's intentionally trying to make less-than-expert users feel stupid.


Submodules are just complicated because Git makes no decisions at all about how they should behave, beyond "never make a decision that could lose data".

So you have to understand the tradeoffs and make every decision at every step. It's the safe option.

Like, what happens if you remove a submodule between revisions? Git won't remove the files, you could have stuff in there. So it just dangles there, as a pile of changed files that you now have to separately, manually remove or commit, because it's no longer tracked as a submodule. And then repeat this same kind of "X could sometimes be risky, so don't do it" behavior for dozens of scenarios.

All of which is in some ways reasonable, and is very much "Git-like behavior". But it's annoying, and if you don't really understand it all it seems like it's just getting in your way all the time for no good reason. Git has been very very slowly improving this behavior in general, but it's still taking an extremely conservative stance on everything, so it'll probably never be streamlined or automagic - making one set of decisions implicitly would get in the way of someone who wants different behavior.


What's the mental model for the use of a git submodule?

I've always thought of them as a way to "vendor" a git repository, i.e. declare a dependency on a specific version of another project. I thought they made sense to use only when you're not actively developing the other project (at least within the same mental context). If you did want to develop the other project independently, I thought it best to clone it as a non-submodule somewhere else, push any commits, then pull them down into the submodule.


I think submodules end up highlighting failures of the (implementation of the) many small repo model. People want to develop in both the submodules and main repo simultaneously, and that’s relatively painful. If you find yourself doing that often, that’s a sign that your repos are highly coupled and failing the intent of the many small repo model anyway.


I have a custom game engine, and I make games that use that engine.

I have about ten different games, all using the same engine, but they were written over the course of many years and so are written against different commits within that engine repo, and git submodules capture that idea perfectly.

If I didn’t have something like that where I effectively had a long-lived library which I wanted to pull into multiple projects, I probably wouldn’t bother with the submodules. But it’s *so* much more convenient to have them in submodules than to copy the engine code separately into each project’s separate repos and then manage applying every change to the engine in all the different game repos individually.

Really, git submodules are exactly the same thing as subversion ‘externals’, but always pinned to a specific commit (which is an available-but-not-enabled-by-default option under subversion), and with a substantially easier interface so maybe folks who don’t need them are more likely to notice them and grumble about how they don’t solve an issue they have?

IMHO git submodules are a huge quality of life improvement over the same system from subversion (as was available in subversion back in the 1.8.x era; I haven’t really used svn in anger in a long time). I definitely wouuldn’t want to go back, or to not have them available.


That's how I have thought of them too, so I've never struggled with them. Hence why I asked the question.

I'm glad to know that I'm not the only one to use them as such.


The main thing that vendoring is supposed to do is to make it so you can build your code even if all your deps disappear. Submodules don't get you that property.


As many others in this thread have stated, the main issue is they have fairly poor UX and if you aren't used to them they can be pretty annoying. They especially have quirks when they're removed from (or moved within) an existing Git repository.

One thing I haven't seen mentioned in this thread though is that they force an opinion of HTTPS vs SSH for the remote repository.

If a developer usually uses SSH, their credential manager might not be authenticated over HTTPS (if they even have one configured at all!) If they usually use HTTPS, they might not even have an SSH keypair associated with their identity. If they're on Windows setting up SSH is generally even higher friction than it is on Linux.

For someone just casually cloning a repository this is a lot of friction right out of the gate, and they haven't even had to start dealing with deciphering your build instructions yet!

-------

Personally I still use Git submodules despite their flaws because they mesh well with my workflow. (This is partially due to the fact that when I transitioned from Hg to Git it was at a job that used them heavily.)

The reality is every solution to bringing external code into your project (whether it's using submodules, subtrees, tools like Josh, scripts to clone separately, IDE features for multi-repo work, ensuring dependencies are always packages, or just plain ol' copy+pasting) all have different pros and cons. None of them are objectively bad, none are objectively best for all situations. You need to determine which workflow makes the most sense for you, your project, and your collaborators.


I recently started using git subtree[0] instead of dealing with all the problems with git submodules, and have been very happy with the experience so far. It does copy every file into your repository, though.

[0]: https://github.com/git/git/blob/master/contrib/subtree/git-s...


This is funny - I was going to chip in and say that you really start to appreciate submodules once you’ve experienced the fractal hell of subtrees!

If I ever have to untangle a messed-up repository containing a subtree again in my life I’m quitting development and moving to a cabin in the woods.


Haha, my usecase is really simple. I have a couple of applications that need to run at the same time on the development machines and they only need to be updated once or twice a year. With subtrees, only the one person who maintains the subtrees actually needs to care about the other applications, and the other developers can mostly ignore that the other applications exist.

If we used containers, a Docker compose file would make things even simpler for us, but alas.


I have exactly the opposite experience. Submodules are hell, subtrees are simple.


Agreed.


Beyond how hard to use they may or may not be, my personal hatred of git submodules is about bypassing your normal dependency management system. See 12 Factors on Dependencies[1].

I've not seen many uses of submodules that weren't better served by adding the package from pypi/npm/crates/...

[1]: https://12factor.net/dependencies


I've been reading a lot of research papers lately, sometimes with POC repos. When compiling the code, it would often fail because dependencies have changed over the years (we're talking about C/C++ code which doesn't have a package manager like you're probably used to). Most of these repos would fail to compile because it expected libraries installed on the system to not have changed in the past 10 years or so. In exactly one of these repos did the author use a git submodule so I was linking directly to the correct version of the library. Granted, those libraries _did not_ do this, so it still failed to compile 7 years later...

If everyone used git submodules for deps, you'd end up with a (block) chain (if everyone's commits were signed) for deps.


> we're talking about C/C++ code which doesn't have a package manager like you're probably used to

No, there are 3-4 suitable package managers [1] depending on what your bar for an acceptable package manager is. It's just C and C++ engineers have gone so long without one that they don't even think about it.

Honestly, we just need to start expecting more (and contributing more) with respect to unpackaged C and C++.

[1] vcpkg, Conan, nix, and spack off the top of my head


That's pretty cool reproducibility indeed, love it.

I once worked in C++, and the lack of a (standard?) package manager was killing me for pretty much that reason. "This says it's using curl, but did they compile it with a weird flag, or was their version just old?", and stuff like that.

For the case you describe, a submodule is the next best thing to proper package management, and I'm glad those submodules are there to save you! But in themselves, the submodules don't guarantee build instructions, and the same guarantees of exactness can be had by locking dependencies.

I feel like your message mostly reinforces my belief in the NEED for package manager for every such language combo, having seen the immense value in other systems, and the extremes that we have to go to when they're missing.

I hear the Nix (and Guix) community has done great steps to manage these stray systems into reproducibility, using submodule chains of the sort you describe, but augmented with standardized build info to make it system-wide pkg management.


That’s a good point about build instructions, but hopefully they are encoded via a make file. In reality, I only had to make very minor changes to the experimental code to get it to compile, so I really appreciated maintainers keeping backwards compatibility for the most part. Some of this code was over 10 years old, and it still worked… that was pretty impressive since I mostly work with higher level languages where dependencies can change very significantly between versions, or even be abandoned within a decade.


> bypassing your normal dependency management system

You can use release tags when using Git submodules, which makes it closer to a "normal dependency management system".

It's better than using the commit hash or branch, but still not that great without flexible semver dependencies...


Indeed!

I've enjoyed the middle ground that some package managers provide, where one can point to a git repo's tag/branch/commit hash instead of the upstream package repository of choice.

This means if you REALLY wanted to do some vendoring approach with custom code, you don't have to do the git submodule, just use the dependency as usual, it's just fetched from git, and built the normal way.

It's not for everybody (I still recommend to point to known packages first), but it's cool to acknowledge the usecase of "I need to get this from git repo" without being savages about it and throwing away our tools.


Why would using pypi/npm/crates be better? If I'm already using git (which I will be) then using something else for package management needs to be a huge step up, especially if the other systems you mentioned are language specific and a git solution would probably be language agnostic.


My assumption here is that your language has already got a dependency management system beyond git, because there's just SO much value in specifying and isolating dependencies and associated build tools, things git doesn't care about (off scope).

Just a few off the top of my head: identifying/resolving transitive dependencies and conflicts, separating development-only deps, optional dependencies for extra features, dependency ranges for pinning and exact locking, proxying/caching for offline-ish rebuild, auto-upgrades, license checking (GPL...), CVE flagging... All these are ~standardized per language.

If you've got (say) a Python script in git, and submodules seem like a good idea for specifying all your dependencies, instead of going towards a package manager, it feels to me like we're a long way away from sound, practical engineering.

For language agnostic, I'm sure the Nix/Guix folks can hook you up with a variant of this, even while using git (maybe even submodules!), but all the value I specified above is still there.

I guess I don't mind git, maybe even using submodules, but I hate the substitution of good package management hygiene over "just import the next folder down, which I submoduled now".


I interpreted the link as saying don’t make me manually install dependencies.

Using git submodules is no worse or better than say npm in that regard.


The security model seems to be "terrible UX defaults that you turn to unsafe defaults instead."

You end up with a lot of gotchas instead of them just working.

The mental model of juggling multiple repos in a non-atomic way also violates the rule of least astonishment. Working with read only submodules smooths this part out at least.

GUI support is slowly getting better at least.


> The security model seems to be "terrible UX defaults that you turn to unsafe defaults instead."

Except practically nothing in git is unsafe, it's like Plan9's Venti in the sense that you can only add content to it. You never lose data, but you can easily lose your bearings.

I presume a major reason why there's not much effort put into UX guard rails preventing losing one's bearings in the day-to-day work is a product of the fundamental fact that the underlying committed data is always there. People just need to at least familiarize themselves with `git reflog` to help lose the anxiety.


> the underlying committed data is always there

That’s not always true, due to reflog expiration and GC.

This is an important difference to e.g. Subversion where you can delete branches/tags just to hide them, but the history always remains 100% there and nothing is ever really deleted.


In git if a commit is within the history of a branch or tag it won't be GC:ed, right?


That’s correct, but commits from that history can also be removed, for example by squashing a branch. The Git history is quite mutable in that sense and doesn’t necessarily reflect the actual commit history.


except for that time there was a vulnerability in exactly the subsystem we're talking about.

https://nvd.nist.gov/vuln/detail/cve-2018-17456

rce via recursive submodule clone ftw


It's really annoying that submodules give you a detached head by default, so working on a submodule within a project is prone to mistakes. Otherwise they've been fine for me.


The biggest reason I find git submodules painful: a "commitlink" object in a git tree does not count as a reference to that commit or anything that commit references, for the purposes of garbage collecting the repository or pushing and pulling changes. You can't have the only reference in your repository to a given commit be a commitlink within another tree.

I'd like to jettison the entire model of "reference another repository that you may or may not have", along with the `.gitmodules` file as anything other than optional metadata, and instead have them be a fully integrated thing that necessarily comes along with any pull/push/clone.


The problem with submodules is that they're read-write. Read-only submodules would be completely fine.


Yeah this is basically it in the course of actually using them; people should probably only submodule in tags as well. Some of people's problems also come from using them where they're not necessary.


If you work on blorp which contains openresty which contains luajit, and you have a patch to luajit that you are making because it enables some work in the end product, you need to make a commit to luajit, make a commit to openresty with your changes, make another commit openresty to change the luajit version, make a commit to blorp with your changes, and make another commit to blorp to change the openresty version. You will create 3 code reviews none of which actually contain all of your changes together. Your coworkers have decent odds of not being able to build their software because they don't have a keybind for `git submodule update --init --recursive` yet.


I don't use submodules, but I do use git repositories inside other git repositories, and let IntelliJ manage them both simultaneously as if they were two different projects. Works fine.

I once enabled it as submodule to test. It adds nothing and from that moment any change in the child creates a change in the parent, which for my use case is totally unnecessary (I want both of them to be independent, even if hierarchically one is inside the other).

Submodules are probably a good option to have libraries that you rarely touch, so you can update/modify them as with a maven/gradle project. For most other user cases submodules make more problems that advantages.


On a project where you have a lot of people working on the same code, an advantage (of submodules) is that you can check out someone's branch and it contains the URLs to the specific commits in the submodules that work with their branch.


Whats up with git submodules refusing to clone/update/checkout submodules while having all their files showing up as deleted? I encounter this quite a lot and the solution seems to be a git submodule sync --recursive (or something like that) but I don't get why I run into this in the first place? Probably related to forgetting --recurse-submodules when cloning but what do I know?


Git submodules are cool but can be confusing because people are already used to their language package manager. They also add overhead as changes frequently have to be pulled/pushed downstream/upstream. But in cases where it makes sense to use it, it's a great tool. Eg: theme that is reused in 3 sites is in it's own repo and is a submodule in each site.


A lot of people here complain about the complex UX. This a big problem but is something you get used to and can live with.

An even bigger problem is when you start substituting a dependency manager with submodules. It has no way to deal with transitive dependencies or diamond-dependencies. What are you going to do when lib A->B->D and A->C->D? Your workspace will now have two duplicate checkouts of D and any update to D requires commiting 3 repos in sequence to update the hashes. If you are really unlucky there can only be one instance of D running on the system but the checkouts differ.

The correct way to deal with this is to only have one top level superproject where all repos, even transitive ones are vendored and pinned. The difficulty is knowing if your repo really is the top level, or if someone else will include it in an even bigger context. Rule of thumb would be that your superproject shouldn’t itself contain any code, only the submodules.


Git submodules also don't interact well with worktrees [1]. Do not try to change the branch of a worktree that has a submodule.

[1] https://git-scm.com/docs/git-worktree


i never found myself struggling with submodules, but at times i found myself just slightly annoyed (especially when having to remove/replace submodules), especially when they are used for simpler use cases.

i actually ended up creating https://carvel.dev/vendir/ for some of the overlapping use cases. aside from not being git specific (for source content or destination), its entirely transparent to consumers of the repo as they do not need to know how some subset of content is being managed. (i am of course a fan of committing vendored content into repos and ignore small price of increasing repo size).


They act almost the same as pinned-revision svn externals, which people don't really seem to have a problem with. The biggest difference I can think of is needing a special command to pull in the submodules, where svn pulls its externals automatically.


I always ran into issues when switching branches, merging or rebasing. And then you have to figure out what’s going on. If you’re not used to work with submodules, that’s the moment where you have to learn it. And a lot of people get overwhelmed then.


This is what killed it for me. Changes to submodules don't seem to do well across commits/rebases, or bisect type operations. After fixing it on so many people's computers when using modules I just had us give up and move the stuff out.


A couple things off the top of my head:

* some folks had to deal with a LOT of submodules back in the day; for instance, it wasn't uncommon to have a dozen+ in your "vendor/plugins" directory in a Rails 1.x app. More submodules, more problems

* sometimes submodules get used to "decompose a monolith" into a confusing tangle of interdependent repos. Actually changing stuff winds up requiring coordination across multiple repos (and depending on the org, multiple teams) and velocity drops to zero. Eventually somebody declares "submodules SUCK!!!!!one!!!" and fixes things by re-monolithing...


They are unergonomic!

Annoying to setup and keep in sync "correctly" (for given project, EDIT: especially edits).

Sure, this depends a bit on the other tooling and for what reason you use.

This doesn't meant they are hard or complicated, but the moment the defaults do not do what most times is needed and you can't change the default (in the project, instead of user settings) i.e. need to do additional manual steps all the time some people will hate it many will dislike it.


Because git is a version control system that is so in love with its data structure, it can't find time to make the rest of the system coherent or useful.


Instead of submodules, you can just fech whatever repo you want into your repo. Create a tracking branch for it which moves the stuff into a subdirectory, and then merge that to your master.

Moreover, when you do the initial fetch, you can limit the depth. That will save space if you don't care for the full history of that repo.


They have their uses and misuses. People misuse them a lot, get burned, they blame them, and then they hate them.

If you lack a dependency mgmt system, work mostly solo or in very small and tightly coordinated team and just need some githubbed project to make yours work, submodules may be the right tool for you.


Exactly! Submodules are a tool within a toolkit. Learn the toolkit, learn the tool, and then decide whether or not it'll solve your problem.


But this is such a non-answer to what's good or bad about them.


It was a response to a comment, not the OP.

There is nothing "bad" about submodules. They do what they were intended to do very well. They are often misused because people don't take the time to understand how things work (event at a conceptual level) before diving into it.

I've found submodules most useful when I'm tracking an external dependency that doesn't change often -and- I don't have a dependency management system to lean on.

For example, I have a repository of 100+ various bash utility scripts. One of the scripts depends on another repository with 500MB of data. I don't use it often so I don't need that large repo cloned alongside my bash scripts every time. In this case:

1) The dependency doesn't change often and I like being deliberate when using a new version of that dependency.

2) The dependency is optional and can simply be ignored when using Git's default clone behavior for my repo.

3) Since it's just a bunch of bash scripts, I don't have a package management system to lean on. I use these scripts on multiple *nix distributions.

Could I make it work without Git submodules? Yes. I would have added the dependency to .gitignore and wrote a script to `git clone` the dependency and set it to a specific revision. Submodules just makes that process a lot easier.


> It was a response to a comment, not the OP.

You were elaborating on the comment, weren't you? And that comment was quite vague too.

> There is nothing "bad" about submodules.

The difference between "what are uses and misuses" and "what is good and bad" is not enormous. I think you're nitpicking. Plus, there are things that are flat-out bad. It's a human-made program under constant development, after all.


> You were elaborating on the comment, weren't you? And that comment was quite vague too.

What was vague about that comment?

> The difference between "what are uses and misuses" and "what is good and bad" is not enormous.

This is incorrect. The definition of "bad" is something of low quality. Misusing a tool and then calling it low quality is dumb.

> Plus, there are things that are flat-out bad. It's a human-made program under constant development, after all.

What about git submodules is "flat-out bad"?

It's ironic that you're criticizing others' comments on being vague and providing non-answers while you've contributed nothing to the discussion. Maybe educate yourself on technical things and you won't need to waste your time having meta discussions on internet forums. Looks to me like you want to contribute but have nothing of value to say.


> What about git submodules is "flat-out bad"?

Do you want me to grep the changelog for fixed mistakes?

Do you think they fixed all of them, just now, june 2022?

There are mistakes, which are flat-out bad. Trying to fix them is a continuous process that is not complete. And there are probably bad decisions that will be baked in forever, but that's not necessary to prove my point.

> It's ironic that you're criticizing others' comments on being vague and providing non-answers while you've contributed nothing to the discussion.

It's not ironic. One person asks for input. Another person gives a vague answer. I call the answer too vague to be useless. That doesn't make it my job to be specific now. I'm not answering OP. I'm giving feedback on a different answer.

Which is what you were originally doing too, so don't complain that it's not "of value".

And I'm not criticizing submodules here either.

I'll make my argument more clear. The idea that any software has "nothing bad" about it is ridiculous. That's a generic argument for all software. If you really want examples, look at the changelog. But I think it should be obvious. If you think it would even be useful for me to provide an example, then there's probably been a terrible miscommunication.


The miscommunication is in your mind.

Every software out there has defects. Bugs, performance issues, UX issues, etc. To assume that anyone in this thread actually means that submodules have no defects whatsoever... well... that's just creating a straw man to tear down because you're grumpy or something.

The issue appears to be with your understanding of what "bad" means in this context. When the OP (and commenter I was responding to) says "bad", they're referring to the overall experience of using git submodules. That specifically does not include the minor fixes/improvements you'd find in a changelog.

By your definition, all software is "bad" because all software has defects in varying degrees. That of course makes the entire conversation pointless and pedantic.

Git submodules work just fine when they're used properly. I've been using them for years without issue and so have many other commenters here. Are there things we (the users of git submodules) would like to see improved? Of course! But that doesn't violate the statement that there is "nothing bad" about submodules in the context of this discussion.

I really hope this helps you navigate conversations with a little more nuance. If not, that's okay too. Have a nice weekend.


> When the OP (and commenter I was responding to) says "bad", they're referring to the overall experience of using git submodules.

Ah, here's the part where things went wrong.

I said "what's good or bad", which is not a judgement of the overall experience. It's the same thing as "uses and misuses" from a different angle.

Talking about what software is better at and worse at is a good response to "why is it bad?". "No it isn't" is mostly a waste of time, especially when a lot of people have bad experiences.

And don't say that the miscommunication is in my mind when you glossed over the word "good" entirely.

> By your definition, all software is "bad"

By my definition, all software has bad elements and bad contexts. And good ones, too. Asking where those lie is not pedantic or pointless at all.

> that doesn't violate the statement that there is "nothing bad" about submodules in the context of this discussion.

If an "overall experience" being good enough means that there is "nothing bad", I'm not the one lacking nuance. You've just obliterated nuance.


If submodules were really slick, they'd become the primary way to manage build-time dependencies.


They're very useful. git submodule update --init --recursive will cover 99% of the time. There are weird ways of working with updates to submodules, but for the most part everything just works.


Git submodules aren't so bad, but they are a pretty leaky abstraction.


We've just transitioned from using submodules at work to subtrees.

I hate them slightly more than submodules. Git describe? Totally broken. Everything else is a tangled mess of junk.

But what armchairhacker says is SPOT ON.


Git submodules aren't perfect, but they can definitely be useful. I use them all the time and for my use case they are a good solution. They do take some getting used to though.


I don't think most people have a complete idea about git submodules. but I think it's easier to say bad directly because everyone around him is vilifying it.


I think Git submodules are good, but they have a very narrow set of use-cases. Sometimes people use submodules when what they really want is a sub-tree merge.


Submodules aren't bad. But in a world where I have to explain that running revert on the 250 megs of jar files someone comitted isn't a fix, and where people often just delete and re-clone entire repos because they don't know what's going on, they incur a heavy support burden on the few people that know how to use them.

You know, the people that already carry everyone else through their job.


I feel your pain. I don't expect everyone to be an expert on everything but given how important version control is in this profession, the lack of understanding (or desire to understand!) it's fundamentals by many in our industry is downright sad.


mutable pointers in the middle of your immutable lineage. completely broken model


Pointers are hard.


I would love to hear from people who've been using the "git subtree" command instead. Any good experiences?

---

My colleagues and myself lost work a few times when working on a project whose top-level Makefile / build script ran some "git submodule" commands from time to time, when they detected that a submodule appeared to be out of date.

Those commands wiped work in progress, because of submodules' tendancy to leave junk around when switching to-level branches, or the set of submodules changed, or the same with recursion over vendor submodules. That junk caused a few end-users to end up building and running incorrect code, or to get unexpected build errors, so the policy was for the build scripts to wipe it.

In other words, policy was to prioritise the experience for end-users who cloned the repo and occasionally updated or switched branches.

Unfortunately that meant it clobbered work in progress by devs occasionally. If you were careful it wouldn't happen, but if you left a trail of small changes and forgot about them, and did a git pull or such to get on with other work, intending to still find those work in progress changes later, they'd sometimes be gone. Such changes were things like those which needed to edit inside and outside a module (e.g. for API changes), or improvements to a submodule's code that there was no hurry to finish or commit, changes to add diagnostics, spotted improvements to tests, etc.

Often when those changes were wiped, the dev wouldn't notice for a while. Then later, "eh, I thought I had already done this before...".

My solution was to stop using the standard build target, and remember to use "git submodule" myself at appropriate times such as when switching branches and updating. That way I never lost any work, but it was not how I was "supposed" to work.

The team discussed improvements to the auto-update-clean-and-hard-reset commands, to make it check more carefully so it wouldn't run as often. But the problem remained, and that refinement made the build options rather ugly, a kind of "make REALLY_REALLY_DONT_UPDATE=1" sort of thing. Sensible defaults that Just Work™ for everyone were never found.

I also found submodules annoyingly time-consuming when making many changes that spanned inside and outside modules or across modules. The dance backward and forward to make a PR in the module to something that's compatible with before-and-after, then a PR in what uses it, then a PR clean up the module, perhaps multiple iterations needed, with each step having a potentially slow review cycle in between, over GitHub's slow web interface as well. Understandable for stable APIs shared among multiple projects, but pointless busywork (and worse commit logs) for small changes that could be a single PR in a monorepo.

(ps. Please see first line of this comment.)


I recommend git subrepo


If you think git submodules are bad/confusing, wait until you see subtrees.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: