Hacker News new | past | comments | ask | show | jobs | submit login
Linus’s rules on keeping Git history clean (2009) (mail-archive.com)
229 points by jjuliano 3 months ago | hide | past | web | favorite | 65 comments



When I first read the often-repeated advice NEVER change public history or something along those lines I was like 'yeah, 'duh'. My opinion on that changed a bit though and I reckon this advice follows the standard pattern of good guidelines being falsely wrapped as strict rules which must never be broken.

In practice: I only work in smaller teams and we use feature branches which get merged into master after review, having the netto effect pretty much no-one is ever working directly on master. So it if happens someone does something like pushing a commit only to figure out 2 days later it has a typo or even needs a small code change or anything which would improve the commit with trivial changes not really worth creating another commit, we just go ahead and fixup/rebase/force push i.e. rewrite public history. Since the rest of the team always does pull --rebase of master anyway and/or rebases feature branches this is not a problem at all.


The question is though, why would you even bother changing public history, even if you can work around the practical problems?

In my view, the concept of a commit mapping exactly to a functional change, and therefore being able to be correct or incorrect, improved, etc, is going against the grain of what revision control is. A commit just is what it is. If it contains a typo, a bug, etc, you notice and fix it 2 days later and that's another commit. Git just describes what happens. What is the utility in pretending that didn't happen and rewriting the history of changes as if you never made that mistake? Who benefits?

If you are concerned about keeping master 'stable' so that checking out any commit will result in a clean, working codebase, you can use abstractions on top such as tags to point out to people which commits are good and/or bad.

I get the idea of a stable, neat git history as though you were all knowing and perfect is comforting, but it's also nonsense and trying to attain that is just wasted effort. Just let git describe what actually happened, yes it's chaotic, yes there is constant rapid iteration, mistakes made and corrected etc, but that's just the process of building stuff. That's the reason you shouldn't rewrite history. There are pragmatic exceptions, though, like writing out egregious errors like committing security keys that can't be quickly changed.


I'm convinced that the obsession with rewriting history is solely due to inadequate tools. Git doesnt keep the name of a branch after it's merged, so people want to make merges look like a single commit on top so that they don't face this ambiguity. Github doesnt even display the branching structure in its commit log, which also shows a woefully small number of commits per page, further incentivising squashing/editing. Many tools (some are better than others) display commit history in a similarly non-dense way or in a way that implicitly discourages branching in some way, e.g. gitk doesn't even display commits from other branches by default. Large numbers of commits are also unwieldy when commits are hashes that cannot be ordered mentally just by looking at them.

Over in mercurial land people are more likely to keep history, even though history rewriting is not only equally powerful, but more safe than git via the 'evolve' extension. We can limit our bisecting to a single branch, such as a stable branch or the default (mercurial parlance for 'master') branch, skipping over commits in feature branches that have been merged in. We can do this because the branches retain their identities post-merge. The most widely-used tool, tortoisehg, displays large numbers of commits densely, with the full tree structure and branch names on display by default. Commits can be referred to via their hash or by a simple incrementing integer (which is only valid on your local clone, but still, this makes things easier for local work).

So we keep all those typo commits - they're usually in feature branches anyway since we don't merge until features are done and we try to keep the default branch functioning. If a merge breaks something, we bisect on the default branch only, which will tell us which merge commit broke it.

I'm still sad that git won the VCS wars over mercurial.


> Github doesnt even display the branching structure in its commit log, which also shows a woefully small number of commits per page, further incentivising squashing/editing. Many tools (some are better than others) display commit history in a similarly non-dense way or in a way that implicitly discourages branching in some way, e.g. gitk doesn't even display commits from other branches by default.

GitLab and Fisheye usually display very well the graph of branching&merges

Also, I have this wonderfull git alias :

lg = log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date-order


> I'm convinced that the obsession with rewriting history is solely due to inadequate tools.

The article at the top literally said explicitly never rewrite public history, so what obsession are you talking about exactly? Git has what you want as long as you don’t mistake local operations before push as “history”, and instead only consider history to be commits that have been shared with other people. That makes more sense anyway, there’s nothing sacred to preserve in the arbitrary, noisy sequence of things I did while I was bumbling around on my machine before I push.

Git was designed with a toolset that shows every commit and lets you clean up your own work before you contribute it to public history. Its tools work well when you understand git’s design and use it the way it was intended. Git is not Mercurial, though, that’s true. Perforce isn’t Mercurial either.

Git can limit bisect to a single branch, and normally does skip branches until you want to descend into them. Don’t confuse losing the branch name with losing the branch, git doesn’t lose the branches, only the names, and only if you delete the names.


I'm talking about attitudes in general and not disagreeing with the post.

I agree with the advice never to rewrite public history, and I totally agree with Linus's approach. He is in the minority with this attitude though, since never rewriting public history means never doing a squash merge and never rebasing a merge/pull request at merge time (both of which are common practice). I suspect even people who endorse the idea of never rewriting public history kind of don't think of the fork from which a pull request is coming as 'public' even if it literally is.

I love the kernel's "keep-all" approach and want more people to use it, I bet if they did the tools would improve to actually work better with that style - whereas right now I think the tools are driving the workflow instead.


> right now I think the tools are driving the workflow instead.

Okay that's fair, I think that's true. To some degree it has to be true to matter which tools you use, right? Even if it's Mercurial.

I haven't personally seen squash merges and rebasing pull requests being used on pull requests of large multi-person branches very commonly, are you saying that's common? I agree that there's common practice of using squash merges and rebasing on private branches, or branches that contain commits by only a single person and contain only code commits.

I'm looking for clarity, not disagreeing with you. The 'principled' argument for never using rebase is almost always attacking the branching practices of individuals and not teams. There definitely is a fuzzy line between pushing to your own branch that is visible to others, but nobody else touches. I'd normally consider that case private, not public, even if it's "literally" public.

I don't feel like I'm hearing what the tangible advantages of never modifying history are. Why is history considered more sacred than clarity of semantic intent? People make mistakes and noise, a lot, why shouldn't the tooling allowing fixing mistakes and cleaning up irrelevant noise after the fact, as long as it doesn't affect others?

Edit: I'm realizing another conceptual line to draw beyond what makes history "public": the question is one of whether you're going to rewrite history out from underneath other people. If not, and you're the only person affected, then you made the local history in the first place, there's no principled reason to prevent you from updating your own work, because it's equivalent to making the same change before committing. If your rewrite is modifying commits that other people already have, then you're inflicting damage on other people. You may cause them to have merge conflicts, you may be modifying code dependencies they're working on but haven't pushed, it's bad for very practical reasons. Using this lens of what other people depend on, does that help clarify your examples of squash merges and rebased pull requests?


> I haven't personally seen squash merges and rebasing pull requests being used on pull requests of large multi-person branches very commonly, are you saying that's common?

I have. Github makes it quite easy to fall into this.

> Why is history considered more sacred than clarity of semantic intent? People make mistakes and noise, a lot, why shouldn't the tooling allowing fixing mistakes and cleaning up irrelevant noise after the fact, as long as it doesn't affect others?

I've got a concrete example of where it causes problems: code reviews. If you've reviewed a branch at a specific commit, and standard practice is to squash merge into master, or to otherwise allow rebases after the review point, you lose the confidence that what's on master is actually what was reviewed. I've seen cases where people got into the habit of getting reviews done, then doing a squash rebase locally, and including tidy-up commits which had never been seen by anyone else before merging straight into master.

If you're in an environment where the rule is that Everything Must Be Reviewed, that's a problem: it's far too easy for an accidental bug to end up on master despite the code reviews and the automated tests on the preceding branch being green.

With the example above, I never would have seen the problem unless I'd been trying to use the git history to measure some statistics about how long it was taking us to get code reviews done. It was only because I was looking at the history commit by commit that it jumped out.


That's a good example, IMO, and yeah it should be very much frowned on (or outright disallowed) to modify an approved code review before pushing without further review. That is kind-of a code review workflow problem, more than a discussion of whether rebase should "never" be used though, right?

The company I work for now has both notifications for commits in code reviews, so everyone sees if you modify something after it being approved, and some repos also have lockdown features where the approved review is tagged and cannot be checked in if modified. So this can be solved with some tooling around code reviews, and git itself doesn't exactly add up to a modern code review toolset. This may be as much or more of a Github problem than a git problem... acknowledging that there's a large swath of developers that doesn't really know the difference between them.


It's a bit of both. If you don't have a strong "thou shalt not rebase" culture, it can be difficult to get people to accept the inconvenience of getting re-reviews on the branch they've just committed a typo-fix to, so you end up leaning on more complex tooling to force the issue.


> Git doesnt keep the name of a branch after it's merged...

Eh? This is trivial to change by specifying the "--no-ff" option to `git merge`, or by setting the config option "merge.ff" to false.


That's not what I'm talking about - after a no-ff merge, sure you can tell that there were two branches, but how can you tell which was which? Which was master and which was the feature branch?

You can use convention to store this information, e.g. the first parent is always master, or you can put the info in the merge commit message. But it's hacky.

In mercurial the branch name lives on forever, attached to those commits whether they are merged or not. "Closing" a branch in mercurial is just a hint that it isn't going to be used for the time being and so shouldn't be listed in tools that list branches, but doesn't actually remove the label from previous commits. So the commit history has the branch names still after a merge. This way you can say "show me all commits in the master branch" (as distinct from "show me all ancestors of the tip of the master branch") and this will exclude feature branches, and is ideal for bisecting.


I mean...ok, I guess if you allow committing directly to master, and you allow merging in both directions...sure, I see your point.

In our repos, we allow neither, so all non-merge commits are, by definition, on a feature branch on the right-hand-side.

I guess one person's hacky convention is another's primary workflow. ¯\_(ツ)_/¯


Does this mean you can't have merge commits in feature branches?

And what about maintaining a release branch? There are good reasons to be merging in both directions sometimes, even if you never commit directly to master.


Who benefits?

We do, simply because it's less cognitive overhead to read log/blame output when it's less chaotic. This doesn't mean that there's zero chaos in the commit history of course. But less than when we'd just never fix simple mistakes right away.


I figured this benefit would be much greater in a public-facing repo. After all, there’s often a few commits fixing minor linting errors etc just before the final merge. Rebasing these away is reducing noise by alot.

Rebasing on master/stable/release branches is another story altogether.


Because git bisect.


Accidental commits/pushes do happen, too. Keys, connection strings, even just mildly embarrassing comments that might show your penchant for Taylor Swift lyrics. Impossible to remove those completely, as far as I understand it, from existing repositories, but new clones it is possible, and it's a damn sight preferable to a commit with message "oops, didn't mean to share this" which _duplicates_ the accident.


I’m all for staying flexible and realizing the rule isn’t absolute, but fixing a typo or improving a commit is a very low bar. On my teams, I reserve rebase on public history for more serious emergencies, like someone accidentally checking in security keys.

It’s probably fine on a very small team, but pull with rebase doesn’t protect you from merge conflicts when someone rebases master underneath you, this is why the practice of rebasing public history is and should generally be considered an anti-pattern.


I get your point, but checking in security keys is cause to rotate keys, not to rebase and force push.


100% absolutely agreed if the keys were accidentally made public outside the org. If not, rotating keys is sometimes a bit more work than a quick rebase & push.

It was the best example of an exception to the rule I could think about, but personally, my threshold for whether rebasing master is ever allowed might be closer to 'no if team is larger than 3 people, otherwise try not to, think twice, and ask everyone first.'


Yeah, pretty much all projects are smaller than the use case Linux has; main differences are scale, number of contributors, hierarchy, etc.

In Linux, iirc, there's a lot more sharing code via email and the like, peer-to-peer. In the rest of the world, it goes through a central repository.

Anyway, a bit more flexible is that it's fine to rebase history and force push, as long as it's on your own branch or as long as everyone working on the same branch is aware that it's about to happen and acts accordingly. If they don't though, you're going to have a bad time.

The most important thing is to be conscious of what effect your actions will have on others. And of course, never do it on master - and if for some reason you have to, take a good look at what caused the problem in the first place. It's good practice in most projects to lock down master in whatever repository host you use (e.g. github) and only allow changes via pull requests.


If you read last Linus message in the chain he says pretty much the same, it's not black and white, and sometimes you need to break the rules.

(That said its still a good rule, and unless you know what you are doing you should follow it)


In my opinion the force push option availability decreases proportionally with the number of users of a repo. In a repo where about 400 people work to make a force push we needed to get authorization from management and do it in a weekend. In my own projects I amend quite often and force push about 1/10 times.


I think advice like this is always geared towards people who don't yet understand the reasons. It'll prevent them from making some real messes.

When you really understand the implications of changing public history, you will also know when you can break the rule.


> In practice: I only work in smaller teams

Are you using git in an open source project, or is this proprietary software within a single company?


Mostly the latter, which of course makes it easier. The open source parts are either only open source in that they're publicly available but in reality there's no other users (we know of) or there aren't many users and they just deal with it e.g. in case of a fork of a more popular project where we have a feature branch which gets rebased on upstream regularly.


My work flow consists in working in some topic locally and committing in the smallest topic units possible. Then as I make some fixes due to typos or plain wrong logic I just amend or fix up. The longer I keep branched out then bigger the likelihood of having a lot of fixup commits. Then I use this awesome magic git rebase somehash^ -i --autoquash. It re orders all my fixups automatically so it is very painless to consolidate everything before pushing. Maybe it is very known but this fixup/autosquash thing changed my productivity a lot.


I don't think this feature is very well-known, in fact I've seen it mentioned before, thought "how useful! I will absorb this into my brain now" – and promptly forgot about it.

I think there is a critical mass of advanced git features required to be really fluent in git, and there is a sizeable fraction of everyday git users who simply haven't made it all the way there yet. Some teams have at least one person who has enough git features under their belt...

To recognize when merge conflicts are being haphazardly made unnecessarily, to be really particular about the shape of the git commit history and be aware of what merges can make those conflicts, to occasionally show the rest of the team some of these tricks or bail someone out when things go haywire, but at least in my experience spreading all of this knowledge to the rest of the team is a long slow process.


I do this the whole time, creating many small commits, fixing up, squashing etc but to keep things a bit orderly I also usually do the rebase -i fairly often in order not to end up with more fixup commits that are easy to fit on screen. But yes: this workflow seems to suit me particularily well and changed the way I work with git. And it tends to lead to a nicely crafted, clear and easy to follow history of all steps of adding a new feature and/or refactoring things and made me more productive and seems to cause less bugs/mess overall because I review pieces of code more often.


If you want clean git history, you must burn "git merge".

No commit should ever have more than one parent.


That might be clean, but it's not history.


I'm not very experienced with git. I use `git rebase` to pull changes from a remote into my private repo in order to prepare pushing my changes. My understanding of rebase is that it takes a commit history with one parent and replays them all onto a different parent. So I don't understand how rebasing would destroy history. Could someone explain what the issue is?


In my opinion source control history is much overrated. Instead of documenting your diffs, you should just document the code itself.

Every developer has some limited time budget for writing documentation. If you spend lots of time and frustration trying to get the git history correct, you have less time left over for actual code documentation.

> See? All the rules really are pretty simple.

Hmmmm.


I'm guessing you've never had to find the source of a bug via git bisect.


'Commit to master, that way merging is somebody elses problem'


I bet a lot of people will find this advice completely confusing. Emailing a patch?


Even if not many developers use an email-based workflow for their own work, I think it’s pretty well-known that the Linux kernel does.


I suspect you're overestimating the number of developers who know or use Linux. Many developers use Git on Windows and don't know anything about Linux, let alone about the workflow of the kernel team.


While GitHub and its clones have taken over, Git was intended to be used with emailed patches. The Linux kernel is a good example of a project using Git that way. It’s reasonably common among mature projects and it’s really not that hard: https://git-send-email.io/


The git project uses the same workflow for development that the Linux kernel does. That is, using a mailing list to submit and discuss commits before merging them into the maintainer's repository.


Sure. But many users of git don't know (or care) about the workflow of the git team. They know github. Maybe gitlab. Or bitbucket. They could go their entire development careers without ever sending a patch as an email, instead relying on feature branches and pull requests.

This could be completely alien to them, and it seems wrong to downvote srg for pointing that out.


git format-patch and git am are still very useful features, even if you're not Linus, not a Linux developer, or on Windows.

Basically you can serialize your git history as patches, transmit them any way you like (including email), and seamlessly re-integrate them as fully complete git commits.

I sometimes use this even when working only with myself, i.e. if I want to test a change on multiple machines before I'm ready to push. Then I can use normal git workflow to rebase, etc.


Is this all about making it possible to do delta debugging with git commits?


Please add (2009) next time.


Isn't that obvious? How else could it be?


Look at the cpython history. There is a major change from Mercurial to Git. It's not that Mercurial doesn't support the "rebase" workflow. It does discourage it though. Personally, I much prefer the "rebase your WIP changes, avoid merges" style. It is much easier to work with the resulting history.


I never got this argument. What the point of "clean history"? History is for log of work, not for being nice read.


So said log of work is not to be read? Why even bother with commit messages then?

It might not be useful for you or your projects, but that's up to you. Linux is the other side of the coin though - they NEED to be able to read a commit from ten years ago, see what was changed (the diff), why it was changed, and by who it was changed and signed off. Take a look at their repository, e.g. via github for ease of access. Here's a random commit: https://github.com/torvalds/linux/commit/e1e54ec7fb55501c33b.... Even with my complete lack of knowledge of kernel development, I can tell that the minor code change is backed up by a lot of reasoning and intent.

But, it's a wholly different use case. Most applications I've worked with are basically webapps that will be thrown away within five years, where it's not as important.


> Why even bother with commit messages then?

For me, it's all about understanding context of a change, the "why" (commit message) not "what" (code comments).

I can run regression tests to find where things broke and, with the messages, have a good understanding of the context and mindset the developer was working in.

In my editor, I can select some code and click "show history for selection" and get a complete log of what happened on those lines. If the commit messages are good, I'll have on understanding of the context of the change.

Missing commit messages usually result in "I don't remember" type emails from the author when I inevitably ask what them why they made a change.


"Commits ordered by when I figured out a bug" has a lot less value than "Commits where changes are grouped by semantic logic".

Stuff like git rebase don't work so well if you have a bunch of WIP busted commits as well.

Maybe you don't commit that often, but people I work with (and myself) commit pretty often so it's easy to have just outright mistaken git commit messages like "fix X" followed by "actually fix X". going through a rebase to have "fix X" mean "fix X" will be great for the future debugging session with a git blame.


In addition, it makes cherry picking bugfixes to older releases much easier.


On the other hand, "fix x" followed by a couple lines of rationale, then "actually fix x" with a discussion of why the previous fix was wrong is more useful to someone trying to understand x.


I think the discussion is valuable. I think adding those insights to a wiki 'Pitfalls' page or whatever is valuable.

From the top of my head, the cases where I'm looking at Git logs are:

1. Code Review. Most of the time I'm reviewing code is looking at a diff. But obviously one of "fix" and "actually fix" is redundant. A clean history also benefits if I want to focus in on one of the commits.

2. Annotation/Blame. If I'm debugging through some issue and looking at older changes, it's nice if coupled changes are in the same commit.

A warts-and-all history has some advantages over rewriting git history (e.g. you could find patterns of where "actually fix" happens and try and improve those), but rewriting history makes the log a better communication tool.


Yeah this is legitimate

I think it’s maybe the difference between emotional truth and literal truth. The emotional truth is the valuable one so rebasing (which doesn’t mean squashing to one commit!) can mean you can clean out the noise and get something valuable.


As a codebase gets larger and older, git bisect becomes invaluable. If you have not used it, you don't know what you're missing.

A pre-condition for git bisect working is that each commit must run well enough to successfully test for whatever behavior change is being tested for. Otherwise, you'll identify some commit where the code went from working to not working, but if it's just some idiotic typo instead of actually the change that you are looking for, you've lost.

I'd rather have a git bisectable history that correctly reflects a steady progression of the product than one that records every typo some developer made for posterity. That I typo'ed a variable name in a version that never shipped to anybody and then had to commit "FIXUP" is not useful information. A clean commit that changes behavior, but is subsequently revealed to cause a regression against a test that won't be written for another two years, is incredibly valuable.

Some of you say you don't get the appeal of a "clean" history; I say back I don't get the appeal of a pedantically historical view of history. I have never gone digging through some old history to figure out whether or not some particular piece of documentation was at some point in the past misspelled. Completely uninteresting. I have never cared about the process of how a particular thing was arrived at, with all the false starts, not to mention that if I did, trying to read a series of patches isn't how I'd want to do it. I have cared about being able to cherry-pick a single clean commit to backport some feature, I have cared about git bisect, and I have cared about the ability to revert a particular feature via "git revert" without having to figure out which discontinuous set of half-a-dozen patches need to be reverted because almost no commits from the past can be reverted without making fundamental breaks to the build, not for fundamental reasons, but because they introduce typos and break variable names and re-do accidental file deletions, etc.

History as a log of work is way less interesting than history as a queryable and manipulable data structure representing the various mostly-valid states of your project, and the ability to manipulate those mostly-valid states at a project level. Composing two valid states of the project together to get a third is incredibly powerful, and when two valid states compose together to create an invalid state, there's real information of some kind there. This doesn't work if your history mostly consists of invalid states. Composing two invalid states of the project together to get another invalid state isn't a surprise, it produces and teaches nothing.

This is why, whenever I can, I have a git pre-commit hook that checks the compile of everything and runs all my test cases. It's better for that to be the habit, and to have to occasionally bypass it for some reason, than for the default to be allowing any ol' commit to fundamentally break whatever. Doing it in a CI system is fine too; I do what I can to keep the local tests working but it's not always possible. The key is just that something is done to ensure validity is maintained.


Not disagreeing with the overall premise but this specific case:

> but if it's just some idiotic typo instead of actually the change that you are looking for, you've lost

I’ve used `git bisect skip` in this situation loads of times without issue


What’s the argument for messy work? History is something everyone on the team needs to use every day for working. Messy history on a large team can be like a thousand little paper cuts, it drains everyone’s time little by little, and it increases the probability of mistakes.


History is extremely useful for being read. For backporting between trees, looking at historical development of a feature, to bisecting an introduced bug or regression.


If it's not readable you are not going to read it. A readable git history is really valuable. The day you'll use git bisect you will understand that.


If you're working with a group of people you should never git pull prior to a commit and always git push -f


Joking aside, "git pull -r" is pure magic in the face of upstream rewrites. It always does the right thing:

https://mergebase.com/doing-git-wrong/2018/03/07/fun-with-gi...

After I wrote that blog post someone told me the reason for the magic is the use of "git merge-base --fork-point" under the hood.


`pull --rebase` is nothing, go back and read the history of `rebase --preserve-merges` and now `rebase --rebase-merges`

I have worked with both, and the new one is much better. Of course this only matters if the branch you are rebasing has merges in it (so in all likelihood, you must be a release manager to need features like this)

Between rebase-merges and --onto, I don't spend hardly any time fixing up bad merges anymore.


I never type

-f or --force

on the same line as git. If someone feels it needs to be done, I try to stop them; failing that, I make sure they're the ones doing it. I don't need that kind of risk.


--force-with-lease is much safer if you're working with others on a shared branch. :3




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: