At a former startup, our product was built on Chromium. As the build/release engineer, one of my daily responsibilities was merging Chromium's changes with ours.
Just performing the merge and conflict resolution was anywhere from 5 minutes to an hour of my time. Ensuring the code compiled was another 5 minutes to an hour. If someone on the Chromium team had significantly refactored a component, which typically occurred every couple weeks, I knew half my day was going to be spent dealing with the refactor.
The Chromium team at the time was many dozens of engineers, landing on the order of a hundred commits per day. Our team was a dozen engineers landing maybe a couple dozen commits daily. A large merge might have on the order of 100 conflicts, but typically it was just a dozen or so conflicts.
Which is to say: I don't understand how it's possible to deal with a merge that has 1k conflicts across 10k changes. How often does this occur? How many people are responsible for handling the merge? Do you have a way to distribute the conflict resolution across multiple engineers, and if so, how? And why don't you aim for more frequent merges so that the conflicts aren't so large?
(And also, your merge tool must be incredible. I assume it displays a three-way diff and provides an easy way to look at the history of both the left and right sides from the merge base up to the merge, along with showing which engineer(s) performed the change(s) on both sides. I found this essential many times for dealing with conflicts, and used a mix of the git CLI and Xcode's opendiff, which was one of the few at the time that would display a proper three-way diff.)
If you use git-mediate, you can re-apply those massive changes on the conflicted state, run git-mediate - and the conflicts get resolved.
For example: if you have 300 conflicts due to some massive rename, you can type in:
git-search-replace.py -f oldGlobalName///newGlobalName
Succcessfully resolved 377 conflicts and failed resolving 1 conflict.
<1 remaining conflict shown as 2 diffs here representing the 2 changes>
When maintaining multiple release lines and moving fixes between them:
Don't use a bad branching model. Things like "merging upwards" (=committing fixes to the oldest branch requiring the fix, then merging the oldest branch into the next older branch etc.), which seems to be somewhat popular, just don't scale, don't work very well, and produce near-unreadable histories. They also incentivise developing on maintenance branches (ick).
Instead, don't do merges between branches. Everything goes into master/dev, except stuff that really doesn't need to go there (e.g. a fix that only affects a specific branch(es)). Then cherry pick them into the maintenance branches.
I also observed that getting rid of merging upwards moved the focus of development back to the actual development version (=master), where it belongs.
From experience with many developers using this method, conflict resolution errors went down to virtually zero, and conflict resolution time has improved by 5x-10x.
You often take the apparent diff, apply it to the other version, and then git-mediate tells you "oops, you forgot to also apply this change". And this is one of the big sources of bugs that stem from conflict resolutions.
Another nice thing about git-mediate is that it lets you safely do the conflict resolution automatically, e.g: via applying a big rename as I showed in the example, and seeing how many conflicts disappear. This is much safer than manually resolving.
The only scenario in which i can see git-mediate working is if you don't actually resolve conflicts at all but instead just do a project-wide search&replace, but that's only going to handle really trivial conflicts, and even then if you're not actually looking at the conflict you run the risk of having the search & replace not actually do what it's supposed to do (e.g. catching something it shouldn't).
This is patently and empirically false:
1) See the automated rename example. How do you gain the same safety and ease of resolution without git mediate? This is, unlike you say, an incredibly common scenario. After all, conflicts are usually due to very wide, mechanical changes. The exact kinds of changes that are easy to re-apply project-wide.
It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts and thus inserting bugs, and due to that are not allowed in many collaborative projects!
2) Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.
This guideline is simple enough to virtually never fuck it up, unlike traditional conflict resolution which is incredibly error-prone. Your complaint that git mediate does not validate you followed this one guideline is moot, since this guideline is so easy to not fuck up - compared to the rest of the resolution process.
If you follow this guideline - git-mediate has tremendous value. It makes sure you did not forget any part of the change done by either side - and streamlines every other part of the conflict resolution process:
A) Takes my favorite editor directly to the conflict line
B) Shows me 2 diffs instead of 3 walls of text
C) Lets me choose the simpler diff to apply to the other sides, making a very minimal text change in my editor.
D) Validates that the change I applied was the last one (or decreases the size of the diff otherwise)
E) Does the "git add" for me, and takes me directly to the next conflict
This has been used by dozens of people, who can all testify that it:
A) Made conflict resolution easy and convenient
B) Reduced the error rate to zero (I don't remember the lasts bug inserted in a merge conflict)
C) Sped the process up by an order of magnitude
Is it? I'm not sure if I've ever had a conflict that would be resolved by a global find & replace. Globally renaming symbols isn't really all that common. In my experience conflicts are not "usually due to very wide, mechanical changes", they're due to two people modifying the same file at the same time.
> It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts …
Most projects don't go doing whitespace fixes over and over again. In projects that do any sort of project-wide whitespace fixes, that sort of thing is usually done once, at which point the whitespace rules are enforced on new commits. So yes, global whitespace changes can cause commits, but they're rather rare.
> Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.
> This guideline is simple enough to virtually never fuck it up
You know what's even simpler? "Apply the same change to 1 version". Saying "Fix the conflict, and then do extra busy-work on top of it" is not even remotely "simpler". It's literally twice the amount of work.
The only thing git-mediate appears to do is tell you if a project-wide find&replace was sufficient to resolve your conflict (and that's assuming the project-wide find&replace is even safe to do, as I mentioned in my previous comment). If you're not doing project-wide find&replaces, then git-mediate just makes your job harder.
From reading your "process" it appears that all you really need is a good merge tool, because most of what you describe as advantageous is what you'd get anyway when using any sort of reasonable tool (e.g. jumping between conflicts, showing correct diffs, making it easy to copy lines from one diff into another, making it easy to mark the hunk as resolved).
Then it is likely a reversal of cause & effect. People are more reluctant to make wide sweeping changes such as renames because they're worried about the ensuing conflicts.
> Most projects don't go doing whitespace fixes over and over again
Again, for similar reasons. Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.
> You know what's even simpler? "Apply the same change to 1 version".
It sounds simpler, but not fucking it up is not simple at all, as evidenced by conflicts being a constant source of bugs.
When you look at the 3 versions, and "apply the diff to 1 version" you actually think you apply the diff to 1 version. You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible. This is detected by git-mediate the double-accounting of applying the perceived diff to both - validating that the perceived diff equals the actual diff.
Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.
> then git-mediate just makes your job harder.
You're talking out of your ass here. Me and 20 other people have been using git-mediate and it's been a huge game changer for conflict resolution. Every single user I've talked to claims huge productivity/reliability benefits from using it.
> all you really need is a good merge tool
I've used "good" merge tools a lot. They're incredibly inferior to my favorite text editor:
* Text editing within them is tedious and terrible
* Their supported actions of copying whole lines from one version to the other are useless 90% of the time.
Let me ask you this:
What percentage of the big conflicts you resolve with a "good merge tool" - build & run correctly on the first run after resolution?
For git-mediate, that is easily >99%.
I disagree. People generally don't do project-wide find&replaces because it's just not all that common to want to rename a global symbol.
> Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.
You seem to have completely missed the point of my comment. I'm not saying projects limp along with bad whitespace. I'm saying projects that decide upon whitespace rules typically do a single global fixup and then simply enforce whitespace rules on all new commits. That means you only ever have one conflict set due to applying whitespace policy, rather than doing it over and over again as you suggested.
> You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible.
Then you're using really shitty merge software. Any halfway-decent merge tool will highlight all the differences for you.
> Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.
And this is just FUD.
It's simple. Just stop using Notepad.exe to handle your merge conflicts and use an actual merge tool. It's pretty hard to miss a change when the tool highlights all the changes for you.
> You're talking out of your ass here.
And you're being uncivil. I'm done with this conversation.
You just re-apply either side of the changes in the conflict (Base->A, or Base->B) and the conflict is then detected as resolved. Reapplying (e.g: via automated rename) is much easier than what people typically mean by "manually resolving the conflict".
Also, as a pretty big productivity boost, it prints the conflicts in a way that lets many editors (sublime, emacs, etc) directly jump to conflicts, do the "git add" for you, etc. This converts your everyday editor into a powerful conflict resolution tool. Using the editing capabilities in most merge tools is tedious.
(In the Sun model one does rewrite project branch history, but one also leaves behind tags, and downstream developers use the equivalent of git rebase --onto. But the true upstream never rewrites its history.)
* latest feature commit #3 (feature)
* | more master commits you wanted to include (master)
| * feature commit #2
| * merge
* | master commits you wanted to include
| * feature commit #1
* original master tip
* master history...
* feature commits #1, #2 and #3 (newbranch)
| * latest feature commit (feature)
| * merge
* | more master commits you wanted to include (master)
| * feature commit #2
| * merge
* | master commits you wanted to include
| * feature commit #1
* original master tip
* master history...
* feature commits #1, #2 and #3 (master, newbranch)
* more master commits you wanted to include
* master commits you wanted to include
* original master tip
* master history...
*latest feature c
E.g., here's what the feature branch goes through:
feature$ git tag feature_05
<time passes; some downstreams push to feature branch/remote>
feature$ git fetch origin
feature$ git rebase origin/master
feature$ git tag feature_06
And here's what a downstream of the feature branch goes through:
downstream$ git fetch feature_remote
<time passes; this downstream does not push in time for the feature branch's rebase>
downstream$ git rebase --onto feature_remote/feature_06 feature_remote/feature_05
Easy peasy. The key is to make it easy to find the previous merge base and then use git rebase --onto to rebase from the old merge base to the new merge base.
Everybody rebases all the time. Everybody except the true master -- that one [almost] never rebases (at Sun it would happen once in a blue moon).
It's open source. Also:
Also KDiff3. Struggling to remember the other ones I used to try unfortunately.
We ended up with a script that created a new git repository, checked out the base version of the code there. Then created a branch for our updated release and another for their codebase. Then attempted to do a merge between the two. For any files which couldn't be automatically merged it created a set of 4 files, the original then a triplet for doing a 3-way merge.
This is also when I bought myself a copy of Beyond Compare 4, which fits perfectly for the price and feature set for what we needed.
In any case, rebasing is better than merging. (Rebasing is a series of merges, naturally, but still.)
This isn't to say that KDiff3 isn't great - it is.
Still an extremely useful program.
You look at who caused conflicts and send out emails. Don't land the merge until all commits are resolved. People who don't resolve their changes get in trouble.
Not perfect, but there it is.
I worked on a chromium based product as well and had the exact same problem. Eventually we came up with a reasonable system for landing commits and just tried out best to build using chromium, but not having actual patches. Worked okay, not great. Better thant he old system of having people do it manually/just porting our changes to each release.
Edit: If you're interested in helping out, e.g. porting the client to Windows, stop by the IRC channel #internetarchive.bak on efnet.
But that's not what I came in to say.
I came in to describe the rebase (not merge!) workflow we used at Sun, which I recommend to anyone running a project the size of Solaris (or larger, in the case of Windows), or, really, even to much smaller projects.
For single-developer projects, you just rebased onto the latest upstream periodically (and finally just before pushing).
For larger projects, the project would run their own upstream that developers would use. The project would periodically rebase onto the latest upstream. Developers would periodically rebase onto their upstream: the project's repo.
The result was clean, linear history in the master repository. By and large one never cared about intra-project history, though project repos were archived anyways so that where one needed to dig through project-internal history ("did they try a different alternative and found it didn't work well?"), one could.
I strongly recommend rebase workflows over merge workflows. In particular, I recommend it to Microsoft.
Moreover, resolution work during a rebase creates a fake history that does not reflect how the work was actually done, which is antithetical to the spirit of version control, in a sense.
A result of this is the loss of any ability to distinguish between bugs introduced in the original code (pre-rebase) vs. bugs introduced while resolving conflicts (which are arguably more likely in the rebase case since the total amount of conflict-resolving can be greater).
It comes down to Resolution Work is Real Work: your code is different before and after
resolution (possibly in ways you didn't intend!), and rebasing to keep the illusion of a total ordering of commits is a bit of an outdated/misuse of abstractions we now have available that can understand projects' evolution in a more sophisticated way.
I was a dedicated rebaser for many years but have since decided that merging is superior, though we're still at the early stages of having sufficient tooling and awareness to properly leverage the more powerful "merge" abstraction, imho.
Ah, right, that's another reason to rebase: because your history is clean, linear, and merge-free, it makes it easier to pick commits from the mainline into release maintenance branches.
The "fake history" argument is no good. Who wants to see your "fix typo" commits if you never pushed code that needed them in the first place? I truly don't care how you worked your commits. I only care about the end result. Besides, if you have thousands of developers, each on a branch, each merging, then the upstream history will have an incomprehensible (i.e., _useless_) merge graph. History needs to be useful to those who will need it. Keep it clean to make it easier on them.
Rebase _is_ the "more powerful merge abstraction", IMO.
rebase : linked-list :: merge : DAG
If the work/repo is truly distributed and there isn't a single permanently-authoritative repo,
a "clean, linear" history is nonsensical to even try to reason about.
In all cases it is a crutch: useful (and nice, and sufficient!) in simple settings, but restricting/misleading in more complex ones (to the point of causing many developers to not see the negative space).
You can get very far thinking of a project as a linked list, but there is a lot to be gained from being able to work effectively with DAGs when a more complex model would better fit the reality being modeled.
It's harder to grok the DAG world because the tooling is less mature, the abstractions are more complex (and powerful!), and almost all the time and money up to now has explored the hub-and-spoke model.
In many areas of technology, however, better tooling and socialization around moving from linked-lists (and even trees) to DAGs is going to unlock more advanced capabilities.
Final point: rebasing is just glorified cherry-picking. Cherry-picking definitely also has a role in a merge-focused/less-centralized world, but merges add something totally new on top of cherry-picking, which rebase does not.
You can have a hierarchical repo system (as we did at Sun).
Or you can have multiple hierarchies, contributing different series of rebased patches up the chain in each hierarchy.
Another possibility is that you are not contributing patches upstream but still have multiple upstreams. Even in this case your best bet is as follows: drop your local patches (save them in a branch), merge one of the upstreams, merge the other, re-apply (cherry-pick, rebase) your commits on top of the new merged head. This is nice because it lets you merge just the upstreams first, then your commits, and you're always left in a situation where your commits are easy to ID: they're the ones on top.
I agree that rebase == centralized. It's a math thing. If you rebase and someone has a clone of your work prior to the rebase chaos happens when they come together. So you have to enforce a centralized flow to make it work in all cases. It's pretty much provable as in a math proof.
Now, you don't want to do this with the ultimate upstream, though occasionally it happened at Sun with the OS/Net gate, usually due to some toxic commit that was best eliminated from the history rather than reverted, or through some accident.
But you'd be right to say that the Sun model was centralized in that there was just one ultimate upstream. (There was one per-"consolidation", since Solaris was broken up into multiple parts like that, but whatever, the point stands.)
Whereas with Linux, say, one might have multiple kernel gates kept by different gatekeepers. Still, if you're contributing to more than one of them, it's easier to cherry-pick (rebase!) your commits onto each upstream than to just merge your way around -- IMO. I.e., you can have a Linux kernel like decentralized dev model and still rebase.
However, I as you can see from my comment in the previous paragraph, _rebase_ itself does not imply a centralized model.
a) a centralized model
b) you have to throw away any work based on the dag before the rebase
c) you have the history in the graph twice (which causes no end of problems).
(a) is the math way, (b) and (c) are ad-hoc hacks. You are well into the ad-hoc hacks, you've found a way to make it work but it includes "don't do that" warnings to users. My experience is that you don't want to have work flows that include "don't do that". Users will do that.
The event stack is a record of every tip that was ever present in this repo other than unpushed commits.
You were at cset 1234, you pull in 25 csets, the event stack has two events, 1 which points to 1234 and 2 which points at the tip after the pull.
You commit "wacked the crap out of it", then commit "fixed typo", then commit "added test", then commit $whatever. The event stack is
. which points at your current tip but is floating
Now you push. Your event stack is 1, 2, 3 and 3 points at the tip as of your push.
What about clone? You get your parent's event stack but other than that they are per repo.
The event stack is the linear history you want, it is the view that everyone wants. It's "what are the list of tips I care about in this repo?". Have a push that broke your tree but you don't know what the previous tip was because the push pushed 2500 commits? No problem. The event stack is a stack and there is a "pop" command that pops off the last change to the event stack. So you would just do "git pop" and see if that fixes your tree, repeat until it does.
We never built this in BitKeeper but I should try. If for no other reason than to show people you can have the messy (but historically accurate) history under the covers but have a linear view that is pleasant for humans.
Even with this, I'd want to rebase away "fixed typo" prior to pushing, and more, I'd want to:
- organize commits into logical chunks so that they might be cherry-picked (in the literal sense, not just the VCS sense) into maintenance release branches
- organize commits as the upstream prefers (some prefer to see test updates in separate commits)
IIUC BitKeeper does have a sort of branch push history, unlike git. Is this wrong?
Which begs the question "how do you do dev vs stable branches?" And the answer is that we have a central clone called "dev" and a central clone called "stable". In our case we have work:/home/bk/stable and work:/home/bk/dev. User repos are in work:/home/bk/$USER/dev-feature1 and work:/home/bk/$USER/stable-bugfix123.
We run a bkd in work:/home so our urls are
bk clone bk://work/dev dev-feature2
bk push bk://work/stable
The model works well until you have huge (like 10GB and bigger) repos. At that point you really want branches because you don't want to clone 10GB to do a bugfix.
Though we addressed that problem, to some extent, by having nested collections (think submodules that actually support all workflows, unlike git, they are submodules that work). So you can clone the subset you need to do your bugfix.
But yeah, there are cases where "a branch is a clone" just doesn't scale, no question. But where it does work it's a super simple and pleasant model
How do you deal with that?
Basically, if you pick a commit, and in the next line exec make && make check (or whatever) then that build & test command will run with the workspace HEAD at that commit. Add such an exec after every pick/squash/fixup and you'll build and test every commit.
That way, looking at the history, you know what commits are stable/tested by looking at merge commits. Others that were brought in since the last merge commit can be considered intermediary commits that don't need to be individually tested.
(Of course, there's also the rebase-and-squash workflow which I've personally never used, but it accomplishes the same thing by erasing any intermediary history altogether.)
"Squashing" is just merging neighboring commits. I do that all the time!
Usually when I work on something, commit incomplete work, work some more, commit, rinse, repeat, then when the whole thing is done I rewrite the history so that I have changes segregated into meaningful commits. E.g., I might be adding a feature and find and fix a few bugs in the process, add tests, fix docs, add a second, minor feature, debug my code, add commits to fix my own bugs, then rewrite the whole thing into N bug fix commits and 2 feature commits, plus as many test commits as needed if they have to be separate from related bug fix commits. I find it difficult to ignore some bug I noticed while coding a feature just so that I can produce clean history in one go without re-writing it! People who propose one never rewrite local history propose to see a single merge commit from me for all that work. Or else the original commits that make no logical sense.
Too, I use "WIP" commits as a way to make it easy to backup my work: commit extant changes, git log -p or git format-patch to save it on a different filesystem. Sure, I could use git diff and thus never commit anything until I'm certain my work is done so I can then write clean history once without having to rewrite. But that's silly -- the end result is what matters, not how many cups of coffee I needed to produce it.
Suppose you want to push regression tests first, then bug fixes, but both together: this is useful for showing that the test catches the bug and the bug fix fixes it. But now you need to document that they go together, in case they need to be reverted, or cherry-picked onto release maintenance branches.
I think branch push history is really something that should be a first-class feature. I could live with using merge commits (or otherwise empty-commits) to achieve this, but I'll be filtering them from history most of the time!
previous employer used a merge workflow (primarily because we didnt understand git very well at the time), and there were merge conflicts all the time when pulling new changes down or merging new changes in.
It was a headache to say the least. As the integration manager for one project, I usually spent the better part of an hour just going through the pull requests and merge conflicts from the previous day. I managed a team that was on the other side of the world, so there were always new changes when I started working in the morning.
"Amazing" is right. Sun was doing rebases in the 90s, and it never looked back.
Though, of course, rebasing is a win in general, even if you happen to have an awesome CR tool (a unicorn I've yet to run into).
Because you're unpushed commits are on top, it's easy to isolate each set of merge conflicts (since you're going commit by commit) and to find the source of the conflicts upstream (with log/blame tools, without having to chase branch and merge histories).
When there is an undesired behavior that is hard to reason about, git-bisect can be used to determine the commit that first introduced it. With a normal merge, it will point to the merge commit, because it was the first time the 2 branches interacted. With a rebase, git bisect will point to one of the rebased commits, each of which already interacted with the branch coming before.
Resolving conflicts in a big merge commit vs in small rebased commits is like resolving conflicts in a distributed system by comparing only the final states, vs inspecting at the actual sequences of changes.
Analogously: who cares how you think? Aside from psychologists and such, that is. We care about what you say, write, do.
We basically had a single branch per repo, and every repo other than the master one was a fork (ala github). But that was dictated by the limitations of the VCS we used (Teamware) before the advent of Hg and git.
So "branches" were just a developer's or project's private playgrounds. When done you pushed to the master (or abandoned the "branch"). Project branches got archived though.
In a git world what this means is that you can have all the branches you want, and you can even push them to the master repo if that's ok with its maintainers, or else keep them in your forks (ala github) or in a repo meant for archival.
But! There is only one true repo/branch, and that's the master branch in the master repo, and there are no merge commits in there.
For developers working on large projects the workflow went like this:
- clone the project repo/branch
- work and commit, pulling --rebase periodically
- push to the project repo/branch
- when the project repo/branch rebases onto a newer upstream the developer has to rebase their downstream onto the rebased project repo/branch
Project techleads or gatekeepers (larger projects could have someone be gatekeeper but not techlead) would be responsible for rebasing the project onto the latest upstream.
To simplify things the upstream did a bi-weekly "release" (for internal purposes) that projects would rebase onto on either a bi-weekly or monthly schedule. This minimizes the number of rebases to do periodically.
When the project nears the dev complete time, the project will start rebasing more frequently.
For very large projects the upstream repo would close to all other developers so that the project could rebase, build, test, and push without having to rinse and repeat.
(Elsewhere I've seen uni-repo systems where there is no closing of the upstream for large projects. There a push might have to restart many times because of other pushes finishing before it. This is a terrible problem. But manually having to "close" a repo is a pain too. I think that one could automate the process of prioritizing pushes so as to minimize the restarts.)
Not necessarily implied by you; just checking.
I don't know what Oracle does nowadays with the gates that make up Solaris. My guess is that they still have a hodge podge, with some gates using git, some Mercurial, and some Teamware still. But that's just a guess. For all I know they may have done what Microsoft did and gone with a single repo for the whole thing.
I remember watching the process to choose a new VCS.
Mercurial is a fine choice but not the best choice. Even over the duration of the decision process the world was clearly moving overwhelmingly to Git.
I don't believe for one second this was a quick turnaround for them either. I've spoken to MS dev evangelists at work stuff over the past few years and they've continually said "it's going to get better", usually with a wry smile.
It bloody did too. They're nowhere near perfect, and the different product branches remain as disjointed as ever, but I'm genuinely impressed at the sheer scale of the organisational change they've implemented.
But this is seriously brave and well executed on their part.
Now that it is somewhat proven, maybe Google will leverage GVFS on Windows and create a FUSE solution for Linux.
And of course Google will shut it down in five years once they're bored of it.
Would you mind unpacking this? I'm intrigued.
I can imagine that part of it could be the need of a git clone --recursive, and everybody omits the --recursive if they don't know there are submodules inside the repository. There is another command to pull the submodules later but I admit it's far from ideal.
Now the article says that with windows they do branches.
If you're curious what it all ends up looking like, read this article. It's a fairly good overview and reasonably up to date.
- Complicates daily life for every engineer
- Becomes hard to make cross-cutting changes
- Complicates releasing the product
- There's a still a core of "stuff" that's not easy to tease apart, so at least one of the smaller Windows repos would still have been a similar order of magnitude in most dimensions
This does seem like a negative, doesn't it?
But it's not. Making it hard to make cross-cutting changes is exactly the point of splitting up a repo.
It forces you to slow down, and—knowing that you can only rarely make cross-cutting changes—you have a strong incentive to move module boundaries to where they should be.
It puts pressure on you to really, actually separate concerns. Not just put "concerns" into a source file that can reach into any of a million other source files and twiddle the bits.
"Easy to make sweeping changes" really means "easy to limp along with a bad architecture."
I think that's one of the reasons why so much code rots: developers thinking it should be easy to make arbitrary changes.
No, it should be hard to make arbitrary changes. It should be easy to make changes with very few side effects, and hard to make changes that affect lots of other code. That's how you get modules that get smaller and smaller, and change less and less often, while still executing often. That's the opposite of code rot: code nirvana.
If you change the word "arbitrary" to "necessary" (implying a different bias than the one you went with) then all of a sudden this attitude sounds less helpful.
Similarly "easy to limp along with a bad architecture" could be re-written as "easy to work with the existing architecture".
At the end of the day, it's about getting work done, not making decisions that are the most "pure".
Windows ME/Vista/8 were terrible and widely hated pieces of software because of "getting things done" instead of making good decisions. They made billions of dollars doing it, don't get me wrong, but they've also lost a lot of market share too and have been piling on bad sentiment for years. They've been pivoting and it has nothing to do with "getting work done" but by going back and making better decisions.
This attitude will lead to a total breakdown of the development process over the long term. You are privileging Work Done At The End Of The Day over everything else.
You need to consider work done at every relevant time scale.
How much can you get done today?
How much can you get done this month?
How much can you get done in 5 years?
Ignore any of these questions at your peril. I fundamentally agree with you about purity though. I'm not sure what in my piece made you think I think Purity Uber Alles is the right way to go.
As evidenced by Microsoft following the one repo rule and not being able to release any new software.
Wait, what ?
Linux by itself is just a kernel and won't do anything for you without the rest of the bits that make up an operating system.
Which is to say, not shockingly, it is typically a tradeoff debate where there is no toggle between good and bad. Just a curve that constantly jumps back and forth between good and bad based on many many variables.
When it comes to process managers, there is obviously disagreement about how complex they should be, but systemd is still a system to manage and collect info about processes.
The Windows build system is where component boundaries get enforced. Having version control introduce additional arbitrary boundaries makes the problem of good modularity harder to solve.
I'm sure that, when dealing with stakeholder structures where different organizations can depend on different bits and pieces, having multiple repositories with difficulty of making breaking and cross-cutting changes, becomes good.
From the view of a single organization where the only users of a component are other components in the same organization, it seems like there is consensus around single-repository.
Which brings me to the point that in the open source world, you can't get away with a single-repository approach for your large system. And that also is telling, along with open source's successes. So which approach is better in the long run? I'd bet on the open source methods.
> organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations
Open source has a very different communication structure than a company. While the big three (MS, Google, FB) try to work towards good inter-departmemt relations, it is usually either
- a single person
- a small group
that are the gatekeepers for a small amout of code, typically encapsulated in a "project". They do commit a lot to their project, yet rarely touch other projects in comparision.
Also, collaboration is infinitely harder, as in the office you can simply walk up to someome, call them, or chat them - in OSS a lot of communication works via Issues and PRs, which are a fundamentally different way to communicate.
This all is reflected by the structure of how a functionality is managed: Each set of gatekeepers gets their own repository, for their function.
Interestingly this even happens with bigger repositories: DefinitelyTyped is a repository for TypeScript "typings" for different JS libraries, which has hundreds of collaborators.
Yet, if you open a pull-request for an existing folder the ones that have previously made big-ish changes can approve / decline the PR, so each folder is its own little repo.
So: maybe the solution is big repos for closed companies, small repos for open-source?
You say this forces you to think ahead, but predicting the future is quite difficult. The result is that you limp along with known-broken code because it would take so much effort to make the breaking changes to clean it up.
For example, lets say you discover that people are frequently misuing a blocking function because they don't realize that it blocks.
Let's say that we have a function `bool doThing()`. We discover that the bool return type is underspecified: there's a number of not-exactly-failure not-exactly-success cases. In a monorepo, it's pretty easy to modify this so that `doThing()` can return a `Result` instead. With multiple repos and artifacts, you either bring up the transitive closure of projects, or you leave it for someone to do later. For a widely used function, this can be prohibitive. That makes people frequently choose the "rename and deprecate" model, which means you get an increasing pile of known-bad functions.
And what happens if the inherent difficult disappears or reduces? You're still left with the imposed external difficulty.
These types of use cases seem so commonly encountered that there should be a list of best practices in the Git docs.
EDIT: like if something was to be shared between Windows and Office, for example.
I'm not a frequent Windows user, or a Windows dev at all. Does anyone know of any consequences that MS's decision might mean, if this hypothesis is true?
If you have a problem with Windows being overcomplicated or in need of refactor it is almost certainly something to do with not-the-kernel.
If you look at something like the Linux kernel its actually much larger than Windows. It needs to have every device driver known to man (except that one WiFi/GPU/Ethernet/Bluetooth driver you need) because internally the architecture is not cleanly defined and kernel changes also involve fixing all the broken drivers.
( https://googleprojectzero.blogspot.com.au/2015/07/one-font-v... )
Of course the issue here is that after NT 4, GDI has been in kernel mode; this is necessary for performance reasons. Prior to that it was a part of the user mode Windows subsystem.
I'd be curious to see if GDI moved back to userland would be acceptable with modern hardware, but I suspect MS is not interested in that level of churn for minimal gain.
It is not true that internal architecture of Linux drivers is not clearly defined. It is just a practical approach to maintenance of drivers (as an author of one Linux hardware driver I'm pretty sure the best possible). Reasoning is outlined in famous "Stable API nonsense" document http://elixir.free-electrons.com/linux/latest/source/Documen...
I don't think Windows approach is worth praising here. It results in a lot of drivers stopping working after few major Windows release upgrades. In Linux, available drivers can be maintained forever.
FWIW, I've stumbled upon both of those things in my personal research while I had legitimate access to the 5.2 sources as a student. It turns out Bing will link you directly to the Windows source code if you search for <arcane MASM directive here>.
Yes, I'm in the process of reporting this to Microsoft and cleansing myself of that poison apple.
I feel like that those questions are valid, and are important in this field, not just kernel development. As someone who desires to continue learning, I will not yield to your counter.
I think the reason that this works for well Google is the amount of test automation that is in place which seems to be very good at flagging any errors due to dependency changes before it gets deployed. Not sure how many organizations have invested and have an automated test infrastructure like Google has built.
It makes the entire system kind of pure-functional and stateless/predictable. Everything from computing which tests you need to run, to who to blame when something breaks, to caching build artifacts, or even sharing workspaces with co-workers.
While this could be implemented with multiple repros underneath, it would add much complexity.
so for most of us the only reasonable path is to split into multiple repositories.
it is also easier to create tools that deal with many repositories.
than it is to create a tool that virtualizes a single large repo.
Well, because of the unbelievable amount of engineering work involved in trying to get Git to operate at such insane scale? To say nothing of the risk involved in the alternative. This project in particular could easily have been a catastrophe.
It can be hard to understand this as an individual engineer, but large tech companies are incredible engineering machines with vast, nearly infinite, capacity at their disposal. At this scale, all that matters is that the organization is moving towards the right goals. It doesn't matter what's in the way; all the implementation details that engineers worry about day-to-day are meaningless at this scale. The organization just paves over over any obstacles in pursuit of its goal.
In this case Microsoft decided that Git was the way to go for source control, that it's fundamentally a good fit for their needs. There was just implementation details in the way. So they just... did it. At their scale, this was not an incredible amount of work. It's just the cost of doing business.
They're running both source control systems in parallel, switching developers in blocks, and monitoring commit activity and feedback to watch for major issues. In the worst case, if GVFS failed or developers hated it, they could roll back to their old system.
Again, to my point above: there's a cost to doing this but it's negligible for very large organizations like Microsoft.
Not that I'd want to run it on such a mega-repository; it takes long enough running it on an average one with a decade of history.
100 urls? That's getting a bit annoying.
Android, Chromium, and Linux (among others, I'm sure) are different in that they use git for version control so they are in their own separate repos.
Why doesn't this make sense.
I personally think of a repo as of an index not a filesystem. You checkout what you need but there is one global constant state - which can eg be used for continuous integration tests
The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.
After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”. Some code is separable (like microservices) and is ideal for isolated repos. Some code is not (like Windows core) and needs to be treated like a single repo.
The Windows source control system used to be organized as a large number (about 20 for product code alone, plus twice that number for test code and large test data files) of independent master source repositories.
The source checkouts from these repos would be arranged in a hierarchy on disk. For example, if root\ were the root of your working copy, files directly under root\ would come from the root repo, files under root\base\ from the base repo, root\testsrc\basetest from basetest, and so on.
To do cross repo operations you used a tool called "qx" (where "q" was the name of the basic source control client). Qx was a bunch of Perl scripts, some of which wrapped q functions and others that implemented higher-level functions such as merging branches. However, qx did not try to atomically do operations across all affected repos.
(The closest analog to this in git land would be submodules.)
While source control was organized this way, build and test ran on all of Windows as a single unit. There was investigation into trying to more thoroughly modularize Windows in the past, but I think the cost was always judged too great.
Mark Lucovsky did a talk several years ago on this source control system, among other aspects of Windows development:
I believe it is still valid for folks not using GVFS to access Windows sources.
I'm sure the build process is non-trivial and with 4,000 people working on it, the amount of updates it gets daily is probably insane. Any one person across teams trying to keep this all straight would surely fail.
Having done a lot of small git repos, I'm a big fan of one huge repo. It makes life easier for everyone, especially your QA team as it's less for them to worry about. In the future, anywhere I'm the technical lead I'm gonna push for one big repo. No submodules either. They're a big pain in the ass too.
It's news to me that Windows decided to go that route too. Personally, I think submodules and git sub-trees suck, so I'm all for putting things in a monorepo.
The bad one is just dumping snapshots of the code into a public repo every so often. You need to make sure your dependencies are open source, have tools that rewrite your code and build files accordingly, and put them in a staging directory for publishing.
The good one is developing that part publicly, and importing it periodically into your internal monorepo with the same (or similar) process to the one you use for importing any other third-party library.
There's also a hybrid approach which is to try and let internal developers use the internal tooling against the internal version of the code, and also external developers, with external tooling, against the public version. That one's harder, and you need a tool that does bidirectional syncing of change requests, commits, and issues.
No, seriously, that's the answer.
-dont have to spend time to think about defining interfaces
-history is full of crap you dont care about
-tests take forever to run
-tooling breaks down completely, though thanks to MS the limit was increased seriously
Defining what portions of the OS you'll have to look in for such changes helps a great deal.
As to building and testing... the system has to get much better about detecting which tests will need to be re-run for any particular change. That's difficult, but you can get 95% of the way there easily enough.
That seems like a design and policy choice, orthogonal to repos.
If they are in different repos, the change is not atomic and you need to version interfaces or keep backwards compatibility in some other way.
- They support checking out a subdirectory without downloading the rest of the repo, as well as omitting directories in a checkout. Indeed, in SVN, branches are just subdirectories, so almost all checkouts are of subdirectories. You can't really do this in Git; you can do sparse checkouts (i.e. omitting things when copying a working tree out of .git), but .git itself has to contain the entire repo, making them mostly useless.
- They don't require downloading the entire history of a repo, so the download size doesn't increase over time. Indeed, they don't support downloading history: svn log and co. are always requests to the server. Unfortunately, Git is the opposite, and only supports accessing previously downloaded history, with no option to offload to a server. Git does have the option to make shallow clones with a limited amount of (or no) history, and unlike sparse checkouts, shallow clones truly avoid downloading the stuff you don't want. But if you have a shallow clone, git log, git blame, etc. just stop at the earliest commit you have history for, making it hard to perform common development tasks.
I don't miss SVN, but there's a reason big companies still use gnarly old systems like Perforce, and not just because legacy: they're genuinely much better at scaling to huge repos (as well as large files). Maybe GVFS fixes this; I haven't looked at its architecture. But as a separate codebase bolted on to near-stock Git, I bet it's a hack; in particular, I bet it doesn't work well if you're offline. I suspect the notion of "maybe present locally, maybe on a server" needs to be baked into the data model and all the tools, rather than using a virtual file system to just pretend remote data is local.
GVFS intends to add the ability to scale to Git, through patches to Git itself and a custom driver. I don't think this is a hack - by no means is it the first version control system to introduce a filesystem level component. Git with GVFS works wonderfully while offline for any file that you already have fetched from the server.
If this sounds like a limitation, then remember that these systems like Perforce and TFVC _also_ have limitations when you're offline: you can continue to edit any file that you've checked out but you can't check out new files.
You can of course _force_ the issue with a checkout/edit/checkin but then you'll need to run some command to reconcile your changes once you return online. This seems increasingly less important as internet becomes ever more prevalent. I had wifi on my most recent trans-Atlantic flight.
I'm not sure what determines when something is "a hack" or not, but I'd certainly rather use Git with GVFS than a heavyweight centralized version control system if I could. Your mileage, as always, may vary.
I have a 100k commit svn repo I have been trying to migrate but the result is just too large. Partly this is due to tons of revisions of binary files that must be in the repo.
Does the virtualization also help provide a shallow set of recent commits locally but keep all history at the server (which is hundreds of gigs that is rarely used)?
A GVFS clone will contain all of the commits and all of the trees but none of the blobs. This lets you operate on history as normal, so long as you don't need the file content. As soon as you touch file content, GVFS will download those blobs on demand.
Is there anything people particularly miss about Source Depot? Something SD was good at, but git is not?
Anyway MS share source with various third parties (governments at least and I believe large customs in general) so any of these are a potential leak source.
1) Why include it in default windows? It seems that 99.99% of users would never even know it existed, let alone use it
2) Does that mean GVFS isn't useable on *nix systems? Any plans to make it useable, if so?
2) GVFS is currently only available on Windows, but we are very interested in porting to other platforms.
The first internal version of GVFS was actually based on a 3rd party driver that looks a lot like FUSE. But because it is general purpose, and requires all IO to context switch from the kernel to user mode, we just couldn't make it fast enough.
Remember that our file system isn't going to be used just for git operations, but also to run builds. Once a file has been downloaded, we need to make sure that it can be read as fast as any local file would be, or dev productivity would tank.
With GvFlt, we're able to virtualize directory enumeration and first-time file reads, but after that get completely out of the way because the file becomes a normal NTFS file from that point on.
From your description it sounds like there could be usefulness in coordinating such efforts.
E- oops, missed that line. Cool, wonder if it will show up in more than Git
We're building a VFS (Virtual File System) for Git (G) so GVFS was a very natural name and it just kind of stuck once we came up with it.
1. How do you measure "largeness" of a git repo?
2. How are you confident that you have the largest?
3. How much technical debt does that translate to?
2. Fairly confident, at least as far as usable repos go. Given how unusable the Windows repo is without GVFS and the other things we've built, it seems pretty unlikely anyone's out there using a bigger one. If you know of something bigger, we'd love to hear about it and learn how they solved the same problems!
3. Windows is a 30 year old codebase. There's a lot of stuff in there supporting a lot of scenarios.
Microsoft does have an internal Source Code Archive, which does the moral equivalent of storing source code and binary artifacts for released software in a underground bunker. I used to have a bit of fun searching the NT 3.5 sources as taken from the Source Code Archive...
It's certainly possible that somebody created a 1 TB source tree in Git, but what we've never heard of is somebody actually _using_ such a source tree, with 4000 or more developers, for their daily work in producing a product.
I say this with some certainty because if somebody had succeeded, they would have needed to make similar changes to Git to be successful, though of course they could have kept such changes secret.
I'm curious, did you experiment with LFS for prior to building GitVFS?
Also, I know that there is an (somewhat) active effort to port GitVFS to Linux, do you know if any of the Git vendors (GitLab and/or GitHub) are planning to support GitVFS in their enterprise products?
That is an excellent point. Thanks!
- Availability of tools
- Familiarity of developers (both current and potential)
This is a little off-topic, but why can't Windows 10 users conclusively disable all telemetry?
(I consider the question only a little off-topic, because I have the impression that this story is part of an ongoing Microsoft charm-offensive.)