Hacker News new | comments | show | ask | jobs | submit login
The largest Git repo (microsoft.com)
1053 points by ethomson 236 days ago | hide | past | web | favorite | 403 comments

Windows, because of the size of the team and the nature of the work, often has VERY large merges across branches (10,000’s of changes with 1,000’s of conflicts).

At a former startup, our product was built on Chromium. As the build/release engineer, one of my daily responsibilities was merging Chromium's changes with ours.

Just performing the merge and conflict resolution was anywhere from 5 minutes to an hour of my time. Ensuring the code compiled was another 5 minutes to an hour. If someone on the Chromium team had significantly refactored a component, which typically occurred every couple weeks, I knew half my day was going to be spent dealing with the refactor.

The Chromium team at the time was many dozens of engineers, landing on the order of a hundred commits per day. Our team was a dozen engineers landing maybe a couple dozen commits daily. A large merge might have on the order of 100 conflicts, but typically it was just a dozen or so conflicts.

Which is to say: I don't understand how it's possible to deal with a merge that has 1k conflicts across 10k changes. How often does this occur? How many people are responsible for handling the merge? Do you have a way to distribute the conflict resolution across multiple engineers, and if so, how? And why don't you aim for more frequent merges so that the conflicts aren't so large?

(And also, your merge tool must be incredible. I assume it displays a three-way diff and provides an easy way to look at the history of both the left and right sides from the merge base up to the merge, along with showing which engineer(s) performed the change(s) on both sides. I found this essential many times for dealing with conflicts, and used a mix of the git CLI and Xcode's opendiff, which was one of the few at the time that would display a proper three-way diff.)

When you have that many conflicts, it's often due to massive renames, or just code moves.

If you use git-mediate[1], you can re-apply those massive changes on the conflicted state, run git-mediate - and the conflicts get resolved.

For example: if you have 300 conflicts due to some massive rename, you can type in:

  git-search-replace.py[2] -f oldGlobalName///newGlobalName
  git-mediate -d
  Succcessfully resolved 377 conflicts and failed resolving 1 conflict.
  <1 remaining conflict shown as 2 diffs here representing the 2 changes>
[1] https://medium.com/@yairchu/how-git-mediate-made-me-stop-fea...

[2] https://github.com/da-x/git-search-replace

Also git rerere

When maintaining multiple release lines and moving fixes between them:

Don't use a bad branching model. Things like "merging upwards" (=committing fixes to the oldest branch requiring the fix, then merging the oldest branch into the next older branch etc.), which seems to be somewhat popular, just don't scale, don't work very well, and produce near-unreadable histories. They also incentivise developing on maintenance branches (ick).

Instead, don't do merges between branches. Everything goes into master/dev, except stuff that really doesn't need to go there (e.g. a fix that only affects a specific branch(es)). Then cherry pick them into the maintenance branches.

Cherry picking hotfixes into maint branches is cool until you have stuff like diverging APIs or refactored modules between branches. I don't know of a better solution; it kind of requires understanding in detail what the fix does and how it does it, then knowing if that's directly applicable to every release which needs to be patched.

Use namespaces to separate API versions?

POST /v4/whatever

POST /v3/whatever

We version each of our individual resources. so a /v1/user might have many /v3/post . Seems to work for us as a smaller engineering team.

A better approach would be to alias /v3/user to /v1/user until there is a breaking change needed in the v3 code tier.

On a rapidly developing API, that would be way too much churn on our front end. For an externally facing API, I completely agree.

Yes, although this applies to all forms of porting changesets or patches between branches or releases.

I don't understand the hate for merging up. I've worked with the 'cherry-pick into release branches' model, and also with an automated merge-upwards model, and I found the automerge to be WAY easier to deal with. If you make sure your automerge is integrated into your build system, so a failing automerge is a red build that sends emails to the responsible engineers, I found that doing it this way removed a ton of the work that was necessary for cherry-picking. I can understand not liking the slightly-messier history you get, but IMO it was vastly better. Do you have other problems with it, or just 'unreadable' histories and work happening on release branches? Seems like a good trade to me.

As the branches diverge, merges take more and more time to do (up to a couple hours, at which point we abandoned the model)... they won't be done automatically. Since merges are basically context-free it's hard to determine the "logic" of changed lines. Since merges always contain a bunch of changes, all have to be resolved before they can be tested, and tracing failures back to specific conflict resolutions takes extra time. Reviewing a merge is seriously difficult. Mismerges are also far more likely to go unnoticed in a large merge compared to a smaller cherry pick. With cherry picking you are only considering one change, and you know which one. You only have to resolve that one change, and can then test, if you feel that's necessary, or move on to the next change.

Also; https://news.ycombinator.com/item?id=14413681

I also observed that getting rid of merging upwards moved the focus of development back to the actual development version (=master), where it belongs.

I just looked at git-mediate and I'm very confused. It appears that all it does is remove the conflict markers from the file after you've already manually fixed the conflict. Except you need to do more work than normal, because you need to apply one of the changes not only to the other branch's version but also to the base. What am I missing here, why would I actually want to use git-mediate when I'm already doing all the work of resolving the conflicts anyway?

It looks like git-mediate does one more important thing; it checks that the conflict is actually solved. In my experience it's very easy to miss something when manually resolving a conflict and often the choices the merge tools give you are not the ones you want.

IntelliJ IDEA default merger knows when all conflicts in a file are handled and shows you a handy notification on top "Save changes and finish merging".

TFS's conflict resolver also does this

Well, all it checks is that you modified the base case to look like one of the two other cases. That doesn't actually tell you if you resolved the conflict though, just that you copied one of the cases over the base case.

True, but if you follow a simple mechanical guideline: Apply the change 2 other versions (preferably the base one last) - then your conflict resolutions are going to be correct.

From experience with many developers using this method, conflict resolution errors went down to virtually zero, and conflict resolution time has improved by 5x-10x.

There's a much simpler mechanical guideline that works without git-mediate: Apply the change to the other version. git-mediate requires you to apply the change twice, but normal conflict resolution only requires you to apply it once.

Except you don't really know if you actually applied the full change to the other version. That's what applying it to the base is all about.

You often take the apparent diff, apply it to the other version, and then git-mediate tells you "oops, you forgot to also apply this change". And this is one of the big sources of bugs that stem from conflict resolutions.

Another nice thing about git-mediate is that it lets you safely do the conflict resolution automatically, e.g: via applying a big rename as I showed in the example, and seeing how many conflicts disappear. This is much safer than manually resolving.

Applying the change to the base doesn't prove that you applied the change to the other version. It only proves that you did the trivial thing of copying one version over the base. That's kinda the whole point of my complaint here, git-mediate is literally just having you do busy-work as a way of saying "I think I've applied this change", and that busy-work has literally no redeeming value because it's simply thrown away by git-mediate. Since git-mediate can't actually tell if you applied the change to the other version correctly, you're getting no real benefit compared to just deleting the conflict markers yourself.

The only scenario in which i can see git-mediate working is if you don't actually resolve conflicts at all but instead just do a project-wide search&replace, but that's only going to handle really trivial conflicts, and even then if you're not actually looking at the conflict you run the risk of having the search & replace not actually do what it's supposed to do (e.g. catching something it shouldn't).

> Since git-mediate can't actually tell if you applied the change to the other version correctly, you're getting no real benefit compared to just deleting the conflict markers yourself

This is patently and empirically false:

1) See the automated rename example. How do you gain the same safety and ease of resolution without git mediate? This is, unlike you say, an incredibly common scenario. After all, conflicts are usually due to very wide, mechanical changes. The exact kinds of changes that are easy to re-apply project-wide. It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts and thus inserting bugs, and due to that are not allowed in many collaborative projects!

2) Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.

This guideline is simple enough to virtually never fuck it up, unlike traditional conflict resolution which is incredibly error-prone. Your complaint that git mediate does not validate you followed this one guideline is moot, since this guideline is so easy to not fuck up - compared to the rest of the resolution process.

If you follow this guideline - git-mediate has tremendous value. It makes sure you did not forget any part of the change done by either side - and streamlines every other part of the conflict resolution process:

A) Takes my favorite editor directly to the conflict line

B) Shows me 2 diffs instead of 3 walls of text

C) Lets me choose the simpler diff to apply to the other sides, making a very minimal text change in my editor.

D) Validates that the change I applied was the last one (or decreases the size of the diff otherwise)

E) Does the "git add" for me, and takes me directly to the next conflict

This has been used by dozens of people, who can all testify that it:

A) Made conflict resolution easy and convenient

B) Reduced the error rate to zero (I don't remember the lasts bug inserted in a merge conflict)

C) Sped the process up by an order of magnitude

> See the automated rename example. How do you gain the same safety and ease of resolution without git mediate? This is, unlike you say, an incredibly common scenario.

Is it? I'm not sure if I've ever had a conflict that would be resolved by a global find & replace. Globally renaming symbols isn't really all that common. In my experience conflicts are not "usually due to very wide, mechanical changes", they're due to two people modifying the same file at the same time.

> It is true for not only renames, but also whitespace fixes which are infamous for causing conflicts …

Most projects don't go doing whitespace fixes over and over again. In projects that do any sort of project-wide whitespace fixes, that sort of thing is usually done once, at which point the whitespace rules are enforced on new commits. So yes, global whitespace changes can cause commits, but they're rather rare.

> Instead of having to tediously compare the 3 versions to make sure you haven't missed any change when resolving (a very common error!) you now have to follow a simple guideline: Apply the same change to 2 versions.

> This guideline is simple enough to virtually never fuck it up

You know what's even simpler? "Apply the same change to 1 version". Saying "Fix the conflict, and then do extra busy-work on top of it" is not even remotely "simpler". It's literally twice the amount of work.

The only thing git-mediate appears to do is tell you if a project-wide find&replace was sufficient to resolve your conflict (and that's assuming the project-wide find&replace is even safe to do, as I mentioned in my previous comment). If you're not doing project-wide find&replaces, then git-mediate just makes your job harder.

From reading your "process" it appears that all you really need is a good merge tool, because most of what you describe as advantageous is what you'd get anyway when using any sort of reasonable tool (e.g. jumping between conflicts, showing correct diffs, making it easy to copy lines from one diff into another, making it easy to mark the hunk as resolved).

> In my experience conflicts are not "usually due to very wide, mechanical changes", they're due to two people modifying the same file at the same time.

Then it is likely a reversal of cause & effect. People are more reluctant to make wide sweeping changes such as renames because they're worried about the ensuing conflicts.

> Most projects don't go doing whitespace fixes over and over again

Again, for similar reasons. Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.

> You know what's even simpler? "Apply the same change to 1 version".

It sounds simpler, but not fucking it up is not simple at all, as evidenced by conflicts being a constant source of bugs.

When you look at the 3 versions, and "apply the diff to 1 version" you actually think you apply the diff to 1 version. You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible. This is detected by git-mediate the double-accounting of applying the perceived diff to both - validating that the perceived diff equals the actual diff.

Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.

> then git-mediate just makes your job harder.

You're talking out of your ass here. Me and 20 other people have been using git-mediate and it's been a huge game changer for conflict resolution. Every single user I've talked to claims huge productivity/reliability benefits from using it.

> all you really need is a good merge tool

I've used "good" merge tools a lot. They're incredibly inferior to my favorite text editor:

* Text editing within them is tedious and terrible

* Their supported actions of copying whole lines from one version to the other are useless 90% of the time.

Let me ask you this:

What percentage of the big conflicts you resolve with a "good merge tool" - build & run correctly on the first run after resolution?

For git-mediate, that is easily >99%.

> People are more reluctant to make wide sweeping changes such as renames because they're worried about the ensuing conflicts.

I disagree. People generally don't do project-wide find&replaces because it's just not all that common to want to rename a global symbol.

> Projects limp around with broken indentation (tabs/spaces), trailing whitespaces, dos newlines, etc - because fixing whitespace is against policy. Why? Conflicts.

You seem to have completely missed the point of my comment. I'm not saying projects limp along with bad whitespace. I'm saying projects that decide upon whitespace rules typically do a single global fixup and then simply enforce whitespace rules on all new commits. That means you only ever have one conflict set due to applying whitespace policy, rather than doing it over and over again as you suggested.

> You're applying the perceived diff to 1 version, which often differs from the actual diff as it may include subtle differences that aren't easily visible.

Then you're using really shitty merge software. Any halfway-decent merge tool will highlight all the differences for you.

> Without git-mediate? At best you bring up build errors. At worst, revive old bugs that were subtly fixed in the diff you think you applied.

And this is just FUD.

It's simple. Just stop using Notepad.exe to handle your merge conflicts and use an actual merge tool. It's pretty hard to miss a change when the tool highlights all the changes for you.

> You're talking out of your ass here.

And you're being uncivil. I'm done with this conversation.

That's why I showed an example of a rename. You write "manually fixed the conflict", where do you see that in the rename example?

You just re-apply either side of the changes in the conflict (Base->A, or Base->B) and the conflict is then detected as resolved. Reapplying (e.g: via automated rename) is much easier than what people typically mean by "manually resolving the conflict".

Also, as a pretty big productivity boost, it prints the conflicts in a way that lets many editors (sublime, emacs, etc) directly jump to conflicts, do the "git add" for you, etc. This converts your everyday editor into a powerful conflict resolution tool. Using the editing capabilities in most merge tools is tedious.

or worse, formatting changes

Windows developed an extension that lets them do conflict resolution in the web. We have a server-side API that it calls into, but the extension isn't fundamentally different from using BeyondCompare or $YOUR_FAVORTE_MERGETOOL.

An extension to what? Could you open source it?

Extension to VSTS [1], sorry. We're working with them on making the extension available to the public. It's possible we could open source it as well; I'll poke around.

[1] https://www.visualstudio.com/en-us/docs/integrate/extensions...

Is this CodeFlow you're talking about?

For 3 way merging, I've had good luck with beyondcompare

Generally, it's the responsibility of whoever made the changes to resolve conflicts (basically, git blame conflicting lines and the people who changed them get notified to resolve conflicts in that file). Distributing the work like this makes the merges more reasonable.

Yep, this is the way I work. Always rebase against master, and fix conflicts there, so branches merging into master should always be up-to-date and have zero conflicts.

How does that work?

You first rebase against the master locally and push the merged feature branch after resolving all the conflicts yourself. Afterwards, you go to the master and merge it against the updated feature branch. The 2nd merge should not result in any conflicts.

This model can be annoying on running feature branches. Once you rebase, you have to force-push to the remote feature branch. It's not so bad if you use --force-with-lease to prevent blowing away work on the remote, but it still means a lot of rewriting history on anything other than one-off branches.

No no, you never force push to "published" branches, e.g., upstream master. What you're doing when you rebase onto the latest upstream is this: you're making your local history the _same_ as the upstream, plus your commits as the latest commits, which means if you push that, then you're NOT rewriting the upstream's history.

(In the Sun model one does rewrite project branch history, but one also leaves behind tags, and downstream developers use the equivalent of git rebase --onto. But the true upstream never rewrites its history.)

That's what I thought as well. But what happens when you need to rebase the feature branch against master? Won't you have to force push that rebase?

There's nothing stopping you from doing merges instead of rebases.

       * latest feature commit #3 (feature)
       * merge
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     * | master commits you wanted to include
     | * feature commit #1
     *   original master tip
     *   master history...
Then, when you're done with feature, if you really care about clean history, just rebase the entire history of the feature branch into one or more commits based on the latest from master. I think checkout -b newbranch; rebase --squash master does the trick here:

     *   feature commits #1, #2 and #3 (newbranch)
     | * latest feature commit (feature)
     | * merge
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     * | master commits you wanted to include
     | * feature commit #1
     *   original master tip
     *   master history...
Then checkout master, rebase newbranch, test it out and if you're all good, delete or ignore the original.

     * feature commits #1, #2 and #3 (master, newbranch)
     * more master commits you wanted to include
     * master commits you wanted to include
     * original master tip
     * master history...

  *latest feature c
* |more master comm. | feature commit #: | merge |/| * |master commits y | feature commit # |/ original master * master history ..

maybe try that again keeping 4 spaces before each line?

I've described this. Downstreams of the feature branch rebase from their previous feature branch merge base (a tag for which is left behind to make it easy to find it) --onto the new feature branch head.

E.g., here's what the feature branch goes through:

feature$ git tag feature_05

<time passes; some downstreams push to feature branch/remote>

feature$ git fetch origin feature$ git rebase origin/master feature$ git tag feature_06

And here's what a downstream of the feature branch goes through:

downstream$ git fetch feature_remote

<time passes; this downstream does not push in time for the feature branch's rebase>

downstream$ git rebase --onto feature_remote/feature_06 feature_remote/feature_05

Easy peasy. The key is to make it easy to find the previous merge base and then use git rebase --onto to rebase from the old merge base to the new merge base.

Everybody rebases all the time. Everybody except the true master -- that one [almost] never rebases (at Sun it would happen once in a blue moon).

For 3-way merging the best tool I've found is Steve Losh's splice.vim (https://github.com/sjl/splice.vim/).

Yes back when using Linux, I used Meld a lot. I can recommend it - the directory comparison is good.

Also KDiff3. Struggling to remember the other ones I used to try unfortunately.

I spent the better part of a year in a team that was merging and bug fixing our companies engine releases into the customer (and owning company) code base. We also had to deal with different code repositories and version control systems.

We ended up with a script that created a new git repository, checked out the base version of the code there. Then created a branch for our updated release and another for their codebase. Then attempted to do a merge between the two. For any files which couldn't be automatically merged it created a set of 4 files, the original then a triplet for doing a 3-way merge.

This is also when I bought myself a copy of Beyond Compare 4, which fits perfectly for the price and feature set for what we needed.

It's usually not so bad. Generally a big project is actually split up into lots of different smaller projects, each of which have an owner. And typically a dev won't touch code that isn't theirs except in unusual cases. Teams that have more closely related or dependent code would typically try to work closer to each other and share code more often than teams that are more separated.

Most changes don't touch hundreds/thousands of files. If you were to split the repo into many (as Windows used to be) then you'd still have the problem of huge projects having MANY conflicts, but worse: now you need to do your merge/rebase for each such repo.

In any case, rebasing is better than merging. (Rebasing is a series of merges, naturally, but still.)

Are you aware of KDiff3[1]? If you are why do you prefer Xcode's opendiff?

[1] http://kdiff3.sourceforge.net/

For me, I found I did not need to do 3 way merges that often and opendiff's native UI fits in better than kdiff3 (for me). I think kdiff3 was Qt? Despite Trolltech's best efforts, Qt does not feel native on a Mac.

This isn't to say that KDiff3 isn't great - it is.

I don't think KDiff 3 looks native anywhere, and I don't think it's because of Qt. It uses weird fonts and icons, and for some reason their toolbar buttons just look wrong.

Still an extremely useful program.

God, that sounds hellish.

It's not nearly as bad when you have all the people who work on it with you.

You look at who caused conflicts and send out emails. Don't land the merge until all commits are resolved. People who don't resolve their changes get in trouble.

Not perfect, but there it is.

I worked on a chromium based product as well and had the exact same problem. Eventually we came up with a reasonable system for landing commits and just tried out best to build using chromium, but not having actual patches. Worked okay, not great. Better thant he old system of having people do it manually/just porting our changes to each release.

Archive Team is making a distributed backup of the Internet Archive. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK Currently the method getting the most attention is to put the data into git-annex repos, and then have clients just download as many files as they have storage space for. But because of limitations with git, each repo can only handle about 100,000 files even if they are not "hydrated". http://git-annex.branchable.com/design/iabackup/ If git performance were improved for files that have not been modified, this restriction could be lifted and the manual work of dividing collections up into repos could be a lot lower.

Edit: If you're interested in helping out, e.g. porting the client to Windows, stop by the IRC channel #internetarchive.bak on efnet.

internet archive sounds like the best ever use case for IPFS

It was considered but it just didn't get enough attention from anyone to get it done. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

Could bup be useful in addition to git-annex? https://github.com/bup/bup

Wouldn't IPFS be much, much more suitable for this purpose?

IIRC, there are permanence and equitable sharing guarantee concerns with IPFS. The former at least can be helped by pinning I think.

Ah, I see, yes. It would be probabilistic, unless there was some way to coordinate sharing (eg downloading the least shared file first).

At Sun Microsystems, Inc., (RIP) we have many "gates" (repos) that made up Solaris. Cross-gate development was somewhat more involved, but still not bad. Basically: you installed the latest build of all of Solaris, then updated the bits from your clones of the gates in question. Still, a single repo is great if it can scale, and GVFS sounds great!

But that's not what I came in to say.

I came in to describe the rebase (not merge!) workflow we used at Sun, which I recommend to anyone running a project the size of Solaris (or larger, in the case of Windows), or, really, even to much smaller projects.

For single-developer projects, you just rebased onto the latest upstream periodically (and finally just before pushing).

For larger projects, the project would run their own upstream that developers would use. The project would periodically rebase onto the latest upstream. Developers would periodically rebase onto their upstream: the project's repo.

The result was clean, linear history in the master repository. By and large one never cared about intra-project history, though project repos were archived anyways so that where one needed to dig through project-internal history ("did they try a different alternative and found it didn't work well?"), one could.

I strongly recommend rebase workflows over merge workflows. In particular, I recommend it to Microsoft.

A problem with rebase workflows that I don't see addressed (here or in the replies) is: if I have, say, 20 local commits and am rebasing them on top of some upstream, I have to fix conflicts up to 20 times; in general I will have to stop to fix conflicts at least as many times as I would have to while merging (namely 0 or 1 times).

Moreover, resolution work during a rebase creates​ a fake history that does not reflect how the work was actually done, which is antithetical to the spirit of version control, in a sense.

A result of this is the loss of any ability to distinguish between bugs introduced in the original code (pre-rebase) vs. bugs introduced while resolving conflicts (which are arguably more likely in the rebase case since the total amount of conflict-resolving can be greater).

It comes down to Resolution Work is Real Work: your code is different before and after resolution (possibly in ways you didn't intend!), and rebasing to keep the illusion of a total ordering of commits is a bit of an outdated/misuse of abstractions we now have available that can understand projects' evolution in a more sophisticated way.

I was a dedicated rebaser for many years but have since decided that merging is superior, though we're still at the early stages of having sufficient tooling and awareness to properly leverage the more powerful "merge" abstraction, imho.

Well, git rerere helps here, though, honestly, this never happens to me even when I have 20 commits. Also, this is what you want, as it makes your commits easier to understand by others. Otherwise, with thousands of developers your merge graph is going to be a pile of incomprehensible spaghetti, and good luck cherry-picking commits into old release patch branches!

Ah, right, that's another reason to rebase: because your history is clean, linear, and merge-free, it makes it easier to pick commits from the mainline into release maintenance branches.

The "fake history" argument is no good. Who wants to see your "fix typo" commits if you never pushed code that needed them in the first place? I truly don't care how you worked your commits. I only care about the end result. Besides, if you have thousands of developers, each on a branch, each merging, then the upstream history will have an incomprehensible (i.e., _useless_) merge graph. History needs to be useful to those who will need it. Keep it clean to make it easier on them.

Rebase _is_ the "more powerful merge abstraction", IMO.

rebase : centralized repo :: merge : decentralized repo

rebase : linked-list :: merge : DAG

If the work/repo is truly distributed and there isn't a single permanently-authoritative repo, a "clean, linear" history is nonsensical to even try to reason about.

In all cases it is a crutch: useful (and nice, and sufficient!) in simple settings, but restricting/misleading in more complex ones (to the point of causing many developers to not see the negative space).

You can get very far thinking of a project as a linked list, but there is a lot to be gained from being able to work effectively with DAGs when a more complex model would better fit the reality being modeled.

It's harder to grok the DAG world because the tooling is less mature, the abstractions are more complex (and powerful!), and almost all the time and money up to now has explored the hub-and-spoke model.

In many areas of technology, however, better tooling and socialization around moving from linked-lists (and even trees) to DAGs is going to unlock more advanced capabilities.

Final point: rebasing is just glorified cherry-picking. Cherry-picking definitely also has a role in a merge-focused/less-centralized world, but merges add something totally new on top of cherry-picking, which rebase does not.

As @zeckalpha says, rebase != centralized repo.

You can have a hierarchical repo system (as we did at Sun).

Or you can have multiple hierarchies, contributing different series of rebased patches up the chain in each hierarchy.

Another possibility is that you are not contributing patches upstream but still have multiple upstreams. Even in this case your best bet is as follows: drop your local patches (save them in a branch), merge one of the upstreams, merge the other, re-apply (cherry-pick, rebase) your commits on top of the new merged head. This is nice because it lets you merge just the upstreams first, then your commits, and you're always left in a situation where your commits are easy to ID: they're the ones on top.

I'm the guy who started this DAG model (also at Sun with NSElite and then later with BitKeeper).

I agree that rebase == centralized. It's a math thing. If you rebase and someone has a clone of your work prior to the rebase chaos happens when they come together. So you have to enforce a centralized flow to make it work in all cases. It's pretty much provable as in a math proof.

Not true! At Sun we did this with project gates regularly. The way it works (as I've described several times in this thread now) is that you rebase --onto. That is, you use a tag for the pre-rebase project upstream to find the merge base for your branch, then cherry-pick your commits (i.e., all local commits after the merge base) onto the post-rebase project upstream.

Now, you don't want to do this with the ultimate upstream, though occasionally it happened at Sun with the OS/Net gate, usually due to some toxic commit that was best eliminated from the history rather than reverted, or through some accident.

But you'd be right to say that the Sun model was centralized in that there was just one ultimate upstream. (There was one per-"consolidation", since Solaris was broken up into multiple parts like that, but whatever, the point stands.)

Whereas with Linux, say, one might have multiple kernel gates kept by different gatekeepers. Still, if you're contributing to more than one of them, it's easier to cherry-pick (rebase!) your commits onto each upstream than to just merge your way around -- IMO. I.e., you can have a Linux kernel like decentralized dev model and still rebase.

However, I as you can see from my comment in the previous paragraph, _rebase_ itself does not imply a centralized model.

I get that you can work around the problems, you don't seem to get that from a math point of view, rebase forces either

a) a centralized model


b) you have to throw away any work based on the dag before the rebase


c) you have the history in the graph twice (which causes no end of problems).

(a) is the math way, (b) and (c) are ad-hoc hacks. You are well into the ad-hoc hacks, you've found a way to make it work but it includes "don't do that" warnings to users. My experience is that you don't want to have work flows that include "don't do that". Users will do that.

Also, it's harder to grok merge history because we humans have a hard time with complexity, and merge history in a system with thousands of developers and multiple upstreams can get insanely complex. The only way to cut through that complexity is to make sure that each upstream ends up with linear history -- that is: to rebase downstreams.

Nope, you want what I called the event stack. It lets you have your cake and eat it too.

The event stack is a record of every tip that was ever present in this repo other than unpushed commits.

You were at cset 1234, you pull in 25 csets, the event stack has two events, 1 which points to 1234 and 2 which points at the tip after the pull.

You commit "wacked the crap out of it", then commit "fixed typo", then commit "added test", then commit $whatever. The event stack is

1 2 . which points at your current tip but is floating

Now you push. Your event stack is 1, 2, 3 and 3 points at the tip as of your push.

What about clone? You get your parent's event stack but other than that they are per repo.

The event stack is the linear history you want, it is the view that everyone wants. It's "what are the list of tips I care about in this repo?". Have a push that broke your tree but you don't know what the previous tip was because the push pushed 2500 commits? No problem. The event stack is a stack and there is a "pop" command that pops off the last change to the event stack. So you would just do "git pop" and see if that fixes your tree, repeat until it does.

We never built this in BitKeeper but I should try. If for no other reason than to show people you can have the messy (but historically accurate) history under the covers but have a linear view that is pleasant for humans.

Yes, I've been asking for branch history (the reflog provides some, but it's insufficient because it's not shared in any way).

Even with this, I'd want to rebase away "fixed typo" prior to pushing, and more, I'd want to:

- organize commits into logical chunks so that they might be cherry-picked (in the literal sense, not just the VCS sense) into maintenance release branches

- organize commits as the upstream prefers (some prefer to see test updates in separate commits)

IIUC BitKeeper does have a sort of branch push history, unlike git. Is this wrong?

So the current BK doesn't really have branches, it has the model that if you want to branch you clone, each clone is a branch.

Which begs the question "how do you do dev vs stable branches?" And the answer is that we have a central clone called "dev" and a central clone called "stable". In our case we have work:/home/bk/stable and work:/home/bk/dev. User repos are in work:/home/bk/$USER/dev-feature1 and work:/home/bk/$USER/stable-bugfix123.

We run a bkd in work:/home so our urls are

BK has a concept of a level - you can't push from a higher level to a lower level. So stable would be level 1, dev would be level 2. Levels propogate on clone so when you do

    bk clone bk://work/dev dev-feature2
and then try and do

    bk push bk://work/stable
it will tell you that you can't push to a lower level. This prevents backflow of all the new feature work into your stable tree.

The model works well until you have huge (like 10GB and bigger) repos. At that point you really want branches because you don't want to clone 10GB to do a bugfix.

Though we addressed that problem, to some extent, by having nested collections (think submodules that actually support all workflows, unlike git, they are submodules that work). So you can clone the subset you need to do your bugfix.

But yeah, there are cases where "a branch is a clone" just doesn't scale, no question. But where it does work it's a super simple and pleasant model

Decentralization at scale can result in a linear chain, too.

IMO, VC comes down not to tracking what was actually done, but to creating snapshots of logical steps that are reasonable to roll back to and git bisect with.

And cherry-pick onto release maintenance branches.

A pain I have with rebase workflow is that it creates untested commits (because diffs were blindly applied to a new version of the code). If I rebase 100 commits, some of the commits will be subtly broken.

How do you deal with that?

With git rebase you can in fact build and test each commit. That's what the 'exec' directive is for (among other things) in rebase scripts!

Basically, if you pick a commit, and in the next line exec make && make check (or whatever) then that build & test command will run with the workspace HEAD at that commit. Add such an exec after every pick/squash/fixup and you'll build and test every commit.

Or you could use the "-x" parameter to execute something between every rebase.

This is why, in git workflows with rebases, it's a good idea to create merge commits anyway, even if the master branch can fast-forwarded.

That way, looking at the history, you know what commits are stable/tested by looking at merge commits. Others that were brought in since the last merge commit can be considered intermediary commits that don't need to be individually tested.

(Of course, there's also the rebase-and-squash workflow which I've personally never used, but it accomplishes the same thing by erasing any intermediary history altogether.)

Also, every commit upstream is stable by definition! Human failures aside, nothing should go upstream that isn't "stable/tested".

"Squashing" is just merging neighboring commits. I do that all the time!

Usually when I work on something, commit incomplete work, work some more, commit, rinse, repeat, then when the whole thing is done I rewrite the history so that I have changes segregated into meaningful commits. E.g., I might be adding a feature and find and fix a few bugs in the process, add tests, fix docs, add a second, minor feature, debug my code, add commits to fix my own bugs, then rewrite the whole thing into N bug fix commits and 2 feature commits, plus as many test commits as needed if they have to be separate from related bug fix commits. I find it difficult to ignore some bug I noticed while coding a feature just so that I can produce clean history in one go without re-writing it! People who propose one never rewrite local history propose to see a single merge commit from me for all that work. Or else the original commits that make no logical sense.

Too, I use "WIP" commits as a way to make it easy to backup my work: commit extant changes, git log -p or git format-patch to save it on a different filesystem. Sure, I could use git diff and thus never commit anything until I'm certain my work is done so I can then write clean history once without having to rewrite. But that's silly -- the end result is what matters, not how many cups of coffee I needed to produce it.

I've toyed with the idea of using merge commits to record sets of commits as being... atoms.

Suppose you want to push regression tests first, then bug fixes, but both together: this is useful for showing that the test catches the bug and the bug fix fixes it. But now you need to document that they go together, in case they need to be reverted, or cherry-picked onto release maintenance branches.

I think branch push history is really something that should be a first-class feature. I could live with using merge commits (or otherwise empty-commits) to achieve this, but I'll be filtering them from history most of the time!

we use a rebase workflow in git at my current employer, and it is amazing.

previous employer used a merge workflow (primarily because we didnt understand git very well at the time), and there were merge conflicts all the time when pulling new changes down or merging new changes in.

It was a headache to say the least. As the integration manager for one project, I usually spent the better part of an hour just going through the pull requests and merge conflicts from the previous day. I managed a team that was on the other side of the world, so there were always new changes when I started working in the morning.

Yes! One of the most important advantages of a rebase workflow is that you can see immediately what upstream commits your conflict with, as opposed to some massive merge you have to go chasing branch history to figure out the semantics of the change in question.

"Amazing" is right. Sun was doing rebases in the 90s, and it never looked back.

My exact experience (in the context of "merging upwards"). Large merges are a huge pain to do, and are basically impossible to review, too.

Yes! Reviewing huge merges is infeasible. Besides, most CR tools are awful at capturing history, especially in multi-repo systems. So rebasing and keeping history clean and linear is a huge win there.

Though, of course, rebasing is a win in general, even if you happen to have an awesome CR tool (a unicorn I've yet to run into).

Another thing is that keeping your unpushed commits "on top" is a great aid in general (e.g., it makes it trivial to answer what haven't I pushed here yet?"), but also is the source of rebasing's conflict resolution power.

Because you're unpushed commits are on top, it's easy to isolate each set of merge conflicts (since you're going commit by commit) and to find the source of the conflicts upstream (with log/blame tools, without having to chase branch and merge histories).

We use a rebase workflow when working with third party source code. We keep all third party code on a main git branch and we create new branches off the main branch as we rebase our changes from third party code version to version.

Why wouldn't you use it for your own source?

Why is a clean linear history desirable? It's not reflective of how the product was built? Is it just for some naive desire of purity?

It's easier to see commits of a branch grouped together in most history viewers. Even though sorting commits topologically can help, most history viewers don't support that option.

When there is an undesired behavior that is hard to reason about, git-bisect can be used to determine the commit that first introduced it. With a normal merge, it will point to the merge commit, because it was the first time the 2 branches interacted. With a rebase, git bisect will point to one of the rebased commits, each of which already interacted with the branch coming before.

Resolving conflicts in a big merge commit vs in small rebased commits is like resolving conflicts in a distributed system by comparing only the final states, vs inspecting at the actual sequences of changes.

Who cares about how a product's sub-projects were put together? What one should care about is how those sub-projects were put together into a final product. To be sure, the sub-projects' internal history can be archived, but it needn't pollute the upstream's history.

Analogously: who cares how you think? Aside from psychologists and such, that is. We care about what you say, write, do.

A trip through "git rebase" search results at HN sure is cringe-inducing. So many people fail to get it.

Can you describe it in a little more detail? Do you still use branches? If so, for what? For different versions?

Great question.

We basically had a single branch per repo, and every repo other than the master one was a fork (ala github). But that was dictated by the limitations of the VCS we used (Teamware) before the advent of Hg and git.

So "branches" were just a developer's or project's private playgrounds. When done you pushed to the master (or abandoned the "branch"). Project branches got archived though.

In a git world what this means is that you can have all the branches you want, and you can even push them to the master repo if that's ok with its maintainers, or else keep them in your forks (ala github) or in a repo meant for archival.

But! There is only one true repo/branch, and that's the master branch in the master repo, and there are no merge commits in there.

For developers working on large projects the workflow went like this:

- clone the project repo/branch

- work and commit, pulling --rebase periodically

- push to the project repo/branch

- when the project repo/branch rebases onto a newer upstream the developer has to rebase their downstream onto the rebased project repo/branch

Project techleads or gatekeepers (larger projects could have someone be gatekeeper but not techlead) would be responsible for rebasing the project onto the latest upstream.

To simplify things the upstream did a bi-weekly "release" (for internal purposes) that projects would rebase onto on either a bi-weekly or monthly schedule. This minimizes the number of rebases to do periodically.

When the project nears the dev complete time, the project will start rebasing more frequently.

For very large projects the upstream repo would close to all other developers so that the project could rebase, build, test, and push without having to rinse and repeat.

(Elsewhere I've seen uni-repo systems where there is no closing of the upstream for large projects. There a push might have to restart many times because of other pushes finishing before it. This is a terrible problem. But manually having to "close" a repo is a pain too. I think that one could automate the process of prioritizing pushes so as to minimize the restarts.)

Are you saying that you use Git instead of Mercurial these days?

Not necessarily implied by you; just checking.

Me personally? Yes, I use git whenever I can. I still have to use Mercurial for some things.

I don't know what Oracle does nowadays with the gates that make up Solaris. My guess is that they still have a hodge podge, with some gates using git, some Mercurial, and some Teamware still. But that's just a guess. For all I know they may have done what Microsoft did and gone with a single repo for the whole thing.

This is old but I meant Sun/Oracle/Java.

I remember watching the process to choose a new VCS. Mercurial is a fine choice but not the best choice. Even over the duration of the decision process the world was clearly moving overwhelmingly to Git.

I have tremendous respect for Microsoft pulling itself together over the past few years.

Such a relevant point, and I don't think they get enough props for it.

I don't believe for one second this was a quick turnaround for them either. I've spoken to MS dev evangelists at work stuff over the past few years and they've continually said "it's going to get better", usually with a wry smile.

It bloody did too. They're nowhere near perfect, and the different product branches remain as disjointed as ever, but I'm genuinely impressed at the sheer scale of the organisational change they've implemented.

Moving entire code base from source Depot (invented at Microsoft) git (not ms) was a huge undertaking. I know many ms devs who hated git.

But this is seriously brave and well executed on their part.

Technically - Source Depot is a fork of Perforce. Not entirely invented at MSFT :).

I also know many MS devs who hated Source Depot :)

To give you a little idea of scale - they've been at this for at least 4 years. It started while I still worked in Windows.

This may be the thing that gets Google to switch. They like having every piece of code in a single repository which Git cannot handle.

Now that it is somewhat proven, maybe Google will leverage GVFS on Windows and create a FUSE solution for Linux.

I'd rather see google open up their monorepo as a platform, and compete with github. git is fine, but there's something compelling about a monorepo. Whether they do it one-monorepo-per-account, or one-global-monorepo, or some mix of the two, would be interesting to see how it shapes up.

Though as things are going, I wouldn't be surprised if Amazon goes from zero to production-quality public monorepo faster than Google gets from here to public beta. It's not in Google's blood.

And of course Google will shut it down in five years once they're bored of it.

Amazon doesn't do mono-repos. They have ORDER OF a million repos. They instead invested in excellent cross repo meta-version and meta build capabilities instead of going mono-repo.

"one-global-monorepo" caused me to envision a beautiful/horrifying Borg-like future where all code in the universe was in a single place and worked together.

This is how Linux (and BSD, and so on) distributions work. Of course there are proprietary and niche outliers, but you can't forbid those in the first place.

I felt like that when I first saw golang and how you can effortlessly use any repo from anywhere.

As the joke goes, Go assumes all your code lives in one place controlled by a harmonious organization. Rust assumes your dependencies are trying to kill you. This says a lot about the people who came up with each one.

> Rust assumes your dependencies are trying to kill you.

Would you mind unpacking this? I'm intrigued.

Cargo.lock for applications freezes the entire dependency graph incl. checksums of everything, for example.

This is the main thing I miss about subversion. You could check out any arbitrary subdirectory of a repository. On two projects the leads and full stack people had the whole thing checked out, everybody else just had the one submodule they were responsible for. Worked fairly well.

Mercurial has narrowspecs these days too! Facebook's monorepo lets you check out parts of the overall tree too. It's not like every Android engineer's laptop has all of fbios in it.

Git has submodules too and teams usually have access control on the main server used for sharing commits.

git submodules aren't seemless. None of the alternatives appear to be any better. A halfassed solution is no solution at all.

Would you mind explaining about this lack of seamlessness?

I can imagine that part of it could be the need of a git clone --recursive, and everybody omits the --recursive if they don't know there are submodules inside the repository. There is another command to pull the submodules later but I admit it's far from ideal.

What's wrong with git submodules?

Google already has a FUSE layer for source control: http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...

Google used to have a Perforce frankenstein, but now they have their own VCS.

Piper is their perforce-like server. You check out certain files, as you'd with p4, work on the tree using git, with reviews, tests, etc. You periodically do a sync which is like pull - - rebase. Then you push your changes back into the perforce-like monorepo.

also they do all the development in one branch / in the trunk https://arxiv.org/abs/1702.01715 ( I never understood the explanations as to why they do that )

Now the article says that with windows they do branches.

They use mercurial (or were), which is as good as git. In fact, I bet a lot of people at Google are happy to use mercurial instead of git, given git's bad reputation with its command line interface.

They don't use mercurial.

You're thinking of Facebook if I'm not mistaken.

I had seen several sources that affirmed that Google used Mercurial, but I'm not sure to what extend, so I will retract it :-)

I'm sure there are a few teams that use Mercurial incidentally somewhere, but our primary megarepo is all on a VCS called Piper. Piper has a Perforce-y interface and there are experimental efforts to use Mercurial with Piper. Also mentioned in the article below, there's limited interop with a Git client.

If you're curious what it all ends up looking like, read this article. It's a fairly good overview and reasonably up to date.


IIRC, the git client is deprecated; the mercurial one is meant to replace it for the use cases where you would want a DVCS client interfacing with Piper in the first place.

It's not deprecated. I use git-multi every day, much to the chagrin of my reviewers. Google thinks that long DIFFBASE chains are weird and exotic, and Google doesn't like weird and exotic as a rule.

I wonder why Windows is a single repository - Why not split it in separate modules? I can imagine tools like Explorer, Internet Explorer/Edge, Notepad, Wordpad, Paint, etc. all can stay in its own repository. I can imagine you can even further split things up, like a kernel, a group of standard drivers, etc. If that is not already the case (separate repos, that is), are the plans to separate it in the future?

Really good question. Actually, splitting Windows up was the first approach we investigated. Full details here: https://www.visualstudio.com/learn/gvfs-design-history/


- Complicates daily life for every engineer

- Becomes hard to make cross-cutting changes

- Complicates releasing the product

- There's a still a core of "stuff" that's not easy to tease apart, so at least one of the smaller Windows repos would still have been a similar order of magnitude in most dimensions

> - Becomes hard to make cross-cutting changes

This does seem like a negative, doesn't it?

But it's not. Making it hard to make cross-cutting changes is exactly the point of splitting up a repo.

It forces you to slow down, and—knowing that you can only rarely make cross-cutting changes—you have a strong incentive to move module boundaries to where they should be.

It puts pressure on you to really, actually separate concerns. Not just put "concerns" into a source file that can reach into any of a million other source files and twiddle the bits.

"Easy to make sweeping changes" really means "easy to limp along with a bad architecture."

I think that's one of the reasons why so much code rots: developers thinking it should be easy to make arbitrary changes.

No, it should be hard to make arbitrary changes. It should be easy to make changes with very few side effects, and hard to make changes that affect lots of other code. That's how you get modules that get smaller and smaller, and change less and less often, while still executing often. That's the opposite of code rot: code nirvana.

No, it should be hard to make arbitrary changes.

If you change the word "arbitrary" to "necessary" (implying a different bias than the one you went with) then all of a sudden this attitude sounds less helpful.

Similarly "easy to limp along with a bad architecture" could be re-written as "easy to work with the existing architecture".

At the end of the day, it's about getting work done, not making decisions that are the most "pure".

You have to balance getting work done vs. purity, and Microsoft has spent years trying to fix a bad balance.

Windows ME/Vista/8 were terrible and widely hated pieces of software because of "getting things done" instead of making good decisions. They made billions of dollars doing it, don't get me wrong, but they've also lost a lot of market share too and have been piling on bad sentiment for years. They've been pivoting and it has nothing to do with "getting work done" but by going back and making better decisions.

Those releases (well, Vista and 8 anyway, I don't know about ME) came out of a long and slow planning process - if they made bad decisions I don't think it was about not taking long enough to make them.

I assumed that Windows 8 was hated because it broke the Start Menu and tried to force users onto Metro.

It also broke a lot of working user interfaces, e.g. wireless connection management.

> At the end of the day, it's about getting work done, not making decisions that are the most "pure".

This attitude will lead to a total breakdown of the development process over the long term. You are privileging Work Done At The End Of The Day over everything else.

You need to consider work done at every relevant time scale.

How much can you get done today?

How much can you get done this month?

How much can you get done in 5 years?

Ignore any of these questions at your peril. I fundamentally agree with you about purity though. I'm not sure what in my piece made you think I think Purity Uber Alles is the right way to go.

> This attitude will lead to a total breakdown of the development process over the long term.

As evidenced by Microsoft following the one repo rule and not being able to release any new software.

Wait, what ?

The text I quoted had nothing to do with monolithic repos.

The linux codebase exists in stark contrast to your claim. Assuming your claim is that broken up repos is the better way.

No it doesn't. I think you are thinking of the kernel, which is separate from all the distros.

Linux by itself is just a kernel and won't do anything for you without the rest of the bits that make up an operating system.

Then I'll point to the wide success of monolithic utilities such as systemd as evidence that consolidating typically helps long term.

Which is to say, not shockingly, it is typically a tradeoff debate where there is no toggle between good and bad. Just a curve that constantly jumps back and forth between good and bad based on many many variables.

systemd is also completely useless on its own. It still needs a bootloader, a kernel, and user-space programs to run.

When it comes to process managers, there is obviously disagreement about how complex they should be, but systemd is still a system to manage and collect info about processes.

The hierarchical merging workflow used by the Linux kernel does mean that there's more friction for wide-ranging, across-the-whole-tree changes than changes isolated to one subsystem.

Isolated changes will always be easier than cross cutting ones. The question really comes down to whether or not you have successfully removed cross cutting changes. If you have, then more isolation almost certainly helps. If you were wrong, and you have a cross cutting change you want to push, excessive isolation (with repos, build systems, languages, whatever), adds to the work. Which typically increases the odds of failure.

Arguing about purity is only pointless and sanctimonious if the water isn't contaminated. Being unable to break a several hundred megabyte codebase into modules isn't a "tap water vs bottled" purity argument, it's a "lets not all die of cholera" purity argument.

As the linked article says, modularizing and living in separate repos was the plan of record for a while. But after evaluating the tradeoffs, we decided that Windows needs to optimize for big, rapid refactors at this stage in its development. "Easy to make sweeping changes" also means "easy to clean up architecture and refactor for cleaner boundaries".

The Windows build system is where component boundaries get enforced. Having version control introduce additional arbitrary boundaries makes the problem of good modularity harder to solve.

You say that, but it is very telling that every large company out there (Google and Facebook come to mind) go for the single-repository approach.

I'm sure that, when dealing with stakeholder structures where different organizations can depend on different bits and pieces, having multiple repositories with difficulty of making breaking and cross-cutting changes, becomes good.

From the view of a single organization where the only users of a component are other components in the same organization, it seems like there is consensus around single-repository.

It is very telling. Google has a cloud supercomputer doing nothing but building code to support their devs. I don't know about Facebook. (I really don't -- I'm constantly amazed that they have as many engineers that they do, what do they all work on?) Where I work (https://medium.com/salesforce-engineering/monolith-to-micros...) there's a big monolith but with more push towards breaking things up, at the architecture level and also on the code organization level. We also use and commit to open source projects (that use git), so to integrate those with the core requires a bit more effort than if they were there already but it's not a big burden and the benefits of having their tendrils being self-contained are big.

Which brings me to the point that in the open source world, you can't get away with a single-repository approach for your large system. And that also is telling, along with open source's successes. So which approach is better in the long run? I'd bet on the open source methods.

You forget Conway's law:

> organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

Open source has a very different communication structure than a company. While the big three (MS, Google, FB) try to work towards good inter-departmemt relations, it is usually either

- a single person - a small group

that are the gatekeepers for a small amout of code, typically encapsulated in a "project". They do commit a lot to their project, yet rarely touch other projects in comparision.

Also, collaboration is infinitely harder, as in the office you can simply walk up to someome, call them, or chat them - in OSS a lot of communication works via Issues and PRs, which are a fundamentally different way to communicate.

This all is reflected by the structure of how a functionality is managed: Each set of gatekeepers gets their own repository, for their function.

Interestingly this even happens with bigger repositories: DefinitelyTyped is a repository for TypeScript "typings" for different JS libraries, which has hundreds of collaborators.

Yet, if you open a pull-request for an existing folder the ones that have previously made big-ish changes can approve / decline the PR, so each folder is its own little repo.


So: maybe the solution is big repos for closed companies, small repos for open-source?

Amazon doesn't.

From my experience with Java-style hard module dependencies, this makes it extremely difficult to refactor anything touching external interfaces.

You say this forces you to think ahead, but predicting the future is quite difficult. The result is that you limp along with known-broken code because it would take so much effort to make the breaking changes to clean it up.

For example, lets say you discover that people are frequently misuing a blocking function because they don't realize that it blocks.

Let's say that we have a function `bool doThing()`. We discover that the bool return type is underspecified: there's a number of not-exactly-failure not-exactly-success cases. In a monorepo, it's pretty easy to modify this so that `doThing()` can return a `Result` instead. With multiple repos and artifacts, you either bring up the transitive closure of projects, or you leave it for someone to do later. For a widely used function, this can be prohibitive. That makes people frequently choose the "rename and deprecate" model, which means you get an increasing pile of known-bad functions.

Have you actually worked on a repository at Windows scale..? If not, how can you know that your guesses about the workflow are accurate?

Making something difficult even more difficult is not helpful to anyone.

And what happens if the inherent difficult disappears or reduces? You're still left with the imposed external difficulty.

Remember when they said IE was an integral part of the OS? Yeah...

Facebook cited similar reasons for having a single large repository: https://code.facebook.com/posts/218678814984400/scaling-merc...

I remember that, in the 90's, you'd often get new UI elements in Office releases that then would eventually move into Windows. There was a technical reason - the cross-cutting - but there also seemed to be a marketing reason - the moment those UI elements became part of core Windows, all developers (yours truly included) would be able to use those elements, effectively negating Office the fresh look before the competition.

That's a really interesting article. I wish I would have found it before going down a similar path with my team, recently.

These types of use cases seem so commonly encountered that there should be a list of best practices in the Git docs.

It's harder to share code between repos, though.

EDIT: like if something was to be shared between Windows and Office, for example.

This was a very interesting point. It sounds like there are some serious architectural limitations on Windows, and this makes me believe the same might be true for the NT kernel, and that MS might not be interested in doing heavy refactoring of it.

I'm not a frequent Windows user, or a Windows dev at all. Does anyone know of any consequences that MS's decision might mean, if this hypothesis is true?

The NT kernel is surprisingly small and well-factored to begin with - it is a lot closer to a 'pure' philosophy (e.g. Microkernel) than something like Linux to begin with.

If you have a problem with Windows being overcomplicated or in need of refactor it is almost certainly something to do with not-the-kernel.

If you look at something like the Linux kernel its actually much larger than Windows. It needs to have every device driver known to man (except that one WiFi/GPU/Ethernet/Bluetooth driver you need) because internally the architecture is not cleanly defined and kernel changes also involve fixing all the broken drivers.

Small and well-factored the core kernel may be, but if you're parsing fonts in kernel mode, you ain't a microkernel.

( https://googleprojectzero.blogspot.com.au/2015/07/one-font-v... )

For sure, and Windows is not a microkernel, but it does have separated kernel-in-kernel and executive layers; it would approach being a microkernel architecture if the executive was moved into userland. This is similar to how macOS would be a microkernel, if everything wasn't run in kernel mode (mach, on which it is partially based, is a microkernel).

Of course the issue here is that after NT 4, GDI has been in kernel mode; this is necessary for performance reasons. Prior to that it was a part of the user mode Windows subsystem.

I'd be curious to see if GDI moved back to userland would be acceptable with modern hardware, but I suspect MS is not interested in that level of churn for minimal gain.

Could you please share a link to NT kernel sources so that I take a look?

It is not true that internal architecture of Linux drivers is not clearly defined. It is just a practical approach to maintenance of drivers (as an author of one Linux hardware driver I'm pretty sure the best possible). Reasoning is outlined in famous "Stable API nonsense" document http://elixir.free-electrons.com/linux/latest/source/Documen...

I don't think Windows approach is worth praising here. It results in a lot of drivers stopping working after few major Windows release upgrades. In Linux, available drivers can be maintained forever.

If you are sufficiently motivated, NT 4 leaked many years ago and you could find it; it even has interesting things like DEC Alpha support & various subsystems still included IIRC. Perhaps you could find a newer version like 5.2 on GitHub or another site, but beware, as a Linux dev/contributor, you probably don't want to have access to that.

FWIW, I've stumbled upon both of those things in my personal research while I had legitimate access to the 5.2 sources as a student. It turns out Bing will link you directly to the Windows source code if you search for <arcane MASM directive here>.

Yes, I'm in the process of reporting this to Microsoft and cleansing myself of that poison apple.

Yes! Windows Kernel is a much more "modern" microkernel architecture than any of the circa-1969 Unix-like architectures popular today. We use Windows 10 / Windows Serve for everything at our company, and we have millions of simultaneously connected users on single boxes. No problems and easy to manage.

It seems presumptive to say that "If you can't use multiple repos, your architecture must be bad". I could just as easily counter with, "If you're using multiple repos, it must mean you have an unnecessarily complex and fragile microservice architecture"

I'm sorry that was implied. I simply want better insight into the kernel from people who have experience developing large kernels, and the decisions that are made as a consequence of architectural choices.

I feel like that those questions are valid, and are important in this field, not just kernel development. As someone who desires to continue learning, I will not yield to your counter.

Google also uses a single giant repo...

Why is that a "good reason" to do it?

Not really :) and I am not the OP. This article [1] provides a very good overview of the repository organization Google has and the reasons behind it.

I think the reason that this works for well Google is the amount of test automation that is in place which seems to be very good at flagging any errors due to dependency changes before it gets deployed. Not sure how many organizations have invested and have an automated test infrastructure like Google has built.

1. https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

It radically simplifies everything. Every commit is reproducible across the entire tool chain and ecosystem.

It makes the entire system kind of pure-functional and stateless/predictable. Everything from computing which tests you need to run, to who to blame when something breaks, to caching build artifacts, or even sharing workspaces with co-workers.

While this could be implemented with multiple repros underneath, it would add much complexity.

I think this is like the "your data isn't big enough for HDFS" argument earlier this week. The point I take away is that at some stage of your growth, this will be a logical decision. I don't think it implies that the same model works for your organization.

And Facebook.

Why split it into separate modules? Seeing that big companies are very successful with monorepos (Google, Facebook, Microsoft), has made me reconsider if repository modularization is actually worth it. There are a host of advantages to not modularizing repos, and I'm beginning to believe they outweigh those of modular repos.

mono repo only works if you have tooling. google's and facebook's tools are not opensource. also, ms tooling is windows only.

so for most of us the only reasonable path is to split into multiple repositories.

it is also easier to create tools that deal with many repositories.

than it is to create a tool that virtualizes a single large repo.

Microsoft's tooling is Windows only _today_. GVFS is open source _and_ we are actively hiring filesystem hackers for Linux/macOS.

but google's and facebook's repos are also orders of magnitude larger than what most people deal with, normal tools might work just fine in most cases.

> Why split it into separate modules?

Well, because of the unbelievable amount of engineering work involved in trying to get Git to operate at such insane scale? To say nothing of the risk involved in the alternative. This project in particular could easily have been a catastrophe.

Microsoft has about 50k developers. When you're dealing with an engineering organization of this size, you're looking at a run rate of $5B a year, or about $20M a day. It's a no-brainer to spend tens or even hundreds of millions of dollars on projects like this if you're going to get even a few percentage points of productivity.

It can be hard to understand this as an individual engineer, but large tech companies are incredible engineering machines with vast, nearly infinite, capacity at their disposal. At this scale, all that matters is that the organization is moving towards the right goals. It doesn't matter what's in the way; all the implementation details that engineers worry about day-to-day are meaningless at this scale. The organization just paves over over any obstacles in pursuit of its goal.

In this case Microsoft decided that Git was the way to go for source control, that it's fundamentally a good fit for their needs. There was just implementation details in the way. So they just... did it. At their scale, this was not an incredible amount of work. It's just the cost of doing business.

You haven't said anything at all about the risk.

If there's one thing large organizations are good at, it's managing risk. And if you read their post, they've done this.

They're running both source control systems in parallel, switching developers in blocks, and monitoring commit activity and feedback to watch for major issues. In the worst case, if GVFS failed or developers hated it, they could roll back to their old system.

Again, to my point above: there's a cost to doing this but it's negligible for very large organizations like Microsoft.

Wait so like, at google, Inbox and Android are in the same repo as ChromeOS and oh I dunno, Google Search? That doesn't make any sense at all...

Android and Chrome are different, but most of Google's code lives in a single repository.


It makes total sense when the expectation is that any engineer in the company can build any part of the stack at any time with a minimum of drama.

This just blew my mind. I'm gonna go home and see about combining all my projects. That seems very useful!

It makes sense to have separate repos for things that don't interact. But when your modules or services do. Having them together cuts out some overhead.

You don't have the same requirements as Google.

What are my requirements such that this wouldn't work for me?

It makes things tricky if you want to opensource just one of your projects.

Well, there's always "git filter-branch".

Not that I'd want to run it on such a mega-repository; it takes long enough running it on an average one with a decade of history.

What's dramatic about copy and pasting a clone uri into a command?

A url? Not much.

100 urls? That's getting a bit annoying.

Yes, Inbox, Maps, Search, etc. are all in one repo in a specialized version control system.

Android, Chromium, and Linux (among others, I'm sure) are different in that they use git for version control so they are in their own separate repos.

Unsure if all their clients are in the same repo as well but even if…

Why doesn't this make sense.

I personally think of a repo as of an index not a filesystem. You checkout what you need but there is one global constant state - which can eg be used for continuous integration tests

Android is in a separate repo.

their earlier blogpost goes into it a bit https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... :

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”. Some code is separable (like microservices) and is ideal for isolated repos. Some code is not (like Windows core) and needs to be treated like a single repo.

As Brian Harry alluded to, this is in fact very close to the previous system. As I recall it from my time at Microsoft:

The Windows source control system used to be organized as a large number (about 20 for product code alone, plus twice that number for test code and large test data files) of independent master source repositories.

The source checkouts from these repos would be arranged in a hierarchy on disk. For example, if root\ were the root of your working copy, files directly under root\ would come from the root repo, files under root\base\ from the base repo, root\testsrc\basetest from basetest, and so on.

To do cross repo operations you used a tool called "qx" (where "q" was the name of the basic source control client). Qx was a bunch of Perl scripts, some of which wrapped q functions and others that implemented higher-level functions such as merging branches. However, qx did not try to atomically do operations across all affected repos.

(The closest analog to this in git land would be submodules.)

While source control was organized this way, build and test ran on all of Windows as a single unit. There was investigation into trying to more thoroughly modularize Windows in the past, but I think the cost was always judged too great.

Mark Lucovsky did a talk several years ago on this source control system, among other aspects of Windows development:


I believe it is still valid for folks not using GVFS to access Windows sources.

It'd probably be hell on earth for the developers/engineers if it was more than one repo.

I'm sure the build process is non-trivial and with 4,000 people working on it, the amount of updates it gets daily is probably insane. Any one person across teams trying to keep this all straight would surely fail.

Having done a lot of small git repos, I'm a big fan of one huge repo. It makes life easier for everyone, especially your QA team as it's less for them to worry about. In the future, anywhere I'm the technical lead I'm gonna push for one big repo. No submodules either. They're a big pain in the ass too.

Most of the work that is the issue would be solved by meta repos and tools to keep components up to date and integrated upward.

So, this is actually pretty common. I know that both Google and Facebook use a huge mono-repo for literally everything (except I think Facebook split out their Android code into a separate repo?). So, all of Facebook's and Google's code for front-end, back-end, tools, infrastructure, literally everything, lives in one repo.

It's news to me that Windows decided to go that route too. Personally, I think submodules and git sub-trees suck, so I'm all for putting things in a monorepo.

How does a mono-repo company manage open sourcing a single part of their infrastructure if things are in one large repo? For example, if everything lived in one repo, how does Facebook manage open sourcing React? Or if I personally wanted to switch to one private mono-repo, how would I share individual projects easily?

So open sourcing can mean three different things:

The bad one is just dumping snapshots of the code into a public repo every so often. You need to make sure your dependencies are open source, have tools that rewrite your code and build files accordingly, and put them in a staging directory for publishing.

The good one is developing that part publicly, and importing it periodically into your internal monorepo with the same (or similar) process to the one you use for importing any other third-party library.

There's also a hybrid approach which is to try and let internal developers use the internal tooling against the internal version of the code, and also external developers, with external tooling, against the public version. That one's harder, and you need a tool that does bidirectional syncing of change requests, commits, and issues.

We have an internal tool that allows us to mirror subdirectories of our monorepo into individual github repositories, and another tool that helps us sync our internal source code review tool with PRs etc.

An internal tool which manages commits, between individual repos etc. does it not seem that this is a logical extension to git itself? A little like submodules, but being able to publish only parts of the sourcetree. Maybe it would be impossible to keep any consistency and leaking information from the rest of the tree.

With difficulty.

No, seriously, that's the answer.

They have an internal mono-repo and public repos on GitHub that are mirrors of their mono-repo.

pros to big repo:

-dont have to spend time to think about defining interfaces


-history is full of crap you dont care about

-tests take forever to run

-tooling breaks down completely, though thanks to MS the limit was increased seriously

Are the big monorepo companies actually waiting for global test suite completion for every change? I'd doubt that, I'm sure they're using intelligent tools to figure out what tests to actually run. Compute for testing is massively expensive at that scale so it's an obvious place to optimize

Google's build and testing system is smart in which tests to run, as you suspect, but it still has a very, very large footprint.

Right. My point is that the monorepo almost certainly isn't a problem in this regard.

You still have to do something about internal interfaces. The problem is that the moment you want to make a backwards-incompatible change to an internal interface now you have to go find users of it, and there go the benefits of GVFS... Or you can let the build and test system tell you what breaks (take a long coffee break, repeat as many times as it takes; could be many times). Or use something like OpenGrok to find all those uses in last night's index of the source.

Defining what portions of the OS you'll have to look in for such changes helps a great deal.

As to building and testing... the system has to get much better about detecting which tests will need to be re-run for any particular change. That's difficult, but you can get 95% of the way there easily enough.

-dont have to spend time to think about defining interfaces

That seems like a design and policy choice, orthogonal to repos.

Not really. It's easier to make a single atomic breaking change to how different components talk to each other if they are in the same repository.

If they are in different repos, the change is not atomic and you need to version interfaces or keep backwards compatibility in some other way.

It's very much really. The fact that it's easier doesn't really matter - a repo is about access to the source code and its history with some degree of convenience. The process and policy of how you control actual change is quite orthogonal. You can have a single repo and enforce inter-module interfaces very strongly. You can have 20 repos and not enforce them at all. Same goes for builds, tests, history, etc. The underlying technology can influence the process but it doesn't make it.

I have always wondered how they deal with acquisitions and sales. I guess a single system makes sense there too.

I have worked at a company using one repo per team and at Google which uses a big monorepo. I much prefer the latter. As long as you have the infrastructure to support it, I see no other downsides (obviously, Google does not use git).

> Before the move to Git, in Source Depot, it was spread across 40+ depots and we had a tool to manage operations that spanned them.

Coming from the days of CVS and SVN, git was a freaking miracle in terms of performance, so I have to just put things into perspective here when the topmost issue of git is performance. It's just a testament how huge are the codebases we're dealing with (Windows over there, but also Android, and surely countless others), the staggering amount of code we're wrangling around these days and the level of collaboration is incredible and I'm quite sure we would not have been able to do that (or at least not that nimbly and with such confidence) were it not for tools like git (and hg). There's a sense of scale regarding that growth across multiple dimensions that just puts me in awe.

Broadly speaking this is true, but note that in some ways CVS and SVN are better at scaling than Git.

- They support checking out a subdirectory without downloading the rest of the repo, as well as omitting directories in a checkout. Indeed, in SVN, branches are just subdirectories, so almost all checkouts are of subdirectories. You can't really do this in Git; you can do sparse checkouts (i.e. omitting things when copying a working tree out of .git), but .git itself has to contain the entire repo, making them mostly useless.

- They don't require downloading the entire history of a repo, so the download size doesn't increase over time. Indeed, they don't support downloading history: svn log and co. are always requests to the server. Unfortunately, Git is the opposite, and only supports accessing previously downloaded history, with no option to offload to a server. Git does have the option to make shallow clones with a limited amount of (or no) history, and unlike sparse checkouts, shallow clones truly avoid downloading the stuff you don't want. But if you have a shallow clone, git log, git blame, etc. just stop at the earliest commit you have history for, making it hard to perform common development tasks.

I don't miss SVN, but there's a reason big companies still use gnarly old systems like Perforce, and not just because legacy: they're genuinely much better at scaling to huge repos (as well as large files). Maybe GVFS fixes this; I haven't looked at its architecture. But as a separate codebase bolted on to near-stock Git, I bet it's a hack; in particular, I bet it doesn't work well if you're offline. I suspect the notion of "maybe present locally, maybe on a server" needs to be baked into the data model and all the tools, rather than using a virtual file system to just pretend remote data is local.

CVS and SVN are probably a bit better at scaling than (stock) Git. Perforce and TFVC _are certainly_ better at scaling than (again, stock, out-of-the-box) Git. That was their entire goal: handle very large source trees (Windows-sized source trees) effectively. That's why they have checkout/edit/checkin semantics, which is also one of the reasons that everybody hates using them.

GVFS intends to add the ability to scale to Git, through patches to Git itself and a custom driver. I don't think this is a hack - by no means is it the first version control system to introduce a filesystem level component. Git with GVFS works wonderfully while offline for any file that you already have fetched from the server.

If this sounds like a limitation, then remember that these systems like Perforce and TFVC _also_ have limitations when you're offline: you can continue to edit any file that you've checked out but you can't check out new files.

You can of course _force_ the issue with a checkout/edit/checkin but then you'll need to run some command to reconcile your changes once you return online. This seems increasingly less important as internet becomes ever more prevalent. I had wifi on my most recent trans-Atlantic flight.

I'm not sure what determines when something is "a hack" or not, but I'd certainly rather use Git with GVFS than a heavyweight centralized version control system if I could. Your mileage, as always, may vary.

GVFS won't fix this because you still need to lock opaque binary files, which is something Perforce supports.

Nothing about git prevents an "ask the remote" feature. It's just not there. I suspect that as git repos grow huge and shallow and partial cloning becomes more common, git will grow such a feature. Granted, it doesn't have it today. And the GVFS thing is... a bit of a hack around git not having that feature -- but it proves the point.

At the risk of sounding like a downer, this was a migration of an existing codebase.

I agree. I really think Linux needs a Nobel Price

For what? Peace?

A handful of us from the product team are around for a few hours to discuss if you're interested.

Does the virtualization work equally well for lots of history as it does for large working copies?

I have a 100k commit svn repo I have been trying to migrate but the result is just too large. Partly this is due to tons of revisions of binary files that must be in the repo.

Does the virtualization also help provide a shallow set of recent commits locally but keep all history at the server (which is hundreds of gigs that is rarely used)?

GVFS helps in both dimensions, working copy and lots of history. For the lots of history case, the win is simply not downloading all the old content.

A GVFS clone will contain all of the commits and all of the trees but none of the blobs. This lets you operate on history as normal, so long as you don't need the file content. As soon as you touch file content, GVFS will download those blobs on demand.

Thanks - that sounds perfect for lots of binary history as you never view history on the binaries, only the source files.

This is amazing, congrats. I worked on Windows briefly in 2005 (the same year git was released!) and was surprised at how well Source Depot worked, especially given the sheer size of the codebase and the other SCM tools at the time.

Is there anything people particularly miss about Source Depot? Something SD was good at, but git is not?

Another interesting complaint is one that we hear from a lot of people who move from CVCS to DVCS: there are too many steps to perform each action. For example, "why do I have to do so many steps to update my topic branch". While we find that people get better with these things over time, I do think it would be interesting to build a suite of wrapper commands that roll a bunch of these actions up.

I just got a request today for an API equivalent to `sd files`, which is not something Git is natively great at without a local copy of the repo.

How do you prevent data exfiltration? I mean, in theory you could restrict the visibility of repos to the user based on team membership/roles and so prevent a single person from unauditably exfiltrating the whole Windows source code tree. In contrast with a monorepo there likely won't be any alerts triggered if someone does do a full git clone, except for someone saturating his switch port...

What would someone do with the source code for Windows? No one in open source would want to touch it. No large company would want to touch it. Grey/black hats are probably happier with their decompilers. Surely it would be easier to pirate than build (assuming their build system scales with most build systems I've observed in the wild). No small company would want to touch it.

Anyway MS share source with various third parties (governments at least and I believe large customs in general) so any of these are a potential leak source.

This is all correct. Also, we'd notice someone grabbing the whole 300GB whether it's in 40 SD depots or a single Git repo.

Mind if I ask how you'd notice?

The article mentions relying on a windows filesystem driver. Two questions about that:

1) Why include it in default windows? It seems that 99.99% of users would never even know it existed, let alone use it

2) Does that mean GVFS isn't useable on *nix systems? Any plans to make it useable, if so?

1) The file system driver is called GvFlt. If it does get included in Windows by default, it'll be to make it easier for products like GVFS, but GvFlt on its own is not usable by end users directly.

2) GVFS is currently only available on Windows, but we are very interested in porting to other platforms.

What are your thoughts on implementing something more general like linux's FUSE[1] instead? A general Virtual Filesystem driver in Windows could be used for a wide range of things and means you don't just have a single-purpose driver sitting around.

[1] https://en.m.wikipedia.org/wiki/Filesystem_in_Userspace

A general FUSE-like API would be very useful to have, but unfortunately it can't meet our performance requirements.

The first internal version of GVFS was actually based on a 3rd party driver that looks a lot like FUSE. But because it is general purpose, and requires all IO to context switch from the kernel to user mode, we just couldn't make it fast enough.

Remember that our file system isn't going to be used just for git operations, but also to run builds. Once a file has been downloaded, we need to make sure that it can be read as fast as any local file would be, or dev productivity would tank.

With GvFlt, we're able to virtualize directory enumeration and first-time file reads, but after that get completely out of the way because the file becomes a normal NTFS file from that point on.

I'm curious what the cross-over between GvFlt may be and the return of virtual files for OneDrive in the Fall Creators Update? Is the work being coordinated between the efforts?

From your description it sounds like there could be usefulness in coordinating such efforts.

For use cases that fit that limited virtualization, is GvFlt something to consider? Or is that a bad idea?

It's not included in Windows, which is why they have a signed drop of the driver for you to install. Even internally we have to install the driver.

E- oops, missed that line. Cool, wonder if it will show up in more than Git

Yes, I was referring to the plan to include it in future Windows builds

Probably as an optional feature?

Why do you name it "GVFS" instead of something more descriptive like "GitVFS"?

This was discussed a bit in the comments in Brian Harry's last post on GVFS: https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g...

We're building a VFS (Virtual File System) for Git (G) so GVFS was a very natural name and it just kind of stuck once we came up with it.

Are you aware that the name was already taken[0] for something which also has to do with file systems?

[0] https://wiki.gnome.org/Projects/gvfs

Sure, a couple of questions:

1. How do you measure "largeness" of a git repo?

2. How are you confident that you have the largest?

3. How much technical debt does that translate to?

1. Saeed is writing a really nice series of articles starting here: https://www.visualstudio.com/learn/git-at-scale/ In the first one, he lays out how we think about small/medium/large repos. Summary: size at tip, size of history, file count at tip, number of refs, and number of developers.

2. Fairly confident, at least as far as usable repos go. Given how unusable the Windows repo is without GVFS and the other things we've built, it seems pretty unlikely anyone's out there using a bigger one. If you know of something bigger, we'd love to hear about it and learn how they solved the same problems!

3. Windows is a 30 year old codebase. There's a lot of stuff in there supporting a lot of scenarios.

Is it possible to checkout (if not build) something like Windows 3.11 or NT 4?

As far as I can recall, this is not possible using Windows source control, as its history only goes back to the lifecycle of Windows XP (when the source control tool prior to GVFS was adopted).

Microsoft does have an internal Source Code Archive, which does the moral equivalent of storing source code and binary artifacts for released software in a underground bunker. I used to have a bit of fun searching the NT 3.5 sources as taken from the Source Code Archive...

I recently heard a story that someone tried to push a 1TB repo to our university Gitlab which then ran out of disk space. Sure, that might have been not be a usable repo but only an experiment. Still, I would bet against the claim that 300GB is the largest one.

300 GB is not the size of _the repository_. It's the size of the code base - the checked out tree of source, tests, build tools, etc - without history.

It's certainly possible that somebody created a 1 TB source tree in Git, but what we've never heard of is somebody actually _using_ such a source tree, with 4000 or more developers, for their daily work in producing a product.

I say this with some certainty because if somebody had succeeded, they would have needed to make similar changes to Git to be successful, though of course they could have kept such changes secret.

1 TB of code?

I'd sure like to run that as my operating system, browser, virtual assistant, car automation system, and overall do-everything-for-me system...

I'm currently investigating using GitLFS for a large repo that has many binary and other large artifacts.

I'm curious, did you experiment with LFS for prior to building GitVFS?

Also, I know that there is an (somewhat) active effort to port GitVFS to Linux, do you know if any of the Git vendors (GitLab and/or GitHub) are planning to support GitVFS in their enterprise products?

Yes we did evaluate LFS. The thing about LFS is that while it does help reduce the clone size, it doesn't reduce the number of files in the repo at all. The biggest bottleneck when working with a repo of this size is that so many of your local git operations are linear on the number of files. One of the main values of GVFS is that it allows Git to only consider the files you're actually working with, not all 3M+ files in the repo.

> One of the main values of GVFS is that it allows Git to only consider the files you're actually working with, not all 3M+ files in the repo.

That is an excellent point. Thanks!

We at GitLab are looking at GitVFS but have not made a decision yet https://gitlab.com/gitlab-org/gitlab-ce/issues/27895

Very cool blog! As I understand, you dynamically fetch a file from the remote git server once for the first time I open the file. Do you do any sort of pre-fetching of files? For example, if a file has an import and uses a few symbols from that file, do you also fetch the imported file beforehand or just fetch it when you access it first time?

For now, we're not that smart and simply fetch what's opened by the filesystem. With the cache servers in place, it's plenty fast. We do also have an optional prefetch to grab all the contents (at tip) for a folder or set of folders.

We don't currently do that sort of predictive prefetching, but it's a feature we've thought a lot about. For now, users can explicitly call "gvfs prefetch" if they want to, or just allow files to be downloaded on demand.

What's the PR review UI built in?

Custom JQuery-based framework, transitioning to React.

Actually, I think we finished the conversion to React :). So, React.

Taylor is the dev manager for that area so I'm inclined to believe his correction :)

What was the impetus for switching to git?

More or less:

- Availability of tools

- Familiarity of developers (both current and potential)

Any plans to port GVFS to Linux or macOS?

Why not TFS?

"A handful of us from the product team are around for a few hours to discuss if you're interested."


This is a little off-topic, but why can't Windows 10 users conclusively disable all telemetry?

(I consider the question only a little off-topic, because I have the impression that this story is part of an ongoing Microsoft charm-offensive.)

Haha, no answer, as expected. HN got butthurt as well lol.

I knew there was a risk of getting downvoted, but I was surprised that it went to "-4".

This is so awesome. Brilliant move MS! In addition to enabling Windows engineers to be significantly more productive (eventually), it will go a long way to enabling engineers in other departments to contribute to Windows. For example, I used to work in the Azure org and once noticed a relatively simple missing feature in Windows. I filed a bug and was in contact with a PM who suggested if I wanted I could work on adding it myself. I dipped my toe in, but the onboarding costs were just too high and I quickly decided against it. With Windows on git, much more likely to have dived in.

I'm not so sure moving to Git alone would have helped your case. Getting an enlistment is only a small part of contributing to Windows.

True, but the move to Git is part of our larger "1ES" (One Engineering System) effort across the company. The idea is, if you know how to enlist/build/edit/submit in any team, you know how to do the same in any team.

Agreed, but probably one of the top 5 road blocks.

This is pretty crazy. It's very hard to imagine working on a single codebase with 4,000 other engineers.

> Another key performance area that I didn’t talk about in my last post is distributed teams. Windows has engineers scattered all over the globe – the US, Europe, the Middle East, India, China, etc. Pulling large amounts of data across very long distances, often over less than ideal bandwidth is a big problem. To tackle this problem, we invested in building a Git proxy solution for GVFS that allows us to cache Git data “at the edge”. We have also used proxies to offload very high volume traffic (like build servers) from the main Visual Studio Team Services service to avoid compromising end user’s experiences during peak loads. Overall, we have 20 Git proxies (which, BTW, we’ve just incorporated into the existing Team Foundation Server Proxy) scattered around the world.

If I was a hacker, this paragraph would probably encourage me to study the GVFS source code and see if I can find some of these Git proxies. I have no idea how you would find them, but there might be some public DNS records. This sounds like some very new technology and some huge infrastructure changes, which are pretty good conditions for security vulnerabilities. What kind of bounty would Microsoft pay if you could get access to the complete source code for Windows? $100,000? [1]

[1] https://technet.microsoft.com/en-us/library/dn425036.aspx

Looks like all of the charts were made in Excel… that's some dedication to staying on-brand!

Hah, they made sure there was just enough styling left for you to notice.

As opposed to what?

HTML and co. It is a website..

probably also about familiarity. while they probably could have mashed together something using html and css, excel is all about nice looking tables and charts. If only they offered a way to embed them instead of needing to take low res screenshots...

19 seconds for a commit (add + commit) might be long but the new improvements look promising (down to ~10s).

(Please correct me if the COMMIT column in the perf table includes the staging operations.)

This looks awesome. I just wish Facebook would also share some perf and time statistics on their own extensions for Mercurial, last time I checked their graphs were unitless.

Indeed, while 19 seconds for commit is far better than 30 minutes we would have seen without GVFS, it's way too slow to actually feel responsive while you're coding. And in fact, it was sometimes worse than 19 seconds because commands like status and add would generally get slower as you access and hydrate more files in the repo. With the big O(modified) update that we just made to GVFS, git commands no longer slow down as you access more files, so now our devs see a consistent commit time of around 10 seconds, and consistent and faster times for most other commands too.

You have to put this into perspective with what they are replacing. You'd never get a submit done in less than 19 seconds using the old source depot tools anyways. When you work on projects this big that takes hours to compile and minutes to incremental compile - responsiveness just isn't something you get to have at scale.

This is mainly why I'm asking for data from Facebook. I've seen claims that their vcs operations (at least the most common ones) are near instant, but nothing official. It would appear that FB have solved the responsiveness problem with Mercurial but, again, no official data to back it up.

Isn't this perhaps the greatest validation of Linus's design genius that what was initially a weekend project[0] has successfully scaled to this?

They could no longer use their revision control system BitKeeper and no other Source Control Management (SCMs) met their needs for a distributed system. Linus Torvalds, the creator of Linux, took the challenge into his own hands and disappeared over the weekend to emerge the following week with Git.

[0] https://www.linux.com/blog/10-years-git-interview-git-creato...

> Isn't this perhaps the greatest validation of Linus's design genius that what was initially a weekend project has successfully scaled to this?

I thought the entire point of the article was to show how git didn't scale, and how they're basically rewriting the project and changing it as much as necessary to make it scale. It's not like Linus designed git to scale as O(modified).

On the other hand, becoming something bigger out of his hands is validation in it's own right...

And Linux was a hobby project as well... now it runs most of the internet AND most of the personal computing devices on the planet.

No, not really. When it scaled enough for the Linux kernel he lost interest in scaling any further, and that's nowhere near what's needed for a monorepo (or a Linux distribution).

And I recall that there are was fairly infamous tech talk at Google where he basically dismissed Google's concerns in an arrogant way.

So I'd say it's mostly a validation of git and Github's popularity with developers, that other people were willing to put in so much work to improve it.

Well by "validating git" aren't you by proxy validating its design by its creator?

Or am I missing something key here? Yes of course it's been extended by very smart people has new features etc etc...but a weekend project that was robust enough to handle the Linux kernel codebase for him? That's is no small feat.

Oh yes perhaps its the new meme among developers that there are no real genius architects and coders and that anyone could have done it really so why distribute props to any one creator?

Yeah given my experience I have a hard time buying into that mindset.

Sure, that weekend project was extremely successful, and git has a pretty clean design.

My point is that it didn't scale to monorepo size, and was never intended to do anything like that. (How many years did it take before Microsoft came along and decided to do it?) Making a claim like that is basically hero worship.

When software is popular enough, people will jump through all sorts of hoops to extend it and maintain compatibility with it. (For example, POSIX, JavaScript, Java, or even PHP.) These technologies all have their good parts and bad parts. The quality of the design is less important than whether it fulfilled a need successfully and appeared at the right time.

If you have a large repo, it actually helps you to save storage if you don't need the full history on clone by using shallow clone.

hg is still behind this, AFAI can tell from search. FB has this as an extension. https://bitbucket.org/facebook/hg-experimental/

Is FB fully using hg internally, or both Git and hg, because obviously FB has public repo on Github.

My understanding is that fb is actively promoting hg for internal repositories. Not sure how they sync between public git and internal hg.

Those of us working on smaller codebases may wonder what the big deal is. Facebook had a similar problem leading them to switch out to mercurial. https://code.facebook.com/posts/218678814984400/scaling-merc... It is awesome that the problems could be solved in git itself.

Also, kudos to the writer of the blog. It is a really high quality blog post. The percentile measures of performance, survey responses from users etc are very typical of solid incremental approaches to challenges faced by startups except these are internal customers.

300GB of code WOW! Just for comparison the entire English Wikipedia dump including all media is about 50-60GB. What are you guys doing there and how large do you see this growing?

Most of that 300GB isn't text. There are test assets, images, videos, built binaries, vhd's, etc. Also, I should be clear that that 300GB is just at tip (no history). We can debate about whether or not those things should be checked into the repo but they are there now.

How did you go about creating the central repo and how long did it take? A 2Gb at tip svn repo with 100k commits is taking me many days and each odd failure typically has me restart the process after filtering out some obscure part of the tree.

Edit: read in another comment that you dropped the history. Understandable, but can appreciate how that would add to friction (devs having to look through two different histories).

The Windows team developed a tool called "GitTrain" that knew how to:

- migrate the tip of a branch to Git (yes, the 300GB number is the tips of all the interesting branches, not the history)

- keep a Git branch and a SD branch in sync for a while

- be re-run over each of the 400+ branches they care about

But they went through some of the same trial-and-error process that you're describing.

Whoa. 300GB with a shallow clone?! What size does the whole repo use on the server side?

The pack file size for a full clone is 187GB. The 300GB is the working directory. We did not import the history of the code base, so the current repo only has about 5 months of history. As others have called out, there are a lot of assets in the repo that don't compress.

Why only 5 months? Will more of the history be added to the git repository eventually?

No, we'll keep the SD servers around for a while for servicing older products. We also have a "breadcrumbing" system that lets an engineer follow a file's history back from Git to the old system.

Was importing the complete history tried during the development? This is very interesting. The git history will grow at break-neck speed and will reach similar size soon enough. Is this to delay the inevitable tech wrangling for dealing with terabyte histories or were there issues with the import/sync?

Or maybe it was just the initial repo setup used for alpha testing that got promoted to production :)

A lot of that is probably static art assets. A lot of big corporate projects are guilty of just throwing everything in one tree like this where for some bizarre reason you are checking out jpgs and video files. It is part of the reason why the Chrome or Unreal Engine repos are preposterously large (hundreds of megs to gigabytes each).

> A lot of big corporate projects are guilty of just throwing everything in one tree like this where for some bizarre reason you are checking out jpgs and video files.

Would you want to build chrome without it's icons for bookmark folders, the launcher, etc.? No?

What about a game without it's intro videos? Even if you need to debug a crash during the initial intro, related to playing said video?

Some people prefer this kind of content to be managed with a different, media focused version control system, or a separate tree, but now you have 2+ disparate distribution / version control systems to try and keep in sync. You do have options like git submodules now for some VCSes, but I haven't been impressed with their workflow, and they weren't always an option. I will gladly burn several gigs just to simplify my workflow - I've got the spare SSD and bandwidth. I've had fully built checkouts in the hundreds of gigabytes range.

And don't get me wrong, "hundreds of gigabytes" is getting annoyingly large if the IT department skimped and bought me a 250GB SSD. I'll abuse directory junctions to offload some of that onto mechanical drives. This is relatively fire and forget though - multi-repository is not.

And there was an article a few weeks back about how certain content creation tools (Adobe) stuff in lots of metadata into these assets by default.

Some of the MS teams were good about stripping those out for release, others, didn't seem to notice.

Well, to be fair, the metadata in those files amounted to 5 MiB over the whole of Windows, so while there is a little overhead, it won't make up a significant chunk.

Oh, I didn't see that part. That's not nearly as bad as I thought based on my reading of the original article.

Hmm. I believe it's likely there's also lots of binary assets - the WAV sound files, BMP images, the "hello world" videos, for example - and possibly also the raw versions of the assets. And if it's really the whole history of Windows in there, that's a LOT of binary assets in LOTS of versions.

Sorry to see SourceDepot (slowly) decommissioned. I loved it and since it was a Perforce fork, what I've learned was directly applicable when I started using P4 in my subsequent job. Perhaps I'm old fashioned but I really see little appeal in DVCSes. I liked Hg but in the long run it's going to be completely run over by Git so I'd rather not invest in it. I'm rambling, sorry.

What was performance like for the Source Depot system? It would be interesting to note the comparison between the old SDX system and GVFS.

Quoting Brian Harry from a comment response at https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

"It depends a great deal on the operation. SourceDepot was much faster at some things – like “sd opened”, the equivalent of “git status”. sd opened was < .5s. git status is at 2.6s now. But SD was much slower at some other things – like branching. Creating a branch in SD would take hours. In Git, it's less than a minute. I saw a mail from one of our engineers at one point saying they'd been putting off doing a big refactoring for 9 months because the branch mechanics in SD would have been so cumbersome and after the switch to Git they were able to get the whole refactoring done in a topic branch in no time.

On an operation, by operation basis, SD is still much faster than our Git/GVFS solution. We're still working on it to close the gap but I'm not sure it will ever get as fast at everything. The broader question, though is about overall developer productivity and we think we are on a path to winning that."

Game development also has very large files and codebases, Git LFS is sometimes not enough. This is great for everyone really but very nice for game development and larger codebases that might have lots of assets along with it.

Microsoft is doing great work here and hope it makes it to bitbucket, github etc.

I don't know much about Windows development, but I'm sure the system is modularized in some way. Why wouldn't you want to break up the project into multiple repos for different parts of the system? That would let you work on and test each part independent of the rest. Each part should be able to function on its own, right? Of course some engineers would need to build and test the entire OS as a whole, but I'd wager that (for example) the team working on visual design of the settings app doesn't need to have the source code of how the login screen verifies passwords.

Clearly Microsoft's process works well enough for them, so I wonder what benefits there are to using the monolithic repo choice over many smaller repos.

Google has a single repo. The advantages are that you don't need to version anything because you always build against head. It's awesome but requires some discipline and good infrastructure.

If you have many small repos for a large interconnected project you simply move the complexity of managing a commit that requires changes into another tool that can manage cross repo changes and dependencies. With a single repo you can change something and build it, fix any breaks and then commit it with just source source control and build system. The many small repos has in my experience been driven by either poor processes or tooling limitations.

Windows was developed a long time ago, and I'm guessing components were never fully separated as the codebase grew larger everyday.

Linus must be very proud - his favourite software Windows - now depends on GIT.

Linus ought to be proud - it wasn't too long ago when Microsoft was calling his other work "a cancer", and now Windows depends on Git.Younger me would not believe this.

Linus definitely is a rare genius on design and execution - he made his mark not only on Kernels/OSes, but on version control systems as well. I salute you, Linus!

If I recall correctly, wasn't the "cancer" remark in reference to the GPL?

Your recollection is inaccurate.

> Microsoft CEO and incontinent over-stater of facts Steve Ballmer said that "Linux is a cancer that attaches itself in an intellectual property sense to everything it touches," during a commercial spot masquerading as a interview with the Chicago Sun-Times on June 1, 2001.[0]


it is cancer with good connotations

I am very certain that Linus would read this article and curse the jaw dropping stupidity of the whole endeavor. They've basically taken a tool he wrote to do real distributed source control that can scale and turned it into a central server.

Git was never meant to be used this way and I know he'd be horrified+amused in the extreme.

centralization has nothing to do with this feature. It's about big repos, centralized or not.

Git is scalable in terms of developers count, not repo size.

Well how the tables have turned! Only about 3 yrs back I was having a conversation with a Microsoft engineer about them evaluating a closed source Hadoop clone because Microsoft policy prohibited them from using open source.

Different divisions have had different stances on open source code for a long time. Somewhere I still have the t-shirt from our first "Open Source Day" event back in 2008 (and it's not like that was the first time any MS employee had ever considered using open source). Things are a lot more standardized now, with a big push from both the top and the bottom to use open source wherever it makes sense. Why reinvent the wheel?

> Why reinvent the wheel?

Look who's talking :)!

Wasn't WiX released and developed as an Open Source project by MS?

IIRC, it was the first open source project started by Microsoft / by a Microsoft employee.

As a German, I find the name super cringeworthy, but it makes me giggle every now and then.

Using open source in their product is different from using it for development. I wonder which one you talking about with that employee

I'm pretty sure it was for a product (server side), not an internal dev tool. Back in 2000 one of my roommates was an intern with the VS team and he said a lot of the devs were using emacs.

I hope this largest repository has enough space for Clippy as Linus loves it.

The day has come that Microsoft employees are celebrating how good they're getting at running Linus Torvalds' source code management tool.

Cats and dogs, flying pigs.

Might be good to start work on a compatible client and server for FUSE-based systems (Linux, OpenBSD, macOS [with a FUSE kernel module]).

> You also see the 80th percentile result for the past 7 days […]

What'd be even more interesting to see is something like the 95th or 99th percentile, as showing that 80% of all operations finish in acceptable time is nice, but probably not what's necessary to have satisfied customers.

I have Git repositories much larger than 300GB, for binary data. The title should be "the largest Git repo for source code", in my opinion. BTW, it is a nice thing Windows development being moved to Git SCM. What's Linus opinion on that? What a victory :-)

Non-dev here, but does this replace/overlap with TFS? What was the driver to adopt Git?

So, TFS/VSTS is a suite of developer services. They fully support and integrate with git. In other words, git is a first-class citizen in TFS/VSTS. The centralized version control system in TFS/VSTS is called "Team Foundation Version Control" or TFVC.

There were a bunch of drivers to move to git: 1. DVCS has some great workflows. Local branching, transitive merging, offline commit, etc. 2. Git is becoming the industry standard and using that for our VC is both a recruiting and productivity advantage for us. 3. Git (and it's workflow) helps foster a better sense of sharing which is something we want to promote within the company. There are more but those are the major ones.

Good questions. TFS is a whole suite of services: 2 version control systems (TFVC and Git), work item tracking, build orchestration, package management, and more. VSTS is the roughly-analogous cloud-hosted version.

I'd have to dig up the link: a few years ago our VP had a good blog post on why we chose to add a Git server to our offering. TFVC is a classic centralized version control system. When we wanted to add a distributed version control, we looked at rolling our own but ultimately concluded that it was better to adopt the de facto standard.

  ultimately concluded that it was better to adopt the de facto standard
Thank you for that : )

I am kinda surprised that Microsoft doesn't use tfs - after all, it's their own version control system. But then again, we use tfs at work and not a day goes by on which I do not long for git.

Here is my cynical view: From what I know they have a history of not using their own tools. They didn't use SourceSafe, but Perforce. Then they made an effort to switch to TFS, realized that it sucks and moved on to git.

You can see the same pattern in Windows desktop apps. They didn't use MFC for themselves, didn't use Winforms, used WPF only a little.

To be somewhat less cynical, VSS and TFVC were not intended to scale to the size of Windows's codebase, thus they weren't used. And instead of inventing our own thing this time around, we went with (and make contributions back to) the de facto standard.

When I was in Xbox, we had a lot of things in TFVC: all the services code, most of the console and Windows apps, and many of the tools. Only the Windows-related bits were in SD.

If they didn't use MFC, WinForms and WPF what exactly were they using? Flash?


And WTL, Windows Template Library.


WTL is way old.

Yes, but it's far newer than Win32 and MFC. There's also ATL, Active Template Library.

UWP everywhere now though.

Is that true? Which larger app is written in UWP? Something like Office, Skype or Visual Studio.

Windows itself is gradually rewriting its shell components in UWP XAML.

The preferred Windows client for Skype has been UWP for a while now (though up until recently it was still often referred to as "Skype Preview").

There is a watered-down version of Office in UWP. The built-in OneNote is actually this UWP. It uses a different stylus filtering algorithm from the desktop version, which is why I don't use it myself, but rest assured it works just fine.

Why don't you install the 2013/2015 version of TFS and use git? It's included and fully supported.

And, just to be clear, git is a first-class citizen in VSTS/TFS. We've fully embraced git as THE DVCS solution within VSTS/TFS. It is seen as a companion to TFVC.

All the modern development on VSTS is focused on git as well.

This is both very cool and eerily reminiscent of MVFS and ClearCase. It's a huge change in Git, going from de-centralized to hyper-centralized. If I read it right, git status, git commit and even running a build or cat'ing a file may not work if your network or the central server is down.

I hope they have thought hard about how to get the "Git proxy" for remote sites working well. If they end up with remote sites working on a 15 min - 1 hour old tree that will be very annoying.

So, if windows engineers are using git now, who is using TFVC? That's Team Foundation Version Control- the original TFS version control engine.

Lots and lots of external customers, and a handful of internal folks. FWIW Windows was never on TFVC (at least not the main development group).

What is the opposite of dogfooding?

That wouldn't be dogfooding - the windows team is not responsible for TFVC, innit.

The pain increases with the square of the number of files and to the fourth power of the dependencies between specific versions of them.

I'm not sure a big repo is a wise thing, even though I understand it may make sense for multiple reasons for a company, understanding it may damage brains far more sophisticated than mammalian ones.

Maybe I missed it, but I wish they would have compared times to what they were using before (Source Depot).

I guess I would more specifically like to know what pain points drove Microsoft to even try such a massive change.

Any word on open sourcing parts of the windows OS now that MS is seeing the light? The head guys have to see the benefits by now.

It says something that MS chose Git over anything proprietary that they developed.

I'm guessing that would be a licensing nightmare. They must pay many companies for licensed technologies inside Windows, and many of those licenses likely wouldn't be compatible with open source licensing.

All of their source code would have to go through legal review, some with each check in. I don't see that happening for legacy code.

If you look at how long it took for Sun to make Solaris free software (and even then it wasn't truly free in some cases) I doubt Microsoft would ever consider spending that much time doing it.

Rather than worry about getting `status` under 10 seconds, just focus on `diff --stat` and `diff --cached --stat`. Those two replace most uses for `status`.

I am not sure, but doesn't this look like an open source version of what ClearCase provides?

I see that Windows engineering uses a merge workflow. I wonder why. See other comments in this thread about rebasing.

At one point, I believe Microsoft was using a modified Perforce server for source code. Is that completely gone now?

Most of the large Source Depot users have moved to Git or, like Windows, are in the process of moving to Git. Legacy stuff will probably live on in SD for a long time, possibly forever, for maintenance work.

You are thinking of Google.

I meant that Google was the company who used Perforce in the past, and not Microsoft. Google isn't using it anymore either; they switched to their own thing named Piper.


SD was based on a very old Perforce as well.

Can you go into any more detail of the breakdown of your repo structure? Thanks!

edit: forgot, no Markdown here

Do you mean across all of Microsoft? Different teams have different structures. Speaking only for TFS and VSTS, we have a single repo containing the code for both, a handful of "adjunct" repos containing tools like GVFS, a repo for the documentation [1], and a bunch of open source repos for the build and release agent [2], agent tasks [3], API samples [4], and probably more I don't know about.

[1] https://www.visualstudio.com/docs

[2] https://github.com/microsoft/vsts-agent

[3] https://github.com/Microsoft/vsts-tasks

[4] https://github.com/Microsoft/vsts-dotnet-samples

So THIS is why they developed GVFS.

is GVFS portable to other OSes?

By design, yes. There are not (yet) implementations on other OSes.

Heh, the place I work at might have the single largest monolithic SVN repo. It works surprisingly well.

This is simply amazing.

Wow the majority of posts here are from Microsoft employees.

So? Why is that surprising?

Sounds like a lot of good work.

But, in "Git repo", what the heck is a "repo"? A repossession as in repossessing a car?

In the OP with "Everything you want to know about Visual Studio ALM and Farming", what is ALM -- air launched missile? What do air launched missiles and "farming" have to do with Visual Studio?

To Bill Gates and Microsoft: For my startup, I downloaded, read, indexed, and abstracted 5000+ Web pages from the Microsoft Web site MSDN. That took many months. Then I typed in the software for my startup, 24,000 programming language statements in Visual Basic .NET 4 and ADO.NET (Active Data Objects, for getting to the relational data base management system SQL Server) and ASP.NET (Active Server Pages, for building Web pages) in 100,000 lines of typing. For that work, all of it that was unique to me and my startup was fast, fun, and easy.

Far and away the worst problem in my startup, that delayed my work for YEARS, was the poor quality of the technical writing in the Microsoft documentation.

Some of the worst of the documentation was for SQL Server: Gee, I read the J. Ullman book on data base quickly and easily while eating dinner at the Mount Kisco Diner. But the Microsoft documentation was clear as mud. Just installing SQL Server ruined my boot partition: SQL Server would not run, repair, reinstall, or uninstall, and I had to reinstall all of Windows and all my applications and try again, more than once.

Quickly I discovered that documentation of logins, users, etc. were a mess: Basically the ideas seemed to be old capabilities, attributes, authentication, and access control lists, but nothing from Microsoft was any help at all. Eventually via Google searches I discovered some simple SQL statements, I could type into a simple file and run with the SQL Server utility SQLCMD.EXE; that way I got some commands that worked for much of what I needed. Now those little files are well documented and what I use. For getting a connection string that worked, again the documentation was useless, and I tried over and over with every variation I could think of until, for no good reason, I got a connection string to work. Once I tried to get a new installation of SQL Server to recognize, connect to, and use a SQL Server database from the previous installation of that version of SQL Server, but the result just killed the installation of SQL Server.

Again, once again, over again, yet again, one more time, far and away the worst problem in my startup is making sense out of Microsoft's documentation. I found W. Rudin, Real and Complex Analysis fast, fun, and easy reading; Microsoft's documentation was an unanesthetized root canal procedure -- OUCH!

So, again, once again, over again, yet again, one more time, please, Please, PLEASE, for the sake of my work, Microsoft, and computing, PLEASE get rid of undefined terms and acronyms in your technical writing. Get them out. Drive them out. Out. Out of your writing. Out of your company. Out of computing. No more undefined terms and acronyms, none, no more. I can't do it. You have to do it. Then, DO IT.


Application Lifecycle Management.

It's Brian Harry's blog, he has a farm, sometimes he posts about the (mis)adventures of his animals.

The definitions of "git repo" and "ALM" are the top search result in google for both terms.

I can believe that.

My point is, shouldn't articles define or at least give lines to definitions for terms?

Apparently Google has discovered that their usual keyword/phrase search of Web pages should be set aside when a search is really for some jargon or and acronym and to do a special search, just for definitions, for such terms. So, if Google understands the crucial importance of unwinding jargon and acronyms, the rest of us in computing can also.

This would create a lot of noise for the regular readers of his blog, who already know the definitions of these terms. The terms are not even obscure. This is also a blog, not a piece of technical documentation.

This just seems like the exact opposite of "Do one thing; and do it well".

I would pay money to see a video camera of Linus' face reading this article. I think we'd probably get impossible new shades of the color red heretofore unknown to humanity.

Linus Torvalds rocks. Windows sucks. Subversion was crap so he just made git instead. git beat out svn, TFS and all that other crap legions of overpaid engineers came up with (or what they didn't get source control???) because unix design philosophy and therein lies the lesson still unlearned for they hath loaded all their bloat into one repo.

Windows. It sucks and it will forever suck because it sucks by design. Bill say 'Thank you Linus - I owe you sooo much because git is way better than the best I could do'

I mean has there ever been worse software ever written than the stuff being loaded into git right now? Awful, awful garbage, creaking and reeking of dirty hacks, different for the sake of it designs, misshapen, bolted together, bloated, willfully annoying, antisocial, phone home, locking-in, full of resolutely, defiant ancient unfixed bugs, butt ugly, horrible UI, full of errors and meaningless error messages, incessant nagging and weird quirks, wtf folders, command line from hell and urgh... note pad ... and oh dear god I almost forgot mmc consoles and visual studio and inconsistent flows, viral load by the galactic shit tonne, complete and utter drivel makes me want to vomit every time I hear that sickening jingle and after all those gazillions of engineering hours an absolute world wonder of fail?

Two guys working out of a garage could do better.

:P (Windows sucks btw)

Sadly we will have to quit blaming Bill Gates. I doubt he makes very many design decisions any more. :)

If he had only listened to me and re-released Xenix open source with a decent WM we could have avoided all this unpleasantness but no, he had to listen to Monkeyboy. :/

And then there's Windows Subsystem for Linux as well.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact