
The largest Git repo - ethomson
https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-largest-git-repo-on-the-planet/
======
js2
_Windows, because of the size of the team and the nature of the work, often
has VERY large merges across branches (10,000’s of changes with 1,000’s of
conflicts)._

At a former startup, our product was built on Chromium. As the build/release
engineer, one of my daily responsibilities was merging Chromium's changes with
ours.

Just performing the merge and conflict resolution was anywhere from 5 minutes
to an hour of my time. Ensuring the code compiled was another 5 minutes to an
hour. If someone on the Chromium team had significantly refactored a
component, which typically occurred every couple weeks, I knew half my day was
going to be spent dealing with the refactor.

The Chromium team at the time was many dozens of engineers, landing on the
order of a hundred commits per day. Our team was a dozen engineers landing
maybe a couple dozen commits daily. A large merge might have on the order of
100 conflicts, but typically it was just a dozen or so conflicts.

Which is to say: I don't understand how it's possible to deal with a merge
that has 1k conflicts across 10k changes. How often does this occur? How many
people are responsible for handling the merge? Do you have a way to distribute
the conflict resolution across multiple engineers, and if so, how? And why
don't you aim for more frequent merges so that the conflicts aren't so large?

(And also, your merge tool must be incredible. I assume it displays a three-
way diff and provides an easy way to look at the history of both the left and
right sides from the merge base up to the merge, along with showing which
engineer(s) performed the change(s) on both sides. I found this essential many
times for dealing with conflicts, and used a mix of the git CLI and Xcode's
opendiff, which was one of the few at the time that would display a proper
three-way diff.)

~~~
Peaker
When you have that many conflicts, it's often due to massive renames, or just
code moves.

If you use git-mediate[1], you can re-apply those massive changes on the
conflicted state, run git-mediate - and the conflicts get resolved.

For example: if you have 300 conflicts due to some massive rename, you can
type in:

    
    
      git-search-replace.py[2] -f oldGlobalName///newGlobalName
      git-mediate -d
      Succcessfully resolved 377 conflicts and failed resolving 1 conflict.
      <1 remaining conflict shown as 2 diffs here representing the 2 changes>
    

[1] [https://medium.com/@yairchu/how-git-mediate-made-me-stop-
fea...](https://medium.com/@yairchu/how-git-mediate-made-me-stop-fearing-
merge-conflicts-and-start-treating-them-like-an-easy-game-of-a2c71b919984)

[2] [https://github.com/da-x/git-search-replace](https://github.com/da-x/git-
search-replace)

~~~
dom0
Also git rerere

When maintaining multiple release lines and moving fixes between them:

Don't use a bad branching model. Things like "merging upwards" (=committing
fixes to the oldest branch requiring the fix, then merging the oldest branch
into the next older branch etc.), which seems to be somewhat popular, just
don't scale, don't work very well, and produce near-unreadable histories. They
also incentivise developing on maintenance branches (ick).

Instead, don't do merges between branches. Everything goes into master/dev,
except stuff that really doesn't need to go there (e.g. a fix that only
affects a specific branch(es)). Then cherry pick them into the maintenance
branches.

~~~
seanp2k2
Cherry picking hotfixes into maint branches is cool until you have stuff like
diverging APIs or refactored modules between branches. I don't know of a
better solution; it kind of requires understanding in detail what the fix does
and how it does it, then knowing if that's directly applicable to every
release which needs to be patched.

~~~
jjawssd
Use namespaces to separate API versions?

POST /v4/whatever

POST /v3/whatever

~~~
_asummers
We version each of our individual resources. so a /v1/user might have many
/v3/post . Seems to work for us as a smaller engineering team.

~~~
jjawssd
A better approach would be to alias /v3/user to /v1/user until there is a
breaking change needed in the v3 code tier.

~~~
_asummers
On a rapidly developing API, that would be way too much churn on our front
end. For an externally facing API, I completely agree.

------
sp332
Archive Team is making a distributed backup of the Internet Archive.
[http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK](http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK)
Currently the method getting the most attention is to put the data into git-
annex repos, and then have clients just download as many files as they have
storage space for. But because of limitations with git, each repo can only
handle about 100,000 files even if they are not "hydrated". [http://git-
annex.branchable.com/design/iabackup/](http://git-
annex.branchable.com/design/iabackup/) If git performance were improved for
files that have not been modified, this restriction could be lifted and the
manual work of dividing collections up into repos could be a lot lower.

Edit: If you're interested in helping out, e.g. porting the client to Windows,
stop by the IRC channel #internetarchive.bak on efnet.

~~~
gcb0
internet archive sounds like the best ever use case for IPFS

~~~
sp332
It was considered but it just didn't get enough attention from anyone to get
it done.
[http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...](http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/ipfs_implementation)

------
cryptonector
At Sun Microsystems, Inc., (RIP) we have many "gates" (repos) that made up
Solaris. Cross-gate development was somewhat more involved, but still not bad.
Basically: you installed the latest build of all of Solaris, then updated the
bits from your clones of the gates in question. Still, a single repo is great
if it can scale, and GVFS sounds great!

But that's not what I came in to say.

I came in to describe the rebase (not merge!) workflow we used at Sun, which I
recommend to anyone running a project the size of Solaris (or larger, in the
case of Windows), or, really, even to much smaller projects.

For single-developer projects, you just rebased onto the latest upstream
periodically (and finally just before pushing).

For larger projects, the project would run their own upstream that developers
would use. The project would periodically rebase onto the latest upstream.
Developers would periodically rebase onto their upstream: the project's repo.

The result was clean, linear history in the master repository. By and large
one never cared about intra-project history, though project repos were
archived anyways so that where one needed to dig through project-internal
history ("did they try a different alternative and found it didn't work
well?"), one could.

I strongly recommend rebase workflows over merge workflows. In particular, I
recommend it to Microsoft.

~~~
rdubz
A problem with rebase workflows that I don't see addressed (here or in the
replies) is: if I have, say, 20 local commits and am rebasing them on top of
some upstream, I have to fix conflicts up to 20 times; in general I will have
to stop to fix conflicts at least as many times as I would have to while
merging (namely 0 or 1 times).

Moreover, resolution work during a rebase creates​ a fake history that does
not reflect how the work was actually done, which is antithetical to the
spirit of version control, in a sense.

A result of this is the loss of any ability to distinguish between bugs
introduced in the original code (pre-rebase) vs. bugs introduced while
resolving conflicts (which are arguably more likely in the rebase case since
the total amount of conflict-resolving can be greater).

It comes down to Resolution Work is Real Work: your code is different before
and after resolution (possibly in ways you didn't intend!), and rebasing to
keep the illusion of a total ordering of commits is a bit of an
outdated/misuse of abstractions we now have available that can understand
projects' evolution in a more sophisticated way.

I was a dedicated rebaser for many years but have since decided that merging
is superior, though we're still at the early stages of having sufficient
tooling and awareness to properly leverage the more powerful "merge"
abstraction, imho.

~~~
cryptonector
Well, git rerere helps here, though, honestly, this never happens to me even
when I have 20 commits. Also, this _is_ what you want, as it makes your
commits easier to understand by others. Otherwise, with thousands of
developers your merge graph is going to be a pile of incomprehensible
spaghetti, and good luck cherry-picking commits into old release patch
branches!

Ah, right, that's another reason to rebase: because your history is clean,
linear, and merge-free, it makes it easier to pick commits from the mainline
into release maintenance branches.

The "fake history" argument is no good. Who wants to see your "fix typo"
commits if you never pushed code that needed them in the first place? I truly
don't care how you worked your commits. I only care about the end result.
Besides, if you have thousands of developers, each on a branch, each merging,
then the upstream history will have an incomprehensible (i.e., _useless_)
merge graph. History needs to be useful to those who will need it. Keep it
clean to make it easier on them.

Rebase _is_ the "more powerful merge abstraction", IMO.

~~~
rdubz
rebase : centralized repo :: merge : decentralized repo

rebase : linked-list :: merge : DAG

If the work/repo is truly distributed and there isn't a single permanently-
authoritative repo, a "clean, linear" history is nonsensical to even try to
reason about.

In all cases it is a crutch: useful (and nice, and sufficient!) in simple
settings, but restricting/misleading in more complex ones (to the point of
causing many developers to not see the negative space).

You can get very far thinking of a project as a linked list, but there is a
lot to be gained from being able to work effectively with DAGs when a more
complex model would better fit the reality being modeled.

It's harder to grok the DAG world because the tooling is less mature, the
abstractions are more complex (and powerful!), and almost all the time and
money up to now has explored the hub-and-spoke model.

In many areas of technology, however, better tooling and socialization around
moving from linked-lists (and even trees) to DAGs is going to unlock more
advanced capabilities.

Final point: rebasing is just glorified cherry-picking. Cherry-picking
definitely also has a role in a merge-focused/less-centralized world, but
merges add something totally new on top of cherry-picking, which rebase does
not.

~~~
cryptonector
As @zeckalpha says, rebase != centralized repo.

You can have a hierarchical repo system (as we did at Sun).

Or you can have multiple hierarchies, contributing different series of rebased
patches up the chain in each hierarchy.

Another possibility is that you are not contributing patches upstream but
still have multiple upstreams. Even in this case your best bet is as follows:
drop your local patches (save them in a branch), merge one of the upstreams,
merge the other, re-apply (cherry-pick, rebase) your commits on top of the new
merged head. This is nice because it lets you merge just the upstreams first,
then your commits, and you're always left in a situation where your commits
are easy to ID: they're the ones on top.

~~~
luckydude
I'm the guy who started this DAG model (also at Sun with NSElite and then
later with BitKeeper).

I agree that rebase == centralized. It's a math thing. If you rebase and
someone has a clone of your work prior to the rebase chaos happens when they
come together. So you have to enforce a centralized flow to make it work in
all cases. It's pretty much provable as in a math proof.

~~~
cryptonector
Not true! At Sun we did this with project gates regularly. The way it works
(as I've described several times in this thread now) is that you rebase
--onto. That is, you use a tag for the pre-rebase project upstream to find the
merge base for your branch, then cherry-pick your commits (i.e., all local
commits after the merge base) onto the post-rebase project upstream.

Now, you don't want to do this with the ultimate upstream, though occasionally
it happened at Sun with the OS/Net gate, usually due to some toxic commit that
was best eliminated from the history rather than reverted, or through some
accident.

But you'd be right to say that the Sun model was centralized in that there was
just one ultimate upstream. (There was one per-"consolidation", since Solaris
was broken up into multiple parts like that, but whatever, the point stands.)

Whereas with Linux, say, one might have multiple kernel gates kept by
different gatekeepers. Still, if you're contributing to more than one of them,
it's easier to cherry-pick (rebase!) your commits onto each upstream than to
just merge your way around -- IMO. I.e., you can have a Linux kernel like
decentralized dev model and still rebase.

However, I as you can see from my comment in the previous paragraph, _rebase_
itself does not imply a centralized model.

~~~
luckydude
I get that you can work around the problems, you don't seem to get that from a
math point of view, rebase forces either

a) a centralized model

or

b) you have to throw away any work based on the dag before the rebase

or)

c) you have the history in the graph twice (which causes no end of problems).

(a) is the math way, (b) and (c) are ad-hoc hacks. You are well into the ad-
hoc hacks, you've found a way to make it work but it includes "don't do that"
warnings to users. My experience is that you don't want to have work flows
that include "don't do that". Users will do that.

------
quotemstr
I have tremendous respect for Microsoft pulling itself together over the past
few years.

~~~
ericfrederich
This may be the thing that gets Google to switch. They like having every piece
of code in a single repository which Git cannot handle.

Now that it is somewhat proven, maybe Google will leverage GVFS on Windows and
create a FUSE solution for Linux.

~~~
daxfohl
I'd rather see google open up their monorepo as a platform, and compete with
github. git is fine, but there's something compelling about a monorepo.
Whether they do it one-monorepo-per-account, or one-global-monorepo, or some
mix of the two, would be interesting to see how it shapes up.

~~~
jasonkostempski
"one-global-monorepo" caused me to envision a beautiful/horrifying Borg-like
future where all code in the universe was in a single place and worked
together.

~~~
rak00n
I felt like that when I first saw golang and how you can effortlessly use any
repo from anywhere.

~~~
wbl
As the joke goes, Go assumes all your code lives in one place controlled by a
harmonious organization. Rust assumes your dependencies are trying to kill
you. This says a lot about the people who came up with each one.

~~~
aeorgnoieang
> Rust assumes your dependencies are trying to kill you.

Would you mind unpacking this? I'm intrigued.

~~~
dom0
Cargo.lock for applications freezes the entire dependency graph incl.
checksums of everything, for example.

------
tobyhinloopen
I wonder why Windows is a single repository - Why not split it in separate
modules? I can imagine tools like Explorer, Internet Explorer/Edge, Notepad,
Wordpad, Paint, etc. all can stay in its own repository. I can imagine you can
even further split things up, like a kernel, a group of standard drivers, etc.
If that is not already the case (separate repos, that is), are the plans to
separate it in the future?

~~~
vtbassmatt
Really good question. Actually, splitting Windows up was the first approach we
investigated. Full details here: [https://www.visualstudio.com/learn/gvfs-
design-history/](https://www.visualstudio.com/learn/gvfs-design-history/)

Summary:

\- Complicates daily life for every engineer

\- Becomes hard to make cross-cutting changes

\- Complicates releasing the product

\- There's a still a core of "stuff" that's not easy to tease apart, so at
least one of the smaller Windows repos would still have been a similar order
of magnitude in most dimensions

~~~
erikpukinskis
> \- Becomes hard to make cross-cutting changes

This does seem like a negative, doesn't it?

But it's not. Making it hard to make cross-cutting changes is exactly the
point of splitting up a repo.

It forces you to slow down, and—knowing that you can only rarely make cross-
cutting changes—you have a strong incentive to move module boundaries to where
they should be.

It puts pressure on you to really, actually separate concerns. Not just put
"concerns" into a source file that can reach into any of a million other
source files and twiddle the bits.

"Easy to make sweeping changes" really means "easy to limp along with a bad
architecture."

I think that's one of the reasons why so much code rots: developers thinking
it should be easy to make arbitrary changes.

No, it should be hard to make arbitrary changes. It should be easy to make
changes with very few side effects, and hard to make changes that affect lots
of other code. That's how you get modules that get smaller and smaller, and
change less and less often, while still executing often. That's the opposite
of code rot: code nirvana.

~~~
Goronmon
_No, it should be hard to make arbitrary changes._

If you change the word "arbitrary" to "necessary" (implying a different bias
than the one you went with) then all of a sudden this attitude sounds less
helpful.

Similarly "easy to limp along with a bad architecture" could be re-written as
"easy to work with the existing architecture".

At the end of the day, it's about getting work done, not making decisions that
are the most "pure".

~~~
colechristensen
You have to balance getting work done vs. purity, and Microsoft has spent
years trying to fix a bad balance.

Windows ME/Vista/8 were terrible and widely hated pieces of software because
of "getting things done" instead of making good decisions. They made billions
of dollars doing it, don't get me wrong, but they've also lost a lot of market
share too and have been piling on bad sentiment for years. They've been
pivoting and it has nothing to do with "getting work done" but by going back
and making better decisions.

~~~
ern
I assumed that Windows 8 was hated because it broke the Start Menu and tried
to force users onto Metro.

~~~
sjwright
It also broke a lot of working user interfaces, e.g. wireless connection
management.

------
lloeki
Coming from the days of CVS and SVN, git was a freaking miracle in terms of
performance, so I have to just put things into perspective here when the
topmost issue of git is _performance_. It's just a testament how _huge_ are
the codebases we're dealing with (Windows over there, but also Android, and
surely countless others), the staggering amount of code we're wrangling around
these days and the level of collaboration is incredible and I'm quite sure we
would not have been able to do that (or at least not that nimbly and with such
confidence) were it not for tools like git (and hg). There's a sense of scale
regarding that growth across multiple dimensions that just puts me in awe.

~~~
comex
Broadly speaking this is true, but note that in some ways CVS and SVN are
_better_ at scaling than Git.

\- They support checking out a subdirectory without downloading the rest of
the repo, as well as omitting directories in a checkout. Indeed, in SVN,
branches are just subdirectories, so almost all checkouts are of
subdirectories. You can't really do this in Git; you can do sparse _checkouts_
(i.e. omitting things when copying a working tree out of .git), but .git
itself has to contain the entire repo, making them mostly useless.

\- They don't require downloading the entire history of a repo, so the
download size doesn't increase over time. Indeed, they don't _support_
downloading history: svn log and co. are always requests to the server.
Unfortunately, Git is the opposite, and only supports accessing previously
downloaded history, with no option to offload to a server. Git does have the
option to make shallow clones with a limited amount of (or no) history, and
unlike sparse checkouts, shallow clones truly avoid downloading the stuff you
don't want. But if you have a shallow clone, git log, git blame, etc. just
stop at the earliest commit you have history for, making it hard to perform
common development tasks.

I don't miss SVN, but there's a reason big companies still use gnarly old
systems like Perforce, and not just because legacy: they're genuinely much
better at scaling to huge repos (as well as large files). Maybe GVFS fixes
this; I haven't looked at its architecture. But as a separate codebase bolted
on to near-stock Git, I bet it's a hack; in particular, I bet it doesn't work
well if you're offline. I suspect the notion of "maybe present locally, maybe
on a server" needs to be baked into the data model and all the tools, rather
than using a virtual file system to just pretend remote data is local.

~~~
ethomson
CVS and SVN are probably a bit better at scaling than (stock) Git. Perforce
and TFVC _are certainly_ better at scaling than (again, stock, out-of-the-box)
Git. That was their entire goal: handle very large source trees (Windows-sized
source trees) effectively. That's why they have checkout/edit/checkin
semantics, which is also one of the reasons that everybody hates using them.

GVFS intends to add the ability to scale to Git, through patches to Git itself
and a custom driver. I don't think this is a hack - by no means is it the
first version control system to introduce a filesystem level component. Git
with GVFS works wonderfully while offline for any file that you already have
fetched from the server.

If this sounds like a limitation, then remember that these systems like
Perforce and TFVC _also_ have limitations when you're offline: you can
continue to edit any file that you've checked out but you can't check out new
files.

You can of course _force_ the issue with a checkout/edit/checkin but then
you'll need to run some command to reconcile your changes once you return
online. This seems increasingly less important as internet becomes ever more
prevalent. I had wifi on my most recent trans-Atlantic flight.

I'm not sure what determines when something is "a hack" or not, but I'd
certainly rather use Git with GVFS than a heavyweight centralized version
control system if I could. Your mileage, as always, may vary.

------
vtbassmatt
A handful of us from the product team are around for a few hours to discuss if
you're interested.

~~~
rajathagasthya
Very cool blog! As I understand, you dynamically fetch a file from the remote
git server once for the first time I open the file. Do you do any sort of pre-
fetching of files? For example, if a file has an import and uses a few symbols
from that file, do you also fetch the imported file beforehand or just fetch
it when you access it first time?

~~~
vtbassmatt
For now, we're not that smart and simply fetch what's opened by the
filesystem. With the cache servers in place, it's plenty fast. We do also have
an optional prefetch to grab all the contents (at tip) for a folder or set of
folders.

------
breck
This is so awesome. Brilliant move MS! In addition to enabling Windows
engineers to be significantly more productive (eventually), it will go a long
way to enabling engineers in other departments to contribute to Windows. For
example, I used to work in the Azure org and once noticed a relatively simple
missing feature in Windows. I filed a bug and was in contact with a PM who
suggested if I wanted I could work on adding it myself. I dipped my toe in,
but the onboarding costs were just too high and I quickly decided against it.
With Windows on git, much more likely to have dived in.

~~~
hyperrail
I'm not so sure moving to Git alone would have helped your case. Getting an
enlistment is only a small part of contributing to Windows.

~~~
vtbassmatt
True, but the move to Git is part of our larger "1ES" (One Engineering System)
effort across the company. The idea is, if you know how to
enlist/build/edit/submit in any team, you know how to do the same in any team.

------
nathan_f77
This is pretty crazy. It's very hard to imagine working on a single codebase
with 4,000 other engineers.

> Another key performance area that I didn’t talk about in my last post is
> distributed teams. Windows has engineers scattered all over the globe – the
> US, Europe, the Middle East, India, China, etc. Pulling large amounts of
> data across very long distances, often over less than ideal bandwidth is a
> big problem. To tackle this problem, we invested in building a Git proxy
> solution for GVFS that allows us to cache Git data “at the edge”. We have
> also used proxies to offload very high volume traffic (like build servers)
> from the main Visual Studio Team Services service to avoid compromising end
> user’s experiences during peak loads. Overall, we have 20 Git proxies
> (which, BTW, we’ve just incorporated into the existing Team Foundation
> Server Proxy) scattered around the world.

If I was a hacker, this paragraph would probably encourage me to study the
GVFS source code and see if I can find some of these Git proxies. I have no
idea how you would find them, but there might be some public DNS records. This
sounds like some very new technology and some huge infrastructure changes,
which are pretty good conditions for security vulnerabilities. What kind of
bounty would Microsoft pay if you could get access to the complete source code
for Windows? $100,000? [1]

[1] [https://technet.microsoft.com/en-
us/library/dn425036.aspx](https://technet.microsoft.com/en-
us/library/dn425036.aspx)

------
falsedan
Looks like all of the charts were made in Excel… that's some dedication to
staying on-brand!

~~~
bhauer
As opposed to what?

~~~
niklasrde
HTML and co. It is a website..

------
vmasto
19 seconds for a commit (add + commit) might be long but the new improvements
look promising (down to ~10s).

(Please correct me if the COMMIT column in the perf table includes the staging
operations.)

This looks awesome. I just wish Facebook would also share some perf and time
statistics on their own extensions for Mercurial, last time I checked their
graphs were unitless.

~~~
saeednoursalehi
Indeed, while 19 seconds for commit is far better than 30 minutes we would
have seen without GVFS, it's way too slow to actually feel responsive while
you're coding. And in fact, it was sometimes worse than 19 seconds because
commands like status and add would generally get slower as you access and
hydrate more files in the repo. With the big O(modified) update that we just
made to GVFS, git commands no longer slow down as you access more files, so
now our devs see a consistent commit time of around 10 seconds, and consistent
and faster times for most other commands too.

~~~
slededit
You have to put this into perspective with what they are replacing. You'd
never get a submit done in less than 19 seconds using the old source depot
tools anyways. When you work on projects this big that takes hours to compile
and minutes to incremental compile - responsiveness just isn't something you
get to have at scale.

~~~
vmasto
This is mainly why I'm asking for data from Facebook. I've seen claims that
their vcs operations (at least the most common ones) are near instant, but
nothing official. It would appear that FB have solved the responsiveness
problem with Mercurial but, again, no official data to back it up.

------
isignal
Those of us working on smaller codebases may wonder what the big deal is.
Facebook had a similar problem leading them to switch out to mercurial.
[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/) It is awesome that the problems could be solved in git itself.

Also, kudos to the writer of the blog. It is a really high quality blog post.
The percentile measures of performance, survey responses from users etc are
very typical of solid incremental approaches to challenges faced by startups
except these are internal customers.

------
cubano
Isn't this perhaps the greatest validation of Linus's design genius that what
was initially a weekend project[0] has successfully scaled to this?

 _They could no longer use their revision control system BitKeeper and no
other Source Control Management (SCMs) met their needs for a distributed
system. Linus Torvalds, the creator of Linux, took the challenge into his own
hands and disappeared over the weekend to emerge the following week with Git._

[0] [https://www.linux.com/blog/10-years-git-interview-git-
creato...](https://www.linux.com/blog/10-years-git-interview-git-creator-
linus-torvalds)

~~~
wfunction
> Isn't this perhaps the greatest validation of Linus's design genius that
> what was initially a weekend project has successfully scaled to this?

I thought the entire point of the article was to show how git _didn 't_ scale,
and how they're basically rewriting the project and changing it as much as
necessary to make it scale. It's not like Linus designed git to scale as
O(modified).

~~~
happycube
On the other hand, becoming something bigger out of his hands is validation in
it's own right...

------
yeukhon
If you have a large repo, it actually helps you to save storage if you don't
need the full history on clone by using shallow clone.

hg is still behind this, AFAI can tell from search. FB has this as an
extension. [https://bitbucket.org/facebook/hg-
experimental/](https://bitbucket.org/facebook/hg-experimental/)

Is FB fully using hg internally, or both Git and hg, because obviously FB has
public repo on Github.

~~~
isignal
My understanding is that fb is actively promoting hg for internal
repositories. Not sure how they sync between public git and internal hg.

------
kk1274
300GB of code WOW! Just for comparison the entire English Wikipedia dump
including all media is about 50-60GB. What are you guys doing there and how
large do you see this growing?

~~~
taylorlafrinere
Most of that 300GB isn't text. There are test assets, images, videos, built
binaries, vhd's, etc. Also, I should be clear that that 300GB is just at tip
(no history). We can debate about whether or not those things should be
checked into the repo but they are there now.

~~~
alkonaut
How did you go about creating the central repo and how long did it take? A 2Gb
at tip svn repo with 100k commits is taking me many days and each odd failure
typically has me restart the process after filtering out some obscure part of
the tree.

Edit: read in another comment that you dropped the history. Understandable,
but can appreciate how that would add to friction (devs having to look through
two different histories).

~~~
vtbassmatt
The Windows team developed a tool called "GitTrain" that knew how to:

\- migrate the tip of a branch to Git (yes, the 300GB number is the _tips_ of
all the interesting branches, not the history)

\- keep a Git branch and a SD branch in sync for a while

\- be re-run over each of the 400+ branches they care about

But they went through some of the same trial-and-error process that you're
describing.

------
DominikD
Sorry to see SourceDepot (slowly) decommissioned. I loved it and since it was
a Perforce fork, what I've learned was directly applicable when I started
using P4 in my subsequent job. Perhaps I'm old fashioned but I really see
little appeal in DVCSes. I liked Hg but in the long run it's going to be
completely run over by Git so I'd rather not invest in it. I'm rambling,
sorry.

------
steve_avery
What was performance like for the Source Depot system? It would be interesting
to note the comparison between the old SDX system and GVFS.

~~~
wilatmsft
Quoting Brian Harry from a comment response at
[https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-
large...](https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-largest-git-
repo-on-the-planet/)

"It depends a great deal on the operation. SourceDepot was much faster at some
things – like “sd opened”, the equivalent of “git status”. sd opened was <
.5s. git status is at 2.6s now. But SD was much slower at some other things –
like branching. Creating a branch in SD would take hours. In Git, it's less
than a minute. I saw a mail from one of our engineers at one point saying
they'd been putting off doing a big refactoring for 9 months because the
branch mechanics in SD would have been so cumbersome and after the switch to
Git they were able to get the whole refactoring done in a topic branch in no
time.

On an operation, by operation basis, SD is still much faster than our Git/GVFS
solution. We're still working on it to close the gap but I'm not sure it will
ever get as fast at everything. The broader question, though is about overall
developer productivity and we think we are on a path to winning that."

------
drawkbox
Game development also has very large files and codebases, Git LFS is sometimes
not enough. This is great for everyone really but very nice for game
development and larger codebases that might have lots of assets along with it.

Microsoft is doing great work here and hope it makes it to bitbucket, github
etc.

------
Willamin
I don't know much about Windows development, but I'm sure the system is
modularized in some way. Why wouldn't you want to break up the project into
multiple repos for different parts of the system? That would let you work on
and test each part independent of the rest. Each part should be able to
function on its own, right? Of course some engineers would need to build and
test the entire OS as a whole, but I'd wager that (for example) the team
working on visual design of the settings app doesn't need to have the source
code of how the login screen verifies passwords.

Clearly Microsoft's process works well enough for them, so I wonder what
benefits there are to using the monolithic repo choice over many smaller
repos.

~~~
domoritz
Google has a single repo. The advantages are that you don't need to version
anything because you always build against head. It's awesome but requires some
discipline and good infrastructure.

------
garyclarke27
Linus must be very proud - his favourite software Windows - now depends on
GIT.

~~~
manyoso
I am _very_ certain that Linus would read this article and curse the jaw
dropping stupidity of the whole endeavor. They've basically taken a tool he
wrote to do _real_ distributed source control that can scale and turned it
into a central server.

Git was never meant to be used this way and I know he'd be horrified+amused in
the extreme.

~~~
e40
centralization has nothing to do with this feature. It's about big repos,
centralized or not.

------
Dunedan
> You also see the 80th percentile result for the past 7 days […]

What'd be even more interesting to see is something like the 95th or 99th
percentile, as showing that 80% of all operations finish in acceptable time is
nice, but probably not what's necessary to have satisfied customers.

------
microcolonel
The day has come that Microsoft employees are celebrating how good they're
getting at running Linus Torvalds' source code management tool.

Cats and dogs, flying pigs.

Might be good to start work on a compatible client and server for FUSE-based
systems (Linux, OpenBSD, macOS [with a FUSE kernel module]).

------
faragon
I have Git repositories much larger than 300GB, for binary data. The title
should be "the largest Git repo for source code", in my opinion. BTW, it is a
nice thing Windows development being moved to Git SCM. What's Linus opinion on
that? What a victory :-)

------
JimA
Non-dev here, but does this replace/overlap with TFS? What was the driver to
adopt Git?

~~~
vtbassmatt
Good questions. TFS is a whole suite of services: 2 version control systems
(TFVC and Git), work item tracking, build orchestration, package management,
and more. VSTS is the roughly-analogous cloud-hosted version.

I'd have to dig up the link: a few years ago our VP had a good blog post on
why we chose to add a Git server to our offering. TFVC is a classic
centralized version control system. When we wanted to add a distributed
version control, we looked at rolling our own but ultimately concluded that it
was better to adopt the de facto standard.

~~~
FLGMwt

      ultimately concluded that it was better to adopt the de facto standard
    

Thank you for that : )

------
TorKlingberg
This is both very cool and eerily reminiscent of MVFS and ClearCase. It's a
huge change in Git, going from de-centralized to hyper-centralized. If I read
it right, git status, git commit and even running a build or cat'ing a file
may not work if your network or the central server is down.

I hope they have thought hard about how to get the "Git proxy" for remote
sites working well. If they end up with remote sites working on a 15 min - 1
hour old tree that will be very annoying.

------
YeGoblynQueenne
So, if windows engineers are using git now, who is using TFVC? That's Team
Foundation Version Control- the original TFS version control engine.

~~~
vtbassmatt
Lots and lots of external customers, and a handful of internal folks. FWIW
Windows was never on TFVC (at least not the main development group).

~~~
a_imho
What is the opposite of dogfooding?

~~~
YeGoblynQueenne
That wouldn't be dogfooding - the windows team is not responsible for TFVC,
innit.

------
rbanffy
The pain increases with the square of the number of files and to the fourth
power of the dependencies between specific versions of them.

I'm not sure a big repo is a wise thing, even though I understand it may make
sense for multiple reasons for a company, understanding it may damage brains
far more sophisticated than mammalian ones.

------
DigitalJack
Maybe I missed it, but I wish they would have compared times to what they were
using before (Source Depot).

I guess I would more specifically like to know what pain points drove
Microsoft to even try such a massive change.

------
bischofs
Any word on open sourcing parts of the windows OS now that MS is seeing the
light? The head guys have to see the benefits by now.

It says something that MS chose Git over anything proprietary that they
developed.

~~~
midnitewarrior
I'm guessing that would be a licensing nightmare. They must pay many companies
for licensed technologies inside Windows, and many of those licenses likely
wouldn't be compatible with open source licensing.

All of their source code would have to go through legal review, some with each
check in. I don't see that happening for legacy code.

------
Myrmornis
Rather than worry about getting `status` under 10 seconds, just focus on `diff
--stat` and `diff --cached --stat`. Those two replace most uses for `status`.

------
cryptonector
I see that Windows engineering uses a merge workflow. I wonder why. See other
comments in this thread about rebasing.

------
criddell
At one point, I believe Microsoft was using a modified Perforce server for
source code. Is that completely gone now?

~~~
fmihaila
You are thinking of Google.

~~~
fmihaila
I meant that Google was the company who used Perforce in the past, and not
Microsoft. Google isn't using it anymore either; they switched to their own
thing named Piper.

[https://www.wired.com/2015/09/google-2-billion-lines-
codeand...](https://www.wired.com/2015/09/google-2-billion-lines-codeand-one-
place/)

~~~
vtbassmatt
SD was based on a very old Perforce as well.

------
debamitro
I am not sure, but doesn't this look like an open source version of what
ClearCase provides?

------
svanwaa
Can you go into any more detail of the breakdown of your repo structure?
Thanks!

~~~
vtbassmatt
edit: forgot, no Markdown here

Do you mean across all of Microsoft? Different teams have different
structures. Speaking only for TFS and VSTS, we have a single repo containing
the code for both, a handful of "adjunct" repos containing tools like GVFS, a
repo for the documentation [1], and a bunch of open source repos for the build
and release agent [2], agent tasks [3], API samples [4], and probably more I
don't know about.

[1] [https://www.visualstudio.com/docs](https://www.visualstudio.com/docs)

[2] [https://github.com/microsoft/vsts-
agent](https://github.com/microsoft/vsts-agent)

[3] [https://github.com/Microsoft/vsts-
tasks](https://github.com/Microsoft/vsts-tasks)

[4] [https://github.com/Microsoft/vsts-dotnet-
samples](https://github.com/Microsoft/vsts-dotnet-samples)

------
wodencafe
So THIS is why they developed GVFS.

------
systems
is GVFS portable to other OSes?

~~~
vtbassmatt
By design, yes. There are not (yet) implementations on other OSes.

------
adoggman
Heh, the place I work at might have the single largest monolithic SVN repo. It
works surprisingly well.

------
Kenji
I am kinda surprised that Microsoft doesn't use tfs - after all, it's their
own version control system. But then again, we use tfs at work and not a day
goes by on which I do not long for git.

~~~
maxxxxx
Here is my cynical view: From what I know they have a history of not using
their own tools. They didn't use SourceSafe, but Perforce. Then they made an
effort to switch to TFS, realized that it sucks and moved on to git.

You can see the same pattern in Windows desktop apps. They didn't use MFC for
themselves, didn't use Winforms, used WPF only a little.

~~~
eyq1
If they didn't use MFC, WinForms and WPF what exactly were they using? Flash?

~~~
maxxxxx
Win32

~~~
Flow
And WTL, Windows Template Library.

[https://en.wikipedia.org/wiki/Windows_Template_Library](https://en.wikipedia.org/wiki/Windows_Template_Library)

~~~
maxxxxx
WTL is way old.

~~~
Flow
Yes, but it's far newer than Win32 and MFC. There's also ATL, Active Template
Library.

------
MS_Buys_Upvotes
Wow the majority of posts here are from Microsoft employees.

~~~
scrollaway
So? Why is that surprising?

------
graycat
Sounds like a lot of good work.

But, in "Git repo", what the heck is a "repo"? A repossession as in
repossessing a car?

In the OP with "Everything you want to know about Visual Studio ALM and
Farming", what is ALM -- air launched missile? What do air launched missiles
and "farming" have to do with Visual Studio?

To Bill Gates and Microsoft: For my startup, I downloaded, read, indexed, and
abstracted 5000+ Web pages from the Microsoft Web site MSDN. That took many
months. Then I typed in the software for my startup, 24,000 programming
language statements in Visual Basic .NET 4 and ADO.NET (Active Data Objects,
for getting to the relational data base management system SQL Server) and
ASP.NET (Active Server Pages, for building Web pages) in 100,000 lines of
typing. For that work, all of it that was unique to me and my startup was
fast, fun, and easy.

Far and away the worst problem in my startup, that delayed my work for YEARS,
was the poor quality of the technical writing in the Microsoft documentation.

Some of the worst of the documentation was for SQL Server: Gee, I read the J.
Ullman book on data base quickly and easily while eating dinner at the Mount
Kisco Diner. But the Microsoft documentation was clear as mud. Just installing
SQL Server ruined my boot partition: SQL Server would not run, repair,
reinstall, or uninstall, and I had to reinstall all of Windows and all my
applications and try again, more than once.

Quickly I discovered that documentation of logins, users, etc. were a mess:
Basically the ideas seemed to be old capabilities, attributes, authentication,
and access control lists, but nothing from Microsoft was any help at all.
Eventually via Google searches I discovered some simple SQL statements, I
could type into a simple file and run with the SQL Server utility SQLCMD.EXE;
that way I got some commands that worked for much of what I needed. Now those
little files are well documented and what I use. For getting a connection
string that worked, again the documentation was useless, and I tried over and
over with every variation I could think of until, for no good reason, I got a
connection string to work. Once I tried to get a new installation of SQL
Server to recognize, connect to, and use a SQL Server database from the
previous installation of that version of SQL Server, but the result just
killed the installation of SQL Server.

Again, once again, over again, yet again, one more time, far and away the
worst problem in my startup is making sense out of Microsoft's documentation.
I found W. Rudin, _Real and Complex Analysis_ fast, fun, and easy reading;
Microsoft's documentation was an unanesthetized root canal procedure -- OUCH!

So, again, once again, over again, yet again, one more time, please, Please,
PLEASE, for the sake of my work, Microsoft, and computing, PLEASE get rid of
undefined terms and acronyms in your technical writing. Get them out. Drive
them out. Out. Out of your writing. Out of your company. Out of computing. No
more undefined terms and acronyms, none, no more. I can't do it. You have to
do it. Then, DO IT.

~~~
Sacho
The definitions of "git repo" and "ALM" are the top search result in google
for both terms.

~~~
graycat
I can believe that.

My point is, shouldn't articles define or at least give lines to definitions
for terms?

Apparently Google has discovered that their usual keyword/phrase search of Web
pages should be set aside when a search is really for some jargon or and
acronym and to do a special search, just for definitions, for such terms. So,
if Google understands the crucial importance of unwinding jargon and acronyms,
the rest of us in computing can also.

~~~
Sacho
This would create a lot of noise for the regular readers of his blog, who
already know the definitions of these terms. The terms are not even obscure.
This is also a blog, not a piece of technical documentation.

------
iamNumber4
This just seems like the exact opposite of "Do one thing; and do it well".

------
manyoso
I would pay money to see a video camera of Linus' face reading this article. I
think we'd probably get impossible new shades of the color red heretofore
unknown to humanity.

------
nthcolumn
Linus Torvalds rocks. Windows sucks. Subversion was crap so he just made git
instead. git beat out svn, TFS and all that other crap legions of overpaid
engineers came up with (or what they didn't get source control???) because
unix design philosophy and therein lies the lesson still unlearned for they
hath loaded all their bloat into one repo.

Windows. It sucks and it will forever suck because it sucks by design. Bill
say 'Thank you Linus - I owe you sooo much because git is way better than the
best I could do'

I mean has there ever been worse software ever written than the stuff being
loaded into git right now? Awful, awful garbage, creaking and reeking of dirty
hacks, different for the sake of it designs, misshapen, bolted together,
bloated, willfully annoying, antisocial, phone home, locking-in, full of
resolutely, defiant ancient unfixed bugs, butt ugly, horrible UI, full of
errors and meaningless error messages, incessant nagging and weird quirks, wtf
folders, command line from hell and urgh... note pad ... and oh dear god I
almost forgot mmc consoles and visual studio and inconsistent flows, viral
load by the galactic shit tonne, complete and utter drivel makes me want to
vomit every time I hear that sickening jingle and after all those gazillions
of engineering hours an absolute world wonder of fail?

Two guys working out of a garage could do better.

:P (Windows sucks btw)

~~~
BenjiWiebe
Sadly we will have to quit blaming Bill Gates. I doubt he makes very many
design decisions any more. :)

~~~
nthcolumn
If he had only listened to me and re-released Xenix open source with a decent
WM we could have avoided all this unpleasantness but no, he had to listen to
Monkeyboy. :/

