Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft and GitHub team up to take Git virtual file system to macOS, Linux (arstechnica.com)
796 points by dmmalam on Nov 17, 2017 | hide | past | web | favorite | 218 comments

"Git wasn't designed for such vast numbers of developers—more than 20,000 actively working on the codebase. Also, Git wasn't designed for a codebase that was so large, either in terms of the number of files and version history for each file, or in terms of sheer size, coming in at more than 300GB. When using standard Git, working with the source repository was unacceptably slow. Common operations (such as checking which files have been modified) would take multiple minutes."

These sentences in some parts gloss over the details, and in others it is flat out wrong. Git was designed for tens of thousands of developers (the Linux kernel), it was designed for huge numbers of files (but large numbers of files works fine on Linux due to the dentry cache, it sucks on Windows because they don't have a cache that has the same behaviour as the Linux dentry cache). Admittedly it is slow for files that are large in size, but it was designed for source code; what sane developer would have source files that are hundreds of MB in size?

That said, monorepos with several dozen software projects can be very slow due to O(n) functions such as git blame and the like. It's true that git wasn't designed for large numbers of projects in one repo that are only indirectly shared by library code and the like.

Unfortunately that was not written in the article.

> in others it is flat out wrong. Git was designed for tens of thousands of developers (the Linux kernel),

This is easy to check, and the article is right. The Linux kernel does not have tens of thousands of active developers. Over a little more than the last year:

    titan:~/src/linux geofft$ git log v4.8..v4.14 --format='%aE' | sort -u | wc -l
If you look at just the most recent release cycle:

    titan:~/src/linux geofft$ git log v4.13..v4.14 --format='%aE' | sort -u | wc -l
Alternatively, if you look over the last ~year, but at people who have more than five commits (chosen arbitrarily):

    titan:~/src/linux geofft$ git log v4.8..v4.14 --format='%aE' | sort | uniq -c | awk '$1 > 5' | wc -l
A codebase with 20,000 actively working on the code base is at least an order of magnitude more than the Linux kernel. If they're committing patches every single day, probably more than that. I have, last I checked, four commits in Linux: two for a summer internship in 2009 and two doc fixes. That certainly contributes to the scale of Linux the kernel in the sense of "there are so many people working on it!", but not in the sense of being an "active developer" of the git repo. I was fine sending in my patches to a mailing list and hearing back at some arbitrarily later point, not triggering any CI, not asking anyone else to collaborate on my work, etc. That doesn't work if you have 20,000 engineers on the same codebase writing code every day.

> Admittedly it is slow for files that are large in size, but it was designed for source code; what sane developer would have source files that are hundreds of MB in size?

Sometimes that's the best way to get things done? The best Debian workflow I've ever worked with (and I've worked with a lot) involved actually committing binary .debs to SVN alongside their source code, because that meant that the SVN revision number was the single source of truth. There wasn't some external artifact system, nor was there a risk of picking up the wrong packages from an apt repo when rebuilding an older SVN revision.

I'm not defending this as pretty. But I will defend this as sane. It got the job done reliably and let me work on actually shipping the product and not yak-shaving workflows.

Good stuff. Came into this thread expecting the top comment to be a know-it-all answer trying to "debunk" the article, as so commonly seen on HN. This is a perfect, well-researched response.

Interesting read

I think this is all missing the point. Do they want to store, say, the entirety of the base Windows OS in a single git repo? If that's the case, sure, git isn't a great fit for their use case, and they should either use something else, or find a way (as they're doing?) to change git to fit their needs.

I can't see MS having tens of thousands of developers working on any single component of Windows, so I'm guessing they _do_ want a giant monorepo. If they were to break it up into separate repos per OS component, I doubt they'd have scaling issues (of course, breaking things up introduces coordination and dependency issues).

PM for Git at Microsoft here. We explored splitting it up. It's a 300GB repository and it's been a monorepo for the last 20 years. Splitting it up logically would take a lot of time (to put it mildly) and development would stall while we did it. And once we did, we would have 100 3.5GB repositories? 3500 100MB repositories? Neither of these are particularly appealing and getting changes checked in atomically across multiple repositories is insanely challenging. There's no doubt that we would need to build tooling to make this work for us. (We did actually explore this direction, but ultimately decided that it would be too much work for too poor an experience.)

Instead, we decided to - as you put it - change Git to fit our needs.

Does Git on MSWindows use a dentry like cache in user land to speed up filename lookups? Does it use a stat() equiv cache? Is there a reason that such caches weren't put into ntoskrnl.exe? Would you be able to give us a brief list (say top 5) of changes that sped up git on windows? And which ones had the most effect on super sized monoreops. Thanks!

I have seen poor performance of git on Linux in very large repos, so I'm not super convinced that the dentry cache magically makes things better.

In particular, if my repo is big enough, I often don't have the entire tree in memory (because I'm doing other useful things with memory and caches got evicted). core.untrackedCache makes things a little better, but it's still not great.

What were you using before this?

We were using a mix of tools throughout Microsoft: several teams were using Source Depot, which is an internally developed centralized version control system. It was built to handle large teams, like Windows and Office. It was the precursor to Team Foundation Version Control, which is the centralized version control system available in TFS and VSTS, which is also capable of scaling to large projects, and many organizations within the company were (and are) using that.

GVFS is part of our effort within the company to standardize on a single set of best-of-breed tools, and use the same tools inside Microsoft that we deliver to customers. So we're adopting one engineering system throughout the company and we're moving everybody to Visual Studio Team Services.

It's my understanding that Source Depot was not internally developed but rather was a fork off Perforce which MS purchased a source license to some years back.

True. Source Depot originated as Perforce and is a fork; it had (at one time) a pretty large team of developers working on it and it has been heavily, heavily modified in the many years since.

Changing Git to fit your needs probably just makes this monorepo function on borrowed time. Fixing the symptoms with band-aids will work for now, but at some point, something will probably have to be done about this 300GB god-repo. Any insight in to what/when/if something is going to be done about that?

I'm not the right person to ask about Windows organization, I'm afraid. But I will say that the nice thing about moving to Git is that its lightweight branching makes refactoring much easier.

I can't see it being easy to take a 20 year massive code base and split it up into components that are so isolated enough they can live in different repositories smoothly.

As they say in: https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g...

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

> I can't see it being easy to take a 20 year massive code base and split it up into components that are so isolated enough they can live in different repositories smoothly.

I can't see it being desirable either, they already have to deal with outdated external API, surely they don't want outdated internal API. A single repository lets you replace an API throughout the codebase at once and remove the old one entirely, if you start "modularising" then all "internal API" are suddenly external with all that entails.

+1 for the research and the distinction between sanity and purity.

> actually committing binary .debs to SVN alongside their source code, because that meant that the SVN revision number was the single source of truth

This is totally armchair-devops, but it sounds like a more ideal case would be CI that runs based on your commit. In my experience, developers get lazy and are likely to commit binaries that dont match the code or miss some dependencies or steps.

I completely agree with your point, using the tools as intended isn't always the best way to get things done. Video games are an easy example, code is often dependent and intermixed with binary assets and you want those to be versioned together.

If you actually commit binaries to help with the CI, then those binaries should be installed by the CI. I'm not saying it's a good idea to do it this way, but if you really want to for some reason, then that's the way to enforce no steps or binaries are missing. (And if devs forget to add something, the build would just fail)

Apologies if I wasn't clear, that's what I was getting at. Commit the code, CI generates the binaries. I guess you could commit the binaries if you wanted, but a separate caching repo would make things more manageable. You'd still be using the commit to reference the binary. But this is ideal, and there are good, practical reasons doing what he describes.

For games the binary data isn't from code. They're textures and models generated in a package (like Photoshop). The bits of code are often tightly coupled with the textures--like a procedural waterfall that uses still images. You'd need to version both of those together.

I mean, it's certainly true that at least once I found that another developer had updated the .debs and forgotten to update the sources, and I had to attempt to reconstruct what he did because I needed to add another patch on top. So you're absolutely right that some better CI would have helped here.

But for the product we were building, it was important that developers were able to test their .debs before committing to the repo. That means that CI can't build post-commit: you have to have some way to take a local modification, turn it into a .deb, test it and see if it does what you want, and then commit that.

I've thrown something together for my current job that supports both building a package interactively for testing and building one in the CI pipeline post-commit (using a git repo as the source in both cases) but it's definitely more cumbersome, and I haven't figured out quite how I want to fix it.

(Also, yes, games are a more obvious example of this, but there are lots of purists who've never worked in game development who are somehow convinced that assets aren't code and shouldn't be in the code repo. Having not worked in game development myself either, I figured it was safer not to invite that argument.)

You could try gerrit + e.g. Jenkins. When pushing to review the CI can pick up the commit, build it and optionally run tests and give +1/-1 verified.

The commit is merged only after a developer will give +2 for the review.

Are private branches that the CI can pick up also not a workable solution for your usecase?

That's what I'm doing now at work, and it's sort of silly to git push, wait for CI, notice that it failed, fix, git push, wait for CI, .... in an SVN world, that would have meant one commit in the monorepo (since there's no such thing as temporary branches, and revisions are global across all branches) for every typo I make during development. You could make it work with git, certainly, but on balance the check-in-a-.deb approach worked fine and didn't require spending more time writing tooling.

Question: what's the `sort -u` for? You're piping to `wc -l`.

"u" stands for unique, so after sorting it skips identical rows

Oh! TIL, thanks!!

Dude teach me how to pipe liKe you

>These sentences in some parts gloss over the details, and in others it is flat out wrong. Git was designed for tens of thousands of developers (the Linux kernel),

Are there really ten thousand active developers on the Linux Kernel?

According to [1] there were ~ 2,000 over a 15 month period. It's an order of magnitude difference that seems like it would have an impact.

[1] https://arstechnica.com/information-technology/2015/02/linux...

And even then, the kernel doesn’t follow anything like the approach to distributed development that your average Github project does. The kernel consists of a bunch of maintainers who each have their own intake branches, to which other people submit PRs or patches as that maintainer wills; and then the mainline kernel, which only takes contributions via patches from one of the maintainers.

The nearest thing you could compare the development model to is the set of base packages for the BSDs: each subsystem/package has its own upstream development (like the Linux subsystems with their maintainers); and then Linus serves as the release engineer, taking updates to those packages and merging them into the release. (I compare to the BSDs specifically because Linux kernel subsystems, like non-cross-platform BSD base-system packages, have no other life outside of their potential integration into the release. People aren’t using the upstreams directly, other than in development of the new feature.)

I mean, that's the model git was designed for, and it's a model that was designed for a large number of people (by using gatekeepers doing bulk updates to mitigate the flood of small ones).

So I don't think raising the way linux uses git by its original intended users really contradicts the idea that it was designed for a large number of developers?

The article above was actually wrong and has been updated to show that Microsoft has 3,000 active developers instead of the original 20,000 that is quoted here. So it's only a difference of 50%, not a full order of magnitude.

Although you should consider how active those devs are... I'm imagining its far more active than Linux kernel devs... so the activity levels can still be an order of magnitude higher

It's probably more active than some Linux kernel developers (those who don't work on the kernel exclusively), but the majority of Linux kernel developers are employed to do kernel work and so I don't see why there would be a difference in amount of code output by either sets of developers.

I develop a desktop app with lots of test documents (test inputs, outputs) that are tied to specific revisions in order to work with the test suite that uses them. None are huge but there are many and some are frequently changed (2GB working copy, 15gb repo, 100k commits). They could be put in some (very) complex hierarchy on a network share, but having them in vcs is so convenient. If rev 9999 changes the code so that the expected output testfileA differs then you update the expected output in the same commit. If you don’t have the file and you want to be able to clone+build+test any revision then you still need the results for 9998 also on the share.

This scenario is nicely handled in svn, mercurial, tfvc and similar. If this lets git do it then it looks like a great idea.

Game developers with lots of churn in huge binary assets also have the same problem, but probably even worse.

Those kinds of use cases are what Git LFS [1] was created to help with. The major Git vendors support it (Bitbucket, GitHub, GitLab). I'm not as certain about the open source implementations.

[1] https://git-lfs.github.com/

Git LFS doesn't solve the gamedev/large asset problem.

The issue is that these giant binary files cannot be merged. Because of this you need a locking mechanism(which SVN and Perforce both support).

Without that you have two people working on the same file and someone having to throw away work which makes for a very unhappy team.

DVCS is great but not every problem domain maps to it.

This is why we’re mostly stuck in Perforce hell in the games industry.

git-lfs supports locking, beginning in 2.0.

Man, had to really dig to find that. Ctrl-F "lock" didn't hit anything on the landing page or main docs page.

Even then it looks like an implementation that misses the mark. Both Perforce and SVN mark "lockable" files as Read-Only on a sync of the local copy. It looks like all this does is prevent a push of the file which will be the same result as if someone merged it ahead of you.

Without that guard in place it makes it really easy to make a mistake and then have changes to a binary file that will collide. Having a checkout process as part of locking workflow is pretty much mandatory.


Nevermind, I finally found this[1] that explains Git LFS does it correctly, kudos! Could do with some more detailed documentation since it looks partially implemented from the man pages.

[1] https://github.com/git-lfs/git-lfs/wiki/File-Locking

I dabbled a bit with lfs but it seems specifically geared towards a few very large resources that you need to manually take care of syncing etc.

What we need to move off large repos in tfvc or svn is something where the inconvenience is small enough that it doesn’t weigh heavier than the benefits of git. LFS was not that, this could be.

Git LFS' Achilles' heel is pruning old data.

I made a stupid decision to use git-lfs to track a pre-initialized Docker database image (using LFS to store initial_data.sql.bz2). Now I have to live with few dozen gigabytes of the past I don't really need (it's a small database).

I know I can rewrite the history and then find something to garbage collect unreferenced blobs (haven't investigated it). The repo allows doing so as it only keeps those dumps, the Dockerfile and a pair of small shell scripts. And, most importantly, it has only 2 users - me and my CI. If it would be large assets and lots of code, checked out and worked on by a large team - I guess, rewriting the history isn't an option.

Storage is cheap, but still...

Game devs just use Perforce (for binaries, or everything). Sometimes Git in combination.

The Windows repo is massive, at least 100GB. Consider that the component equivalent to Linux, the NT kernel, is but a small part of the whole thing. And it is indeed a single large project, where every commit has to leave the entirety of Windows in a good state. So if you change some widely used function, you need to fix every single caller in all of Windows. Also, the build system evolves along with the product, so it is checked into the repository as well. The right version of the build system is whatever is in the timestamp you are building. So there’s your seemingly elusive good reason for keeping large binaries in the repo. And while I worked there, the object files from the nightly build were also checked in, so you could rebuild mostly-unchanged components quickly.

The alternative to doing it this way is to introduce version management between components, which is totally unnecessary complexity. Windows is one product and all of the components (or subsets thereof) ship together. As far as internal-only interfaces are concerned, version N does not need to work with N-1 or N+1, only N. That is a massively simplifying assumption that makes a lot of things much easier.

There is a reason why all the big tech companies operate this way. Microsoft, Google, and Facebook, all have one giant repository with their entire product.

I worked at a game company for a while and one of the interesting things there was that there really wasn't a good way around checking in binaries to the repo. First off, a lot of your "source" assets are properly in binary form (images, for example), because that was the most compact form for them to exist, by orders of magnitude, and also because trying to generate them from some other non-binary representation would add tremendous overhead to development work. On top of that, building the entire product from "source" might involve tens to hundreds of thousands of hours of CPU/GPU time. So you just didn't do that. Instead you built everything in parallel and checked the results in somewhere, that would then get bundled together into the final product. There are ways around doing things this way, for sure, but given that you can just go pay a fairly insignificant amount of dollars to some company and get a VCS that won't choke.

> There is a reason why all the big tech companies operate this way. Microsoft, Google, and Facebook, all have one giant repository with their entire product.

Cool, but that doesn't explain why Office and Windows need to share the same repo. Ditto for gmail and chrome os. Ditto for fb messenger and whatever face recognition algorithm they use.

I see how my post could have led to that interpretation. That’s not what I meant. There are certainly many many repos in use across Microsoft, for all sorts of different projects. Office doesn’t share the same repo as Windows. And neither does the toolchain (compiler/linker/etc.); only the binaries are checked into the Windows repo. Those are truly separate products, which need to work across different versions of Windows.

Office and Windows don't share the same repo...

IMHO it is a major failure that git doesn’t scale for org-wide mono repos. Software development process should not be dictated by limitations in the VCS. It surprises me that others are so willing to settle.

> IMHO it is a major failure that git doesn’t scale for org-wide mono repos.

git was (initially) written by Torvalds to serve his own purpose -- he needed a new VCS for Linux. Let's say that it "doesn't scale for org-wide mono repos". Unless that was a design goal (and I'm pretty sure it wasn't), how is that a "failure"? It apparently works just fine for the kernel developers.

Yet because it didn't work well for Microsoft's 300 GB worth of tens millions of lines of code it is somehow a "failure"?

Apparently Microsoft's own TFS is a failure too, then, because it seems that it wasn't up to the task either.

PM at Microsoft here. TFVC is completely different, though.

(To quickly clarify nomenclature: TFVC is "Team Foundation Version Control", the name of the centralized version control system in TFS and VSTS, which are our on-premises and cloud hosted development platforms, respectively. They both support both TFVC and Git - Git with GVFS, and they do more than just version control, hence the naming clarification. Apologies for all the three letter acronyms.)

Anyway, TFVC is totally different and a centralized version control system with all the workflows that go along with it like expensive branching. The desire to move to Git opens up all the awesome workflows that Git offers.

The impetus here was to standardize the entire company on a modern set of development tools - one engineering system across the company - hosted in Visual Studio Team Services. We could have moved them to TFVC, it's remarkably similar to its predecessor, "Source Depot" which is a Microsoft internal tool that the Windows team was using.

But that's a lateral move. There aren't really any benefits to TFVC over Source Depot except that we sell and support one to end users and don't with the other. So that has some nice organizational benefits but not enough to warrant moving everybody and the build/release farms over.

But moving to Git unlocks all sorts of new workflows - lightweight branching, pull requests, all that good stuff that everybody who uses Git is totally used to.

Anyway, that's the background on Git vs TFVC. And, no, I totally agree that Git not supporting giant monorepos is not a failing, per se. But it is a limitation, and thankfully one that we're working to help overcome.

well what would be cool if TFS could be installed on Linux and use a different database than MSSQL (PostgreSQL, MySQL or MSSQL) (I mean search uses elasticsearch anyway...). Even better if it would be more close to the online service. And way better if it came with something like gitlab-runner.

Can you clarify what you mean by "more close to the online service"? We update TFS with the bits from VSTS pretty regularly; are you looking for a faster cadence or am I misunderstanding what you're saying?

well some views are looking more clean on the online version. i.e. the project overview page.

currently I'm evaluating git hosting on-premise and another downside is, is that git tfs takes a lot of horse power (this looks unsuitable for a small team)

There's also the popularity factor... not many care(d) about TFS. I would refrain from calling most pieces of technology a "failure", git especially.

When your screwdriver does a bad job at pounding nails, you're doing it wrong.

Is there a VCS that does?

When I was at Amazon, Perforce couldn't scale (it had already been partitioned into a few depots, but those had reached their reasonable capacity). Amazon moved to a per "module" git-repo system. It was convenient to push across multiple repos simultaneously, but it wasn't a deal breaker when we lost it (I never heard of any workflows being completely unworkable following the transition).

Piper: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

Unfortunately, it's not available externally.

Even working at Google, my jaw still dropped at this:

    Google's codebase is shared by more than 25,000 Google software
    developers from dozens of offices in countries around the world.
    On a typical workday, they commit 16,000 changes to the codebase,
    and another 24,000 changes are committed by automated systems.
    Each day the repository serves billions of file read requests,
    with approximately 800,000 queries per second during peak traffic
    and an average of approximately 500,000 queries per second each
    workday. Most of this traffic originates from Google's
    distributed build-and-test systems.
Mostly I felt terribly unproductive — my changelist generation rate is way below average.

Can you please quote with ">" at the start of the paragraph instead of using a code snippet? This is unreadable on mobile; the text is 3x as wide as the (scrolling) viewport.

The number of engineers and commits is probably in the same ballpark at Amazon. With a DVCS, far less operations are done server side, so I would imagine the read requests are an order of magnitude or two lower. They have tooling to get a global view across all repos, however there are no cross-repo atomic operations (those can be managed in their build system in a relatively robust way).

What advantage does a company wide mono repo provide?

The ability to have something like CitC (or this post's git virtual filesystem) is certainly one big advantage -- no need to clone new packages, they're right there in your "local" source tree. Bazel (blaze) is another, particularly when coupled with working at HEAD.

My experience with farms of git repos is that the lack of atomic operations over many tiny repos leads to things like version sets and having to periodically merge dependencies. I've worked on teams where that was inevitably neglected during hectic periods resulting in painful merges of large numbers of changes. That problem simply doesn't exist with working at HEAD and high quality presubmit test automation/admission control. The single repo also allows for single code reviews spanning multiple packages which makes it MUCH simpler to re-arrange code (Bazel again helps here since a "package" is any directory with a BUILD file). Package creation is lighter weight for the same reason, and has fewer consequences for poor name choices since rearrangement is easy and well supported by automated tools.

Sharing one build system where a build command implicitly spans many packages also results in efficient caching of build artifacts and massively distributed builds (think a distributable and cacheable build action per executable command rather than a brazil-build per package). Each unit test result can be cached and only dependent tests re-run as you tweak an in-progress change. This is fantastic for a local workflow (flaky tests can be tackled with --runs_per_test=1000, which with a distributed build system is often only marginally slower than a single test run). Also, you can query all affected tests for a given change with a single "local" bazel query command. The list goes on from here -- I keep thinking of new things to add (finer grained dependencies, finer grained visibility controls, etc.).

It's not that you can't build most of this for distributed repos, but I'd argue it's harder and some things (like ease of code reorg) are nearly impossible.

Subjectively, having worked with both approaches at scale, Google's seems to result in much better code and repo hygiene.

Wow! Thanks for sharing. I hadn't contemplated how Google managed their source code. Sounds like Piper has some nice features!

Lots more details here [1] if you're interested.

[1] https://www.youtube.com/watch?v=W71BTkUbdqE

which means nothing, e.g. by loc your changestats are order of magnitude higher than mine

Google has one internally, and Facebook has done a ton of work on Mercurial to build something similar, doing things like replacing files on disk with memcache servers running close to the dev machines. But I don't think there's anything that works "out of the box" at that scale.

For quite a while at Google there have been git extensions/tools for working with Piper. They're pretty widely used, but haven't been "officially" supported.

There's a new project that allows you to use hg with Piper/CitC; I've been using it for about a month and it's been working really well. No clue how it works, though.

Would you be willing to elaborate on the pros/cons of mono repos vs what git encourages? It seems to just be the wrong fit for a given use case, but works very well others.

That's a good and valid point, which deserves a conversation undoubtedly - however, there are a lot of businesses out there who have a large monolith, legacy application gigabytes in size where the cost of slicing and dicing into smaller repositories has to be driven by requirements and not technology.

My current client has a 17-year-old multi Gb TFS monolith. New components go into git and as features evolve it's common for the respective team to hive them off into they're own git repos as stand-alone components. Nevertheless TFS is painful and if I could have an effective git repo with the monolith too, then life would be far easier.

Is there a way to split a TFS repository, surely you no longer need the past 10 years of commit history on a daily basis - archive it or something?

Also how feasible is migrating from TFS to Git?

Very, thanks to git-tfs [1]. My company is in the midst of a move from on premise TFS/TFVC to cloud based VSTS/git. We've run into some projects that managed to somehow mangle their branch history so to the point that we gave up and only migrated the current checkin to git, but most projects pull out the full history and all branches with no trouble.

The only real complaint I have is speed. Dunno if it's our TFS server or the nature of TFS, but every checkin pulled into git takes anywhere from 5-30 seconds. In large projects with long histories you can easily spend 30+ minutes exporting history before running into a problem that prevents the export from completing.

[1]: https://github.com/git-tfs/git-tfs

> In large projects with long histories you can easily spend 30+ minutes exporting history before running into a problem that prevents the export from completing.

If it makes you feel any better I had to spend two continuous weeks exporting our huge TFS monolith before I managed to have a working git copy.

But yeah, Git-TFS is a godsend.

What I find more surprising is the length they went through to make git work for them.

It would probably have been better if they had assigned a team of highly skilled engineers to design a distributed source control from scratch specifically to solve the problems faced by the windows team.

After all, `git` itself was developed for Linux because Linus was not satisfied with any of the existing solutions.

Why would that have been better?

(Why in the world would that have been better?)

For the same reason that git turned out better than svn for a lot of use cases.

It seems like Windows is not really one of these use cases.

Tens of thousands of developers pushing changes is not the same as the Linux development process, at all. The key is in the word “actively.” Even Perforce couldn’t keep up with that type of load in a similar situation with Google’s monorepo, forcing them to develop a Perforce-compatible alternative.

It’s tough to scale VCS to very large teams, regardless of the software. No need to stand up for git.

Your comment reminds me that Microsoft should really double down on replacing NTFS. Even Apple has moved off HFS+ with their latest OS.

They unfortunately removed the ability to create ReFS disks from Windows 10 Pro and lower. Hopefully they reverse that decision soon. That said I'm not sure how stable it was considered.

I agree that these sentences gloss over the details - and rightly so: it's an article in Ars, after all, not a journal article. We (Microsoft) have published a lot[1] about our journey to put Enterprise-scale codebases, like the Windows codebase, into Git. I think you'll find that they expand upon some of the things that are glossed over.

And I would encourage you to read them, so I won't belabor the details here, but to give you just one example: git blame is not the issue that we're most concerned about. The basic, highest priority functionality in Git - things like git add - are the the issues we're most concerned about.

git add is O(n) on the number of files in the repository. When you run `add`, git reads the index, modifies the entry (or entries) that you're changing, and writes the index back out.

The Windows repository has about 3.5 million files[2]. Running `git add` - when we started thinking about moving Windows into Git - took minutes.

Can you imagine running `git add` and having it take 10 minutes?

Now obviously there are some inefficiencies here - there's quadratic operations and the like that went in assuming that nobody would ever put 3.5 million files into a Git repository. And we've been cutting those out over time[3].

Thankfully, Git does have some functionality to support very large repositories - using shallow clones, sparse checkouts and the like. We've added narrow clones, to only download portions of the tree, and automation to handle this automatically without user intervention.

That's the scaling work that we're doing with GVFS. And these changes bring the P80 time for git add down to 2.5 seconds[4]. We've been contributing these changes back to git itself, and we're thrilled to work with industry partners like GitHub who are also interested.

Sorry to go on - version control is a passion of mine, as it is for many of us working on Visual Studio Team Services. Your conclusion is very much correct: Git wasn't designed for this from the beginning. Thankfully, software isn't immutable, so we're scaling it up.

[1]: http://gvfs.io/

[2]: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

[3]: https://blogs.msdn.microsoft.com/devops/2017/05/30/optimizin...

[4]: https://blogs.msdn.microsoft.com/devops/2017/11/15/updates-t...

The article was modified to "— more than 3,000 actively working on the codebase".

Interesting to see them try to fix Git when Facebook considered it (due to similar constraints) and decided to go with Hg instead: https://code.facebook.com/posts/218678814984400/scaling-merc...

Microsoft PM here. It's important to note that Hg also had similar constraints. So Facebook was in a position of having to fix Git or Hg to suit their needs. It turns out that for a variety of factors, Hg was the better choice for them to hack on.

Microsoft took the same look and went the opposite direction, and decided that Git was easier for us to hack on. This was a no brainer for us. We build multiple tools that host peoples Git repositories (TFS and VSTS) and have Git repositories as deployment mechanisms for Azure. We have several contributors to tools in the Git ecosystem on staff, including people contributing to core git, and the maintainer for Git for Windows, libgit2 and LibGit2Sharp. But we have comparatively little institutional knowledge of Hg.

That post you linked is awesome. Facebook has done a lot of really impressive work scaling Hg and a lot of the lessons learned at Facebook and implementations in Hg are very similar to what we've done with Git.

This video from Durham Goode on Facebook's version control team (from Git Merge, amusingly) is also awesome: https://m.youtube.com/watch?v=gOVD-DrUpwQ

It's really good to see Microsoft join in the open source community. I switched to Mac over a decade ago largely because of the Microsoft only mentality back then. Really makes me consider switching back.

They decided to go with mercurial because it was possible to monkey patch its guts from extensions written in python. They're now talking about implementing parts in rust, which, ironically, would have prevented them from doing what they originally chose mercurial for.

> They decided to go with mercurial because it was possible to monkey patch

The ability to extend, swap and replace deep into the stack was always a core component of Mercurial, hence having to enable extensions for things many users would consider core features e.g. terminal colors or graph log

Facebook didn't hack around like monkeys, they built extensions. And when they could not do it as extensions, they upstreamed improvements to the core.

The alternative would have been to fork the codebase entirely.

Monkey patching the internals would have been significantly less maintainable: having used it as a library, I can tell you that none of the internal stuff is considered stable and pretty major components will change between point releases (I was using the diff-parsing and patch-application routines for something else, the API changed basically every minor release, forking would at least give you a heads-up conflict when upstream changed, monkey-patching would either blow up at runtime or not go through the patch anymore)

> They're now talking about implementing parts in rust, which, ironically, would have prevented them from doing what they originally chose mercurial for.


Microsoft is used to implementing its own filesystems, so it probably feels more natural for them to do that than it does for Facebook.

Uhm... the author seems to be confused about the relationship between Github and git.

GitHub doesn't develop git, it's a service built on top of git. Any modifications to git has nothing to do with GitHub. GFVS is a GitHub project, not part of git.

> GitHub doesn't develop git

I'd be surprised if none of their employees have contributed to Git. I didn't interpret anything in the article as saying that GitHub develops Git, in its entirety (or even largely).

> GFVS is a GitHub project

GFVS is a Microsoft project and GitHub seems to be contributing. And the GFVS developers have been submitting, successfully, their changes to Git itself upstream to the actual Git project. So it is becoming a part of Git.

> I'd be surprised if none of their employees have contributed to Git.

In git.git, there are 7 commits with @github.com authors, last one from 2014.

(I hope it just means GitHub folks are trying to blend into the crowd by using personal addresses, or something...)

I don't know what addresses they use but github engineering blog has a lot of examples of really useful contributions, e.g.:

> Shortly after our initial deploy, we also started the process of upstreaming the changes to Git so the whole community could benefit from them.

Source: https://githubengineering.com/counting-objects/

But I guess most of their git related work time goes to libgit2.

> I'd be surprised if none of their employees have contributed to Git.

Jeff King (peff) has worked for GitHub for a long time:


Michael Haggerty (mhagger) is also a GitHub employee.

So, yes.

> I didn't interpret anything in the article as saying that GitHub develops Git

"Microsoft [...] wanted to get these modifications accepted upstream and integrated into the standard Git client.

That plan appears to be going well. Yesterday, the company announced that GitHub was adopting its modifications and that the two would be working together to bring suitable clients to macOS and Linux."

This hints that Git upstream = GitHub. I mean, why mention GitHub at all if they aren't upstream? The rest of the article doesn't explain GitHub's role in this story either.

I still don't think your interpretation is warranted, but meh.

Microsoft made some contributions and they're working to get them accepted upstream. They're maintaining a fork basically.

GitHub is adopting their modifications, i.e. GitHub is running Microsoft's fork.

Microsoft's modifications are necessary for them, particularly so they can use Git with the giant Windows repo. Presumably, GitHub also has a need or desire for those same or similar modifications. IIRC, some of GitHub's largest customers need or want a version of Git that can also handle large repos, or repos with large files or large numbers of files.

Maybe quickly changing topics is what causes the confusion. I guess they picked github because it has enterprise offering and enterprise customers are interested in large git repos. So github is like a test lab for GVFS. In the mean time they upstream the changes to real git adjusting GVFS to what git maintainers think is right.

Small comment, it's GVFS, not GFVS,. FileSystem, FS always at the back.

GitHub may not be the primary contributor to git, but they're certainly a significant user of it, and they _have_ contributed patches to upstream git.

(example, the '768 conflict' bug here: https://githubengineering.com/move-fast/)

For the avoidance of doubt, GitHub is not upstream of Git.

yeah this was really confusing in the article everywhere it mentions up-streaming it refers to "git" only then later mentions how GitHub is accepting it's modifications... so not entirely sure which it is.

TL;DR pretty sure it's not touching git source at all.

Looking though the GVFS readme: https://github.com/Microsoft/GVFS it appears that it actually wraps git, which would make sense because in that case they would have to explicitly add support to each git host and provide individual platform ports of GVFS (all of which would be unnecessary if it was actually upstreamed into git).

So that's good... git is one of my favourite open source tools and I would really hate for M$ to start polluting it. I don't care if GVFS is a good idea or not I just don't trust the fuckers and they will always deserve that suspicion.

No, they needed to do some modifications to the git suite of tools. Generally git expects all objects to be on disk and Microsoft wanted to have sparse checkouts of files in a specific revision.

Not really polluting but rather having some objects be fetched only on demand.

Source: https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...

> In addition to the GVFS sources, we’ve also made some changes to Git to allow it to work well on a GVFS-backed repo, and those sources are available at https://github.com/Microsoft/git.

For the record as far as I understand GVFS the article is correctly using git vs Github.

Thanks for the clarification, sparse object fetch seems like a small change in concept at least.

Did I misunderstood the idea, or is GVFS essentially making git a non-distributed VCS?

I think once you're checked out, you're still distributed as far as being able to make and push commits which modify the files you have. But obviously someone can't clone files from you which you don't yourself have.

See previous articles which discussions:



From reading the first post "The largest Git repo on the planet" - "Windows code base" of "3.5M files", "repo of about 300GB", "4,000 engineers" - I assumed that "Windows code base" contains all the utilties, DLLs that included in windows release such as notepad, games, IE/Edge etc and not just the Windows KERNEL.

If so, the comparable code base to check against is Android AOSP, Ubuntu, RedHat or FreeBSD.

If that is true, I believe the source code base for Ubuntu/RedHat distribution with all the Apps likely be bigger compare to Windows in term of number files, source repo and number of engineers (open source developers for all the packages such as ff, chrome openoffice.)

Microsoft folks feel free to correct me here.

It seems that the existing git's process, dev model seems to work well for much bigger projects already by using different git repo for each apps.

Still not sure what pain point does the new GVFS solve.....

AOSP is probably the closest comparison, Fedora/RHEL and Ubuntu don't keep application source code checked into git. Fedora specifically uses dist-git, there's a git repo per package with the spec file, patches and other files needed for the build and then source tarballs are pushed to a web server where they can be downloaded later with the dist-git tooling.

So yeah, all of the code and data actually stored in source control for various Linux distributions is pretty small.

Linux distributions are a different beast since they're mostly tracking a myriad of upstream third-party repositories overlaid when necessary by their own patch sets. And most of those upstream repos are by necessity highly decoupled from any particular distribution. Different distributions have different solutions to how they manage upstreams with local changes. E.g. OpenEmbedded uses BitBake where (similar to FreeBSD Ports and Gentoo Portage) the upstream can be pretty much anything, including just a tarball over HTTP, and local changes are captured in patch files, while others instead use one-to-one tracking repos where local changes are represented by version control revisions.

AOSP is closer to what you're imagining, but I haven't met anyone who thinks Repo (Android's meta-repo layer) and Gerrit (their Repo-aware code review and merge queue tool) are pleasant to work with. E.g. it takes forever and a day to do a Repo sync on a fresh machine. A demand-synced VFS would be very nice for AOSP development, even though it's not a monorepo but a polyrepo where Repo ties everything together.

It's more that it provides more control on the centralized<->distributed spectrum. Git by default, yes, is fully distributed with every copy of a repo supposed to maintain full history, etc. GVFS allows you the option to offload some of that storage effort to servers you specify. Those servers can be distributed themselves (similar to CDNs, etc), so there's still distribution flexibility there.

You can think of it as giving you somewhat flexible control of the "torrenter/seeder ratio" in BitTorrent: how many complete copies of the repo are available/accessible at a given time.

I've been thinking of it as more of making it on-demand — it doesn't sound like you lose the underlying distributed nature (e.g. from the description on https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... it sounds like 'git push ssh://github.com/microsoft/windows.git' would work, although it might take longer to run than it does for you to be fired) but managed clients can choose to operate in a mode where they download everything on-demand rather than immediately.

That seems like a reasonable compromise for a workload which Git is otherwise completely unable to handle.

It actually takes a bit more than 24 hours... so yeah

It’s just a pragmatic recognition that most files are going to stay unchanged relative to some “master” repository anyway. It maintains the same distributed semantics as vanilla git, but allows you to use less disk space if you choose to rely on the availability of that master repository.

It's worth asking: how many Github users are actually using git as a fully distributed version control system? The typical Github workflow is to treat the Github repo as a preferred upstream--which sort of centralizes things.

I think a common open source workflow is to have a forked GitHub repo and working off of this.

So in the end you have your local repo, your fork, and the organization repo. During a pull request process third parties might make pull requests to your fork to try and fix things

   The typical Github workflow is to treat the Github repo as a preferred upstream
Is it typical? What metrics do you have to support that?

Speaking for myself I have only ever used triangular workflows: fork upstream; set local remotes to own fork; push to own fork; issue pull request; profit


The main repo is upstream of you regardless of whether it is labeled as such in `git remote -v`. If it goes offline, nobody systematically falls back on your fork. This makes the system effectively centralized, which is parent's point.

To elaborate in more (possibly excruciating) detail:

It's very rare for anyone on GitHub to do the sort of tiered collaboration that the Linux kernel uses. If, say, I want to contribute to VSCode, pretty much the only way to get my changes upstream is to submit pull requests directly to github.com/Microsoft/vscode.

Compare to the tiered approach, where I notice someone is an active and "trusted" contributor, so I submit my changes to their fork, they accept them, and then those changes eventually make their way into the canonical repo at some future merge point. That's virtually unheard of on GitHub, but it's the way the Linux kernel works.

Pretty much the only way you could get away with something even remotely similar and not have people look at you funny in the GitHub community is maybe if you stalked someone's fork, noticed they were working on a certain feature, then noticed there was some bug or deficiency in their topic branch, then they wake up in the morning to a request for review from you regarding the fix. Even that, which would be very unusual, would really only work in a limited set of cases where you're collaborating on something they've already undertaken—there's not really a clean way in the social climate surrounding GitHub to submit your own, unrelated work to that person (e.g., because it's something you think they'd be interested in), get them to pull from you, and then get upstream to eventually pull from that person.

That's detailed, but not excruciating. Thank you.

Good clarification. But, do we know that is what everyone does? It's obviously a cultural, rather than a technical limitation. I suspect there are significant bodies of code kept inside corporate forks of upstream (and regularly rebased to them) with only selected parts dribbled out to the public upstream repos by trusted representatives of said copies. But, I have nothing to prove that and the only public traces I would see would be commits to upstream from corporation X.

Depends on what you mean by distributed. If a repo contains 2 projects pA and pB and I’m only involved in the former while the other project has a 2GB binary asset with 1000 historical revisions, then I’m happy to just be distributed wrt my part of the repo.

To put it another way: A server with multiple repos on it is a central server with respect to the repos you don’t clone! This is the same, but for parts of repos.

To me it looked like one of the points of Git was that one could work on anything from the entire repository locally, i.e., for example, without having an internet connection, e.g. when on an airplane. (I think having one huge repository is the problem here.)

Well, Git was indeed created to serve Linux kernel development. So decentralized access for the people writing patches all over the world was useful. However, when companies started using source control for their monorepos which developers access through a work network or VPN decentralization takes a backseat to business concerns. Different use cases and all that.

I think the problem there, is that you then basically wind up with Git on the backend and an SVN-like layer over the front. Which seems a bit silly. SVN is a very robust project that works really well. Why not add git-like porcelain to SVN if you're gonna force devs to have an network connection so they don't have to download a giant git working directory.

Smart people are working on all this, so I'm sure there are reasons, but in the all the instances where I've had to interact with a monorepo, it was because the tech debt was too high to pay off to break it apart, not because it was better.

And if you're indebted to the point where you have no point in paying it off, you damn well better have leveraged assets against it (i.e. a cash cow of a business, like MS Word or Facebook)

That's interesting. I've never been an insider and it was several years ago, but Google spoke in pretty glowing terms of their monorepo when I heard them talk about it. That's neither here nor their though, not my personal preference either to be fair.

As far as Git's killer feature I think decentralization is only one part of it. Being able to deal with the change graph directly is sometimes handy. And knowing the basis of the model, I find Git pretty easy to understand even when history gets fuzzy. Not sure but I think Darcs might be even better for that, just without the mindshare.

But there are source control systems designed around that centralized use case. Lots of them. One of which Microsoft owns.

Why spend so much effort bashing a distributed peg into a centralized hole?

See my response to the sibling comment. Git's decentralization isn't its only upside.

Sounds a bit similar to Google's Piper (https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...) that lets you check out a portion of the monorepo, no?

"The biggest complexity is that Git has a very conservative approach to compatibility, requiring that repositories remain compatible across versions."

This, above all else. That I never, or anyone else on our team, ever has to worry about wether they are running an "older version" of git is such a win.

It's not really a surprise that monorepo advocates struggle here. Compatibility through versioning and stable APIs is challenging and requires discipline.

Making wide-spread changes to all clients of your code when you make changes that break things doesn't require discipline.

Big problem. Just update your machines. Am I supposed to miss great features because of your admin laziness?

It has to fail gracefully on version mismatch, other than that it's just a bad decision.

Hmm, some part of me wouldn't mind, a FUSE setup where: /repository/github.com/<org>/<repo>/ is automatically cloned out from github when accessed the first time :)

It would be pretty cool to use as a $GOPATH. Or just for making drive-by github contributions super easy.

I built https://github.com/zimbatm/h to satisfy the same need but it's a command-line tool, not a FUSE filesystem. In some way I prefer it because transparent network access doesn't work really well with failure modes.

Usage: `h <owner>/<repo>` clones the repo, if it doesn't exist, and takes you there.

This and hub are integral to my developer workflow:

    $ h some/repo
    $ git checkout -b some-branch
    [hack and commit]
    $ hub fork
    $ git push -u zimbatm
    $ hub pull-request

It's all fun and games until someone like me tries to run `find` on the parent directory...

Children would be lazily loaded and would not appear in directory listing of the parent directory. `find` would search your existing local checkouts only and the file system would still be POSIX.

How do you trigger loading the projects if there isn’t a pointer to them showing up in the file system?

When you open or stat a directory that doesn't exist, try to checkout that path, if that fails, return an ENOENT error?

A couple of thoughts as the implementation downloads mainly placeholders:

that mechanism can also be combined with permissions on the server side so contractor X who is fixing a driver need only have access to the driver and some supporting code while still checking into the monorepo.

It also means that every developer’s laptop won’t have a full copy of the repo (so if said laptop is lost, the risk is more contained)

And it’s similar to Part of LFS, bigfiles and other “blob” add-o s to git...could see in future a way to put .doc files and the like into repos.

> The open source, free GitHub hosting doesn't need the scaling work Microsoft has done

They might want to fix that sentence. It reads as saying that Github is open source. Perhaps it should read

> The free GitHub hosting that is typically used by open source projects doesn't need the scaling work Microsoft has done

> Microsoft and GitHub are also working to bring similar capabilities to other platforms, with macOS coming first, and later Linux. The obvious way to do this on both systems is to use FUSE, an infrastructure for building file systems that run in user mode rather than kernel mode (desirable because user-mode development is easier and safer than kernel mode). However, the companies have discovered that FUSE isn't fast enough for this—a lesson Dropbox also learned when developing a similar capability, Project Infinite. Currently, the companies believe that tapping into a macOS extensibility mechanism called Kauth (or KAuth) will be the best way forward.

I'd love to read more technical details on this. How can Kauth support a virtual filesystem?

It seems that lazy-downloading files isn't all that useful if you need to compile all the source code. Does anyone know which build tools can be used to avoid having to open most of the source files - maybe some centralized build cache?

This is kinda funny, it undoes one of the original pillars of why git was made

Does anyone else think this is another "Embrace, extend, and extinguish" move from MS? I don't know enough about the changes make that judgement.

The title is terrible. Here is an excerpt from the article that explains what's going on: "The company's solution was to develop Git Virtual File System (GVFS). With GVFS, a local replica of a Git repository is virtualized such that it contains metadata and only the source code files that have been explicitly retrieved. By eliminating the need to replicate every file (and, hence, check every file for modifications), both the disk footprint of the repository and the speed of working with it were greatly improved. Microsoft modified Git to handle this virtual file system. The client was altered so that it didn't needlessly try to access files that weren't available locally and a new transfer protocol was added for selectively retrieving individual files from a remote repository."

This style of working is needed in large code bases, where not all files are checked out on developer workstations (for performance or privacy reasons).

If you missed the discussion with details about how the technology actually works, rather than xplat: https://news.ycombinator.com/item?id=13559662 linked Q&A https://www.reddit.com/r/programming/comments/5rtlk0/git_vir...

An open-source cross-platform virtual file system API that was also fast (as opposed to FUSE) would be amazing.

It would indeed, although FUSE has the virtue of simplicity, and it would be nice not to sacrifice that.

isn't this similar to what was done by rational (the company thst ibm bought iirc) as part of their clearcase product ? my experience with that was not so great unfortunately. it seemed very clunky to use and i think required an army of system-administration folks to run.

Naming things is among the hardest things for programmers... what would you title it?

And good naming is useful to others. So I think we shouldn't discourage useful debates on how article titles could be improved. That helps us improve our naming skills.

That was partially a reference to a popular programming joke (ex: https://twitter.com/codinghorror/status/506010907021828096?l...), but I would think that asking for an example of a better title is the opposite of discouraging debate. Apologies if that seemed sarcastic; it was not meant that way.

Wait... so they turned got into a centralized abomination and added access control to parts of the source tree (A feature that SVN had from the start but decentralized VCS cannot provide)? Is this correct?

What do you think is more possible?

(a) the result of the work of a major Git player like GitHub, and a major software company like Microsoft, with tens of excellent engineers devoted on it, and that solves a real pain point they have, is a centralized abomination that merely replicates a feature SVN already had

(b) your description is a crude knee jerk reaction


In a similar rhetorical style:

If SVN is a terrible application with no merits to large developers like MS, why did it exist in the first place?

If the features SVN brings over git are not considered detrimental by Torvalds, why did he create his own VCS with them removed, and why hasn't he added them back in?

>If SVN is a terrible application with no merits to large developers like MS, why did it exist in the first place?

Where's the contradiction? All kids of terrible apps exist. Terribleness and existence are not mutually exclusive qualities.

(Assuming SVN is terrible, of course, which I didn't say. I'd say SVN was a attempt to go beyond CVS with some shortcomings that don't make it the best available option today).

>If the features SVN brings over git are not considered detrimental by Torvalds, why did he create his own VCS with them removed, and why hasn't he added them back in?

Lots of possible answers (given the assumption in your "if"):

E.g. he might not consider them detrimental for other people and use cases, but he doesn't need them for his use case (Linux kernel development) either.

Or he thinks that while they might be good, they complicate things too much, and he prefers a more minimum feature set.

Iirc google and facebook decided against using git because it didn't scale to stupidly enormous code bases.

Microsoft decided to stick with git but add non-strict fetching to it. On the plus side, still all the advantages of easy branching/merging and working offline if you touched the required files which distributed vcs's bring. But you still need to be connected to work with parts of the code base you haven't used before.

So I guess if you run the test suite for your current task first then you can work offline since all relevant files will be fetched?

> But you still need to be connected to work with parts of the code base you haven't used before.

Which seems like a fair tradeoff to me. The existing "native solution" for a very very large codebase would be to have it split into multiple, logical repositories. If you fetched one repo you needed to work on, but not all the dependencies or sibling repos, you still wouldn't be able to work on those other parts of the codebase until you connected.

>Iirc google and facebook decided against using git because it didn't scale to stupidly enormous code bases.

Compared to mercurial, that Facebook uses?

And Facebook added a lot to Mercurial so that it would scale to their stupidly enormous code base.


Meanwhile OpenJDK (which is another project composed of separate but interconnected components) wrote a mercurial extension to manage multiple repositories easily (http://openjdk.java.net/projects/code-tools/trees/). I think this is a much better solution than a monorepo, you can still see the "whole codebase" like it were a single giant repository but you don't have to deal with the scaling headaches in quite the same way.

I was pretty surprised when I made my own build of OpenJDK 9 how easy it was to work with, `hg tclone blah`, `hg tup jdk9.0.1+11`, `./configure blah; make images` and done. Even if git submodules were closer in functionality (checking out the same tag across multiple modules at once with ease) the song-and-dance with actually downloading the modules after cloning is annoying.

It's still distributed, just not entirely 100% distributed. It's like the difference in which IPFS nodes have which items pinned or how many seeders you have versus torrenters (and the subsequent file availability in that cluster) in BitTorrent.

Are these things really comparable? I am not so sure. My understanding of GVFS is that it is hierarchical, not P2P.

In the reference implementation from Microsoft it most closely resembles a CDN: you give it a list of servers to back the git database when it needs to look up an object. You distribute those servers as you see need to, based on very similar logic to CDN distribution. For instance, you probably want at least one available per office to keep content close to the users that need it.

Even if a CDN is more "hierarchical" rather than P2P, it's still distributed, it's just distributed on a different axis than you are perhaps expecting.

Furthermore, to a very large extent that's an implementation detail. The GVFS protocol itself [1] is a very simple REST API, and there is absolutely nothing stopping you from building a GVFS "server" that is literally backed by IPFS or BitTorrent or some other P2P file system.

[1] https://github.com/Microsoft/gvfs/blob/master/Protocol.md

Man, I get sick of the incessant title criticism on HN. I wish the quality of the title wasn't such a go to topic. Unless the title is grievously misrepresenting, let's just discuss the topic.

Yes the latest crowd has created some memes for themselves. I agree we need to extinguish them for the sake of continued quality HN discussion. Including:

- X software package name reminds/confuses me of Y product with similar name

- X title is garbage. Here is the title I would write...

- I don't understand [basic concept available on Wikipedia]. Someone explain it to me (aka lazyweb)

- I'm not an expert on this, but [completely unqualified and uncited conjecture]

- I also once did X from TFA and [unrelated personal anecdote with no insight] (aka long form "me too" comment)

I come to HN so that I can read informed discussion from people who work in the field of TFA. To read discussion of basic topics between the uninformed, there is everywhere else on the internet.

That meme of "I don't understand [basic concept from wikipedia, googleable phrase, dictionary word] can you explain it?"

Has infected the hell out of our company Slack. It's considered very rude to not answer.

I feel like when I was coming up, where IRC or mailing lists were a thing, having done all possible research before asking humans was an absolute cultural requirement.

It sucked at first, but I definitely miss that and long for the old days. It was so much more respectful and efficient.

I mean, unlike a lot of aggregation sites, HN has a specific policy about link titles, which is generally sensible and does the right thing. (https://news.ycombinator.com/newsguidelines.html)

Sure, in some cases, implementation of this policy fails, and something is rewritten that shouldn't be, or something should be rewritten that isn't. But, I'd estimate maybe 80% of all discussions about titles, that I've seen, have been valid concerns, and typically resulted in a rewrite.

I have long been familiar with HN's policies. OP did not suggest a new title, they instead posted a long quote from TFA and added an oversimplified summary -- which I'm not sure adds to quality discussion. Articles which are only clickbait should be flagged. Correcting the title on a good article is different from complaining.

I sometimes wonder if it isn't almost literal bike shedding: I have nothing to contribute on the topic it hand, but I'm compelled to contribute, so I'll go on about the title. Obviously there are some egregious examples that are correctly called out, in this case, yeah, I saw "the title is terrible" and quit reading.

Other similar types of responses:

- this project/library/company name is terrible

- this website hijacks my scroll bar

- this website doesn’t render well in mobile

- I can’t read this site because it’s too narrow

- I can’t read this site because it’s too wide

The last four I’m willing to forgive as a warning to the large demographic of HN: in case any of you other yahoos think this is a good idea, it’s not.

Maybe we can make this the go to reply on these comments.

But then we all miss out on the title criticism criticisms.

There is a [-] that you can click to minimize any thread you do not wish to read.

"Just delete it if you don't want to read it", said every contributor and defender of low-quality content ever. I've been hearing this since Usenet was popular, and yet the pattern doesn't change. Can you picture Donald Knuth typing, "just delete it if you don't want to read what I write"? No, because he doesn't have to.

Seems like Microsoft is trying to brand a cached proxy for Git...? This would make more sense as a paid service for those that need it. Possibly, a local appliance that a company would host internally. What am I missing...?

It's a virtualization layer for the Git object database (all the hash-named files in a repository's .git folder) intended for sparsely checking out large git repositories (lots of commit history, huge trees of files, etc), and only downloading the objects you need as you need them.

It defines a "CDN protocol" for downloading those objects as needed (which Bitbucket and GitHub are both supporting in various alpha/beta stages), which is essentially a cache offered as a paid service to big enterprise projects, but the GVFS project also has to make sure that git operates as efficiently as possible with sparse object databases, and implements how those sparse object databases work at all (which to this point was not something git concerned itself with, and partly why the work is being done as a filesystem proxy using placeholder files on the user's machine).

The project has included work in making sure that git commands touch as few objects from the object database as they can to get their work done (minimizing downloads from a remote server).

And they are still calling it GVFS -- a name that is already used by a prominent linux technology. https://github.com/Microsoft/GVFS/issues/7

Every single hacker news discussion these days seems to have a complaint about reusing a name.

Every name is taken. There is nothing to be done about it. It's okay. I mean, it's sad, and terrible, but it's also fine.

They are definitely not all taken

Anyone else get superconfused and think about the GNOME Virtual Filesystem?

Having worked on GVFS (the GNOME one) in the past... every single time.

Maybe GNOME should start trademarking project names too :/...

Moving Windows development to git sounds like a completely irrational decision from a technical standpoint that was driven mostly by marketing concerns.

Could you please try to comment more substantively? We've already asked you this.


> We've already asked you this.

Am I on some kind of a black/gray list? Either that, or you specifically remember my name, which sounds odd.

I worked at Microsoft before and after my team (not Windows) did the Source Depot -> git transition, and I can say that it was mostly for the better. Source Depot was pretty good as non-distributed VCSs go, but it doesn't hold a candle to git. Some specifics:

- Branching in SD sucked, the only way to work on multiple features at once was to either have multiple copies of the repo or constantly fiddle with "package files" that contained your changes, and the whole thing shit the bed if the server went down.

- All the new hires both from college and the industry all know git and there were an increasing number of them who just found it bizarre that Microsoft was still on a proprietary, non-distributed system.

- Higher-ups (correctly, IMO) decided that VSTS/TFS had to have git in order to remain competitive, and that we should be eating our own dogfood. That's partly a marketing concern but also a legitimate technical decision.

Marketing concerns by whom? The nonexisting company behind git? Github when MS doesn't use github? Or you think MS is going to be able to drum up some business just because they use git? Please elaborate!

While I don't agree with OP... you could make a case that switching to Git is a way for Microsoft to make itself more fashionable to the engineering community at large.

The marketing here is: it's cool to work at Microsoft again!

> you could make a case that switching to Git is a way for Microsoft to make itself more fashionable to the engineering community at large. The marketing here is: it's cool to work at Microsoft again!

But that makes it a technical decision! If it's easier to attract talented engineers because you're using git, that improves the product. I was trying to make this point in my other comment too: OP is drawing a sharp line between "marketing decisions" and "engineering decisions" and sneering at the former, but actually the boundary can be pretty fuzzy.

What did they use before and why is it better?

On what do you base that assumption?

Umm, why?

Because git is the latest fashion rage and by itself incapable of managing the size of the Windows code base. This is the ideal combination of reasons for corporate PHB to decide to use it.

Exhibit A: GVFS was solely invented as a hack to make git usable by the Windows team according to the Windows team blog.

(Yes, there is some sour grapes and sarcasm in this post)

Does GVFS support fine-grained access control to parts of the repo, or all userss are guaranteed to access the whole repository if they want?

I wonder if this is a precursor to open-sourcing Windows development...

No. Open Source and walled gardens don't mix. MS wants to turn Windows into the later, obviously.

Does this mean we have to fdisk/format if we forget the commands to correct a broken git repo?


I seriously did that first learning Git. And there's plenty of niches and side-cases that I'm still not quite sure of. Going from client-server to distributed has a level of complexity that usually isn't discussed until you implement.

EDIT: And further understanding is, this provides a GIT filesystem based connection so that one can work on a multi-TB repo without downloading everything locally.

This seems to be the result of choosing to have all the software in one OMG-sized repo, rather than 1 project/repo. And evidently they need a "keep on server" for this. Makes me wonder more why they even went with Git or a distributed model at all. This seems more like they 'screwed up and now have tons of bandaids'.

Lookup the history of GVFS and why Microsoft chose Git. There are interesting reads about it.

There's also lots of interesting reads about 'monorepos' and their pros and cons. Note tho that Google and Facebook, and lots of other companies, use mono-repos. It's not just Microsoft and, not surprisingly, they've all made (more or less) reasoned, thoughtful decisions taking into account lots of factors that almost no one else in the world would ever think to do.

Yep, I've done that.

This is only tangentially related, but is there any standard advice to get git on a remote file system (ceph) to run fast with a long history?

It's a little surprising to me that these companies use Git at all. What's the appeal of Git really? Is it just the "familiarity factor", and is that really a reason to pour so many resources into it?

I think almost all the people they hire nowadays would be familiar with git and not something else. So it would be a lot of time savings in training if you switch to git instead of training every newbie to your own peculiar source control mechanism.

Sure, but the CLI is the CLI. I can't imagine there's no way to replicate the Git CLI without rewriting the internals. Surely the basics of branch-and-merge will be the same, right? I doubt anyone would miss the dangerously overloaded `checkout` subcommand, or wrestling with rebase conflicts.

Personally, I imagine low-cost in-place branching is useful to most people doing serious work...?

I can't help thinking, there is more to this than is discussed in the article. This tech can also be used to give developers selective access to a git repo. It lets Microsoft and other commercial software vendors get on the git bandwagon with out actually sharing all the code with all the developers. Each git repo is a complete copy because git was designed to support Open-Source. In Open-Source there are no hidden parts. But that's a hard pill for corps like Microsoft to swallow. This looks more like a solution to meet developer demands to use git and also satisfy the demands of management to maintain control of their team's code based.

As stated in the article, git is also challenging for game development because of the large asset files. Git has limitations and they are part of the reasons a lot of companies and even indie game devs didn't jump into it (Jonathan Blow, John Carmack, being some of the critics). It's more about speed, simplicity, and efficiency rather than "Corp" vs "Open Source" conspiracy.

Git's issues with game dev isn't just because of large file size though. Binary files are hard to control in a DVCS because if two people spend 24 hours to modify a binary asset at the same time you can't just merge the changes together like with source code. Someone has to manually go in and replicate both changes together.

Got vfs is more motivated by Microsoft wanting to use git as a monorepo. I think they have around 2k windows devs on one git tree with this vfs.

Are there advantages to using a monorepo rather than having a proper package manager? Seems like the problems they're trying to solve would be better solved by not insisting on using a monorepo.

While this debate keeps being hashed out, it's important to note that some of the largest and most successful tech companies have decided to use monorepos. Facebook, Google, Microsoft, and others use monorepos.

I suspect one factor for Microsoft is that many patches will be cross-cutting, and tooling with multiple repositories or even submodules remains poor.

I think it's dangerous to assume a causal link between success and their use of monorepos. As aaronAgain points out, Microsoft's codebase existed since before monoliths were recognised to be a bad idea. They're already screwed, and need a solution for where they are today. That's a valid and important use-case, but doesn't suggest that someone starting from scratch should follow their lead.

It will take years to break apart some of the code into packages. In the meantime Git has to work on the monolith, so why not contribute back how we did that.

Disclosure: Work at MS

If that's the reason, that's a great use-case for it. I am looking at it from a green fields perspective, and it seems more profitable to start with my company's codebase split up into packages from the outset, and focus on building tooling to handle package dependency version management instead of building tools to handle monorepos. The existence of large companies using monorepos creates some cargo cult pressure on those trying to make a decision which path to take, and so I think it's important to discuss what is really the goal to determine the best fit for my own use-case.

For what it's worth, I work on a project that is split into multiple repos that we now wish was a monorepo. Internal versioning issues are complex and end up eating a lot of time in our CI/CD workflow when they deliver no actual value to the end customer, since it all ships as one discrete unit. Ultimately we have a couple of FTEs that are basically hand-picking commits out of different repos to try to assemble a complete system of all versions that integrate with eachother. If we were working in one repo and kept breaking interface changes in feature branches this work would be significantly reduced, and the improved visibility of those cross-cutting changes would probably also motivate a faster turnaround time on them.

I would think that versioning and compatibility issues are the main driver of monorepos - if you are ultimately shipping one product, why get wrapped up in all the labor that can be involved in breaking it down while still being able to pull off working versions to test? Might be a much better decision to just treat it as one giant repo that always stays in a working state.

And finally, as a more soft-factors issue, I think that a monorepo can help to reduce siloization. We sometimes have issues with teams not liking it when people mess with "their" repo, which slows down cross-cutting changes by a huge factor. A monorepo would probably be a powerful factor against this kind of thing.

These are good observations. What about if you're explicitly not shipping all the code to each client, but have individual projects per client? Some of these criticisms still apply, but it seems more natural to have a repository per client/project when you're building bespoke solutions, or customizations from an inventory of shrink-wrapped solutions. [And clients must only see their own code.]

If the problem is the usability of multiple repositories, could this be solved with better tooling? Projects like GVFS suggest that monorepos do not avoid a need for strong tooling.

It's a cultural question like tabs versus spaces as much as anything: there are tools to support both, developers with plenty of opinions on both sides, many of which vary from language to language/environment to environment. Some languages have great package dependency systems, other languages excel when more of the code is more directly accessible.

It's easier for junior developers to deal with monorepos and it takes certain architecture considerations to plan for a strong component model and version management of that. Would you expect to have the right mix of senior-level staff to junior-level staff to handle that? What sort of turnover might you expect?

Furthermore, many monorepos sometimes don't happen intentionally, they just grow organically. It's sometimes only in hindsight where you realize that what you thought of as one system, one component, could have been cut into smaller pieces. It's sometimes only in hindsight where you realize that something you thought of as an internal-only API you didn't wish to version and package and support as such should have been componentized and versioned and packaged separately.

On both sides of the monorepo/small-packages spectrum there are continual trade-offs of time versus planning versus skill level, and neither is necessarily the "right" answer, and likely what you end up doing is somewhere in the middle, some combination of both, based as much on pragmatic needs as anything else.

When you say "start your company", I assume a small total repo size. I would say definitely go for a single repo, every time, all the time. Internal version dependencies can be surprisingly hard to handle even with a handful of components/apps.

It is extremely unlikely that your companys code size would become too much to handle in a single repo. And before that, imho it's just premature optimization to split things up.

While it's possible for it to become premature optimization, there is a level of division that hits a sweet spot that should absolutely be done up-front. I think it is not unreasonable to have a repository per project, and have another "Core" repository for code you factor out of individual project repos for reuse. As the core becomes more complex, you are then already set up to handle it.

Likewise, if you don't write at least some documentation up-front, you're much more likely to never get it done at all.

Based on my experiences, I would disagree. The "core" becomes a mess of versions and dependencies - you still need ALL the other repos to be able to change the code/data there to check all dependencies.

Just split with folders within the repo.

Anyways this is my opinion, and I know many wise people that disagree with me. And I've seen companies work well with both multiple and single repo.

I.e. your mileage may vary.

If you separate things out, you have to now manage version compatibility among different components. Which is a massive waste of time when they are going to be shipped as a single unit anyway. Things are indeed being broken up into packages, but they still all live in the same repository.

Microsoft goes one step further and checks the entire build system into the source tree too, compiler and all.

The effect of all this is that you can sync to any known-good version of Windows and just build it without worrying about other dependencies.

I've worked at both Microsoft and Amazon, and they have opposite strategies for integrating large code bases. Microsoft prefers putting everything into one giant project and having different teams touch different parts of it (there are lots of advantages in this, incidentally, especially in terms of servicing). Amazon goes the opposite way, by essentially using package management to keep track of and manage dependencies between a huge number of smaller projects. Amazon's solution works well for them, partly because it matches their service oriented architecture. But at the same time, it's the only thing of its sort I've ever seen that was actually built to a sufficient level of quality. And it represents a tremendous investment of dev work over a more than a decade.

Microsoft is capable of doing that kind of work (and they sort of are with some of the azure stuff they're working on) but it's a non-trivial project that requires a tremendous amount of investment of time and resources. And that's just to get to square one, let alone something that is competitive with their other infrastructure (keep in mind that microsoft also has tremendous investments in build and CI infrastructure). That's a very hard sell, especially from a risk management perspective.

In Microsoft, it really depends on which org you're in, and which product you're working on. Older codebases (Windows, Office) tend to be monorepos. Newer ones are generally less so. And some of the older ones have been migrating slowly via componentization (VS).

It's not about maintaining control -- downloading any one of those files is as easy as running `touch` on them. It's about only checking out which pieces you actually need to build the component you care about. That way I don't need to download the whole IE source to build the NT kernel. (I just checked code into windows yesterday)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact