
Microsoft and GitHub team up to take Git virtual file system to macOS, Linux - dmmalam
https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/
======
tankenmate
"Git wasn't designed for such vast numbers of developers—more than 20,000
actively working on the codebase. Also, Git wasn't designed for a codebase
that was so large, either in terms of the number of files and version history
for each file, or in terms of sheer size, coming in at more than 300GB. When
using standard Git, working with the source repository was unacceptably slow.
Common operations (such as checking which files have been modified) would take
multiple minutes."

These sentences in some parts gloss over the details, and in others it is flat
out wrong. Git was designed for tens of thousands of developers (the Linux
kernel), it was designed for huge numbers of files (but large numbers of files
works fine on Linux due to the dentry cache, it sucks on Windows because they
don't have a cache that has the same behaviour as the Linux dentry cache).
Admittedly it is slow for files that are large in size, but it was designed
for source code; what sane developer would have source files that are hundreds
of MB in size?

That said, monorepos with several dozen software projects can be very slow due
to O(n) functions such as git blame and the like. It's true that git wasn't
designed for large numbers of projects in one repo that are only indirectly
shared by library code and the like.

Unfortunately that was not written in the article.

~~~
geofft
> in others it is flat out wrong. Git was designed for tens of thousands of
> developers (the Linux kernel),

This is easy to check, and the article is right. The Linux kernel does not
have tens of thousands of active developers. Over a little more than the last
year:

    
    
        titan:~/src/linux geofft$ git log v4.8..v4.14 --format='%aE' | sort -u | wc -l
        4857
    

If you look at just the most recent release cycle:

    
    
        titan:~/src/linux geofft$ git log v4.13..v4.14 --format='%aE' | sort -u | wc -l
        1839
    

Alternatively, if you look over the last ~year, but at people who have more
than five commits (chosen arbitrarily):

    
    
        titan:~/src/linux geofft$ git log v4.8..v4.14 --format='%aE' | sort | uniq -c | awk '$1 > 5' | wc -l
        1814
    

A codebase with 20,000 _actively_ working on the code base is at least an
order of magnitude more than the Linux kernel. If they're committing patches
every single day, probably more than that. I have, last I checked, four
commits in Linux: two for a summer internship in 2009 and two doc fixes. That
certainly contributes to the scale of Linux the kernel in the sense of "there
are so many people working on it!", but not in the sense of being an "active
developer" of the git repo. I was fine sending in my patches to a mailing list
and hearing back at some arbitrarily later point, not triggering any CI, not
asking anyone else to collaborate on my work, etc. That doesn't work if you
have 20,000 engineers on the same codebase writing code every day.

> _Admittedly it is slow for files that are large in size, but it was designed
> for source code; what sane developer would have source files that are
> hundreds of MB in size?_

Sometimes that's the best way to get things done? The best Debian workflow
I've ever worked with (and I've worked with a _lot_ ) involved actually
committing binary .debs to SVN alongside their source code, because that meant
that the SVN revision number was the single source of truth. There wasn't some
external artifact system, nor was there a risk of picking up the wrong
packages from an apt repo when rebuilding an older SVN revision.

I'm not defending this as _pretty_. But I will defend this as _sane_. It got
the job done reliably and let me work on actually shipping the product and not
yak-shaving workflows.

~~~
kelnos
I think this is all missing the point. Do they want to store, say, the
entirety of the base Windows OS in a single git repo? If that's the case,
sure, git isn't a great fit for their use case, and they should either use
something else, or find a way (as they're doing?) to change git to fit their
needs.

I can't see MS having tens of thousands of developers working on any single
component of Windows, so I'm guessing they _do_ want a giant monorepo. If they
were to break it up into separate repos per OS component, I doubt they'd have
scaling issues (of course, breaking things up introduces coordination and
dependency issues).

~~~
ethomson
PM for Git at Microsoft here. We explored splitting it up. It's a 300GB
repository and it's been a monorepo for the last 20 years. Splitting it up
logically would take a lot of time (to put it mildly) and development would
stall while we did it. And once we did, we would have 100 3.5GB repositories?
3500 100MB repositories? Neither of these are particularly appealing and
getting changes checked in atomically across multiple repositories is insanely
challenging. There's no doubt that we would need to build tooling to make this
work for us. (We did actually explore this direction, but ultimately decided
that it would be too much work for too poor an experience.)

Instead, we decided to - as you put it - change Git to fit our needs.

~~~
tankenmate
Does Git on MSWindows use a dentry like cache in user land to speed up
filename lookups? Does it use a stat() equiv cache? Is there a reason that
such caches weren't put into ntoskrnl.exe? Would you be able to give us a
brief list (say top 5) of changes that sped up git on windows? And which ones
had the most effect on super sized monoreops. Thanks!

~~~
geofft
I have seen poor performance of git on Linux in very large repos, so I'm not
super convinced that the dentry cache magically makes things better.

In particular, if my repo is big enough, I often _don 't_ have the entire tree
in memory (because I'm doing other useful things with memory and caches got
evicted). core.untrackedCache makes things a little better, but it's still not
great.

------
rokob
Interesting to see them try to fix Git when Facebook considered it (due to
similar constraints) and decided to go with Hg instead:
[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

~~~
ethomson
Microsoft PM here. It's important to note that Hg also had similar
constraints. So Facebook was in a position of having to fix Git or Hg to suit
their needs. It turns out that for a variety of factors, Hg was the better
choice for them to hack on.

Microsoft took the same look and went the opposite direction, and decided that
Git was easier for us to hack on. This was a no brainer for us. We build
multiple tools that host peoples Git repositories (TFS and VSTS) and have Git
repositories as deployment mechanisms for Azure. We have several contributors
to tools in the Git ecosystem on staff, including people contributing to core
git, and the maintainer for Git for Windows, libgit2 and LibGit2Sharp. But we
have comparatively little institutional knowledge of Hg.

That post you linked is awesome. Facebook has done a lot of really impressive
work scaling Hg and a lot of the lessons learned at Facebook and
implementations in Hg are very similar to what we've done with Git.

This video from Durham Goode on Facebook's version control team (from Git
Merge, amusingly) is also awesome: [https://m.youtube.com/watch?v=gOVD-
DrUpwQ](https://m.youtube.com/watch?v=gOVD-DrUpwQ)

~~~
normaljoe
It's really good to see Microsoft join in the open source community. I
switched to Mac over a decade ago largely because of the Microsoft only
mentality back then. Really makes me consider switching back.

------
paradite
Uhm... the author seems to be confused about the relationship between Github
and git.

GitHub doesn't develop git, it's a service built on top of git. Any
modifications to git has nothing to do with GitHub. GFVS is a GitHub project,
not part of git.

~~~
aeorgnoieang
> GitHub doesn't develop git

I'd be surprised if none of their employees have contributed to Git. I didn't
interpret anything in the article as saying that GitHub develops Git, in its
entirety (or even largely).

> GFVS is a GitHub project

GFVS is a Microsoft project and GitHub seems to be contributing. And the GFVS
developers have been submitting, successfully, their changes to Git itself
upstream to the actual Git project. So it is _becoming_ a part of Git.

~~~
jwilk
> I didn't interpret anything in the article as saying that GitHub develops
> Git

" _Microsoft [...] wanted to get these modifications accepted upstream and
integrated into the standard Git client.

That plan appears to be going well. Yesterday, the company announced that
GitHub was adopting its modifications and that the two would be working
together to bring suitable clients to macOS and Linux._"

This hints that Git upstream = GitHub. I mean, why mention GitHub at all if
they aren't upstream? The rest of the article doesn't explain GitHub's role in
this story either.

~~~
aeorgnoieang
I still don't think your interpretation is warranted, but meh.

Microsoft made some contributions and they're working to get them accepted
upstream. They're maintaining a fork basically.

GitHub is adopting their modifications, i.e. GitHub is running Microsoft's
fork.

Microsoft's modifications are necessary _for them_ , particularly so they can
use Git with the giant Windows repo. Presumably, GitHub also has a need or
desire for those same or similar modifications. IIRC, some of GitHub's largest
customers need or want a version of Git that can also handle large repos, or
repos with large files or large numbers of files.

------
jwilk
For the avoidance of doubt, GitHub is _not_ upstream of Git.

~~~
tomxor
yeah this was really confusing in the article everywhere it mentions up-
streaming it refers to "git" only then later mentions how GitHub is accepting
it's modifications... so not entirely sure which it is.

TL;DR pretty sure it's not touching git source at all.

Looking though the GVFS readme:
[https://github.com/Microsoft/GVFS](https://github.com/Microsoft/GVFS) it
appears that it actually wraps git, which would make sense because in that
case they would have to explicitly add support to each git host and provide
individual platform ports of GVFS (all of which would be unnecessary if it was
actually upstreamed into git).

So that's good... git is one of my favourite open source tools and I would
really hate for M$ to start polluting it. I don't care if GVFS is a good idea
or not I just don't trust the fuckers and they will always deserve that
suspicion.

~~~
hdhzy
No, they needed to do some modifications to the git suite of tools. Generally
git expects all objects to be on disk and Microsoft wanted to have sparse
checkouts of files in a specific revision.

Not really polluting but rather having some objects be fetched only on demand.

Source:
[https://blogs.msdn.microsoft.com/devops/2017/02/03/announcin...](https://blogs.msdn.microsoft.com/devops/2017/02/03/announcing-
gvfs-git-virtual-file-system/)

> In addition to the GVFS sources, we’ve also made some changes to Git to
> allow it to work well on a GVFS-backed repo, and those sources are available
> at [https://github.com/Microsoft/git](https://github.com/Microsoft/git).

For the record as far as I understand GVFS the article is correctly using git
vs Github.

~~~
tomxor
Thanks for the clarification, sparse object fetch seems like a small change in
concept at least.

------
tarruda
Did I misunderstood the idea, or is GVFS essentially making git a non-
distributed VCS?

~~~
snowwrestler
It's worth asking: how many Github users are actually using git as a fully
distributed version control system? The typical Github workflow is to treat
the Github repo as a preferred upstream--which sort of centralizes things.

~~~
frabbit

       The typical Github workflow is to treat the Github repo as a preferred upstream
    

Is it typical? What metrics do you have to support that?

Speaking for myself I have only ever used triangular workflows: fork upstream;
set local remotes to own fork; push to own fork; issue pull request; profit

[https://felipec.wordpress.com/2014/05/11/git-triangular-
work...](https://felipec.wordpress.com/2014/05/11/git-triangular-workflows/)

~~~
ezrast
The main repo is upstream of you regardless of whether it is labeled as such
in `git remote -v`. If it goes offline, nobody systematically falls back on
your fork. This makes the system effectively centralized, which is parent's
point.

~~~
carussell
To elaborate in more (possibly excruciating) detail:

It's very rare for anyone on GitHub to do the sort of tiered collaboration
that the Linux kernel uses. If, say, I want to contribute to VSCode, pretty
much the only way to get my changes upstream is to submit pull requests
directly to github.com/Microsoft/vscode.

Compare to the tiered approach, where I notice someone is an active and
"trusted" contributor, so I submit my changes to their fork, they accept them,
and then those changes eventually make their way into the canonical repo at
some future merge point. That's virtually unheard of on GitHub, but it's the
way the Linux kernel works.

Pretty much the only way you could get away with something even remotely
similar and not have people look at you funny in the GitHub community is
_maybe_ if you stalked someone's fork, noticed they were working on a certain
feature, then noticed there was some bug or deficiency in their topic branch,
then they wake up in the morning to a request for review from you regarding
the fix. Even that, which would be very unusual, would really only work in a
limited set of cases where you're collaborating on something they've already
undertaken—there's not really a clean way in the social climate surrounding
GitHub to submit your own, unrelated work to that person (e.g., because it's
something you think they'd be interested in), get them to pull from you, and
then get upstream to eventually pull from that person.

~~~
frabbit
That's detailed, but not excruciating. Thank you.

------
Koshkin
To me it looked like one of the points of Git was that one could work on
anything from the entire repository locally, i.e., for example, without having
an internet connection, e.g. when on an airplane. (I think having one huge
repository is the problem here.)

~~~
jaccarmac
Well, Git was indeed created to serve Linux kernel development. So
decentralized access for the people writing patches all over the world was
useful. However, when companies started using source control for their
monorepos which developers access through a work network or VPN
decentralization takes a backseat to business concerns. Different use cases
and all that.

~~~
secstate
I think the problem there, is that you then basically wind up with Git on the
backend and an SVN-like layer over the front. Which seems a bit silly. SVN is
a very robust project that works really well. Why not add git-like porcelain
to SVN if you're gonna force devs to have an network connection so they don't
have to download a giant git working directory.

Smart people are working on all this, so I'm sure there are reasons, but in
the all the instances where I've had to interact with a monorepo, it was
because the tech debt was too high to pay off to break it apart, not because
it was better.

And if you're indebted to the point where you have no point in paying it off,
you damn well better have leveraged assets against it (i.e. a cash cow of a
business, like MS Word or Facebook)

~~~
jaccarmac
That's interesting. I've never been an insider and it was several years ago,
but Google spoke in pretty glowing terms of their monorepo when I heard them
talk about it. That's neither here nor their though, not my personal
preference either to be fair.

As far as Git's killer feature I think decentralization is only one part of
it. Being able to deal with the change graph directly is sometimes handy. And
knowing the basis of the model, I find Git pretty easy to understand even when
history gets fuzzy. Not sure but I think Darcs might be even better for that,
just without the mindshare.

------
drdrey
Sounds a bit similar to Google's Piper
([https://cacm.acm.org/magazines/2016/7/204032-why-google-
stor...](https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-
billions-of-lines-of-code-in-a-single-repository/fulltext)) that lets you
check out a portion of the monorepo, no?

------
mg74
"The biggest complexity is that Git has a very conservative approach to
compatibility, requiring that repositories remain compatible across versions."

This, above all else. That I never, or anyone else on our team, ever has to
worry about wether they are running an "older version" of git is such a win.

~~~
xyzzy_plugh
It's not really a surprise that monorepo advocates struggle here.
Compatibility through versioning and stable APIs is challenging and requires
discipline.

Making wide-spread changes to all clients of your code when you make changes
that break things doesn't require discipline.

------
jopsen
Hmm, some part of me wouldn't mind, a FUSE setup where:
/repository/github.com/<org>/<repo>/ is automatically cloned out from github
when accessed the first time :)

It would be pretty cool to use as a $GOPATH. Or just for making drive-by
github contributions super easy.

~~~
hinkley
It's all fun and games until someone like me tries to run `find` on the parent
directory...

~~~
silentOpen
Children would be lazily loaded and would not appear in directory listing of
the parent directory. `find` would search your existing local checkouts only
and the file system would still be POSIX.

~~~
hinkley
How do you trigger loading the projects if there isn’t a pointer to them
showing up in the file system?

~~~
koffiezet
When you open or stat a directory that doesn't exist, try to checkout that
path, if that fails, return an ENOENT error?

------
gumby
A couple of thoughts as the implementation downloads mainly placeholders:

that mechanism can also be combined with permissions on the server side so
contractor X who is fixing a driver need only have access to the driver and
some supporting code while still checking into the monorepo.

It also means that every developer’s laptop won’t have a full copy of the repo
(so if said laptop is lost, the risk is more contained)

And it’s similar to Part of LFS, bigfiles and other “blob” add-o s to
git...could see in future a way to put .doc files and the like into repos.

------
Myrmornis
> The open source, free GitHub hosting doesn't need the scaling work Microsoft
> has done

They might want to fix that sentence. It reads as saying that Github is open
source. Perhaps it should read

> The free GitHub hosting that is typically used by open source projects
> doesn't need the scaling work Microsoft has done

------
dfabulich
> _Microsoft and GitHub are also working to bring similar capabilities to
> other platforms, with macOS coming first, and later Linux. The obvious way
> to do this on both systems is to use FUSE, an infrastructure for building
> file systems that run in user mode rather than kernel mode (desirable
> because user-mode development is easier and safer than kernel mode).
> However, the companies have discovered that FUSE isn 't fast enough for
> this—a lesson Dropbox also learned when developing a similar capability,
> Project Infinite. Currently, the companies believe that tapping into a macOS
> extensibility mechanism called Kauth (or KAuth) will be the best way
> forward._

I'd love to read more technical details on this. How can Kauth support a
virtual filesystem?

------
m12k
It seems that lazy-downloading files isn't all that useful if you need to
compile all the source code. Does anyone know which build tools can be used to
avoid having to open most of the source files - maybe some centralized build
cache?

------
exabrial
This is kinda funny, it undoes one of the original pillars of why git was made

------
iceman2654
Does anyone else think this is another "Embrace, extend, and extinguish" move
from MS? I don't know enough about the changes make that judgement.

------
scscsc
The title is terrible. Here is an excerpt from the article that explains
what's going on: "The company's solution was to develop Git Virtual File
System (GVFS). With GVFS, a local replica of a Git repository is virtualized
such that it contains metadata and only the source code files that have been
explicitly retrieved. By eliminating the need to replicate every file (and,
hence, check every file for modifications), both the disk footprint of the
repository and the speed of working with it were greatly improved. Microsoft
modified Git to handle this virtual file system. The client was altered so
that it didn't needlessly try to access files that weren't available locally
and a new transfer protocol was added for selectively retrieving individual
files from a remote repository."

This style of working is needed in large code bases, where not all files are
checked out on developer workstations (for performance or privacy reasons).

~~~
jcoffland
Man, I get sick of the incessant title criticism on HN. I wish the quality of
the title wasn't such a go to topic. Unless the title is grievously
misrepresenting, let's just discuss the topic.

~~~
doublerebel
Yes the latest crowd has created some memes for themselves. I agree we need to
extinguish them for the sake of continued quality HN discussion. Including:

\- X software package name reminds/confuses me of Y product with similar name

\- X title is garbage. Here is the title I would write...

\- I don't understand [basic concept available on Wikipedia]. Someone explain
it to me (aka lazyweb)

\- I'm not an expert on this, but [completely unqualified and uncited
conjecture]

\- I also once did X from TFA and [unrelated personal anecdote with no
insight] (aka long form "me too" comment)

I come to HN so that I can read informed discussion from people who work in
the field of TFA. To read discussion of basic topics between the uninformed,
there is everywhere else on the internet.

~~~
Yen
I mean, unlike a lot of aggregation sites, HN has a specific policy about link
titles, which is generally sensible and does the right thing.
([https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html))

Sure, in some cases, implementation of this policy fails, and something is
rewritten that shouldn't be, or something should be rewritten that isn't. But,
I'd estimate maybe 80% of all discussions about titles, that I've seen, have
been valid concerns, and typically resulted in a rewrite.

~~~
doublerebel
I have long been familiar with HN's policies. OP did not suggest a new title,
they instead posted a long quote from TFA and added an oversimplified summary
-- which I'm not sure adds to quality discussion. Articles which are only
clickbait should be flagged. Correcting the title on a good article is
different from complaining.

------
sam0x17
And they are still calling it GVFS -- a name that is already used by a
prominent linux technology.
[https://github.com/Microsoft/GVFS/issues/7](https://github.com/Microsoft/GVFS/issues/7)

~~~
zellyn
Every single hacker news discussion these days seems to have a complaint about
reusing a name.

Every name is taken. There is nothing to be done about it. It's okay. I mean,
it's sad, and terrible, but it's also fine.

~~~
camgunz
They are definitely not all taken

------
chungy
Anyone else get superconfused and think about the GNOME Virtual Filesystem?

~~~
sam0x17
[https://github.com/Microsoft/GVFS/issues/7](https://github.com/Microsoft/GVFS/issues/7)

------
hasenj
Moving Windows development to git sounds like a completely irrational decision
from a technical standpoint that was driven mostly by marketing concerns.

~~~
sctb
Could you please try to comment more substantively? We've already asked you
this.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
hasenj
> We've already asked you this.

Am I on some kind of a black/gray list? Either that, or you specifically
remember my name, which sounds odd.

------
Nelkins
I wonder if this is a precursor to open-sourcing Windows development...

~~~
gmueckl
No. Open Source and walled gardens don't mix. MS wants to turn Windows into
the later, obviously.

------
crankylinuxuser
Does this mean we have to fdisk/format if we forget the commands to correct a
broken git repo?

[https://xkcd.com/1597/](https://xkcd.com/1597/)

I seriously did that first learning Git. And there's plenty of niches and
side-cases that I'm still not quite sure of. Going from client-server to
distributed has a level of complexity that usually isn't discussed until you
implement.

EDIT: And further understanding is, this provides a GIT filesystem based
connection so that one can work on a multi-TB repo without downloading
everything locally.

This seems to be the result of choosing to have all the software in one OMG-
sized repo, rather than 1 project/repo. And evidently they need a "keep on
server" for this. Makes me wonder more why they even went with Git or a
distributed model at all. This seems more like they 'screwed up and now have
tons of bandaids'.

~~~
aeorgnoieang
Lookup the history of GVFS and why Microsoft chose Git. There are interesting
reads about it.

There's also lots of interesting reads about 'monorepos' and their pros and
cons. Note tho that Google and Facebook, and lots of other companies, use
mono-repos. It's not just Microsoft and, not surprisingly, they've all made
(more or less) reasoned, thoughtful decisions taking into account lots of
factors that almost no one else in the world would ever think to do.

------
jeanmichelx
This is only tangentially related, but is there any standard advice to get git
on a remote file system (ceph) to run fast with a long history?

------
nerdponx
It's a little surprising to me that these companies use Git at all. What's the
appeal of Git really? Is it just the "familiarity factor", and is that really
a reason to pour so many resources into it?

~~~
dingo_bat
I think almost all the people they hire nowadays would be familiar with git
and not something else. So it would be a lot of time savings in training if
you switch to git instead of training every newbie to your own peculiar source
control mechanism.

~~~
nerdponx
Sure, but the CLI is the CLI. I can't imagine there's no way to replicate the
Git CLI without rewriting the internals. Surely the basics of branch-and-merge
will be the same, right? I doubt anyone would miss the dangerously overloaded
`checkout` subcommand, or wrestling with rebase conflicts.

------
jcoffland
I can't help thinking, there is more to this than is discussed in the article.
This tech can also be used to give developers selective access to a git repo.
It lets Microsoft and other commercial software vendors get on the git
bandwagon with out actually sharing all the code with all the developers. Each
git repo is a complete copy because git was designed to support Open-Source.
In Open-Source there are no hidden parts. But that's a hard pill for corps
like Microsoft to swallow. This looks more like a solution to meet developer
demands to use git and also satisfy the demands of management to maintain
control of their team's code based.

~~~
binthere
As stated in the article, git is also challenging for game development because
of the large asset files. Git has limitations and they are part of the reasons
a lot of companies and even indie game devs didn't jump into it (Jonathan
Blow, John Carmack, being some of the critics). It's more about speed,
simplicity, and efficiency rather than "Corp" vs "Open Source" conspiracy.

~~~
KallDrexx
Git's issues with game dev isn't just because of large file size though.
Binary files are hard to control in a DVCS because if two people spend 24
hours to modify a binary asset at the same time you can't just merge the
changes together like with source code. Someone has to manually go in and
replicate both changes together.

