For the Gnome community this might be both poor and confusing, for the rest of us it's ... meh.
Given that the products are quite similar - both are virtual file systems after all - confusion is actually quite likely. If you're the kind of person who has reasons to speak about virtual filesystems at all (which, honestly, most people don't) you should probably know about both.
Had this been back in the day one would guess it was deliberate... but hey, everybody tells me Microsoft are supposed to be good guys these days, so I guess Hanlon's Razor applies.
AFAICT, there's essentially no overlap of features beyond the fact that they're virtual file systems. The feature sets are completely different, and there's no intersection where if you read more than a few words into whatever you're looking at you'll be left scratching your head thinking "Oh jeez, which one is it this time?"
> Lots of branches – Users of Git create branches pretty prolifically. It’s not uncommon for an engineer to build up ~20 branches over time and multiply 20 by, say 5000 engineers and that’s 100,000 branches. Git just won’t be usable. To solve this, we built a feature we call “limited refs” into our Git service (Team Services and TFS) that will cause the service to pretend that only the branches “you care about” are projected to your Git client. You can favorite the branches you want and Git will be happy.
This is almost certainly a result of trying to have one company-wide monolithic repository that holds the source code of hundreds or thousands of separate projects.
Git is more pleasant when you break your codebase into isolated components. These can be pretty large—the Linux kernel has 16 million lines of code—but if your codebase is many times larger than a complete modern kernel, you might want to split it.
If you have 5,000 engineers all pushing branches to a single master git repository, you may want to either rethink your repository structure, or at least have maintainer subtrees the way Linux does.
As part of a company that switched to same kind monorepo structure form several separate repositories - there is nothing "pleasant" about having to deal with multiple Git repositories for connected components. Subtrees, submodules are utter hell of maintenance, checkout bugs (which hurt CI) and bad UX across the board ("why doesn't this build?" "you forgot to checkout submodules" "no you forgot to move the commit pointer" "no you forgot to change dependant tests because you didn't see them in an isolated repository"...)
It's not a GOOD approach, but it's the best approach compared to all other more terrible ones.
It's a lot more work than the hack that is submodules (we understand that hack, we did exactly the same thing for years to track outside source drops). But it is worth it, everything just works.
You could tell me to go try it out myself, but a nice side-by-side comparison showing that bk does all that git/hg do (if that's true) could go a long way to attracting git/mercurial people over to bk.
BK does more and less than git/hg. It's got richer history, it was trivial to write a bk fast-export that git could import. Writing a bk fast-import wasn't so bad but an incremental one is hard because git doesn't store as much history (no per file history, actually no file object, just pathnames that have some data, not the same, we have a DAG per file, makes history better, makes merges faster and easier).
Mercurial copied our UI so for a lot of stuff if you know hg <cmd> then
bk <cmd> will just work. BK predates all these systems, they picked up
some of bk's commands.
There is a cheatsheet (somewhat out of date but it will get you going) at:
And you can email dev <at> bitkeeper.com or hit up http://bitkeeper.org for more info.
I do not recommend using submodules, except in very limited cases.
What I'm suggesting is that if your code base is big enough to break git (considerably larger than the Linux kernel's 16 million lines of code), you might want to break it into multiple projects with their own release cycles. Build them separately, release them via an internal release process, etc.
Personally, I like largish repos. But "an order of magnitude larger than the Linux kernel" is just too big, in my opinion, at least for git-based projects. If you have 100 or 200 million lines of code, you probably have natural points to split up your architecture.
Submodules used to be incredibly buggy and terrible; they've improved in recent versions of Git so now they're merely really bad.
Here's an example. Everyone who has used a good SCM discovers a pretty useful work flow, we call it merge and test. The idea is you have N pull request sitting around. You don't want to just pull them in, you want to see if they all work together. So you clone your integration tree to a test tree. In BitKeeper, you'd then add each of the pull requests as an incoming parent (bk parent -i <repo>) and then you just merge them (bk pull). Once it's all merged, you build it, test it.
Facebook does this, a lot of people do this. Why do they do it that way? To avoid polluting the main tree with garbage. It's the opposite of continuous integration (aka shoveling shit into your integration as fast as possible). Smart companies want their integration tree to be a stable base on which to build, not a layer of continuously integrated quicksand. If the tests all pass you can push the whole wad to the integration tree, or rebase each to tip (we hate this, doing that means you turned your history into CVS history and you lose all sorts of useful information).
So how do submodules break that workflow? Easy, they don't support sideways pulls. If two people are working on a submodule and they have not pushed to the main tree, try and sideways pull from Jane to John and you won't get Jane's work. As a dev who worked on the nested collections, I get why that doesn't work, there are a zillion corner cases where it gets complicated dealing with that state (and there is a goldmine of test cases in the open source BK code, see src/t/t.nested* and start reading).
Here's an example of an obvious thing you have to deal with (if you don't think in DAGs this is gonna hurt a little). Suppose you have a subrepo called libc. Jane has a clone of the collection as does John. Their clones are just the top repo, no subrepos are present. They both pull from different clones that have libc present and each of those clones have modified libc. So that means there is implied (different) work in Jane's missing libc and John's missing libc. Which means if Jane pulls John's clone then the libc DAG has forked and needs to be merged. BK recognizes that, even though the subrepos are not present, and tells Jane when she pulls John's collection that she needs to populate libc so she can merge it. The centralized model of submodules side steps that entire class of problems, at the cost of no sideways pulls, no workflow other than the centralized CVS like model.
Submodules are the quick hack to get you semi-coupled collections, BK/nested is the hard work to make a collection have the semantics of a monolithic repo.
There's AFAIK been some orgs working on mercurial in order to support large mono-repos - I don't know if it's better or worse than git today. I see that Illumos (opensolaris) have shifted from mercurial to git, for example:
(Solaris/Sun was a somewhat early adopter of "big" mercurial repos).
Working with big mono repos are not without challenges:
FWIW I tried using sub-modules for mercurial some years ago - in the end I realized it was probably not worth it. I'm also skeptical about go's use of "vendoring" dependencies for a similar reason.
That is, mono repos are great for single company wide access, easy to fork and commit against different areas, have a single system for driving all CI changes across repos.
I agree that having a single huge repo kinda sucks in Git, but it's also basically a nonstarter for companies that want all the features of git, but their repo and build is monolithic. Would you require people to first completely rebuild their software into separate Git repos before using Git? This could take years!
I've been looking at this exact problem at my work, and there is no easy answer here. I think MS probably picked the easier path forward in their case.
Correct. But both Google and Microsoft use this method. And I'm sure they've put in a lot more man-years of investigation into it than you and I have!
Indirect evidence: both of them released open source clones of Google's proprietary build system "Blaze" -- Facebook's "Buck" and Twitter's "Pants". Google was actually last to the party, with "Bazel".
I believe Git was never in contention due to the obvious scalability problems, but GVFS does seem like it might potentially be usable at Google.
[Edit] Not that I'm against GVFS, there is nothing wrong with not wanting to download the entire repository in some cases.
When the Chromium team forked WebKit (renaming it Blink) they merged it into Chromium, specifically citing ease of development.
To some extent I think the problems with split repositories are just down to bad design. WebKit was never a nice clean API, it was just a mess of whatever hacks Apple and Google needed for their respective browsers. The Chromium/WebKit split wasn't made on sensible engineering grounds, it just reflected the Google/Apple administrative divisions (Conway's law: "organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations")
I still feel like modular code is better, and properly designed modular code can work well in modular repositories. But good modular code can work in monorepos too. Monorepos have the advantage that you can do huge codebase-wide refactorings atomically. Clang seems to use that approach -- all the internal APIs are constantly in flux.
I assume all the repos involved have to be managed by the same Gerrit instance too. In that case it's not clear to me what you actually gain from multiple repos (besides working around Git's scaling issues).
> In that case it's not clear to me what you actually gain from multiple repos (besides working around Git's scaling issues).
I'm not sure what you mean here, would you please elaborate?
If you're pulling in repos from third parties, outside of your Gerrit workflow, that's different -- but in that case I don't see how you can enforce those atomic multi-repo commits.
We lock the top repo and then go do the work in the subrepos. All repos respect a lock in the top repo but we have a way to say "yeah, yeah, there is a lock but that is your lock, go ahead".
Git is absolutely bad at handling even moderately large repositories (unless it is 100% text based for its entire life), and its support for handling submodules completely sucks. I didn't even work on a multi-million line codebase. Few hundred kLOC. But with like 20 submodules though (library components), and the amount of organizational overhead is annoying as shit. (Not to mention the tools we developed to stop the active ~20 developers from footgunning themselves when they committed non-existent pointers, explaining carefully how they worked, etc). We have a binary or two, but we are anxious and careful to control and update them when we do (they were added before our time, and every update at this point on such a long-lived repo can visibly affect performance. Splitting it out is even more annoying ultimately and loses our history, and does not reduce the size of the original repo). Git is just bad here. Almost everyone is a volunteer. This kind of shit is a waste of time for everyone, frankly.
What if git was 5x as slow as it was now? Would that mean we should create 5x more repositories, divide every repository up into 5 more? Delete your historic repositores with 5x frequency? Would trying to improve that mean we're trying to "force git to adapt for monolithic repositories", as you claim? There are very real limits to this kind of reasoning, bending backwards for your tools.
Tools like Bitkeeper are far better at this and have a way better UI, out of the box. The "Product Lines" feature means subrepositories are much more transparent for all users (with full push/pull bidirectional interoperation, which is absolutely critical, and it does crazy things like -- gasp, cloning all subrepositores by default). But BitKeeper can also handle 50GB repositories fairly easily and quickly, including large binary files. No multi-repos necessary, no LFS. It "just works". That's excellent and it removes concerns and worries you might have later.
I also think you understate the true value of continuous history. It's not something to toss aside so quickly, and any extra effort to use it means you lose a lot of insight. Maybe if you don't have to dig old code constantly (or your company throws away/rewrites its product constantly I guess).
In my current job, for the OSS project I worked on -- it's got VCS history dating back to 1995. Yes, I have traced changes, design decisions, old relics, and even bugs back nearly ~20 years. Why? Because it was easy and there, and the best way to discover why something was done. It is absolutely one of the most important discovery and historic tools, for this reason. Many developers -- including myself, use it all the time.
New developers cannot possibly understand the rational of an obscure change 10 years ago by someone with a PhD without this kind of help, at least not easily. Some files may not have been touched in like 8 years! This repository is also relatively quick to clone, luckily. The 20 submodules make it a damper.
One of my last jobs was working on a product that was 15 years old, and had only one repository. They stored binaries in it, but it had had many of them updated over the years (small JAR files, mostly, one copy of GCC). It took 6 hours to checkout, but honestly that was the main thing that sucked - performance. However, all of the same benefits applied. And that one truly was continuous -- you could get a look at the repo, as it was built for customers, years ago, to look at the behavior. Atomic refactorings across years of code, over a million lines of code, and 50+ developers and libraries was possible in a single commit.
That's a good thing. But the performance is a tool deficiency, not something to be hand waved away. That second repository, the 15 year one, was like, 50GB total. That's peanuts in today's terms. Other VCSs would have been much more tolerable (including BK, or probably something like Perforce, in terms of pure speed).
I have to second this. When I was working on the VCL component of LibreOffice I often had to backtrack through positively ancient history to try to understand decisions.
Unfortunately, history before 2000 is not available, and even worse was the absolutely stupid decision someone made to merge changes but not keep commit history for branches - consequently there are poor summaries of multiple changes for a single code commit. It can be incredibly frustrating!
Did I mention he was bright? He was, and he took a look at the history in bk revtool (a graphical tool, shows you the dag, you can left/right click on a pair of nodes and see diffs, double click on a node and see that version of the file).
He double clicked on the very first rev and lo and behold, there was the code he was about to type in. He said it was exactly how he imagined it.
Hmm, says he. I wonder why it changed. He started clicking around and went "ohh, so on this OS they have that problem". Some more clicks "oh, and windows has this problem with wrapping pids". Etc.
Had he not had that history (and had it not be really fast to click around and see it), he would have rewritten the file to be back where we started, losing all the bug fixes.
History matters. Being able to see it quickly and easily matters.
Wait, are you saying... Git is a half-assed copy of the just the bits of BitKeeper that were needed for the Linux kernel?!? Written by people who don't care about UI?
(I'm mostly joking, of course, Git is great in many ways. But it's so terrible in others. The living embodiment of "worse is better".)
The fact that Git has no per file history data structure, no actual revisioning of renames, blame is insanely slow (see below for a demo), etc. Yeah, Git is really really popular, it won, no question. But man, did the world get screwed because of that.
At this point I'm well on my way to being retired and playing with tractors, but I'd love it if Git ripped off everything useful in BK. I don't see it happening, Linus is proud of his "design", it's pretty entrenched. At least with BK out there as open source, you can fast-import your git tree, play around and see what the world could have been like.
Oh, yeah, the blame demo. Git's file format isn't a file format, it's a repository format. With no formal file object, Git has to paw through the entire history to get the annotations for a single file. Most of the time you don't care but if you want to be responsive to bug reports, support requests, being able to figure out who changed what in a file is really helpful. So we benchmarked our blame implementation against Git's blame implementation, here's a little video about it:
It's pretty hilarious IMHO.
Optimise for the common case is standard engineering practice as I'm sure you know.
(btw putting design in quotes is a bit rude; as a fan of
yours it pains me a little to say this)
It's fine if what you want is a compressed tarball server, that's the essence of what it is and exactly what Linus wanted. From that point of view the design is brilliant. From an SCM point of view it is very lacking.
As for fast, Git is fast until it is not. See the Facebook benchmarking Git thread, it was posted here. Git commit and pull performance don't scale up at all.
Git is very fast when everything fits in memory. It's horrible when things don't fit in memory. Remember, it names everything by hash, doing that is neat from a math point of view, it sucks from a disk point of view, even an SSD point of view.
Finally, "optimize for the common case" I think is a bit off the mark here. Linus wasn't optimizing for performance, he was optimizing for ease of implementation. he simply doesn't care about the history. If he could have figured out a way to have a sliding window of history that was just enough to merge every old repo out there, I suspect he would have done that. To him, history is baggage, he just cares about the tip.
SCM people care about all the history, it's all useful at some point. The goal is to record all of that, efficiently, so that when someone needs it, they have it. The video is just showing that we got the efficient part done.
"Big merges" break history in Git. Not in BitKeeper, BK passes changes by reference not by value. What that means is suppose you have 3 users, A, B, and C. And you have a file with 1000 lines of code. A & B clone that repo and have identical versions of the file. A changes the top 501 lines, B changes the bottom 500 lines. Now C clones A and pulls B. In the merge, C has pick between A's line 501 and B's line 501 but the other lines are merged unchanged. In most naive systems, it will look like C wrote either the top 501 lines or the bottom 500 lines when you run blame. That's what I mean by pass by value, the merged lines are a copy.
In BitKeeper, the unchanged merged lines are passed by reference, they are in the history exactly once. Only the line where it was merged (changed in the merge) will show up as belonging to C.
You'd be amazed at how long it takes people to trust this. They are so used to pass by value semantics (and hate them because now a bunch code done by someone else appears to be done by C and C gets stuck with the bugs).
Is there anything like Github or Bitbucket that supports BK?
If there were a service to easily host BK repos, and especially if you could easily make it readable via Git, that could be pretty compelling.
(Edit to add: http://bkbits.net maybe?)
Does brew tell you who did the packaging? I'm wondering if that's one of my guys or someone else (I'd love it if it were someone else, be good to get people hacking).
So it's using the installer from bitkeeper.org; I don't know if the people who added it to homebrew are affiliated.
Put another way: what if git was suddenly 100x faster, at every scale for every repository? Would you put more stuff in there? Some limits are real (e.g. Google has like an 80TB source repository, you're screwed there). But some of the limits are absolutely artificial in a sense.
There's definitely something to be said for misusing Git in ways it shouldn't be used. And if you're Google or Microsoft or Facebook -- whatever, you have a thousand other problems.
But honestly, "I want it to be able to handle large repositories" has never seemed like a fundamental "misuse" of any version control tool. It actually seems like something we've only told ourselves is a misuse, because literally every available tool is just incredibly bad at, it fundamentally.
And the GVFS repo doesn't have the problems GVFS tries to solve. The current .git directory is only 425KB. You could put three copies on a floppy disk!
Never happened. We ended up with "Vista" instead.
Maybe they've learned their lessons on that.
I do recall some years ago seeing someone had posted (here maybe?) that they were using git to manage their home directory and everything in it.
Having the ability to easily switch between versions of my operating system could possibly be great.
git checkout -b upgrade/service-pack-12
# ...trying things...
# ..decide it was a terrible idea...
git reset --hard
git checkout -
Anyway when the said this about a File System based on SQL Server, i really liked the idea. I had something similiar in my head: Instead of installing software to folders, you add a node and all necessary files are referenced in this node.
That is basicly what apps are doing right now already. But imagine that your folders become tags and meaning and when you put images into a 'year' or in an 'event' node they are already sorted.
You might be interested by this:
In particular, Tagistant allows things such as:
$ ls ~/myfiles/store/photos/@/London/
$ ls ~/myfiles/store/pictures:/aperture/gt/5.6/@
$ ls ~/myfiles/store/time:/hour/lt/3/@
There's also etc-keeper for keeping /etc & co in a git repo for a partial approach.
But GVFS feels more like the spiritual child of git and Dropbox Smart Sync (formerly Infinite).
GFVS isn't really a file system like NTFS or ext3. It's a way to work with a subset of a git repo in a flexible manner (not submodules).
Use GVFS instead of Windows Updates when you need a new driver or the latest .Net framework.
Windows itself is already in a giant git repo, only pull the features you need?
A smaller OS could remove some of the pain points of small SSDs and the like with cheap consumer devices, and Windows Updates can be rather finnicky at times.
(I don't expect anything nefarious, Microsoft doesn't have the power these days to make GVFS the default git client for 90% of developers, it's just amusingly MS style...)
There is no point complaining than an earned and deserved reputation is earned and deserved.
Not a meme. An earned and deserved reputation. I'm sure Microsoft don't like their reputation and that's kind of the point.
The only thing that matters for me, is Git gaining greater mind share in Enterprise. And I think Microsoft's flurry of activity in the Git hosting world, is not them trying to embrace it, so they can extinguish it. But rather, it's them trying to ensure they don't lose relevance.
In the past, when we had SCM diversity (ClearCase, Perforce, Subversion, Git, etc.), companies who managed the SCM technology, controlled the software development life cycle. However, with Git, where there is no owner, you can't control the software development life cycle, like you use to.
Microsoft embracing Git like they are, is just them flexing their vast technical and financial resources, so they can get people to migrate to their ecosystem. They know they can't extinguish Git, and since they can't dictate how Git evolves, their next best option is to make working with Git better in their world.