git on my local machine eats this tree for breakfast. I seceded from p4 a few months ago, and my workflow has never been smoother; I only ever interact with perforce for actual checkins, and to pull in other developers changes.
I guess the moral here is that to talk about whether something "scales", we need to be clear about what dimension we're trying to scale. In our experience, trying to get perforce to scale to huge numbers of developers has taken a lot of effort. Since git is numb to the number of developers, it has the potential to work better in our environment. YMMV.
You also may as well say who it is ;)
I speak from experience; the game I'm working on at my studio has an in-house tool to optimally paletteize 2d assets for an embedded system. It's a computationally hard problem so a lot of intermediate data is generated to make the final build relatively fast.
This tool generates about five files, IIRC, and one folder, for each frame of animation. Multiply this by 30 frames per second for each sprite's anim. Multiply this by an entire game's assets. The result is 80,000+ files and folders checked into SVN. That's after the coder optimized the number of files needed slightly. It can take 20 minutes or more to do an SVN update because of all the checks needed.
The solution we really need is a versioning system for binary files. There aren't many of those. Maybe Perforce works. I don't know, I've never used it.
They can be used on things that physically resemble source code but aren't, like text-based document formats (raw HTML, for instance), but you probably won't need their full power and, by design, they lack features that would be useful in that case. For your convenience they are capable of storing small quantities of binary data since most projects have a little here and there. But in both cases, you're a bit out-of-spec.
When you try to stuff tons of binary data into them, they break. They do not just break at the practical level, in the sense that operations take a long time but maybe someday somebody could fix them with enough performance work. They break at the conceptual level. Their whole worldview is no longer valid. The invariants they are built on are gone. It's not just a little problem, it's fundamental, and here's the really important thing: It can't be fixed without making them no longer great source control systems.
I use git on a fairly big repository and the scanning is currently at the "annoying" level for me, but the scanning is there for good reasons, reasons related to its use as a source control system. On my small personal projects it definitely helps me a lot.
SVN is the wrong tool for the job. I don't know what the right tool is. It may not even exist. But SVN is still the wrong tool. (If nothing else you could hack together one of those key/value stores that have been on HN lately and cobble together something with the resulting hash values.)
And, going back to the original link, criticizing git for not working with a repository with large numbers of binary files is not a very interesting critique. If Perforce does work under those circumstances, I would conclude they've almost certainly had to make tradeoffs that make it a less powerful source control system. Based on what I've heard from Perforce users and critics, that is an accurate conclusion. But I have no direct experience myself.
Why? Is there some deep architectural reason why Git can't perform like Perforce on large binary files? Something so deep it cannot ever be fixed? I've read through this whole thread and see no such reason yet, only hints that it exists.
The problem is that if you aren't hip-deep in both systems, you often can't see the tradeoffs, or if someone explains them to you, you might say "But just do this and this and this and you're done!" Hopefully, you've had some experience of someone coming up to you and saying that about some system you've written, perhaps your boss, so you know how it just doesn't work that way, because it's never that easy. If you haven't had this experience, you probably won't understand this point until you have.
There are always tradeoffs.
Lately at my work, I've run into a series of issues as I get closer to optimal in some parts of the product I'm responsible for where I have to make a decision that will either please one third of my customer base, or two thirds of my customer base. Neither are wrong, doing both isn't feasible, and the losing customers call in and wonder why they can't have it their way, and there isn't an answer I can give that satisfies them... but nevertheless, I have to chose. (I don't want to give specific examples, but broad examples would include "case sensitivity" (either way you lose in some cases) or whether or not you give a particularly moderately important unavoidable error message; half your customers are annoyed it shows up and the other half would call in to complain that it doesn't.) You can't have it all.
Git deals with the state of the tree as a whole, while Perforce, Subversion, and some others work at a file-by-file level, so they only need to scan the huge files when adding them, doing comparisons for updates, etc. (Updating or scanning for changes on perforce or subversion does scan the whole tree, though, which can be very slow.)
You can make git ignore the binary files via .gitignore, of course, and then they won't slow source tracking down anymore. You need to use something else to keep them in sync, though. (Rysnc works well for me. Unison is supposed to be good for this, as well.) You can still track the file metadata, such as a checksum and a path to fetch it from automatically, in a text file in git. It won't be able to do merges on the binaries, but how often do you merge binaries?
I wonder if having some way to tell git to do the same for files with a particular extension/in a particular directory would get us a decent amount of the way there?
(oops, missed the edit window)
As I've said elsewhere in this thread, tracking metadata (path and sha1 hash) for large binary files in git and otherwise ignoring them via .gitignore works quite well. I'm pretty sure the "right tool" is either rysnc or something similar.
If memory serves, you can automatically mount daily snapshots of FreeBSD's standard filesystem. (I'm using OpenBSD, which is slightly different.)
Vesta is a cool piece of technology in many ways. It has a pure functional build language, completely parallelisable builds with accurate caching between builds, etc. The guy who maintains it at Intel is a bit bitter because he can't understand why less capable version control or build systems are more popular. Which is really because vesta's advantages show up best in quite large projects, but once your project is that large, it's very hard to switch.
The web page looks moribund, but in fact it's still actively developed, and the developers hang out on IRC.
It does seem to handle large binary files reasonably well, at least, though fully scanning for any changes (the equivalent of "git status") generally takes about two minutes on my computer, so it's a mixed blessing. I think tracking binary data would be better handled by a fundamentally different kind of tool, really; there are major differences between managing large binaries vs. managing heuristically merge-able, predominantly textual data.
* One example: Two functions with similar names but reversed arguments (i.e., methodA(from, to) and methodB(to, from)) had the arguments transposed during the merge in many, many files. It introduces some really subtle bugs. It also happened again during the next major merge from ongoing-development to release.
I don't believe in 6GB source trees without binary blobs. Perhaps Windows codebase is that large but I would assume it's hosted in multiple repos and very rarely is built/checked out by individual developers all at once.
No kidding! Even the worst copy-and-paste programming would compress well. (Also, .gitignore + rsync seems like the best option to me, too.)
Also, keep in mind that git will hash all of a commit at once. So if 500 2k files changed in a commit, the time spent hashing should be almost the same as hashing a 1MB file.
Which may be true, but this article is pretty thin
I can give you the latter - git does not support partial checkouts. If your repository is 80Gb and you only need 5Gb to work with you will have to get all 80Gb over the network and it will be about 16 times as slow as it needs to be. The same problem does not exist in perforce - you can get any subset you wish.
Linus was pinned on this in his talk he gave at Google and his repsonse was "well, you can create multiple repostiroies and then write scripts on top to manage all this". To his credit he admitted the problem, but then he doesn't seem to care enough to do anything about it.
I'd even go so far as to say that the needs of versioning large asset files and source code are so different that a system optimized for one will always be deficient for the other. Therefore I don't think the idea of "scripts on top" is so bad. Actually the ideal would be a set of porcelain commands built on two separate subsystems.
Keeping a script (or a makefile, etc.) under VC that contains paths to the most recent versions of the large builds (and their sha1 hashes) would probably suffice, in most cases.
Tracking large data dependencies requires very different techniques from tracking source code, config files, etc. which are predominantly textual. They're usually several orders of magnitude larger (so hashing everything is impractical), and merging things automatically is far more problematic. Doing data tracking really well involves framing it as a fundamentally different problem (algorithmically, if nothing else), and I'm not surprised that git and most other VCSs handles that poorly - they're not designed for it. Rsync and unison are, though, and they can be very complementary to VC systems.
In my experience, tracking icons, images for buttons, etc. doesn't really impact VC performance much, but large databases, video, etc. definitely can.
Similarly, git is speciailizing is small, text-only projects, but will not work well for a project involving some sort of grpaics-enriched GUI or large code base or both.
As I mention a few replies up, this covers about 99% of all programming projects. Very few projects have gigabytes of source code or graphics. If you just have a few hundred megabytes of graphics, git will do fine. If you only have 2 million lines of code, git will do fine.
Things larger than this are really rare, so it is not a problem that Git doesn't handle it well. Or rather, it's not a problem for very many people.
In other words it seems to me that everyone here agrees about the facts - git does not scale. And then some people seem to care about it and some don't. Are we on the same page now?
Git doesn't include a source code editor, profiler, or web browser, either. The Unix way of design involves the creation of independent programs that intercommunicate via files and/or pipes. The individual programs themselves can be small and simple due to not trying to solve several different problems at once, and since they communicate via buffered pipes, you get Erlang-style message-passing parallelism for free at the OS level.
Like I said, if you track the binary metadata (checksums and paths to fetch them from) in git but not the files themselves, your scaling problem goes away completely. If you have a personal problem with using more than one program jointly, that has nothing to do with git.
It's unfortunate that those people are forced to use Perforce... there's definitely a need for a "scalable" SCM, but git should not be it.
(BTW, there are repositories bigger than the Linux kernel that behave very well under Git. The Emacs git repository is an example, it has almost 30 years of history, weighs in at around 150M, but still performs fine. I know that I have never been paid to work on any application nearly this big, so I don't really worry about Git not meeting my needs.)
Git -does- scale. It just doesn't scale as one huge repository. You do need tools on top to manage multiple projects. It'd be nice to have an open source Github.
Consider also: is it possible that the multi-gigabyte repositories this guy's thinking of are byproducts of extremely crappy version control disciplines?
It seems that most of people discussing here agree that "Git is good for source version control, while Perforce is for asset version control". So I'm not really sure what people is arguing about.
1. Check in symbolic links to git. You can include the SHA-1 or MD5 in the file name.
2. Have those symbolic links point to your large out-of-tree directory of binary files.
3. rsync the out-of-tree directory when you need to do work off the server
4. Have a git hook check to see whether those files are present on your machine when you pull, and to update the SHA-1s in the symbolic link filenames when you push
By using symbolic links, at least you have the dependencies encoded within git, even if the big files themselves aren't there.
I'm not sure "Git doesn't handle this dysfunctional source control usage" is really a valid complaint.
I'll agree with you that making git handle huge binary repositories speedily is probably not a worthwhile effort.
For program source control, throwing in a bunch of unnecessary files is dysfunctional. Assets are, by definition, not unnecessary files.
Sounds like an opportunity for a startup or at least a cool open source project ("make something people want")
To surpass the existing systems, you really need to work on the actual production assets, and in the environment where hundreds of artists/programmers are updating it daily. Such large "practical" production assets are not easily accessible to startups or open-source projects.
I don't say it is impossible, though; AlienBrain seems to manage sustaining business. And I've seen few people who are truly happy with their asset management systems.