A an article titled "Why" ..... should usually explain WHY. This article, if any...

DenisM · on March 14, 2009

Are you looking for "why" as in "what are the architecture problems" or as in "what are the scenarios and metrics demonstrating the case"?

I can give you the latter - git does not support partial checkouts. If your repository is 80Gb and you only need 5Gb to work with you will have to get all 80Gb over the network and it will be about 16 times as slow as it needs to be. The same problem does not exist in perforce - you can get any subset you wish.

Linus was pinned on this in his talk he gave at Google and his repsonse was "well, you can create multiple repostiroies and then write scripts on top to manage all this". To his credit he admitted the problem, but then he doesn't seem to care enough to do anything about it.

dasil003 · on March 15, 2009

You're right that Linus probably doesn't care about large binary files. I think that's just as well though. git can't be all things to all people. My gut feeling is that making git good for huge binary repositories would mean sacrificing 80% of what makes it so sweet for regular development.

I'd even go so far as to say that the needs of versioning large asset files and source code are so different that a system optimized for one will always be deficient for the other. Therefore I don't think the idea of "scripts on top" is so bad. Actually the ideal would be a set of porcelain commands built on two separate subsystems.

silentbicycle · on March 15, 2009

Agreed. It's a tool designed for managing incremental changes to files that are typically merged rather than replaced. While some version control systems do better with large binary data, it seems like that would be better handled by a completely different kind of tool.

Keeping a script (or a makefile, etc.) under VC that contains paths to the most recent versions of the large builds (and their sha1 hashes) would probably suffice, in most cases.

DenisM · on March 15, 2009

So a script is needed to keep the GUI of a program and the artwork in sync? I think that keeping all parts of a product in a single place, in sync is the most basic requirement of the source control. Saying that git is not designed for this sort of thing is saying that git is not designed to be an adequate source control system.

silentbicycle · on March 15, 2009

source control.

Tracking large data dependencies requires very different techniques from tracking source code, config files, etc. which are predominantly textual. They're usually several orders of magnitude larger (so hashing everything is impractical), and merging things automatically is far more problematic. Doing data tracking really well involves framing it as a fundamentally different problem (algorithmically, if nothing else), and I'm not surprised that git and most other VCSs handles that poorly - they're not designed for it. Rsync and unison are, though, and they can be very complementary to VC systems.

In my experience, tracking icons, images for buttons, etc. doesn't really impact VC performance much, but large databases, video, etc. definitely can.

DenisM · on March 15, 2009

So, a medical student tells his professor that he does not want to study the ear problems but plans to specialize on the nose problems instead. "Really?" asked the prof, "so where exactly do you plan to specialize - the left nostril or the right nostril?".

Similarly, git is speciailizing is small, text-only projects, but will not work well for a project involving some sort of grpaics-enriched GUI or large code base or both.

jrockway · on March 15, 2009

Similarly, git is speciailizing is small, text-only projects, but will not work well for a project involving some sort of graphics-enriched GUI or large code base or both.

As I mention a few replies up, this covers about 99% of all programming projects. Very few projects have gigabytes of source code or graphics. If you just have a few hundred megabytes of graphics, git will do fine. If you only have 2 million lines of code, git will do fine.

Things larger than this are really rare, so it is not a problem that Git doesn't handle it well. Or rather, it's not a problem for very many people.

DenisM · on March 15, 2009

I recall we have started this discussion with a question of whether git scales or not. You position is essentially that git doesn't scale but you don't care.

In other words it seems to me that everyone here agrees about the facts - git does not scale. And then some people seem to care about it and some don't. Are we on the same page now?

silentbicycle · on March 15, 2009

I'm a different person from jrockway, but my position is while git alone does not scale when tracking with very large binaries, it's an easily solved problem, as it is intentionally designed to work with other tools better suited to such tasks. Git + rysnc solves that problem for me in full, and should scale fine.

Git doesn't include a source code editor, profiler, or web browser, either. The Unix way of design involves the creation of independent programs that intercommunicate via files and/or pipes. The individual programs themselves can be small and simple due to not trying to solve several different problems at once, and since they communicate via buffered pipes, you get Erlang-style message-passing parallelism for free at the OS level.

Like I said, if you track the binary metadata (checksums and paths to fetch them from) in git but not the files themselves, your scaling problem goes away completely. If you have a personal problem with using more than one program jointly, that has nothing to do with git.

dasil003 · on March 16, 2009

I will go a step further and say the people who think git should "scale" are wrong. The reason being that there are real tradeoffs that would need to be made. The vast majority of software projects should not be forced to use an inferior SCM just so that the few really large projects can be supported.

It's unfortunate that those people are forced to use Perforce... there's definitely a need for a "scalable" SCM, but git should not be it.

jrockway · on March 15, 2009

Sure, but this problem applies to huge software products like Microsoft Office, not the sort of software that the average developer at the average company works on every day. So while Perforce may be good for that 1% of the population, Git probably meets the needs of everyone else better.

(BTW, there are repositories bigger than the Linux kernel that behave very well under Git. The Emacs git repository is an example, it has almost 30 years of history, weighs in at around 150M, but still performs fine. I know that I have never been paid to work on any application nearly this big, so I don't really worry about Git not meeting my needs.)

krschultz · on March 15, 2009

The former. Nobody is twisting this guy to use git so I don't care if it doesn't work for his scenario, I care if it works for me. I've heard the "can't deal with large binary files/repos" many times before, I thought he was going to explain why that was the case.

jorgeortiz85 · on March 15, 2009

How much code do you think Github stores? I'd wager at least several gigabytes.

Git -does- scale. It just doesn't scale as one huge repository. You do need tools on top to manage multiple projects. It'd be nice to have an open source Github.