
Why Perforce is more scalable than Git  - peter123
http://gandolf.homelinux.org/blog/index.php?id=50
======
kmavm
I should probably be somewhat careful what I say here, but I work for a
commercial software company with one of the largest perforce databases
(according to perforce) in existence. 3000 developers working on several
million lines of code; I don't know how many MB offhand, but I'd guess O(1GB).
Our central perforce server is a monster that we literally cannot keep fed
with enough RAM, yet we frequently wait for a minute or two for a simple p4
submit, or p4 sync, to complete.

git on my local machine eats this tree for breakfast. I seceded from p4 a few
months ago, and my workflow has never been smoother; I only ever interact with
perforce for actual checkins, and to pull in other developers changes.

I guess the moral here is that to talk about whether something "scales", we
need to be clear about what dimension we're trying to scale. In our
experience, trying to get perforce to scale to huge numbers of developers has
taken a lot of effort. Since git is numb to the number of developers, it has
the potential to work better in our environment. YMMV.

~~~
michaelneale
good to know - the local-git remote-other model seems to be pretty common (and
one of the strengths of git - as its so easy to get started with git).

You also may as well say who it is ;)

------
triplefox
This article doesn't really state the problem precisely; that problem is large
data sets. The Linux kernel has a lot of code but weighs in at ~300MB - not
even a gigabyte uncompressed. But if you start checking in lots of binaries -
or if you use source control for assets as happens in a game production
environment - you start having serious, serious scalability problems because
hashing those huge files is no longer fast, and traversing deep directory
structures with lots of files becomes a scary problem..

I speak from experience; the game I'm working on at my studio has an in-house
tool to optimally paletteize 2d assets for an embedded system. It's a
computationally hard problem so a lot of intermediate data is generated to
make the final build relatively fast.

This tool generates about five files, IIRC, and one folder, for each frame of
animation. Multiply this by 30 frames per second for each sprite's anim.
Multiply this by an entire game's assets. The result is 80,000+ files and
folders checked into SVN. That's after the coder optimized the number of files
needed slightly. It can take 20 minutes or more to do an SVN update because of
all the checks needed.

The solution we really need is a versioning system for binary files. There
aren't many of those. Maybe Perforce works. I don't know, I've never used it.

~~~
jerf
Some of your repliers are beating around the bush a little, I'm going to come
right out and say it: Source control systems are for _controlling source
code_. They are built from top to bottom around the idea that they are storing
text files. They are build around the idea that a text-based patch is a
meaningful thing to use. They are build around the idea that there is a
reasonable merge algorithm to use to merge two people's changes.

They can be used on things that physically resemble source code but aren't,
like text-based document formats (raw HTML, for instance), but you probably
won't need their full power and, by design, they lack features that would be
useful in that case. For your convenience they are capable of storing small
quantities of binary data since most projects have a little here and there.
But in both cases, you're a bit out-of-spec.

When you try to stuff tons of binary data into them, they break. They do not
just break at the practical level, in the sense that operations take a long
time but maybe someday somebody could fix them with enough performance work.
They break at the _conceptual_ level. Their whole worldview is no longer
valid. The invariants they are built on are gone. It's not just a little
problem, it's fundamental, and here's the really important thing: It can't be
fixed without making them no longer great source control systems.

I use git on a fairly big repository and the scanning is currently at the
"annoying" level for me, but the scanning is there for good reasons, reasons
related to its use as a source control system. On my small personal projects
it definitely helps me a lot.

SVN is the wrong tool for the job. I don't know what the right tool is. It may
not even exist. But SVN is still the wrong tool. (If nothing else you could
hack together one of those key/value stores that have been on HN lately and
cobble together something with the resulting hash values.)

And, going back to the original link, criticizing git for not working with a
repository with large numbers of binary files is not a very interesting
critique. If Perforce does work under those circumstances, I would conclude
they've almost certainly had to make tradeoffs that make it a less powerful
source control system. Based on what I've heard from Perforce users and
critics, that is an accurate conclusion. But I have no direct experience
myself.

~~~
cousin_it
_It can't be fixed without making them no longer great source control
systems._

Why? Is there some deep architectural reason why Git can't perform like
Perforce on large binary files? Something so deep it cannot ever be fixed?
I've read through this whole thread and see no such reason yet, only hints
that it exists.

~~~
silentbicycle
To my understanding: When git looks for changes it scans all the tracked files
(checking timestamps before hashing) and hashing those that look changed.
Commits generate patches for all changed files, then generate hashes for each
file, then hash the set of files in each directory for a hash of the
directory, repeated until the state of the entire repository is collected into
one hash. This is normally pretty fast, and has a lot of advantages as a way
to represent state changes in the project, but it also means that if the
project has several huge binary files sitting about (or thousands of large
binaries, etc.), it will have to hash them as well. This requires a full pass
through the file any time that they look like they might have changed, new
files are added, etc. (Mercurial works very similarly, though the internal
data structures are different.) Running sha1 on a 342M file just took about 9
seconds on my computer; this goes up linearly with file size.

Git deals with the state of the tree _as a whole_ , while Perforce,
Subversion, and some others work at a _file-by-file level_ , so they only need
to scan the huge files when adding them, doing comparisons for updates, etc.
(Updating or scanning for changes on perforce or subversion _does_ scan the
whole tree, though, which can be very slow.)

You can make git ignore the binary files via .gitignore, of course, and then
they won't slow source tracking down anymore. You need to use something else
to keep them in sync, though. (Rysnc works well for me. Unison is supposed to
be good for this, as well.) You can still track the file metadata, such as a
checksum and a path to fetch it from automatically, in a text file in git. It
won't be able to do merges on the binaries, but how often do you merge
binaries?

~~~
mst
Well, rsync has a mode where you say "look, if the file size and timestamp are
the same, please just assume it hasn't changed - I'm happy with this and am
willing to accept that if that isn't sufficient any resulting problems are
mine".

I wonder if having some way to tell git to do the same for files with a
particular extension/in a particular directory would get us a decent amount of
the way there?

~~~
silentbicycle
While I haven't checked the source, I'm pretty sure by default git doesn't re-
hash any files unless the timestamps have changed. (I know Mercurial doesn't.)

~~~
silentbicycle
The scaling situation for the top post would involve large binaries (builds or
generated data) being added on a regular basis.

(oops, missed the edit window)

------
krschultz
A an article titled "Why" ..... should usually explain WHY. This article, if
anything, just asserts that git IS un-scalable. The closest thing to a "Why"
is this "Don't believe me? Fine. Go ahead and wait a minute after every git
command while it scans your entire repo."

Which may be true, but this article is pretty thin

~~~
DenisM
Are you looking for "why" as in "what are the architecture problems" or as in
"what are the scenarios and metrics demonstrating the case"?

I can give you the latter - git does not support partial checkouts. If your
repository is 80Gb and you only need 5Gb to work with you will have to get all
80Gb over the network and it will be about 16 times as slow as it needs to be.
The same problem does not exist in perforce - you can get any subset you wish.

Linus was pinned on this in his talk he gave at Google and his repsonse was
"well, you can create multiple repostiroies and then write scripts on top to
manage all this". To his credit he admitted the problem, but then he doesn't
seem to care enough to do anything about it.

~~~
dasil003
You're right that Linus probably doesn't care about large binary files. I
think that's just as well though. git can't be all things to all people. My
gut feeling is that making git good for huge binary repositories would mean
sacrificing 80% of what makes it so sweet for regular development.

I'd even go so far as to say that the needs of versioning large asset files
and source code are so different that a system optimized for one will always
be deficient for the other. Therefore I don't think the idea of "scripts on
top" is so bad. Actually the ideal would be a set of porcelain commands built
on two separate subsystems.

~~~
silentbicycle
Agreed. It's a tool designed for managing incremental changes to files that
are typically merged rather than replaced. While some version control systems
do better with large binary data, it seems like that would be better handled
by a completely different kind of tool.

Keeping a script (or a makefile, etc.) under VC that contains paths to the
most recent versions of the large builds (and their sha1 hashes) would
probably suffice, in most cases.

~~~
DenisM
So a script is needed to keep the GUI of a program and the artwork in sync? I
think that keeping all parts of a product in a single place, in sync is the
most basic requirement of the source control. Saying that git is not designed
for this sort of thing is saying that git is not designed to be an adequate
source control system.

~~~
silentbicycle
_source_ control.

Tracking large data dependencies requires very different techniques from
tracking source code, config files, etc. which are predominantly textual.
They're usually several orders of magnitude larger (so hashing everything is
impractical), and merging things automatically is far more problematic. Doing
data tracking really well involves framing it as a fundamentally different
problem (algorithmically, if nothing else), and I'm not surprised that git and
most other VCSs handles that poorly - they're not designed for it. Rsync and
unison are, though, and they can be very complementary to VC systems.

In my experience, tracking icons, images for buttons, etc. doesn't really
impact VC performance much, but large databases, video, etc. definitely can.

~~~
DenisM
So, a medical student tells his professor that he does not want to study the
ear problems but plans to specialize on the nose problems instead. "Really?"
asked the prof, "so where exactly do you plan to specialize - the left nostril
or the right nostril?".

Similarly, git is speciailizing is small, text-only projects, but will not
work well for a project involving some sort of grpaics-enriched GUI or large
code base or both.

~~~
jrockway
_Similarly, git is speciailizing is small, text-only projects, but will not
work well for a project involving some sort of graphics-enriched GUI or large
code base or both._

As I mention a few replies up, this covers about 99% of all programming
projects. Very few projects have gigabytes of source code or graphics. If you
just have a few hundred megabytes of graphics, git will do fine. If you only
have 2 million lines of code, git will do fine.

Things larger than this are really rare, so it is not a problem that Git
doesn't handle it well. Or rather, it's not a problem for very many people.

~~~
DenisM
I recall we have started this discussion with a question of whether git scales
or not. You position is essentially that git doesn't scale but you don't care.

In other words it seems to me that everyone here agrees about the facts - git
does not scale. And then some people seem to care about it and some don't. Are
we on the same page now?

~~~
silentbicycle
I'm a different person from jrockway, but my position is while git _alone_
does not scale when tracking with very large binaries, it's an easily solved
problem, as it is intentionally designed to work with other tools better
suited to such tasks. Git + rysnc solves that problem for me in full, and
should scale fine.

Git doesn't include a source code editor, profiler, or web browser, either.
The Unix way of design involves the creation of independent programs that
intercommunicate via files and/or pipes. The individual programs themselves
can be small and simple due to not trying to solve several different problems
at once, and since they communicate via buffered pipes, you get Erlang-style
message-passing parallelism for free at the OS level.

Like I said, if you track the binary metadata (checksums and paths to fetch
them from) in git but not the files themselves, your scaling problem goes away
completely. If you have a personal problem with using more than one program
jointly, that has nothing to do with git.

------
tptacek
I have never worked on a 6GB repository, and apart from my (pretty solid)
development background, I've spent the last 3 years doing code audits for huge
software shops. The numbers in this post aren't compelling.

Consider also: is it possible that the multi-gigabyte repositories this guy's
thinking of are byproducts of extremely crappy version control disciplines?

~~~
shiro
To your question: I guess not. He just has a different development background.
Productions like game dev or CG work require to track huge binary objects and
source code in sync (I'd argue that, in these environments, any attempt to
manage binary objects and sources in separate systems has eventually failed;
here, source code includes shader programs, for example, which typically
depend on the texture images they work on; they're inseparable. I've been
there, done that)---but if you're in this industry and have managed to do it
well, I'm eager to hear about your experience.

It seems that most of people discussing here agree that "Git is good for
_source_ version control, while Perforce is for _asset_ version control". So
I'm not really sure what people is arguing about.

------
balajis
Best way I've found to do joint versioning of code with large datasets
(whether binary or tab-delimited text):

1\. Check in symbolic links to git. You can include the SHA-1 or MD5 in the
file name.

2\. Have those symbolic links point to your large out-of-tree directory of
binary files.

3\. rsync the out-of-tree directory when you need to do work off the server

4\. Have a git hook check to see whether those files are present on your
machine when you pull, and to update the SHA-1s in the symbolic link filenames
when you push

By using symbolic links, at least you have the dependencies encoded within
git, even if the big files themselves aren't there.

------
andr
The article never actually says why Perforce is more scalable, except that you
have "an IT department taking care of it."

~~~
silentbicycle
Perforce only scans through everything when you update or scan for changed
files, but mostly tracks changes file by file. Git (and mercurial) work with
the state of the tree as a whole for most operations, because this makes
handling searching, branching, merging, etc. much nicer, but it means that
when you have gigantic binary files just sitting around, it takes time
scanning them as well.

~~~
andr
Why not cache the MD5/SHA1 hashes and only update them when the timestamp of
the file changes?

~~~
silentbicycle
Mercurial does, by default. I'm almost certain git already does, too.

------
Semiapies
I actually go along with the "put any binary or tool you need to make this run
in the repository" school of thought, but I don't stuff generated files in the
repository.

I'm not sure "Git doesn't handle this dysfunctional source control usage" is
really a valid complaint.

~~~
dasil003
It's not really fair to call it "dysfunctional source control usage" (at least
no fairer than the OA). The problem is with large asset files, which though
almost always "generated" in the sense that people don't hand-craft them like
code, are not necessarily a function of what is stored in the repo.

I'll agree with you that making git handle huge binary repositories speedily
is probably not a worthwhile effort.

~~~
Semiapies
Asset files, as in triplefox's comment above, are a _very_ different thing
from the OA's mention of keeping the object files from nightly builds and
anything else they think of to throw into the repository.

For program source control, throwing in a bunch of unnecessary files is
dysfunctional. Assets are, by definition, not unnecessary files.

------
zacharypinter
I wonder how difficult it would be to update git to support special-handling
of binary files? Perhaps a .gitbinary file (similar to .gitignore) which
basically tells git to ignore a directory unless a specific command (say git
binary sync) is run. Or, perhaps handling binary files better is just a matter
of faster hashes? Could the filesystem be any help here? I know ZFS keeps a
SHA-1 hash of all directories and files.

------
plinkplonk
maybe someone who works in ane environment with large binary files that need
to be tracked should just buckle down and write a tool appropriate for that
scenario? (vs try to force fit git - a tool designed for a different set of
usecases - )

Sounds like an opportunity for a startup or at least a cool open source
project ("make something people want")

~~~
shiro
The main problem is that there are limited number of users that really need
large asset tracking solutions, and these big players have spent many years
and now have something more or less working.

To surpass the existing systems, you really need to work on the actual
production assets, and in the environment where hundreds of
artists/programmers are updating it daily. Such large "practical" production
assets are not easily accessible to startups or open-source projects.

I don't say it is impossible, though; AlienBrain seems to manage sustaining
business. And I've seen few people who are truly happy with their asset
management systems.

