Hacker News new | past | comments | ask | show | jobs | submit login
Why Perforce is more scalable than Git (homelinux.org)
29 points by peter123 on March 14, 2009 | hide | past | favorite | 47 comments

I should probably be somewhat careful what I say here, but I work for a commercial software company with one of the largest perforce databases (according to perforce) in existence. 3000 developers working on several million lines of code; I don't know how many MB offhand, but I'd guess O(1GB). Our central perforce server is a monster that we literally cannot keep fed with enough RAM, yet we frequently wait for a minute or two for a simple p4 submit, or p4 sync, to complete.

git on my local machine eats this tree for breakfast. I seceded from p4 a few months ago, and my workflow has never been smoother; I only ever interact with perforce for actual checkins, and to pull in other developers changes.

I guess the moral here is that to talk about whether something "scales", we need to be clear about what dimension we're trying to scale. In our experience, trying to get perforce to scale to huge numbers of developers has taken a lot of effort. Since git is numb to the number of developers, it has the potential to work better in our environment. YMMV.

good to know - the local-git remote-other model seems to be pretty common (and one of the strengths of git - as its so easy to get started with git).

You also may as well say who it is ;)

This article doesn't really state the problem precisely; that problem is large data sets. The Linux kernel has a lot of code but weighs in at ~300MB - not even a gigabyte uncompressed. But if you start checking in lots of binaries - or if you use source control for assets as happens in a game production environment - you start having serious, serious scalability problems because hashing those huge files is no longer fast, and traversing deep directory structures with lots of files becomes a scary problem..

I speak from experience; the game I'm working on at my studio has an in-house tool to optimally paletteize 2d assets for an embedded system. It's a computationally hard problem so a lot of intermediate data is generated to make the final build relatively fast.

This tool generates about five files, IIRC, and one folder, for each frame of animation. Multiply this by 30 frames per second for each sprite's anim. Multiply this by an entire game's assets. The result is 80,000+ files and folders checked into SVN. That's after the coder optimized the number of files needed slightly. It can take 20 minutes or more to do an SVN update because of all the checks needed.

The solution we really need is a versioning system for binary files. There aren't many of those. Maybe Perforce works. I don't know, I've never used it.

Some of your repliers are beating around the bush a little, I'm going to come right out and say it: Source control systems are for controlling source code. They are built from top to bottom around the idea that they are storing text files. They are build around the idea that a text-based patch is a meaningful thing to use. They are build around the idea that there is a reasonable merge algorithm to use to merge two people's changes.

They can be used on things that physically resemble source code but aren't, like text-based document formats (raw HTML, for instance), but you probably won't need their full power and, by design, they lack features that would be useful in that case. For your convenience they are capable of storing small quantities of binary data since most projects have a little here and there. But in both cases, you're a bit out-of-spec.

When you try to stuff tons of binary data into them, they break. They do not just break at the practical level, in the sense that operations take a long time but maybe someday somebody could fix them with enough performance work. They break at the conceptual level. Their whole worldview is no longer valid. The invariants they are built on are gone. It's not just a little problem, it's fundamental, and here's the really important thing: It can't be fixed without making them no longer great source control systems.

I use git on a fairly big repository and the scanning is currently at the "annoying" level for me, but the scanning is there for good reasons, reasons related to its use as a source control system. On my small personal projects it definitely helps me a lot.

SVN is the wrong tool for the job. I don't know what the right tool is. It may not even exist. But SVN is still the wrong tool. (If nothing else you could hack together one of those key/value stores that have been on HN lately and cobble together something with the resulting hash values.)

And, going back to the original link, criticizing git for not working with a repository with large numbers of binary files is not a very interesting critique. If Perforce does work under those circumstances, I would conclude they've almost certainly had to make tradeoffs that make it a less powerful source control system. Based on what I've heard from Perforce users and critics, that is an accurate conclusion. But I have no direct experience myself.

It can't be fixed without making them no longer great source control systems.

Why? Is there some deep architectural reason why Git can't perform like Perforce on large binary files? Something so deep it cannot ever be fixed? I've read through this whole thread and see no such reason yet, only hints that it exists.

Tradeoffs. silentbicycle's explanation is pretty good, but I want to call out the fact that you simply can not have an optimal source control system and an optimal binary blob management system. The two share a lot of similarities and there's a core that you could probably extract to use to build both, but when you're talking optimal systems, there are forces that are in conflict.

The problem is that if you aren't hip-deep in both systems, you often can't see the tradeoffs, or if someone explains them to you, you might say "But just do this and this and this and you're done!" Hopefully, you've had some experience of someone coming up to you and saying that about some system you've written, perhaps your boss, so you know how it just doesn't work that way, because it's never that easy. If you haven't had this experience, you probably won't understand this point until you have.

There are always tradeoffs.

Lately at my work, I've run into a series of issues as I get closer to optimal in some parts of the product I'm responsible for where I have to make a decision that will either please one third of my customer base, or two thirds of my customer base. Neither are wrong, doing both isn't feasible, and the losing customers call in and wonder why they can't have it their way, and there isn't an answer I can give that satisfies them... but nevertheless, I have to chose. (I don't want to give specific examples, but broad examples would include "case sensitivity" (either way you lose in some cases) or whether or not you give a particularly moderately important unavoidable error message; half your customers are annoyed it shows up and the other half would call in to complain that it doesn't.) You can't have it all.

To my understanding: When git looks for changes it scans all the tracked files (checking timestamps before hashing) and hashing those that look changed. Commits generate patches for all changed files, then generate hashes for each file, then hash the set of files in each directory for a hash of the directory, repeated until the state of the entire repository is collected into one hash. This is normally pretty fast, and has a lot of advantages as a way to represent state changes in the project, but it also means that if the project has several huge binary files sitting about (or thousands of large binaries, etc.), it will have to hash them as well. This requires a full pass through the file any time that they look like they might have changed, new files are added, etc. (Mercurial works very similarly, though the internal data structures are different.) Running sha1 on a 342M file just took about 9 seconds on my computer; this goes up linearly with file size.

Git deals with the state of the tree as a whole, while Perforce, Subversion, and some others work at a file-by-file level, so they only need to scan the huge files when adding them, doing comparisons for updates, etc. (Updating or scanning for changes on perforce or subversion does scan the whole tree, though, which can be very slow.)

You can make git ignore the binary files via .gitignore, of course, and then they won't slow source tracking down anymore. You need to use something else to keep them in sync, though. (Rysnc works well for me. Unison is supposed to be good for this, as well.) You can still track the file metadata, such as a checksum and a path to fetch it from automatically, in a text file in git. It won't be able to do merges on the binaries, but how often do you merge binaries?

Well, rsync has a mode where you say "look, if the file size and timestamp are the same, please just assume it hasn't changed - I'm happy with this and am willing to accept that if that isn't sufficient any resulting problems are mine".

I wonder if having some way to tell git to do the same for files with a particular extension/in a particular directory would get us a decent amount of the way there?

While I haven't checked the source, I'm pretty sure by default git doesn't re-hash any files unless the timestamps have changed. (I know Mercurial doesn't.)

The scaling situation for the top post would involve large binaries (builds or generated data) being added on a regular basis.

(oops, missed the edit window)

Git IS a key/value store database. (A filesystem is a kind of database, too, of course.) There's a good summary of the internal data structures here -- http://www.kernel.org/pub/software/scm/git/docs/user-manual....

As I've said elsewhere in this thread, tracking metadata (path and sha1 hash) for large binary files in git and otherwise ignoring them via .gitignore works quite well. I'm pretty sure the "right tool" is either rysnc or something similar.

You know, that got me thinking. ZFS will do versioning and, as a filesystem, you'd be keeping your binary data in it anyway. In ZFS, this versioning is implemented as a tree of data blocks, only those blocks that change between versions would be "new". If a block is unchanged, ZFS can exploit shared structure to avoid needless copying.

Right. You can do a lot of things (version control and encryption come to mind) at the filesystem level. A filesystem is a specialized kind of database, anyway, and databases are surprisingly versatile.

If memory serves, you can automatically mount daily snapshots of FreeBSD's standard filesystem. (I'm using OpenBSD, which is slightly different.)

There is an open source filesystem-based version control system: vesta (http://www.vestasys.org/). That's probably the sort of thing you need for this kind of work. It was developed by DEC and then Intel for chip development, and those guys check in binary blobs.

Vesta is a cool piece of technology in many ways. It has a pure functional build language, completely parallelisable builds with accurate caching between builds, etc. The guy who maintains it at Intel is a bit bitter because he can't understand why less capable version control or build systems are more popular. Which is really because vesta's advantages show up best in quite large projects, but once your project is that large, it's very hard to switch.

The web page looks moribund, but in fact it's still actively developed, and the developers hang out on IRC.

On several occasions, Perforce has introduced merge errors when merging from one branch to another.* Without mentioning (or speaking for) my employer, we have a project with about ten branches (4-5 in current use), 50-60k changelists in the history, 2-3 gb of data. The project is 10+ years old, but I don't believe the full history has been kept. We use Perforce, but several major developers don't fully trust it, and we're investigating other options.

It does seem to handle large binary files reasonably well, at least, though fully scanning for any changes (the equivalent of "git status") generally takes about two minutes on my computer, so it's a mixed blessing. I think tracking binary data would be better handled by a fundamentally different kind of tool, really; there are major differences between managing large binaries vs. managing heuristically merge-able, predominantly textual data.

* One example: Two functions with similar names but reversed arguments (i.e., methodA(from, to) and methodB(to, from)) had the arguments transposed during the merge in many, many files. It introduces some really subtle bugs. It also happened again during the next major merge from ongoing-development to release.

You are telling Perforce to force-merge files without you having a say in it,and you are complaining that some of them didn't go right? Sorry, but you won't get better results from anything else if you have that policy.

We have a very similar problem [except we have videos that need to sit right next to the code tree] and we have added those video subfolders into .gitignore and use rsync for those instead - simple and effective tool-per-task.

I don't believe in 6GB source trees without binary blobs. Perhaps Windows codebase is that large but I would assume it's hosted in multiple repos and very rarely is built/checked out by individual developers all at once.

> I don't believe in 6GB source trees without binary blobs.

No kidding! Even the worst copy-and-paste programming would compress well. (Also, .gitignore + rsync seems like the best option to me, too.)

Git has a feature called "superprojects". You would configure this project with all the parts of your project as "submodules". That way, when you need to pull down some new code (but not any media assets) you can just do that, but everything is still being tracked by the superproject.

Also, keep in mind that git will hash all of a commit at once. So if 500 2k files changed in a commit, the time spent hashing should be almost the same as hashing a 1MB file.

Media assets are definitely an interesting problem. (Of course, then you're beyond "source control" and into something else, but that something else is important.)

A an article titled "Why" ..... should usually explain WHY. This article, if anything, just asserts that git IS un-scalable. The closest thing to a "Why" is this "Don't believe me? Fine. Go ahead and wait a minute after every git command while it scans your entire repo."

Which may be true, but this article is pretty thin

Are you looking for "why" as in "what are the architecture problems" or as in "what are the scenarios and metrics demonstrating the case"?

I can give you the latter - git does not support partial checkouts. If your repository is 80Gb and you only need 5Gb to work with you will have to get all 80Gb over the network and it will be about 16 times as slow as it needs to be. The same problem does not exist in perforce - you can get any subset you wish.

Linus was pinned on this in his talk he gave at Google and his repsonse was "well, you can create multiple repostiroies and then write scripts on top to manage all this". To his credit he admitted the problem, but then he doesn't seem to care enough to do anything about it.

You're right that Linus probably doesn't care about large binary files. I think that's just as well though. git can't be all things to all people. My gut feeling is that making git good for huge binary repositories would mean sacrificing 80% of what makes it so sweet for regular development.

I'd even go so far as to say that the needs of versioning large asset files and source code are so different that a system optimized for one will always be deficient for the other. Therefore I don't think the idea of "scripts on top" is so bad. Actually the ideal would be a set of porcelain commands built on two separate subsystems.

Agreed. It's a tool designed for managing incremental changes to files that are typically merged rather than replaced. While some version control systems do better with large binary data, it seems like that would be better handled by a completely different kind of tool.

Keeping a script (or a makefile, etc.) under VC that contains paths to the most recent versions of the large builds (and their sha1 hashes) would probably suffice, in most cases.

So a script is needed to keep the GUI of a program and the artwork in sync? I think that keeping all parts of a product in a single place, in sync is the most basic requirement of the source control. Saying that git is not designed for this sort of thing is saying that git is not designed to be an adequate source control system.

source control.

Tracking large data dependencies requires very different techniques from tracking source code, config files, etc. which are predominantly textual. They're usually several orders of magnitude larger (so hashing everything is impractical), and merging things automatically is far more problematic. Doing data tracking really well involves framing it as a fundamentally different problem (algorithmically, if nothing else), and I'm not surprised that git and most other VCSs handles that poorly - they're not designed for it. Rsync and unison are, though, and they can be very complementary to VC systems.

In my experience, tracking icons, images for buttons, etc. doesn't really impact VC performance much, but large databases, video, etc. definitely can.

So, a medical student tells his professor that he does not want to study the ear problems but plans to specialize on the nose problems instead. "Really?" asked the prof, "so where exactly do you plan to specialize - the left nostril or the right nostril?".

Similarly, git is speciailizing is small, text-only projects, but will not work well for a project involving some sort of grpaics-enriched GUI or large code base or both.

Similarly, git is speciailizing is small, text-only projects, but will not work well for a project involving some sort of graphics-enriched GUI or large code base or both.

As I mention a few replies up, this covers about 99% of all programming projects. Very few projects have gigabytes of source code or graphics. If you just have a few hundred megabytes of graphics, git will do fine. If you only have 2 million lines of code, git will do fine.

Things larger than this are really rare, so it is not a problem that Git doesn't handle it well. Or rather, it's not a problem for very many people.

I recall we have started this discussion with a question of whether git scales or not. You position is essentially that git doesn't scale but you don't care.

In other words it seems to me that everyone here agrees about the facts - git does not scale. And then some people seem to care about it and some don't. Are we on the same page now?

I'm a different person from jrockway, but my position is while git alone does not scale when tracking with very large binaries, it's an easily solved problem, as it is intentionally designed to work with other tools better suited to such tasks. Git + rysnc solves that problem for me in full, and should scale fine.

Git doesn't include a source code editor, profiler, or web browser, either. The Unix way of design involves the creation of independent programs that intercommunicate via files and/or pipes. The individual programs themselves can be small and simple due to not trying to solve several different problems at once, and since they communicate via buffered pipes, you get Erlang-style message-passing parallelism for free at the OS level.

Like I said, if you track the binary metadata (checksums and paths to fetch them from) in git but not the files themselves, your scaling problem goes away completely. If you have a personal problem with using more than one program jointly, that has nothing to do with git.

I will go a step further and say the people who think git should "scale" are wrong. The reason being that there are real tradeoffs that would need to be made. The vast majority of software projects should not be forced to use an inferior SCM just so that the few really large projects can be supported.

It's unfortunate that those people are forced to use Perforce... there's definitely a need for a "scalable" SCM, but git should not be it.

Sure, but this problem applies to huge software products like Microsoft Office, not the sort of software that the average developer at the average company works on every day. So while Perforce may be good for that 1% of the population, Git probably meets the needs of everyone else better.

(BTW, there are repositories bigger than the Linux kernel that behave very well under Git. The Emacs git repository is an example, it has almost 30 years of history, weighs in at around 150M, but still performs fine. I know that I have never been paid to work on any application nearly this big, so I don't really worry about Git not meeting my needs.)

The former. Nobody is twisting this guy to use git so I don't care if it doesn't work for his scenario, I care if it works for me. I've heard the "can't deal with large binary files/repos" many times before, I thought he was going to explain why that was the case.

How much code do you think Github stores? I'd wager at least several gigabytes.

Git -does- scale. It just doesn't scale as one huge repository. You do need tools on top to manage multiple projects. It'd be nice to have an open source Github.

I have never worked on a 6GB repository, and apart from my (pretty solid) development background, I've spent the last 3 years doing code audits for huge software shops. The numbers in this post aren't compelling.

Consider also: is it possible that the multi-gigabyte repositories this guy's thinking of are byproducts of extremely crappy version control disciplines?

To your question: I guess not. He just has a different development background. Productions like game dev or CG work require to track huge binary objects and source code in sync (I'd argue that, in these environments, any attempt to manage binary objects and sources in separate systems has eventually failed; here, source code includes shader programs, for example, which typically depend on the texture images they work on; they're inseparable. I've been there, done that)---but if you're in this industry and have managed to do it well, I'm eager to hear about your experience.

It seems that most of people discussing here agree that "Git is good for source version control, while Perforce is for asset version control". So I'm not really sure what people is arguing about.

The article never actually says why Perforce is more scalable, except that you have "an IT department taking care of it."

Perforce only scans through everything when you update or scan for changed files, but mostly tracks changes file by file. Git (and mercurial) work with the state of the tree as a whole for most operations, because this makes handling searching, branching, merging, etc. much nicer, but it means that when you have gigantic binary files just sitting around, it takes time scanning them as well.

Why not cache the MD5/SHA1 hashes and only update them when the timestamp of the file changes?

Mercurial does, by default. I'm almost certain git already does, too.

Best way I've found to do joint versioning of code with large datasets (whether binary or tab-delimited text):

1. Check in symbolic links to git. You can include the SHA-1 or MD5 in the file name.

2. Have those symbolic links point to your large out-of-tree directory of binary files.

3. rsync the out-of-tree directory when you need to do work off the server

4. Have a git hook check to see whether those files are present on your machine when you pull, and to update the SHA-1s in the symbolic link filenames when you push

By using symbolic links, at least you have the dependencies encoded within git, even if the big files themselves aren't there.

I actually go along with the "put any binary or tool you need to make this run in the repository" school of thought, but I don't stuff generated files in the repository.

I'm not sure "Git doesn't handle this dysfunctional source control usage" is really a valid complaint.

It's not really fair to call it "dysfunctional source control usage" (at least no fairer than the OA). The problem is with large asset files, which though almost always "generated" in the sense that people don't hand-craft them like code, are not necessarily a function of what is stored in the repo.

I'll agree with you that making git handle huge binary repositories speedily is probably not a worthwhile effort.

Asset files, as in triplefox's comment above, are a very different thing from the OA's mention of keeping the object files from nightly builds and anything else they think of to throw into the repository.

For program source control, throwing in a bunch of unnecessary files is dysfunctional. Assets are, by definition, not unnecessary files.

I wonder how difficult it would be to update git to support special-handling of binary files? Perhaps a .gitbinary file (similar to .gitignore) which basically tells git to ignore a directory unless a specific command (say git binary sync) is run. Or, perhaps handling binary files better is just a matter of faster hashes? Could the filesystem be any help here? I know ZFS keeps a SHA-1 hash of all directories and files.

maybe someone who works in ane environment with large binary files that need to be tracked should just buckle down and write a tool appropriate for that scenario? (vs try to force fit git - a tool designed for a different set of usecases - )

Sounds like an opportunity for a startup or at least a cool open source project ("make something people want")

The main problem is that there are limited number of users that really need large asset tracking solutions, and these big players have spent many years and now have something more or less working.

To surpass the existing systems, you really need to work on the actual production assets, and in the environment where hundreds of artists/programmers are updating it daily. Such large "practical" production assets are not easily accessible to startups or open-source projects.

I don't say it is impossible, though; AlienBrain seems to manage sustaining business. And I've seen few people who are truly happy with their asset management systems.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact