I don't use Facebook, and I'm not suggesting that they're building the software of the future. But surely someone there is smart enough to know that, for this decision, time is on their side.
Review board and Gerrit are both awful in comparison
They each have their strengths, but both of them are infinitely preferable to not doing code review. Neither is awful.
In my admittedly limited experience (Windows 7 x64, ant, Android SDK) ant is terribly slow to build projects with multiple source library dependencies and throwing hardware at the problem doesn't speed it up that much.
For example, most open source library projects that you include in a project don't change from build to build, but ant dutifully recompiles them each time instead of caching the output until the files in that project are changed or I manually clean the build output.
Which will then lead down a path of a set of scripts/utilities on top of said system to standardize on a set of targets and deal with known issues. And suddenly we have reinvented another old tool, autotools. We'll probably find ourselves in their troubles soon enough.
In fact, the only thing the autotools don't do, that every current tool does, is download the dependencies. Which, is not surprising, since fast network connections are a relatively new thing.
And I'm actually not sure I care for the download of dependencies thing. It is nice at first, but quickly gets to people pushing another thing to download, instead of making do with what you have. This is plain embarrassing when the downloaded part offers practically nothing on top of what you had.
Make, when used properly, is still an pretty smart tool.
Also, auto detecting inputs rather than being forced to specify them is nice. Especially as virtually all input specs in Makefiles are wrong or incomplete.
Writing a build system is not such a big deal -- and outdoing the existing work is not very hard.
Explicit input specifications are virtually never correct. #include scanners for example, are generally wrong because they do not express the dependency on the inexistence of the headers in previous include paths.
Usually game assets are in one repository (including compiled binaries) and code in another. The repository containing the game itself can grow to hundreds of gigabytes in size due to tracking revision history on art assets (models, movies, textures, animation data, etc).
I wouldn't doubt there's some larger commercial game projects that have repository sizes exceeding 1TB.
1TB is rather a lot. My previous record was 300GB and even that seemed a bit much. But it is very convenient having everything in one place, including all the stuff that you only need occasionally but is handy to have to hand, such as old builds of the game that you can just run in situ and installers for all the tools.
(I don't know what the entire repository size must be like, but many of the larger files have a limited history depth, so it's probably less than 5-10TB. So not exactly unimaginable, though I'm sure buying that much server-grade disk space - doubtless in duplicate - might cost more than I think.)
However, Perforce does have Git integration now, allowing for either a centralized or distributed version control model. Considering the popularity of Git, I wouldn't doubt smaller Perforce-based game projects are going the DVCS route.
Also, hypothetically speaking, consider if you had a game project that would eventually grow to 1-2TB in repository size. If you spent $100 per developer to augment each of their workstations with a dedicated 3TB hard drive, you would have an awesome level of redundancy using DVCS (plus all the other advantages). I know it's no replacement for cold, off-site backups, but it would still be nice.
So, you need some central system to manage the assets so that people know "hey, don't edit this file right now because so-and-so is editing it".
Ideally you'd like to know this BEFORE you start editing. In other words, you don't want to spent 15-60 minutes editing something and only on saving or trying to check in get told "hey, sorry but someone else was editing this file so you'll have to discard your changes and start over". Some editors are better at this than others.
You could try to write something on outside of git to help with this but why when P4 already provides this?
Maybe because P4 is kind of a PITA? I used it for 10 months on a project (without any noticeable art assets, even; this was just code) and it regularly caused problems. The company had someone whose sole job was to administrate P4, and it was sorely needed.
Of course, it's been many years, and I no longer remember details about the precise problems encountered, just the overall frustration. Although the one thing I do remember is the aggravation caused when a coworker accidentally locked a file they weren't editing and then left for the day.
Like you say, games include all sorts of binary assets.
Any idea how much actual code is there?
The Linux kernel is only 175MB
The F22 has some 1.7 million LOC
This graph shows some pretty big things
>> At least according to the presentation by a Facebook engineer that I just watched, they're still on git. 
*8 GB plus 46 GB .git directory
> --depth <depth>: Create a shallow clone with a history truncated to the specified number of revisions.
If the OS and applications that are creating and modifying files are all honest about the file dates, it should be possible to only scan dates instead of reading out every file. Or even use ionotify-like events to track what changed.
EDIT: As sisk points out below --depth itself is not new, but as of 1.9 the limitations that previously came with shallow clones were lifted. Thanks sisk.
In 2011 they had 10 millions LoC, up to 500 commits a day,
but if we asssume the plots keeps going up like this, now in 2014 it can be pretty big.
Their binary was 1.5GB when the paper was written.
I wrote https://github.com/polydawn/mdm/ to help with this. It goes a step further and puts each binary version in a separate root of history, which means you can pull down only what you need.
Don't other people do that, too? What's the benefit of having binaries stored? I've never needed that; I've never worked on any huge projects, so I might be missing something crucial.
As for images, icons, fonts and similar, I just have a build script that copies/generates them, if it's needed.
I guess I've always been a little bit "obsessed" about the tidiness of my repositories.
There's only so much git gc can do. We've got a 500MB repo (.git, excluding working copy) at work, for 100k revisions. That's with a fresh clone and having tried most combinations of gc and repack we could think of. Considering the size of facebook, I can only expect that their repo is deeper (longer history in revisions count), broader (more files), more complex and probably full of binary stuff.
Of course not, it's an actual project. There is code, there are data files (text and binary) and there are assets.
I actually prefer monolithic repos (I realize that the slide posted might be in jest). I have seen projects struggle with submodules and splitting up modules into separate repos. People change something in their module. They don't test any upstream modules because it's not their problem anymore. Software in fast moving companies doesn't work like that. There are always subtle behavior dependancies (re: one module depends on a bug in another module either by mistake or intentionally). I just prefer having all code and tests of all modules in one place.
While this is interesting it also requires a lot of discipline and almost one person dedicated as full time "dba" to not end up with spaghetti. Since there is no unique version number of the repo you have to store these queries and manually keep adding labels to be able to go back to an exact point of time.
It does have some uses like being able to run a new test on old code to find the first version when something broke or being able to change versions of external libraries or blob assets quickly but its hard to say if its worth it since it comes with so many other problems.
They might have since modularized and cleaned it up but it seems unlikely they'd fully SOA-ize the Facebook web app.
It works absolutely brilliantly. Division of labor and responsibility becomes clear, repos stay manageable, large scale rewrites can happen safely, in piecemeal, over time... it really is the best way to do it.
So, yes, if you are able to control growth enough that you can make this happen, it is attractive. If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.
If you're not growing, then there is no problem. If you have linear growth maybe you can keep pushing it, but who plans on linear growth?
Google is already on multiple Perforce servers because of scaling, and that is not a situation that is going to improve. If you start using multiple centralized version control servers, you are going to want a build/deployment system that has a concept of packages (and package versions) anyway.
> If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.
These sort of dependency resolution conflicts can and do happen, but far less often than you would think. Enforcing semantic versioning goes a long way (and along with it, specifying. In practice, the benifits of versioned dependencies (such as avoiding ridiculous workarounds like described by this HN comment: https://news.ycombinator.com/item?id=7649374) far outway any downsides.
You can even create a system that uses versioned packages as dependencies while using a centralized versioning system. In fact, this is probably the easiest migration strategy. Build the infrastructure that you will eventually need anyway while you are still managing on one massive repository. Then you can 'lazy eval' the migration (pulling more and more packages off the centralized system as the company grows faster and faster, avoiding version control brownouts).
It is amusing the amount of hubris our industry has. Seriously, you are talking about outsmarting two of the most successful companies out there.
I mean, could they do better? Probably. But it is hard to grok the amount of second guessing any of their decisions get.
I do feel that the main reason they are successful is large manpower. That is, competent (but not necessarily stellar) leadership can accomplish a ton with dedicated hard workers. This shouldn't be used as evidence that what they are doing is absolutely correct. But, it should be considered when what they do is called into question.
If you have/know of any studies into that, I know I would love to read them. I doubt I am alone.
The counter argument appears to be this. If one team checks in a change that break's another team's code, then the focus should be on getting that fixed as soon as possible.
Now, if you are in multiple repositories, it can be easy to shift that responsibility onto the repository that is now broken. Things then get triaged and tasks must be done such that getting in a potentially breaking fix may take time.
Contrasted with the simple rule of "you can't break the build" in a single repository, where the impetus is on whoever is checking in the code to make sure all use sites still work.
Granted, an "easy" solution to this is to greatly restrict any changes that could break use site code. The Kernel is a good example of this policy. Overall, I think that is a very good policy to follow.
Our workflow covers all the potential problems you named (eg. scripts to keep everything up to date, tests that get run at build or push time after everything is already checked out from the individual repos, etc.).
We've been running this way for over a year with literally zero issues.
git log -- my-teams-subdirectory
If you use any sort of versioning this shouldn't ever cause a problem.
BUT this allows you to pay the prices of versioning (downstream burdened with rewriting => they never do => old code lives indefinitely) only in the worst cases.
If done right (lots of tests, great CI infrastructure), fixing everything downstream is practical in many cases, and can be a win.
A subtler question is how this interplays with branching.
You can't be expected to update downstream on every random/abandoned branch out there, only the head. Which deters people from branching because then it's their responsibility to keep up...
It appears they are now using Mercurial and working on scaling that (also noted by several others in this discussion): https://code.facebook.com/posts/218678814984400/scaling-merc...
I also wonder if that size includes a snapshot of a subset of Facebook's Graph, so that each developer has a "mini-facebook" to work on that's large enough to be representative of the actual site (so that feed generation and other functionalities take somewhat the same time to execute.)
Unified repos scales well up to a certain point before troubles arise. e.g. fully distributed VCS starts to break down when you have hundreds of MB and people with slow internet connections. Large projects like the Linux kernel and Firefox are beyond this point. You also have implementation details such as Git's repacks and garbage collection that introduce performance issues. Facebook is a magnitude past where troubles begin. The fact they control the workstations and can throw fast disks, CPU, memory, and 1 gbps+ links at the problem has bought them time.
Facebook made the determination that preserving a unified repository (and thus preserving developer productivity) was more important than dealing with the limitation of existing tools. So, they set out to improve one VCS system: Mercurial (https://code.facebook.com/posts/218678814984400/scaling-merc...). They are effectively leveraging the extensibility of Mercurial to turn it from a fully distributed VCS to one that supports shallow clones (remotefilelog extension) and can leverage filesystem watching primitives to make I/O operations fast (hgwatchman) and more. Unlike compiled tools (like Git), Facebook doesn't have to wait for upstream to accept possibly-controversial and difficult-to-land enhancements or maintain a forked Git distribution. They can write Mercurial extensions and monkeypatch the core of Mercurial (written in Python) to prove out ideas and they can upstream patches and extensions to benefit everybody. Mercurial is happily accepting their patches and every Mercurial user is better off because of Facebook.
Furthermore, Mercurial's extensibility makes it a perfect complement to a tailored and well-oiled development workflow. You can write Mercurial extensions that provide deep integration with existing tools and systems. See http://gregoryszorc.com/blog/2013/11/08/using-mercurial-to-q.... There are many compelling reasons why you would want to choose Mercurial over other solutions. Those reasons are even more compelling in corporate environments (such as Facebook) where the network effect of Git + GitHub (IMO the foremost reason to use Git) doesn't significantly factor into your decision.
What if multiple services are utilizing a shared library? For each service to be independent in the way I think you are advocating for, you would need multiple copies of that shared library (either via separate copies in separate repos or a shared copy via something like subrepos).
Multiple copies leads to copies getting out of sync. You (likely) lose the ability to perform a single atomic commit. Furthermore, you've increased the barrier to change (and to move fast) by introducing uncertainty. Are Service X and Service Y using the latest/greatest version of the library? Why did my change to this library break Service Z? Oh, it's because Service Z lags 3 versions behind on this library and can't talk with my new version.
Unified repositories help eliminate the sync problem and make a whole class of problems that are detrimental to productivity and moving fast go away.
Facebook isn't alone in making this decision. I believe Google maintains a large Perforce repository for the same reasons.
No, you have a notion of packages in your build system and deployment system.
You want to use FooWizz framework for your new service BarQuxer? Include FooWiz+=2.0 as a dependency of your service. The build system will then get the suitable package FooWiz when building your BarQuxer. Another team on the other side of the company also wants to use FooWiz? They do the exact same thing. There is never a need for FooWiz to be duplicated, anybody can build with that package as a dependency.
SOA is beneficial over monolithic development for many other reasons unrelated to versioning. It just happens to enable saner versioning as one of it's benefits.
The first clone does not have to go over the wire. Part of git's distributed nature is that you can copy the .git to any hard drive and pass it on to someone else. Then...
> git checkout .
And I assume they have extra requirements for interns to push code.
I liked this idea too : "all engineers who contributed code must be available online during the
push. The release system verifies this by contacting them automatically using a system of IRC
bots; if an engineer is unavailable (at least for daily pushes), his or her commit will be reverted."
That way, they are be able to react very quickly in case of a problem.
I doubt that an intern on Google would've access to the search codebase. I'd wager that only a handful of trusted employees have access to that codebase.
On a small company, I agree. But on FB they're around 5k people. Let's say they have 3k engineers, that's a awful lot of people they're trusting with their source code
Everyone having access to everything must be worth the security trade-off. On the other hand, I suppose it's debatable whether it would be a trade-off at all.
Not with git. The Hash chain mechanism requires the entire repo to generate valid chains, so it's all or nothing.
I would find this extremely hard to believe, especially at Facebook. At any software company, your code base is what defines you as a company; there is no way they'd let the good stuff sneak out like that.
Now what? How do you get users? "We are just like facebook - only your friends aren't here" probably wouldn't get users excited.
And if you somehow DID manage to get users, don't you think there are "watermarks" in the code, that they could detect and sue you to death with?
How they capture this data and use it would also be in their source code, no? This is absolutely where Facebook gets its worth. I would assume this is what they would want to keep in a limited exposure set? I might be wrong, but this is why they hire the best engineers out there.
True, you do. What you gain is the ability of small pieces to move individually through API changes.
If your entire codebase is in one repo (as appears to be the case here), and you want to change an API, you must either do so in a backwards compatible way, and slowly eradicate any old callers, or change them all in one fell swoop.
By splitting to multiple repos, you can version them independently. Thus, a project can (hopefully temporarily) depend on the old API, which only gets bugfixes, while another project can depend on the new version.
The tricky bit is when you have one "binary" or something equivalent referring to two versions of a dependency. (Usually indirectly, i.e., A depends on B which depends on D v1, and A depends on C which depends on D v2, and D v1 and D v2 are incompatible.) You can't really do much about this, but if you keep your components small enough (think services with well separated interfaces) you should be able to keep the dependencies small enough as well.
Individual libraries/dependencies get worked on by themselves, with an API that other applications use. Then the other apps just bump a version number and get newer code.
That is, the reason an API changes is because a use site has need of a change. So, at a minimum, you need to make that change and test it against that site in a somewhat atomic commit.
Then, if the change has any affect on other uses, you need a good way to test that change on them at the same time. Otherwise, they will resist pulling this change until it is fixed.
Add in more than a handful of such use sites, and suddenly things are just unmanageable in this "manageable" situation.
Not that this is "easy" in a central repo. But at least with the source dependency, you can get a compiler flag at every place an API change breaks something.
And, true, you can do this with multiple repos, too. But every attempt I have seen to do that just uses a frighteningly complicated tool to "recreate" what looks like a single source tree out of many separate ones. (jhbuild, and friends)
So, if there is a good tool for doing that, I'd certainly love to hear about it.
Mostly because we want a 100% reproducible build environment, so a complete build environment (compilers + IDE + build system) is all checked into the repro.
I mean, the same guy that told me this, also said that the codebase size was about 50 times less than the one reported in this slide, so it may all be pure speculation.
"The deployed executable
size is around 1.5 Gbytes, including the Web server and compiled Facebook application. The code
and data propagate to all servers via BitTorrent, which is configured to minimize global traffic
by exploiting cluster and rack affinity. The time needed to propagate to all the servers is roughly
So using torrents isn't foreign to them.
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information
# let's assume a char is 2mm wide, 500 chars per meter
115964116.992 #meters of code
# assume 80 chars per line, a char is 5mm high, 200 lines per meter
3623878.656 # height of code in meters
# 1000 meters per km
3623.878656 # km of code, it's about 385,000 km from the Earth to the Moon
>>> from sys import stdout
>>> stdout.write("that's a hella lotta code\n")
Need I go on? :-) You've replaced a relatively simple system of merge requests with some pseudo in-code versioning system controlled through boolean global variables.
I'll take feature branches any day of the week over that mess. The github model is far superior IMO.
I think feature toggles can be extremely useful, but still develop in a branch and merge after review/qa.
It hasn't become part of everyone's workflow yet, but it's pretty useful.
So git rebase -i will be more readable, while actual linear history is always gibberish.
FB doesn't need to branch ... Gatekeeper (their A/B, feature flag system) really takes care of that concern logically
A quick google comes up with nothing but I could have SWORN I read that.
what about a re-index or something, will that take forever?
I worry at such size the speed will suffer, I feel git is comfortable with probably a few GBs only?
anyway it's good to know that 54GB still is usable!
How much of it is static resources, like CSS sprite images?
We're doing something wrong.
Hopefully they actually have some separate sites, separate tools and separate libraries. Or could understand how to use submodules or something rather than literally putting everything in one huge repository.
Whether to put images and other assets into git repos is a separate decision.
The fix for this is pretty simple: use filesystem watch hooks like inotify to update an lstat cache. I wrote something like this for an internal project and the speed difference was night and day. I remember reading that there had been progress on the inotify front on the git dev mailing list a few years ago, don't know what the current status is.
From my experience, we can bet git is the one that takes the least amount of time
Not hitting the network to check which files changed, for a start
- Facebook's complex privacy vs Twitter's binary "public or private"
- Facebook's real names vs Twitter's @usernames
- Facebook's freeform posting length vs Twitter's 140 character limit