Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Mercurial at Facebook (2014) (fb.com)
31 points by based2 15 days ago | hide | past | favorite | 78 comments

As an engineer that worked both at google and Facebook, I vastly prefer googles monorepo on perforce. Combining that with citc was pretty solid way to develop your project.

Hg is a bit of a nightmare in the wfh situation. Really slow, hangs for a long time if you haven't synced in a few days. Yes, im sure there are ways to tweak, but not sure if you can tweak them enough!

As a fellow Facebook employee, FYI, Hacker News is not a great place to get tech support.

For the benefit of everyone else on this thread, note that Facebook uses https://github.com/facebookexperimental/eden. From the README:

> Despite having originally evolved from Mercurial, EdenSCM is not a distributed source control system.

i.e., this is not "stock" Mercurial. For example, the Eden server (Mononoke) is written in Rust and the complementary virtual filesystem is written in C++. This has very different performance characteristics than real Mercurial.

EdenSCM works with both EdenFS (the custom filesystem) and a traditional filesystem. If you use EdenFS, pulls will be much cheaper because you only fetch what you use. If you use a traditional filesystem, EdenSCM supports the same "sparse checkouts" feature as stock Mercurial (https://firefox-source-docs.mozilla.org/build/buildsystem/sp...), which can also be used to reduce the size of the slice of the monorepo you pull down.

Last I checked, Perforce (and Google's "implementation" of Perforce, Piper) did not provide nearly the same level of support for stacked diffs as Eden. As both Google and Facebook have cultures of pre-commit code review, working with stacked diffs makes it much easier to make progress while waiting for approvals on earlier diffs.

I believe there are relative advantages/disadvantages of Eden vs. Piper+CitC and that both projects aspire to have the best of each in the limit.

I can only imagine an attempt at the initial clone.. the hg client is ridiculously slow, memory-hungry, and full of dumb behavior that seems to assume local connections

hg has a mode to make initial clones faster by offering zipped compressed checkouts for download.

it does range requests to these bundles, attempts to extract the partials while keeping the http connection alive using up GB of memory in the process, and then totally fails pretty regularly when it need to go back to the network. It makes no attempt at recovering those chunks, re-establishing connections, or using any intelligence at all.

It also overwhelms http header size norms on the regular and runs up memory when attempting to use those bundles.

Chunks/bundles are, in fact, strictly worse unless you get your initial clones with a good piece of software like curl.

When did you work at Facebook? Things used to be bad, but improved greatly after Eden was introduced around 2018. Eden is a FUSE filesystem which uses hg as the backing store.

(I also worked at Google but not the google3 monorepo - instead I got emerge, which was misery.)

Joined about 6months ago, after 7.5 yrs at google. Eden still doesn't support all repos :(

I do Android occasionally, from laptop, given wfh.

Can you share where Eden doesn't work? I no longer work at Fb but at the time I left Eden was solid for www/android, and mostly there on Mac/iOS.

Somehow FB has made all the common operations work in <6 seconds with millions of files...

It's an impressive feat but something I find even more impressive is the scale they need to be prepared for. The performance challenges of maintaining infrastructure for as many developers as Facebook/Google employ is constant and demanding.

At my extremely small startup in the past ~6 months I've made about 1,000 files (@ HEAD, not including deletions). So, maybe, at the absolute worst case each engineer could produce ~150 files/month. Facebook likely has ~1,000 engineers. That's a worst case 1,800,000 files/year + an existing massive code base.

Good luck to the engineers who are forever battling the scaling issues underlying these orgs! It's extremely impressive to even hear about overcoming these huge hurdles.

Facebook has way more than 1000 engineers :)

It all breaks down when it does not because watchman decides to come to a halt and basic arc pull takes more than 10mins after a week of no action on the repo :)

Watchman is gone!

I'd be very interested to hear what Google's done to their perforce server, if anything.

I used to do vendor support for a company that does version control for chip design. Basically Perforce + some metadata and applications that automate away a bunch of the administrative grunt work.

Some of our larger customers had quite a few layers of abstraction to try and speed it up (to varying degrees of success). But I don't know how Google's scale would compare to a company that would routinely have check-ins they claimed to be hundreds of GB. Or another that checked in their giant PDKs to send them to each of their sites globally

Google has completely replaced it, in stages, with a whitebox implementation that's by now almost a complete protocol rewrite. You can read about Piper here[1]

[1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

I've heard about that! Was more thinking about what they did while there was still a Perforce server on the backend

That I can speak (a little) to: They gave it a gigantic RAM-Backed SAN and the best hardware money could buy. Nothing particularly interesting, but there were several perforce servers with 512G of RAM attached to several SANS with terrabytes of RAM storage (In 2012ish when it was replaced). It was only ever explained to me in terms of SWE-Hours, but my understanding is that the hardware budget was enough to fund the several software engineering teams that replaced it, and the licensing was more than the hardware.

Note: this article is from 2014; it would be interesting to see how Mercurial has worked for Facebook since then.

It's a lot better than it was in 2014, but I still miss using git.

Why? Git is such a worse, over-engineered experience.

Github won, so we're all forced to use git now, but mercurial is really a simpler interface and a more straightforward mental model.

Because I really like the model/UX of the staging area. I'm not worried about the tool being simple; I use it every day, so it's worth learning inside and out.

It's a much nicer experience IMO to be able to work on various different pieces, then iteratively run `git add -p` or similar to stage individual files or hunks, verify everything I'm adding with `git diff --cached`, then `git commit` when I'm actually ready and happy. It lets me fix multiple parts of my project at the same time, without interruption, while still letting me compose my batch of changes into multiple, self-contained changesets.

`hg record` feels really limited by comparison, and rushes you immediately into committing your selected changes, and then further makes it difficult to selectively amend more (partial) changes to that commit afterwards.

One big idea that git should learn from Mercurial is history editing. git has only clunky, modal interactive rebase, and struggles with natural questions like "what is the child of this commit."

But hg makes it easy to walk up and down a stack, putting fixes in the right place.

At the risk of overreach, this feature would lead to better software. If your tools have better support for preparing a stack of commits, then code review will be easier, and software will be higher quality.

I wrote git-prev-next to scratch my itch but IMO git should embrace it as a core feature. http://github.com/ridiculousfish/git-prev-next

I started on mercurial. I vastly prefer git. I admit it took a few months of asking teammates to help me get it but i'd never go back.

Same here. I used to regard http://jordi.inversethought.com/blog/i-hate-git/ as my hero, and then I had to start using git in a new gig, and never touched mercurial since. If the goal of using a vcs is to have meaningful atomic commits, nothing beats git staging area and interactive rebase.

Anyone know if they're still on Mercurial?

“Yes”, but with custom server, custom virtual filesystem layer, and heavily-tweaked client — https://github.com/facebookexperimental/eden

They also stopped participating in upstream mercurial development, google is pretty much the only major corporate contributor and steward of the project at this point.

I can't say for sure, but I'd be incredibly surprised if they weren't. What else would they be on? (Don't say Git, they tried to make Git scale for their monorepo and it would have required a lot of changes that the Git community wasn't interested in.)

Microsoft uses git for all its Windows development, which is a pretty big scale (the repo was 300GB back in 2017)

From what I understand, Microsoft "took Git and ran with it" the same way that Facebook did with Mercurial, or that Google did with Perforce. The command-line front-end tools are recognizably Git, but the underlying system is radically different.

Google's main repo was 86TB in January 2015.

Is this 86TB of files at that time? Or 86TB of files cumulatively over time + metadata?

Yep, and they've had to add support for several features that Facebook added to Mercurial years earlier.

> it would have required a lot of changes that the Git community wasn't interested in

Can you expand on this? I've thought partial checkouts are accepted and the user facing part even has been released with 1.7?

This goes back to 10 years ago now. It’s nice to see some improvement, but 10 years ago there was very little interest in these sorts of problems. The literal response to FB performance tests was “you’re doing it wrong, don’t use a monorepo.”

It shouldn’t be a surprise that FB decided to go elsewhere.

Fair point, years ago the situation was different. Nowadays really large scale git repos like Windows exist, and work by Jonathan Tan and others has made git more compatible with monorepos.

Perforce is still popular.

I think FB and Google both surpassed Perforce's scale a long time ago.

I've heard all these stories about build servers or source control servers where, for the longest time, the solution to scalability problems was to throw larger machines at it. At some point the machines have well into the TBs of RAM, each, and it becomes a major priority to make the system distributed.

Perforce scales the best of the bunch.

It doesn't matter if you're the best if you're not good enough.

That's why neither Google nor Facebook use Perforce for most of their code. The repos are just too large.

True, but the terminal scale of Perforce at Google was very impressive, and shows that it can handle pretty much anything that 99.9% of companies would need.

Sure, but Facebook is absolutely in that 0.1%.

What makes you say it's not?

Google talked about Piper (its home-built Perforce replacement) in 2015. See:

- https://www.youtube.com/watch?v=W71BTkUbdqE

- https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

Personal conversations with people who used to run Perforce at Google.

Back when Google was still at Perforce, I used to go to perforce.com to read about how your engineers would never ever wait for Perforce, while waiting for Perforce.

The Perforce servers were… large. And backed by large, extremely expensive SANs that worked with RAM as backing store. Yet, it was slow as molasses at Google's scale.

Perforce at Google scale in 2010 was "slow" in that it was orders of magnitude faster than git is on much smaller repos today. The only thing that git is really faster for is branching, which is something I almost never wanted to do with Perforce anyway. Git feels fast because it either front-loads all the pain (with checkouts) or defers it (with unexpected runs of git gc) but it doesn't really scale. That's why git users are so vocally opposed to monorepos.

As a committed monorepo opponent, that's not why. "Why" is that it's not possible for a monorepo to have meaningful history. If you don't want to have meaningful history, you're crippling yourself, but you also shouldn't use Git; there are better tools for version control with bad history.

Ah so I guess it's "not possible" that Google has a meaningful history of their source code and I guess their project, arguably among largest and most profitable of all time, is "crippled". Weird. I wonder how they get anything done over there.

Google is very unimpressive; the only way I can think of in which they are even tangentially associated with good engineering practices is, well, employing Junio Hamano to, basically, curate the meaningful history of Git.

Objectively great contributions to OSS and mercurial but I wonder how many companies or projects will need to do something similar.

I enjoy the patches the make mercurial usable without the crazy extensions that few of us will ever use.

In 2014, a large site but not the behemoth it is today, how did Facebook manage to clock in at 17 million LOC?? Genuinely curious for anyone with first party experience.

Does anyone have any good videos about Facebook's use of watchman to ship changes to central systems & be performing continuous test/analysis on code as it is developed?

I feel like I saw a good video in 2017 or so but cannot find much these days.

Why not just have bunch of repos? You can use what google does with chrome and use depot tools to checkout from a bunch of repos.

Edit: Why am I getting downvoted for asking a question?

We use multiple repos and we have a pretty large codebase consuming a TB a day.

We tried at Spotify the multi-repo thing and it was a total nightmare. Once you have lots of people working on a product, it becomes very hard to do changes on a multi repo setup.

The alternatives:

1. Completely de-coupled code. I.e. teams ship libraries, like a 3rd part api. This slows down development considerably, and makes re-use of code harder.

2. Keep everyone on one repo, and allow to changes stream in.

1. Slows down dev. (it suddenly becomes more beourocratic), 2. requires scaling your code versioning tool (gir/mercurial, etc) and processes associated with it.

Also, single repos, make sense in one domain (e.g. server side, ios, android, etc...). You can have different repos for different 'domains' where code doesn't intersect with each other that much.

> makes re-use of code harder

I've found this to be a feature of polyrepos because monorepos can easily become a rat's nest of dependencies. Polyrepos make you think harder about what should really be exposed and shared.

Build systems can enforce visibility, which solves this.

Allowing fine grained visibility at the level of a file, package, or artifact is better than at the granularity of a repo.

Depends how easy it is to change visibility. My point is polyrepos make things you need to be careful about like adding internal dependencies and API changes hard. Monorepos make those deceptively easy.

It's not clear how a monorepo makes you changes hard (not why they should be). They're really easy when you can do them atomically, and even the hard cases are easier in a monorepo than multi.

(And at least at Google visibility changes required the approval of the team who you want to depend on)

A phrase I like to use sums this up: “dependencies are easy to add, but hard to remove.”

Multiple repos... the most bureaucratic and painful way to scale your code base.

Having seen "a bunch of repos" in action, I think people don't talk enough about just how awful the experience can be, and how much more work it is to manage multiple repos. As you increase the number of repos, the pain gets worse at a rate which is faster than linear. There are plenty of articles talking about how wonderful a monorepo is, just not many articles about how bad multirepo is.

NPM hell is a close approximation of the multirepo experience. Try upgrading the dependencies in a large NPM project and you'll see all sorts of problems. You might find that upgrading X breaks Y, but you have to upgrade X in order to upgrade Z, and you need to upgrade Z for some reason.

With monorepo, all the versions march forward in sync. If you fix trunk, you will probably ship it, eventually. With multirepo, you need to fight the tooling just to show (from the example above) that a patch to Y will let you upgrade your project to use a newer version of Z.

Because you lose the 'single head' concept of actually being able to always have a clear view of the current newest version and of linear version history. The ability to map a single revision number into the state of _all_ source code (including third party dependencies) is extremely powerful. You also lose the ability of performing sweeping changes across an entire codebase in lockstep (think: backwards compatibility breaking API changes).

Not to mention that Google internally avoids multiple repositories (and instead runs of a single monorepo) - it's just the android/chrome/public stuff that's mostly split up.

> performing sweeping changes across an entire codebase in lockstep

This thinking right here, sounds like a reason why Google retires a lot more services at much higher frequency compared to other companies?

In https://thehftguy.com/2019/12/10/why-products-are-shutdown-t..., the HFT guy brought up the "5 years upgrade pain" as the reason why a service would be retired, at the time when the pain from upkeeping outweights the gain from revenue.

By regularly making backward incompatible API change, as afforded by the monorepo, while also having engineers moving freely between projects, the upgrade pain cycle becomes way shorter for Google services.

By the way, this line of thought also leaks into public facing source code, with guava being the poster child of breaking backward compatibility, althought it has learned its lesson starting with version 21.

Also there isnt really a lot of great tooling for making sweeping changes across repos.

Don't make breaking API changes, you can't deploy them atomically. You have to support the old API until everyone has safely migrated and confirmed no rollback will be needed.

There are more apis than just rpc ones. If I change my library's api, I can update the callers atomically.

Then services would be stuck in a sea of technical debt. There are too many moving pieces to do that.

Technical debt is a tool to be used (with caution). Getting ludicrously far behind is a risk I'd avoid, but our stability is more important than staying on the bleeding edge and testing every single version of each of our dependencies.

you can if it's a monorepo and all your callers are in the monorepo, which sort of answers OP's question

The commit can be atomic, but if prod is many machines, the deployment cannot.

Giant companies consistently rather vendor the open source code they ingest rather than give back.

Of course Google, Facebook etc have given back a lot to OSS. Just not extemporaneously to when they received the value. It may be years until the internal rebuild of something some guy copied from somewhere is rereleased.

Git was designed for OSS. That includes its radically transparent form of development. Then again with submodules you can easily vendor your private stuff, rather than doing things the other way around.

I have sympathy -- I "vendor" some libraries as I need to change something deep in the library. It's often not nice, and not something upstream would want.

This isn't really true. The Google oss policies (opensource.google) require that you use a recent version. As a result upatreaming functional changes is encouraged and easier than maintaining a local patch set. Sometimes there are local changes but they're usually non functional stuff to make the library integrate with google infra.

They do vendor, but that's for security and efficiency reasons mostly, not to avoid giving back.

Dan Luu gives a good summary of benefits of monorepo: http://danluu.com/monorepo/



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact