Hacker News new | past | comments | ask | show | jobs | submit login
How Facebook pushes updates to the site (facebook.com)
220 points by creativityhurts on May 28, 2011 | hide | past | favorite | 40 comments

I loved this video. Gatekeeper blew me away.

Summarized some of the highlights here if you don't have time to watch:


I thought he said perforce and git, not subversion and git.

Then corrected himself to subversion and git.

Ahh, thanks. I wasn't listening that closely.

I'm pretty impressed by the "push karma" system for gauging how risky individual engineers' commits are on average.

Here's another (much shorter!) video where Chuck Rossi talks about push karma very briefly: https://www.facebook.com/video/video.php?v=778890205865&...

I like how their entire development cycle revolves around people getting drunk on weekends.

It's their entire business model as well.

JWZ suggests:

"How will this software get my users laid" should be on the minds of anyone writing social software (and these days, almost all software is social software). "Social software" is about making it easy for people to do other things that make them happy: meeting, communicating, and hooking up.


This coming from a single person that owns a bar.

There are lots of other types in the world. Married, with children, etc.

Anyone know of a more automated system to handle forward/backward compatibility? Obviously, there's a lot of manual coding work that has to be done but is there a system that categorizes these various changes (schema or new URL for a page or change in backend service interface), automatically tracks and gets rid of these dependencies after a certain period of time? To give a concrete example, let's say I switched the Facebook messages URL to "/mail" from "/messages". I would mark the old handler as deprecated and eventually, after the new changes have been pushed to everyone, the system will prompt the developer to get rid of essentially the dead code. This is a very simplified example but I believe such deprecation tracking would be useful for more complex changes too.

I just wanted to mention one thing about this video. It's missing the first 3 minutes or so where I introduce myself and my team. It's also missing the part where I gave credit for some of these slides to John Allspaw and Paul Hammond from flickr who gave an awesome talk at the 2009 O'Reilly Velocity Conference. Their talk inspired me to put together this presentation.

So facebook is programed in PHP but everything on the server is in C++ thanks to "hiphop"? mind=blown

PHP is actually one of the slowest mainstream interpreted languages. At Facebook levels of scale, that becomes a serious problem.

It doesn't even take a billion hits. I'd say it's a problem as soon as you scale to a couple dozen servers. At that point you're wasting enough money running extra cores that you could have hired another dev instead. We picked up something like 6x web frontend performance by doing a pretty straightforward if tedious Java port (64-bit Sun JVM), and it also put us in a position to start using NIO (with Netty) much more heavily.

Interesting, thanks. I don't have experience running PHP apps beyond a few servers, and the database was always the bottleneck.

And they deploy the 1Gb binaries via BitTorrent!

What he says at the beginning is to me the most important. Having great tool is great (!), but the most important is to have the right culture about QA and releases.

It's interesting that their entire release architecture seems to be focused on never pushing bad things out to production, whereas given their traffic they could probably push things out much sooner (minutes after they're committed) to small parts of their overall traffic, and slowly increase the traffic on those pieces of code as they prove themselves to be stable, or quickly revert them if they're not.

That would mean having a lot of versions of Facebook live at any one point, but as those parts prove themselves stable they'd gradually be rolled out to all of their traffic.

One point that also wasn't covered is that as they're pushing things out they'll only cherry-pick parts of their codebase depending on which engineers they have around. I wonder if they have a lot of hairy merge conflicts around release time due to that, and bugs in production resulting purely from those merge conflicts. Or worse, subtle bugs resulting in change A going out, but being programmed against a function that was changed in change B, which is not going out because the author of change B isn't in today.

"they could probably push things out much sooner (minutes after they're committed) to small parts of their overall traffic, and slowly increase the traffic on those pieces of code as they prove themselves to be stable, or quickly revert them if they're not."

The risk to user data is way too high.

This could have serious consequences. You could push client bugs with erroneous API calls, or server-side bugs that cause data loss. Rolling back isn't enough to fix the damage. The user's data has reached a permanent bad state that they didn't intentionally reach. You could roll back the data of every person who used the change, which would undo all of their work. You could analyze the data and try to fix it. This might work, or it might get the user data into a different bad state.

Plus, bugs in the view of the site might not cause errors that pop up in your error console, since it's hard to write tests for "looks wrong." Obvious errors - "when I click on my profile picture my name disappears" - are caught by external people instead of internal people, which adds a level of indirection between a problem appearing and a fix being written.

That being said, there are great uses for gradual rollouts. The video mentions that they do this for mature features with Gatekeeper - the developer can conditionally enable a Prod feature, and see what it does.

This is correct, especially for a company under the level of government privacy scrutiny as Facebook. An erroneous push that exposes private user data could lead to a very heavy fine.

Cherry picking based on which engineers are around is much more about our daily releases. Everything checked into trunk on Sunday will go out on Tuesday. But if I've requested a diff be merged for the Wednesday release, it won't happen unless I've told request_bot that I'm around to support my changes. This also means that if there are merge conflicts, the engineers who wrote the patches will be there to help resolve them.

I haven't watched the video and I've been up all night, so forgive me if I'm contradicting the video. I'm probably wrong and the video's probably right. At least as of 2009, you are correct and that is how things were pushed. Code would have to be reviewed before it was pushed, but push would happen in stages, and chuckr and others would monitor its progress and revert commits that were found to be broken, as they went out. Errors were monitored and correlated to sets of patches, and would be investigated in real-time. There was the usual weekly push for typical changes, daily push for important changes, and unscheduled pushes for critical/very urgent changes.

Unless you mean to suggest continuously integrating developer commits to trunk into the live branch, in which case, no, that's a horrible idea. Not every bug manifests itself that quickly.

As for merging, my memory is pretty hazy but I believe pushing and merging went hand in hand, and stuff that conflicted meant someone wasn't communicating yet working on the same code as someone else. Code was often documented with its owner. The code review utility at the time would (I think) take that✝, and (I think) run a blame and automatically CC those people on the code review, so it could be caught before it went out.

✝ Unless that was just done by convention so you know who to ask about a bit of code you might need to revise. Sorry my memory is unreliable on that bit.

You have to remember that facebook is a compiled binary (hiphop). They entered the release cycle they did vs true Continuous Integration because original compilation times for this binary were extreme. Pretty sure Chuck mentioned something in the range of 1gb binary. They've corrected the amount of time it takes to compile and can push within 15 minutes of a trunk merge, but it's stil not on the scale of a few lines here, a few lines there.

Anyone know the URL for the code review tool they said they use and open sourced? He said "fabrication", however I cannot seem to find it. It's also not listed in http://developers.facebook.com/opensource/

This is how I push updates for now:

    git pull
    lein uberjar
    sudo restart myprj

Fabric is a nice tool for automating deployments: http://docs.fabfile.org/en/1.0.1/index.html

For example:

    fab production deploy

    fab web-servers deploy
    fab database backup
.. etc.

Sweet, thanks.

Does anyone know if a non-video summary or set of slides for this video exists?

Hacker Newser rasmus4200 took notes here: http://agilewarrior.wordpress.com/2011/05/28/how-facebook-pu...

There's also a two minute version at https://www.facebook.com/video/video.php?v=778890205865&....

Was able to download this video with savevideo.me

This actually wound up being a helpful comment. After switching to HD in order to see his text examples more clearly, I found out that the session starts over from the beginning and you can't fast forward the video at all.

This meant that I had to either stop watching halfway through so that I didn't have to sit through the same half-hour again, or rip the video so that I can watch it like a human.

i fast forwarded with no problem. just click to where i wanted to start watching in the video.

not HD mode. stable chrome for mac 10.6.

So they don't really test the code, they just push it out, and fix the bugs on Thursdays. Genius strategy. No wonder their platform breaks so often.

The code is tested. There are unit tests, Watir tests, tests written by the developer for that specific change, and all changes require a test plan from the engineer.

We prefer the term "watirboarding".

They have tests, but for systems of that complexity, some bugs only appear in production whatever you do.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact