To help spot problems, Facebook employees who access the social network from within the company's internal network will always see an experimental build of the site based on the very latest code, including proposed changes that haven't officially been accepted.
Probably the only place where your excuse for checking Facebook at work can be "Looking for bugs!"
The many data sources tracked by Facebook's internal monitoring tools even include tweets about Facebook. That information is displayed in a graph with separate trend lines to show the change in volume of positive and negative remarks
Guess I need to tweet more about how slow their mobile app is getting...
One of the major ongoing development efforts at Facebook is a project to replace the HipHop transpiler. Facebook's developers are creating their own bytecode format and custom runtime environment.... the company can push thin bytecode deltas representing just the parts that have changed. Facebook may even be able to splice the updated bytecode into the application while it's running, avoiding a process restart.
Even though I'm doing nothing nearly this awesome, this article has done more to inspire and excite me about my own coding than anything I've read in a long time.
I'm also really intrigued by the "karma" rating for all of their developers.
Can anyone speak to how well that is working? Has it been effective? Any negative side-effects? Is it just for the release process, or for any development effort?
It is "Push karma", so generally only applies to the push process. It isn't visible to anyone but the pushers and yourself (at least, I can't see anyone else's push karma in the expected places).
It isn't a complex rating system - there are probably 97% of people at the base karma level, maybe 0.1% at one rung higher, 2.7% at one rung lower, and 0.2% at lower than that. The "Push" tech talk at https://www.facebook.com/video/video.php?v=10100259101684977 has more on it.
Mostly it is a way of letting you know that you made people's lives difficult by holding up the push process by not being available to support your changes. You know that you won't get away with that, that you need to make up for it, and you also know when you've made up for it.
(It also doesn't apply to a decent number of engineers, since they work on services and infrastructure that are not part of that process.)
I'm surprised by the monolithic all-or-nothing deployments that they have off of the single binary.
I prefer to componentise applications and allow those components to be deployed, released, and rolled back separately.
I also don't agree with the 'rollback is for losers' message as hinted at in the article.
A fast dependable rollback (measured in the seconds) is significantly preferable to getting a developer to implement a fix to some issue under pressure and push it out in a rush. Much better to roll back, take stock, then implement the right fix under the tested process.
This deployment of what's called "www" ("dub-dub-dub") is somewhat monolithic in the sense that it is large, but it is not really an "all-or-nothing" deployment. It is tested internally (for at least 2-3 days) by employees (engineers and non-engineers), and then on a subset of servers, a larger subset, and so on. Performance, error rates, interaction rates, and so forth are compared between the incumbent and new-hotness versions for a reasonably reliable indication of issues throughout the rollout. More on the www push at https://www.facebook.com/video/video.php?v=10100259101684977
Teams that run "services" (such as the one that powers the "typeahead" search bar at the top of Facebook) deploy their services separately, at whatever pace makes sense for them. These are almost never in tight lock-step with anything else, for the usual reasons.
If a new release goes out every day, how is that tested internally for 2-3 days? Do employees get split across versions or are incremental changes rolled out much more frequently?
Sounds like only the latter is being discussed: Facebook typically rolls out a minor update on every single business day. Major updates are issued once a week, generally on Tuesday afternoons.
Facebook is updating tens of thousands of servers with every push. "Rolling back" a release could take as long as a regular push and contribute to problems as the version in use diverges.
Instead, FB has an aggressive and flexible internal system for "ungating" features to groups based on different criteria. Usually a feature would be pushed out in a deactivated state, then a developer will slowly ramp up its exposure to actual traffic. This limits the ability for a push to insta-break the site and means they can come back around for the next day's push with tweaks, then increase the code's exposure.
According to the article reverting does NOT involve re-deploying. Each server maintains the previous version of the binary and if needed the release team can pull the switch to revert all the servers. I assume that takes seconds, not 30 mins.
In fact i worked at a dutch social network where we also used hiphop. The new compiled binary is pushed to all servers and then it is started, the old binary is stopped and a port handover is done. Thus deploying without downtime. The old binary is available on the system, making a rollback very fast. However old binaries are removed after a time. So you can only roll back to a previous version quickly.
"Facebook is updating tens of thousands of servers with every push. "Rolling back" a release would effectively take as long as a regular push and could contribute to problems if they're found."
Absolutely not. A lot of places get around this with a simple symlink switch. You keep N older releases. So it might be /code/releases with various datestamped releases...and /code/releases/current points to the currently running one. Want to rollback? Point current to one older. Done.
Presumably the 1.5gb facebook binary is deployed as a daemon listening on port 80. Seems unlikely that they are cgi-ing a new process for each request.
Still though, the 1.5 GB binary takes 30 minutes to get out to the servers as per the article.
In my business, we would be crucified if we had to wait for 3 minutes to roll back a change let alone 30.
I agree that you can and should have all of the gating and safety nets and checks in the world as you roll forward, but at the end of the day, you need the ultimate safety net of a very fast, very dependable rollback that you can run at will.
IMO reverting isn't for losers. It should be the first thing you do in case of errors, if your architecture supports it.
"In my business, we would be crucified if we had to wait for 3 minutes to roll back a change let alone 30."
According to Rossi the servers keep the old version around after rollout, so rollback wouldn't require them to push out the old binary again but just to restart the old process. This will probably only take a couple of seconds making rollback really fast if needed.
Facebook's developers are creating their own bytecode format and custom runtime environment.... the company can push thin bytecode deltas representing just the parts that have changed. Facebook may even be able to splice the updated bytecode into the application while it's running, avoiding a process restart.
NB: Fast JIT byte code VMs running web app frameworks in high level languages that can do all of this have been around since the early 2000s. (Including the distribution of binary deltas that can be applied atomically to live servers.) Smalltalk web app servers had all of this tech, plus refactoring capabilities and distributed version control years ahead of the rest of the industry. It makes me wonder what else is out there beneath the radar today.
Because the continuous integration runs tests for all components at HEAD, not at every random possible combination that could end up on a machine. The key to releases is repeatability and consistency. Copying one big blob to every machine is repeatable and consistent. Installing a bunch of libraries and updating things piecemeal is much more difficult to do right. Internal bandwidth is cheap, so this is almost a no-brainer. Even without an internal bittorrent distribution mechanism, it's still easy.
The reason why people tend to gravitate to incremental deployments for web applications is because the typical tools encourage it; modules get installed in separate directories, each part of the app is a separate file (back in the CGI days), etc. When you compile everything into one file, though, then you just copy that file around to deploy. It's easier. (Ask a PHP programmer how to change one file, and it will probably be "change that one file". Ask a Java programmer, and it will be "fix the file, build a WAR, and replace the WAR". Tools dictate process, and the "scripting language" default is to work on a file level instead of an application level.)
I've always wanted one-file deployment for my personal applications, but I never saw anyone doing it so I assumed I was wrong. But nope, it turns out that everyone else was wrong :)
jpeterson didn't suggest "installing a bunch of libraries and updating things piecemeal". Instead of transferring the entire binary for every release they could generate a (likely much smaller) patch, transfer just it, then apply it. I expect they're not doing this because it's computationally intensive compared to transferring the entire binary.
I wonder whether they've considered (something similar to) Google Courgette[1], for distributing compiled binaries with many similarities, but translating into abstract basic blocks and rewriting pointers to match.
Although, both the generation of the diff-set, and it's application on each individual server might end up eating more resources than just using torrents.
"bsdiff is quite memory-hungry. It requires max(17n,9n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file."
further on in the article they explain that facebook is developing a hip hop virtual machine that allows them to run custom bytecode in their own runtime environment which would allow them to push deltas. im fairly certain that pushing deltas of compiled c++ code on this scale is impossible.
I find it most interesting that they rely on irc internally. They work on one of the world's largest online communications platforms -- surely they could solve their problem in a way that gives it to their millions of users too?
We do make extensive use of Facebook messages (many people, including me, make good use of Facebook Messenger on mobile and/or desktop) and of Facebook groups.
IRC's ability to quickly create temporary groups, temporary membership (essentially muting discussion by leaving, or peeking by joining), bot frameworks, and multiple clients are potential reasons why IRC might be preferable for the sort of things it is used for, I guess.
As I understand it, the main reason is that IRC is decoupled from Facebook completely (or should be). In "Oh shit, the sky is falling!" SEV situation we can trust (sorta) that IRC will be there.
irc is hostile to the kind of user that 37 signals markets to. I don't know that it's necessarily hostile to the kind of user who works in Facebook engineering.
In much the same way that if you live in a world of Word, vi is hostile. Don't get me wrong -- I use and love irc, for many of the reasons given above. I just think if I could have all or many of its benefits in a way that all my non-technical Facebook friends could use too, there'd be something exceptionally powerful and useful.
It's almost like right now we're in the days of pine and mailman. What we could have is GMail.
The article mentions that employees visiting Facebook from inside use an experimental build. Any idea how they manage this if the experimental build requires changes in the data structures used by the site?
I suspect that if new features require storage changes, the changes are strictly additive and won't affect old features, or old features are modified to use the new storage setup.
With the amount of data Facebook has, they probably don't have the option of an "iterate over every row and change a thing" kind of migration.
Eventually you'll have users in the same state, either because the push is not atomic or because the change is rolled out slowly. This means you have to build the change with this in mind regardless of the employee testing.
The company has two separate sets of these tests; one does some conventional sanity checking on the code and the other simulates user interaction to make sure that the website's user interface behaves properly.
Anyone know more about this ? How are the user interaction made ?
I'm guessing they're talking about unit tests and integration tests, the integration tests probably simulating user input using a framework such as Selenium(RC).
How do you wind up with a 1.5gb binary? That's incredible -- especially considering all their static assets are on their CDN, so this is basically their code and all the libraries they're pulling in.
(I work on the HipHop compiler.) You start by compiling PHP source.
Simple PHP statements take a lot more space in the binary than intuition suggests. E.g.:
if ($a == $b) ...
would seem like it should be
cmp $rax, $rbx
jz ...
But! If type inference has failed, we don't know what types $a and $b are, so they might be strings or objects or something crazy. So we're going to have to indirectly dispatch to $a's '==' method. We also spend a ton of space on reference counting code; the semantics of the language basically force you to do naive reference counting, since refcounts can be witnessed in various ways, so every time we pass an argument, do an assignment, sometimes even evaluate expressions, we need to manipulate reference counts, and if they've gone to zero call a destructor.
It ends up making the code really large, and one of the things that's unique about our efforts to run PHP fast relative to other dynamic language efforts is that sheer code bulk ends up being our largest enemy; if we're not careful, icache misses eat us alive.
Finally, I'll note that it's not quite a 1.5GB binary. The actual ELF binary is something like 1.1GB, and the remainder of the package we bittorrent around production is stuff like static resources (javascript, css) and primed contents for the APC cache that we want prepopulated on boot.
Based on your work heretofore, do you think it's wise for Facebook to continue on the PHP path instead of working on a backend rewrite in C# or some other, saner language? I find it odd that Facebook is still using PHP and pouring lots of effort and cash into things like HipHop when they're obviously hiring people smart enough to use another language, and when they obviously have the runway to perform a dark horse rewrite into a much cleaner, saner backend.
I wouldn't say that we've "continued on the PHP path." I'd say that we've refused to throw out the precious PHP parts of our application, while not being afraid to use more appropriate languages across Thrift boundaries when needed. Our search engine, newsfeed, and ad serving infrastructure, for instance, are in C++.
A drop-everything-and-rewrite of the PHP code is entirely out of the question, for all the reasons covered in Spolsky's 12-year-old classic on the subject: http://www.joelonsoftware.com/articles/fog0000000069.html. Those of us working on making PHP perform better are a tiny fraction of Facebook engineering as a whole; this small overhead cost is nothing compared to the risks inherent in a ground-up rewrite.
Most of PHP's language-level faults can be engineered around. For instance, we have a code-review-time script that parses (really parses) the code to warn engineers (and reviewers) about dangerous or deprecated idioms.
PHP also has some affirmative virtues. The programming model is more productive than that of compiled languages, and even many interpreted languages; save/reload the web page is just a better, tighter loop to get work done in than save/compile/restart my server/reload the web page. I'm actually a fan of PHP's concurrency model, which naifs often mistake as "no concurrency allowed"; PHP's concurrency primitive is curl[1], and if you wrap a tiny bit of library around it, you can make it behave like actors.
[1] Seriously. curl provides a shared-nothing way to asynchronously run code, and has the virtue of not caring what language the other side is written in to boot.
Right, I'm familiar with Spolsky's piece, but I think there are times when a rewrite is legitimate. I think that a situation where you must roll a completely custom in-house compiler that generates binaries which exceed 1 GB in size in order to get adequate performance of your app is a good candidate for a new architecture, despite Spolsky's claims. Spolsky's article discusses throwing out pages of code because the programmers "don't know what half of these API calls are for" and "[wanting] to build something grand" -- these are quite different impulses than the real-world problems staring Facebook in the face by its continued usage of PHP.
I think also that there is a difference between writing a new backend for something that is solid and in place and just throwing the whole product out the window and re-imagining it from the ground up, and I think the latter is the kind of rewrite that should be avoided and considered dangerous. When you could throw 4-5 guys on a real C# or C++ rewrite and tell them the final product has to behave identically to the PHP version, you have a much less volatile situation.
As for the PHP workflow, I agree it's nice not to have an intermediate step, but that intermediate step can usually be circumvented pretty rapidly by throwing a script or two (or just flipping a config option) into your development environment.
There is pretty much no way 4 to 5 guys could rewrite "www" in another language in parallel to continuing development from a hundred developers, no matter how long they had. However, 4 to 5 guys could write HipHop.
What real-world problems do you see staring Facebook in the face in its continued usage of PHP? I'm no PHP fan, but many of the common PHP problems are avoided or mitigated in the code base.
I think developing in PHP and compiling to C++/binary probably results in much higher developer productivity than developing in C++/C# directly. Developer salaries are undoubtedly their largest expense, by far, dwarfing those salaries of the people who make PHP performant and 1.5 gig binary updates sane.
I expect that running their datacenters is a larger expense than developer salaries.
"In 2011, $606 million was allocated towards total capital investment in data center infrastructure by Facebook, which includes the cost of servers, networking equipment, construction, and storage." - http://www.colocationamerica.com/blog/facebook-data-center-i...
The assumption that migrating the entire www stack to something like C++ would help with the datacenter costs is not supported by reality. Please remember that the bits that are in PHP are mostly front-end code. This handles the presentation of the data, but the actual heavy-lifting and data manipulation is done by the back-end infrastructure which is mostly C/C++ with some Java thrown in for the hadoop bits.
I would disagree here - every percent CPU saved for the same workload is a 1% reduction in the number of machines needed. The number of web machines is sufficiently large that savings of even 1% are praiseworthy. Quite a bit of effort is expended to keep this going down and to the right (at least some of the time).
Ah, good point, my estimate of their hardware spending wasn't that high. With 3000+ employees, depending on average total compensation, it might be close. That said, I would still guess that they'd choose greater developer productivity and agility over savings on server costs.
I wonder if there are any public benchmarks of compiled Hiphop vs. C++
PHP is not such a bad language. Of course it has its quirks. However bad code is written by a bad developer, not by a bad language. You can write beautiful code in PHP too.
And the huge code base from Facebook is impossible to rewrite it in even a few years. Throwing away knowledge, tools, and tested code.
Thanks for your insights, but this makes me wonder: how much PHP code is there to make a rewrite unfeasible? At least the public part shouldn't be so big, are there huge administrative interfaces that somehow can't be separated?
To add to what others have said most Facebook hires learn PHP during the boot camp process. The continued use of the PHP is definitely not due to lack of experience with other languages or some misplaced fondness for the language.
When you work on a codebase the size of Facebook's the core language becomes less important than the abstractions and framework that have been built around it. As well as having sane internal libraries Facebook have added extensions to the language like XHP and the yield keyword (http://www.serversidemagazine.com/news/10-questions-with-fac...) which make it much nicer to use.
I should amend my previous comment though; most place I've worked at that are within a couple of orders of magnitude of Facebook's size have developed or adopted abstractions that constitute the majority of code engineers interact with.
How good is the type inference? I did some research on PHP back in the day, and reckoned that a huge number of types (80%-90%) could be inferred - but I didn't run it on anything like Facebook's codebase.
Also, more info on the type inference please :) Context-sensitive? Do you model reference counts in your analysis? If so, do you get optimizations out of that?
Did the article really say a compile is 15 minutes? That's insane - how is it so fast? Does it limit the optimizations? You mustn't be able to do very deep analysis - or do you cache it somehow? What about FDO?
I've argued for years that the PHP interpreter's implementation leak into the language semantics, but you seem to have cast iron proof about the refcounts. Is it just destructors, or do you see errors due to copy-on-write semantics (like the weird array semantics when a reference is in a copy-on-writed array)?
(Sorry for all the questions, it's just very interesting)
I haven't done any work on the type inference stuff, so my understanding is second-hand. My peers who actually know what they're talking about are working on a more complete write-up, so stay tuned.
1. (Type inference.) The compiler's type inference is basically a first-order symbolic execution without inlining; we don't iterate to a fixed point or anything, and any union types at all cause us to give up and call it a variant. Many other PHP surprises mess up type inference; for example, function parameters by reference aren't marked in the caller, so if anywhere in the codebase there exists a unary function that takes its first parameter by reference, $f($a) will spoil our inference of $a: $f might mutate it.
Still, simple things work pretty well. One major area we do well on is inferring the return types of function calls. It's hard to say how much we're missing out by not doing more sophisticated type inference; while we can count the sites at compile time, this doesn't tell us which sites are dynamically important. Our anecdotal experience has been that type inference is one of the more powerful optimizations we can perform.
We don't model refcounts, and I'm certain there are opportunities there.
2. (Compile times.) 15 minutes sounds about right to me, and chuckr would certainly know better. Compiler speed itself is something we pay attention to, since it inhibits our ability to turn pushes around. The actual global AST analysis is multithreaded, so we keep the whole build machine busy. So far the trade-offs between optimizations and compile time have not been too dire; there haven't been things that would make the binary twice as fast if only we could compile twenty times as long, e.g.
3. (PHP implementation/semantics). Who've you been "arguing" against? :). It's a manifest fact that accidents of PHP's implementation have been sunk into the language. Destructors are one of the ways to witness refcounts, but there are others. var_dump() of arrays exposes the refcount of internal data; things whose refcount is >1 are displayed with a '&' preceding them, e.g.:
Less trivially, it is hard to efficiently support the copy semantics for arrays without reference counting; other systems have overcome this, but it is less trivial than with refcounting.
1) So what about the (I would think fairly common case) of loops where the variable is first initialized in the loop. On one path, it's null, on the other, it's a typed value. Can you optimize that?
2) Cool! phc went the route of really advanced, context sensitive analysis, and the result was massive memory use and huge compile times for even simple programs (if using optimization).
3) Have you read IBM's POPL 2010 paper on it? Interesting stuff, though maybe not actionable. They argue (there's that word again!) that PHP's copy-on-write is flawed for arrays containing references - the reference can lose it's reference-ness when the array gets altered and copied. They have an implementation which fixes this for a relatively small slowdown. Interesting paper as I recall.
FWIW, I think it's allowable to change the output of var_dump. Optimizing reference counts kinda requires it, and I know Hiphop isn't in the business of strict conformance to PHP semantics.
1) Great question! Yes, it is super-common. Most of our type inferences on local variables are actually of form "Null-or-X". So we can emit code that skeptically checks for Null, but then goes ahead and assumes X otherwise.
2) I'm not claiming we have all the answers here; but decent compile times has been a strong evolutionary pressure all along.
3) Yeah, we basically implement the "slightly relaxed" semantics described in that paper, and are addicted to the performance wins from doing so. WRT loosening semantics, though, in general you'd be surprised how close we come to quirk-for-quirk compatibility. The problem is that real PHP programs, including ours, end up depending, often unintentionally, on those quirks. I.e., nobody meant to rely on the order function params are evaluated in, but in practice we rely on it because param 0's evaluation has some side effect that param 1's evaluation needs to run correctly...
The HHVM JIT isn't in (much of) production yet, because it isn't fast enough yet. Once we have high confidence it can provide some perf wins, we'll go ahead and move over to it. Even if it were perf neutral, there is some virtue to having developers use the same environment as production.
I expect you end up with that when you decide that you never, ever want to have to resolve a problem with an executable and a dynamic library having incompatible versions.
The companies whose deployment processes I've seen did exactly that - after an update, a request to a CDN results in a fault which forces the CDN to query the original company server. In those cases, the static assets had to be distributed onto the servers.
I would imagine Facebook does the same, but the article pretty explicitly says that they don't do that (which is why I imagine you got downvoted - for the record, I think that was inappropriate), which raises the question of what exactly do they do?
> The binary executable is just one part of the Facebook application stack, of course. Many external resources are referenced from Facebook pages, including JavaScript, CSS, and graphical assets. Those files are hosted on geographically distributed content delivery networks (CDNs).
It doesn't say whether or not these are part of the binary, but I'd suspect they aren't used as the CDN origin. You'd want these (randomly-but-uniquely named) resources to be pushed out to the CDN and warmed up before a deploy to avoid a thundering herd[1] problem from millions of users.
Ignoring technical ways to work around the thundering herd problem, the slow rollout (employees first for several days, then a subset of servers, then a larger subset) also mitigates the problem.
Facebook's testing practices and culture of developer accountability help to prevent serious bugs from being rolled out in production code. When a developer's code disrupts the website and necessitates a post-deployment fix, the incident is tracked and factored into Facebook's assessment of the developer's job performance.
[...]
Employees with low karma can regain their lost points over time by performing well—though some also try to help their odds by bringing Rossi goodies. Booze and cupcakes are Rossi's preferred currency of redemption; the release engineering team has an impressive supply of booze on hand, some of which was supplied by developers looking to restore their tarnished karma.
This sounds like Facebook strongly rewards developers who work on trivial, low-risk features rather than larger, more important features. Also, it sounds like bribery factors into your overall job performance rating.
Push karma primarily affects how likely the release engineering team will accept any breaking of the standard rules of getting your code into the push. It generally doesn't drop if you are responsive and responsible for any problems your change causes. The only way to restore points is to show respect and consideration for the hard work the release engineering team does.
(I'm not 100% sure, but I think most of the booze and cupcakes come from people who were appreciative of the release engineering team for bringing potential issues to their attention or for being accommodating in terms of hours and in terms of delay to get things fixed.)
Being irresponsible (not supporting your changes) will factor into your performance review, but working exclusively on low-risk features will most likely hurt it way more.
To help spot problems, Facebook employees who access the social network from within the company's internal network will always see an experimental build of the site based on the very latest code, including proposed changes that haven't officially been accepted.
Probably the only place where your excuse for checking Facebook at work can be "Looking for bugs!"
The many data sources tracked by Facebook's internal monitoring tools even include tweets about Facebook. That information is displayed in a graph with separate trend lines to show the change in volume of positive and negative remarks
Guess I need to tweet more about how slow their mobile app is getting...
One of the major ongoing development efforts at Facebook is a project to replace the HipHop transpiler. Facebook's developers are creating their own bytecode format and custom runtime environment.... the company can push thin bytecode deltas representing just the parts that have changed. Facebook may even be able to splice the updated bytecode into the application while it's running, avoiding a process restart.
Even though I'm doing nothing nearly this awesome, this article has done more to inspire and excite me about my own coding than anything I've read in a long time.