To help spot problems, Facebook employees who access the social network from within the company's internal network will always see an experimental build of the site based on the very latest code, including proposed changes that haven't officially been accepted.
Probably the only place where your excuse for checking Facebook at work can be "Looking for bugs!"
The many data sources tracked by Facebook's internal monitoring tools even include tweets about Facebook. That information is displayed in a graph with separate trend lines to show the change in volume of positive and negative remarks
Guess I need to tweet more about how slow their mobile app is getting...
One of the major ongoing development efforts at Facebook is a project to replace the HipHop transpiler. Facebook's developers are creating their own bytecode format and custom runtime environment.... the company can push thin bytecode deltas representing just the parts that have changed. Facebook may even be able to splice the updated bytecode into the application while it's running, avoiding a process restart.
Even though I'm doing nothing nearly this awesome, this article has done more to inspire and excite me about my own coding than anything I've read in a long time.
Can anyone speak to how well that is working? Has it been effective? Any negative side-effects? Is it just for the release process, or for any development effort?
It isn't a complex rating system - there are probably 97% of people at the base karma level, maybe 0.1% at one rung higher, 2.7% at one rung lower, and 0.2% at lower than that. The "Push" tech talk at https://www.facebook.com/video/video.php?v=10100259101684977 has more on it.
Mostly it is a way of letting you know that you made people's lives difficult by holding up the push process by not being available to support your changes. You know that you won't get away with that, that you need to make up for it, and you also know when you've made up for it.
(It also doesn't apply to a decent number of engineers, since they work on services and infrastructure that are not part of that process.)
I prefer to componentise applications and allow those components to be deployed, released, and rolled back separately.
I also don't agree with the 'rollback is for losers' message as hinted at in the article.
A fast dependable rollback (measured in the seconds) is significantly preferable to getting a developer to implement a fix to some issue under pressure and push it out in a rush. Much better to roll back, take stock, then implement the right fix under the tested process.
Teams that run "services" (such as the one that powers the "typeahead" search bar at the top of Facebook) deploy their services separately, at whatever pace makes sense for them. These are almost never in tight lock-step with anything else, for the usual reasons.
Instead, FB has an aggressive and flexible internal system for "ungating" features to groups based on different criteria. Usually a feature would be pushed out in a deactivated state, then a developer will slowly ramp up its exposure to actual traffic. This limits the ability for a push to insta-break the site and means they can come back around for the next day's push with tweaks, then increase the code's exposure.
It is compiled PHP. Do they embed an http server? Do they talk fcgi with a well-known webserver, like - php would?
In my business, we would be crucified if we had to wait for 3 minutes to roll back a change let alone 30.
I agree that you can and should have all of the gating and safety nets and checks in the world as you roll forward, but at the end of the day, you need the ultimate safety net of a very fast, very dependable rollback that you can run at will.
IMO reverting isn't for losers. It should be the first thing you do in case of errors, if your architecture supports it.
According to Rossi the servers keep the old version around after rollout, so rollback wouldn't require them to push out the old binary again but just to restart the old process. This will probably only take a couple of seconds making rollback really fast if needed.
NB: Fast JIT byte code VMs running web app frameworks in high level languages that can do all of this have been around since the early 2000s. (Including the distribution of binary deltas that can be applied atomically to live servers.) Smalltalk web app servers had all of this tech, plus refactoring capabilities and distributed version control years ahead of the rest of the industry. It makes me wonder what else is out there beneath the radar today.
The reason why people tend to gravitate to incremental deployments for web applications is because the typical tools encourage it; modules get installed in separate directories, each part of the app is a separate file (back in the CGI days), etc. When you compile everything into one file, though, then you just copy that file around to deploy. It's easier. (Ask a PHP programmer how to change one file, and it will probably be "change that one file". Ask a Java programmer, and it will be "fix the file, build a WAR, and replace the WAR". Tools dictate process, and the "scripting language" default is to work on a file level instead of an application level.)
I've always wanted one-file deployment for my personal applications, but I never saw anyone doing it so I assumed I was wrong. But nope, it turns out that everyone else was wrong :)
Although, both the generation of the diff-set, and it's application on each individual server might end up eating more resources than just using torrents.
IRC's ability to quickly create temporary groups, temporary membership (essentially muting discussion by leaving, or peeking by joining), bot frameworks, and multiple clients are potential reasons why IRC might be preferable for the sort of things it is used for, I guess.
There is no need to re-invent IRC.
A real-time low-friction shared chat service would actually be a hugely compelling thing inside Facebook.
(just curious, I used it before in a corporate environment, it was fine for us)
It's almost like right now we're in the days of pine and mailman. What we could have is GMail.
We were all Unix geeks, so perhaps that's why it worked well for us.
With the amount of data Facebook has, they probably don't have the option of an "iterate over every row and change a thing" kind of migration.
Anyone know more about this ? How are the user interaction made ?
Simple PHP statements take a lot more space in the binary than intuition suggests. E.g.:
if ($a == $b) ...
would seem like it should be
cmp $rax, $rbx
It ends up making the code really large, and one of the things that's unique about our efforts to run PHP fast relative to other dynamic language efforts is that sheer code bulk ends up being our largest enemy; if we're not careful, icache misses eat us alive.
I wouldn't say that we've "continued on the PHP path." I'd say that we've refused to throw out the precious PHP parts of our application, while not being afraid to use more appropriate languages across Thrift boundaries when needed. Our search engine, newsfeed, and ad serving infrastructure, for instance, are in C++.
A drop-everything-and-rewrite of the PHP code is entirely out of the question, for all the reasons covered in Spolsky's 12-year-old classic on the subject: http://www.joelonsoftware.com/articles/fog0000000069.html. Those of us working on making PHP perform better are a tiny fraction of Facebook engineering as a whole; this small overhead cost is nothing compared to the risks inherent in a ground-up rewrite.
Most of PHP's language-level faults can be engineered around. For instance, we have a code-review-time script that parses (really parses) the code to warn engineers (and reviewers) about dangerous or deprecated idioms.
PHP also has some affirmative virtues. The programming model is more productive than that of compiled languages, and even many interpreted languages; save/reload the web page is just a better, tighter loop to get work done in than save/compile/restart my server/reload the web page. I'm actually a fan of PHP's concurrency model, which naifs often mistake as "no concurrency allowed"; PHP's concurrency primitive is curl, and if you wrap a tiny bit of library around it, you can make it behave like actors.
 Seriously. curl provides a shared-nothing way to asynchronously run code, and has the virtue of not caring what language the other side is written in to boot.
I think also that there is a difference between writing a new backend for something that is solid and in place and just throwing the whole product out the window and re-imagining it from the ground up, and I think the latter is the kind of rewrite that should be avoided and considered dangerous. When you could throw 4-5 guys on a real C# or C++ rewrite and tell them the final product has to behave identically to the PHP version, you have a much less volatile situation.
As for the PHP workflow, I agree it's nice not to have an intermediate step, but that intermediate step can usually be circumvented pretty rapidly by throwing a script or two (or just flipping a config option) into your development environment.
What real-world problems do you see staring Facebook in the face in its continued usage of PHP? I'm no PHP fan, but many of the common PHP problems are avoided or mitigated in the code base.
"In 2011, $606 million was allocated towards total capital investment in data center infrastructure by Facebook, which includes the cost of servers, networking equipment, construction, and storage." - http://www.colocationamerica.com/blog/facebook-data-center-i...
I wonder if there are any public benchmarks of compiled Hiphop vs. C++
And the huge code base from Facebook is impossible to rewrite it in even a few years. Throwing away knowledge, tools, and tested code.
At one time the chat thing was supposedly done with Erlang - is that still true? Probably ejabberd hacked up, or something like that.
When you work on a codebase the size of Facebook's the core language becomes less important than the abstractions and framework that have been built around it. As well as having sane internal libraries Facebook have added extensions to the language like XHP and the yield keyword (http://www.serversidemagazine.com/news/10-questions-with-fac...) which make it much nicer to use.
I should amend my previous comment though; most place I've worked at that are within a couple of orders of magnitude of Facebook's size have developed or adopted abstractions that constitute the majority of code engineers interact with.
Also, more info on the type inference please :) Context-sensitive? Do you model reference counts in your analysis? If so, do you get optimizations out of that?
Did the article really say a compile is 15 minutes? That's insane - how is it so fast? Does it limit the optimizations? You mustn't be able to do very deep analysis - or do you cache it somehow? What about FDO?
I've argued for years that the PHP interpreter's implementation leak into the language semantics, but you seem to have cast iron proof about the refcounts. Is it just destructors, or do you see errors due to copy-on-write semantics (like the weird array semantics when a reference is in a copy-on-writed array)?
(Sorry for all the questions, it's just very interesting)
1. (Type inference.) The compiler's type inference is basically a first-order symbolic execution without inlining; we don't iterate to a fixed point or anything, and any union types at all cause us to give up and call it a variant. Many other PHP surprises mess up type inference; for example, function parameters by reference aren't marked in the caller, so if anywhere in the codebase there exists a unary function that takes its first parameter by reference, $f($a) will spoil our inference of $a: $f might mutate it.
Still, simple things work pretty well. One major area we do well on is inferring the return types of function calls. It's hard to say how much we're missing out by not doing more sophisticated type inference; while we can count the sites at compile time, this doesn't tell us which sites are dynamically important. Our anecdotal experience has been that type inference is one of the more powerful optimizations we can perform.
We don't model refcounts, and I'm certain there are opportunities there.
2. (Compile times.) 15 minutes sounds about right to me, and chuckr would certainly know better. Compiler speed itself is something we pay attention to, since it inhibits our ability to turn pushes around. The actual global AST analysis is multithreaded, so we keep the whole build machine busy. So far the trade-offs between optimizations and compile time have not been too dire; there haven't been things that would make the binary twice as fast if only we could compile twenty times as long, e.g.
3. (PHP implementation/semantics). Who've you been "arguing" against? :). It's a manifest fact that accidents of PHP's implementation have been sunk into the language. Destructors are one of the ways to witness refcounts, but there are others. var_dump() of arrays exposes the refcount of internal data; things whose refcount is >1 are displayed with a '&' preceding them, e.g.:
$a = 1;
$data =& $a;
Less trivially, it is hard to efficiently support the copy semantics for arrays without reference counting; other systems have overcome this, but it is less trivial than with refcounting.
2) Cool! phc went the route of really advanced, context sensitive analysis, and the result was massive memory use and huge compile times for even simple programs (if using optimization).
3) Have you read IBM's POPL 2010 paper on it? Interesting stuff, though maybe not actionable. They argue (there's that word again!) that PHP's copy-on-write is flawed for arrays containing references - the reference can lose it's reference-ness when the array gets altered and copied. They have an implementation which fixes this for a relatively small slowdown. Interesting paper as I recall.
FWIW, I think it's allowable to change the output of var_dump. Optimizing reference counts kinda requires it, and I know Hiphop isn't in the business of strict conformance to PHP semantics.
2) I'm not claiming we have all the answers here; but decent compile times has been a strong evolutionary pressure all along.
3) Yeah, we basically implement the "slightly relaxed" semantics described in that paper, and are addicted to the performance wins from doing so. WRT loosening semantics, though, in general you'd be surprised how close we come to quirk-for-quirk compatibility. The problem is that real PHP programs, including ours, end up depending, often unintentionally, on those quirks. I.e., nobody meant to rely on the order function params are evaluated in, but in practice we rely on it because param 0's evaluation has some side effect that param 1's evaluation needs to run correctly...
I would imagine Facebook does the same, but the article pretty explicitly says that they don't do that (which is why I imagine you got downvoted - for the record, I think that was inappropriate), which raises the question of what exactly do they do?
It doesn't say whether or not these are part of the binary, but I'd suspect they aren't used as the CDN origin. You'd want these (randomly-but-uniquely named) resources to be pushed out to the CDN and warmed up before a deploy to avoid a thundering herd problem from millions of users.
Employees with low karma can regain their lost points over time by performing well—though some also try to help their odds by bringing Rossi goodies. Booze and cupcakes are Rossi's preferred currency of redemption; the release engineering team has an impressive supply of booze on hand, some of which was supplied by developers looking to restore their tarnished karma.
This sounds like Facebook strongly rewards developers who work on trivial, low-risk features rather than larger, more important features. Also, it sounds like bribery factors into your overall job performance rating.
(I'm not 100% sure, but I think most of the booze and cupcakes come from people who were appreciative of the release engineering team for bringing potential issues to their attention or for being accommodating in terms of hours and in terms of delay to get things fixed.)
Being irresponsible (not supporting your changes) will factor into your performance review, but working exclusively on low-risk features will most likely hurt it way more.