Hacker News new | past | comments | ask | show | jobs | submit login

Facebook’s backups were written in PHP. Well, the second version was, anyway. (The first version didn’t work so well.) By backups, I mean the central MySQL databases with profile and post information; not media or messages. The “Crown Jewels,” so to speak.

It was written by an engineer in a week or so and then handed off to me. I was a storage guy with some programming chops. Fewer chops than I thought.

Multiprocess, forking, long-lived, server-side, I/O intensive PHP. And it worked brilliantly. Over the following years, it scaled from a backup set of a few hundred TB to many, many petabytes.

I extended the code quite a bit, fixed a bunch of bugs (it was version 0 when it launched, no hate to the guy who wrote it), and added statistics for reports and coverage.

The main reliability problem ended up being when we switched to HPHP, the predecessor to HipHop. Since the primary and largest customer of this code was the web front-end, changes there would often break our backend code. One I remember was changing fork() to only start a new green thread; saved $millions on the front end, completely broke backups. We ended up getting a special posix_fork() function or something from the HPHP team that restored the old behavior.

Eventually I rewrote it in Python 2.7. It took me two complete attempts and then a further six months of tweaking to get it anywhere near as stable as the PHP version, even with brilliant guys like Haiping Zhao constantly re-weaving the carpet under our feet. I never did like it as much as the PHP version.

PHP is surprisingly good at multiprocess and forking. Though their are plenty of things that can go wrong with using pcntl_fork() in regards to memory/sockets.(Most base PHP classes are not written with explicit multiprocess handling.) I have written a couple different message queue/worker systems in PHP with forking and it always surprised people when they ask what language the system is written in.

I'm confused by your use of the term "backups" do you mean backend?

I think they literally mean backups of critical data, managed by PHP code.

Barracuda Backup was also written in PHP - including code to backup VMware servers, receiving data from the Windows agent, and connecting to Linux servers via SSHFS/CIFS, etc.

It worked surprisingly well, using similar multiprocessing techniques.

Can you elaborate a bit on what that code did? You said it a backup system talking to a MySQL? But there was a different front end and a different back end system that were both dependents on it?

Sorry, I wasn’t very clear. The backups were of MySQL databases and we stored them for a period of time (it was in the Terms of Service). When I say “back end” or “backend” I mean internal servers, that didn’t directly take HTTP requests. This means everything from databases to backup hosts to other data stores to dashboards. “Front end” means web servers that did take HTTP requests and served facebook.com. PHP development at early Facebook was always focused on the latter, for good reason.

Question I possibly couldn't ask a more appropriate person: how were the long-running server processes managed?

I really want to use PHP to manage backend-type, long-lived stuff... in a way that is relatively lightweight and self-managing, and idiomatically tolerates the occasional bit of unparseable syntax on script load or mis-named function call, without throwing a hissy fit.

Like... the only thing that'll typically ever knock php-fpm completely over is generally a script-triggered interpreter segfault, which is Bad™, exceptional, and (given php-fpm's collective uptime on all installations everywhere) vanishingly rare. Fatal error? FCGI response message passed upstream. Script exceeded execution time? Hard timeout (and message passed upstream). CLI SAPI? Clunk; no more script execution for you. I've always felt a bit left out in the cold by this, just in terms of how the language itself handles this edge case.

I guess I probably just stop whining and go setup systemd unit files or similar. That would definitely make the most sense in a production environment; I should probably build the muscle memory.

It's just that, for slower-paced and/or prototype-stage projects that don't have a clear direction... my brain's trying to reach for an equivalent of `php -S` that isn't there, and... I guess it's really easy to get idiomaticity-sniped, heh. ("But this project isn't a thing yet... and systemd unit files go in /etc, which is like, systemwide... and I forgot the unit file syntax again..." etc etc)

TL;DR, if this made sense :), did you ever encounter this sort of scenario, and how did you handle it? A $systemd_equivalent, language-level patches, shell script wrappers, cron, ...?

Oh, another curiosity - whenever remembering how pcntl_fork() works I usually have to reach for `killall php` a few times :) (especially when the forks are all using 100% CPU... and I accidentally start slightly too many...). How was killall isolated from nuking important server processes? Shebang line? Renamed interpreter binary (...probably not)? Different user accounts?

Thanks very much for reading :)

I think your first question is asking about something different than we did at the time (PHP 5.x). There was no central process that “ran” the backups—is that what you mean? There was a cron job on each backup server that started the work (systemd would do the job nowadays). The code would figure out what database servers it was responsible for and kick off those jobs, then exit on completion. Reporting and restaging to central storage was another set of crons, and so on. So though they ran for hours or days, the PHP processes had a well-defined start and terminus. The “master” process on each host started the worker children, did some housekeeping, wait()ed for hours, and took note of the exit codes. What you’re trying to do sounds a lot harder.

If you’re missing a magical command that sets up a test environment, I recommend writing it in shell, PHP, whatever, and sticking it in $HOME/bin. Or a makefile with a target so you can just run “make testserver” or the like; that way it will stay with the project. Or scripts in $PROJECT/bin or $PROJECT/scripts. Doesn’t really matter as long as it’s documented in the README and simple to execute. It’s permissible and customary to have a cleanup command, as well, if you started a background process. You could even have those start and stop commands create and then disable a systemd unit—that way you won’t have to look it up every time.

> How was killall isolated from nuking important server processes?

In general we didn’t isolate our processes against signals, because there was occasionally reason to send them. When we did send them, we sent them precisely. If a few kill -9s didn’t stop a backup then there was almost certainly a disk issue on the host and it was stuck on a kernel I/O system call, and we’d “nuke” the host (send it through a self-diagnosis and reimage cycle; cloud analogy: terminate and reallocate). It was definitely a cattle-not-pets environment. Other backup hosts would take up the slack.

Thanks very much for replying! Apologies for response latency...

Cron jobs make a lot of sense for periodic batch work.

It's interesting you used PHP to start and manage child workers (I recall reading somewhere in the docs about PHP being unable to report error code status correctly under certain conditions, but I can't find it right now).

Regarding environment, I was mostly pining about PHP's lack of a "correct"/idiomatic way to handle genuinely fatal conditions (like syntax errors) in a way that meant the binary would stay alive. I'm suddenly reminded of Java's resilience to failure - it'll loudly tell you all about everything currently going sideways in stderr/logcat/elsewhere, but it's "difficult enough" to properly knock a java process over (IIUC) that it's very commonly used for long-running server processes.

PHP-FPM has this same longevity property, but the CLI was designed to prefer crashing. I just always wished I didn't have to bolt on an afterthought extra layer to get reliability. So I wondered if I could learn anything from this particular long-running-process scenario. Cron is hard to beat though :)

Hmm. Automating the systemd unit creation process. Hmmmm... :) <mumbling about not knowing whether su/sudo credential/config sanity will produce password prompts>

And... heh, that's right, `kill` exists. Need to step up my game and stop `killall php`-ing. Good point.

Thanks again for the insight.

I met Haiping at the HipHop pre-launch thing where they brought some non-facebook-folks in to learn about it. I really liked him.

Why did you switch to Python? Surely it wasn't as perfomant and you already had a working solution.

Mainly to get off HPHP and relieve both that team and ours of the constant coordination and toe-mashing. We thought “modernizing” the code would be a better investment than jerry-rigging the Zend engine back into our infrastructure. I think it was the right call.

Also, with Python we could interact with other services using basic Thrift bindings, and even provide our own endpoints. With PHP, all RPCs and queries were prebuilt into our standard libraries with custom glue code that increasingly made non-Zend assumptions. Again, we could have made it work, but we would be swimming against the current.

All the rest of the database management code ended up being moved to Python, and benefitting greatly, so it eventually worked out.

In summary, organizational concerns.

> Mainly to get off HPHP

Did you consider "moving back" to PHP instead of emigrating to Python? Given that the code was originally written for PHP that sounds like the easiest way out of any problems caused by HPHP. Was it politics which led to the decision that HPHP was the only allowable PHP engine?

If we wanted to continue to run Zend we would have had to build out the entire supporting toolchain, from system package up. FB servers at the time ran on an ancient Linux distribution (I don’t remember if it was ever made public so I won’t say) with security patches and a modern kernel, so the system PHP was 4.2 or something similarly unusable. I think our server team eventually disabled the package because it kept causing problems when it was inadvertently installed.

And if we went that route, we would have had to build Thrift bindings and write all the glue code for any other internal service we wanted to talk to. They all had official client libraries, but they weren’t built for Zend anymore, and since they were frequent targets of micro-optimizations, they would no longer run on it. We would have been the sole maintainers of this alternate distribution and libraries, as well. And THEN we would have had to talk to the server team and convince them that it’s cool install this custom package that directly conflicts with some of the most important code we run that’s literally maintained by a team of PhDs and industry luminaries, but we pinky swear it won’t mess up. (Ever built a .deb or .rpm?) That would have been politics, but it never got that far.

Another team had already done the work for Python to “compile” your script and all its dependencies into a self-executing compressed file. All you had to do to talk to another service was import the client libraries, write your logic, run the packager, and ship the binary. (Very Go-like, in retrospect.) It was already widely used and well-supported, as a group effort. I even fixed a few bugs. So, for this particular niche at this particular company, PHP was a victim of its success elsewhere.

Did you use those backups often?

Why? Have you lost some "friends"?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact