It was written by an engineer in a week or so and then handed off to me. I was a storage guy with some programming chops. Fewer chops than I thought.
Multiprocess, forking, long-lived, server-side, I/O intensive PHP. And it worked brilliantly. Over the following years, it scaled from a backup set of a few hundred TB to many, many petabytes.
I extended the code quite a bit, fixed a bunch of bugs (it was version 0 when it launched, no hate to the guy who wrote it), and added statistics for reports and coverage.
The main reliability problem ended up being when we switched to HPHP, the predecessor to HipHop. Since the primary and largest customer of this code was the web front-end, changes there would often break our backend code. One I remember was changing fork() to only start a new green thread; saved $millions on the front end, completely broke backups. We ended up getting a special posix_fork() function or something from the HPHP team that restored the old behavior.
Eventually I rewrote it in Python 2.7. It took me two complete attempts and then a further six months of tweaking to get it anywhere near as stable as the PHP version, even with brilliant guys like Haiping Zhao constantly re-weaving the carpet under our feet. I never did like it as much as the PHP version.
It worked surprisingly well, using similar multiprocessing techniques.
I really want to use PHP to manage backend-type, long-lived stuff... in a way that is relatively lightweight and self-managing, and idiomatically tolerates the occasional bit of unparseable syntax on script load or mis-named function call, without throwing a hissy fit.
Like... the only thing that'll typically ever knock php-fpm completely over is generally a script-triggered interpreter segfault, which is Bad™, exceptional, and (given php-fpm's collective uptime on all installations everywhere) vanishingly rare. Fatal error? FCGI response message passed upstream. Script exceeded execution time? Hard timeout (and message passed upstream). CLI SAPI? Clunk; no more script execution for you. I've always felt a bit left out in the cold by this, just in terms of how the language itself handles this edge case.
I guess I probably just stop whining and go setup systemd unit files or similar. That would definitely make the most sense in a production environment; I should probably build the muscle memory.
It's just that, for slower-paced and/or prototype-stage projects that don't have a clear direction... my brain's trying to reach for an equivalent of `php -S` that isn't there, and... I guess it's really easy to get idiomaticity-sniped, heh. ("But this project isn't a thing yet... and systemd unit files go in /etc, which is like, systemwide... and I forgot the unit file syntax again..." etc etc)
TL;DR, if this made sense :), did you ever encounter this sort of scenario, and how did you handle it? A $systemd_equivalent, language-level patches, shell script wrappers, cron, ...?
Oh, another curiosity - whenever remembering how pcntl_fork() works I usually have to reach for `killall php` a few times :) (especially when the forks are all using 100% CPU... and I accidentally start slightly too many...). How was killall isolated from nuking important server processes? Shebang line? Renamed interpreter binary (...probably not)? Different user accounts?
Thanks very much for reading :)
If you’re missing a magical command that sets up a test environment, I recommend writing it in shell, PHP, whatever, and sticking it in $HOME/bin. Or a makefile with a target so you can just run “make testserver” or the like; that way it will stay with the project. Or scripts in $PROJECT/bin or $PROJECT/scripts. Doesn’t really matter as long as it’s documented in the README and simple to execute. It’s permissible and customary to have a cleanup command, as well, if you started a background process. You could even have those start and stop commands create and then disable a systemd unit—that way you won’t have to look it up every time.
> How was killall isolated from nuking important server processes?
In general we didn’t isolate our processes against signals, because there was occasionally reason to send them. When we did send them, we sent them precisely. If a few kill -9s didn’t stop a backup then there was almost certainly a disk issue on the host and it was stuck on a kernel I/O system call, and we’d “nuke” the host (send it through a self-diagnosis and reimage cycle; cloud analogy: terminate and reallocate). It was definitely a cattle-not-pets environment. Other backup hosts would take up the slack.
Cron jobs make a lot of sense for periodic batch work.
It's interesting you used PHP to start and manage child workers (I recall reading somewhere in the docs about PHP being unable to report error code status correctly under certain conditions, but I can't find it right now).
Regarding environment, I was mostly pining about PHP's lack of a "correct"/idiomatic way to handle genuinely fatal conditions (like syntax errors) in a way that meant the binary would stay alive. I'm suddenly reminded of Java's resilience to failure - it'll loudly tell you all about everything currently going sideways in stderr/logcat/elsewhere, but it's "difficult enough" to properly knock a java process over (IIUC) that it's very commonly used for long-running server processes.
PHP-FPM has this same longevity property, but the CLI was designed to prefer crashing. I just always wished I didn't have to bolt on an afterthought extra layer to get reliability. So I wondered if I could learn anything from this particular long-running-process scenario. Cron is hard to beat though :)
Hmm. Automating the systemd unit creation process. Hmmmm... :) <mumbling about not knowing whether su/sudo credential/config sanity will produce password prompts>
And... heh, that's right, `kill` exists. Need to step up my game and stop `killall php`-ing. Good point.
Thanks again for the insight.
Also, with Python we could interact with other services using basic Thrift bindings, and even provide our own endpoints. With PHP, all RPCs and queries were prebuilt into our standard libraries with custom glue code that increasingly made non-Zend assumptions. Again, we could have made it work, but we would be swimming against the current.
All the rest of the database management code ended up being moved to Python, and benefitting greatly, so it eventually worked out.
In summary, organizational concerns.
Did you consider "moving back" to PHP instead of emigrating to Python? Given that the code was originally written for PHP that sounds like the easiest way out of any problems caused by HPHP. Was it politics which led to the decision that HPHP was the only allowable PHP engine?
And if we went that route, we would have had to build Thrift bindings and write all the glue code for any other internal service we wanted to talk to. They all had official client libraries, but they weren’t built for Zend anymore, and since they were frequent targets of micro-optimizations, they would no longer run on it. We would have been the sole maintainers of this alternate distribution and libraries, as well. And THEN we would have had to talk to the server team and convince them that it’s cool install this custom package that directly conflicts with some of the most important code we run that’s literally maintained by a team of PhDs and industry luminaries, but we pinky swear it won’t mess up. (Ever built a .deb or .rpm?) That would have been politics, but it never got that far.
Another team had already done the work for Python to “compile” your script and all its dependencies into a self-executing compressed file. All you had to do to talk to another service was import the client libraries, write your logic, run the packager, and ship the binary. (Very Go-like, in retrospect.) It was already widely used and well-supported, as a group effort. I even fixed a few bugs. So, for this particular niche at this particular company, PHP was a victim of its success elsewhere.