Hacker News new | past | comments | ask | show | jobs | submit login
Poll: What do you use for Unix process management/monitoring?
142 points by gosuri on July 22, 2013 | hide | past | favorite | 124 comments
I'm setting up production infrastructure for an new project and would love to know what you use for process management
199 points
131 points
123 points
88 points
53 points
44 points
god
26 points
5 points
3 points




I use runit in production for http://typing.io. I appreciate runit's strong unix philosophy (shell scripts instead of dsls). However, I'm starting to experiment with systemd because of features like properly tracking and killing services [1]. This feature would be useful with a task like upgrading an nginx binary without dropping connections [2]. This isn't possible with runit (and most process monitors) because nginx double forks, breaking its supervision tree.

[1] http://0pointer.de/blog/projects/systemd-for-admins-4.html

[2] http://wiki.nginx.org/CommandLine#Upgrading_To_a_New_Binary_...


Side note: that's a wonderful page, but I think it would benefit greatly from using something like Mozilla persona instead of Google account.


why? Google account is so universal. Who has a Mozilla persona?


> Who has a Mozilla persona?

Anyone whose email provider provides persona automatically, or who runs their own identity provider, or who has taken 10 seconds to create an account on Mozilla's.


Because Google sells your data, I guess. And the NSA scandal.


And because you're telling Google more when you log into a site with OpenID (or any Google-specific mechanisms) than you're telling anyone when you log in with Persona.


Does nginx ever really crash, though? It hasn't in my experience.


Supervision has many benefits besides automatic restarts after crashes. Supervising programs provide a consistent way to start, monitor, and log long running programs. Nginx reinvents its own interface for some of this functionality (like daemonizing, log rotating/compression, conf reloading, etc.), but it's useful for all services to work under the same interface. This is especially true for monitoring nginx's status, where a supervisor like runit is much nicer than 'pgrep nginx' or 'ps aux | grep $(cat /where/nginx/dumps/its/pid)'.


There are some good points there, but they're not particularly applicable to nginx. nginx's relative robustness, its need to log to more than one file, and its need to fork itself in order to perform graceful configuration state transitions suggests that it's not really a low-hanging fruit for robust process supervision.

Besides, its control interface is the same as any other subsystem if you've configured it correctly -- i.e., "service nginx <verb>".


I agree that nginx needs supervision less than most processes because it reinvents many wheels. However, supervision is still nice, e.g. your 'service nginx' example that uses Ubuntu's supervisor Upstart.

I agree it's not worth straining to make nginx's binary upgrade work with arbitrary supervision. However, if someone created a supervisor that solves this problem (systemd), I might give it a try.


service(8) is supposed to be supervisor-independent. If it's not, that's a problem.


My bad, I've only used the service command on Ubuntu post-upstart and thought they were more deeply related.


Like with all software... it doesn't until it does. And when it does you wish you had an auto-restart system in place.

It doesn't even have to be nginx's fault. It could be that some other process started fork-bombing the system and OOM-killer decided that killing nginx is the way to resolve it, before trying the actual offender.


I enjoyed how the page scales when I zoom in :)


You can turn of daemonization in nginx.


Disabling daemonization allows nginx to be monitored initially and through conf reloads but not binary upgrades (e.g. nginx 1.5.1 -> nginx 1.5.2). The binary upgrade process forks the master process, which it 'orphans' from runit and reparents under pid 1.

oops, edit: pid 0 -> pid 1


pid 1, actually :)


You forgot one:

sysv init!

All of my systems' processes are managed by it and have been for at least two decades.

Occasionally I do these periodic tasks as well, which are handled by a thing called "cron".

Yes this is sarcasm. There is a lot of wheel-reinventing done these days which is entirely unnecessary if you consider the long-forgotten "Unix philosophy".


sysv init doesn't monitor processes, so that doesn't fully answer the question. What do you use to ensure that processes started by sys init are still running?


Seriously it does not ? Add something like this:

   myapp:234:respawn:/path/to/myscript
to your inittab, and sysvinit will relaunch it (if it dies).

Regards.


ps is used for monitoring.

I make sure I write processes that don't fail. This is done with tooling (valgrind, cachegrind, unit tests, integration tests) and testing (automated testing, experienced testers).

The whole process restart mechanism is a combination of string and sticky tape to abstract away the problem of half-arsed poorly written bits of functionality that crap themselves every few minutes.

I can and have written processes that stay alive for years (5 years 122 days pegged at 100% CPU being the record after which a reboot was required as the Sun Ultra 2 hosting it was replaced).


> I make sure I write processes that don't fail.

Please name one major, non-trivial network system software that has gone for 5 years, 122 days without a fault in production anywhere.

Engineers in other disciplines make best efforts not to fuck up. And then they accept that they, or others, will fuck up anyhow and they plan for that.

So do we. And so we should. Betting it all on Red 19 is a strategy of roulette, not serious work.


We don't bet on that, we aim for that.


I can sense we're about to have a circular argument based on where to draw the boundaries of responsibility.

My point is that betting entirely on one strategy for mitigating faults is unnecessarily risky. Especially when additional levels of mitigation are easily installed and configured. In your other post you even point out a series of things that you do.

I don't see why process management is of a different kind.

Edit: removed unnecessary grandstanding.


you have clearly never written a distributed system or similar. Fail Fast, Fail Often.


That mantra is simply bullshit.

I have written massively distributed systems which have an insanely high reliability requirement and it is really not the answer. I've been doing this for 25 years.

A more appropriate statement is:

  Fail early and gracefully, recover always, expect failure.
Fail early - assertions up front. Prevention is better than cure.

Fail gracefully - don't allow assertions to take the entire process out. Don't allow the language to crash the process out.

Recover always - design your system for recovery and understand recovery conditions.

Expect failure - know where and when something is going to fail and handle it.

Fail fast, fail often results in nasty shit like processes hung in restart cycles etc.


This is much closer to the actual Erlang philosophy that gets misquoted so often.


Cron ?!


Monit is by far your best bet. Easy to install, packaged on most distros and it performs reactive monitoring as opposed to most traditional monitoring systems like Nagios. Plus you can open up a web interface if you want to allow for easy browsing of monitored processes.

EDIT: I liked it so much (and it was so easy) that I wrote a blog post expounding how much I liked it and how to use it. http://moduscreate.com/monit-easy-monitoring/


As one of the guys behind Monit, thanks for that! We're about to improve Monit further by adding realtime (ms resolution) scheduling and monitoring (libev or libuv) which we hope will make Monit even more useful.


Hey.. We use monit and mmonit on all our servers for a lot of varied services, and it is truly a life saver! I only wish the web console for mmonit was a little more detailed and showed more system information. It is also sometimes helpful to see a graph of memory, cpu usage, and server response time over the time axis to see the current system load.


Thank you for writing such a quality piece of software! You have a huge fan.


Folk on Solaris / Illumos would probably like SMF to be added to the list.

I'd go on to mention various z/OS subsystems, but that's a bit esoteric even for HN :D

(Process management ties into my larger rant that nobody has properly combined it with configuration management. But nobody has time for this nonsense.)


Nothing. The answer is you don't monitor production processes directly, ever, it's a waste of your time and effort. Certainly this sort of foolishness should not be used to page an employee off-hours if that's where you're headed.

The only thing you need to monitor is: whether a server answers the network request it was designed to. Outside of that you might optionally want to know whether the disk is full, ram is maxed (thus putting linux into swap) or if the cpu runs too high to cope with losing some servers at peak, but really that's all optional if you're in ec2 and can just spin up more servers on a moment's notice.

You can gather all this data for yourself with Newrelic or if you want you can send data to graphite or if you're old-fashioned you can use Icinga in place of Nagios because it keeps history in a database. If the developers want to know about the process for the application they implemented you can put Newrelic on the server for them, and put the system Newrelic thing on there too, just don't pay attention to it or pretend it's important until something breaks.

The important catch here, the thing that is critical to this whole line of thinking: you have to have thought things through before you built them, focused on having one service per os and real redundancy throughout the environment, and then critically your kick should be fast enough that if a server has some kind of problem in production you don't fix it you just re-kick it. That means your kick throws the os on there, then triggers salt or ansible or chef to configure every single detail and then triggers a deploy of internally-developed applications. That means you have to test the kick to death before you can rely on it to rebuild you something live. If the problem is recurring you can use immediate tools, jdump or whatever, to get some data, give it to the application's developers, and let them try to recreate it in staging while you go ahead and re-kick the prod server and go back to writing documentation for lesser ops to not read, drinking at your desk, reading hackernews, acting as a cia listening post for cat pictures or whatever else passes the time.


Some points of note:

1. daemontools and runit are practically identical. I do prefer runit somewhat, as svlogd has a few more features than multilog (syslog forwarding, more timestamp options), and sv has a few more options than svc (it can issue more signals to the supervised process).

2. Among the criteria I look for in a process manager are: (1) the ability to issue any signal to a process (not just the limited set of TERM and HUP), and (2) the ability to run some kind of test as a predicate for restarting or reloading a service. The latter is especially useful to help avoid automating yourself into an outage. As far as I'm aware, none of the above process supervisors can do that, so I tend to eschew them in favor of initscripts and prefer server implementations that are reliable enough not to need supervision.


Systemd provides a hook (ExecStartPre [1]) for running test commands before starting a service, e.g. checking nginx configuration before starting. However, I also prefer the flexibility of shell scripts over something like systemd's INI service definition format.

[1] http://www.freedesktop.org/software/systemd/man/systemd.serv...


I've been using supervisord for most everything (doesn't hurt that I'm primarily a python guy), but I'm slowly testing out Mozilla's circus (https://github.com/mozilla-services/circus) and it's been going great so far.


We've actually had the opposite experience - circus has been nothing but a pain for us, unfortunately. Timeouts with lots of processes (50+), random CPU usage through the roof, and lots and lots of bugs that shouldn't be there.

There was a nasty race condition for a while that locked circus up, and it wouldn't restart crashed procs. For a while, you couldn't specify a timeout on the commandline - on a major version, too. It was 1.0.0 in master for a while, and then went backwards to 0.7.0. We were sitting on master for the fix to the aforementioned timeout, and so no updates happened until we realized what happened and then manually "downgraded."

All in all, it really feels like we're either using it wrong (probably, we're adding & removing processes on the fly), or we're the only ones really loading it up with a ton of processes which may or may not flap a lot.

If you don't mind, why are you moving away from supervisord?


Circus is still pretty new and rough around the edges, so I haven't totally moved away from supervisord yet. My foray there is mostly exploratory, trying to get used to it and its quirks before really diving into a comparison between the two.

That said, the ability to manage sockets sounds very interesting, hopefully simplifying my stack even more (my current use case is in getting the most performance out of a small VPS, so removing things from the stack would hopefully clear up RAM for actual web workers). I've been running a small site using nginx->circus->chaussette->django and it's faster (and was simpler to configure) than my standard nginx->supervisord->uwsgi->django deployment.

Also, supervisord isn't without its issues, and one of them is managing processes at scale (see https://github.com/Supervisor/supervisor/issues/26, that bug is 2 years in the making). Circus at least claims they intend to support thousands (via http://circus.readthedocs.org/en/0.9.2/rationale/) so I'd be interested in seeing what they bring to the table on that front.


> There was a nasty race condition for a while that locked circus up, and it wouldn't restart crashed procs.

This was fixed in 0.7.1

> For a while, you couldn't specify a timeout on the commandline - on a major version, too.

To my knowledge this was never released.

> It was 1.0.0 in master for a while, and then went backwards to 0.7.0.

Yes we decided for a while the next version would be 1.0 then we changed our mind. All happened in master and was not released, so I don't see the problem here;

> All in all, it really feels like we're either using it wrong (probably, we're adding & removing processes on the fly), or we're the only ones really loading it up with a ton of processes which may or may not flap a lot.

I am still available for any help. Circus is young but works for our needs. If you are happily using Supervisord, that's fine - but keeping on posting your negative experience on HN from 3 months ago without having tried the tool recently --while we addressed to my knowledge all the bugs you mention-- is a bit inapropriate imho


Hey, thanks for the reply!

> This was fixed in 0.7.1

We're still using Circus under production loads, and we're still seeing it go unresponsive and chew through a ton of CPU. Unfortunately we haven't been able to reliably reproduce it, so until we can, it's not something we can fix.

> To my knowledge this was never released.

It's happened a couple times, the second one was likely just on master, but the first was happening from a pip install. https://github.com/mozilla-services/circus/issues/457 https://github.com/mozilla-services/circus/issues/380

> If you are happily using Supervisord, that's fine

We're not, we're still using Circus while we find or build something else.


How are you doing dynamic process allocation with supervisord? I had to move to circus to get that API capability.


Update the supervisor config file and reread/update? Not as nice as doing it from the library though.


I don't, currently, because I don't have to monitor services, but if I did, I think I'd likely use daemontools, based on the fact that djb really, really understands how to write Unix software.


I use daemontools to manage all the servers for SlickDNS (https://www.slickdns.com). Trivial to install and configure. Most of my run scripts are one liners, and of course like anything written by DJB daemontools just runs and runs. (Naturally I'm also using tinydns on the SlickDNS name servers.)


djb however doesn't know how to get on with other people. While the software is pretty neat, it does piss all over your disks and leave a hard to untangle mess.


systemd handles service monitoring natively, as well as socket management and many aspects of container management. It's a superset of most of the tools listed.


Right now, I use Upstart (and thus Ubuntu) because it can--kinda-sorta--do the same "supervision-tree" thing that Erlang/OTP does; you can have a virtual "service" that runs a script on startup to get a list of tokens, then starts a set of dynamic children using those tokens ("initctl start myservice/worker FOR=bob", that kind of thing), monitors them, and restarts them when they crash; bringing down the supervisor cleanly shuts down its currently-running dynamic children, and so on.

Can systemd do that? I like everything else I've heard about it, but I haven't seen any documentation regarding this sort of usage.


Funnily enough, last time I set up a supervision tree hierarchy with Upstart, I was thinking the whole time, "this would be so much nicer with Erlang." I think that was my a-ha moment for OTP.

For most standalone systems though, Upstart works nicely enough as long as services don't need too much coordination.


systemd can, right now, launch "template" services based on a common template with a parameter. For instance, the usual set of text login prompts are services named getty@tty1.service, getty@tty2.service, ..., getty@tty6.service, all spawned from a template getty@.service.

systemd is currently adding support for dynamic service "instances", which would allow you to launch and close such services on the fly.

So, depending on how dynamic you need the list of virtual services to be, the answer is either "yes, that works right now" or "yes, that's available in the latest systemd release, but probably not in your distro yet".


Probably time to use a distribution that does rolling releases!


I'd love to try systemd in the future but right now we deploy in Debian and for us is easier to go supervisord (and all our devs "speak" Python, that's a plus).

My main concern is that systemd handles too much and I feel it's going to be hard to change. I guess we can start using systemd without removing supervisord and move from there.


Thank you Josh.


SNMP for centralized systems monitoring, cronjobs for "daemon management". I dislike the tools that take up large system footprints, sometimes larger than the things they're supposed to manage themselves, to do simple things. No, they're not always extensible. When you can accomplish the same thing with a 4-5 line shellscript that takes less than 100k of ram to run, that you'd do with a giant process that takes 40-50MB of ram, I know which one I prefer (generally).


Care to discuss how you use SNMP for centralized monitoring?


Systemd for most Servers, svc / rundir ( a version of daemontools ) from busybox on embedded services.

The code is designed to be shot, and will always recover after a restart. rundir/svc will neatly reap a process and re-start it again. And can be used separately.


Do any of these integrate with cgroups? I've found myself wanting to specify some rules about resource usage on occasion, and cgroups seems conceptually nice, but I'm not sure how to work it in nicely to my other tools, short of writing custom shell scripts to manipulate /proc/cgroups.


Systemd provides many knobs for tuning cpu, memory, and io settings [1] using cgroups. This deep integration is one of the reasons why systemd only runs on linux.

[1] http://0pointer.de/blog/projects/resources.html


Anyone with any experience using Angel (https://github.com/MichaelXavier/Angel)?


Thanks, that looks great.


Have been using monit to keep a couple of node.js services online / monitor their PIDs and HTTP interfaces. It's been a positive experience so far.


Please add God[1] to the list, we use this in production.

Also I'll put in a shameless plug for my side project, a service management tool for multi project development; Hack On[2]

[1] http://godrb.com/ [2] https://github.com/snikch/hack


Please don't use God. You're in for a lot of flaky monitoring problems if you do.

For reference see this bug which has been open for 2 years.

https://github.com/mojombo/god/issues/51


Seconded. I like the DSL, but a monitoring solution has to be unfailingly rock-solid, which God most certainly is not.


Urgh, that doesn't sound great. We haven't encountered that issue, but we also only use it for a largely inconsequential service.

I think it's a bit unfair to downvote me, the OP has posted a Poll for what people use, not what they recommend. We use God, so I've asked for God to be added to the list.


I've been using god for two years and have never had this problem, or any problem with it. Maybe the god load command has problems, but I can't think of any reason to use it. Bottom line - try a few monitoring tools out and use what works for you.


I've been seeing this problem in prod for the last 8 months, consistently. I literally have one process that doesn't start 3/4ths of the time. My start up script has to iterate through all the processes god is supposed to start and if they're not all running kill god and try to start again. Not inspiring a lot of confidence in god as a monitoring solution.


But system operators love to use God. It's that whole male ego thing.


Thank you. Just added.


How many of those are actually derivatives of daemontools?


https://github.com/caldwell/daemon-manager

I've been dogfooding it in a production environment for a couple years and it's been pretty solid.


I wish that s6 (skarnet.org's small and secure supervision software suite) [1] were more-widely packaged and available on distros. It's very much in the same vein as daemontools, but with some improvements. While certainly biased, the author wrote a pretty good breakdown and comparison of why s6 was developed [2].

[1] http://www.skarnet.org/software/s6/

[2] http://www.skarnet.org/software/s6/why.html


launchd, but I can pretty safely assume that I am one of the few here running servers on OS X.


I use supervisord now, before I would use mon/mongroup [1] which is just a tiny C program to monitor stuff.

I have also used god at some point, but I kept having trouble. I can't remember exactly what was wrong but it never quite worked correctly for me. Probably PEBCAK.

[1] https://github.com/jgallen23/mongroup


upstart because I wouldn't use a job control system in prod that isn't included with the base distribution (do any base unix distros use monit or supervisord?) It's just too much useless work to rewrite job control logic for daemons when the OS already gives them to you, and I've been quite surprised with the feature completeness of upstart.


On the other hand, if you want to switch distros or OS, you'll have to rewrite everything.


^C, ^Z, ps, fg, bg




I use upstart, but am not happy with it for a number of reasons. Two important ones: "restart" does not reread the configuration file and the DSL is poorly done (the "respawn" stanza and others).

I haven't looked recently at alternatives, but I'm open to it.


nagios + few licences of new relics


Thanks. Would also love to know what you use for controlling process lifecycle.


I'm using htop. Very easy but maybe not enough features for what you are looking for ?


For daemons: none of the above (init) I then monitor it with Zabbix. I assume services don't crash, and hey they don't (not that I know of in any case).

Unless you did really mean process and not daemon, then it's supervisord.


personally i've had great success with supervisord, no success with god, good experiences with monit, but i'm curious, whatever happened to good ol' linux watchdog?


Currently using upstart, only because it's default in ubuntu.


No launchd love? Okay...


Reactive monitoring via Riemann: http://riemann.io/

We use this to monitor services at the application level.


Pacemaker if you need to keep it alive no matter what.

Systemd was pretty stable until user mode flat out broke in 205. I use it to manage my entire desktop session.




Supervisord but i'd love to check on systemd and its capabilities for that in the future..


forever (https://github.com/nodejitsu/forever/) has worked great for me, but doesn't make any sense if you're not running node.js applications.


Why not? It seems perfectly suited to run any application.


Node/v8 is not lightweight, it's hard to justify a ~30mb process just for monitoring. By all means, if you already have node in your system, use it; it's just kind of pointless if you have node only for forever.


I've used monit since forever, but have really come to like runit more recently.


SMF


top


upstart does the job nicely.


monit for active monitoring + munin for trends


htop


'ps' :)


it was a joke. no need to mod down.


sysv init


docker


forever


launchd


top


Kinda sad that "nothing" isn't on the list. I just use software that isn't broken, so it doesn't need to be constantly restarted.


While it's true that "nothing" or "custom" should be poll options, I don't agree with your reasons.

Supervisor programs aren't just about restarting processes if they die. They manage dependencies, reloading, a central authority for starting/stopping and other administrative tasks. Some additionally do things like logging.

And besides, even the most stable software can't be guaranteed bug-free (though I do agree that a better action might be to stop and yell loudly rather than blindly restart). Why forego a safety net?


I think there are some problems with your statement. All these process-reloading and management options are a symptom of a much larger problem. The problem is simply:

   People have stopped writing software designed to run continuously.
Some notes from 20 years of Unix fudging:

If a process dies, it's a problem with the process, not an external problem. Make it reliable. Software shouldn't fail at all. PostgreSQL, Postfix, init never crash from experience. Why shouldn't your software do the same?

Reloading: system-wide: "service whatever restart/reload"

Dependencies: Dependencies are hell. Having used BIG Unix kit with power sequencers and stuff, I wouldn't ever take on a dependency. If I did, the dependent processes should fail gracefully and retry.

Central authority: service xxx restart/reload again...

Logging: syslog? On our kit, we use Splunk but it's pretty much overkill for most people.

If you can't sleep at night because your processes crash, there's no good employing someone (a supervisor) to go around and do CPR on them. Fix the root problem.


Funny you should mention PostgreSQL, since it includes a monitoring process - the postmaster - that restarts the workers if they die ;) (and they do, occasionally).


That's because the child processes can load and run third party code, which obviously can crash them. Just like apache modules can crash httpd worker processes. So why run an extra layer of monitoring on top of the existing one for apps like those?


I don't think that's what's being suggested.


All the software in the poll is general purpose monitoring/restarting software. It isn't software designed to be multi-process and monitor its own child processes.


Then you never had to guarantee a SLA and pay for service downtime, so not seriously done a _business_ that earns money by providing a service.


We do guarantee an SLA and use precisely nothing other than sysv init.

We pick software that is reliable and trustworthy then test it thoroughly.

Those are forgotten arts outside of the enterprise. Elsewhere, any new technology that falls from the sky is picked whether it works properly or not.


How do you cope with hardware failure, corrupt memory banks, temporary network failures, you also just buy "hardware that never fails"? Just testing something and deeming it reliable enough to "just run" is a horrible practice. There is no software on earth that is bug-free, you do realize that, yes? Why do you think the NASA builds rockets with many many safety measures? Because they use the latest and greatest hardware and software? Or because they KNOW that at some point something WILL fail. It's not a only a question of picking stable software. The skill is to be prepared for the emergency at any time, even if it is on a rare occasion. If your only safety measure is to "just pick stable stuff", i'd surely never buy your service.


Firstly, calm down. I gave you a small peek into my world and you're drawing a lot of bad assumptions from it.

We mitigate hardware issues with either hot spares or clustering. We have three datacentres distributed geographically with entire redundant sets of equipment (4x 42U racks in each).

With respect to software failures, we test everything thoroughly including failover conditions etc. Everything is load tested as well.

We are prepared for emergency. We have dedicated people ready to jump on that.

However preventing these things ever being needed is a professional responsibility which is my point.


The poll is about the fad of using a program to monitor your known to crash/leak/whatever app, generally starting it, monitoring it, and restarting it. Hardware failure is completely irrelevant to the discussion.


Yes I have. A whole ISP in fact, including hosting corporate email for hundreds of small to medium businesses. Postfix never crashed. Openldap never crashed. Courier never crashed. What is monitoring and restarting your monitoring/restarting daemon?


I wish I lived in the perfect world you inhabit :( Some of us are dealt shitty cards and told to make it work.


Agree. I've been at this for a long time and never needed anything in this capacity. Not once.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: