Of course, this works better when you have many small processes rather than few monolithic ones. But now you're designing an Erlang system :)
I'm curious if this works in practice for you. The current OOM algorithm in Linux sums up the memory usage of a process and all its children. So there are good chances that the restarter process is killed first, and then the main software is killed too (when OOM killer realizes the last kill didn't free enough memory).
This is exactly the problem we're facing at work here: on a computational cluster, users sometimes start wild code that consumes all the memory, but the OOM decides to kill the batch-queue daemon first, because it's the root of all misbehaving processes. We have to explcitly set `oom_adj` on the important daemons to prevent the machines from becoming unresponsive because of a bad OOM decision.
Still, in your use-case, I'd definitely recommend only letting users run their "wild code" inside a memory cgroup+process namespace (e.g. an LXC container.)
Crash-only systems only work when a faulty component crashes itself before it crashes you. Processes modellable as mutually-untrustworthy agents should always have a failure boundary drawn between them. (User A shouldn't be able to bring down the cluster-agent; but they shouldn't be able to snipe user B's job by OOMing their job on the same cluster node, either.) And on a Unix box, the only true failure boundaries are jails/zones/containers; nothing else really stops a user from using up any number of not-oft-considered resources (file descriptors, PIDs, etc.)
I think it's surprisingly easy to get yourself in the situation where this is a concern for you but you don't know how to solve it.
 Just run "adduser" and have SSH running, or just create an upstart job, or write a custom daemon that accepts and executes jobs from not-quite-trustworthy-undergrads, or...
It used to sum until Linux 2.6.36.
Diff'rent strokes. I also like the OOM killer, its a dastardly wonderful thing to tie lots of safety-critical things to .. and in the SIL-4 OS business (my domain), it is indeed a crash imperative to understand how to use the OOM properly. Or: not.
So in this light .. I know Erlang is "the thing" right now, but I feel I must just mention that:
>Of course, this works better when you have many small processes rather than few monolithic ones. But now you're designing an Erlang system :)
.. one could also be designing a Lua-based distribution, or JVM, or whatever you like, essentially, and integrating with oom_killer. There's nothing Erlang'y about it. Because if you're playing with the oom_killer, you're really making a distribution choice, in the topology.
If your app cares about oom, well kiddo .. you better not be doing anything less than excercising complete control over your distro, its launch policies, its use of the TextSegment as installed, and so on. Absolutely you're making Distribution decisions about the functionality of the combined system. oom_killer isn't useful by itself.
My point being, worrying about oom_killer isn't just something Erlang users need think about, nor are they the only ones who really 'get' why an oom_killer can be used nicely .. if you're building a distro, either for use as an embedded machine, a tight secure server image, or indeed even as a desktop user, well .. careful memory integration is, as you say, a harsh pass.
Incidentally, I use the oom_killer exactly as you mention, in a few embedded distro applications, specifically indeed, a kill of whatever 'lua' is hogging resources. Its an extraordinarily functional mechanism for recovery ..
I don't think Erlang is "the thing" right now. The "hot" languages for concurrent programming are Go and Node.js, for better or for worse.
ulimit is a better solution i.e. set reasonable constraints based on available resources.
Ok, it doesn't exactly support the quoted feature ("stop restarting if it exceeds X attempts in Y seconds"), but it does sleep for a second between restart attempts to mitigate the same problem: http://cr.yp.to/daemontools/supervise.html
Is your objection that daemontools is largely unused / unmaintained and lacks features (such as flapping-avoidance)? Or something else?
Just curious, thanks.
Either one of upstart, systemd, supervisord will give you equivalent solution that has more practical features. Daemontools was good enough a couple of years ago. (I used it then happily)
In short - there's nothing wrong with it really, but many alternatives are much better.
sh $0 &
perl -e 'push @big, 1 while 1' &
This isn't disastrous of course (replication + snapshots + redis aof), but still annoying.
Thus, the database itself will never degrade due to spilling to disk, but all other processes might. But if it's a DB box, that won't precisely matter.
They'll never swap out and bonus less memory used. (note transparent huge pages have only brought me pain so far)
Though Ryanair could well be heading down this route 
The only exception that airlines will sometimes do is to juggle around destination and alternate airports so that they can save some fuel on that (you have to have enough fuel to reach an alternate airport, and then 45 minutes in addition to that).
I also spend significant time on aviation/pilot forums, and I've literally never heard of "lots of paperwork" for any emergency that resulted in a safe, no injuries landing. The topic comes up regularly in the "should I have declared an emergency or not?" discussions, the common argument against is "I don't want the paperwork" (a strange trade off against a life-safety question in any case), and the most I've heard as routine followup is a phone call from an inspector.
My experience is exclusively US, so perhaps it's different in other areas of the world.
The only issue I've ever caused was triggering a TCAS warning for a landing airplane because I was slow on starting a turn. Even though there were VFR conditions and we were at a safe dinstance, the other pilots said they had to report it as a matter of company policy, and my instructor had to make a few phone calls afterwards.
> At least two memorandums were sent to Ryanair pilots detailing the company's concern about what was described as "excess fuel explanations" -- a description of the reasons flight commanders have to give if they take on extra fuel over the recommended minimum fuel load.
Is "because it's only the minimum!" not a valid answer?
It was a "mayday" because they didn't have enough fuel to return to the original airport they were going to land at and/or to continue waiting in a holding pattern for another hour.
But think of a a hypothetical situation where low fuel situations somehow were common, unforeseeable and unavoidable. And if dropping a passenger could indeed make a difference between full crash or safe landing. It's a shitty situation, but it makes sense to give somebody a chute and kick them out, instead of letting all die.
But critically low memory situations aren't unavoidable. Not overallocating is one very easy solution.
Sometimes is pretty much all of the time. My SO works as cabin crew for one of the large airlines you will have heard of, and they typically always overcommit. It isn't just because they are greedy and want lots of money though (well I guess it kind of is). As you said, they know some people won't turn up, but also they may decide to change the class of the airline (economy and business -> economy, business and first) depending on how people book or what aircraft are available on the day.
United is infamous for this, prompting a sound piece of advice: if for some reason you're in first class on a flight scheduled for an A320, don't take a seat in row 3, since odds are high it'll be swapped to an A319 (which only has two rows in F) day of the flight. In theory there's a pecking order for who gets downgraded in that case (first, anyone who moved up to F on a status upgrade, then the paid F fares starting with whoever was latest to check in), but in practice the gate agent just bumps row 3 and calls it a day.
I think the parent poster was in part referring to operational upgrades (very common at British Airways — they will dramatically oversell the Y cabin and op-up people into J/F as needed), which don't necessarily require an equipment swap.
Another option uniquely available to European carriers that you might not have seen in the U.S. is to adjust the number of J/Y seats in an aircraft: on most intra-Europe flights, the "business-class cabin" is the same seats as the coach cabin, but with a blocked middle seat and a movable divider so you know where Y starts.
(Come to think of it, I guess BA will sometimes swap equipment to a flight with fewer classes of service — for example, the LHR-DME route can lose its lie-flat Club World seats overnight sometimes after unplanned equipment swaps — but I don't think BA does this for yield management, because it's costly to them to have to pay refunds to people they've inconvenienced.)
Certainly not in the current airline environment. Even with a perfectly loaded plane, they only make a little money, and you can't plan for a perfectly loaded plane with any kind of confidence without double-booking.
I phrased what I said poorly. They always allow overcommit, but they try to work it out so that on average they efficiently make use of their seats without constantly having to give people giveaways and bad customer experiences.
The problem with translating that approach to OSes is that you can easily deadlock the entire system.
So... a lot like killing a process.
OSX is just like, "Hey guys, if any of you happen to not need your memory sometimes, would you mind kindly letting me know and I'll go ahead and let you go at a convenient time?" Meanwhile Linux goes on a murderous rampage with unpredictable effects.
FWIW: Android baked that idea into the framework from the start. All apps are essentially required by the application lifecycle to be robust against sudden failure, and the system is designed to give apps fair warning to save themselves at various times (e.g. on backgrounding). If you think about it, on a battery-powered embedded system that's pretty much a firm requirement anyway.
This may have changed in recent version of Android, as I haven't kept up with internals that much.
But that's not quite what I was saying. The point was that the promise in OS X (that a process "could be killed if necessary") if baked into Android: it's true by default of all processes, and the framework guarantees a set of lifecycle callbacks to allow processes to persist their state robustly. This isn't part of kernel behavior at all.
It can be a bit finnicky to work with, but it does seem to be pretty much strictly better than just killing random processes without giving them a chance to be better behaved.
To make it persist, just add vm.overcommit_memory=2 to /etc/sysctl.conf
For instance, if you run a Redis server and turn off memory overcommit, you might not be able to background save.
echo -17 > /proc/$PID/oom_adj
where $PID is the process ID you want to protect.
oom_adj can be tuned with other values to make a process more or less likely to be killed.
Maybe there is a way to suspend a process (feather the prop) rather than completely kill processes.
Usually, I recommend that database and queue servers run the database/queue with a priority that makes it unlikely for them to be killed.
I had a case where a colleague running a script on a server with high pressure killed the queue, which is unadvisable, even if is crash-safe. Before that, the queue was running for 1.5 years straight.
However it would be bad.
Under-committing resources (thus removing the need for an OOM killer) will NOT lead to a net gain compared to over-committing resources (and thus requiring an OOM killer of some sort).
If we are unwilling to overcommit resources then it would be woefully uneconomical to run algorithms that have bad worst-case performance (because to avoid over committing you would necessarily need to assume the worst case is encountered every time).
It's just not feasible to avoid algorithms that have bad worst-case performance. Rather, we need to develop better abstractions for dealing with components (e.g. computations, programs, processes, threads, actors, functions etc.) that go over budget. Here's my attempt at developing a better abstraction for web servers: mikegagnon.com/beergarden
Ultimately, we need to treat every system like a soft real-time system, because at the end of the day every program has timeliness requirements and has resource constraints. The current POSIX model does not provide such abstractions and I think that's why we have these debates about OOM killers.
I like the idea of the doorman but what if you could somehow pass back useful math to the client? Of course, then you'd have to disregard that useful work yourself or double check it negating the energy savings. Or, perhaps return a map of traveling sales man type problem (maybe a map of metadata and their traversal cost), and they could navigate that map depending on exactly what kind of data they really want, thus reducing your load for valid, heavy requests and if they return a path with lots of nodes, you know to de-prioritize it or drop the request.
Thank you! Yes, I made the UI. It's open source: sidenote.io
> I like the idea of the doorman but what if you could somehow pass back useful math to the client?
I think it would be great to have clients perform useful computations instead of just burning cycles. But that's not MVP, even for a research project.
Well it seems to work for MySQL
Anyway, you wouldn't ever return the nulled data. If the process tries to access the data, THEN you crash it.
Perhaps something more in line with BSD would be a sacrifice(2) syscall? It would pick a random process and...
This isn't the case with the default MSVC implementation of malloc() in Windows. In Windows address space is reserved and committed with VirtualAlloc(), and typically that is done in one step.
I think memory is over committed because Linus wanted to keep the memory footprint lower than NT early on in the development of the kernel. The drawback is applications may segfault when writing to memory that was successfully returned by malloc().
 "Actual physical pages are not allocated unless/until the virtual addresses are actually accessed." http://msdn.microsoft.com/en-us/library/windows/desktop/aa36...
It does count against the total memory allowed. My laptop has 8GB of RAM and 1GB page, giving me a 9GB overall commit limit. If I spawn a process that eats up 1GB at a time, even Task Manager can clearly show me going up and hitting 8/9GB, then I'll get OOM in my process.
Windows won't commit memory that a process can't use. You can't overcommit, although you might end up in the pagefile. Without the odd concept of fork, you don't end up with processes having huge "committed" address spaces that aren't ever going to be used.
The note about physical pages is just saying it's not mapping it, not that it's not guaranteeing it.
Edit: And remember, OOF goes off on less than .1% of flights, so they have hundreds of times as many parachutes per flight as they have people who need them. Rumors that parachutes are oversubscribed are therefore wildly inaccurate.
After a while I noticed when the bug triggered/the system started becoming unresponsive, and I had a terminal with killall -9 chromium & killall -9 flash-plugin ready to go, so I could myself preempt it and OOM wouldnt get involved. There has to be better mechanism than OOM.
Like not overcomitting memory in the first place
As for hinting to oom_killer: I have a script which searches for chrome and flash processes every minute, and sets their oom_score_adj in the high hundreds. This makes reasonably sure that oom killer will go after these processes first.