I guess I'm one of the few people(?) who like the OOM killer. If all your deployed software is written to be crash-only[1], and every process is supervised by some other process which will restart it on failure, then OOM is basically the trigger for a rather harsh Garbage Collection pass, where software that was leaking memory has its clock wound back by being forcefully restarted.
Of course, this works better when you have many small processes rather than few monolithic ones. But now you're designing an Erlang system :)
> every process is supervised by some other process which will restart it on failure
I'm curious if this works in practice for you. The current OOM algorithm in Linux sums up the memory usage of a process and all its children. So there are good chances that the restarter process is killed first, and then the main software is killed too (when OOM killer realizes the last kill didn't free enough memory).
This is exactly the problem we're facing at work here: on a computational cluster, users sometimes start wild code that consumes all the memory, but the OOM decides to kill the batch-queue daemon first, because it's the root of all misbehaving processes. We have to explcitly set `oom_adj` on the important daemons to prevent the machines from becoming unresponsive because of a bad OOM decision.
My "restarter process" is upstart. It's convenient, since the OOM-killer tries to not kill init (for bad things happen when you kill init), so it's a somewhat-safe place to put supervisory logic. One of the better calls Canonical has made, I think. :)
Still, in your use-case, I'd definitely recommend only letting users run their "wild code" inside a memory cgroup+process namespace (e.g. an LXC container.)
Crash-only systems only work when a faulty component crashes itself before it crashes you. Processes modellable as mutually-untrustworthy agents should always have a failure boundary drawn between them. (User A shouldn't be able to bring down the cluster-agent; but they shouldn't be able to snipe user B's job by OOMing their job on the same cluster node, either.) And on a Unix box, the only true failure boundaries are jails/zones/containers; nothing else really stops a user from using up any number of not-oft-considered resources (file descriptors, PIDs, etc.)
Do you have any good resources on where to get started going about setting up failure boundaries/jails/zones/containers like this properly?
I think it's surprisingly easy to get yourself in the situation where this is a concern for you[0] but you don't know how to solve it.
[0] Just run "adduser" and have SSH running, or just create an upstart job, or write a custom daemon that accepts and executes jobs from not-quite-trustworthy-undergrads, or...
>I guess I'm one of the few people(?) who like the OOM killer.
Diff'rent strokes. I also like the OOM killer, its a dastardly wonderful thing to tie lots of safety-critical things to .. and in the SIL-4 OS business (my domain), it is indeed a crash imperative to understand how to use the OOM properly. Or: not.
So in this light .. I know Erlang is "the thing" right now, but I feel I must just mention that:
>Of course, this works better when you have many small processes rather than few monolithic ones. But now you're designing an Erlang system :)
.. one could also be designing a Lua-based distribution, or JVM, or whatever you like, essentially, and integrating with oom_killer. There's nothing Erlang'y about it. Because if you're playing with the oom_killer, you're really making a distribution choice, in the topology.
If your app cares about oom, well kiddo .. you better not be doing anything less than excercising complete control over your distro, its launch policies, its use of the TextSegment as installed, and so on. Absolutely you're making Distribution decisions about the functionality of the combined system. oom_killer isn't useful by itself.
My point being, worrying about oom_killer isn't just something Erlang users need think about, nor are they the only ones who really 'get' why an oom_killer can be used nicely .. if you're building a distro, either for use as an embedded machine, a tight secure server image, or indeed even as a desktop user, well .. careful memory integration is, as you say, a harsh pass.
Incidentally, I use the oom_killer exactly as you mention, in a few embedded distro applications, specifically indeed, a kill of whatever 'lua' is hogging resources. Its an extraordinarily functional mechanism for recovery ..
Those environments don't provide the supervision hierarchy which is being discussed in this thread. In Erlang, every actor you create has a supervisor, which will restart the actor (the behaviour is tunable) when it crashes.
That is until you get a process that triggers OOM due to a bug on startup (pissing all over RAM) that ends up getting stuck in a restart loop which cripples your machine so you cant get in and fix it. Either that or it fails and neutralises supervisord or something resulting in site down anyway.
ulimit is a better solution i.e. set reasonable constraints based on available resources.
I've never seen a process supervisor daemon that didn't have an "if process FOO exceeds X restarts in Y milliseconds, stop trying to start FOO" clause.
You've never seen an "enterprise" software init script then. 5-nines? Stick it in an endless loop and no one will notice (Atlassian I'm looking at you).
Yeah, but looking at Atlassian is easy, for pete's sake they gave us shitty software like JIRA, Confluence and others... and I say that with a HUGE dislike of those products due to how obtuse they can be.
> daemontools / supervise doesn't have it and it's a fairly popular solution.
Ok, it doesn't exactly support the quoted feature ("stop restarting if it exceeds X attempts in Y seconds"), but it does sleep for a second between restart attempts to mitigate the same problem: http://cr.yp.to/daemontools/supervise.html
I'm curious why you wouldn't recommend it. I've never used it myself, but I've been reading about it the past few weeks. The design seems really well done compared to init.d scripts in the sense that every init script must reimplement daemonization, pid files, etc. (which is helped by examples, start-stop-daemon, etc. but is still a huge and delicate chore in my experience).
Is your objection that daemontools is largely unused / unmaintained and lacks features (such as flapping-avoidance)? Or something else?
It's mostly about the lack of features. No support for cgroups, user switching, adjustable backoff, syslog, etc. Sure - it works, but it's the same thing as with qmail - it's the very minimum that you can call a useful SMTP deaemon. The moment you want any feature on top, you have to implement it yourself.
Either one of upstart, systemd, supervisord will give you equivalent solution that has more practical features. Daemontools was good enough a couple of years ago. (I used it then happily)
In short - there's nothing wrong with it really, but many alternatives are much better.
A particularly bad case of the OOM killer making a mistake is in-memory databases. If some other process is using lots of memory, but still less that the DB, it'll get killed for no good reason anyway.
One would hope that any machine hosting an in-memory database has gobloads of RAM and not much else to do. :) [Still, this is what replication is for! "In-memory database" is only a scary concept if you don't have any hot slaves.]
One might have 2/3 of memory used, with well known database size growth. Then a runaway process could use 1/3 of memory, cause trouble, but it's still the DB that gets killed.
This isn't disastrous of course (replication + snapshots + redis aof), but still annoying.
Another thing you can do, in this case, is to enable swap system-wide, then put just the database process into a cgroup with memory.swappiness = 0.
Thus, the database itself will never degrade due to spilling to disk, but all other processes might. But if it's a DB box, that won't precisely matter.
The OOM killer has almost never worked for me. I think once or twice only, the rest of the time, high memory pressure combined with high storage pressure effectively locks up the system. Easiest way to trigger it is to start eclipse, virtual box and start a system update (just package download, not install) and I can't move my mouse any more.
The real irony here is that airlines actually do something very much like overcommit & OOM killer when it comes to reservations, and for precisely the same reasons: they know that not all the reservations will be used at the same time, but sometimes they do end up double booked, so then someone has to be kicked off the flight.
Very clever, that, once you think about it. "The on-time airline" running late, stuck in holding pattern, claim you're out of fuel, get landing priority, no more delay. About as classy as the other Ryanair tactics.
I would agree except that it seems to go deeper than that. Pilots are ranked by how little spare fuel they take. If they dare to take too much spare fuel they have to write a memo to management explaining why.
It's not company-approved, it's regulator-approved, which is designed to be sufficient. Airlines have no reason to go above that, except for planning reasons (eg you might want to add some extra fuel to have the option to fly faster on some legs to make up for delays). So I wouldn't expect Ryanair to carry less fuel than other airlines.
The only exception that airlines will sometimes do is to juggle around destination and alternate airports so that they can save some fuel on that (you have to have enough fuel to reach an alternate airport, and then 45 minutes in addition to that).
While I don't like Ryanair, this scenario isn't very likely. There will always be lots of paperwork etc. after declaring an emergency, which ultimately could end up with the pilots losing their license if it was discovered that they were repeatedly doing this.
I'm not an airline transport pilot, but in my personal flying, I've declared emergencies twice and asked for priority handling due to minimum fuel concerns once. No paperwork on any of them.
I also spend significant time on aviation/pilot forums, and I've literally never heard of "lots of paperwork" for any emergency that resulted in a safe, no injuries landing. The topic comes up regularly in the "should I have declared an emergency or not?" discussions, the common argument against is "I don't want the paperwork" (a strange trade off against a life-safety question in any case), and the most I've heard as routine followup is a phone call from an inspector.
My experience is exclusively US, so perhaps it's different in other areas of the world.
Maybe there is a difference between parts of the world, but I would be surprised if airlines don't have some kind of routine to follow when an emergency is declared. And I would be even more surprised if airline pilots on Ryanair routinely used emergencies as an excuse without it having any consequences.
The only issue I've ever caused was triggering a TCAS warning for a landing airplane because I was slow on starting a turn. Even though there were VFR conditions and we were at a safe dinstance, the other pilots said they had to report it as a matter of company policy, and my instructor had to make a few phone calls afterwards.
> At least two memorandums were sent to Ryanair pilots detailing the company's concern about what was described as "excess fuel explanations" -- a description of the reasons flight commanders have to give if they take on extra fuel over the recommended minimum fuel load.
Is "because it's only the minimum!" not a valid answer?
If you read the article to the end, they had 90 minutes extra worth of fuel, and landed with 30 minutes left after diverting due to a thunderstorm.
It was a "mayday" because they didn't have enough fuel to return to the original airport they were going to land at and/or to continue waiting in a holding pattern for another hour.
It's not the best analogy as when airplane is low on fuel, dropping a passenger wouldn't make a big difference (I'm assuming).
But think of a a hypothetical situation where low fuel situations somehow were common, unforeseeable and unavoidable. And if dropping a passenger could indeed make a difference between full crash or safe landing. It's a shitty situation, but it makes sense to give somebody a chute and kick them out, instead of letting all die.
Right, so they do it at pretty much the precise time that everyone grabs the seats. They don't somehow get everyone on the plane, take off without everyone in their seats, and THEN kick out passengers.
Sometimes is pretty much all of the time. My SO works as cabin crew for one of the large airlines you will have heard of, and they typically always overcommit. It isn't just because they are greedy and want lots of money though (well I guess it kind of is). As you said, they know some people won't turn up, but also they may decide to change the class of the airline (economy and business -> economy, business and first) depending on how people book or what aircraft are available on the day.
but also they may decide to change the class of the airline (economy and business -> economy, business and first) depending on how people book or what aircraft are available on the day.
United is infamous for this, prompting a sound piece of advice: if for some reason you're in first class on a flight scheduled for an A320, don't take a seat in row 3, since odds are high it'll be swapped to an A319 (which only has two rows in F) day of the flight. In theory there's a pecking order for who gets downgraded in that case (first, anyone who moved up to F on a status upgrade, then the paid F fares starting with whoever was latest to check in), but in practice the gate agent just bumps row 3 and calls it a day.
I feel like aircraft substitutions are more likely to happen for operational reasons (the A320 they wanted to commit is unavailable because of mx/wx, most likely, and instead of taking a delay, they pull an A319). Some carriers do seem to use last-minute equipment swaps for yield management, but I feel like for UA, doing this is not a good idea — they burn a lot of time re-assigning seats and refunding E+ fees, etc. after a 320->319 downgrade, so I think it's something they try to avoid because it's costly to deviate from the planned equipment.
I think the parent poster was in part referring to operational upgrades (very common at British Airways — they will dramatically oversell the Y cabin and op-up people into J/F as needed), which don't necessarily require an equipment swap.
Another option uniquely available to European carriers that you might not have seen in the U.S. is to adjust the number of J/Y seats in an aircraft: on most intra-Europe flights, the "business-class cabin" is the same seats as the coach cabin, but with a blocked middle seat and a movable divider so you know where Y starts.
(Come to think of it, I guess BA will sometimes swap equipment to a flight with fewer classes of service — for example, the LHR-DME route can lose its lie-flat Club World seats overnight sometimes after unplanned equipment swaps — but I don't think BA does this for yield management, because it's costly to them to have to pay refunds to people they've inconvenienced.)
I don't know why United does it, just that the A320/A319 swap is so common with them that on frequent-flyer forums it often gets a stickied thread with tips on what to do, warning signs, etc. Hence the advice about never taking a seat in row 3 (and for economy passengers, never booking into the last couple rows of the aircraft, since those seats don't exist on the 319).
There's also a counterintuitive "always choose row 3" option, where you make yourself more likely to be downgraded but also more likely to be offered a travel voucher under United's fairly generous distance-based compensation policy (internal documentation "GG OVS DOWNGRADE").
It isn't just because they are greedy and want lots of money
Certainly not in the current airline environment. Even with a perfectly loaded plane, they only make a little money, and you can't plan for a perfectly loaded plane with any kind of confidence without double-booking.
> My SO works as cabin crew for one of the large airlines you will have heard of, and they typically always overcommit.
I phrased what I said poorly. They always allow overcommit, but they try to work it out so that on average they efficiently make use of their seats without constantly having to give people giveaways and bad customer experiences.
That's more analogous to the strategy of pausing processes which require memory when there isn't any available, and then resuming them once memory becomes available. The airline still lets you complete your trip when they bump you from a flight, you just have to wait until capacity becomes available for you.
The problem with translating that approach to OSes is that you can easily deadlock the entire system.
It depends on circumstance. For a lot of people, having their flight canceled and having to switch to another is all but catastrophic. You basically have to restart and build a new travel plan, and much of what you set out to accomplish gets lost.
Precisely. The OOM killer has its uses, and while it's debatable if it is suited for this workload or that workload, it is easily tunable for specific use cases, and you can disable it altogether if you'd like the classic behavior.
Yes, but that's more like fork or malloc failing in the first place than letting you run and then randomly killing something when the system gets overcomitted
OSX is just like, "Hey guys, if any of you happen to not need your memory sometimes, would you mind kindly letting me know and I'll go ahead and let you go at a convenient time?" Meanwhile Linux goes on a murderous rampage with unpredictable effects.
... And Windows is sitting in a corner, fans kicked on high, trying valiantly to manage with swapping to disk until a sysadmin gives up trying to connect via RDC and yields to the age-old "have you tried turning it off and back on again?"
The part of this I hate most is that if you have physical access to the machine and can hit ctrl-alt-delete you are given an out-of-band dialog that functions flawlessly; this dialog happens to have a "task manager" button, but as the task manager functionality is part of a normal user application and not that magical dialog, you get thrown back into the swap storm, only now with yet another process (taskmgr) competing for memory :(.
That's a few layers up the stack. Sudden Termination is a message to the OS management layer that an app can recover robustly from an unexpected death. That's not a job for the kernel, and presumably Darwin has some handling analagous to the OOM killer for the situation where memory truly is exhausted.
FWIW: Android baked that idea into the framework from the start. All apps are essentially required by the application lifecycle to be robust against sudden failure, and the system is designed to give apps fair warning to save themselves at various times (e.g. on backgrounding). If you think about it, on a battery-powered embedded system that's pretty much a firm requirement anyway.
Android accomplishes this using a slightly modified OOM killer, which it can control directly through /proc. In addition to "out of memory condition" which traditional OOM killer supports, it has the "low memory killer" (LMK). Processes on Android that are not supposed to be randomly terminated are protected by adjusting the OOM/LMK settings, which the Android system does when launching Davlik for that activity/service.
This may have changed in recent version of Android, as I haven't kept up with internals that much.
Actually it's much more than slightly modified. The lowmemkiller (which is still present, and mostly unmodified from what you saw) code runs out of the cache shrinker framework, not on an OOM event. And it chooses processes to kill based on categories that correspond to UI state (e.g. preferentially killing idle apps before background processes before foreground apps, etc...).
But that's not quite what I was saying. The point was that the promise in OS X (that a process "could be killed if necessary") if baked into Android: it's true by default of all processes, and the framework guarantees a set of lifecycle callbacks to allow processes to persist their state robustly. This isn't part of kernel behavior at all.
The memory alert stuff on iOS is similar; when the phone runs low on free memory it tells the app free up as much as possible, and then only starts killing processes if there still isn't enough available.
It can be a bit finnicky to work with, but it does seem to be pretty much strictly better than just killing random processes without giving them a chance to be better behaved.
This reminds me of my one and only question on Stackoverflow: "Throwing the fattest people off of an overloaded airplane." http://stackoverflow.com/q/7746648/67591
Some years back I was flying a small commuter who used small prop type airplanes (I call them pterodactyl air). Part way through the flight, I noticed one prop seemed like it was not working, so I leaned forward to alert the co-pilot (the plane was that small). He told me that they would turn off one engine and "feather the prop" to save fuel. I told him that I would be happy to take a up a collection back in the cabin from the other passengers to pay for the extra fuel to power both engines. He chuckled, but I was serious. I never flew with them again.
Maybe there is a way to suspend a process (feather the prop) rather than completely kill processes.
Just FYI, pretty much every plane with multiple engines is able to safely fly and land with only one engine working, and all commercial pilots are trained to do so. If you were halfway through the flight already, it's entirely possible you were already in descent and didn't need the other prop at all to complete the flight.
Wow, that was ridiculous. So, because you don't understand how planes work this is the pilots' fault. Even if you were serious about the fuel exactly where do you propose they put it? In-flight refuelling of commercial planes doesn't exist.
Usually, I recommend that database and queue servers run the database/queue with a priority that makes it unlikely for them to be killed.
I had a case where a colleague running a script on a server with high pressure killed the queue, which is unadvisable, even if is crash-safe. Before that, the queue was running for 1.5 years straight.
That post is a great, poetic allegory. But ultimately, I think the analogy presents a bad idea. The allegory makes the point that we could entirely avoid OOM errors by engineering a system such that resources are never overcommitted. This is true; we could do that.
However it would be bad.
Under-committing resources (thus removing the need for an OOM killer) will NOT lead to a net gain compared to over-committing resources (and thus requiring an OOM killer of some sort).
If we are unwilling to overcommit resources then it would be woefully uneconomical to run algorithms that have bad worst-case performance (because to avoid over committing you would necessarily need to assume the worst case is encountered every time).
It's just not feasible to avoid algorithms that have bad worst-case performance. Rather, we need to develop better abstractions for dealing with components (e.g. computations, programs, processes, threads, actors, functions etc.) that go over budget. Here's my attempt at developing a better abstraction for web servers: mikegagnon.com/beergarden
Ultimately, we need to treat every system like a soft real-time system, because at the end of the day every program has timeliness requirements and has resource constraints. The current POSIX model does not provide such abstractions and I think that's why we have these debates about OOM killers.
I like the idea of the doorman but what if you could somehow pass back useful math to the client? Of course, then you'd have to disregard that useful work yourself or double check it negating the energy savings. Or, perhaps return a map of traveling sales man type problem (maybe a map of metadata and their traversal cost), and they could navigate that map depending on exactly what kind of data they really want, thus reducing your load for valid, heavy requests and if they return a path with lots of nodes, you know to de-prioritize it or drop the request.
Here's a novel way to deal with an out of memory situation caused by slow memory leaks in a long-running server process: start swapping memory that hasn't been touched in literally days or weeks to /dev/null, and pray the process doesn't ever need it again.
That's so indescribably worse than just killing that process the mind boggles. Breaking in a simple, predictable, and detectable way vs. corrupting data and hoping (excuse me, "praying") nobody notices.
Better yet instead of sending it to /dev/null save it on disk and reload it whenever the process needs it again.
Oh wait that is how virtual memory works.
Compressing the memory area used by the leaky process in question would be a gentler and more robust "solution" here. There already exists a Linux kernel module called "zram" [1] with which you can accomplish just that (though you might have to tune your swappiness a bit first).
I'm pretty sure there's a linux kernel module. Obviously it'd be required for my method to work. Might not be doable in FreeBSD (incompatible with the branding).
Or, here's a crazy idea: how about we actually allocate the memory when you call malloc(), and if there isn't any, give you an error instead? Programs could check the return code and decide what to do when they run out of memory themselves. Crazy, I know.
You can do that if you like. Use vm.overcommit_memory - setting it to 2 will still allow overcommit if there is room in swap for it. (So a process doesn't have to use the memory, it's just assured of a place in swap should it need it.)
My memory is a bit hazy in this area, but I think by default memory is over committed in Linux. What that means is malloc() can return an address that doesn't have physical memory assigned in the page table. Memory isn't committed until it is written to.
This isn't the case with the default MSVC implementation of malloc() in Windows. In Windows address space is reserved and committed with VirtualAlloc(), and typically that is done in one step.
I think memory is over committed because Linus wanted to keep the memory footprint lower than NT early on in the development of the kernel. The drawback is applications may segfault when writing to memory that was successfully returned by malloc().
You seem to have some terminology problems here. Windows VirtualAlloc may "commit" memory but that does not mean it actually reserves physical pages [1]. That always happens only when the memory is accessed. On the other hand, MSVC's malloc() probably uses HeapAlloc(), which in turn uses VirtualAlloc(). I don't think there are any fundamental differences between Linux and Windows here.
That link says:
"Allocates memory charges (from the overall size of memory and the paging files on disk) for the specified reserved memory pages."
It does count against the total memory allowed. My laptop has 8GB of RAM and 1GB page, giving me a 9GB overall commit limit. If I spawn a process that eats up 1GB at a time, even Task Manager can clearly show me going up and hitting 8/9GB, then I'll get OOM in my process.
Windows won't commit memory that a process can't use. You can't overcommit, although you might end up in the pagefile. Without the odd concept of fork, you don't end up with processes having huge "committed" address spaces that aren't ever going to be used.
The note about physical pages is just saying it's not mapping it, not that it's not guaranteeing it.
VirtualAlloc can reserve address space and commit pages for it depending on the flag provided. Address space is a resource of the process, and pages are a resource for the entire system.
Also, they count as your carry on bag, and they only bring one parachute on board, so if multiple people have to be thrown off and they've both paid for a parachute, they have to draw straws to decide who gets it.
Edit: And remember, OOF goes off on less than .1% of flights, so they have hundreds of times as many parachutes per flight as they have people who need them. Rumors that parachutes are oversubscribed are therefore wildly inaccurate.
The few cases when Ive seen OOM invoked, it took couple of minutes to kill chromium after flash (of course) messed up, during that time the system was unresponsive and it killed few random smaller processes until it hit the correct one, flash or chromium in some weird interdependent bug. Either way, I wasnt too happy.
After a while I noticed when the bug triggered/the system started becoming unresponsive, and I had a terminal with killall -9 chromium & killall -9 flash-plugin ready to go, so I could myself preempt it and OOM wouldnt get involved. There has to be better mechanism than OOM.
Too slow for the command to pass through the usual "X - WM - TE - shell - killall" chain? Try Alt+SysRq+f ; that, in my experience, is waaaaay faster for invoking oom_killer.
His issue is that oom_killer doesn't get it right straightaway. Alt+SysRq+f would still need multiple invocations before it gets flash.
Personally, I do use Alt+SysRq+f since it predictably targets GMail-on-Chrome everytime on my system. That is usually enough on my desktop OS for me to jump in manually kill the offender. I can then just F5 GMail.
Even so, in OOM situations, sysrq invocation is still an order of magnitude faster than killall invoked from a graphical terminal emulator.
As for hinting to oom_killer: I have a script which searches for chrome and flash processes every minute, and sets their oom_score_adj in the high hundreds. This makes reasonably sure that oom killer will go after these processes first.
Of course, this works better when you have many small processes rather than few monolithic ones. But now you're designing an Erlang system :)
---
[1] http://lwn.net/Articles/191059/