Hacker News new | past | comments | ask | show | jobs | submit login
Kill init by touching a bunch of files (rachelbythebay.com)
255 points by omnibrain on Nov 24, 2014 | hide | past | web | favorite | 123 comments

This is a great example of how bad many open source projects are at accepting contributions from 'non core' developers. The patch is just rejected, when it actually looks pretty valid to handle all cases of return value from a kernel interface. While it might not be a perfect solution, accepting it with suggestions for additional improvements could have led to those improvements.

> The patch is just rejected

Technically the patch isn't rejected (and I'm kinda peeved TFA claims it is). It's in limbo, waiting for further action from the submitter or an other contributor: it's marked as needing improvements since it hides the issue under the rug instead of reporting it to the caller/user.

This is a rejected patch: https://code.launchpad.net/~jamesodhunt/libnih/bug-776532/+m..., it has a "Status: Rejected" set (and a disapproving review).

Although of course the patch could have been merged and improved later, so that libnih wouldn't blow up the whole system in case of inotify overflow.

The OS disobeying your configuration files is really better than it restarting?

Looks to me that Upstart kept the lesser of two evils. I'd reject that patch too.

> The OS disobeying your configuration files is really better than it restarting?

It does not restart, it locks up with a mostly useless error message, then, maybe, at one point, possibly restarts. The restart is not a policy decision it's a side effect of the box being dead. Here's the better option: bubble up the issue to the caller and let it decide what to do.

> Looks to me that Upstart kept the lesser of two evils.

Upstart didn't do anything.

> I'd reject that patch too.

The patch isn't rejected.

> The patch isn't rejected.

Just because the reject status wasn't used does not mean it wasn't rejected. The patch, as it was written, was rejected. It won't be used and the recommendation was a complete rewrite in a completely different direction.

People legitimately object to accepting a patch that fixes one issue but can lead to other problems down the line. IMHO, even not accepting the patch, but with suggestions for additional improvements would have been better behaviour.

As it stands, the communication was:

1. There's a problem, here's a patch.

2. NAK, because (valid reasons).

3. Radio silence.

Perhaps better behaviour would have been:

2. NAK, because (valid reasons). However, I acknowledge the problem; perhaps you can fix it (some other way).

This way, the discussion is more likely to keep going and end up with a proper fix to the issue.

The communication to me looked more like:

1. There's a problem, here's a patch.

2. We won't apply it because it doesn't fix it perfectly. Sure it's better than what we did, and you offered the patch for free out of the kindness of your heart, but we aren't going to apply it until you work on it some more.

3. But... it's better in every way!

If you write a patch for a project that is an improvement in every way, but not yet perfect and they don't apply it... Well you're probably not going to spend much more time helping that project are you?

>> But... it's better in every way!

And this is where we disagree. I believe it's worse. The original crashes the system, which is really bad. The cases where this happens are few, and people are going to be aware of what they did to cause it (unzipping a bunch of files in there was suggested as a trigger). With the proposed patch, nothing will happen. The user will not be aware of the issue and the system will not update with the changes. The user may not even notice their action had no effect and if they do, it'll be a harder mystery than the crash to figure out.

Some prefer the system to crash than silently ignore input. Error codes and messages are better than either of those.

I suspect the 'goodness' or the 'badness' of the patch is contained in what init does when it misses notifies. Clearly the kernel has dropped some on the floor as it ran out of space, and it tells you this. What libnih does is then scream and shout and abort (which is a fine first configuration since you don't know how common this will be) but when you discover it does happen, you consider ignoring it, if the error is idempotent. Meaning of course if you ignore it, do you later get a notify when the kernel has more notify buffers to use? To understand that you need to read the notify code in the kernel so see how it is generating notifications.

If the kernel drops and never returns a notification, then init has to know that it missed some in order to operate under the correct set of init files. That requires a combination patch to init and to libnih.

If the kernel gets around to notifying anyway, just more slowly, then you can safely ignore it, init will eventually get the message the 'regular' way and you're done.

Given the bug, the next step might be to see if systemd suffers a similar challenge in the presence of a lot of config changes.

I came here to say, well, something.

Obviously, the patch sucks -- it just ignores the error, instead of doing the right thing.

OTOH, init not shitting the bed and taking your whole OS down is probably a good thing, even if getting there is imperfect.

I'm pretty sure James Hunt is a core Upstart developer, at least these days. Scott James Remnant, the original author, stopped being the lead maintainer when he left Canonical, but I don't remember if that had happened by 2011.

.NET's FileSystemWatcher documentation says that there's an internal buffer of a finite size, and that if (when) it fills up, you're going to be told about it and must do a complete traversal of the directory you're watching. No-one has invented a better way to deal with this, so that's what you need to do.

Many developers ignore this, so it's not really surprising that this has happened with inotify too. It's mentioned that a patch wasn't accepted, but it was with good reason - it doesn't fix the problem (by traversing the directory).

I'd generally take a config file failing to update (this may happen anyways from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't available) over the box dying with a terrible crash report.

I'd much prefer to accept the patch with a followup request, tracking issue, or TODO. There wass good reason to not consider the issue resolved, but I'm not of the opinion there was good reason to reject the patch either.

>I'd generally take a config file failing to update (this may happen anyways from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't available) over the box dying with a terrible crash report.

I'm not entirely sure I'd agree with that line of reasoning, to be honest. While the crash is a big problem and falls into the category of "obscure enough" problems that might leave your sysadmins scratching their heads for quite some time (especially because of the contrived and terrible crash report message), I would much prefer my system to crash and halt instead of failing to recognize/load a config file. Not loading config files properly (and failing silently, at that) means that you might be exposed to security attacks, your system might be in an inconsistent state until the next reboot and you might not even know it. I'll take a failure over prolonged inconsistent state of a machine any day. That's what redundancy solves after all.

If you are concerned about config changes taking effect, you should be explicitly reloading affected services anyway and/or verifying that the change has taken effect, not blindly rely on your init to have picked up the change.

This is a false dichotomy anyway: The solution is not to not apply the patch if the submitter doesn't provide an alternative, but recognise that this is a potentially serious problem and fix it anyway.

There are multiple alternatives:

* Fix libnih.

* Fork libnih for upstart, and fix it there.

* Ditch libnih for this, and catch the problem yourself.

* fork() and treat the child failing as a sign you need to rescan the entire config directory after restarting it.

* Regularly rescanning the directory anyway, on the assumption that Things-Go-Wrong and events will get missed out for some reason or the other.

More than one of those can be applied at once. Personally I'd opt for the two last in combination.

Failing silently should no be an option. This is a case where the best is enemy of the good. The patch is good, albeit not optimal. Hanging on to a bad solution, in presence of a good solution, because you can imagine an optimal solution was a bad call.

The patch should log a warning/error and avoid the crash. It does not preempt an optimal solution, by traversing the file tree, being developed afterwards.

loud suicide rather than quiet misbehavior (nobody reads logs until someone notices something wrong) is often preferred, at least for distributed systems.

> Failing silently should no be an option.

In this case, by my standards, it already is. Some relatively unrelated bit of code is later failing loudly.

> This is a case where the best is enemy of the good.

Neither best nor good happened in this case - so I'd flip your terms around. The best of intentions and standards prevented good.

> The patch should log a warning/error and avoid the crash. It does not preempt an optimal solution, by traversing the file tree, being developed afterwards.

On this we can agree wholeheartedly. That'd be a great followup changelist in lieu of doing the optimal solution, assuming the latter is too time consuming.

> your system might be in an inconsistent state until the next reboot and you might not even know it

With the kernel panic, you'll be ending up in an inconsistent state even after the reboot - assuming it reboots - when it happens in the middle of un-taring your configuration files.

At least you WILL know that your system is in an inconsistent state and you know something wrong happened and you should take action.

Plus, there are modern features in kernels that allow you to audit your kernel panics, figure out what is wrong and, hopefully, everything should be automatically reverted back to how it was thanks to backups and journaling on the file system (to prevent corrupted images, that is).

Of course this won't automatically fix your problems for you or finish your un-taring of config files, but that's why we have sysadmins, isn't it?

> It's mentioned that a patch wasn't accepted, but it was with good reason - it doesn't fix the problem (by traversing the directory).

The patch can not fix the problem: upstart can't scan the directory (or ignore the overflow, or handle it however it would) because libnih blows up without yielding to upstart. It does not call a NihWatch handler and does not return an error code.

So the first step is to have libnih at least not blow up (which the patch did), ideally notify the user (which the patch didn't). The purpose of the patch was not to fix the issue, it was to fix libnih's broken handling of inotify.

If I was using this system I would say that it isn't up to libnih to fix the problem; inform init and let init do a complete config file(s) reload. This is very similar to the way the kernel handles interrupts; when there aren't many it handles hardware in default interrupt mode (e.g. one interrupt per packet), but when the rate of interrupts gets too high it moves to non blocking synchronous polling (e.g. keep iterating through packets until a buffer fills/reschedule or the input queue empties).

> If I was using this system I would say that it isn't up to libnih to fix the problem

It isn't up to libnih to handle the overflow, but the problem right now is in libnih itself[0] and only libnih can fix that.

[0] and is twofold: libnih blows up on inotify overflow instead of not blowing up on a relatively normal (if rare) situation, and consequently (and obviously) libnih doesn't provide any way for its caller to handle the overflow.

That's true. What upstart could have done, though, is to watch for config changes in a separate process, so that it does not need to have an in-process dependency on something like this.

That would give upstart a way of handling the overflow and other (unknown) potential problems: rescan the config if the config watcher dies (after respawning the watcher).

This seems like a reasonable solution. And one that sandboxes failures in that code path to prevent PID 1 from crashing. It's a shame this is the only comment (that I've seen) even mentioning this.

There's a risk that this is a case of "best is the enemy of the good".

Rather than fix an issue that can crash a whole box, hold out until the "right thing" is done.

That's a good point, but there's also the argument that it's better to stop and log why, than silently fail and let people try and figure out why something's broken.

I don't think either are great here, so the only real way to fix this is to, well, fix it!

As far as I remember all the Linux kernel to userspace notifications eg netlink as well behave like this. Of course it is a pain to do and hard to test so few people do but if you are writing a library you clearly should especially if it is being run by init.

Perhaps it would be a good move to extract the file-watching behaviour from the part of Upstart that runs as PID 1, and put it in a separate daemon that can notify Upstart when it's time to reload a configuration file. That way, only the daemon would crash in this situation, and Upstart could restart it as it would any other daemon.

Although then you have the problem of the notification channel between the daemon and Upstart overflowing, i suppose.

"libnih" - I have no clue what this does, so I immediately read it as 'lib not invented here'. Otherwise, yay...more unstable software.

> so I immediately read it as 'lib not invented here'.

> libnih is a small library for C application development containing functions that, despite its name, are not implemented elsewhere in the standard library set.

It's a pretty fun name. :)

Here's the core code, for the curious: http://bazaar.launchpad.net/~scott/libnih/trunk/files/head:/....

I found it interesting that it uses a space after the unary ! operator in C, which (to me), provided yet another way of writing various (very common) tests. For instance "if(! fp)" after trying to open a file.

It's pretty dumb that pid 1 has any assert()s in it at all. Or libraries for that matter.

You're going to _love_ systemd. I think someone found a kitchen sink in there once.

I decided to see if there was any merit to your snarky comment:

    $ git clone git://anongit.freedesktop.org/systemd/systemd systemd-git
    $ cd systemd-git
    $ grep -r "assert(" | wc -l
That looks pretty bad at first. But I noticed this in src/shared/macro.h:

    /* We override the glibc assert() here. */
    #undef assert
    #ifdef NDEBUG
    #define assert(expr) do {} while(false)
    #define assert(expr) assert_se(expr)

I didn't verify that macro.h was included by every file that used assert(). But I'm assuming it does. So it looks like assert() at least doesn't call abort() in their world. In fact, it's a nop in non-debug builds. And even in debug builds, the assert_se() macro logs a message and continues.

I didn't check for any external libraries systemd might use.

Also, the systemd repository contains dozens of other programs than systemd itself. Even if any of those were to abort, the implications for systemd would be far less severe. Not nearly all the git code will ever be running in pid 1.

This would be a great place to insert a snarky comment about the sanity of overloading a standard function.

I believe it violates the principle of least surprise, which I generally would consider a good principle.

As a new reader of the code, you'd be quite sane to expect "assert" to mean assert(), not "our local assert(), which is something different".

There should be a --disallow-keyword-redefinitions flag to the preprocessor, or something. Grumble.

'assert' is not a keyword. Even if it were, they overrode it with a function that has the exact same signature and general runtime behaviour. Hardly anything to be surprised about.

Also, invoking the principle of least surprise on a C codebase is just ludicrous.

I didn't say it was a keyword. It's a standard function-like macro (not a function, as I wrote).

I would think it just as bad an idea to overload `NULL` with something non-standard.

The standard things in the language are something to be learned, and they should remain constant and not change per-project just because someone can do it differently. Do it however you like, but don't clobber the standard names, is all I'm saying.

The "principle of least surprise" would say that of course asserts are turned off with -NDEBUG. From assert(3) on osx: """The assert() macro may be removed at compile time with the cc(1) option -DNDEBUG."""

In general I would agree with you—it sucks to override a standard function with weird, local, non-standard functionality. However, in this case, it's an extremely common practice to override assert() so that it only aborts or warns during testing or when some DEBUG macro is defined. To attack that is to ignore long-standing, wide-spread idioms.

"The principle of least surprise" is, in every use I've ever seen, just an obnoxious way to say that it's not how you would have done it.

Yes, when I want a nonstandard assert() I normally give it a different name, like insist(). No reason not to.

I often need to wash up after particularly messy system administration tasks. That sink will come in handy.

I'd be curious to know if systemd is tickled by this one weird trick to crash your server. It does work with inotify (and other fun newish kernel techs), so it would be subject to the inotify queue length...but what'll happen when it hits it?

TTBOMK, systemd doesn't watch unit files for changes. If you change a unit, you need to run `systemctl daemon-reload` to make it notice. It does, however, try to work out if it's in a state where you are running `systemctl` without reloading, and warns that you may need to reload.

Eyeballing their inotify use, it does seem to be careful to check this field and log failures if it's not as expected, but doesn't appear to do anything that should intentionally or unintentionally trigger failure.

And then take this anti-pattern and multiply it by the sheer excess of feeping creaturism; a massive vulnerability surface. If you thought AWS's short notice reboot everything incident was/is uncommon I suspect we'll be seeing a lot more of that sort of thing in the nearish future.

Reportedly it's possible to achieve a similar effect on systemd by having too many journal files (on 208, but the distros backport patches into it constantly from the -stable branch, IIRC): https://bugzilla.redhat.com/show_bug.cgi?id=1112556

Well... on one hand side, it's probably better to abort than to get a possible exploit. On the other it's not a funny situation when one of your base services dies and it can happen easily (found some cases of remote-triggerable abort()s in systemd-resolve - patches were sent upsteam, nobody seems to care for a few days now).

Exploits include denial of service, surely?

Writing to /etc/init.d requires root access. If you are root, you can bring down the box as it is.

The problem here was that the box went down by accident due to this bug.

Really? How?

Because the bug kills pid1 which took down production boxes:

> I should point out that this is not theoretical. I went through all of the above because some real machines hit this for some reason. I don't have access to them, so I had to work backwards from just the message logged by init.

> […]

> Let me say this again: this happens in the wild. The "touch * * *..." repro looks contrived, because, well, that's the whole point of a repro.

No, how did you accidentally write to /etc/init.d?

> No, how did you accidentally write to /etc/init.d?

Let me quote again, with a finer slicing:

> I went through all of the above because some real machines hit this for some reason. I don't have access to them

So you don't know? Gotcha.

And you apparently never managed anything in production.

Have you ever heard of "reproducing a problem"?

I've never seen anything write to /etc/init.d in production (or anywhere actually), which is why I asked for the root cause.

Without, all the time, bothering to read the article, and asking here something that no-one here could know, because it's not in the article. If you really wanted to know, you could have mailed the author.

Boy, do you deserve these downvotes.

How do I get downvote access? I'm jelly.

Get a high enough score. Also don't use words like jelly. HN doesn't like reddit's tendency to use funny memes etc.

"jelly" is a word, not a funny meme. (Also, it's Hacker News, not Not-Reddit News.)

Thank you.

From the article Upstart is watching /etc so the example might have used /etc/init.d just to illustrate it

But yes, it is really strange. This usually means some weird proprietary tools

I'm a bit out of the loop on this stuff, but isn't this kind of thing (bulk-rewriting stuff in /etc) exactly what configuration management systems do in their normal operation?

I don't have a chef/puppet/etc system handy at the moment, but could this possibly be the result of someone setting a file immutable, with the chef recipe dealing with that bit of the filesystem changing, causing chef to repeatedly trying to apply a changeset?

The author indicates the watcher also follows /etc, so any files written there could cause this as well.

That's the point.

Maybe someone untarred a ton of config files into /etc and hit this limit?

I wonder though, could a regular user create a lot of watches and "overflow" the inotify "queues", then just 1 harmless change in /etc/init.d could bring down the system?

There are so many things in here that tempt me to comment about, so here goes:

1) For me, this is a prime example of why I personally like programming environments with exceptions. If libnih could throw an exception (I know it can't), then they could do that which would allow the caller to at least deal with the exception and not bring the system down. If they don't handle the exception, well, we're were we are today, but as it stands now, fixing this will require somebody to actually patch libnih.

Yes. libnih could also handle that error by returning an error code itself, but the library developers clearly didn't want to bother with that in other callers of the affected function.

By using exceptions, for the same amount of work it took to add the assertion they could also have at least provided the option for the machine to not go down.

Also, I do understand the reservations against exceptions, but stuff like this is what makes me personally prefer having exceptions over not having them.

2) I read some passive aggressive "this is what happens if your init system is too complicated" assertions between the lines.

Being able to quickly change something in /etc/init and then have the system react to that is actually very convenient (and a must if you are pid 1 and don't want to force restarts on users).

Yes, the system is not prepared to handle 10Ks of init scripts changing, but if you're root (which you have to be to trigger this), there are way more convenient (and quicker!) ways to bring down a machine (shutdown -h being one of them).

Just removing a convenient feature because of some risk that the feature could possibly be abused by an admin IMHO isn't the right thing to do.

3) I agree with not accepting the patch. You don't (ever! ever!) fix a problem by ignoring it somewhere down the stack. You also don't call exit() or an equivalent in a library either of course :-).

The correct fix would be to remove the assertion, to return an error code and to fix all call sites (good luck with the public API that you've just now changed).

Or to throw an exception which brings us back to point 1.

I'm not complaining, btw: Stuff has happened (great analysis of the issue, btw. Much appreciated as it allowed me to completely understand the issue and be able to write this comment without having to do the analysis myself). I see why and I also understand that fixing it isn't that easy. Software is complicated.

The one thing that I'm heavily disagreeing though is above point 2). Being able to just edit a file is way more convenient than also having to restart some daemon (especially if that has PID 1). The only fix from upstarts perspective would be to forego the usage of libnih (where the bug lives), but that would mean a lot of additional maintenance work in order to protect against a totally theoretical issue as this bug requires root rights to use.

I fail to see how an exception vs an error code makes any difference whatsoever.

The whole problem is that the library doesn't notice the error. If you don't notice an error you are not going to use an exception either.

Not to mention that an exception by default terminates the program, which is exactly what you don't want to do.

The library noticed the error and was (effectively) calling exit().

If the library was at liberty to throw, they would not have to change any internal callers and would get the same feature for them (not having to deal with errors anywhere) but they would still give the external caller a chance not to terminate.

Yes. By default, a caller would terminate, but it would get a chance not to in a way that doesn't involve changing the libraries public interface.

Exceptions are not a silver bullet. One can't simply turn an assert into an exception and expect things to work. That's a great way to add new and tricky bugs unrelated to the original issue.

If an exception crosses a stack frame that isn't expecting one, things could be left in an inconsistent state. Invariants may vary, constraints may be unconstrained.

To retrofit exceptions into the existing code at this point would be very difficult, as you would have to audit every caller of the function in question, a nd the callersof those, and on and on...

Imagine problems like Apple's "goto fail" SSL bug, but more obscure and harder to find.

To have used exceptions from the beginning would require an amount of work roughly equivalent to returning error codes. They aren't fundamentally different from error codes in that way. Well, unless you're a cowboy who ignores errors.

Exceptions wouldn't have solved anything here.

> Exceptions are not a silver bullet.

There are a lot of things that aren't silver bullets. Oddly though, they are still helpful.

> One can't simply turn an assert into an exception and expect things to work.

Actually, if you did that, you'd expect them to fail... but with better semantics.

> That's a great way to add new and tricky bugs unrelated to the original issue.

Only if your runtime doesn't allow for the clean expression of stack unwinding semantics.

> To retrofit exceptions into the existing code at this point would be very difficult.

Yes, I don't think that was at all related to the original point though. It'd be hard to retrofit almost any new language feature in to the existing code.

> To have used exceptions from the beginning would require an amount of work roughly equivalent to returning error codes. They aren't fundamentally different from error codes in that way. Well, unless you're a cowboy who ignores errors.

Not at all. Error codes require error handling in every caller up the call chain. Exceptions at least allow you to put your error handling logic just where you have handlers.

> Only if your runtime doesn't allow for the clean expression of stack unwinding semantics.

The point here is that if a library has already chosen to not cleanly unwind anything, your application catching the exception isn't going to help the garbled state created by the library.

I think cbsmith's point is that a a good language makes it so easy to correctly unwind the stack that it's almost harder to not do it right. If we could go back in time and do it over using exceptions and a better language, this likely would have been an exception from the start, which would have made this bug easier to fix. For example, Python's "assert" raises instead of exiting the process. The caller could just catch the exception from the failed assertion and deal with it, without need to update any other code. The problem is, we can't easily change languages now, nor can we safely retrofit exceptions. Hindsight is 20/20.

It's worth noting that that argument depends on the exception-raising assertion having been present from the beginning. The correctness of stack unwinding code can sometimes depend of which exceptions a function raises. A particular piece of code that is currently correct may break if one of the functions it calls is changed to raise a new type of exception.

That's essentially it. The whole, "yeah but the library may not handle exceptions properly" is essentially a C++ problem, not an exceptions problem.

An assert is what a programmer uses when they think something won't/can't ever happen.

So the authors did not realise inotify could legitimately return an error.

If they'd thought it could return an error, they'd return the error code themselves or recover gracefully or whatever.

The library can do the equivalent by offering a new interface to register an error callback.

Even without that, callers do have a way of preventing termination: Fork.

Use atexit to register a handler that setjmps back to main.

Hey, I like that. Always nice to use the more obscure things messing with the stack (set/longjmp) ;-)... Haven't encountered them for a while.


    [optiplex /home/chris/tmp/atexit_main]
    $ ./testme
    This is main. I will sleep a while, then try to exit.
    PANIC! main() has been restarted due to a unscheduled exit!
    This is main. I will sleep a while, then try to exit.
    PANIC! main() has been restarted due to a unscheduled exit!
    This is main. I will sleep a while, then try to exit.
    PANIC! main() has been restarted due to a unscheduled exit!
    This is main. I will sleep a while, then try to exit.

> Just removing a convenient feature because of some risk that the feature could possibly be abused by an admin IMHO isn't the right thing to do.

From my reading of the article the author saw this issue in the wild (the assert) and then developed the 1000 touch approach to demonstrate the issue.

1) Exceptions are not a solution and are in fact no different than error codes in this case. If your init system throws an exception, what does that mean? It means your computer goes down.

The actual solution here is Maybe / Optionals (depending on your language of choice) as Haskell and Rust and Scala have. These would require the caller to unwrap the result, and thus make this bug actually impossible. The "returns -1" really is the equivalent of a null pointer exception which is an issue that is solved by Options not by Exceptions. Exceptions are definitely not a solution in any way.

2) I didn't see the "too complicated" between the lines bit. However, you're wrong about having to be root to trigger this. The watch, if you missed it, was on all of /etc. There are many folders in /etc which are not root owned. Furthermore, just because you have to be root to cause unexpected behavior means little in this case. Sure, this won't likely be exploited maliciously to take down a box and if that were the fear it wouldn't matter. That wasn't the concern though. The concern is this happening by accident. Even if a program is running as root, it shouldn't be able to bring down the box purely by accident.

3) A patch that prevents an init system from crashing at the expense of missing a file change (which was under /etc and quite possibly not even relevant to the init system) isn't that bad. Take it and add an issue to improve it (e.g. by directory walking as suggested). Seriously, your init system crashing because a few programs decide to write a few hundred default config files to /etc, is dumb.

For an easy example: using ansible-galaxy to fetch a role/playbook, by default (at least in the past) installed in /etc/ansible/roles/$ROLENAME. These roles can be on the order of several thousand files. The init system need not care about this at all.

The most important point here, though, is that Exceptions are absolutely not the solution here and, in fact, we've known that exceptions are fundamentally wrong for the last 40 years and yet people still learn them as part of Java and think they're good design.

> Being able to quickly change something in /etc/init and then have the system react to that is actually very convenient (and a must if you are pid 1 and don't want to force restarts on users).

I'm not sure I see how it's a must. Why not just have init respond to SIGHUP like so many daemons and reload its configuration then?

And that points to a fairly simple defensive practice to avoid this type of problem:

Run it in a different process. Have said process signal if a change is found. If they want to handle the case where /etc/init.d is potentially huge and you don't want to rescan everything, have it write changes via a pipe.

(In fact there's a program that will do that for you: inotifywait)

Agreed, it's pretty standard behaviour to signal daemons to cause them to reload config. SIGHUP'ing it is how you get sysvinit to reload its config too.

I see why and I also understand that fixing it isn't that easy. Software is complicated.

The flip side of this is that perhaps we should aim to reduce the complexity of software. In this case, it's not like libnih is providing any significant increase in functionality over just using inotify anyway.

1. Exceptions are basically 'goto' with a pretty wrapper. Over time, I've come to believe that the best place to handle an error is close to where it occurs, rather than throwing some hail mary message up the stack and hope someone else can deal with it.

2. Well, it certainly is true that more code and more functionality is likely to contain more bugs. Whether your pid 1 needs to be doing all the things that upstart does is obviously a contentious issue.

3. This is a toss-up. I usually like 'crash loudly' over 'fail silently', but I think rejecting the patch outright was a poor decision.

> Exceptions are basically 'goto' with a pretty wrapper.

Famously, "for loops" are basically 'goto' with a pretty wrapper.

> Famously, "for loops" are basically 'goto' with a pretty wrapper.

Trivial response: for loops don't potentially involve jumping into code in a completely different file, which could have been written by someone else.

More substantive response: for loops don't imply a specific philosophy about error handling.

That said, I do like exceptions, as I think they encourage a healthy ignorance in most code, as long as the code which can handle the exception does handle it. But that's just a "don't write bad code" admonition, and you can abuse any philosophy to write bad code.

1. It is a different mindset. Handling each error code until your program is correct top to bottom makes for a much more stable existence, but it's harder when you rely on something that hasn't handled an error code. With exceptions it is easier to work with shitty middleware. Also, in environments where exceptions are idiomatic, there are ways of cleaning up aborted code and safely entering exception handlers. Far more straightforward to reason about than `GOTO 34550`.

I agree with your other two points though.

Yeah, asserts are pretty evil, too.

I like Go's approach: no asserts, and only use exceptions for really exceptional stuff. (Go calls exceptions panic/recover, and tweaks the recover syntax in such a way that it's much less tempting to use it for normal error handling.)

From an outsiders perspective, with zero-knowledge, it seems like libnih should be propagating back the buffer-full error.

I agree that silently failing to reload config files in only certain circumstances could be just as bad or worse behavior than just crashing, but both behaviors seem really really bad in core system software, n'est pas?

Out of curiosity, why do you like no asserts? I use assertions extensively, both as a way to check my logic and to "document" invariants.

The Go FAQ defends this decision by talking about error handling, but using assertions for error handling has always been wrong.

When events are dropped, init should manually scan the config files for changes until the event queue catches up again.

So assert is an interesting combination of several things:

- A check that can be disabled at compile time - A check that blows up the program if it fails - A check that has a really convenient syntax

The "disabled at compile time" bit is pretty much insane (at least in a language with side-effects) -- you'll inevitably end up accidentally having some sort of side-effect in your asserts, and if you turn them off for your production build, you're basically inserting a whole brand-new untested configuration, running in a build that is particularly hard to debug. But probably most people just leave asserts "on" all the time. (Even if you don't have side-effects in your asserts, you're still enabling untested run-paths, which is going to give weird, hard to debug error reports.)

The second thing is blowing up the program. Sometimes, this really is all you can do, and every language has some way for this to happen -- take the head of an empty list, access beyond an array bounds, whatever.

The problem is combining it with the really convenient syntax -- it's just very tempting to assert conditions which you should actually be handling more robustly, because assert is the easiest thing to write.

I'm doing a mix of Go server-side dev and Obj-C client side dev at the moment, and I do use asserts in Obj-C land.

I try to only use them only for truly "programmer screwed up" conditions, which should all be caught during development. But we still get crashes now and then from real users from asserts that shouldn't have been asserts; we fall into temptation.

So I suspect that actually we might be better of in Obj-C land, too, avoiding asserts.

(I also like the invariant-style thinking and documentation quality of asserts, though.)

There are languages where one cannot have side-effects in assertions.

And there are languages where assertions throw a standard (i.e. catchable) exception.

Seems to me a better approach might be to catch the error, and then retry the operation until it succeeds, maybe crashing out if no process is made after enough time (the reader died).

This will cause the library to stall, which might be unacceptable to some people, but if simply noting the error and dropping it is unacceptable then you really don't have a choice.

TL;DR: He overflowed inotify queue of /etc/init which is the Upstart configuration directory being monitored. Upstart doesn't deal with the overflow, exits, and causes kernel panic.

The bug is not fixed because in order to trigger it you need root to spam file operations in /etc/init, which implies bigger problems elsewhere. If you have root and want to see panics, just echo c >/proc/sysrq-trigger.

Rachel is a she I think you will find. A better TLDR is that init can die due to valid response from the kernel. She found this in a production use case. Perhaps involving fast changing services.

> Upstart doesn't deal with the overflow,

Actually it's libnih that doesn't deal with the overflow. The difference is that any other application using that library will also abort on that condition. Let the new bug hunt begin!

Actually, it watches all of /etc/. Rachel wrote this:

"If you poke around in the source, you can find that it actually registers watches for the entire directory of its config file for various reasons, so it winds up following /etc (due to /etc/init.conf) and /etc/init (its "job dir")."

> Upstart doesn't deal with the overflow

Tecnically Upstart doesn't even see the overflow, libnih blows up without notifying its caller.

Why is this getting downvoted? This is an important aspect to the bug.

All software has bugs, and Upstart is no exception, so it should not be a great surprise or tragedy that a bug is found. If you could do this as a non-root user, that would be worthy of the "I just can't believe Upstart could have a bug!" type post. But this is a pretty standard this-should-never-happen-in-a-normal-system bug.

Sometimes when a bug has a very low likelihood of happening in a normal system, it gets ignored; in this case the bug was noticed, but the bug needed work, which was never done. I know rachel says it came up on a production system, but that does not mean it was operating under normal circumstances. Still, this bug should be fixed.

Not that I need to mention it, but I will because i'm a dick: This problem would not happen with sysvinit. One more reason for me to keep using Slackware...

Yeah, while the digging in "step 5" is interesting, I think the reach of this bug is fairly negligible: "root user can break the system".

EDIT: perhaps not so negligible after all. See justincormack's comment.

What's the point of watching for init config changes? Who has init configs that change so often that this is useful?

The good news is that Upstart is in the process of being phased out. The bad news is its replacement. : )

It's worth noting that the root user has any number of pathological use cases that can bring down the system. This is but one of them. Interesting, but not particularly dangerous or likely to be triggered in any normal circumstance.

Maybe next time reading the full post before commenting:

> I should point out that this is not theoretical. I went through all of the above because some real machines hit this for some reason. I don't have access to them, so I had to work backwards from just the message logged by init. Then I worked forwards with a successful reproduction case to get to this point. I have no idea what the original machines are doing to make this fire, but it's probably something bizarre like spamming /etc with whatever kinds of behavior will generate those inotify events libnih asked to see.

She doesn't know why it happened.

I would argue that it was almost certainly a pathological use case, as I suggested above. i.e. Something was already broken, and it triggered this crash; had it been left to its own devices, it probably would have triggered some other bad behavior eventually (disk full, OOM killer, etc., many possibilities). I don't know it, of course, since she doesn't have the details on what caused these inotify events on such a massive and rapid scale, but I'm having a hard time imagining why /etc/init would be receiving thousands of events, short of something already being broken badly.

Working in industrial embedded systems, I know this ignorance in software design just to meet deadlines.

Doing testing myself on such device on one side I follow the mantra "The art of testing is to make border cases possible and not to assume that they will not happen."

On the other side I also have to deal with safety related stuff, where there is the rule: "The safety of the system must be maintained under any circumstances including during system with failures." That it is important to maintain human safety, like a crane should work _always_ within its limits even when failed sensors provide misreadings.

That is the same here, even when a certain service is going wild, system integrity and function must be maintained. Ignoring this fact under the assumption the cause is something else is for me just general ignorance in providing quality work.

But can't crashing be a good way to maintain integrity in the face of abnormal behavior? A critical system shouldn't depend on a single Linux box never crashing, in my opinion as an ignorant. It's too complicated a kernel to depend on that.

That is right for safety related system. You always need a second path at least, because of single point of failure.

But, being said that, part of safety related development is, to cover any theoretically possible behavior. Because not doing it, leads to systematic failures which will decrease the overall system safety. Knowing this will prevent certification with according authorities, like FDA in medical equipment, LLoyds in ships, TÜV in off-road vehicles.

At the end, knowing that such bugs are just ignored with such blatant arguments fuels the image of bad software quality.

Trying to find the root cause is also important after a safety incident. Rachel managed to find a way to reproduce the problem, and though it's not exactly what occurred, it seems like she figured out a way of crashing the box.

Perhaps the repro seems pathological. But fix this issue, and you may well have fixed a whole bunch of other issues that are not so pathological. Certainly, just touching Files should never force the system to reboot!

I don't understand what point you're making. Someone's written a post about a fatal bug in a popular piece of software, and it brings down the OS. You said it's interesting. But because it's got to be a pathological case, even though it's actually happened to people in production, it's not important? (This response wouldn't bother me if it didn't feel to me like part of a broader culture of minimizing the impact of serious software bugs under an unfounded assumption that they're not that common -- as though that were enough to make them unimportant in the first place.)

I didn't say unimportant. I said it is interesting, but no reason to panic (well, maybe a reason for the kernel to panic, but you and me probably shouldn't).

The library to help deal with inotify events does not correctly handle the kernel inotify API spec. It is a poor library, and using something like that in init is a poor choice. init should be absolutely robust to reasonable kernel responses.

The post said it happened in real usage, and was then reproduced by this contrived example. Whatever the original use case was, may have still been reckless or absurd, but I think it's also worth noting that it was not originally found by intentionally throwing pathological usecases at the system.

It rather sounds like something that would happen fairly regularly when using something like puppet, or some other automated system configurator to push a bunch of config changes all at once. Especially if your configurator wasn't all that well written, such that it wrote all config files even the unchanged ones. In that case your system would go down with every config change once the number of config files got large enough to trigger the issue.

As far as I understand, you need to be root (or another privileged user) who has write access to /etc/init. Conclusion: You can bring down a machine with superuser privileges. Breaking.


1. Apparently it watches all of /etc, not just /etc/init,

2. Which means that writing large-enough config changes in some part of /etc you thought was completely unrelated to init can unexpectedly take down your box.

That's not good, and not something that can be dismissed as "of course superusers can turn the box off".

Boy oh boy the responses to this are frustrating. (And we wonder why everything is broken and shitty.)

Yeah, this reminds me of a similar article from the same source (IIRC) about "issues" with disabling the oom-killer and replacing it with your own. Interesting, probably something that needs to be addressed, but not necessarily a game changer as the article would have you believe.

What really caught my eye is that this affected a newer init system, specifically because it was more dynamic (using inotify), which has been a goal of many init replacements. I'm curious if this affects other init systems, specifically old init or systemd. Or if you could find similar attacks against other init systems.

You can set this up as an unprivileged user in any directory you have access to. Just use inotifywait to configure your own inotify triggers first.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact