
Kill init by touching a bunch of files - omnibrain
http://rachelbythebay.com/w/2014/11/24/touch/
======
angry_octet
This is a great example of how bad many open source projects are at accepting
contributions from 'non core' developers. The patch is just rejected, when it
actually looks pretty valid to handle all cases of return value from a kernel
interface. While it might not be a perfect solution, accepting it with
suggestions for additional improvements could have led to those improvements.

~~~
masklinn
> The patch is just rejected

 _Technically_ the patch isn't rejected (and I'm kinda peeved TFA claims it
is). It's in limbo, waiting for further action from the submitter or an other
contributor: it's marked as needing improvements since it hides the issue
under the rug instead of reporting it to the caller/user.

This is a rejected patch:
[https://code.launchpad.net/~jamesodhunt/libnih/bug-776532/+m...](https://code.launchpad.net/~jamesodhunt/libnih/bug-776532/+merge/140150),
it has a "Status: Rejected" set (and a disapproving review).

Although of course the patch could have been merged and improved later, so
that libnih wouldn't blow up the whole system in case of inotify overflow.

~~~
marcosdumay
The OS disobeying your configuration files is really better than it
restarting?

Looks to me that Upstart kept the lesser of two evils. I'd reject that patch
too.

~~~
masklinn
> The OS disobeying your configuration files is really better than it
> restarting?

It does not restart, it locks up with a mostly useless error message, then,
maybe, at one point, possibly restarts. The restart is not a policy decision
it's a side effect of the box being dead. Here's the better option: bubble up
the issue to the caller and let it decide what to do.

> Looks to me that Upstart kept the lesser of two evils.

Upstart didn't do anything.

> I'd reject that patch too.

The patch isn't rejected.

~~~
mynameisvlad
> The patch isn't rejected.

Just because the reject status wasn't used does not mean it wasn't rejected.
The patch, as it was written, was rejected. It won't be used and the
recommendation was a complete rewrite in a completely different direction.

------
rikkus
.NET's FileSystemWatcher documentation says that there's an internal buffer of
a finite size, and that if (when) it fills up, you're going to be told about
it and must do a complete traversal of the directory you're watching. No-one
has invented a better way to deal with this, so that's what you need to do.

Many developers ignore this, so it's not really surprising that this has
happened with inotify too. It's mentioned that a patch wasn't accepted, but it
was with good reason - it doesn't fix the problem (by traversing the
directory).

~~~
MaulingMonkey
I'd generally take a config file failing to update (this may happen anyways
from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't
available) over the box dying with a terrible crash report.

I'd much prefer to accept the patch with a followup request, tracking issue,
or TODO. There wass good reason to not consider the issue resolved, but I'm
not of the opinion there was good reason to reject the patch either.

~~~
Morgawr
>I'd generally take a config file failing to update (this may happen anyways
from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't
available) over the box dying with a terrible crash report.

I'm not entirely sure I'd agree with that line of reasoning, to be honest.
While the crash is a big problem and falls into the category of "obscure
enough" problems that might leave your sysadmins scratching their heads for
quite some time (especially because of the contrived and terrible crash report
message), I would much prefer my system to crash and halt instead of failing
to recognize/load a config file. Not loading config files properly (and
failing silently, at that) means that you might be exposed to security
attacks, your system might be in an inconsistent state until the next reboot
and you might not even know it. I'll take a failure over prolonged
inconsistent state of a machine any day. That's what redundancy solves after
all.

~~~
sergiosgc
Failing silently should no be an option. This is a case where the best is
enemy of the good. The patch is good, albeit not optimal. Hanging on to a bad
solution, in presence of a good solution, because you can imagine an optimal
solution was a bad call.

The patch should log a warning/error and avoid the crash. It does not preempt
an optimal solution, by traversing the file tree, being developed afterwards.

~~~
x0x0
loud suicide rather than quiet misbehavior (nobody reads logs until someone
notices something wrong) is often preferred, at least for distributed systems.

------
twic
Perhaps it would be a good move to extract the file-watching behaviour from
the part of Upstart that runs as PID 1, and put it in a separate daemon that
can notify Upstart when it's time to reload a configuration file. That way,
only the daemon would crash in this situation, and Upstart could restart it as
it would any other daemon.

Although then you have the problem of the notification channel between the
daemon and Upstart overflowing, i suppose.

------
Havvy
"libnih" \- I have no clue what this does, so I immediately read it as 'lib
not invented here'. Otherwise, yay...more unstable software.

~~~
masklinn
> so I immediately read it as 'lib not invented here'.

> libnih is a small library for C application development containing functions
> that, despite its name, are not implemented elsewhere in the standard
> library set.

~~~
unwind
It's a pretty fun name. :)

Here's the core code, for the curious:
[http://bazaar.launchpad.net/~scott/libnih/trunk/files/head:/...](http://bazaar.launchpad.net/~scott/libnih/trunk/files/head:/nih/).

I found it interesting that it uses a space after the unary ! operator in C,
which (to me), provided yet another way of writing various (very common)
tests. For instance "if(! fp)" after trying to open a file.

------
shaurz
It's pretty dumb that pid 1 has any assert()s in it at all. Or libraries for
that matter.

~~~
erpellan
You're going to _love_ systemd. I think someone found a kitchen sink in there
once.

~~~
SwellJoe
I often need to wash up after particularly messy system administration tasks.
That sink will come in handy.

I'd be curious to know if systemd is tickled by this one weird trick to crash
your server. It does work with inotify (and other fun newish kernel techs), so
it would be subject to the inotify queue length...but what'll happen when it
hits it?

~~~
andrewaylett
TTBOMK, systemd _doesn 't_ watch unit files for changes. If you change a unit,
you need to run `systemctl daemon-reload` to make it notice. It does, however,
try to work out if it's in a state where you are running `systemctl` without
reloading, and warns that you may need to reload.

------
nodata
Writing to /etc/init.d requires root access. If you are root, you can bring
down the box as it is.

~~~
ge0rg
The problem here was that the box went down by accident due to this bug.

~~~
nodata
Really? How?

~~~
masklinn
Because the bug kills pid1 which took down production boxes:

> I should point out that this is not theoretical. I went through all of the
> above because some real machines hit this for some reason. I don't have
> access to them, so I had to work backwards from just the message logged by
> init.

> […]

> Let me say this again: this happens in the wild. The "touch * * *..." repro
> looks contrived, because, well, that's the whole point of a repro.

~~~
nodata
No, how did you accidentally write to /etc/init.d?

~~~
masklinn
> No, how did you accidentally write to /etc/init.d?

Let me quote again, with a finer slicing:

> I went through all of the above because some real machines hit this for some
> reason. I don't have access to them

~~~
nodata
So you don't know? Gotcha.

~~~
raverbashing
And you apparently never managed anything in production.

Have you ever heard of "reproducing a problem"?

~~~
nodata
I've never seen anything write to /etc/init.d in production (or anywhere
actually), which is why I asked for the root cause.

~~~
gpvos
Without, all the time, bothering to read the article, and asking _here_
something that no-one here could know, because it's not in the article. If you
_really_ wanted to know, you could have mailed the author.

Boy, do you deserve these downvotes.

~~~
cdelsolar
How do I get downvote access? I'm jelly.

~~~
csours
Get a high enough score. Also don't use words like jelly. HN doesn't like
reddit's tendency to use funny memes etc.

~~~
gohrt
"jelly" is a word, not a funny meme. (Also, it's Hacker News, not Not-Reddit
News.)

~~~
cdelsolar
Thank you.

------
pilif
There are so many things in here that tempt me to comment about, so here goes:

1) For me, this is a prime example of why I personally like programming
environments with exceptions. If libnih could throw an exception (I know it
can't), then they could do that which would allow the caller to at least deal
with the exception and not bring the system down. If they don't handle the
exception, well, we're were we are today, but as it stands now, fixing this
will require somebody to actually patch libnih.

Yes. libnih could also handle that error by returning an error code itself,
but the library developers clearly didn't want to bother with that in other
callers of the affected function.

By using exceptions, for the same amount of work it took to add the assertion
they could also have at least provided the option for the machine to not go
down.

Also, I do understand the reservations against exceptions, but stuff like this
is what makes me personally prefer having exceptions over not having them.

2) I read some passive aggressive "this is what happens if your init system is
too complicated" assertions between the lines.

Being able to quickly change something in /etc/init and then have the system
react to that is actually very convenient (and a must if you are pid 1 and
don't want to force restarts on users).

Yes, the system is not prepared to handle 10Ks of init scripts changing, but
if you're root (which you have to be to trigger this), there are _way_ more
convenient (and quicker!) ways to bring down a machine (shutdown -h being one
of them).

Just removing a convenient feature because of some risk that the feature could
possibly be abused by an admin IMHO isn't the right thing to do.

3) I agree with not accepting the patch. You don't (ever! ever!) fix a problem
by ignoring it somewhere down the stack. You also don't call exit() or an
equivalent in a library either of course :-).

The correct fix would be to remove the assertion, to return an error code and
to fix all call sites (good luck with the public API that you've just now
changed).

Or to throw an exception which brings us back to point 1.

I'm not complaining, btw: Stuff has happened (great analysis of the issue,
btw. Much appreciated as it allowed me to completely understand the issue and
be able to write this comment without having to do the analysis myself). I see
why and I also understand that fixing it isn't that easy. Software is
complicated.

The one thing that I'm heavily disagreeing though is above point 2). Being
able to just edit a file is way more convenient than also having to restart
some daemon (especially if that has PID 1). The only fix from upstarts
perspective would be to forego the usage of libnih (where the bug lives), but
that would mean a lot of additional maintenance work in order to protect
against a totally theoretical issue as this bug requires root rights to use.

~~~
michaelmior
> Being able to quickly change something in /etc/init and then have the system
> react to that is actually very convenient (and a must if you are pid 1 and
> don't want to force restarts on users).

I'm not sure I see how it's a must. Why not just have init respond to SIGHUP
like so many daemons and reload its configuration then?

~~~
vidarh
And that points to a fairly simple defensive practice to avoid this type of
problem:

Run it in a different process. Have said process signal if a change is found.
If they want to handle the case where /etc/init.d is potentially huge and you
don't want to rescan everything, have it write changes via a pipe.

(In fact there's a program that will do that for you: inotifywait)

------
xfs
TL;DR: He overflowed inotify queue of /etc/init which is the Upstart
configuration directory being monitored. Upstart doesn't deal with the
overflow, exits, and causes kernel panic.

The bug is not fixed because in order to trigger it you need root to spam file
operations in /etc/init, which implies bigger problems elsewhere. If you have
root and want to see panics, just echo c >/proc/sysrq-trigger.

~~~
justincormack
Rachel is a she I think you will find. A better TLDR is that init can die due
to valid response from the kernel. She found this in a production use case.
Perhaps involving fast changing services.

------
uint32
What's the point of watching for init config changes? Who has init configs
that change so often that this is useful?

------
weissadam
The good news is that Upstart is in the process of being phased out. The bad
news is its replacement. : )

------
SwellJoe
It's worth noting that the root user has any number of pathological use cases
that can bring down the system. This is but one of them. Interesting, but not
particularly dangerous or likely to be triggered in any normal circumstance.

~~~
PinguTS
Maybe next time reading the full post before commenting:

> I should point out that this is not theoretical. I went through all of the
> above because some real machines hit this for some reason. I don't have
> access to them, so I had to work backwards from just the message logged by
> init. Then I worked forwards with a successful reproduction case to get to
> this point. I have no idea what the original machines are doing to make this
> fire, but it's probably something bizarre like spamming /etc with whatever
> kinds of behavior will generate those inotify events libnih asked to see.

~~~
SwellJoe
She doesn't know why it happened.

I would argue that it was almost certainly a pathological use case, as I
suggested above. i.e. Something was already broken, and it triggered this
crash; had it been left to its own devices, it probably would have triggered
some other bad behavior eventually (disk full, OOM killer, etc., many
possibilities). I don't know it, of course, since she doesn't have the details
on what caused these inotify events on such a massive and rapid scale, but I'm
having a hard time imagining why /etc/init would be receiving thousands of
events, short of something already being broken badly.

~~~
PinguTS
Working in industrial embedded systems, I know this ignorance in software
design just to meet deadlines.

Doing testing myself on such device on one side I follow the mantra "The art
of testing is to make border cases possible and not to assume that they will
not happen."

On the other side I also have to deal with safety related stuff, where there
is the rule: "The safety of the system must be maintained under any
circumstances including during system with failures." That it is important to
maintain human safety, like a crane should work _always_ within its limits
even when failed sensors provide misreadings.

That is the same here, even when a certain service is going wild, system
integrity and function must be maintained. Ignoring this fact under the
assumption the cause is something else is for me just general ignorance in
providing quality work.

~~~
icebraining
But can't crashing be a good way to maintain integrity in the face of abnormal
behavior? A critical system shouldn't depend on a single Linux box never
crashing, in my opinion as an ignorant. It's too complicated a kernel to
depend on that.

~~~
PinguTS
That is right for safety related system. You always need a second path at
least, because of single point of failure.

But, being said that, part of safety related development is, to cover any
theoretically possible behavior. Because not doing it, leads to systematic
failures which will decrease the overall system safety. Knowing this will
prevent certification with according authorities, like FDA in medical
equipment, LLoyds in ships, TÜV in off-road vehicles.

At the end, knowing that such bugs are just ignored with such blatant
arguments fuels the image of bad software quality.

~~~
chris_wot
Trying to find the root cause is also important after a safety incident.
Rachel managed to find a way to reproduce the problem, and though it's not
exactly what occurred, it seems like she figured out a way of crashing the
box.

Perhaps the repro seems pathological. But fix this issue, and you may well
have fixed a whole bunch of other issues that are not so pathological.
Certainly, just touching Files should never force the system to reboot!

------
emeidi
As far as I understand, you need to be root (or another privileged user) who
has write access to /etc/init. Conclusion: You can bring down a machine with
superuser privileges. Breaking.

~~~
ubernostrum
Except:

1\. Apparently it watches _all of /etc_, not just /etc/init,

2\. Which means that writing large-enough config changes in some part of /etc
you thought was completely unrelated to init can unexpectedly take down your
box.

That's not good, and not something that can be dismissed as "of course
superusers can turn the box off".

