Hacker News new | past | comments | ask | show | jobs | submit login

Maybe next time reading the full post before commenting:

> I should point out that this is not theoretical. I went through all of the above because some real machines hit this for some reason. I don't have access to them, so I had to work backwards from just the message logged by init. Then I worked forwards with a successful reproduction case to get to this point. I have no idea what the original machines are doing to make this fire, but it's probably something bizarre like spamming /etc with whatever kinds of behavior will generate those inotify events libnih asked to see.

She doesn't know why it happened.

I would argue that it was almost certainly a pathological use case, as I suggested above. i.e. Something was already broken, and it triggered this crash; had it been left to its own devices, it probably would have triggered some other bad behavior eventually (disk full, OOM killer, etc., many possibilities). I don't know it, of course, since she doesn't have the details on what caused these inotify events on such a massive and rapid scale, but I'm having a hard time imagining why /etc/init would be receiving thousands of events, short of something already being broken badly.

Working in industrial embedded systems, I know this ignorance in software design just to meet deadlines.

Doing testing myself on such device on one side I follow the mantra "The art of testing is to make border cases possible and not to assume that they will not happen."

On the other side I also have to deal with safety related stuff, where there is the rule: "The safety of the system must be maintained under any circumstances including during system with failures." That it is important to maintain human safety, like a crane should work _always_ within its limits even when failed sensors provide misreadings.

That is the same here, even when a certain service is going wild, system integrity and function must be maintained. Ignoring this fact under the assumption the cause is something else is for me just general ignorance in providing quality work.

But can't crashing be a good way to maintain integrity in the face of abnormal behavior? A critical system shouldn't depend on a single Linux box never crashing, in my opinion as an ignorant. It's too complicated a kernel to depend on that.

That is right for safety related system. You always need a second path at least, because of single point of failure.

But, being said that, part of safety related development is, to cover any theoretically possible behavior. Because not doing it, leads to systematic failures which will decrease the overall system safety. Knowing this will prevent certification with according authorities, like FDA in medical equipment, LLoyds in ships, TÜV in off-road vehicles.

At the end, knowing that such bugs are just ignored with such blatant arguments fuels the image of bad software quality.

Trying to find the root cause is also important after a safety incident. Rachel managed to find a way to reproduce the problem, and though it's not exactly what occurred, it seems like she figured out a way of crashing the box.

Perhaps the repro seems pathological. But fix this issue, and you may well have fixed a whole bunch of other issues that are not so pathological. Certainly, just touching Files should never force the system to reboot!

I don't understand what point you're making. Someone's written a post about a fatal bug in a popular piece of software, and it brings down the OS. You said it's interesting. But because it's got to be a pathological case, even though it's actually happened to people in production, it's not important? (This response wouldn't bother me if it didn't feel to me like part of a broader culture of minimizing the impact of serious software bugs under an unfounded assumption that they're not that common -- as though that were enough to make them unimportant in the first place.)

I didn't say unimportant. I said it is interesting, but no reason to panic (well, maybe a reason for the kernel to panic, but you and me probably shouldn't).

The library to help deal with inotify events does not correctly handle the kernel inotify API spec. It is a poor library, and using something like that in init is a poor choice. init should be absolutely robust to reasonable kernel responses.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact