
The Day “/Proc” Died - rkeene2
https://dev.to/rkeene/the-day-proc-died-53ic
======
verisimilitudes
Some things never change. Is there a single UNIX variant that doesn't fail in
some way when it can no longer write to disk? It's unacceptable for most
programs to fail if they can no longer write to disk; Firefox also fails, but
of course gives no indication as to what's happening.

Now, of course, no UNIX variant I'm aware of has a real notion of system
programs and so there's no programs that get special privileges that would
protect them from this. It would be preferable for the system to die, rather
than permit this debugging nonsense, considering it would perhaps actually be
fixed if the machine died in this case.

This is an asinine failure case mired in 1970s malpractice. Don't you agree
this is damning and unacceptable? No one has any right to be proud of this
mess. There's millions of lines of code and yet basic failure cases aren't
truly accounted for or are handled in the most asinine of ways, such as with
the ``Out of Memory Killer''.

~~~
theamk
Did you know that Linux has this special note of “root reserved space”
specifically for situations like this? And the same way, there is per-process
“oom_adj” which can be used to control OOM killer priority, to spare the
system processes.

This problem has been solved, multiple times. Of course many distributions
fail to mark appropriate processes as system, so they fail anyway, but this is
just a bug, not an a glaring design omission.

~~~
Avamander
The OOM killer __never __in the twelve years I 've used Linux has triggered
before my system grinds to a halt and never recovers. This problem has __not
__been solved.

~~~
eikenberry
That is probably being cause by your use of swap space, not an OOM issue. I've
had multiple cases of the OOM killer kicking off on my system, all without it
slowing way down.

~~~
rcxdude
Lacking swap space causes more severe symptoms in an OOM situation, not less,
from my experience. I think this is because everything that can get evicted
from RAM is before the OOM killer gets invoked, which means every disk access
slows to a crawl.

------
aidenn0
Since ext4 became the default, my most common cause of bizarre behaviors has
been running out of inodes.

    
    
       df -i
    

will show this, but a bare df command will not.

Getting off topic now, but does anybody know if the ext4 utilities changed the
calculation of inodes when formatting compared to ext3/ext2 utilities? Running
out of inodes on those filesystems was fairly unusual, but I've seen it happen
a dozen times in the past 5 years on ext4.

~~~
simonjgreen
Across the approximately 2000 VMs we look after as an MSP, we see this at
least once a week. It's almost always badly cleaned up session files from a
php app or similar.

~~~
Symbiote
The default ratio of inodes is configured in /etc/mke2fs.conf. You should
either change that, override it with the -i argument when you create the
filesystem.

My desktop's / filesystem is around 238,000,000,000 bytes, and with the
default ratio of one inode per 16384 bytes, I have around 14,500,000 inodes.

Note that with a default blocksize of 4096, you're limited to 4× as many
inodes as you have at present, so if you're seeing this weekly I recommend
monitoring the number of remaining inodes (df -i), or changing the app to
store sessions in a database.

~~~
simonjgreen
We're talking about preexisting systems. inodes are already tweaked up during
build to highest, and we monitor similar to how you suggest via a checkmk
script. If the apps running were ours we could make changes to make them clean
up better or store through other means, alas this is just something you have
to handle when doing break fix response on 100s of clients making their own
decisions.

------
free652
The rookie mistake a lot of admins do is never creating a partition for
/var/log , I lost count how many times servers went into a weird mode when the
root is getting filled to 100%

~~~
est
I always rm the dir inside /var/log and ln them back from another disk.
Because I am too lazy to change default config sparsely located on /etc of
many installed programs.

~~~
Galanwe
Most of the programs should by default use syslog, so you should not really
have to configure much of them except enabling syslog output. Then you just
have to configure your syslog implementation to write wherever you desire,
rotate the files, or forward the messages to other machines for storing, etc.

~~~
majewsky
> Most of the programs should by default use syslog

Is that still true today? Docker wants you to log to stdout, so that's what
most newer applications do. systemd also wants you to log to stdout, and will
redirect stdout to journald/syslog automatically. In fact, an application that
only logs to syslog can turn into a minor headache when you want to dockerize
it. Which is why stuff like [https://github.com/sapcc/syslog-
stdout](https://github.com/sapcc/syslog-stdout) exists.

I'm having a related problem with another project that I'm working on where I
wrap an OpenLDAP server. It would be much easier to properly wrap it if
OpenLDAP would just log to stdout instead of bypassing me and going for the
syslog. Maybe at some point I'll set up a separate mount namespace for it to
pass a /dev/syslog shim into it. But this shows that the log-to-stdout pattern
is much more Unix-y because it composes better.

~~~
defanor
> Is that still true today?

I think it is.

> Docker wants you to log to stdout, so that's what most newer applications
> do.

According to Debian popularity contest [1], which matches my observations,
Docker itself isn't a particularly common package to be found in a system.

> systemd also wants you to log to stdout

While that's an option, there is sd-journal(3), which allows proper logging
with priorities and custom fields.

[1] [https://popcon.debian.org/by_inst](https://popcon.debian.org/by_inst)

Edit: Perhaps worth mentioning that even with systemd and logging into stdout,
syslogd (and maybe journald) configuration should be sufficient to sort out
the log files, as mentioned in the grandparent comment.

------
bpchaps
Interesting. I'm legitimately surprised the author put so much work into this
research (happy they did!). By far one of the biggest culprits for "really odd
behavior" is because of a full disk, or the disk is in some failing/failed
state. To the point where when troubleshooting, `df -k` will be one of the
first commands I'll run.

Does this company not have disk monitoring?

~~~
segmondy
Did you not see that the date was 2011? I think this is lesson learned sort of
article. When things go wonky, look for simple reasons. At least he didn't
think it was a hardware bug.

~~~
bpchaps
Nope, I didn't actually. Which is funny and makes a bit more sense, since I
haven't seen a full disk cause bad locking like that since around the date of
the article. Still, this never should have been an issue, since disk space
monitoring shoulda prevented the problem from ever happening in the first
place.

------
wahern
Wow, that's almost exactly the same problem I encountered in Linux last week,
where simply reading /proc/$pid/cmdline would block the process attempting to
read the process info. It appears to have been related to this issue:

    
    
      https://lkml.org/lkml/2018/2/20/576
    

And much like the Solaris issue, as best I could tell the original processes
(this occurred multiple times on two different nodes) would seem to be blocked
either in the filesystem or memory management layers flushing pages.

~~~
majewsky
Please don't put things that are not code in codeblocks. In this case, it
makes the link unclickable for no good reason. Working link:
[https://lkml.org/lkml/2018/2/20/576](https://lkml.org/lkml/2018/2/20/576)

------
drudru11
Wow - cool example of live kernel debugging. It is very cool that the kernel
is still usable while this issue occurred.

Also lucky that someone was logged into the zone without a /proc dependency.
Usually people have complex shell prompts that might require /proc lookups.

It is concerning, though, that a less privileged zone could affect the entire
system.

------
Iv
Slightly tangential, but I find it a bit hard to accept that systems do not
have a more graceful failure mode when their disks are full.

I keep a few GB free on my / but when I inadvertently fill it, it becomes
almost impossible to use. Would it be so hard to keep the few last MB as
reserved space for debugging purpose and refuse any space allocation that is
not devoted to a 'ls' or a 'baobab' process?

~~~
theamk
They do [https://odzangba.wordpress.com/2010/02/20/how-to-free-
reserv...](https://odzangba.wordpress.com/2010/02/20/how-to-free-reserved-
space-on-ext4-partitions/)

~~~
Iv
You can reboot as root in a terminal but really, there is no good reason to
fall back to this mode instead of more gracefully refuse, from userland, to
use up more space.

------
spydum
I've never used mdb on solaris, but its where I first learned of strace (and
ptrace) - which unlocked soooo much of how unix worked for me, and lead me
down the path to unix wizardy (well, that and access to open source code
bases). in fact, when i started reading this, it's where I thought it was
headed (in fact, I think could have come to the conclusion quicker than trying
to read mdb)

------
roryrjb
I've never heard of mdb before. I mean I've never actually used Solaris
(OpenSolaris/illumos to be specific) for anything other than a few experiments
here and there but nothing in production. But I've been intrigued by it mostly
due to watching talks by Bryan Cantrill. It seems that at least Joyent bets
heavily on it, and it seems that's mostly due to ZFS. There's also DTrace in
this area, again haven't used it, but I've been itching to use bpftrace on
Linux as an equivalent. Anyway, what I really came to ask is, who uses Solaris
or illumos? Does it have a future? How relevant is it?

~~~
aidenn0
I've been using illumos on and off since it was called "OpenSolaris" and
illumos is rapidly becoming less relevant just due to the tiny ecosystem.
dtrace and mdb are both great tools, but it's getting to the point where ZFS
on Linux (which possibly can't even be shipped as a binary without violating
the Linux EULA) is seeing more usage than all other downstream consumers (and
ZFS/FreeBSD was probably a wider usage than Illumos for a long while now).

Ultimately being a libre *nix that is better in a few ways than Linux seems to
be a long-term losing proposition, as Linux will eventually check all the
boxes you (even if it's not quite as nice) thus steadily shrinking your niche.

~~~
krageon
> the Linux EULA

What are you talking about? This isn't a thing that exists as far as I'm
aware.

~~~
aidenn0
[https://www.kernel.org/doc/html/v4.18/process/license-
rules....](https://www.kernel.org/doc/html/v4.18/process/license-rules.html)

------
Sarki
And that's why you must monitor your servers health. You know, for when
they're starting to act funny.

I'm not related in any way with them but as a personnal favorite (I even use
it at home) I'd recommend Zabbix as it's open source and quite straighforward
to install and deploy its agents, once configured you can even forget about
it.

Gosh, its default alerts will give you hints on things you never considered
checking before while the integration of new/bespoke software can be done in a
matter of minutes.

------
DarkStar851
Heh. I've yet to witness a system yet that survived no disk space. Lucky the
whole thing didn't lock up, had a remote mail server die like that and
required an on-site intervention.

------
tetha
Hm. I am impressed by the casual kernel debugging. I couldn't do that.
However, why would I have to do that if running check_disk on all important
file systems tells me if file systems are >80% or >90% full?

There'll be bugs I cannot debug, because I cannot casually debug my linux
kernel. Yes. I still have to encounter them personally, but I'll get there.
But this just seems weird, because this is a very, very basic problem to
monitor for.

~~~
icedchai
Nobody thought to run "df"? Disk being full is a common source of all sorts of
weird server problems.

~~~
Alupis
> Nobody thought to run "df"? Disk being full is a common source of all sorts
> of weird server problems.

Yep, to the point where `df -h` has become one of the first things I run when
a server starts acting funny or things stop working.

Disk being full is far too common - run away logging, weird temp files, etc,
or sometimes just a box that nobody maintained for years and years.

The fun part starts after you've identified which partition or drive is full -
now you have to identify the problem files!

~~~
mobilemidget
First things you run on login?

It ought to be monitored with alerting by default and inode use too.

~~~
icedchai
True. Unfortunately good monitoring is an afterthought in most organizations.

------
Roboprog
No production monitoring type alert went “blip” when the disk(s) started
filling past a certain percentage?

------
geofft
I've encountered a ton of these at my previous and current job, in Linux.

The primary problem is that forms of ps output that read the full command line
(from /proc/$pid/cmdline) require reading memory from the process. This
requires, at least, a read-lock on the process's memory map semaphore
(mmap_sem), and lots of other things like to access mmap_sem, including other
memory allocation (write lock), a page fault (read lock, so you can figure out
what to fault in), etc. In particular, if the process is in the middle of
mapping or faulting a mapped page from a slow filesystem - such as NFS or a
network-backed block device provided by a hypervisor - then it can sit around
with mmap_sem for arbitrarily long.

Usually the process taking a read lock on its own mmap_sem, or someone else
taking a read lock, is harmless, since it's a reader-writer lock and there can
be multiple readers. But as soon as a writer declares an intent to take a
write lock, further readers are blocked to avoid writer starvation, which
means a single slow reader will prevent all further readers. See
[http://blog.nelhage.com/post/rwlock-
contention/](http://blog.nelhage.com/post/rwlock-contention/) for some
excitement there.

You can generally read /proc/$pid/comm (short command line) and
/proc/$pid/status, which both just reference info in the kernel's task_struct,
and don't require taking a lock on the userspace memory map. You can also read
/proc/$pid/syscall, which will tell you what syscall it's in and the numeric
arguments, and usually you can read /proc/$pid/stack, which tells you the
kernel stack of the process. (Though I have recently found that that one
_also_ takes a lock, but fortunately one that's much less frequently
contended.) If you're trying to make sense of why a system is stuck, and ps
aux is unresponsive, my goto is grep 'disk sleep' /proc/ * /status, followed
by reading the corresponding /proc/$pid/stack. If you're lucky, you'll see
which module is slow (filesystem / block I/O? networked filesystem? FUSE?
etc.) and can try to address that. Or perhaps you'll see several processes
trying to get a lock on something and one that looks like it's holding a lock
and stuck doing work; if you can address (perhaps kill) that process, the
system might make progress.

lsof likes to read /proc/$pid/maps, the list of mapped files, which of course
requires an mmap_sem read lock. It does this so that it can list files that
are mapped but no longer have a file descriptor (e.g., shared libraries get
opened, mmaped, and closed). If you know that you're only interested in files
with file descriptors - e.g., you're looking for a socket, or something - you
can do this with less contention by looking at /proc/$pid/fd/, which is a
directory of magical nodes that show up as symlinks to open files. (They're
not really symlinks; for instance, they'll work even if the actual file is
deleted. But you can ls -l them as if they were symlinks, so ls -l /proc/ *
/fd/ * | grep is a pretty decent alternative to lsof.)

[sorry about the formatting, HN is really enthusiastic about asterisks]

------
unixhero
Sounds like an excellent edit: [ strikeout Linux] kernel bug report

~~~
earenndil
It's solaris, not linux.

~~~
rkeene2
And there was no point in reporting kernel bugs to Sun/Oracle at the time,
because they did not care.

------
riffraff
this is more confirmation of my anecdotal experience that when things fail
it's 90% of the time a FS problem (run out of space, run out of inodes etc),
thanks :)

------
de_watcher
Just df -h, dude.

~~~
lucb1e
Right, how do you know to check that? What if you're out of inodes, would you
know to check for that, just randomly because a process is acting weird? Learn
something everyday.

~~~
de_watcher
The experience came from the days when I wasn't able to debug yet, and when
full disks were frequent for similar reasons.

