Hacker News new | past | comments | ask | show | jobs | submit login
Understanding /proc (fredrb.github.io)
234 points by fredrb on Oct 5, 2016 | hide | past | favorite | 42 comments

This is a great post, recently I have been trying to re-learn and understand linux (specifically) ubuntu using monitoring tools. In my opinion htop and Facebook osquery are the two best available tools for understanding how an operating systems and processes work. The osquery approach of recording all OS data in form relational tables (with PIDs as keys etc.) is very useful.



The osquery query packs are especially useful: https://osquery.io/docs/packs/

Here is an incomplete draft about a similar post: https://github.com/AKSHAYUBHAT/TopDownGuideToLinux

Isn't htop just top with colours?

For a quick global view I like atop, then if I need to drill in a subsystem, iftop, vmstat, free, etc...

Not quite. You can access lsof and strace from inside htop. There is also a process tree view, and you can select the process you want to manipulate via arrow keys.

And you can click on things in htop with your mouse. That feature never gets old. I wish more command line tools supported that...

This reminds me that MC supports mouse interaction.

And playing around with it, i find that it is more elaborate than i first anticipated. The menus can even be operated via the scroll wheel, and it is sensitive to where the mouse is hovering.

All in all i find myself pondering if more up to date console web browser is possible. Perhaps using the framebuffer rather than X to display sites (or maybe go all out and implement it using sixel).

I feel stupid now, I've been using htop for years and never noticed that...

Same here! I picked up htop ~2011 and learned about mouse support in ~2014. :D

Also- if you're on a Mac and you use tmux I just started trying out the tmux integration with iterm2. Pros and cons but it's interesting to see first class OS windows for tmux.

Edit: add missing iterm2 reference. https://gitlab.com/gnachman/iterm2/wikis/TmuxIntegration

Oh! Thanks. I never knew...

Very nice write-up, and a great way to dive deep into an interesting system! But if you plan on maintaining a project like this long term I would recommend using one of the many existing libraries like https://github.com/prometheus/procfs or http://pythonhosted.org/psutil/

There can be a lot of edge cases, and inevitability things will change in the future. Centralising the work of parsing /proc files goes a long way and helps keep things sane for maintenance.

It's worth noting that `man proc` has pretty exhaustive, though incomplete, documentation of the various files. It's a great read to learn about some of the files available.

Best place to learn about proc IMHO... And possibly linux header files..

I wrote something to do similar parsing of process state recently. It seems nuts to me that you can't get this all in one call. The naïve way of `fopen()`ing the files you need has a race condition if the PID is reused between two calls to `fopen()`.

Admittedly, probably rare. But why route through calls to `fopen()` and `read()` when you could just provide a function that returns OS-defined structs?

You can use openat() with a file descriptor corresponding to the /proc/pid directory to avoid the race condition.

Huh. Does this mean that I can open() all the numerically-named directories in /proc, in a loop, and eventually prevent the system from being able to create new processes?

Or will operations on a file descriptor corresponding to an exited process fail?

According to my reading of the code, an open process file just holds a reference to a relatively lightweight handle to that process (struct pid), which means 1) resource exhaustion by holding onto full process structures doesn't happen 2) numerical process ids can get recycled even if someone is holding a long-dead /proc/PID directory open for that pid, but the old opened directory keeps referring to the dead process.

Thank you so much for working that out!

Yep, just saying the "obvious" approach is wrong, which is (IMHO) always a crappy way to design an API.

Based on what I learned a few months ago trying to deal with all these problems, the conclusion I've come to is that basically the entire "standard" set of POSIX calls are wrong. Correct handling requires an entire parallel set of calls like "openat". However, this parallel set of calls postdates the more conventional calls by a couple of decades and in some cases we found may still not yet be entirely complete. (I'd guess they probably are by now, we were a couple of versions back on the kernel.)

The problem is that those old calls have immense, immense inertia. They are how people think of files. They are how almost all, if not actually all, higher level languages interact with files by default, saving the "correct" calls for external modules, if you even get that. In fact, many higher-level languages are actively inimical to correct file handling by trying to abstract away the "file handle" so you only have to deal with file names, but for correct handling you really need to consider the file handle the real file and the file name merely a transient method for obtaining a file handle, which is to be never used again once you have the handle.

Bear in mind the "crappy" API in question is probably older than you are, so it's not really that surprising that it has needed some work as our world has changed.

Even if you fix the pid reuse problem, you'll still have difficulties parsing things like /proc/PID/maps. Unless your pseudo-file consists of fixed-size records - and as I recall, /proc/PID/maps doesn't, as each line has a variable-width path in it - your best option is just to read the entire file in at once, and fingers crossed it ended up being atomic (Obviously the system can't block operations that affect the maps file...)

signalfd gets this bit right.

It didn't take much POSIX programming before I started to look at Windows in a whole new light...

> signalfd gets this bit right.

I'd be careful saying that any API based around signals is "right". In general, signals are just horrible and I really wish that the UNIX history had played out differently.

I agree that parsing /proc/pid/maps is a complete nightmare, but I'm pretty sure that the way pseudo-files on linux work, it is guaranteed the contents won't change out from under you while you are reading it.

Not all pseudo-files are like that. For example, /sys/fs/cgroup/.../cgroup.procs will only have consistent content if you read everything in a single page. Which is kinda dumb IMO.

Interesting... good to know.

Why would the OS recycle its process ids anyway? I mean, if you'd use 64 bits for the id, there's no way the counter could cycle in the lifetime of the hardware.

http://man7.org/linux/man-pages/man5/proc.5.html - "On 32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million)."

And this is Unix, so you can't do anything without running a process ;) - and I think PIDs and threads share a namespace on Linux, too, so the chance of wraparound is even higher again. (I also don't think there's any guarantee that PIDs will be a simple incrementing counter in the first place! Though that's the most obvious thing to do, so you can probably indeed be pretty certain that's exactly what will happen..)

OS X has a 64-bit thread ID that promises to be unique across all threads for the uptime of the system. What a good idea! - no prizes for guessing whether this comes from the BSD part, or the Mach part...

I think it is a bigger problem that you can't get a snapshot of all processes atomically. It is easy enough to solve the case of atomically accessing a single process, though annoying.

It isn't clear to me what an "atomic" snapshot of processes would even be in a multicore world, though. Even if process creation and destruction is strictly linearizable in how the kernel handles it, which I do not assert to be true, there is certainly a lot of other stuff in that snapshot which is not. And without that, there isn't anything like an "atomic" snapshot that makes sense to me.

(Even in a single core world I suspect you'd hit a lot of resistance in any method you could use to take an atomic state, for performance reasons. Even just telling the kernel to "stop doing everything else and give me a copy of all this information" is not going to go over well.)

Sure it means something. Both Windows[1] and OS X[2] have easy ways to do it, and the task list in the Linux kernel is just a linked list that could be easily atomically copied. Obviously a lot of the metadata associated with the processes couldn't be easily accessed in a consistent way, but the number of running processes, their pid, basic info like the command line and process image is totally knowable to the kernel at any given instant.

The fact that you can't get that info easily on Linux makes a lot of tools harder to write than they should be, and often give misleading/incorrect information.

1. https://msdn.microsoft.com/en-us/library/windows/desktop/ms6... 2. https://developer.apple.com/legacy/library/documentation/Dar...

At least the Windows one is the function I wasn't willing to assert exists. However, it does not appear to have the "a lot of other stuff in that snapshot which is not [linearizable].". The resulting structs[1] are missing many things that ps may want to display in Linux, which you will have to fetch nonatomically.

The OSX one, I don't know; I scanned over the docs and it went over my threshold for what I'm willing to poke through for a HN post.

In other words, this is exactly what I left myself an out for, and I see nothing that contradicts what I said for the Windows case.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

I don't disagree that there is a lot of information you cannot include in such a snapshot, and neither the Windows or OS X equivalents(it's the KERN_PROC argument to sysctl btw) give you much detail beyond all pids(which you can get with a directory listing of /proc in linux, though this is not completely atomic either, even with getdents(2)!) and their cmd line/exe path. Those latter two are completely unavailable to get reliably with any method I know of on linux, and that results in some inconsistent and/or incomplete info for certain use cases, especially when combined with the fact that PIDs, though randomized, can be reused on linux.

Minor nitpick: calling it a a clone of the Unix 'ps' wouldn't be exactly right, since I understand /proc is Linux-specific. On the other hand, how did the Unix 'ps' or the 'ps' in other Unix-clones work? Is there an alternative method to expose the process data to userspace instead of using a VFS like procfs?

/proc is not Linux-specific; my late, great Colleague Roger Faulkner did a lot of work on it in Solaris:


...and there's a general history here:


The part that may be specific to Linux is Linux provides a text-based interface, whereas systems like Solaris provide a binary interface.

That link is really interesting! Thanks for that. I should have probably said procfs is Linux-specific. Anyway, learned (rather unlearned) something new today.

Also interesting is that early Unix systems (before v8) used `ptrace()` for gathering process information - the same system call programs like strace/ltrace use today.

No, procfs is also not Linux specific ;-) /proc is a filesystem, so most of us refer to it as 'procfs' for short. In fact, the header file you include a C program in Solaris to use it is '<procfs.h>'.

As I said before, the only thing that's really Linux-specific is Linux chose to represent it as text instead of something machine-parsable.

Which other OS uses procfs?

Hier is the history https://en.wikipedia.org/wiki/Procfs I think the one from linux was inspire by plan9's implementation.

Thank you!

There are a couple of options. A good number of other UNIXes do in fact have a procfs. Some others, like macOS, have a system call that gives you the information you want (on macOS this is a sysctl, CTL_KERN / KERN_PROC / KERN_PROC_ALL, which returns a set of structs). And finally, especially on older UNIXes, the ps command is setuid root, and goes and opens /dev/kmem or similar and looks around in kernel memory and parses the live kernel's process table directly.

You may be thinking of sysfs.

I wrote a small ps clone as a side project, and "man proc" was invaluable in understanding what everything meant.

There was interesting work happening on a proposed newer api though, "task_diag" https://lwn.net/Articles/685791/ https://criu.org/Task-diag.

Thank you for writing this. This is a wonderfully insightful post :)

Is there something similar for /sys and /run?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact