I wrote something to do similar parsing of process state recently. It seems nuts...

cyphar · on Oct 5, 2016

You can use openat() with a file descriptor corresponding to the /proc/pid directory to avoid the race condition.

yrro · on Oct 5, 2016

Huh. Does this mean that I can open() all the numerically-named directories in /proc, in a loop, and eventually prevent the system from being able to create new processes?

Or will operations on a file descriptor corresponding to an exited process fail?

dezgeg · on Oct 5, 2016

According to my reading of the code, an open process file just holds a reference to a relatively lightweight handle to that process (struct pid), which means 1) resource exhaustion by holding onto full process structures doesn't happen 2) numerical process ids can get recycled even if someone is holding a long-dead /proc/PID directory open for that pid, but the old opened directory keeps referring to the dead process.

yrro · on Oct 7, 2016

Thank you so much for working that out!

stouset · on Oct 5, 2016

Yep, just saying the "obvious" approach is wrong, which is (IMHO) always a crappy way to design an API.

jerf · on Oct 5, 2016

Based on what I learned a few months ago trying to deal with all these problems, the conclusion I've come to is that basically the entire "standard" set of POSIX calls are wrong. Correct handling requires an entire parallel set of calls like "openat". However, this parallel set of calls postdates the more conventional calls by a couple of decades and in some cases we found may still not yet be entirely complete. (I'd guess they probably are by now, we were a couple of versions back on the kernel.)

The problem is that those old calls have immense, immense inertia. They are how people think of files. They are how almost all, if not actually all, higher level languages interact with files by default, saving the "correct" calls for external modules, if you even get that. In fact, many higher-level languages are actively inimical to correct file handling by trying to abstract away the "file handle" so you only have to deal with file names, but for correct handling you really need to consider the file handle the real file and the file name merely a transient method for obtaining a file handle, which is to be never used again once you have the handle.

Bear in mind the "crappy" API in question is probably older than you are, so it's not really that surprising that it has needed some work as our world has changed.

to3m · on Oct 5, 2016

Even if you fix the pid reuse problem, you'll still have difficulties parsing things like /proc/PID/maps. Unless your pseudo-file consists of fixed-size records - and as I recall, /proc/PID/maps doesn't, as each line has a variable-width path in it - your best option is just to read the entire file in at once, and fingers crossed it ended up being atomic (Obviously the system can't block operations that affect the maps file...)

signalfd gets this bit right.

It didn't take much POSIX programming before I started to look at Windows in a whole new light...

cyphar · on Oct 6, 2016

> signalfd gets this bit right.

I'd be careful saying that any API based around signals is "right". In general, signals are just horrible and I really wish that the UNIX history had played out differently.

tcoppi · on Oct 5, 2016

I agree that parsing /proc/pid/maps is a complete nightmare, but I'm pretty sure that the way pseudo-files on linux work, it is guaranteed the contents won't change out from under you while you are reading it.

cyphar · on Oct 6, 2016

Not all pseudo-files are like that. For example, /sys/fs/cgroup/.../cgroup.procs will only have consistent content if you read everything in a single page. Which is kinda dumb IMO.

tcoppi · on Oct 6, 2016

Interesting... good to know.

amelius · on Oct 5, 2016

Why would the OS recycle its process ids anyway? I mean, if you'd use 64 bits for the id, there's no way the counter could cycle in the lifetime of the hardware.

to3m · on Oct 5, 2016

http://man7.org/linux/man-pages/man5/proc.5.html - "On 32-bit platforms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million)."

And this is Unix, so you can't do anything without running a process ;) - and I think PIDs and threads share a namespace on Linux, too, so the chance of wraparound is even higher again. (I also don't think there's any guarantee that PIDs will be a simple incrementing counter in the first place! Though that's the most obvious thing to do, so you can probably indeed be pretty certain that's exactly what will happen..)

OS X has a 64-bit thread ID that promises to be unique across all threads for the uptime of the system. What a good idea! - no prizes for guessing whether this comes from the BSD part, or the Mach part...

tcoppi · on Oct 5, 2016

I think it is a bigger problem that you can't get a snapshot of all processes atomically. It is easy enough to solve the case of atomically accessing a single process, though annoying.

jerf · on Oct 5, 2016

It isn't clear to me what an "atomic" snapshot of processes would even be in a multicore world, though. Even if process creation and destruction is strictly linearizable in how the kernel handles it, which I do not assert to be true, there is certainly a lot of other stuff in that snapshot which is not. And without that, there isn't anything like an "atomic" snapshot that makes sense to me.

(Even in a single core world I suspect you'd hit a lot of resistance in any method you could use to take an atomic state, for performance reasons. Even just telling the kernel to "stop doing everything else and give me a copy of all this information" is not going to go over well.)

tcoppi · on Oct 5, 2016

Sure it means something. Both Windows[1] and OS X[2] have easy ways to do it, and the task list in the Linux kernel is just a linked list that could be easily atomically copied. Obviously a lot of the metadata associated with the processes couldn't be easily accessed in a consistent way, but the number of running processes, their pid, basic info like the command line and process image is totally knowable to the kernel at any given instant.

The fact that you can't get that info easily on Linux makes a lot of tools harder to write than they should be, and often give misleading/incorrect information.

1. https://msdn.microsoft.com/en-us/library/windows/desktop/ms6... 2. https://developer.apple.com/legacy/library/documentation/Dar...

jerf · on Oct 5, 2016

At least the Windows one is the function I wasn't willing to assert exists. However, it does not appear to have the "a lot of other stuff in that snapshot which is not [linearizable].". The resulting structs[1] are missing many things that ps may want to display in Linux, which you will have to fetch nonatomically.

The OSX one, I don't know; I scanned over the docs and it went over my threshold for what I'm willing to poke through for a HN post.

In other words, this is exactly what I left myself an out for, and I see nothing that contradicts what I said for the Windows case.

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

tcoppi · on Oct 5, 2016

I don't disagree that there is a lot of information you cannot include in such a snapshot, and neither the Windows or OS X equivalents(it's the KERN_PROC argument to sysctl btw) give you much detail beyond all pids(which you can get with a directory listing of /proc in linux, though this is not completely atomic either, even with getdents(2)!) and their cmd line/exe path. Those latter two are completely unavailable to get reliably with any method I know of on linux, and that results in some inconsistent and/or incomplete info for certain use cases, especially when combined with the fact that PIDs, though randomized, can be reused on linux.