
Fundamentals of non blocking I/O on Linux/BSD - blopeur
https://medium.com/@copyconstruct/nonblocking-i-o-99948ad7c957
======
wahern

      > the file entry data structure maintains a file offset for
      > every process
    

File table entries aren't per-process. That's dangerously misleading. Without
getting lost in the weeds, just remember that new file table entries are
usually created for every open() or socket() call, in addition to a new
descriptor table entry. Contrast that with dup() or fork(), which create new
descriptor table entries that point to shared file table entries.

Why would this matter? Because as the article suggests (albeit ambiguously), a
fork() has semantics similar to dup() in that the new process gets a cloned
descriptor table. _but_ with pointers to _shared_ file table entries. Why this
matters, and why the article is misleading, is because when two processes
perform reads through a shared file table entry for a regular file, the order
of reads will matter as between the two processes because of the shared
cursor; the sequence of data each process reads could differ based on random
scheduling latencies.

File position cursors really only matter for regular files, not pipes or
sockets. But another thing to keep in mind is the distinction between
descriptor flags and file entry flags[1]. O_CLOEXEC is a descriptor flag,
which means it's not inherited when you dup() a descriptor (it is inherited
across a fork, but that's because you're getting a clone of the entry which is
thereafter distinct). However, flags like O_NONBLOCK are file entry flags,
which means if process A forks process B and process B does fcntl(fd4,
F_SETFL, O_NONBLOCK), all of a sudden fd4 will behave in a non-blocking manner
in process A. Likewise, if you do dup2(fd4, fd5) then fcntl(fd5, F_SETFL,
O_NONBLOCK), all of a sudden fd4 is non-blocking.

One interesting distinction between BSD and Linux is that on BSD opening
/dev/fd/N is identical to calling dup(); even though you're using open() you
get a shared file table entry. However, Linux symlinks /dev/fd to
/proc/self/fd and you get regular open() semantics with an unshared file table
entry.

On Linux, at least, you can create a new file table entry for an existing pipe
through /proc/self/fd. But AFAIK you can't open sockets through /proc/self/fd.
I've never really run into the need to do this, though. But it's good to keep
all of this in mind for regular files because shared cursors could easily
cause headaches, perhaps even security issues.

FWIW, Unix descriptor passing has dup() semantics.

[1] Technically the term is file status flags, but when discussing this I
prefer file entry flags (or file table entry flags) to highlight the
relationship to descriptor flags.

~~~
copyconstruct
I'm the author of that post. The goal wasn't to mislead - like I mentioned,
I'm learning these things myself and definitely could've gotten several things
wrong.

I meant _file offsets_ are per process, not that every process gets its own
table entry.

> when two processes perform reads through a shared file table entry for a
> regular file, the order of reads will matter as between the two processes
> because of the shared cursor; the sequence of data each process reads could
> differ based on random scheduling latencies.

Not sure I follow. won't the two processes still have their own descriptors
which point to the same file entry but maintain their own offsets? I think
what I understood from your comment is that descriptors are _shared_ by the
parent and child with share by reference semantics? So both the parent and the
child _are using the same descriptor_ which in turn has an offset in the file
table entry.

~~~
wahern
But file offsets _aren't_ per process. File offsets (aka I/O position cursors)
are kept in the file table entry data structure, and those are shared for
descriptors that have been dup'd or fork'd. If the cursor wasn't shared then
this program

    
    
      #include <stdio.h>
      #include <stdlib.h>
      
      #include <err.h>
      #include <unistd.h>
      
      int
      main(void) {
      	FILE *fh = tmpfile();
      	if (!fh)
      		err(1, "tmpfile");  
      	int fd = fileno(fh);
      	if (fd == -1)
      		errx(1, "fileno: no descriptor");  
      	const char digits[] = "0123456789";
      	if (sizeof digits != write(fd, digits, sizeof digits))
      		err(1, "write");
      	if (-1 == lseek(fd, 0, SEEK_SET))
      		err(1, "lseek");  
      	if (-1 == fork())
      		err(1, "fork");
      	char ch;
      	switch (read(fd, &ch, 1)) {
      	case -1:
      		err(1, "read");
      	case 0:
      		errx(1, "read: EOF");
      	}
      	printf("%ld: %c\n", (long)getpid(), ch);  
      	return 0;
      }
    

would print '0' twice. However, it actually prints '0' then '1'.

Descriptor tables are per process, but the only thing a descriptor table entry
stores is a flags field (basically, O_CLOEXEC/FD_CLOEXEC, plus maybe some
esoteric platform specific flags), and a pointer to a file table entry data
structure. Most state, like the O_NONBLOCK flag and file offsets, are kept in
the [often shared] file table entry. The file table and its entries are
completely independent from any particular process; in fact, traditionally
there's only one global file table, just like there's only one process table.

These errors can usually be avoided if one always cite to a primary source
(e.g. POSIX standard, vendor source code) for every assertion and/or validates
the assertion with actual code. Maybe it's my legal training, but whenever I
make an assertion, especially a technical assertion, I make it a habit of
following those two rules, even when posting comments. And quite often I end
up learning something new in the process.

