
POSIX close(2) is broken - cperciva
http://www.daemonology.net/blog/2011-12-17-POSIX-close-is-broken.html
======
cperciva
Also, there's a bug in HN: When I submitted this under the title "POSIX
close(2) is broken" it converted "POSIX" to "Posix" (presumably because PG
doesn't like shouting titles), but I was able to edit it back to reading
"POSIX".

In this case "POSIX" is correct, not "Posix", of course; but if HN is going to
mangle titles it ought to at least do so _consistently_.

~~~
fauigerzigerk
That's odd. I see lots of titles on the front page right now that include the
word SOPA.

~~~
cperciva
Maybe it's triggered by having 5 or more consecutive upper-case characters? I
have no idea.

------
Qerub
Recommended reading:

Richard P. Gabriel – The Rise of Worse is Better
<http://www.dreamsongs.com/RiseOfWorseIsBetter.html>

[...]

The MIT guy did not see any code that handled this case and asked the New
Jersey guy how the problem was handled. The New Jersey guy said that the Unix
folks were aware of the problem, but the solution was for the system routine
to always finish, but sometimes an error code would be returned that signaled
that the system routine had failed to complete its action. A correct user
program, then, had to check the error code to determine whether to simply try
the system routine again. The MIT guy did not like this solution because it
was not the right thing.

The New Jersey guy said that the Unix solution was right because the design
philosophy of Unix was simplicity and that the right thing was too complex.
Besides, programmers could easily insert this extra test and loop. The MIT guy
pointed out that the implementation was simple but the interface to the
functionality was complex. The New Jersey guy said that the right tradeoff has
been selected in Unix -- namely, implementation simplicity was more important
than interface simplicity.

The MIT guy then muttered that sometimes it takes a tough man to make a tender
chicken, but the New Jersey guy didn’t understand (I’m not sure I do either).

[...]

~~~
cperciva
I considered citing that in my post, but it didn't fit anywhere conveniently.
In general I like the New Jersey approach of "push complexity to the caller",
but not in cases like this where it's impossible for the caller to handle the
complexity correctly.

I have the same issue with eventual consistency, for what it's worth: I'm fine
with the fact that my updates might not be globally visible immediately, but I
want to have a way that at some point I can _know_ that they have propagated
(aka. "eventually known consistency").

~~~
eis
Is eventually known consistency even possible if at any point while the
notification of consistency is being transferred to you, the system can enter
a non/eventual-consistent state again?

~~~
cperciva
By "eventually known consistency" I don't mean "at some point I will know that
the system is consistent"; rather, I mean "at some point I will know that the
system is consistent _with respect to operation X_ " (or alternatively "...
with respect to all operations performed before time T").

------
saurik
Link to the standard:

[http://pubs.opengroup.org/onlinepubs/000095399/functions/clo...](http://pubs.opengroup.org/onlinepubs/000095399/functions/close.html)

Some discussion that is related to this:

<http://lwn.net/Articles/365294/>

[http://stackoverflow.com/questions/7297001/if-
close2-fails-w...](http://stackoverflow.com/questions/7297001/if-close2-fails-
with-eio-will-the-file-descriptor-still-be-deleted)

Something which I would consider to be "the answer" from Linus for Linux:

[http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3000...](http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3000.html)

[http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3171...](http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3171.html)

[http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3202...](http://linux.derkeiler.com/Mailing-
Lists/Kernel/2005-09/3202.html)

(Apparently, the first thing Linux does is deallocate the file descriptor;
/then/ it starts flushing pending written data. If this process is
interrupted, it will return EINTR, but the file descriptor itself is already
deallocated, and may have been reused long before close() returned.)

Honestly, after reviewing all of this material, I think the author of this
post might simply not know what "unspecified" means (or, alternatively, is
making an abstract problem sound overly concrete and blown somewhat out of
proportion).

When a specification says "unspecified", it doesn't mean "unknown at runtime,
could be anything nondeterministically". It only means that the standard did
not specify the result.

Often, this is because there is either a historical artifact or reasonable
practical constraint that was posed by one or more vendors that were
implementing the standard that meant different implementations simply
/weren't/ going to agree on any one value.

However, for any given implementation, there very likely might be a correct
answer. In the case of Linux, you should not retry calls to close(): the
primary maintainer has told us that it deterministically guarantees the file
descriptor is closed even if the function returns EINTR (critical edit: the
wording of this sentence was incorrect in original draft).

That said, there may be another implementation out there, possibly one we may
use (but with a standard like this, could easily be one that we have never
heard of and that already died a death of obscurity years ago) that has the
opposite behavior: where EINTR interrupts something (such as pending writes)
before the file descriptor is closed, requiring developers to retry.

It is, however, unfortunate that this happened, and the author is right about
issues for developers who attempt to use this API "without assuming more than
the standard specified".

That said, there may be other options, such as using SA_RESTART, as argued in
this discussion I just found (which also happens to talk about the meaning of
"unspecified" ;P):

<http://permalink.gmane.org/gmane.os.hurd.devel.readers/265>

(edit:) Oh, and people may find it interesting that standards like this are
often modified over time, with "interpretations" and outright edits. As an
example, here is some vaguely related discussion regarding closing file
descriptors during process termination:

<http://austingroupbugs.net/view.php?id=498>

(edit:) Some more existing discussion:

[https://sites.google.com/site/michaelsafyan/software-
enginee...](https://sites.google.com/site/michaelsafyan/software-
engineering/checkforeintrwheninvokingclosethinkagain)

[http://lists.freebsd.org/pipermail/freebsd-
threads/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
threads/2010-August/004866.html)

[http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/20...](http://unix.derkeiler.com/Newsgroups/comp.unix.programmer/2009-01/msg00155.html)

<http://www.cpptalk.net/image-vp121851.html>

<http://programming.itags.org/unix-linux-programming/79468/>

Another (very long) Linux-specific thread involving some more responses from
Linus:

[http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0391.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0391.html)

[http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0409.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0409.html)

[http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/1173.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/1173.html)

In this thread we also find someone else claiming the standard is "broken":

[http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0413.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0413.html)

With a possibly not useful response, that at least demonstrates that the
examples from the specification did not retry:

[http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0395.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0207.2/0395.html)

~~~
FooBarWidget
What about other operating systems like FreeBSD, Solaris and OS X? I have to
support many platforms and my code is heavily multithreaded.

~~~
saurik
FreeBSD: It seems pretty clear in FreeBSD that you should not retry the call
to close(); sys_close() calls kern_close() which locks the file, frees the
file descriptor (!), and /then/ calls closef() (which, if capable of returning
EINTR, is already too late).

[http://lists.freebsd.org/pipermail/freebsd-
threads/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
threads/2010-August/004866.html)

<http://fxr.watson.org/fxr/source/kern/kern_descrip.c>

Mac OS X: Here you should not retry, but you should also be careful; depending
on whether you are UNIX2003 you will get different behavior from the close()
function from libSystem, directing you to either the syscall close() or
close_nocancel().

If you end up with close(), then the first thing that happens is
__pthread_testcancel(1) is called, and if the thread is cancelled it will
return EINTR before doing anything else: in this case, you would need to
retry.

However, I think close_nocancel(), which calls closef_locked(), might be
capable of returning EINTR, which will be held and only returned from
close_internal_locked() after _fdrelse() has already removed the descriptor.

So, if it is the case that EINTR is capable of being returned from the
closef_locked() call, you would need to /not/ retry, which thereby means that
the close() version of this call is impossible to use safely on Mac OS X: if I
were you I'd avoid it to use close_nocancel() (explicitly if warranted).

[http://opensource.apple.com/source/xnu/xnu-1699.24.8/bsd/ker...](http://opensource.apple.com/source/xnu/xnu-1699.24.8/bsd/kern/kern_descrip.c)

Solaris: Unfortunately, I do not know enough about Solaris to provide any
useful commentary. :(

~~~
nknight
Notice how long and complex your discussion of this issue is, and how many
different systems you have to investigate in depth to even begin to form a
coherent picture.

All we're trying to do is close a file safely. This is core functionality that
virtually every application will have to depend on. It would be like free()
saying it might be interrupted by a signal, and you have no way of knowing if
the memory has actually been deallocated or not. That would be insane.

------
1amzave
My first reaction to the thread-safety problem: fstat before close, if it
fails with EINTR, fstat again. Then you can tell both the state of the fd
(since fstat will return EBADF if it's closed), and if it's still a valid fd,
_what_ file it actually refers to. So if another thread _did_ open() and
reallocate the same fd number, you'll get a different struct stat back, and
then know not to retry the close.

As in:

    
    
      int safeclose(int fd)
      {
      	struct stat before, after;
      
      	if (fstat(fd,&before))
      		return -1;
      
      	while (close(fd)) {
      		if (errno != EINTR)
      			return -1;
      
      		/* If fstat() fails, our close() succeeded */
      		if (fstat(fd,&after))
      			return 0;
      
      		/* If we've got a different file, our close() succeeded */
      		if (before.st_dev != after.st_dev
      		    || before.st_ino != after.st_ino /* whatever other necessary checks here... */)
      			return 0;
      		/* Otherwise we've got the same file still open, retry */
      	}
      
      	return 0;		
      }
    

Any problems/races I'm not seeing? Is comparing 'struct stat' potentially not
100% reliable?

~~~
philh
It looks like if the other thread opened the same file you might get a false
positive.

    
    
        struct stat {
            dev_t     st_dev;     /* ID of device containing file */
            ino_t     st_ino;     /* inode number */
            mode_t    st_mode;    /* protection */
            nlink_t   st_nlink;   /* number of hard links */
            uid_t     st_uid;     /* user ID of owner */
            gid_t     st_gid;     /* group ID of owner */
            dev_t     st_rdev;    /* device ID (if special file) */
            off_t     st_size;    /* total size, in bytes */
            blksize_t st_blksize; /* blocksize for file system I/O */
            blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
            time_t    st_atime;   /* time of last access */
            time_t    st_mtime;   /* time of last modification */
            time_t    st_ctime;   /* time of last status change */
        };
    

The only one of these likely to change is st_atime, but I don't think that's
guaranteed to update. (I think different behaviours can be compiled into the
linux kernel, including "only update atime if it equals ctime or mtime".)

~~~
justincormack
Or mounting a filesystem with the noatime option, which is common.

------
antirez
This is a good example but more generally it is hardly possible to write
serious long running and reliable software targeting posix in general...

~~~
cperciva
It isn't? The use of close(2) is the only place I'm aware of kivaloo assuming
anything POSIX doesn't provide.

~~~
FooBarWidget
Now that I think about it, dup2() to an existing file descriptor must be
broken in a similar manner. In fact, with dup2() you can't even figure out
whether the implicit close() succeeded.

~~~
FrankBooth
_From the man page:_

    
    
      If  newfd was open, any errors that would have been reported at close(2) time are
      lost.  A careful programmer will not use dup2() or dup3() without closing newfd
      first.

~~~
saurik
Which is actually "interesting" advice, as it cannot be followed in a
multithreaded program without insanely extensive process-wide locks ;P (as any
call to open() between close() and dup2() has an irritatingly high probability
of grabbing whatever file descriptor you just freed up). (Of course, most
usages of dup2() are in contexts just after a fork() or at the beginning of
main(), so I guess this could often be practical; it still seems unfortunate.)

~~~
finnw
Maybe this can still be worked around by copying the target fd first so it can
be closed after fdup()ing the source fd, e.g.

// Assume x and y are open fds and we want to replace y with a copy of x

z = open("/dev/null", O_RDONLY); // Guarantees no other thread is using z

dup2(y, z);

dup2(x, y);

close(z);

~~~
saurik
Hah! (You don't actually need that open("/dev/null") trick, though: you can
just use dup(). ;P)

------
JoeAltmaier
Its time to stop reusing identifiers in kernels, libraries etc. There is no
good reason not to use a uuid pretty much universally?

~~~
ajross
Other than compatibility with 30 years of software and the huge annoyance of
using an identifier that doesn't fit in a machine word. The integer file
descriptor is here to stay.

~~~
JoeAltmaier
New application environments are created all the time.

A new api that exports the true internal (uuid-based) identifier need not
obsolete existing (broken) applications.

------
swah
What about checking how language implementations solve this for their 'closes'
?

~~~
viraptor
If you mean different other languages, that's interesting... Python seems to
be completely oblivious to the issue and signals (Modules/_io/fileio.c
internal_close() and others). There is SA_RESTART added if possible in
Modules/faulthandler.c - but I'm not sure what conditions activate it.

~~~
saurik
Yeah. This can be a serious issue, for the record. It is difficult for me to
come up with an experience that has made me quite as angry as calling
urllib.urlopen().read() in a highly-contended program (Apache2 mpm_worker) and
getting an EINTR exception bubbled all the way up to the top... I mean, what
did they expect I do: retry the entire HTTP fetch? ;P

I finally switched to mpm_event, which caused me to no longer get signals
constantly, but until then I was seriously running a patched copy of Python to
work around this issue; an issue, by the way, which was reported at the
beginning of 2007, and only fixed midway through 2010. :(

<http://bugs.python.org/issue1628205>

The general idea, however, is that if you are using a low-level primitive, one
that nigh unto maps directly to a system call (as opposed to urlopen ;P), then
you actually "want" (supposedly) EINTR to not be handled for you, as it might
be your intention to use it to do something valuable (just as you might from a
C program).

------
halayli
close(2) can also block even when the socket is set to non-blocking.

------
ay
What about wrapping the first solution into a mutex lock ? Then the
multithreaded behavior will disappear. And closing the fd might not be the
most frequent operation usually so should not have much of performance impact.

~~~
JoeAltmaier
Which mutex? What about threads you don't 'own', inside libraries for
instance? I don't think you will be able to find "one ring to rule them all"

~~~
ay
Oh if they use the close direct, too bad. But libraries that launch the
starships behind your back with no control left to you are also not the best
gift.

Standard's bug aside, I wonder how the practical implementations behave. Is
this condition reproducible ?

------
ScottBurson
Amazing how, even in 2011, PC Lusering is still the Right Thing.

------
nknight
This gets even stranger than you might think. Consider fclose()[1], which
says, after the ordinary disclaimer that POSIX defers to the C standard, _"The
fclose() function shall perform the equivalent of a close() on the file
descriptor that is associated with the stream pointed to by stream."_.

And as part of the C standard, we have _"After the call to fclose(), any use
of stream results in undefined behavior."_.

Uh-oh. So does this imply that fclose(), at least, has to do the job right, or
does it mean it inherits the close() flaw, _and_ leaves no defined way to try
and clean up the mess?

[1]
[http://pubs.opengroup.org/onlinepubs/9699919799/functions/fc...](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fclose.html)

~~~
cperciva
Given that the standard states

    
    
        [EINTR]
            The fclose() function was interrupted by a signal.
    

I think it's safe to say that fclose is allowed to fail with no defined way to
clean up the mess.

~~~
nknight
Yeah, I was just looking at that after I posted my initial comment, and I
think I agree with you. That's the only reading I can see that doesn't create
more questions than it answers.

I think in this case, Occam's razor says "this part of the standard is not
fully thought through".

~~~
cperciva
_this part of the standard is not fully thought through_

I agree... and this terrifies me, given that EINTR is mentioned in the man
page for close(2) back in BSD Net/2, over 20 years ago.

~~~
ajasmin
_this terrifies me_

Now think about all the libraries and application that end up calling close()
and fclose().

Add a few abstraction layers and good luck figuring out if closing a file is
safe in a multithreaded Java app.

