Also, there's a bug in HN: When I submitted this under the title "POSIX close(2) is broken" it converted "POSIX" to "Posix" (presumably because PG doesn't like shouting titles), but I was able to edit it back to reading "POSIX".
In this case "POSIX" is correct, not "Posix", of course; but if HN is going to mangle titles it ought to at least do so consistently.
There is no need to mangle titles consistently. The behaviour you describe is a good solution: remove shouting by default, and in the case that it is not really shouting (as in POSIX), the submitter can always fix it by editing it back.
The MIT guy did not see any code that handled this case and asked the New Jersey guy how the problem was handled. The New Jersey guy said that the Unix folks were aware of the problem, but the solution was for the system routine to always finish, but sometimes an error code would be returned that signaled that the system routine had failed to complete its action. A correct user program, then, had to check the error code to determine whether to simply try the system routine again. The MIT guy did not like this solution because it was not the right thing.
The New Jersey guy said that the Unix solution was right because the design philosophy of Unix was simplicity and that the right thing was too complex. Besides, programmers could easily insert this extra test and loop. The MIT guy pointed out that the implementation was simple but the interface to the functionality was complex. The New Jersey guy said that the right tradeoff has been selected in Unix -- namely, implementation simplicity was more important than interface simplicity.
The MIT guy then muttered that sometimes it takes a tough man to make a tender chicken, but the New Jersey guy didn’t understand (I’m not sure I do either).
I considered citing that in my post, but it didn't fit anywhere conveniently. In general I like the New Jersey approach of "push complexity to the caller", but not in cases like this where it's impossible for the caller to handle the complexity correctly.
I have the same issue with eventual consistency, for what it's worth: I'm fine with the fact that my updates might not be globally visible immediately, but I want to have a way that at some point I can know that they have propagated (aka. "eventually known consistency").
The problem with "push complexity to the caller" is that there are a lot more callers than implementors, and to make non-buggy code you have to educate callers who
- may not have time ("my boss told me to ship this yesterday")
- may not know ("I couldn't afford a copy of the standard, and I got my documentation by browsing web pages")
- may not understand ("WTF is this extra thing I have to worry about?")
In short, a perfect recipe for making a world of software that mostly works, most of the time, but is full of bugs in edge conditions and under stress.
I don't write code like that. It's never fun to explain to a customer why your stuff falls apart under load, for instance.
The solution is layering: OS-level APIs can push complexity to the caller, but the "caller" can be a library that implements a higher-level API and hides this complexity from the application. Now you have the best of both worlds: people who want to do simple things use higher-level APIs but people who have more sophisticated needs can use the lower-level API directly.
If the lowest-level API does complicated things under the covers, then the power user is left with no recourse when the "one size fits all" API didn't fit.
Yes, exactly. I don't understand why the Unix guys didn't solve this in libc, where it's both easy and simple (provided the system call returns all the information needed, as it surely would have been made to do had this approach been taken).
Sure, occasionally someone will need to issue the bare system call without the libc wrapper, but that could have a different name, e.g. '_close'.
Is eventually known consistency even possible if at any point while the notification of consistency is being transferred to you, the system can enter a non/eventual-consistent state again?
By "eventually known consistency" I don't mean "at some point I will know that the system is consistent"; rather, I mean "at some point I will know that the system is consistent with respect to operation X" (or alternatively "... with respect to all operations performed before time T").
There's actually two distinct things you might want to know:
1) The current state of the system.
2) Whether the current state of the system is equal to or later than one reflecting your change.
#1 is technically unknowable without a locking mechanism, but #2 is theoretically possible. Assuming reliable time synchronization, your worst-case scenario would be a timestamp on every record, and checking the timestamp on each node (you can do this with Cassandra, for example).
(Apparently, the first thing Linux does is deallocate the file descriptor; /then/ it starts flushing pending written data. If this process is interrupted, it will return EINTR, but the file descriptor itself is already deallocated, and may have been reused long before close() returned.)
Honestly, after reviewing all of this material, I think the author of this post might simply not know what "unspecified" means (or, alternatively, is making an abstract problem sound overly concrete and blown somewhat out of proportion).
When a specification says "unspecified", it doesn't mean "unknown at runtime, could be anything nondeterministically". It only means that the standard did not specify the result.
Often, this is because there is either a historical artifact or reasonable practical constraint that was posed by one or more vendors that were implementing the standard that meant different implementations simply /weren't/ going to agree on any one value.
However, for any given implementation, there very likely might be a correct answer. In the case of Linux, you should not retry calls to close(): the primary maintainer has told us that it deterministically guarantees the file descriptor is closed even if the function returns EINTR (critical edit: the wording of this sentence was incorrect in original draft).
That said, there may be another implementation out there, possibly one we may use (but with a standard like this, could easily be one that we have never heard of and that already died a death of obscurity years ago) that has the opposite behavior: where EINTR interrupts something (such as pending writes) before the file descriptor is closed, requiring developers to retry.
It is, however, unfortunate that this happened, and the author is right about issues for developers who attempt to use this API "without assuming more than the standard specified".
That said, there may be other options, such as using SA_RESTART, as argued in this discussion I just found (which also happens to talk about the meaning of "unspecified" ;P):
(edit:) Oh, and people may find it interesting that standards like this are often modified over time, with "interpretations" and outright edits. As an example, here is some vaguely related discussion regarding closing file descriptors during process termination:
When a specification says "unspecified", it doesn't mean "unknown at runtime, could be anything nondeterministically". It only means that the standard did not specify the result.
Sure. But if you want to write portable code, this means you can't assume any particular behaviour.
If you never want to run on anything except recent versions of linux, life is much easier for you. But some of us don't like being so limited.
Sure. Or just don't close() from a thread where you want to handle signals (without terminating, that is). Signals are the broken bits, frankly, not the POSIX standard. They have never played well with system calls. There's an essay somewhere I remember reading where an ITS hacker looks at Unix for how it handles the interrupted system call problem and comes away horrified at the discovery that it makes the user do it via EINTR.
There's no good reason to be catching synchronous signals in modern code[1]. See signalfd() et. al.
[1] Well, except for things that can only be delivered synchronously like SIGSEGV for user-handled paging. But that's complicated enough that the extra complexity of handling a SIGSEGV delivered out of a syscall is probably tolerable.
> There's an essay somewhere I remember reading where an ITS hacker looks at Unix for how it handles the interrupted system call problem and comes away horrified at the discovery that it makes the user do it via EINTR.
> There's no good reason to be catching synchronous signals in modern code
In libraries you can never know what the main program will do, and who wants to write a library that says "this library will malfunction if you use OS feature X"?
> See signalfd() et. al.
That's very cool, I had not seen that! But it's Linux-specific.
> There's an essay somewhere I remember reading where an ITS hacker looks at Unix for how it handles the interrupted system call problem and comes away horrified at the discovery that it makes the user do it via EINTR.
Not really the real point of the essay, but yeah there kind of is. Now come with a design (with similar premices, you can't just argue that the whole system call semantic has to change...) which does not involve the user, and we will talk.
Given that SA_RESTART already manages to perform that restart without the user's intervention, it is obviously pretty simple to come up with such a design: you just make SA_RESTART the default behavior. ;P
I said don't catch signals, not don't use them. See signalfd() et. al.
Though I'm curious why you're wanting to trying to interrupt those syscalls. In particular, why are you waiting on a process that you don't know is already dead? That's what SIGCHLD is for. And the only case I can think of for a blocking open() is a network/fuse mounted file. In which case the filesystem will surely eventually return an error (and if it doesn't no amount of voodoo in your application logic is going to correct the bugs in the underlying system).
FreeBSD: It seems pretty clear in FreeBSD that you should not retry the call to close(); sys_close() calls kern_close() which locks the file, frees the file descriptor (!), and /then/ calls closef() (which, if capable of returning EINTR, is already too late).
Mac OS X: Here you should not retry, but you should also be careful; depending on whether you are UNIX2003 you will get different behavior from the close() function from libSystem, directing you to either the syscall close() or close_nocancel().
If you end up with close(), then the first thing that happens is __pthread_testcancel(1) is called, and if the thread is cancelled it will return EINTR before doing anything else: in this case, you would need to retry.
However, I think close_nocancel(), which calls closef_locked(), might be capable of returning EINTR, which will be held and only returned from close_internal_locked() after _fdrelse() has already removed the descriptor.
So, if it is the case that EINTR is capable of being returned from the closef_locked() call, you would need to /not/ retry, which thereby means that the close() version of this call is impossible to use safely on Mac OS X: if I were you I'd avoid it to use close_nocancel() (explicitly if warranted).
Notice how long and complex your discussion of this issue is, and how many different systems you have to investigate in depth to even begin to form a coherent picture.
All we're trying to do is close a file safely. This is core functionality that virtually every application will have to depend on. It would be like free() saying it might be interrupted by a signal, and you have no way of knowing if the memory has actually been deallocated or not. That would be insane.
I believe FreeBSD will never return EINTR for close, but I wouldn't bet my life on it. I've heard rumours that Solaris can return EINTR, but likewise I make no guarantees. I have no idea what OS X does.
Yes. Afterwards, I learned more about standards and specifications (including taking on a weird fetish for lurking on the mailing lists where people actively are working on them), and realized that my attempts were flawed by premise. ;P
Seriously, POSIX is /not/ able to provide for you the ability to have a single program always work on every system: they tried really hard, but the world isn't perfect. They were (and are) attempting to unify and control something really complex, and they made remarkable progress given how many people they were trying to bring together who already had existing incompatible implementations.
In this case, they carefully made clear that this behavior was unspecified. That does not mean that they failed or that their specification was broken. In fact, I'd argue the opposite: if they had required implementations to do something in specific, the specification would be broken as it would not have described reality. You can't just claim that the implementations of your standard that people are actively trying to code against don't exist or are incorrect.
Honestly, though, to take a more direct appraisal of your question, the real epiphany for me came when someone clubbed me over the head with the difference between "portable" and "ported", and then demonstrated that all of the people that had come before me whose work I most admired had concentrated on making "portable" code as opposed to "ported" code: the most amazing code I've ever seen is the code that has managed to easily be adapted to changing environments as it had the simplest design and most powerful abstractions from the underlying systems, often as a direct result of attempting to embrace so many unrelated platforms.
Which then leads to a "better" question: have you ever tried to write code that could easily be ported between multiple operating systems, whether they be any of the numerous implementations of Unix (old or new, BSD or System V, largely compliant or downright buggy), Windows (using native APIs, not compatibility wrappers), or Mac OS (9, not X)?
If not, I recommend trying it, as that is what "portability" really is: once you experience it for your own code, it is difficult to take projects that insist on only working on a single homogeneous set of environments seriously anymore.
> I think the author of this post simply doesn't know what "unspecified" means.
I'm pretty confident Colin Percival knows what "unspecified" means. I suspect he also knows how it's different from "implementation-defined" (also pointed out in the thread you link to), and how it means that, officially, we don't even have a way to determine what an implementation does, because they don't have to document their choice.
SA_RESTART is also not a guaranteed solution for someone trying to be standards-compliant, since it's part of the XSI extensions, which may not be present or otherwise required for the application.
For such core functionality, this is a dangerous bit of ambiguity.
I claim the issue here is that this standard is not evil/epic magic. It did not and could not solve all possible portability issues between different implementations.
After continuing to research this issue (as I love looking at this kind of stuff: understanding more about the history and complexity of implementations excites me), I finally hit jackpot.
HP-UX 11.22: """[EINTR] An attempt to close a slow device or connection or file with pending aio requests was interrupted by a signal. The file descriptor still points to an open device or connection or file."""
AIX 5.3: """If the close subroutine is interrupted by a signal that is caught, it returns a value of -1, the errno global variable is set to EINTR and the state of the FileDescriptor parameter is closed."""
So, here we have two Unix implementations--both of which predated POSIX--that have incompatible definitions of close(). I do not feel it is reasonable to demand POSIX solve this, and sure enough: it didn't.
Now, maybe that makes POSIX "useless" (if what you care about is being able to write code without knowing anything about the target), but I don't think it is fair to claim that the definition of close is "broken".
In distinction, it clearly states that the behavior is "unspecified", which is a bothersome yet practically acceptable tradeoff. Some things in life are simply unspecified. :(
I do not feel it is reasonable to demand POSIX solve this
I think it would be entirely reasonable to demand that POSIX resolve this. But I've always come down on the side of proscriptive standards rather than merely descriptive standards.
Interesting, I didn't know anything actually refused to close the file. HP-SUX sucks again.
From one standpoint, I can see that would even be the desired behavior, because it gives the app the opportunity to deal with the error in some meaningful way, like to finish flushing. But since everybody else has long since decided that if you care you flush then close, being different is just aggravating. Although it sounds like flush then close will work everywhere.
I think you're missing the broader point. We don't actually have a problem with the close() behavior differing between implementations, we have a problem because there is no way to determine the correct course of action afterward.
For HP-UX, it's "close it again".
For AIX, it's apparently "everything is fine, don't close it again". Linux, too.
That could be OK, even when undocumented, IF we had a way to discover what had occurred. But we don't. close() isn't broken because it can behave in multiple ways that can't be predicted in advance, it's broken because there's no safe way to proceed.
And by the way, we do know something about the target when we're writing POSIX-compliant code. We know the target is POSIX-compliant. That's the whole point of POSIX.
That is totally fair: even something as simple as a #define available inside of a header file could have allowed this to be discovered from the source code, which would have not only not broken any existing code, but also would have allowed existing implementations to not have unreasonable burdons.
I just have a difficult time with the rather strong terminology that the definition is broken: the definition is consistent, reasonable, and even somewhat workable... it is just not terribly useful, satisfying, or comforting; if anything, I'd say "incomplete".
(Although, given that it chose "unspecified" rather than "implementation-defined", one has to wonder whether some existing system actually did something horrific in this case and left non-deterministic behavior. I guess there could always be a constant for "you're screwed", though. ;P)
Sure, there's a way to tell. There are lots of ways of telling which platform you're on (#defines), and platforms can document the expected behavior. If a platform doesn't, then complain to the vendor.
Yes, it sucks to resort to conditional compilation around "close", but it's already been explained why it's necessary. POSIX was not created in a vacuum as a theoretically ideal OS specification.
> SA_RESTART is also not a guaranteed solution for someone trying to be standards-compliant, since it's part of the XSI extensions, which may not be present or otherwise required for the application.
I need to backtrack on this. I just discovered SA_RESTART appears without surrounding XSI tags in the 2008 (Issue 7) standard. This was not the case in the 2004 (Issue 6) standard.
EDIT: I had here a discussion of whether dropping the tags was intentional, but now I see I missed the key sentence in the 2008 update: "Functionality relating to the Realtime Signals Extension option is moved to the Base."
SA_RESTART is core POSIX functionality from the 2008 edition on. Of course there are many systems out there not compliant with 2008, but this does change the picture somewhat.
My first reaction to the thread-safety problem: fstat before close, if it fails with EINTR, fstat again. Then you can tell both the state of the fd (since fstat will return EBADF if it's closed), and if it's still a valid fd, what file it actually refers to. So if another thread did open() and reallocate the same fd number, you'll get a different struct stat back, and then know not to retry the close.
As in:
int safeclose(int fd)
{
struct stat before, after;
if (fstat(fd,&before))
return -1;
while (close(fd)) {
if (errno != EINTR)
return -1;
/* If fstat() fails, our close() succeeded */
if (fstat(fd,&after))
return 0;
/* If we've got a different file, our close() succeeded */
if (before.st_dev != after.st_dev
|| before.st_ino != after.st_ino /* whatever other necessary checks here... */)
return 0;
/* Otherwise we've got the same file still open, retry */
}
return 0;
}
Any problems/races I'm not seeing? Is comparing 'struct stat' potentially not 100% reliable?
It looks like if the other thread opened the same file you might get a false positive.
struct stat {
dev_t st_dev; /* ID of device containing file */
ino_t st_ino; /* inode number */
mode_t st_mode; /* protection */
nlink_t st_nlink; /* number of hard links */
uid_t st_uid; /* user ID of owner */
gid_t st_gid; /* group ID of owner */
dev_t st_rdev; /* device ID (if special file) */
off_t st_size; /* total size, in bytes */
blksize_t st_blksize; /* blocksize for file system I/O */
blkcnt_t st_blocks; /* number of 512B blocks allocated */
time_t st_atime; /* time of last access */
time_t st_mtime; /* time of last modification */
time_t st_ctime; /* time of last status change */
};
The only one of these likely to change is st_atime, but I don't think that's guaranteed to update. (I think different behaviours can be compiled into the linux kernel, including "only update atime if it equals ctime or mtime".)
True, hadn't thought of that. (Seems obvious in retrospect, of course.)
It's almost tempting to try to differentiate that with st_atime, but I don't think there's any way that could be reliable (especially given noatime mounts and such).
1) Thread A closes fd X for File Y, Thread B opens fd X for File Y. Thread A sees fd X referring to File Y, closes it again out from under Thread B.
2) Since the behavior is unspecified, it's possible you could have fd X briefly continue to point to File Y. This could lead to a case where your fstat() results cease to be valid by the time you run close() again.
3) fd doesn't have to be a normal on-disk file, and POSIX is somewhat vague about what the stat struct's values are for the various possible types of "file".
Just another example, among the infinite others: you fsycn a filedes in a thread, then write to the same fd from another thread, and write(2) will block. Tens of semantical behaviors like this are NOT part of POSIX, but are part of real world systems.
So anyway in order to write non trivial systems you have to understand the implementation of different system calls in different operating systems.
Now that I think about it, dup2() to an existing file descriptor must be broken in a similar manner. In fact, with dup2() you can't even figure out whether the implicit close() succeeded.
If newfd was open, any errors that would have been reported at close(2) time are
lost. A careful programmer will not use dup2() or dup3() without closing newfd
first.
Which is actually "interesting" advice, as it cannot be followed in a multithreaded program without insanely extensive process-wide locks ;P (as any call to open() between close() and dup2() has an irritatingly high probability of grabbing whatever file descriptor you just freed up). (Of course, most usages of dup2() are in contexts just after a fork() or at the beginning of main(), so I guess this could often be practical; it still seems unfortunate.)
Other than compatibility with 30 years of software and the huge annoyance of using an identifier that doesn't fit in a machine word. The integer file descriptor is here to stay.
Yes; but uuids are network-sharable, never expire (never get reused), can be conjured up without cooperation with anybody. You can merge two handle spaces without any issues, ever.
If you mean different other languages, that's interesting... Python seems to be completely oblivious to the issue and signals (Modules/_io/fileio.c internal_close() and others). There is SA_RESTART added if possible in Modules/faulthandler.c - but I'm not sure what conditions activate it.
Yeah. This can be a serious issue, for the record. It is difficult for me to come up with an experience that has made me quite as angry as calling urllib.urlopen().read() in a highly-contended program (Apache2 mpm_worker) and getting an EINTR exception bubbled all the way up to the top... I mean, what did they expect I do: retry the entire HTTP fetch? ;P
I finally switched to mpm_event, which caused me to no longer get signals constantly, but until then I was seriously running a patched copy of Python to work around this issue; an issue, by the way, which was reported at the beginning of 2007, and only fixed midway through 2010. :(
The general idea, however, is that if you are using a low-level primitive, one that nigh unto maps directly to a system call (as opposed to urlopen ;P), then you actually "want" (supposedly) EINTR to not be handled for you, as it might be your intention to use it to do something valuable (just as you might from a C program).
I assume you're talking about dynamic languages like Python or Ruby. It's common for underlying system characteristics to leak through the abstractions, so something like this isn't necessarily "handled".
In CPython's case, as far as I can tell, it uses bare, unchecked close() calls, and just assumes they succeed.
(It does, however, provide an interface for dealing with signals, and one of the capabilities, as of 2.6, is to specify whether system calls will restart. See some discussion on SA_RESTART elsewhere in these comments.)
What about wrapping the first solution into a mutex lock ? Then the multithreaded behavior will disappear. And closing the fd might not be the most frequent operation usually so should not have much of performance impact.
Only locking the close() is not enough. Another thread could still get the same file descriptor through whatever syscall while you loop around the close(). You would need to lock all syscalls that create or close a file descriptor which is a) infeasable and b) way too slow.
In other words: it's a race condition between allocating and deallocating a fd and not between two deallocations.
Oh if they use the close direct, too bad. But libraries that launch the starships behind your back with no control left to you are also not the best gift.
Standard's bug aside, I wonder how the practical implementations behave. Is this condition reproducible ?
This gets even stranger than you might think. Consider fclose()[1], which says, after the ordinary disclaimer that POSIX defers to the C standard, "The fclose() function shall perform the equivalent of a close() on the file descriptor that is associated with the stream pointed to by stream.".
And as part of the C standard, we have "After the call to fclose(), any use of stream results in undefined behavior.".
Uh-oh. So does this imply that fclose(), at least, has to do the job right, or does it mean it inherits the close() flaw, and leaves no defined way to try and clean up the mess?
Yeah, I was just looking at that after I posted my initial comment, and I think I agree with you. That's the only reading I can see that doesn't create more questions than it answers.
I think in this case, Occam's razor says "this part of the standard is not fully thought through".
In this case "POSIX" is correct, not "Posix", of course; but if HN is going to mangle titles it ought to at least do so consistently.