Unix System Call Timeouts

Animats · on March 13, 2017

This is done right in QNX. In QNX, everything that can block can be given a timeout, using TimerTimeout().[1] This is useful in real-time code with moderate time constraints. If you're logging to disk, and the disk stalls because of a hardware problem, you may need to get control back so you can keep doing your real-time work. It's more important to, say, issue steering commands than log.

After a timeout, you don't know how the request that blocked ended. If you need that, a multi-thread or multi-process solution is indicated.

In general, QNX is much better at timing, hard real time scheduling, and inter-process communication than UNIX/Linux. In the Unix/Linux world, those were afterthoughts; in QNX, they're core features. Because all I/O is via interprocess communication, it takes an extra memory to memory copy; this seems to add maybe 10% CPU overhead. That's the microkernel penalty, a moderate fixed cost.

[1] http://www.qnx.com/developers/docs/7.0.0/#com.qnx.doc.neutri...

nwmcsween · on March 13, 2017

A better solution is to have async syscalls, this helps both the scheduler (you could batch return) and given a proper language the developer as well. The microkernel approach still suffers from the high context switch overhead and IMO it would be better to use something similar to an exokernel while implementing most things in userspace.

panic · on March 13, 2017

The problem with asynchronous operations is that there has to be a buffer somewhere. Once you have a buffer, you have to answer the question: what happens when the buffer is full? If the answer is to block, then you haven't really solved anything.

nwmcsween · on March 13, 2017

This is why you want something like an exokernel.

pjmlp · on March 13, 2017

Windows comes to my mind....

klodolph · on March 13, 2017

I'd like to point out here that the basic Unix assumption is that we don't care about blocking on local disk. So mkdir() should not have a timeout because it should always be fast enough not to need it. Yes, that's not true, local disk can be very slow, and there's NFS and FUSE, but it's the assumption Unix is built on.

Have you ever had a process waiting for disk that you couldn't kill? Like running "ls" in a big directory, which hangs, and you pound ^C over and over again in vain? That's this same assumption at work here. It's the difference between D and S states on Linux. So if you want timeouts on mkdir(), what you really want is a complete reengineering of the Unix system with the philosophy that local disk IO is interruptible.

kosinus · on March 13, 2017

Pretty much, yes. I think that assumption turned bad a really long time ago?

Interruptible sounds like a very strong requirement. Just being able to wait for the result sanely is enough, I think. A process may then decide to lose interest halfway, but having the result be undefined and fds invalid is just fine, from what I've seen in e.g. node.js/libuv.

microcolonel · on March 13, 2017

Keep in mind that local storage is getting lower latency year over year. NVMe was a big step, but there could soon be more hardware memory mapped flash storage (like DIMM SSDs), and 3D XPoint. I think the overhead will soon exceed the operation in terms of cost, especially if it's asynchronous.

vidarh · on March 13, 2017

That assumes uncontended access.

geofft · on March 13, 2017

You can effectively get wait() with a timeout by setting a handler for SIGCHLD, spawning the child, and using select(). This works on all UNIX and gets you everything signalfd would. And as long as you set the handler before you start the child, you don't have to worry about the self-pipe trick or pselect, since you avoid the race condition. The sigtimedwait() approach also seems fine.

The big trouble you have is that there is no approach to signals that composes; if someone else is waiting for a SIGCHLD via some other means, you're likely to eat the signal they were waiting for. If you're particularly unlucky and both children terminate while you're not scheduled, you'll only get a single SIGCHLD - even with siginfo_t, all that means is at least this child exited, maybe others did too.

Perhaps the better approach is to just spawn a thread for this, which is also the general-case answer for every other syscall, whether it's wait(), mkdir(), readlink(), or sync_file_range(). On the other hand, threads are a "now you have two problems" sort of solution.... I'd like to see a generic API where you can submit any system call to the kernel, as if it were on another thread, and select on its completion.

PopsiclePete · on March 13, 2017

The NT kernel got async I/O right, for the general case, over 20 years ago: https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

It's scary, really, how behind the times UNIX is, in so many ways.

DSMan195276 · on March 13, 2017

I would agree with you. The most natural way to do it seems to be SIGCHLD and a timeout function like select - when the signal hits, the timeout returns early.

I'd add that the most 'Unixy' way to do it probably is to use `alarm()`, though like threads it is a bit of a "now you have two problems" solution as well. But if you blocked all of your signals, spawned your child, called alarm(), and then did a `sigwait()` on SIGCHLD and SIGALRM I think that would work fairly well and be decently simple. You just check which of the two hits first and then handle them accordingly.

I'd also add that even with the threaded cases it's not actually guaranteed to work. Lots of these syscalls may put the program into uninterruptible sleep, which will prevent you from killing the thread until the action is done (And for lots of things this is a necessity, since otherwise you could leave things in an unknown state. Userspace shouldn't be able to corrupt the file-system just by specifying a weird timeout.).

IMO, like others have said I think the real solution to a timeout for things like `mkdir` is in the kernel: If you're attempting to access something like a NFS mount and it's taking too long, it will fail and you get an EIO, which is not extremely graceful, but does mean that you won't sit there forever - it will return after a period of time the kernel considers sufficient to know that it's not going to happen.

In userspace you ideally have no idea what you're actually accessing, and thus have no basis for what an accurate timeout should be anyway. It doesn't make sense to pass a hard-coded timeout to `mkdir()` when that value will probably be stupid long for some things and not long enough for others, and that's a much bigger usage issue then certain programs taking a long time on slow filesystems.

AstralStorm · on March 13, 2017

You may also interrupt most operations using pthread_cancel and a few other pthread functions.

geofft · on March 13, 2017

pthread_cancel to abort a syscall is usually internally implemented by sending the thread a signal (on glibc, it's one of the real-time signals below SIGRTMIN), causing the syscall to return EINTR. So it's the same logic as SIGALRM, essentially.

(And for some system calls, like certain NFS operations, they aren't interruptible by a signal of any sort other than perhaps SIGKILL.)

0xbadcafebee · on March 13, 2017

> there’s a trick to turning these kinds of I/O operations into something you can put into an event loop: you run the desired operation (mkdir() in this case) in another thread, and then wait for the thread to finish with a timeout

This isn't a trick, this is expected. Calls return when they're supposed to return, or not at all. If you need to return BEFORE the system call is done, you need to do it somewhere else (like in a new thread or process, or node). This is also not limited to I/O, but basically any system call: if you return before it's done, it may break something, so it might not provide a good timeout method.

Something to ask yourself is also why you need to return before the call is done? It's similar to the NFS hard-vs-soft-mounting argument. Soft mounting can cause damage when improperly interrupted; hard mounting prevents this by waiting until the system is behaving properly again, with the side effect of pissing off the users.

deathanatos · on March 13, 2017

In some cases, you have knowledge that you're about to do a whole bunch of I/O operations, sometimes all of the same time, sometimes not. (It doesn't really matter.) Ideally, I'd like to transfer this knowledge — that is, the list of I/O operations — to the kernel wholesale, so that it also has complete knowledge of the task at hand, and can better figure out the most optimal way to go about completing that (which it really can't do if it can't see the whole picture). This might mean scheduling disk operations more efficiently, to avoid seeking, or batching multiple network requests into a single packet, etc.

You can't do that with synchronous APIs without hackery, since the very structure of the API is self-defeating when it comes to getting the complete picture to the kernel. If I have a 1000 I/O operations, I do not want to spawn 1000 hardware threads: relative to the amount of information required to describe an I/O op, threads are incredibly expensive.

I don't want the syscall to represent the entirety of the work, simply, the request to have the work performed. The kernel's response is then essentially "Acknowledged, beginning this I/O. Here's a handle/means¹ to obtain the result of the operation." Then, I can batch-request notifications of results through some kernel I/O event queue … e.g., kqueue or epoll.

¹if handles are too much, you could also agree to have it stuck in some sort of queue of results, that might be usable with kqueue/epoll.

0xbadcafebee · on March 13, 2017

Do you want this bulk i/o syscall to inform you every time an operation is complete, or in stages? Do you want it to prioritize latency over bulk operations? Do you want it to take up more or less CPU? Will interrupts get thrown each time you query the status? Do you want to know when the operation is complete on spindle, in on-disk cache, in the filesystem cache? Do you want it to handle network filesystems differently? Do you want it to take advantage of multichannel NCQ and other features or implement your own in the kernel? Do you want this new i/o scheduler to affect the rest of the system's i/o, or only your application's? Do you want multiple applications to use different command queues or for yours to trump them (priority) ? Do you want the kernel to implement it's own batch ordering or rely on vendor firmware? (It sounded at first like you were describing vectored i/o but I assume you want something more abstract than that, kinda like a more generalized blk-multiqueue?)

deathanatos · on March 13, 2017

All good questions, but none of these seem possible in today's POSIX APIs either. (Most, I feel, probably are best just implemented as "options" to the syscall in either the sync or async view of the world.) The point was more to have async operations be possible, whereas today, they're not.

> It sounded at first like you were describing vectored i/o but I assume you want something more abstract than that, kinda like a more generalized blk-multiqueue?

Asynchronous I/O, not so much vectored (though vectored is similar, but I want to stay away from that term as most of the APIs (e.g., readv writev) I've seen for that aren't actually asynchronous and are just more efficient user-to-kernel bindings).

spc476 · on March 13, 2017

In our case at $WORK, it's because we have hard timeouts to meet our service level agreements with the Monopolistic Phone Companies. We need to return a response within X time, no exceptions.

0xbadcafebee · on March 13, 2017

In my past working with teams with network service SLAs, they had to design a robust multithreaded backend app and modify the frontend service to ensure all http transactions finished within 60ms. Timeouts were one of the smaller concerns...

sigil · on March 13, 2017

Shouldn't this work?

- In the parent, open a pipe. Fork.

- Parent closes the write side. Child closes the read side, clears FD_CLOEXEC on it, and execs.

- Parent does a select(2) on the pipe's read side with the desired timeout. If the child exits before the timeout, the pipe becomes "EOF readable." Either way select(2) returns on or before the timeout.

I get that this trick won't work for imposing timeouts on other syscalls (the author points to mkdir(2)) but isn't that the purvey of an RTOS?

AstralStorm · on March 13, 2017

Poll and epoll also work.

RTOS should prefer asynchronous calls and lockfree queues instead.

microcolonel · on March 13, 2017

That sounds like a nightmare. I doubt many mortals could do hard realtime on a fully-asynchronous RTOS.

sebcat · on March 13, 2017

FreeBSD has pdfork, which works with process descriptors instead of signals. No more signal handlers needed, and you can use process descriptors as you'd use file descriptors with calls to poll, select, kevent, &c.

sargun · on March 13, 2017

If you really need this kind of abstraction for syscalls, I might suggest looking at Erlang, or Golang. Both of these languages and run times provide abstractions for these "bugs" in the underlying APIs.

In Erlang, all I/O goes through things called "ports", and does not block, but it's all asynchronous. There is no reason you couldn't do the same in C -- spin up an external process with a SIGALARM, and maintain a socket that you use, instead of dispatching the syscalls in the same process / thread you run the rest of your code.

In Go, syscalls are scheduled on their own thread (typically), and you can run a goroutine, and wait for the output of the call.

Yes, these bugs exist, and they sometimes make writing code a pain, but they're mostly "solved" if you use the right tool.

vbernat · on March 13, 2017

Go doesn't have a general abstraction for that either. If you put something in a goroutine, you usually cannot cancel it. If you decide to timeout, the goroutine will continue to run. This may not be important for an mkdir(), but this may be for other stuff, like for example for accepting a connection. The not-cancelled goroutine will have an important side-effect on your program (stealing a connection).

Moreover, Go wrapper functions automatically restart a syscall when it is interrupted. Therefore, you cannot cancel a syscall by sending yourself a signal.

This may change in the future when everything will take a context (at the cost of an API change), but this is currently not the case.

loeg · on March 13, 2017

Yeah, wait() doesn't have a standard way to wait for a timeout. I would wait (with timeout) for the SIGCHLD with some OS-dependent interface (signalfd+poll/select/epoll on Linux, or kevent on BSD).

But: What possible semantics could you want for mkdir() such that a timeout is sane?

TheDong · on March 13, 2017

> But: What possible semantics could you want for mkdir() such that a timeout is sane?

nfs is mentioned a few lines later. Creating directories on any sort of network mount can and will time out, and it's better if your application can handle that sort of error rather than just wait forever.

Yes, the error can't conclusively say "the directory wasn't actually created", but at least it can say "it might not have been, dunno"

FUSE filesystems also can have bugs resulting in mkdir hanging forever.

Neither of those are all that insane.

loeg · on March 13, 2017

> nfs is mentioned a few lines later. Creating directories on any sort of network mount can and will time out

NFS basically assumes a reliable network. Operations should complete more or less immediately.

How is a syscall-level timeout more useful than a whole-application watchdog for handling flaky mounted filesystems?

If you want a flaky network filesystem, I think it's probably better to go with some application-level protocol rather than putting it below Unix's filesystem interface.

> FUSE filesystems also can have bugs resulting in mkdir hanging forever.

Filesystem bugs are filesystem bugs -- the application layer shouldn't have to work around them.

spc476 · on March 13, 2017

As if. At $WORK, we service requests (millions per day) that require access to NFS mounted files [1]. A network hiccup could effect one request but not another. I'm not sure how a "whole-application watchdog" would work for us.

As it is, the main thread will schedule a "wakeup" from another thread whose sole existence is to interrupt file system calls stuck in NFS. Yes, it would be nice if such code didn't have to exist. But it does.

On the plus side, it's there for a feature that isn't used that often. On the bad side, it's there for a feature that isn't used that often.

[1] Like most such decisions, it was a political decision [2] rather than a technical decision.

[2] Strippers and steak, baby! Strippers and steak! And those responsible are not bound by their decisions. We lowly engineers are.

xenadu02 · on March 13, 2017

I'm not picking on you specifically but when people say things like "Unix gives you brain damage" this is what they mean.

Your suggestion is that the network should be reliable (a strict impossibility) or that every application should use a bespoke protocol to do simple things like working with a network file system.

The Unix IO model isn't well designed and is far from the only Unix-ism to work poorly in the face of threads (signals and fork/exec are common offenders; not even memcpy is safe in a signal handler!)

This is hardly surprising: it was slapped together ad-hoc over the years.

fulafel · on March 13, 2017

NFS violates a number of basic Unix filesystem guarantees. It's well documented that it violates POSIX guaranteed filesystem semantics. The abstraction NFS chose is "filesystem operations don't time out", and implements this by just making your process hang as long as the network is dead - possibly forever.

I'm not sure it's a bad thing that the Unix FS API doesn't do networked filesystems. Everbody is using userspace APIs for remote file access, and the big and varied design space of remote file access semantics doesn't get bottlenecked on 30 year old system call APIs.

loeg · on March 13, 2017

> I'm not picking on you specifically but when people say things like "Unix gives you brain damage" this is what they mean.

You understand that's still offensive and implies I have brain damage, right?

> every application should use a bespoke protocol to do simple things like working with a network file system.

It doesn't have to be bespoke, but I agree that the Unix VFS interface isn't appropriate for it.

What operating system are you aware of that provides a mkdir interface with timeout? Windows' CreateDirectory() does not. OS X is just BSD. We're departing the realm of mainstream systems. QNX mkdir does not appear to take a timeout. Plan9 uses create() with a DIR flag, which does not take a timeout. Help me out here?

xenadu02 · on March 13, 2017

> You understand that's still offensive and implies I have brain damage, right?

My apologize, I didn't intend that. I mean that some people are used to Unix and use their familiarity as a post-hoc rationalization for some of the poorly designed aspects of it.

To be clear: my position is that several Unix APIs are bad designs. Rather than making excuses for them we should demand better cross-platform APIs.

> It doesn't have to be bespoke, but I agree that the Unix VFS interface isn't appropriate for it.

I don't agree. Network filesystems should be mountable and traversable just like local filesystems, but the OS and libraries should do more to be async and multi-thread aware, as should applications.

pjmlp · on March 13, 2017

> Rather than making excuses for them we should demand better cross-platform APIs.

We already have them in the form of language runtimes.

POSIX as cross-platform API is mostly relevant to C.

Even C++ is moving beyond POSIX, by providing proper OS abstractions on the standard library.

fiddlerwoaroof · on March 13, 2017

I suspect old systems like Multics did.

But, this is the sort of thing that Richard Gabriel is talking about in worse is better[1]

[1]: https://www.dreamsongs.com/WorseIsBetter.html

slrz · on March 13, 2017

> The Unix IO model isn't well designed and is far from the only Unix-ism to work poorly in the face of threads

Right, threads in the form of pthreads should've never been introduced into Unix, as they made a huge mess of many interfaces.

The rfork "you have processes that can share resources" model is a much better fit (although it doesn't solve all problems). It's what Linux as well as FreeBSD implement at the kernel level anyway except they add tons of ugly stuff on top of it to allow for performant implementations of POSIX threads semantics.

geofft · on March 13, 2017

> NFS basically assumes a reliable network. Operations should complete more or less immediately.

I don't know what you mean by this; are you claiming that NFS' design, by nature of being part of the kernel VFS, only makes sense in a reliable network because it pretends to work like a local filesystem? Then yes, we're discussing ways to fix that.

But if you're claiming NFS only works in reliable networks, that's absolutely untrue. NFSv3, at least, is designed to work in unreliable networks by being stateless (dare I say RESTful) and having well-defined semantics for operations that may not have completed. Close-to-open consistency is a way of sacrificing consistency for availability. And so forth. I've used it in production on remarkably unreliable networks and it absolutely works. But, of course, you need the application to be aware that it's using a filesystem that will expose the network's unreliability.

By "unreliable network" I mean a real-world network. I'm not suggesting NFS-over-wifi where you're expecting some sort of moshfs, I'm talking about inter-data-center WAN links, LAN links that occasionally get congested, etc.

> Filesystem bugs are filesystem bugs -- the application layer shouldn't have to work around them.

Shouldn't have to, no. But with the UNIX API, it doesn't even have the option of doing so. And being liberal in what you accept from others is a good way to build robust systems; this continues to apply when "you" is userspace and "others" is kernelspace.

For instance, a specific type of application, maybe a high-concurrency web server, might want to add timeouts on filesystem operations that can interrupt the current request and raise a 500 or something, but not abort the entire server. We have exactly this problem with https://lost-contact.mit.edu, a web server that quite intentionally provides access to all public AFS content via the web. It's running normal Apache on top of the normal OpenAFS client. Sometimes a remote AFS server maintained by someone else will go unresponsive, and eventually all threads on the entire Apache server will hang, one-by-one, making requests to the stuck server. There's no way to solve this in the UNIX design; we'd have to write a userspace AFS client to make this work.

AstralStorm · on March 13, 2017

Not true, you can set up an alarm signal to interrupt most syscalls. All fs ones at least.

geofft · on March 13, 2017

Not when you're using many network filesystems. From man 5 nfs:

"The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels."

I'm curious if you've used a network filesystem in production - this feels like such an everyday occurrence with either NFS or AFS that I'm surprised to hear people claim that it works differently.

TheDong · on March 13, 2017

> How is a syscall-level timeout more useful than a whole-application watchdog for handling flaky mounted filesystems?

It's more useful because once you enter into a syscall, it's out of the application's control.

The kernel's off in its happy place doing whatever it does, and the application has no control over that other than being able to specify timeouts to begin with or having async epoll-like interfaces.

AstralStorm · on March 13, 2017

You can also interrupt syscalls using signals or pthread_cancel.

ec109685 · on March 13, 2017

s/application/thread/

GauntletWizard · on March 13, 2017

In a distributed systems problem, almost everything should have a timeout - mkdir and all filesystem operations should be prepared for the possibility that backing store will be slow, unresponsive, or simply broken.

On the other hand, the application layer is probably the wrong place to implement such. Most NFS clients have the option to set a timeout, and system calls that take longer than it instead receive EIO. This isn't perfect, but if you're building an application that needs more complex behavior than that, you probably do need to be building on a more filesystem specific API. Still, there's a point to be made here - There should be a general template or recommendation for what that API is.

loeg · on March 13, 2017

> On the other hand, the application layer is probably the wrong place to implement such. Most NFS clients have the option to set a timeout, and system calls that take longer than it instead receive EIO. This isn't perfect, but if you're building an application that needs more complex behavior than that, you probably do need to be building on a more filesystem specific API.

Yes, exactly.

xelxebar · on March 13, 2017

Exactly. I'm kind of surprised to see the comments arguing that all potentially blocking syscalls should have a timeout field or something. That kind of thing just begs to be factored out into a more general framework.

GauntletWizard · on March 14, 2017

Something similar: https://golang.org/pkg/context/

tyingq · on March 13, 2017

I would guess the poster is just lamenting the lack of a straightforward solution.

Assume you're doing a mkdir() on an nfs mounted volume. It could hang, for quite some time. You would like to set a timeout and error out. There are no terrific solutions. There are workarounds, yes...but largely inelegant.

bonzini · on March 13, 2017

If you set a timeout on NFS operations, you could get an error and yet the mkdir could have succeeded---it could be the response that was delayed. This is really hard to handle, and that's the reason why NFS 's "soft" mount option is considered unsafe.

loeg · on March 13, 2017

There's nothing elegant about unreliable NFS in general. :-)

AstralStorm · on March 13, 2017

Wait is a poor interface anyway, waitpid is always superior and supports a timeout.

skarap · on March 13, 2017

And what about unlink()? Should it have a timeout? How would one use it?

rwallace · on March 13, 2017

Why does he say alarm is always the wrong answer? I've used alarm to kill the current process after a user-selectable timeout if it's taking too long, and it seems to work fine.

fh973 · on March 13, 2017

His use case should be solvable with a timer and a kill. Then asking for a timeout version for the syscall is for convenience. But that's a bad guiding principle for system interface design. You want to keep things simple and orthogonal and that rules out a timeout parameter. Maybe a general way to cancel in-progress syscalls would do it?

Anyway, as my late systems professor used to say: timeouts are always too short or too long.

cozzyd · on March 13, 2017

In the case of waiting on a child started with exec, I've used the crappy workaround of using the timeout command. It makes checking the return value a bit trickier though...

AstralStorm · on March 13, 2017

I've found a similar thing. The best way it seems is to open a pipe and poll on it.

jstimpfle · on March 13, 2017

Not saying the syscall APIs are not broken, but what is a use case for wait() with timeout? Just curious...

deathanatos · on March 13, 2017

From the article's first paragraph:

> The use case is something like this: you spawn a subprocess, and you expect the subprocess to complete within ten seconds. If it doesn’t complete in that time, you want to treat it as an error (and perhaps kill the child).

I've had this exact use case myself, several times. It's painful.

geofft · on March 13, 2017

Weirdly-behaved processes that get stuck (which is usually why you want it in a separate process in the first place). Things that come to mind include hardware utilities making ioctl()s, tools that fetch data from the network that don't have built-in functionality for timeouts, and untrusted code running inside a sandbox (e.g. https://play.rust-lang.org/ with "fn main() {loop {}}").

You can sort of work around it by calling waitpid(...WNOHANG) every tenth of a second or so.

klodolph · on March 13, 2017

WNOHANG is a bit of a hack. It's more straightforward to wait for SIGCHLD or just call alarm().

tomjakubowski · on March 13, 2017

Sure, it's probably a hack if you write a loop that looks like this:

    while (!waitpid(child_pid, &status, WNOHANG))
        usleep(10);
    }

But WNOHANG is perfectly fine, even necessary, if you're calling it from the body of some periodically scheduled event loop which you mustn't block, like in a video game or something.

klodolph · on March 13, 2017

I think some of the context for "WNOHANG is a bit of a hack" got lost.

> You can sort of work around it by calling waitpid(...WNOHANG) every tenth of a second or so.

For event loops, you can still use SIGCHLD (which is portable) or signalfd (which integrates better into event loops). This avoids the extra syscall every loop through the event loop.

geofft · on March 13, 2017

SIGCHLD (via a signal handler, signalfd, sigtimedwait, whatever) only tells you that some child exited, and also only one part of the process can deal with SIGCHLD. (You can't correctly compose a signal handler + signalfd.) waitpid on the specific pid will check that specific process and no others, and not interfere with any other part of the program that might be waiting for some other child.