
Unix System Call Timeouts - panic
https://eklitzke.org/unix-system-call-timeouts
======
Animats
This is done right in QNX. In QNX, everything that can block can be given a
timeout, using TimerTimeout().[1] This is useful in real-time code with
moderate time constraints. If you're logging to disk, and the disk stalls
because of a hardware problem, you may need to get control back so you can
keep doing your real-time work. It's more important to, say, issue steering
commands than log.

After a timeout, you don't know how the request that blocked ended. If you
need that, a multi-thread or multi-process solution is indicated.

In general, QNX is much better at timing, hard real time scheduling, and
inter-process communication than UNIX/Linux. In the Unix/Linux world, those
were afterthoughts; in QNX, they're core features. Because all I/O is via
interprocess communication, it takes an extra memory to memory copy; this
seems to add maybe 10% CPU overhead. That's the microkernel penalty, a
moderate fixed cost.

[1]
[http://www.qnx.com/developers/docs/7.0.0/#com.qnx.doc.neutri...](http://www.qnx.com/developers/docs/7.0.0/#com.qnx.doc.neutrino.lib_ref/topic/t/timertimeout.html)

~~~
nwmcsween
A better solution is to have async syscalls, this helps both the scheduler
(you could batch return) and given a proper language the developer as well.
The microkernel approach still suffers from the high context switch overhead
and IMO it would be better to use something similar to an exokernel while
implementing most things in userspace.

~~~
panic
The problem with asynchronous operations is that there has to be a buffer
somewhere. Once you have a buffer, you have to answer the question: what
happens when the buffer is full? If the answer is to block, then you haven't
really solved anything.

~~~
nwmcsween
This is why you want something like an exokernel.

------
klodolph
I'd like to point out here that the basic Unix assumption is that we don't
care about blocking on local disk. So mkdir() should not have a timeout
because it should always be fast enough not to need it. Yes, that's not true,
local disk can be very slow, and there's NFS and FUSE, but it's the assumption
Unix is built on.

Have you ever had a process waiting for disk that you couldn't kill? Like
running "ls" in a big directory, which hangs, and you pound ^C over and over
again in vain? That's this same assumption at work here. It's the difference
between D and S states on Linux. So if you want timeouts on mkdir(), what you
really want is a complete reengineering of the Unix system with the philosophy
that local disk IO is interruptible.

~~~
kosinus
Pretty much, yes. I think that assumption turned bad a really long time ago?

Interruptible sounds like a very strong requirement. Just being able to wait
for the result sanely is enough, I think. A process may then decide to lose
interest halfway, but having the result be undefined and fds invalid is just
fine, from what I've seen in e.g. node.js/libuv.

~~~
microcolonel
Keep in mind that local storage is getting lower latency year over year. NVMe
was a big step, but there could soon be more hardware memory mapped flash
storage (like DIMM SSDs), and 3D XPoint. I think the overhead will soon exceed
the operation in terms of cost, especially if it's asynchronous.

~~~
vidarh
That assumes uncontended access.

------
geofft
You can effectively get wait() with a timeout by setting a handler for
SIGCHLD, spawning the child, and using select(). This works on all UNIX and
gets you everything signalfd would. And as long as you set the handler before
you start the child, you don't have to worry about the self-pipe trick or
pselect, since you avoid the race condition. The sigtimedwait() approach also
seems fine.

The big trouble you have is that there is no approach to signals that
composes; if someone else is waiting for a SIGCHLD via some other means,
you're likely to eat the signal they were waiting for. If you're particularly
unlucky and both children terminate while you're not scheduled, you'll only
get a single SIGCHLD - even with siginfo_t, all that means is _at least_ this
child exited, maybe others did too.

Perhaps the better approach is to just spawn a thread for this, which is also
the general-case answer for every other syscall, whether it's wait(), mkdir(),
readlink(), or sync_file_range(). On the other hand, threads are a "now you
have two problems" sort of solution.... I'd like to see a generic API where
you can submit _any_ system call to the kernel, as if it were on another
thread, and select on its completion.

~~~
DSMan195276
I would agree with you. The most natural way to do it seems to be SIGCHLD and
a timeout function like select - when the signal hits, the timeout returns
early.

I'd add that the most 'Unixy' way to do it probably _is_ to use `alarm()`,
though like threads it is a bit of a "now you have two problems" solution as
well. But if you blocked all of your signals, spawned your child, called
alarm(), and then did a `sigwait()` on SIGCHLD and SIGALRM I think that would
work fairly well and be decently simple. You just check which of the two hits
first and then handle them accordingly.

I'd also add that even with the threaded cases it's not actually guaranteed to
work. Lots of these syscalls may put the program into uninterruptible sleep,
which will prevent you from killing the thread until the action is done (And
for lots of things this is a necessity, since otherwise you could leave things
in an unknown state. Userspace shouldn't be able to corrupt the file-system
just by specifying a weird timeout.).

IMO, like others have said I think the real solution to a timeout for things
like `mkdir` is in the kernel: If you're attempting to access something like a
NFS mount and it's taking too long, it will fail and you get an EIO, which is
not extremely graceful, but does mean that you won't sit there forever - it
will return after a period of time the kernel considers sufficient to know
that it's not going to happen.

In userspace you ideally have no idea what you're actually accessing, and thus
have no basis for what an accurate timeout should be anyway. It doesn't make
sense to pass a hard-coded timeout to `mkdir()` when that value will probably
be stupid long for some things and not long enough for others, and that's a
much bigger usage issue then certain programs taking a long time on slow
filesystems.

~~~
AstralStorm
You may also interrupt most operations using pthread_cancel and a few other
pthread functions.

~~~
geofft
pthread_cancel to abort a syscall is usually internally implemented by sending
the thread a signal (on glibc, it's one of the real-time signals below
SIGRTMIN), causing the syscall to return EINTR. So it's the same logic as
SIGALRM, essentially.

(And for some system calls, like certain NFS operations, they aren't
interruptible by a signal of any sort other than perhaps SIGKILL.)

------
peterwwillis
_> there’s a trick to turning these kinds of I/O operations into something you
can put into an event loop: you run the desired operation (mkdir() in this
case) in another thread, and then wait for the thread to finish with a
timeout_

This isn't a trick, this is expected. Calls return when they're supposed to
return, or not at all. If you need to return BEFORE the system call is done,
you need to do it somewhere else (like in a new thread or process, or node).
This is also not limited to I/O, but basically any system call: if you return
before it's done, it may break something, so it might not provide a good
timeout method.

Something to ask yourself is also why you need to return before the call is
done? It's similar to the NFS hard-vs-soft-mounting argument. Soft mounting
can cause damage when improperly interrupted; hard mounting prevents this by
waiting until the system is behaving properly again, with the side effect of
pissing off the users.

~~~
deathanatos
In some cases, you have knowledge that you're about to do a whole bunch of I/O
operations, sometimes all of the same time, sometimes not. (It doesn't really
matter.) _Ideally_ , I'd like to transfer this knowledge — that is, the list
of I/O operations — to the kernel wholesale, so that it _also_ has complete
knowledge of the task at hand, and can better figure out the most optimal way
to go about completing that (which it really can't do if it can't see the
whole picture). This might mean scheduling disk operations more efficiently,
to avoid seeking, or batching multiple network requests into a single packet,
etc.

You can't do that with synchronous APIs without hackery, since the very
structure of the API is self-defeating when it comes to getting the complete
picture to the kernel. If I have a 1000 I/O operations, I do not want to spawn
1000 hardware threads: relative to the amount of information required to
describe an I/O op, threads are incredibly expensive.

I don't want the syscall to represent the entirety of the work, simply, the
request to have the work performed. The kernel's response is then essentially
"Acknowledged, beginning this I/O. Here's a handle/means¹ to obtain the result
of the operation." Then, I can batch-request notifications of results through
some kernel I/O event queue … e.g., kqueue or epoll.

¹if handles are too much, you could also agree to have it stuck in some sort
of queue of results, that might be usable with kqueue/epoll.

~~~
peterwwillis
Do you want this bulk i/o syscall to inform you every time an operation is
complete, or in stages? Do you want it to prioritize latency over bulk
operations? Do you want it to take up more or less CPU? Will interrupts get
thrown each time you query the status? Do you want to know when the operation
is complete on spindle, in on-disk cache, in the filesystem cache? Do you want
it to handle network filesystems differently? Do you want it to take advantage
of multichannel NCQ and other features or implement your own in the kernel? Do
you want this new i/o scheduler to affect the rest of the system's i/o, or
only your application's? Do you want multiple applications to use different
command queues or for yours to trump them (priority) ? Do you want the kernel
to implement it's own batch ordering or rely on vendor firmware? (It sounded
at first like you were describing vectored i/o but I assume you want something
more abstract than that, kinda like a more generalized blk-multiqueue?)

~~~
deathanatos
All good questions, but none of these seem possible in today's POSIX APIs
either. (Most, I feel, probably are best just implemented as "options" to the
syscall in either the sync or async view of the world.) The point was more to
have async operations be _possible_ , whereas today, they're not.

> _It sounded at first like you were describing vectored i /o but I assume you
> want something more abstract than that, kinda like a more generalized blk-
> multiqueue?_

Asynchronous I/O, not so much vectored (though vectored is similar, but I want
to stay away from that term as most of the APIs (e.g., readv writev) I've seen
for that aren't actually asynchronous and are just more efficient user-to-
kernel bindings).

------
sigil
Shouldn't this work?

\- In the parent, open a pipe. Fork.

\- Parent closes the write side. Child closes the read side, clears FD_CLOEXEC
on it, and execs.

\- Parent does a select(2) on the pipe's read side with the desired timeout.
If the child exits before the timeout, the pipe becomes "EOF readable." Either
way select(2) returns on or before the timeout.

I get that this trick won't work for imposing timeouts on other syscalls (the
author points to mkdir(2)) but isn't that the purvey of an RTOS?

~~~
AstralStorm
Poll and epoll also work.

RTOS should prefer asynchronous calls and lockfree queues instead.

~~~
microcolonel
That sounds like a nightmare. I doubt many mortals could do hard realtime on a
fully-asynchronous RTOS.

------
sebcat
FreeBSD has pdfork, which works with process descriptors instead of signals.
No more signal handlers needed, and you can use process descriptors as you'd
use file descriptors with calls to poll, select, kevent, &c.

------
sargun
If you really need this kind of abstraction for syscalls, I might suggest
looking at Erlang, or Golang. Both of these languages and run times provide
abstractions for these "bugs" in the underlying APIs.

In Erlang, all I/O goes through things called "ports", and does not block, but
it's all asynchronous. There is no reason you couldn't do the same in C --
spin up an external process with a SIGALARM, and maintain a socket that you
use, instead of dispatching the syscalls in the same process / thread you run
the rest of your code.

In Go, syscalls are scheduled on their own thread (typically), and you can run
a goroutine, and wait for the output of the call.

Yes, these bugs exist, and they sometimes make writing code a pain, but
they're mostly "solved" if you use the right tool.

~~~
vbernat
Go doesn't have a general abstraction for that either. If you put something in
a goroutine, you usually cannot cancel it. If you decide to timeout, the
goroutine will continue to run. This may not be important for an mkdir(), but
this may be for other stuff, like for example for accepting a connection. The
not-cancelled goroutine will have an important side-effect on your program
(stealing a connection).

Moreover, Go wrapper functions automatically restart a syscall when it is
interrupted. Therefore, you cannot cancel a syscall by sending yourself a
signal.

This may change in the future when everything will take a context (at the cost
of an API change), but this is currently not the case.

------
loeg
Yeah, wait() doesn't have a standard way to wait for a timeout. I would wait
(with timeout) for the SIGCHLD with some OS-dependent interface
(signalfd+poll/select/epoll on Linux, or kevent on BSD).

But: What possible semantics could you want for mkdir() such that a timeout is
sane?

~~~
GauntletWizard
In a distributed systems problem, almost everything should have a timeout -
mkdir and all filesystem operations should be prepared for the possibility
that backing store will be slow, unresponsive, or simply broken.

On the other hand, the application layer is probably the wrong place to
implement such. Most NFS clients have the option to set a timeout, and system
calls that take longer than it instead receive EIO. This isn't perfect, but if
you're building an application that needs more complex behavior than that, you
probably do need to be building on a more filesystem specific API. Still,
there's a point to be made here - There should be a general template or
recommendation for what that API is.

~~~
xelxebar
Exactly. I'm kind of surprised to see the comments arguing that all
potentially blocking syscalls should have a timeout field or something. That
kind of thing just begs to be factored out into a more general framework.

~~~
GauntletWizard
Something similar:
[https://golang.org/pkg/context/](https://golang.org/pkg/context/)

------
rwallace
Why does he say alarm is always the wrong answer? I've used alarm to kill the
current process after a user-selectable timeout if it's taking too long, and
it seems to work fine.

------
fh973
His use case should be solvable with a timer and a kill. Then asking for a
timeout version for the syscall is for convenience. But that's a bad guiding
principle for system interface design. You want to keep things simple and
orthogonal and that rules out a timeout parameter. Maybe a general way to
cancel in-progress syscalls would do it?

Anyway, as my late systems professor used to say: timeouts are always too
short or too long.

------
cozzyd
In the case of waiting on a child started with exec, I've used the crappy
workaround of using the timeout command. It makes checking the return value a
bit trickier though...

------
AstralStorm
I've found a similar thing. The best way it seems is to open a pipe and poll
on it.

------
jstimpfle
Not saying the syscall APIs are not broken, but what is a use case for wait()
with timeout? Just curious...

~~~
deathanatos
From the article's first paragraph:

> _The use case is something like this: you spawn a subprocess, and you expect
> the subprocess to complete within ten seconds. If it doesn’t complete in
> that time, you want to treat it as an error (and perhaps kill the child)._

I've had this exact use case myself, several times. It's painful.

