
Timeouts and cancellation for humans (2018) - vthriller
https://vorpus.org/blog/timeouts-and-cancellation-for-humans/
======
jrockway
Yes, properly handling timeouts and cancellation is the next frontier for
programmers to conquer. I was just thinking about this the other day because
some program I was using locked up, and of course it worked fine when
restarted, and I began to wonder why this happens so frequently. A lot of
obscure things can cause hangs, but if every blocking operation has a timeout,
the number goes way down.

I think it's unfortunate that even new languages still treat timeouts and
cancellation as an afterthought. For example, every Go program I've ever
written says:

    
    
        select {
            case <-ctx.Done():
                return nil, ctx.Err()
            case thing := <-thingICareAboutCh:
                return thing, nil
        }
    

Instead of:

    
    
        return <-thingICareAboutCh, nil
    

The language designers thought about needing to give up on blocking
operations, and then said "meh, let the programmer decide on a case-by-case
basis". And that's the state of the art.

(Getting off topic, this is why I avoid mutexes and other concurrency
operations that aren't channels; you can't cancel your wait on them. Not being
able to cancel something means that if there are any bugs in your program,
you'll find out when the program leaks thousands of goroutines that are stuck
waiting for something that will never happen and runs out of memory. Even if
the thing they're waiting for does happen, the browser that's going to tell
someone about it has long been closed, and so you'll just die with a write
error when you finally generate a response. If you have a timeout and a
cancellation on every blocking task, your program gives up when the user gives
up, and will run unattended for a lot longer.)

~~~
noelwelsh
Go is very much not the state of the art in anything. It's really based on
1980s ideas.

~~~
gpderetta
more like 60s: Go vs Brand X [1]

[1] [http://cowlark.com/2009-11-15-go/](http://cowlark.com/2009-11-15-go/)

------
rauhl
I think that this is a good example of where dynamic scope is helpful.

We’re used to lexical scope: it’s easy to reason about, and it is a really
good default. But sometimes it makes sense for one function to apply settings
for all the functions it calls, without interfering with _other_ functions,
scopes, threads or processes (like setting a global would).

It’d be nice to be able to say ‘this function should timeout within 10 ms’ and
then any function called will just automatically timeout.

Go’s contexts integrate timeouts and cancellation, and permit one to add _any_
value, should one wish to, but you have to be disciplined and add a context
argument to every single function. It’d be better, I think, to support it
natively in the language. Lisp does this: any variable declared with
DEFPARAMETER or DEFVAR is dynamic, and you can locally declare something
dynamic too.

One can fake dynamic scoping with thread-local storage and stacks or linked
lists, if one needs it, but it can get ugly.

Dynamic scoping doesn’t get the attention or respect I think it deserves. It’s
arguably the wrong thing by default, but when it’s useful, it’s _really_
useful.

~~~
quietbritishjim
I think that what you're describing is how Trio's cancellation scopes work
[1]. Trio is an alternative async library for Python [2] (as opposed to
asyncio). It's pretty neat.

[1] [https://vorpus.org/blog/timeouts-and-cancellation-for-
humans...](https://vorpus.org/blog/timeouts-and-cancellation-for-humans/)

[2] [https://github.com/python-trio/trio](https://github.com/python-trio/trio)
[2]

~~~
BiteCode_dev
It's possible to implement the concept in asyncio as well, just not enforce it
for dependancies due to ensure_future().

See ayo for a very rougth draft of the idea:
[https://github.com/Tygs/ayo](https://github.com/Tygs/ayo)

------
ekimekim
This is something that is really nice in gevent. Under the hood it's doing
something similar to what the article says - every time you make a blocking
call (in gevent, this means yielding to the event loop until your event
occurs), you might have a gevent.Timeout raised.

Since gevent is generally used by monkey-patching all standard IO, most code
doesn't even need to be aware of this feature - it just treats a timeout as an
unhandled exception.

From the user's perspective, it can be used simply as a context manager that
cancels the timeout on exit from the block:

    
    
        with gevent.Timeout(10):
          requests.get(...)
    

By default this will cause the Timeout to be raised, which you can then catch
and handle. As a shorthand, you can also give it an option to suppress the
exception, effectively jumping to the end of the with block upon timeout:

    
    
        response = None
        with gevent.Timeout(10, False):
          response = requests.get(...)
        if response is None:
          # handle timeout

------
amelius
If you think this is difficult, then try designing an abstraction that can
reliably (best effort) report progress of any operation in your program.

There is a reason why most progress indicators suck, and it's because it is in
general surprisingly hard to write one.

------
ruslan
I often wonder why SO_RCVTIMEO/SO_SNDTIMEO avaiable for sockets but not for
file descriptors. Setting timeout once then using classic read/write on
blocking FDs is easy, meaning error code handling is appropriate.

~~~
gpderetta
I think it is because network is inherently different from disk. A disk might
be slow, but data will eventually arrive (bad disk or network mounts being the
exception of course).

~~~
nybble41
> bad disk or network mounts being the exception of course

That's kind of the point. Programs shouldn't assume that any particular file
is local. Any file may be hosted on a network mount, and that means network-
style asynchronous interfaces should be used to access the data.

------
kc0bfv
I definitely assumed this would contain clever tips about how to handle it
when coworkers don't respond to your emails, don't accomplish things they
promised, or don't follow-through. Maybe some automation methods to handle
those situations.

------
jeffreygoesto
Very systematic and accessible description of the problem and various
alternatives including their origins. I learned a lot reading the post, thank
you!

------
dwohnitmok
I'm not convinced the task-based approach doesn't work. Perf-wise there's no
reason that tasks have to have the overhead of threads.

Syntactically, I think it is worth distinguishing between things that can time
out and things that can't, because you usually need to do some sort of cleanup
on timeout.

In fact as far as I can tell, the cancel scopes provided by Trio with the
async await syntax are exactly isomorphic to Scala's tasks from the cats
library (where they are called IO).

Also I'm not sure I understand the author's preference for thinking of
timeouts as level-triggered rather than edge-triggered. While it's an
interesting way of thinking about the problem, and would be the natural way a
timeout is implemented in an e.g. FRP system (a lot of flavors of FRP are
essentially entirely level-triggered systems), it doesn't seem like the way
you'd implement things in a non-reactive system. What's wrong with just
killing the entire tree of operations (as is usual when you propagate say, an
exception) on a timeout, or from a top-down manner when you put a timeout on a
task?

Timeouts are fundamentally tied with concurrency (they are a concurrent
operation: you're racing the clock against your code and seeing who wins) and
to me the tricky thing about timeouts is exactly the same trickiness that you
face with concurrency, namely shielding critical sections. How you decide to
pass timeout arguments seems like a secondary concern. Just like with normal
concurrency, you need to make sure that certain critical sections are atomic
with respect to timing out, either by disallowing timeouts during that
critical section (you therefore need to make the critical section as small as
possible, ideally a single atomic swap operation) or implementing a reasonable
form of rollback. (Of course you can always take the poll-based approach where
you poll for timeout status, but again this is just a specialization of a
general concurrency strategy)

------
noelwelsh
FP libraries have pretty much solved this IMO. You create a value that
describes what you want to happen and that description can include
cancellation if some condition is met (e.g. it takes too long). There are
limitations imposed by the runtime on what you can actually cancel (e.g. I
don't believe all OS calls can be interrupted) but beyond that it works as
specified.

Here's one example of such a library, though without a bit of FP background it
probably doesn't make a great deal of sense:

[https://typelevel.org/cats-
effect/typeclasses/concurrent.htm...](https://typelevel.org/cats-
effect/typeclasses/concurrent.html)

~~~
marcosdumay
I dunno. As bad as Haskell's exceptions are, I still didn't see anything
better.

On a sane system, every little thing must be cancellable, otherwise nothing
really is. This interface fails because of this.

------
dirtydroog
Boost.ASIO (C++) does not expose the SO_RCVTIMEO socket options and instead
makes you use a deadline_timer explicity. It's very annoying but this article
kind of explains why it is that way.

------
BiteCode_dev
My first though was "oh, another covid-19 article". The human mind is funny.

~~~
chrisco255
We're all on timeout and everything is cancelled. Makes sense.

~~~
skocznymroczny
sudo killall -19

------
saurik
I have spent way too much of my time as a developer over the years hacking on
software to remove ill-conceived timeouts where some developer said--sometimes
not even in one place but for some insane reason at every single level of the
entire codebase--"this operation couldn't possibly take longer than 10
seconds"... and then it does, because my video is longer than they expected or
I have more files in a single directory than they expected or my network is
slower than they expected (whether because I have more packet loss or more
competition or more indirection) or my filesystem had more errors to fix
during fsck than they expected or I had activated more plugins than they
expected or I had installed more fonts than they expected or I had more email
that matches my search query than they expected or more people tried to follow
me than then expected (for months back when Instagram was new I seriously
couldn't open the Instagram app because it usually took more than the magic 10
seconds--an arbitrary timeout from Apple--to load my pending follower request
list for my private account; the information would get increasingly cached
every load so if I ran the app over and over again eventually it would work)
or my DNS configuration was more broken than they expected or I had a more
difficult-to-use keyboard than they expected or I had more layers of security
on my credit card than they expected or _any number things that they didn 't
expect_ (can you appreciate how increasingly specific these examples started
becoming, as I started having horrifying flashbacks of timeouts I had to
remove because some idiot developer decided they could predict how long
something could take and then aborted the operation, which seems like the
worst possible way of handling that situation? :/). Providing the user a way
to cancel something is great, but programming environments should make
timeouts maximally difficult to implement, preferably so complex that no one
ever implements them at all (and yes, I appreciate that this is a pipe dream,
as a powerful abstraction tends to make timeouts sadly so easy people strew
them around liberally... but certainly no timeout arguments should be provided
on any APIs lest someone arbitrarily guess "10 seconds"): if the user, all the
way up at the top of the stack, wants to give up, they can press a cancel
button. And to be clear: I don't think timeouts are something mostly just
amateur programmers tend to get wrong and which can be used effectively by
experts (as is the case with goto statements or random access memory or
multiple inheritance)... I have _never_ seen a timeout--a true "timeout" mind
you, as opposed to an idempotent retry (where the first operation is allowed
to still be in flight and the second will, without restating, merge with the
first attempt as opposed to causing a stampede; these make sense when you have
lossy networks, for example)--in a piece of software that was a feature
instead of a bug, where the software would not have been trivially improved by
nothing more than simply deleting the timeout, and I would almost go so far as
to say they are theoretically unsound.

~~~
james_s_tayler
It's a double edged sword. A lot of system failures happen because every
single thread winds up accidentally blocking on something that is never going
to return for one reason or another. In such cases a timeout is able to
unblock the threads.

Maybe the developers collected lots of data and said p99 latency is X, so
let's set the timeout X+alpha. If something takes drastically longer than most
requests it's probably something going wrong. Maybe your latency was the
99.99999 and the timed you out.

Or yeah the devs just guessed and guessed badly and the timeouts cause more
problems than they solve.

