
Don't trust default timeouts - kiyanwang
https://robertovitillo.com/default-timeouts/
======
joosters
No! The danger in forcing programmers to pick a timeout is that they will pick
the wrong value, most often a too short timeout, because they have been
testing their software on a super-fast internal network and haven't considered
the poor users in the real world.

Case in point: Google's Waze. If I have a slow mobile connection (e.g. edge or
even 3g), Waze will repeatedly fail to load a driving route. It will think for
a few seconds at most, then timeout and tell me there was a problem. If it
only would wait a few more seconds to load, then the app would be useful.
Instead, due to their crappy choice of timeouts, the app becomes useless.

~~~
hrktb
Is the argument that a call that could get stuck for a few minutes is better
than a “wrong“ dozens of second value ? Even as a user I feel it’s a waste of
precious resource (including my time). It’s like waiting at the register until
the shop closes down because the employee had to go somewhere, instead of
giving up and trying the next open register.

I’d think infinity is not a valid state.

For waze’s case, I supposed their priority is not on salvaging the 1% longest
request (though critical to you), and instead preserve server resources for
the 99% faster clients. That’s not a “wrong” value on their side, and probably
have been carefully tailored to get the right tradeoff.

~~~
user5994461
A too short timeout is more problematic than no timeout because it breaks the
application.

Let's say 10 seconds, typical intuitive but bad timeout. This will cause
requests to fail for no reason other than users are in Asia or Africa, high
latency. This will break the application when it's used or deployed across
datacenters because high latency. This will cause requests to fail when the
server is a bit busy (couple seconds more to process requests). Worse, it will
cause chain reactions under load, creating more retries and even more load,
causing other services/servers to timeout too.

Better go for a long timeout. A long timeout doesn't break the application.

~~~
rfoo
> A long timeout doesn't break the application.

I'm pretty sure infinite timeout also breaks the application, in a way people
rarely realize that it is because of the timeout. People would rather think it
"just didn't work, don't know why" instead of being very clever and realized
"it must be low timeouts!!!"

~~~
user5994461
Yes, go ahead and set a 10 minutes rather than infinite or 10 seconds. That
will make it much easier to realize that things are frozen because they will
raise exceptions and logs all over the place.

To be pedantic though, infinite timeouts don't break applications except some
rare cases of resources exhaustion. If an application is completely
unresponsive, it is dead for good, not because of the timeout, need to fix the
root cause (often resource exhaustion like swapping or it's waiting on another
IO or service that's frozen).

~~~
astrobe_
Failing because of a too short timeout feels silly, but a stupidly large
timeout leads to frustration and hazardous user actions like killing the app
with the task manager.

You don't need a timeout, you need a "cancel" button.

~~~
user5994461
Funny you mention that, this reminds me of Windows task management. Windows
automatically gives a popup to terminate an application when it detects an
application is unresponsive.

This happens regularly when I open large files in some app, they take a fair
bit of time to load, Windows offers a popup to kill the app after few seconds.
Have to carefully wait and not click anything.

------
asdfasgasdgasdg
I think the opposite. Use infinite timeouts for outbound calls. If you're an
interactive application, display progress/activity to the user. Allow the user
to manually cancel. If you're a server application or system, maintain an
operation wide timeout and, if you do time out, propagate cancellation.

Systems I work with that have a default timeout are a pita. You end up having
to make pointless retries when you'd have been happy to wait.

There are exceptions. If cancelling and retrying is has a decent chance of
routing around the original problem. In that case, a timeout makes total
sense. The other case is if you have a workload where operations tie up a
mixed set of resources (e.g. threads + blocked backend calls) and only some of
your incoming ops are dependent on the blocked resource. In that case,
timeouts make sense in that they at least allow you to make forward progress
on the unblocked requests. Although tbh separate queues and thread pools is
the safer way to handle this. Because your caller with the timed out calls is
gonna keep retrying and eventually these retries will crowd out the requests
that can make progress in your incoming request mix.

~~~
tlb
As a user, I prefer short timeouts that pop up an error message with a 'retry'
button.

You need the retry button anyway, in case the server is throwing errors. And
there's often no good place for a cancel button, without putting up a big
'Loading' animation.

~~~
cortesoft
Retrying doesn’t help on a slow connection... if the best case connections
speed of your user is slower than your timeout, you can retry infinite times
and it wont help.

Maybe increase the timeout on retry?

~~~
nitrogen
Exponential backoff but for the timeout instead of the interval, and/or a Wait
Longer button, sound useful.

------
emptysea
Related to HTTP timeouts, I’ve run into database clients without default
timeouts. This meant even though the HTTP request was cut off after the
timeout, the database request kept on working causing tons of slow database
requests to be running on the server.

With Postgres you can use roles to set timeouts, maybe you want a longer
timeout for crons, shorter for HTTP endpoints.

Sadly we were using mongo which doesn’t have equivalent functionality. Ended
up monkey patching the client library to define a reasonable default timeout.

~~~
znep
That isn't due to a missing timeout, that is due to not properly communicating
aborted requests down the stack which, admittedly, isn't always easy and some
clients/languages/etc. are very bad at. A hardcoded timeout, while a fine
workaround in some applications, is not a good default and not the proper fix
for that.

Default timeouts in the database layers are hidden time bombs that turn
operations that just legitimately take a bit longer than some value the
library author set that you didn't even know existed into failures that get
retried over and over causing even more load than just doing the thing once.
Don't get me wrong there are lots of uses for setting strict timeouts and
being able to do so is very important, but as a default no thanks.

~~~
nitrogen
You sometimes won't know a TCP connection has been closed unless you try to
write to it (possibly there's a select/epoll/etc way to test), so if you are
using blocking I/O, you won't know that the HTTP client went away long ago.

~~~
user5994461
Highly advise to turn on TCP keepalive to detect dropped connections.

~~~
Matthias247
Sure. But the pointer of the parent poster was that you still won't observe
the error unless you are interacting with the socket again. If you have a
blocking thread per request model and your thread is blocked on the database
IO, then it won't look at the original request (and it's source socket) for
that timeframe.

There is no great OS solution for handling this. You kind of need to run async
IO on the lowest layer, and at least still be able to receive the read
readiness and associated close/reset notification that you can somehow forward
to the application stack (maybe in the form a `CancellationToken`)

------
hodgesrm
"Don't trust timeouts" is a better title and a better approach. The
fundamental problem with distributed systems is you can't tell the difference
between slow and non-responsive/crashed services. Simple timeouts are rarely
the answer. Here are the obvious ones depending on what part of the problem
you are trying to optimize.

* Keepalive - Have the server ping back on a short timeout while it's working. Use a very long timeout for the server response.

* Asynchronous queues - Use queues for requests and discard traffic/error out when the queue becomes full.

* Idempotence - Send another request if the first one does not return in a reasonable amount of time.

* Broadcast - Don't fetch the information, have it sent to you through UDP. Great for cumulative metrics. If you miss one, no problem, the next packet has the same data.

* Cancellation - Cancel the request if you don't get an answer.

* Multiple requests - Send requests to multiple services and return the one that gets back first.

Forcing clients to pick timeouts amounts to punting a hard problem over to
somebody who has even less idea how to solve it than you do.

Edit: clarity

~~~
ric2b
Great list.

That last suggestion (Multiple requests) can be tough to implement correctly,
I think it's usually called Happy Eyeballs:
[https://youtu.be/oLkfnc_UMcE?t=290](https://youtu.be/oLkfnc_UMcE?t=290)

------
nirui
Man! As a Chinese person, I really wish some companies to move their
developers who works on network related components to China for a few weeks,
because that will improve the stability of their products quite a bit.

You know we have a firewall that randomly disconnects connection and block
traffic. A lots of apps just gets confused when that happened. And that
happens a lot, once few minutes or shorter maybe.

When I work on my proxy, I had do define a new strategy to detect dead
connections, such as to use separated timeout for Dial and Read. The Dial
timeout will be a shorter value defaulted at 20 seconds, the Read timeout will
be a rather normal one usually defaulted at 120 seconds.

I found many software just don't use any strategy. They just sends the
connection and wait, assuming everything will be fine while it's actually
hanging forever on user's end, until the OS kick them out. Many download
system don't even have retry/resume mechanism: You downloaded 99% of a 600mb
package (and it takes about 48 hours), then the connection EOF'ed, the
software say "yeah you better download all of it again, hehe".

An example of good strategy can be found in `apt`. The software detects slow
network, timed out connection, automatically retry downloads (not sure if it
can resume download, could be great if it did). And all of that gives me a
strong software that I can trust: I know when I run the command, the command
will try it's best to get things done. And usually it did, causing far fewer
issues than `npm`, `snap`, `git` and etc.

I suggest everybody give this mindset a try: When your software downloads data
and puts it on user's computer, the copy of data is now owned by the user. You
remove the data, you're looting the user from what they've got. It's like that
you're making a dinner for your user: Everything been made (downloaded) is
already on the table, one failed meal (packet) should not cause you to
flipping the table. Instead, you retry and retry until it can't be done (For
example, the source has changed or wait time is really too long).

------
acdha
This is a surprising blind spot for most large tech companies. I’m sure Apple,
Google, Microsoft, Facebook, etc. spend a lot of money on their corporate
networks but you’d think they have at least one engineer who encounters
unreliable wireless on a regular basis. Only Netflix and appears to test for
this - I’ve never had to toggle networking to restore functionality after a
dropped packet.

~~~
throwaway2048
Youtube also handles it pretty well, but the twitch player for example, gives
up forever if the network is too flakey, who exactly wants that behavior?

~~~
powersnail
Youtube Music, however, has the opposite problem. It always tries to load the
online version, regardless of whether I have downloaded the file already. Even
if I specifically clicked a file in the downloaded list, it will still try the
online version when the next song begins.

When I have flaky connections, I get so many pauses (of songs I have
downloaded) that I just deleted the app alltogether.

------
mprovost
One of the main complaints about NFS, one of the original distributed systems
- is that client machines hang when the server is unavailable. The problem is
that the (Unix) filesystem layer assumes that disks are reliable (spoiler
alert: they're not), and NFS stretches disk access across a network.

The concept of a "soft mount" with a timeout was introduced to NFS but it's
almost never recommended. This is because client programs have no idea how to
handle a timeout from the filesystem. This article shows how every HTTP client
has to be configured to handle failures. Imagine if every program that
accesses a file, from /bin/cat all the way up, had to have error handling code
to deal with timeouts and retries. A sane choice is to wait infinitely if
there's nothing more intelligent that you can do.

~~~
tyingq
_" NFS server xyz not responding , still trying"_ is still in my head, despite
not using nfs for probably a decade.

------
Dunedan
I recently ran into this problem with the popular Python "requests" library,
which doesn't have a default timeout set. That's especially annoying as their
slogan is "HTTP for Humans" and that doesn't feel very human friendly.

There is a longstanding issue [1] to add a default timeout, but so far that
hasn't happened yet.

[1]:
[https://github.com/psf/requests/issues/3070](https://github.com/psf/requests/issues/3070)

------
hderms
In the AWS builders library they suggest setting the timeout to be at the p99
for the expected latency of the operation (or choose a different percentile if
you want to be more or less tolerant of false positives). That methodology
seems pretty solid, provided it's something that's continually re-evaluated
and tested under load.

Also important to consider what the client is advised to do in the case of a
timeout. Retries, for instance should likely have backoff and jitter attached,
or a retry budget.

~~~
user5994461
That number sounds like really bad advise to me. Should be more like 99.99% in
my experience.

Internal services have extremely low response time during normal operation
(p99 around a second) but then the database will start a snapshot or a large
analytics query hits on the week end (high IO) and the latency is through the
roof for a short while. Too bad if services have short timeouts, they're all
failing all requests now for no reason.

p99 is normal operation. Services shouldn't be configured to systematically
fail for 1% of operations.

~~~
hderms
fair enough, that's why they call out that you need to load test it and
actually determine that the value you set meets expectations. Agreed that
blindly setting a value is problematic

------
adamch
We kept running into this issue at my job. A lot of our original database
queries for our Go service called db.QueryRow, not db.QueryRowContext. The
former doesn't respect timeouts, the latter does. So I ended up writing my own
wrapper around Go's database/sql package. It basically just reexports all the
functions that accept Context and hides the ones that don't. Very helpful.
Timeouts are important.

------
awinter-py
> Network requests without timeouts are the top silent killer of distributed
> systems.

YES 1000% (I mean maybe not 'top' but it's up there)

languages _must_ move connection pooling, timeouts, and retry semantics into
the stdlib

API client libraries have to do a better job of documenting what happens when
a request fails

systems need to do a better job of centralizing how timeouts are configured;
this can't be left to chance

~~~
dnautics
May I introduce you to erlang/BEAM? These issues were solved 30 years ago, and
the ecosystem has had plenty of time to solidify best practices around exactly
what you're asking for.

~~~
Thaxll
What actually BEAM protect you against resources consumption ? If you set o
timeout all you process will be up and idle.

~~~
dnautics
Default BEAM timeout is usually 5s (probably too long, in some cases), if you
miss it the default is unhandled exception, which crashes the process that
made the call (and only that process, no others). The VM will then recover all
of the resources (file descriptors, sockets, data to be GC'd) associated with
that process. All in zero lines of code.

Also you can have millions of processes per core, with minimal performance
regression, do you're likely to notice it in monitoring before it becomes a
problem.

------
mcqueenjordan
I like the idea of increasing the timeout on successive retries. Sort of like
backoff exponentially, increase timeout exponentially. Of course, this is only
if the reason for retrying could be helped by a larger timeout.

------
kalecserk
I work on the payments industry and this issue has struk our systems several
times. One extra piece of advice is to also consider the compound timeout when
there are multiple calls to the same service. I still remember having our
system comopletely hang because Rabbitmq was unresponsive. We had a 50ms
timeout with Rabbitmq, but that didn’t protect us since we would hit the
service 50 times per request.

------
peterwwillis
Different timeouts exist for different purpose. Sometimes _infinite_ is the
only sane default. Sometimes the default must be dynamic. And sometimes you
just pick something that sort of makes sense with all the other system
components in mind.

There are a dozen or more timeouts just for a TCP connection. There's the
initialization timeout, the 3-way handshake timeout, the half-closed timeout,
the time-wait timeout, the unverified reset timeout, the established
connection timeout, the retransmission timeout, the timed wait delay, the
delayed ack timer, the arp cache timeout, the arp cache minimum reference
timeout, the keep-alive timeout, and more.

Every single person in the world depends upon default timeouts, so of course
they matter. When they are picked intelligently, they improve the default
behavior of the majority of system interactions. So we can trust default
timeouts, when they are useful. But if we're _building_ a system, it makes
sense for us to determine what the appropriate timeout is _for our system_.

------
svnpenn
> Javascript’s XMLHttpRequest is THE web API to retrieve data from a server
> asynchronously.

Uh what? Has he never heard of Fetch:

[https://developer.mozilla.org/Web/API/Fetch_API](https://developer.mozilla.org/Web/API/Fetch_API)

its been around for at least 5 years, and it returns a Promise.

~~~
simonw
He covers fetch() in the section after that, and rightly complains that unlike
XHR it doesn't support timeouts at all.

------
ufmace
There could stand to be more thought in many applications put into timeouts
and cancellation and how it should all work in the face of APIs that might be
unresponsive or slow sometimes. But I don't think that putting some arbitrary
timeout as the default everywhere is really a good idea.

Many of these things are used for one-off scripts, where it isn't worth
thinking about. For many APIs, it isn't worth the trouble - if one of your
dependent services is unresponsive, there isn't really any meaningful thing
your application can do anyways. It doesn't become an issue until there are so
many timeouts that it's impacting other resources. Best to leave it off until
you know what you want to do with it.

------
faebi
Timeouts start to get really funny once you start to create lots of UDP
Connections while using NAT somewhere in between. Since UDP is connectionless
the NAT has no idea whether it will receive any packets anymore and therefore
has to keep the port mapping for a certain amount of time. At some point UDP
packets will be dropped since these mapping tables can‘t be of unlimited size.

~~~
herpderperator
conntrack timeouts don't just apply to UDP:

    
    
      net.netfilter.nf_conntrack_dccp_timeout_timewait = 240
      net.netfilter.nf_conntrack_frag6_timeout = 60
      net.netfilter.nf_conntrack_generic_timeout = 600
      net.netfilter.nf_conntrack_gre_timeout = 30
      net.netfilter.nf_conntrack_gre_timeout_stream = 180
      net.netfilter.nf_conntrack_icmp_timeout = 30
      net.netfilter.nf_conntrack_icmpv6_timeout = 30
    

and even for TCP, there is a timeout after the connection is closed. The fact
that UDP has no state and therefore no 'connection' doesn't mean that just
because TCP does, that conntrack only tracks it while the connection is open.
Besides, you could sever a cable and TCP wouldn't know that anything happened.
So you do need timeouts for anything in a NAT table.

------
karmakaze
Any mention of the Go http package and timeouts ought to also mention "never
use http.DefaultClient":

    
    
      package http
      var DefaultClient = &Client{}
    

It is a convenient global variable that uses whatever last settings were set
upon it from any bit of code executed in any dependency.

------
jasonhansel
Also: set a timeout in your database to stop out-of-control queries from
taking a whole system down. Postgres's "statement_timeout" comes to mind; if a
statement exceeds the timeout, Postgres can effectively roll back the system
to is previous state.

------
jlg23
> never use “infinity” as a default timeout.

Never say never. What can be tuned in the system (obviously relevant only for
server software) is better tuned there unless you really like (re-)negotiating
tuning options with ops.

------
Marazan
After being bitten twice by default timeout values I have the maxim "the
defaults will always be wrong" engraved in my heart.

------
tobyhinloopen
XHR does have a timeout right? It’s just arbitrarily defined by the browser.

