
Make resilient Go servers using timeouts, deadlines and context cancellation - signa11
https://ieftimov.com/post/make-resilient-golang-net-http-servers-using-timeouts-deadlines-context-cancellation/
======
JyB
Almost all articles I've seen explaining the context pkg are done with
net/http examples.

That's fine but I feel like it might not be the best introduction for a novice
as a lot of concepts are mixed together and they might miss the bigger
picture. Context is not just for web servers. You don't have to know how
net/http works.

You could simply demonstrate the usefulness of the context package with
showing how to properly clean up your program on a sigterm. Or even gracefully
stop a long running operation so you are not afraid to stop/start your program
at the "wrong" time.

~~~
morelisp
> gracefully stop a long running operation so you are not afraid to stop/start
> your program at the "wrong" time.

This is one of my major pain points with Go's contexts. Where I work we do
have an "application wrapper" that gets cancelled on various signals, and it's
very handy for some things. But one thing it's not good at is shutting things
down safely!

Something we do pretty commonly is start multiple servers (e.g. a public and a
private HTTP server, or an HTTP server and DB background task, etc.) When we
get SIGINT, we want to cancel both's context (easy), then wait for both to
stop before continuing with our exit process (hard). Yes, this is the
canonical case for sync.WaitGroup, but those are hard to use correctly when
you need to transition into and out of "acceptable" states rather than just
count down (well, hard to use correctly period, to be honest - probably half
the time I see a junior dev using them they increment within the goroutine).
And hard to timeout waiting for the waitgroup when you want to continue
despite unclean shutdown - WaitGroup.Done doesn't itself take a context.

This is further complicated by the fact that for servers you often don't
really want to use the "application context" as the parent of the request
contexts. Rather you want the server to shut down cleanly when that context is
cancelled, processing all pending requests to completion but without
immediately cancelling them. So the base request context is ideally something
like "cancelling X seconds after the application context" which is not part of
the standard context toolbox.

And of course different libraries are not really consistent in how they shut
down - http.Server lets you close it with a timeout and so returns an error
but also you need to check the error return of the Serve method (and you can't
distinguish a graceful stop from a hard stop from try to restart a stopped
server), grpc.Server offers only hard and graceful stops with no timeout and
only the Serve method returns an error, and sarama.Client provides only a
synchronous close that returns an error.

I've not used them but I'm told C#'s cancellation tokens akin to Go context
but are closely integrated with their async task state machine, such that it's
easy to hand out cancellation tokens _and_ wait for the tasks waiting for
those tokens to finish.

------
regecks
I tried running Go HTTP servers bare to the internet (after Cloudflare
promoted doing so in a blog post), but went back to using a reverse proxy the
next time.

The main benefit seems to be convenience. I can upgrade and graceful-restart
nginx instead of having to rebuild and redeploy the Go server (involving a
full app restart). Not having to worry about goroutine leaks because some jerk
decided to send the request line @ 1 byte/sec is just an added bonus.

~~~
jrockway
You have to worry about the jerk sending requests at 1 byte per second no
matter which webserver you use. It's always a problem to let an unlimited
number of people ask for an unlimited amount of resources; it's just that
things like goroutines are heavier than a file descriptor or a few bytes of
RAM, so you'll notice wasted goroutines more quickly than wasted fds or
memory.

Typically, you need to consider the total amount of memory you want your web
server to use, how much of that memory one request can use, and how long a
request can use that memory. (File descriptors must also be considered.)

Envoy has a section in their documentation about this here:
[https://www.envoyproxy.io/docs/envoy/latest/configuration/be...](https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge#best-
practices-edge)

nginx similarly has a number of knobs to turn:
[https://www.nginx.com/blog/tuning-nginx/](https://www.nginx.com/blog/tuning-
nginx/)

I use Envoy as my web proxy and nginx to serve static content. My envoy
configuration is complicated and my nginx configuration is simple, as a
result. I imagine that if you are hosting a serious amount of traffic with
Nginx as the edge proxy, more tuning is required. I've never tried, so I don't
really know.

~~~
zzzcpan
_> You have to worry about the jerk sending requests at 1 byte per second no
matter which webserver you use._

Not necessarily. It's just free webservers don't bother dealing with it, but
there are plenty of simple approaches. Like just dropping connections that are
sending requests slower than some threshold or dropping the slowest connection
when some total number of connections is reached. Or more complicated, which
also works to protect from all kinds of attacks, dropping the highest
malicious score or the lowest reputation score client when some resource usage
threshold is reached.

None of these are easy to implement with synchronous multithreaded networking
code though, like in Go. Realistically it's only viable with asynchronous
single threaded programming models or an actor model.

~~~
jlokier
> None of these are easy to implement with synchronous multithreaded
> networking code though, like in Go. Realistically it's only viable with
> asynchronous single threaded programming models or an actor model.

It's hard to see why synchronous multi-threaded code would find these things
any more difficult than async or actor models.

All three models are equally able to access shared data structures to keep
track of resource usage statistics, per-connection statistics, and timers.

OS kernels do this routinely, and are essentially multi-threaded on SMP
architectures or with kernel pre-emption.

~~~
zzzcpan
Basically the reason is you can't just kill a thread that shares memory with
other threads. Go doesn't even have an ability to kill goroutines, so your
only choices is manual context tracking and manual cancellation in every piece
of code. But if you are in a an event loop, for example, you can just destroy
any client at any point. Same with actors, if you are in an actor, you can
just kill other actors.

~~~
jlokier
Thanks, that's an interesting point of view.

Unfortunately, with event loops and async programming, including async-await
models, cancellation is just as fiddly and needing to be explicitly handled by
client event handlers/awaiters.

For example, think of JavaScript and its promises or their async-await
equivalent.

There is no standard, generic way to cancel those operations in progress,
because it's a tricky problem.

~~~
zzzcpan
_> cancellation is just as fiddly and needing to be explicitly handled by
client event handlers/awaiters_

That's not true. In event loops to do cancellation you simply remove event
handlers for associated client from whatever event notification mechanism you
are using and delete (free) client's data structured, including futures,
promises or whatever you are using. Since references to all of them are
necessary for event loops to be able to even call event handlers, no awareness
of any of it on event handlers' side is required.

~~~
jlokier
That's not true; it only applies to a subclass of simpler event scenarios.

For example, in an event loop system you may have some code that operates on
two shared resources by obtaining a lock on the first, doing some work, then
obtaining a lock on the second, then intending to do more work and then
release both locks. All asynchronously non-blocking, using events (or awaits).

While waiting for the second lock, the client will have a registered an event
handler to be called when the second lock is acquired.

("Lock" here doesn't have to mean a mutex. It can also mean other kinds of
exclusive state or temporary ownership over a resource.)

If the client is then cancelled, it is essential to run a client-specific code
path which cleans up whatever was performed after the first lock was obtained,
otherwise the system will remain in an inconsistent state.

Simply removing all the client's event handlers (assuming you kept track of
them all) and freeing unreferenced memory will result in an inconsistent state
that breaks other clients.

This is the same basic problem as with cancelling threads. And just like with
event/await systems, some thread systems do let you cancel threads, and it is
safe in simple cases, but an unsafe pattern in more general cases like the
above example. Which is why thread systems tend to discourage it.

~~~
zzzcpan
Nope, event loops and asynchronous programming in general don't have a concept
of taking a lock, because the code in any event handler already has exclusive
access to everything. I.e. everything is effectively sequentially consistent.

There are some broken ideas out there that mix different concurrency models,
in particular async programming with shared memory multithreading, not
realizing they are bounding themselves to the lowest common denominator, but I
was never talking about any of them.

~~~
jlokier
We are clearly working with very different kinds of event loops and
asynchronous programming then.

I think you use "in general" to mean "in a specific subset" here...

It is not true that every step in async programming is sequentially
consistent, except in a particular subset of async programming styles.

The concept of taking an _async_ mutex is not that unusual. Consider taking a
lock on a file in a filesystem, in order to modify other files consistently as
seen by other processes.

In your model where everything is fully consistent between events, assuming
you don't freeze the event loop waiting for filesystem operations, you've
ruled out this sort of consistent file updating entirely! That's a quite an
extreme limitation.

In actual generality, where things like async I/O takes place, you must deal
with consistency cleanup when destroying event-driven tasks.

For an example that I would think this fits in what you consider a reasonable
model:

You open a connection to a database (requiring an event because it has a time
delay), submit your read and writes transaction (more events because of time
to read or to stream large writes), then commit and close (a third event). If
you kill the task between steps 2 and 3 by simply deleting the pending
callback, what happens?

What should happen when you kill this task is the transaction is aborted.

But in garbage collected environments, immediate RAII is not available and the
transaction will linger, taking resources until it's collected. A lingering
connection containing transaction data; this is often a problem with database
connections.

In a less data-laden version, you simple opened, read, and closed a file. This
time, it's a file handle that lingers until collected.

You can call the more general style "broken" if you like, but it doesn't make
problems like this go away.

These problem are typically solved by having a cancellation-cleanup handler
run when the task is killed, either inline in the task (its callback is called
with an error meaning it has been cancelled), or registered separately.

They can also be solved by keeping track of all resources to clean up,
including database and file handles, and anything else. That is just another
kind of cleanup handler, but it's a nice model to work with; Erlang does this,
as do unix processes. C++ does it via RAII.

In any case, all of them have to do _something_ to handle the cancellation, in
addition to just deleting the task's event handlers.

------
telendt
There's an ugly bug in http.TimeoutHandler though - it obscures stack traces
so that it's impossible to use them to locate panic in decorated handler:
[https://github.com/golang/go/issues/27375](https://github.com/golang/go/issues/27375)

------
diamondo25
Ok, good, contexts are now making sure you can handle upcoming timeouts
decided by an upper layer (caller function). But how about the time.After
function? It'll still be running in the background? So you can still have a
memory or 'processing power' leak?

~~~
rakoo
The function you write after time.After should use the same context, and check
its Done channel before continuing execution

~~~
diamondo25
So you have to propagate the Context, hm. IIRC, go test will panic the test
case when it times out. Not exactly sure tho. It would be nice if there was a
kind of 'abort' feature to clean up subroutines spun off this thread

~~~
thwarted
The best there is with the context package is to make sure to call the cancel
function given to you by contexts that have cancelation. Usually you do this
via defer. The cancel function is a no-op if the context is finished
otherwise. All this ends up doing though is making sure that things that clean
themselves up know to clean themselves up eventually.

~~~
morelisp
> Usually you do this via defer.

I agree this is usually done by defer, but you probably _should not_ do it
that way unless your code is very simple. Consider a function body which I've
seen variants of many times:

    
    
        ctx, cancel = context.WithTimeout(pctx, timeout)
        defer cancel()
        if resp, err := do(ctx, req); err == nil {
            process(resp)
        } 
        return err
    

Safe yes, but optimal? process doesn't use the context, and may take longer
than the timeout. The context will continue running, with some associated
resource cost (at the very least, the context's goroutine and timer). A
minimal change is:

    
    
        ctx, cancel = context.WithTimeout(pctx, timeout)
        resp, err := do(ctx, req)
        cancel()
        if err == nil {
            process(resp)
        } 
        return err
    

Which disposes of those resources much earlier.

(Depending on your Go compiler version there is also a potential cost
associated simply with using defer; this is independent of that.)

------
mjpuser
Nice intro to timeouts and context. Next step would be dealing with state
changes that happen in a cancelled request.

------
awinter-py
absolutely agree with the risk of slow clients saturating your connection
limit

when doing DB work with these, I'm a little shakier -- once I start a
multistep DB write, I probably want it to finish. Yes I can use a transaction
to roll back the whole thing, but I think there are cases where rollback is
wrong and I'd rather keep the write.

so while cancellation is cool, it's also a little fraught and hard to test.

~~~
shhsshs
In those rare cases you can choose not to propagate the same context through
those operations. Only check for cancellation once the operations have all
finished.

------
namanaggarwal
Shouldn't read header timeout be less than read timeout?

------
gigatexal
I scrolled through the whole article and didn’t get a blaring ad like that.

------
axaxs
Yeah, I quit reading when I get the unblockable full page ad asking me about
paying 50 dollars for a Go course. Good spam bot.

~~~
Operyl
I did not get such an ad on my end? And this browser has no blockers of any
kind. I scrolled through it in its entirety.

~~~
axaxs
So you know I'm not lying... [https://ibb.co/v43sMXv](https://ibb.co/v43sMXv)

It's not actually unskippable, but on mobile you have to zoom out to click the
X at top right.

~~~
Operyl
I wouldn’t call that an ad, but even then I never had it trigger. Must’ve hit
a JavaScript error on my end or something.

