
Deadlines, lies and videotape: The tale of a gRPC bug - fanf2
https://blog.hostedgraphite.com/2018/10/30/deadlines-lies-and-videotape-the-tale-of-a-grpc-bug/
======
anyfoo
Good war story.

One thing I was wondering:

> It essentially showed our layer0 service freezing, and even stop reporting
> its own internal metrics, only to recover a few minutes later. As we didn’t
> know how to reproduce the problem, it was hard to make progress.

I don’t know the environment (and not too much about the current affairs of
e.g. web development in general), but since the problem was noticeable and
lasted for minutes, up to almost an hour according to the graph, is there
anything that would prevent them from logging into an affected machine and
taking multiple stack shots of the relevant processes in close succession?

That would immediately show which threads are hanging where. The rest of
debugging is still required to figure out why it hangs there (especially if
you are not familiar with the internals of the code at the top of the obtained
stack shots), but at least you know where to look.

~~~
muststopmyths
Since the root problem seems to have been inside gRPC, the stacks may or may
not point to an actual cause, depending on underlying architecture.

For example, if there is some sort of asynchronous queue for message
send/receive, the stacks might just show threads waiting.

I'm speculating obviously. But for any complex application you usually are
looking at a mix of verbose logs and live debugging when troubleshooting.

The advantage of good logs is that you can follow the control/data flow in the
application, whereas attaching a debugger gives you current state. Figuring
out why you're in that state may be quite difficult, depending on the
application.

~~~
anyfoo
> For example, if there is some sort of asynchronous queue for message
> send/receive, the stacks might just show threads waiting.

Yeah, that's the point, it will show the threads waiting, but it will normally
also exactly show where, with what call stack, the threads are waiting. The
full call stack of several stuck processes usually gives you a pretty good
idea what is being waited on, and why.

I say "usually" because things get murkier with e.g. interpreted languages,
where what the program is actually doing is on a different abstraction layer
than the presented stack traces, but even then you have a good chance of at
least figuring out what the stalling resource is, and what processes are
involved.

> Since the root problem seems to have been inside gRPC, the stacks may or may
> not point to an actual cause

I'd go so far to say that they are unlikely to point at the actual _cause_ (as
said, you still need to continue debugging, and good logs always help), but at
least you now know what code to look at without having to guess reproduction
steps.

It's how we debug hangs in modern complex operating systems, where there often
is a lot of IPC involved. The very first thing to ask for and look at is stack
shots comprising all processes and threads in the system (either in kernel
land or user land or both, depending on the problem), preferably many of those
taken in close-ish succession.

------
klodolph
> It’s important to note that by default, this deadline is set to unlimited,
> which is almost certainly not what you want.

For interactive use this is somewhat reasonable as a default. When you’re
processing an RPC request, it’s also reasonable because when the incoming
request is canceled you can cancel the outgoing requests too.

I would encourage people to think carefully about deadlines and not just
pepper them everywhere.

~~~
noselasd
How is that reasonable ? With an infinite timeout, you are liable to have dead
connections pile up forever on any network event that causes the peer to go
silent

Or you have to handle timeouts at another level in your code, which is
reasonable - but can make things a lot more complicated.

~~~
klodolph
Infinite timeout is often (rather, almost always!) better for layers where the
timeout should be higher in the stack. The timeout or cancellation happens at
the topmost level and then propagates downwards.

This is not any more complicated than a timeout, and it saves you an enormous
amount of pain.

For example, let's suppose a particular operation is timing out, and while you
might want to fix underlying issues, for now you might to increase the timout.
You can fix it by adjusting the timeout at the topmost level without worrying
about other levels. If you use timeouts everywhere, you might have to adjust
timeouts everywhere.

So as a default, I would say, only put timeouts at the top. You can then
evaluate putting timeouts elsewhere as necessary. I've also found it very
helpful in certain to do things like stream results as they are generated, so
you can set a deadline and use whatever partial results you get before the
deadline. In particular this is very useful if you have flaky hardware, or
very large queries (in terms of number of systems they touch).

------
muststopmyths
Not handling the return of sendmsg (or any send operation on a socket) is a
common mistake (I hate to say "rookie",but the shoe does fit here).

The proposed solution of setting the TCP_USER_TIMEOUT is worth internalizing.

In my experience, the one thing anyone writing an RPC transport over TCP
should always do is set a short TCP keepalive timer to detect network
connectivity breakdowns. I am talking about a per-socket basis, of course. On
Linux, I guess you'd use the above option to get the same effect.

On Windows, you can set SO_KEEPALIVE_VALS, which allows you to set the TCP
keepalive intervals. I couldn't find an equivalent Linux socket option by a
cursory google.

The TCP keepalive defaults were designed for an age when sending packets too
frequently was considered impolite. As a result, a network hole (where packets
are being silently dropped) can take several minutes to detect.

Have been bitten a few times by this, so now I just make this socket option
part of my standard template code for connection protocols.

~~~
eeeeffdsx
If I understand it correctly, the issue was’t caused by not checking the
result of sendmsg, but sendmsg blocking the program when the send buffer was
full.

~~~
muststopmyths
no, if you look at the github issue, sendmsg fails with EAGAIN. It's probably
an upper layer that hangs.

~~~
noselasd
Note that in this case EAGAIN is just an indication the socket buffer is full
- which can easily also happen on normal operation when you're blasting off
calls at full speed.

------
ru999gol
I only ever had horrible experience using grpc, its overly complicated, badly
documented, has bad code quality and is full of bugs especially in combination
with multiprocessing/asyncio. maybe i'm the only one with that experience but
at least with python you probably are better off using something less bloated.

------
eeeeffdsx
So it boils down to the Python implementation of gRPC using synchronous I/O.

~~~
detaro
It's not specific to the Python one, and if it were synchronous I/O sendmsg
would block, not return EAGAIN.

