
TCP Puzzlers - jsnell
https://www.joyent.com/blog/tcp-puzzlers
======
ChuckMcM
That is an _awesome_ post. If you are thinking about writing distributed
systems code you need to understand this behavior of TCP/IP completely. Back
when a 100 nodes was a "big" network these sorts of problems were so rare that
you could be forgiven for not taking them into account, but these days with
virtualization you can have a 500 node network all on the same freakin' host!
And the law of large numbers says "rare things happen often, when you have _a
lot_ of things."

Fun anecdote, at Blekko we had people who tried to scrape the search engine by
fetching all 300 pages of results. They would do that with some script or code
and it would be clear they weren't human because they would ask for each page
right after the other. We sent them to a process that Greg wrote on a machine
that did most of the TCP handshake and then went away. As a result the
scrapers script would hang forever. We saw output that suggested some of these
things sat their for _months_ waiting for results that would never come.

~~~
scottlamb
> If you are thinking about writing distributed systems code you need to
> understand this behavior of TCP/IP completely.

Not necessarily. If you have a good RPC layer, it will abstract away most of
these details. For example, gRPC automatically does periodic health checking
so you never see a connection hang forever. (But you do see some CPU and
network cost associated with open channels!) gRPC (or at least gRPC's
predecessor, Stubby) has various other quirks which are more critical to
understand if you build your system on top of it. I expect the same is true of
other RPC layers. Some examples below:

* a boolean controlling "initial reachability" of a channel before a TCP connection is established (when load-balancing between several servers, you generally want it to be false so your RPC goes to a known-working server)

* situations in which existing channels are working but establishing new ones does not

* the notion of "fail-fast" when a channel is unreachable

* lame-ducking when a server is planning to shut down

* deadlines being automatically adjusted for expected round-trip transit time

* load-balanced channels: various algorithms, subsetting

* quirky but extremely useful built-in debugging webpages

It's valuable to understand all the layers underneath when abstractions leak,
but the abstractions work most of the time. I'd say it's more essentially to
understand the abstraction you're using than ones two or three layers down.
People build (or at least contribute to) wildly successful distributed systems
without a complete understanding of TCP or IP or Ethernet every day.

~~~
est
> For example, gRPC automatically does periodic health checking so you never
> see a connection hang forever.

So, a simple setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE) call?

~~~
rdtsc
I found in practice that is usually not enough and every distributed system I
designed or worked on ended up with heartbeats of some point. It could OS
peculiarities, or could be inability to tweak the times of keepalives.

Sometimes the heartbeats are sent by the server, sometimes by the client, it
depends. But they always end up in the application layer protocol.

~~~
greglindahl
... because the TCP keepalive is a minimum of 2 hours, per RFC. Which is far
too long, so everyone adds one at the application level.

~~~
gricardo99
On Linux you can set tcp_keepalive_intvl and tcp_keepalive_probed to make this
much shorter, but it's global to all sockets, so app keepalives are better for
finer control, among other things mentioned.

~~~
noselasd
There's 3 socket options (TCP_KEEPCNT/TCP_KEEPIDLE/TCP_KEEPINTVL) that allows
you to control this per socket too, it's not just global.

------
Animats
Idle but half-dead TCP connections will eventually time out. It can take
hours, though, on some systems.[1] Early thinking was that if you have a
Telnet session open, TCP itself should not time out. The Telnet server might
have a timeout, but it wasn't TCP's job to decide how long you get for lunch.

Today most servers have shorter timeouts, mostly as a defense against denial
of service attacks. But it's often the HTTP server, not the TCP level, that
times out first.

[1]
[https://blogs.technet.microsoft.com/nettracer/2010/06/03/thi...](https://blogs.technet.microsoft.com/nettracer/2010/06/03/things-
that-you-may-want-to-know-about-tcp-keepalives/)

~~~
dap
> Idle but half-dead TCP connections will eventually time out. It can take
> hours, though, on some systems.

This is not true, in general. I think you're describing connections which have
TCP KeepAlive enabled on them.

~~~
colanderman
I find that usually, long-lived idle TCP connections get killed by stateful
firewalls, of which there are often several along any given path through the
Internet.

e.g. my home router has a connection timeout of 24 hours.

~~~
noselasd
There's quite a number of stateful firewalls that just silently drop the
connection with out sending a RST though, meaning TCP connections are idling
forever, if it does not employ any tcp or application level keep-alives or
timeouts.

------
richm44
There are some other fun edge cases - for example:

\- If you have a port-forward to a machine that is switched off then you can
get ICMP network unreachable or ICMP host unreachable as the response to the a
SYN in the initial handshake.

This can also happen at any point in the connection. Other ICMP messages can
also occur like this (eg. admin prohibited).

It's always worth remembering that the TCP connection is sitting on an
underlying network stack that can also signal errors outside of the TCP
protocol itself.

~~~
rdtsc
Ooh fun! I just learned something new. I knew about how blocking ICMP could
lead to MTU path discovery issues, but this is different.

Do you know at the socket API level how the ICMP unreachagle is manifested.
Looked at connect() error and saw ENETUNREACH -- guessing that's the one?

------
droopybuns
On OSX:

sudo dtruss -d -t bind,listen,accept,poll,read,write nc -l -p 8080

dtrace: failed to execute nc: dtrace cannot control executables signed with
restricted entitlements

If you copy nc over to /tmp/ and run dtruss again, it will work. Hope that
helps others. Fun article, OP!

~~~
sigjuice
I never liked dtruss much for complicated debugging because the output shows
way too many things in hex. OTOH, output from strace on Linux tends to be much
nicer. e.g.

    
    
      $ strace -tebind nc -l -p 8080
      14:27:44 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
    
      $ strace -teconnect nc 10.88.88.140 8080
      14:28:29 connect(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.88.88.140")}, 16) = -1 EINPROGRESS (Operation now in progress)
    

Does anyone know if the various dtrace-based dtruss implementations can be
made to do this?

~~~
rdtsc
Yeah same here. When I want to trace my code, boot up a VM just because am
more familiar with strace and gdb rather than dtruss and lldb. I can see
dtruss having more and fancier features, I but am just too lazy to learn two
tools of the same type.

------
shabble
Other than using VMs and 'surprise' power-cycling them, what other ways are
there to test these sorts of behaviours? Would an entry like `iptables -I
forward -s ... -d ... -j DENY` on a host the 2 endpoints route over suffice to
suddenly stop all traffic (or a pair of rules to cover a->b and b->a packets)?

Are there any tools for 'and suddenly, the network disappears/goes to 99%
packetloss/10s latency' etc that you can use to test existing software? I'm
imagining some sort of iptables/tc wrappers that can take defined scenarios
(or just chaos-monkey it) and apply them to traffic, possibly allowing
assertions as to how the application should behave for various situations.

~~~
greenleafjacob
Jepsen [1]. For previous discussions on HN with MongoDB [2] Cassandra [3]
RabbitMQ [4] and ETCD [5]. Others exist as well.

[1]: [https://github.com/aphyr/jepsen](https://github.com/aphyr/jepsen)

[2]:
[https://news.ycombinator.com/item?id=9417773](https://news.ycombinator.com/item?id=9417773)

[3]:
[https://news.ycombinator.com/item?id=6451885](https://news.ycombinator.com/item?id=6451885)

[4]:
[https://news.ycombinator.com/item?id=7863856](https://news.ycombinator.com/item?id=7863856)

[5]:
[https://news.ycombinator.com/item?id=7884640](https://news.ycombinator.com/item?id=7884640)

------
johnrob
I used to mimimc the power off scenario using a wired connection and a router.
The client machine ethernet connects to an up-linked router, and to mimic the
server's poweroff, we just pull the uplink cable out from the router. As far
as the client can tell, the network interface is still up (because the router
is still alive). If you pull the client side cable from the router, the
client's network stack detects the lost network interface and the socket
closes immediately.

~~~
caf
Some routers will send ICMP Destination Unreachable back in that situation.

------
arielb1
Puzzler #3 does not have much with _using_ a half-closed socket - you still
have a socket that selects as writable, but is not connected to anything.

For some reason I remember that when a process crashes the OS's TCP stack
sends an RST to the opposing side, rather than a FIN, preventing puzzler #3.

~~~
jtakkala
When a process exits abnormally or not and if there's unread data in the
receive buffer then the kernel will send a RST, otherwise it would send a FIN.

------
partycoder
This would be in terms of system calls. In terms of messages being sent you
can capture the traffic and see what was sent, with which flags, etc.

------
saurik
"When I type the message and hit enter, kodos gets woken up, reads the message
from stdin, and sends it over the socket." <\- From Puzzler #1, this is a
typo: it should say "Kang", not "Kodos". Kang is sending a message to Kodos,
and this message was typed into Kang's console (waking up Kang) which sent the
message to Kodos.

------
tbarbugli
I did not really get the difference between 1. and 2. puzzlers, don't they
test the same scenario?

~~~
arielb1
Puzzler 1 is when a machine goes down and stays down, and you hang or
retransmit until a timeout. Puzzler 2 is when a machine goes down and then
comes back up, and you get reset.

~~~
rob-olmos
Puzzler #1 is an unclean reboot with a server-side TCP reset after the server
is back online.

Puzzler #2 is an unclean shutdown with a client-side TCP timeout since the
server never responds.

~~~
arielb1
Yeah. The other way around.

------
rob-olmos
For Puzzler #2, did the kernel send any additional attempts on its own, beyond
the initial attempt, before the 5min timeout?

~~~
caf
Yes - but you only see those if you're watching on the network side.

------
syncsynchalt
One thing missing from this article is TCP heartbeats, which prevents the
first two conditions from going on indefinitely.

However since the minimum allowed heartbeat is (usually) two hours host
reboot/shutdown is usually detected via a heartbeat at the application layer
instead, which sends some no-op traffic every n seconds or minutes.

~~~
ghusbands
The article mentions both. "It's possible to manage this problem using TCP or
application-level keep-alive." (and did before you commented)

------
webscalist
Wow that modern CSS making it difficult to read longer than 2 seconds.

~~~
eridius
What?

~~~
lightlyused
What he means is that the contrast between the text and background makes the
article hard to read. If they would up the font-weight to 400 from 300 it
would give a much better contrast.

~~~
eridius
Huh, I didn't even notice it. I wonder if this is due to text rendering
differences between platforms? The page is pretty readable on Safari on OS X.

~~~
vacri
Any time you see mid- or light-grey text on white, you know that the designer
is on OSX and doesn't care one jot about other viewers.

~~~
forgetfullest
They may care, but they're simply unaware of how bad it looks, and also
unaware of browsershots.org and similar services :)

