Hacker News new | comments | show | ask | jobs | submit login
TCP Puzzlers (joyent.com)
468 points by jsnell on Aug 18, 2016 | hide | past | web | favorite | 70 comments

That is an awesome post. If you are thinking about writing distributed systems code you need to understand this behavior of TCP/IP completely. Back when a 100 nodes was a "big" network these sorts of problems were so rare that you could be forgiven for not taking them into account, but these days with virtualization you can have a 500 node network all on the same freakin' host! And the law of large numbers says "rare things happen often, when you have a lot of things."

Fun anecdote, at Blekko we had people who tried to scrape the search engine by fetching all 300 pages of results. They would do that with some script or code and it would be clear they weren't human because they would ask for each page right after the other. We sent them to a process that Greg wrote on a machine that did most of the TCP handshake and then went away. As a result the scrapers script would hang forever. We saw output that suggested some of these things sat their for months waiting for results that would never come.

> If you are thinking about writing distributed systems code you need to understand this behavior of TCP/IP completely.

Not necessarily. If you have a good RPC layer, it will abstract away most of these details. For example, gRPC automatically does periodic health checking so you never see a connection hang forever. (But you do see some CPU and network cost associated with open channels!) gRPC (or at least gRPC's predecessor, Stubby) has various other quirks which are more critical to understand if you build your system on top of it. I expect the same is true of other RPC layers. Some examples below:

* a boolean controlling "initial reachability" of a channel before a TCP connection is established (when load-balancing between several servers, you generally want it to be false so your RPC goes to a known-working server)

* situations in which existing channels are working but establishing new ones does not

* the notion of "fail-fast" when a channel is unreachable

* lame-ducking when a server is planning to shut down

* deadlines being automatically adjusted for expected round-trip transit time

* load-balanced channels: various algorithms, subsetting

* quirky but extremely useful built-in debugging webpages

It's valuable to understand all the layers underneath when abstractions leak, but the abstractions work most of the time. I'd say it's more essentially to understand the abstraction you're using than ones two or three layers down. People build (or at least contribute to) wildly successful distributed systems without a complete understanding of TCP or IP or Ethernet every day.

Hmm, I realize you can't say but when I was there 'Stubby hangs forever ..." was in a painfully large number of bug reports. So gRPC has fixed all that has it? Great to know.

In my experience an abstraction is only as strong as an organization's ability to hold its invariant assumptions, well invariant. And what I took away from that experience was that knowing how an abstraction was implemented allowed me to see those invariance violations way before my peers were starting to ask, "well maybe this library isn't working like I expect it to."

> Hmm, I realize you can't say but when I was there 'Stubby hangs forever ..." was in a painfully large number of bug reports. So gRPC has fixed all that has it? Great to know.

Huh. I don't see bug reports like that. "This RPC with no deadline hangs forever" definitely but I wouldn't call that a Stubby problem. I'd call it a buggy server (one that leaks its reply callback, has a deadlock in some thread pool it's using for request handling, etc.) and a client that isn't properly using deadline propagation.

> For example, gRPC automatically does periodic health checking so you never see a connection hang forever.

So, a simple setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE) call?

Application-level keepalives can be handy in other regards, eg. you can assert that your application's event loop isn't blocked on an infinite loop, deadlock, or blocking call in user-level code, and you can pass operational statistics back to your (presumably proprietary) monitoring system.

I found in practice that is usually not enough and every distributed system I designed or worked on ended up with heartbeats of some point. It could OS peculiarities, or could be inability to tweak the times of keepalives.

Sometimes the heartbeats are sent by the server, sometimes by the client, it depends. But they always end up in the application layer protocol.

... because the TCP keepalive is a minimum of 2 hours, per RFC. Which is far too long, so everyone adds one at the application level.

The minimum default is 2 hours, but applications can configure this to much smaller intervals, like 10 seconds.

I feel like putting heartbeats themselves into the application level is a layering violation. They go in the session or presentation layer. WebSockets does it right, with its own heartbeat frame type that gets handled in userland but outside of your app's code.

On Linux you can set tcp_keepalive_intvl and tcp_keepalive_probed to make this much shorter, but it's global to all sockets, so app keepalives are better for finer control, among other things mentioned.

There's 3 socket options (TCP_KEEPCNT/TCP_KEEPIDLE/TCP_KEEPINTVL) that allows you to control this per socket too, it's not just global.

I guess the HTTP/2 PING messages.

A connection hanging for two hours is awful close to hanging forever.

> Back when a 100 nodes was a "big" network these sorts of problems were so rare that you could be forgiven for not taking them into account

Not really. Writing reliable network programs always had to take this into account. Take the scenario (not mentioned in TFA) where a firewall somewhere in the network path suddenly starts blocking traffic in one direction. Or handle a patch cable being pulled from a switch (or from your computer which has a different result). These scenarios always were real and resulted in comparable connection error states.

Do agree it's a good post though, explains it rather nicely.

Completely agree. I've run into a lot of this problems on a single node.

This practice is a type of tarpitting, and is fairly common and achievable in several ways. I've seen it implemented with several-second delays between bytes, but you have to weigh such active techniques against having the connection and resources set up.

Or you can re-write enough of the TCP stack in user land to do all of the handshaking and then go away. No connections, no resources wasted, infinite numbers of clients waiting forever and ever in their read() call.

Oh, sure, I actually liked your method and wish I had thought of it. I meant active techniques like mine.

So there's a way to hang the connection that won't be caught by a `timeout` setting in most HTTP clients? Or the bot writers were just too lazy to add timeouts in their bots?

I suspect the bot writers were too lazy to add timeouts.

He had no way to know if they were acutally still waiting months later. Just because the peer has not sent a RST doesn't mean remote end still has a process/machine still in existence.

As one data point, Python feed parser has no timeout. I've seen it sit for days when the remote end disappeared.

Idle but half-dead TCP connections will eventually time out. It can take hours, though, on some systems.[1] Early thinking was that if you have a Telnet session open, TCP itself should not time out. The Telnet server might have a timeout, but it wasn't TCP's job to decide how long you get for lunch.

Today most servers have shorter timeouts, mostly as a defense against denial of service attacks. But it's often the HTTP server, not the TCP level, that times out first.

[1] https://blogs.technet.microsoft.com/nettracer/2010/06/03/thi...

> Idle but half-dead TCP connections will eventually time out. It can take hours, though, on some systems.

This is not true, in general. I think you're describing connections which have TCP KeepAlive enabled on them.

I find that usually, long-lived idle TCP connections get killed by stateful firewalls, of which there are often several along any given path through the Internet.

e.g. my home router has a connection timeout of 24 hours.

There's quite a number of stateful firewalls that just silently drop the connection with out sending a RST though, meaning TCP connections are idling forever, if it does not employ any tcp or application level keep-alives or timeouts.

There's quite a number of stateful firewalls that just silently drop the connection with out sending a RST though, meaning TCP connections are idling forever, if it does not employ any tcp or application level keep-alives.

See RFC 763, page 77, "User Timeout". This is a bit ambiguous, and there's an attempt to clarify it in RFC 5482.

[1] https://tools.ietf.org/html/rfc0793 [2] https://tools.ietf.org/html/rfc5482

From RFD 5482's abstract:

> The TCP user timeout controls how long transmitted data may remain > unacknowledged before a connection is forcefully closed.

As I understand it, this only applies if there is data outstanding. In the puzzler, there was no data outstanding. You're right that if there had been, the side with data outstanding would eventually notice the problem and terminate the connection. The default timeout on most systems I've seen is 5-8 minutes.

By contrast, the previous article you linked was about KeepAlive, which will always eventually detect this condition, but by default usually not for at least two hours.

Correct. Connection read/write timeouts only apply when, wait for it, reading or writing.

All TCP keep alive does is send a packet every so often, which is actually something which is rarely actually set.

I'm battling this problem at the movement in a server I wrote (although I'm using a C++ framework above the socket interface which adds to the complexity). The problem only occurs on linux so wasn't detected for a while, and only with one client (running an embedded RTOS). My server ends up stuck in CLOSE_WAIT and therefore wastes a responder thread, and eventually this reaches the limit and the server stops responding completely. It's a really difficult one to debug as it takes about 3 minutes to cause the problem to occur. It's easy enough to see what is going on at the TCP level, but it gets more complex to try and resolve this, as the various software layers add further abstractions above this. It's one thing to see the TCP messages, but another to try and understand this at a higher code level. The CLOSE_WAIT state does appear to timeout but not for a very long time, too long in this case.

If you're stuck in CLOSE_WAIT, it's a bug in your software: You've received a fin and need to close the socket if you're done with it.

The socket should be marked ready for reading, but when you try to read you'll get zero bytes back: Something in your framework may not realize that -- truss/strace the process and I'd guess you'll see a 0 byte read followed by not closing it; alternatively you may not be polling the socket for read availability?

Some things would change if you intended for the socket to be half closed, but I don't think you do?

Depends if TCP keepalives are enabled, but if the connection goes through a NAT gateway, that will certainly have a tracking state timeout. Usually it is on the order of at least hours though, sometimes days.

setsockopt [1], though, lets you change the timeout on a per-socket basis via the SO_RCVTIMEO and SO_SNDTIMEO options.

If you need to know sooner that your data isn't going to be sent, it's pretty trivial to set up a short timeout that overrides the system defaults.

[1] https://www.freebsd.org/cgi/man.cgi?query=setsockopt&sektion...

SO_RCVTIMEO and SO_SNDTIMEO are for setting a timeout on blocking socket operations. They don't tell you anything about whether the other end is still there or not.

Setting those options isn't much different from setting a timeout in your poll() call.

There are some other fun edge cases - for example:

- If you have a port-forward to a machine that is switched off then you can get ICMP network unreachable or ICMP host unreachable as the response to the a SYN in the initial handshake.

This can also happen at any point in the connection. Other ICMP messages can also occur like this (eg. admin prohibited).

It's always worth remembering that the TCP connection is sitting on an underlying network stack that can also signal errors outside of the TCP protocol itself.

Ooh fun! I just learned something new. I knew about how blocking ICMP could lead to MTU path discovery issues, but this is different.

Do you know at the socket API level how the ICMP unreachagle is manifested. Looked at connect() error and saw ENETUNREACH -- guessing that's the one?


sudo dtruss -d -t bind,listen,accept,poll,read,write nc -l -p 8080

dtrace: failed to execute nc: dtrace cannot control executables signed with restricted entitlements

If you copy nc over to /tmp/ and run dtruss again, it will work. Hope that helps others. Fun article, OP!

I never liked dtruss much for complicated debugging because the output shows way too many things in hex. OTOH, output from strace on Linux tends to be much nicer. e.g.

  $ strace -tebind nc -l -p 8080
  14:27:44 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("")}, 16) = 0

  $ strace -teconnect nc 8080
  14:28:29 connect(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("")}, 16) = -1 EINPROGRESS (Operation now in progress)
Does anyone know if the various dtrace-based dtruss implementations can be made to do this?

Yeah same here. When I want to trace my code, boot up a VM just because am more familiar with strace and gdb rather than dtruss and lldb. I can see dtruss having more and fancier features, I but am just too lazy to learn two tools of the same type.

Other than using VMs and 'surprise' power-cycling them, what other ways are there to test these sorts of behaviours? Would an entry like `iptables -I forward -s ... -d ... -j DENY` on a host the 2 endpoints route over suffice to suddenly stop all traffic (or a pair of rules to cover a->b and b->a packets)?

Are there any tools for 'and suddenly, the network disappears/goes to 99% packetloss/10s latency' etc that you can use to test existing software? I'm imagining some sort of iptables/tc wrappers that can take defined scenarios (or just chaos-monkey it) and apply them to traffic, possibly allowing assertions as to how the application should behave for various situations.

Jepsen [1]. For previous discussions on HN with MongoDB [2] Cassandra [3] RabbitMQ [4] and ETCD [5]. Others exist as well.

[1]: https://github.com/aphyr/jepsen

[2]: https://news.ycombinator.com/item?id=9417773

[3]: https://news.ycombinator.com/item?id=6451885

[4]: https://news.ycombinator.com/item?id=7863856

[5]: https://news.ycombinator.com/item?id=7884640

I wrote a tool for basically that exact use case. It causes deterministic network faults (on a per-flow rather than per-interface basis), and takes trace files from both sides of the fault.

[0] https://www.snellman.net/blog/archive/2015-10-01-flow-disrup...

would ipconfig eth0 down do the trick?

I used to mimimc the power off scenario using a wired connection and a router. The client machine ethernet connects to an up-linked router, and to mimic the server's poweroff, we just pull the uplink cable out from the router. As far as the client can tell, the network interface is still up (because the router is still alive). If you pull the client side cable from the router, the client's network stack detects the lost network interface and the socket closes immediately.

Some routers will send ICMP Destination Unreachable back in that situation.

Puzzler #3 does not have much with using a half-closed socket - you still have a socket that selects as writable, but is not connected to anything.

For some reason I remember that when a process crashes the OS's TCP stack sends an RST to the opposing side, rather than a FIN, preventing puzzler #3.

When a process exits abnormally or not and if there's unread data in the receive buffer then the kernel will send a RST, otherwise it would send a FIN.

This would be in terms of system calls. In terms of messages being sent you can capture the traffic and see what was sent, with which flags, etc.

"When I type the message and hit enter, kodos gets woken up, reads the message from stdin, and sends it over the socket." <- From Puzzler #1, this is a typo: it should say "Kang", not "Kodos". Kang is sending a message to Kodos, and this message was typed into Kang's console (waking up Kang) which sent the message to Kodos.

For Puzzler #2, did the kernel send any additional attempts on its own, beyond the initial attempt, before the 5min timeout?

Yes - but you only see those if you're watching on the network side.

I did not really get the difference between 1. and 2. puzzlers, don't they test the same scenario?

Puzzler 1 is when a machine goes down and stays down, and you hang or retransmit until a timeout. Puzzler 2 is when a machine goes down and then comes back up, and you get reset.

Puzzler #1 is an unclean reboot with a server-side TCP reset after the server is back online.

Puzzler #2 is an unclean shutdown with a client-side TCP timeout since the server never responds.

Yeah. The other way around.

One thing missing from this article is TCP heartbeats, which prevents the first two conditions from going on indefinitely.

However since the minimum allowed heartbeat is (usually) two hours host reboot/shutdown is usually detected via a heartbeat at the application layer instead, which sends some no-op traffic every n seconds or minutes.

The article mentions both. "It's possible to manage this problem using TCP or application-level keep-alive." (and did before you commented)

like you say, the heartbeat removes some instances, but not others. The article also says,

"It's the responsibility of the application to handle the cases where these differ (often using a keep-alive mechanism)."

Wow that modern CSS making it difficult to read longer than 2 seconds.

At least for me, there's a "body { font-weight: 300; }" in https://www.joyent.com/assets/css/style.css that makes the text difficult to read.

Removing that using chrome's inspector made this text legible again.


What he means is that the contrast between the text and background makes the article hard to read. If they would up the font-weight to 400 from 300 it would give a much better contrast.

Huh, I didn't even notice it. I wonder if this is due to text rendering differences between platforms? The page is pretty readable on Safari on OS X.

Any time you see mid- or light-grey text on white, you know that the designer is on OSX and doesn't care one jot about other viewers.

They may care, but they're simply unaware of how bad it looks, and also unaware of browsershots.org and similar services :)

It's a bit light in Safari on OS X on a non-Retina screen.

Looks nice in Safari on iPhone too.

Firefox on Ubuntu here.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact