> A little known fact is that it's not possible to have any packet loss or congestion on the loopback interface.
This seems a bit misleading, given the two counterexamples that the article describes after this.
> If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be!
On illumos, the FIN_WAIT_2 -> TIME_WAIT transition happens only after 60 seconds if the application has closed the socket file descriptor. In that case, by definition the application has no handle with which to perform operations on the socket. The resource belongs exclusively to the kernel. If the other system disappeared forever, and there were no timeout, that socket would be held open forever.
By comparison, in CLOSE_WAIT, the application still has a handle on the socket, and it's responsible for the resource. The application can even keep sending more data in this case (as part of a graceful termination of a higher-level protocol). Or it could enable keep-alive. It's able to respond to the case where the other system has gone away, and it could break the application if the kernel closed the socket on its behalf.
I think the behavior is non-obvious, but pretty reasonable.
> This seems a bit misleading, given the two counterexamples that the article describes after this.
Ok, allow me to be more precise. The loopback _link_, the virtual cable, the virtual interface thingy has zero packet loss. If you imagine a wire that represents loopback, it will never congest, never have interference, never have any loss.
The packets will _always_ reach the other end. Now, what happens after they reach the other end, that's a separate story. The article indeed showed that it's possible that the kernel network stack dropped packets because of full buffers, or the CLOSE-WAIT thing.
But it's the kernel network stack on the receiving end dropping packets actively, not the loopback "wire".
> ...and it could break the application if the kernel closed the socket on its behalf.
The article showed a case that the other peer just quit alltogether. What kernel thinks about CLOSE-WAIT socket is irrelevant. The connection is _dead_. You can send data over it, you can attempt to read, you can try to close it gracefully, no difference. The other end refused to cooperate.
I argue that keeping the socket in CLOSE-WAIT for more than, say, 60 seconds is stupid, since the other party _will_ give up.
Basically, any application has 60 seconds to flush the data and call the close() after receiving FIN. After that, it all gets weird. So why not enforce this 60 seconds and just move the socket off CLOSE_WAIT to some next stage?
Now, what precisely file descriptor represents is another story. Maybe you're right, maybe a file descriptor can't point to a socket in TIME-WAIT or LAST-ACK states.
Although it's rare, it's legal and possible for peer A to send a FIN, and peer B to continue to send data on the half closed connection and for peer A to expect the data and continue to receive it. How is the kernel expected to understand the difference between that behavior and a socket leak? In this case, peer A fully closed the application socket, so if peer B sends data on the half-closed socket, it would get a RST from peer A, if peer A was still mildly cooperating or the data would simply not be acked if peer A left the network.
 It's probably rare because people don't understand it well, and additionally some of the people who don't understand it create middle-boxes that break these types of connections in exciting ways.
> The packets will _always_ reach the other end. Now, what happens after they reach the other end, that's a separate story. The article indeed showed that it's possible that the kernel network stack dropped packets because of full buffers, or the CLOSE-WAIT thing.
Which is kind of ironic, because that's how packets get lost in the wild as well. The router's buffer gets full due to congestion and it drops them.
Remember that it's possible for an application to send FIN without closing the socket. In the article's case, the connection was dead, but from the TCP stack's perspective, it has no way to tell that case from the case where the connection is still alive on the other end.
While weird, and I don't think it's a good idea, it's possible for host A to send FIN over a connection to host B, causing host B's socket to enter CLOSE_WAIT, but to have the connections exist in this state for an extended period (many minutes) while host B continues to send data to host A. A totally plausible interaction is an HTTP client that connects to the server, sends the request headers, sends FIN (causing the server to wind up in CLOSE_WAIT), and then the server spends several minutes sending the requested data. Even if the socket were idle, the connection may be perfectly healthy.
Plus, even if the kernel cleaned up the socket, the application has still leaked a file descriptor (also a limited resource).
> Basically, any application has 60 seconds to flush the data and call the close() after receiving FIN. After that, it all gets weird. So why not enforce this 60 seconds and just move the socket off CLOSE_WAIT to some next stage?
For the reasons I mentioned, as long as the socket is open in the application on host B, I don't think the kernel can conclude that the program is done with it.
That's okay, because this isn't all that hard to handle: if host A has actually closed its end of the socket (rather than just sending FIN), then as the post describes host A's socket will typically close about 60 seconds after receiving the ACK for its FIN. In that case, if host B continues sending data, it will eventually get an RST and socket operations will fail with ECONNRESET. No connection will be leaked.
As I understand it, the problem only happens when host B stops sending data (because the socket is leaked), in which case it will never realize that host A is gone. This is not very different than the case where the socket is still ESTABLISHED and host A has panicked, power-cycled, or failed in some other non-graceful way. A likely solution to both of these would be to use TCP keep-alive or application-level keep-alive, which will cause the same ECONNRESET behavior, and no connection will be leaked.
I agree that this case is very non-obvious, and I ran into a very similar problem recently that was painful to debug. But I don't think it's safe for the kernel to attempt to solve this, there would still be a leak even if it did, and the same mechanism that the application can use to identify failed connections in the ESTABLISHED state can likely be used to identify this case as well.
Wow. Awesome explanation. Ok, so basically as long as socket is in CLOSE-WAIT the server may want to send() something.
Do you happen to know what are the detailed semantics of `tcp_fin_timeout` sysctl? Will FIN-WAIT-2 progress to cleanup after 60 seconds, or after 60 seconds since last received packet?
Would this be sane:
- Let's allow kernel to expire CLOSE-WAIT if the application didn't send anything for more than 60 seconds.
> - Let's allow kernel to expire CLOSE-WAIT if the application didn't send anything for more than 60 seconds.
No, that's not sane either. If the application wants to expire the CLOSE_WAIT socket, it can close it; at which point, a FIN will be sent, the state will go to LAST_ACK, while waiting for the peer to ACK the FIN, at which point the socket is totally done. It violates the RFC, but is perfectly reasonable for LAST_ACK to have a timeout if the other peer is non-responsive, as the alternative is sockets living forever when the peer goes away (same with FIN_WAIT's).
Even with no realistic use cases and no efficiency gains, we still got complaints from a small number of developers who wanted the flexibility of closing a socket one way but not the other.
It's like saying that EOF has no realistic use case; that _every_ protocol should implement a custom channeling and signaling mechanism at the application layer in order to get bidirectional streams, even though they're using TCP sockets.
Now, many of the most popular protocols do that, but they do that because they're intended to be universally deployed and must deal with all the random pathological cases out in the wild. But if you don't have to worry about broken software, such as when you know there won't be broken proxies or routers in your path, then you can dramatically simplify many purpose-built protocols.
As a practical matter most people can safely make that assumption. Most of the time it's cheaper and easier (in the grand scheme of things) to make the broken software shoulder the burden. Not everybody is writing software for Cloudflare or Chrome; and fortunately _most_ responsible vendors do a decent job at implementing standards correctly.
I hope there's some confusion on your part or my part, and that you just didn't admit that WinRT was completely broken by design. That should be a hanging offense for anybody designing or implementing core infrastructure software. Decisions like that impose millions, if not billions of dollars in unnecessary costs on the world.
If you are referring to the "Assuming the target application has some space in its buffers, packet loss over loopback is not possible." caveat, then yes, it is somewhat ambiguous what's really being implied. Maybe it's just network packet loss they are referring to in the initial statement? I'm not sure that really makes sense either.
> On illumos, the FIN_WAIT_2 -> TIME_WAIT transition happens only after 60 seconds if the application has closed the socket file descriptor. ... If the other system disappeared forever, and there were no timeout, that socket would be held open forever.
Isn't that exactly what the automatic closing is supposed to prevent? Couldn't deliberate or accidental connection interruptions eventually cause a DOS?
Yes, exactly. My point was that until the application closes the file descriptor, it's up to the application to deal with that issue. (Many public-facing web servers close idle sockets for this reason.) Only once the application has closed the socket does it become the TCP stack's responsibility, and that's why the 60-second timer exists. But that timer exists only for sockets whose file descriptors are closed.
There is actually a more "gracefully degrading" solution, to keep those connections around potentially indefinitely or with a very long timeout, but to recycle them if the memory is needed for new connections. Seems like Linux supports this for TIME-WAIT but is not enabled by default (sys.net.ipv4.tcp_tw_recycle).
Actually lwIP does such recycling and it can even kill active connections - if there's no memory for a new connection it will try to kill connections in order: TIME-WAIT, LAST-ACK, CLOSING, active connections (incl. FIN-WAIT-2).
With this setting Linux no longer follows RFC and there is a risk that it could confuse packets from different connections. It falls back to tcp timestamp from PAWS, but often people disable timestamp as well.
(One way I like to test multithreaded stuff is to have a test mode where it runs for a while and quits, then run N copies of it at once, repeatedly, and leave the whole thing running for an hour/until it goes wrong. So that's how I encountered this in my case.)
Typical idioms have the server as a connection responder and not an initiator. When the server is a responder it's typically on a well-known port for the sake of the clients knowing where to find it. Having a server on an ephemeral port sounds like a pretty uncommon use case.
I will admit that I probably wouldn't expect an ETIMEDOUT over a loopback, but I think I would very quickly assume that one side or the other has a bug regarding leaked resources.
> It seems that the design decisions made by the BSD Socket API have unexpected long lasting consequences. If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be!
One side-effect of having a timeout drive you from CLOSE_WAIT to LAST_ACK without a close() would be that the remote side would not be able to see the application-level reaction to its active close. Determining whether this was a graceful application-level closure would not be possible anymore (though I'll admit I'm not sure how critical that is to the protocol).
Interesting comment on potential implications of automatically transitioning from CLOSE_WAIT.
The CLOSE_WAIT cannot simply be ignored in the general case. The other side could potentially be still out there, lagging badly. Its important to make a really diligent attempt to close, else other more horrible bugs may happen (two endpoints talking to your same port).
I'd suggest the stack should never reallocate a port that is not 'clean'.
The root of all this confusion is the small port number space (16 bits). Reusing them in a lossy environment (or buggy client environment) can result in all your sockets being in a bad state in a few seconds. Especially with random allocation - its the 'how many people in the room before two have the same birthday' puzzle - the answer is 'not many' before a collision (reallocation of dirty port) is likely.
A solution would be to enlarge the port space - e.g. use uuids that don't ever repeat.
IMO the leak is the real bug and the protocols work as intended. Keep in mind that the stack(s) distinguish TCP connections not by merely one 16-bit port number but by (local port, local addr, remote addr, remote port). This quadruple is effectively 96 bits of identification.
Enlarging the port space has an astonishingly high cost. IMO, it would be easier to just define a new transport protocol or use a network protocol with more addresses (IPv6, e.g.).
No. TCP connections are distinct from one another only by the entire quadruple. There's no (or very nearly no) invalid combinations of ports and IPs.
Some sharing is possible. Port 80 can be used for HTTP and UDP simultaneously for instance.
127.0.0.0/8 is a pretty huge selection of ips to talk to yourself with -- if that's not enough ports for you, you can use UNIX sockets or take over DoD address space. (or ipv6)
>SO_LINGER controls the action taken when unsent messages are queued on
> socket and a close(2) is performed. If the socket promises reliable
> delivery of data and SO_LINGER is set, the system will block the process
> on the close attempt until it is able to transmit the data or until it
> decides it is unable to deliver the information (a timeout period, termed
> the linger interval, is specified in the setsockopt() call when SO_LINGER
> is requested). If SO_LINGER is disabled and a close is issued, the sys-
tem will process the close in a manner that allows the process to con-
> tinue as quickly as possible.
Am I wrong?
I suspect it has 60 seconds tops to send any remaining data, and SO_LINGER could be useful there.
TCP is a spec, it could have design bugs. There are many implementations of that spec. New ones created, existing ones modified. It should not be surprising that they have bugs.
Nevertheless your point is a good one: even old and heavily-used systems will harbor latent bugs. There are surely unfixed design problems in TCP as well!