
Automatically closing FIN_WAIT_2 is a violation of the TCP specification - jgrahamc
https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
======
dap
This is an interesting case, but I'm confused about some of the details.

> A little known fact is that it's not possible to have any packet loss or
> congestion on the loopback interface.

This seems a bit misleading, given the two counterexamples that the article
describes after this.

> If you think about it - why exactly the socket can automatically expire the
> FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time.
> This is very confusing... And it should be!

On illumos, the FIN_WAIT_2 -> TIME_WAIT transition happens only after 60
seconds if the application has closed the socket file descriptor. In that
case, by definition the application has no handle with which to perform
operations on the socket. The resource belongs exclusively to the kernel. If
the other system disappeared forever, and there were no timeout, that socket
would be held open forever.

By comparison, in CLOSE_WAIT, the application still has a handle on the
socket, and it's responsible for the resource. The application can even keep
sending more data in this case (as part of a graceful termination of a higher-
level protocol). Or it could enable keep-alive. It's able to respond to the
case where the other system has gone away, and it could break the application
if the kernel closed the socket on its behalf.

I think the behavior is non-obvious, but pretty reasonable.

~~~
majke
_> > A little known fact is that it's not possible to have any packet loss or
congestion on the loopback interface._

 _> This seems a bit misleading, given the two counterexamples that the
article describes after this._

Ok, allow me to be more precise. The loopback _link_, the virtual cable, the
virtual interface thingy has zero packet loss. If you imagine a wire that
represents loopback, it will never congest, never have interference, never
have any loss.

The packets will _always_ reach the other end. Now, what happens after they
reach the other end, that's a separate story. The article indeed showed that
it's possible that the kernel network stack dropped packets because of full
buffers, or the CLOSE-WAIT thing.

But it's the kernel network stack on the receiving end dropping packets
actively, not the loopback "wire".

 _> ...and it could break the application if the kernel closed the socket on
its behalf._

The article showed a case that the other peer just quit alltogether. What
kernel thinks about CLOSE-WAIT socket is irrelevant. The connection is _dead_.
You can send data over it, you can attempt to read, you can try to close it
gracefully, no difference. The other end refused to cooperate.

I argue that keeping the socket in CLOSE-WAIT for more than, say, 60 seconds
is stupid, since the other party _will_ give up.

Basically, any application has 60 seconds to flush the data and call the
close() after receiving FIN. After that, it all gets weird. So why not enforce
this 60 seconds and just move the socket off CLOSE_WAIT to some next stage?

Now, what precisely file descriptor represents is another story. Maybe you're
right, maybe a file descriptor can't point to a socket in TIME-WAIT or LAST-
ACK states.

~~~
dap
> The article showed a case that the other peer just quit alltogether. What
> kernel thinks about CLOSE-WAIT socket is irrelevant. The connection is
> _dead_. You can send data over it, you can attempt to read, you can try to
> close it gracefully, no difference. The other end refused to cooperate.

Remember that it's possible for an application to send FIN without closing the
socket. In the article's case, the connection was dead, but from the TCP
stack's perspective, it has no way to tell that case from the case where the
connection is still alive on the other end.

While weird, and I don't think it's a good idea, it's possible for host A to
send FIN over a connection to host B, causing host B's socket to enter
CLOSE_WAIT, but to have the connections exist in this state for an extended
period (many minutes) while host B continues to send data to host A. A totally
plausible interaction is an HTTP client that connects to the server, sends the
request headers, sends FIN (causing the server to wind up in CLOSE_WAIT), and
then the server spends several minutes sending the requested data. Even if the
socket were idle, the connection may be perfectly healthy.

Plus, even if the kernel cleaned up the socket, the application has still
leaked a file descriptor (also a limited resource).

> Basically, any application has 60 seconds to flush the data and call the
> close() after receiving FIN. After that, it all gets weird. So why not
> enforce this 60 seconds and just move the socket off CLOSE_WAIT to some next
> stage?

For the reasons I mentioned, as long as the socket is open in the application
on host B, I don't think the kernel can conclude that the program is done with
it.

That's okay, because this isn't all that hard to handle: if host A has
actually closed its end of the socket (rather than just sending FIN), then as
the post describes host A's socket will typically close about 60 seconds after
receiving the ACK for its FIN. In that case, if host B continues sending data,
it will eventually get an RST and socket operations will fail with ECONNRESET.
No connection will be leaked.

As I understand it, the problem only happens when host B stops sending data
(because the socket is leaked), in which case it will never realize that host
A is gone. This is not very different than the case where the socket is still
ESTABLISHED and host A has panicked, power-cycled, or failed in some other
non-graceful way. A likely solution to both of these would be to use TCP keep-
alive or application-level keep-alive, which will cause the same ECONNRESET
behavior, and no connection will be leaked.

I agree that this case is very non-obvious, and I ran into a very similar
problem recently that was painful to debug. But I don't think it's safe for
the kernel to attempt to solve this, there would still be a leak even if it
did, and the same mechanism that the application can use to identify failed
connections in the ESTABLISHED state can likely be used to identify this case
as well.

~~~
majke
_> For the reasons I mentioned, as long as the socket is open in the
application on host B, I don't think the kernel can conclude that the program
is done with it._

Wow. Awesome explanation. Ok, so basically as long as socket is in CLOSE-WAIT
the server may want to send() something.

Do you happen to know what are the detailed semantics of `tcp_fin_timeout`
sysctl? Will FIN-WAIT-2 progress to cleanup after 60 seconds, or after 60
seconds since last received packet?

Would this be sane:

\- Let's allow kernel to expire CLOSE-WAIT if the application didn't send
anything for more than 60 seconds.

~~~
toast0
> Would this be sane:

> \- Let's allow kernel to expire CLOSE-WAIT if the application didn't send
> anything for more than 60 seconds.

No, that's not sane either. If the application wants to expire the CLOSE_WAIT
socket, it can close it; at which point, a FIN will be sent, the state will go
to LAST_ACK, while waiting for the peer to ACK the FIN, at which point the
socket is totally done. It violates the RFC, but is perfectly reasonable for
LAST_ACK to have a timeout if the other peer is non-responsive, as the
alternative is sockets living forever when the peer goes away (same with
FIN_WAIT's).

------
to3m
Also in socket surprises, once you've moved on from CLOSE_WAIT to TIME_WAIT,
and doing everything pretty much by the book, you can run out of ephemeral
ports and hit this: [https://goodenoughsoftware.net/2013/07/15/self-
connects/](https://goodenoughsoftware.net/2013/07/15/self-connects/)

(One way I like to test multithreaded stuff is to have a test mode where it
runs for a while and quits, then run N copies of it at once, repeatedly, and
leave the whole thing running for an hour/until it goes wrong. So that's how I
encountered this in my case.)

~~~
wyldfire
> Our mistake was the server was running on an ephemeral port.

Typical idioms have the server as a connection responder and not an initiator.
When the server is a responder it's typically on a well-known port for the
sake of the clients knowing where to find it. Having a server on an ephemeral
port sounds like a pretty uncommon use case.

~~~
to3m
One way you can have a listening port in the ephemeral range is to specify 0
as the port when binding, allowing the system to assign the port for you. You
might do this if you're using the socket for some kind of parent<->child IPC.
(Pipes might be more usual for this sort of thing, but sockets code is a bit
easier to make portable between Windows and POSIX, and you also get the option
of running both sides on different systems.)

------
wyldfire
This is only really a problem when there's a collision of the quadruple
identifying the connection. The article indicates as much - "If you want to
reproduce this weird scenario consider running this script. It will greatly
increase the probability of netcat hitting the conflicted socket".

I will admit that I probably wouldn't expect an ETIMEDOUT over a loopback, but
I think I would very quickly assume that one side or the other has a bug
regarding leaked resources.

> It seems that the design decisions made by the BSD Socket API have
> unexpected long lasting consequences. If you think about it - why exactly
> the socket can automatically expire the FIN_WAIT state, but can't move off
> from CLOSE_WAIT after some grace time. This is very confusing... And it
> should be!

One side-effect of having a timeout drive you from CLOSE_WAIT to LAST_ACK
without a close() would be that the remote side would not be able to see the
application-level reaction to its active close. Determining whether this was a
graceful application-level closure would not be possible anymore (though I'll
admit I'm not sure how critical that is to the protocol).

~~~
majke
Correct. To be hit by this problem you need leaking sockets AND a client that
opens and closes sockets rapidly. What I didn't write is that we had a "bind
before connect" on the client side, which means you could hit the CLOSE_WAIT
_faster_ than after going through 30k+ sockets. Bind-before-connect sockets
have random src port number, therefore depending on setup we hit the colliding
socket in a thousand or so sockets, not in 30k+ sockets.

Interesting comment on potential implications of automatically transitioning
from CLOSE_WAIT.

------
mcguire
Interesting. I would have thought that the server would have sent a reset when
it received a SYN on a CLOSE_WAIT socket, not ignored it.

~~~
JoeAltmaier
The OP describes a local-host scenario, which is a special case and could be
cleaned up better, as you describe.

The CLOSE_WAIT cannot simply be ignored in the general case. The other side
could potentially be still out there, lagging badly. Its important to make a
really diligent attempt to close, else other more horrible bugs may happen
(two endpoints talking to your same port).

I'd suggest the stack should never reallocate a port that is not 'clean'.

The root of all this confusion is the small port number space (16 bits).
Reusing them in a lossy environment (or buggy client environment) can result
in _all your sockets_ being in a bad state in a few seconds. Especially with
random allocation - its the 'how many people in the room before two have the
same birthday' puzzle - the answer is 'not many' before a collision
(reallocation of dirty port) is likely.

A solution would be to enlarge the port space - e.g. use uuids that don't ever
repeat.

~~~
wyldfire
> The root of all this confusion is the small port number space (16 bits).

IMO the leak is the real bug and the protocols work as intended. Keep in mind
that the stack(s) distinguish TCP connections not by merely one 16-bit port
number but by (local port, local addr, remote addr, remote port). This
quadruple is effectively 96 bits of identification.

Enlarging the port space has an astonishingly high cost. IMO, it would be
easier to just define a new transport protocol or use a network protocol with
more addresses (IPv6, e.g.).

~~~
JoeAltmaier
It may distinguish them (filter them for correctness) but port numbers must be
unique across all legal tuples, right? So its still a 16-bit space. And the
cost of adding 112 bits to each message is perhaps overstated in this modern
world. Messages often dwarf that. Though agreed it would help to use a new
protocol - once you have a uuid in the message header, you can design a
protocol that omits all the other constant fields (ip etc) because the uuid
can be associated at every stage with all that.

~~~
wyldfire
> It may distinguish them (filter them for correctness) but port numbers must
> be unique across all legal tuples, right?

No. TCP connections are distinct from one another only by the entire
quadruple. There's no (or very nearly no) invalid combinations of ports and
IPs.

~~~
JoeAltmaier
Yes. First, the tuple has 5 elements (including Protocol). True that a router
can identify uniquely a connection by this tuple, that doesn't speak to
dispatching on the client or server endpoint. Every server will listen on a
port with a protocol. But then when a connection is made, it will migrate it
to a unique local port.

Some sharing is possible. Port 80 can be used for HTTP and UDP simultaneously
for instance.

------
bluejekyll
I always thought CLOSE_WAIT was due to unsent bytes after a close(). This
definitely implies differently than my understanding, I'm surprised the
article doesn't discuss SO_LINGER at all, from man pages:

>SO_LINGER controls the action taken when unsent messages are queued on

> socket and a close(2) is performed. If the socket promises reliable

> delivery of data and SO_LINGER is set, the system will block the process

> on the close attempt until it is able to transmit the data or until it

> decides it is unable to deliver the information (a timeout period, termed

> the linger interval, is specified in the setsockopt() call when SO_LINGER

> is requested). If SO_LINGER is disabled and a close is issued, the sys- tem
> will process the close in a manner that allows the process to con-

> tinue as quickly as possible.

Am I wrong?

~~~
majke
This article describes a case when close() is _not_ being called in the first
place. But indeed, it's very interesting what exactly the application can do
after it receives FIN.

I suspect it has 60 seconds tops to send any remaining data, and SO_LINGER
could be useful there.

------
peterwwillis
Good flow chart/overview here: [https://benohead.com/tcp-about-
fin_wait_2-time_wait-and-clos...](https://benohead.com/tcp-about-
fin_wait_2-time_wait-and-close_wait/)

------
theptip
Props to CloudFlare for putting out high-quality case studies. I consistently
learn at least one new flag/command to add to the quiver every time I read one
of these.

------
coldcode
Even today something as basic as TCP can have interesting bugs.

~~~
njharman
TCP is not simple.

TCP is a spec, it could have design bugs. There are many implementations of
that spec. New ones created, existing ones modified. It should not be
surprising that they have bugs.

