Hacker News new | past | comments | ask | show | jobs | submit login
Automatically closing FIN_WAIT_2 is a violation of the TCP specification (cloudflare.com)
238 points by jgrahamc on Aug 12, 2016 | hide | past | favorite | 32 comments



This is an interesting case, but I'm confused about some of the details.

> A little known fact is that it's not possible to have any packet loss or congestion on the loopback interface.

This seems a bit misleading, given the two counterexamples that the article describes after this.

> If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be!

On illumos, the FIN_WAIT_2 -> TIME_WAIT transition happens only after 60 seconds if the application has closed the socket file descriptor. In that case, by definition the application has no handle with which to perform operations on the socket. The resource belongs exclusively to the kernel. If the other system disappeared forever, and there were no timeout, that socket would be held open forever.

By comparison, in CLOSE_WAIT, the application still has a handle on the socket, and it's responsible for the resource. The application can even keep sending more data in this case (as part of a graceful termination of a higher-level protocol). Or it could enable keep-alive. It's able to respond to the case where the other system has gone away, and it could break the application if the kernel closed the socket on its behalf.

I think the behavior is non-obvious, but pretty reasonable.


>> A little known fact is that it's not possible to have any packet loss or congestion on the loopback interface.

> This seems a bit misleading, given the two counterexamples that the article describes after this.

Ok, allow me to be more precise. The loopback _link_, the virtual cable, the virtual interface thingy has zero packet loss. If you imagine a wire that represents loopback, it will never congest, never have interference, never have any loss.

The packets will _always_ reach the other end. Now, what happens after they reach the other end, that's a separate story. The article indeed showed that it's possible that the kernel network stack dropped packets because of full buffers, or the CLOSE-WAIT thing.

But it's the kernel network stack on the receiving end dropping packets actively, not the loopback "wire".

> ...and it could break the application if the kernel closed the socket on its behalf.

The article showed a case that the other peer just quit alltogether. What kernel thinks about CLOSE-WAIT socket is irrelevant. The connection is _dead_. You can send data over it, you can attempt to read, you can try to close it gracefully, no difference. The other end refused to cooperate.

I argue that keeping the socket in CLOSE-WAIT for more than, say, 60 seconds is stupid, since the other party _will_ give up.

Basically, any application has 60 seconds to flush the data and call the close() after receiving FIN. After that, it all gets weird. So why not enforce this 60 seconds and just move the socket off CLOSE_WAIT to some next stage?

Now, what precisely file descriptor represents is another story. Maybe you're right, maybe a file descriptor can't point to a socket in TIME-WAIT or LAST-ACK states.


> I argue that keeping the socket in CLOSE-WAIT for more than, say, 60 seconds is stupid, since the other party _will_ give up.

Although it's rare[1], it's legal and possible for peer A to send a FIN, and peer B to continue to send data on the half closed connection and for peer A to expect the data and continue to receive it. How is the kernel expected to understand the difference between that behavior and a socket leak? In this case, peer A fully closed the application socket, so if peer B sends data on the half-closed socket, it would get a RST from peer A, if peer A was still mildly cooperating or the data would simply not be acked if peer A left the network.

[1] It's probably rare because people don't understand it well, and additionally some of the people who don't understand it create middle-boxes that break these types of connections in exciting ways.


> Ok, allow me to be more precise. The loopback _link_, the virtual cable, the virtual interface thingy has zero packet loss. If you imagine a wire that represents loopback, it will never congest, never have interference, never have any loss.

> The packets will _always_ reach the other end. Now, what happens after they reach the other end, that's a separate story. The article indeed showed that it's possible that the kernel network stack dropped packets because of full buffers, or the CLOSE-WAIT thing.

Which is kind of ironic, because that's how packets get lost in the wild as well. The router's buffer gets full due to congestion and it drops them.


> The article showed a case that the other peer just quit alltogether. What kernel thinks about CLOSE-WAIT socket is irrelevant. The connection is _dead_. You can send data over it, you can attempt to read, you can try to close it gracefully, no difference. The other end refused to cooperate.

Remember that it's possible for an application to send FIN without closing the socket. In the article's case, the connection was dead, but from the TCP stack's perspective, it has no way to tell that case from the case where the connection is still alive on the other end.

While weird, and I don't think it's a good idea, it's possible for host A to send FIN over a connection to host B, causing host B's socket to enter CLOSE_WAIT, but to have the connections exist in this state for an extended period (many minutes) while host B continues to send data to host A. A totally plausible interaction is an HTTP client that connects to the server, sends the request headers, sends FIN (causing the server to wind up in CLOSE_WAIT), and then the server spends several minutes sending the requested data. Even if the socket were idle, the connection may be perfectly healthy.

Plus, even if the kernel cleaned up the socket, the application has still leaked a file descriptor (also a limited resource).

> Basically, any application has 60 seconds to flush the data and call the close() after receiving FIN. After that, it all gets weird. So why not enforce this 60 seconds and just move the socket off CLOSE_WAIT to some next stage?

For the reasons I mentioned, as long as the socket is open in the application on host B, I don't think the kernel can conclude that the program is done with it.

That's okay, because this isn't all that hard to handle: if host A has actually closed its end of the socket (rather than just sending FIN), then as the post describes host A's socket will typically close about 60 seconds after receiving the ACK for its FIN. In that case, if host B continues sending data, it will eventually get an RST and socket operations will fail with ECONNRESET. No connection will be leaked.

As I understand it, the problem only happens when host B stops sending data (because the socket is leaked), in which case it will never realize that host A is gone. This is not very different than the case where the socket is still ESTABLISHED and host A has panicked, power-cycled, or failed in some other non-graceful way. A likely solution to both of these would be to use TCP keep-alive or application-level keep-alive, which will cause the same ECONNRESET behavior, and no connection will be leaked.

I agree that this case is very non-obvious, and I ran into a very similar problem recently that was painful to debug. But I don't think it's safe for the kernel to attempt to solve this, there would still be a leak even if it did, and the same mechanism that the application can use to identify failed connections in the ESTABLISHED state can likely be used to identify this case as well.


> For the reasons I mentioned, as long as the socket is open in the application on host B, I don't think the kernel can conclude that the program is done with it.

Wow. Awesome explanation. Ok, so basically as long as socket is in CLOSE-WAIT the server may want to send() something.

Do you happen to know what are the detailed semantics of `tcp_fin_timeout` sysctl? Will FIN-WAIT-2 progress to cleanup after 60 seconds, or after 60 seconds since last received packet?

Would this be sane:

- Let's allow kernel to expire CLOSE-WAIT if the application didn't send anything for more than 60 seconds.


> Would this be sane:

> - Let's allow kernel to expire CLOSE-WAIT if the application didn't send anything for more than 60 seconds.

No, that's not sane either. If the application wants to expire the CLOSE_WAIT socket, it can close it; at which point, a FIN will be sent, the state will go to LAST_ACK, while waiting for the peer to ACK the FIN, at which point the socket is totally done. It violates the RFC, but is perfectly reasonable for LAST_ACK to have a timeout if the other peer is non-responsive, as the alternative is sockets living forever when the peer goes away (same with FIN_WAIT's).


This sounds like a half-closed socket. I helped with WinRT socket API for Windows; we had to decide whether to support that or not. In the end, we couldn't come up with a truly legitimate case where an app really needs to do this.

Even with no realistic use cases and no efficiency gains, we still got complaints from a small number of developers who wanted the flexibility of closing a socket one way but not the other.


Huh? No realistic use cases? A half-closed socket is how you emulate EOF.

It's like saying that EOF has no realistic use case; that _every_ protocol should implement a custom channeling and signaling mechanism at the application layer in order to get bidirectional streams, even though they're using TCP sockets.

Now, many of the most popular protocols do that, but they do that because they're intended to be universally deployed and must deal with all the random pathological cases out in the wild. But if you don't have to worry about broken software, such as when you know there won't be broken proxies or routers in your path, then you can dramatically simplify many purpose-built protocols.

As a practical matter most people can safely make that assumption. Most of the time it's cheaper and easier (in the grand scheme of things) to make the broken software shoulder the burden. Not everybody is writing software for Cloudflare or Chrome; and fortunately _most_ responsible vendors do a decent job at implementing standards correctly.

I hope there's some confusion on your part or my part, and that you just didn't admit that WinRT was completely broken by design. That should be a hanging offense for anybody designing or implementing core infrastructure software. Decisions like that impose millions, if not billions of dollars in unnecessary costs on the world.


> This seems a bit misleading, given the two counterexamples that the article describes after this.

If you are referring to the "Assuming the target application has some space in its buffers, packet loss over loopback is not possible." caveat, then yes, it is somewhat ambiguous what's really being implied. Maybe it's just network packet loss they are referring to in the initial statement? I'm not sure that really makes sense either.

> On illumos, the FIN_WAIT_2 -> TIME_WAIT transition happens only after 60 seconds if the application has closed the socket file descriptor. ... If the other system disappeared forever, and there were no timeout, that socket would be held open forever.

Isn't that exactly what the automatic closing is supposed to prevent? Couldn't deliberate or accidental connection interruptions eventually cause a DOS?


> Isn't that exactly what the automatic closing is supposed to prevent? Couldn't deliberate or accidental connection interruptions eventually cause a DOS?

Yes, exactly. My point was that until the application closes the file descriptor, it's up to the application to deal with that issue. (Many public-facing web servers close idle sockets for this reason.) Only once the application has closed the socket does it become the TCP stack's responsibility, and that's why the 60-second timer exists. But that timer exists only for sockets whose file descriptors are closed.


> If the other system disappeared forever, and there were no timeout, that socket would be held open forever.

There is actually a more "gracefully degrading" solution, to keep those connections around potentially indefinitely or with a very long timeout, but to recycle them if the memory is needed for new connections. Seems like Linux supports this for TIME-WAIT but is not enabled by default (sys.net.ipv4.tcp_tw_recycle).

Actually lwIP does such recycling and it can even kill active connections - if there's no memory for a new connection it will try to kill connections in order: TIME-WAIT, LAST-ACK, CLOSING, active connections (incl. FIN-WAIT-2).


I would highly recommend to not use that setting it solves one problem and could introduce other, harder to debug problems. Especially if a load balancer or NAT is on the way.

With this setting Linux no longer follows RFC and there is a risk that it could confuse packets from different connections. It falls back to tcp timestamp from PAWS, but often people disable timestamp as well.


Also in socket surprises, once you've moved on from CLOSE_WAIT to TIME_WAIT, and doing everything pretty much by the book, you can run out of ephemeral ports and hit this: https://goodenoughsoftware.net/2013/07/15/self-connects/

(One way I like to test multithreaded stuff is to have a test mode where it runs for a while and quits, then run N copies of it at once, repeatedly, and leave the whole thing running for an hour/until it goes wrong. So that's how I encountered this in my case.)


> Our mistake was the server was running on an ephemeral port.

Typical idioms have the server as a connection responder and not an initiator. When the server is a responder it's typically on a well-known port for the sake of the clients knowing where to find it. Having a server on an ephemeral port sounds like a pretty uncommon use case.


One way you can have a listening port in the ephemeral range is to specify 0 as the port when binding, allowing the system to assign the port for you. You might do this if you're using the socket for some kind of parent<->child IPC. (Pipes might be more usual for this sort of thing, but sockets code is a bit easier to make portable between Windows and POSIX, and you also get the option of running both sides on different systems.)


This is only really a problem when there's a collision of the quadruple identifying the connection. The article indicates as much - "If you want to reproduce this weird scenario consider running this script. It will greatly increase the probability of netcat hitting the conflicted socket".

I will admit that I probably wouldn't expect an ETIMEDOUT over a loopback, but I think I would very quickly assume that one side or the other has a bug regarding leaked resources.

> It seems that the design decisions made by the BSD Socket API have unexpected long lasting consequences. If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be!

One side-effect of having a timeout drive you from CLOSE_WAIT to LAST_ACK without a close() would be that the remote side would not be able to see the application-level reaction to its active close. Determining whether this was a graceful application-level closure would not be possible anymore (though I'll admit I'm not sure how critical that is to the protocol).


Correct. To be hit by this problem you need leaking sockets AND a client that opens and closes sockets rapidly. What I didn't write is that we had a "bind before connect" on the client side, which means you could hit the CLOSE_WAIT _faster_ than after going through 30k+ sockets. Bind-before-connect sockets have random src port number, therefore depending on setup we hit the colliding socket in a thousand or so sockets, not in 30k+ sockets.

Interesting comment on potential implications of automatically transitioning from CLOSE_WAIT.


Interesting. I would have thought that the server would have sent a reset when it received a SYN on a CLOSE_WAIT socket, not ignored it.


The OP describes a local-host scenario, which is a special case and could be cleaned up better, as you describe.

The CLOSE_WAIT cannot simply be ignored in the general case. The other side could potentially be still out there, lagging badly. Its important to make a really diligent attempt to close, else other more horrible bugs may happen (two endpoints talking to your same port).

I'd suggest the stack should never reallocate a port that is not 'clean'.

The root of all this confusion is the small port number space (16 bits). Reusing them in a lossy environment (or buggy client environment) can result in all your sockets being in a bad state in a few seconds. Especially with random allocation - its the 'how many people in the room before two have the same birthday' puzzle - the answer is 'not many' before a collision (reallocation of dirty port) is likely.

A solution would be to enlarge the port space - e.g. use uuids that don't ever repeat.


> The root of all this confusion is the small port number space (16 bits).

IMO the leak is the real bug and the protocols work as intended. Keep in mind that the stack(s) distinguish TCP connections not by merely one 16-bit port number but by (local port, local addr, remote addr, remote port). This quadruple is effectively 96 bits of identification.

Enlarging the port space has an astonishingly high cost. IMO, it would be easier to just define a new transport protocol or use a network protocol with more addresses (IPv6, e.g.).


It may distinguish them (filter them for correctness) but port numbers must be unique across all legal tuples, right? So its still a 16-bit space. And the cost of adding 112 bits to each message is perhaps overstated in this modern world. Messages often dwarf that. Though agreed it would help to use a new protocol - once you have a uuid in the message header, you can design a protocol that omits all the other constant fields (ip etc) because the uuid can be associated at every stage with all that.


> It may distinguish them (filter them for correctness) but port numbers must be unique across all legal tuples, right?

No. TCP connections are distinct from one another only by the entire quadruple. There's no (or very nearly no) invalid combinations of ports and IPs.


Yes. First, the tuple has 5 elements (including Protocol). True that a router can identify uniquely a connection by this tuple, that doesn't speak to dispatching on the client or server endpoint. Every server will listen on a port with a protocol. But then when a connection is made, it will migrate it to a unique local port.

Some sharing is possible. Port 80 can be used for HTTP and UDP simultaneously for instance.


You could easily run into the same scenario over a real network as well. Perhaps in that case, Linux would send a RST to the SYN; it's hard to know without testing since localhost on linux skips most of the TCP stack, FreeBSD runs the full stack though -- you can get congestion collapse on lo0 if you push it hard enough or find exciting (fixed) bugs in syncookies leading to desynchronized TCP states.

127.0.0.0/8 is a pretty huge selection of ips to talk to yourself with -- if that's not enough ports for you, you can use UNIX sockets or take over DoD address space. (or ipv6)


I always thought CLOSE_WAIT was due to unsent bytes after a close(). This definitely implies differently than my understanding, I'm surprised the article doesn't discuss SO_LINGER at all, from man pages:

>SO_LINGER controls the action taken when unsent messages are queued on

> socket and a close(2) is performed. If the socket promises reliable

> delivery of data and SO_LINGER is set, the system will block the process

> on the close attempt until it is able to transmit the data or until it

> decides it is unable to deliver the information (a timeout period, termed

> the linger interval, is specified in the setsockopt() call when SO_LINGER

> is requested). If SO_LINGER is disabled and a close is issued, the sys- tem will process the close in a manner that allows the process to con-

> tinue as quickly as possible.

Am I wrong?


This article describes a case when close() is _not_ being called in the first place. But indeed, it's very interesting what exactly the application can do after it receives FIN.

I suspect it has 60 seconds tops to send any remaining data, and SO_LINGER could be useful there.



Props to CloudFlare for putting out high-quality case studies. I consistently learn at least one new flag/command to add to the quiver every time I read one of these.


Even today something as basic as TCP can have interesting bugs.


TCP is not simple.

TCP is a spec, it could have design bugs. There are many implementations of that spec. New ones created, existing ones modified. It should not be surprising that they have bugs.


I don't consider this a bug in TCP, it's a design mistake in the BSD socket interface.

Nevertheless your point is a good one: even old and heavily-used systems will harbor latent bugs. There are surely unfixed design problems in TCP as well!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: