Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why TCP over TCP is a bad idea (2001) (inka.de)
112 points by fanf2 on Nov 13, 2020 | hide | past | favorite | 68 comments


I suppose the issue could be alleviated by having the tunnel know it's payload is IP packets, keep a reasonably short buffer, and drop payload packets from the queue after a short timeout, e.g. 200ms. So essentially, the tunnel needs to behave like a switch would.

Edit: Obviously, it still reacts doubly as bad to the base connection dropping packets. I recall testing once how a TCP connection behaves when you subject it to a random % of packet loss (regardless of bandwidth it tries to use), and it resulted in completely stalled connections at a surprisingly low drop rates.

Edit2: In case anyone wants to try themselves, here's how: https://www.pico.net/kb/how-can-i-simulate-delayed-and-dropp...


Interesting idea, although I'd argue that if it should ACK like UDP, retransmit like UDP and control flow and congestion like UDP... You should use UDP ;)

However, firewalls are usually much more permissive to TCP than to UDP. I wonder if there is any project that encapsulates UDP-semantic datagrams into TCP-looking segments?


UDP is treated fairly well by firewalls... at least compared to SCTP for example. QUIC/HTTP3 are UDP-based and even though there's usually a TCP/HTTP2 fallback they fare reasonably well.


I have various boxes which we send out to venues, on the whole outgoing connections are fine, but sometimes you get some really restrictive policies. I've had stuff that MITMs TCP/443, completely blocks UDP, etc.

My devices tend to try to connect back via

* UDP port 443, sometimes works

* an sstp vpn

* SSH to tcp/$highnumber, sometimes they blck/MITM port 80, 443, but leave the standard

* DNS

I can't think of a time that one of them didn't through.


It does make it through many firewalls these days, yes.

But all implementations I know use a much shorter timeout/keepalive period for UDP than they use for TCP because of firewalls/NATs. (I think the RFCs even recommend something like 300 seconds for TCP, but only 30 for UDP as a default?)

This has pretty significant implications on power consumption for mobile devices.


You meant it should not ack (like UDP doesn't), not retransmit (like UDP doesn't), and perform no flow/congestion control (like UDP doesn't), yes?


your phrasing and their phrasing mean the same thing.


Thank you for confirming that; it wasn't clear to me.


That is very interesting!

What exactly do you mean with "completely stalled connections"?

Do you mean, that the sending side is queueing up to-be-send messages and can't clear the queue because it is working on correcting packet losses all the time, so the queue will just grow and never shrink?

Do you recall at which % of packet loss this behaviour started?


Stalled as in, sender has data to send, receiver is ready to receive, but the transfer makes no progress or proceeds very slowly.

I can't quote a number for the drop percentage, honestly I've forgot. Discovering the limit was a side effect, we were simply looking to test how a piece of software would behave over a bad connection. I just remember being surprised that everything just stopped when I put in packet loss that wasn't anywhere near 100%.

My since-then-adjusted expectations would put any double digit percentage (yes, starting from 10%) of random packet loss as unusable conditions for TCP.


I understand that retransmit algorithm in the top level tcp would be useless because the bottom layer is expected to be reliable.

A better option would be for the tunnel to generate fake ACK to avoid retransmission happening?


Right, the issue is not the tunnel, it's the essentially infinite buffer. Any large buffer will have a negative effect.


Thanks for the link. How could one monitor the simulation? Any guide?


About your experiment: TCP BBR should behave better, I suspect.


In my personal experience, BBR with SACK works pretty well even on fairly bad connections. It still slows way down but not as badly as others like Cubic.

That was with BitTorrent uploads of Linux ISOs to Taiwan. (Why do they download so many copies of Ubuntu 14.04 LTS?)

But since I didn't do controlled tests of multiple congestion types I could just be seeing things.


I've had very good experiences on restricted networks with udp2raw ( https://github.com/wangyu-/udp2raw-tunnel ). Basically, you configure openvpn in udp mode, then tunnel them over udp2raw, which pretends that they're a tcp stream that happens to be mysteriously out of order. (Firewalls usually believe this.)


“Note that the upper and the lower layer TCP have different timers. When an upper layer connection starts fast, its timers are fast too. Now it can happen that the lower connection has slower timers, perhaps as a leftover from a period with a slow or unreliable base connection.

Imagine what happens when, in this situation, the base connection starts losing packets. The lower layer TCP queues up a retransmission and increases its timeouts. Since the connection is blocked for this amount of time, the upper layer (i.e. payload) TCP won't get a timely ACK, and will also queue a retransmission. Because the timeout is still less than the lower layer timeout, the upper layer will queue up more retransmissions faster than the lower layer can process them. This makes the upper layer connection stall very quickly and every retransmission just adds to the problem - an internal meltdown effect.”


Yggdrasil[1] (partially) works around this issue by using a very large (64k) MTU size.

[1]: https://yggdrasil-network.github.io/2018/07/13/about-mtu.htm...


That's interesting. Do you know how these huge packets are dealt with under the hood? Even with jumbo frames, most switches cannot forward frames larger than 9K. Fragmentation is also pretty broken / unreliable, does Yggdrasil fragment packets itself?


For a more layman explanation, see https://www.youtube.com/watch?v=AAssk2N_oPk


This is also why "SSL VPN" is an equally bad idea.

* Unless it's OpenVPN, which can run SSL over UDP.


Or Cisco AnyConnect / openconnect, or any other implementation that supports DTLS (UDP) upgrade.

The key thing to remember is while TCP over TCP is a bad idea, bad connectivity to your workplace network is not worse than no connectivity at all, which is why SSL VPN solutions exist.


Or it's over QUIC, which is TLS over UDP.


On LTE (which often falls down to 3G) my TCP OpenVPN is God sent. It makes my network usable. It's especially visible during COVID, because all those people want to use the same mobile provider, becasue DSL connection is rammed by others.

Sadly OpenVPN over UDP does not work as good - DNS queries are failing and so on. Wireguard falls even more behind.


That's strange. All else being equal TCP should perform worse. I suspect it's because TCP is getting preferential treatment (eg. less throttling/qos, or better connection tracking in NAT), rather than TCP actually being better.


I think it would be possible to avoid this issue by acking the inner TCP packets on the send end of the tunnel. This way the sending application believes that the package has been sent successfully. Because the outer TCP will ensure reliable delivery there should never be a need to retransmit a packet for the inner TCP.


That breaks your bandwidth estimation though? So you're more likely to fill up your sending buffers and then suddenly stall anyway


You would ACK after a suitable delay, to throttle the inner connection to a level where your sending buffers don't fill up.


A proxy server, essentially.


Essentially yes. You are terminating one TCP connection. Shuffling the data stream over your tunnel however you want and establishing another TCP connection at the other end. The only real difference is that the source and destination address isn't changed.


Would this not have the chance of breaking anything relying on tcp keepalives?


A lot of transports are in-order.

TCP can use this to detect the difference between delay, and loss, and treat them differently.

Unfortunately, not all transports are in-order, and there is no way to know from the endpoints, hence the large number of TCP retransmit algorithms trying to find some optimal heuristic...


For historical interest, I have wondered if this also applies to TCP/IP over X.25, which was used for a while on JANET, the UK academic network, when IP became irresistable. I remember that as working well, but perhaps it didn't really.


IIRC X.25 (and related ATM and framerelay protocols) have a built in bandwidth allocation mechanism used to defined virtual circuits that effectively works like a dedicated link and thus doesn't interfere with tcp congestion control algorithms


Thanks. That sounds vaguely familiar now.



This is a really interesting explanation. Is it not somehow possible in to disable retranmission requests in the upper layer, perhaps by using a firewall to block them if the OS doesn't allow it to be configured on the interface.


The problem is that the retransmissions are handled by your TCP stack, which lives outside of your program. You are basically asking TCP to not be TCP at this point, so the easiest solution is to just use UDP instead.


I was viewing this with my sysadmin hat on not from a programmer perspective. So firewalls and OS TCP configuration would be within scope and sometimes you have no choice over the lower level link you have. But yes, I agree that the lesson from the article is don't do this if you don't have to but that doesn't mean I'm not interested in ways the problem could be alleviated. From other discussion, it looks like setting a high MTU on the upper layer could help.


If your problem is packet loss due to BER and retransmissions a big MTU is hugely counterproductive. You flip one bit on a 32kb frame and you've gotta resend the whole damn thing, and if you're doing TCP over TCP that retransmission might happen a dozen times or more due to the way the timers interact.

If you absolutely must do TCP over TCP the only sensible thing to do is have the outer tunnel terminate and buffer the stream internally. This is fairly memory intensive so routers won't do it, but applications can get away with it.


>Is it not somehow possible in to disable retranmission requests in the upper layer

...or just use UDP?


You don't necessarily have an option. ssh lets you tunnel with tun/tap interfaces which presumably suffers this problem given that ssh is over TCP. Sometimes you're forced to set something up over whatever link you happen to have.


The appropriate way to disable retransmissions is to use UDP.


Fascinating.

So then how to VPN's work performantly?

I'd always assumed my VPN implemented TCP and UDP over a TCP connection. Do they not? Is the VPN connection actually just UDP?


VPNs are almost always over UDP.

See IPSec (most common VPN implementation): https://www.cloudflare.com/learning/network-layer/what-is-ip...

See Wireguard (increasing popularity): https://www.wireguard.com/


Doesn't IPSec operate at the same level as TCP/UDP?

TCP is protocol 6, UDP protocol 17; while IPsec uses protocol 50 for ESP (encrypted) and 51 for AH (authenticated).


IPSEC can operate in both modes, either as a pure IP protocol or in "nat traversal" mode, over IIRC udp 500 and 4500


IPsec is really an IP protocol, just like UDP and TCP is.

For pragmatic reasons (firewalls, NAT, that sort of thing) it is nowadays mostly tunneled over UDP.


Most of them can use either, although TCP gets you terrible performance for the reasons mentioned in the article.

Modern VPNs will usually attempt UDP first, falling back to TCP only if that does not work; unfortunately, OpenVPN doesn't seem to have that option and requires manual configuration in my experience. This means that many sysadmins configure it to use TCP for the higher success rate in most environments.

IPsec is UDP only, as is Wireguard.


IPsec isn't UDP though. It use UDP for the key exchange, but the actual IPsec traffic uses ESP or AH, which are IP protocols at the same level as TCP, UDP, or ICMP.

(ignoring IPSec NAT-T)


True, but NAT-T is almost a given if one of the endpoints is a laptop or mobile device these days.

Also, UDP and ESP can almost be used interchangeably for this discussion (flow control, congestion control, segmentation etc. or a lack thereof).


There's also an advantage of UDP being able to resume after losing the link, as with mosh and a feature of wireguard (and other VPNs?).


The argument make sense and I ran across it a few times before, but I'm using OpenVPN via TCP on a daily basis and never had issues. I don't know how much better the connection would be via UDP, but TCP works like a charm as far as I can tell.


Because you have a good home connection?

The average wired connection has 0% drop rate, losing less than a packet in a million. Try to use TCP on a shitty connection and it's a different matter.


Well, I do sometimes use it on a train. Of course I experienced connection loss when on a train, but I cannot say whether that's because I'm using TCP or the connection is actually that shitty. But you might be right.


The way to think about it is as an amplifier: if you have issues with latency or packet loss they’ll be notably worse over that VPN tunnel. If you have a low-latency connection or aren’t doing latency-sensitive work, it can be okay - downloads won’t be too much different, SSH will be impacted, web browsing is in between.


Now that you mention it, the only application I have issues with is SMB. RDP for example works fine, but I cannot upload anything over 100MB with `smbclient`. I actually resort to the "network disk" feature over RDP for that because it works better.

Could this be explained by TCP over TCP?


Possibly - when you say “cannot upload” is that because it times out? One of the challenges using TCP-based VPNs is that it interferes with normal TCP window scaling so transfer rates ramp up much more slowly and crater if you get any kind of packet loss.


Since the article is from 2001, is this still true today?


Nearly every thing in that article is still true. The algorithms changed a bit, but I don't think they interact any better now. Instead, the failures should have become worse because buffers got larger.

Still, TCP over SSH works. It's not perfect, but the problems are almost not noticeable. It used to work by 2001 too. There is something that makes all of that to not apply most of the time. I believe that unused capacity makes the problem possible to recover from.


This article does not apply to TCP over SSH, as in SSH port forwarding (not openssh's VPN feature), in that case the timers are tied together and you don't get the crappy cascading behavior.


That sounds very interesting. Do you have any technical sources that I could delve deeper into? I've always thought that SSH tunneling / port forwarding suffers from the exact problems outlined in TFA.


As I understand it, ssh "unwraps" the TCP stream back into a regular byte stream, then sends that, and reconstitutes it back into TCP packets on the other end.


Thanks. That sounds like a reasonable explanation, and anecdotally I don't remember ever having major performance problems with SSH, even on mobile networks, so it makes sense that there ia something preventing this particular error mode.


It works until you get packet loss. On a modern broadband connection doing just web browsing/email/etc... it will be fine. Even bulk data transfers will be ok as the packet loss will be spread out so you don't get the cascading failures mentioned in the article. The whole scheme could come crashing down if you're running over WiFi and your neighbor starts blasting out traffic on the same channel though.


Great, thanks for the reply.


and this is also the reason WireGuard did not want to run on top of TCP


(2001)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: