TCP over TCP is a bad idea (2000)

aarmenaa · 2024-10-25T21:54:15 1729893255

I've just spent the last month learning exactly why I definitely do want a TCP over TCP VPN. The short answer is almost every cloud vendor assumes you're doing TCP, and they've taken the "unreliable" part of UDP to heart. It is practically impossible run any modern VPN on most cloud providers anymore.

Over the last month, I've been attempting to set up a fast Wireguard VPN tunnel between AWS and OVH. AWS killed all internet access on the instance with zero warning and sent us an email indicating that they suspected the instance was compromised and being used as part of a DDOS attack. OVH randomly performs "DDOS mitigation" anytime the tunnel is under any load. In both cases we were able to talk to someone and have the issue addressed, but I wanna stress: this is one stream between two IPs -- there's nothing that makes this anything close to looking like a DDOS. Even after getting everything properly blessed, OVH drops all UDP traffic over 1 Gbps. It took them a month of back-and-forth troubleshooting to tell us this.

The really terrible part is "TCP over TCP is bad" is now so prevalent there's basically no good VPN options for it if you need it. Wireguard won't do it directly, but there's hacks involving udp2raw. I tried it, and wasn't able to achieve more than 100 Mbps. OpenVPN can do it, but is single-threaded and won't reasonably do more than 1 Gbps without hardware acceleration, which didn't appear to work on EC2 instances. strongSwan cannot be configured to do unencapsulated ESP anymore -- they removed the option -- so it's UDP encapsulated only. Their reasoning is UDP is necessary for NAT traversal, and of course everybody needs that. It's also thread-per-SA so also not fast. The only solution I've found than can do something not UDP is Libreswan, which can still do unencapsulated ESP (IP Protocol 50) if you ask nicely. It's also thread-per-SA, but I've managed to wring 2 - 3 Gbps out of a single core after tinkering with the configuration.

For the love of all that's good in the world, just add performant TCP support to Wireguard. I do not care about what happens in non-optimal conditions.

/rant

kwantam · 2024-10-25T22:31:23 1729895483

The whole point of this article is that performant Wireguard-over-TCP support in Wireguard simply does not work. You're not fighting the prevalence of an idea, you're fighting an inherent behavior of the system as currently constituted.

In more detail, let's imagine we make a Wireguard-over-TCP tunnel. The "outer" TCP connection carrying the Wireguard tunnel is, well, a TCP connection. So Wireguard can't stop the connection from retransmitting. Likewise, any "inner" TCP connections routed through the Wireguard tunnel are plain-vanilla TCP connections; Wireguard cannot stop them from retransmitting, either. The retransmit-in-retransmit behavior is precisely the issue.

So, what could we possibly do about this? Well, Wireguard certainly cannot modify the inner TCP connections (because then it wouldn't be providing a tunnel).

Could it work with a modified outer TCP connection? Maybe---perhaps Wireguard could implement a user-space "TCP" stack that sends syntactically valid TCP segments but never retransmits, then run that on both ends of the connection. In essence, UDP masquerading as TCP. But there's no guarantee that this faux-TCP connection wouldn't break in weird ways because the network (especially, as you've discovered, any cloud provider's network!) isn't just a dumb pipe: middleboxes, for example, expect TCP to behave like TCP.

Good news (and oops), it looks like I've just accidentally described phantun (and maybe other solutions): https://github.com/dndx/phantun I'd be curious if this manages to sidestep the issues you're seeing with AWS and OVH.

aarmenaa · 2024-10-25T22:59:14 1729897154

> The retransmit-in-retransmit behavior is precisely the issue.

But you're concerned about an issue I do not have. In practice retransmits are rare between my endpoints, and if they did occur poor performance is acceptable for some period of time. I just need it to me fast most of the time. To reiterate: I do not care about what happens in non-optimal conditions.

> it looks like I've just accidentally described phantun (and maybe other solutions): https://github.com/dndx/phantun

I'll definitely look into that. They specifically mention being more performant than udp2raw, so that's nice.

lxgr · 2024-10-25T23:26:19 1729898779

> In practice retransmits are rare between my endpoints

You seem to be mistaken about how (most) TCP implementations work. They regularly trigger packet loss and retransmissions as part of their mechanism to determine the optimal transmission rate over an entire path (made up of potentially multiple point-to-point connections with dynamically varying capacity).

That mechanism breaks down horribly when using TCP-over-TCP.

screcth · 2024-10-26T00:37:44 1729903064

Can't the tunneling software detect when the upper TCP is retransmitting segments and drop them?

That would give the lower TCP enough time to transmit the original segment.

lxgr · 2024-10-26T01:10:18 1729905018

Maybe, but packet loss isn't the only problem. You'll also want to preserve latency (TCP has a pretty sophisticated latency estimation mechanism), for example.

Some middleboxes will also do terrible things to your TCP streams (restrictive firewalls only allowing TCP are good candidates for that), and then all bets are off.

If you're really required to use TCP, the "fake TCP" approach that others in sibling threads have mentioned seems more promising (but again, beware of middleboxes).

mlyle · 2024-10-26T00:00:53 1729900853

But, my connection speed is usually greater and my loss is much less to my VPN endpoint than to whatever services I am accessing though that endpoint. As a result it doesn't affect things much. Further, accessing it with UDP is not always possible.

lxgr · 2024-10-26T01:32:10 1729906330

> [...] my loss is much less [...]

Unless it's actually zero, any loss on the "outer" TCP stream will cause a retransmission, visible to the inner one as a sharp jump in latency of all data following the loss. Most TCP stacks don't handle that very well either.

mlyle · 2024-10-26T01:53:00 1729907580

Sure, even when outer loss is pretty close to zero, it's conceptually not great.

On the other hand, I get 400mbps over TCP-over-TCP connections, and can't connect in any other reasonable way. 400mbps > 0.

Even tunneling in UDP is not great due to MTU effects.

lxgr · 2024-10-25T23:16:45 1729898205

> just add performant TCP support to Wireguard

But IP over TCP is in principle non-performant. There's no (non-evil) magic Wireguard could perform to get around that.

Adding TCP support to Wireguard would add a whole bunch of complexity that it doesn't need – for a very niche use case (i.e. where you absolutely have to get an IP VPN to work over a restrictive firewall).

> Wireguard won't do it directly, but there's hacks involving udp2raw.

Which significantly does not do UDP over TCP in the problematic sense (it just masquerades UDP as TCP, without providing a second set of TCP control loops on top of the first one).

> AWS killed all internet access on the instance with zero warning and sent us an email indicating that they suspected the instance was compromised and being used as part of a DDOS attack.

It makes no sense for that to be due to Wireguard usage, though (not saying I don't believe you that it happened, just their explanation or your assumption of their motivation seems strange). Things like Tailscale use Wireguard and should be common enough for AWS to know about them by now, I'd assume?

Dylan16807 · 2024-10-26T04:00:14 1729915214

> But IP over TCP is in principle non-performant.

No it's not. In principle it risks meltdown, which is different. A link that occasionally breaks can be performant while it's working.

chgs · 2024-10-25T23:10:24 1729897824

I run WireGuard to all my ec2 and AWS instances with no problem. I also run UDP video streams into AWS with little issue.

amaccuish · 2024-10-26T11:36:46 1729942606

Ye I think there's either more to the story or a misconfiguration. I've done WireGuard at Azure, Hetzner, and AWS. All work fine.

aarmenaa · 2024-10-26T18:29:50 1729967390

It is very difficult to misconfigure Wireguard -- there's just not that much to tune aside from MTU. We've had a 1 Gbps tunnel between AWS and OVH for years and it worked mostly fine, except for the handful of times OVH's DDOS mitigation kicked in and killed the tunnel. The issue is when you start wanting to go beyond 1 Gbps.

I think AWS will do 5 Gbps with a capable peer -- which is their limit for a single flow [1] -- but you might need to tell them first so they don't kill public networking on the instance though. I found that UDP iperf tests reliably got my instance's internet shut off, so keep that in mind. On the other hand, OVH will happily do 5-ish Gbps to/from my EC2 instance in a TCP iperf test, but won't tolerate more than 1 Gbps of inbound UDP. OVH support has indicated that this is expected, though they do not document that limitation and it seemed that both their support and network engineering people were themselves unaware of that limit until we complained. They don't seem to have the same limits on ESP, which is why I developed an interest in ipsec arcana.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

mobilio · 2024-10-26T18:55:50 1729968950

Enter TunSafe: https://github.com/TunSafe/TunSafe

This comes with TCP implementation https://github.com/TunSafe/TunSafe/blob/master/docs/WireGuar...

Bad news is that runs only between TunSafe instances.

toast0 · 2024-10-26T16:37:10 1729960630

Worst case, can't you run a minimal turn server and have TCP over Wireshark/UDP over turn/tcp?

For a site to site VPN, something where you use transparent proxying at the routers to turn TCP into TCP over SOCKS (over TLS) might work. TCP proxying with 1:1 sockets avoids most of the issues with TCP over TCP, at the expense of needing to keep socket buffers at the proxy hosts.

Hikikomori · 2024-10-25T23:35:46 1729899346

Did ipsec over udp for client vpn, to datacenter and even to Azure from AWS. No issues whatsoever, never did more than a Gbit over 1 tunnel though.

aarmenaa · 2024-10-26T00:09:52 1729901392

We've run Wireguard tunnels that max out at 1 Gbps in AWS for years with no issues (on the AWS side, anyways). It seems like things get hairy once you want to do more than that.

danw1979 · 2024-10-26T08:53:21 1729932801

Did you try udp/443 to see if OVH clobber that traffic ?

I was quite hoping that the advent of QUIC would let us all use UDP again, albeit on one port.

throwing_away · 2024-10-25T22:28:15 1729895295

Did you go down the shadowsocks path at all?

aarmenaa · 2024-10-26T00:08:06 1729901286

I did not. I'm not terribly familiar with it, but it doesn't look like I can do general routing with it, right? My end goal is to route between two subnets.

riobard · 2024-10-26T06:55:39 1729925739

Nope, shadowsocks is just plain TCP-in-TCP (not TCP-over-TCP) proxy. If you cannot have performant routing between clouds due to UDP QoS, then the only sensible solution would be to setup proxy nodes on both sides and transparently redirect TCP (if that's all you need) traffic through the proxy.

(I wrote https://github.com/shadowsocks/go-shadowsocks2)

eqvinox · 2024-10-26T15:53:04 1729957984

> strongSwan cannot be configured to do unencapsulated ESP anymore -- they removed the option

wait, what? Pretty sure I still used unencapsulated ESP a few months ago… though I wouldn't necessarily notice if it negotiates UDP after some update I guess… starts looking at things

Edit: strongswan 6.0 Beta documentation still lists "<conn>.encap default: no" as config option — this wouldn't make any sense if UDP encapsulation was always on now. Are you sure about this?

aarmenaa · 2024-10-26T18:04:14 1729965854

Sorry, I misremembered the issue. Looking at my notes the issue is they don't allow disabling their NAT-T implementation, which detects NAT scenarios and automatically forces encapsulation on port 4500/udp. The issue is that every public IP on an EC2 instance is a 1:1 NAT IP. Every packet sent to the public IP is forwarded to the private IP -- including ESP -- but it is technically NAT and looks like NAT to strongSwan.

There's an issue open for years; it will probably never be fixed:

https://wiki.strongswan.org/issues/1265

eqvinox · 2024-10-26T20:33:37 1729974817

Ah, OK, yeah that makes sense.

FWIW, using IPv6 might be an option here?

dang · 2024-10-25T19:23:19 1729884199

Why TCP Over TCP Is a Bad Idea (2001) - https://news.ycombinator.com/item?id=9281954 - March 2015 (43 comments)

Why TCP Over TCP Is A Bad Idea - https://news.ycombinator.com/item?id=2409090 - April 2011 (26 comments)

cma · 2024-10-26T05:09:04 1729919344

If you are in a situation where you have to anyway, you can use multiple TCP sockets and round robin them (with Nagle off) such that you are always sending just one packet over each. You'll get overhead and some unneeded acks, but no front of line blocking of the second layer of TCP mechanics going on.

some_furry · 2024-10-26T00:47:11 1729903631

Yes, but what about IPv6 over Amazon S3?

https://xeiaso.net/blog/anything-message-queue/

bcrl · 2024-10-26T01:59:26 1729907966

Time to pull RFC 1149: A Standard for the Transmission of IP Datagrams on Avian Carriers. https://www.rfc-editor.org/rfc/rfc1149

Svip · 2024-10-25T20:01:04 1729886464

I notice that the earliest version of this post[0] is dated 1999, whilst the latest version is modified in 2001 (see the main link). Which year would be appropriate to mark it on HN? 1999? 2001?

[0] https://web.archive.org/web/20000310230940/http://sites.inka...

Shared404 · 2024-10-25T20:59:12 1729889952

IMO the latest update is the best option. Or a syntax like (<original date>, updated <update date>) if the update was not super substantial.

dmitrygr · 2024-10-25T20:13:21 1729887201

sqrt(1999 * 2001)

pif · 2024-10-25T21:34:20 1729892060

(|1999> + |2001>)*Sqrt(1/2)

schmidtleonard · 2024-10-25T20:14:52 1729887292

(1999^-1 + 2001^-1)^-1

exe34 · 2024-10-25T21:08:52 1729890532

2000ish

01HNNWZ0MV43FF · 2024-10-25T21:17:43 1729891063

Port forwarding doesn't count, right?

duskwuff · 2024-10-25T21:33:42 1729892022

Correct, because that doesn't transform the TCP stream.

"TCP over TCP" specifically means a TCP stream whose payload represents a sequence of TCP packets.

AStonesThrow · 2024-10-25T21:54:00 1729893240

TCP doesn't use "packets". They're called "segments".

lxgr · 2024-10-25T23:23:49 1729898629

A TCP packet is a pretty common term for an IP packet containing a TCP segment.

VPNs usually forward IP packets, so the usage seems correct to me here.

AStonesThrow · 2024-10-25T23:55:13 1729900513

It's wrong.

"TCP packet" is a pretty common colloquial term which is also factually wrong. There is no such thing as "packet" at the TCP protocol layer.

This paper is discussing the tunnelling of payloads with PPP over SSH.

You can see in the stacked diagram that "TCP" appears in two layers of the protocol stack, as does IP.

Essentially, they're encapsulating TCP segments inside other TCP segments, although there are also trappings of IP, PPP, and SSH protocols in-between those.

This isn't a VPN use case at all. And if a VPN is forwarding IP packets, then it's forwarding IP packets.

adrian_b · 2024-10-26T08:19:49 1729930789

"TCP packet" is an absolutely correct term.

As the previous poster said, a "TCP packet" is not a "TCP segment".

A "TCP packet" is an IP packet that carries some part of the TCP byte stream, i.e. normally some part of one TCP segment.

Each "TCP packet" carries its own TCP header, so the division of the TCP segments into TCP packets is visible and significant when you look at a TCP data flow from outside, even if it does not matter internally for the processes that communicate through TCP.

There are probably much more people who care about TCP packets than about TCP segments. When you are configuring firewall rules or monitoring network traffic, you are concerned about TCP packets, almost never about TCP segments.

AStonesThrow · 2024-10-26T08:26:40 1729931200

That's an IP packet. There's no such thing as "TCP packets".

IP packets can encapsulate various things depending on how you set their protocol fields. But packets are not TCP. This is quite straightforward. It simply has to do with distinct PDU nomenclature at each layer.

You can have Ethernet frames carrying IP packets too, but they are not "Ethernet packets"; that would be absurd.

icedchai · 2024-10-26T15:14:28 1729955668

Okay, let's call them "IP packets with protocol 6 (TCP) in the header." Does that make it clearer?

lxgr · 2024-10-26T17:21:56 1729963316

> You can have Ethernet frames carrying IP packets too, but they are not "Ethernet packets"; that would be absurd.

Yes, because "Ethernet packet" would be the wrong way around, layer wise. The pattern is "x y", short for "an y of type x" or "an y containing x" in this case.

It would be "IP frames" by analogy, i.e. Ethernet frames containing IP packets, although I do find that association a bit harder to make.

simeonmiteff · 2024-10-26T10:59:10 1729940350

No, the nomenclature is not consistent at each layer.

Another layer down, IEEE 802.3 names the complete message (i.e., including preamble, SFD and FC) that an Ethernet PHY sends to an Ethernet MAC (which processes "frames")... a packet!

"TCP packet" is correct.

adrian_b · 2024-10-26T13:17:06 1729948626

The IP packets are divided into many kinds, based on the higher-level protocol that they carry, i.e. based on the type of the header that follows the IP header.

Thus there are ICMP packets, UDP packets, TCP packets, IPsec packets and so on.

Everybody who has anything to do with network management understands that when it is said "ICMP packet", "IPsec packet", "TCP packet" and so on, what is meant is "IP packet carrying the ICMP/IPsec/TCP protocol".

When you manage or monitor network traffic, you see and examine IP packets of various kinds. Nobody cares about how a TCP stream happened to be partitioned in TCP segments. That may be interesting only for those who look on the other side of a TCP socket, its internal side, i.e. those who debug some program that communicates through TCP and which may have some throughput or latency problems.

The Ethernet frames are also divided into many kinds based on the protocol that they carry, which may be ARP, IP, IPX and so on. For more consistency, those could have been called ARP, IP, IPX etc. frames, but they are also usually called ARP, IP, IPX etc. packets. The reason is that frame and packet are almost synonymous terms. One or the other has been preferred depending on the organization that has prepared a standard.

AStonesThrow · 2024-10-26T15:30:54 1729956654

> when it is said "ICMP packet", "IPsec packet", "TCP packet" and so on, what is meant is "IP packet carrying the ICMP/IPsec/TCP protocol".

But an "IP packet carrying TCP" is totally different from a "TCP segment", and so why the fuck is everyone trying to muddy the waters in this regard?

Likewise, "an Ethernet frame with IP stuff in it" is not an IP packet. Because it's at a lower layer!!!

Due to fragmentation, these higher protocol layers can be split up into two or more PDUs at lower levels. They are reassembled as steps to decoding the protocol. A TCP stream, made of multiple segments, will surely be split up across multiple IP packets. Even a single TCP segment is never guaranteed to map 1:1 with a single IP packet. That's why you can't equate them, and that's why it's idiotic to try and force the wrong PDU terminology in the wrong layer, because they simply don't match up, and you confuse newbies into believing these things are equivalent or interchangeable. They are not!

A TCP segment is a thing in its own right. It is not equivalent to an IP packet with protocol 6. Why is that a crazy idea to y'all?

adrian_b · 2024-10-26T17:55:01 1729965301

Yes you are right that an "IP packet carrying TCP" is totally different from a "TCP segment".

Like I have explained, this is precisely the reason why we need two different terms, "TCP packet" and "TCP segment", to name the two different things. There exists no 1-to-1 mapping between TCP packets and TCP segments. It is frequent for a TCP segment to span multiple TCP packets.

Muddying the waters would be if the same term would be used for the two different things. When two different things have two different names, the waters are clear.

Likewise, there is a 1-to-1 mapping between the "Ethernet frame with IP stuff in it" and the IP packet contained inside the Ethernet frame (ignoring the distinction between complete IP packets and fragmented IP packets, which has been removed in IPv6). The IP packet is obtained by deleting or ignoring the Ethernet header & CRC, while the Ethernet frame is obtained by adding to the IP packet the Ethernet header and CRC.

Because of the 1-to-1 mapping, there is no risk of confusion when using a term like "IP packet", regardless whether you speak about the IP packet alone, as existing in the memory before being sent or after reception, on about the IP packet as existing on the communication links or in some tool for network monitoring/sniffing, where it is encapsulated in the Ethernet frame. Similarly for an ARP packet or any other kind of packet that can be encapsulated in an Ethernet frame.

lxgr · 2024-10-26T17:18:33 1729963113

> A TCP segment is a thing in its own right. It is not equivalent to an IP packet with protocol 6. Why is that a crazy idea to y'all?

Nobody is claiming they're the same in this thread, as far as I can tell.

"TCP packet" is simply a pretty clear/unambiguous contraction of "an IP packet carrying TCP", or "an IP packet containing a TCP segment", to me.

lxgr · 2024-10-25T23:23:05 1729898585

Port forwarding is TCP "next to" TCP, so that's fine, yes!

It can even be beneficial in some cases: If a host has an old/bad TCP stack not able to deal well with some network situation (latency, packet loss, you name it), port forwarding from a closer/less affected host can resolve the issue.

Happened to me once for the terrible old eBook delivery server from my public library when a continent away: It handled the long latency so poorly, a 30 MB download would have taken two hours. SSH forwarding brought that down to seconds.