
Fragmentation, MTU, MSS clamping and tunnels (2014) - fauria
https://fredrikjj.wordpress.com/2014/08/10/fragmentation-mtu-mss-clamping-tunnels/
======
TrueDuality
Hah, I just encountered this for a completed though similar reason. I'm still
not entirely sure what was going on, but in Azure it seems like their network
fabric that underpins VM networking stack were combining TCP packets into
larger ones between two VMs.

One of those was tunnelling the traffic elsewhere and was getting packets
"from the network" with packet sizes up to 65k despite the MTU of the tunnel
being set to 1400. The transmitting host was respecting the MTU. I had to
force clamp the MSS to the PMTU in iptables to get things working again.

~~~
bcrl
The performance of modern network cards heavily leans on 2 so called stateless
offloads: TCP/transmit segmentation offload and large receive offload (also
call gso and gro). On transmit the kernel sends TCP packets in large chunks
(~64KB) to the nic and the nic then puts them on the wire in MTU sized chunks
as specified by the host. One receive, the nic tries to merge multiple small
packets together into a single large packet. Both of these optimizations have
the effect of reducing the number of packets per second the kernel and
application have to handle, driving up throughput as a result. Naturally, this
introduces a new class of subtle network behaviour that many folks aren't
aware. This also results in tcpdump no longer showing what packets are being
sent / received on the wire, making debugging tonnes of fun for the whole
family!

------
Klasiaster
Besides disabling PMTU Discovery by allowing fragmentation with `sysctl
net.ipv4.ip_no_pmtu_disc=1` the Linux kernel also has a setting to enable
dynamic MTU probing at the TCP layer (PLPMTU, RFC4821) `echo 1 >
/proc/sys/net/ipv4/tcp_mtu_probing`.

------
zamadatix
Good article, gets to the point with good info. The "what's the biggest
message I can send across a path" problem is one of those things that I hadn't
thought of before running into it but ended up being really interesting/harder
to solve than I initially thought it was (which I think are the "fun"
surprises in learning new things).

MSS clamping is going away with protocols like QUIC and HTTP/3 encrypting L4
headers/options. PMTUD remains the proper way to handle things, or something
like PLPMTUD if you're not willing to assume IP. Some things create path MTU
black holes, commonly firewalls but also APs. APs can be particularly
troublesome because nowadays they commonly don't terminate the wireless
encryption, the controller at the other end of the tunnel does, which means
they can't inject an ICMP to let the client know that packet is too big. MSS
saves the day (for TCP) but otherwise the AP can either drop or fragment at
the encap layer transparent to the client. I've yet to see a system that
delivers the fragmented tunnel packet so the controller can ICMP back that the
packet is too big but it'd be the cleanest thing in that scenario.

Also if you're going to limit tunnel size arbitrarily beforehand probably best
to make sure the resulting packet ends up being able to fit into 1428. A lot
of products use a 1300 "inside" MTU for this (and similar) reasons. 1400 +
auto IPsec calc will lead to trouble in this scenario.

Don't drop "inside" MTU below 1280 in IPv6 networks. IPv4 is much more
tolerant with the standard being 576 but many supporting less.

