
Fixing an old hack – why we are bumping the IPv6 MTU - jgrahamc
https://blog.cloudflare.com/increasing-ipv6-mtu/
======
zrm
2015:

[https://news.ycombinator.com/item?id=9001576](https://news.ycombinator.com/item?id=9001576)

------
jorangreef
From a software point of view, it has always seemed to me that typical MTUs
are an order of magnitude too small:

Disk operations as well as CPU operations (authentication, encryption, e.g.
TLS) all benefit from using larger block sizes of 65536 bytes and up.

This amortizes system call overhead, enables larger sequential operations,
enables more use of SIMD instructions, reduces instruction cache misses
(working on one thing for longer), reduces context switches, and serves to
increase the leverage that the control plane has on the data plane.

Just one specific example, something like TCP needs to keep track of packets
(for retransmission, acks etc.). When you're sending Gbps, small packets mean
much more bookkeeping overhead (more method calls, more bits in ack packets,
more hash table lookups) compared to large packets. Throughput should be much
better with an MTU of 65536 (and latency sensitive applications don't have to
send that much).

I can imagine that in the past the cost of memory was radically different,
implying a smaller MTU, but it seems like sticking with 1500 for IPv6 was an
opportunity missed.

~~~
phicoh
The problem here is ethernet. Even the latest greatest many gigabits per
second ethernet can be bridged to plain old coax based 10 Mbps ethernet.

Within the ethernet protocol there is no way to negotiate the maximum frame
length between stations. So IEEE standard still has a maximum of 1500 octets
for the payload.

There are some RFCs that go beyond 1500, but mostly the IETF follows the IEEE
specification.

Note that the 1500 link MTU applies to ethernet (and similar link
technologies). For anything else, you can have larger (or smaller) MTUs.

~~~
akvadrako
Most network devices support jumbo frames¹, which are an IEEE standard, but
they aren't the default. The point is not the IEEE standard, but what most
devices are configured for.

Transit operators and IXPs just don't want to configure their equipment to use
jumbo frames. It's probably just a coordination problem.²

[1] [https://www.networkworld.com/article/2224722/cisco-
subnet/ju...](https://www.networkworld.com/article/2224722/cisco-subnet/jumbo-
frames.html) [2]
[https://lists.gt.net/nanog/users/187984](https://lists.gt.net/nanog/users/187984)

~~~
zamadatix
> Most network devices support jumbo frames¹, which are an IEEE standard, but
> they aren't the default.

Jumbo frames are not an IEEE standard.

------
ggm
lack of frag/reassembly along the path is a huge huge barrier.

by now, link speed growth compared to MTU growth is a sad story. we should be
jumbo everywhere.

This is a pragmatic story, I understand why my jumbo dream isn't happening in
the wide.

But for anyone who direct connects to an agency like Akamai or FB or Google or
(gasp) Cloudflare.. Why not up the default?

Whats the downside?

~~~
lend000
Jumbo means lots of reassembly on the most common end-link, Wi-fi (which
cannot support larger datagrams without huge drop rates due to the error prone
nature of the medium). Uncorrectable errors are pretty rare on wired links but
regardless, as a general rule the higher the MTU the higher the ratio of bits
that need to be retransmitted.

~~~
ggm
I don't have anything here. I sort of wonder if we misunderstand the problem
and combine wifi noise/size with 4G noise/size issues. There are times I think
bursty RF signals with big packets might be better than pessimal small packets
in chains. At one level, the quoted 5Ghz wifi speeds have to be taken with a
pinch of salt, if what you say is true.

------
toast0
I recently spent some time with IPv4 MTU. I don't know how much it applies to
IPv6, but my TLDR is MSS (+ size of minimum header) is usually a good
indicator of acheivable MTU, but some small fraction of people are
misconfigured. A lot of that is syns indicating 1500, but actual MTU is 1492
(pppoe is a scourge). But there's some other issues too, it's hard to track
down. Some client oses are capable of probing for mtu black holes, but Android
doesn't do it (thanks Google), even though the kernel has it available, and
there's no way an app could turn it on (thanks Google). If you want to avoid
this problem, you can send back MTU - X, depending on how conservative you
want to be. 8, 20 and 50 were good values of X, but you're adding some
overhead too.

~~~
lathiat
As a quick summary, first of all the host at each end of the connection knows
it's MTU and advertises this as MSS in the TCP connection so that in theory
the connection is established at the right MTU but this is never tested until
a large packet is sent (which may or may not arrive). ( _the MSS is actually
smaller than the MTU since its the data size without headers and even on a
given link can change e.g. if IP options are in use.. but in any case it 's
derived from the MTU.. RTFM for the gory details).

Most consumer routers having an MTU of 1492 due to PPPoE will modify the MSS
in-flight of the TCP connection to the MSS it knows for the next hop. This is
not a standard feature in other routers though, and obviously wouldn't really
work on the internet at large where paths can change mid connection. So it's
best used at endpoint routers only.

Ideally such routers might advertise the MTU using DHCP or similar, and it is
possible to do this now (and this is commonly used e.g. openstack clouds)
however it does unnecessarily then limit the MTU of the local network
unnecessarily. In theory I guess it could send a mtu for the default route and
local subnet separately but basically in practice no one does this afaik.

Originally, a router along the path would know the MTU was smaller on the
subsequent link, and would fragment the packet. For various reasons this is an
expensive operation for all involved so we decided to instead take advantage
of the "DF" (Don't Fragment) flag to instead generate an ICMP message back to
the source asking them to please adjust their MTU/MSS and re-transmit. This
flag specifically prevents the router fragmenting the message (that will never
happen now) and is the default behavior at least on Linux and I assume most OS
now.

This can go wrong in two ways, firstly ICMP may be blocked somewhere so it
never arrives even at the end host or by some over zealous firewall. Secondly,
the MTU may be misconfigured somewhere at layer 2 so the packet is silently
dropped rather than having an ICMP message generated. The ICMP message
requires that a router _knows* the MTU of the complete l2 path to the next
hop. If it get's that wrong, it will just send the packet and assume it is OK,
when in reality it is dropped silently. (This same problem can be seen on just
a local network, if you configure e.g. 9000 MTU on two hosts but an MTU of
1500 on your switch, your connections will break. It's not purely a routing
issue).

There is a second protocol designed to detect PMTU "passively" to solve this
problem called TCP PLPMD (Packetization Layer Path MTU Discovery). This
basically uses heuristics to detect that packets are being dropped and then
start probing the real path MTU by doing something like a binary search
between some base MTU (e.g. 512) and the known max MTU (e.g. 1500). It tries
various values until it settles on the actual MTU. The problem with this of
course, is that packets may be dropped by packet loss rather than MTU issues
and so there is a bit of luck and strategy to getting this right - and it also
introduces some extra latency while some packets are being dropped.

For whatever reason this option is currently off by default in Linux (e.g.
Ubuntu) and if it was on, it would silently fix the MTU on a lot of
connections that are currently not fixed. The option is sysctl
net.ipv4.tcp_mtu_probing. This has been enabled by default since Windows
Vista, which explains why we still see people with their MSS misconfigured
since Windows Vista onwards will just fix it. In linux at least it also has
two modes, one where it waits to detect an ICMP blackhole and another where it
always uses the protocol from the start. The former results in a 1-2s delay
before it kicks in to fix a connection, so not ideal if it's happening on
every connection but OK for the odd broken connection.

There is of course, like all topics, more to it than this.. but hopefully that
helps.

~~~
toast0
Thanks for the summary. I'm aware of all of this; but from the server side,
when clients mysteriously don't send packets of a certain size, it's hard to
know why --- because I'm only on the server side, and I haven't had any luck
getting knowledgeable people on the other end.

I can tell it's an MTU issue, because I get later packets, but I'm missing a
block that happens to be an exact multiple of the MSS. It could be simple
packet loss, except this pattern comes up too often, the client will sometimes
send probing acks, but never the missing packets, and it never retries with
smaller packets. This is especially bad when you're communicating with default
Linux servers that always send syn+ack with their local MSS, because so many
clients think they can send 1500 byte packets, but they can't. It's less bad
when using most versions of FreeBSD [1] where it returns whichever is smaller
of client side MSS or server MSS. There are a number of systems/networks that
clamp outgoing SYNs but not incoming SYNs, so downloads mostly work, but
uploads are broken. Mirroring the sent MSS isn't enough to make every
connection work though, because of exciting things you mentioned -- somewhere
in the middle, something is dropping large packets, and ICMPs aren't sent or
otherwise don't make it to the client, so it just kind of sits around waiting
forever.

Edit to add: I wish the codified behavior was simply to truncate IP packets to
the size available, rather than either fragment or alert. TCP Peers would be
able to notice that only shorter packets get acked, and adjust, and in the
meantime, some data would have gotten through. Not terribly great for UDP, but
I dunno.

[1] there's a change in -CURRENT to use the Linux behavior, unfortunately

~~~
p1mrx
> I wish the codified behavior was simply to truncate IP packets to the size
> available

If a router could replace the packet with a "truncation sentinel" that's
compatible with most firewalls, detectable by new implementations, and doesn't
cause data corruption for old implementations, then perhaps truncation could
be incrementally deployed? Like a "packet too big" error, but in the forward
direction.

However, it might be impossible to construct a packet satisfying all of those
conditions.

------
Sarkie
I can't believe they used a picture of a Vietnamese tunnel

