
Path MTU Discovery in Practice - jgrahamc
https://blog.cloudflare.com/path-mtu-discovery-in-practice/
======
zAy0LfpBZLC8mAC
I find it mildly embarrassing that people still manage to screw up PMTUD, a
big CDN, no less ... but then again, I guess I appreciate that they did
actually care to fix it once they noticed, which is not something to take for
granted, and at least it was not the idiotic idea of filtering ICMP (has
Amazon ever managed to fix EC2?), but an actual oversight.

However, that blog article seems somewhat misinformed: In addition to tunnels,
one major reason for PMTUs smaller than the standard Ethernet MTU of 1500 are
DSL connections based on PPPoE, both for IPv4 and for IPv6. The only reason
most people don't notice the breakage caused by incompetent admins (in the
let's filter ICMP case) is because home routers tend to mess with the TCP
options in SYN packets, adding an MSS matching the MTU of the DSL connection.
Of course, that's a braindead fix, as it doesn't work for anything other than
TCP.

BTW: I doubt it was that small a number of users that noticed, a customer of
mine also noticed (native IPv6 over ADSL), but given that this usually is
caused by intentionally clueless admins, I just told them that the server
admin probably was incompetent rather than reporting the problem to cloudflare
...

~~~
majke
[citation needed]

But more seriously, PMTUD usually just works, so people tend to forget about
it. This is actually good - it's a sign that things have stabilised. It's only
the unusual setup at CloudFlare caused the problems, not problems with MTU
itself.

The world would be a better place if people understood how things they rely on
work. That's I wrote this article.

~~~
zAy0LfpBZLC8mAC
citation for what? ;-)

That PMTUD just works is not my experience, especially with IPv4, there are
lots of servers/sites out there that break if you don't mangle the MSS if you
are behind some link with an MTU < 1500--unfortunately.

But yeah, let's hope that your article will help spread the word that PMTUD is
important, and that completely filtering ICMP is idiotic--in any case it has
helped remove an "incompetent" flag for your company in my brain ;-)

------
donavanm
Doing a public post mortem is nice. But there's a bit of, uh, "derp" in the
post.

1) Conflating MTU and MSS. MTU is the maximum size of the layer 3 IP datagram.
MSS is the maximum size of the layer 4 TCP data section. Specifically MSS does
not include TCP/IP headers etc. MSS whould be influenced by MTU, but theyre
not interchangeable.

2) All hosts must accept minimum IPv4 datagram size of 576 bytes. The minimum
_fragment size_ is 68 bytes. See RFC 791 "Total Length" and "Fragmentation and
Reassembly."

3) Not enabling pmtu discovery probing by default. A significant number (3%?)
of clients on the internet have broken ICMP MTU detection and MTU < 1500.

4) Using a routing platform that can't ECMP on the data portion of ICMP
packets. This is straight WTF. How are you doing dest unreachable/ttl exceeded
without this, for example?

5) Hashing on the four tuple. Kinda maybe? More often I see 5 tuple (+proto)
or 7 tuple (+proto, vlan, tos). And sweet jesus I hope you've disabled ingress
interface in the hash.

6) Picking magic numbers of 1280 & 1024 byte MTU. Why not look at what your
clients are advertising in practice? It's recorded on every socket.

7) Naming your hack "Path MTU Daemon." Holy name collisions batman, PMTUD is
already a thing the rest of the internet knows about. And its not your
version!

8) Conflating asymmetric anycast failures with PMTUD failure. If the dst can't
get ICMP back to the original source why would any other IP make it back? I'm
not even sure how you'd get in this state, outside of flaps. Someone ECMP per
packet including proto to put ICMP on a different path which then has an
intermediate node with a different AS path or terrible PBR? Bizarro, and still
dont see how the cloudflare "pmtud" would help.

~~~
majke
Thanks for feedback!

1) Indeed. I tried to speak only about MTU but well, MSS creeped in a few
times. OTOH I don't want to get into details, I want to keep the article
short.

2) Point taken.

3) It's more complex. In IPv4, as mentioned the problem has been known for
much longer. Enabling RFC4821 is not perfect solution, better fix your ICMP.
That's why it's off by default on linux.

4) What can I say. Hardware is being hardware?

5) Simplification. I did say "for TCP" though.

6) 1280 is the safe choice for ipv6. For IPv4 it's a compromise. We do listen
to what the client says, 1024 is a fallback when RFC4821 kicks in.

8) I specifically avoided going into the details of this. Look for future blog
posts.

------
zrm
> As a temporary fix we reduced the MTU for all IPv6 paths to 1,280 (solution
> mentioned as #2). Many other providers have the same problem and use this
> trick on IPv6, and never send IPv6 packets greater than 1,280 bytes.

This feels like a bad idea. If your set your MTU to the minimum then anything
still encapsulated gets pushed below the minimum.

------
spydum
It kills me PMTU still can bite us. I remember when working at an ISP ~15
years ago and fighting PMTU issues all the time because we first tried to roll
out ADSL service with PPP and later bridging and there was a bit of overhead
we didn't account for. Match that up with my bosses conviction that all ICMP
should be blocked, led to fun times (he was also a big proponent of disabling
STP -- because you know it could cause whacky loops..).

