
Broken packets: IP fragmentation is flawed - majke
https://blog.cloudflare.com/ip-fragmentation-is-broken/
======
magila
While the designers of TCP/IP got an amazing amount of things right, one
things they missed rather badly was the need to take the incentives and
motivations of the people participating in the network into account. I don't
blame them for that. At the time the internet was an academic pursuit, so the
idea of designing a protocol to resist breakage by self-interested commercial
ISPs simply wasn't something that was on people's minds.

Taking IP fragmentation as a particular example: The protocols should be
designed such that inability to perform PMTU discovery prevents any
connections being established at all.

Designing protocols which will be robust in the face of human actors who may-
or-may-not be entirely interested in doing things the intended way is
incredibly difficult. To make things worse, once you make a mistake it is
practically impossible to fix. Using fragmentation as an example again: Now
that middle boxes which break PMTU discovery are established, it would be
impractical to introduce a "fail unconditionally" protocol. People would
simply blame the products which introduced such a protocol for being broken.
You have to get it right from the start so that bad actors get caught out
before their broken products can be deployed.

~~~
gerdesj
"Taking IP fragmentation as a particular example: The protocols should be
designed such that inability to perform PMTU discovery prevents any
connections being established at all."

You seem to be saying: "I'm right and you are wrong" here and I would suggest
that is ill advised - TCP/IP based comms generally work despite ill advised
configuration. You can bin ICMP on WAN and yet an OpenVPN connection still
works, for example.

"Designing protocols which will be robust" \- see Internet: all of it. I
suggest that the current internet protocols are a shining example of how to do
engineering properly.

~~~
paloaltokid
You know BGP was written on the back of a napkin, right? And no, I am not
joking.

I would also submit that BGP is _not_ a shining example of well-done
engineering.

~~~
aninhumer
That something was written on a napkin doesn't mean anything without context.

It could mean someone made something up off the cuff, or it could mean they'd
been thinking about it for months and were finally inspired during a meal and
didn't want to lose the idea.

~~~
jimktrains2
Or that it started on a napkin and was refined and flushed out over time.

------
majke
Check out the online ICMP / Fragmentation checker:

IPv4 version: [http://icmpcheck.popcount.org](http://icmpcheck.popcount.org)

IPv6 version:
[http://icmpcheckv6.popcount.org](http://icmpcheckv6.popcount.org)

If you see red - this means your ISP / cable modem is not really internet
happy.

Funny thing, the ISP I use fails on the ICMP PMTU message delivery test on the
first refresh, but succeeds on the second. What happens? The ICMP is delivered
to middle box, which then FRAGMENTS MY TCP PACKETS. In other word - the ICMP
is stopped at a middle box, who will ignore my packets DF flag and still,
fragment them without my system even knowing. On second refresh the middle box
remembers the Path MTU will be saved so the middle box will reduce the MSS on
SYN packet. This is remembered for about 15 minutes. Middle boxes suck.

~~~
user5994461
Or it means that ICMP traffic is not working. Misconfigured networks are
incredibly common. There is a massive misconception that ICMP is not needed or
should be blocked for security.

For example, you can google "should i open/block ICPM" and none of the stack
overflow answers will tell you what's the right thing to do.

~~~
jlgaddis
Indeed, and those who like to blindly drop all ICMP are gonna be hurting when
it comes time to deploy IPv6.

As noted in RFC4890 [0], "ICMPv6 is essential to the functioning of IPv6 ...
Filtering strategies designed for the corresponding protocol, ICMP, in IPv4
networks are not directly applicable."

So, if you currently just drop all ICMP at the edge, you might as well get
used to doing it the right way now and save yourself some trouble in the
future.

[0]:
[https://www.ietf.org/rfc/rfc4890.txt](https://www.ietf.org/rfc/rfc4890.txt)

------
jsnell
> These often completely ignore ICMP and perform very aggressive connection
> rewriting. For example Orange Polska not only ignores inbound "Packet too
> big" ICMP messages, but also rewrites the connection state and clamps the
> MSS to a non-negotiable 1344 bytes.

Most mobile operators will do MSS clamping. The IP packets get encapsulated in
GTP. Having to fragment the GTP packets would suck. (Now, you don't need to
clamp to 1344 just to account for GTP. So they have something else going on as
well, perhaps GTP+IPSec).

Of course, "most" isn't "all". Some don't do it. Some try to do it, but mess
up. The saddest case I saw was an operator with a setup like this:

client <-> GGSN <-> terminating proxy <-> NAT <-> server

That's a pretty normal traffic flow, right?

Now, for some reason they were doing the MSS clamping on the NAT, not the
GGSN. Sometimes that might be OK. But they didn't account for the proxy...
What happened in practice was that all the communication between the client
and the proxy was at a too-high MSS of 1460 (causing GTP fragmentation) and
the communication between the proxy and the server used a too-low MSS of 1380
(causing 5% increase in packet count for no reason). Oops.

~~~
majke
Problem with MSS clamping - if the other party has MTU of 1500 but the Path
has MTU of < 1344, then, what?

Well, it's broken. The sending party doing MSS clamping, will get the ICMP
probably dropped, and the fragments will likely not work either (like large
DNS responses).

GTP MSS clamping assume there is no <1344 link in the path! True, this is
often the case, but is completely missing the point of "the design of the
internet". Sad times...

~~~
jsnell
> GTP MSS clamping assume there is no <1344 link in the path!

Huh? You appear to be implying that clamping the MSS is causing large packets
to be dropped, or ICMP responses to be blackholed. Neither of those is true.
It's simply the mobile network asking the endpoints not to send larger TCP
packets than that. (And since TCP has traditionally been 95%+ of the traffic,
the GTP fragmentation for large non-TCP packets isn't a big deal).

~~~
majke
Hold your horses. It's not MSS clamping per se. It's the middle box MSS
clamping. Say a middle box rewrites my SYN packet from my phone to have
smaller MSS, from 1500 to say 1344.

The other party (server) responds with 1500 MSS, and both parties successfully
exchange TCP handshake assuming PMTU of 1344. Now say there is a 1300 link
between. And say my phone sends large packet.

My phone will happily transmit 1344 byte packet. Say TCP, with DF flag.... the
intermediate router with small MTU will stop it, and return back ICMP PMTU
message.

The point I'm making - the middle box doing clamping, usually don't do much
about this ICMP. They drop it. What they should do is to send this ICMP back
to my phone, but that's not what I saw in my tests. Or maybe I just tested
some buggy middle boxes.

~~~
jsnell
Yes, like I said you're conflating MSS clamping with all kinds of other
things. That's incorrect. Remove the MSS clamping from your example, and the
outcome is exactly the same.

MSS clamping by a middlebox is a totally fine practice. Dropping ICMP packets
is a bad practice. But they have nothing to do with each other. They're
probably not even being done by the same device.

~~~
majke
Aye. Got it.

------
xxpor
Oh man, I could rant about this forever. PMTUD is very broken on the internet.
This leads to things like MSS clamping based on the route your packet will
take.

If you run an ISP, or really a network of any sort, and block ICMP
indiscriminately, your network is broken. Please stop.

------
gsich
>The last fragment will almost never have the optimal size. For large
transfers this means a significant part of the traffic will be composed of
suboptimal short datagrams - a waste of precious router resources.

Why? With large transfers, the most packets will have the maximum length, only
the last will have a "suboptimal" short length.

~~~
noselasd
Say the TCP stream sends packets of 1500 bytes, an optimal size for most part
of most part of the journey of the packet

At one point the MTU is smaller, a router have to fragment it in two, say one
fragment of 1260 bytes and the last fragment contains the remaining 240 bytes.

So, a lot of segments gets fragmented into one 1260 byte fragment and one 240
byte fragment. That 240 byte fragment is far from optimal, the overhead to
payload ratio is a lot worse than for the 1260 fragment.

A much more optimal approach would be for the endpoint to adjust the segment
size, and send all packets as 1260 bytes. (which is what will happen if PMTU
works)

~~~
kazinator
Path MTU discovery is supposed to ferret out the fact that the TCP segment
size should be reduced.

IP fragmentation is just a best effort mechanism that is better than the
datagrams being dropped due to their excessive size; it allows connections to
be established.

You don't want to be doing bulk transfers with fragmentation.

IP fragmentation won't help with latency, because it is redundant. The TCP
window is already a form of "fragmentation". TCP already fragments the data
into segments that can be received out of order, individually re-transmitted
and re-assembled.

For large transfers over fast, high latency networks, TCP supports window
scaling. Larger individual segments (and thus IP fragmentation) is not
required for the sake of having more data "in the air".

~~~
Borealid
This neglects the fact that IP imposes a fixed per-packet overhead. More
packets means lower efficiency.

Using larger packets reduces the overhead as a percentage of traffic, so is
desirable regardless of the TCP window size.

Higher efficiency means higher bandwidth, which means lower latency for small
transfers. Waiting for two 10-byte packets yields latency equal to the higher
of the two latencies. Waiting for one 20-byte packet is, on average, lower
latency (because you take one sample instead of worst-of-two).

~~~
kazinator
Of course, we want to be using the largest packets that the underlying network
allows.

My comment is about using larger packets specifically _while relying on
fragmentation_ to deliver them. Fragmentation re-introduces the overhead since
fragments carry IP headers.

If a protocol already segments data into datagrams and reassembles them into a
stream, it derives no advantage from IP fragmentation.

IP fragmentation basically provides the possibility of operating in the face
of misconfiguration (hopefully temporarily), with reduced performance (better
than no performance at all).

(Of course I'm not saying that since TCP chops the data into pieces and has a
sliding window that can scale, there is no downside to making the pieces small
just for the heck of it, like 50 bytes of payload!)

------
p1mrx
Instead of "never fragment in-transit", I think IPv6 should've put a single
"truncated" bit in the header. When a packet hits a link that's too small, the
router chops off the end and sets the bit. Let the endpoints figure out what
to do with that information.

------
colemannugent
My question after reading the article is how would we go about addressing
this?

It seems that IPv4 (and v6 to some extent) allow for MTUs of up to 64KiB. Is
the only thing limiting the selection of higher Path MTUs the fact that most
Ethernet implementations have a MTU of 1500 bytes?

Do network hardware companies sell devices that tolerate higher MTUs? Is this
a hardware limitation, or it it just unwillingness to go past the 1500 bytes
because that's the IEEE standard?

~~~
majke
This is a subject for another blog post. Or a book!

This problem has been excessively discussed in early IP days. The alternative
to current model is to have a fixed packet size. This inherently means that
for some media the packet size will be not optimal.

The current model in IPv4 is reasonable - that is: dynamic packet size, with
either routers doing fragmentation or signaling mechanism to push back the
PMTU onto the originating host.

The problem is that there is very little visibility if these mechanisms work.
The linked online tests are an attempt to fix this - by easily showing
everyone if their CPE / ISP / Path is reasonable and behaves correctly.

Testing fragmentation / ICMP PMTUD delivery historically was pretty hard. We
need more visibility.

~~~
bogomipz
>"The alternative to current model is to have a fixed packet size"

This was already done with ATM for L2 and the 53 byte fixed cell size. You
don't hear about ATM any more.

------
mino
Here is the relevant talk that Geoff Huston gave last week at NANOG71:

[https://www.youtube.com/watch?v=P4xH4MYagFE](https://www.youtube.com/watch?v=P4xH4MYagFE)

Very interesting and, as usual, well presented. The picture it draws for UDP
(i.e., DNS) fragmentation over IPv6 is very bleak. Basically: "it's not
working".

------
teddyh
Some argue that PLPMTU should be made the default in Linux:

[https://lists.debian.org/debian-
ipv6/2017/09/msg00000.html](https://lists.debian.org/debian-
ipv6/2017/09/msg00000.html)

~~~
majke
My fav net.ipv4.tcp_mtu_probing=2 MTU probing / RFC4821!

I spoke about it here:

[https://blog.cloudflare.com/path-mtu-discovery-in-
practice/](https://blog.cloudflare.com/path-mtu-discovery-in-practice/)

There are a couple of issues. In old kernels it caused a "WARNING", with some
weird stack trace. I'm quite sure it has been fixed before 4.8, but I'm not
sure if it's fixed in 4.4. 3.18 is definitely buggy. Very hard to reproduce.

Second, it causes stalls on the connections. If the kernel sees a dropped
packet, it might get confused and instead of just resending it, it might go
into the MTU Probing mode. This will effectively stall the connection for a
while.

Finally, for long running connections, packet loss has high chances of being
misdiagnosed as MTU Blackhole, therefore the Path MTU has increased chances of
being reduced over time. MTU Probing has no mechanism to _increase_ Path Mtu.
Only decrease. This might degrade the performance of long running connections.

Recap:

\- don't use MTU probing if you have long running connections

\- it might cause connection stalls

More reading on the subject (suggested by a friend, hi Jari!):

[https://www.nlnetlabs.nl/downloads/publications/pmtu-
black-h...](https://www.nlnetlabs.nl/downloads/publications/pmtu-black-holes-
msc-thesis.pdf)

------
whatupmd
I work at an ISP and saw this problem yesterday...

This is a good read, with a lot of research, describing a common problem. I
will bookmark it, thanks for the share.

