Hacker News new | comments | show | ask | jobs | submit login
Broken packets: IP fragmentation is flawed (cloudflare.com)
144 points by majke 10 months ago | hide | past | web | favorite | 49 comments

While the designers of TCP/IP got an amazing amount of things right, one things they missed rather badly was the need to take the incentives and motivations of the people participating in the network into account. I don't blame them for that. At the time the internet was an academic pursuit, so the idea of designing a protocol to resist breakage by self-interested commercial ISPs simply wasn't something that was on people's minds.

Taking IP fragmentation as a particular example: The protocols should be designed such that inability to perform PMTU discovery prevents any connections being established at all.

Designing protocols which will be robust in the face of human actors who may-or-may-not be entirely interested in doing things the intended way is incredibly difficult. To make things worse, once you make a mistake it is practically impossible to fix. Using fragmentation as an example again: Now that middle boxes which break PMTU discovery are established, it would be impractical to introduce a "fail unconditionally" protocol. People would simply blame the products which introduced such a protocol for being broken. You have to get it right from the start so that bad actors get caught out before their broken products can be deployed.

"Taking IP fragmentation as a particular example: The protocols should be designed such that inability to perform PMTU discovery prevents any connections being established at all."

You seem to be saying: "I'm right and you are wrong" here and I would suggest that is ill advised - TCP/IP based comms generally work despite ill advised configuration. You can bin ICMP on WAN and yet an OpenVPN connection still works, for example.

"Designing protocols which will be robust" - see Internet: all of it. I suggest that the current internet protocols are a shining example of how to do engineering properly.

The internet still works without ICMP, until it doesn't. Worse, when it does fail you have no recourse because the network has accumulated a pile of hacks to get 99% of people's use cases working and network operators are not inclined to spend the energy to fix their networks just for you. And so you layer another hack onto the pile and move on, as does everyone else. Not only is the time spent layering hacks on hacks significant, the result is a network which is much less useful because only the most common usage patterns can be expected to work. Want to do something novel? Good luck.

See also: https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...

You know BGP was written on the back of a napkin, right? And no, I am not joking.

I would also submit that BGP is not a shining example of well-done engineering.

That something was written on a napkin doesn't mean anything without context.

It could mean someone made something up off the cuff, or it could mean they'd been thinking about it for months and were finally inspired during a meal and didn't want to lose the idea.

Or that it started on a napkin and was refined and flushed out over time.

Check out the online ICMP / Fragmentation checker:

IPv4 version: http://icmpcheck.popcount.org

IPv6 version: http://icmpcheckv6.popcount.org

If you see red - this means your ISP / cable modem is not really internet happy.

Funny thing, the ISP I use fails on the ICMP PMTU message delivery test on the first refresh, but succeeds on the second. What happens? The ICMP is delivered to middle box, which then FRAGMENTS MY TCP PACKETS. In other word - the ICMP is stopped at a middle box, who will ignore my packets DF flag and still, fragment them without my system even knowing. On second refresh the middle box remembers the Path MTU will be saved so the middle box will reduce the MSS on SYN packet. This is remembered for about 15 minutes. Middle boxes suck.

Or it means that ICMP traffic is not working. Misconfigured networks are incredibly common. There is a massive misconception that ICMP is not needed or should be blocked for security.

For example, you can google "should i open/block ICPM" and none of the stack overflow answers will tell you what's the right thing to do.

Indeed, and those who like to blindly drop all ICMP are gonna be hurting when it comes time to deploy IPv6.

As noted in RFC4890 [0], "ICMPv6 is essential to the functioning of IPv6 ... Filtering strategies designed for the corresponding protocol, ICMP, in IPv4 networks are not directly applicable."

So, if you currently just drop all ICMP at the edge, you might as well get used to doing it the right way now and save yourself some trouble in the future.

[0]: https://www.ietf.org/rfc/rfc4890.txt

> These often completely ignore ICMP and perform very aggressive connection rewriting. For example Orange Polska not only ignores inbound "Packet too big" ICMP messages, but also rewrites the connection state and clamps the MSS to a non-negotiable 1344 bytes.

Most mobile operators will do MSS clamping. The IP packets get encapsulated in GTP. Having to fragment the GTP packets would suck. (Now, you don't need to clamp to 1344 just to account for GTP. So they have something else going on as well, perhaps GTP+IPSec).

Of course, "most" isn't "all". Some don't do it. Some try to do it, but mess up. The saddest case I saw was an operator with a setup like this:

client <-> GGSN <-> terminating proxy <-> NAT <-> server

That's a pretty normal traffic flow, right?

Now, for some reason they were doing the MSS clamping on the NAT, not the GGSN. Sometimes that might be OK. But they didn't account for the proxy... What happened in practice was that all the communication between the client and the proxy was at a too-high MSS of 1460 (causing GTP fragmentation) and the communication between the proxy and the server used a too-low MSS of 1380 (causing 5% increase in packet count for no reason). Oops.

Problem with MSS clamping - if the other party has MTU of 1500 but the Path has MTU of < 1344, then, what?

Well, it's broken. The sending party doing MSS clamping, will get the ICMP probably dropped, and the fragments will likely not work either (like large DNS responses).

GTP MSS clamping assume there is no <1344 link in the path! True, this is often the case, but is completely missing the point of "the design of the internet". Sad times...

> GTP MSS clamping assume there is no <1344 link in the path!

Huh? You appear to be implying that clamping the MSS is causing large packets to be dropped, or ICMP responses to be blackholed. Neither of those is true. It's simply the mobile network asking the endpoints not to send larger TCP packets than that. (And since TCP has traditionally been 95%+ of the traffic, the GTP fragmentation for large non-TCP packets isn't a big deal).

Hold your horses. It's not MSS clamping per se. It's the middle box MSS clamping. Say a middle box rewrites my SYN packet from my phone to have smaller MSS, from 1500 to say 1344.

The other party (server) responds with 1500 MSS, and both parties successfully exchange TCP handshake assuming PMTU of 1344. Now say there is a 1300 link between. And say my phone sends large packet.

My phone will happily transmit 1344 byte packet. Say TCP, with DF flag.... the intermediate router with small MTU will stop it, and return back ICMP PMTU message.

The point I'm making - the middle box doing clamping, usually don't do much about this ICMP. They drop it. What they should do is to send this ICMP back to my phone, but that's not what I saw in my tests. Or maybe I just tested some buggy middle boxes.

Yes, like I said you're conflating MSS clamping with all kinds of other things. That's incorrect. Remove the MSS clamping from your example, and the outcome is exactly the same.

MSS clamping by a middlebox is a totally fine practice. Dropping ICMP packets is a bad practice. But they have nothing to do with each other. They're probably not even being done by the same device.

Aye. Got it.

Oh man, I could rant about this forever. PMTUD is very broken on the internet. This leads to things like MSS clamping based on the route your packet will take.

If you run an ISP, or really a network of any sort, and block ICMP indiscriminately, your network is broken. Please stop.

>The last fragment will almost never have the optimal size. For large transfers this means a significant part of the traffic will be composed of suboptimal short datagrams - a waste of precious router resources.

Why? With large transfers, the most packets will have the maximum length, only the last will have a "suboptimal" short length.

You're thinking of streams. The fragmentation discussed here is of packets.

Recent equipment might be sending packets of 9000 bytes. When you cross over into older networks that gets split into 6 or 7 packets, so the length of the last packet is a problem but it's 15% of the traffic.

But going from older networks into quirky ones (wireless, modems, or ancient systems) will split your ~1500 byte packet into two. Like 1344 and 166 or 900 and 600.

Even recent and not so recent equipment that supports jumbo frames do not have it enabled by default. Large mtu hosts (or the specific interface with a large MTU) should be in an isolated network where all hosts have that large MTU. Best practice is to not mix MTU sizes on hosts that should talk to each other and never go over 1500 for hosts/interfaces that need to talk to the internet.

Interesting. I haven't participated in a network architecture at the hardware level for a bit so I'm out of the loop.

Of course, this whole article is about the differences between what should be and what you encounter in the wild.

But if I have jumbo packets in my data center and a reverse proxy at the edge, I either have an application layer creating fragments, or I have another, lesser problem with packet sizes.

If I don't see fragmentation, I will instead see some jitter in packet transmission due to buffering. Many workloads wouldn't even notice but some care. The proxy will end up holding onto the last partial packet until the next one is processed. If that packet has a message boundary in it then the jitter turns into latency from the perspective of the destination system.

Say the TCP stream sends packets of 1500 bytes, an optimal size for most part of most part of the journey of the packet

At one point the MTU is smaller, a router have to fragment it in two, say one fragment of 1260 bytes and the last fragment contains the remaining 240 bytes.

So, a lot of segments gets fragmented into one 1260 byte fragment and one 240 byte fragment. That 240 byte fragment is far from optimal, the overhead to payload ratio is a lot worse than for the 1260 fragment.

A much more optimal approach would be for the endpoint to adjust the segment size, and send all packets as 1260 bytes. (which is what will happen if PMTU works)

Path MTU discovery is supposed to ferret out the fact that the TCP segment size should be reduced.

IP fragmentation is just a best effort mechanism that is better than the datagrams being dropped due to their excessive size; it allows connections to be established.

You don't want to be doing bulk transfers with fragmentation.

IP fragmentation won't help with latency, because it is redundant. The TCP window is already a form of "fragmentation". TCP already fragments the data into segments that can be received out of order, individually re-transmitted and re-assembled.

For large transfers over fast, high latency networks, TCP supports window scaling. Larger individual segments (and thus IP fragmentation) is not required for the sake of having more data "in the air".

This neglects the fact that IP imposes a fixed per-packet overhead. More packets means lower efficiency.

Using larger packets reduces the overhead as a percentage of traffic, so is desirable regardless of the TCP window size.

Higher efficiency means higher bandwidth, which means lower latency for small transfers. Waiting for two 10-byte packets yields latency equal to the higher of the two latencies. Waiting for one 20-byte packet is, on average, lower latency (because you take one sample instead of worst-of-two).

Of course, we want to be using the largest packets that the underlying network allows.

My comment is about using larger packets specifically while relying on fragmentation to deliver them. Fragmentation re-introduces the overhead since fragments carry IP headers.

If a protocol already segments data into datagrams and reassembles them into a stream, it derives no advantage from IP fragmentation.

IP fragmentation basically provides the possibility of operating in the face of misconfiguration (hopefully temporarily), with reduced performance (better than no performance at all).

(Of course I'm not saying that since TCP chops the data into pieces and has a sliding window that can scale, there is no downside to making the pieces small just for the heck of it, like 50 bytes of payload!)

Why does it matter how efficient each individual packet is, in terms of overhead vs payload? If I send N bytes in M packets the total transfer overhead and efficiency are the same regardless of how uniform the packet sizes are.

The only problem I see is that the larger fragments are more likely to get refragmented, while uniform fragments are more likely to pass through subsequent MTU bottlenecks.

I don't quite understand what you are trying to convey. If you send 100 packets with 10 bytes payload(or TCP segments if you will), I hope it is obvious that there are more overhead (Ethernet/IP/TCP header of 100 packets) compared to sending 1 packet with a 1000 byte payload.

Normally this matters as it is desirable for systems to use less resources, not more. (e.g. we'd want to lessen the load on the network, we'd want applications to have low latency)

I believe the context is where something in the middle is fragmenting almost every packet. Like if you don't negotiate path MTU.

So "last" meaning the last fragment of every packet sent, not just the literal last packet.

Say 3600 bytes sent with an initial mtu size of 1500, but frags needed at 1200. At least 6 packets sent, versus 3 packets if you used an initial size of 1200.

Add in the headers, frag needed packets, etc, and you've sent not only more packets, but more bytes.

> For large transfers

The implicit assumption is that a fragmented packet will be slightly more than MTU. So for a large transfer - say N packets, of packet size MTU + 1 - you will see N "first fragments" fragmented to the size of MTU, and another N last fragments of size 1. (actually IP header + 1)

In other words 50% of the packets flowing will have size of 1 which is super bad use of router switching resources.

But wouldn't that be multiple transfers?

By "transfer" here I mean a thing like a curl on a large file.

It would even apply to a single socket send() of say, a 4k buffer, which is pretty common.

Broken MTU negotiation could easily result in twice as many packets as needed in that scenario.

Twice as many packets per reduced MTU hop. The worst case is pretty bad and far north of 2x the number of packets sent from origin.

The worst case is that the routers support N, N-1, N-2, N-3, which means one additional packet per hap.

I know what you are thinking: "well what if it's N/2, N/4, N/8 etc". And the answer is that in such a network you are screwed anyway. Lots of small packets would be unavoidable.

"With large transfers, the most packets will have the maximum length"

There is no universal "maximum length" among the links in a route. Some link will have a smaller maximum forcing the previous link to fragment nearly every packet; instant 2x (or more) packet overhead. "Large" transfers exacerbate this as nearly every packet is max size when it leaves the origin host.

The fragmentation is typically from a 1500 byte packet to a slightly smaller value, so you end up with one 1400 byte packet and one 100 byte packet.

The fragmentation happens after your computer has packetised the data.

Instead of "never fragment in-transit", I think IPv6 should've put a single "truncated" bit in the header. When a packet hits a link that's too small, the router chops off the end and sets the bit. Let the endpoints figure out what to do with that information.

My question after reading the article is how would we go about addressing this?

It seems that IPv4 (and v6 to some extent) allow for MTUs of up to 64KiB. Is the only thing limiting the selection of higher Path MTUs the fact that most Ethernet implementations have a MTU of 1500 bytes?

Do network hardware companies sell devices that tolerate higher MTUs? Is this a hardware limitation, or it it just unwillingness to go past the 1500 bytes because that's the IEEE standard?

This is a subject for another blog post. Or a book!

This problem has been excessively discussed in early IP days. The alternative to current model is to have a fixed packet size. This inherently means that for some media the packet size will be not optimal.

The current model in IPv4 is reasonable - that is: dynamic packet size, with either routers doing fragmentation or signaling mechanism to push back the PMTU onto the originating host.

The problem is that there is very little visibility if these mechanisms work. The linked online tests are an attempt to fix this - by easily showing everyone if their CPE / ISP / Path is reasonable and behaves correctly.

Testing fragmentation / ICMP PMTUD delivery historically was pretty hard. We need more visibility.

>"The alternative to current model is to have a fixed packet size"

This was already done with ATM for L2 and the 53 byte fixed cell size. You don't hear about ATM any more.

Sure, lost of hardware supports jumbo frames which are 9K bytes in length. Ethernet however uses a 32 bit CRC to detect bit inversion and this CRC is only reliable up to around 12K bytes.

That doesn't sound right. Do you have a citation for the ethernet CRC becoming unreliable over 12K?

Sure, see the following


11. Appendix 2. Comments from the draft's authors:

FDDI and Ethernet use the same error checking mechanism; CRC-32. The probability of undetected errors remains constant for frame sizes between 3007 and 91639 bits (approximately 376 to 11455 bytes). Setting the maximum size of jumbo frames to 9018 bytes falls well within this range. There is no increase in undetected errors when using jumbo frames and the existing CRC-32 as the error detection mechanism.

This is also worth a read:


>The strength of the Ethernet CRC checksum and the 16 bitTransport checksum has been found to reduce for data segments that are largerthan the standard Ethernet MTU. Koopman et. al. [Koopman] have explored a number of CRC polynomials as well as the polynomial used in the Ethernet CRC calculation.

And the following on page 32:


>"Due to the nature of the CRC algorithm, the probability of undetected errors is the same for frame sizes between 376 and 11,455 bytes. Thus to maintain the same bit error rate accuracy as standard Ethernet, frames should ideally not exceed 11455 bytes."

Interesting; thanks! (Also thanks to cesarb)

I did a quick search, and found one. The page at https://web.archive.org/web/20060918065212/www.aarnet.edu.au... says:

"The weakening of this frame check which greater frame sizes is explored in R. Jain's "Error Characteristics of Fiber Distributed Data Interface (FDDI)", which appeared in IEEE Transactions on Communications, August 1990. [...] Firstly, the power of ethernet's Frame Check Sequence is the major limitation on increasing the ethernet MTU beyond 11444 bytes. Secondly, frame sizes under 11445 bytes are as well protected by ethernet's Frame Check Sequence as frame sizes under 1518 bytes."

Another page at https://web.archive.org/web/20070426062813/http://sd.wareone... explains why 9000 bytes is commonly chosen:

"Why 9000? First because ethernet uses a 32 bit CRC that loses its effectiveness above about 12000 bytes. And secondly, 9000 was large enough to carry an 8 KB application datagram (e.g. NFS) plus packet header overhead."

(The page doesn't mention it, but I guess only multiples of 1500 were considered, and 9000 is the smallest multiple of 9000 greater than 8192.)

Other links:

"Raising the Internet MTU": https://web.archive.org/web/20070427184111/www.psc.edu/~math...

"Thoughts on increasing MTUs on the internet" (NANOG thread from 2007): https://www.nanog.org/mailinglist/mailarchives/old_archive/2...

Edit: found the source of the "9000 bytes" de facto standard: https://noc.net.internet2.edu/i2network/jumbo-frames/rrsum-a...

Most (at least most non consumer devices) supports ethernet frames of up to about 9000 bytes - aka Jumbo Frames.

Here is the relevant talk that Geoff Huston gave last week at NANOG71:


Very interesting and, as usual, well presented. The picture it draws for UDP (i.e., DNS) fragmentation over IPv6 is very bleak. Basically: "it's not working".

Some argue that PLPMTU should be made the default in Linux:


My fav net.ipv4.tcp_mtu_probing=2 MTU probing / RFC4821!

I spoke about it here:


There are a couple of issues. In old kernels it caused a "WARNING", with some weird stack trace. I'm quite sure it has been fixed before 4.8, but I'm not sure if it's fixed in 4.4. 3.18 is definitely buggy. Very hard to reproduce.

Second, it causes stalls on the connections. If the kernel sees a dropped packet, it might get confused and instead of just resending it, it might go into the MTU Probing mode. This will effectively stall the connection for a while.

Finally, for long running connections, packet loss has high chances of being misdiagnosed as MTU Blackhole, therefore the Path MTU has increased chances of being reduced over time. MTU Probing has no mechanism to _increase_ Path Mtu. Only decrease. This might degrade the performance of long running connections.


- don't use MTU probing if you have long running connections

- it might cause connection stalls

More reading on the subject (suggested by a friend, hi Jari!):


I work at an ISP and saw this problem yesterday...

This is a good read, with a lot of research, describing a common problem. I will bookmark it, thanks for the share.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact