Taking IP fragmentation as a particular example: The protocols should be designed such that inability to perform PMTU discovery prevents any connections being established at all.
Designing protocols which will be robust in the face of human actors who may-or-may-not be entirely interested in doing things the intended way is incredibly difficult. To make things worse, once you make a mistake it is practically impossible to fix. Using fragmentation as an example again: Now that middle boxes which break PMTU discovery are established, it would be impractical to introduce a "fail unconditionally" protocol. People would simply blame the products which introduced such a protocol for being broken. You have to get it right from the start so that bad actors get caught out before their broken products can be deployed.
You seem to be saying: "I'm right and you are wrong" here and I would suggest that is ill advised - TCP/IP based comms generally work despite ill advised configuration. You can bin ICMP on WAN and yet an OpenVPN connection still works, for example.
"Designing protocols which will be robust" - see Internet: all of it. I suggest that the current internet protocols are a shining example of how to do engineering properly.
See also: https://tools.ietf.org/html/draft-thomson-postel-was-wrong-0...
I would also submit that BGP is not a shining example of well-done engineering.
It could mean someone made something up off the cuff, or it could mean they'd been thinking about it for months and were finally inspired during a meal and didn't want to lose the idea.
IPv4 version: http://icmpcheck.popcount.org
IPv6 version: http://icmpcheckv6.popcount.org
If you see red - this means your ISP / cable modem is not really internet happy.
Funny thing, the ISP I use fails on the ICMP PMTU message delivery test on the first refresh, but succeeds on the second. What happens? The ICMP is delivered to middle box, which then FRAGMENTS MY TCP PACKETS. In other word - the ICMP is stopped at a middle box, who will ignore my packets DF flag and still, fragment them without my system even knowing. On second refresh the middle box remembers the Path MTU will be saved so the middle box will reduce the MSS on SYN packet. This is remembered for about 15 minutes. Middle boxes suck.
For example, you can google "should i open/block ICPM" and none of the stack overflow answers will tell you what's the right thing to do.
As noted in RFC4890 , "ICMPv6 is essential to the functioning of IPv6 ... Filtering strategies designed for the corresponding protocol, ICMP, in IPv4 networks are not directly applicable."
So, if you currently just drop all ICMP at the edge, you might as well get used to doing it the right way now and save yourself some trouble in the future.
If you run an ISP, or really a network of any sort, and block ICMP indiscriminately, your network is broken. Please stop.
Why? With large transfers, the most packets will have the maximum length, only the last will have a "suboptimal" short length.
Recent equipment might be sending packets of 9000 bytes. When you cross over into older networks that gets split into 6 or 7 packets, so the length of the last packet is a problem but it's 15% of the traffic.
But going from older networks into quirky ones (wireless, modems, or ancient systems) will split your ~1500 byte packet into two. Like 1344 and 166 or 900 and 600.
Of course, this whole article is about the differences between what should be and what you encounter in the wild.
But if I have jumbo packets in my data center and a reverse proxy at the edge, I either have an application layer creating fragments, or I have another, lesser problem with packet sizes.
If I don't see fragmentation, I will instead see some jitter in packet transmission due to buffering. Many workloads wouldn't even notice but some care. The proxy will end up holding onto the last partial packet until the next one is processed. If that packet has a message boundary in it then the jitter turns into latency from the perspective of the destination system.
At one point the MTU is smaller, a router have to fragment it in two, say one fragment of 1260 bytes and the last fragment contains the remaining 240 bytes.
So, a lot of segments gets fragmented into one 1260 byte fragment and one 240 byte fragment. That 240 byte fragment is far from optimal, the overhead to payload ratio is a lot worse than for the 1260 fragment.
A much more optimal approach would be for the endpoint to adjust the segment size, and send all packets as 1260 bytes. (which is what will happen if PMTU works)
IP fragmentation is just a best effort mechanism that is better than the datagrams being dropped due to their excessive size; it allows connections to be established.
You don't want to be doing bulk transfers with fragmentation.
IP fragmentation won't help with latency, because it is redundant. The TCP window is already a form of "fragmentation". TCP already fragments the data into segments that can be received out of order, individually re-transmitted and re-assembled.
For large transfers over fast, high latency networks, TCP supports window scaling. Larger individual segments (and thus IP fragmentation) is not required for the sake of having more data "in the air".
Using larger packets reduces the overhead as a percentage of traffic, so is desirable regardless of the TCP window size.
Higher efficiency means higher bandwidth, which means lower latency for small transfers. Waiting for two 10-byte packets yields latency equal to the higher of the two latencies. Waiting for one 20-byte packet is, on average, lower latency (because you take one sample instead of worst-of-two).
My comment is about using larger packets specifically while relying on fragmentation to deliver them. Fragmentation re-introduces the overhead since fragments carry IP headers.
If a protocol already segments data into datagrams and reassembles them into a stream, it derives no advantage from IP fragmentation.
IP fragmentation basically provides the possibility of operating in the face of misconfiguration (hopefully temporarily), with reduced performance (better than no performance at all).
(Of course I'm not saying that since TCP chops the data into pieces and has a sliding window that can scale, there is no downside to making the pieces small just for the heck of it, like 50 bytes of payload!)
The only problem I see is that the larger fragments are more likely to get refragmented, while uniform fragments are more likely to pass through subsequent MTU bottlenecks.
Normally this matters as it is desirable for systems to use less resources, not more. (e.g. we'd want to lessen the load on the network, we'd want applications to have low latency)
So "last" meaning the last fragment of every packet sent, not just the literal last packet.
Say 3600 bytes sent with an initial mtu size of 1500, but frags needed at 1200. At least 6 packets sent, versus 3 packets if you used an initial size of 1200.
Add in the headers, frag needed packets, etc, and you've sent not only more packets, but more bytes.
The implicit assumption is that a fragmented packet will be slightly more than MTU. So for a large transfer - say N packets, of packet size MTU + 1 - you will see N "first fragments" fragmented to the size of MTU, and another N last fragments of size 1. (actually IP header + 1)
In other words 50% of the packets flowing will have size of 1 which is super bad use of router switching resources.
Broken MTU negotiation could easily result in twice as many packets as needed in that scenario.
I know what you are thinking: "well what if it's N/2, N/4, N/8 etc". And the answer is that in such a network you are screwed anyway. Lots of small packets would be unavoidable.
There is no universal "maximum length" among the links in a route. Some link will have a smaller maximum forcing the previous link to fragment nearly every packet; instant 2x (or more) packet overhead. "Large" transfers exacerbate this as nearly every packet is max size when it leaves the origin host.
The fragmentation happens after your computer has packetised the data.
Most mobile operators will do MSS clamping. The IP packets get encapsulated in GTP. Having to fragment the GTP packets would suck. (Now, you don't need to clamp to 1344 just to account for GTP. So they have something else going on as well, perhaps GTP+IPSec).
Of course, "most" isn't "all". Some don't do it. Some try to do it, but mess up. The saddest case I saw was an operator with a setup like this:
client <-> GGSN <-> terminating proxy <-> NAT <-> server
That's a pretty normal traffic flow, right?
Now, for some reason they were doing the MSS clamping on the NAT, not the GGSN. Sometimes that might be OK. But they didn't account for the proxy... What happened in practice was that all the communication between the client and the proxy was at a too-high MSS of 1460 (causing GTP fragmentation) and the communication between the proxy and the server used a too-low MSS of 1380 (causing 5% increase in packet count for no reason). Oops.
Well, it's broken. The sending party doing MSS clamping, will get the ICMP probably dropped, and the fragments will likely not work either (like large DNS responses).
GTP MSS clamping assume there is no <1344 link in the path! True, this is often the case, but is completely missing the point of "the design of the internet". Sad times...
Huh? You appear to be implying that clamping the MSS is causing large packets to be dropped, or ICMP responses to be blackholed. Neither of those is true. It's simply the mobile network asking the endpoints not to send larger TCP packets than that. (And since TCP has traditionally been 95%+ of the traffic, the GTP fragmentation for large non-TCP packets isn't a big deal).
The other party (server) responds with 1500 MSS, and both parties successfully exchange TCP handshake assuming PMTU of 1344. Now say there is a 1300 link between. And say my phone sends large packet.
My phone will happily transmit 1344 byte packet. Say TCP, with DF flag.... the intermediate router with small MTU will stop it, and return back ICMP PMTU message.
The point I'm making - the middle box doing clamping, usually don't do much about this ICMP. They drop it. What they should do is to send this ICMP back to my phone, but that's not what I saw in my tests. Or maybe I just tested some buggy middle boxes.
MSS clamping by a middlebox is a totally fine practice. Dropping ICMP packets is a bad practice. But they have nothing to do with each other. They're probably not even being done by the same device.
It seems that IPv4 (and v6 to some extent) allow for MTUs of up to 64KiB. Is the only thing limiting the selection of higher Path MTUs the fact that most Ethernet implementations have a MTU of 1500 bytes?
Do network hardware companies sell devices that tolerate higher MTUs? Is this a hardware limitation, or it it just unwillingness to go past the 1500 bytes because that's the IEEE standard?
This problem has been excessively discussed in early IP days. The alternative to current model is to have a fixed packet size. This inherently means that for some media the packet size will be not optimal.
The current model in IPv4 is reasonable - that is: dynamic packet size, with either routers doing fragmentation or signaling mechanism to push back the PMTU onto the originating host.
The problem is that there is very little visibility if these mechanisms work. The linked online tests are an attempt to fix this - by easily showing everyone if their CPE / ISP / Path is reasonable and behaves correctly.
Testing fragmentation / ICMP PMTUD delivery historically was pretty hard. We need more visibility.
This was already done with ATM for L2 and the 53 byte fixed cell size. You don't hear about ATM any more.
11. Appendix 2. Comments from the draft's authors:
FDDI and Ethernet use the same error checking mechanism; CRC-32. The probability of undetected errors remains constant for frame sizes between 3007 and 91639 bits (approximately 376 to 11455 bytes). Setting the maximum size of jumbo frames to 9018 bytes falls well within this range. There is no increase in undetected errors when using jumbo frames and the existing CRC-32 as the error detection mechanism.
This is also worth a read:
>The strength of the Ethernet CRC checksum and the 16 bitTransport checksum has been found to reduce for data segments that are largerthan the standard Ethernet MTU. Koopman et. al. [Koopman] have explored a number of CRC polynomials as well as the polynomial used in the Ethernet CRC calculation.
And the following on page 32:
>"Due to the nature of the CRC algorithm, the probability of undetected errors is the same for frame sizes between 376 and 11,455 bytes. Thus to maintain the same bit error rate accuracy as standard Ethernet, frames should ideally not exceed 11455 bytes."
"The weakening of this frame check which greater frame sizes is explored in R. Jain's "Error Characteristics of Fiber Distributed Data Interface (FDDI)", which appeared in IEEE Transactions on Communications, August 1990. [...] Firstly, the power of ethernet's Frame Check Sequence is the major limitation on increasing the ethernet MTU beyond 11444 bytes. Secondly, frame sizes under 11445 bytes are as well protected by ethernet's Frame Check Sequence as frame sizes under 1518 bytes."
Another page at https://web.archive.org/web/20070426062813/http://sd.wareone... explains why 9000 bytes is commonly chosen:
"Why 9000? First because ethernet uses a 32 bit CRC that loses its effectiveness above about 12000 bytes. And secondly, 9000 was large enough to carry an 8 KB application datagram (e.g. NFS) plus packet header overhead."
(The page doesn't mention it, but I guess only multiples of 1500 were considered, and 9000 is the smallest multiple of 9000 greater than 8192.)
"Raising the Internet MTU": https://web.archive.org/web/20070427184111/www.psc.edu/~math...
"Thoughts on increasing MTUs on the internet" (NANOG thread from 2007): https://www.nanog.org/mailinglist/mailarchives/old_archive/2...
Edit: found the source of the "9000 bytes" de facto standard: https://noc.net.internet2.edu/i2network/jumbo-frames/rrsum-a...
Very interesting and, as usual, well presented. The picture it draws for UDP (i.e., DNS) fragmentation over IPv6 is very bleak. Basically: "it's not working".
I spoke about it here:
There are a couple of issues. In old kernels it caused a "WARNING", with some weird stack trace. I'm quite sure it has been fixed before 4.8, but I'm not sure if it's fixed in 4.4. 3.18 is definitely buggy. Very hard to reproduce.
Second, it causes stalls on the connections. If the kernel sees a dropped packet, it might get confused and instead of just resending it, it might go into the MTU Probing mode. This will effectively stall the connection for a while.
Finally, for long running connections, packet loss has high chances of being misdiagnosed as MTU Blackhole, therefore the Path MTU has increased chances of being reduced over time. MTU Probing has no mechanism to _increase_ Path Mtu. Only decrease. This might degrade the performance of long running connections.
- don't use MTU probing if you have long running connections
- it might cause connection stalls
More reading on the subject (suggested by a friend, hi Jari!):
This is a good read, with a lot of research, describing a common problem. I will bookmark it, thanks for the share.