From what I recall, the 1500 number was chosen to minimize the time wasted in the case of a collision. Collisions could happen on the original links because instead of point-to-point links like we have today, they all used the same shared channel. (This was in the days of hubs, before switches existed.) The CSMA (Carrier Sense, Multiple Access) algorithm would not transmit while another node was using the channel, then it would refrain from transmitting by a random length of time before sending. If two nodes began transmitting at the same time, a collision would occur and the Ethernet frames would be lost. Error detection and correction (if used) occurs at higher layers. So if a collision occurred with a 1500 byte MTU at 10Mbps, 1.2mS of time would be wasted. IIRC, the 1500 byte MTU was selected empirically.
Another reason for the short MTU was the accuracy of the crystal oscillator on each Ethernet card. I believe 10-Base Ethernet used bi-phase/Manchester encoding, but instead of recovering the clock from the data and using it to shift the bits in to the receiver, the cheap hardware would just assume that everything was in sync. So if any crystal oscillator was off by more than 0.12%, the final bits of the frame would be corrupted.
I've actually encountered Ethernet cards that had this problem. They would work fine with short frames, but then hang forever once a long one came across the wire. The first one I saw was a pain to troubleshoot. I could telnet into a host without trouble, but as soon as I typed 'ls', the session would hang. The ARP, SYN/ACK and Telnet login exchanges were small frames, but as soon as I requested the directory listing, the remote end sent frames at max MTU and they were never received. They would be perpetually re-tried and perpetually fail because of the frequency error of the oscillator in the (3C509) NIC.
1500 bytes is also the maximum size that can be used and have errors reliably detected by the particular CRC32 in Ethernet. And that's a hard requirement as sometimes the only way you notice a collision at the receiver is when you notice the packet failed the CRC check. A more complicated CRC was not an option because of the limits of hardware complexity (and cost) at the time.
No comments on history, AFAIK, Packet Length Limit: 1500 was found on a stone tablet.
But note that many high profile sites purposefully do not send 1500 byte packets. Path MTU detection has many cases where it doesn't work, so it's simpler to say f*** it and set your effective MTU to 1450 or 1480 and call it a day.
Reflecting the MSS in SYN minus 8 (assuming a poorly configured PPPoE link) or 20 (assuming a poorly configured IPIP tunnel) or even 28 (both) is probably more effective, but stats are hard to gather, and AFAIK no OS makes it easy to run a test, you'd need to patch things and divert packets and have a good representative section of users, etc.
Well yeah, PPPoE is super derpy; but at least the modems could be trained to adjust MSS on the way in and out to (tcp) 1472, but they often don't, and sometimes only adjust mss on the way out, not on the way in. My current PPPoE ISP doesn't support RFC4638, and my previous ISP was worse, I had a fiber connection with 1500 MTU, but their 6rd gateway was only 1492 MTU (so ipv6 MTU was 1472 after the tunnel) and ICMP needs frag was rate limited, so IPv6 would only sometimes work until I figured out I needed to adjust the v6 MTU.
Did you know Windows doesn't request (or process) MTU information on v4 DHCP? That's a nice touch.
That is the whole point of RFC4638, it allows you to request an MTU of 1508 for the PPPoE connection so that the tunnelled frames can be 1500 bytes. The problem is that it tries to be backwards compatible which makes it hard to deal with this request not getting granted.
I've tried to detect MTU on PPPoE sessions with LCP echo requests, and it works... Until I came across consumer wireless routers that send back short LCP echo replies to 1500 byte LCP echo requests. Sometimes software is such complete garbage in the real world.
by made it a bit easier: assume 1280 unless pmtud is able to show otherwise. V4 had something similar with 576 but it was a bit too small to just default to and started out before pmtud was settled so couldn't mandate it
I have a related question. Jumbo frames (e. g. 9000) look well suited for internal VLANs in DC networks: DC grade routers/switches almost always support jumbo frames, PMTUD is not an issue for internal networks (servers in which don't talk to clients with filtered ICMP).
Why are they so rarely used despite an almost free performance benefit?
The extra complexity (best practice is to run a separate vlan/subnet where clients have an extra NIC with high MTU) isn't worth it as the only real drawback is on the wire header overhead (NICs have LRO/LSO and other offload mechanisms).
They are used often in storage networks. Back when I worked at a vmware shop, we also implemented them on the vmotion network, and it cut down considerably moving very large DB servers to other hosts.
It is pretty common for networks used for local NFS storage, or other uses where the packets are somewhat guaranteed to not leave the facility, like between Oracle RAC nodes, etc.
in DC Networks, 9k frames and pathmtu is very common. the issue however is that the MTU across the internet is 1500, and you cannot selectively use mtu's depending on routes, so you end up with 9k MTU at edge routers which need to split that into 1500 MTU packets.
this fragmentation is absolutely brutal to performance of edge routers.
Doing this requires fragmenting at the IP level, which is notoriously unreliable over the internet. The other option is to send PMTU exceeded messages telling the sender to drop their MTU for the path, adding significant overhead for every new connection (assuming the sender even listens to such messages).
How so? In my example, the default route has a MTU of 1500 and the localnet route has a MTU of whatever the interface is (9000 in this example)
When you send a packet, the destination route is looked up to determine the MTU. If you send to the internet, you'll get small packets (not IP fragments), and 10. will get large packets.
That works fine for locally generated traffic which has knowledge of the routes. For hosts going through a router, they have no idea what the outbound link MTU is, requiring PMTU discovery or messing with TCP MSS.
The claim I replied to was that you couldn't run one MTU for some routes and one for other routes.
You absolutely can. But running your default route at 9000 is ill-advised as you mention.
If your datacenter routes aren't simply rfc1918 addresses, it would be harder to set the end host routes, and the effort to get larger packets on the wire might not be worth the rather small benefits.
Would it not be possible to have another network device before the edge router that encapsulates the packet to 1500? And vice versa for inbound packets?
I have a rack in a DC, I've always wanted to configure jumbo frames between the host and VMs. However I haven't because I am not sure of the recourse when serving traffic.
ofcourse this would be possible (with just another router) but this would still result in inefficienties.
IP Fragmentation is something which you do not want (it was removed from IPv6 for a good reason). TCP does not like it when IP packets get fragmented, as this usually results in lost PDU's when there is even the slightest of packet loss. Which causes TCP to resend the PDU.
I miss BNC because it was fun to build little sculptures out of the terminators and connectors. Networking is faster these days, but does it still make for good fidget toys? I think not.
Hehe, me too.
I only have one old network card lying around, but for some reason it's a Token Ring card. I have no idea where it came from, because I never used it, and I think by the time it found its way into my collection of obsolete hardware, I did not even own a computer with an ISA slot...
There would probably be some concrete value to having a flag day on the internet where we just universally agree to bump the MTU up to something a little larger and gain across the board.
If I'm not mistaken, this happened with both TLS and DNS - there were multiple days where all the major service providers flipped the bits to allow everyone to test the systems at scale and fix any issues that occurred.
this is far, far harder to do compared to DNS flag days.
changing the value to something else is futile. What needs to happen is people should allow ICMP into their networks and allow PMTUD to work correctly.
Path MTU discovery has solved this issue, but people keep dropping ALL icmp traffic.
Another reason for the short MTU was the accuracy of the crystal oscillator on each Ethernet card. I believe 10-Base Ethernet used bi-phase/Manchester encoding, but instead of recovering the clock from the data and using it to shift the bits in to the receiver, the cheap hardware would just assume that everything was in sync. So if any crystal oscillator was off by more than 0.12%, the final bits of the frame would be corrupted.
I've actually encountered Ethernet cards that had this problem. They would work fine with short frames, but then hang forever once a long one came across the wire. The first one I saw was a pain to troubleshoot. I could telnet into a host without trouble, but as soon as I typed 'ls', the session would hang. The ARP, SYN/ACK and Telnet login exchanges were small frames, but as soon as I requested the directory listing, the remote end sent frames at max MTU and they were never received. They would be perpetually re-tried and perpetually fail because of the frequency error of the oscillator in the (3C509) NIC.