How 1500 bytes became the MTU of the Internet (2020)

anonymousisme · on June 29, 2021

From what I recall, the 1500 number was chosen to minimize the time wasted in the case of a collision. Collisions could happen on the original links because instead of point-to-point links like we have today, they all used the same shared channel. (This was in the days of hubs, before switches existed.) The CSMA (Carrier Sense, Multiple Access) algorithm would not transmit while another node was using the channel, then it would refrain from transmitting by a random length of time before sending. If two nodes began transmitting at the same time, a collision would occur and the Ethernet frames would be lost. Error detection and correction (if used) occurs at higher layers. So if a collision occurred with a 1500 byte MTU at 10Mbps, 1.2mS of time would be wasted. IIRC, the 1500 byte MTU was selected empirically.

Another reason for the short MTU was the accuracy of the crystal oscillator on each Ethernet card. I believe 10-Base Ethernet used bi-phase/Manchester encoding, but instead of recovering the clock from the data and using it to shift the bits in to the receiver, the cheap hardware would just assume that everything was in sync. So if any crystal oscillator was off by more than 0.12%, the final bits of the frame would be corrupted.

I've actually encountered Ethernet cards that had this problem. They would work fine with short frames, but then hang forever once a long one came across the wire. The first one I saw was a pain to troubleshoot. I could telnet into a host without trouble, but as soon as I typed 'ls', the session would hang. The ARP, SYN/ACK and Telnet login exchanges were small frames, but as soon as I requested the directory listing, the remote end sent frames at max MTU and they were never received. They would be perpetually re-tried and perpetually fail because of the frequency error of the oscillator in the (3C509) NIC.

gumby · on June 29, 2021

> IIRC, the 1500 byte MTU was selected empirically.

That’s what I was told when I worked at PARC in the 1980s. Folks these days forget (or never learn) how Ethernet worked in those days.

(There’s no reason people should know how it worked — the modern work of IEEE 802 has as much relation to that as we do with lemurs).

bcrl · on July 1, 2021

1500 bytes is also the maximum size that can be used and have errors reliably detected by the particular CRC32 in Ethernet. And that's a hard requirement as sometimes the only way you notice a collision at the receiver is when you notice the packet failed the CRC check. A more complicated CRC was not an option because of the limits of hardware complexity (and cost) at the time.

anonymousisme · on July 1, 2021

Interesting writeup here: https://etherealmind.com/ethernet-jumbo-frames-full-duplex-9...

It claims that CRC32 is sufficient to detect errors in frames as large as 9000 bytes.

dbcurtis · on June 29, 2021

I would say more likely in the days of coax, before hubs existed.

toast0 · on June 29, 2021

No comments on history, AFAIK, Packet Length Limit: 1500 was found on a stone tablet.

But note that many high profile sites purposefully do not send 1500 byte packets. Path MTU detection has many cases where it doesn't work, so it's simpler to say f*** it and set your effective MTU to 1450 or 1480 and call it a day.

Reflecting the MSS in SYN minus 8 (assuming a poorly configured PPPoE link) or 20 (assuming a poorly configured IPIP tunnel) or even 28 (both) is probably more effective, but stats are hard to gather, and AFAIK no OS makes it easy to run a test, you'd need to patch things and divert packets and have a good representative section of users, etc.

rjsw · on June 29, 2021

Some of that is down to the PPPoE standards. I added RFC4638 support to the OS that I use but don't know how ISPs that don't support it will respond.

toast0 · on June 29, 2021

Well yeah, PPPoE is super derpy; but at least the modems could be trained to adjust MSS on the way in and out to (tcp) 1472, but they often don't, and sometimes only adjust mss on the way out, not on the way in. My current PPPoE ISP doesn't support RFC4638, and my previous ISP was worse, I had a fiber connection with 1500 MTU, but their 6rd gateway was only 1492 MTU (so ipv6 MTU was 1472 after the tunnel) and ICMP needs frag was rate limited, so IPv6 would only sometimes work until I figured out I needed to adjust the v6 MTU.

Did you know Windows doesn't request (or process) MTU information on v4 DHCP? That's a nice touch.

philjohn · on June 29, 2021

And without dedicated packet processing chips it's a nightmare on anything >= 1Gbps

netr0ute · on June 29, 2021

I'd expect those ISPs to use jumbo frames only where they use PPPoE so that it looks "transparent" to non-PPPoE traffic.

rjsw · on June 29, 2021

That is the whole point of RFC4638, it allows you to request an MTU of 1508 for the PPPoE connection so that the tunnelled frames can be 1500 bytes. The problem is that it tries to be backwards compatible which makes it hard to deal with this request not getting granted.

bcrl · on July 1, 2021

I've tried to detect MTU on PPPoE sessions with LCP echo requests, and it works... Until I came across consumer wireless routers that send back short LCP echo replies to 1500 byte LCP echo requests. Sometimes software is such complete garbage in the real world.

zamadatix · on June 29, 2021

by made it a bit easier: assume 1280 unless pmtud is able to show otherwise. V4 had something similar with 576 but it was a bit too small to just default to and started out before pmtud was settled so couldn't mandate it

rubatuga · on June 29, 2021

Yep 1280 is the minimum for IPv6. Cloudflare used to do 1280, but then they said good riddance and increased it (for better efficiency)

https://blog.cloudflare.com/increasing-ipv6-mtu/

zamadatix · on June 29, 2021

"by"=what my phone thinks IPv6 should be.

benjojo12 · on June 29, 2021

Previous Discussion :)

https://news.ycombinator.com/item?id=22364830

citrin_ru · on June 29, 2021

I have a related question. Jumbo frames (e. g. 9000) look well suited for internal VLANs in DC networks: DC grade routers/switches almost always support jumbo frames, PMTUD is not an issue for internal networks (servers in which don't talk to clients with filtered ICMP).

Why are they so rarely used despite an almost free performance benefit?

throw0101a · on June 29, 2021

> Why are they so rarely used despite an almost free performance benefit?

Try measuring the performance benefit. You'll probably find there isn't much/any.

It may have been a boost around 2000, but modern NICs have all sorts offloads now.

Hikikomori · on June 29, 2021

The extra complexity (best practice is to run a separate vlan/subnet where clients have an extra NIC with high MTU) isn't worth it as the only real drawback is on the wire header overhead (NICs have LRO/LSO and other offload mechanisms).

briffle · on June 29, 2021

They are used often in storage networks. Back when I worked at a vmware shop, we also implemented them on the vmotion network, and it cut down considerably moving very large DB servers to other hosts.

tyingq · on June 29, 2021

It is pretty common for networks used for local NFS storage, or other uses where the packets are somewhat guaranteed to not leave the facility, like between Oracle RAC nodes, etc.

kazen44 · on June 29, 2021

in DC Networks, 9k frames and pathmtu is very common. the issue however is that the MTU across the internet is 1500, and you cannot selectively use mtu's depending on routes, so you end up with 9k MTU at edge routers which need to split that into 1500 MTU packets.

this fragmentation is absolutely brutal to performance of edge routers.

toast0 · on June 29, 2021

> you cannot selectively use mtu's depending on routes

You absolutely can. Assume a single network port on the server.

On BSD, it looks something like this:

ifconfig foo 10.x.y.z/24 mtu 9000

route add default 10.x.y.1 -mtu 1500

route add 10.0.0.0/8 10.x.y.1

Linux has a mss attribute on routes which should work similarly.

Roll your own IPV6 version, but radvd can sort of maybe offer network routes I think, so it might be slightly easier to config it that way.

r1ch · on June 29, 2021

Doing this requires fragmenting at the IP level, which is notoriously unreliable over the internet. The other option is to send PMTU exceeded messages telling the sender to drop their MTU for the path, adding significant overhead for every new connection (assuming the sender even listens to such messages).

toast0 · on June 29, 2021

How so? In my example, the default route has a MTU of 1500 and the localnet route has a MTU of whatever the interface is (9000 in this example)

When you send a packet, the destination route is looked up to determine the MTU. If you send to the internet, you'll get small packets (not IP fragments), and 10. will get large packets.

r1ch · on June 29, 2021

That works fine for locally generated traffic which has knowledge of the routes. For hosts going through a router, they have no idea what the outbound link MTU is, requiring PMTU discovery or messing with TCP MSS.

toast0 · on June 30, 2021

The claim I replied to was that you couldn't run one MTU for some routes and one for other routes.

You absolutely can. But running your default route at 9000 is ill-advised as you mention.

If your datacenter routes aren't simply rfc1918 addresses, it would be harder to set the end host routes, and the effort to get larger packets on the wire might not be worth the rather small benefits.

doublerabbit · on June 29, 2021

Would it not be possible to have another network device before the edge router that encapsulates the packet to 1500? And vice versa for inbound packets?

I have a rack in a DC, I've always wanted to configure jumbo frames between the host and VMs. However I haven't because I am not sure of the recourse when serving traffic.

kazen44 · on June 29, 2021

ofcourse this would be possible (with just another router) but this would still result in inefficienties.

IP Fragmentation is something which you do not want (it was removed from IPv6 for a good reason). TCP does not like it when IP packets get fragmented, as this usually results in lost PDU's when there is even the slightest of packet loss. Which causes TCP to resend the PDU.

EricE · on June 29, 2021

Oh man - I got hives the moment that picture loaded. I think I still have a passel of those cards in a box somewhere :p

meepmorp · on June 29, 2021

RJ-45, AUI, and BNC.

I miss BNC because it was fun to build little sculptures out of the terminators and connectors. Networking is faster these days, but does it still make for good fidget toys? I think not.

krylon · on June 29, 2021

Hehe, me too. I only have one old network card lying around, but for some reason it's a Token Ring card. I have no idea where it came from, because I never used it, and I think by the time it found its way into my collection of obsolete hardware, I did not even own a computer with an ISA slot...

mmastrac · on June 29, 2021

There would probably be some concrete value to having a flag day on the internet where we just universally agree to bump the MTU up to something a little larger and gain across the board.

If I'm not mistaken, this happened with both TLS and DNS - there were multiple days where all the major service providers flipped the bits to allow everyone to test the systems at scale and fix any issues that occurred.

kazen44 · on June 29, 2021

this is far, far harder to do compared to DNS flag days.

changing the value to something else is futile. What needs to happen is people should allow ICMP into their networks and allow PMTUD to work correctly. Path MTU discovery has solved this issue, but people keep dropping ALL icmp traffic.

kubanczyk · on June 29, 2021

14 days after IPv6 flag day, m'kay everyone?