I wish we could have another and bump the packet size. We're at the point where ...

supahfly_remix · on Aug 22, 2024

> I wish we could have another and bump the packet size.

The clock precision (100s of ppm) of the NIC oscillators on either side of a network connection gives a physical upper limit on the Ethernet packet size. The space between the packets lets the slower side "catch up". See https://en.wikipedia.org/wiki/Interpacket_gap for more info.

We could use more precise oscillators to have longer packets but at a more expensive cost.

monocasa · on Aug 22, 2024

You don't need that as much on modern protocols. The point of 8b/10b or 64b/66b is that it guarantees enough edges to allow receivers to be self clocking from the incoming bits being more or less thrown directly into a PLL.

namibj · on Aug 22, 2024

That's a separate concern.

The previously mentioned issue is that to never buffer packets in a reclocking repeater on a link, you _need_ the incoming packet rate to never be higher than the rate at which you can send them back out, or else you'd fill up/buffer.

If your repeaters are actually switches, this manifests as whether you occasionally drop packets on a full link with uncongested switching fabric. Think two switches with a 10G port and 8 1G ports each used to compress 8 long cables into one (say, via vlan-tagging based on which of the 8 ports).

zamadatix · on Aug 22, 2024

Realistically I think we would be fine to make packet size significantly larger than Ethernet would currently allow if we really wanted. E.g. Infiniband already has 1x lane speeds of 200 Gbps without relying on any interpacket gap for clock sync at all. Ethernet, on the other hand, has been consistently increasing speed while decreasing the number of bits used for the interpacket gap since it's less and less relevant for clocking. Put a few bytes back and you could probably do enormous sizes.

tlb · on Aug 23, 2024

I don't get how that limits the packet size. If a sender's clock is 500 ppm faster than an intermediate node's, you need 500 ppm of slack. That could be short packets with a short gap, or large packets with a large gap.

Ethernet specs the IPG as a fixed number of bits, but it could easily be proportional to the size of the previous packet.

leoc · on Aug 22, 2024

(Intentional) jumbo frames at layer 2 and expanded MTUs at layer 3 are certainly available (as you may know). In fact it seems (I am, it should be obvious, not an expert) that using jumbo frames is more or less the common practice by now. There does in fact seem to have been some standards drama about this, too: I can't find it now, but IIRC in the '00s someone's proposal to extend the header protocols to allow the header to indicate a frame size of over 1500 bytes was rejected, and nothing seems to have been done since. At the moment it seems that the best way to indicate max. Ethernet frame sizes of over 1500 is an optional field in LLDP(!) https://www.juniper.net/documentation/us/en/software/junos/u... and the fall-back is sending successively larger pings and seeing when the network breaks(!) https://docs.oracle.com/cd/E36784_01/html/E36815/gmzds.html .

kstrauser · on Aug 22, 2024

The common advice I've heard for jumbo frames is not to enable them unless you can do it for every devices on your LAN, and even then it's probably not worthwhile outside specific situations like a separate iSCSI network or such.

I just now ran iperf3 from my Mac to my Synology without jumbo frames:

  [ ID] Interval           Transfer     Bitrate         Retr
  [  7]   0.00-10.00  sec  10.0 GBytes  8.61 Gbits/sec    0             sender
  [  7]   0.00-10.00  sec  10.0 GBytes  8.60 Gbits/sec                  receiver

Given how rarely I actually care to saturate the 10Gbit link, I'd rather use the slightly hypothetically slower default settings that are highly likely to work in all scenarios.

akira2501 · on Aug 22, 2024

It seems much more effective on plain 1000Base. It's the difference between 850Mb/s and 975Mb/s for me.

zamadatix · on Aug 23, 2024

That difference is due to your devices relatively poor at handling high pps workloads and not really related to the particular speed it's running at. You can certainly get more than 850 mbps on a 1000BASE-T link with standard MTU, on the order of ~100 mbps more.

kstrauser · on Aug 22, 2024

That surprises me a little. From the Wikipedia article[0] I'd expected jumbo frames to be only about 5% more efficient.

[0] https://en.wikipedia.org/wiki/Jumbo_frame

myrandomcomment · on Aug 23, 2024

The use of jumbo frames is NOT normal and is only used in very specific setups. In general something like a storage system that is isolated to its own layer2 network. At some point your jumbo network has to hit the reset of the network and packet fragmentation is done in software via the CPU which is very expensive and not linerate. The normal outcome is you break your network.

scifi · on Aug 22, 2024

I'm not certain what my point is, but I wanted to mention that jumbo frames don't work over the Internet. More of a LAN thing.

toast0 · on Aug 22, 2024

My local internet exchange has a 1500 vlan and a 9000 vlan. My understanding is there are many fewer peers on the 9000 vlan, but it's not zero.

If you want to use jumbo packets on the internet at large, you need to have working path MTU detection, which realistically means at least probing, but you really should have that at 1500 too, because there's still plenty of broken networks out there. My guess is you won't have many connections with an effective mtu above 1500, but you might have some.

zamadatix · on Aug 23, 2024

Separate peering VLANs for those using 1500 byte peering and 9000 byte peering about sums up how much of a PITA it is to mix things and expect PMTUD to work.

I'd be willing to bet my lunch more 10x more places have been moving down to assuming 1280 byte connections (since IPv6 guarantees it) than have been peering on the internet at >1500 (not counting 1504 for VLAN tags and whathaveyou).

api · on Aug 22, 2024

MTUs are one of the eternal gremlins of networking, and any choice of MTU will almost certainly be either too large for the present day or too small for the future. 1500 was chosen back when computers ran at dozens of megahertz and it was actually kind of large at the time.

Changing the MTU is awful because parameters like MTU get baked into hardware in the form of buffer sizes limited by actual RAM limits. Like everything else on a network once the network is deployed changing it is very hard because you have to change everything along the path. Networks are limited by the lowest common denominator.

This kind of thing is one of the downsides of packet switching networks like IP. The OSI folks envisioned a network that presented a higher level interface where you'd open channels or send messages and the underlying network would handle all the details for you. This would be more like the classic analog phone network where you make a phone call and a channel is opened and all details are invisible.

It's very loosely analogous to CISC vs RISC where the OSI approach is more akin to CISC. In networking RISC won out for numerous reasons, but its simplicity causes a lot of deep platform details to leak into upper application layers. High-level applications should arguably not even have to think about things like MTU, but they do.

When higher level applications have to think about things like NAT and stateful firewall traversal, IPv4 vs IPv6, port remapping, etc. is where it gets very ugly.

The downside of the OSI approach is that innovation would require the cooperation of telecoms. Every type of connection, etc., would be a product offered by the underlying network. It would also give telecoms a ton of power to nickel and dime, censor, conduct surveillance, etc. and would make anonymity and privacy very hard. It would be a much more managed Internet as opposed to the packet switching Wild West that we got.

gizmo686 · on Aug 22, 2024

Most high level applications do not deal with any of the low level details. They open a channel by specifying a hostname and a port number, and are given a reliable bidirectional byte stream from the network layer.

As far as most applications are concerned, the hostname is just a string that is interperated by the network layer. Be that through a DNS lookup, or parsing as an address native to the underlying network protocol.

A minority of applications get fancy and request a datagram oriented link, which the network layer also provides (with an admittadly small limit of 65KB that leaks over from that.

Few applications ever go deeper than that.

immibis · on Aug 22, 2024

Maximum packet size is already configurable on most NICs. 9000 is a typical non-default limit. If you increase the limit, you must do so on all devices on the network. https://en.wikipedia.org/wiki/Jumbo_frame

imoverclocked · on Aug 22, 2024

I’ve seen Cisco gear that supports 8192… that was fun to figure out with a separate network team :) “yup, we’ve enabled jumbo frames!”

throwup238 · on Aug 22, 2024

> I wish we could have another and bump the packet size.

That's why I'm in full support of a world ending apocalypse that allows society to restart from scratch. We've made so many bad decisions this time around, with packet sizes being some of the worst.

amelius · on Aug 22, 2024

Maybe we can then also redefine pi as 2*pi, while we're at it.

HappMacDonald · on Aug 22, 2024

Or just use tau and call it "tau"

mportela · on Aug 22, 2024

What a throwback! I remember when the Tau Manifesto came out: https://tauday.com/tau-manifesto

thfuran · on Aug 22, 2024

As long as we also make sure electrons are positively charged this time.

kridsdale3 · on Aug 22, 2024

Then I'm going to be really confused about positrons!

thfuran · on Aug 22, 2024

Oh, but have you heard the news about negatrons?

toast0 · on Aug 22, 2024

I mean, larger packets (and working path MTU detection) could be useful, but with large (1500) byte packets and reasonable hardware, I never had trouble pushing 10G from the network side. Add TLS and some other processing, and older hardware wouldn't keep up, but not because of packetization. Small packets is also a different story.

All my hardware at the time was xeon 2690, v1-4. Nics were Intel x520/x540 or similar (whatever SuperMicro was using back then). IIRC, v1 could do 10G easy without TLS, 8-9G with TLS, v3 improved AES acceleration and we could push 2x10G. When I turned off NIC packetization acceleration, I didn't notice much change in CPU or throughput, but if packetization was a bottleneck it should have been significant.

At home, with similar age desktop processors with @ dual core pentium g3470 (haswell, same gen as a 2690v3), I can't quite hit 10g in iperf, but it's closeish, another two cores would probably do it.

In some cases, you can get some big gains in efficiency by lining up the user space cpu with the kernel cpu that handles the rx/tx queues that the NIC hashes the connection to, though.

dale_glass · on Aug 22, 2024

I discovered that putting a 10G interface into a bridge implies a very significant slowdown. Linux has to do stuff on the CPU to do the bridging, so that turns off a large part of the card's acceleration.

That's not a good thing for a server that runs a bunch of VMs.

Fortunately SR-IOV exists, but it seems a tad silly to me that I have to do all this weird PCIe passthrough stuff just for this. It's nice, don't get me wrong, but a bit too exotic for what should be a simple setup.

kjellsbells · on Aug 22, 2024

I always found it ironic that virtualization made me have to care about the hardware more than I ever had to before.

2013 me: slap the app on a Dell. It'll be fine.

2017 me: aw crap, the nic doesnt support SR-IOV. What do you mean, I need a special driver? Oh lordy, I'm pinning a whole damn CPU just so DPDK can pull packets off the wire?

toast0 · on Aug 22, 2024

Oh yeah, bridged mode on my little pentium system brings perf way down. It was fine on 1G, but when I upgraded to 10G and wanted to hit numbers, I needed to stop doing software bridging. For me, I have slots and NICs, so I moved away from virtual ethernet on a bridge to actual ports; main host gets the 10G, and everything else gets to use 1G ports.

No SR-IOV on my board, but that's ok.