I'm a big believer in Postel's Law (also known as the robustness principle) [1]. Basically be liberal in what you accept and conservative in what you send.
The problem with TCP is that the actual Layer 2 and 3 infrastructure doesn't obey this principle. It's taken the opposite stance to simply reject anything that's weird or unexpected. There's a reason and history for why this is. The most defensible is security. Less defensible is things like deep packet inspection.
This is a well-knwon problem and falls under the term (which I love) "ossification" [2].
As an example, much of the Internet stops working with MTUs above ~1500 (which this article mentions). Large packets (eg MTU ~9000) are almost a necessity for 10+ GbE but that often won't work outside of the LAN.
So I guess my point is changes like this will take a long time (if ever) to get widespread support. This could still impact, say, data center and other "local" deployments but I guess you have to start somewhere.
And I am totally against this. I agree for human / interactive input. But for automated things, be as strict as you can be, and hard fail on anything else. Specs are there for a reason, and not adhering them creates compatibility issues.
> RFC 1122 (1989) expanded on Postel's principle by recommending that programmers "assume that the network is filled with malevolent entities that will send in packets designed to have the worst possible effect
Well, that's gonna be a tough one if you try to make sense of garbage data (or data which doesn't adhere to the specs)
> Protocols should allow for the addition of new codes for existing fields in future versions of protocols by accepting messages with unknown codes (possibly logging them).
I agree with that. That does not mean "be liberal in what you accept". It means the protocol is designed in a better way.
> Basically be liberal in what you accept and conservative in what you send
That's not the same as "be conservative in what you do, be liberal in what you accept from others". "Doing" is a lot more than "sending".
You know.. many programmers sometimes say that when things go wrong it's "a user error", or "a human error". But most of the time it is either a programming error, or a lazy programmer. It's so easy for a programmer to send the correct data. But we live in a world where usually the least correct/secure software is chosen, because it's slightly easier to get started, or allows for input errors (mysql data truncation, mongodb, etc, etc)
> And I am totally against this. I agree for human / interactive input. But for automated things, be as strict as you can be, and hard fail on anything else. Specs are there for a reason, and not adhering them creates compatibility issues.
Totally agreed on this. When a human inputs a name with a trailing space, trim it for them. Or when they input a phone number with funny dashes, remove them.
But when an API does this, it means the API client is simply incorrect. Incorrect requests should be rejected. Don't leave it up to the server to try to interpret it, this just leads to incorrect behavior.
> As an example, much of the Internet stops working with MTUs above ~1500 (which this article mentions).
I experienced this first hand recently with a lower MTU limit and it's pretty confusing. The annoying thing is that when it's only wrong at one end, most of the internet and web pages will actually carry on working, i.e it's only some HTTP requests with large enough request header that push past the MTU fragmentation threshold and disappear. Which results in websites half loading, or sometimes loading fine, but not other times.
I have to use LTE to get decent internet in the city I live, and have been experimenting with different providers. Turns out most LTE networks use quite low MTUs, and so PMTUD (MTU discovery) _must_ work correctly otherwise you get the behaviour I described above... For the network with the best reception and backhaul, PMTUD just didn't seem to work, so it defaults to 1500, and all the wifi connections to the LTE router will be told to use 1500, so my wiregaurd sets itself to 1420 (less 80 bytes) and even that wont save itself when the UDP packets vanish.
Now when the internet seems fishy my first response after a simple ping is to try ping at the network interface supposed MTU, e.g `ping -M do -s 1472 1.1.1.1` (-28 bytes for the header).
Even more confusing is that for these LTE networks the effective MTU seems to change, possible as different combinations of cells are used for carrier aggregation... so if you are trying to use this for home internet you need to keep testing for the MTU over time to discover the lowest, then change your router wifi MTU (not even possible for most LTE routers) to the minimum for reliable internet.
PMTUD on the Internet, sadly, does not work anymore. It was wonky for a few years, then totally died when AWS and the other cloud providers started blocking all ICMP in the default configuration. (It is also notoriously hard to get to work through packet-based load balancers, since most routers won't include enough of the original packet in the “too big” ICMP to reliably send it the same way as all the data packets that form the flow.)
The reason why you can browse the web at all with PMTU <1500 is due to the awful hack that is TCP MSS clamping, where the router will rewrite your TCP SYN packets to advertise a different MSS (which is roughly the same as MTU, only on layer 3) based on what it thinks the PMTU to the host is.
Seriously, try browsing the Internet with MTU 1500, PMTU <1500 and a router in place that does _not_ do TCP MSS clamping. It's an incredibly frustrating experience.
> PMTUD on the Internet, sadly, does not work anymore.
I don't think PMTUD ever really worked over the whole internet. Path MTU blackhole discovery can work, but it defaults to off on a lot of OSes (including Android!), and it's behavior isn't always great anyway. iOS actually does do a good job here (which doesn't change my overall opinion). I don't think Windows will pay attention to the MTU from DHCP, but I think it will try path mtu blackhole detection (but it may do it with rfc timers, which are way too slow)
Also, most TCP stacks send out their MSS on SYN+ACK instead of sending max(received MSS, local MSS), which means that those networks that tried to fix their MTU problem by tweaking outgoing SYNs but don't tweak incoming SYN+ACKs don't get fixed. FreeBSD used to do the right thing, but it changed recently. People will argue that the RFC says to do it the way I'm calling the wrong way, but how many connections do you run into that can send a 1500 byte packet, but only receive a 1492 packet?
Most/many large internet sites just send MSS indicating 1480 or less and call it a day.
IMHO, something in-band would have been a lot more workable; let the router that can't forward the whole packet mark it, truncate, and forward it, because we can assume the router has a working forward path (otherwise there's no connection), but we can't assume the router has a working reverse path. For TCP, this would be relatively straight forward to manage; UDP wouldn't be great though.
> Path MTU blackhole discovery can work, but it defaults to off on a lot of OSes (including Android!), and it's behavior isn't always great anyway.
Interesting, I didn't know this existed, it's built into linux too but unfortunately it's only for TCP `man tcp`:
tcp_mtu_probing (integer; default: 0; since Linux 2.6.17)
This parameter controls TCP Packetization-Layer Path MTU Discovery.
The following values may be assigned to the file:
0 Disabled
1 Disabled by default, enabled when an ICMP black hole detected
2 Always enabled, use initial MSS of tcp_base_mss.
I get how it makes sense to do that at the TCP level for arbitrary paths, but I had the impression the main issue was high variability in end point MTU, i.e the particular gateway all your internet is being routed through, which is when it makes sense to do PMTU black hole discovery at the IP layer, which is essentially what I've been doing manually... Am I over generalising my problem?
Your problem is highly likely to be in the first hop, or first few hops. Adjusting the MTU for the whole interface is the right thing to do there.
Historically, ICMP blackholes tended to be on long distance international transit, though. So maybe your ISP was 1500 MTU throughout and most of its transit and peering was too (or sent ICMPs anyway), but routes to Mordor were few and far between, so your ISPs transit provider used one that was IPIP tunneled (so effective MTU was 1480) and either didn't send ICMPs or was rate limited on sending ICMPs so it was as if they didn't send them. Everything would work OK to most of the internet, but if you made a connection to Mordor (or someone from Mordor made a connection to you), small packets make it through and large packets get dropped, and you get that kind of works but doesn't really work behavior.
In that situation, if you're in Mordor and most of the things you connect to aren't, it makes sense to set a smaller MTU. If you're not in Mordor, but sometimes connect to it, you want probing to happen for those connections, but setting a smaller MTU probably isn't the right thing (but, popular sites do it). If you're in Mordor and mostly connect to Mordor services, but sometimes the outside world, you again want probing sometimes.
But keep in mind, the probing really only works if both sides are doing it. It's hard to induce your TCP peer to send smaller packets if you sent a too big MSS.
All that said, my experience is that in recent times, it seems like problems are usually some ISP PPPoE CPE that isn't clamping MSS. International transit has gotten pretty good over the past 30 years.
In this case I think the robustness principle is actually counter productive, here me out:
MTU discovery is "robust". Most nodes already forward any packet they can handle. The problem is that _some_ nodes filter out ICMP packets (hence breaking MTU discovery) but the link still "kind-of-works" (tm). Nodes that break MTU discovery should be considered 100% broken, but instead of fixing them we "accommodate" them by statically (manually) lowering our outgoing MTU, hence proliferating the problem.
Robustness works against us. If half-broken firewalls were considered fully-broken, they would be replaced instead of accommodated.
Aside: the `Fragmentation Needed` packet should have been part of the IP header, not a separate protocol.
> Most nodes already forward any packet they can handle
On second thought, I guess I am wrong here, since the default MTU in Linux is a very low number (something like 1500). Robustness would be to have limit by default and fully rely on MTU discovery.
Although the low default was probably established because of the aforementioned ICMP filtering issue.
If you read the actual LKML message, and not just the LWN article, it becomes clear this is really about bumping TSO/LRO sizes in linux, not about using link layer MTUs in excess of 64k. This is more akin to Microsoft's "LSOv2" (which is even mentioned in Erick's commit message for the mlx5 driver).
I'm not sure if the LWN author just doesn't understand the distinction, or if they were trying to simplify things for their readers..
> Modern network interfaces perform segmentation offloading, meaning that much of the work of creating individual packets is done within the interface itself. Making segmentation offloading work with jumbo packets tends to involve a small number of tweaks; a few drivers are updated in the patch set.
I was keying off of this "The BIG TCP patch set adds the logic necessary to generate and accept jumbo packets when the maximum transmission unit (MTU) of a connection is set sufficiently high.", which made it sound to me like the author believed that the MTU was 185k
Part of this problem is there really isn't a layer 2 protocol for discovering the max MTU of a path. You have to use a higher protocol like ICMP to find it and then adjust your endpoint.
The Internet will work with large L3/L4 packets assuming the path is able to create fragments. The article talked about the kernel's ability to hold fragments but it wasn't clear to me how Internet fragmentation comes into play or whether the performance gain is lost when using this on the Internet versus a LAN.
Side tangent, this is one issue I have with the major cloud providers. Azure, AWS, et al all advertise high-bandwidth interfaces upwards of 100Gbps, but you can't saturate it anywhere near those rates with 1500 Byte packets from the Internet.
Sure, but the amount of traffic inside AWS is way bigger than the amount of traffic that crosses its boundary. This is true for a lot of operators. Netflix and YouTube and similar loads that are virtually all egress are special cases.
I honestly don’t think an implementation of TCP that would try and follow this principle would bring anything but misery. I lived on the internet since IE5… imagine including “hacks for Cisco” in your packets
I've played around with jumbo frames in the past. One issue is that none of the checksums at any layer were designed for packets this big. Both the Ethernet and TCP checksums don't have enough bits to reliably detect errors, even on 9k frames. This is ok if you're doing checksumming at the application layer, but often you're not, maybe it's something like NFS that assumes lower levels of the stack take care of message integrity. As transmission speeds have gone up, the bit error rate has stayed the same, so you can just do the math on how many errors you're going to have per day.
You need to have application-level checksums anyway, since pretty much every router or switch these days will recalculate checksums on all layers (they need to if routing the packet, and since pretty much all COTS chipsets can do L3, they do so all the time even if the switch's software won't let you configure it), so if a bit flips in any router's processing, your checksums won't detect it.
Sure but that means rewriting every application. There was a mind blowing Defcon presentation a few years ago "Bit-squatting: DNS Hijacking Without Exploitation" by Artem Dinaburg where they registered a bunch of domains that were one bit off from well known cloud providers, and recorded all of the requests that leaked out of their internal networks when the DNS response had a bit flipped. That's not to say that you're going to have a huge DNS response, but there is already some amount of this going on at every level of the stack and unless you rewrite every application from DNS on up to check then you're still going to encounter errors.
The article says "Enabling a packet size of 185,000 bytes increased network throughput by nearly 50%" So they essentially tripled the max TSO/GRO size from ~64k to 185k and that resulted in more throughput, presumably due to less overhead.
That's a LOT of per-packet overheads, and seems a bit surprising to me. In tests I've done, I have not seen much difference between a 16k max tso and a 64k max TSO. Then again, my benchmark is generally a TCP_STREAM (throughput), and not a TCP_RR (ping pong) test.
In general, outside a datacenter, the last thing you want to do is dump 185KB for the same connection on the wire in a giant burst, as you'll blow out router buffers when transiting the internet from 100GbE in your datacenter to some crappy DSL router at your users' houses. So I wonder how much the benefit is reduced when using some kind of software based packet pacing? I think linux has some kind of packet scheduler designed to avoid large bursts, etc. Does it break apart these giant TSOs? Or does using it just cap the max TSO size?
At least in FreeBSD. the TCP stack controls pacing and one thing it does is to send down smaller TSOs to NICs that don't support hardware pacing. The goal is to reduce the burst size on the wire, so TCP will send down a few KB, wait several milliseconds, dribbled down more, etc. This reduces our average TSO size, and increases overhead.
One of the things datacenter people want is to avoid dpdk-everything just because they put 4 200GbE in their server. I sure would like to just zmq everything... Especially for internal streaming applications, that would be a boon.
I haven't read the patchset yet but I'm wondering whether it would work with UDP. TCP can be too much 'kitchensink' if you don't need reordering or many-resends.
>"Imagine, for a second, that you are trying to keep up with a 100Gb/s network adapter. As networking developer Jesper Brouer described back in 2015, if one is using the longstanding maximum packet size of 1,538 bytes, running the interface at full speed means coping with over eight-million packets per second."
Should this not be "frame size" instead of "packet size"? Unless I am reading this incorrectly the maximum TCP packet size is 1500(20 bytes for IP header, 20 bytes for TCP header and 1460 bytes for the actual payload.) I'm not trying to be pedantic but the other 38 bytes would all be considered part of the Ethernet and not TCP no?
This is wrong for a few reasons. It is common to try to fit TCP packets into 1500 bytes (minus the bytes for the IP and TCP headers) because 1500 bytes is a common MTU for Ethernet, but this isn't a limitation of TCP, it's just something done to avoid fragmentation. Without extensions TCP packets can be sized up to 64KB. For example, in an environment with jumbo Ethernet frames you might have 9000 byte MTU at the link layer which is just fine. But it's also fine to send a 9000 byte or even 64KB TCP packet over a network with a 1500 byte MTU, the packet will just be split up at the IP layer using IP fragmentation. And as the article points out, RFC 2675 allows creating TCP packets whose payload is even larger than 64KB.
No it is not wrong. In fact you are supporting my point exactly. My question was about the terminology the author was using. Packets are TCP terminology. Frames are ethernet terminology and the 1500 byte limit is because this is the MTU of an Ethernet interface, again a layer 2 concern not layer 4. See also elsewhere in this thread where I also asked what the benefit of this over Jumbo frames.
Seems pretty useful once all the drivers, utilities, network devices catch up. This will be all the rage in SAN and other performance network environments in a couple years. Would be great to see a more thorough performance evaluation. For example I wonder if the throughput and latency hold in different network conditions - e.g. network conditions with high latency, packet loss, and both. I presume after a certain point of degradation the larger packet size becomes of penalty because your retransmissions become very wasteful.
Do virtual adapters on VMs for communication between host and guest stand to benefit from this? For e.g. scp between host and guest. Current max mtu size on VirtualBox is 9000 right?
Seems strange, there is no fundamental reason why a transaction should have to fit inside a single packet at a lower layer unless there is some nasty leaking of abstractions going on.
Jumbo would give you [up to] 6 times less packets for the same bandwidth => 6 times less overhead on the packet processing.
BIG TCP could give up to 123 times less packets (only theoretically, of course) but surely would dramatically lower the processing overhead for the big payloads.
Interesting development. However, I do wonder at some point if those desiring fast packet processing shouldn't just bite the bullet and adopt something DPDK-based (eg, FDIO/VPP).
I think pure-interrupt packet processing is challenged (for the reasons laid out in the article). But getting people to dedicate a core/thread [to RX/TX polling] is also a hard sell.
Kernel bypass is not a panacea. Even with a totally userspace networking stack, there will be a lot of per-packet rather than per-byte processing, and often packets from different sources will be sufficiently interleaved that you can't even do any kind of useful batching fastpath .
taking cpu-cores out of normal scheduling, and dedicating them for just packet-forwarding etc. is the first step. not sure how popular ddio is outside of intel ecosystem, but shunting packets between nic and cpu-cache (to and from) is very useful for such workloads imho.
The organization in question already bypasses the kernel with their “SNAP” scheme. Perhaps they are motivated to keep improving kernel IP stack simply because practically all GCP customers still use it.
> But getting people to dedicate a core/thread [to RX/TX polling] is also a hard sell.
I think that if you're at the level where the kernel network stack is not enough, core pinning is a no brainer. You're probably spending more than one core in receiving data, moving to a polling thread pinned to a single core will not only improve performance due to the polling, but also due to the scheduler not pushing your process out and to the possibility of putting the threads on the appropriate NUMA nodes.
Part of the pain with kernel bypass is memory management. For real performance gains you need zero copy and for that you need to work inside the packets. Perfectly doable for a green field project not so east I think for a large existing system.
I wanted to pay someone to work on a dpdk layer for zeromq, I'm not sure anyone would be interested but I feel it'd help alleviate part of the dpdk pain.
You have to be root/suid, have to implement your netstack yourself (ip, tcp, etc.), have to use the special threads which may or may not play nice with other threading libraries, locking and lock-free stuff, and there are subtle differences between controler behaviours that are smoothed over by the kernel and drivers, but hey you don't want those. For very simple forwarding and L2/L3 stuff it's easy to start and you can get to high perf quick enough. For anything applicative it can quickly become hard. I lost so much time with e.g. jumbo frames support, arp, frame re-ordering and low-level footguns, honestly sometimes I miss the pain of plain sockets.
And in the end what I want is mostly giving a large reassembled 2-100MB buffer to userland to avx512-stuff all over - same as I do with zmq on a (bunch of) 10Gb link(s). Or send to gpu via gpudirect, and the joy of programming DMAs instead of 'normal' cudaMemcpyAsync. And don't forget to reserve a (bunch of) physical machines for CI.
I'm not saying it's not useful, on the contrary. You get to send/receive 800Gbit/s of traffic on a single socket system! But it feels like a setback in usability, learning curve.
I'm saying I'd like a simpler API over it, to transmit/receive pub/sub large messages.
Thanks for the wonderfully detailed response. I had a couple questions:
>"And in the end what I want is mostly giving a large reassembled 2-100MB buffer to userland to avx512-stuff all over - same as I do with zmq"
How does the avx512 come into play here? I know its mainly used for HPC type workloads. Is there a more general usage pattern for it as well? I know Linux was very vocal about his disdain for avx512 and mentioned people were using it for memcpy[1] Is that the case for DPDK?
Second is "zmq" here ZeroMQ, the messaging library?
Ah it's mostly signal/video processing, very high throughput, compute-heavy, very little branching. AVX was kind of off-topic there, sorry...
AVX512 (and vector instructions) makes sense if you have very repetitive dense operations to do on your data. Think convolutions, filters, BLAS, FFT... Even better with dpdk since they run with hugepages so no tlb thrashing.
DPDK has an option to use AVX512 for memory copy but I mostly do the computing + copy results in one pass, to maximize throughput.
Zmq for zeromq yes, sorry. The abstraction (send arbitrary sized messages, receive messages, publish, subscribe, route messages... Same API for in-process, inter-process, inter-node) is so nice I miss it when we go down some levels of abstraction, especially with large large messages.
I'm not sure I agree with Linus there. Intel is/was in a pinch, unable to ramp up FLOPS/mm2 and core count, they reached for vector instructions, in an extremely pragmatic 'what do you need them for - OK you'll get exactly those! And if you need something else, well... Have fun with the fun instruction set'.
Intel also had (understandable) frequency throttling problems with how dense/hot the fused-multiply-add could get, but still painful and not detailed (discover weird frequency quirks on your 8000USD chip yay!) and have also been (to me) very unclear on the AVX512 roadmap (see recent Alderlake adventures).
I get that Linus would like simpler, less kludgy solutions (more cores, more efficient cores, etc.) but if they exist in the cpu world, I'm waiting. Maybe ARM SVE whenever it will hit mainstream, maybe... The way for more raw flops is for the moment GPUs/accelerators - TPU, tenstorrent, FPGAs, mining ASICs being the extreme there - but they're their own cans of worms :-).
So if you want very low latency (no going back and forth through PCIe and not motivated enough to use GPUdirect...) and still in the PC world, you're stuck. But AVX512 is actually a very good instruction set! Branch registers, most operations available, cross-lane instructions, all the shuffles you can imagine (well more than I can imagine) and recently avx512-fp16, the vectorized popcount... It's all getting better and better.
A real shame we don't have more hardware with it - starting TigerLake you get it in consumer hardware! good luck with thermals but it gives a taste.
The problem with TCP is that the actual Layer 2 and 3 infrastructure doesn't obey this principle. It's taken the opposite stance to simply reject anything that's weird or unexpected. There's a reason and history for why this is. The most defensible is security. Less defensible is things like deep packet inspection.
This is a well-knwon problem and falls under the term (which I love) "ossification" [2].
As an example, much of the Internet stops working with MTUs above ~1500 (which this article mentions). Large packets (eg MTU ~9000) are almost a necessity for 10+ GbE but that often won't work outside of the LAN.
So I guess my point is changes like this will take a long time (if ever) to get widespread support. This could still impact, say, data center and other "local" deployments but I guess you have to start somewhere.
[1]: https://en.wikipedia.org/wiki/Robustness_principle
[2]: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7738442