Let's compare a 100MB file served via TCP to the same file served via QUIC.
- web server sends 2MB at a time, 50x times, via async sendfile (50 syscalls & kqueue notifications)
- kernel reads data from disk, and encrypts. The data is read once and written once by KTLS in the kernel.
- TCP sends data to the NIC in large-sh chunks 1.5k to 64k at a time, lets say an average of 16k. So the network stack runs 6250 times to transmit.
- The client acks every other frame, so that's 33333 acks. Let's say they are collapsed 2:1 by LRO, so the TCP stack runs 16,666 times to process acks
- web server mmaps or read()'s the file and encrypts it in userspace and sends it 1500b at a time (1 extra memory copy & 66,666 system calls)
- UDP stack runs 66,666 to send data
- UDP stack runs 33,333 number of times to receive QUIC acks (no idea what the aggregation is, lets say 2:1)
- kernel wakes up web server to process QUIC acks 33,333 times.
- 4x as many network stack traversals due to the lack of TSO/LRO.
- 1000x as many system calls, due to doing all the packet handing in userspace
- at least one more data copy (kernel -> user) due to data handling in userspace.
Disclaimer: I work on CDN performance. We've served 90Gb/s with a 12-core Xeon-D. To serve the same amount of traffic with QUIC, you'd probably need multiple Xeon Gold CPUS. I guess that Google can afford this.
Additionally, since QUIC streams allow for client IP mobility, that creates an additional challenge for IP level load balancing as well as handling at the host level. In a well configured host, TCP packets for a given stream will always arrive at the same nic queue, on the same CPU, allowing the TCP data structure to be local to that CPU and avoid cross-cpu locks. In QUIC, the next packet can come from a new IP, which could be ECMP routed to a different host, or arrive on a different NIC queue and a different CPU. Perhaps, your ECMP router and NIC can be taught to look for the QUIC connection IDs, but that doesn't seem at all certain.
sendmmsg (or the upcoming io_uring) let you send multiple UDP packets with a single syscall.
I also don't think a QUIC server would read the whole file into user-space at once - that's just a giant memory waste. Rather it would be streamed and chunks would get encrypted. That process requires of course an extra copy (likely even two for the unencrypted and encrypted version), but that's the same for all user-space file serving and encryption options and nothing new due to QUIC. For KTLS it would need to get investigated whether the kernel solution doesn't also perform a copy somewhere (I honestly don't know).
Having written the FreeBSD kernel TLS, I can assure you that there is no copy. Data is brought into the kernel via DMA from storage into a page in the VM page cache. When the IO is done, it is then encrypted into an connection-private page. That page is then sent and DMA'ed on the network adapter. So we have in the kernel tls case:
- memory DMA to kernel mem from storage.
- memory READ from kernel mem to read plaintext for crypto
- memory write to another chunk of kernel mem to write encrypted data
- memory DMA from kernel mem to NIC
For QUIC you have:
- memory DMA to kernel mem from storage
- memory read from kernel mem via mmap
- memory write to userspace mem to write encrypted data
- memory read from userspace mem to copy to kernel
- memory write to kernel mem
- memory DMA from kernel mem to NIC
Right now, we can just barely serve 100g from a Xeon-D because Intel limited the memory bandwidth to DDR4-2400. At an effective bandwidth limit of 60GB/sec, that's on the edge of being able to handle the kernel TLS data path. So even if everything else about QUIC was free, this extra memory copy from userspace would cut bandwidth by a third.
There are also more general efforts underway on Linux to reduce the system call and copying overhead of processing packets in userspace. TPACKET_V3 is an easy way to vastly increase the scalability of UDP recv processing with minimal application changes. AF_XDP is much more extreme, but it is going to be more implementable than the older DPDK-style semantics. It effectively will put packet buffer management into userspace with the transport. But once you're doing that have recaptured much of the advantages that TCP has by running in the kernel.
Even HTTP/2 involves some of the issues the parent mentions. HTTP/2 is not really helpful for large files, and might likely perform worse than HTTP/1.1 due to the additional insertion and parsing of control flow headers. HTTP/2 helps small files most, by avoiding the overhead of connection establishment for those.
100k is not that much different than 100mb, except the TCP window will not be as far open, so TSO will not be as effective.
Note that I work on a CDN that serves large media files, so I'm biased towards that workload.
But again, couldn't there be NICs with offloading QUIC capabilities? Maybe this could even be done with firmware updates (I don't know how much of the TCP offloading is done in real hardware)
QUIC/HTTP3 as a protocol is a great improvement to actual internet performance for users which is what really matters.
Things don't automatically get better. It is hard work, it sucks, and it's not for everyone. It will take years to undo the damage of this transition. We will still be working on it in a decade. There are some very subtle gains like HOL-blocking. I'm not convinced that outweighs current actualized improvements in TCP congestion control (BBR), and for any application I can think of the places that really need something message-oriented seem better covered by WebRTC.
What you are really talking about is Full Employment Theorem.
What "damage" are you talking about? The only issues are compatibility and increased resource utilization on the server-side, both of which will get better as usage increases. It's not a problem. We go through these cycles all the time with all kinds of technology and there's nothing special here.
Nobody would return back to the 2010 tech days just because some people decided to make fat websites.
Can you explain more about how the negatives you mention weigh up against the positives? There isn't a net benefit somewhere? If not, can something be changed to give a better balance like a hybrid solution?
A good document outlining considerations can be found here: https://http3-explained.haxx.se/en/
These aren't intrinsic problems with QUIC, they're common to all new protocols.