At least for TX it's much easier to get that performance today due to relatively recent kernel changes. Unsure about RX though.
A few years ago I worked on the upload service within the YouTube Android app. Shortly after YouTube started using QUIC for video streaming I decided it would be interesting to experiment using QUIC for mobile uploads; my thesis being that its BBR implementation might lead to better throughput on lossy wireless connections and therefore faster mobile uploads. I discovered though when tethered to my dev workstation I wasn't able to break ~100mbit while the native TCP stack was able to saturate the USB-C gigabit ethernet adapter I was using. This problem got the attention of the QUIC folks, which lead to a more detailed investigation than I was capable of and eventually to kernel patches described here: http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2...
I'd like to think my little project was the major motivator for that work but I'm pretty sure the server efficiency gains described in that paper were a much stronger motivation. :)
That paper is a real disappointment with the places it doesnt go.
Specifically, if QUIC is to have high performance and high power efficiency, just batching the packets isn't enough. Think of a usecase - a CDN wants to be able to send a block of data from disk to a client. It should be able to direct that data from the SSD to the network card without the CPU touching the data at all.
On the receive side, a mobile device should be able to download a big file to ram or flash with the main CPU asleep most of the time.
That means the network device should be encrypting the packets, doing the pacing, doing the retries, etc.
It may be that a quic specific set of options or syscalls is best for that, or it might be that the application should provide a "this is how to send my data" set of bytecode which the network hardware executes repeatedly.
But just batching UDP packets is a half way measure which isn't in the direction of the final goal.
Right, that paper doesn't introduce life-shattering innovation but rather just fixes a long-standing issue with UDP performance in the Linux kernel. That ain't nothing.
A large motivator to develop and deploy QUIC is that by implementing much of the stack as a userspace library the protocol can be iterated on at a pace impossible with TCP. BBR congestion control experiments are a great example of that; there's enormous value in quick iteration, even at the cost of compute efficiency.
I agree with your sentiment though; I have to imagine eventually the protocol will settle down and hardware acceleration will become a thing - but imho not to the degree you propose.
Wouldn't this mean the networking interface is now Layer 7? It has to understand how QUIC works. Sounds like Smart NIC, though that seems mainly enterprise focused.
I dont remember to much detail off the top of my head but the cpu was a newish 128 core amd and a mellanox nic, i had to get my work stuff out to get more detail
Ah, so it's an absolute beast of a machine! I thought your response might have been pointing to additional efficiency in new hardware, but it's more that you have more powerful hardware.
Also, given that this was on 2014-ish hardware, 2Ghz/8Core sounds a lot like a Xeon L. A high-powered Xeon or something like an i7 could've probably already done a lot better simply because of raw single core performance. So these numbers definitely sound reasonable.
5 years later I have servers that can do 2 million (multi process) without messing with Numa or nic settings.