Hacker News new | past | comments | ask | show | jobs | submit login

QUIC costs something like 2x to 4x as much CPU time to serve large files or streams per byte as compared to TCP. This is because the anti-middlebox protections also mean that modern network hardware and software offloads that greatly reduce CPU time cannot work with QUIC. When combined with the fact that QUIC is userspace, that's just deadly for performance. I'm talking about TSO, LRO (aka GRO), kTLS, and kTLS + hw encryption.

Let's compare a 100MB file served via TCP to the same file served via QUIC.

  - web server sends 2MB at a time, 50x times, via async sendfile (50 syscalls & kqueue notifications)
  - kernel reads data from disk, and encrypts.  The data is read once and written once by KTLS in the kernel.
  - TCP sends data to the NIC in large-sh chunks 1.5k to 64k at a time, lets say an average of 16k.  So the network stack runs 6250 times to transmit.
  - The client acks every other frame, so that's 33333 acks.  Let's say they are collapsed 2:1 by LRO, so the TCP stack runs 16,666 times to process acks

  - web server mmaps or read()'s the file and encrypts it in userspace and sends it 1500b at a time (1 extra memory copy & 66,666 system calls)
  - UDP stack runs 66,666 to send data
  - UDP stack runs 33,333 number of times to receive QUIC acks (no idea what the aggregation is, lets say 2:1) 
  - kernel wakes up web server to process QUIC acks 33,333 times.
So for QUIC we have:

  - 4x as many network stack traversals due to the lack of TSO/LRO.
  - 1000x as many system calls, due to doing all the packet handing in userspace
  - at least one more data copy (kernel -> user) due to data handling in userspace.
Some of these can be solved, by either moving QUIC into the kernel, or by using a DPDK-like userspace networking solution. However, the lack of TSO/LRO even by itself is a killer for performance.

Disclaimer: I work on CDN performance. We've served 90Gb/s with a 12-core Xeon-D. To serve the same amount of traffic with QUIC, you'd probably need multiple Xeon Gold CPUS. I guess that Google can afford this.

In addition to those downsides, the QUIC spec points out middleboxes tend to time out UDP streams pretty agressively, so it recommends a ping timer of 10 seconds.

Additionally, since QUIC streams allow for client IP mobility, that creates an additional challenge for IP level load balancing as well as handling at the host level. In a well configured host, TCP packets for a given stream will always arrive at the same nic queue, on the same CPU, allowing the TCP data structure to be local to that CPU and avoid cross-cpu locks. In QUIC, the next packet can come from a new IP, which could be ECMP routed to a different host, or arrive on a different NIC queue and a different CPU. Perhaps, your ECMP router and NIC can be taught to look for the QUIC connection IDs, but that doesn't seem at all certain.

That's not really a fair comparison. In the case that the IP changes for quic, tcp would have to completely re-establish the connection. A cross core memory access is tiny in comparison.

Thanks for sharing the insight... This being HN, it doesn't necessarily read to me like a disadvantage for QUIC the protocol as much as an opportunity for someone to come up with a way to do hardware-assisted QUIC in the networking interface...

My first thought as well. So much so that I fully expect that we are going to see multiple companies pop up in the coming next few years that will take a shot a making said hardware.

> and sends it 1500b at a time

sendmmsg (or the upcoming io_uring) let you send multiple UDP packets with a single syscall.

While this is useful, I don't think it would completely resolve the noted "tons of send syscalls" issue. QUIC performs it's own flow control and I don't think it can just send all the packers composing a file at once (all the time, at least)

If your server handles many connections simultanously you can still bundle a lot of packets in a single sendmmsg syscall, it can dispatch to a different destination address for each packet.

But I think each of these UDP packets will still travel separately from the syscall layer to the NIC (eg, no TSO). So you're still a factor of 40 or so behind TCP + TSO

I think in general I agree. However the overhead numbers are exaggerated, and we should be fair with that. E.g. it was already mentioned that multiple UDP packets can be transmitted via a single syscall, and reasonable implementations can make use of it. I haven't read the Quic spec (yet), so I don't know how much data can be aggregated without waiting for ACKs or interleaving other data - but if it's anything comparable to HTTP/2 then it should be configurable and support >= 64kB chunks.

I also don't think a QUIC server would read the whole file into user-space at once - that's just a giant memory waste. Rather it would be streamed and chunks would get encrypted. That process requires of course an extra copy (likely even two for the unencrypted and encrypted version), but that's the same for all user-space file serving and encryption options and nothing new due to QUIC. For KTLS it would need to get investigated whether the kernel solution doesn't also perform a copy somewhere (I honestly don't know).

Of course it is not going to read the entire file at once.

Having written the FreeBSD kernel TLS, I can assure you that there is no copy. Data is brought into the kernel via DMA from storage into a page in the VM page cache. When the IO is done, it is then encrypted into an connection-private page. That page is then sent and DMA'ed on the network adapter. So we have in the kernel tls case:

  - memory DMA to kernel mem from storage.
  - memory READ from kernel mem to read plaintext for crypto
  - memory write to another chunk of kernel mem to write encrypted data
  - memory DMA from kernel mem to NIC
In the case where the NIC supports inline TLS offload, the middle 2 steps are skipped, and it devolves to essentially the unencrypted case.

For QUIC you have:

  - memory DMA to kernel mem from storage
  - memory read from kernel mem via mmap
  - memory write to userspace mem to write encrypted data
  - memory read from userspace mem to copy to kernel
  - memory write to kernel mem
  - memory DMA from kernel mem to NIC
So you go from 3 "copies" to 4 "copies", which increases memory bandwidth demands by 33%.

Right now, we can just barely serve 100g from a Xeon-D because Intel limited the memory bandwidth to DDR4-2400. At an effective bandwidth limit of 60GB/sec, that's on the edge of being able to handle the kernel TLS data path. So even if everything else about QUIC was free, this extra memory copy from userspace would cut bandwidth by a third.

Good to know. Thanks for the explanation and all the insights!

What does the TLS vs QUIC look like though?

There's no reason the offloads can't work with QUIC. Linux already has UDP GSO (https://lwn.net/Articles/752184/). There's no technical reason I can think of that kTLS cannot be implemented for UDP on Linux, it's just not there today.

There are also more general efforts underway on Linux to reduce the system call and copying overhead of processing packets in userspace. TPACKET_V3 is an easy way to vastly increase the scalability of UDP recv processing with minimal application changes. AF_XDP is much more extreme, but it is going to be more implementable than the older DPDK-style semantics. It effectively will put packet buffer management into userspace with the transport. But once you're doing that have recaptured much of the advantages that TCP has by running in the kernel.

Two questions: can't large files continue to be served on HTTP2? and won't https://www.dpdk.org/ allow user-space network stacks to do segmentation, etc...? (Maybe it's too immature?)

Regarding the first question:

Even HTTP/2 involves some of the issues the parent mentions. HTTP/2 is not really helpful for large files, and might likely perform worse than HTTP/1.1 due to the additional insertion and parsing of control flow headers. HTTP/2 helps small files most, by avoiding the overhead of connection establishment for those.

How much of this applies to 1~100KB responses?

1k, not so much since there is no aggregation that can happen there anyway.

100k is not that much different than 100mb, except the TCP window will not be as far open, so TSO will not be as effective.

Note that I work on a CDN that serves large media files, so I'm biased towards that workload.

Awesome analysis. This is first time that I read about the downsides of QUIC, curious that whether implementing it in userspace was a conscious trade off knowing the performance downside or Google/IETF wasn’t aware of the problem at all?

Implementing new transport protocols in kernelspace has significant downsides for adoption. In fact, it has been a long time since anyone tried it.

Wouldn't pretty much all of that overhead compared to TCP vanish if QUIC was implemented in the kernel?

No, due to the lack of TSO/LRO. Its my understanding that QUIC is designed to encrypt packet medata so that middle boxes cannot re-segment traffic. This same feature prevents NICs from doing TSO.

Ok thanks, that makes sense. For anyone else wondering what TSO is, see https://en.wikipedia.org/wiki/Large_send_offload

But again, couldn't there be NICs with offloading QUIC capabilities? Maybe this could even be done with firmware updates (I don't know how much of the TCP offloading is done in real hardware)

If the NIC is given the key for the connection, it can do the segmenting, encrypting and retransmissions.

This is just normal technological progress. CPU time is cheap and scalable, and the protocol will keep getting more optimized with better software and hardware. Similar issues were brought up with HTTP2 using TLS everywhere and messing with proxies but that's no longer a problem.

QUIC/HTTP3 as a protocol is a great improvement to actual internet performance for users which is what really matters.

Picking your comment as the newest instance but this is one of the dumbest memes I see in this thread.

Things don't automatically get better. It is hard work, it sucks, and it's not for everyone. It will take years to undo the damage of this transition. We will still be working on it in a decade. There are some very subtle gains like HOL-blocking. I'm not convinced that outweighs current actualized improvements in TCP congestion control (BBR), and for any application I can think of the places that really need something message-oriented seem better covered by WebRTC.

What you are really talking about is Full Employment Theorem.

Yes, progress obviously takes effort. What part is a "meme"? Leave that nonsense out of HN.

What "damage" are you talking about? The only issues are compatibility and increased resource utilization on the server-side, both of which will get better as usage increases. It's not a problem. We go through these cycles all the time with all kinds of technology and there's nothing special here.

It's thinking like that which leads to web page bloat. CPU resources aren't free, especially in an environmental capacity.

No it's not. Webpage bloat is a developer issue, not a technical problem.

QUIC is a new protocol is to make user experiences better. There's a tradeoff in more server CPU but that's cheaper, more scalable, and will only be short-term as things quickly improve. The actual comparison would be rendering engines and Javascript runtimes that have become more complicated to build and run but are faster and more functional in return.

Nobody would return back to the 2010 tech days just because some people decided to make fat websites.

> To serve the same amount of traffic with QUIC, you'd probably need multiple Xeon Gold CPUS. I guess that Google can afford this.

Can you explain more about how the negatives you mention weigh up against the positives? There isn't a net benefit somewhere? If not, can something be changed to give a better balance like a hybrid solution?

Personally I believe that the majority of positive caters to privacy. That being said there are other positive things about IETF QUIC that will, likely, play into new functionality over time.

A good document outlining considerations can be found here: https://http3-explained.haxx.se/en/

tl;dr - QUIC doesn't have kernel or hardware support (yet).

These aren't intrinsic problems with QUIC, they're common to all new protocols.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact