GSO (Generic segmentation offload) is supported for UDP in the Linux kernel from 4.18 so I would assume these optimizations should be easily possible for the kernel driver as well and can take advantage of any drivers that pass the GSO down to the network hardware too.
Kinda weird optimization though; I'm not exactly sure why it works as well as it appears to; I am thinking that the gain may be far far less noticeable/important at <1gbps. This may be why there aren't any benchmarks for lower end devices or bandwidth constrained paths.
On another front, I wonder would there be any advantage to encapsulating a native UDP protocol like wireguard in QUIC frames? Might this result in increased reliability and improved packet handling from firewalls and middleboxes?
GSO improves performance because you can send a ~64kB "datagram" through the stack, and then it will be split up either by the network card (if it supports GSO) or by the kernel. This is a lot more efficient than sending a pile of 1500 byte packets through the stack. It also saves a lot of context switches with the kernel.
By way of comparison, when you send data over TCP, the system call interface lets you write very large chunks, so a single system call can write gigabytes of data.
Isn't the article specifically talking about tx-udp-segmentation in the tun driver though, and not the hardware nic adapter?
From the article:
> The TUN driver support in v6.2 was the missing piece
> needed to improve UDP throughput over wireguard-go. With
> it toggled on, wireguard-go can receive “monster” UDP
> datagrams from the kernel
Wireguard udp packets are already pretty unstructured. The first four bytes are 4, 0, 0, 0. They have another 12 bytes of header and then encrypted data.
Tangentially, I've faced some interesting challenges getting a multi-gigabit Wireguard VPN operating through my 2Gb Frontier connection.
My UDM Pro seems to top out around ~800mbit per UDP stream - pegged at 100% CPU on a single core. Likely it can't keep up with the interrupt rate, given it's ksoftirqd pegging it. Replaced UDM Pro with a pfsense machine.
Then I started getting 100% packet loss on the edge of Frontier's network after a couple of minutes of sustained UDP near-line-rate throughput. In the end, after trying and failing to explain this to Frontier's tech support, I reached out to their engineering management on LinkedIn, and got put in touch with the local NOC director. Turns out to be some intermediate hop is rebooting after a few mins, and they're "in contact with the manufacturer". Haven't heard back in a few months.
tldr as >1Gb connections become more ubiquitous, other bottlenecks will become apparent!
You could look for a better ISP. The larger problem is that in the US it's completely normal for there to be no actual choice, or for your "choice" to be between two equally huge uninterested corporations who know they don't need to be better than each other to keep the same revenue.
Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.
> Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.
Or even better, the model we have in France. The last mile is a monopoly for a limited time only (2-3 years). So if you build a connection to some place that didn't have one, you can profit off exclusivity for some time, and are incentivised to be good to the consumers because they can switch, but will probably ony do so if you're shit/too expensive.
It'd require sensible regulation, so the Republicans simply won't stand for it, and it's not one of the Democrats' main issues, so they couldn't be bothered.
The problem is how satisfied people are when they get to just blame the other side and not bother with any further thought. As long as people like you reward that mentality then it will never be fixed.
Yes, it must be me to blame, not the bad faith actors in office. Would you like to collect the two cent payment I offer for exposure to my wrongthink now or later?
They customized it some, but it's all more or less upstream condoned code that Jason built.
Also, if you want to access your tailscale network, but don't have permissions to create a tun or wg device, the fully userspace implementation can work in that situation, which seems like a nice property to have.
Also, Wireguard is really easy to implement, making it less of a problem to have multiple implementations. Each implementation is more likely to be correct/invulnerable.
Small implementation was a design objective of Wireguard, after the horrors of IPsec (see Linus' email that praises the difference).
what’s the right way to interpret the last section on cpu utilization? i.e. now that you’re able to achieve 12.5gbps how much overhead is this at a machine level?
also was ena express used on the c6i.8xlarge? that should allow for getting past the ec2 single flow limits.
I wonder what explains the large gap between Mellanox and AWS kernel wireguard performance (the original Go code has a much smaller difference so shouldn't be just CPU speed difference).
Off-topic: Tailscale seems like such a perfect acquisition target for Cloudflare. It seems like there's perfect product and culture alignment. Amirite?
This is pretty cool, but I would have liked to see more benchmarks around phones to servers instead of Linux box to Linux box.
UDP and QUIC are most effective serving traffic from phones to servers, not from Linux box to Linux box.
Linux boxes are typically either servers or behind a corporate firewall as e.g user laptops such as the one I use at work. There are distinct disadvantages to running QUIC in that environment:
* UDP is often blocked by default by corporate firewalls.
* Having everything in user space means having to actually update the user software before getting a security patch deep in the transport layer. Compared with getting a kernel patch to fix a TCP vulnerability, which typically happens more often on a Linux box and is more stable than updating userspace software.
* TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs.
However, from a phone on a cell tower, QUIC starts to make sense:
* Having everything in user space means I can update for security patches every time I update the app, which is much more frequent than OS updates on for example Android.
* Having everything over UDP means I can get the usual non head of line blocking benefits so often touted, with top notch security as well.
> UDP is often blocked by default by corporate firewalls.
I mean, that sounds a lot like a “them problem”. Kind of like the people that can’t use grpc because their silly corporate firewall MITM’s the encryption. The rest of us aren’t beholden to archaic IT decisions.
> TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs
Yeah but if I can go even faster, why wouldn’t I? Quic gives per-sub-stream back pressure, out of the box that’s so useful! No HoL blocking, substreams, out-of-order reassembly, there’s so many neat features, why wouldn’t you want to use it, given the chance?
> Having everything in user space…
Means we basically get “user space networking for the rest of us” and that we can run Quic implementations that fit our own applications requirements.
Regarding kernel patches coming more often than userspace updates, presumably you mean in the context of random third-party apps not maintained as well as the major browsers. That's fair, but if QUIC becomes popular enough, I imagine we'll see distros including QUIC dlls that these minor apps link against.
Kinda weird optimization though; I'm not exactly sure why it works as well as it appears to; I am thinking that the gain may be far far less noticeable/important at <1gbps. This may be why there aren't any benchmarks for lower end devices or bandwidth constrained paths.
On another front, I wonder would there be any advantage to encapsulating a native UDP protocol like wireguard in QUIC frames? Might this result in increased reliability and improved packet handling from firewalls and middleboxes?