Didn't really convey where the magic happens to get whatever frames are produced by the userspace TCP/IP stack to whatever mechanism puts those frames on some wire... isn't it still making some syscall? IOW, where's the I/O happening?
I haven’t followed for a long time but last time I checked the state of the art was getting your packet in and out via an eBPF module like XDP actually bypassing the kernel network stack (that’s not what the article is doing). You would then be directly talking to the NIC through the module which is akin to sending packets directly through the wire.
You can't really just map a PCIe device into userland (by that I mean that the just here does a lot of work).
As far as I know, DPDK uses either a BSD NIC driver running in userland (somewhat slow) or XDP on Linux.
No, that is indeed the working principle (plus minus some yak shaving with iommus and what not). It certainly doesn't use XDP, that would defeat the purpose.
I don’t see in anyway how it would defeat the purpose especially considering it certainly does use XDP [1] for everything it doesn’t have a driver for and you can opt for it instead of an existing driver if you actually want to have something performant.
Also, "no" about what? DPDK doesn’t just expose the PCIE device to user land. There is what you call the yak shaving with memory and then it also has a full driver from the NIC so you can actually you know do networking.
I don’t see in anyway how it would defeat the purpose especially considering it certainly does use XDP [1] for everything it doesn’t have a driver for and you can opt for it instead of an existing driver if you actually want to have something more performant.
The simplest way to do this, and the way I think this post is referring to, is to bring up a WireGuard session. WireGuard is essentially just a UDP protocol (so driveable from userland SOCK_DGRAM) that encapsulates whole IP packets, so you can run a TCP stack in userland over it.
Thanks. I read a bit about WireGuard... seems to create a tunnel over UDP as you say. Where I got off track is the following comment in the article:
> Who says you need to do networking in the kernel though?
I thought it implied they were bypassing the kernel entirely. This isn't true at all if it's still using a UDP socket to send... there's still a ton of code in the kernel to handle sending and receiving UDP. It's also a helluva lot less efficient in terms of CPU consumed, unless one is using sendmmsg, because it's a syscall per tiny datagram between user space and kernel space to get the data onto the UDP socket...
When people talk about bypassing the kernel, they mean taking control of TCP/IP (UDP, IPv6, ICMP, whatever) in userland; for instance, doing your own congestion control algorithm, or binding to every port on an address all at once, or doing your own routing --- all things the kernel doesn't expose interfaces for. They usually don't mean you're literally not using the kernel for anything.
There are near-total bypass solutions, in which the bulk of the code is the same as you'd use with this WireGuard bypass, where ultra-high performance is the goal, and you're wiring userland up to get packets directly to and from the card DMA rings. That's a possible thing to do too, but you basically need to build the whole system around what you're doing. The WireGuard thing you can do as an unprivileged user --- that is, you can ship an application to your users that does it, without them having to care.
Long story short: sometimes this is about perf, here though not so much.
If you're interested in the latter technique (hitting the HW directly from userspace), take a look at https://www.dpdk.org/. That's the framework that 95% of things that do that use (number pulled directly out of my rear end).
I was hoping someone else was going to call out DPDK, because I've never used it, am aware of it, and don't want to explain it. :) You can't possibly offend me by suggesting I'm not aware of some technology.
None of us can ever know ahead of time what techs a stranger has seen. So, it’s fine to say “you might find this helpful.” They will or they won’t but it’s still a kind act.
I've never used dpdk for kernel bypass, only open onload. Originally for solarflare hardware, apparently it works with cards from other vendors via XDP (though I've never tried it). It gets loaded via LD_PRELOAD and replaces ordinary BSD socket functions with full userspace sockets that can fall back to the kernel when necessary (e.g. if the application wants to yield the cpu but still get notified about the socket).
To me the key thing with any TCP implementation is the timers. Just sending a single TCP packet causes the kernel to have to track a lot of state and then operate asynchronously from your program to manage this state, which also requires the kernel to manage packet buffering and queuing for you.
Writing raw ethernet frames, can you do that as an unprivileged user? Or is the point more that we're doing userspace networking, which could nevertheless be run as root?
K3s is a proper full k8s and has instructions for running multiple-node clusters, it just ships with batteries included and with defaults that play nice with having a single node.
I agree with Jonathan Blow, everything should go through userland/userspace. Even the graphics stack. I can’t tell you how many times my workstation would freeze due to a weird USB driver implementation.
One advantage of Windows over Linux is the fact their graphics stack can easily be restarted while the system/apps/etc are running. The screen just dips to black for a second and you're back where you were. You can see this anytime you install a new graphics driver, but I have occasionally seen this happen spontaneously when a bug in a graphics driver gets caught and Windows reinitializes the driver. X11 definitely can't do that and I believe neither can Wayland. Not sure whether Windows does this for other things such as networking.
I once updated my graphics driver in the middle of a CS match. There was a second of black screen, a few seconds of lower frame rate, then everything was back to normal. I was amazed.
A lot of applications don’t handle the DEVICE_RESET event, though, so you’re likely to see them crash if the driver resets. Firefox used to have a bug where video playback stopped working after a driver reset, but that’s been fixed.
The way QNX handled hard faults was exceptional. Very important for certain environments like self-driving. We don’t have time for kernel panics when lives are on the line.
Really love seeing such a straightforward example, of starting with some desire or need, and ending up DIY'ing ones own operator.
A lot of the pushback against Kubernetes revolves around whether you 'rewlly need it' or whether to do something else. Seeing someone go past running containers like this highlights the extensibility, shows the core of Kubernetes as a pattern & paradigm for building any kind of platform.
It's neat seeing that done so quickly & readily here. That we can add and manage anything, consistently, quickly, is a promise that I love seeing fulfilled.
While I see what you're getting at, I find the comment funny after the first thing the article leads with is that certain things are a pain in Kubernetes.
Sort of. What the author is trying to do is actually quite easy if the entire setup had been Kubernetes. k3s allows you to connect nodes to your cluster over a Tailscale VPN. If you do that, you don't need to expose your remote AI server to any network at all except the Kubernetes internal network. I'm guessing fly.io doesn't just give you bare servers you can run k3s on, though. The only real difficulty here is hooking up totally different abstraction engines that aren't designed to work with each other. Putting the blame specifically on one of them and not the other doesn't make sense.
The pluggable storage (CSI) and networking (CNI) aspects of Kube tend to be a little limiting, in that a cluster generally only expects to have one at a time running.
There is Multus, which is extremely well regarded CNI that enables mixing and matching different CNI Plugins. It works great, but it does involve diving into an additional level of complexity, and it can be exciting/unstable building the setup.
Maybe the author didn't have rights over the cluster to do this or interest in mucking about deeply in this admittedly complex/highly capable system was limited. Or maybe just using an adding like this was appealing as non disruptive! This seems like a pretty creative & direct way to extend what they already had. There are good well tread options for higher power, for mixed modes of networking, but this strikes me as a nice way to add more without having to revamp the base cluster, which I found super cool & direct.
It's a pain because kubernetes is designed to run multiple workloads on multiple servers. So if you want to access the VPN from some kubernetes containers you're going to have to figure something out.
But nothing is stopping you from just joining all your hosts into the VPN, just like a traditional deployment. Or set it up on your network gateway. This would make it available to all of your containers. Great. You're done.
But if that's not what you want, you'll need to figure something out.
I'm not sure if I follow. You said Kubernetes is great for building any kind of a platform, but then when someone wants controlled access to a VPN it suddenly turns into a no, not like that? Giving only certain parts of your architecture access to certain capabilities is far from a niche use case.
> You said Kubernetes is great for building any kind of a platform
I didn't, that was someone else. I wouldn't say any kind of platform, but it is a great foundation for many platforms.
> it suddenly turns into a no, not like that
Not at all. Read my post and the OP again! Several valid solutions have been offered, some that work out of the box, some that require a little tinkering.
If you want to do a traditional VM deployment you'd segment your workloads per hosts and put some hosts in the VPN. Sure. Cool. You can do exactly the same with kubernetes with node pools.
Only when you want to have some workloads that have access and some that don't running on the same host machine you might need to do some tinkering. Just like you'd have to do otherwise. Kubernetes really changes nothing, other than giving you ways to deal with it.
Of course there's plenty of ready built solutions for this you can just plug in, too. Search for Service Mesh.
Giving only certain parts of your architecture access to certain capabilities is far from a niche use case
What is the "normal" best practice here then? I would just spin up multiple single-node k3s VM clusters and hook the AI k3s VM to the VPN and the others not.
At its heart, Kubernetes is a workload scheduler, and a fairly opinionated one at that. IMO, it's too complicated for folks that just need an "extensible platform" when all they really need is a dumb scheduler API without all the Kubernetes bells and whistles that have been bolted on over the years.
Kubernetes tries to be everything to everyone and makes the entire thing too complicated. Seeing Kubernetes broken up into smaller, more purpose built components and let folks pick and choose or swap could be helpful, at the risk of it becoming OpenStack.
There's an alternate reality where Fleet is the defacto scheduler and cats and dogs live together in harmony.
It's a judgement call to say kube is at its heart workload scheduling, and I don't think it's true.
For a while some of the managers were packaged together into some core, but that's no longer true. Yes it still resembles most people's use case too. But neither of these technical truths convince me that this incidental happenstance is a core truth.
I haven't heard any arguments for why you think it's "too complicated" to be an extensible platform, and I struggle to imagine what you would rest those arguments upon. There's a restful API server and folks watch the api-server with operators and write code to make sure the real world resembles that desired state, best as they can. There's so many hundreds of operators, and it seems not-hard for many. So, like, what's so scary about that?
But you insist on calling it "a dumb scheduler API" which again is using bad definitions to argue a bad point that you can't make in good faith recognition of what's actually at stake here, that contorts the view to a dumber simpler wrong view of what's happening.
> Seeing Kubernetes broken up into smaller, more purpose built components and let folks pick and choose or swap could be helpful
This is the exact opposite of what most people ask for, which are integrated kube distros. And they ask for that because kube is a collection of different isolated pieces that happen to compound into something bigger. You already have your wish, you just don't see it.
> At its heart, Kubernetes is a workload scheduler
Wouldn't it be more accurate to state Kubernetes at its core is a distributed control plane?
You want something that inspects the state of a distributed system and mutates it as necessary to reach some "desired state". That's the control plane bit. For fault tolerance, the control plane itself needs to be distributed. This is the exact reason for Kubernetes' existence.
I'd say the opposite instead: we need Kubernetes distributions, just like Linux needs distributions. Nobody wants to build their kernel from scratch and to hand pick various user space programs.
Same for Kubernetes: Distributions which pack everything you need in an opinionated way, so that it's easy to use. Now it's kinda build-your-own-kubernetes at every platform: kubeadm, EKS etc all require you to install various add-on components before you have a fully suitable cluster.
I think the Operator pattern will grow to become exactly what you describe. We're still at the early stage of that, but I can see that a group of operators could become a "distribution", in your example.
What OP did could be done with any containerization platform that abstracted away networking, which should be all of them.
Personally I think you should always be using containerization, because even single node Docker is easy. If you are running something for real, then definitely use Kubernetes.
If you are using containerization, setting up Tailscale is trivial.
Whenever you abstract away something, you can swap out core components like networking willy nilly. Of course, you can always over-abstract, but a healthy amount is wonderful for most non-very high performance use cases.
For those that haven’t seen it: Do yourself a favor and watch it. At least two times. It’s such a strange (haha) movie. Quite unique IMO, can’t think of any other movie that is similar.
I'll second this suggestion. It's an old movie but it really holds up, and if anything is renewedly relevant right now. I'm reluctant to say too much about it because there's a couple realizations about the movie that you best have yourself.
If you feel like sharing your realizations here, feel free to use rot13 or shoot me an email :)
The first time I saw it, in senior high school, I understood nothing of it. It was just weird. I saw it again in my late 30s and realized it was a great movie – however, I will admit that it’s quite hard for me to articulate why it’s so good. I guess it says something about society and being human, but the wisdom is not at all on-the-nose.
Sbe zr gur zbivr jnf nzhfvat ohg dhvgr pbashfvat hagvy V fybjyl fgnegrq gb ernyvmr ab bar va gur punva bs pbzznaq bs gur ahxrf vf rira erzbgryl fnar. Guvf jnfa'g n fhqqra ernyvmngvba ohg zber ercrngrqyl ernyvmvat gur fnzr guvat nobhg qvssrerag punenpgref. Pyrneyl fbzr ner zber fnar guna bguref, ohg abar bs gurz pbzr pybfr gb jung V'q qrfpevor nf npghnyyl fnar.
Nqq gb gung gur nofheqvfg uhzbe bs gur cerpvbhf obqvyl syhvqf fcrrpu be gur arrqvat gb svaq n dhnegre sbe gur cnlcubar naq lbh'ir tbg n terng zbivr. Vg'f bar cneg pbzrql bar cneg fpnguvat pevgvpvfz nobhg rira gur pbaprcg bs univat na ngbzvp obzo ng nyy.