Hi! This is a blog post sharing some low-level Linux networking we're doing at Modal with WireGuard.
As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?
We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox
Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.
Thanks. We did check out Tailscale, but they didn't quite have what we were looking for: some high-availability custom component that plugs into a low-level container runtime. (Which makes sense, it's pretty different from their intended use case.)
Modal is actually a happy customer of Tailscale (but for other purposes). :D
So if a company only needs an outbound VPN for their road warriors and not an inbound VPN to access internal servers, vprox could be a simpler alternative to Tailscale?
We use gVisor! It's an open-source application security sandbox spun off from Google. We work with the gVisor team to get the features we need (notably GPUs / CUDA support) and also help test gVisor upstream https://gvisor.dev/users/
It's also used by Google Kubernetes Engine, OpenAI, and Cloudflare among others to run untrusted code.
We don't use Kubernetes to run user workloads, we do use gVisor. We don't use MIG (multi-instance GPU) or MPS. If you run a container on Modal using N GPUs, you get the entire N GPUs.
Re not using Kubernetes, we have our own custom container runtime in Rust with optimizations like lazy loading of content-addressed file systems. https://www.youtube.com/watch?v=SlkEW4C2kd4
Yep! This is something we have internal tests for haha, you have good instincts that it can be tricky. Here's an example of using that for multi-GPU training https://modal.com/docs/examples/llm-finetuning
Okay, well think very deeply about what you are saying about isolation; the topology of the hardware; and why NVIDIA does not allow P2P access even in vGPU settings except in specific circumstances that are not yours. I think if it were as easy to make the isolation promises you are making, NVIDIA would already do it. Malformed NVLink messages make GPUs fall off the bus even in trusted applications.
Sorry, just realized the misunderstanding. To clarify Modal still uses the kernel WireGuard module. The userspace part that’s in Go and not in other languages that we use is the wgctrl library.
this is a really neat writeup! the design choice to make each "exit node" control the local wireguard connections instead of a global control plane is pretty neat.
an unfinished project I worked on (https://github.com/redpwn/rvpn) was a bit more ambitious with a global control plane and I quickly learned supporting multiple clients especially anything networking related is a tarpit. the focus on linux / aws specifically here and the results achievable from it are nice to see.
networking is challenging and this was a nice deep dive into some networking internals, thanks for sharing the details :)
Thanks for sharing. I'm interested in seeing what a global control plane might look like, seems like authentication might be tricky to get right!
Controlling our worker environment (like `net.ipv4.conf.all.rp_filter` sysctl) is a big help for us since it means we don't have to deal with the fullness of all possible network configurations.
Thanks for sharing. This new feature is neat! It might sound a bit out there, but here's a thought: could you enable assigning unique IP addresses to different serverless instances? For certain use cases, like web scraping, it's helpful to simulate requests coming from multiple locations instead of just one. I think allowing requests to originate from a pool of IP addresses would be doable given this proxy model.
JWT/OIDC, where the thing you're authenticating to (like MongoDB Atlas) trusts your identity provider (AWS, GCP, Modal, GitLab CI). It's better than mTLS because it allows for more flexibility in claims (extra metadata and security checks can be done with arbitrary data provided by the identity provider), and JWTs are usually shorter lived than certificates.
A db connection driver? You pass the JWT as the username/password which contains the information about your identity and is signed by the identity provider that the party you're authenticating to has been configured to trust.
Or, you use a broker like Vault to which you authenticate with that JWT, and which generates a just in time ephemeral username/password for your database, which gets rotated at some point.
It's not quite a zero-trust solution though due to the CA chain of trust.
mTLS is security at a different layer though than IP source whitelisting. I'd say that a lot of companies we spoke to would want both as a defense-in-depth measure. Even with mTLS, network whitelisting is relevant. If your certificate were to be exposed for instance, an attacker would still need to be able to forge a source IP address to start a connection.
If mTLS is combined with outbound connections, then IP source whitelisting is irrelevant; the external network cannot connect to your resources.
This (and more) is exactly what we (I work on it) built with open source OpenZiti, a zero trust networking platform. Bonus points, it includes SDKs so you can embed ZTN into the serverless function, a colleague demonstrated it with a Python workload on AWS - https://blog.openziti.io/my-intern-assignment-call-a-dark-we....
I'd put it in the zero-trust category if the server (or owner of the server, etc) is the issuer of the client certificate and the client uses that certificate to authenticate itself, but I'll admit this is a pedantic point that adds nothing of substance. The idea being that you trust your issuance of the certificate and the various things that can be asserted based on how it was issued (stored in TPM, etc), rather than any parameter that could be controlled by the remote party.
Completely agree. IP addresses are almost never a good means of authentication. It results in brittle and inflexible architecture as well. Applications become aware of layers they should be abstracted from
Firewalls exist, many network environments block everything not explicitly allowed.
Authentication is only part of the problem, networks are firewalled (with dedicated appliances) and segmented to prevent lateral movement in the event of a compromise
It's not authentication. People aren't using static ips for authentication purposes
But if I have firewall policies that allow connections only to specific services I need a destination address and port (yes, some firewalls allow host names but there's drawbacks to that)
> IP addresses aren't authenticated, they can be spoofed
For anything bidirectional you'd need the client to have a route back to you for that address, which would require you compromising some routers and advertising it via BGP etc.
You can spoof addresses all you want but it will generally not do much for a stateful protocol
> People aren't using static ips for authentication purposes
Lol. Of course they do. In fact, it's the only viable way to authenticate servers in Current Year. Unlike ssh host keys, of which literally nobody on this planet takes seriously, or https certificates which is just make-work security theater.
> Modal has an isolated container runtime that lets us share each host’s CPU and memory between workloads.
Looks like Modal hosts workloads in Containers, not VMs. How do you enforce secure isolation with this design? A single kernel vulnerability could lead to remote execution on the host, impacting all workloads . Am I missing anything?
Yeah, great question. This came up at the beginning of design. A lot of our customers specifically needed IPv4 whitelisting. For example, MongoDB Atlas (a very popular database vendor) only supports IPv4. https://www.mongodb.com/community/forums/t/does-mongodb-atla...
The architecture of vprox is pretty generic though and could support IPv6 as well.
I guess that works until other customers need access to IPv6-only resources… (e.g.: we've stopped rolling IPv4 to any of our CI. No IPv6, no build artifacts…)
In a perfect world I'd also be asking whether you considered NAT64, but unfortunately I'm well aware that's a giant world of pain to get to work on Linux (involving either out-of-tree Jool, or full-on VPP)
At my company (Fortune 100), we've been selling a lot of our public v4 space to implement... RFC1918 space. We've re-IP'd over 50,000 systems so far to private space. We just implemented NAT for the first time ever. I was surprised to see how far behind some companies are.
Couldn't a NAT instance in-front of containers accomplish this as well (assuming only needed for outbound traffic)? The open source project fck-nat[1] looks amazing for this purpose.
Right, vprox servers act as multiplexed NAT instances with a VPN attached. You do still need the VPN part though since our containers run around the world, in multiple regions and availability zones. Setting the gateway to a machine running fck-nat would only work if that machine is in the same subnet (e.g., for AWS, in one availability zone).
The other features that were hard requirements for us were multi-tenancy and high availability / failover.
By the way, fck-nat is just a basic shell script that sets the `ip_forward` and `rp_filter` sysctls and adds an IP masquerade rule. If you look at vprox, we also do this but build a lot on top of it. https://github.com/modal-labs/vprox
Ahh that makes sense. I do think that a single fck-nat instance can service multiple AZ's though in a AWS region. Just need to adjust the VPC routing table. Thanks for the reply and info.
As a serverless platform we hit a bit of a tricky tradeoff: we run multi-tenant user workloads on machines around the world, and each serverless function is an autoscaling container pool. How do you let users give their functions static IPs, but also decouple them from compute resource flexibility?
We needed a high-availability VPN proxy for containers and didn't find one, so we built our own on top of WireGuard and open-sourced it at https://github.com/modal-labs/vprox
Let us know if you have thoughts! I'm relatively new to low-level container networking, and we (me + my coworkers Luis and Jeffrey + others) have enjoyed working on this.