
BPF: A New Type of Software - andrenth
http://www.brendangregg.com/blog/2019-12-02/bpf-a-new-type-of-software.html
======
pjmlp
Android also uses it a lot.

[https://source.android.com/devices/architecture/kernel/bpf](https://source.android.com/devices/architecture/kernel/bpf)

[https://android.googlesource.com/platform/external/adeb/+/ma...](https://android.googlesource.com/platform/external/adeb/+/master/BCC.md)

[https://linuxplumbersconf.org/event/4/contributions/411/atta...](https://linuxplumbersconf.org/event/4/contributions/411/attachments/354/585/2019_LPC_Lisbon__eBPF_use_in_Android_Networking.pdf)

[https://blog.linuxplumbersconf.org/2017/ocw/system/presentat...](https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4791/original/eBPF%20cgroup%20filters%20for%20data%20usage%20accounting%20on%20Android.pdf)

~~~
winternett
Is this why Google is possibly de-prioritizing non-AMP sites on it's search
index, making non-AMP sites load slowly on Google Chrome, Lobbying against Net
Neutrality, and doing many other things like implementing flagging for sites
in Chrome that aren't encrypted?

If that was ever a possibility then it would explain a lot... But that's none
of my business...

Not making any officially accusatory statements here though mind you. ಠ_ಠ

------
hackworks
eBPF can be viewed as a mechanism to safely run user code in kernel since it
uses a DSL and a compiler before the byte code is executed in kernel. This
opens up doors for running performance critical functionality in kernel
without having to bundle it with the kernel or very tightly coupled with the
kernel version.

Optimizing FUSE is an example:
[https://extfuse.github.io/](https://extfuse.github.io/)

I expect custom security auditing software, reverse proxies, firewalls with
rule engines implemented in eBPF in the coming future. This will avoid having
to copy dates across kernel and user boundary and the switching overheads.

~~~
throwaway894345
According to
[https://news.ycombinator.com/item?id=18496054](https://news.ycombinator.com/item?id=18496054),
these programs have to halt? How does this system guarantee that the programs
halt? Does this mean eBPF is not Turing complete?

~~~
nwallin
The language itself is Turing complete, but the kernel will refuse to run a
program that it cannot prove will halt.

There are three categories of programs; programs you can trivially prove will
halt, programs you can trivially prove won't halt, and programs where it's
difficult or impossible to prove whether or not will halt. The third category
is what we call the halting problem. Only the first category will be run by
the kernel.

The nitty gritty is that the only jumps which are allowed to go backwards are
the end of a loop with a fixed number of iterations. All other jumps must jump
forward.

This sounds like it's really limiting, but in practice there's not a whole lot
of useful stuff you're unable to do.

~~~
fireant
Could you create an unbounded loop by scheduling another BPF execution on the
end of each program? I imagine something like a WASM runtime that would be
split into multiple BPF programs on loops. Which would practically achieve
[https://github.com/nebulet/nebulet](https://github.com/nebulet/nebulet),
right?

------
eddiezane
My coworker gave a talk [0] two weeks ago at KubeCon on the eBPF datapath in
regards to Cilium (a network implementation for Kubernetes). He touches on the
journey though different kernel and system calls.

0: [https://youtu.be/Kmm8Hl57WDU](https://youtu.be/Kmm8Hl57WDU)

------
JackRabbitSlim
BPF amuses me greatly; Real-Mode/Ring0 are dead! Long live Real-mode/Ring0!

Yes, yes. It's got some access control and security sugar sprinkled in but the
idea of 'userland for everything!' is clearly wrong if eBPF moving stuff back
into the kernel is "The biggest OS development in my career"

------
jokoon
I have a hard time understanding what you would use it for. I could understand
a use-case, but I fail to understand why it would be that much useful.

I have a sense it allows much better performance for horizontal scaling, but
I'm not sure...

~~~
rusk
Real-time, low latency, network-based applications.

At the pace of network events, CPU is still very fast by perhaps at least
order of magnitude. However, latency introduced by system calls is
significant.

This allows you to run certain classes of application in kernel space with
these overheads largely mitigated.

Principally it's monitoring and "observability" applications, but apparently
it's much more flexible now than it has been historically ...

~~~
z3t4
Wouldn't virtualization kill the perf benefit, or is this supposed to run "on
the metal"?

~~~
gjulianm
I don't know about other devices, but network cards have pretty good support
for virtualization. A lot of them have virtual devices that the card generates
and are added to the system as if they were different PCI devices. Then, each
VM gets one of those virtual devices and they access it directly as a regular
PCI device, with no virtualization layer whatsoever.

------
anyzen
Off topic: this is the same Brendan Gregg of flame charts fame [0][1]. It has
solved my skin quite a few times when trying to figure out performance
bottlenecks in Python apps (using pyflame[2] to capture data and FlameGraph[1]
to convert it to displayable SVG).

[0]
[http://www.brendangregg.com/flamegraphs.html](http://www.brendangregg.com/flamegraphs.html)

[1]
[https://github.com/brendangregg/FlameGraph](https://github.com/brendangregg/FlameGraph)

[2] [https://github.com/uber-archive/pyflame](https://github.com/uber-
archive/pyflame)

~~~
emptysea
Looks like `pyflame` was recently deprecated & archived.

I've had success using `py-spy` for debugging perf issues. Flamegraphs are
much nicer to work with than cProfile's output.

[https://github.com/benfred/py-spy](https://github.com/benfred/py-spy)

~~~
simtel20
I just wrote up a quick survey of python profilers that hasn't been published
yet, and along with py-spy, there is austin
([https://github.com/P403n1x87/austin](https://github.com/P403n1x87/austin)).
The thing that I liked the most about austin is that it also samples the
memory usage of the system so that you have the context of the world outside
of the process being sampled, in case it is useful. That said, py-spy is
easier to install (it can be installed via pip).

~~~
lathiat
Facebook also appears to have published BPF to do the same thing as py-spy but
in kernel on perf hooks. But isn’t currently well documented to be easily
accessible as far as I can tell.

I plan to test it out but hadn’t yet.

For anyone that wants to understand how py-spy works id also suggest the talk
on rbspy (see YouTube) it’s great and basically the same but for ruby.

------
ComSubVie
Haven't used it till now (except maybe via nft?).

What I'm not sure is: who is preventing BPF to be used as rootkits? Since they
are run inside the kernel and cannot be inspected (?) can they be used to hide
malicious activity?

~~~
rhinoceraptor
You can see which bpf programs are loaded in the kernel via the bpf() syscall.

Theoretically it could be used for a rootkit, but the programs needed to
loaded as root, and they can't have side effects. BPF has also been around for
a long time, and it's in basically all of the nix operating systems.

~~~
sanxiyn
Generally agreed, but Linux BPF is considerably more powerful than traditional
Unix BPF, so I wouldn't depend on "it has been around for a long time" for
safety.

I would like to see some academic research on Linux BPF verifier. If you are a
graduate student working on formal methods looking for a topic, this is a
hint.

~~~
allset_
If someone has root, it's already game over. An attacker could just hook the
syscalls directly which would be more stealthy that using BPF programs.

------
yetihehe
Looks nice, microservices going into super-micro territory, where they are
just simple small code snippets. Possible problems - many people will learn
the hard way that logging or printing every packet which comes through your
interfaces for further analysis will bring down your system. From the start
there should be some simple way to rate-limit those bpf programs, like "if
this exceeds some limits or bogs down system for more than X milliseconds,
disable and give error".

~~~
kbumsik
> "if this exceeds some limits or bogs down system for more than X
> milliseconds, disable and give error".

AFAIK indefinite loops in a BPF are not allowed: The kernel eBPF verifier will
reject to load such programs. So the execution time of BPF programs will not
be variable time and will be predictive.

I'm not sure if users need to care about the cases of long execution times
when no loops are allowed.

~~~
andbberger
How is that possible?? If true, that would imply BPF programs are not Turing
complete.

~~~
kbumsik
You are right, eBPF is non-Turing complete. To be precise I heard that the
kernel eBPF verifier ensures the flow of a BPF program is a kind of a DAG (no
cycles = no loops). So eBPF VM is definitely not a general-purpose machine.

~~~
dathinab
Actually they now allow bounded loops (such which you could theoretically
unroll complete). It's mentioned in the talk, too. There is also a limit on
the number of instructions a program can have I wonder if bounded loops
multiply the "numbe of instructions" per iteration with the upper bound of t
the number of persons for this?

------
manyworlds
This is less about BPF vs native code, and more about the process model vs the
event based model of application programming.

Event based handling is inherently more efficient because it runs in the
context of the caller, instead of requiring its own context like in process-
based applications.

This is the main reason why file system code in the kernel is more efficient
than file system servers running in a different process, eg via FUSE. No
context switching

~~~
EsssM7QVMehFPAs
To be fair it should be considered that within the kernel space event driven
architectures are already ubiquitous though. I/O, filesystem, you name it.
Including powerful multiplexing and dispatch frameworks.

I'd rather see this as a way to ingest performance critical code pieces into
the kernel space more easily, with virtualization and verification options
providing safety within an otherwise dangerous/complicated domain.

I would not agree with the article that this kind of paradigm is new - neither
inside not outside of kernel land.

~~~
youareawesome
Agree, it's not new, and I think GP's point is that event-driven is the norm
for the kernel and why kernel-level interfaces are efficient.

What's new here is that this is being made available to user-level custom
applications.

------
rusk
I wonder is this something that Futhark [0] could be used for?

[0] [https://futhark-lang.org/](https://futhark-lang.org/)

EDIT pleased to discover that the native `bpftrace` language is based on awk!

------
ClumsyPilot
This looks very similar to webassembly work going on right now, both use a
secure VM, and both run in kernel space.

Would webassembly be a more general purpose way of accomplishing something
like this?

~~~
TickleSteve
A BPF interpreter can literally be ~100 LoC. A WebAssembly VM on the other
hand will likely be ~1million LoC (without checking).

One is suitable for embedding into a kernel, the other isn't.

~~~
sanxiyn
Where did you get ~1million number? [https://github.com/bytecodealliance/wasm-
micro-runtime](https://github.com/bytecodealliance/wasm-micro-runtime) is less
than 100K LoC for example.

~~~
0-_-0
If a micro runtime is 100K LOC then 1M LOC wasn't unrealistic.

~~~
sanxiyn
About half is tests and sample codes. Runtime proper (core directory) is <50K.

It is also WASI runtime. WebAssembly runtime proper (core/iwasm/runtime
directory) is more like 10K.

------
GrayShade
Brendan has a lot of great content that gets posted here regularly:
[http://www.brendangregg.com/](http://www.brendangregg.com/) (I'm still trying
find the time to get through it though).

There's also
[https://github.com/iovisor/bcc#tools](https://github.com/iovisor/bcc#tools)
as an easy way to get started using BPF.

~~~
thinkmassive
He also has a book coming out this month, BPF Performance Tools
[http://www.brendangregg.com/bpf-performance-tools-
book.html](http://www.brendangregg.com/bpf-performance-tools-book.html)

------
kizer
I've wondered why operating systems, aside from hypervisors, are
overwhelmingly the first abstraction - I know the obvious benefits, or rather
necessities (processor sharing, security, file system, etc etc) - but in
ultra-specialized perf-critical applications I'd have thought economic
pressures would have materialized a greater variety of ad-hoc bare-metal
software. I guess we're headed that way w/ the twilight of Moor'es law, FPGAs.
Maybe I'm just ignorant of such implementations.

~~~
SahAssar
Isn't that sorta what a unikernel is?

~~~
eyberg
yep - and some of us are working on that

------
Iv
So from my understanding, that's a kind of "secure" (I'd like to know more
about the security model tbh) module that runs with kernel privilege with no
scheduling (so it runs until completion). These are supposed to be short and I
am assuming, can't call libs and can't allocate memory (outside a predefined
stack I would guess?)

Aren't they very similar to interrupts? What is the difference there? The
kernel API?

~~~
maxdamantus
I think it's more to do with avoiding overheads typically associated with
system calls (presumably involving some interrupt and
disabling/enabling/changing paging behaviour).

Here's an example of a syscall-heavy command on my system:

    
    
      $ time dd if=/dev/zero bs=1 count=10M of=/dev/null
      10485760+0 records in
      10485760+0 records out
      10485760 bytes (10 MB, 10 MiB) copied, 7.09089 s, 1.5 MB/s
     
      real    0m7.092s
      user    0m2.123s
      sys     0m4.968s
    

3 million system calls per second seems quite slow on a 3.2 GHz CPU when all
it should really be doing is dereferencing a couple of pointers until it finds
some functions that simply write a zero byte to a buffer (the "/dev/zero"
descriptor handler) and ignore bytes from a buffer (the "/dev/null" descriptor
handler).

If you have a safe bytecode format for representing operations that are
performed in a loop, the kernel can just perform those operations without
having to switch back and forth to userspace.

~~~
barrkel
Have you forgotten about Meltdown, Spectre, and all the other cache attacks?

~~~
maxdamantus
These are things that kernel developers are surely mindful of when coming up
with and implementing eBPF functionality.

Regardless, I'm sure I've run this same test years ago and seen the system
call count still in the same order (that is, a couple of million per second).
I really doubt Spectre mitigations are what are causing what should be a few
dereferences and function calls to take around a thousand clock cycles.

------
etaioinshrdlu
It reminds me a wee bit of stuff that has happened in browsers. NaCl was a way
to run untrusted native code by scanning it for unsafe operations.

BPF seems to have a similar aim in a sense: run less-trustworthy code in an
unrestricted environment, but don't use CPU hardware based access control (and
therefore context switching) to limit it, rather, make the code more sandboxed
in software to prevent the need for context switches.

------
etaioinshrdlu
Very interesting. The BPF programs are often written in C and compiled using a
BPF backed to llvm using this project:
[https://github.com/iovisor/bcc](https://github.com/iovisor/bcc)

This should be useful any time you need a high performance, high security way
to instrument or extend a C based program during run time! Not just in
kernels.

------
kragen
There are a bunch of places where non-Turing-complete scripting can be used to
provide better interfaces between mutually-untrusting systems. Bitcoin Script,
which is loopless like BPF, is another example. (I don't mean eBPF, which
isn't loopless.) I've been thinking about using this approach in the Wercam
windowing system for BubbleOS to get reliably low-latency feedback for user
interface events, as Don Hopkins did with NeWS;
[https://gitlab.com/kragen/dercuano/blob/master/markdown/werc...](https://gitlab.com/kragen/dercuano/blob/master/markdown/wercam-
scriptable-windows) explains how, and also delves a bit into the history of
the approach. Other possible uses include active networking (by including a
routing program in your packet headers), specifying pub-sub subscriptions, and
specifying database queries.

------
willisbueller
Sounds a lot like SPIN OS
[https://en.wikipedia.org/wiki/SPIN_(operating_system)](https://en.wikipedia.org/wiki/SPIN_\(operating_system\))
They have to make do without type safety (in SPIN's case provided by
modula-3), but it's really cool to see it tried. For those interested: SPIN
and other hybrid kernels (like Exo) were created in the fall-out of
"microkernels are bad" by attempting to allow a hybrid approach. Linux was
created around the same time staying straight in the monolithic category (but
since moving to be more hybrid).

~~~
smorrow
Out of the loop, why is Linux now "more hybrid"?

~~~
pm90
Because it has a crap ton of drivers and shit in the kernel, which is the
opposite of unikernel/microkernel approach where most of this functionality is
implemented in userspace. See eg Minix

~~~
gok
I think the question was how is it a microkernel at all

------
smartmic
I do not know much about kernels and OS architecture. But I am wondering how
this relates to the idea of GNU Hurd. Wasn't its concept also about
microkernel-based design?

~~~
monocasa
This is pretty different, Hurd ran all of this stuff in user space.

It's really close to exokernels though. XOK had three different, but similar
VMs for running user code in kernel space. Like their version of futex was a
filter program that got run to see if the program should be woke up.

------
phkamp
The Lost Generation discovers IBM Mainframe Channel Programs?

Want to bet if they are going to make all the same mistakes themselves, or if
they are willing to learn from history?

~~~
leetrout
All snark aside, is there potential benefit to Varnish Cache with this
becoming widely adopted? Things that could only be accomplished at lower
layers like this implies access to?

~~~
EsssM7QVMehFPAs
Since there is already highly efficient event driven I/O and a context switch
is bound to happen for applications that require major business logic in user
space, I doubt there will be huge benefits for large server applications.

Maybe simplified versions work with this paradigm with significant performance
gains, but I doubt that you could simply plug Varnish or nginx in and see much
of an improvement.

There is still quite a complexity gap between e.g. interpreting firewall
filter rules and a full blown web reverse proxy..

~~~
eloff
I think you could use this to generate a reverse proxy for the current config
in nginx, maybe if only for some subset of configurations. Like a fast path
for common flows. But it wouldn't be easy.

------
musicale
I'm a big fan of eBPF (and Brendan Gregg), but it's not really a _new_ type of
software - the original BPF (1993) has been around for a while, not to mention
other user-extensible kernel architectures such as ring-based kernels and
microkernels (1960s-) and exokernels (1994).

eBPF reminds me a lot of exokernel type systems.

------
Aissen
gregkh said at Kernel Recipes this year that "we have a micro kernel and
nobody noticed" while talking about eBPF.

~~~
kasabali
I think he may have said that about "ELF modules" [1] specifically. Because
plain BPF code runs in kernel mode and that wouldn't make it a micro kernel. I
don't know if elf modules feature has been merged upstream.

1\. [https://lwn.net/Articles/749108/](https://lwn.net/Articles/749108/)

------
HIP_HOP
Can someone write an ELI5 of BPF, please?

~~~
leetrout
Take a look through [https://qmonnet.github.io/whirl-offload/2016/09/01/dive-
into...](https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/)

~~~
fizwhiz
Good god those linked articles ought to keep anyone busy for months

------
yehia2amer
This is one of the moments that i read an article and say to myself: “Those
stuff needs a smarter programmer than i am” I will try to watch the video,
hopefully it is simpler to understand !

~~~
badfrog
I've thought that many times and always been wrong. At the end of the day,
code is just code. After you take the time to understand the environment it's
running in, it's not that much different than anything else you could write.

Creating BPF in the first place took a lot of cleverness. Figuring out how to
fit it into a massive infrastructure at a place like Facebook took a lot of
cleverness. But most of the people actually working on the implementations are
just normal software engineers.

~~~
throwaway2048
There is lots of code that is difficult to understand, for anyone.

What about code that proves a complex, many page mathematical theorem?

What about code that does complex and very math heavy things like GCM
encryption modes, or statistical compression via prediction by partial
matching and arithmetic coding?

Just reading the code isn't always enough, plenty of complicated code requires
understanding of concepts far outside what you could hope to fit in a comment.

~~~
badfrog
> What about code that proves a complex, many page mathematical theorem?

The hard part should be understanding the theorem. If you understand that and
the code is still difficult, then the code should be better written.

I guess one exception is when you're optimizing for performance and make an
explicit decision to sacrifice maintainability.

------
wbl
I wonder what Netflix does on their FreeBSD machines.

------
sigstoat
if i like the guarantees that BPF gives the kernel, and want to embed it into
my user-space software so that it can execute BPF programs received from other
untrusted processes, are there any hints on where to start? the "BPF
beginners" material i come across doesn't seem to discuss this use case.

~~~
base0x10
This is currently not easy to do. First of all, the kernel BPF implementation
is GPLv2, which means you cannot rip out the runtime and embed it in your
userspace code unless you are able to distribute your userspace software as
GPLv2 (This is AFAIK and IANAL). For software engineering and license
compatibility reasons, you probably want to use a clean implementation of
eBPF.

Two such implementations exist. uBPF has a simple implementation. This
implementation is likely not feature complete, as it is not regularly updated
and the kernel BPF validator keeps getting new features. DPDK also has an
implementation of BPF. This appears to be actively developed, but it also
comes with a lot of baggage from DPDK. It may be possible to fork for this
purpose, but it appears to primarily used for running traditional network
filters on incoming packets in userspace.

What you are looking for is a really interesting possibility but would require
a significant software engineering effort.

If you relax the constraint that BPF runs in userspace, it is possible to
insert userspace dynamic tracing points (USDT) into your code, which will call
out to a BPF program in the kernel when executed.

~~~
sigstoat
> First of all, the kernel BPF implementation is GPLv2, which means you cannot
> rip out the runtime and embed it in your userspace code unless you are able
> to distribute your userspace software as GPLv2

mmm right, should've thought of that.

> What you are looking for is a really interesting possibility but would
> require a significant software engineering effort.

thanks for the info, i appreciate it, and it gives me a bit to chew on.

------
paggle
Can someone explain in one simple paragraph what this is?

------
steve1977
I still have to watch the presentation and/or read the book, but at first
glance this looks like a nasty kludge. I hope I'm wrong.

------
jeswin
Great video. Good content, delivery and impressive lighting and video
production work.

------
baybal2
Question from me, why reinvent the bicycle and not just write proper kernel
modules in C?

~~~
rhinoceraptor
BPF is completely production safe. So there is no way for a BPF program to
crash the kernel, introduce significant performance latency, or have any side
effects on the kernel/user space. Obviously, kernel modules have none of those
properties.

Also, BPF has been around for almost 30 years, and you're likely using it.
tcpdump is basically just a BPF bytecode frontend, for example.

~~~
manyworlds
You can maintain production safety by using a BPF->kernel module compiler.

This additionally removes the need to have the bpf compiler in the kernel,
reducing both core size and vulnerability surface area.

No reason BPF must imply JIT

~~~
monocasa
The end goal with bpf is to allow arbitrary untrusted programs to load bpf
programs. If you were just loading kernel modules you wouldn't be able to
maintain kernel integrity and let arbitrary programs load code.

~~~
sigjuice
I can't find the LKML thread, but there is some fundamental problem with
unloading modules safely. BPF programs might not have such limitations.

~~~
monocasa
Oh, neat! Yeah, I didn't think of this but that totally makes sense.

BPF programs declare their resources to the kernel (maps, etc.), but modules
are reliant on the __exit function cleaning everything up properly. The
correctness of __exit is unverified, difficult, and practically one of the
least tested pathways which makes it traditionally fraught with bugs.

~~~
manyworlds
A BPF->kernel module compiler would ensure all necessary cleanup happens in
__exit automatically

~~~
monocasa
A BPF to kernel module compiler wouldn't let the kernel verify the program in
a real way. There's still work to be done, bit the end goal of BPF is pretty
obviously to allow non root users to load programs.

Doing the verification offline is a non starter. Appending the verification
information and reverifying it at load time is more work than a BPF runtime as
it is as you have to reproject ISA semantics in a more complex way.

~~~
manyworlds
Hmm I think bpf today is used in fully trusted environments. Kernel level
verification is unnecessary except in untrusted containerized environments or
when running untrusted applications, both use cases being relatively
rare/specialized.

I think the main benefit of bpf is that it prevents you from shooting yourself
in the foot. Running totally untrusted code in the kernel just seems like a
recipe for disaster and for that reason it makes sense bpf is still limited to
root.

~~~
monocasa
That's where it is today, but the end goal is removing that restriction. It
probably would have happened quicker if Spectre/Meltdown hadn't come out of
nowhere. Like it used to be that KVM required CAP_SYS_ADMIN as well, but now
that's been opened up to whoever has permissions to the device file. Start
requiring "own the box anyway" privileges while the feature bakes, but open it
up as it becomes more mature and attackers have a go at it.

It's sort of like how originally you could only jump forward, then the opened
it up to any DAG, then they allowed probably bounded loops.

And there's been OSes that don't require root for their in kernel virtual
machines, XOK and AEGIS being the prominent examples.

~~~
manyworlds
Sure but I don’t see any compelling use cases to motivate opening it up. Do
you have an example of one?

Even if the kernel opens it up without a compelling use case, it seems likely
that distribution policy will keep it default locked to root.

Which is my point here. I don’t see a compelling reason to have a JIT in the
kernel when AOT BPF seems to cover 90% of all existing use cases. In fact I
may even write a bpf to kernel module compiler myself.

~~~
monocasa
* Syscall tracing, sandboxing, and monitoring

* KVM device MMIO emulation

* OS personality emulation (like WSL but doesn't require root)

* New synchronization primitives to user space (like XOK's wake predicates)

* A lot of others..

Modern BPF is the exploring the same cornerstones as exokernels and really
opens up a whole bunch of concepts that haven't been seen in mainstream
kernels, particularly if non privileged users are allowed to invoke it.

~~~
manyworlds
Thanks for the examples but those all still seem like things vast majority of
Linux users can do today, since vast majority of Linux users have root access.
Both desktop and server.

Mobile users like android don’t have root but I don’t see why an untrusted
mobile app would need bpf.

Only benefit of allowing non-root that I can see is enabling untrusted
containers in cloud environments to do the same. All large cloud providers use
KVM/zen (not containers) for untrusted users in which case they already have
root.

Can you give an example of a scenario where the user doesn’t have root yet
still would want to do those things?

