If that was ever a possibility then it would explain a lot... But that's none of my business...
Not making any officially accusatory statements here though mind you. ಠ_ಠ
Optimizing FUSE is an example: https://extfuse.github.io/
I expect custom security auditing software, reverse proxies, firewalls with rule engines implemented in eBPF in the coming future. This will avoid having to copy dates across kernel and user boundary and the switching overheads.
There are three categories of programs; programs you can trivially prove will halt, programs you can trivially prove won't halt, and programs where it's difficult or impossible to prove whether or not will halt. The third category is what we call the halting problem. Only the first category will be run by the kernel.
The nitty gritty is that the only jumps which are allowed to go backwards are the end of a loop with a fixed number of iterations. All other jumps must jump forward.
This sounds like it's really limiting, but in practice there's not a whole lot of useful stuff you're unable to do.
With a fixed number iterations (which also means fixed amount of memory) it's really just a finite state automata.
In practice that means a true Turning machine can compute things that aren't actually computable in our Universe, or at least not by the bits of the Universe we understand well. I doubt any engineer will think that "this architecture can't in theory compute things not computable in this Universe" is a strong argument against it.
Wasn't this the original purpose of BPF?
Flash was a lovecraftian horror novel of a code base that even Adobe didn't feel totally comfortable making changes in.
You are correct however that allowing arbitrary eBPF code to be installed by untrusted users increases the attack surface. For example, we need to trust that the eBPF checker is correct, as well as the implementation of the eBPF operations. Moreover, even though eBPF may be safe I suspect it could potentially trigger or exploit bugs in internal kernel subsystems that might be harder or even impossible to trigger through the regular system call interface.
> Server abilities can be changed by uploading JS functions/libs..
> Want to search a file formats meta-data on the server? Upload JS to read the
meta-data and index it, and provide the search options..
> Servers should also use the best options for the tasks they provide: Grep to
seach text -- and not Grep like functionality, but the optimized binary for
the platform the server is running.
> Locking, linking, ACLs, and binary access should all be changable by JS.
Kind of like current serverless tech but instead of running at the edge, you run on the storage node.
EDIT: removed statement about network speed vs ssd speed. Pretty sure I was way off.
Yes, yes. It's got some access control and security sugar sprinkled in but the idea of 'userland for everything!' is clearly wrong if eBPF moving stuff back into the kernel is "The biggest OS development in my career"
I have a sense it allows much better performance for horizontal scaling, but I'm not sure...
At the pace of network events, CPU is still very fast by perhaps at least order of magnitude. However, latency introduced by system calls is significant.
This allows you to run certain classes of application in kernel space with these overheads largely mitigated.
Principally it's monitoring and "observability" applications, but apparently it's much more flexible now than it has been historically ...
I mean, suppose I want to make an eBPF program to pipe audio input into my very fancy DSP algo. Awesome. But now I've limited the remainder of my signal chain to remain in eBPF land, haven't I? The moment I throw the signal to Supercollider, Pd, some Jack client, etc., I'm back to being a mere userland mortal and taking the performance hit that entails. (Unless I sample that data coming into userland at a lower rate, but that severely limits the use cases for such a design.)
Judging by some of the screenshots I've seen Linux audio users seem to want to mix their samples through every single audio program they can get their hands on. So at least in this domain I have a hard time seeing how eBPF could practically help realtime scheduling.
Right, because jack is a user space collection of programs. It's obvious that any real-time or low latency DSP audio processing pipeline would have to be moved to BPF in its entirety.
So yes, BPF would work just fine if a new DAW pipeline was designed to work with in it. User space programs would only function as GUI front ends to load and configure BPF audio modules. This way audio can be sourced or sinked via DAC/ADC devices, disk, and network devices without the user-kernel space latencies.
More interesting would be using BPF for the driver and being able to directly invoke JACK (or some other user-space endpoint) in a way that essentially bypasses the scheduler.
Put differently: rather than rely on the normal scheduler pathway to waking up JACK (or some other user-space endpoint), have BPF do it as soon as the device driver is ready.
But I don't think BPF has any way to do this at present - it seems to rely on accumulation of data in kernel space and a user space poll-driven process to pick it up and push it at the user.
For example, can there be an eBPF "hello world" Dalek modulator that just sits between ALSA's audio input and output?
I don't know enough about audio processing to answer your specific question ... but processing network traffic for instance, you have the opportunity to "do things" with traffic. For instance partially decode PDUs to see if they interest you, and then pass them on to user-land for more in-depth processing if so ...
EDIT I think you're thinking in terms exclusively of `mapping` operations, but this kind of thing can be great for `reducing` or filtering.
I suppose it's possible to do audio processing being limited to integer arithmetic. BPF does have multiply and divide instructions at least.
 Scanning the x86_64 JIT compiler for example yields no mention of SIMD instructions: https://github.com/torvalds/linux/blob/master/arch/x86/net/b...
Yep: Here's a bredangregg presentation on the topic: https://www.youtube-nocookie.com/embed/7pmXdG8-7WU
io_uring was introduced to attack this too and essentially is an async, queue-based, batched syscall interface eliminating a lot of that latency overhead. With polling mode you only have to do a syscall when both the application and the kernel are out of work.
I thought these kinds of applications generally (try to) avoid the kernel entirely e.g. high frequency trading algorithms running on FPGAs.
Maybe using it to get data on/off of the FPGA for charting / updates / etc.
* you can use maps which are persistent key-value stores provided by the eBPF runtime, and you can call out to approved C functions provided by the kernel, but those won't allocate memory for you.
EDIT - but even without this, the comparison with user-space holds.
Two, as a better tool for general observability and problem tracing. For example, I recently listened to a Usenix (LISA19) talk by Brendan Gregg (author of this blog) about linux systems perf at Netflix where he talks about how much strace can impact performance and he posits the future replacement for it will be 'perf trace' which uses ring buffer and BPF. 
rather than specify a limited set of api entry points (like segment gates) that you try to make safe....you can provide general security and resource guarantees and let the consumer do what they need. so that's cute, but imagine the round trip reduction (nfsv4 does this to a very limited extent) and increased generality.
what if your btree traversal could be safely run on the storage server.
I've had success using `py-spy` for debugging perf issues. Flamegraphs are much nicer to work with than cProfile's output.
I plan to test it out but hadn’t yet.
For anyone that wants to understand how py-spy works id also suggest the talk on rbspy (see YouTube) it’s great and basically the same but for ruby.
What I'm not sure is: who is preventing BPF to be used as rootkits? Since they are run inside the kernel and cannot be inspected (?) can they be used to hide malicious activity?
Theoretically it could be used for a rootkit, but the programs needed to loaded as root, and they can't have side effects. BPF has also been around for a long time, and it's in basically all of the nix operating systems.
I would like to see some academic research on Linux BPF verifier. If you are a graduate student working on formal methods looking for a topic, this is a hint.
AFAIK indefinite loops in a BPF are not allowed: The kernel eBPF verifier will reject to load such programs. So the execution time of BPF programs will not be variable time and will be predictive.
I'm not sure if users need to care about the cases of long execution times when no loops are allowed.
It's not intended to be turing complete, just to be useful enough while provably not harming the system (e.g. by never terminating). There's quite a good write-up at https://lwn.net/Articles/773605/
The programming language Zig is aiming to be able to calculate the stack requirements for a given function. This has enormous benefits if you can do it in areas like embedded software. You can guarantee that you don't run out of memory (functions in Zig can't use an allocator without permission either). But to do this, functions can't do recursion, and probably can't be Turing complete in general. I don't think Zig will ban recursion, but it will give you some powerful tools/options if you avoid it where you can.
Here is a short transcription (not verbatim):
> People have asked if BPF is Turing Complete. Is is possible to have a BPF program that can run BPF programs? We have had some serious discussions, as are we there yet. And the answer is no. The answer is no because the verifier rejects unbounded loops and so that there is some things that are not possible. There are bounded loops in BPF now.
That's exactly what kernel will do
Event based handling is inherently more efficient because it runs in the context of the caller, instead of requiring its own context like in process-based applications.
This is the main reason why file system code in the kernel is more efficient than file system servers running in a different process, eg via FUSE. No context switching
I'd rather see this as a way to ingest performance critical code pieces into the kernel space more easily, with virtualization and verification options providing safety within an otherwise dangerous/complicated domain.
I would not agree with the article that this kind of paradigm is new - neither inside not outside of kernel land.
What's new here is that this is being made available to user-level custom applications.
> No, you need not just the blocks, you need the actual cache chain data structures themselves. For doing things like good read-ahead, you need to be able to (efficiently) look up trivial things like "is that block
already in the cache".
> So you need not only the data, you need the _tags_ too.
> In other words, your filesystem needs to have access to the whole
disk cache layer, not just the contents. Or it will not perform well.
Edit: and the context of this discussion was fairly ancient systems with tagged TLBs and simple in order cores with syscalls nearly as cheap as regular user space call instructions. They were still ungodly slow with microkernels, and his explanation is the meat of his view as to why. It's all about having the data in the right place with as little synchronization required.
I actually have a lot of experience in this area, and I can say that effective readahead is a bit of a crapshoot. Only really works in trivial cases. Ultimately if IO latency sucks, nothing can save you.
His particular point doesn’t fully make sense either. It’s easy to kick off readahead when you only have access to block data, the kernel won’t issue redundant IO requests for blocks already in the cache. Also mlock/madvise give a lot of control in terms of dictating eviction strategies for special blocks.
All thins equal (costless syscall/mm swapping, IO), I still think inter-task synchronization is the largest overhead, but I have no numbers to back it up. Something tells me Marshalling all IO syscalls to a kernel thread would be about as slow as a user-space FUSE task.
EDIT pleased to discover that the native `bpftrace` language is based on awk!
Would webassembly be a more general purpose way of accomplishing something like this?
But I got your point. Thanks!
One is suitable for embedding into a kernel, the other isn't.
It is also WASI runtime. WebAssembly runtime proper (core/iwasm/runtime directory) is more like 10K.
I'd be very surprised if the linux kernel doesn't eventually get web assembly support.
Well currently we run code (e.g. drivers) as trusted full-permission code. Surely, web-assembly would be better than this from a security perspective.
Basic examples: code that takes a lock and never releases it, code that loops forever, code that leaks kernel resources.
Then how can you possibly know? Do you just assume that because it has "Web" in the name, it must be insanely complicated?
Perhaps WebAssembly kernel-space programming would still be viable for major trusted projects, like Kafka, where you literally dedicate the entire machine to that application.
There's also https://github.com/iovisor/bcc#tools as an easy way to get started using BPF.
It's complex though, because the tooling wrt logging, monitoring, debugging aren't as mature yet.
Aren't they very similar to interrupts? What is the difference there? The kernel API?
You can thus get speed-ups because there's no API, memory management, even scheduling.
This is different form hardware interrupts because they don't run in response to hardware events, and they don't have side effects. There is a similarity in that you want the filters to run quickly, though.
Here's an example of a syscall-heavy command on my system:
$ time dd if=/dev/zero bs=1 count=10M of=/dev/null
10485760+0 records in
10485760+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 7.09089 s, 1.5 MB/s
If you have a safe bytecode format for representing operations that are performed in a loop, the kernel can just perform those operations without having to switch back and forth to userspace.
You've got 4.968s of system time there (i.e. broadly the time spent in kernel code) and 2.123s of user time. Given that the user-space program is effectively a tight loop around read() and write() calls, we can assume that almost all of those 2 seconds are spent going through the syscall plumbing.
Now, there's going to also be some of the kernel-side time spent in the syscall plumbing too, but there's also a lot of I/O, buffer and filesystem layer code executing there. All of which will be in use with a BPF program too. So it's unclear how much of the effective time can be shaved off.
And given that I can write a program that makes 132 million calls per second to the glibc `putchar` function (which also buffers), I'm pretty sure there's a lot of time that can be shaved off as we start to replace the system call mechanism with plain function calls.
Regardless, I'm sure I've run this same test years ago and seen the system call count still in the same order (that is, a couple of million per second). I really doubt Spectre mitigations are what are causing what should be a few dereferences and function calls to take around a thousand clock cycles.
BPF seems to have a similar aim in a sense: run less-trustworthy code in an unrestricted environment, but don't use CPU hardware based access control (and therefore context switching) to limit it, rather, make the code more sandboxed in software to prevent the need for context switches.
This should be useful any time you need a high performance, high security way to instrument or extend a C based program during run time! Not just in kernels.
It's really close to exokernels though. XOK had three different, but similar VMs for running user code in kernel space. Like their version of futex was a filter program that got run to see if the program should be woke up.
Want to bet if they are going to make all the same mistakes themselves, or if they are willing to learn from history?
Maybe simplified versions work with this paradigm with significant performance gains, but I doubt that you could simply plug Varnish or nginx in and see much of an improvement.
There is still quite a complexity gap between e.g. interpreting firewall filter rules and a full blown web reverse proxy..
eBPF reminds me a lot of exokernel type systems.
Creating BPF in the first place took a lot of cleverness. Figuring out how to fit it into a massive infrastructure at a place like Facebook took a lot of cleverness. But most of the people actually working on the implementations are just normal software engineers.
What about code that proves a complex, many page mathematical theorem?
What about code that does complex and very math heavy things like GCM encryption modes, or statistical compression via prediction by partial matching and arithmetic coding?
Just reading the code isn't always enough, plenty of complicated code requires understanding of concepts far outside what you could hope to fit in a comment.
The hard part should be understanding the theorem. If you understand that and the code is still difficult, then the code should be better written.
I guess one exception is when you're optimizing for performance and make an explicit decision to sacrifice maintainability.
Two such implementations exist. uBPF has a simple implementation. This implementation is likely not feature complete, as it is not regularly updated and the kernel BPF validator keeps getting new features. DPDK also has an implementation of BPF. This appears to be actively developed, but it also comes with a lot of baggage from DPDK. It may be possible to fork for this purpose, but it appears to primarily used for running traditional network filters on incoming packets in userspace.
What you are looking for is a really interesting possibility but would require a significant software engineering effort.
If you relax the constraint that BPF runs in userspace, it is possible to insert userspace dynamic tracing points (USDT) into your code, which will call out to a BPF program in the kernel when executed.
mmm right, should've thought of that.
> What you are looking for is a really interesting possibility but would require a significant software engineering effort.
thanks for the info, i appreciate it, and it gives me a bit to chew on.
Also, BPF has been around for almost 30 years, and you're likely using it. tcpdump is basically just a BPF bytecode frontend, for example.
This additionally removes the need to have the bpf compiler in the kernel, reducing both core size and vulnerability surface area.
No reason BPF must imply JIT
BPF programs declare their resources to the kernel (maps, etc.), but modules are reliant on the __exit function cleaning everything up properly. The correctness of __exit is unverified, difficult, and practically one of the least tested pathways which makes it traditionally fraught with bugs.
Doing the verification offline is a non starter. Appending the verification information and reverifying it at load time is more work than a BPF runtime as it is as you have to reproject ISA semantics in a more complex way.
I think the main benefit of bpf is that it prevents you from shooting yourself in the foot. Running totally untrusted code in the kernel just seems like a recipe for disaster and for that reason it makes sense bpf is still limited to root.
It's sort of like how originally you could only jump forward, then the opened it up to any DAG, then they allowed probably bounded loops.
And there's been OSes that don't require root for their in kernel virtual machines, XOK and AEGIS being the prominent examples.
Even if the kernel opens it up without a compelling use case, it seems likely that distribution policy will keep it default locked to root.
Which is my point here. I don’t see a compelling reason to have a JIT in the kernel when AOT BPF seems to cover 90% of all existing use cases. In fact I may even write a bpf to kernel module compiler myself.
* KVM device MMIO emulation
* OS personality emulation (like WSL but doesn't require root)
* New synchronization primitives to user space (like XOK's wake predicates)
* A lot of others..
Modern BPF is the exploring the same cornerstones as exokernels and really opens up a whole bunch of concepts that haven't been seen in mainstream kernels, particularly if non privileged users are allowed to invoke it.
Mobile users like android don’t have root but I don’t see why an untrusted mobile app would need bpf.
Only benefit of allowing non-root that I can see is enabling untrusted containers in cloud environments to do the same. All large cloud providers use KVM/zen (not containers) for untrusted users in which case they already have root.
Can you give an example of a scenario where the user doesn’t have root yet still would want to do those things?