Hacker News new | past | comments | ask | show | jobs | submit login
BPF: A New Type of Software (brendangregg.com)
603 points by andrenth on Dec 3, 2019 | hide | past | favorite | 185 comments

Is this why Google is possibly de-prioritizing non-AMP sites on it's search index, making non-AMP sites load slowly on Google Chrome, Lobbying against Net Neutrality, and doing many other things like implementing flagging for sites in Chrome that aren't encrypted?

If that was ever a possibility then it would explain a lot... But that's none of my business...

Not making any officially accusatory statements here though mind you. ಠ_ಠ

eBPF can be viewed as a mechanism to safely run user code in kernel since it uses a DSL and a compiler before the byte code is executed in kernel. This opens up doors for running performance critical functionality in kernel without having to bundle it with the kernel or very tightly coupled with the kernel version.

Optimizing FUSE is an example: https://extfuse.github.io/

I expect custom security auditing software, reverse proxies, firewalls with rule engines implemented in eBPF in the coming future. This will avoid having to copy dates across kernel and user boundary and the switching overheads.

According to https://news.ycombinator.com/item?id=18496054, these programs have to halt? How does this system guarantee that the programs halt? Does this mean eBPF is not Turing complete?

The language itself is Turing complete, but the kernel will refuse to run a program that it cannot prove will halt.

There are three categories of programs; programs you can trivially prove will halt, programs you can trivially prove won't halt, and programs where it's difficult or impossible to prove whether or not will halt. The third category is what we call the halting problem. Only the first category will be run by the kernel.

The nitty gritty is that the only jumps which are allowed to go backwards are the end of a loop with a fixed number of iterations. All other jumps must jump forward.

This sounds like it's really limiting, but in practice there's not a whole lot of useful stuff you're unable to do.

Could you create an unbounded loop by scheduling another BPF execution on the end of each program? I imagine something like a WASM runtime that would be split into multiple BPF programs on loops. Which would practically achieve https://github.com/nebulet/nebulet, right?

This does actually mean that eBPF is not Turing complete, in that it can't simulate a Turing machine.

With a fixed number iterations (which also means fixed amount of memory) it's really just a finite state automata.

From the accompanying video (17:11)[1] it's not Turing complete "because the verifier rejects unbounded loops and so there is some things that are not possible. We do have bounded loops in BPF now."

1: https://youtu.be/7pmXdG8-7WU?t=1031

Commonly known as total functional programming.


Indeed, eBPF is not turing complete.

It's the definition of Turning complete that's the problem. A Turning complete machine must be able to process countably infinite inputs. The halting problem arises from same thing. There is no halting problem if we restrict ourselves to programs that will stop in a bounded time (obviously).

In practice that means a true Turning machine can compute things that aren't actually computable in our Universe, or at least not by the bits of the Universe we understand well. I doubt any engineer will think that "this architecture can't in theory compute things not computable in this Universe" is a strong argument against it.

> firewalls with rule engines implemented in eBPF in the coming future

Wasn't this the original purpose of BPF?

Bytecode + compiler was also how both Flash and Java Applets worked. Have we forgotten how secure those were?

Java implemented the sandbox as java code in the same VM as what it was sandboxing, then later attached reflection so you had a billion different ways to get references to those classes blocking your access and do brain surgery on them. Oof.

Flash was a lovecraftian horror novel of a code base that even Adobe didn't feel totally comfortable making changes in.

It's also how wasm works, and in some sense -- JavaScript. Somehow, these introduce much less security problems.


I'm not sure Javascript has fewer security problems at all. Sandboxes help a lot, but not all sandboxes are equally strong and Javascript engines routinely get popped.

JavaScript doesn’t give you nearly as much access to host OS features as the JVM does. (And there sometimes are security problems when it does.)

But it's not really interesting to compare js to java, we should be comparing to ebpf.

The 'somehow' seems rather clear to me when you look at how the organizations who governed these languages and how they're run.

One difference with eBPF is that it limits the number of instructions you can execute (and also prohibits backward branches if I recall - "loops" must be unrolled) and that the semantics are restricted enough that it can be "proven" to be safe.

You are correct however that allowing arbitrary eBPF code to be installed by untrusted users increases the attack surface. For example, we need to trust that the eBPF checker is correct, as well as the implementation of the eBPF operations. Moreover, even though eBPF may be safe I suspect it could potentially trigger or exploit bugs in internal kernel subsystems that might be harder or even impossible to trigger through the regular system call interface.

Another example is a sandboxing file system https://lwn.net/Articles/803890/

I was thinking about something similar to extfuse, but for a remote filesystem. Here's some note scraps:

> Server abilities can be changed by uploading JS functions/libs..

> Want to search a file formats meta-data on the server? Upload JS to read the meta-data and index it, and provide the search options..

> Servers should also use the best options for the tasks they provide: Grep to seach text -- and not Grep like functionality, but the optimized binary for the platform the server is running.

> Locking, linking, ACLs, and binary access should all be changable by JS.

This is an interesting idea. Sort of the inverse of a web app. Part of the problem with the cloud for large datasets (ie genomics) is getting the computation close enough to the data (the UI being the third leg of the stool). If you could upload small processing scripts (or ebpf/wasm) to the exact node where the data lives in real time, it might open up some novel techniques.

Kind of like current serverless tech but instead of running at the edge, you run on the storage node.

EDIT: removed statement about network speed vs ssd speed. Pretty sure I was way off.

I believe this was a huge strength of MapReduce back in the day. The mapping and reducing of initial data would happen on the node the data was stores in. The only thing being transferred was the code and the results.

My coworker gave a talk [0] two weeks ago at KubeCon on the eBPF datapath in regards to Cilium (a network implementation for Kubernetes). He touches on the journey though different kernel and system calls.

0: https://youtu.be/Kmm8Hl57WDU

BPF amuses me greatly; Real-Mode/Ring0 are dead! Long live Real-mode/Ring0!

Yes, yes. It's got some access control and security sugar sprinkled in but the idea of 'userland for everything!' is clearly wrong if eBPF moving stuff back into the kernel is "The biggest OS development in my career"

I have a hard time understanding what you would use it for. I could understand a use-case, but I fail to understand why it would be that much useful.

I have a sense it allows much better performance for horizontal scaling, but I'm not sure...

Cilium's container networking and security product [0]; Facebook's Katran [1]; Netflix's flowsrus (not public yet, but see my tcplife tool in BCC[2]). That's just the beginning. It's not just for performance, it's also security, as while BPF programs run in kernel-mode they have a limited and secured API for interacting with the system (BPF helpers).

Since extended BPF is in the Linux kernel (and will be in other kernels in the future), everyone is getting it, and we'll see more use cases over time. In some ways it's like the birth of JavaScript for the browser, and all the new applications it made possible. But it goes further than that: we could still analyze and debug JavaScript applications using traditional tools. But BPF programs are neither process-space or kernel routines, and are outside the view of everything. No visibility in ps(1), top(1), or lsmod(8). We're having to create new tools to even see what's running on the CPUs. Every performance monitoring product that shows a process table with CPU consumption will now need a BPF program table as well.

[0] https://cilium.io/ [1] https://github.com/facebookincubator/katran [2] http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplif...

Real-time, low latency, network-based applications.

At the pace of network events, CPU is still very fast by perhaps at least order of magnitude. However, latency introduced by system calls is significant.

This allows you to run certain classes of application in kernel space with these overheads largely mitigated.

Principally it's monitoring and "observability" applications, but apparently it's much more flexible now than it has been historically ...

To get the realtime benefits I'd have to run everything that requires realtime scheduling as an eBPF program, right?

I mean, suppose I want to make an eBPF program to pipe audio input into my very fancy DSP algo. Awesome. But now I've limited the remainder of my signal chain to remain in eBPF land, haven't I? The moment I throw the signal to Supercollider, Pd, some Jack client, etc., I'm back to being a mere userland mortal and taking the performance hit that entails. (Unless I sample that data coming into userland at a lower rate, but that severely limits the use cases for such a design.)

Judging by some of the screenshots I've seen Linux audio users seem to want to mix their samples through every single audio program they can get their hands on. So at least in this domain I have a hard time seeing how eBPF could practically help realtime scheduling.

Edit: clarification

> So at least in this domain I have a hard time seeing how eBPF could practically help realtime scheduling.

Right, because jack is a user space collection of programs. It's obvious that any real-time or low latency DSP audio processing pipeline would have to be moved to BPF in its entirety.

So yes, BPF would work just fine if a new DAW pipeline was designed to work with in it. User space programs would only function as GUI front ends to load and configure BPF audio modules. This way audio can be sourced or sinked via DAC/ADC devices, disk, and network devices without the user-kernel space latencies.

You're never going to fit Ardour's engine into 1M instructions, nor is it going to work without unbounded loops. So that's not going to happen.

More interesting would be using BPF for the driver and being able to directly invoke JACK (or some other user-space endpoint) in a way that essentially bypasses the scheduler.

Put differently: rather than rely on the normal scheduler pathway to waking up JACK (or some other user-space endpoint), have BPF do it as soon as the device driver is ready.

But I don't think BPF has any way to do this at present - it seems to rely on accumulation of data in kernel space and a user space poll-driven process to pick it up and push it at the user.

Does the eBPF program need to communicate with userspace?

For example, can there be an eBPF "hello world" Dalek modulator that just sits between ALSA's audio input and output?

sure there could, but that's utterly unrelated in practical terms to what i think we're discussing, which is ways in which eBPF could impact full-scale audio software rather than limited inline DSP.

Hm, good point. But couldn't you say the same thing about CUDA? There is an overhead involved in shunting data to and from the cores ... but the nature of the work being done, and the massive parallelism, mean the overall end-to-end volume of throughput increases dramatically.

I don't know enough about audio processing to answer your specific question ... but processing network traffic for instance, you have the opportunity to "do things" with traffic. For instance partially decode PDUs to see if they interest you, and then pass them on to user-land for more in-depth processing if so ...

EDIT I think you're thinking in terms exclusively of `mapping` operations, but this kind of thing can be great for `reducing` or filtering.

There's bandwidth and there's latency. CUDA has the bandwidth, but not the latency, for generalizable low latency audio processing.

It could still be considered low latency by adding the constraint that all use cases are in the category of call-and-response musical forms. :)

Great idea for audio but that sounds very challenging given BPF's lack of support for floating point types and instructions [0].

I suppose it's possible to do audio processing being limited to integer arithmetic. BPF does have multiply and divide instructions at least.

[0] Scanning the x86_64 JIT compiler for example yields no mention of SIMD instructions: https://github.com/torvalds/linux/blob/master/arch/x86/net/b...

> Principally it's monitoring and "observability" applications, but apparently it's much more flexible now than it has been historically ...

Yep: Here's a bredangregg presentation on the topic: https://www.youtube-nocookie.com/embed/7pmXdG8-7WU

That's exactly what the linked article is.

> However, latency introduced by system calls is significant.

io_uring was introduced to attack this too and essentially is an async, queue-based, batched syscall interface eliminating a lot of that latency overhead. With polling mode you only have to do a syscall when both the application and the kernel are out of work.

> Real-time, low latency, network-based applications.

I thought these kinds of applications generally (try to) avoid the kernel entirely e.g. high frequency trading algorithms running on FPGAs.

Maybe using it to get data on/off of the FPGA for charting / updates / etc.

Could you allocate a memory dynamically inside your BPF program?

no* BPF programs are written in a very small subset of C (or rust or C++), which does not include dynamic memory allocation or unbounded loops.

* you can use maps which are persistent key-value stores provided by the eBPF runtime, and you can call out to approved C functions provided by the kernel, but those won't allocate memory for you.

Wouldn't virtualization kill the perf benefit, or is this supposed to run "on the metal"?

I don't know about other devices, but network cards have pretty good support for virtualization. A lot of them have virtual devices that the card generates and are added to the system as if they were different PCI devices. Then, each VM gets one of those virtual devices and they access it directly as a regular PCI device, with no virtualization layer whatsoever.

It’s a VM in the JIT sense, not a VM in the VMWare sense.

I think that depends ... I'm not an expert on virtualisation but I've seen some cases where VMs get full bare-metal access to system functions, albeit supervised. I think this is the purpose of virtualisation extensions on modern architectures (e.g. VT-x) - this is how it's possible to have a 64-bit OS run virtualised on a 64-bit processor.

EDIT - but even without this, the comparison with user-space holds.

Virtualized OSes can still access network hardware directly via SR-IOV

The main use case for me as a linux admin is two fold. One, to augment iptables/nftables for increased speed and observability gains in them. It's possible to do BPF only netfilter (some firewall/IDS tools are likely to use it heavily) but I think it works better just helping the other tools, and you can lookup some benchmarks that show it.

Two, as a better tool for general observability and problem tracing. For example, I recently listened to a Usenix (LISA19) talk by Brendan Gregg (author of this blog) about linux systems perf at Netflix where he talks about how much strace can impact performance and he posits the future replacement for it will be 'perf trace' which uses ring buffer and BPF. [1]

1[] https://youtu.be/fhBHvsi0Ql0?t=1300

Not sure if you were hinting at it already, just in case, you may be interested in the ongoing work with bpfilter [0] which uses ebpf underneath existing xfilter rule interfaces.

[0] https://lwn.net/Articles/747551/

Yeah I should have been more specific, you are right.

imagine how this might be useful in a distributed context.

rather than specify a limited set of api entry points (like segment gates) that you try to make safe....you can provide general security and resource guarantees and let the consumer do what they need. so that's cute, but imagine the round trip reduction (nfsv4 does this to a very limited extent) and increased generality.

what if your btree traversal could be safely run on the storage server.

One could replace JS with something that halts.

Look at all the examples Gregg provides.

Off topic: this is the same Brendan Gregg of flame charts fame [0][1]. It has solved my skin quite a few times when trying to figure out performance bottlenecks in Python apps (using pyflame[2] to capture data and FlameGraph[1] to convert it to displayable SVG).

[0] http://www.brendangregg.com/flamegraphs.html

[1] https://github.com/brendangregg/FlameGraph

[2] https://github.com/uber-archive/pyflame

Looks like `pyflame` was recently deprecated & archived.

I've had success using `py-spy` for debugging perf issues. Flamegraphs are much nicer to work with than cProfile's output.


I just wrote up a quick survey of python profilers that hasn't been published yet, and along with py-spy, there is austin (https://github.com/P403n1x87/austin). The thing that I liked the most about austin is that it also samples the memory usage of the system so that you have the context of the world outside of the process being sampled, in case it is useful. That said, py-spy is easier to install (it can be installed via pip).

Facebook also appears to have published BPF to do the same thing as py-spy but in kernel on perf hooks. But isn’t currently well documented to be easily accessible as far as I can tell.

I plan to test it out but hadn’t yet.

For anyone that wants to understand how py-spy works id also suggest the talk on rbspy (see YouTube) it’s great and basically the same but for ruby.

Haven't used it till now (except maybe via nft?).

What I'm not sure is: who is preventing BPF to be used as rootkits? Since they are run inside the kernel and cannot be inspected (?) can they be used to hide malicious activity?

In addition to the compiler there is a verifier that imposes some strict limits on programs. One example is it must prove the program will halt. It would be foolish to say there couldn't be a vector there, but they have done really strong work in protecting against that type of attack.

You can see which bpf programs are loaded in the kernel via the bpf() syscall.

Theoretically it could be used for a rootkit, but the programs needed to loaded as root, and they can't have side effects. BPF has also been around for a long time, and it's in basically all of the nix operating systems.

Generally agreed, but Linux BPF is considerably more powerful than traditional Unix BPF, so I wouldn't depend on "it has been around for a long time" for safety.

I would like to see some academic research on Linux BPF verifier. If you are a graduate student working on formal methods looking for a topic, this is a hint.

If someone has root, it's already game over. An attacker could just hook the syscalls directly which would be more stealthy that using BPF programs.

I was wondering about that as well

Looks nice, microservices going into super-micro territory, where they are just simple small code snippets. Possible problems - many people will learn the hard way that logging or printing every packet which comes through your interfaces for further analysis will bring down your system. From the start there should be some simple way to rate-limit those bpf programs, like "if this exceeds some limits or bogs down system for more than X milliseconds, disable and give error".

> "if this exceeds some limits or bogs down system for more than X milliseconds, disable and give error".

AFAIK indefinite loops in a BPF are not allowed: The kernel eBPF verifier will reject to load such programs. So the execution time of BPF programs will not be variable time and will be predictive.

I'm not sure if users need to care about the cases of long execution times when no loops are allowed.

How is that possible?? If true, that would imply BPF programs are not Turing complete.

>If true, that would imply BPF programs are not Turing complete.

It's not intended to be turing complete, just to be useful enough while provably not harming the system (e.g. by never terminating). There's quite a good write-up at https://lwn.net/Articles/773605/

You are right, eBPF is non-Turing complete. To be precise I heard that the kernel eBPF verifier ensures the flow of a BPF program is a kind of a DAG (no cycles = no loops). So eBPF VM is definitely not a general-purpose machine.

Actually they now allow bounded loops (such which you could theoretically unroll complete). It's mentioned in the talk, too. There is also a limit on the number of instructions a program can have I wonder if bounded loops multiply the "numbe of instructions" per iteration with the upper bound of t the number of persons for this?

I think it's one of the bigger misconceptions that programs have to be Turing complete to be useful, or even that functions should be allowed to be Turing complete by default. In fact, in many cases, we should probably have been programming in a way where functions are not Turing complete by default, just as in some modern languages, functions are not allowed to modify global state by defaults (functions are pure by default).

The programming language Zig is aiming to be able to calculate the stack requirements for a given function. This has enormous benefits if you can do it in areas like embedded software. You can guarantee that you don't run out of memory (functions in Zig can't use an allocator without permission either). But to do this, functions can't do recursion, and probably can't be Turing complete in general. I don't think Zig will ban recursion, but it will give you some powerful tools/options if you avoid it where you can.

Brendan Gregg talks about it in the talk in the link at 17:50.

Here is a short transcription (not verbatim):

> People have asked if BPF is Turing Complete. Is is possible to have a BPF program that can run BPF programs? We have had some serious discussions, as are we there yet. And the answer is no. The answer is no because the verifier rejects unbounded loops and so that there is some things that are not possible. There are bounded loops in BPF now.

That is a generally-useful property in various contexts, such as user-defined stream-processing functions running on shared systems. Long ago my main contribution to the re-write of the "Liverpool sort program" for processing nuclear physics data tapes was ensuring that the DSL wasn't Turing complete, to ensure progress would be made.

That was specifically a design goal to make sure their runtime is bounded, and most tasks, especially in that problem space, can be done without being turing-complete.

> if this exceeds some limits or bogs down system for more than X milliseconds, disable and give error

That's exactly what kernel will do

This is less about BPF vs native code, and more about the process model vs the event based model of application programming.

Event based handling is inherently more efficient because it runs in the context of the caller, instead of requiring its own context like in process-based applications.

This is the main reason why file system code in the kernel is more efficient than file system servers running in a different process, eg via FUSE. No context switching

To be fair it should be considered that within the kernel space event driven architectures are already ubiquitous though. I/O, filesystem, you name it. Including powerful multiplexing and dispatch frameworks.

I'd rather see this as a way to ingest performance critical code pieces into the kernel space more easily, with virtualization and verification options providing safety within an otherwise dangerous/complicated domain.

I would not agree with the article that this kind of paradigm is new - neither inside not outside of kernel land.

Agree, it's not new, and I think GP's point is that event-driven is the norm for the kernel and why kernel-level interfaces are efficient.

What's new here is that this is being made available to user-level custom applications.

Sort of, but misses some of the larger picture. The main reason that fs code is faster in the kernel is the direct access to kernel data structures. File system, virtual memory, and buffer cache are all three sides of the same coin. Once you divorce yourself from direct (even if sandboxed) read and writes of the underlying data structures, you impose a massive overhead.

Hmm I don’t think this is the case, at least not when comparing fuse to in-kernel file systems. Having access to native VM structures only helps to the extent that you can avoid copies, yet in fuse, only one extra copy takes place. I think having to switch tasks (and associated work: swapping mm, flushing tlb, ireting/syscalling, synchronization) is really what kills perf

Torvalds disagrees with you, at least wrt to the fundamental limitation here (there may be other issues layered on top of it of course).

> No, you need not just the blocks, you need the actual cache chain data structures themselves. For doing things like good read-ahead, you need to be able to (efficiently) look up trivial things like "is that block already in the cache".

> So you need not only the data, you need the _tags_ too.

> In other words, your filesystem needs to have access to the whole disk cache layer, not just the contents. Or it will not perform well.


Edit: and the context of this discussion was fairly ancient systems with tagged TLBs and simple in order cores with syscalls nearly as cheap as regular user space call instructions. They were still ungodly slow with microkernels, and his explanation is the meat of his view as to why. It's all about having the data in the right place with as little synchronization required.

Ah okay, fine grained control of cache to minimize IO waiting is a good counterpoint.

I actually have a lot of experience in this area, and I can say that effective readahead is a bit of a crapshoot. Only really works in trivial cases. Ultimately if IO latency sucks, nothing can save you.

His particular point doesn’t fully make sense either. It’s easy to kick off readahead when you only have access to block data, the kernel won’t issue redundant IO requests for blocks already in the cache. Also mlock/madvise give a lot of control in terms of dictating eviction strategies for special blocks.

All thins equal (costless syscall/mm swapping, IO), I still think inter-task synchronization is the largest overhead, but I have no numbers to back it up. Something tells me Marshalling all IO syscalls to a kernel thread would be about as slow as a user-space FUSE task.

I wonder is this something that Futhark [0] could be used for?

[0] https://futhark-lang.org/

EDIT pleased to discover that the native `bpftrace` language is based on awk!

This looks very similar to webassembly work going on right now, both use a secure VM, and both run in kernel space.

Would webassembly be a more general purpose way of accomplishing something like this?

webassembly "runs in kernel space"?

Yes, though it's early days yet. See:


Please no.

Care to explain?

The tendency to dump all kinds of unrelated stuff in the kernel is a very bad trend. It increases the attacks surface of the kernel in unpredictable ways and will lead to security issues, especially with something as complex as an interpreter.

Interpreter is very small part of overall kernel codebase, and it is the same interpreter for all kinds of applications written in wa, need to be analyzed only once, but performance benefits of running code in kernel space are huge. Moreover, wa can even make kernel safer but rewriting certain not very performance critical parts of the kernel in wa.

But I got your point. Thanks!

A BPF interpreter can literally be ~100 LoC. A WebAssembly VM on the other hand will likely be ~1million LoC (without checking).

One is suitable for embedding into a kernel, the other isn't.

Where did you get ~1million number? https://github.com/bytecodealliance/wasm-micro-runtime is less than 100K LoC for example.

If a micro runtime is 100K LOC then 1M LOC wasn't unrealistic.

About half is tests and sample codes. Runtime proper (core directory) is <50K.

It is also WASI runtime. WebAssembly runtime proper (core/iwasm/runtime directory) is more like 10K.

This is a micro runtime that acts as a tiny WASM interpreter. 1M LOC sounds fair when complex runtime features such as JIT or AOT are included.

> One is suitable for embedding into a kernel, the other isn't.

I'd be very surprised if the linux kernel doesn't eventually get web assembly support.

Doubtful. Webassembly is Turing complete, BPF isn’t. Running untrusted unbounded code in the kernel is not smart. BPF was invented with kernel constraints in mind, webassembly was invented with browser constraints in mind. Completely different use cases

> Running untrusted unbounded code in the kernel is not smart.

Well currently we run code (e.g. drivers) as trusted full-permission code. Surely, web-assembly would be better than this from a security perspective.

Generally kernel code is also written to specific constraints. Out-of-tree drivers are usually bad at conforming and hypothetical out-of-tree webassembly drivers would be bad for similar reasons. Memory safety isn’t enough, using kernel interfaces safely requires conformance that webassembly can’t statically guarantee like BPF can

Basic examples: code that takes a lock and never releases it, code that loops forever, code that leaks kernel resources.

What use case do you envision?

Out-of-tree drivers, support for closed-source drivers in Linux (assuming kernel API for wasm drivers is stable).

While I see how out-of-tree drivers using a stable API might be a desirable goal, I don't understand what WASM adds to the table. Maybe some degree of cross-arch portability, but that's already afforded by C and the kernel.

> without checking

Then how can you possibly know? Do you just assume that because it has "Web" in the name, it must be insanely complicated?

Thank you, I appreciate the perspective.

Perhaps WebAssembly kernel-space programming would still be viable for major trusted projects, like Kafka, where you literally dedicate the entire machine to that application.

Unikernel or rump kernel would be better for that purpose

Brendan has a lot of great content that gets posted here regularly: http://www.brendangregg.com/ (I'm still trying find the time to get through it though).

There's also https://github.com/iovisor/bcc#tools as an easy way to get started using BPF.

I'd also recommend looking at bpftrace[1], which provides a simple DSL specifically for writing BPF programs and attaching them to events. (Disclaimer: I started the project)

[1] https://github.com/iovisor/bpftrace/blob/master/docs/tutoria...

Brendans sharing is amazing, I have and am learning so much from it, as well as having referred others to it when they have a performance issue but missing the troubleshooting tools.

He also has a book coming out this month, BPF Performance Tools http://www.brendangregg.com/bpf-performance-tools-book.html

Strange that Brendan Gregg's blog is not available via an RSS feed. An oversight?

I've wondered why operating systems, aside from hypervisors, are overwhelmingly the first abstraction - I know the obvious benefits, or rather necessities (processor sharing, security, file system, etc etc) - but in ultra-specialized perf-critical applications I'd have thought economic pressures would have materialized a greater variety of ad-hoc bare-metal software. I guess we're headed that way w/ the twilight of Moor'es law, FPGAs. Maybe I'm just ignorant of such implementations.

You're not wrong. Alibaba is really leading the charge on FPGAs as servers in the data center.

It's complex though, because the tooling wrt logging, monitoring, debugging aren't as mature yet.

Isn't that sorta what a unikernel is?

yep - and some of us are working on that

So from my understanding, that's a kind of "secure" (I'd like to know more about the security model tbh) module that runs with kernel privilege with no scheduling (so it runs until completion). These are supposed to be short and I am assuming, can't call libs and can't allocate memory (outside a predefined stack I would guess?)

Aren't they very similar to interrupts? What is the difference there? The kernel API?

IIR I think it's a VM(?) that has certain limits, and because of those limits it's OK to run it in kernel-space. IIRC it's not Turning complete, and has a fixed run time. So Linux can just say "OK run this now" and not worry about scheduling. It's like putting a green thread in the kernel but to do this safely you need very strict restrictions (finite memory it can access, finite number of steps, etc).

You can thus get speed-ups because there's no API, memory management, even scheduling.

Got it, thanks! I did not realize it runs in a VM, ok, that indeed makes sense to call it a different kind of application.

My understanding is that they're program fragments that can be attached to kernel code and are generally used for debugging and observability. It's using a virtual machine that's designed to sandbox these, and they're limited in power so that a buggy filter can't hang the kernel.

This is different form hardware interrupts because they don't run in response to hardware events, and they don't have side effects. There is a similarity in that you want the filters to run quickly, though.

I think it's more to do with avoiding overheads typically associated with system calls (presumably involving some interrupt and disabling/enabling/changing paging behaviour).

Here's an example of a syscall-heavy command on my system:

  $ time dd if=/dev/zero bs=1 count=10M of=/dev/null
  10485760+0 records in
  10485760+0 records out
  10485760 bytes (10 MB, 10 MiB) copied, 7.09089 s, 1.5 MB/s
  real    0m7.092s
  user    0m2.123s
  sys     0m4.968s
3 million system calls per second seems quite slow on a 3.2 GHz CPU when all it should really be doing is dereferencing a couple of pointers until it finds some functions that simply write a zero byte to a buffer (the "/dev/zero" descriptor handler) and ignore bytes from a buffer (the "/dev/null" descriptor handler).

If you have a safe bytecode format for representing operations that are performed in a loop, the kernel can just perform those operations without having to switch back and forth to userspace.

How much of that time is really spent in the system call interface?

You've got 4.968s of system time there (i.e. broadly the time spent in kernel code) and 2.123s of user time. Given that the user-space program is effectively a tight loop around read() and write() calls, we can assume that almost all of those 2 seconds are spent going through the syscall plumbing.

Now, there's going to also be some of the kernel-side time spent in the syscall plumbing too, but there's also a lot of I/O, buffer and filesystem layer code executing there. All of which will be in use with a BPF program too. So it's unclear how much of the effective time can be shaved off.

There shouldn't really be any significant filesystem code involved, since once `dd` has opened the files, it should have handlers for those devices more-or-less directly in its descriptor table. Once you have a descriptor to a pipe or device, there shouldn't be any filesystem-level checking in the middle of your reads/writes; all you're doing is filling/emptying buffers.

And given that I can write a program that makes 132 million calls per second to the glibc `putchar` function (which also buffers), I'm pretty sure there's a lot of time that can be shaved off as we start to replace the system call mechanism with plain function calls.

Have you forgotten about Meltdown, Spectre, and all the other cache attacks?

These are things that kernel developers are surely mindful of when coming up with and implementing eBPF functionality.

Regardless, I'm sure I've run this same test years ago and seen the system call count still in the same order (that is, a couple of million per second). I really doubt Spectre mitigations are what are causing what should be a few dereferences and function calls to take around a thousand clock cycles.

It's one of the two phases. We're back to the other one, wait for a couple of months.

Re-try with mitigations=off

It reminds me a wee bit of stuff that has happened in browsers. NaCl was a way to run untrusted native code by scanning it for unsafe operations.

BPF seems to have a similar aim in a sense: run less-trustworthy code in an unrestricted environment, but don't use CPU hardware based access control (and therefore context switching) to limit it, rather, make the code more sandboxed in software to prevent the need for context switches.

Very interesting. The BPF programs are often written in C and compiled using a BPF backed to llvm using this project: https://github.com/iovisor/bcc

This should be useful any time you need a high performance, high security way to instrument or extend a C based program during run time! Not just in kernels.

There are a bunch of places where non-Turing-complete scripting can be used to provide better interfaces between mutually-untrusting systems. Bitcoin Script, which is loopless like BPF, is another example. (I don't mean eBPF, which isn't loopless.) I've been thinking about using this approach in the Wercam windowing system for BubbleOS to get reliably low-latency feedback for user interface events, as Don Hopkins did with NeWS; https://gitlab.com/kragen/dercuano/blob/master/markdown/werc... explains how, and also delves a bit into the history of the approach. Other possible uses include active networking (by including a routing program in your packet headers), specifying pub-sub subscriptions, and specifying database queries.

Sounds a lot like SPIN OS https://en.wikipedia.org/wiki/SPIN_(operating_system) They have to make do without type safety (in SPIN's case provided by modula-3), but it's really cool to see it tried. For those interested: SPIN and other hybrid kernels (like Exo) were created in the fall-out of "microkernels are bad" by attempting to allow a hybrid approach. Linux was created around the same time staying straight in the monolithic category (but since moving to be more hybrid).

Out of the loop, why is Linux now "more hybrid"?

It provides some of the capabilities previously limited to microkernels in the past around the ability of an application domain to customize the kernel to fit its needs. You can see it in loadable kernel modules, user-space drivers which allow applications to control how they access hardware directly, and now loadable application code too which can run directly in the kernel. Hybrid hasn't been used to denote something that has a address space layout similar to a mkernel so much as it's being used to denote a monolithic operating system that has some capabilities previously only available in a mkernel.

Because it has a crap ton of drivers and shit in the kernel, which is the opposite of unikernel/microkernel approach where most of this functionality is implemented in userspace. See eg Minix

I think the question was how is it a microkernel at all

Many of the benefits of microkernels have been implemented in Linux over the past 25 years. Modules, FUSE, and live patching for example.

I don't understand kernel modules, dynamic-loaded or otherwise, to be any more or less monolithic (in the sense of monolithic versus micro kernels). Do they have their own address space or something?

kernel modules and BPF ....

I do not know much about kernels and OS architecture. But I am wondering how this relates to the idea of GNU Hurd. Wasn't its concept also about microkernel-based design?

This is pretty different, Hurd ran all of this stuff in user space.

It's really close to exokernels though. XOK had three different, but similar VMs for running user code in kernel space. Like their version of futex was a filter program that got run to see if the program should be woke up.

The Lost Generation discovers IBM Mainframe Channel Programs?

Want to bet if they are going to make all the same mistakes themselves, or if they are willing to learn from history?

Seems like they only resemble Channel Programs in spirit since they don't use any specialized hardware. If it all runs on commodity hardware and is completely generic/customizable in software then it seems a lot more useful and economical than anything for a mainframe.

All snark aside, is there potential benefit to Varnish Cache with this becoming widely adopted? Things that could only be accomplished at lower layers like this implies access to?

Since there is already highly efficient event driven I/O and a context switch is bound to happen for applications that require major business logic in user space, I doubt there will be huge benefits for large server applications.

Maybe simplified versions work with this paradigm with significant performance gains, but I doubt that you could simply plug Varnish or nginx in and see much of an improvement.

There is still quite a complexity gap between e.g. interpreting firewall filter rules and a full blown web reverse proxy..

I think you could use this to generate a reverse proxy for the current config in nginx, maybe if only for some subset of configurations. Like a fast path for common flows. But it wouldn't be easy.

Enlighten us grey one.

Which mistakes? Like, I've def got a few in mind, but I'd like to hear your perspective.

I'm a big fan of eBPF (and Brendan Gregg), but it's not really a new type of software - the original BPF (1993) has been around for a while, not to mention other user-extensible kernel architectures such as ring-based kernels and microkernels (1960s-) and exokernels (1994).

eBPF reminds me a lot of exokernel type systems.

gregkh said at Kernel Recipes this year that "we have a micro kernel and nobody noticed" while talking about eBPF.

I think he may have said that about "ELF modules" [1] specifically. Because plain BPF code runs in kernel mode and that wouldn't make it a micro kernel. I don't know if elf modules feature has been merged upstream.

1. https://lwn.net/Articles/749108/

It's more an exokernel, but yeah.

Can someone write an ELI5 of BPF, please?

Good god those linked articles ought to keep anyone busy for months

This is one of the moments that i read an article and say to myself: “Those stuff needs a smarter programmer than i am” I will try to watch the video, hopefully it is simpler to understand !

I've thought that many times and always been wrong. At the end of the day, code is just code. After you take the time to understand the environment it's running in, it's not that much different than anything else you could write.

Creating BPF in the first place took a lot of cleverness. Figuring out how to fit it into a massive infrastructure at a place like Facebook took a lot of cleverness. But most of the people actually working on the implementations are just normal software engineers.

There is lots of code that is difficult to understand, for anyone.

What about code that proves a complex, many page mathematical theorem?

What about code that does complex and very math heavy things like GCM encryption modes, or statistical compression via prediction by partial matching and arithmetic coding?

Just reading the code isn't always enough, plenty of complicated code requires understanding of concepts far outside what you could hope to fit in a comment.

> What about code that proves a complex, many page mathematical theorem?

The hard part should be understanding the theorem. If you understand that and the code is still difficult, then the code should be better written.

I guess one exception is when you're optimizing for performance and make an explicit decision to sacrifice maintainability.

I wonder what Netflix does on their FreeBSD machines.

if i like the guarantees that BPF gives the kernel, and want to embed it into my user-space software so that it can execute BPF programs received from other untrusted processes, are there any hints on where to start? the "BPF beginners" material i come across doesn't seem to discuss this use case.

This is currently not easy to do. First of all, the kernel BPF implementation is GPLv2, which means you cannot rip out the runtime and embed it in your userspace code unless you are able to distribute your userspace software as GPLv2 (This is AFAIK and IANAL). For software engineering and license compatibility reasons, you probably want to use a clean implementation of eBPF.

Two such implementations exist. uBPF has a simple implementation. This implementation is likely not feature complete, as it is not regularly updated and the kernel BPF validator keeps getting new features. DPDK also has an implementation of BPF. This appears to be actively developed, but it also comes with a lot of baggage from DPDK. It may be possible to fork for this purpose, but it appears to primarily used for running traditional network filters on incoming packets in userspace.

What you are looking for is a really interesting possibility but would require a significant software engineering effort.

If you relax the constraint that BPF runs in userspace, it is possible to insert userspace dynamic tracing points (USDT) into your code, which will call out to a BPF program in the kernel when executed.

> First of all, the kernel BPF implementation is GPLv2, which means you cannot rip out the runtime and embed it in your userspace code unless you are able to distribute your userspace software as GPLv2

mmm right, should've thought of that.

> What you are looking for is a really interesting possibility but would require a significant software engineering effort.

thanks for the info, i appreciate it, and it gives me a bit to chew on.

Can someone explain in one simple paragraph what this is?

I still have to watch the presentation and/or read the book, but at first glance this looks like a nasty kludge. I hope I'm wrong.

Great video. Good content, delivery and impressive lighting and video production work.


Once you see it, it cannot be unseen!

Question from me, why reinvent the bicycle and not just write proper kernel modules in C?

Well, if you have seen the video, probably you wouldn't ask this. Kernel modules can do everything that can BPF do, but BPF could be easier to write, more isolated and less error prone.

BPF is completely production safe. So there is no way for a BPF program to crash the kernel, introduce significant performance latency, or have any side effects on the kernel/user space. Obviously, kernel modules have none of those properties.

Also, BPF has been around for almost 30 years, and you're likely using it. tcpdump is basically just a BPF bytecode frontend, for example.

You can maintain production safety by using a BPF->kernel module compiler.

This additionally removes the need to have the bpf compiler in the kernel, reducing both core size and vulnerability surface area.

No reason BPF must imply JIT

The end goal with bpf is to allow arbitrary untrusted programs to load bpf programs. If you were just loading kernel modules you wouldn't be able to maintain kernel integrity and let arbitrary programs load code.

I can't find the LKML thread, but there is some fundamental problem with unloading modules safely. BPF programs might not have such limitations.

Oh, neat! Yeah, I didn't think of this but that totally makes sense.

BPF programs declare their resources to the kernel (maps, etc.), but modules are reliant on the __exit function cleaning everything up properly. The correctness of __exit is unverified, difficult, and practically one of the least tested pathways which makes it traditionally fraught with bugs.

A BPF->kernel module compiler would ensure all necessary cleanup happens in __exit automatically

A BPF to kernel module compiler wouldn't let the kernel verify the program in a real way. There's still work to be done, bit the end goal of BPF is pretty obviously to allow non root users to load programs.

Doing the verification offline is a non starter. Appending the verification information and reverifying it at load time is more work than a BPF runtime as it is as you have to reproject ISA semantics in a more complex way.

Hmm I think bpf today is used in fully trusted environments. Kernel level verification is unnecessary except in untrusted containerized environments or when running untrusted applications, both use cases being relatively rare/specialized.

I think the main benefit of bpf is that it prevents you from shooting yourself in the foot. Running totally untrusted code in the kernel just seems like a recipe for disaster and for that reason it makes sense bpf is still limited to root.

That's where it is today, but the end goal is removing that restriction. It probably would have happened quicker if Spectre/Meltdown hadn't come out of nowhere. Like it used to be that KVM required CAP_SYS_ADMIN as well, but now that's been opened up to whoever has permissions to the device file. Start requiring "own the box anyway" privileges while the feature bakes, but open it up as it becomes more mature and attackers have a go at it.

It's sort of like how originally you could only jump forward, then the opened it up to any DAG, then they allowed probably bounded loops.

And there's been OSes that don't require root for their in kernel virtual machines, XOK and AEGIS being the prominent examples.

Sure but I don’t see any compelling use cases to motivate opening it up. Do you have an example of one?

Even if the kernel opens it up without a compelling use case, it seems likely that distribution policy will keep it default locked to root.

Which is my point here. I don’t see a compelling reason to have a JIT in the kernel when AOT BPF seems to cover 90% of all existing use cases. In fact I may even write a bpf to kernel module compiler myself.

* Syscall tracing, sandboxing, and monitoring

* KVM device MMIO emulation

* OS personality emulation (like WSL but doesn't require root)

* New synchronization primitives to user space (like XOK's wake predicates)

* A lot of others..

Modern BPF is the exploring the same cornerstones as exokernels and really opens up a whole bunch of concepts that haven't been seen in mainstream kernels, particularly if non privileged users are allowed to invoke it.

Thanks for the examples but those all still seem like things vast majority of Linux users can do today, since vast majority of Linux users have root access. Both desktop and server.

Mobile users like android don’t have root but I don’t see why an untrusted mobile app would need bpf.

Only benefit of allowing non-root that I can see is enabling untrusted containers in cloud environments to do the same. All large cloud providers use KVM/zen (not containers) for untrusted users in which case they already have root.

Can you give an example of a scenario where the user doesn’t have root yet still would want to do those things?

Loading BPF requires root, so does loading kernel modules. I.e. if you have permissions to load BPF, you can already load arbitrary code.

At the moment.

eBPF and kernel modules solve completely different tasks. If you need a kernel module to accomplish your task, then by definition you cannot accomplish it with eBPF.

I guess you are missing the point by 100000 miles. The whole point of BPF is the avoid writing a C kernel module and pull in the entire problem domain of C programs running in Ring0. this was called out explicitly in the video.

He’s not off the mark completely, you can maintain production safety by using a BPF->kernel module compiler. Unnecessary to have an entire JIT infrastructure in the kernel just to get the safety benefits of BPF.

You need root to load kernel modules. While this isn't possible yet, the intention is to allow loading BPF programs without privilege.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact