Hacker News new | past | comments | ask | show | jobs | submit login
eBPF Is Awesome (filipnikolovski.com)
268 points by filipn 11 months ago | hide | past | favorite | 44 comments



Most examples of BPF code are written in a mix of Python and C using BCC, the "BPF Compiler Collection", which essentially treats all of LLVM and clang as a library callable from Python code.

I can't get my head around using it that way, and have found it pretty straightforward to just write C programs, compiled with clang `-target bpf`. Until very recently, writing anything interesting this way required you to declare all functions inline, compile into a single ELF .o, and, of course, avoid most loops. But most of the kinds of things you'd write in BPF tend not to be especially loopy (you can factor most algorithmic code out into userland, communicating with BPF using maps).

A big issue for this kind of development is kernel compat; struct layouts can change from release to release, for instance. This isn't a problem for us at Fly, because we just run the same kernel everywhere, but it's a real problem if you're trying to ship a tool for other people's systems. But that's changing with CO-RE; recent kernels can export a simplified symbol table in a BPF-legible format called BTF, and the leader can perform relocations. Facebook has written a bunch of good stuff about this:

https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc...


There's also https://github.com/alessandrod/bpf-linker to make compiling a bit easier, as it does necessary inlining at link-time.


I think dtrace has the same problem, i.e. it's pretty tightly coupled to the exact functions / trace points in the kernel. A different kernel can break a dtrace script, although I think their code changes a lot less than Linux does.

It seems somewhat unavoidable, if the goal is to introspect the kernel at a very intimate level ...


No, DTrace does not have this problem, though our solution to it is one of the least well known aspects of DTrace: we have a notion of explicit stability that allows for stable scripts to be built on top of very low level implementation details that themselves might change. See the chapter on "Stability" in the Dynamic Tracing Guide[1] for details.

[1] http://dtrace.org/guide/chp-stab.html


> Stability attributes are computed for most D language statements by taking the minimum stability and class of the entities in the statement.

That's a fascinating read and an amazing idea. To your knowledge are there any other software ecosystems that track stability in nearly as formalized a way? Has there been investigation into bringing these ideas into other modern languages? (I don't believe Rust has a concept like this, for instance, though it would even further strengthen the language's concept of correctness if it did!)


Well, we certainly thought it was a big deal! We were really trying to address this issue of writing stable scripts -- allowing for stable, powerful tooling without ossifying the underlying system. I'm really pleased with the work we did there -- but it's unquestionably one of the more esoteric aspects of DTrace.

Something of a funny story that this brings to mind: the taxonomy we have here is actually the interface taxonomy from what was Sun's Platform Software Architecture Review Committee (PSARC), which itself borrowed it from Sun's larger Software Development Framework (SDF). We had to get DTrace reviewed by PSARC, which we weren't necessarily looking forward to -- in part because of big developments like this one. To get past our PSARC review, we adopted several strategies, one of which was to separate out DTrace from its instrumentation providers as separate cases before the committee. When we first presented DTrace to PSARC, committee members wanted to fixate on instrumentation methodology -- and it was very helpful to be able to defer these fixations to later cases (after having let members pontificate and chew up some of the clock, of course). The other technique that we developed (which was devastatingly effective) was to distract the committee with issues that were irrelevant but amenable to debate. When a debate emerged among the committee members (and PSARC being more or less a debating society, this was practically guaranteed), we would effectively feed both sides of the debate -- and in the end, run out the clock on something we didn't care about. All of this worked exceedingly well -- and DTrace itself (one of the largest cases that had ever appeared before PSARC) was approved with essentially no modifications.

Shortly after the DTrace case was approved, we started bringing forward cases on instrumentation providers. With each case, we presented the stability matrix of that particular provider; on the first such case, I remember vividly one committee member asking: "what the hell is this and when do we review it?!" We explained that it was the stability matrix -- as explained at length in the case that they had in fact already approved. They realized in an instant that they had fixated on a dinghy of nomenclature while we had slipped behind them an ocean liner of semantics -- and it was glorious.


> The other technique that we developed (which was devastatingly effective) was to distract the committee with issues that were irrelevant but amenable to debate.

It's not exactly the same but this reminds me of the way Matt Stone described his interactions with the MPAA board in This Film Is Not Yet Rated (https://en.wikipedia.org/wiki/This_Film_Is_Not_Yet_Rated).

i.e. they went into the Team America rating negotiation with aggressive material they were prepared to cut, and probably wanted to cut anyway, and let the committee spend all their time on that.

See also (NSFW):

https://youtu.be/SgyG8y1vg1M?t=151

https://lettersofnote.com/2009/09/30/p-s-this-is-my-favorite...


Thanks for the reference... although it seems clear upon reading it that the problem is mitigated (by tagging/documentation) but still present. Although the tool support looks nice and is probably something that could be borrowed for eBPF.


Off topic, but thanks for being so inspiring ;)


BTF is designed to avoid the problem, by having the live kernel export symbols; supposedly --- I haven't used it --- the toolchain even converts it to a header file, so BPF programs just include a "vmlinux.h" instead of includes pointing into kernel source (which is a nightmare). It's ambitious and I'm surprised it does as much as it does but apparently they're solving this problem.


I can imagine it leading to ossification of kernel internals.

Imagine when someone comes up with a revolutionary new paging method, but it causes everyone's eBPF scripts to fail to load and a bunch of tools to break...


What's worse is that I've run into Kernel bugs/panics a few times that made me hesitate recommended BPF for production systems. Hopefully those become less frequent as the ecosystem matures, but they were pretty scary!


Handling input options and displaying output is a little easier in python. It also let's you hack the tools quick and run any changes instantly.


CO-RE is great, but for those who have to run on older kernels an approach is to loop, guessing the offset and running an experiment to see if correct:

https://github.com/weaveworks/tcptracer-bpf/blob/cd53e7c84ba...

This was done (by Kinvolk) for the visualisation tool Weave Scope; also picked up by DataDog https://github.com/DataDog/datadog-process-agent/tree/master...


I got a bunch of Numba version related errors (Python 3.7) when I tried to run the example code in the website and my thoughts were in the same direction. Was wondering if it is possible to write something like this in, say, Golang instead of Python.


There are Go bindings for BPF and BCC: https://github.com/iovisor/gobpf

I'm not sure the state of them at this point, but it's the same paradigm GP mentioned.


I think the more idiomatic thing to use in Go is Cilium, which has tooling support for loading and attaching eBPF programs, and also a weird embedding system that calls clang9 directly.

I find the Cilium libraries sort of hit-or-miss† but they mostly work well, but, again, I just build my BPF programs themselves with Makefiles into .o's, and use Cilium (or, for XDP/TC, iproute2) to load them.

https://twitter.com/tqbf/status/1336825568478834689


Spectre mitigations can make it go from awesome to useful.

The documentation is also pretty dire, but it's mostly implement-once remember-forever in my experience - it's all there but kernel samples are quite hard to read, and I'd rather not guess based on struct listings (e.g. variable length structs aren't particularly fun when you're fumbling around)


It seems worth mentioning that the code actually executing in the kernel, when it is running your eBPF, is native machine code, ahead-of-time compiled from the bytecode program you gave to the kernel.


Technically it's JITted not AOT compiled.


The line is blurry.

JIT compilation means compiling code on the fly right when you're about to execute it. When the code is part of a larger program, jitting allows to compile the parts of it you actually need and avoid wasting time to compile other parts. It also allows to compile the relevant code paths based on dynamic flow analysis, which often involves interpreting your program the first time you run it and emitting instructions for the next time around (tracing)

If the code unit is small and you know you're going to run it all, you can compile it in one swoop. If you compile it when you load it in the kernel, as opposed to compiling it lazily right before you run it the first time, i think it's fair to call this Ahead-of-Time compilation, even if the compilation happens right next the use site and not as part of the developer tool chain


What ithkuil said.

The time to translate the bytecodes is taken ahead of time, when the eBPF is installed, rather than the first time some other system call needs to actually execute the code.

It is not hard to imagine a system removing and installing eBPF bits dynamically according to realtime events, but it is a fair bet that in most uses they are set up at program start time and left in place.


Yup, it is. My recent epiphany-blogpost:

https://blog.habets.se/2020/11/BPF-the-future-of-configs.htm...


> running a user space program inside the kernel

Isn't it actually running a user program in kernel space?


Is it possible to write device drivers in eBPF?

(I've asked this before, but haven't gotten any response, and no clear answer from Google/DDG either).


eBPF isn't Turing complete after being verified so I would assume no.


Do device drivers typically need to be Turing complete? I would have expect drivers for simple USB devices for example to be pretty simple state machines.


I'm not sure. Even if you could do it in eBPF, I really don't think you could get it past the verifier because it's not just like formal verification it actually has to be trivially bounded (so parts of it definitely could be done without Turing completeness but you ain't writing your GPU driver in it).

The JIT isn't necessarily turned on so it's probably not a great idea in the first place


What are the benefits of using eBPF besides a promise of observability "for free"?

Can eBPF be used for observability using platforms like Java or .net core, or does their platform VMs obfuscate too much and monitoring them using eBPF is not feasible?

How does eBPF work wrt OpenTelemetry etc.? Should OpenTelemetry be seen as standardized interfaces to which eBPF reports data?


eBPF helps with kernel observability - an area that has been sorely lacking in the past. For the JVM or .NET, they give you virtually no insight at all into system calls - so eBPF is complementary to VM profilers, not a replacement. If you ever used Shark on OS X you will get a sense of how cool this is - this was a profiler for the OS X JVM which profiled the system calls as well and combined it all into a single trace tree. Maybe one day we'll get similar profilers on Linux for these systems - with eBPF it should be fairly straightforward.

OpenTelemetry is just a reference API. You could export metrics using eBPF as well. I'm pretty sure Sysdig does this for example.


See Brendan Gregg's excellent work in this space

http://www.brendangregg.com/blog/2014-06-12/java-flame-graph...


It’s definitely possible in some VMs. I’ve been working in a Ruby profiler that collects the stacks from a BPF program [1]. There are some BPF safety mechanisms that require some creativity to overcome such as max instructions, not being Turing complete, etc.

[1]: https://github.com/facebookexperimental/rbperf


> The eBPF program is written in a pseudo-C code

Pseudo? This is a nit, but isn't it actually regular C?


I just happen to run into a freebsd video on dtrace (similar technology to eBPF, I think) that was created three weeks ago.

https://www.youtube.com/watch?v=E06GVdH-LX0


I think comparing dtrace to eBPF is missing what makes eBPF great. dtrace is just one application that can be implemented using eBPF.

Your toolbox can be used to fix things, but eBPF is a factory for making new types of tools and toolboxes.

eBPF can be used to make small programs that run at tracing points, thus making dtrace. But it can also be made to make packet filter decisions (thus altering what happens), and with at least one network card that eBPF program can be pushed to the network card and filter before the packet even hits RAM, much less the CPU!

eBPF can run at socket init time, and set some default TCP tuning parameters.

Another comment in this thread asked if one can write a whole device driver in eBPF. The answer is actually not clear.

eBPF is more similar to "the ability to load kernel modules" than it is "a tracing framework".


That sounds extremely cool.

Sadly, I don't program in Linux, so I can't use it. :'(


If you program on Windows you should check out Event Tracing for Windows(ETW). Similar to eBPF, ETW is a logging framework inside Windows kernal. Microsoft.Diagnostics.Tracing.TraceEvent[0] is a nice nuget package for logging and analyzing ETL files.

[0]https://github.com/microsoft/perfview/blob/master/documentat...


But only after reading this glorious and funny article about using ETW for logging thread context switches. https://caseymuratori.com/blog_0025


Thankfully all of this native API is abstracted away in C#.


If you just want to learn and try it, you can always do it in a Linux VM.

My general development skill (in Linux or otherwise) has definitely improved since I became a Linux native. But that didn't happen overnight.


You may try out generic eBPF outside of Linux: https://github.com/generic-ebpf/generic-ebpf


LLVM also has a BPF backend so you can compile any C++/C program to run on BPF.


There is also a bpf triple for gcc (I say that having never been able to actually use it)


If you're using an old enough MacOS X (I think 10.12 or older), DTrace has similar functionality. Unfortunately it has been broken in recent MacOS versions, at least unless you disable SIP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: