Hacker News new | past | comments | ask | show | jobs | submit login
BPF: The Forgotten Bytecode (cloudflare.com)
195 points by jgrahamc on May 22, 2014 | hide | past | favorite | 29 comments



We used BPF at Arbor Networks as a medium for signatures for DDoS attacks (other tools in our stack automatically generated them from statistical models), and I spent a pretty big chunk of my first year there working on ways to optimize evaluation of lots of BPF filters on NetFlow records in parallel.

The BPF virtual machine itself is trivial; any competent systems programmer given its specification should have no trouble implementing it.

What's most interesting about BPF is libpcap. There is a surprising amount of compiler theory baked into pcap, and it bears by far the larger share of the weight in making Unix packet capture applications workable and efficient (this may have changed with JIT'd BPF). I actually found this frustrating, as simple changes to the BPF VM (for instance, "vectorized" comparisons to support multiple filters) would break the libpcap optimizer.

I confess to being a bit baffled as to why things like seccomp-bpf would choose to use such a simple bytecode scheme; network packet captures uses it because of tradition and compatibility, and because a competently-designed high level filter language compiler supports only BPF.


I don't know if I'd call what libcap does "compiler theory" it appears to be a darker art that has extra knowledge about a lot of different things. It's remarkably sophisticated though. I implemented a BFF jit engine for a popular ips and spent a lot of time trying to work out a good way to replace libcap before we just went with it. Optimized libcap output will routinely make you scratch your head and wonder how it can work.


I would read with great interest a blog post about your experiences. I have no practical need, but I thought this tour of BPF was fascinating, and it left me hungry for more.


In addition to his blog post, which will probably be great, my reading list would be:

* First, the McCanne BPF+ paper.

* Then, the pcap source code, particularly pcap/optimize.c, perhaps starting with the peephole optimization stuff.

* Finally, one of the BPF JIT projects (there's one on SourceForge). Amusingly: the Linux BPF JIT was at one point used as an aid to memory corruption, by jit-spraying kernel memory. So also consider reading any of the jit-spray papers.


hey Steve here, the BPF+ work was done mainly by Andy Begel... really good stuff but it never got incorporated into libpcap, and I never otherwise published how the original optimizer worked. I did, however, give a retrospective talk on it a couple years back at sharkfest...

http://sharkfest.wireshark.org/sharkfest.11/

Click on "Keynote Video" for a really boring talk :-)


> I confess to being a bit baffled as to why things like seccomp-bpf would choose to use such a simple bytecode scheme;

Maybe they envisioned a hardware implementation?


What would you want to do with seccomp-bpf that can't be expressed with BPF? Or did I misunderstand the comment?

You want to allow or deny calls based on syscall args, which seems to fit the BPF model. Perhaps string matching is where it breaks down, since syscall args have strings and network packets don't? I suppose a stripped down regex like Lua's might be useful for matching arguments that are file system paths, and that is firmly outside BPF as far as I understand.

I would guess that they wanted to reuse a VM already in the kernel, and that had a fast JIT implementation, but I'm just speculating.


The BPF VM is suitable for system call arguments, but probably not optimal for performance.

The pcap compiler is not particularly well-suited for system call arguments.

The real reason to use the BPF VM, as opposed to any other VM or decision tree, is the pcap compiler, which doesn't buy much for system calls.


Well, I think it makes sense given the history. When seccomp-bpf and the other stuff was introduced, there was already a BPF VM in the kernel designed to efficiently run untrusted code (although seccomp-bpf can't use the JIT today); adding another VM is a maintainability and security hazard, and simplicity is good for security, so why not adapt it?

And now that it exists, it makes sense to evolve the bpf design to be more efficient but keep the result backwards compatible and unified between subsystems.


There was the DIF vm from dtrace too already, it would have been a better fit but that's difficult regarding linux.


There are a couple later innovations on the BPF design that are really interesting:

Mach Packet Filter: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.4...

MPF is interesting because it was designed to make packet filters fast as the primary way to dispatch packets to user-space endpoints- BPF just has to run all the filters on all the packets, but MPF will merge filter prologues and then do a switch on some common value (like the port).

Dynamic Packet Filter: http://pdos.csail.mit.edu/~engler/dpf.html

DPF takes this a step further by using a more declarative bytecode. It can merge multiple chunks of packet filters so that with layered protocols switching on IDs, ports, etc. you end up with a decision tree (or DAG maybe?). They also wrote a JIT with some interesting optimizations like picking the best dispatch method for each switch based on the number of branches, so they actually get better dispatch performance than hand-written packet demultiplexers.


Eric Dumazet, one of the many kernel hackers, wrote a JIT compiler that compiles BPFs directly to x86 code _in the kernel_. Much to learn from his code: http://lwn.net/Articles/437884/

As tptacek notes, the VM is very simple -- it's straightforward to write a BPF VM to C-code translator that you can then compile by gcc. But it will still be slower than edumazet's compiled code.


In my article about DynASM I mention the BPF JIT and suggest that it could use DynASM (it would make the JIT more readable at basically no cost): http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-s...


FreeBSD has had limited support for this for a long time now (limitations may be lifted now, not sure).

http://freebsd.1045724.n5.nabble.com/PATCH-BPF-Just-In-Time-...



If only there was an LLVM backend to generate BPF. Oh wait, there is! As readers of LLVM Weekly (http://llvmweekly.org) will know, an implementation was posted to LKML a while back: https://lkml.org/lkml/2014/2/5/743


Full Disclosure: asb is the editor of llvm weekly


Given the restricted nature of BPF, what kind of input is expected to be used with this? Surely you can't just run any code through it. There wasn't any additional info in that posting, unfortunately.


This backend is for an "extended" BPF. https://lkml.org/lkml/2014/2/5/728


All the jumps are only forward, which guarantees that there aren't any loops in the BPF program.

In other words, BPF programs can be thought of as encoding a DAG that represents a decision tree.


Yes and no. It's important to keep in mind that BPF uses a register machine. If you look at the paper[0] describing BPF, the authors explicitly considered a decision tree model (CSPF). Turns out that trees perform redundant computation, can't deal with variable length headers, and are not memory efficient.

[0] http://www.tcpdump.org/papers/bpf-usenix93.pdf


Right, that's why I said it's a DAG -- a decision DAG -- instead of just a tree structure.


As with everything, there are tradeoffs - For instance this from a few years ago:

http://mainisusuallyafunction.blogspot.com/2012/11/attacking...

Which allows bpf-jit to be part of a larger privilege escalation scheme. (To properly use this attack you need to gain control of an applicaion first, then you can maybe use this attack, and if successful you own the box).

I haven't kept up with this at all, so I don't know if it has been fixed (or even is fixable) but it is interesting.


--

> Finally you can apply the rule like so:

    iptables -A INPUT \
        -p udp --dport 53 \
        -m bpf --bytecode "14,0 0 0 20,177 0 0 0,12 0 0 0,7 0 0 0,64 0 0 0,21 0 7 124090465,64 0 0 4,21 0 5 1836084325,64 0 0 8,21 0 3 56848237,80 0 0 12,21 0 1 0,6 0 0 1,6 0 0 0," \
        -j DROP
--

Writing raw opcodes to execute in the kernel context on each packet received? Yes, please! :)


Well... using the new “nftables” packet filter, this will be the "normal" way of packet filtering. So, yes, please! ☺

http://netfilter.org/projects/nftables/

❝Pseudo-state machine in kernel-space: the userspace utility nftables interprets the rule-set provided by the user (using a new syntax), it compiles it into the pseudo-state machine bytecode and then it transfers it to the kernel via the nftables Netlink's API. Roughly, the idea behind nftables is similar to the Berkeley Packet Filters (BPF).❞


Well, I like bytecode machines.

But I'd expect a compiler or and assembler of some sort that allows to write human-readable code.


tracedump is a single application packet sniffer that generates BPF code on the fly: http://mutrics.iitis.pl/tracedump


Wow, didn't know about that. Very useful!


Not forgotten, indeed we can say that the idea of BPF and proving that is secure is related to what Google did with NaCl. More food for thought http://www.csl.sri.com/users/neumann/2012resolve-cheri.pdf




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: