The BPF virtual machine itself is trivial; any competent systems programmer given its specification should have no trouble implementing it.
What's most interesting about BPF is libpcap. There is a surprising amount of compiler theory baked into pcap, and it bears by far the larger share of the weight in making Unix packet capture applications workable and efficient (this may have changed with JIT'd BPF). I actually found this frustrating, as simple changes to the BPF VM (for instance, "vectorized" comparisons to support multiple filters) would break the libpcap optimizer.
I confess to being a bit baffled as to why things like seccomp-bpf would choose to use such a simple bytecode scheme; network packet captures uses it because of tradition and compatibility, and because a competently-designed high level filter language compiler supports only BPF.
* First, the McCanne BPF+ paper.
* Then, the pcap source code, particularly pcap/optimize.c, perhaps starting with the peephole optimization stuff.
* Finally, one of the BPF JIT projects (there's one on SourceForge). Amusingly: the Linux BPF JIT was at one point used as an aid to memory corruption, by jit-spraying kernel memory. So also consider reading any of the jit-spray papers.
Click on "Keynote Video" for a really boring talk :-)
Maybe they envisioned a hardware implementation?
You want to allow or deny calls based on syscall args, which seems to fit the BPF model. Perhaps string matching is where it breaks down, since syscall args have strings and network packets don't? I suppose a stripped down regex like Lua's might be useful for matching arguments that are file system paths, and that is firmly outside BPF as far as I understand.
I would guess that they wanted to reuse a VM already in the kernel, and that had a fast JIT implementation, but I'm just speculating.
The pcap compiler is not particularly well-suited for system call arguments.
The real reason to use the BPF VM, as opposed to any other VM or decision tree, is the pcap compiler, which doesn't buy much for system calls.
And now that it exists, it makes sense to evolve the bpf design to be more efficient but keep the result backwards compatible and unified between subsystems.
Mach Packet Filter: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.4...
MPF is interesting because it was designed to make packet filters fast as the primary way to dispatch packets to user-space endpoints- BPF just has to run all the filters on all the packets, but MPF will merge filter prologues and then do a switch on some common value (like the port).
Dynamic Packet Filter: http://pdos.csail.mit.edu/~engler/dpf.html
DPF takes this a step further by using a more declarative bytecode. It can merge multiple chunks of packet filters so that with layered protocols switching on IDs, ports, etc. you end up with a decision tree (or DAG maybe?). They also wrote a JIT with some interesting optimizations like picking the best dispatch method for each switch based on the number of branches, so they actually get better dispatch performance than hand-written packet demultiplexers.
As tptacek notes, the VM is very simple -- it's straightforward to write a BPF VM to C-code translator that you can then compile by gcc. But it will still be slower than edumazet's compiled code.
(via: https://twitter.com/justincormack/status/469388089354633216 )
In other words, BPF programs can be thought of as encoding a DAG that represents a decision tree.
Which allows bpf-jit to be part of a larger privilege escalation scheme. (To properly use this attack you need to gain control of an applicaion first, then you can maybe use this attack, and if successful you own the box).
I haven't kept up with this at all, so I don't know if it has been fixed (or even is fixable) but it is interesting.
> Finally you can apply the rule like so:
iptables -A INPUT \
-p udp --dport 53 \
-m bpf --bytecode "14,0 0 0 20,177 0 0 0,12 0 0 0,7 0 0 0,64 0 0 0,21 0 7 124090465,64 0 0 4,21 0 5 1836084325,64 0 0 8,21 0 3 56848237,80 0 0 12,21 0 1 0,6 0 0 1,6 0 0 0," \
Writing raw opcodes to execute in the kernel context on each packet received? Yes, please! :)
❝Pseudo-state machine in kernel-space: the userspace utility nftables interprets the rule-set provided by the user (using a new syntax), it compiles it into the pseudo-state machine bytecode and then it transfers it to the kernel via the nftables Netlink's API. Roughly, the idea behind nftables is similar to the Berkeley Packet Filters (BPF).❞
But I'd expect a compiler or and assembler of some sort that allows to write human-readable code.