Hacker News new | past | comments | ask | show | jobs | submit login
BPF for storage: an exokernel-inspired approach [pdf] (arxiv.org)
122 points by gbrown_ 17 days ago | hide | past | favorite | 39 comments

There's also some interest in making eBPF the standard for computational storage [1]: offloading processing to the SSD controller itself. Many vendors have found that it's easy to add a few extra ARM cores or some fixed-function accelerators to a high-end enterprise SSD controller, but the software ecosystem is lacking and fragmented.

This work may be a very complementary approach. Using eBPF to move some processing into the kernel should make it easier to later move it off the CPU entirely and into the storage device itself.

[1] https://nvmexpress.org/bringing-compute-to-storage-with-nvm-...

Can I get Postgres running on a RAID controller while we’re at it?

Samsung's SmartSSD comes with a Xilinx FPGA accelerator that can be used to execute code without having to move data outside the device. And their open source SDK includes domain-specific routines including databases.

See https://xilinx.github.io/Vitis_Libraries/ for details on their software stack.

It's not SQL, but Samsung made a Key Value SSD that uses short keys instead of "addresses" to index blocks of data.

It's been done.

Imagine an SSD serving up query responses! I love it.

Not too far from Netezza.

wasn't it the other way around ? The vendors got extra cores from their suppliers and started asking themselves what to do with them?

By "vendors", I meant drive and controller vendors rather than server/appliance vendors. They're looking for ways to differentiate their products and offer more value add, but extending an SSD's functionality beyond mere block storage (potentially with transparent compression or encryption) requires a lot of buy-in from end users who need to write software to use such capabilities.

I meant the Seagates and WD's from this world. Around 2014 they also had HDDs with an extra core, where I suspect they just got that for free from they supplier with the message "look, we stopped making these single core CPUs anymore, here's a dual core for the same price".

WD and Seagate are their chip vendors. They design their own custom SoCs.

In case of Seagate's Kinetic Drives, one core was used by the controller, the other 'extra' CPU was used to manage a Key value store on the driver. These were ARM processors. I don't think they make these themselves.

You would be wrong. I work with guys who used to do SoC design for both Seagate and WD.

This is how WD was able to jump to RiscV so quickly; they didn't have any suppliers they needed to negotiate with, and the market for CortexR style in order, single issue, fully pipelined real time cores is sort of a commodity.

Well, that's why I was asking. Thx!

I'd prefer it to use spirv or wasm. eBPF is intentionally an extremely limited language.

But trivially provable termination is really nice property if you don't want to need your hypervisor to trust every guest kernel.

eBPF the bytecode is not particularly limited. You can parse complex formats like Arrow or Parquet even. The Linux kernel overlays a verifier on top which adds all sorts of draconian restrictions (for good reason). When people talk of eBPF they don't always mean to include the Linux verifier limitations as well. In particular, that nvmexpress working group link in the parent post does not say one way or the other.

Why not use a different bytecode that is already more common, in that case?

Because ebpf is designed for verification under constrained circumstances like kernels, but allowing easy JITing that's almost a 1 to 1 translation. Stuff like not doing register allocation, being 2 address, etc.

So basically back to SCSI.

I'm not seeing any new parallels to SCSI, beyond the similarities that have long existed between NVMe and SCSI. What kind of programmable functionality over SCSI are you referring to?

SCSI uses their own controllers that receive a set of commands and take it from there for the data retrieval operations.


SCSI uses simple fixed command sets, the same as standard NVMe drives. As far as I'm aware, SCSI doesn't have any way to chain commands together, so you can't really do anything more complex than predefined commands like copy or compare and write. It's nothing at all analogous to an actual programmable offload engine like you'd get with eBPF, and all the semi-advanced SCSI commands that are actually applicable to SSDs already have equivalents in the base NVMe command set.

Exokernels coming back would be so cool.

I wrote a prototype kernel/os [0] a few years back that ran wasm in kernel mode. I'd love to see that idea pursued further.

[0]: https://github.com/nebulet/nebulet

I really loved the idea of nebulet! Do you think it is possible to reach the same security guarantees of eBPF with a feauture set like wasm? Other big challenges you encountered?

I think the main issue is that eBPF cannot have loops. It restricts the programs you can write. Wasm does not have that property, so you cannot prove that it will complete.

To be clear, eBPF programs can have loops -- they're just jumps that have negative offsets (which are signed 16-bit numbers), but for security reasons many verifiers do not allow them so they can ensure that the program halts.

I'd like to see more proof-carrying code techniques extended to WASM. Some of the work could be handed off to the compiler, embedding the proof in the binary. This could make expensive proofs, like termination checking, more tractable.

are you aware of the halting problem?

uh..yeah. The halting problem doesn’t mean you can’t do termination checking. It just means there’s no general algorithm for deciding if a machine halts.

We only care about the specific subset of programs that the compiler is able to prove properties about. https://en.m.wikipedia.org/wiki/Termination_analysis

Everything old is new again.


This kind of nails what has been obvious to me for a long time: eBPF is basically the revenge of exokernels.

Many databases can bypass filesystem and use block device (partitions) directly. But modifying NVMe drivers is pretty novel approach. Maybe Redis compiled into Linux kernel is not bad idea :)

Reminds me of the old Auspex file servers, with their functional processing units. For a while CPUs got fast enough to do everything. Then they didn't, and GPUs became a thing (though they had been something of a thing back in the 80s too, like the Amiga blitter), and now we're putting CPUs in storage controllers. The pendulum has swung a few times.

I wonder if Unikernels would also see this speedup, though they're so much harder to implement/use that I assume BPF-powered storage will see mass adoption before unikernels do...

What is disappointing is even with newer OS's they seems to ignore exokernels and go straight for older microkernel style architecture, which is 1980's tech

Holy crap this is a cool idea.

And then there where "in" eBPF Drivers

And they where compiled on the fly

Compatible forever

Running non stop

until the end of all time.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact