Hacker News new | past | comments | ask | show | jobs | submit login
Bpftop: Streamlining eBPF performance optimization (netflixtechblog.com)
133 points by mochomocha 9 months ago | hide | past | favorite | 21 comments



Looks great. I remember using bpftrace at work to debug a nasty performance issue, I went down the rabbit hole only to find a certain syscall was being called much too often. We managed to trace it back inside the sourcecode to a sleep(1 second), which was some sort of manual io scheduling commited by the CTO when the startup was early stage. Removing those few lines and installing kyber fixed the issues!


Interesting, what was the storage type?


Just tried this on the eBPF code I am working on. Works great! This one is going straight into the toolbox.

Even though eBPF is super fast, I have found triggering complex probes many times will have performance implications, which can not easily be tracked down to the instrumenting application. This will help with that a lot.


I am happy to hear that you had a good first impression. At Netflix, we do some Linux scheduler instrumentation with eBPF and overhead matters. I was inspired to create the tool to enable the traditional performance work loop: get a baseline, tweak code, get another reading, rinse & repeat.


What use cases people use eBPF for these days?


We had a recent use case to log outbound TCP connections _excluding_ internal and known addresses from our k8s infrastructure, with the log including the process name/pid, uid a bunch of other metadata.

I wrote a tool that compiles to a small, statically linked binary (using CO-RE/libbpf), deployed to every node as a DaemonSet. It just works and uses minimal CPU and memory resources.


Tongue in cheek: lots of people have discovered they can replace Linux kernel modules with brittle eBPF code instead, which attaches itself to various parts of the kernel that are even less stable than the things modules have to deal with.


They are nice for quick experimentation, yes. But there are rock solid projects like Cilium using them. I think your point is that the barrier to abuse is lower?


The eBPF website has a list of projects using it, that can give you a decent flavour of what people use it for. https://ebpf.io/applications/


Stackstate, my current employer uses eBPF in addition to Open Telemetry for collecting observability data. https://www.stackstate.com/platform/features/


We use it for several parts of our network forwarding path (our private networking features are built in eBPF), for a variety of monitoring purposes, and (principally with bpftrace) as a debugging tool.


We have implemented zero-code distributed tracing with eBPF. https://github.com/deepflowio/deepflow


Using eBPF based tools (like bcc) to debug the issues https://github.com/iovisor/bcc


I find this somewhat amusing given one of the primary use cases of eBPF is measuring performance


There's plenty other applications.

Network routing can be implemented in (e)bpf. It's even the original use case.

But there's also the LSM based on ebpf, there's a user space scheduler (Google iirc), seccomp and some cgroup filters can be done in bpf...

It's the Lua of the kernel at this point. Provides a lot of extension points.


sched-ext is meta, Google have something else but less open i believe.

https://github.com/sched-ext/scx/blob/main/OVERVIEW.md



Who watches the Watchmen?


Nice! But I got it to freeze under higher load. Removing the load does not help.


bpftop author here. Would you mind creating an issue to track this? https://github.com/Netflix/bpftop/issues


done (https://github.com/Netflix/bpftop/issues/17). Seems to be some futex issue, the kind of bugs that tend to be hard to replicate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: