Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Xcapture-BPF – like Linux top, but with Xray vision (0x.tools)
431 points by tanelpoder 2 days ago | hide | past | favorite | 38 comments





I use BCC tools weekly to debug production issues. Recently I found we were massively pressuring page caches due to having a large number of loopback devices with their own page cache. Enabling direct io on the loopback devices fixed the issue.

eBPF is really a superpower, it lets you do things which are incomprehensible if you don’t know about it.


I've been learning BCC / bpftrace recently to debug a memory leak issue on a customer's system, and it has been super useful.

I'd love to hear more of this debugging story!

Containers are offered block storage by creating a loopback device with a backing file on the kubelet’s file system. We noticed that on some very heavily utilized nodes that iowait was using 60% of all the available cores on the node.

I first confirmed that nvme drives were healthy according to SMART, I then worked up the stack and used BCC tools to look at block io latency. Block io latency was quite low for the NVME drives (microseconds) but was hundreds of milliseconds for the loopback block devices.

This lead me to believe that something was wrong with the loopback devices and not the underlying NVMEs. I used cachestat/cachetop and found that the page cache miss rate was very high and that we were thrashing the page cache constantly paging in and out data. From there I inspected the loopback devices using losetup and found that direct io was disabled and the sector size of the loopback device did not match the sector size of the backing filesystem.

I modified the loopback devices to use the same sector size as the block size of the underlying file system and enabled direct io. Instantly, the majority of the page cache was freed, iowait went way down, and io throughout went way up.

Without BCC tools I would have never been able to figure this out.

Double caching loopback devices is quite the footgun.

Another interesting thing we hit is that our version of losetup would happily fail to enable direct io but still give you a loopback device, this has since been fixed: https://github.com/util-linux/util-linux/commit/d53346ed082d...


There's also either Composefs or Puzzlefs, both of which attempt to let the page cache work across containers!

https://github.com/containers/composefs https://github.com/project-machine/puzzlefs


Which container runtime are you using? As far as I know both Docker and containerd use overlay filesystems instead of loopback devices.

And how did you know that tweaking the sector size to equal the underlying filesystem's block size would prevent double caching? Where can one get this sort of knowledge?


The loopback devices came from a CSI which creates a backing file on the kubelet’s filesystem and mounts it into the container as a block device. We use containerd.

I knew that enabling direct io would most likely disable double caching because that is literally the point of enabling direct io on a loopback device. Initially I just tried enabling direct io on the loopback devices, but that failed with a cryptic “invalid argument” error. After some more research I found that direct IO needs the sector size to match the filesystems block size in some cases to work.


We had something similar about 10 years ago where I worked. Customer instances were backed via loopback devices to local disks. We didn’t think of this - face palm - on the loop back devices. What we ended up doing was writing a small daemon to posix fadvise the kernel to skip the page cache… your solution is way simpler and more elegant… hats off to you

Folks who find this useful might also be interested in otel-profiling-agent [1] which Elastic recently opensourced and donated to OpenTelemetry. It's a low-overhead eBPF-based continuous profiler which, besides native code, can unwind stacks from other widely used runtimes (Hotspot, V8, Python, .NET, Ruby, Perl, PHP).

[1] https://github.com/elastic/otel-profiling-agent


Grafana has one too called Beyla.

https://grafana.com/oss/beyla-ebpf/


I am trying to wrap my head around it, still unclear what it does l.

That's like most of Grafana's documentation

Relatively how expensive is to capture the callstack when doing sample profiling?

With Intel CET's tech there should be way to capture a shadow stack, that really just contains entry points, but wondering if that's going to be used...


The on-cpu sample profiling is not a big deal for my use cases as I don't need the "perf" sampling to happen at 10kHz or anything (more like 10-1Hz, but always on).

But the sched_switch tracepoint is the hottest event, without stack sampling it's 200-500ns per event (on my Xeon 63xx CPUs), depending on what data is collected. I use #ifdefs to compile in only the fields that are actually used (smaller thread_state struct, fewer branches and instructions to decode & cache). Surprisingly when I collect kernel stack, the overhead jumps higher up compared to user stack (kstack goes from say 400ns to 3200ns, while ustack jumps to 2800ns per event or so).

I have done almost zero optimizations (and I figure using libbpf/BTF/CO-RE will help too). But I'm ok with these numbers for most of my workloads of interest, and since eBPF programs are not cast in stone, can do further reductions, like actually sampling stacks in the sched_switch probe on every 10th occurrence or something.

So in worst case, this full-visibility approach might not be usable as always-on instrumentation for some workloads (like some redis/memcached/mysql lookups doing 10M context switches/s on a big server), but even with such workloads, a temporary increase in instrumentation overhead might be ok, when there are known recurring problems to troubleshoot.


Awesome info!!! Thanks a lot!

I’ve never used eBPF, does anyone have some good resources for learning it?

Brendan Gregg's site (and book) is probably the best starting point (he was involved in DTrace work & rollout 20 years ago when at Sun) and was/is instrumental in pushing eBPF in Linux even further than DTrace ever went:

https://brendangregg.com/ebpf.html


Just a quick clarification: while Brendan was certainly an active DTrace user and evangelist, he wasn't involved in the development of DTrace itself -- or its rollout. (Brendan came to Sun in 2006; DTrace was released in 2003.) As for eBPF with respect to DTrace, I would say that they are different systems with different goals and approaches rather than one eclipsing the other. (There are certainly many things that DTrace can do that eBPF/BCC cannot, some of the details of which we elaborated on in our 20th anniversary of DTrace's initial integration.[0])

Edit: We actually went into much more specific detail on eBPF/BCC in contrast to DTrace a few weeks after the 20th anniversary podcast.[1]

[0] https://www.youtube.com/watch?v=IeUFzBBRilM

[1] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m7s


Thanks, yes I was more or less aware of that (I'd been using DTrace since Solaris 10 beta in 2004 or 2003?)... By rollout I really meant "getting the word out there"... that's half the battle in my experience (that's why this post here! :-)

What I loved about DTrace was that once it was out, even in beta, it was pretty complete and worked - all the DTrace ports that I've tried, including on Windows (!) a few years ago were very limited or had some showstopper issues. I guess eBPF was like that too some years ago, but by now it's pretty sweet even for more regular consumer who don't keep track of its development.

Edit: Oh, wasn't aware of the timeline, I may have some dates (years) wrong in my memory


Yes, not involved in DTrace itself, but he did write a bunch of DTrace Tools which led to an interesting meeting with a Sun exec: https://www.brendangregg.com/blog/2021-06-04/an-unbelievable...

>As for eBPF with respect to DTrace, I would say that they are different systems with different goals and approaches

For sure. Different systems, different times.

>rather than one eclipsing the other.

It does seem that DTrace has been eclipsed though, at least in Linux (which runs the vast majority of the world's compute). Is there a reason to use DTrace over eBPF for tracing and observability in Linux?

>There are certainly many things that DTrace can do that eBPF/BCC cannot

This may be true, but that gap is closing. There are certainly many things that eBPF can do that DTrace cannot, like Cilium.


Perhaps familiarity with the syntax of DTrace if coming from Solaris-heavy enterprise background. But then again, too many years have passed since Solaris was a major mainstream platform. Oracle ships and supports DTrace on (Oracle) Linux by the way, but DTrace 2.0 on Linux is a scripting frontend that gets compiled to eBPF under the hood.

Back when I tried to build xcapture with DTrace, I could launch the script and use something like /pid$oracle::func:entry/ but IIRC the probe was attached only to the processes that already existed and not any new ones that were started after loading the DTrace probes. Maybe I should have used some lower level APIs or something - but eBPF on Linux automatically handles both existing and new processes.


> eBPF on Linux automatically handles both existing and new processes

Without knowing your particular case, DTrace does too - it’d certainly be tricky to use if you’re trying to debug software that “instantly crashes on startup” if it couldn’t do that. “execname” (not “pid”) is where I’d look, or perhaps that part of the predicate is skipable; regardless, should be possible.


For example I used something like "pid:module:funcname:entry" probe for userspace things (not pid$123 or pid$target, just pid to catch all PIDs using the module/funcname of interest). And back when I tested, it didn't automatically catch any new PIDs so these probes were not fired for them unless I restarted my DTrace script (but it was probably year <2010 when I last tested it).

Execname is a variable in DTrace and not a probe (?), so how would it help with automatically attaching to new PIDs? Now that I recall more details, there was no issue with statically defined kernel "fbt" probes nor "profile", but the userspace pid one was where I hit this limitation.


> Execname is a variable in DTrace and not a probe (?), so how would it help with automatically attaching to new PIDs?

You're correct, and I may have provided "a solution" to a misunderstanding of your problem - I don't think the "not matching new procs/pids" is inherent in DTrace, so indeed you might have run into an implementation issue (as it was 15 years ago). I misunderstood you as perhaps using a predicate matching a specific pid; my fault.


It lets you hook into various points in the kernel; ultimately you need to learn how the Linux kernel is structured to make the most of it.

Unlike a module, it can only really read data, not modify data structures, so it's nice for things like tracing kernel events.

The XDP subsystem is particularly designed for you to apply filters to network data before it makes it to the network stack, but it still doesn't give you the same level of control or performance as DPDK, since you still need the data to go to the kernel.


Yep (the 0x.tools author here). If you look into my code, you'll see that I'm not a good developer :-) But I have a decent understanding of Linux kernel flow and kernel/app interaction dynamics, thanks to many years of troubleshooting large (Oracle) database workloads. So I knew exactly what I wanted to measure and how, just had to learn the eBPF parts. That's why I picked BCC instead of libbpf as I was somewhat familiar with it already, but fully dynamic and "self-updating" libbpf loading approach is the goal for v3 (help appreciated!)

I was going to ask "why BCC" (BCC is super clunky) but you're way ahead of us. This is great work, thanks for posting it.

Yeah, I already see limitations, the last one was yesterday when I installed earlier Ubuntu versions to see how far back this can go - and even Ubuntu 22.04 didn't work out of the box, ended up with some BCC/kernel header mismatch issue [1] although the kernel itself supported it. A workaround was to download & compile the latest BCC yourself, but I don't want to go there as the customers/systems I work on wouldn't go there anyway.

But libbpf with CO-RE will solve these issues as I understand, so as long as the kernel supports what you need, the CO-RE binary will work.

This raises another issue for me though, it's not easy, but easier, for enterprises to download and run a single python + single C source file (with <500 code lines to review) than a compiled CO-RE binary, but my long term plan/hope is that I (we) get the RedHats and AWSes of this world to just provide the eventual mature release as a standard package.

[1] https://github.com/iovisor/bcc/issues/3993


Myself I've only built simple things, like tracing sched switch events for certain threads, and killing the process if they happen (specifically designed as a safety for pinned threads).

Same here, until now. I built the earlier xcapture v1 (also in the repo) about 5 years ago and it just samples various /proc/PID/task/TID pseudofiles regularly, it also allows you get pretty far with the thread-level activity measurement approach, especially when combined with always-on low frequency on-CPU sampling with perf.

XDP, in its intended configuration, passes pointers to packets still on the driver DMA rings (or whatever) directly to BPF code, which can modify packets and forward them to other devices, bypassing the kernel stack completely. You can XDP_PASS a packet if you'd like it to hit the kernel, creating an skbuff, and bouncing it through all the kernel's network stack code, but the idea is that you don't want to do that; if you do, just use TC BPF, which is equivalently powerful and more flexible.

Yes for XDP there is a dedicated API, but for any of the other hooks like tracepoints, it's all designed to give you read-only access.

The whole CO-RE thing is about having a kernel-version-agnostic way of reading fields from kernel data structures.


Right, I'm just pushing back on the DPDK thing.

DPDK polls the hardware directly from userland.

XDP reads the data in the normal NAPI kernel way, integrating with the IRQ system etc., which might or might not be desirable depending on your use case.

Then if you want to forward it to userland, you still need to write the data to a ring buffer, with your userland process polling it, at which point it's more akin to using io_uring.

It's mostly useful if you can write your entire logic in your eBPF program without going through userland, so it's nice for various tracing applications, filters or security checks, but that's about it as far as I can tell.


I'll toot my own horn here. But there are plenty of presentations about it, Brendan Gregg's are usually pretty great.

"bpftrace recipes: 5 real problems solved" - Trent Lloyd (Everything Open 2023) https://www.youtube.com/watch?v=ZDTfcrp9pJI


There's a bunch of examples over at https://github.com/iovisor/bcc

You might find some interesting stuff here

https://ebpf.io/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: