Hacker News new | past | comments | ask | show | jobs | submit login
KUtrace: Low-overhead Linux kernel tracing facility (github.com/dicksites)
172 points by luu 9 months ago | hide | past | favorite | 34 comments



Early in my career, I reached out to the author and was able to grab lunch with him; he was about to retire! Was insightful to hear his thoughts on system performance, particularly systems involving more than one machine, which is something they studied deeply.

It gave me appreciation for the amount of knowledge one accumulates over a career, and what a loss it is to an organization when one so knowledgeable retires.


I had the pleasure to work with Dick on getting KUtrace to work on Android devices last year. It was a great experience to work with one of the greats in systems performance. He was a wealth of information regarding performance bottlenecks and optimizations.

KUtrace is absolutely one of the most powerful tools I've used for deeply understanding performance bottlenecks (after isolating issues) such as poor scheduling behavior. I would highly recommend reading his book "Understanding Software Dynamics" [1] if you are interested in learning more about KUtrace or performance bottlenecks/optimizations in general. The book is quite dense and dives deep into the performance characteristics of many examples of the five fundamental resources (according to Dick): CPU, Memory, Disk/SSD, Network, and Software critical sections.

[1]: https://www.oreilly.com/library/view/understanding-software-...


> performance bottlenecks/optimizations in general

How applicable to the general cases is it? I’m deeply interested in the topic, but unlikely to actually be running KUTrace, fwiw.


So the book is split into four major sections: Measurement, Observation, KUtrace, Reasoning.

"Measurement" delves into understanding and measuring four fundamental resources: CPU, Memory, Disk/SSD, and Network. This section is quite dense and explores both the depth and breadth of understanding performance of programs. For example, there is a chapter on optimizing code to use caches more efficiently. Though I will say this section is obviously not a complete exploration of all aspects of performance as there are many many many more things which can affect the performance in such complex systems like modern computers. But what Dick does is in this section is to give you more tools in your toolbelt to understand performance better.

"Observation" looks at existing tooling (so profilers, tracing tools, etc.) and discusses where they are useful or where they fall short.

"KUtrace" introduces KUtrace, its kernel module, and its timeline visualization tool. It discusses its design and implementation and why is it so fast and low-overhead.

"Reasoning" has case studies that looks at particular kinds of performance pathologies such as "waiting for CPUs" etc. Dick uses KUtrace here to tease out the underlying inefficiencies in the analyzed programs.

So the first two sections are essentially orthogonal to if you want to use KUtrace or not, but the last two sections are about KUtrace and how to use it to understand performance bottlenecks. Even if you don't use KUtrace, the "Reasoning" section can still be insightful imo as KUtrace is just a tool at the end of the day, and the real insight is why or what is causing the performance issue.


Thanks. I appreciate the thoughtful response, especially as so many others are also clamouring for comments regarding their specific cases.


Thanks, looks interesting. Does it cover measuring memory bandwidth consumption? This is something I feel there is a lack of good tooling for.


What are you precisely trying to measure? Theoretically if you know the performance counters you want to measure, you can replace the IPC counter in the kernel module. I believe Dick has a different version of the kernel module which measure LLC misses instead of CPU cycles. Does that answer your question?


Hey, thanks for the response. Is it just a matter of measuring the LLC miss rate and then figuring out the max DRAM bandwidth somehow? What about in a multicore setting? NUMA? It would be nice to have a library or tool that works this out - always surprises me there isn't something off the shelf.


You might be interested to use Intel VTune then if you have an Intel CPU. I believe it has a profiling option that shows memory bandwidth over time [1].

[1]: https://www.intel.com/content/www/us/en/docs/vtune-profiler/...


Perfetto on Android is very very slick. Why did you need KUtrace? What was perfetto missing?

Nice handle btw. Grinding for that was unforgettable...


Perfetto is pretty cool (if not a very overloaded term since there's the perfetto UI, perfetto backend, etc.), but we were generally interested in fairly low-level aspects like waiting on memory, etc. which require low-overhead tracing. ftrace (which perfetto uses under the hood for all the system events) does have observable overhead. KUtrace has nifty visualization for the different wait kinds (waiting for CPU, locks, memory, etc.). There was also the novelty of trying to get KUtrace to work with Android.

The upside to perfetto of course is the much much richer tooling, infrastructure, and ease of use since it comes pre-installed on your phone.

> Nice handle btw. Grinding for that was unforgettable...

Haha thanks. Symphony of the Night is easily one of my favorite games -- I can pick it up any time and play it until 200.6% completion ;)


Interesting! Does it also work on non-rooted Android devices?


No it unfortunately does not. You require root to remount the read-only system partition, to insert the kernel modules, and to turn SELinux off. We used a "userdebug" build to get root, but I imagine most of this could also be done with a phone rooted through other means (I haven't tried it, however).


Site's article Benchmarking "Hello, World!" is basically a KUTrace tutorial.

https://queue.acm.org/detail.cfm?id=3291278

Also, https://www.youtube.com/watch?v=D_qRuKO9qzM


His book, "Understanding Software Dynamics", is one of the best technical books I've ever read. Top 3 for me.


If you don't mind me asking, what are the other two?


- Managing Gigabytes

- Hacker's Delight


Much appreciated. I bought all 3.


Thanks a bunch! Will check them out.


Sounds very interesting.

But it works by patching the kernel, not just using eBPF like many performance tools recently. So it needs active maintenance all the time considering the current velocity of internal kernel changes. And I would not be surprised if it didn't build or work correctly if you have a heavily patched and customized kernel.

On the positive side at a first glimpse the maintenance to adapt to new kernels looks very active.


Out of curious, does BPF now capable of capturing all the context switch events such as CPU trap?

Also, if the overhead is negligible, maybe the author can try to merge this into mainline with the use of static key to make the incurred overhead switchable. In spite of the static key, the degree of the accompanied inteferences on cache and branch predictor might be an intriguing topic though.


Lots of the low level exception/trap/fault handling functions are blacklisted, probably to avoid lockups and unwanted recursion mayhem:

  $ sudo wc -l /sys/kernel/debug/kprobes/blacklist 
  783 /sys/kernel/debug/kprobes/blacklist
Edit: Perhaps an alternative approach would be to attach probes to relevant (precise) PMU events. There's also this prototype of adding breakpoint/watchpoint support to eBPF [1]. But actually doing stuff within this context may get complicated very fast, so would need to be severely limited, if feasible at all.

[1] https://ebpf.io/summit-2020-slides/eBPF_Summit_2020-Lightnin...


Why not just use the eBPF system?


Most people would. The author makes a comparison in this interview: https://www.usenix.org/system/files/login/articles/login_fal...

Note also that this work emerged within Google a decade before eBPF was really useful.


Wasn't there similar useful stuff from systemtap before eBPF? That's been around for quite a while.


The other alternative is LTTng, released in 2005 (requires loading a kernel module):

https://lttng.org/


True again, but SystemTap overhead is 20-100x higher.


Systemtap, for whatever reason, remained largely RedHat solution. It didn't really have much traction outside that diaspora.


Argument is that KUTrace is faster, which honestly I'm not sure of.


I thought this read "training facility", and I was excited to sign up!


How would this interact with `io_uring`, especially the polling modes (IO_SETUP_SQPOLL, IO_SETUP_IOPOLL)?


You still have to wait for your cache to reload from main memory, or for disk or network I/O, or for processes to be scheduled to run, so while it's likely more efficient than epoll approaches, I doubt there's any really fundamental difference in the performance problems you would find.


Is the naming intentional? Or just a weird coincidence? Kut being Dutch for cunt, by an author called dick..?


KUtrace stands for "Kernel-User" trace as (most of) the tracepoints are at the kernel-userspace boundary. So it is a coincidence.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: