Early in my career, I reached out to the author and was able to grab lunch with him; he was about to retire! Was insightful to hear his thoughts on system performance, particularly systems involving more than one machine, which is something they studied deeply.
It gave me appreciation for the amount of knowledge one accumulates over a career, and what a loss it is to an organization when one so knowledgeable retires.
I had the pleasure to work with Dick on getting KUtrace to work on Android devices last year. It was a great experience to work with one of the greats in systems performance. He was a wealth of information regarding performance bottlenecks and optimizations.
KUtrace is absolutely one of the most powerful tools I've used for deeply understanding performance bottlenecks (after isolating issues) such as poor scheduling behavior. I would highly recommend reading his book "Understanding Software Dynamics" [1] if you are interested in learning more about KUtrace or performance bottlenecks/optimizations in general. The book is quite dense and dives deep into the performance characteristics of many examples of the five fundamental resources (according to Dick): CPU, Memory, Disk/SSD, Network, and Software critical sections.
So the book is split into four major sections: Measurement, Observation, KUtrace, Reasoning.
"Measurement" delves into understanding and measuring four fundamental resources: CPU, Memory, Disk/SSD, and Network. This section is quite dense and explores both the depth and breadth of understanding performance of programs. For example, there is a chapter on optimizing code to use caches more efficiently. Though I will say this section is obviously not a complete exploration of all aspects of performance as there are many many many more things which can affect the performance in such complex systems like modern computers. But what Dick does is in this section is to give you more tools in your toolbelt to understand performance better.
"Observation" looks at existing tooling (so profilers, tracing tools, etc.) and discusses where they are useful or where they fall short.
"KUtrace" introduces KUtrace, its kernel module, and its timeline visualization tool. It discusses its design and implementation and why is it so fast and low-overhead.
"Reasoning" has case studies that looks at particular kinds of performance pathologies such as "waiting for CPUs" etc. Dick uses KUtrace here to tease out the underlying inefficiencies in the analyzed programs.
So the first two sections are essentially orthogonal to if you want to use KUtrace or not, but the last two sections are about KUtrace and how to use it to understand performance bottlenecks. Even if you don't use KUtrace, the "Reasoning" section can still be insightful imo as KUtrace is just a tool at the end of the day, and the real insight is why or what is causing the performance issue.
What are you precisely trying to measure? Theoretically if you know the performance counters you want to measure, you can replace the IPC counter in the kernel module. I believe Dick has a different version of the kernel module which measure LLC misses instead of CPU cycles. Does that answer your question?
Hey, thanks for the response. Is it just a matter of measuring the LLC miss rate and then figuring out the max DRAM bandwidth somehow? What about in a multicore setting? NUMA? It would be nice to have a library or tool that works this out - always surprises me there isn't something off the shelf.
You might be interested to use Intel VTune then if you have an Intel CPU. I believe it has a profiling option that shows memory bandwidth over time [1].
Perfetto is pretty cool (if not a very overloaded term since there's the perfetto UI, perfetto backend, etc.), but we were generally interested in fairly low-level aspects like waiting on memory, etc. which require low-overhead tracing. ftrace (which perfetto uses under the hood for all the system events) does have observable overhead. KUtrace has nifty visualization for the different wait kinds (waiting for CPU, locks, memory, etc.). There was also the novelty of trying to get KUtrace to work with Android.
The upside to perfetto of course is the much much richer tooling, infrastructure, and ease of use since it comes pre-installed on your phone.
> Nice handle btw. Grinding for that was unforgettable...
Haha thanks. Symphony of the Night is easily one of my favorite games -- I can pick it up any time and play it until 200.6% completion ;)
No it unfortunately does not. You require root to remount the read-only system partition, to insert the kernel modules, and to turn SELinux off. We used a "userdebug" build to get root, but I imagine most of this could also be done with a phone rooted through other means (I haven't tried it, however).
But it works by patching the kernel, not just using eBPF like many performance tools recently. So it needs active maintenance all the time considering the current velocity of internal kernel changes. And I would not be surprised if it didn't build or work correctly if you have a heavily patched and customized kernel.
On the positive side at a first glimpse the maintenance to adapt to new kernels looks very active.
Out of curious, does BPF now capable of capturing all the context switch events such as CPU trap?
Also, if the overhead is negligible, maybe the author can try to merge this into mainline with the use of static key to make the incurred overhead switchable. In spite of the static key, the degree of the accompanied inteferences on cache and branch predictor might be an intriguing topic though.
Edit: Perhaps an alternative approach would be to attach probes to relevant (precise) PMU events. There's also this prototype of adding breakpoint/watchpoint support to eBPF [1]. But actually doing stuff within this context may get complicated very fast, so would need to be severely limited, if feasible at all.
You still have to wait for your cache to reload from main memory, or for disk or network I/O, or for processes to be scheduled to run, so while it's likely more efficient than epoll approaches, I doubt there's any really fundamental difference in the performance problems you would find.
It gave me appreciation for the amount of knowledge one accumulates over a career, and what a loss it is to an organization when one so knowledgeable retires.