

Ask HN: What's your best C/C++ profiling tool, hints and best practice? - henrikm85

I frequently come across wanting to profile C++ on Linux. I have used perf a lot before but do not have vtune handy. I have dabbled around with poor man&#x27;s profiler but that seems to get trickier with lots and lots of threads. What are your favorite outside-the-box approaches? How do you figure out contention or IO wait issues?
======
vandot
HPCToolkit and TAU are good options for profiling C++ applications. They come
from HPC so they are intended for use with parallel and highly concurrent
applications.

[http://hpctoolkit.org/](http://hpctoolkit.org/)
[https://www.cs.uoregon.edu/research/tau/home.php](https://www.cs.uoregon.edu/research/tau/home.php)

~~~
gnufx
Threadspotter (paratools.com) and maqao.org might be of interest, at least for
x86_64 GNU/Linux but I wouldn't know about C++ specifics.

[TAU is doubtless a good bet, but for what it's worth for general interest,
the other common systems for HPC are openspeedshop.org, cube/scalasca
(scalasca.org), and extrae/paraver (bsc.es). A good comparison of them all
would be useful, but I've not found one.]

------
ekr
Any tips on profiling windows device drivers? I've tried xperf, but it doesn't
allow you (or I haven't seen how) to change the frequency of the sampling, and
can only do a system-wide profile.

I've also tried vtune, but it doesn't support stack-tracing (or things like
lbr) for system-wide profiling, and it doesn't have a specific option for
sampling drivers. You can attach to the System process, but then you're
missing a lot of the your driver code, that runs in other contexts.

I kept thinking about implementing my own sampling profiler (using LBR for
stack-tracing, and hardware performance events, like linux's oprofile/
freebsd's hwpmc), but I can't see how I could only profile my driver, and not
the whole system, without hooking the Windows scheduler. I guess I will just
profile the whole system and check if the program counter is inside my module.

------
jononor
Before you start, determine what you are attempting to optimize. Throughput or
latency? Improving averages, or reducing how often below-acceptable
performance occurs?

Write end2end tests that execises the application as close to what user would.
Then, use a profiler with an API so you can start dump when test/app setup is
completed (to avoid extranous noise/misleading data). I like gperftools
combined with KCachegrind as a GUI. Used it very successfully for instance in
MyPaint: [http://www.jonnor.com/2012/11/improved-drawing-
performance-i...](http://www.jonnor.com/2012/11/improved-drawing-performance-
in-mypaint-brush-engine)

------
JoachimSchipper
Your question is a bit all over the map - are you interested in reducing CPU
usage, reducing time spent in locks, or do you want to talk to the kernel more
efficiently? Are you targeting, throughput, latency?

That said,
[http://www.brendangregg.com/flamegraphs.html](http://www.brendangregg.com/flamegraphs.html)
is a nice introduction to a site that has lots of material.

------
gricardo99
valgrind's got some good features along these lines:
[http://valgrind.org/info/tools.html](http://valgrind.org/info/tools.html)

~~~
henrikm85
Valgrind is nice, however, especially with multi-threaded programs the
virtualized execution diverges from a non-valgrind-VM run quite a bit so I am
not a huge fan.

------
soulbadguy
Very interested in this too

~~~
olzhas
me too

