
Evaluation of CPU profiling tools: gperftools, Valgrind and gprof (2015) - adgnaf
http://gernotklingler.com/blog/gprof-valgrind-gperftools-evaluation-tools-application-level-cpu-profiling-linux/
======
woadwarrior01
Also worth mentioning here is perf[1], which is great for low overhead
profiling. Also, perf profiles can be turned into profiles compatible with GCC
and LLVM PGO to build optimized binaries based on production runs, using
autofdo[2]. In my use case, the instrumentation overhead was too high to use
regular profiling on production workloads.

[1]:
[https://en.m.wikipedia.org/wiki/Perf_%28Linux%29](https://en.m.wikipedia.org/wiki/Perf_%28Linux%29)

[2]: [https://github.com/google/autofdo](https://github.com/google/autofdo)

~~~
atq2119
Don't forget Hotspot as a way to visualize perf results, which vanilla perf is
unfortunately a bit lacking in:
[https://github.com/KDAB/hotspot](https://github.com/KDAB/hotspot)

~~~
the8472
Thank you, I have been looking for something that could display perf
recordings as a swimming lane view for threads.

------
eyegor
I'm just going to drop coz[1] as another suggestion. Ever since their
talk/paper I expected other implementations of a causal profiler but for some
reason everyone is steeped in the old ways. The concept just seems like such a
huge efficiency boost compared to raw flame graphs. If you have time to watch
their talk, it's linked on the github readme.

[1] [https://github.com/plasma-
umass/coz/blob/master/README.md](https://github.com/plasma-
umass/coz/blob/master/README.md)

~~~
zimbatm
coz requires the user to instrument the code. So it's interesting but also
much more costly to run experiments with.

I suspect it's cheaper to start with callgring to get an idea of the hot spots
in the code base, find the low hanging fruits. Then switch to coz if you
really need to squeeze out the last performance juice.

~~~
vanderZwan
If you watch the talk, he finds a number of examples of already implemented
optimizations that don't work. I wouldn't be surprised if it was better to
immediately start with coz. If nothing else, it forces you to model the
problem better so there's overlap with test driven design there, no?

------
praveenster
Julia Evans has this really nice ezine regarding Linux debugging tools:
[https://wizardzines.com/zines/debugging/](https://wizardzines.com/zines/debugging/)

------
foota
I absolutely love flamegraphs for analysing performance. If you haven't used
one before and you're interested in optimization (in particular, of large
programs you're unfamiliar with) then check them out! I also find them to be
an easy way to get a grasp on complicated call stacks, since the addition of
method linking on the call stacks makes it really easy to follow.

~~~
gnufx
I've asked before without luck: How are flamegraphs preferable to the well-
established sorts of visualizations in the common HPC performance tools, like
CUBE, Paraver, and TAU, say? They typically provide at least inclusive or
exclusive function/region views with choices of metrics for profiling and/or
tracing over serial, threaded, or distributed execution.

~~~
foota
Well, I'll start by saying I'm not familiar with any of those tools. Took a
quick look at them though. It looks like Paraver offers a time domain look at
performance? And cube seems to offer time based and a graphviz of the call
tree.

In a flame graph the width of a stack frame correlates to the % of CPU time
spent in that stack frame, and the y is the particular call stack.

This means that you can quickly tell what functions, and from what call sites,
are the most expensive.

The only visualization I know of that matches the ability to quickly zero in
on things, while maintaining context, is a graph of call stack with frames
colored by cumulative CPU time, but that has the issue that laying out the
graph is hard, and seeing everything at once is difficult.

~~~
gnufx
That may be OK in simple cases where you can easily eyeball it, if you're only
interested in aggregated CPU time as a metric, and if you win most from
optimizing the obvious function in all modes of the program. That's not
necessarily the case in complex scientific codes, for instance, especially
parallel ones.

~~~
foota
This is true. It's more useful for optimizing usage than it is for deep
sleuthing of "why is this particular thing performing poorly".

------
ss248
The pictures in the article don't work (at least for me). Here is the wayback
machine snapshot where everything displays correctly.

[https://web.archive.org/web/20160718172225/http://gernotklin...](https://web.archive.org/web/20160718172225/http://gernotklingler.com/blog/gprof-
valgrind-gperftools-evaluation-tools-application-level-cpu-profiling-linux/)

------
frumiousirc
A problem with google perftools is the `SIGPROF` signal used in sampling will
interrupt polling such as used in ZeroMQ. Otherwise, it is a good tool in the
toolbox.

~~~
vectorEQ
that is/was a problem with zeroMQ, not gperftools:

    
    
       27    SIGPROF      terminate process    profiling timer alarm (see setitimer(2))
    

its use is specifically what that signal is for.

Additionally, it seems fixed in zeroMQ for some things already in 2016... so i
doubt thats still valid issue.

------
pmoriarty
The most interesting tool that I've found along these lines is sysdig.

The amount of different ways and the ease with which it can be used to dig in
to and evaluate performance and other characteristics is truly awesome.

It can't do everything. But it can do a lot.

------
pjc50
Also worth considering is Oprofile:
[https://en.m.wikipedia.org/wiki/OProfile](https://en.m.wikipedia.org/wiki/OProfile)

~~~
redis_mlc
I used to use oprofile, but I don't think it works on current kernels.

So I use "perf stat" now. :)

------
Thaxll
No love for VTune?

~~~
mdani
It was great on physical hosts but could not get it working in EC2 VMs

~~~
gnufx
It requires access to hardware counters you don't normally have in EC2, and at
a privilege level I wouldn't want to enable in a multi-user compute system.

------
glouwbug
gcc has an address and thread sanitizer built in these days.

    
    
        gcc -fsanitize=address ...
    
        gcc -fsanitize=thread ...
    
    

They perform better than valgrind.

~~~
wyldfire
This article refers to valgrind's profiling features (callgrind) and not its
more common/popular 'memcheck' feature.

Sanitizers and memcheck are unrelated to the profiling discussion here.

