
The limitations of sampling profilers - skybrian
http://danluu.com/perf-tracing/
======
bgirard
> we need something that does more than just show aggregates, like a sampling
> profiler would.

This isn't really a limitation of sampling profilers. This is really a
limitation of the front-end nearly every sampling profiler stick their data
into. Basically they just aggregate sampling data and stop there.

For the gecko platform we use a sampling profiler but we're able to correlate
samples with unresponsiveness events. For instance you might have only 0.3% of
your samples running Garbage Collection or X but that 0.3% was all consecutive
and was scheduled during a VSync event and prevented the app from displaying
the frame on time. You want your UI to highlight things that impact latency
even if in aggregate the function was fast, because it still managed to make
you skip a frame on the screen.

Tracing Vs. Sampling is just a data collection method. The issue here is the
UI and how displaying the data in a strictly aggregate/bottom-up view isn't
enough.

Here's a demo of our open source cross-platform multi-process _sampling_ low
level profiler we use:

[http://people.mozilla.org/~bgirard/keyboard_timeline.mp4](http://people.mozilla.org/~bgirard/keyboard_timeline.mp4)

~~~
jholman
The ability to show something more than aggregates is, sure, a UI/front-end
issue. (And an issue with intermediate storage, obviously).

But it's misleading to say that tracing-vs-sampling isn't a factor, unless you
believe that the most-performant sampling and the most-performant tracing are
roughly comparable. TFA claims otherwise; the claim is that the maximum
resolution of tracing is much better than the maximum resolution of sampling.

~~~
_benedict
That's nonsense. If anything the maximum resolution of sampling is higher,
since it has lower overhead, so it perturbs the measured thing less.

However most sampling profilers only report CPU-time spent in a task, not
wall-time. So, the things he discusses like time spent waiting for a task to
complete would be obscured. However it would be perfectly possible to report
the number of samples in which a thread was blocked in a method call (though
it could not be said with certainty it was the same invocation, it would be
enough to see that a majority of time blocked was elapsed there), it's just
that they do not typically do so.

~~~
tbrownaw
_maximum resolution of sampling is higher, since it has lower overhead_

The article claims otherwise.

This is apparently mostly related to the costs of preemption and resulting
icache replacement.

 _though it could not be said with certainty it was the same invocation, it
would be enough to see that a majority of time blocked was elapsed there_

But that's not quite what's being discussed. The main interesting thing here
is tracking what oddness causes outlier worst-case times, which very much does
require tracking individual invocations.

~~~
_benedict
And I'm calling that aspect of the article nonsense. The cost of preemption is
incurred only infrequently, so even if it perturbs the point after it measures
slightly, it only does so infrequently and the point _at which it measures_
(assuming it has not been affected by the prior sample) more accurately
represents a system without any instrumentation.

Assuming each invocation has a unique stack trace, each call site can still
also effectively be tracked through sampling. Looking at his examples, this
seems reasonably likely, as they all have quite different behaviours.

What tracing does do is permit a clear sequential analysis of an arbitrary
granularity of macro behaviours.

If the macro behaviours are chosen with sufficiently _low_ resolution that
they are large enough for instrumentation to be an immeasurable overhead, then
you obviously get a very clear and accurate picture of the system behaviour at
that reduced resolution.

Tracing RPC calls and other similar behaviours as done in the blog are a good
example, but it isn't down to increased resolution; quite the opposite.

------
lallysingh
Hi, cheap plug here for my really, really low-overhead event tracer: ppt.

[https://github.com/lally/libmet](https://github.com/lally/libmet) (BSD
license)

Basically, you define structs that you then fill in your code. Then give them
to a ppt-generated library routine. It's a bit of a pain, but you get two nice
features:

\- The storage is completely non-blocking: you're just writing to shared
memory.

\- Overhead is linear to write rate: if you have a spare CPU core on the
machine, the overhead on your active cores is a constant + the cost of filling
& copying (once) the struct.

\- It'll prefer to lose some data in case the disk falls behind in writing --
it won't slow down your app. Auto-generated sequence numbers tell you what/if
any data was lost.

\- You can leave the code on in production: the listener attaches via ptrace()
and injects a destination buffer into your process (constant time), and if
there's no buffer, the data gets harmlessly dropped.

I've left it alone as I haven't needed it for myself in a while, and there
hasn't been interest (I don't really advertise it, though). If anyone's
interested, just ping me or file an issue.

------
barrkel
Sampling and instrumenting profilers both lie, in different ways. Sampling is
incomplete; instrumenting has potentially strong observer effect. Neither are
the ultimate or necessary future. They are both useful.

------
skybrian
In retrospect I probably should have used the second half of the title:
"glimpses of tracing tools from the future".

------
chillydawg
The most interesting bit about that, to me, was the diagram of a google
search's RPC tree. That was awesome, and I imagine these days it's a more more
so.

------
tbrownaw
There's also a brief mention in
[http://joeduffyblog.com/2015/11/19/asynchronous-
everything/](http://joeduffyblog.com/2015/11/19/asynchronous-everything/)
about making async/remote calls not break stack traces. That feels like it
ought to be somewhat related to the distributed performance profiling
discussed here.

~~~
abecedarius
[https://github.com/cocoonfx/causeway](https://github.com/cocoonfx/causeway)
seems to be the current version of the early work that must've influenced the
Midori work you link to.

------
LoSboccacc
sampling is useful to detect where to direct inquiry later, creating flame
graph of the whole application is quite taxing and sampling can help
restricting the research area

------
lsiebert
Google could open source the profiler at some point.

