
Google XRay: A Function Call Tracing System [pdf] - mnem
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45287.pdf
======
brendangregg
NOT ANOTHER TRACER!!

I'm sure it's impressive engineering work, but why oh why...

How does it compare to Linux uprobes, which are built into Linux mainline?
Bear in mind there are different front ends for uprobes (ftrace, perf_events,
bcc, ...), and these are also still in development, so if one lacked certain
features they needed, such features could be added. There's been a LOT of work
in this area in the past 6 months, as well (see lkml).

If the goal was lowest performance, then why compile with no-op sleds
("negligible overhead") instead of using dynamic tracing (literally "zero
overhead")? Or, if the existing kernel-based dynamic tracers benchmarked
poorly, then why not something like LTTng?

How does it compare to DTrace, as well? (Doesn't Google have some FreeBSD?).

All the tracers I mentioned can not only do dynamic tracing, but also
instrument all user and kernel code, without special recompilation.

~~~
davidtgoldblatt
Most of XRay was written before uprobes was merged into the Linux kernel (and
well before such kernels were widely available).

I don't think any of the alternatives you mentioned are Pareto superior to
XRay when considering all of "speed while tracing", "speed while not tracing",
and "flexibility".

E.g.:

\- In "speed while tracing", anything that takes a context switch per traced
function will probably be dramatically slower. Even if there's some fast
dispatch mechanism you have in mind that I'm not familiar with when you say
dynamic tracing, if it doesn't insert the moral equivalent of a nop-sled, it
will have to either choose between logging the whole PC (spending data, which
means spending RAM and disk time) or figuring out how to map it to a function-
specific unique int (spending cycles).

\- In "speed while not tracing", anything much more expensive than nop-sleds
will be too slow to run in production.

\- Anything that doesn't have a compile time component probably won't be able
to completely hook functions that get inlined, or whose source you aren't able
to change, won't be able to pick out information the runtime wants to
summarize from function arguments, etc.

To me, the neat thing about XRay isn't so much the "function patching" aspect,
except insofar as it serves as a mechanism to execute arbitrary code at
function entry or exit in a way that's runtime-customizable and very low
overhead when you want it to be.

~~~
brendangregg
Thanks for the reply; some replies:

> "anything that takes a context switch per traced function will probably be
> dramatically slower."

Good thing uprobes don't context switch:

    
    
      # perf stat -e context-switches -e probe_libc:re_search_internal sed '/./d' /mnt/data.txt 
      
       Performance counter stats for 'sed /./d /mnt/data.txt':
      
                     6      context-switches                                            
            15,122,432      probe_libc:re_search_internal                                   
      
          19.744738204 seconds time elapsed
    

You mean mode switch? Cheaper, but yes, still costly. Here's runtime without
the probe:

    
    
      # time sed '/./d' /mnt/data.txt 
      
      real	0m3.349s
      user	0m3.345s
      sys	0m0.004s
    

Which means we can calculate the cost to be ~1.1 us per probe (on my system).
Anyone know what XRay is clocking in at?

AFAIK, LTTng has done work for user<->user instrumentation. I think uBPF will
be doing this
([https://github.com/iovisor/ubpf](https://github.com/iovisor/ubpf)) -
although that project is very new. Could use some help from some more good
engineers (please do!).

> "In "speed while not tracing", anything much more expensive than nop-sleds
> will be too slow to run in production."

I'm not sure anyone is suggesting anything more than nop-sleds. Dynamic
tracing is zero, and static tracing is nop-sleds.

> "probably won't be able to completely hook functions that get inlined"

Sure. Sometimes there's static tracing probes (nop-sled based), sometimes
there isn't and it's dynamic probes, sometimes those dynamic probes are
inlined and you walk up the stack to find one that isn't. If it is inlined,
maybe you need to trace the address rather than the function entry.

In my experience it's pretty rare that something is just untracable because
inlining is so insane. But yes, it does happen sometimes. Usually I figure out
a workaround before giving up.

> "a mechanism to execute arbitrary code at function entry or exit in a way
> that's runtime-customizable and very low overhead when you want it to be"

BPF! In-kernel virtual machine that runs JIT'd code on events, and is part of
mainline Linux. Lots of enhancements in the Linux 4.x series.

~~~
davidtgoldblatt
> You mean mode switch? Cheaper, but yes, still costly.

Hah, yes.

> Which means we can calculate the cost to be ~1.1 us per probe (on my
> system). Anyone know what XRay is clocking in at?

I'm not at Google any more to check (and haven't touched the code since 2012),
but IIRC XRay overhead while tracing was something like 100 cycles per
function (which includes an entry/exit pair). Hard to compare across different
machines made years apart of course, but I think that a little over an order
of magnitude difference sounds about right.

~~~
brendangregg
Ok, if the requirements are 100 cycles per function, then you'll (for current
kernels) need user<->user. So, AFAIK, an LTTng-like thing or uBPF. And LTTng
does have such an order of magnitude difference: slide 25+ of
[http://www.efficios.com/pub/endusersummit2012/presentation-e...](http://www.efficios.com/pub/endusersummit2012/presentation-
enduser2012.pdf)

I'd also look at what problems need solving, in case they can be solved
without function tracing. If 100 cycles per function was needed, it sounds
like a high rate of functions, and a lot could be done with just timer-based
stack trace sampling.

~~~
knweiss
Brendan, did you read this awesome blog post (which mentions your work btw)
about Google's tracing framework which _may_ explain the kind of problems they
want to solve? [http://danluu.com/perf-tracing/](http://danluu.com/perf-
tracing/)

" _Sampling profilers, the most common performance debugging tool, are
notoriously bad at debugging problems caused by tail latency because they
aggregate events into averages. But tail latency is, by definition, not
average._ "

~~~
brendangregg
I don't quite get it -- to me, that's like saying a hammer is notoriously bad
at screwing in screws. And then going ahead and proving it. Um. Good content,
but I don't get the premise. I wasn't using sampling profilers to study tail
latency in the first place. Who would? (Maybe it's a different usage of
"sampling profilers" than I'm familiar with.)

As for tail latency: one way is to dump every function event ; a lower cost
way is to summarize latency in-kernel as a histogram. For both you're still
tracing every function, and care about overhead down to 100 cycles etc.

But think about what causes the tails in the first place. Lock contention?
Trace that. Resource I/O? Trace that. If it costs 1 us to trace disk I/O, it's
usually not a problem. I like to time scheduler switch events with stack
traces -- a catch all (but a bit more expensive). Of course, these approaches
require kernel instrumentation. :)

~~~
compudj
Hi Brendan,

(Full disclosure: I'm Mathieu Desnoyers, part of the LTTng maintainer team.)

I would like to introduce a slightly less extreme point of view when
considering "on-the-fly" aggregation of traces vs tracing to buffer followed
by post-processing. I see from the current discussion thread that it's very
much either one or the other, but I think that combining the two approach
helps creating much more powerful tools. On-the-fly aggregation based on trace
instrumentation helps pinpointing latency outliers. Tracing to buffers, on the
other hand, provides very detailed information about the system behavior that
leads to those outliers. By using on-the-fly aggregation as "triggers" to
collect tracer in-memory ring-buffers, one can achieve investigation of
latency outliers with very small I/O overhead.

------
Mister_Snuggles
I would love to see a tracer that works across processes and machines.

For example, a request hits the web server, the web server tickles an
application server, the application server makes a number of database queries,
and the results propagate back the way they came.

I'd love to see something that could follow this request through all of the
layers and across all of the machines, without the luxury of having the source
code for most of the components.

I don't see why this wouldn't be possible, but I certainly see why it wouldn't
be EASY!

~~~
bg451
Seeing that people have already mentioned Dapper, I'd add that there are
already a few open source implementations of Dapper available. Appdash[1] is a
very lightweight tracer that isn't too fancy, but gets the job done. There's
also the much bigger, more popular tracer Zipkin[2], which was built by
Twitter.

[1]
[https://github.com/sourcegraph/appdash](https://github.com/sourcegraph/appdash)

[2]
[https://github.com/openzipkin/zipkin](https://github.com/openzipkin/zipkin)

~~~
Mister_Snuggles
It looks like both of these require changes to the application. Dapper appears
to have the same issue.

My use case is for understanding a closed source application which has a
number of separate pieces. Since I don't have the source code, these don't
look like viable options.

------
wslh
I am never tired of this shameless plug: our state of the art open source
instrumentation engines for Microsoft Windows where you can hook applications
without even knowing about the complexities of hooking. The most programmer
friendly is
[https://github.com/nektra/Deviare2](https://github.com/nektra/Deviare2) while
the Microsoft Detours competitor is: [https://github.com/nektra/Deviare-
InProc](https://github.com/nektra/Deviare-InProc)

It has an embedded disassembler to smartly hook functions even if they have a
jmp in the prologue.

~~~
05
Yeah but what if they have a branch target in the prologue?

~~~
mxmauro
When the stub for calling the original function is created, most hooking
engines assumes the prologue contains the standard "mov edi,edi/push ebp/mov
ebp,esp" and it is wrong.

If the prologue contains, for e.g., a relative jmp, copying opcodes is not
enough. You must convert it to an absolute jump in the generated stub. The
same applies to several instructions doing relative/indirect addressing.

------
dang
Url changed from
[https://research.google.com/pubs/pub45287.html](https://research.google.com/pubs/pub45287.html),
which points to this.

This project was discussed recently at
[https://news.ycombinator.com/item?id=11595287](https://news.ycombinator.com/item?id=11595287),
but perhaps the current post adds more information.

------
compudj
I don't see any mention of Intel's errata on cross-modifying code on SMP in
the paper. I wonder how the authors handle this ? See "Unsynchronized Cross-
Modifying Code Operations Can Cause Unexpected Instruction Execution Results"
Ref.
[http://www.intel.com.tr/content/dam/www/public/us/en/documen...](http://www.intel.com.tr/content/dam/www/public/us/en/documents/specification-
updates/xeon-5400-spec-update.pdf) AX72. This is one of the main challenges to
cross-modifying code, and one key reason why LTTng-UST does not use a nop-
slide today. One possible approach to this is to SIGSTOP the entire process
while doing the code modification, which is unwanted in real-time systems.
Another approach would be to integrate with uprobes and do a temporary
breakpoint bypass, similarly to what is done in the Linux kernel today for
jump labels.

------
erikpukinskis
More and more I think call tracing is a core process of programming, and am
re-architecting the code I write to be easily traceable.

I feel like this approach of using layer-cake architectures where you function
call has to plumb through a dozen layers that you didn't write and then trying
to make sense of that with data analytics is the wrong approach.

Instead, I have been ditching layer-cake libraries for vertically integrated
libraries that do one thing and take full responsibility for it. This requires
architecting your application in a different way... it generally means more
boilerplate. Libraries do the heavy lifting, but no sexy DSLs that turn your
boilerplate into terse method chains and such.

But the ability to simply put a breakpoint anywhere in the system and have the
stack be a good representation of which pieces of code are _doing something
right now_ vs just hanging around because this-kind-of-thing might need that-
kind-of-interface later on.

------
4ad
Seems _much_ more restricted than perf or DTrace, or all the other tracing
tools. I don't understand the point of this at all. NIH?

I'd love to heard brendangregg's or bcantrill's thoughts on this. (Does HN
give users some kind of alert when they are mentioned?).

~~~
asuffield
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an
SRE at Google)

I recommend reading page 2 of the paper, which discusses the specific set of
features that XRay offers. How do I get perf or DTrace to give me the six
things listed there? I can only think of ways to get a couple of them.

~~~
dap
> The cost is acceptable when tracing and barely measurable when not tracing.

"Acceptable" is obviously relative, but with DTrace's pid provider, the cost
is zero when not tracing, and about the cost of a fastcall per probe point
when enabled.

> Instrumentation is automatic and directed towards functions that are
> important for understanding the binary’s execution time.

I'm not sure what this means, but with DTrace, you enumerate the functions or
binary objects (with wildcards and such) that you want to instrument, and the
framework takes care of reliably instrumenting them, no matter the state of
the process. Is that "automatic" and "directed"? I need to read the rest of
the paper more closely.

> Tracing is efficient in both space and time -- only recording what is
> required and what matters.

DTrace records exactly what you ask it to. It supports in-situ aggregation for
cases where it's not tenable to record a complete log of all interesting
events. This is an important part of the design.

> Tracing is configurable with thresholds for storage (how much memory to use)
> and accuracy (whether to log everything or only function calls taking at
> least some amount of time).

With DTrace, it's pretty easy to filter on function execution time. The buffer
size is configurable. There are also multiple buffer policies for different
use-cases (e.g., ringbuffer of the last N events leading up to some other
event).

> Tracing does not require changes to the operating system nor super-user
> privileges.

If they're running Linux, as I imagine they are, DTrace isn't necessarily an
option. Several other platforms have just ported it. Using it to record user-
level state on your own processes does not require superuser privileges.

> Tracing can be turned on and off dynamically without having to restart the
> server.

Absolutely -- that's what the "D" is for.

I'd strongly recommended checking out the DTrace paper:
[https://www.usenix.org/legacy/event/usenix04/tech/general/fu...](https://www.usenix.org/legacy/event/usenix04/tech/general/full_papers/cantrill/cantrill_html/)

There may be good reasons not to use DTrace for this, but I'm not sure which
of those six goals would be the sticking point other than OS availability.
(edit: I also haven't read beyond that yet!)

------
georgehm
Same idea, different use?
[https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20110921-00/?p=9583)

------
israrkhan
The fact that it relies on compiler instrumentation makes it less interesting
for people outside Google. Given that there are other instrumentation systems
that do not require changes to compiler

------
kabdib
Seems unnecessary to do anything to registers in the call edge glue. We did
this (oh, decades ago...) on the 68000, on the Macintosh, and really all we
needed was a distilled link map and some return addresses. Maybe there are
some complications we didn't have, though.

~~~
4ad
I agree. This technique is as old as programming itself.

