
Faster Backtraces for Native Applications - sbahra
http://backtrace.io/blog/blog/2014/09/15/bt-lightweight-backtrace-tool/
======
js2
There are a couple excellent reporting frameworks, PLCR and breakpad. What's
sorely missing is a really good backend.

There's commercial options in crittercism, crashlytics, bugsense, and
hockeyapp, but these are either mobile only or subpar in what they provide.

And there's really not any decent opensource backend. Mozilla Socorro is ugly
and tied to their infrastructure in many ways. The closest I can think of is
Sentry... but it's primarily designed for handling tracebacks from backend
apps. (Oh, I guess there's also squash.io)

Anyway, I'm curious to see what you're building on the backend, and if you
intend to opensource it in addition to what you plan to offer commercially.

~~~
sbahra
Will reveal the backend soon but we do not have plans to open-source it as of
yet.

------
BruceM
It would be useful to know the commands that were run for gdb, lldb, and so on
to do our own comparisons.

I also have need of being able to run a program under a harness that, should
it crash, lets me get the stack and other details. Bonus points if I get to
set up my own code to help decode and pretty print my own data types. But for
my purposes, that needs to be something under a more open license which is why
I've been working with LLDB to date.

------
sbahra
Appreciate any feedback to the early access core software and we're all ears
on feature requests. Let us know if you encounter any sub-optimal output and
we'll fix it, hope some of you guys find this useful and live a more
manageable life :-P

~~~
landonjf
As the author of a crash reporting framework, this is the first time I've seen
execution speed of a frame unwinder so strongly touted :-)

Without code, I can't make much of a comment otherwise, although I'm certainly
interested to see your patches to libunwind once it's upstreamed.

One of the reasons we didn't use libunwind in PLCrashReporter (and instead,
wrote our own DWARF unwinding code) was the relative difficulty we saw in
porting libunwind to Mach-O and the Mach thread/VM APIs, as compared to the
cost in porting our relatively platform-neutral code to other platforms.

The analysis side is super interesting too, and it'll be great to see more of
what emerges from your work. I can't think of much that's been published on
that front other than Microsoft's overview of their Windows Error Reporting
heuristics:

    
    
      http://www.sigops.org/sosp/sosp09/papers/glerum-sosp09.pdf
    

Congratulations on your release!

~~~
sbahra
Come on now, you're the author of _the_ crash reporting framework for mobile
:-P Big fan of PLCrashReporter and your work with it, the mobile error
reporting market has much to thank for it. I've looked at some of your code
and definitely appreciated the cleanliness of it.

On performance: The execution speed being touted here is not that of a frame
unwinder and we're building something that goes beyond this in scope. Our core
client-side technology is a debugging library optimized for tracing. Unwinding
was only a bottleneck for some specific ridiculous workloads and usually the
bottleneck is elsewhere if program structure is being parsed (and this is
where the general purpose debuggers end up taking up so much time).

There are applications out there that are so complex that traditional
debuggers are simply infeasible (imagine 30 minute+ backtraces). However, what
we're excited about are all those spare cycles we get to make use of...

On libunwind: Yes. We'll be targeting some exotic platforms, and this is where
libunwind definitely helps. It definitely lacks in file format abstractions
but we support multiple unwinding backends for this reason, it will be painful
but not too painful.

Thanks for the SOSP link, looks interesting. I agree, this is an area that has
definitely been neglected by academia.

We're very excited for the first release of our platform and I'll keep you
posted on it, your feedback would be great and we think we've come up with
very useful technology.

------
kevingadd
So something I don't understand that doesn't seem to have an explanation in
the linked article:

Why does the speed of backtrace generation matter for handling/reporting
crashes? If the process is dead, I don't see how it taking an extra 50ms (if
that) to tear down makes a big difference - especially since crashes should be
an exceptional case, not a common one.

Is the optimization here actually intended to enable more accurate, less
intrusive realtime profiling, or something like that? Otherwise I'm having a
hard time understanding how optimizing for wall-clock time here is actually a
useful exercise (even if it is very interesting)

~~~
sbahra
1) What's most interesting to us is that with all these spare cycles mean we
can start doing some computationally expensive analysis as part of crash
reporting and can even do it at scale. However, this type of analysis requires
a very efficient tracer. We will provide updates on the latter in an upcoming
post.

2) The speed and efficiency of backtrace generation can affect recovery times.
It's far more than 50ms for a lot of server-side or embedded applications.

3) Large programs today cannot be debugged feasibly (as in, good luck in
generating a detailed crash report, your system will likely not have the
resources), especially if they're time sensitive as well (tracing is the
typical approach there). There are engineers out there who have to spend hours
just to extract a small memory dump from a single thread.

4) Certain classes of bugs are best observed over time and minimizing jitter
is important (more on that later as we unveil some features of the advanced
tracer and user interface).

Less intrusive real-time profiling is interesting to us, but currently only in
the context of state leading to a fatal bug (this also includes bugs that
involve hanging such as an infinite loop). The technology does have
applications for performance management but this isn't something we are
focusing on at the moment.

------
sbahra
Thanks for all the feedback people. We've released a new version with robust
handling of attachment failures, additional DWARF features and performance
improvements (up to 35% on targets with lots of shared objects).

