Building Sentry, a service to process native crash reports and minidumps

etaioinshrdlu · on June 14, 2019

Sentry is one of the nicest services I've head the pleasure of using. Having our errors centrally logged and managed is invaluable.

Source: a happy user.

PikachuEXE · on June 14, 2019

We use Sentry for Rails & JavaScript and we are happy with it Haven't tried with CSP (Content Security Policy) yet but it seems great too (from reading their blog post)

stavros · on June 14, 2019

I have similarly been using Sentry for years and am extremely happy with it. I do have some UI woes occasionally, but they seem to be improving in that department.

robocat · on June 14, 2019

We have used Sentry for a long time with JavaScript. The main issues for us are:

* Obese JavaScript code. We had to write our own custom code to log events.

* Aimed at large scale companies. We only have 1000s of users, and we care about each individual exception, but I think it is really aimed at consolidating large numbers of events.

* Meaningless percentages on data. Tagged data is processed, but the end percentage value has little meaning e.g. send through 1000 similar events, with 1 event with a tag with value X, and 1 event with a tag with value Y, and 998 with no value. Sentry reports 50% X and 50% Y!

But they have given us really excellent service, especially given we are not paying enterprise rates.

Edit: also we are not in a US timezone, which makes the UI weird. And I do love the email integration: have a bug, get an email, fix it.

the_mitsuhiko · on June 14, 2019

> * Obese JavaScript code. We had to write our own custom code to log events.

For what it's worth we spent a lot of time to reduce bundle sizes recently. It's ultimately a tradeoff between how rich and complete the data is one wants to capture (and from how many browsers) and how big one wants to have the bundle :(

stavros · on June 14, 2019

Can you make a "lite" module for people who want basic exception handling, perhaps?

the_mitsuhiko · on June 14, 2019

In theory it should treeshake down quite well if you use esm modules. Depends obviously on quite a few factors.

Additionally there is another version of the JS SKD dubbed "loader" which lazy loads the real SDK on first use: https://docs.sentry.io/platforms/javascript/#lazy-loading-se...

mikedh · on June 14, 2019

They have @sentry/minimal as one of the 9 Javascript options : https://github.com/getsentry/sentry-javascript

sciurus · on June 14, 2019

> Aimed at large scale companies. We only have 1000s of users, and we care about each individual exception, but I think it is really aimed at consolidating large numbers of events.

What you're saying here confuses my slightly. Sentry is aimed at consolidating large numbers of equivalent events. It looks at things like the stack trace to determine when to merge events into a single issue. If you deploy a bug that results in one exception occurring a thousand times per second you want sentry to create one issue with thousands of events, not thousands of issues, right?

deepthought42 · on June 14, 2019

We pair Sentry with Segment, which makes handling events super easy instead of writing custom code for logging evens.

config_yml · on June 14, 2019

Could you please expand a bit on what and how you‘re doing this?

deepthought42 · on June 14, 2019

I'm not sure how familiar you are with segment.io, but it's a super easy data piping service. So we use Segment to define all of our events across our stack, front-end, middle-ware, backend and then we just connect segment to sentry using the integration they provide. FWIW, we use segment for all of our data piping needs...errors, interaction tracking, traffic analytics, etc. This way we can not only have the errors show up in sentry for us to review, but we can also review in other services where in an event stream any given user ran into the error for easier debugging.

lacogubik · on June 14, 2019

How do you get errors to other services in Segment? Did you write some custom code to catch errors and then send them as segment's track calls?

We use Sentry with Segment, but on client segment just loads sentry and the errors are caught by sentry only, they do not appear in other integrations.

deepthought42 · on June 14, 2019

ya, we use a small custom listener to capture the errors and wrap them in a segment track call.

config_yml · on June 14, 2019

Interesting, thanks for sharing that. What other services are you using? Is there anything I can read up on for this kind of operational stuff? It would be interesting to learn about the landscapes of tools people use for their production setups.

deepthought42 · on June 18, 2019

We use a variety of tools, and often when we add a new tool it's because we review the integrations for segment and discover a service that we can leverage. For a good starter setup though, I really enjoy segment, sentry, mixpanel, and heap.io. We use a few other services, but when it comes to troubleshooting for customers these are our main goto services.

xvilka · on June 14, 2019

They might want to check radare2[1] for processing crash dumps, since it supports all 3 major platforms (Windows, Linux, OS X), and allows to play with the stripped files as well.

[1] https://github.com/radare/radare2

the_mitsuhiko · on June 14, 2019

Armin from Sentry here: the feature set overlap between sentry's symbolicator and radare2 is not that great. The goals are also very different. We only want to unwind and symbolicate but we need to do this at scale. radare on the other hand wants to disassemble and do all kinds of low level debugging and it's build for user attended interactive sessions.

xvilka · on June 14, 2019

Some companies use r2 for batch jobs, moreover it is available as a library. Anyway, just wanted to inform you, if you didn't knew.

rtpg · on June 14, 2019

Sentry has been very good to us, and it’s a generally great business model to boot! Overall great for the community and for ourselves

I am going to whine a bit that the recent move over to the unified SDK has been less than ideal for us. The fact that the raven docs would point us to the unified SDK but not to a “how to migrate” page made me super unsure about whether we were doing the right thing (esp. when it came to the logging integrations on Python)

It’s kind of an interesting problem, providing SDKs for each language. Sentry went with unifying the API across language boundaries and I’m not super happy with the results but I don’t have like 30 packages to maintain

the_mitsuhiko · on June 14, 2019

> Sentry went with unifying the API across language boundaries and I’m not super happy with the results but I don’t have like 30 packages to maintain

Yeah, that move and the docs did not go exactly as planned. There are a few reasons why we did it: a) the old SDKs had no sensible state management which caused endless issues such as incorrect breadcrumb collection in async code. b) it's really hard for customer support to understand the number of SDKs.

We're working on improving that, in particular docs.

Operyl · on June 14, 2019

The title cuts off “Symbolicator,” the specific name of the component here which is slightly confusing.

daniel_levine · on June 14, 2019

It was in the original title I submitted, but the admins must have changed it

sciurus · on June 14, 2019

This is cool stuff! It's nice to see what Sentry can develop in this space with the focus and resources that they have.

I handle ops for Mozilla's crash reporting pipeline for Firefox [0] and our symbol server [1], among other things. I know our respective development teams stay in touch, and I hope we can find a way to use symbolic/symbolicator to simplify our stack.

[0] https://socorro.readthedocs.io/en/latest/ [1] https://tecken.readthedocs.io/en/latest/

SEJeff · on June 17, 2019

I've used sentry since Dave Cramer (sentry original author) was working back at Disqus years ago. It's excellent software that fills a really important niche. It is wonderful to see he managed to build a solid team and company around it.

larrik · on June 14, 2019

I really like sentry, but I'm sad that the URL scheme changed (from sentry.io/<org name>/<project name>/ to sentry.io/organizations/<org>/issues/?project=<meaningless int>)

zeeg · on June 14, 2019

Yeah it's not great right now, but will likely change again.

e.g. sentry.io/issues/SEN-12345

We're also introduce a much more comprehensive event search which will require event permalinks so we're sorting some of that out.

Feel free to throw additional feedback our way. Best place would be on forum.sentry.io to make sure the team actually sees it.

scardine · on June 14, 2019

Hey @the_mitsuhiko, any plans to support Django Channels (daphne) out of the box? Debugging async stuff is tough.

the_mitsuhiko · on June 14, 2019

There are no concrete plans. It's not clear to me what this would entitle.

js2 · on June 14, 2019

I built Yahoo's in-house mobile app crash reporting tool a few years ago (still in use). I used an on-premise install of Sentry as the UI. At the time, Sentry didn't really support mobile error reporting, so I built something much like what's detailed in this post and called it the Processor.

I regret never having made the time to open-source what I built. The Processor is written in Python, takes reports from mobile devices, unwinds, symbolicates, retraces, unminifies, etc as needed, then generates a Sentry "event" and forwards that to our on-prem Sentry instance.

I also built the SDKs. For iOS, I used PLCrashReporter. These days I'd probably use KSCrash. An important point here. On iOS, the unwinding is done on the device. So all you have to do on the backend is symbolicate it. Another point: it's relatively easy to get iOS system symbols. Plug an iOS device into a Mac running Xcode and the symbols are transferred from the device to the Mac. You can then harvest them however you need. In fact, Apple has apparently stopped encrypting OTA updates so you no longer need an iOS device to get the symbols:

https://github.com/Zuikyo/iOS-System-Symbols

For Android NDK crashes I've tried a few approaches and still don't have a satisfying solution. Originally I went with breakpad + minidumps on the device. On the backend, the Processor runs the breakpad stackwalker on the minidump. Another important point: the unwinding is occurring on the backend in this case, unlike iOS where it's done on the phone. (A minidump is basically just a snapshot of all the thread stack memory, plus some extra diagnostic info.) But to unwind reliably off-device you need the Android system symbols (in addition to the app's symbols obviously). Well good luck with that. Google makes the original Nexus Android OS images available so you can harvest those but you'll never get symbols for all the various Android devices. I built a tool that can harvest symbols off a device and tried to crowdsource them from Yahoo's developers but it's not been very successful (there's a lot of flavors of Android).

Another issue is that minidumps are relatively largish to deal with. So my second approach was two-fold. I'm still using breakpad's crash handler on the device, but I now have it generating the much smaller microdump format. In addition, I've added libunwind to our Android SDK so that after capturing the microdump, I attempt to unwind on the device (also collecting function names during unwinding) and add that info to the report. The Processor then only needs to unwind the microdump if the unwinding on the device failed. Otherwise it just needs to symbolicate. This hasn't been wildly successful though. Unwinding on an Android device is trickier than on an iOS device. Also, it's almost impossible (well I haven't figured out how) to unwind through the ART/Java frames that called into the native code.

Of course the vast majority of Android crashes are in Java code and this is much easier to deal with these. They are unwound just find on the device so on the backend you only need to deal with deobfuscating the ProGuard minification which is easily done using the mapping file generated by ProGuard.

What's really annoying with native mobile crashes is that both Android and iOS have their own services for both capturing crashes and unwinding on the device. And because these are integrated with the OS and work out-of-process, they are much more reliable than anything you can do in-process using something like PLCR, KSCrash, libunwind, etc.

But, neither OS gives an app access to its own system generated reports. All you get is the lame reports the devices upload to Google Play Console / iTunes Connect.

Anyway, thank you to Sentry for providing such a great product and I'm sorry again I wasn't able to contribute more. I'm not sure what I built would work at your scale. It's interesting we ended up with similar designs.

the_mitsuhiko · on June 14, 2019

> Another point: it's relatively easy to get iOS system symbols. Plug an iOS device into a Mac running Xcode and the symbols are transferred from the device to the Mac.

Indeed. But sadly Apple does not provide a symbol server like Microsoft does. We are maintaining our own internally. I wish we could open it up to the world but I'm pretty sure that it's not legal to redistribute these.

> For Android NDK crashes I've tried a few approaches and still don't have a satisfying solution.

That is indeed overall a pretty frustrating situation. It's similar for linux in general where it's really hard to get all the debug symbols collected. And even if debug symbols exist, they are not stored like you would expect from a symbol server. Very frustrating.

I'm quite annoyed that there is so little support from the platform holders to provide production debugging APIs. One would think there is a higher demand for this :(

js2 · on June 14, 2019

> Apple does not provide a symbol server like Microsoft does. We are maintaining our own internally.

Ditto. But the maintainer of the repo I linked to has thrown caution to the wind and thrown them all up on a Google Drive:

https://github.com/Zuikyo/iOS-System-Symbols/blob/master/col...